Open access peer-reviewed chapter

Development of Estimation Procedure of Population Mean in Two-Phase Stratified Sampling

Written By

Partha Parichha, Kajla Basu and Arnab Bandyopadhyay

Submitted: September 8th, 2018 Reviewed: December 3rd, 2018 Published: September 27th, 2019

DOI: 10.5772/intechopen.82850

Chapter metrics overview

803 Chapter Downloads

View Full Metrics


This article describes the problem of estimation of finite population mean in two-phase stratified random sampling. Using information on two auxiliary variables, a class of product to regression chain type estimators has been proposed and its characteristic is discussed. The unbiased version of the proposed class of estimators has been constructed and the optimality condition for the proposed class of estimators is derived. The efficacy of the proposed methodology has been justified through empirical investigations carried over the data set of natural population as well as the data set of artificially generated population. The survey statistician may be suggested to use it.


  • stratified random sampling
  • double sampling
  • auxiliary variables
  • chain type estimators
  • bias
  • mean square error
  • efficiency
  • AMS 2000 Mathematics Subject Classification: 62D05

1. Introduction

In this present paper we have made use of Auxiliary information extracted from the variables having correlation with study variable. Auxiliary information may be utilized at planning, design and estimation stages to develop improved estimation procedures in sample surveys. Sometimes, information on auxiliary variable may be readily available for all the units of population; for example, tonnage (or seat capacity) of each vehicle or ship is known in survey sampling of transportation and number of beds available in different hospitals may be known well in advance in health care surveys. If such information lacks, it is sometimes, relatively cheap to take a large preliminary sample where auxiliary variable alone is measured, such practice is applicable in two-phase (or double) sampling. Two-phase stratified sampling happens to be a powerful and cost effective (economical) technique for obtaining the reliable estimate in first-phase (preliminary) sample for the unknown parameters of the auxiliary variables. For example, Sukhatme [1] mentioned that in a survey to estimate the production of lime crop based on orchards as sampling units, a comparatively larger sample is drawn to determine the acreage under the crop while the yield rate is determined from a sub sample of the orchards selected for determining acreage.

In order to construct an efficient estimator of the population mean of the auxiliary variable in first-phase (preliminary) sample, Chand [2] introduced a technique of chaining another auxiliary variable with the first auxiliary variable by using the ratio estimator in the first phase sample. The estimator is known as chain-type ratio estimator. This work was further extended by Kiregyera [3, 4], Tracy et al. [5], Singh and Espejo [6], Gupta and Shabbir [7], Shukla et al. [8], Choudhury and Singh [9], Parichha et al. [10] and among others, where they proposed various chain-type ratio and regression estimators.

In practice, the population may often consist of heterogeneous units. For example, in socio-economic surveys, people may live in rural areas, urban localities, ordinary domestic houses, hostels, hospitals and jail, etc. In such a situation one should carefully study the population according to the characteristics of regions and then apply sampling scheme strata wise independently. This procedure is known as stratified random sampling. It may be noted that most of the developments in two-phase sampling scheme are based on simple random sampling only while limited number of attempts are taken to address the problems of two-phase sampling scheme in the platform of stratified random sampling. It may be also noticeable that the most of the research work on two-phase sampling are producing biased estimates. However, biased becomes a serious drawback in sample surveys. A sampling method is called biased if it systematically favors some outcomes over others. It results in a biased sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling. For example, telephone sampling is common in marketing surveys. A simple random sample may be chosen from the sampling frame consisting of a list of telephone numbers of people in the area being surveyed. This method does involve taking a simple random sample, but it is not a simple random sample of the target population (consumers in the area being surveyed). It will miss people who do not have a phone. It may also miss people who only have a cell phone that has an area code not in the region being surveyed. It will also miss people who do not wish to be surveyed, including those who monitor calls on an answering machine and don’t answer those from telephone surveyors. Thus the method systematically excludes certain types of consumers in the area. It is obvious that the inferences from a biased sample are not as trustworthy as conclusions from a truly random sample.

Encouraged with the above work, we have proposed a class of product to regression chain type estimators in stratified sampling using two auxiliary variables under double sampling. The unbiased version of the proposed class of estimators has been obtained which make the estimation strategy more practicable. The dominance of the proposed estimation strategy over the conventional ones has been established through empirical investigations carried over the data set of natural as well as artificially generated population.


2. Sampling structures and notations

Consider a finite population U = {1, 2,…, N} of N identifiable units divided into Lhomogeneous strata with the hth stratum (h = 1, 2,…, L) having Nh. Let yand (x, z) be the study variable and two auxiliary variables respectively taking values yihand (xih,zih), respectively, for the unit i = 1,2,Nhof the hth stratum. Y¯=h=1LY¯hWh, X¯=h=1LX¯hWh, Z¯=h=1LZ¯hWhbe population means of the study and the auxiliary variables, and Y¯h=i=1NhyhiNh, X¯h=i=1NhxhiNh, Z¯h=i=1NhzhiNhbe the corresponding stratum means. Here Wh=NhNis the known stratum weight.

Let Cyh=SyhY¯h, Cxh=SxhX¯hand Czh=SzhZ¯hbe the coefficients of variation where Syh=i=1NhyhiY¯h2Nh1, Sxh=i=1NhxhiX¯h2Nh1, Szh=i=1NhzhiZ¯h2Nh1are the population standard deviations in the hth stratum.

Let ρyxh,ρyzhand ρxzhbe the correlation coefficients between (y, x), (y, z), and (x, z) respectively in the hth stratum. Chand [2] and Kiregyera [3, 4] discussed a situation in simple random sampling when information on xis unknown but another auxiliary variable zis easily available. It is assumed that population mean of one auxiliary variable zis known in advance and the population mean of the other auxiliary variable xis unknown. We seek to estimate through a two-phase stratified sampling design. Using a simple random sample without replacement (SRSWOR) sampling scheme at each phase, we adopt the double sampling scheme as follows.

  1. In the first phase, a preliminary large sample of size nhis drawn from the hth stratum of size Nh(h = 1, 2,…, L) and information on the auxiliary variables xand zis observed.

  2. In the second phase, a sub-sample of size nhis drawn from first phase sample nhunits from the h th stratum of size Nhand information on both the study variable y and the auxiliary variables x and z is taken.

y¯h=1nhi=1nhyhi, x¯h=1nhi=1nhxhi, z¯h=1nhi=1nhzhi, x¯h=1nhi=1nhxhi, and z¯h=1nhi=1nhzhibe the corresponding sample means in the hth stratum.


3. Discussion on existing estimation strategies

The usual stratified mean estimator (y¯st) for population mean (Y¯), is given by


The mean square error (MSE) of y¯st, is given by


Motivated with the technique adopted by Chand [2], one may frame the chain ratio-product type estimator in stratified sampling structure as


The bias and MSErespectively of y¯hRP, to first order of approximation, are obtained as




Similarly, inspired with the technique adopted by Choudhary and Sing [9], one may frame the two-phase stratified random sampling estimator in stratified sampling as


where khis constant.

Bias y¯cshh=1Lwhy¯hA5h


4. Formulation of proposed estimation strategy

Motivated with the earlier work, discussed above, we have constructed a class of product to regression chain type estimators as


where khh=12Lis a real constant which can be suitably determined by minimizing the M. S. E. of the class of estimator tpand x¯dh=x¯+bxzhnhZ¯hz¯h;where bxzhnhis the regression coefficient between the variables x and z at the hth stratum.


5. Bias and mean square errors of the proposed class of estimator tp

It can be easily noted that the proposed class of estimators tpdefined in Eqs. (8) is chain product and regression type estimator. Therefore, it is biased estimator for population mean Y¯. So, we obtain biases and mean square errors under large sample approximations using the following transformations:


and Eei=0for (i = 1, 2,…, 6), ei for (i = 1, 2,…, 6) are relative error term.

Under above transformations the class of estimator tp may be represented as

tp=h=1LwhY¯1+e11kh1+e31+e21+ kh1+e3Z¯hX¯hβxzhe4+e4e5e4e61+e21E10

We have the following expectations of the sample statistics of two-phase stratified sampling as




Expanding binomially, using results from Eq. (1) and retaining the terms up to first order of sample size, we have derived the expressions of bias B(.) and mean square error M(.) of the class of estimators tp as

Btp=EtpYh¯=h=1LwhY¯1khbxzhZh¯Xh¯f2SxzhX¯hZh¯f1SyzhY¯hZh¯f2μ102SxzhZh¯μ003SZh2Zh¯+ f3Sxh2X¯h2SyxhY¯hX¯hE12

where a=f2ρxzh2Cxh2andb=f2ρyzhρxzhCyhCxhf2ρxzh2Cxh2c=f3Cxh22f3ρyxhCyhCxh+f2ρxzh2Cxh22f2ρyzhρxzhCyhCxh.


6. Bias reduction for the proposed class of estimators

In recent time serious drawback is bias of an estimator. Therefore, unbiased versions of the proposed classes of estimators are more desirable. Motivated with this argument and influenced by the bias correction techniques of Tracy et al. [5] and Bandyopadhyay and Singh [11] we proceed to derive the unbiased version of our proposed class of estimator tp.

From Eq. (12), we observe that the expression of bias of the estimator tp contains the population parameters such as μ003,μ102, Syxh, Syzh,Sxh2,Syh2,Y¯h,X¯h,Syzhand Szh2. Since Szh2is known while μ003,μ102, Syxh, Syzh,Sxh2,Syh2,Y¯h,X¯hand Syzhare unknown, replacing μ003,μ102, Syxh, Syzh,Sxh2,Syh2,Y¯h,X¯h,by their respective sample estimator (based on the second phase sample of size m) m003,m102,syzh,sxh2,syh2,y¯h,x¯handsyzh, we get an estimator of B(tp) and


where mpqr=1mi=1mxhix¯hpyhiy¯hqzhiz¯hr.

Motivating with the bias reduction techniques of Tracy et al. [5] and Bandyopadhyay and Singh [11], we have derived the unbiased version of the proposed class of estimators tp to the first order of approximations two-phase stratified sampling.


which becomes

tp=h=1Lwhy¯hkhx¯hx¯h+1khx¯idhx¯hyh¯1khbxzhzh¯xh¯f2sxzhxh¯zh¯f1syzhyh¯zh¯f2m102sxzhzh¯m003sZh2zh¯+ f3sxh2xh¯2syxhyhxh¯E15

Thus, the variance of tpto the first order of approximation are obtained as


From Eqs. (10) and (15) it is to be noted that the class of estimators tpis preferable over the class of estimators tpof two –phase sampling set up as tpis unbiased (up to first order of sample size) class of estimator of Y¯hwhile the class of estimator tp is biased.


7. Minimum variance of proposed class of estimators

It is obvious from the Eq. (16) that the variances of the proposed class of estimator tpdepend on the value of the constant kh. Therefore, we desire to minimize their variances and discussed them below. The optimality condition under which proposed class of estimators tphave minimum variance is obtained as


Substituting the optimum value of the constant kh in Eq. (19), we have the minimum variance of the class of estimators tpas


8. Efficiency comparison of the proposed strategy

It is important to investigate the performance of the proposed class of estimators with respect to the existing ones. We use the two natural population and one artificially generated population data set to justify the supremacy of the proposed strategy.

8.1 Empirical investigations through natural populations

The data set of two natural populations has been presented below.

  • Population I(Source: Murthy [12], p. 228)

    y: Factory outputin thousand rupees, x: Number of workers in the factory, and z:Fixed capital of factory in thousand rupees.

The data consist of 80 observations which are divided into four strata according to the auxiliary variable zas: (i) z ≤ 500, (ii) 500 < z ≤ 1000, (iii) 1000 < z ≤ 2000, and z > 2000 respectively for allocation of sample size to different strata, Proportional allocation is used.

  1. Stratum 1z500

  1. Stratum 2500<z1000

  1. Stratum 31000<z2000

  1. Stratum 4z>2000

  • Population II(Source: Koyuncu and Kadilar [13]).

y: Number of teachers, x: Number of students both primary and secondary schools, and z: Number of classes both primary and secondary schools. There are 923 districts in 6 regions (as: (i) Marmara, (ii) Agean, (iii) Mediterranean, (iv) Central Anatolia, (v) Black Sea, (vi): East and Southeast Anatolia) in Turkey in 2007 (source: The Turkish Republic Ministry of Education).

  1. Marmara region

  1. Agean region

  1. Mediterranean

  1. Central Anatolia region

  1. Black sea region


The percentage relative efficiencies (PRE) the proposed class of estimators tpwith respect to different estimators under their respective optimum conditions are shown below.

8.2 Empirical investigations through artificially generated population

An important aspect of simulation is that one builds a simulation model to replicate the actual system. Simulation allows comparison of analytical techniques and helps in concluding whether a newly developed technique is better than the existing ones. Motivated by Singh and Deo [14], Singh et al. [15] and Maji et al. [16] who have been adopted the artificial population generation techniques, we have generated five sets of independent random numbers of size N (N = 100) namely x1k,y1k,x2k,y2kandzkk=123Nfrom a standard normal distribution with the help of R-software. By varying the correlation coefficients ρyxandρxz, we have generated the following transformed variables of the population U with the values of σy2=50,μy=10,σx2=100,μx=50,σz2=50and μz=20as


We have split total population of size N = 100 into 5 strata each of size 20 i.e.Nh=20h=125taking them sequentially and consider nh=12andnh=8; h=125for the efficiency comparison of the proposed strategy.

The percentage relative efficiencies the proposed class of estimators tpwith respect to different estimators (under their respective optimum conditions) are derived through the data set of the artificially generated population are obtained as:


9. Conclusion

From the construction of estimation strategy and efficiency comparison of the proposed methodology, following matters are noted.

  1. Form Table 1, it is clear that the proposed class of estimators is at least 1% better than the existing one in estimating the population mean.

  2. Similarly from Table 2 it is found that the new estimator is at least 28% better than the existing one.

  3. It may also be noted from Tables 1 and 2 that the artificially generated population is homogeneous (the mean and variance of the respective variables are almost same for different strata) where the natural populations are heterogeneous (the mean and variance of the respective variables are different for different strata) in nature. Our suggested estimators performs with equal efficiency for both the types.

  4. The unbiased version of the proposed technique has been obtained which make the proposed class of estimators much more practicable.

Population IPopulation II

Table 1.

PRE of the proposed estimator tpwith respect to different estimators through data set of natural population.

We use following expression to obtain the percent relative efficiency (PRE) of the proposed estimator tpwith respect to different estimators as PRE=Vy¯Min.Vtp×100.

Artificially generated population

Table 2.

PRE of the proposed estimator tpwith respect to different estimators through data set of artificially generated population.

We use following expression to obtain the percent relative efficiency (PRE) of the proposed estimator tpwith respect to different estimators as PRE=Vy¯Min.Vtp×100.

Thus, it is found that the proposed estimation technique has addressed the problems of estimation through two-phase stratified sampling which may truthful for real life application where population is especially heterogeneous in nature and stratification is essential. Due to the benefits achieved by the new estimator, the survey statistician may be suggested to use it.


  1. 1. Sukhatme B. Some ratio type estimators in two-phase sampling. Journal of the American Statistics Associations. 1962;57:628-632
  2. 2. Chand L. Some ratio type estimators based on two or more auxiliary variables [unpublished PhD thesis]. Ames, Iowa (USA): Iowa State University; 1975
  3. 3. Kiregyera B. A chain ratio type estimators in finite population double sampling using two auxiliary variables. Metrika. 1980;17:217-223
  4. 4. Kiregyera B. Regression type estimators using two auxiliary variables and the model of double sampling from finite populations. Metrika. 1984;31:215-226
  5. 5. Tracy DS, Singh HP, Singh R. An alternative to the ratio-cum-product estimator in sample surveys. Journal of Statistical Planning and Inference. 1996;53:375-387
  6. 6. Singh HP, Espejo MR. Double sampling ratio-product estimator of a finite population mean in sampling surveys. Journal of Applied Statistics. 2007;34(1):71-85
  7. 7. Gupta S, Shabbir J. on the use of transformed auxiliary variables in estimating population mean by using two auxiliary variables. Journal of Statistical Planning and Inference. 2007;137:1606-1611
  8. 8. Shukla D, Pathak S, Thakur NS. Estimation of population mean using two auxiliary sources in sample surveys. Statistics in Transition. 2012;13(1):21-36
  9. 9. Choudhury S, Singh BK. A class of chain ratio–product type estimators with two auxiliary variables under double sampling scheme. Journal of the Korean Statistical Society. 2012;41:247-256
  10. 10. Parichha P, Basu K, Bandyopadhyay A, Mukhopadhyay P. Development of efficient estimation technique for population mean in two phase sampling using fuzzy tools. Journal of Applied Mathematics, Statistics and Informatics. 2017;13(2):5-28. DOI: 10.1515/jamsi-2017-0006
  11. 11. Bandyopadhyay A, Singh GN. Predictive estimation of population mean in two-phase sampling. Communications in Statistics: Theory and Methods. 2016;45(14):4249-4267. DOI: 10.1080/03610926.2014.919396
  12. 12. Murthy MN. Sampling Theory and Methods. Calcutta: Statistical Publishing Society; 1967
  13. 13. Koyuncu N, Kadilar C. Family of estimators of population mean using two auxiliary variables in stratified sampling. Communications in Statistics: Theory and Methods. 2009;38:2398-2417
  14. 14. Singh S, Deo B. Imputation by power transformation. Statistical Papers. 2003;4:555-579
  15. 15. Singh S, Joarder AH, Tracy DS. Median estimation using double sampling. Australian & New Zealand Journal of Statistics. 2001;43(1):33-46
  16. 16. Maji R, Singh GN, Bandyopadhyay A. Estimation of population mean in presence of random non-response in two-stage cluster sampling. Communications in Statistics: Theory and Methods, ISSN: 0361-0926. 2018. DOI: 10.1080/03610926.2018.1478101

Written By

Partha Parichha, Kajla Basu and Arnab Bandyopadhyay

Submitted: September 8th, 2018 Reviewed: December 3rd, 2018 Published: September 27th, 2019