Open access peer-reviewed chapter

Development of Estimation Procedure of Population Mean in Two-Phase Stratified Sampling

By Partha Parichha, Kajla Basu and Arnab Bandyopadhyay

Submitted: September 8th 2018Reviewed: December 3rd 2018Published: September 27th 2019

DOI: 10.5772/intechopen.82850

Downloaded: 132

Abstract

This article describes the problem of estimation of finite population mean in two-phase stratified random sampling. Using information on two auxiliary variables, a class of product to regression chain type estimators has been proposed and its characteristic is discussed. The unbiased version of the proposed class of estimators has been constructed and the optimality condition for the proposed class of estimators is derived. The efficacy of the proposed methodology has been justified through empirical investigations carried over the data set of natural population as well as the data set of artificially generated population. The survey statistician may be suggested to use it.

Keywords

  • stratified random sampling
  • double sampling
  • auxiliary variables
  • chain type estimators
  • bias
  • mean square error
  • efficiency
  • AMS 2000 Mathematics Subject Classification: 62D05

1. Introduction

In this present paper we have made use of Auxiliary information extracted from the variables having correlation with study variable. Auxiliary information may be utilized at planning, design and estimation stages to develop improved estimation procedures in sample surveys. Sometimes, information on auxiliary variable may be readily available for all the units of population; for example, tonnage (or seat capacity) of each vehicle or ship is known in survey sampling of transportation and number of beds available in different hospitals may be known well in advance in health care surveys. If such information lacks, it is sometimes, relatively cheap to take a large preliminary sample where auxiliary variable alone is measured, such practice is applicable in two-phase (or double) sampling. Two-phase stratified sampling happens to be a powerful and cost effective (economical) technique for obtaining the reliable estimate in first-phase (preliminary) sample for the unknown parameters of the auxiliary variables. For example, Sukhatme [1] mentioned that in a survey to estimate the production of lime crop based on orchards as sampling units, a comparatively larger sample is drawn to determine the acreage under the crop while the yield rate is determined from a sub sample of the orchards selected for determining acreage.

In order to construct an efficient estimator of the population mean of the auxiliary variable in first-phase (preliminary) sample, Chand [2] introduced a technique of chaining another auxiliary variable with the first auxiliary variable by using the ratio estimator in the first phase sample. The estimator is known as chain-type ratio estimator. This work was further extended by Kiregyera [3, 4], Tracy et al. [5], Singh and Espejo [6], Gupta and Shabbir [7], Shukla et al. [8], Choudhury and Singh [9], Parichha et al. [10] and among others, where they proposed various chain-type ratio and regression estimators.

In practice, the population may often consist of heterogeneous units. For example, in socio-economic surveys, people may live in rural areas, urban localities, ordinary domestic houses, hostels, hospitals and jail, etc. In such a situation one should carefully study the population according to the characteristics of regions and then apply sampling scheme strata wise independently. This procedure is known as stratified random sampling. It may be noted that most of the developments in two-phase sampling scheme are based on simple random sampling only while limited number of attempts are taken to address the problems of two-phase sampling scheme in the platform of stratified random sampling. It may be also noticeable that the most of the research work on two-phase sampling are producing biased estimates. However, biased becomes a serious drawback in sample surveys. A sampling method is called biased if it systematically favors some outcomes over others. It results in a biased sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling. For example, telephone sampling is common in marketing surveys. A simple random sample may be chosen from the sampling frame consisting of a list of telephone numbers of people in the area being surveyed. This method does involve taking a simple random sample, but it is not a simple random sample of the target population (consumers in the area being surveyed). It will miss people who do not have a phone. It may also miss people who only have a cell phone that has an area code not in the region being surveyed. It will also miss people who do not wish to be surveyed, including those who monitor calls on an answering machine and don’t answer those from telephone surveyors. Thus the method systematically excludes certain types of consumers in the area. It is obvious that the inferences from a biased sample are not as trustworthy as conclusions from a truly random sample.

Encouraged with the above work, we have proposed a class of product to regression chain type estimators in stratified sampling using two auxiliary variables under double sampling. The unbiased version of the proposed class of estimators has been obtained which make the estimation strategy more practicable. The dominance of the proposed estimation strategy over the conventional ones has been established through empirical investigations carried over the data set of natural as well as artificially generated population.

2. Sampling structures and notations

Consider a finite population U = {1, 2,…, N} of N identifiable units divided into L homogeneous strata with the hth stratum (h = 1, 2,…, L) having Nh. Let y and (x, z) be the study variable and two auxiliary variables respectively taking values yihand (xih,zih), respectively, for the unit i = 1,2,Nhof the hth stratum. Y¯=h=1LY¯hWh, X¯=h=1LX¯hWh, Z¯=h=1LZ¯hWhbe population means of the study and the auxiliary variables, and Y¯h=i=1NhyhiNh, X¯h=i=1NhxhiNh, Z¯h=i=1NhzhiNhbe the corresponding stratum means. Here Wh=NhNis the known stratum weight.

Let Cyh=SyhY¯h, Cxh=SxhX¯hand Czh=SzhZ¯hbe the coefficients of variation where Syh=i=1NhyhiY¯h2Nh1, Sxh=i=1NhxhiX¯h2Nh1, Szh=i=1NhzhiZ¯h2Nh1are the population standard deviations in the hth stratum.

Let ρyxh,ρyzhand ρxzhbe the correlation coefficients between (y, x), (y, z), and (x, z) respectively in the hth stratum. Chand [2] and Kiregyera [3, 4] discussed a situation in simple random sampling when information on x is unknown but another auxiliary variable z is easily available. It is assumed that population mean of one auxiliary variable z is known in advance and the population mean of the other auxiliary variable x is unknown. We seek to estimate through a two-phase stratified sampling design. Using a simple random sample without replacement (SRSWOR) sampling scheme at each phase, we adopt the double sampling scheme as follows.

  1. In the first phase, a preliminary large sample of size nhis drawn from the hth stratum of size Nh(h = 1, 2,…, L) and information on the auxiliary variables x and z is observed.

  2. In the second phase, a sub-sample of size nhis drawn from first phase sample nhunits from the h th stratum of size Nhand information on both the study variable y and the auxiliary variables x and z is taken.

y¯h=1nhi=1nhyhi, x¯h=1nhi=1nhxhi, z¯h=1nhi=1nhzhi, x¯h=1nhi=1nhxhi, and z¯h=1nhi=1nhzhibe the corresponding sample means in the hth stratum.

3. Discussion on existing estimation strategies

The usual stratified mean estimator (y¯st) for population mean (Y¯), is given by

y¯st=h=1Lwhy¯hE1

The mean square error (MSE) of y¯st, is given by

MESy¯st=h=1`Lwh21nh1Nhsyh2E2

Motivated with the technique adopted by Chand [2], one may frame the chain ratio-product type estimator in stratified sampling structure as

y¯RPh=h=1Lwhy¯hX¯hx¯hZ¯hz¯hE3

The bias and MSE respectively of y¯hRP, to first order of approximation, are obtained as

Biasy¯RPhh=1Lwhy¯h1nh1nhA1h+1nh1NhA2hE4
MSEy¯RPh=h=1Lwh2syh21nh1nhA3h+1nh1NhA4h+1nh1NhE5

where

A1h=Cxh2ρyxhCyhCxhandA2h=Czh2ρyzhCyhCzh
A3h=Cxh2Cyh22ρyxhCxhCyhandA4h=Czh2Cyh22ρyzhCzhCyh

Similarly, inspired with the technique adopted by Choudhary and Sing [9], one may frame the two-phase stratified random sampling estimator in stratified sampling as

y¯csh=h=1Lwhy¯hkhx¯hx¯hZ¯hz¯h+1khx¯hx¯hz¯hZ¯hE6

where khis constant.

Bias y¯cshh=1Lwhy¯hA5h

A5h=12khCyh1nh1nhρyxhCxh+1nh1NhρyzhCzh+kh1nh1NhCxh2+1nh1NhCzh2E7
AndMSEy¯cshmin=h=1Lwh2syh2×1nh1Nh1nh1nhρyxhCxh1nh1NhρyzhCzh21nh1NhCxh2+1nh1NhCzh2E8

4. Formulation of proposed estimation strategy

Motivated with the earlier work, discussed above, we have constructed a class of product to regression chain type estimators as

tp=h=1Lwhy¯hkhx¯hx¯h+1khx¯idhx¯hE9

where khh=12Lis a real constant which can be suitably determined by minimizing the M. S. E. of the class of estimator tpand x¯dh=x¯+bxzhnhZ¯hz¯h;where bxzhnhis the regression coefficient between the variables x and z at the hth stratum.

5. Bias and mean square errors of the proposed class of estimator tp

It can be easily noted that the proposed class of estimators tpdefined in Eqs. (8) is chain product and regression type estimator. Therefore, it is biased estimator for population mean Y¯. So, we obtain biases and mean square errors under large sample approximations using the following transformations:

y¯h=Y¯h1+e1,x¯h=X¯h1+e2,x¯h=X¯h1+e3,z¯h=Z¯h1+e4,sxzh=Sxzh1+e5,szh2=Szh21+e6

and Eei=0for (i = 1, 2,…, 6), ei for (i = 1, 2,…, 6) are relative error term.

Under above transformations the class of estimator tp may be represented as

tp=h=1LwhY¯1+e11kh1+e31+e21+ kh1+e3Z¯hX¯hβxzhe4+e4e5e4e61+e21E10

We have the following expectations of the sample statistics of two-phase stratified sampling as

Ee12=f1Cyh2,Ee22=f1Cxh2,Ee32=f2Cxh2,Ee42=f2Czh2Ee1e2=f1ρyxhCyhCxh,Ee1e3=f2ρyxhCyhCxh,Ee2e3=f2Cxh2,Ee2e4=Ee3e4=f2ρxzhCxhCzh,Ee4e5=f2μ102Z¯hSxzh,Ee4e6=f2μ003Z¯hSzh2,Ee2e5=f2μ201X¯hSxzh,Ee2e6=f2μ102X¯hSzh2,Ee1e4=f2ρyzhCyhCzh.E11

where

f1=1nh1Nh,f3=1nh1nh,f2=1nh1Nh,
μpqr=1Nhi=1NhxiX¯hpyiY¯hqziZ¯hr;pqr0

Expanding binomially, using results from Eq. (1) and retaining the terms up to first order of sample size, we have derived the expressions of bias B(.) and mean square error M(.) of the class of estimators tp as

Btp=EtpYh¯=h=1LwhY¯1khbxzhZh¯Xh¯f2SxzhX¯hZh¯f1SyzhY¯hZh¯f2μ102SxzhZh¯μ003SZh2Zh¯+ f3Sxh2X¯h2SyxhY¯hX¯hE12
Mtp=EtpY¯h2=h=1LwhY¯2hf1Cyh2+kh2a+2khb+cE13

where a=f2ρxzh2Cxh2andb=f2ρyzhρxzhCyhCxhf2ρxzh2Cxh2c=f3Cxh22f3ρyxhCyhCxh+f2ρxzh2Cxh22f2ρyzhρxzhCyhCxh.

6. Bias reduction for the proposed class of estimators

In recent time serious drawback is bias of an estimator. Therefore, unbiased versions of the proposed classes of estimators are more desirable. Motivated with this argument and influenced by the bias correction techniques of Tracy et al. [5] and Bandyopadhyay and Singh [11] we proceed to derive the unbiased version of our proposed class of estimator tp.

From Eq. (12), we observe that the expression of bias of the estimator tp contains the population parameters such as μ003,μ102, Syxh, Syzh,Sxh2,Syh2,Y¯h,X¯h,Syzhand Szh2. Since Szh2is known while μ003,μ102, Syxh, Syzh,Sxh2,Syh2,Y¯h,X¯hand Syzhare unknown, replacing μ003,μ102, Syxh, Syzh,Sxh2,Syh2,Y¯h,X¯h,by their respective sample estimator (based on the second phase sample of size m) m003,m102,syzh,sxh2,syh2,y¯h,x¯handsyzh, we get an estimator of B(tp) and

btp=h=1Lwhy¯h1khbxzhz¯hx¯hf2sxzhx¯hz¯hf1syzhy¯hz¯hf2m102sxzhz¯hm003sZh2z¯h+f3sxh2x¯2hsyxhy¯hx¯h.E14

where mpqr=1mi=1mxhix¯hpyhiy¯hqzhiz¯hr.

Motivating with the bias reduction techniques of Tracy et al. [5] and Bandyopadhyay and Singh [11], we have derived the unbiased version of the proposed class of estimators tp to the first order of approximations two-phase stratified sampling.

tp=tpbtp

which becomes

tp=h=1Lwhy¯hkhx¯hx¯h+1khx¯idhx¯hyh¯1khbxzhzh¯xh¯f2sxzhxh¯zh¯f1syzhyh¯zh¯f2m102sxzhzh¯m003sZh2zh¯+ f3sxh2xh¯2syxhyhxh¯E15

Thus, the variance of tpto the first order of approximation are obtained as

Vtp=Mtp=h=1LY¯h2f1Cyh2+kh2a+2khb+cE16

From Eqs. (10) and (15) it is to be noted that the class of estimators tpis preferable over the class of estimators tpof two –phase sampling set up as tpis unbiased (up to first order of sample size) class of estimator of Y¯hwhile the class of estimator tp is biased.

7. Minimum variance of proposed class of estimators

It is obvious from the Eq. (16) that the variances of the proposed class of estimator tpdepend on the value of the constant kh. Therefore, we desire to minimize their variances and discussed them below. The optimality condition under which proposed class of estimators tphave minimum variance is obtained as

kh=baE17

Substituting the optimum value of the constant kh in Eq. (19), we have the minimum variance of the class of estimators tpas

Min.Vtp=h=1LW2hY¯h2f1Cyh2b2a+CE18

8. Efficiency comparison of the proposed strategy

It is important to investigate the performance of the proposed class of estimators with respect to the existing ones. We use the two natural population and one artificially generated population data set to justify the supremacy of the proposed strategy.

8.1 Empirical investigations through natural populations

The data set of two natural populations has been presented below.

  • Population I (Source: Murthy [12], p. 228)

    y: Factory output in thousand rupees, x: Number of workers in the factory, and z:Fixed capital of factory in thousand rupees.

The data consist of 80 observations which are divided into four strata according to the auxiliary variable z as: (i) z ≤ 500, (ii) 500 < z ≤ 1000, (iii) 1000 < z ≤ 2000, and z > 2000 respectively for allocation of sample size to different strata, Proportional allocation is used.

  1. Stratum 1 z500

N1=19,n1=11,n1=5,Y¯1=2669.247,X¯1=65.15789Z¯1=349.6842,Cy1=0.28363,Cx1=0.17153,Cz1=0.31299ρyx1=0.81381,ρyz1=0.9364,ρxz1=0.9044
  1. Stratum 2 500<z1000

N2=32,n2=17,n2=8,Y¯2=4657.625,X¯2=139.9668Z¯2=706.5938,Cy2=0.14366,Cx2=0.3169,Cz2=0.15457ρyx2=0.8883,ρyz2=0.9259,ρxz2=0.8456
  1. Stratum 3 1000<z2000

N3=14,n3=8,n3=3,Y¯3=6537.214,X¯3=403.2143Z¯3=1539.571,Cy3=0.06365,Cx3=0.20117,Cz3=0.18004ρyx3=0.9295,ρyz3=0.9835,ρxz3=0.9366
  1. Stratum 4 z>2000

N4=15,n4=9,n4=4,Y¯4=7843.667,X¯4=763.2Z¯4=2620.533,Cy4=0.08232,Cx4=0.22464,Cz4=0.14156ρyx4=0.9787,ρyz4=0.9692,ρxz4=0.9454
  • Population II (Source: Koyuncu and Kadilar [13]).

y: Number of teachers, x: Number of students both primary and secondary schools, and z: Number of classes both primary and secondary schools. There are 923 districts in 6 regions (as: (i) Marmara, (ii) Agean, (iii) Mediterranean, (iv) Central Anatolia, (v) Black Sea, (vi): East and Southeast Anatolia) in Turkey in 2007 (source: The Turkish Republic Ministry of Education).

  1. Marmara region

N1=127,n1=60,n1=31,Y¯1=703.74,X¯1=20804.59Z¯1=498.28,Cy1=1.25591,Cx1=1.46538,Cz1=1.115ρyx1=0.936,ρyz1=0.97891,ρxz1=0.93958
  1. Agean region

N2=117,n2=40,n2=21,Y¯2=413,X¯2=9211.79Z¯2=318.83,Cy2=1.56155,Cx2=1.64797,Cz2=1.14804ρyx2=0.996,ρyz2=0.97624,ρxz2=0.96958
  1. Mediterranean

N3=103,n3=50,n3=29,Y¯3=573.17,X¯3=14309.3Z¯3=431.36,Cy3=1.80307,Cx3=1.9253,Cz3=1.42097ρyx3=0.994,ρyz3=0.98351,ρxz3=0.97655
  1. Central Anatolia region

N4=170,n4=75,n4=38,Y¯4=424.66,X¯4=9478.85Z¯4=311.32,Cy4=1.90878,Cx4=1.92206,Cz4=1.47124ρyx4=0.983,ρyz4=0.98296,ρxz4=0.96362
  1. Black sea region

N5=205,n5=40,n5=25,Y¯5=267.03,X¯5=5569.95Z¯5=227.20,Cy5=1.51162,Cx5=1.52564,Cz5=1.14811ρyx5=0.989,ρyz5=0.96434,ρxz5=0.96725.

The percentage relative efficiencies (PRE) the proposed class of estimators tpwith respect to different estimators under their respective optimum conditions are shown below.

8.2 Empirical investigations through artificially generated population

An important aspect of simulation is that one builds a simulation model to replicate the actual system. Simulation allows comparison of analytical techniques and helps in concluding whether a newly developed technique is better than the existing ones. Motivated by Singh and Deo [14], Singh et al. [15] and Maji et al. [16] who have been adopted the artificial population generation techniques, we have generated five sets of independent random numbers of size N (N = 100) namely x1k,y1k,x2k,y2kandzkk=123Nfrom a standard normal distribution with the help of R-software. By varying the correlation coefficients ρyxandρxz, we have generated the following transformed variables of the population U with the values of σy2=50,μy=10,σx2=100,μx=50,σz2=50and μz=20as

y1k=μy+σyρxyx1k+1ρyx2y1kx1k=μx+σxx1kzk=μz+σzρxzx1k+1ρxz2zky2k=y1kandx2k=x1k.

We have split total population of size N = 100 into 5 strata each of size 20 i.e.Nh=20h=125taking them sequentially and consider nh=12andnh=8; h=125for the efficiency comparison of the proposed strategy.

The percentage relative efficiencies the proposed class of estimators tpwith respect to different estimators (under their respective optimum conditions) are derived through the data set of the artificially generated population are obtained as:

9. Conclusion

From the construction of estimation strategy and efficiency comparison of the proposed methodology, following matters are noted.

  1. Form Table 1, it is clear that the proposed class of estimators is at least 1% better than the existing one in estimating the population mean.

  2. Similarly from Table 2 it is found that the new estimator is at least 28% better than the existing one.

  3. It may also be noted from Tables 1 and 2 that the artificially generated population is homogeneous (the mean and variance of the respective variables are almost same for different strata) where the natural populations are heterogeneous (the mean and variance of the respective variables are different for different strata) in nature. Our suggested estimators performs with equal efficiency for both the types.

  4. The unbiased version of the proposed technique has been obtained which make the proposed class of estimators much more practicable.

EstimatorPRE
Population IPopulation II
y¯st173.3608192.951
y¯hRP101.1429131.5654
y¯csh118.3215172.226

Table 1.

PRE of the proposed estimator tpwith respect to different estimators through data set of natural population.

We use following expression to obtain the percent relative efficiency (PRE) of the proposed estimator tpwith respect to different estimators as PRE=Vy¯Min.Vtp×100.

EstimatorPRE
Artificially generated population
y¯st179.623
y¯hRP128.256
y¯csh154.879

Table 2.

PRE of the proposed estimator tpwith respect to different estimators through data set of artificially generated population.

We use following expression to obtain the percent relative efficiency (PRE) of the proposed estimator tpwith respect to different estimators as PRE=Vy¯Min.Vtp×100.

Thus, it is found that the proposed estimation technique has addressed the problems of estimation through two-phase stratified sampling which may truthful for real life application where population is especially heterogeneous in nature and stratification is essential. Due to the benefits achieved by the new estimator, the survey statistician may be suggested to use it.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Partha Parichha, Kajla Basu and Arnab Bandyopadhyay (September 27th 2019). Development of Estimation Procedure of Population Mean in Two-Phase Stratified Sampling, Statistical Methodologies, Jan Peter Hessling, IntechOpen, DOI: 10.5772/intechopen.82850. Available from:

chapter statistics

132total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Methods of Russian Patent Analysis

By Dmitriy Korobkin, Sergey Vasiliev, Sergey Fomenkov and S.G. Kolesnikov

Related Book

First chapter

Introductory Chapter: Challenges of Uncertainty Quantification

By Jan Peter Hessling

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us