Development of Estimation Procedure of Population Mean in Two-Phase Stratified Sampling

This article describes the problem of estimation of finite population mean in two-phase stratified random sampling. Using information on two auxiliary variables, a class of product to regression chain type estimators has been proposed and its characteristic is discussed. The unbiased version of the proposed class of estimators has been constructed and the optimality condition for the proposed class of estimators is derived. The efficacy of the proposed methodology has been justified through empirical investigations carried over the data set of natural population as well as the data set of artificially generated population. The survey statistician may be suggested to use it.


Introduction
In this present paper we have made use of Auxiliary information extracted from the variables having correlation with study variable. Auxiliary information may be utilized at planning, design and estimation stages to develop improved estimation procedures in sample surveys. Sometimes, information on auxiliary variable may be readily available for all the units of population; for example, tonnage (or seat capacity) of each vehicle or ship is known in survey sampling of transportation and number of beds available in different hospitals may be known well in advance in health care surveys. If such information lacks, it is sometimes, relatively cheap to take a large preliminary sample where auxiliary variable alone is measured, such practice is applicable in two-phase (or double) sampling. Two-phase stratified sampling happens to be a powerful and cost effective (economical) technique for obtaining the reliable estimate in first-phase (preliminary) sample for the unknown parameters of the auxiliary variables. For example, Sukhatme [1] mentioned that in a survey to estimate the production of lime crop based on orchards as sampling units, a comparatively larger sample is drawn to determine the acreage under the crop while the yield rate is determined from a sub sample of the orchards selected for determining acreage.
In order to construct an efficient estimator of the population mean of the auxiliary variable in first-phase (preliminary) sample, Chand [2] introduced a technique of chaining another auxiliary variable with the first auxiliary variable by using the ratio estimator in the first phase sample. The estimator is known as chain-type ratio estimator. This work was further extended by Kiregyera [3,4], Tracy et al. [5], Singh and Espejo [6], Gupta and Shabbir [7], Shukla et al. [8], Choudhury and Singh [9], Parichha et al. [10] and among others, where they proposed various chain-type ratio and regression estimators.
In practice, the population may often consist of heterogeneous units. For example, in socio-economic surveys, people may live in rural areas, urban localities, ordinary domestic houses, hostels, hospitals and jail, etc. In such a situation one should carefully study the population according to the characteristics of regions and then apply sampling scheme strata wise independently. This procedure is known as stratified random sampling. It may be noted that most of the developments in twophase sampling scheme are based on simple random sampling only while limited number of attempts are taken to address the problems of two-phase sampling scheme in the platform of stratified random sampling. It may be also noticeable that the most of the research work on two-phase sampling are producing biased estimates. However, biased becomes a serious drawback in sample surveys. A sampling method is called biased if it systematically favors some outcomes over others. It results in a biased sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling. For example, telephone sampling is common in marketing surveys. A simple random sample may be chosen from the sampling frame consisting of a list of telephone numbers of people in the area being surveyed. This method does involve taking a simple random sample, but it is not a simple random sample of the target population (consumers in the area being surveyed). It will miss people who do not have a phone. It may also miss people who only have a cell phone that has an area code not in the region being surveyed. It will also miss people who do not wish to be surveyed, including those who monitor calls on an answering machine and don't answer those from telephone surveyors. Thus the method systematically excludes certain types of consumers in the area. It is obvious that the inferences from a biased sample are not as trustworthy as conclusions from a truly random sample.
Encouraged with the above work, we have proposed a class of product to regression chain type estimators in stratified sampling using two auxiliary variables under double sampling. The unbiased version of the proposed class of estimators has been obtained which make the estimation strategy more practicable. The dominance of the proposed estimation strategy over the conventional ones has been established through empirical investigations carried over the data set of natural as well as artificially generated population.

Sampling structures and notations
Consider a finite population U = {1, 2,…, N} of N identifiable units divided into L homogeneous strata with the hth stratum (h = 1, 2,…, L) having N h . Let y and (x, z) be the study variable and two auxiliary variables respectively taking values y ih and (x ih ,z ih ), respectively, for the unit i = 1,2,…N h of the hth stratum.
r are the population standard deviations in the hth stratum. Let ρ yx h , ρ yz h and ρ xz h be the correlation coefficients between (y, x), (y, z), and (x, z) respectively in the hth stratum. Chand [2] and Kiregyera [3,4] discussed a situation in simple random sampling when information on x is unknown but another auxiliary variable z is easily available. It is assumed that population mean of one auxiliary variable z is known in advance and the population mean of the other auxiliary variable x is unknown. We seek to estimate through a two-phase stratified sampling design. Using a simple random sample without replacement (SRSWOR) sampling scheme at each phase, we adopt the double sampling scheme as follows.
i. In the first phase, a preliminary large sample of size n 0 h is drawn from the hth stratum of size N h (h = 1, 2,…, L) and information on the auxiliary variables x and z is observed.
ii. In the second phase, a sub-sample of size n h is drawn from first phase sample n 0 h units from the h th stratum of size N h and information on both the study variable y and the auxiliary variables x and z is taken.
h i¼1 z hi be the corresponding sample means in the hth stratum.

Discussion on existing estimation strategies
The usual stratified mean estimator (y st ) for population mean (Y), is given by The mean square error (MSE) of y st , is given by Motivated with the technique adopted by Chand [2], one may frame the chain ratio-product type estimator in stratified sampling structure as The bias and MSE respectively of y h ð Þ RP , to first order of approximation, are obtained as Similarly, inspired with the technique adopted by Choudhary and Sing [9], one may frame the two-phase stratified random sampling estimator in stratified sampling as where k h is constant.
And MSE y h

Formulation of proposed estimation strategy
Motivated with the earlier work, discussed above, we have constructed a class of product to regression chain type estimators as where k h h ¼ 1; 2; …; L ð Þis a real constant which can be suitably determined by minimizing the M. S. E. of the class of estimator t p and x 0 is the regression coefficient between the variables x and z at the hth stratum.

Bias and mean square errors of the proposed class of estimator t p
It can be easily noted that the proposed class of estimators t p defined in Eqs. (8) is chain product and regression type estimator. Therefore, it is biased estimator for population mean Y. So, we obtain biases and mean square errors under large sample approximations using the following transformations: and E e i ð Þ ¼ 0 for (i = 1, 2,…, 6), e i for (i = 1, 2,…, 6) are relative error term. Under above transformations the class of estimator t p may be represented as We have the following expectations of the sample statistics of two-phase stratified sampling as where Expanding binomially, using results from Eq. (1) and retaining the terms up to first order of sample size, we have derived the expressions of bias B(.) and mean square error M(.) of the class of estimators t p as

Bias reduction for the proposed class of estimators
In recent time serious drawback is bias of an estimator. Therefore, unbiased versions of the proposed classes of estimators are more desirable. Motivated with this argument and influenced by the bias correction techniques of Tracy et al. [5] and Bandyopadhyay and Singh [11] we proceed to derive the unbiased version of our proposed class of estimator t p .
From Eq. (12), we observe that the expression of bias of the estimator t p contains the population parameters such as μ 003 , μ 102 , S yx h , S yz h , S 2 x h , S 2 , y h , x h and s yz h , we get an estimator of B(t p ) and where m pqr ¼ 1 m ∑ m i¼1 x hi À x h ð Þ p y hi À y h À Á q z hi À z h ð Þ r : Motivating with the bias reduction techniques of Tracy et al. [5] and Bandyopadhyay and Singh [11], we have derived the unbiased version of the proposed class of estimators t p to the first order of approximations two-phase stratified sampling.
Thus, the variance of t 0 p to the first order of approximation are obtained as From Eqs. (10) and (15) it is to be noted that the class of estimators t 0 p is preferable over the class of estimators t p of two -phase sampling set up as t 0 p is unbiased (up to first order of sample size) class of estimator of Y h while the class of estimator t p is biased.

Minimum variance of proposed class of estimators
It is obvious from the Eq. (16) that the variances of the proposed class of estimator t 0 p depend on the value of the constant k h . Therefore, we desire to minimize their variances and discussed them below. The optimality condition under which proposed class of estimators t 0 p have minimum variance is obtained as Substituting the optimum value of the constant k h in Eq. (19), we have the minimum variance of the class of estimators t 0 p as

Efficiency comparison of the proposed strategy
It is important to investigate the performance of the proposed class of estimators with respect to the existing ones. We use the two natural population and one artificially generated population data set to justify the supremacy of the proposed strategy.

Empirical investigations through natural populations
The data set of two natural populations has been presented below.
• Population I (Source: Murthy [12], p. 228) y: Factory output in thousand rupees, x: Number of workers in the factory, and z: Fixed capital of factory in thousand rupees.
The data consist of 80 observations which are divided into four strata according to the auxiliary variable z as: (i) z ≤ 500, (ii) 500 < z ≤ 1000, (iii) 1000 < z ≤ 2000, and z > 2000 respectively for allocation of sample size to different strata, Proportional allocation is used.

Marmara region
Agean region Central Anatolia region

Empirical investigations through artificially generated population
An important aspect of simulation is that one builds a simulation model to replicate the actual system. Simulation allows comparison of analytical techniques and helps in concluding whether a newly developed technique is better than the existing ones. Motivated by Singh and Deo [14], Singh et al. [15] and Maji et al. [16] who have been adopted the artificial population generation techniques, we have generated five sets of independent random numbers of size N (N = 100) namely x 0 1 k , y 0 1 k , x 0 2 k , y 0 2 k and z 0 k k ¼ 1; 2; 3; …; N ð Þ from a standard normal distribution with the help of R-software. By varying the correlation coefficients ρ yx and ρ xz , we have generated the following transformed variables of the population U with the values of σ 2 y ¼ 50, μ y ¼ 10, σ 2 x ¼ 100, μ x ¼ 50, σ 2 z ¼ 50 and μ z ¼ 20 as We have split total population of size N = 100 into 5 strata each of size 20 i:e:; N h ¼ 20; h ¼ 1; 2; …; 5 ð Þ ½ taking them sequentially and consider n 0 h ¼ 12 and n h ¼ 8; h ¼ 1; 2; …; 5 ð Þfor the efficiency comparison of the proposed strategy. The percentage relative efficiencies the proposed class of estimators t 0 p with respect to different estimators (under their respective optimum conditions) are derived through the data set of the artificially generated population are obtained as:

Conclusion
From the construction of estimation strategy and efficiency comparison of the proposed methodology, following matters are noted. Table 1, it is clear that the proposed class of estimators is at least 1% better than the existing one in estimating the population mean. Table 2 it is found that the new estimator is at least 28% better than the existing one.

Similarly from
3. It may also be noted from Tables 1 and 2 that the artificially generated population is homogeneous (the mean and variance of the respective variables are almost same for different strata) where the natural populations are heterogeneous (the mean and variance of the respective variables are different for different strata) in nature. Our suggested estimators performs with equal efficiency for both the types. We use following expression to obtain the percent relative efficiency (PRE) of the proposed estimator t 0 p with respect to different estimators as PRE ¼  4.The unbiased version of the proposed technique has been obtained which make the proposed class of estimators much more practicable.
Thus, it is found that the proposed estimation technique has addressed the problems of estimation through two-phase stratified sampling which may truthful for real life application where population is especially heterogeneous in nature and stratification is essential. Due to the benefits achieved by the new estimator, the survey statistician may be suggested to use it.  We use following expression to obtain the percent relative efficiency (PRE) of the proposed estimator t 0 p with respect to different estimators as PRE ¼  Table 2. PRE of the proposed estimator t 0 p with respect to different estimators through data set of artificially generated population.