Sample Survey Methods
Sample Survey Methods STAT 422
Popular in Course
Popular in Statistics
This 19 page Class Notes was uploaded by Mr. Alex Berge on Friday October 23, 2015. The Class Notes belongs to STAT 422 at University of Idaho taught by Christopher Williams in Fall. Since its upload, it has received 31 views. For similar materials see /class/227939/stat-422-university-of-idaho in Statistics at University of Idaho.
Reviews for Sample Survey Methods
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/23/15
An example of estimating p from two stage cluster sampling Suppose in problem 99 we sampled 6 bottles per case instead of 4 and checked each to see if the volume was at least 8 ounces We can then estimate the proportion of each case that are completely lled Here are the calculations to estimate the overall proportion of bottles completely lled from the new machine N 2471 6MZ 127711 6M 2412 288M MN 12 p 2M 72 The estimated variance is gtltgtsz 4 M2ltfgtlt gt7 N nM2 nNMZZ Mi mrl for these data we have M 7 18397 Z 1 Mi miil ST 42717 so 2 2Mi i 4 M4332 7 21355 71 71 7 5 707 18 1 4271 1 18397003710008877004597 p 7 24 6122 39 624122 39 739 39 739 giving a bound of B 136 Two practical tools sample weights and the design effect Use of sample weights We have learned about many important topics in designing sample surveys such as strati cation clustering and the use of auxillary information with ratio estimation These ideas can be combined in many ways to obtain very complex multi stage sampling designs and our chapter on two stage cluster sampling gave us a glimpse of the complexity that can be involved in more sophisticated designs As we saw in the chap ter on two stage cluster sampling the usual estimators particularly variance estimators can become very complicated In actual sampling studies re searchers often simplify calculations by using sampling weights and compu tational approximations for variance calculations Here we will introduce sampling weights by rewriting the expression for the estimator of the mean in strati ed random sampling Recall in the chapter on strati ed random sampling that the estimator of M is L m m Z 1 wijyij 1 L N where wij Nin is the weight for the jth observation in group 239 and has the interpretation that each observation in the sample represents wij Nin members of the population Thus if a population of N 1000 elements are divided into four strata each equal to N 250 and if equal sample sizes of n 100 are used for each stratum then each observation in the sample represents Nin 25 elements from the population The general idea is that a sampling weight is a reciprocal of a selection probability so for the StRS example above 67 1wlj niNl is the probability of being sampled for a member of the z stratum In many sampling studies the sampling weights are calculated as the sampling design is developed Once the sam pling weights are calculated any quantity of interest can be calculated as a weighted sum as exempli ed in the StRS expression above and computa tional methods such as Taylor series jackknife or bootstrap methods can be used to calculate an approximate variance estimate For multistage sam ples weights are generated for each stage and then multiplied to obtain an overall weight The SAS example for this lecture shows the use of SAS Proc SURVEYSELECT to generate a two stage cluster sample The sam pling weights produced from this sample are combined with the data in Proc SURVEYMEANS to obtain an estimate of the mean and approximate vari ance estimate which can be compared to the result obtained using explicit formulas for estimating the mean from two stage cluster sampling Use of a design effect Computing sample sizes for complex surveys that are repeated over time is made easier with the concept of a design effect denoted by deff The design effect for a sampling plan and a statistic of interest is de ned to be the ratio of the estimated variance of the statistic under the sampling plan to the estimated variance of the statistic under simple random sampling As an example consider estimating a proportion from a complex multistage design The design effect for the complex design would be deffcomplex design7 f3 W VSRS with same sample size 7 Vestimate from complex design 151 i 15771 The design effect is similar to a relative ef ciency and measures the loss or gain in ef ciency of the complex design relative to an SRS design This is extremely useful when computing sample sizes for a future sample survey For a future survey the sample size estimate is just the estimate for a SRS sample for a given bound multiplied by the design effect For example suppose a multistage sampling plan that involved clustering and strati cation was used to estimate a proportion and the design effect was 17 Then for the next survey the sample size for an SRS sample and the given bound can be calculated and multiplied by 17 to give a sample size for the complex sample design Reference Lohr SL 1999 Sampling Design and Analysis Paci c Grove CA BrooksCole Introduction to unequalprobability sampling PPS sampling with replacement Unequal probability sampling with replacement As we saw in the jobs example7 there are situations in which it is desirable to have un equal probabilities of selecting elements into the sample In section 33 of the text7 it is stated that for a population with elements 111112 uN7 we might choose to sample elements with replacement with respective selec tion probabilities of 61 62 6N In that situation the estimator 1 n y l T 7 i n 51 11 is an unbiased estimator for 739 and an unbiased estimator of V6 is given by 71 11 11 n 2 n we lbgf n12 2n 1 Note that this variance estimator is of the form 52717 and is unbiased for VG because the with replacement sampling scheme yields a sample of 71 separate independent estimates of 739 namely This estima tor and its7 variance estimator are due to Hansen and Hurwitz 19437 and thus the estimator is called the HansenHurwitz estimator Sampling with replacement yields estimators whose theoretical properties are easy to understand7 but sampling with replacement is often inef cient and can be impractical in some situations Later we will discuss general approaches to unequal probability sampling that have sampling without replacement7 using estimators developed by Horvitz and Thompson 1952 PPS sampling with replacement ln cluster sampling7 it is often use ful to use unequal probability sampling with replacement with probabilities proportional to size PPS In this case7 6i miM7 and our estimator of the total is A 1 in 1 M M n in M n 7 where i is the average of the observations in cluster 239 To obtain an estimator of the mean we can just divide by M7 giving 1 71 17179 E i1 The estimated variances of these two estimators are then i1 We 7 Z a 7 emfn 71gt 7 Z a 7 mixn 71gt and M 77 i VUALWS l 7 PP92n 71 l 77 n 1 i 1 731 ppf n i 1 See the examples from lecture and in the SAS code on the web References Hansen7 MH and Hurwitz7 WN 1943 On the theory of sampling from a nite population Annals of Mathematical Statistics 14 333 362 HorVitZ7 DC and Thompson7 DJ 1952 A generalization of sampling without replacement from a nite universe Journal of the Amciican Statis tical Association 47 663 685 Comparison of Estimates For any two random variables yl and y2 we have Ey1 7 yg Ey1 7 Ey2 and Vy1 7 yg Vy1 Vy2 7 2covy1y2 lf yl and y2 are inde pendent then the variance simpli es to Vy1 7 yg Vy1 Vy2 For comparing sample means we will only consider the simpli ed case where the estimates are independent In that case we have qu 171 i 172 and Vl71 yz V171 V2 If the population sizes are large we often disregard the nite population correction terms For comparing sample proportions we will consider both the case where the samples are independent and also the case where the two proportion esti mates are dependent because of multinomial sampling When the estimates are independent we have 10702 151 i 152 and AA A AA AA 151 N1i711 152 Nzinz V 7 V V 7 7 7 p1 p2 p1ltp2 n171 N1 n271 N2 but notice that in the text they use 71 in the denominator of the variance instead of n 7 1 and they ignore the fpc terms When the estimates are from a multinomial sample like 7yes7 7n07 or 7maybe7 then the proportion estimates are dependent and the appropriate estimators are 10702 151 i 152 and 7A A 7A AA AAA 1519 15293 2151152 N V 7 V V 72 p1 p2 pm p2 covltp1p2gt 7171 771 771 N Notice that the third term in the variance expression has a 77 sign instead of a 7 7 sign This is because estimates of proportions from multinomial sampling are negatively correlated In other words if you know that more people said yes then less of them would have said 7n07 for example Advantages and Disadvantages of SRS designs Comparison of estimators Here we discuss the choice of an estimator of the mean My comparing the ratio regression and difference estimators of the current chapter and the SRS estimator from Chapter 4 We rst consider the bias of the four methods then we introduce the concept of relative ef ciency as a way to compare variances of estimators Bias The sample mean from Chapter 4 is unbiased with SRS and the difference estimator is also unbiased with SRS The regression estimator is biased in sampling from nite populations but the bias is usually small if the relationship between y and z is linear The ratio estimator 7 yi is a biased estimator of R MyM and an approximation to its relative bias is lt gt G where is the sample correlation between I and y This quantity can be calculated for a given data set and simulations can also be used to understand the bias of both the ratio and regression estimators See the example in Table 612 in the book and the SAS example on the web Comparing variances the concept of relative ef ciency If two estimators are unbiased or have suf ciently small bias we then rely on variance comparisons to choose the best estimator One way to make a variance comparison is through the concept of relative ef ciency For two estimators denoted by E1 and E2 based on the same sample size the relative ef ciency of E1 to E2 is de ned as E1 VE2 RE i E2 gt V E1 Thus if REE1E2 gt 1 estimator E1 is favored We will focus on estimated variances and use an estimated relative ef ciency 0 RE WE E2 VE1 The comparisons in the text are based on the following estimated variances W 85 A A N77 A Nin VMy5 lt Nn gtlts sie2ms sygtlt Nn A A N 7 n VWyL W 533 1725 7 and A N 7 n A 2 2 A MD s s 7 2mm lt gt for the sample mean of y the ratio estimator the regression estimator and the difference estimator respectively After a feW algebraic manipulations the following expressions eme g z A 1 1732 gtlunless 0i 32 7282 72rd 8 y E pmygtlunlessb7x A yL RE A7 ltMygt 851 32 A A 2 2 ZA 1 RE 3y 287 I 8y gt 1 unless I 1 MD 3341 P Conclusions Strati ed Random Sampling When we know that parts of the population differ with respect to the quantity that we are estimating we can obtain better estimates by using strati cation A strati ed random sample is obtained by separating the population elements into nonoverlapping groups called strata and then se lecting a simple random sample from each stratum Strati cation is particu larly useful because i it can yield smaller error bounds than SRS especially when the measurement is homogeneous within strata ii The cost per obser vation can be lowered by appropriate choice of strata and iii it will yield stratum speci c estimates of population parameters Drawing a Strati ed Random Sample notation Choose strata then take a SRS from each L number of strata N population size from stratum 239 N N1 N2 NL total population size n sample size from stratum 239 n 711 712 71L total sample size Estimation of a population mean and total from strati ed ran dom sampling To estimate the mean M from strati ed random sampling we use the sample strati ed mean 1 1 L 7J9 N M791 N2172 Nm N Niz Since the strata are independent we obtain the estimated variance by summing the estimated variances from each stratum L A 1 N 7 n 52 V i 7 N i yst N2 l lt ni To estimate the total 739 from strati ed random sampling we just estimate the total 739 from each stratum and sum Also since the strata are indepen dent the estimated variance of the total is just the sum of the estimated variances from each stratum L 5 Z and i1 50 50 50 50 50 50 50 50 50 65 65 65 65 65 65 65 65 65 65 65 65 65 45 45 Obs plant bigm downtime 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Obs plant mi mtime mpop Vtime 1 1 10 540000 50 113778 2 2 13 400000 65 106667 3 3 9 566667 45 167500 4 4 10 480000 48 132889 5 5 10 430000 52 111222 6 6 12 383333 58 148788 7 7 8 500000 42 51429 8 8 13 384615 66 43077 9 9 8 487500 40 61250 10 10 11 500000 56 118000 The MEANS Procedure Variable Sum Mean Variance bigmy 240018 2400179487 7683270962 Vterm2part 2198530 219853 75529890 Unbiased Estimate of Population Mean Twostage Cluster Design Standard Estimate Error Bound sbquot2 480036 019259 038517 768327 Ratio Estimate of Population Mean Twostage Cluster Design Standard Estimate Error Bound srquot2 459804 023165 046330 123442 Covariance and Correlation We often want to understand the interrelationship between two random variables One measure of the linear dependence between two random variables is their covariance c0vy1 yz E y1 H1 yz uz which basically measures how well the random variables agree in their deviations about their means It is dif cult to compare covariances because of scale differences so we usually also calculate their correlation which is a scaled version of their covariance p c0vy1 Y261 62 Estimation and Confidence Intervals We use sample statistics as estimators of population parameters We would like these estimators to be unbiased and have small variance If we are comparing biased estimators we can compare their mean squared error MSE When n N and N n are all large then the sample mean will tend to have a normal distribution We generally want to make statements like P 9HAT 9 l lt B l cc to quantify the amount of error of estimation This leads to a con dence interval 9 HAT B 9 HAT B with con dence coef cient 1 0c Generally B is set to be 2STD 9 HAT 2 times the standard error of the estimator in which case Tchebysheff39s theorem states that we achieve at least 75 con dence If 9 HAT is normal we have 95 con dence Simple Random Sampling If a sample of size n is drawn from a population of size N such that every possible sample of size n is equally likely the sampling procedure is called simple random sampling How to draw a simple random sample Estimation of the population mean and total Cluster Sampling A cluster sample is a probability sample in which each sampling unit is a collection of elements Two common reasons for using cluster sampling are i a frame of elements is either impossible or very costly and ii the cost of sampling increases with the distance between the elements When using cluster sampling the rst decision is what to use as a cluster several examples of these considerations are discussed in the text Once the clusters are chosen a frame of clusters is obtained and then a simple random sample of clusters is taken Notation for cluster sampling N the number of clusters in the population n the number of clusters sampled m the number of elements in cluster i i l 2 3 N TL m the average cluster size for the sample i1 N M the number of elements in the population i1 M MN the average cluster size for the population y the total of all observations in the ith cluster Estimation of a population mean u Our estimator of the population mean is just the total of all elements in the sample divided by the number of elements in the sample 2 N 1 2 Qmm Q 11 with iN n 87 where 83 i1 2m i1 Note that the estimator Q is a ratio estimator The estimated variance above is biased so it is advisable to have n 2 20 unless the m are equal Example the number of hours of television watched per day Suppose we visit a small community of N 150 households and we randomly sample n 10 households For each sampled household we nd out how many people live at the household and how many hours of TV are watched per day by each of them Now we have y 9327 344 m 2710 27 and 13 7 gm 11 4689 so that 872 521 Then we have A 140 1 521 4 i 7 7 0667 so that B 52 150 272 10 We can also plot yi against mi to check the linearity of the data and if the regression line appears to go through the origin Simple Random Sampling for Proportions Often we are interested in estimating a population proportion p for some characteristic such as the proportion of voters favoring some proposal or the proportion of an animal species having a particular genetic condition To estimate a proportion for a particular characteristic we de ne the variable y for each sampled element to be equal to 1 if the element has the characteristic and 0 otherwise Then our estimator for the proportion p is just the sample mean of y 109 7 and since 15A 17 the variance estimator can be obtained by using our expression for V 17 and expressing it in terms of f and qA 1 7 15 fit N771 7171 N 39 Sample Size selection for Proportions We Again we can use the same approach that we used earlier to obtain a sample size expression for estimating proportions by using the expression for the bound B and solving for the sample size n Npq N 1Bz4Pq39 The question also arises as to what value to use for p since we are trying to estimate it Here however it is easier because if we do not have information from previous studies we can set p 5 as a conservative value 71 Examples
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'