Seminar in Biostatistics
Seminar in Biostatistics PUBH 9130
Popular in Course
Popular in Public Health Genetics
verified elite notetaker
This 29 page Class Notes was uploaded by Mr. Myron Jacobs on Monday October 12, 2015. The Class Notes belongs to PUBH 9130 at Georgia Southern University taught by Robert Vogel in Fall. Since its upload, it has received 40 views. For similar materials see /class/222006/pubh-9130-georgia-southern-university in Public Health Genetics at Georgia Southern University.
Reviews for Seminar in Biostatistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/12/15
Nonresponse and Missing Data in Sample Surveys There are two types of nonresponse The first type is called unit nonresponse Unit nonresponse is when a unit provides no information The second type of nonresponse is called item nonresponse Item nonresponse is when a unit provides some information but some information is missing Both types on nonresponse pose serious threats to the accuracy of the estimates and both types of nonresponse are dif cult to avoid 1 Effect of Nonresponse on Accuracy of Estimates What happens when we are faced with nonresponse Consider the following de nitions N total number of enumeration units in the population N1 total number of potential responding units in the population N2 total number of potential nonresponders in the population N2 NiNl 71 mean levelof N1 responders NI 2 mean levelof N2 nonresponders 7 NJ 1 Nz2N mean of N enumeration units in population If we take a simple random sample ofn enumeration units and if no attempt is made to obtain data from the potential nonresponders then we are effectively estimating the mean level ofthe subgroup on N1 responders in the population ratherthan from the entire population of N enumeration units 80 our sample of n enumeration units will yield 71 responding enumeration units and a sample mean f based on these units The expected value of f is 157 971 and the bias of f is 37 971 i 7 N2N172 The most important part ofthe bias is noting that the bias is independent of the number m ofunits successfully sampled Let me repeat increasing the sample size will not reduce the bias lfyou ask ten million people their opinion and only two million provide an opinion you have a large bias This is the major problem with the Literary Digest surveys and the current internet and callin surveys They are biasedlllllllllllllll The trick to a successful survey is to reduce nonresponse bias that is reduce the proportion NzN Example A survey of 100 households obtained from a simple random sample is to be conducted in a rural area containing 2000 households for the purpose of estimating the proportion of households without ush toilets Suppose 20 400 ofthe 2000 households refuse to participate reducing our effective population to 1600 potential responders Also suppose 100 25 of the 400 nonresponders do not have ush toilets whereas 160 10 of the potential responders do not have ush toilets The there are 260 households ofthe 2000 households 13 without ush toilets We take a sample and make no attempt to obtain data from the nonresponders Since 10 of the potential nonresponders do not have flush toilets we should expect our sample to re ect that percentage The bias for this sample is 37 4002000 10 lt25 i03 V th the same percentages but a nonresponse of only 100 households the bias is 37 1002000 10 725 7 0007 5 2 Methods of Increasing the Response Rates in Sample Surveys A Increasing the Number of Households Contacted Successfully V th direct interviews lack of contact occurs when nobody is home Provisions need to be made to revisit households during the evening or attempt to collect the information via telephone Using the telephone may require 10 to 12 call backs If you use a telephone using a quottoll freequot number quotunknownquot number or quotprivatequot number will not yield any results Use a telephone with a Known name if possible such as quotGeorgia Southern Universityquot In mail surveys of households lack of contact occurs ifthe family no longer lives at the address at which the name is listed lfthe listing unit is an address then a visit to the address might be necessary to obtain the correct name Remember one in ve American families move each year B Increasing the Completion Rate in Mail Questionnaires Good packaging of the questionnaire goes a long way A carefully worded cover letter explaining the purpose ofthe survey and identifying the sponsoring agency along with a statement of con dentiality is mandatory Do you know the difference between anonymous and con dential If you claim the response is anonymous you cannot ethically track who responded and who did not The materials must be of high quality and sent first class A return envelope also should contain first class postage Many institutions balk at rst class postage and prefer metered mail It is a universal observation that people are more likely to respond to a mail questionnaire that is attractive professional appearance and requires no more than 30 minutes to complete Paper with light pastel colors can be useful When using a mail survey with physicians white paper is just another piece of paper on the desk Light blue green or yellow have enough color to stand out Red and orange tend to make people angry I also send out preletters and followup postcards along with additional mailings ofthe questionnaire By claiming confidentiality I can track my responders and reduce costs by sending out repeated questionnaires to the nonresponders Tokens ofappreciation can be useful and help increase response but care and considerably thought needs to go into what type of gratuity is to be provided C Decreasing the Number of Refusals in FacetoFace or Telephone Interviews Conventional wisdom in the survey business has always been that it is easy to refuse a mail survey since the respondent has no direct contact with the organization conducting the interview Wisdom continues that it is somewhat more difficult to refuse a telephone interview because voice contact has been established Further it is even more dif cult to refuse facetoface interviews because of eye contact The telephone wisdom has waned considerably due to telemarketers caller id and answering machines In telephone interviews given the person answers and facetoface interviews the nonresponse rate can be reduced ifthere is an effective publicity campaign initiated in advance of the survey and continued through the interviewing stages ofthe survey Use of churches local radio and local television is helpful in getting out the message In facetoface interviews proper attire and credentials proudly displayed are a big help D Endorsements Response rates may be improved if the survey is endorsed by an agency or organization whose sphere of interest includes the subject matter For example a telephone survey to collect information about prevalence of cancer screening to be used by the Southeast Georgia Cancer Coalition may have better results if it is endorsed by the Georgia Chapter ofthe American Cancer Association lfthe survey is a mail survey the endorsement should be included in the cover letter and a signature of the CEO ofthe organization is very helpful When surveying institutions endorsements are extremely important A survey of Family Medicine physicians in Georgia should be endorsed by the Georgia Family Medicine Association To get these types ofendorsements it might be necessary to include the endorsing agency as a collaborator Incentives are effective in increasing response rates Incentives usually are involved in mail surveys Cash is always betterthan gifts 3 Mail Surveys Combined with Interviews of Nonrespondents Mail surveys are less expensive than personal interview surveys However mail surveys frequently do not provide a high enough response rate to make the survey valid or meet the reliability speci cations lfthe response rate to the mail survey is low after repeated mailings and reminders it is possible to use a two stage sampling procedure in which the rst stage is a mail survey and the second stage is either a personal interview or a telephone interview The second stage is usually a subsample of the nonresponders from the first stage The big question is how big the second stage subsample should be to get valid and reliable estimates Example A community has 300 physicians and a questionnaire is sent to a simple random sample of 100 physicians The primary endpoint ofthe questionnaire is to nd out ifthe physician accepts patients who cannot pay for their services either directly or indirectly Of the 100 questionnaires sent out only 30 are returned From the 70 nonresponding physicians a subsample of 20 physicians is selected and intensive effort is made to get responses for the 20 physicians in the subsample Ofthe 20 physicians 15 responded The data from the survey are summarized as returns from sample Number Number 39yes39 mail survey 30 subsample 15 3 The notation for this example is de ned as N number of EUs in the population 71 number of Eus in initial sample quot1 number of EUs responding in initial sample 712 n 7 m EUs not responding in initial sample 739 number of 712 units selected for second sample l quot2 number of 739 With responses n1 Y1 E xinl i1 first level mean of responders second level mean of responders Yclub nlfl quotZYZ y mean ofdouble sample Forthis example N3oo n1oo n1 30 n2 70 quotE 20 71 2030 67 f2 315 20 SO 7de 30 67 7020100 34 Note if quot2 is numerically close to 739 then the estimator Yde has little or no bias n 215 A Determining the Optimal Fraction of Initial Nonrespondents to Subsample for Intensive Effort If a double sample is to be taken the rst question to address is how large a sample of 739 to take from the 712 enumeration units who did not initially respond The strategy that is usually used in this decision is based on optimal allocation based on three cost components Co cost per mailing initial questionnaires C1 cost per returned questionnaire for processing C2 cost per questionnaire from second stage The total eld costs and processing costs is summarized as C CO C1711 C274 lfthe anticipated proportion P1 ofthose initially sampled will respond to the initial stage then the optimal number 739 that should be sampled is n n2 X C1P1C2P1 Example Suppose Co 150 per questionnaire for mailing C1 1500 per questionnaire for processing and C2 4500 per enumeration unit at the second stage Also suppose the estimated initial response rate is 30 so P1 30 and we wish to initially sample 100 physicians The optimal numberto subsample is n 70 x 150 15 x 0 3045 x30 70 x 23 46 667 z 47 B Determination of Sample Size Needed for a TwoStage Mail Survey If we are planning a survey the first question is how many surveys we send out in the initial sample Ifwe assume as in the previous example that only 30 of the physicians will respond to the initial mail survey and using the same cost components as in the previous example we know that we need to sample 23 of the nonrespondents in the second stage There were 70 nonrespondents and our computed sample size is 7023 or 23 the nonrespondents Since the initial sampling plan was to use a simple random sample of physicians from a community of N300 physicians and setting 8 03 2 3 virtual certainty and quotguessingquot that 80 of the physicians will say they will accept patients who cannot pay for their services we can compute the needed sample size using the basic formula in Chapter 3 as n 9X300X8X 2 Z m I I 9x8gtlt2299gtlt3 gtlt8 physICIans So if we have a response rate of 100 we need to sample 24 physicians However we anticipate a 30 response rate so we anticipate that only 3 X 24 7 2 z 7 physicians will respond which is considerable less than we need To determine the actual number we need to sample in the rst stage we use the following formula n n 11 P1quotC2P1C0 C1131 1 where n39 is the sample size based on the traditional sample size formula P1 is the response rate Co C1 C2 are the cost components 80 in our example we get n 24gtlt17gtlt45 gtlt315 15 gtlt3 1 324 m 33 We take an initial sample of 33 physicians ofwhich we expect 3 X 33 9 9 e 10 physicians to respond Ofthe 23 nonresponders we will do an intensive survey on a subsample of size n 23 x J150 15 gtlt 03045 x30 15333 16 Note if the initial response rate can be increased to 60 then our adjusted sample size is n 24x 1 7gtlt 45 x61515 x6 1 34140 z 34 and we would get 6 X 34 20 4 e 20 responses On the second stage we would make an intensive effort on n 14x J15o 15 x O6045 gtlt60 8 7305 m 9 4 Other Uses of Double Sampling Methodology Double sampling can be used in other situations besides nonresponse problems The process is to take a preliminary sample followed by grouping elements into two strata taking a subsample from one ofthe strata and nally form a weighted combination ofthe two strata Example In NHANES II a survey ofapproximately 28000 persons aged 6 months to 74 years provide data guardians provide info for the 6 month olds and probably others also for you smart alecks based on interviews medical examinations and laboratory tests Chest radiographs were performed on approximately 10000 subjects aged 25 and above Each radiograph was screen by two radiologists for an array of items including lung cancer One item was quotenlargement ofthe pulmonary arteriesquot Such enlargements can be indicative of pulmonary hypertension Of 10153 chest radiographs of survey respondents 25 74 years of age screened by the two radiologists 326 had pulmonary artery enlargement However in only 30 cases 9 did the two radiologists agree Due to the low agreement a third radiologists was brought into the project and asked to provide a more careful analysis of the original 326 radiographs and a sample of 288 radiographs from the 9827 radiographs quotwithout pulmonary artery enlargementquot The pertinent definitions and quantities are given as Pdub N1P1N2P2N1Nz N1 number of chest radiographs originally scored positive 326 N2 number of chest radiographs score negative by both 9827 P1 x1N1 P2 962712 x1 number con rmed by third reader among N1 originally screened positive quot2 number sampled among N2 originally screened negative 288 xz number determined positive by third reader among 712 sampled X1 220 third reader found 220 of original 326 to be positive xz 74 third reader found 74 of 288 subsample to be positive p1 220326 2 0675 1 p2 2 74288 2257 pm 326 x 675 9827 x25710153 2701 or2704 Based on the original readers the estimate ofquotenlargement of the pulmonary arteriesquot is 3261015332 5 Item Nonresponse Methods of Imputation Item nonresponse refers to missing data elements and also includes values of data elements that are clearly in error and cannot be used For example if a sample of Georgia state birth records for any particular year is taken you should not be surprised to find a mother whose age is less than 2 years of age giving birth to her fourth child with a weight of 12 pounds and gestation time of 16 weeks A Mechanisms by Which Missing Data Values Arise To address the problem of missing data values it is crucial to understand the mechanism that caused the values to be missing The analytic strategies for repairing the problem are based on the mechanism causing the data to be missing There are three basic types of missing data not recorded not applicable and not collected Not applicable may be a question that does not apply to the person being interviewed such as quotDate of last PAP smearquot when the respondent is a male Not collected generally refers to data items that were added to the questionnaire at a latertime People interviewed early in the process were never asked these questions ie the data was not collected from these individuals quotNot recordedquot general refers to questions in which the respondent refused to answer The quotnot recordedquot missing items need to be dealt with carefully and it is this type of missing data for which we need to understand the mechanism behind the reason for not answering the question Let Y W be a hypothetical n by k matrix consisting ofthe values of the k variables for the n units sampled The Vij may be recorded or missing Let M my be the corresponding n by k matrix of139s and 039s where 1 denotes a particular Vij is missing The mechanism for the occurrence ofmissing values is characterized by the distribution of the random matrix M of missing values given the matrix Y of true values The three most common mechanisms are 1 Missing Completely at Random lfthe distribution of M conditional on Y is independent of Y then the mechanism of item response is missing completely at random MCAR This means that the values are missing completely independent of the data An example is a lab test is made but the specimen was lost on the way to the lab orthe lab lost the result 2 Missing at Random lfthe distribution of M conditional on Y is dependent on the observed values of Y but not the missing values on then the mechanism is called missing at random MAR An example of MAR is the results ofa mammogram lfthe person interviewed is a 21 year old female and the response to the question quotHave you ever had a mammogramquot is missing we would classify this as MAR because the missing value does not depend on the true value ofthe answer yes or no but it is not independent ofthe matrix of values Y since the woman is 21 and the question probably is not applicable lfthe interviewer was a male with a missing value the mechanism is the same 3 Nonignorabe Missing Value lfthe distribution of M conditional on Y is dependent on the missing values of Y then the mechanism for this type of item nonresponse is called nonignorable A typical example would be to the question quotHow many drinks per week do you typically havequot A missing value for this question is frequently the result of the true answer to this question Any question dealing with sensitive matters should be examined to see if it is nonignorable An analysis that assumes the missing data mechanism is either MCAR or MAR when in fact it is nonignorable will produce biased estimates B Some Methods for Analyzing Data in the Presence of Missing Values 1 Complete Case Methods Discard all units which have missing values on any variables used in the particular analysis This practice is widely used but is not a good procedure and should be avoided This method reduces the available sample size resulting in loss of precision lfthis method was applied to the NHANES data the nal sample size would be less than 500 due to data collected on teeth lfthe individuals dropped from the study are very different from those who remain the results will be biased In complex sampling plans individuals are assigned weight which usually re ects the probability of selection Deleting these people negate the validity of the weighting methods 2 lmputationBased Methods These methods provide substitute or imputed values for the missing data Once the data is imputed methods of analysis that require complete data on all variables are then used to perform the analysis 3 Reweighting Methods These methods attempt to adjust for missing data by adjusting the sampling weights This method is usually applied to unit nonresponse as oppose to item nonresponse When it is used in item nonresponse the reweighting is performed by using either logistic regression or probit analysis 4 ModelBased Methods These methods involve having the missing data modeled using likelihood functions of incomplete data using maximum likelihood methods These methods have been developed for MAR and nonignorable missing data One method is called quotinformative censoringquot C Imputation Methods single value methods Substitution of Mean This is a very easy and widely used but very bad method of imputation In this method the mean value of all individuals whose values for the particular variable is present is used as the value for all individuals with missing values for that particular variable The end result is that the estimate for the population mean is the same as ifit is calculated based on only the individuals who responded However the standard error will always be biased downward since this method adds no variability to the data Another problem is with correlations and their interpretation The Hot DeckCold Deck Method To use a hot deck method cells are formed based on various demographic variables For individuals who responded to all variables a registry oftheir values is constructed and the values are placed in the appropriate cells based on their demographic data If a person is identi ed with a missing a value based on the demographic characteristics the quotcorrectquot cell is selected and an observation from that cell is selected at random and used as a quotreplacementquot In a strict hot deck there is only one value in the registry at a time Every time an actual value is recorded it replaces the current registry value Ifa person is missing a value the current registry value is used In a strict hot deck method if there is a great deal of missing data the registry value does not change often and adds little variability to the data causing an underestimate ofthe variance In a strict cold deck method each cell has a collection ofvalues from which one can be randomly selected An example using this methodology is the Southwest Georgia Cancer Coalition Pilot Survey Regression Methods In this method regression equations are built using complete data and then the missing data the dependent variable y in the regression equation is predicted based on the regression equation This method allows the missing value that is imputed to be quotharmoniousquot with the person39s other responses and also adds variability to the nal analysis 6 Multiple Imputation Multiple imputation does not give a single imputed value for missing data but two or more values Once two or more values are imputed for each missing data item the resulting datasets are analyzed by complete data set methods The results ofthe multiple analyses are combined and the resulting inference take into account the uncertainty of the missing values Consider the following example Example Consider the following partial data set MMSE is the min mental state examination and Education is the level ofthe education of the subject where 1eementary 2high school 3some college 4coege graduate We need to impute the missing values for Education and MMSE Community Building Subject Education MMSE 17 1 1 1 2 1 1 2 18 1 1 3 2 2O 1 2 1 4 1 2 2 3 27 1 3 1 3 2O 1 3 2 2 18 2 1 1 1 11 2 1 2 1 2 1 3 2 13 2 1 4 2 15 2 2 1 2 2 2 2 16 3 1 1 3 24 3 1 2 3 26 3 2 1 15 3 2 2 2 17 3 3 1 4 26 3 3 2 3 3 3 3 3 21 To replace the four missing values of MMSE we can replace a missing value with a value from that variable for any subject in the same PSU The missing value for subject 1 stratum 1 and PSU 2 can be replaced by one other subject since there are two subjects in PSU 2 Subject 2 in stratum 2 PSU 1 can be replaced by one ofthree possible values since there are three other people in that PSU Similarly subject1 stratum 2 PSU 2 can be replaced by one value and subject 2 stratum 3 and PSU 3 can be replaced by one of two possible values There are 6 possible data imputation sets that can be formed based on the available choices 1 x3 gtlt1gtlt2 6 The possible imputation results forthe four subjects are Combination A B C D 1 27 11 16 26 27 11 16 21 3 27 13 16 26 4 27 13 16 21 5 27 15 16 26 6 27 15 16 21 Taking a random sample ofthree possible combinations say 2 3 and 5 the resultant data sets for MMSE are SuMect 1 2 3 1 17 17 17 2 18 18 18 3 2O 2O 2O 4 27 27 27 5 27 27 27 6 2O 2O 2O 7 18 18 18 8 11 11 11 9 11 13 15 1O 13 13 13 11 15 15 15 12 16 16 16 13 16 16 16 14 24 24 24 15 26 26 26 16 15 15 15 17 17 17 17 18 26 26 26 19 21 26 26 2O 21 21 21 x 1895 1930 1940 1 Uses of Surveys Obtain information in a timely manner at reduced cost Politics Health and social services Marketing opinions 2 Basic De nitions PopulationTarget Population Sample subset of population Sample survey probability based sample Summary statistics Descriptive survey used to estimate the level ofa set ofvariables not to test hypotheses Census 3 Design Sample Design Sampling plan Estimation procedures Survey Measurements responsibility of subject matter persons39 not statistician How do you pronounce his name Likert and what is the scale Questionnaire design is a complex task Survey operations Select a sample Design questionnaire Pretest and modify if needed Train survey takers PILOT SURVEY on a small sample Data must be collected according to protocol to maintain reliability and validity curbsiding Statistical Analysis Considerable care needs to be taken in the interpretation of ndings Errors need to be taken into account both sampling and measurement Report Writing Report should include how everything including problems were resolved 4 Literary Digest History of Literary Digest Surveys Result of 1936 survey Merge into Time Magazine 1938 Dewey beats Truman A large number of responses is not as important as how the responses are obtained 01 The Population and the Sample The foundation of sampling methodology is knowing what the components ofa population Population universe target population consists of all individuals to which the ndings are to be extrapolated Elementary units or elements Enumeration units listing units Enumeration rules counting rule Number ofelementary units is denoted by N Each elementary unit will be identi ed by a label from 1 to N A characteristic or variable is denoted by a letter such as x or y The value of a characteristic 2 in the i 1 elementary unit is denoted X 6 The function ofa survey is to estimate population values these values are called parameters39 X X Population Total i1 N 7 ZXiN Population Mean i1 Population Proportion P x XN where 1 if the attribute is present X O ifthe attribute is absent thus i1 represents the total number of elements with the attribute Population Variance and Standard Deviation N 2 Population Variance W N 82 N 2 a 1l ZH X X Population Standard DeVIation N When the characteristic being considered is a dichotomous attribute the expression for the population variance reduces to note in some books this is denoted 0 Px17Px Example A Family Practice Residency Program has 24 residents and the director is interested in the number of inof ce patient contacts on average per resident per week N24 and X equals the number of contacts for resident i If the total number of resident contacts for a particular week is 292 then 7 241 25122Z X 29224 122 contacts To find 0 we compute T which we will say equals 7329 contacts 2 The standard deviation 0x 856 contacts If 7 residents failed to see any patients in the of ce that week then P x 1724071 or 71 ofthe residents had at least one patient contact 0 07117 071 2059 and ox 2059 4538 The Coef cient of Variation CV denoted as V x is the ratio ofthe standard deviation and the mean V x Ox X The coef cient ofvariation is a unitless number since the unit of measurement forthe mean and standard deviation cancel This parameter is often used to compare variability oftwo of more characteristics It measures the variability of the characteristic relative to the mean of the characteristic The square ofthe CV is called the relative variance or relvariance and is denoted Vi A problem with the CV is that if the mean is zero then the CV is not defined Example The mean systolic blood pressure of a population is 130mmHg and the standard deviation is 15mmHg The same population has a mean cholesterol level of 200mg100ml with a standard deviation of 40mgml The coef cient of variation for for systolic blood pressure is 151300115 and for cholesterol 40200020 thus the variability for cholesterol is greater than that of systolic blood pressure Now suppose the mean diastolic blood pressure is 60mmHg with a standard deviation of 8mmHg The coefficient of variation for diastolic blood pressure is 8600133 so the variation of diastolic blood pressure is greater than that of systolic blood pressure To summarize Population Total Population Mean Population Proportion Population Variance Population Standard Deviation Variance for Dichotomous Attribute Coefficient of Variation RelVariance 7 The Sample PxXN z X 7YZ ox T a 2 m a Px17Px Vx039xWg XE The primary objective of the sample survey is to take a subset or sample ofthe population and estimate the population parameters from that sample Sample surveys can be categorized into two broad categories probability samples and nonprobability samples A probability sample has the characteristic that every element in the sample has a known nonzero probability of being included in the sample A nonprobability sample is based on a sampling plan that does not have this feature Literary Digest and Chicago Sun Times quotDewey Winsquot many internet surveys convenience surveys Features of Probability Samples Unbiased estimates of population parameters can be constructed Standard errors of these estimates can be constructed this provides a means to determine the value ofthe estimates Nonprobability samples do not have these features thus you can not evaluate the validity or reliability of the estimates Types of nonprobability samples include Purposive orjudgmental sampling individuals are selected who are considered the most representative ofthe population An example is to pick a few quottypical daysquot in the hospital to perform a survey Quota sampling contact and interview a certain number of people in a demographic group Snowball sampling first contact gives names of next contacts sample snowballs 8 Important Concepts of Sampling Sampling Frame The list of elements such that every element in the population has some chance of being selected into the sample by whatever method is used to select elements from the population This could be a telephone directory tax records or hospital admission records Frequently sampling is performed in multistages A multistage sampling design may at the rst stage sample counties in a state at the second stage areas within the county the third stage households within the area and fourth stage an adult within the household The units listed in the frame are generally called sampling units The sampling frame for the first stage in the above example consists of the 159 counties in Georgia and the sampling unit is a county In the second stage the sampling unit would be the predefined areas within a selected county In the third stage the sampling unit would be households within the area and in the fourth stage the sampling unit would be adults within a selected household he sampling units from the final stage are called enumeration units or listing units 9 Sampling Measurements and Summary Statistics Upper case letters sometimes upper case script letters and Greek symbols are used to denote population parameters Lower case letters are used to denote sample statistics sample statistics estimate population parameters Just as an upper case 39N39 denotes the population size the lower case 39n39 denotes sample size in multistage sampling other letters such as 39m39 and 39p39 are used to denote sample sizes Sample Total x 2196 Sample Mean f 21 960quot Sample Proportion PX 96quot where X is the number in the sample with the attribute Sample Variance SE 2196139 2n 1 Sample Variance for dichotomous attribute 3 npx1 PDn 1 Note when n is large nn1 gets close to 1 so 5 PXU Px l n z Sample Standard Deviation S ZLilo M 1 10 Estimates An estimate ofthe population total X is x x We multiple the sample total by the ratio ofthe number ofelements in the population to the number ofelements in the sample An estimate ofthe population variance 0 is usually denoted 0 3 IV 1st Example From a population of 25 physicians the mean number ofvisits per day is I 508 X127 0 6791 and ox 824 Suppose we take a random sample of9 physicians in which each physician has the same chance of being selected We nd the number of visits to be 50 12 5 6 7 37 8 0 From this we compute x80 f 889 and SE 12511 In addition seven of the nine physicians made visits so p x 078 Our estimate ofX is 2598022222 Our estimate of 7 is f 889 A Our estimate of 0 is 0 242512511 12011 We see from this sample our estimates are not very close to the population parameters Taking a different sample will give us different estimates Some estimates will be high some low some close to the population parameters and others not so close 11 Sampling Distributions We would like to estimate a population parameter We will draw exactly one sample of size 39n39 However because the population is finite of size 39N39 there are 2 possible distinct samples lfwe did construct all 2 estimates ofthe population parameter of interest the result is a frequency distribution of estimators The frequency distribution is called the Sampling Distribution of the Estimator Example Students not Immunized for Measles School Nostudents total proportion 1 4 2 28 5 179 3 90 3 033 4 44 3 068 5 36 7 194 6 57 8 140 total 314 30 096 Suppose we wish to estimate the total number of students and proportion thereof who are not immunized based on a sample of size two 2 Since there are 6 schools the total possible number of samples of size 2 is 15 The samples are Sample Schools x x Sample Schools x x 1 12 27 2 13 21 3 14 21 4 15 33 5 16 36 6 23 24 7 24 24 8 25 36 9 26 39 10 34 18 11 35 30 12 36 33 13 45 30 14 46 33 15 56 45 Note in the table above x39 is estimating the population total X30 Some are above some are below and two estimates are exact Sorting the samples based on x39 the 15 samples are distributed as x39 frequencv relative frequency 18 115 21 2 215 24 2 215 27 1 115 30 2 215 33 3 315 36 2 215 39 1 115 45 1 115 The mean ofthe sampling distribution ofan estimated parameter with respect to the sampling plan is called the expected value ofthe parameter and is denoted E So if we are estimating some parameter denoted 39d39 with an estimator 3 the expected value is given as 3m where C isthe total number of samples 7r is the probability of being in A sample i and 61 is the estimator of sample i From our example EX39 1811521215391154511530 When the expected value of an estimator equals the parameter it is estimating the estimator is called unbiased The Variance Var 3 ofthe sampling distribution ofthe estimated parameter 3 with respect to the sampling plan is V5142 2137E212m The standard deviation SE 3 ofthe sampling distribution of the estimated population parameter is also known as the standard error and is the square root ofthe variance SE 115143 From the immunization example the Varx3918 30 2 1152130 2 2154530 2 115528 and the SEX39 4528 727 12 Summary of Mean and Variance of Sampling Distribution Totals EV 21 xl i VWOCI 221 Ex 277i Means E5 2 771 Varm 2 7 7 Ef27r Proportions EQ X 22117 VWCUx 1be TEpX27Ti Remember These are the means and variances ofthe estimators based on the sampling distribution 7n is the probability of inclusion and C is the number of distinct possible samples that can be made from the selected sampling plan 13 Characteristics of Estimates of Population Parameters Bias The bias B 3 of an estimate 3 of the population parameter d is the difference between the expected value E 3 ofthe sampling distribution of 3 and the true value ofthe unknown parameter d 33 EQ 7 d Example In the school example if we select samples of size 2 so that each school has the same probability of selection the estimate ofthe total is unbiased However suppose we place the numbers 1 through 10 on a piece of paper seal each paper in an envelope and draw one envelope Each envelope has a 110 chance of being selected The sampling plan is de ned as Number picked Schools Selected Number picked Schools Selected 1 1and2 6 2and3 2 1and3 7 2and4 3 1and4 8 2and5 4 1and5 9 2and6 5 1and6 10 3and4 The Possible Samples Sample Schools Total Sample Schools Total 1 12 27 6 23 24 2 13 21 7 24 24 3 14 21 8 25 36 4 15 33 9 26 39 5 16 36 10 34 18 The Sampling Distribution X 7T 18 110 21 210 24 110 27 110 33 110 36 210 39 110 Ex 18110 21210 39110 279 at 30 The estimate is biased Why MSE The Mean Square Error ofa population estimate d is denoted MSE d and is the mean of the squared differences between the values of the estimate and the true value ofd C A C A A A A A 291 7 d277 2w 7 Ed Ed 7 d277 mg 324 MSEd i1 i1 The MSE ofan estimator equals the variance ofthe estimator plus the Bias squared Example The MSEX39 based on selecting two schools at random is 528 O 528 This is the sample where X3930 and is unbiased The Varx39 based on the sample that yielded 279 as the estimate ofthe total is 18 7 2792110 217 2792210 39 7 27 92110 5049 Also we calculated the bias as 279 30 21 so bias squared is 441 and the MSE5049 441 549 Note the biased estimator in this example has a smaller variance but the MSE is larger Bias especially when uncorrected can lead to incorrect inferences From basic statistics the most frequent test is a ttest A ttest is the ratio of an estimator and its standard deviation The standard deviation is the square root of the variance so ifthe estimator is biased then the variance is either smaller or largerthan the MSE which in turn may yield false positives or false negatives 14 Validity Reliability and Accuracy Reliability Reliability of an estimator refers to how reproducible the estimator is over repetitions ofthe sampling plan fthere are no measurement errors in the survey then the reliability of an estimator can be gauged in terms of its sampling variance or standard error The smaller the variance the greater the reliability given no measurement errors Validity Validity of an estimator refers to how the mean ofthe estimator over repetitions ofthe sampling plan differs from the true value of the parameter being estimated fthere are no measurement errors then validity can be evaluated by examination ofthe bias The smallerthe bias the greater the validity Unbiased estimates are valid estimates Accuracy The accuracy ofan estimator refers to how far away a particular value of an estimate is on average from the true value ofthe parameter Accuracy is measured by MSE The smaller the MSE the greater the accuracy
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'