Introduction to Statistical Analysis
Introduction to Statistical Analysis STAT 2120
Popular in Course
Popular in Statistics
This 4 page Class Notes was uploaded by Gideon Yundt on Monday September 21, 2015. The Class Notes belongs to STAT 2120 at University of Virginia taught by Dan Spitzner in Fall. Since its upload, it has received 20 views. For similar materials see /class/209782/stat-2120-university-of-virginia in Statistics at University of Virginia.
Reviews for Introduction to Statistical Analysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/21/15
STAT 2120 Notes on Topic 2 Spring 2010 Leastsguares regression 0 A regression line describes a oneway linear relationship between variables 0 An explanatory variable x explains variability in a response variable i o Often one wants to predict y from a given xi Such a prediction is denoted in The leastsquares regression line makes the sum of squaredprediction errors as small as possible 0 A prediction error is the vertical distance between a given point and a regression liner 0 The formula for the leastsquares regression line is 37 70 blx with slope bl I x and intercept 70 37 719 Predictions are made by plugging in values of xi 0 Slope 71 is the amount of change in 37 when x increases by one unit Intercept 70 is the prediction at x 0 0 Calculate b0 and 71 by computer Properties of the leastsquares regression line 0 Interchanging x and y modifies the formulation 0 The line 37 70 blx always passes through the point 9 37 o The slope formula 71 I interprets the relationship in units of 5x and 5y through 1C 0 Similarly r2 measures the proportion of variability in 3 that is explained by x The residuals describe the leftover variation in 3 after fitting the leastsquares regression line Each residual is defined by y 37 o The average of the residuals is zero 0 Analysis of residuals helps to assess the suitability of a linear relationship A residual plot is a scatterplot of residuals against the values of x The ideal residual plot should exhibit no systematic pattern patterns indicating a departure from the linear relationship are curvature trends in spread outliers in the residuals An outlier in 3 corresponds with an outlier in the residuals Such is observed as an observation that outside of the overall pattern of the relationship 0 Influential observations are those whose individual deletion would have a strong impact on the regression liner 0 An influential observation is often an outlier in x but may not be an outlier in yr 0 O O O Cautions about correlation and regression 0 Basic cautions 0 Correlation is for twoway relationships regression for oneway relationships 0 Only relevant for linear relationships 0 Neither is resistant Extrapolation is when predictions are made outside the range of data 0 Often untrustworthy since the linear relationship may not hold for xvalues far outside those observed Correlation calculated on averaged data is higher than that calculated on individuals The relationship betweentwo variables may be influenced by a third lurking variable that is not observed 0 Lurking variables may influence relationships between any type of variables quantitative or categorical Association is not causationi 0 An observed association may reflect the in uence of a causal lurking variablei Such is called a nonsense correlationi An experiment that controls lurking variables is best for establishing causationi o It is possible to establish causation without performing an experiment that controls for lurking variables but the evidence that arises is weaker Relationships in categorical data 0 Relationships in categorical data are explored by compiling variables in twoway tables 0 A twoway table involves a row variable and a coluIIm variable 0 A twoway table may record counts or percentages Percentages are most useful because they are easy to compare in the form of distributions Relationships are described through specialized distributions appearing in the table 0 Bar graphs provide a useful means of presenting the relevant distributions 0 The distributions of the row and coluIIm variables appear in the marng of the table and are called marginal distributions Given as counts they are called row and column totals A conditional distribution is calculated from the counts of one variable limited to a given category of the other variable 0 O 0 An association may be described by examining the conditional distributions of one variable across the categories of the other variable Typically the former would be the response variable and the latter the explanatory variable 0 Lurking variables may give rise to Simpson s paradox patterns seen in individual categories are reversed in the pattenis of the combined data Introduction to producing data 0 Designing the production of data allows data analysis to be used for statistical inference 0 Data are produced on a small scale and the intent is to generalize to a wider scale 0 Among the ways that data may be produced are taking a sample or performing an experiment Confounding arises between explanatory variables when their relationships with the response are indistinguishable In an observaDesigning sampletional study individuals are observed but no attempt is made to control the conditions of dataproduction o Observational studies are often plagued by confounding between an observed variable and an unobserved lurking variable 0 In an experiment the conditions of data production are controlled by applying treatments to individuals 0 One objective in designing an experiment is to avoid confounding between explanatory variables Designing a sample 0 The key elements of a sampling study 0 A population is a collection of individuals about which we want information and the conclusions of statistical inference are to be relevant o A sample is the subset of a population on which data are measured and put to analysis 0 The design of a sample refers to the method used to select it from the population 0 A sampling design is biased if it systematically favors certain portions of the population over others Examples of biased sampling designs 0 A voluntary sample arises when individuals are selfselected for the sample by responding to an incentive o A convenience sample arises when selection for the sample is determined by the convenience of the selectionmakeri 0 Simple random sampling SRS selects a sample randomly in such a way that every fixedsize subset has an equal probability of being selected 0 Bias is avoided in SRS by its use of chance 0 SRS may be carried out by drawing labels from a hat or by simulating that procedure with computer software or a table of random digits 0 A probability sample is a sample selected by chance through the use of probability sampling based on known selection probabilities of each sample 0 SRS is an example of unbiased probability sampling 0 In general bias may be accommodated in probability sampling using knowledge of the selection probabilities Stratified random sampling is an example of probability sampling in which simple random samples are drawn in distinct strata and aggregated Multistage sampling is an example of probability sampling that is carried out in stages At each stage each in a current list of sampling units is narrowed to a list of more refined sampling units and a SRS of those units is selected The final list of sampling units is of individuals 0 Bias in sampling may originate from sources other than the sampling design includin o Undercoverage in the list of individuals in the population Nonresponse of individuals selected for the samp er Inaccurate responses of the respondent which leads to response bias This may be unintentionally encouraged by the interviewer 0 Poor wording and design of a questionnaire Designing experiments 0 In an experiment a response variable is observed under controlled conditions that reflect carefully chosen values of explanatory variables 0 Terminology associated with experiments is 0 Individuals are referred to as subjects 0 Explanatory variables are referred to as factors 0 Each specific value of an explanatory variable is referred to as the level of a factor It reflects the application of a treatment used to modify the experimental conditions in a specific way Experiments provide focus on interesting treatments by holding uninteresting factors 0 O O O steady 39 study of multiple factors STAT 2120 Spring 2010 Notes on Topic 9 Introduction to inference for proportions I The interest is in analyzing data on categorical variables 0 Data are in the form of counts or percents 0 Relevant parameters are population proportions 0 Suitable estimates are sample proportions Inference for a single proportion I The relevant quantities in the sampling framework for inference on proportions are 0 The population proportion the proportion of successes in the population p o The sample success count X which records the number of successes in the sample 0 The sample proportion the proportion of successes in the sample 13 X 11 Suppose a sample is obtained by SRS The sample proportion 13 has the following properties 0 The mean and standard deviation of f are u p and 03917 o If the sample size is large then f is approximately Normal The onesample 2 test for a proportion is as follows 0 Assumptions a large sizen simple random sample is drawn from a population with unknown population proportion p o The null hypothesis is H0 p p0 and the alternative hypothesis may be any of Hg p lt m Ham gt m or Ham 3 P0 0 The standardized test statistic is z 15 P0 1001 Payquot 0 The Pvalue is calculated as PZ S 2 if Ha p lt p0 PZ S zifH1p gt p0 and 2PZ S z ifHap po 0 A rule of thumb is that the stated significance level for this test is accurate if both npo 2 10 and n1 pa 2 10 0 Observe that the test statistic is a zscore with 03917 replaced with its null value 017 1 701 POWquot I Procedures for testing H0 p p0 have been developed for the case where n is small but will not be discussed here I The onesample z confidence interval for a population proportion is as follows 0 Assumptions a large sizen simple random sample is drawn from a population with unknown population proportion p o A confidence interval for p with approximate confidence level C is f i z Z l n where zquot is such that C P Z S Z S 2quot for Z having a standard Normal distribution 0 Derivation of the Cl formula uses SE 1131 n to estimate 01 1p1 pn The statistic SE1 is the standard error of f This is different than in the test where of is instead replaced with its null value 0 The margin of error is m z 17 o A rule of thumb is that the stated confidence level for this Cl is accurate if both 1113 2 15 and n1 13 2 15 The conventional Cl above may be inaccurate when 13 z 0 or 13 z 1 An alternative for those cases is the onesample plusfour confidence interval for a population proportion which is described as follows 0 The starting point is the Wilson estimate of p given by 7 X 2n 4 o Assumptions a large sizen simple random sample is drawn from a population with unknown population proportion p A confidence interval for p with approximate confidence level C is 13 i z Z l n where zquot is such that C P Z S Z S 2quot for Z having a standard Normal distribution The statistic SE 1171 n is the standard error of 13 o A rule of thumb is that the stated confidence level for this Cl is accurate ifn 2 10 o For further motivation consider the hypothetical case where X 0 then 13 0 SE1 0 and the 2 Cl is 0 i 0 whereas 13 2n 4 and the plusfour Cl gives a meaningful result As with Cls in general a confidence interval for p accounts only for the random variability in 13 or 13 and not for more general biases that may arise in producing the data Results of the Cl and test procedures for p will remain the same when failures are counted instead of successes provided the formulation of the problem is consistent throughout When planning a sample the sample size may be chosen to target a desired margin of error m of a Cl from a desired confidence level C o The relevant sample size formula is n zm2p1 pquot where pquot is an educated guess of p O O o A conservative guess ofp is pquot 12 which yields the sample size formula 11 z2m2 This is conservative in the sense that the calculated sample size will be at least slightly larger than that produced by the previous formula Differences between the formulas are especially large when pquot S 03 or pquot 2 07 Oftentimes it is helpful to examine results of the first sample size formula across a range ofvalues for p O 0 Comparing two proportions I Categorical data may be produced in comparative experiments 0 Categorical data from matchedpairs experiments are possible but is not discussed Comparing two proportions is instead discussed within the twosample setup Notation writes p1 and p2 for the respective population proportions of the two populations X1 and X 2 for the corresponding sample success counts and 131 Xlnl and 132 X 2 n2 for the corresponding sample proportions o The specific dataanalysis objectives when comparing proportions are to produce a confidence interval for p1 p2 or to test 0 p1 p2 against a oneor twosided alternative The statistic D 131 132 is often a suitable estimator of p1 pz 0 D 131 132 is unbiased for p1 pz 1 e the mean em 291 232 isumz m p2 o The standard deviation of D 131 132 is O P11 P1P21 P2I 711 n 0171 172 2 o If the sample sizes 111 and 112 are both large then the distribution of D 131 132 is approximately Np1 p2 0 1 2 It is relevant to consider case where p1 pz 0 When p1 p2 the standard deviation of D 131 132 simplifies to 017107172 mwhere p m 272 is the common value of the population proportions The superscript refers to a test of H0 p1 p2 where 010772 is relevant 0 When p1 p2 the standard error of 131 132 is SEgl L z 1131 Xlnl 1112 where 13 X1 X2n1 112 This is sometimes called the pooled standard error of 131 132 o The statistic 13 X1 X2n1 112 is sometimes called the pooled estimate of p p1 p2 since it pools the data from both samples eg X1 X2 is the total number of successes in both samples and 111 112 is the total size of both samples The twosample 2 test for proportions is as follows 0 Assumptions independent simple random samples drawn from distinct populations with unknown population proportions p1 and pz 0 The null hypothesis is H0 p1 p2 and the alternative hypothesis may be any of Hg p1 lt 102 Haipl gt 102 0139 Haipl 102 A o The test statistic is z x I 17L117L2 where f Xl X2n1 112 o The Pvalue is calculated as PZ S 2 if HamL lt p2 PZ S z ifHap1 gt p2 and 2PZ S z ifHap1 pz 0 A rule of thumb is that the stated significance level for this test is accurate if 111131 2 5 quot11 151 2 5 quot2152 2 5 and quot21 152 2 5 0 Use of the pooled standard error Sngi z 1131 f1n1 1112 arises as a re ection of the case where H0 is true in which 13 X1 X2n1 112 is the pooled estimate ofp p1 p2 The twosample z confidence interval for proportions is as follows 0 Assumptions independent large size 111 and n2 simple random samples drawn from distinct populations with unknown population proportions p1 and pz 0 A confidence interval for p1 p2 with approximate confidence level C is 131 132 i 2quot 17107171 172 17172 where z is such that TL1 TL2 C P z S Z S 2quot forZ having a standard Norm al distribution 0 Derivation of the Cl formula uses SE 1 2 1711r171 1720172 T M to estimate 7171172 1711 171 TL1 the standard error of D 131 132 o The margin of error is m 2 TL1 TL2 217 The statistic S E A is M 171 172
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'