Note 9 for PUBHLTH 640 at UMass
Note 9 for PUBHLTH 640 at UMass
Popular in Course
Popular in Department
This 44 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 14 views.
Reviews for Note 9 for PUBHLTH 640 at UMass
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
BE54OW Regression and Correlation Topic 9 Regression and Correlation Topic 908994P N De nition of the Linear Regression Model Estimation The Analysis of Variance Table Assumptions for the Straight Line Regression Hypothesis Testing Con dence Interval Estimation Introduction to Correlation Hypothesis Test for Correlation 10 20 24 27 35 39 42 Page 1 of 43 BE54OW Regression and Correlation Page 2 of 43 1 De nition of the Linear Regression Model In the last unit topic 8 the setting was that of two categorical discrete variables such as smoking and low birth weight and the use of chisquare tests of association and homogeneity In this unit topic 9 our focus is in the setting of two continuous variables such as age and weight This topic is an introduction to simple linear regression and correlation Linear Regression Linear regression models the mean u of one random variable as a linear function of one or more other variables that are treated as xed The estimation and hypothesis testing involved are extensions of ideas and techniques that we have already seen In linear regression O we observe an outcome or dependent variable Y at several levels ofthe independent or predictor variable X there may be more than one predictor X as seen later O A linear regression model assumes that the values of the predictor X have been xed in advance of observing Y O However this is not always the reality Often Y and X are observed jointly and are both random variables Correlation Correlation considers the association oftwo random variables O The techniques of estimation and hypothesis testing are the same for linear regression and correlation analyses O Exploring the relationship begins with tting a line to the points Q We develop the linear regression model analysis for a simple example involving one predictor and one outcome BE54OW Regression and Correlation Page 3 of 43 Example Source Kleinbaum Kupper and Muller 1988 Available are pairs of observations of age and weight for n11 chicken embryos WTY AG EX LOGWTZ 0029 6 4538 0052 7 1284 0079 8 1102 0125 9 0903 0181 10 0742 0261 11 0583 0425 12 0372 0738 13 0132 113 14 0053 1882 15 0275 2812 16 0449 We ll use a familiar notation O The data are 11 pairs of X1 Y1 where XAGE and YWT X1 Y1 6 029 X11 Y 16 2812 and O equivalently 11 pairs of X1 Y1 where XAGE and YLOGWT X1 Y1 6 1538 X11 Y 16 0449 Though simple it helps to be clear in the research question Does weight change with age In the language of analysis of variance we are asking the following Can the variability in weight be explained to a signi cant extent by variations in age What is a good functional form that relates age to weight BE54OW Regression and Correlation Page 4 of 43 We begin with a plot of XAGE versus YWT Scatter Plot of WT vs AGE 30 24 7 13 gt WT 12 gt 06 o o i i i I s s 10 12 14 1s AGE We check and learn about the following O 9999 The average and median of X The range and pattern of variability in X The average and median of Y The range and pattern of variability in Y The nature of the relationship between X and Y The strength of the relationship between X and Y The identi cation of any points that might be in uential For these data O The plot suggests a relationship between AGE and WT O A straight line might t well but another model might be better Q We have adequate ranges of values for both AGE and WT Q There are no outliers BE54OW Regression and Correlation Page 5 of 43 We might have gotten any of a variety of plots w No relationship between X and Y y2 Linear relationship between X and Y y3 Nonlinear relationship between X and Y BE54OW Regression and Correlation Page 6 of 43 w Note the arrow pointing to the outlying point Fit of a linear model will yield estimated slope that is spuriously nonzero Note the arrow pointing to the outlying point Fit of a linear model will yield an estimated slope that is spuriously near zero y2 Note the arrow pointing to the outlying point Fit of a linear model will yield an estimated slope that is spuriously high y2 BE54OW Regression and Correlation Page 6 of 43 w Note the arrow pointing to the outlying point Fit of a linear model will yield estimated slope that is spuriously nonzero Note the arrow pointing to the outlying point Fit of a linear model will yield an estimated slope that is spuriously near zero y2 Note the arrow pointing to the outlying point Fit of a linear model will yield an estimated slope that is spuriously high y2 BE54OW Regression and Correlation Page 7 of 43 The bowl shape of our scatter plot suggests that perhaps a better model relates the logarithm of WT to AGE Scatter Plot of LOGVVT vs AGE 05 e 02 I 8 39 09 7 16 6 8 10 12 14 16 AGE We ll investigate two models 1 WT50 lAGE 2 LOGWT 30 31 AGE EESAUW Regesammdcanelmm PageX uf43 Recall what you mi l have learned in an old math class about he eguation of a line Y o lx pa 39 39yinlercepl value ofy when x O p quotslopequot AyAX X Bo i0 yin1ercep1quot value of y when x 0 i1 quotslopequot AyAgg change in ychange in x You mi l recall too a feel for the slope Slope gt o Slope o Slope lt o BE54OW Regression and Correlation Page 9 of 43 De nition of the Straight Line Model Y 50 31 X Population Sample Y8081X8 Y6 081Xe Y 80 IBIX is the relationship in the population It is measured with error 8 measurement error We do NOT know the value of 80 nor 81 nor 8 A A 80 81 and e are our guesses of 80 81 and 8 e residual A We do have values of 80 81 and e The values of 80 81 and e are obtained by the method ofleast sguares estimation To see if 30 w 80 and A w 81 we perform regression diagnostics Note This is not discussed in this course see Puleth 640 Intermediate Biostatistics A little notationI sorry Y the outcome or dependent variable X the predictor or independent variable uy The expected value on for all persons in the population pylxq The expected value of Y for the subpopulation for whom XX 6Y2 Variability of Y among all persons in the population mqu Variability of Y for the subpopulation for whom XX BE54OW Regression and Correlation Page 10 of 43 2 Estimation We will use the method ofleast sguares to obtain guesses of 30 and 31 Goal From the many possible lines through the scatter of points choose the one line that is closest to the data What do we mean bV Close O We d like the vertical distance between each observed Y and its corresponding tted Y to be as small as possible O It s not possible to choose 230 and 231 so that it minimizes Y1 l12 and minimizes individually Y2 r and minimizes individually K W Q So instead we choose 30 and 231 that minimizes their total n Ir ff 2X 23 0 EXDZ 11 EESAUW Regessm mdcanelaum Fag 11 am A picture gives a feel for the fact that man lines are possible and that we seek the closest in the sense of Vertical distances being as small as Bossible 730f51x Bo and 31 are chosen such that the sum of the squared vertical Sistances Edi2 is minimized i1 For each observed value xi we have an observed vi and the predicted value Ayi on the line The vertical distances gi vi vi A 2 The expression lobeminimized 227 If 2 m m 2 A EX has avariety or names e residual sum or squares e sum or squares about the regression line e sum or squares due error SSE e82 m BE54OW Regression and Correlation Page 12 of 43 For the calculus lover A little calculus vields the solution for the guesses 230 and 231 Consider SSE n Yl go 5 IXZ 11 11 O Step 1 Differentiate with respect to 231 Set derivative equal to 0 and solve O Step 2 Differentiate with respect to 230 Set derivative equal to 0 insert 231 and solve For the noncalculus lover here are the estimates of 80 and 81 81 is the slope 0 Estimate is denoted 1 or b1 80 is the intercept 0 Estimate is denoted 80 or b0 BE54OW Regression and Correlation Page 13 of 43 Some verV helpful preliminarV calculations sxx XXX if 2X2 N2 o syy 2YY2 2Y2 N372 sxy 2X iY Y ZXY Ni Note These expressions make use of a special notation called the summation notation The capitol S indicates summation In Sxy the first subscript x is saying XE The second subscript y is saying yy sXy 2XXYY 7 1 7 S subscrith subscripty Slope 23 EXX XXX Y 06VXY A sxy 1 m W Va 31 sxx Intercept 230 7 231 P d39 t39 fY A A A re 1c10n0 YI80I81X b0 b1X BE54OW Regression and Correlation Page 14 of 43 Do these estimates make sense quot The linear movement in Y Slope A Z XXX Y 06VXY with linear movement in X is 1 quot 2 VarX measured relative to the g X X variability in X 231 0 says With a unit change in X overall there is a 5050 chance that Y increases versus decreases 231 at 0 says With a unit increase in X Y increases also 23 1 gt 0 or Y decreases 23 1 lt 0 Intercept 30 7 31 If the linear model is incorrect or if the true model does not have a linear component we obtain 31 0 and 30 l7 as our best guess of an unknown Y BE54OW Regression and Correlation Page 15 of 43 Illustration in SAS Code data temp input wt age logwt label wt Weight Y age Age X logwt L0gWeight Y cards 029 6 1538 052 7 1284 079 8 1102 125 9 0903 181 10 0742 261 11 0583 425 12 0372 738 13 0132 113 14 0053 1882 15 0275 281216 0449 run quit proc reg datatemp simple 0pti0n simple produces simple descriptives title Regression of YWeight 0n XAge model wtage run quit Partial listing of output Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 188453 052584 358 00059 age Age X 1 023507 004594 512 00006 Annotated Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 188453 intercept Bquot 052584 358 00059 age Age x 1 023507 slope p 004594 512 00006 The tted line is therefore WT l88453 023507 AGE BE54OW Regression and Correlation Page 16 of 43 Let s overlay the tted line on our scatterplot Scatter Plot of WT vs AGE 30 e 24 e 187 WT 127 06 e 007 AGE Q As we might have guessed the straight line model may not be the best choice O The bowl shape of the scatter plot does have a linear component however O Without the plot we might have believed the straight line t is okay BE54OW Let s try a straight line model t to YLOGWT versus XAGE Regression and Correlation Partial listing of output Page 17 of 43 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 268925 003064 8778 lt0001 age Age X 1 019589 000268 7318 lt0001 Annotated Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 268925 intercept Bquot 003064 8778 lt0001 Age Age x 1 019589 slope 3 000268 7318 lt0001 0 Thus the tted line is LOGWT 268925 019589AGE Now the scatterplot with the overlay of the tted line looks much better Scatter Plot of LOGWT vs AGE LOGWT 12 AGE 14 16 BE54OW Regression and Correlation Page 18 of 43 Now You Try Prediction of Weight from Heigm Source Dixon and Massey 1969 Individual Height 1X Weight g X g 1 60 110 2 60 135 3 60 120 4 62 120 5 62 140 6 62 130 7 62 135 8 64 150 9 64 145 10 70 170 1 1 70 185 12 70 160 It helps to do the preliminary calculations X63833 37141667 2X3 249068 2Y3 22463100 ZXiYi 2109380 sxx 2171667 Syy 5266667 SXy 863333 BE54OW Regression and Correlation Page 19 of 43 Slope 3 Sxy A 863333 1 Sxx 312 250291 171667 Intercept 30 17 Al 30 141667 5029163833 1793573 BE54OW Regression and Correlation Page 20 of 43 3 The Analysis of Variance Table In Topic 1 Summarizing Data we learned that the numerator of the sample variance of the Y data is Yz In regression settings where Y is the outcome variable this il same quantity Y2 is appreciated as the total variance of the Y s As we will il see other names for this are total sum of squares total corrected and SSY Note corrected has to do with subtracting the mean before squaring An analysis of variance table is all about partitioning the total variance of the Y s corrected into two components 1 Due residual the individual Y about the individual prediction Y 2 Due regression the prediction Y about the overall mean Y Aside Whv are we interested in such a partition We d like to know if within the data there exists the suggestion of a linear relationship signal that can be discerned from chance variability noise 1 the leftover variability of the observed Yi about the predicted noise 2 the explained variability of the predicted about the overall mean l7 signal Here is the partition Note Look closely and you ll see that both sides are the same K YK Y Some algebra not shown reveals a nice partition of the total variability Zan Ygt2 ZY gt2 206 if Total Sum of Squares Due Error Sum of Squares Due Model Sum of Squares BE54OW Regression and Correlation Page 21 of 43 A closer look Total Sum of Squares Due Model Sum of Squares Due Error Sum of Squares iK VZ f 72 Y fz 11 O 7 deviation of X from 7that is to be explained O 7 due model signal systematic due regression O due error noise or residual In the world of regression analyses we seek to explain the total variability Z 11 What happens when 31 9t 0 What happens when 31 0 A straight line relationship is helpful A straight line relationship is not helpful Best guess is Y 80 IBIX Best guess is Y 80 Y Due model is LARGE because Due error is nearly the TOTAL because X i7 Y YY 5 OY l7 Y 7lt230ZZ AlX Y Y EY 8X X Due error has to be small Due regression has to be small duemodel duemodel will be large will be small dueerror dueerror BE54OW Regression and Correlation Page 22 of 43 How to Partition the Total Variance 1 The total or total corrected refers to the variability of Y about Y O Y is called the total sum of squares 9 Degrees of freedom df nl 9 Division ofthe total sum of squares by its df yields the total mean square 2 The residual or due error refers to the variability of Y about f O is called the residual sum of squares 0 Degrees of freedom df n2 9 Division ofthe residual sum of squares by its dfyields the residual mean square 3 The regression or due model refers to the variability of about Y O Y Yr is called the regression sum of squares 0 Degrees of freedom df 1 9 Division ofthe regression sum of squares by its df yields the regression mean square or model mean square This is an example of a variance component Source df Sum of Squares Mean Square Regression 1 quot A 2 SSR Y SSRl Error n2 quot A 2 SSE X SSEnZ Total corrected nl SSTiltx rgtz 11 Hint Mean square Sum 0fsquaresd BE54OW Regression and Correlation Page 23 of 43 Be careful The question we may ask from an analysis ofvariance table is a limited one Does the t of the straight line model explain a signi cant portion of the variability of the individual Y about I7 Is this better than using I7 alone We are NOT asking Is the choice of the straight line model correct Would another functional form be a better choice We ll use a hvpothesis test approach and the method of proof bV contradiction Q We begin with a null hypothesis that says 31 0 no linear relationship O Evaluation will focus on the comparison of the due regression mean square to the residual mean square O Recall that we reasoned the following If 51 at 0 Then dueregressiondueresidual will be LARGE If 51 0 Then dueregressiondueresidual will be SMALL O Our pvalue calculation will answer the question If 31 0 truly what are the chances of obtaining an value of dueregressiondueresidual as larger or larger than that observed To calculate chances we need a probability model So far we have not needed one BE540W Regression and Correlation Page 24 of43 4 Assumptions for a Straight Line Regression Analysis In performing least squares estimation we did not use a probability model We were doing geometry Hypothesis testing requires some assumptions and a probability model Assumptions O The separate observations Y1 Y1 Yn are independent O The values of the predictor variable X are fixed and measured without error O For each value of the predictor variable Xx the distribution of values on follows a normal distribution with mean equal to Maxx and common variance equal to Gy xz O The separate means gym lie on a straight line that is 115mm 50 31 X SchematicallV here is what the situation looks like courtesV Stan LemeshoW For each value of x the values ofy are normally distributed around uy x on the line with the same variance for all values of x but different means Mr V le o lx Pym WV rm Pylxz Hereo 2 2 52 2 YIX 7le1 YIXG UYIX BE54OW Regression and Correlation Page 25 of 43 With these assumptions we can assess the signi cance of the variance explained bv the model msqmodel msqresidual with df 1 nZ 310 B1 0 Due model MSR has expected value 039Yx2 Due residual MSE has expected value 6YX2 F MSRMSE will be close to 1 Due model MSR has expected value quot 2 UYIXZ iZXX X 11 Due residual MSE has expected value 6YX2 F MSRMSE will be LARGER than 1 We obtain the analysis of variance table for the model of YLOGWT to XAGE The following is in SAS with annotations in red Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F MSQ Regression MSQ Residual Model 1 422106 422106 535560 lt0001 Error 9 000709 000078816 Corrected Total 10 422815 Root MSE 002807 RSquare 09983 SSQregressionSSQtotal Dependent Mean 053445 Adj RSq 09981 R2 adjusted for n and it predictors Coeff Var 525286 BE54OW Regression and Correlation Page 26 of 43 This output corresponds to the following Source Df Sum of Squares Mean Square Regression 1 A 2 SSR Y 422063 SSRl 422063 Error n2 9 A 2 SSE Y1 03900705 SSEnZ 7838E04 Total corrected nl 10 SST ZHXZ 17f 422768 11 Other information in this output 9 RSQUARED Sum of squares regressionSum of squares total is the proportion of the total that we have been able to explain with the t of the straight line model Be careful As predictors are added to the model RSQUARED can only increase Eventually we need to adjust this measure to take this into account See ADJUSTED RSQUARED We also get an overall F test of the null hypothesis that the simple linear model does not explain signi cantly more variability in LOGWT than the average LOGWT F MSQ RegressionMSQ Residual 42206300007838 538494 with df 1 9 Achieved signi cance lt 00001 Reject Ho Conclude that the tted line is a signi cant improvement over the average LOGWT BE54OW Regression and Correlation Page 27 of 43 5 Hypothesis Testing Straight Line Model Y 30 51 X 1 Overall FTest 2 Test of slope 3 Test of intercept 1 Overall F Test Research Question Does the tted model the I explain signi cantly more of the total variability ofthe Y about 17 than does 17 Assumptions As before H0 and H1 H0 81 0 HA 81 7 0 Test Statistic msqregresi0n msqresidual df 101 2 Evaluation rule When the null hypothesis is true the value ofF should be close to 1 Alternatively when 31 0 the value of F will be LARGER than 1 Thus our pvalue calculation answers What are the chances of obtaining our value ofthe F or one that is larger if we believe the null hypothesis that 51 0 BE54OW Regression and Correlation Page 28 of 43 Calculations For our data we obtain pvalue pr FUN 2mm 0 prFl9 2538494ltlt0001 msqresidual Evaluate Under the null hypothesis that 31 0 the chances of obtaining a value ofF that is so far away from the expected value of 1 with a value of 539494 is less than 1 chance in 10000 This is a very small likelihood Intergret We have learned that at least the tted straight line model does a much better job of explaining the variability in LOGWT than a model that allows only for the average LOGWT later BE640 Intermediate Biostatistics we ll see that the analysis does not stop here BE54OW Regression and Correlation Page 29 of 43 2 Test of the SlopeI E1 Some interesting notes The overall F test and the test of the slope are equivalent The test of the slope uses a t score approach to hypothesis testing It can be shown that t score for slope 2 overall F Research Question Is the slope 51 0 Assumptions As before H0 and H1 H0 81 0 HA 81 7 0 Test Statistic To compute the tscore we need an estimate of the standard error of 81 SElt A1 msqresidual 2X X 11 BE54OW Regression and Correlation Page 30 of 43 Our tscore is therefore observed expected 0 s exp ecled 5431 l score dfn 2 We can nd this information in our output The following is in SAS with annotations in red Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t EstimateError Intercept Intercept 1 268925 003064 8778 lt0001 age Age X 1 019589 000268 7318 019589000268 lt0001 Recall what we mean by a tscore T7338 says the estimated slope is estimated to be 7338 standard error units away from its expected value of zero Check that tscore I2 Overall F I 7338 2 538462 which is close Evaluation rule When the null hypothesis is true the value oft should be close to zero Alternatively when 31 0 the value of t will be DIFFERENT from 0 Here our pvalue calculation answers What are the chances of obtaining our value of the t or one that is more far away from 0 if we believe the null hypothesis that 51 0 BE54OW Regression and Correlation Page 31 of 43 Calculations For our data we obtain pvalue 3 0 2 A H A se 2pr l 2prl9 2 7338ltlt0001 Evaluate Under the null hypothesis that 31 0 the chances of obtaining a tscore value that is 7338 or more standard error units away from the expected value of0 is less than 1 chance in 10000 I ntergret The inference is the same as that for the overall F test The tted straight line model does a much better job of explaining the variability in LOGWT than the sample mean BE54OW Regression and Correlation Page 32 of 43 11 Test ofthe InterceptI E0 This pertains to Whether or not the straight line relationship passes through the origin It is rarely of interest Research Question Is the intercept 30 0 Assumptions As before H0 and H1 H0 80 0 HA 6 0 7 0 Test Statistic To compute the tscore for the intercept we need an estimate of the standard error of 80 SElt A0 msqresidual l 2X X2 11 BE54OW Regression and Correlation Page 33 of 43 Our tscore is therefore 0bserved expected 30 0 l score A s exp ecled s lt 0 dfn 2 We can nd this information in our output The following is in SAS with annotations in red Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t EstimateError Intercept Intercept 1 268925 003064 8778 268925003064 lt0001 age Age X 1 019589 000268 7318 lt0001 This t8778 says the estimated intercept is estimated to be 8778 standard error units away from its expected value of zero Evaluation rule When the null hypothesis is true the value oft should be close to zero Alternatively when B0 0 the value of t will be DIFFERENT from 0 Our pvalue calculation answers What are the chances of obtaining our value of the t or one that is more far away from 0 if we believe the null hypothesis that 30 0 BE54OW Regression and Correlation Page 34 of 43 Calculations pValue 80 0 Zpr ION 2 A Al 2prI9 2 8778 ltlt 0001 se 80 Evaluate Under the null hypothesis that 30 0 the chances of obtaining a tscore value that is 8778 or more standard error units away from the expected value of0 is less than 1 chance in 10000 I ntergret The inference is that the straight line relationship between YLOGWT and XAGE does not pass through the origin BE54OW Regression and Correlation Page 35 of 43 6 Con dence Interval Estimation Straight Line Model Y 30 51 X Recall Topic 6 Estimation that there are 3 elements of a con dence interval 1 Best single guess estimate 2 Standard error ofthe best single guess SEestimate 3 Con dence coef cient 0 These will be percentiles from the t distribution with dfn2 O For a 95 con dence interval this will be a 975th percentile O For a 10c100 con dence interval this will be a 10L2100 39 percentile The generic form of a con dence interval is then Generic Form of Con dence Interval Straight Line Model Y 50 31 X Lower limit Estimate con dence coef cient SE estimate Upper limit Estimate con dence coef cient SE estimate We might want con dence interval estimates of the following 4 parameters 1 Slope 2 Intercept 3 Mean of subset of population for whom XX0 4 Individual response for person for whom XX0 BE54OW Regression and Correlation Page 36 of 43 1 SLOPE estimate A A A 1 seb1 msqres1dualn Z X 2 INTERCEPT estimate 3 0 A A X2 seb0 msqres1dua1 n 2 1 2Xi X 3 MEAN at XX0 estimate 1 A0 3le A 1 X Y 2 se msqres1dua1 2 1 2Xi X 4 INDIVIDUAL with XXo estimate Km 2 30 3le 1 X0 YY s msqresidua1 1 n 2 n 1 BE54OW Regression and Correlation Page 37 of 43 Illustration for the model which ts YLOGWT to XAGE Recall that we obtained the following t Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 268925 003064 8778 lt0001 age Age X 1 019589 000268 7318 lt0001 95 Con dence Interval for the Slope 51 1 Best single guess estimate 3 1 2 019589 2 Standard error ofthe best single guess SEestimate se 1 000268 3 Con dence coef cient 975th percentile of Student t t 226 975df9 95 Con dence Interval for Slope 31 Estimate 1 con dence coef cient SE 019589 226000268 01898 02019 95 Con dence Interval for the InterceptI i0 1 Best single guess estimate 30 268925 2 Standard error ofthe best single guess SEestimate se3 0 003064 3 Con dence coef cient 975th percentile of Student t t 226 975df9 95 Con dence Interval for Slope 50 Estimate i con dence coef cient SE 468925 226003064 2758526200 BE54OW Regression and Correlation Con dence Intervals for Predictions Code Page 38 of 43 proc reg datatemp alpha05 alpha05 is type I error title Regression of YWeight on XAge model wtagecli clm cli for individual clm for mean run quit Output Output Statistics Dependent Predicted Std Error Obs Variable Value Mean Predict 95 CL Mean 1 15380 15139 00158 15497 14781 15868 2 12840 13180 00136 13489 12871 13886 3 11020 11221 00117 11485 10957 11909 4 09030 09262 00100 09489 09036 09937 5 07420 07303 0008878 07504 07103 07970 6 05830 05345 0008465 05536 05153 06008 7 03720 03386 0008878 03586 03185 04052 8 01320 01427 00100 01653 01200 02101 9 00530 00532 00117 00268 00796 00156 10 02750 02491 00136 02182 02800 01785 11 04490 04450 00158 04092 04808 03721 1 1 1 0 0 0 0 0 0 0 0 95 CL Predict 4410 2474 0534 8588 6637 4681 2720 0752 1220 3197 5179 Residual 0 0241 00340 00201 00232 0 0117 0 0485 0 0334 00107 0 000218 00259 0004000 BE54OW Regression and Correlation Page 39 of 43 7 Introduction to Correlation De nition of Correlation A correlation coef cient is a measure ofthe association between two paired random variables eg height and weight The Pearson product moment correlation in particular is a measure of the strength of the straight line relationship between the two random variables Another correlation measure not discussed here is the Spearman correlation It is a measure of the strength of the monotone increasing or decreasing relationship between the two random variables The Spearman correlation is a nonparametric meaning model free measure It is introduced in Puleth 640 Intermediate Biostatistics Formula for the Pearson Product Moment Correlation p o The population parameter designation is rho written as p o The estimate of p based on information in a sample is represented using r 0 Some preliminaries 1 Suppose we are interested in the correlation between X and Y H 2 Xi y 2 COQXY H Xy This is the covarianceXY n1 n1 Z Xi if S 3 Vaf X H XX and similarly 111 111 4 AYi1yi 2syy WU T1114 BE54OW Regression and Correlation Page 40 of 43 Formula for Estimate of Pearson Product Moment Correlation from a Sample 009Xy I vafXvafy tg gt H S Xy Ismsyy If you absolutely have to do it by hand an equivalent more calculator friendly formula is 11 o The correlation r can take on values between 0 and 1 only 0 Thus the correlation coef cient is said to be dimensionless it is independent of the units of X or y 0 Sign of the correlation coef cient positive or negative Sign of the estimated slope BE54OW Regression and Correlation Page 41 of 43 There is a relationship between the slope of the straight line A and the estimated correlation r Relationship between slope A and the sample correlation r X X Because 81 y and r y SXX SS Xny A little algebra reveals that Thus beware 0 It is possible to have a very large positive or negative r might accompanying a very nonzero slope inasmuch as A very large r might re ect a very large SXX all other things equal A very large r might re ect a very small Syy all other things equal BE54OW Regression and Correlation Page 42 of 43 8 Hypothesis Test of Correlation The null hypothesis of zero correlation is equivalent to the null hypothesis of zero slope Research Question Is the correlation p 0 Is the slope B1 0 Assumgtions As before H0 and H g H O 0 0 H A 0 i 0 Test Statistic A little algebra not shown yields a very nice formula for the tscore that we need H df n 2 I score We can nd this information in our output Recall the rst example and the model of YLOGWT to XAGE The Pearson Correlation r is the iRsquared in the output Root MSE 002807 RSquare 09983 Dependent Mean 053445 Adj RSq 09981 Coeff Var 525286 Pearson Correlation r V09983 09991 BE54OW Regression and Correlation Page 43 of 43 Substitution into the formula for the tscore yields r Jul 2 9991J 29974 I scorez 0412 y1r2 x19983 7269 Note The value 9991 in the numerator is rl R2 998 9991 This is very close to the value of the t score that was obtained for testing the null hypothesis of zero slope The discrepancy is probably rounding error I did the calculations on my calculator using 4 significant digits SAS probably used more signi cant digits ch
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'