### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Intrmd Biostatistics PUBHLTH 640

UMass

GPA 3.88

### View Full Document

## 15

## 0

## Popular in Course

## Popular in Public Health

This 43 page Class Notes was uploaded by Agustin Bechtelar on Friday October 30, 2015. The Class Notes belongs to PUBHLTH 640 at University of Massachusetts taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/232294/pubhlth-640-university-of-massachusetts in Public Health at University of Massachusetts.

## Similar to PUBHLTH 640 at UMass

## Popular in Public Health

## Reviews for Intrmd Biostatistics

### What is Karma?

#### Karma is the currency of StudySoup.

Date Created: 10/30/15

BE54OW Regression and Correlation Topic 9 Regression and Correlation Topic 908994P N De nition of the Linear Regression Model Estimation The Analysis of Variance Table Assumptions for the Straight Line Regression Hypothesis Testing Con dence Interval Estimation Introduction to Correlation Hypothesis Test for Correlation Page 1 of 43 BE54OW Regression and Correlation Page 2 of 43 1 De nition of the Linear Regression Model In the last unit topic 8 the setting was that of two categorical discrete variables such as smoking and low birth weight and the use of chisquare tests of association and homogeneity In this unit topic 9 our focus is in the setting of two continuous variables such as age and weight This topic is an introduction to simple linear regression and correlation Linear Regression Linear regression models the mean u of one random variable as a linear function of one or more other variables that are treated as xed The estimation and hypothesis testing involved are extensions of ideas and techniques that we have already seen In linear regression O we observe an outcome or dependent variable Y at several levels ofthe independent or predictor variable X there may be more than one predictor X as seen later O A linear regression model assumes that the values of the predictor X have been xed in advance of observing Y O However this is not always the reality Often Y and X are observed jointly and are both random variables Correlation Correlation considers the association oftwo random variables 9 The techniques of estimation and hypothesis testing are the same for linear regression and correlation analyses 9 Exploring the relationship begins with tting a line to the points 9 We develop the linear regression model analysis for a simple example involving one predictor and one outcome BE54OW Regression and Correlation Page 3 of 43 Example Source Kleinbaum Kupper and Muller 1988 Available are pairs of observations of age and weight for n11 chicken embryos WTY AG EX LOGWTZ 0029 6 4538 0052 7 1284 0079 8 1102 0125 9 0903 0181 10 0742 0261 11 0583 0425 12 0372 0738 13 0132 113 14 0053 1882 15 0275 2812 16 0449 We ll use a familiar notation O The data are 11 pairs of X1 Y1 where XAGE and YWT X1 Y1 6 029 X11 Y 16 2812 and O equivalently 11 pairs of X1 Y1 where XAGE and YLOGWT X1 Y1 6 1538 X11 Y 16 0449 Though simple it helps to be clear in the research guestion Does weight change with age In the language of analysis of variance we are asking the following Can the variability in weight be explained to a signi cant extent by variations in age What is a good functional form that relates age to weight BE54OW Regression and Correlation Page 4 of 43 We begin with a plot of XAGE versus YWT Scatter Plot of WT vs AGE 30 24 7 13 gt WT 12 gt 06 o o l l I l l s s 10 12 14 1s AGE We check and learn about the following O The average and median of X The range and pattern of variability in X The average and median of Y The range and pattern of variability in Y The nature of the relationship between X and Y The strength of the relationship between X and Y The identi cation of any points that might be in uential 999999 For these data O The plot suggests a relationship between AGE and WT O A straight line might t well but another model might be better Q We have adequate ranges of values for both AGE and WT Q There are no outliers BE54OW Regression and Correlation Page 5 of 43 We might have gotten any of a variety of plots No relationship between X and Y Linear relationship between X and Y Nonlinear relationship between X and Y BE54OW Regression and Correlation Page 6 of 43 Note the arrow pointing to the outlying point Fit of a linear model will yield estimated slope that is spuriously nonzero Note the arrow pointing to the outlying point Fit of a linear model will yield an estimated slope that is spuriously near zero 10 D a Note the arrow pointing to the outlying point Fit of a linear model will yield an estimated slope that is spuriously high BE54OW Regression and Correlation Page 7 of 43 The bowl shape of our scatter plot suggests that perhaps a better model relates the logarithm of WT to AGE Scatter Plot of LOGVVT vs AGE 05 7 02 e I 8 39 09 7 16 e 6 8 10 12 14 16 We ll investigate two models 1 WT50 lAGE 2 LOGWT 30 31 AGE EESAUW PageX uf43 Y o lx pa 39 y nlerceplquot value ofy when x O Bo quotslopequot AyAX X i0 yin1ercep1quot value of y when x 0 i1 quotslopequot AyAgg change in ychange in x You mi l recall 00 a feel for the slope Slope gt o Slope o Slope lt o BE54OW Regression and Correlation Page 9 of 43 De nition of the Straight Line Model Y 50 31 X Population Sample Y8081X8 Y6 081Xe Y 80 IBIX is the relationship in the population It is measured with error 8 measurement error We do NOT know the value of 80 nor 81 nor 8 A A 80 81 and e are our guesses of 80 81 and 8 e residual We do have values of 80 81 and e The values of 80 81 and e are obtained by the method ofleast sguares estimation To see if 30 w 80 and A w 81 we perform regression diagnostics Note This is not discussed in this course see Puleth 640 Intermediate Biostatistics A little notationI sorry Y the outcome or dependent variable X the predictor or independent variable uy The expected value on for all persons in the population pylxq The expected value of Y for the subpopulation for whom XX 6Y2 Variability of Y among all persons in the population mqu Variability of Y for the subpopulation for whom XX BE54OW Regression and Correlation Page 10 of 43 2 Estimation We will use the method ofleast sguares to obtain guesses of 30 and 31 Goal From the many possible lines through the scatter of points choose the one line that is closest to the data What do we mean by Close O We d like the vertical distance between each observed Y and its corresponding tted Y to be as small as possible O It s not possible to choose 230 and 231 so that it minimizes YI 2 and minimizes individually I Y2 r and minimizes individually IiIii Q So instead we choose 30 and 231 that minimizes their total n n Ir ff air 21 EXDZ EESAUW Fag 11 am A closest in the sen c m 7 50 31x to and 31 are chosen such that the sum of the squared vertical gistances Edi2 is minimized i1 For each observed value xi we have an observed vi and the predicted value Avi on the line The vertical distances gi vi vi A 2 l A A 2 The expressiomobemiuimized 227 If 227 n AXD has avariety or 4 names residual sum of squares sum of squares about the regression line sum of squares due error SSE 9999 m BE54OW Regression and Correlation Page 12 of 43 For the calculus lover A little calculus yields the solution for the guesses 230 and 231 Consider SSE quot 11 I K a AX O Step 1 Differentiate with respect to 231 Set derivative equal to 0 and solve O Step 2 Differentiate with respect to 230 Set derivative equal to 0 insert 231 and solve F or the noncalculus lover here are the estimates of 80 and 81 81 is the slope 0 Estimate is denoted 1 or b1 80 is the intercept 0 Estimate is denoted 80 or b0 BE54OW Regression and Correlation Page 13 of 43 Some veg helpful prelimina calculations o sxx XXX if 2X2 Nlt2 o syy 2YY2 2Y2 NY2 s y 2X iY Y ZXY Ni X Note These expressions make use of a special notation called the summation notation The capitol S indicates summation In Sxy the first subscript x is saying XE The second subscript y is saying yy sXy 2XXYY 739 S subscrith subscripty Slope A 11 Y C6VX Y A sxy 1 m X V5 31 sxx Intercept Ago 2 7 231 P d39 t39 fY A A A re 1c10n0 YIBOIBIX b0 b1X BE54OW Regression and Correlation Page 14 of 43 Do these estimates make sense quot The linear movement in Y Slope A Z XXX Y 06VXY with linear movement in X is 1 quot 2 VarX measured relative to the g X X variability in X 231 0 says With a unit change in X overall there is a 5050 chance that Y increases versus decreases 231 at 0 says With a unit increase in X Y increases also 31 gt 0 or Y decreases 23 1 lt 0 Intercept 7 f If the linear model is 0 1 incorrect or if the true model does not have a linear component we obtain 31 0 and 30 l7 as our best guess of an unknown Y BE54OW Regression and Correlation Page 15 of 43 Illustration in SAS Code data temp input wt age logwt label wt Weight Y a cards 029 6 1 538 052 7 1 284 079 8 1102 125 9 0903 181 10 0742 261 11 0583 425 12 0 372 738 13 0 132 1882 15 0275 281216 0449 run quit proc reg datatemp simple option simple produces simple descriptives i le Regression of YWeight 0n XAge model wtage run quit Partial listing of output Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 188453 052584 358 00059 age Age X 1 023507 004594 512 00006 A A A 1 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 188453 intercept Bquot 052584 358 00059 age Age x 1 023507 slope p 004594 512 00006 The tted line is therefore WT l88453 023507 AGE BE54OW Regression and Correlation Page 16 of 43 Let s overlay the tted line on our scatterplot Scatter Plot of WT vs AGE WT AGE Q As we might have guessed the straight line model may not be the best choice O The bowl shape of the scatter plot does have a linear component however O Without the plot we might have believed the straight line t is okay BE54OW Regression and Correlation Let s try a straight line model t to YLOGWT versus XAGE Partial listing of output Page 17 of 43 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 268925 003064 8778 lt0001 age Age 1 019589 000268 7318 lt0001 A A A 1 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 268925 intercept Bquot 003064 8778 lt0001 Age Age x 1 019589 slope 3 000268 7318 lt0001 0 Thus the tted line is LOGWT 268925 019589AGE Now the scatterplot with the overlay of the tted line looks much better Scatter Plot of LOGWT vs AGE 05 e 02 LOGWT 09 gt 516 BE54OW Regression and Correlation Page 18 of 43 Now You Try Prediction of Weight from Height Source Dixon and Massey 1969 Individual Height 1X Weight g X g 1 60 110 2 60 135 3 60 120 4 62 120 5 62 140 6 62 130 7 62 135 8 64 150 9 64 145 10 70 170 11 70 185 12 70 160 It helps to do the preliminary calculations i63833 37141667 2X3 2 49 068 2Y3 246100 ZXiYi109380 SXX171667 Syy 53266667 SXy 863333 Page 19 of 43 BE54OW Regression and Correlation Slope 3 Sxy 1 Sxx A1 2 86339333 250291 171667 Intercept 30 17 A1 30 141667 5029163833 1793573 BE54OW Regression and Correlation Page 20 of 43 3 The Analysis of Variance Table In Topic 1 Summarizing Data we learned that the numerator of the sample variance of the Y data is Yz In regression settings where Y is the outcome variable this il a same quantity Y2 is appreciated as the total variance of the Y s As we will il see other names for this are total sum of squares total corrected and SSY Note corrected has to do with subtracting the mean before squaring An analysis of variance table is all about partitioning the total variance of the Y s corrected into two components 1 Due residual the individual Y about the individual prediction Y 2 Due regression the prediction Y about the overall mean Y Aside Whv are we interested in such a partition We d like to know if within the data there exists the suggestion of a linear relationship signal that can be discerned from chance variability noise 1 the leftover variability of the observed Yi about the predicted noise 2 the explained variability of the predicted about the overall mean l7 signal Here is the partition Note Look closely and you ll see that both sides are the same K YK Y Some algebra not shown reveals a nice partition of the total variability Zan Ygt2 ZY gt2 206 if Total Sum of Squares Due Error Sum of Squares Due Model Sum of Squares BE54OW Regression and Correlation Page 21 of 43 A closer look Total Sum of Squares Due Model Sum of Squares Due Error Sum of Squares K VY f 72 r fz x x O Y 7 deviation of X from 7that is to be explained x A X Y due model signal systematic due regression 1 o oY J lt due error noise or residual In the world of regression analyses we seek to explain the total variability Z 11 What happens when 31 9t 0 What happens when 31 0 A straight line relationship is helpful A straight line relationship is not helpful A A A Best guess is Y 80 IBIX A A Best guess is Y 80 Y Due model is LARGE because Y 725 05 1X l7 217 31 AlX Y XXX Y Due error is nearly the TOTAL because Y gtY 22ltY 7gt Due error has to be small Due regression has to be small duemodel will be large dueerror duemodel dueerror will be small BE54OW Regression and Correlation Page 22 of 43 How to Partition the Total Variance 1 The total or total corrected refers to the variability of Y about Y O Y is called the total sum of squares 9 Degrees of freedom df nl 9 Division ofthe total sum of squares by its df yields the total mean square 2 The residual or due error refers to the variability of Y about f x O is called the residual sum of squares 0 Degrees of freedom df n2 9 Division ofthe residual sum of squares by its dfyields the residual mean square 3 The regression or due model refers to the variability of about Y O Y Yr is called the regression sum of squares 0 Degrees of freedom df 1 9 Division ofthe regression sum of squares by its df yields the regression mean square or model mean square This is an example of a variance component Source df Sum of Squares Mean Square Regression 1 quot A 2 SSR Y SSRl Error n2 quot A 2 SSE X SSEnZ Total corrected nl SSTiltx rgtz Hint Mean square Sum 0fsquaresd BE54OW Regression and Correlation Page 23 of 43 Be careful The question we may ask from an analysis ofvariance table is a limited one Does the t of the straight line model explain a signi cant portion of the variability of the individual Y about I7 Is this better than using I7 alone We are NOT asking Is the choice of the straight line model correct Would another functional form be a better choice We ll use a hypothesis test approach and the method of proof by contradiction Q We begin with a null hypothesis that says 31 0 no linear relationship O Evaluation will focus on the comparison of the due regression mean square to the residual mean square O Recall that we reasoned the following If 51 at 0 Then dueregressiondueresidual will be LARGE If 51 0 Then dueregressiondueresidual will be SMALL O Our pvalue calculation will answer the question If 31 0 truly what are the chances of obtaining an value of dueregressiondueresidual as larger or larger than that observed To calculate chances we need a probability model So far we have not needed one BESAOW 1 r m and m Wan n Page 24 of43 4 Assumptions for a Straight Line Regression Analysis In performing least squares estimation we did not use a probability model We were doing geometry Hypothesis testing requires some assumptions and a probability model Assumptions O The separate observations Y1 Y1 Yn are independent O The values of the predictor variable X are fixed and measured without error O For each value of the predictor variable Xx the distribution of values on follows a normal distribution with mean equal to Maxx and common variance equal to Gy xz O The separate means gym lie on a straight line that is 115mm 50 31 X here is what the situation looks like courtesV Stan LemeshoW uch value of x the values ofy are normally For e distributed around y on the line with the some variance for all values of x but different means rly x v I wx 5umx rm Pylxz 52 2 Here 0 YIXZ le3 2 7 7 2 YIX YIX BE54OW Regression and Correlation Page 25 of 43 With these assumptionsI we can assess the signi cance of the variance explained by the model msq nwel with df 1 nZ msqresidual 31 0 B1 9 0 Due model MSR has expected value Due model MSR has expected value 2 n 039 2 YIX UYIXZ 3 XXX X 11 Due residual MSE has expected value Due residual MSE has expected value GYIX 6YX F gMSRgMSE will be close to 1 F MSRgMSE will be LARGER than 1 We obtain the analysis of variance table for the model of YLOGWT to XAGE The quot owing is in SAS with quot in red Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F MSQ Regression MSQ Residual Model 1 422106 422106 535560 lt0001 Error 9 000709 000078816 Corrected Total 10 422815 Root MSE 002807 RSquare 09983 SSQregressionSSQtotal Dependent Mean 053445 Adj RSq 09981 R2 adjusted for n and it predictors Coeff Var 525286 BE54OW Regression and Correlation Page 26 of 43 This output corresponds to the following Source Df Sum of Squares Mean Square Regression 1 A 2 SSR Y 422063 SSRl 422063 Error n2 9 A 2 SSE Y1 03900705 SSEnZ 7838E04 Total corrected nl 10 SST ZHXZ 17f 422768 11 Other information in this output 9 RSQUARED Sum of squares regressionSum of squares total is the proportion of the total that we have been able to explain with the t of the straight line model Be careful As predictors are added to the model RSQUARED can only increase Eventually we need to adjust this measure to take this into account See ADJUSTED RSQUARED We also get an overall F test of the null hypothesis that the simple linear model does not explain signi cantly more variability in LOGWT than the average LOGWT F MSQ RegressionMSQ Residual 42206300007838 538494 with df 1 9 Achieved signi cance lt 00001 Reject Ho Conclude that the tted line is a signi cant improvement over the average LOGWT BE54OW Regression and Correlation Page 27 of 43 5 Hypothesis Testing Straight Line Model Y 30 51 X 1 Overall FTest 2 Test of slope 3 Test of intercept 1 Overall F Test Research Question Does the tted model the I explain signi cantly more of the total variability ofthe Y about 17 than does 17 Assumptions As before H0 and H1 H0 81 0 HA 81 7 0 Test Statistic msqregresi0n msqresidual df 101 2 Evaluation rule When the null hypothesis is true the value ofF should be close to 1 Alternatively when 31 0 the value of F will be LARGER than 1 Thus our pvalue calculation answers What are the chances of obtaining our value ofthe F or one that is larger if we believe the null hypothesis that 51 0 BE54OW Regression and Correlation Page 28 of 43 Calculations For our data we obtain pvalue msqmod el prFM 2 IA 0 prFm 2 538494 0001 msqresidual Evaluate Under the null hypothesis that 31 0 the chances of obtaining a value ofF that is so far away from the expected value of 1 with a value of 539494 is less than 1 chance in 10000 This is a very small likelihood Intergret We have learned that at least the tted straight line model does a much better job of explaining the variability in LOGWT than a model that allows only for the average LOGWT later BE640 Intermediate Biostatistics we ll see that the analysis does not stop here BE54OW Regression and Correlation Page 29 of 43 2 Test of the SlopeI E1 Some interesting notes The overall F test and the test of the slope are equivalent The test of the slope uses a t score approach to hypothesis testing It can be shown that t score for slope 2 overall F Research Question Is the slope 51 0 Assumptions As before H0 and H1 H0 81 0 HA 81 7 0 Test Statistic To compute the tscore we need an estimate of the standard error of 81 SElt A1 msqresidual 2X X 11 BE54OW Regression and Correlation Page 30 of 43 Our tscore is therefore observed expected 0 l score s exp ecled 5431 dfn 2 We can nd this information in our output The quot owing is in SAS with quot in red Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t EstimateError Intercept Intercept 1 268925 003064 8778 lt0001 age Age X 1 019589 000268 7318 019589000268 lt0001 Recall what we mean by a tscore T7338 says the estimated slope is estimated to be 7338 standard error units away from its expected value of zero Check that tscore 12 Overall F I 7338 2 538462 which is close Evaluation rule When the null hypothesis is true the value oft should be close to zero Alternatively when 31 0 the value of t will be DIFFERENT from 0 Here our pvalue calculation answers What are the chances of obtaining our value of the t or one that is more far away from 0 if we believe the null hypothesis that 3 0 BE54OW Regression and Correlation Page 31 of 43 Calculations For our data we obtain pvalue A 2pr l gt Bl 0 2prl9 27338ltlt0001 H A quot se 1 Evaluate Under the null hypothesis that 31 0 the chances of obtaining a tscore value that is 7338 or more standard error units away from the expected value of0 is less than 1 chance in 10000 I ntergret The inference is the same as that for the overall F test The tted straight line model does a much better job of explaining the variability in LOGWT than the sample mean BE54OW Regression and Correlation Page 32 of 43 11 Test ofthe InterceptI E0 This pertains to Whether or not the straight line relationship passes through the origin It is rarely of interest Research Question Is the intercept 30 0 Assumptions As before H0 and H1 H0 80 0 HA 6 0 7 0 Test Statistic To compute the tscore for the intercept we need an estimate of the standard error of 80 SE quresidual 1 X n BE54OW Regression and Correlation Page 33 of 43 Our tscore is therefore observed exp ecled o 0 s exp ecled s lt A 0 l score dfn 2 We can nd this information in our output The quot owing is in SAS with quot in red Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t EstimateError Intercept Intercept 1 268925 003064 8778 268925003064 lt0001 age Age X 1 019589 000268 7318 lt0001 This t8778 says the estimated intercept is estimated to be 8778 standard error units away from its expected value of zero Evaluation rule When the null hypothesis is true the value oft should be close to zero Alternatively when B0 0 the value of t will be DIFFERENT from 0 Our pvalue calculation answers What are the chances of obtaining our value of the t or one that is more far away from 0 if we believe the null hypothesis that 30 0 BE54OW Regression and Correlation Page 34 of 43 Calculations pValue A 0 Zpr ION 2 80 2prI9 2 8778 ltlt0001 s 80 Evaluate Under the null hypothesis that 30 0 the chances of obtaining a tscore value that is 8778 or more standard error units away from the expected value of0 is less than 1 chance in 10000 I ntergret The inference is that the straight line relationship between YLOGWT and XAGE does not pass through the origin BE54OW Regression and Correlation Page 35 of 43 6 Con dence Interval Estimation Straight Line Model Y 30 51 X Recall Topic 6 Estimation that there are 3 elements of a con dence interval 1 Best single guess estimate 2 Standard error ofthe best single guess SEestimate 3 Con dence coef cient 0 These will be percentiles from the t distribution with dfn2 O For a 95 con dence interval this will be a 975th percentile O For a 10c100 con dence interval this will be a 10L2100 39 percentile The generic form of a con dence interval is then Generic Form of Con dence Interval Straight Line Model Y 50 31 X Lower limit Estimate con dence coef cient SE estimate Upper limit Estimate con dence coef cient SE estimate We might want con dence interval estimates of the following 4 parameters 1 Slope 2 Intercept 3 Mean of subset of population for whom XX0 4 Individual response for person for whom XX0 BE54OW Regression and Correlation Page 36 of 43 1 SLOPE estimate A A A 1 seb1 msqres1dualn Z X 2 INTERCEPT estimate 3 0 A A 1 X2 seb0 msqres1dua1 n 2 1 2Xi X 3 MEAN at XX0 estimate 1 A0 3le A 1 X Y 2 se msqres1dua1 n 4 INDIVIDUAL with XXo estimate Km 2 30 3le A 1 X Y 2 se msqres1dua1 1 n 2Xi if i1 BE54OW Regression and Correlation Page 37 of 43 Illustration for the model which ts YLOGWT to XAGE Recall that we obtained the following t Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 268925 003064 8778 lt0001 age Age X 1 019589 000268 7318 lt0001 95 Con dence Interval for the SlopeI 1 1 Best single guess estimate 3 1 2 019589 2 Standard error ofthe best single guess SEestimate se 1 000268 3 Con dence coef cient 975th percentile of Student t t 226 975df9 95 Con dence Interval for Slope 31 Estimate 1 con dence coef cient SE 019589 i 226000268 01898 02019 95 Con dence Interval for the InterceptI i0 1 Best single guess estimate 30 268925 2 Standard error ofthe best single guess SEestimate se3 0 003064 3 Con dence coef cient 975th percentile of Student t t 226 975df9 95 Con dence Interval for Slope 50 Estimate i con dence coef cient SE 268925 1 226003064 2758526200 BE54OW Regression and Correlation Page 38 of 43 Con dence Intervals for Predictions Code proc reg datatemp alpha05 alpha05 is type I error title Regression of YWeight on XAge model wtagecli clm cli for individual clm for mean run quit Output Output Statistics Dependent Predicted Std Error Obs Variable Value Mean Predict 95 CL Mean 95 42 CL Predict Residual 1 15380 15139 00158 1 5497 14781 15868 14410 00241 2 12840 13180 00136 1 3489 12871 13886 12474 00340 3 1 1020 1 1221 0 0117 1 1485 10957 11909 1 0534 00201 4 0 90 0 0 9262 00100 9489 0 90 6 09937 0 8588 00232 5 0 7420 0 7303 0 008878 7504 0 7103 07970 0 6637 00117 6 0 5830 0 5345 0 008465 0 5536 0 5153 06008 0 4681 00485 7 0 3720 0 3386 0 008878 03586 0 3185 04052 0 2720 00334 8 0 1320 0 1427 100 01653 0 1200 02101 0 0752 00107 9 0 0530 00532 0 0117 0 0268 0 0796 00156 01220 0000218 10 0 2750 0 2491 0 0136 0 2182 0 2800 01785 0 3197 00259 11 0 4490 04450 0 0158 0 4092 0 4808 03721 05179 0004000 BE54OW Regression and Correlation Page 39 of 43 7 Introduction to Correlation De nition of Correlation A correlation coef cient is a measure ofthe association between two paired random variables eg height and weight The Pearson product moment correlation in particular is a measure of the strength of the straight line relationship between the two random variables Another correlation measure not discussed here is the Spearman correlation It is a measure of the strength of the monotone increasing or decreasing relationship between the two random variables The Spearman correlation is a nonparametric meaning model free measure It is introduced in Puleth 640 Intermediate Biostatistics Formula for the Pearson Product Moment Correlation p o The population parameter designation is rho written as p o The estimate of p based on information in a sample is represented using r 0 Some preliminaries 1 Suppose we are interested in the correlation between X and Y Z Xi y s 2 COQXY H Xy This is the covarianceXY n1 n1 Z Xi if S 3 Vaf X H XX and similarly 111 111 2671 2 s 4 W BE54OW Regression and Correlation Page 40 of 43 Formula for Estimate of Pearson Product Moment Correlation from a Sample 009Xy I vafXvafy b gt H S Xy Ismsyy If you absolutely have to do it by hand an equivalent more calculator friendly formula is 11 Z Xi 11 11 11 o The correlation r can take on values between 0 and 1 only 0 Thus the correlation coef cient is said to be dimensionless it is independent of the units of X or y 0 Sign of the correlation coef cient positive or negative Sign of the estimated slope BE54OW Regression and Correlation Page 41 of 43 There is a relationship between the slope of the straight line A and the estimated correlation r Relationship between slope A and the sample correlation r X X Because 81 y and r y A little algebra reveals that Thus beware 0 It is possible to have a very large positive or negative r might accompanying a very nonzero slope inasmuch as A very large r might re ect a very large SXX all other things equal A very large r might re ect a very small Syy all other things equal BE54OW Regression and Correlation Page 42 of 43 8 Hypothesis Test of Correlation The null hypothesis of zero correlation is equivalent to the null hypothesis of zero slope Research Question Is the correlation p 0 Is the slope B1 0 Assumgtions As before H0 and H g H O 0 0 H A 0 i 0 Test Statistic A little algebra not shown yields a very nice formula for the tscore that we need r n2 I score 2 V1 r2 df n 2 We can nd this information in our output Recall the rst example and the model of YLOGWT to XAGE The Pearson Correlation r is the iRsquared in the output Root MSE 002807 RSquare 09983 Dependent Mean 053445 Adj RSq 09981 Coeff Var 525286 Pearson nr ra39lafinn r 39983 03991 BE54OW Regression and Correlation Page 43 of 43 Substitution into the formula for the tscore yields r Jul 2 9991J 29974 I scorez 0412 y1r2 J1 9983 7269 Note The value 9991 in the numerator is rl R2 998 9991 This is very close to the value of the t score that was obtained for testing the null hypothesis of zero slope The discrepancy is probably rounding error I did the calculations on my calculator using 4 significant digits SAS probably used more signi cant digits ch

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.