STATISTICAL METHODS II
STATISTICAL METHODS II BIOS 544
Virginia Commonwealth University
Popular in Course
Popular in Biostatistics
This 25 page Class Notes was uploaded by Priscilla Rau on Wednesday October 28, 2015. The Class Notes belongs to BIOS 544 at Virginia Commonwealth University taught by Alvin Best in Fall. Since its upload, it has received 18 views. For similar materials see /class/230638/bios-544-virginia-commonwealth-university in Biostatistics at Virginia Commonwealth University.
Reviews for STATISTICAL METHODS II
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/28/15
BIOS 544 Section 8 From Correlation and Regression to Multiple Regression Dataset used The Son et al 2002 Korean adult child caregivers subset is CaregiversJMP Table of contents r J l Scatterplots and Correlation 2 Simple Linear 39 4 Slopeintercept 5 Least squares 7 Variance J for 9 Correlation and quuare 10 A J quot l 1 Correlation or 39 17 Step by Step 18 Phase 1 State the Question 18 Phase 2 Decide How to Answer the Question 19 Phase 3 Answer the Question 21 Phase 4 Communicate the Answer to the Question 22 JMP Notes Frror Bookmark not de ned Exercises 25 Overview We re heading for multipleregression Multiple regression extends regression and recall that regression is the same as correlation Regression and correlation is simply a way for looking at relationships between two continuous variables One of the two variables is Y the response and the other X the predictor Multiple regression extends this in the following way We still have one Y but in multiple regression we have more than one X multiple predictors But first we must do some review We re going to begin by looking at relationships between two continuous variables We ll start with scatterplots and correlation When the relationship between the two variables follows a straight line there are three ways to think about it by using a correlation coefficient as a descriptive statistic by looking at the significance of slope and by comparing models These three points of view all cast lightifrom different directionsion something that turns out to be a very general way to look at modeling the relationship between variables Remember With multiple Xs things will get more complex so you ll need to be firmly grounded in the simple case of one Y and one X Preparation 0 Daniel s Chapter 9 Simple Linear Regression and Correlation o Daniel s Chapter 10 Multiple Regression and Correlation BIOS 544 Section 8 The Example Data Throughout we use the data in Son Wykle amp Zauszniewski s 2002 Korean adult child caregivers of older adults with dementia Journal of Gerontological Nursing January 1928 A subset 100 of Nl 17 observations appears online as CaregiversJMP Note that we ve added to rows to the data table with no data They are excluded from analysis and are used for prediction in a latter section Scatterplots and Correlation We use scatterplots to show the relationship between two continuous variables measured on the same individuals The values of one variable appear on the horizontal aXis the X variable in JMP and the values of the other variable appear on the vertical aXis the Y variable in JMP Each individual appears as a point To interpret a scatterplot look rst at the overall pattern The figure below shows a strong positive linear relationship Bivariate Fit of Burden By Memorybehavior problems Burden Memorybehavior problems Bivariate Normal Ellipse P0950 Correlation l Signif Variable Mean Std Dev Correlation Prob Number Memorybehavior problems 26301 1529684 0502062 lt0001 100 Burden 6924 2005755 Looking for a relationship If there is a relationship there are three things to look for o The direction of the relationship 0 The strength of the relationship 0 The form of the relationship Copyright Al Best 18 December 2008 All rights reserved 82 BIOS 544 Section 8 Interpretation of the rstatistic The term correlation refers to the xtrength and direction of a linear relationship Here are the basic facts you need to know to interpret a correlation r It makes no difference which variable you call X and which you call Y you ll get the same value for r The units ofX and the units on don t matter The correlation r has no unit ofmeasurement it sjust a number always between 71 and 1 A positive r indicates a positive relationship as one variable s values increase the other variables values also increase A negative r indicates a negative relationship as one variable s values increase the other variable s values decrease A value near zero indicates a very weak linear relationship as one variable s values increase we have no idea what the other variable s values will do The number only indicates the strength and direction of a xmzightline linear relationship Correlation does not describe a curved relationship no matter how strong it is The pivalue from a correlation does not test whether the relationship is best described by a line Just like the mean and stande deviation correlation can be strongly affected by outliers Outliers Here is an example of outliers r 7003 pvalue 074 or r 7024 pvalue 0017 stiu tuba WE EIEI ZEI EIEI income Chose one summary statement 0 There is ano association between income and care giving burden r 7003 n 100 0 There is a negative relationship r 7024 n 99 a er excluding the outlier value Moral of the story Always every time really look at your data no kidding Copyright Al Best 18 December 2008 All rights reserved 83 BIOS 544 Section 8 Form of the Relationship Regression and correlation always assume that the form of the relationship is linear The Pearson correlation coef cient r only indicates the strength and direction of a straight line linear relationship Correlation does not describe a curved relationship no matter how strong it is Rule There is no substitute for plotting the data Simple Linear Regression Regression refers the situation when we have continuous measurements on two variables and we wish to use one to predict the other For instance Research Question 1 What are the effects of care recipient impairments e g physical dependency in ADLs memory and behavioral problems cognitive impairments and caregivers perceived health on caregiver burden page 24 Distributions First as always look at the data Distributions i i ADL m louantiles lMoments l 100 0 maximum 90 000 Mean 99 5 90 000 Std Dev 20159653 97 5 90 000 Std 1 9653 90 0 88 000 Upper 95 Mean 61850113 0 40 60 80 10 75 0 quariiie 78 500 Lower 95 Mean 53 849887 50 0 median 56 000 N 100 25 0 quariiie 42 000 10 0 30 2 5 25 050 0 5 22 00 0 0 minimum 22 000 p uulElll Em louantiles l iMoments i 39 100 0 maximum 66 000 Mean 26 301 99 5 66 000 Std Dev 15 296844 97 5 64 475 Std EnMean 1 5296844 90 0 49 900 Upper 95 Mean 29 336226 10 30 50 7 75 0 quariiie 34 000 Lower 95 Mean 23 265774 50 0 median 24 000 N 100 25 0 quariiie 14 000 1 9 000 2 5 4 000 0 3 000 5 0 0 minimum 3 000 Cogn39 39ve impairment I an louantiles l Moments 1 100 0 maximum 27 000 Mean 13 69 9 5 7 000 Std Dev 7 7299039 97 5 27 000 Std EnMean 0 7729904 90 0 23 000 Upper 95 Mean 15 223781 0 5 10 15 20 25 75 0 quariiie 19000 L0vver95ivieari 12156219 50 0 median 15 000 N 100 25 0 quariiie 7 000 10 0 2 100 2 5 0 000 0 5 0 000 00 minimum 0000 Burden II louantiles l Moments l g 100 0 maximum 115 00 Mean 99 5 15 00 Std Dev 20 057554 97 5 112 00 Std EnMean 2 0057554 90 0 3 80 Upper 95 Mean 73 219854 0 40 60 80 100 75 0 quariiie 86 50 Lower 95 Mean 65 260146 50 0 median 69 50 N 100 25 0 quariiie 52 25 10 0 43 0 2 5 29 58 0 5 28 00 00 minimum 28 00 Copyright Al Best 18 December 2008 All rights reserved 84 BIOS 544 Section 8 Scatterplot Looking at the relationship between burden and memory problems consider 0 Does higher or lower burden seem to go with higher or lower memory problems 0 Is the correlation between burden and memory zero 0 Fitting a straight line thru the data is the slope zero Is the line at 0 Does knowing the level of memory problems allow us to predict the level of burden beyond just a chance level It turns out that these are all the same question If we answer one we ve answered the others Bivariate Fit of Burden By Memorybehavioral problems 1 20 1 1 0 39 39 100 90 80 39 39 7O I 60 I 50 l39 39 39 39 40 quot 39 39 30 39 39 20 O 10 20 30 40 50 60 70 Burden Memorybehavior problems Informally we d venture a guess that there may be a moderate relationship between the memorybehavioral problems and the caregiver burden It s not a strong relationship but clearly a high level of memory problems seems to go with a high burden and vice versa So we ve answered one of our questions Does higher or lower burden seem to go with higher or lower memory problems Let s look at each of the other forms of the questions Slopeintercept One of the assumptions we are making is that a straightline relationship makes sense If it does not then there are alternatives But in this case linearity does seem to make sense That is the form of the relationship is Y intercept slope oX error Copyright Al Best 18 December 2008 All rights reserved 85 BIOS 544 Section 8 This is a straight line that intersects the Xaxis at the intercept and has a trend indicated by the slope The slope is the average increase in Y for every unit increase in X For example the best tting line for the example data is shown below lBivariate Fit of Burden By Memorybehavior problems U 110 39 39 100 Burden l 9 Memorybehavior problems Linear Fitl lLinear Fit Burden 51925681 06583141 Memo lbehaviorrprobli Summary of Fit RSquare 0252066 RSquare Adi 0244434 Root Mean Square Error 1743469 Mean of Response 6924 Observations or Sum Wgts 100 lParameter Estimates l Term Estimate Std Error t Ratio Probgtt Intercept 51925681 3480878 1492 lt0001 Memorybehaviorproblems 06583141 011455 575 lt0001 See the Parameter Estimates report for the estimated intercept and slope The line intercepts the vertical axis at about 519 That is if there were zero memory problems then the line predicts that the Burden is 519 The slope is approximately 066 That is for every increase of one unit in Memorybehavior problems the line predicts that there will be an increase of 066 units of Burden Testing the Slope One of our questions was When tting a straight line through the data is the slope zero The parameter estimates report answers this The estimated slope is approximately 066 SE 011 and it is signi cantly different than zero I 575 df 98 pvalue lt 00001 Recall df N7 2 It s proper to make this inference if all the assumptions of the model are met We haven t covered these assumptions yet But we will Copyright Al Best 18 December 2008 All rights reserved 86 BIOS 544 Section 8 JMP notes To lk at relationships between two continuous variables 0 Choose Analyze gt l Fit Y by X and identify the X and Ycolumns in the dialog A dot plot will appear Making choices in the Bivariate popup below the gure modi es the default scatterplot 0 To t a straight line through the data 39 39n Choose Bivariate V F t L e A new popup will appear below the gure iIinear F 0 To remove the line from the scatterplot Choose Linear Fit 39 Remove Fit Least squares Recall that we are assuming the form of the relationship between MemoryProb and Burden is linear This means that the generic equation describing the relationship is Y intercept slope oX error We estimate the intercept and slope and then we can predict Y by the vertical height of the line The predicted value is denoted 17 say 31 hat I interceptestimme slopeestimateX error or predicted Burden 51926 41658 0 MemoryProb Consider how this works for one observation the rst Memory problems in obs l is 4 Cognitive Perceived impairment social suppo The observed Burden value for is 28 This is the value that we re trying to predict using a line u 1107 39 39 mu a 40 eh m Memorybehavmm pmbiems predicted Burden 51926 41658 0 MemoryProb 5456 Copyright Al Best 18 December 2008 All righm reserved 87 BIOS 544 Sectlon 8 n L h re m m l Rurd n m The dltference between the observed value of Yand l7 ls called the error or me resxdual Y I7 error Srrreellde acmal y valuewas 28 me errorrs 725 55 285455 72555 slur glue 4 am 1 in 3b 4b 5b 39a 7 MEmurybehavlural problems LeaSISquares How best mungquot llne7 The slope and error 11 value No olher slope and mterceptwlll glve a smaller squared error lune Cnpynght AlBesl lEDecen uer 2m Allngns reserved 88 BIOS 544 Section 8 Variance accounted for Here is an important descriptor the squared correlation is the proportion of variance accounted for That is r2 say R squared tells us how much of the variance in the response variable is accounted for by the explanatory variable The histogram and moments reports for the response variable Burden the predicted values and the errors are shown below Distributions Burden Predicted Burden Residuals Burden Moments l Moments Moments Mean 6924 Mean 6924 Mean 4832e15 Std Dev 20057554 Std Dev 10070128 Std Dev 17346411 Std Err Mean 20057554 Std Err Mean 10070128 Std Err Mean 17346411 Upper 95 Mean 73219854 Upper 95 Mean 71238132 Upper 95 Mean 34419043 Lower 95 Mean 65260146 Lower 95 Mean 67241868 Lower 95 Mean 3441904 N 100 N 100 N 100 Notice that there is more spread in Burden than in Pred Burden That is the estimated SD of the Y variable is 201 and the estimated SD of I is 101 And the SD of error is 173 These SD s appear in the rst row of the table below Recall that variance is the squared SD these values appear in the second row Notice that the variance of the observed Burden values N402 is equal to the variance of the predicted values 101 plus the variance of the error values N301 That is the predicted values accounted for some fraction of the total variance in the observed values Observed Predicted Error Total SD 20058 10070 17346 Variance 402305 101407 300898 402305 101407 Ram 39 402305 r2 02521 quotproportion of variance accounted forquot r 05021 quotcorrelationquot The ratio of the predicted to observed variance is the proportion of the variance in the response variable that is accounted for by the explanatory variable This ratio is 0252 here That is r2 025 Copyright Al Best 18 December 2008 All rights reserved 89 BIOS 544 Section 8 Correlation and quuare The quuare value is already available in the Linear Fit report To calculate the correlation we could take the square root of quuare or use JMP to estimate the correlation directly One of our questions was Is there a correlation between memory problems and Burden The Bivariate Fit report answers that question Bivariate Fit of Burden By Memorybehavior problems 110 100 Burden l 9 Memorybehavior problems Linear Fit Bivariate Normal Ellipse P0950 lLinear Fit Burden 51925681 06583141Memorybehavior problems lSummary of Fit RSquare 0252066 RSquare Adj 0244434 Root Mean Square Error 1743469 Mean of Response 6924 Observations or Sum Wgts 100 lParameter Estimates l Term Estimate Std Error t Ratio Probgtt Intercept 51925681 3480878 1492 lt0001 Memorybehavior problems 06583141 011455 575 lt0001 l Correlation l Signif Variable Mean Std Dev Correlation Prob Number Memorybehavior problems 26301 1529684 0502062 lt0001 100 Burden 6924 2005755 Note No one shows the ellipse in a publication Copyright Al Best 18 December 2008 All rights reserved 810 BIOS 544 Section 8 HQIES To estimate the correlation betwen the Y and X variables in the scatterplot Choose Bivariate V Density Ellipse gt 95 An ellipse a bivariate report and a new popup will appear 0 To remove these Choose Bivariate Normal Ellipse 39 Remove Fit 0 To add the standardized beta rightclick the Parameter Estimates report Note that the RSquare value in the report is the square root of the correlation Note that the Root Mean Square Error is the SD of the residuals We ll have occasion to use the Beta value called in JMP the Std Beta What is the interpretation of is value Write up The correlation between memorybehavior problems and caregiver burden is positive r 050 df 98 pvalue lt 00001 And your description of the relationship between memory problems and burden is Assumptions Inspecting a straightline t is one way to interpret the relationship between the Y and X columns Another is the bivariate normal ellipse The word norma reminds us that there are assumptions underlying all of what we ve talked about so far That is under some circumstances a correlation estimate makes sense a straight line t makes sense talking about variance accounted for makes sense And in other situations they don t The assumptions behind a correlation For a correlation estimate to make sense these are the assumptions that need to holdl o The X and Y values have a bivariate normal distribution If these assumptions hold then the bivariate normal ellipse can be thought of as a con dence bound That is a 95 bivariate normal density ellipse will enclose about 95 of the data points if these assumptions hold The assumptions behind a straightline fit For a straightline t to make sense these are the assumptions that need to holdz o The observations rows must be representative and independent 0 The form of the relationship between Y and X must be linear 1 Also see Daniel page 431 2 Also see Daniel page 402 Copyright Al Best 18 December 2008 All rights reserved 811 BIOS 544 Section 8 o The values of X are measured without error or at least that the measurement error is negligible 0 For every xed X the variance of the Y values is the same 0 The error residual values have a normal distribution If these assumptions hold then the statistical test for the slope is appropriate Most of these are easy to assess in JMP Linearity is largely assessed visually But there are also important clues about nonlinearity available by assessing the last two assumptions Homogeneous variance One misconception about this assumption is that it means that the Variance of the X variable must equal to the Variance of the Y variable No The SD of Memory is 153 and the SD of Burden is 201 and these are not required to be the same Distributions Memorybehavior problems Burden J Norm a26301 152968 Nonna6924200576 Moments Moments Mean 26301 Mean 6924 Std Dev 15296844 Std Dev 20057554 Std Err Mean 15296844 Std Err Mean 20057554 upper 95 Mean 29336226 upper 95 Mean 73219854 lower 95 Mean 23265774 lower 95 Mean 65260146 N 100 N How do we assess whether the variance of the Y values is the same for every xed X First visually Look at the amount of spread above and below the line Is there equal variance here Triglycerides Cholesterol Copyright Al Best 18 December 2008 All rights reserved 812 BIOS 544 Section 8 See the fan spread For low cholesterol values there is less spread above and below the line than with larger cholesterol values The spread around the line at cholesterol 200 is much wider than at 100 Secondly look at a plot ofthe residuals versus the X values Resldual lnn AEIEI 2mm 3mm 0 lmlesleml This plot clearly shows that the variance ofthe residuals is not equal across X the values To see the plot ofthe residuals for any t go to the submenu popup for that t 0 Choose LinearFitElPlot Residuals A residual plot like that below is clear evidence of equal variance no pattern whatsoever Resldual in an 4n n Memorybehavlural problems Normality of quot 39 39 quantileplot Residual Burden 233 lsuza 4757 an nuznnan mm as EI7EIB nanaanaa Nuvmal ouamlle le In practice it s usually suf cient to inspect the residuals vs predictor plot Copyright Al Best 18 December 2008 All rights reserved 813 BIOS 544 Section 8 Note that you do not care whether the raw Y values are normal You don t care whether the X values are normal Although it usually is a good sign if they are You do care about the normality of the residuals at a fixed point on the X axis Thinking about a linear fit Again we re considering predicting a caregiver s Burden from the level of memory problems We do this by tting a straight line through the points in the scatterplot and interpreting the slope If there is no relationship what will the slope of this line be Said another way if we re trying to predict Burden from something that is absolutely useless as a predictor what Burden for a subject would we predict wa I u uuuuu 100 45000 100 45000 90 90 30 JIL 40000 30 40000 gt 70 i T gt gt 70 gt 60 35000 60 35000 50 50 40 30000 40 30000 30 30 1390 2390 3390 4390 5390 6390 70 25000 02 4 6 a 1390 2390 3390 4390 5390 6390 70 25000 024 6 a x x X x If we knew no other useful information our best single predictor of the Burden for a caregiver would be the mean rate It would be best in the sense that it would minimize our expected error That is to give a total score for how good or bad our predictions are calculate the sum of all the squared th errors for each care giver the 1 one Our total error is SST z Burdenii Burden2 i The general definition of SST is yl 2 Z Predicting the mean is the best we can do if we have no other useful information That is the simplest possible model for prediction is to simply predict the mean Predicting the mean will give us the least squared error We call the score when using this model the Total Sum of Squares SST It is the total variability of the yvalues Since this is the best we can do with no other information this is the baseline used to compare all other potential predictors That is if a set of predicted values does not come out with an error less than SST then it s clearly worse than the simplest possible model Copyright Al Best 18 December 2008 All rights reserved 814 BIOS 544 Section 8 Say we have a potential way to guess the Burden rate In our case our model is a linear regression model we re using the MemProb values to predict the Burden rate It gives us a predictedBurden for all the 139 l to 100 subjects We calculate the sum of all the squared deviations from the predicted mean SSR z predictedBurdeni 7 Burden 2 i A 2 The general de nition of SSR1s yl y i This is called the Sum of Squares Regression or the explained sum of squares It s also called the Model Sum of Squares since it describes how good our model is It is the variability explained by the model The better this model is the larger SSR is The worse it is the smaller SSR is The SSR tell us how much of SST the model explains What s left over is the Sum of Squared Error SSE Z Burdeni 7 predictedBurdeni2 i A 2 The general de nition of SSE 1s yl yi i The SSE is the unexplained variability the variability not explained by the model It turns out that SST SSR SSE Since 2y1 J 22 J 2y1 2 1 1 1 So we have some portion of the total variability SST explained by a model SSR with some portion left unexplained the error of the model SSE This brings us back to the notion of variance accounted for Another de nition for quuare is r2 SSR SST This has a number of uses because it ties together a lot of what we ve done in this class and will tie together things we ll cover in the future In the Mean Fit report that occurs when we just fit a at regression line we see how bad this model is it has SSE 3982824 This will be the total error all other models will be compared against Fit Mean L lAnalysis of Variance Sum of Mean 6924 Source DF Squares Mean Square F Ratio Std Dev RMSE 2005755 Model 1 10039340 100393 330276 Std Error 2005755 Error 98 29788900 3040 Prob gt F SSE 3982824 C Total 99 39828240 lt0001 Copyright Al Best 18 December 2008 All rights reserved 815 BIOS 544 Section 8 In the Analysis of Variance report that occurs when we t a line we see how bad this model is it has SSE 29788900 Recall that SST SSR SSE 39828240 10039340 29788900 And so r2 SSR SST 1003934039828240 0252 So this gives us another way to look at correlation It s a way to compare models We have one very simple model predict the mean and we have a model we think is better the line How much better is it An r2 zero would tell us it s no better An r2 1 would tell us it does a perfectjob of predicting the data Confidence Bounds This is visualized very well in JMP by looking at the con dence band around our straightline model In the gure below we see both the Linear Fit and its 95 con dence bound We also see the Mean Fit Burden Burden So recall one of our questions was Does knowing the level of memory problems allow us to predict the Burden rate beyond just a chance level Now we have a way to answer this question The beyond just a chance level prediction would be Predict the mean Burden about 692 Our model competes with that simplest possible model Is it any better Answer that question by looking at the con dence bounds Do the red curved con dence bounds include the at mean line If the blue is inside the con dence band then the model is not statistically signi cant actually the pvalue will be gt 005 If the at line cuts through the con dence band then the model is statistically signi cant and the p value will be lt 005 Copyright Al Best 18 December 2008 All rights reserved 816 BIOS 544 Section 8 JMPnotes To see the 95 con dence band for the mean predicted value for any t go to the submenu popup for that t 0 Choose Linear Fit quot Confid Curves Fit Random Predictor And the plot on the right shows what we ll see with a useless predictor We d get just about the same plot if we d simply generated a random variable and use it in a scatterplot The regression happened to have a slight increase but as you can see its con dence bounds include the mean prediction The p value here was 006 not signi cant The fact that it is not a signi cant predictor is re ected in the above gure Correlation or Regression If there is a clear predictor variable and a clear response variable you want to use the values in the rst as a basis for guessing unknown future values of the second then the regression model is appropriate If both are random and could be thought of on equal terms then the correlation model is appropriate We ve been using MemProb to predict Burden Would it make sense to do the reverse Bivariate Fit of Burden L 39 plumeus l lBivariate Fit of Memorybehavior problems By Burden l 120 loo 3 60 E E 7 g 80 t g E 40 g 60 3g O a 40 39 g Q 20 20 a E 0 lo 20 30 40 50 607 0 40 60 80 100 Memorybehalor Burden roblems Llnear Flt 7 F l Blvarlate Normal Elnose P0 950 I Linear Fit lquot39quot 39 Fquot Burden 51 925681 Memorybehavior problems 0 653143M9m9w lsummary f isouare 0752066 Warner war 70772710723 o 38289613uroen t 0 2 5i03966 RSquare Adj 0 244434 0 24434 Root Mean Square Error 17 43469 39 M Response 69 24 Mean or Observations or 5W W915 100 bservatlons or Sum Wgts loo lAnalysis ofVariance 1 Anal 39 of a um of Urquot of Source DF Squares Mean Square F Ratio Squares Mean Square Model 1 10039 340 10039 3 33 0276 5839 194 5839 19 Error 98 29788 900 304 0 Prob gt F 17326 156 176 80 C Total 99 39828 240 23165 3 Parameter Estimates Term Estimate Std Error tRatio Probgtt Te m Estimate Std Error t Ratio Probgt t intercept 25681 3 480878 14 92 lt 0001 intercept 70 210723 4 800971 70 04 0 9651 Memorybehavior problems 0 6583141 011455 5 75 lt 0001f Burden 0 3828961 0 066626 75 lt 0001f Correlation 1 Correlation Signif ariabie Me Std Dev Correlation Prob Number Std Dev Correlation Prob Num Memorybehavior problems 26 301 15 29684 0 502062 lt 0001f 100 69 24 20 05755 0 502062 lt 0001f 100 rden 20 05755 avlor problems 26 301 15 29684 What s the same What s different Copyright Al Best 18 December 2008 All rights reserved 817 BIOS 544 Section 8 Burden 2390 30 40 o 60 Memorybehavioral problems Step by Step It s not necessary to follow through the details of every step each time we look at a set of data But until we gain the experience to be comfortable that we ll end up in the right place it s a good idea to take one step at a time Phase 1 State the Question 1 Evaluate and describe the data The first two questions are Where did this data come from What are the observed statistics The univariate distribution of each variable should be inspected and suitable descriptive statistics recorded Preliminary assessment of normality is useful although not critical Investigate outliers 2 Review assumptions Then the bivariate scatterplot should be inspected The linearity and equal variance assumption should be informally assessed There are three questions 0 Is the process used in this study likely to yield data that is representative of the population A simple random sample is the best way to assure this Otherwise look for possible sources ofbias o Is each observation in the sample independent of the others The classic example of a situation where this assumption is violated is when there are repeated observations on the sampled individual The two above assumptions we can assess now If they are questionable then proceeding isiat bestirisky If the above assumptions are met then we can proceed to keep in mind three crucial assumptions 0 Is the form of the relationship linear o For every fixed X is the variance of the Y values around the tted line constant That is is the variance homogeneous Copyright Al Best 18 December 2008 All rights reserved 818 BIOS 544 Section 8 However if it s obviousifrom the inspection of the scatter plotithat linearity is suspect or that equal variance does not hold then don t bother proceeding We ll cover some methods to handle these situations later 0 Are the residuals normally distributed If the sample size is suf cient then we can often assume that the errors will be normally distributed The test statistic we useithe t statisticiis not sensitive to moderate departures from normality Thus unless the distribution is seriously skewed the actual calculated pvalues and con dence intervals will be close to the levels for exact normality With large samples the normality assumption is nearly always met Usually the assessment of these last three assumptions will have to wait until the residual errors are determined This occurs after we t a straightline model 3 State the question in the form of hypotheses There are three equivalent ways to state the null hypotheses 0 H0 r 0 zero correlation vs HA r at 0 either a positive or negative correlation 0 H0 slope 0 at line vs HA slope at 0 a trend 0 H0 Y Y error predict the mean vs HAY intercept slope X error predict a trended value Answer one of the three and you have answered the other two They are absolutely and completely the same questions We may prefer one over the other in terms of ease of explanation but the decision making process is identical However since the correlation is best thought of as a descriptive statistic it s best to not use r to do hypothesis testing Actually r is not normally distributed so hypothesis testing on it is best avoided So we proceed by testing hypotheses on the slope Thus the hypotheses stated for testing purposes are Ho slope 0 vs Ha slope at 0 Phase 2 Decide How to Answer the Question 4 Decide on a summary statistic that reflects the question There are two test statistics we can use In this situationiwhere there is only one Y and one Xithe following two statistics give identical pvalues slope 0 SEslape 1 Copyright Al Best 18 December 2008 All rights reserved 819 BIOS 544 Section 8 SSR dfmodel SSE dfmor The t value has df n 7 2 The F distribution has two df parameters the df numerator and the df denominator Since the dfmodel l the F value has 1 n 7 2 df In the situation where df numeraror 1 then t2 F If it helps you can think of it as though the ttest tests whether the slope is zero and the Ftest tests whether predicting values with a line is better than predicting the meanY Which ever is easiest for you use that conceptualization Now however that at bottom these are the same thing You get the same pvalue and make the same decision either way 5 How could random variation affect that statistic What values of I will we see if the null hypothesis is true if the null hypothesis is not true 0 For H0 slope 0 we ll see I equal to zero 0 For HA slope at 0 we ll see tvalues different than zero Recall the rough interpretation that t s larger than 2 are remarkable What values of F will we see if the null hypothesis is true if the null hypothesis is not true 0 For H0 Y error we ll see Fequal to zero 0 For HA Y intercept slopeX error we ll see F values greater than zero From the rough interpretation of t we d guess that an F larger than 4 is remarkable Either yield a pvalue when compared to a distribution with df n 7 2 6 State a decision rule using the statistic to answer the question The universal decision rule Reject Ho ifpvalue lt 0 Phase 3 Answer the Question 7 Calculate the statistic Ask the software to t a line Inspect the Parameter Estimates report for the slope estimate SE t and pvalue The df corresponds to df Error Parameter Estimates Term Estimate Std Error t Ratio Probgtt Intercept 51 925681 3480878 1492 lt 0001 Memorybehavior problems 06583141 011455 575 ltOOO1 Or inspect the Analysis of Variance report for the F ratio df Model and df Error and pvalue Copyright Al Best 18 December 2008 All rights reserved 820 BIOS 544 Section 8 Analysis of Variance Source DF Sum oquuares Mean Square F Ratio Model 1 10039340 100393 330276 Error 98 29788900 3040 prob gt F C Total 99 39828240 lt0001 SSR dfmodel MSR SSE dfmor MSE calculation note The F ratio corresponds toF It s a ratio of the mean square of the model which compares the straight line predicted value to the mean predicted value to the mean square error which compares the straight line predicted value to the observed value The correlation may be a useful descriptive statistic it is available in the Bivariate report The p value for the correlation is identical to either p value above Correlation Variable Mean Std Dev Correlation Signif Prob Number Memorybehavior problems 2630 1530 0502 lt0001 100 Burden 6924 2006 8 Make a statistical decision In this example since pvalue lt 00001 we reject the nullhypothesis But at this point it is crucial to determine whether the statistical decision is defendable Are the assumptions met In order of importance 0 Is the data representative 0 Is each observation independent 0 Is the relationship linear o For every fixed X is the variance of the Y values the same 0 Are the residuals normally distributed If the first three are not clearly met then inference is questionable If the error variance is non constant then other alternatives should be considered Normal residuals and equal variance usually go hand in hand With sufficiently large sample size normality is usually a safe assumption That is unless there is clear evidence of skewness or of severe nonnormality then we re usually safe Normal Residuals Usually the previous gure is enough information to judge the normality assumption Usually it is not necessary to inspect the residuals Unless there is clear nonnormality don t be concerned 9 State the substantive conclusion There is a positive relationship between caregiver Burden and memorybehavioral problems Copyright Al Best 18 December 2008 All rights reserved 821 BIOS 544 Section 8 Phase 4 Communicate the Answer to the Question 10 Document our understanding with text tables or figures There is any number of ways to document our understanding They would all begin with background information on each variable We d then describe the methods of how the data was obtained And then Table1 Summary Values Summary n Mean SD 95 CI Memorybehavior problems 100 263 153 233 293 Burden 100 692 201 653 732 Correlation If the intent of the question is to inquire whether there was a signi cant correlation between the two variables the following paragraph would be an adequate description Caregiver burden and the level of memory or behavioral problems were recorded from n 100 caregivers of adults with dementia The average values are described in Table 1 There was found to be a significant positive linear correlation between the memorybehavioral problems and caregiver burden r 050 df 98 pvalue lt 00001 That is higher problems went with higher burden Straight line If the intent of the question is to describe a straight line through the pairs of points the following paragraph would be adequate In data obtained from described in Table l The relationship between memorybehavioral problems and caregiver burden was found to be linear and the positive trend was significantly different than zero I 575 df 98 p value lt 00001 See Table 2 for the estimated parameters able Regression Parameters for Predicting Burden from MemlBeh Problems estimate SE 95 CI intercept 51926 3481 45018 58 833 slope 0658 0115 0431 0886 The predicted burden for a caregiver was found to be 066Problems 5l9 This linear relationship accounts for 25 of the variance of the caregiver burden Figure 1 illustrates the linear trend and shows the 95 confidence bound when predicting an individual caregiver s burden from a given problem level That is if an adult has a memorybehavior problem level of 50 our study found that the burden for caregiver is predicted to be 848 95 prediction interval between 497 and 1200 Copyright Al Best 18 December 2008 All rights reserved 8 22 BIOS 544 Section 8 Figure 1 The Relationship between problems and burden 120 110 100 90 80 7o 60 5o 40 30 39 20 Burden 0 1390 2390 3390 4390 5390 6390 70 Memorybehavior problems Figure 2 shows the trend with the 95 con dence bound when predicting the average burden level from a given problem level That is if a sample of adult have an average memorybehavior problem level of 50 our study found that the average burden for caregiver is predicted to be 848 95 con dence interval between 784 and 912 Fi ure 2 The Relationship between problems and burden 11o 39 39 100 90 80 7o Burden I I I I I I 0 1O 20 3O 4O 5O 60 70 Memorybehavior problems Note In clinical situations we almost always are dealing with n 1 and so the prediction interval is the one you want The problem with the PI is that it s often way wider than the CI So people often Copyright Al Best 18 December 2008 All rights reserved 8 23 BIOS 544 Section 8 incorrectly show the CI when they really should show the PI since it s narrower and looks better Model comparison If the intent of the question is to determine whether a model can be developed to predict one variable from another the results can be stated as with the straight line paragraphs above Probably the only thing that would be different is instead of saying The relationship between problem level and burden was found to be linear and the trend was signi cantly different than zero I We say A significant linear model was found that described the relationship between the two variables F1 98 330 pvalue lt 00001 Details Note how the df for the F value is either included in parentheses or with subscripts F1 98 330 Only one decimal place is needed for the F value perhaps two t values are reported to one or two decimal places rcorrelations are reported to at most two decimal places The number of decimal places for slopes and intercept estimates should be the same as the number of decimal places you use for the mean or SD or perhaps one more decimal Mistakes to Avoid See the figure below It is way too busy It looks like you ve simply chosen most of the options available in hopes that maybe one is the correct one No one ever publishes the at mean line No one ever shows the ellipse it only appears because that is the way that JMP want you to go when you want a correlation Note also that it is not correct to show the ellipse and say see the 95 confidence bound The ellipse is neither the CI nor the PI And never show both the C1 or PI choose one and only one 120 110 100 Burden l ELinear Fit Fit Mean Bivariate Normal Ellipse P0950 0 1390 2390 3390 4390 5390 6390 70 Memorybehavior problems Copyright Al Best 18 December 2008 All rights reserved 824 BIOS 544 Section 8 Bottom line Do NOT use a gure like this Choose one and only one of the figures on the previous page to illustrate the relationship between two continuous variables Exercises In the CaregiverJMP dataset the relationship between the IMP column names and the entries in the various tables in the paper are summarized on the next page In all of these exercises turn in only your summary paragraph any necessary summary tables A scatterplot with tted line and one of the two con dence bounds is probably useful 1 Describe these N100 subjects In a sentence describe where the data came from In a table summarize the predictor variables used in table 4 and the Burden outcome variable Nothing fancy n Mean SD should be suf cient Turn in One short paragraph with text referring to the table One table 2 Describe the correlations among the independent and dependent variables In other words reproduce in the paper s Table 2 using JMP s Multivariate platform The independent variables are the variables listed in Table 4 Burden is the dependent variable we ll be concerned with ignore Satisfaction Note that you will not get the same correlations since your N is different Do a brief visual inspection and comment if 1 the correlation you get has a different sign when compared with Table 2 or 2 if the correlation you get is quite different say more that i 01 Do not do any signi cance testing Turn in A sentence or two referring to the table A Table showing the correlation matrix Variable Table 2 Table 3 Table 4 Predictors Caregiver Age 1 1a Caregiver Income 3 1c Duration of caregiving 2 1b ADL 4 1 2a Memorybehavior problems 5 2 2b Cognitive impairment 6 3 2c Health 7 4 2d Perceived social support 8 3 Sex Relationship Education Outcomes Burden 9 Satisfaction 10 3 Answer this question Is there a Bivariate correlation between the eight predictor variables listed as lAge 8Perceived social support in Table 2 and outcome variable Burden That is test whether the correlations listed in the 9 Burden row of Table 2 are signi cant Note that this is Copyright Al Best 18 December 2008 All rights reserved 825