New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Shane Marks
Shane Marks

GPA 3.93

L. Hendrix

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

L. Hendrix
Class Notes
25 ?




Popular in Course

Popular in Statistics

This 72 page Class Notes was uploaded by Shane Marks on Monday October 26, 2015. The Class Notes belongs to STAT 205 at University of South Carolina - Columbia taught by L. Hendrix in Fall. Since its upload, it has received 271 views. For similar materials see /class/229649/stat-205-university-of-south-carolina-columbia in Statistics at University of South Carolina - Columbia.


Reviews for ELEM STATS


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/26/15
Dependent Paired Samples Inference on p1 2 Example 91 The compound mchlorophenylpiperazine mCPP is thought to affect appetite and food intake in humans In a study of this effect 9 moderately obese women were given mCPP in a double blind placebo controlled experiment Some ofthe women took mCPP for two weeks then took nothing for two weeks quotwash out period and then took the placebo for two weeks The other women took the placebo during the first two weeks had their two week quotwash out period then took mCPP for the last two weeks The results measured as weight loss after treatment kg are summarized below note that weight gain is denoted by a negative value TABLE 92 Weight Loss kg for 9 Women Weight Change mCPP Placebo Difference SubjeCt YI V2 01 Y1 y 1 11 00 1 1 2 13 03 1 6 3 10 06 0 4 4 17 03 1 4 5 14 07 2 l 6 01 02 0 3 7 05 06 0 1 8 16 09 0 7 9 05 20 1 5 Mean 91 O9 1 00 SD 74 88 72 So we have two measurements for each individual We d like to compare the true mean weight loss with mCPP to the true mean weight loss with placebo Question Are these samples independent Answer Let s suppose we want to make inference on the difference between the two population means We can use the paired data model Paired Data Model Suppose Yil quot39 Mm 612 is paired with Yiz quot39 Mm 622 at each observation i 1 2 n Then if d Yil Yi2 the Edi EYi1 Yi2 EYi1 EYi2 ul u2 ud Also di quot39 N ud of where of is a complicated function of the model parameters Question What is the distribution of a So we can make inferences on ud ul uz using a Y1 Y2 with the applications hypothesis test and confidence interval we already learned with the tdistribution The usual assumptions apply Only our interpretations will change Example 93 Construct and interpret a 95 confidence interval for the mean difference in weight loss between the two groups Dependent Samples Page 2 Example 96 If you walk toward a squirrel that is on the ground it will eventually run for the nearest tree A researcher wondered if he could get closer to the squirrel than the squirrel was to the nearest tree before the squirrel would run He made 11 observations The data are summarized below TABLE 94 Distances In Inches from Person and from Tree When Squirrel Started to Run From Person From Tree Difference Squirrel y yz d y y 81 137 56 178 34 144 3 202 51 151 4 325 50 275 S 238 54 1 X4 6 134 236 i 102 7 240 45 195 8 326 293 33 9 60 277 i217 10 119 83 36 11 189 41 148 Mean 190 118 72 D 89 101 148 Construct and interpret a 95 confidence interval for the mean difference in distance from person to squirrel and squirrel to tree Dependent Samples Page 3 Conduct a test of hypothesis at the OL 005 significance level to determine if there is a significant difference in these distances Based on the 95 confidence interval you computed for the squirrel data you could have determined the decision for the hypothesis test we just computed How Dependent Samples Page 4 Check the assumption that can be checked for by the QQplot for valid t based inference for the squirrel data examples Normal QQ Plot 0 100 200 l l l Sample Quanmes o o 400 l o 200 Theoretical Quantlles Dependent Samples Page 5 ChiSquare for Contingency Tables 2 x 2 Case A test for p1 p We have learned a confidence interval for p1 p2 the difference in the population proportions We want a hypothesis testing procedure for this difference Definitions A contingency table is a tabular arrangement of count data representing how the row factor frequencies relate to the column factor We call a contingency table with quotrquot rows and quotcquot columns an r x c contingency table Each category in a contingency table is called a in Example Consider a 2 x 2 contingency table with the row factor denoting a success versus failure and the column factor denoting Group 1 or Group 2 where the samples for both Group 1 and Group 2 are independent of each other Then the contingency table looks like this Group 1 Group 2 Success Y1 Y2 Failure Recall Example 1037 regarding effectiveness of Timolol on angina status The contingency table would be as follows Timolol Placebo Angina free 44 19 Not Angina Free 116 128 We have already used this data to construct a 95 confidence interval for the difference in the proportion of angina free for the Timolol versus the Placebo conditions Let p1 denote the probability or population proportion of success for Group 1 Let p2 denote the probability or population proportion of success for Group 2 To test Ho p1 p2 we ll introduce Pearson s X2 Chisquare statistic Definition OE 2 Pearson s X2 statIstIc IS X52 2 where the sum IS over all the cells In the table 0 denotes observed values in each cell and E denotes the value we d expect to see if Ho were true Now we have the observed values the data we collected What are the E s Remember we conduct hypothesis tests under the assumption that the null hypothesis is true lfthe null hypothesis were true then So then 161 and 162 would be estimating a common p ie the probability of a success would be the same under Group 1 or Group 2 in our example Then we could estimate this common p by using a weighted pooled estimator Y1 Y2 n1f 1 l n2 2 n1n1 n2n2 Y1Y2 pool n1n2 n1n2 n1n2 Little Sidebar Suppose you are flipping an unfair coin where the probability of a heads is 03 and the probability of a tails is 07 How many heads would you expect to see if you were to flip this unfair coin ten times Now apply this thought process to get the expected successes for Group 1 And compute the expected successes for Group 2 Chisquare for Contingency Tables Page 2 Fill out the quotExpected Table for the Group 1Group 2 successfailure contingency table Groupl Group2 Success FaHure Things to remember 0 The E s expected counts need not be integers and we do not round them 0 The row and column totals are the same for observed and expected tables this is a good way to check your calculations 0 For the Chisquare test we ll begin implementing in just a moment to be valid we need each E 2 1 and for the average E 2 5 Chisquare for Contingency Tables Page 3 Calculating Pvalues under the x2 distribution The X2 distribution is a right skewed distribution The values of a X2 random variable are greater than or equal to 0 The X2 distribution has degrees of freedom The degrees of freedom for a X2 test with a contingency table are df of rows 1 of columns 1 Examples of Chisquare Distributions 010 015 020 025 030 l l l 005 l 000 l For a nondirectional alternative P PM 2 x25 f df1 we have the option of performing a directional alternative In this case P PX f2X if data deviate in the direction specified by HA 205 otherwise Tl8384 Matrix 2nd xinverse gt scroll over to EDIT gt ENTER gt Enter your matrix STAT gt scroll over to TESTS gt scroll down to XZTest gt ENTER gt Make sure your observed values are in the matrix specified the expected matrix will be calculated for you and stored in the matrix specified gt Calculate gt ENTER Chisquare for Contingency Tables Page 4 Example Using the table below conduct a test of hypothesis at the OL 001 significance level to determine whether there is a significant difference in the probability of being angina free underTimolol or placebo Timolol Placebo Angina free 44 19 Not Angina Free 116 128 Chisquare for Contingency Tables Page 5 What if the researchers wanted to know to know whether the probability of being angina free is greater uncler Timolol than under placebo What if the researchers wanted to detect whether the probability of being angina free uncler Timolol is less than under placebo Chisquare for Contingency Tables Page 6 A Test for Association The workup of all the previous examples assumed we had two independentsamples and we were observing those two samples for the outcome of one variable Many times we are in the situation where we observe one sample for two explanatoryfactors Factor 1 Level 1 Level 2 Y1 Y2 Level1 Factor 2 Level2 In the case where we have one sample and we re observing it for two explanatory factors we ll test the hypothesis of association The test for Ho there is no association is numerically equivalent to that of Ho p1 p2 but the hypotheses and interpretations are different Chisquare for Contingency Tables Page 7 Example 1021 To study the association of hair color and eye color in a German population an anthropologist observed a sample of 6800 men Dark Eye Color Light Hair Color Dark Light 726 131 3129 2814 Test at the OL 005 significance level whether hair color is associated with eye color in this population of German men Chisquare for Contingency Tables Page 8 General r x c Case The ideas presented in the 2 x 2 cases just presented can be easily extended to general r x c contingency tables For the case where we have c di erent samples your columns and we re checking each sample for different levels of the row factor the hypothesis will change slightly Here we ll test whether the distributions are the same for each sample Think about it if we have more than a success and a failure then for each column we ll have Peve1 Peve 2 Peve r And then the null hypothesis would be testing whether p11 p12 plc and p21 p22 p2c etc This is called a compound hypothesis For the case where we have one sample and we re checking that one sample for different levels of two differentfactors we ll still be testing association Chisquare for Contingency Tables Page 9 Example 1031 The following table shows the observed distribution of A B AB and 0 blood types in three samples of African Americans livin g in different locations Florida Iowa Missouri A 122 1781 353 B 117 1351 269 AB 19 289 60 O 244 3301 713 Test at the OL 005 level of significance whether the distribution of blood type for African Americans is different across the three regions Chisquare for Contingency Tables Page 10 Example 1033 To study the association of hair color and eye color in a German population an anthropologist observed a sample of 6800 men this is the same study as that of example 1021 Hair Color Brown Black Fair Red Brown 438 288 115 16 Eye Color Grey or Green 1387 746 946 53 Blue 807 189 1768 47 Test at the OL 005 significance level whether hair color is associated with eye color in this population of German men Chisquare for Contingency Tables Page 11 Final Notes on ChiSquare for Contingency Tables Remember your calculator gives Pvalues for a nondirectional alternative We can have a directional alternative when we re in the 2 x 2 table and when HA is directional one must check the data deviate in the direction specified by HA o If yes cut Pvalue in half o If no P gt 05 and fail to reject Ho Degrees of freedom for an r x c table are rows 1 columns 1 Pearson s X2 statistic for contingency tables uses the approximation X2 quot39 xzdf so in order to be a valid approximation a standard rule of thumb is to require E 2 1 for each cell and the average E 2 5 and observations independent of one another If expected counts are small and data forms a 2 x 2 table Fisher s exact test may be appropriate By contrast example 1021 illustrates X2s is very sensitive with large sample sizes For r x c tables we have the following two hypotheses o c samples and we re checking for rlevels of a row factor then we re testing whether the distributions are the same for the groups your columns 0 one sample and we re checking for rlevels of a row factor and c levels of a column factor then we re testing for an association of the row and column factors Chisquare for Contingency Tables Page 12 Wilcoxon Mann Whitney Test In the independent samples case we first learned to apply the independent sample t test when we had normality of the sample mean for each sample If there is a violation of this assumption we ll apply a non parametric test for independent samples the Wilcoxon Mann Whitney test This test uses the relative position of the data in a rank ordering unlike the t test which uses the actual values To use the Wilcoxon Mann Whitney test 0 It must be reasonable to regard the data as a random sample from their respective populations Observations within each sample must be independent of one another The two samples must be independent of one another Rank Sum Algorithm 1 Arrange rank the observations in each sample from smallest to largest 2 Count the number of observations in each group that are smaller than each observation in the other group a Look at the first observation in the first group and count the number of observations in the second group that are smaller than it b Record that count c Repeat for all observations in the sample d Do a b and c for the second column 3 Add the rankings you recorded from Step 2 for the first column and call this sum K1 Add the rankings you recorded from Step 2 for the second column Call this sum K2 Check your calculations K1 K2 nlnz 4 The test statistic is Us maxK1 K2 5 Use the test statistic Us to bracket the Pvalue from Table 6 Example 739 Soil respiration is a measure of microbial activity in soil which affects plant growth In one study soil cores were taken from two locations in a forest under an opening in the forest canopy gap location and a nearby location under heavy tree growth growth location The amount of carbon dioxide given off by each core was measured mol COzg soilhr Normal QQ Plot Narmal QQ Pm TABLE 217 Soil Respiration Data mol 029 soilhr from Example 738 ampmple Ouantlws Sample OuanllLzs Growth Gap 17 20 170 315 22 29 13 16 22 190 64 5 18 14 1 Theoreiical ounnines Gap Theoretical Quanulas Gmwm The researcher would like to compare soil respiration under the two conditions gap and growth but there is a clear violation of normality and with small sample sizes the CLT can t help us out An attempt to transform the growth data maybe with loggrowth would make the skewness of the gap distribution even worse Wilcoxon Mann Whitney Employ the Wilcoxon Mann Whitney test at the on 005 significance level to see whether the distributions are shifted in some way from one another Number of Gap Number of Growth Observations Thal Y Y2 Observations Tlml Are Smaller Growth Dala Gap Dala Are Smaller 17 6 20 13 22 14 64 15 170 16 190 l 8 315 22 29 K K 2 Wil coxo n M ann Whitney Binomial Distribution Let s refresh our memory on computing probability with an example Consider the experiment where we toss a fair coin 3 times Find the probability distribution for flipping heads in this experiment Now think about finding probability distributions associated with flipping a fair coin say 6 times And then consider the experiment where the coin is not fair The calculations get unwieldy fast We need a more convenient method Definition The independent trials mode occurs when i n independent trials are studied ii each trial results in a single binary observation iii each trial s success has constant probability Psuccess p Notice that if Psuccess p Pfailure 1 p Your text calls this the BlnS Binary Indep n is constant Same p setting but is commonly referred to as a Binomial Experiment In a BlnS setting if we let Y successes then Y has a binomial distribution NOTATION Y quot39 Binnp The binomial probability function is PY j an pl 1 pquotj j O1n where nCJ with j jj1j221 and define O 1 nl jln jl Example Use the binomial probability function to find Pexactly 1 head in the experiment where a fair coin is flipped 3 times Find Pat least one head Binomial Distribution Page 2 The TI calculators will compute binomial probabilities For PY j Choose 2nd and VARS to bring up DISTR menu gt scroll down to binompdf gt ENTER gt binompdfnpj For PY s j choose 2nd and VARS to bring up DISTR menu gt scroll down to binomcdf gt ENTER gt binomcdfnpj Example 343 Suppose that 39 of the individuals in a large population have a certain mutant trait and that a random sample of 5 individuals is chosen from the population Find Pat least 1 and at most 4 mutants in the sample The following is a probability histogram of the distribution from example 343 u i z 3 4 5 Numhcrul39mmzilm Figure 315 Binomial dislribmion withquot 5 and 7 39 Binomial mean and variance If Y N Binnp the population mean and variance are W hp and GVZ npllp Example Find the mean and standard deviation for the number of mutants out of five selected individuals from the population described in example 343 Binomial Distribution Page 3 Simple Linear Regression We have been introduced to the notion that a categorical variable could depend on different levels of another variable when we discussed contingency tables We ll extend this idea to the case of predicting a continuous response variable from different levels of another variable We say the variable Y is the response variable dependent on an explanatory predictor variable X There are many examples in the life sciences of such a situation height to predict weight dose of an algaecide to predict algae growth skinfold measurements to predict total body fat etc Often times several predictors are used to make a prediction of one variable ex height weight age smoking status gender can all be used to predict blood pressure We focus on the special case of using one predictor variable for a response where the relationship is linear Example 123 In a study of a free living population of the snake Vipera bertis researchers caught and measured nine adult females The goal is to predict weight Y from length X The data and a scatterplot of the data are below Notice this data comes in pairs For example x1y1 60 136 snake Length cm welght ScatterplotofWeightvs Length 1 60 136 Femae lfpera 5 2 69 198 200 3 180 4 64 140 g 5 54 93 6 120 7 100 8 65 174 39 I I I I I I 550 57 5 6O 0 62 5 6S 0 67 5 700 9 63 145 Length cm First we look at a scatterplot of the data We d like to fit a straight line to the data Why linear Does fitting a straight line seem reasonable Simple Linear Model Regression Equation The simple linearmodel relatingYandXis v ybob1x Y be b1X be is the intercept the point where the line AYb crosses the Y axis x1 1 b1 is the slope the change in Y over the change in X rise over run Definition A predicted value or fitted value is the predicted value of yi for a given xi based on the regression equation be blxi Notation 91 be blxi y Residual A residual is the departure from Y of a fitted value Notation residi yi Ii Figure 126 7 and the residual or a typical data point x y Which line do we fit X X We will fit a line that goes through the data in the best way possible based on the least squares criterion Definition The residual sum of squares aka SSresid or SSE is 11 SSresid SSE Z yiyi2 i1 The least squares criterion states that the optimal fit of a model to data occurs when the SSresid is as small as possible Note that under our model 11 n ssresid SSE Z yank Z yibob1x2 i1 i1 Referto the applet at htt standardsnctmor document eexam les cha 7 74 Regression and Correlation Page 2 Using calculus to minimize the SSE we find the coefficients for the regression equation b le1Xi39gtYi39 1 2311 X1502 b0 3 quot b1 Tl8384 Enter the data into two lists STAT gt TESTS gt LinReg39l39l39est We ll go over the options in class Example 123 Find the linear regression of weight Y on Length X Scatterplot of Weight vs Length Wll Fitted Regression Line Weight 9 I o 650 675 700 60 0 62 5 Length cm lnterpret the slope b1 in the context of the setting Can we interpret the meaning of the Y intercept b0 in this setting Definition An extrapolation occurs when one uses the model to predict a y value corresponding to an x value which is not within the range of the observed x s Regression and Correlation Page 3 A Measure of Variability Sylx Once we fit a line to our data and use it to make predictions it is natural to ask the question of how far off our predictions are in general Definition The residual standard deviation is SSresid nZ le1 Yi39yi 2 S YX n2 Caution This is not to be confused with sY Recall le1 Yi39WZ n1 Sy Scatterplot of Weight vs Length with Filted Regression Line Weight g E 5539o 575 60 o 62 6539o 67395 700 Length cm Determine and interpret SW for the regression of female Vipera bertis weight on length Regression and Correlation Page 4 The Linear Statistical Model Definition A conditional mean is the expected value of a variable conditional on another variable Notation uYIX Defiinition A conditional standard deviation is the standard deviation of a variable conditional on another variable Notation cm The linear regression model of Y on X assumes Y uYIX 8 where the conditional mean is linear with l39l39YIX 50 51X and us O and GS Gyx We use to estimate 30 to estimate 31 and to estimate Gm Then we can estimate or predict uYIXx at any X so that aYXX b0 blx Assuming the linear model is appropriate here find estimates of the mean and standard deviation of female Vipera bertis weight at a length of 65 cm Should we estimate female Vipera bertis weight at a length of 75 cm Why or why not Regression and Correlation Page 5 Inference on 31 Normal Error Model In our discussion on the linear statistical model we stated that the linear regression model of Y on X assumes a linear conditional mean with the errors having mean 0 and standard deviation Uy x To make inference on 31 we need to update the conditions on this model to include a normal distribution on the errors Yu ixs 2 Mix 03 Olo39szaxl Figure 129 Assumptions to check I 5i must be independent I 5i must be normally distributed Density l 5i must have equal variance l 5i must have mean zero Regression and Correlation Page 6 How to Check Assumptions ei independent ei normally distributed ei must have equal variance si must have mean zero Looking at Residuals vs Predicted or X Plot Regression and Correlation Page 7 Check the assumptions that can be checked for the female Vipera bem s regression using the plots below imple Linear Regressinn 00 mm mResmuzls Residuals vs LEnwlh Resmua s Resmua s 2 2 65 in 5 n n 5 Nume Ouanmes an Lengm Java Avwet Wmdaw Regression and Correlation Page 8 Confidence Interval for 31 Under the normal error model b1 is unbiased for 31 with SYX l l11 Xi39 302 This confidence interval uses a tcritical point ta2dfn2 SEb1 The Tl84 and 89 calculators have a menu option to compute this interval called LinRengnt found in STAT gt TESTS For the Tl83 use the LinReg39l39l39est option and a t critical point From LinReg39l39l39est we can get the standard error of b1 t b1 SE b b1 5 2 1 ts Note The test statistic t5 returned by your calculator is the test statistic for a hypothesis test not a critical point You ll have to compute the critical point on your own or be given one Then the CI is b1 i taZldfn25Eb1 Compute and interpret a 95 confidence interval for 31 in the female Vipera bertis regression U59 t025df7 2365 Regression and Correlation Page 9 Hypothesis Test for 31 Similar to the development of the confidence interval for 31 we can use the t distribution to conduct a hypothesis test for 31 with t b1 5 SEb1 Under Ho ts quot39 tdfn2 We ll test Ho HA Use the LinReg39l39l39est option in your calculator to conduct a test of hypothesis at the OL 005 significance level whether true mean female Vipera bertis weight tends to increase with increase in female Vipera bertis length Regression and Correlation Page 10 Coefficient of Determination r2 Recall SSresid is a measure of the unexplained variablility in Y the variation in Y not explained by X through the regression model and is given by n A 2 SSreSId SSE 2 Yr Yr Scamplotofu gtvsmw 11 with Fined Ragrssion Line Definition SStotal measures the total variability in Y and is given by 18 ssctotal ssr Z Yd 7y i1 Weight g E Definition SSreg measures the variability in Y that is explained by X through the regression model and is given by SSreg SSR 2 9502 i1 Then SStotal SSreg SSresid or the total variability in Y is explained by the regression model plus the unexplained residual variation Definition The coefficient of determination is the ratio between the SSreg and SStotal and is given by 2 SSreg l l11Xi39Yi393 l2 SStOtal l11 X1502 2311 Yi39WZ and is interpreted as the proportion or percentage of variability in Y that is explained by the linear regression of Y on X coefficient of determination r Find and interpret r2 for the regression of female Vipera bertis weight on length Regression and Correlation Page 11 Reading Regression Output from Standard Statis 39cal Software Example Fertility Enhancer Data The data below concerns the prevalence ofa so called fertility enhancer and the population of Oldenburg Germany in thousands of people between 1930 and 1936 39 39 4 be found ln 39 39 39 44 No2 Jahrgang 1936 Berlln and 48 No1 Jahrang 1940 Berlln and Statistichesarbuch Deutscher Gemeinden 2733 Jahrang 19321938 Gustav Flscher Jena The actual data below ls estlrnated from the graph ln Box Hunter and Hunter s Statisticsfor Experimenters a Regre nn Filled Ime rIm Java ADDlel Wlndaw 1 simple Linea Regressinn Sunple Lmea Regresslnn 00 mm mResmuzls Reslduzls vs x Reslduals Reslduals 3 3 2 39 2 39 l l u u 1 l 391 1 393 393 rl rna n n5 1 Mn len an Inn 2m 2w Nuvmal ouanmes x lt7 Back Next lt7 Back l JavaADDlelWlndaw JaVaADDlElWlndaw Regression and Correlation Page 12 Minitab ompm Regression Analysis Population versus x The Egresslun equecmn 15 5555155155 355 5151 x Predlccur Cuei 55 Cuei T p Constant 35455 4574 714 5551 x 515573 552422 522 5552 5 255355 5755 5555 555551551 553 15111515 5 Varlance 55 55 5 5 31755 31755 3573 5552 4 55 525 1 35555 Page 13 Ragrassinn and Carralatinn Correlation The linear regression model assumes the X s are measured with negligible error Think about the snake data herethe researcher measured length to predict weight Why not the other way around I mean if we go out to collect snake measurements lam not volunteering to get the length I m volunteering to hook the snake and throw it in a bag to weigh it But if we tried to use the weight to predict length the variability in weight clue to eating pregnancy etc could lead to bad predictions of length For instance a snake in our data set that just ate a mouse would have a shorter length that what would be predicted for a snake that actually weighed little snake food big snake pounds In other words quX is the mean of Y given X We use this type of model to make predictions of Y based on our model for a given value of X For the situation where we d like to make statements about the joint relationship of X and Y we ll need for X and Y to both be random When we re interested in examining the joint relationship of two random variables we are interested in theirjoint distribution the joint distribution of two random variables is called a bivariate distribution Definition The bivariate random sampling model views the pairs Xi Yi asjoint random variables with population means ux uy population standard deviations ox Gy and a correlation parameter p In this model p measures the level of dependence between two random variables X and Y o 1 S p S 1 o p gt i1 Cgt X amp Y become more correlated o p gt O Cgt X amp Y become uncorrelated We ll measure the sample correlation coefficient called r 3110 Yi39y 1quot 311 X1502 311 Yd 02 Notice what is hiding inside of r Regression and Correlation Page 14 Figure 121 5 Dan sets to illustrate the correlation coef cient a r298 b 7 0135 d 7798 e r 5 1 rlt35 Properties of r o r inZ o asn gtooEpzr S 0 related to L5 regression coefficients b1 r Y 5x 0 test of H0 31 O numerically equivalent to test of H0 p 0 b1 nZ ts m m Figure 1214 Blood pressure and platelet calcium for 38 persons with normal blood pressure line Example 1219 1 39 39 s calcium in blood related to blood pressure I 39 rlguveizu itlmipwuu Y calcium concentration in blood platelets mlpilicli mlnumlir3 39 39 39 pmmuui ilunm um X blood pressure average of systolic and diastolic I I What do you thinkris forthese data w m m Illlull mm mm llgl Regression and Correlation Page 15 Plots Depicting the Sensitivity of r to Outliers 5 5 3 3 s n I 39 39 It 39 c I 39 I c I l quot a l 39 I 39I II In 39 39l I IIquot I l Ill 39 39 1 II I 39 l I I 39 I x n I xx I I I I I I I I I I J39 l I 0 5 ll 45 45 ll l5 All 45 5 Lll l5 3 45 X X X I r2 bl quot39h c r J Confidence Interval for p We can build a 1 0c confidence interval on p if we extend the bivariate model to include bivariate normality We ll assume X quot39 NIIXGX2 Yquot39 Nuyoyz with CorrX Y p The following figures depict bivariate standard normal distributions with different correlations bivariate normal n I vaarIate normal Unfortunately there is no easy way to build good intervals directly on p Instead we transform between different scales for p Regression and Correlation Page 16 Definition The Fisher Z transform is defined as Zrln And when under the bivariate normal Zr quot39 NO i 6221 The inverse function is r T 6 1 Compute a confidence interval for Zp and then invert using the inverse formula to get a 1oc confidence interval on p CI on Zp Zr 4 Zen2 2Zr Za2g 1 2ZrZa2g 1 Cl on p lt 1 1 ezZr Zl2 1 2ZrZZ2 1 Compute a 95 interval for p the true correlation of platelet calcium and blood pressure Regression and Correlation Page 17 Some Final Notes on Regression and Correlation 0 use conditional regression analysis when prediction of Y from X is desired 0 random sampling from the conditional distribution onX is required if b0 b1 and sle are to be viewed as estimates of the parameters 30 31 and cm 0 Y must be random and X need not be random use correlation analysis when association between X and Y is under study 0 bivariate random sampling model is required ifr is to be viewed as an estimate of the population parameter p o X and Y both must be random Always plot the data Why Because r is very sensitive to extreme observations and outliers so BE CAREFUL r is also known as the Pearson ProductMoment Correlation Coefficient a distribution free version of r exists known as Spearman s Rank Correlation Coefficient Regression and Correlation Page 18 Independent Samples Inference on ul uz We learned how to make inference on one population meanu through a confidence interval and then through a hypothesis test We then learned to make inference on the difference between two population means ul uz when we are in the special case of the paired dependent samples model We would like to make inference on the difference between two population means ul uz when we have independent samples Recall in the dependent samples setting we took the difference of the columns of data and worked only with the difference column This collapsed the two sample problem into a one sample problem In the independent samples case we cannot employ this same strategy we might not even have equal sample sizes For the hypothesis test on ul uz we ll need our point estimate and standard error of the point estimate For the confidence interval we ll also need a critical point Point Estimate of ul uz Y1 Y2 is unbiased for ul uz Standard Error of the Point Estimate In the special case where we know we have equal population variances 612 622 we ll calculate the SEY1 Y2 a little differently Recall that in the case where each population can have its own variance we have the following 2 2 General Case SEY1 Y2 151 2 1 2 In the case where we know 612 622 oz we ll estimate that common variance by weighting the sample variances by the degrees of freedom available for each column in the following way 111 15n2 15 2 2 2 2 2 Spool Spool 2 039 0quot 039 SEY Y wheres 1 2 1 2 n1 n2 pool n1n22 Critical Point General Case We cannot find the exact degrees of freedom so we use a highly accurate approximation to degrees of freedom called the WelshSatterthwaite approximation 2 3 32 d SE 2Y1 SE 2Y2 W n 2 W5 SE 4V1 SE 4amp3 8 i n11 I12 1 r10114 nn2391 f normality of the sample mean holds for each sample then this approximation to degrees of freedom yields a highly accurate approximate confidence interval f normality does not hold the central limit theorem may still apply but sample sizes must grow much larger n1 n2 2 20 Equal Population Variances 03912 03922 0392 df n1 n2 2 o If normality of the sample mean holds for each sample and 612 622 oz then using the standard error with pooled standard deviation yields an exact confidence interval Tl8384 STAT gt TESTS gt2ampTlnt or 25amp39l39l39est for a hypothesis test gtchoose Data or Stats gt Enter appropriate information The option for Pooled No Yes is asking if you want to pool variances only check yes if you have reason to believe the two populations have equal variances Tl89 Apps gt StatsList Editor gtF6 or F7 according to whether you want a confidence interval or hypothesis test The option for Pooled No Yes is asking if you want to pool variances only check yes if you have reason to believe the two populations have equal variances Independent Samples Page 2 Example Exercise 715 A study was conducted to determine te psychoactive effect of the drug Pargyline on the feeding behavior of the black bowfly Phormia regina The experimenters used two groups of flies to conduct this experiment one group injected with Pargyline and a control group injected with saline The accompanying table summarizes the sucrose consumption mg in 30 minutes Control Pargyline Y 149 465 s 54 117 n 900 905 Construct and interpret a 99 confidence interval for the difference in population means You may proceed as though the assumptions were checked and deemed acceptable Independent Samples Page 3 Example A researcher investigated the effect of green light in comparison to red light on the growth rate of bean plants The following table shows data on the growth of plants inches from the soil to the first branching stem two weeks after germination Normal QQ Plot Green s n25 a Sample Quantiles Use the QQPlots to check the assumption of normality J Theoretical Quanliles Y Red Normal QQ Plot Sample Guantiles Test whether there is evidence that the mean growth is different under the green and red light conditions eme Ef e lWW39es Independent Samples Page 4 ChiSquare for Goodness of Fit 2 Pearson s orIgInal Idea was to use X s to assess divergence In the quotOquots against a modeled value for E Any model could be proposed not just for r x c tables In this sense X2s measures the goodness of fit of the model for E We ll use the usual X2s statistic we used in analysis of contingency tables with intuitive expected counts and df categories 1 Example 101 In genetics we believe that offspring characters appear in regular ratios Eg in snapdragons the offspring of two pink hybrid parents produces Pred 1 Ppink 1z Pwhite 1 Another way to write this model is to say snapdragon offspring appear in a 121 red to pink to white ratio Is there any evidence this model is not correct Suppose a sample of n 234 snapdragon crossed from pink parents yields Red Pink White 54 122 58 Conduct a goodness of fit test at the OL 010 level of significance Compound Null Hypothesis When the Chisquare test for contingency tables was introduced we discussed the notion of a compound null hypothesis Let us examine the compound null hypothesis in the Goodness of Fit null hypothesis Consider the snapdragon example from Example 101 The null hypothesis Ho PrRed 025 PrPink 05 PrWhite 025 How many independent assertions are in this Ho This is a compound null hypothesis since it makes more than one independent assertion In the case of a compound null hypothesis the alternative hypothesis is necessarily a non directional alternate For a directional alternative which direction would each category go The Chisquare test in this case is detecting deviations in any direction and does not have the capability to do otherwise Then the HA is necessarily HA PrRed O25 andor PrPink O5 andor PrWhite O25 ChiSquare for Goodness of Fit Page 2 Densities and Random Variables As was mentioned before in class a random variable is the measured outcome of some random process When the random variable is quantitative we can either have a discrete or a continuous random variable When we have a discrete distribution of a random variable we can list the probability associated with each possible outcome As an example consider a certain population of the freshwater sculpin Cottus rotheus The distribution ofthe random variable Y number of tail vertebrae is shown in the following table yi 20 21 22 23 PYyi 03 51 40 06 We ve listed out the entire probability distribution all the probabilities add up to one We could graphically represent a discrete distribution with a frequency histogram Construct a frequency histogram for the number of tail vertebrae in this population of sculpin In the case where we have a continuous random variable we want a different tool to represent this type of distribution We call this representation of a continuous random variable s distribution a density curve Consider the distribution of blood glucose levels measured one hour after a subject from a certain population of women drinks 50mg of glucose dissolved in water Example 327 depicts this distribution with binwidth set 10 5 and quot0 respectively Wt H i W mm mmLu Notice the probability density curve is like having a probability histogram where we re squeezing the binwidth down to 0 an infinite number of bins Then the way we get probabilities associated with continuous random variables is still an area just like in the frequency histogram But we need area under a curve so we need calculus Area Proportion of Y values between a and I 5 D u I Y Integration from a to b of the function that results in the probability density curve will give us the probability of being between those two values under the specified distribution Fortunately your T calculator will be doing the integration for you in this class More on that later Now think back to your math classes Questions What is the length of a single point What is the area of a line Answers A single point does not have any length A line has no area The following facts pertaining to probabilities associated with a continuous random variable are consequences of the fact that a line doesn t have area PY a O PY b area of a line is zero So PYsa PYlta PYa PYlta And PasYsb PaltYsb PasYltb PaltYltb Densities and Random Variables Page 2 Mean and Variance of a Discrete Random Variable The population mean of a discrete random variable Y is given by uy ZyiPY yi The mean of a random variable Y is also known as the expected value of Y denoted EY The population variance of a discrete random variable Y is given by of Zyi uy2PY yi One can show that of EYZ EY2 EYZ uY2 Then the o ulation standard deviation is oy lo Example 335 For the number of tail vertebrae in the population of freshwater sculpin Cottus rotheus find uy and oy yi 20 21 22 23 PYyi 03 51 40 06 Densities and Random Variables Page 3 Introduction to Hypothesis Testing A Hypothesis Test for u Heuristic Hypothesis testing works a lot like our legal system In the legal system the accused is innocent until proven guilty After examining the evidence he is found either guilty or quotnot guilty by a jury of his peers How much evidence does there need to be to convict The answer to this is different for every jury Also this is not a perfect process meaning mistakes are made A mistake can be made by sending an innocent man to prison or letting a guilty man go free Let s put these ideas into the framework of hypothesis testing Statistical Let s say a researcher has reason to believe the population mean is different from what has been accepted The belief that has been around for some time the status quo will be called the null hypothesis denoted by Ho The belief that the true mean may actually be different from this null hypothesized belief is called the alternative hypothesis denoted by HA Stating the hypothesis We ll state our null hypothesis in the following way Ho H Ho the IlII sign always goes with Ho Then the alternative hypothesis can be one of the following three statements HA HA HA Finding the evidence We ll use X and knowledge of its distribution to gather our evidence lntuitively we know that the further away X is from uo the more evidence we have that the null hypothesis is not true X ll be our test statIstIc Let tS 5 If the null hypothesis were true remember innocent until proven guilty then ts quot39 tdfn1 and we can compute probabilities associated with it When X is close to Ho then ts When X is larger than uo then ts When X is smaller than uo then ts Diagram of finding the evidence for the three possible tests Ho H HO HA H gt Mo H01HHo HA plt HO H05HHo HAH HO Hypothesis Test for p Page 2 Definition The Pvalue of a test statistic is the probability given that the null hypothesis is true of observing a test statistic that extreme or more extreme in the direction of the alternative hypothesis The Decision 50 the Pvalue quantifies how extreme our test statistic would be given that the null hypothesis is true This is evidence against the null hypothesis Question How much evidence is needed to conclude the null hypothesis is incorrect Answer This varies from researcher to researcher and we ll make a prespecified cutoff at before we conduct the test of hypothesis We call this the significance level of the test We reject Ho when P S 0c We fail to reject Ho when P gt 0c Steps for Carrying Out a Hypothesis Test 1 Set 0c significance level 2 State hypotheses 3 Compute test statistic 4 Compute Pvalue 5 Make decision 6 State conclusion in context of the setting T8484 STAT gt scroll over to TESTS gt scroll down to TTest gt ENTER gt choose data if you have the data set choose Stats if you have the sample mean sd and size uo is from Ho u is your HA choose Calculate gt Enter T89 APPS gt Flash Apps gt StatsList Editor gt F6 2nd Fl gtscrol down to TTest gt enter the appropriate information in the prompts Hypothesis Test for p Page 3 Example The national center for health statistics reports the mean systolic blood pressure for males aged 3544 is 128 mmHg A medical researcher believes the mean systolic blood pressure for male executives in this group is lower than 128 mmHg A random sample of 72 male executives in this age group results in a sample mean of 1261 mmHg and a standard deviation of 152 mmHg Is there evidence to support the researcher s claim Test this hypothesis at the 005 level of significance Hypothesis Test for p Page 4 Compute the Pvalue be for HA5 ll39 HO HA5 ll39gtllvo Errors When we make a decision reject or fail to reject H0 are we always correct We can make two types of errors in hypothesis testing Definition The False Positive Rate aka the Type Error Rate of a test is the probability of rejecting Ho when it is true NOTATION on Preject Ho HQ true Definition The False Negative Rate aka the Type II Error Rate of a test is the probability of failing to reject Ho when it is false NOTATION B Pfail to reject Ho HQ false TABLE 710 Possible Outromes of Testing Ho Tme Situation H 0 true H 0 false Ollr Do not reject HU Correct Type 11 error Decision Reject H1 Type I error Correct Hypothesis Test for u Page 5 Choosing on If we think of 0c the significance level of a test as the probability of rejecting the null hypothesis given the null hypothesis is actually true then we would certainly want to choose a very small cc to guard against this type of error Right It turns out we cannot simultaneously minimize both 0c and 3 Traditionally we attend to 0c o If a false positive error is worse than a false negative drive a very low 01 005 o If a false negative error is worse than a false positive let a rise 10 or even 15 If you re not surecan t distinguish then a traditional middle ground is OL 005 Example Suppose some sort of immunotherapy is being proposed as an effective therapy against cancer Suppose the immunotherapy is tested on cancer patients who are already taking chemotherapy and some sort of measure of change in response change in tumor size is being measured with Ho no effect of immunotherapy HA beneficial effect of immunotherapy ATypel Error would waste a lot of patients money on useless immunotherapy A Type II Error would dismiss an effective cure as useless Deciding which type of error is worse isn t always easy to determine Power Definition The power of a test is the probability of rejecting Ho when it is false NOTATION Preject HC HQ false Notice Preject HC HQ false 1 Pfail to reject HC HQ false 1 3 50 power is the complement of false negative error We can estimate the power of a hypothesis testing procedure which is beyond the scope of this course in advance and often we try to design experiments so that power 1 3 2 080 Hypothesis Test for p Page 6 Sampling Distributions Suppose we use the sample mean Y as our quotbest guess of the population mean u We assume it will be somewhat close to the target but not exactly on every time It is also intuitive that if we were to take another sample from the same population and calculate the sample mean for that sample it would not only be off the target but also be slightly different from the first sample mean Definition The variability among random samples from the same population is called sampling variability Definition A probability distribution that characterizes some aspect of sampling variability is called a sampling distribution A sampling distribution tells us how close the resemblance between the sample and population is likely to be You can imagine describing the sampling distribution of a statistic by repeatedly taking samples from the same population over and over again computing the statistic and plotting all the statistics on a probability histogram This distribution would have a shape a mean standard deviation etc We ll study the sampling distribution of f and Y Samgl gD tr u onof Recaii ifY N Binnp we can use Eto estimate the popuiation proportion p if it is unknown Since V is random so is Exampie 54 Suppose Psuperior vision 2015 03 and ietY denote the number of peopie with superior Vision Suppose we have a random sampie of 2 peopie Find the sampiing distribution of NULiLC W Lu 39 39 astne p 39 5 iarger Figure 55 Samplingdislrihutionsof m39 p 50m various valucsnfn m Ii 8 i11 nZU Sampling Distributions Page 2 Sampling Distribution on There are two important theorems we ll be applying throughout the rest of the semester Sampling Distribution on Given a random sample Y1 Y2 Yn where EYi u and VarYi 62 we have i H EiY H 2 ii 07 VarY a 2 iii Iin iid Nu62 then Y Nu Example 59 Let Y denote the weight of seeds with Y N N500 14400 Find PW gt 550 for a random sample with n 4 Central Limit Theorem CLT Given a random sample Y1 Y2 Yn where EYi u and VarYi oz 2 Y gtNu6asn gtoo The CLT is approximately true for finite n and the approximation improves as n gets larger Sometimes we only need a few observations for it to kick in sometimes we need more This depends on how quotfar from normal the data is to begin with Figure 516 Sampling distributions of Y for samples n 16 from he limescore population N 32 4 u 128 l l 7 l l 7 l l l l 7 l l l l 7 100 200 100 200 120 140 160 180 120 140 160 180 it 1 u it Example 513 Y denotes the number of eye facets in a fruit fly Clearly Y is not continuous and hence cannot be normally distributed Figure 514 illustrates how the CLT still kicks in Figure 514 Sampling distrihuliuns of 7 for samples from the Drawmilquot eyelncel population 1131 k N V T 39 l 39 40 I I l l l 41 NIT Ki BUT ml 40 m SH H p Sampling Distributions Page 3 Standard Errors Definition If an estimator 9 of an unknown parameter has the property that E 9 we say 9 is an unbiased estimator for 9 If an estimator is not unbiased we say it is biased For example since EY u then Y is unbiased for u Definition The standard error of a point estimator is the estimated standard deviation of the 556 lv3i Definition The Standard Error of the Mean SEM is A2 2 suwJJ Example 62 Let Y denote stem length of soybean plants cm point estimator TABVLE3961 Stem Length of Soybean Plants Stem Length cm 202 229 233 200 194 220 221 220 219 215 197 215 209 Y 2134 cm 52 1486 cm2 n 13 Find SEW DO NOT confuse the SE with the SD In Example 62 the SD of the sample was S V1486 122 cm but the SEM was 122V13 034 cm Notice here again that as n T SEM l 3 more precision in larger samples Sampling Distributions Page 4 95 Interval for Difference in Population Proportions p1p2 Agresti Caffo Returning to the independent two sample case suppose now the data are from a binomial setting with Y1 quot39 Binn1p1 independent of Y2 quot39 Binn2p2 Of interest is the difference p1 p2 A good point estimator for p1 p2 is the difference in sample proportions But again we will use the previous AC strategy for this confidence interval Presented here is a 95 confidence interval Point Estimator Add one success and one failure to each sample Y11 Y21 n12 n22 1132 quotC51 Critical Point Zoe2 196 Standard Error of Point Estimator 151139151 132139152 n12 n22 SEI139I2 95 Confidence Interval for p1 p2 Y11 Y21 n12 n22 1 1 p1 p1p2 p2wherep1 2 n12 n22 151132 i ZaZ Note Presented here is a 95 confidence interval generalizations exist for other levels of confidence Example 1037 Angina pectoris is a chronic condition in which the sufferer has periodic attackes of chest pain In a study to evaluate the effectiveness of the drug Timolol in preventing angina attacks patients were randomly allocated to receive a daily dosage of either Timolol or placebo for 28 weeks TABLE 104 Responsn Io Angina Treatment Treatment Timolol Placebo Angina free 44 19 Not angina free 116 128 Total 160 147 Let Y1 number of patients angina free after Timolol Let Y2 number of patients angina free after placebo Compute and interpret a 95 confidence interval for the difference in proportion of angina free under Timolol versus placebo Agresti Caffo Interval Page 2 Assumptions and Relation between Confidence Interval and Hypothesis Test On t based Inference for u Assumptions for Validity of Confidence Interval and Hypothesis test for p 0 Data must be from a random sample from large population 0 Observations in the sample must be independent of each other 0 n small population distribution must be approximately normal 0 n large population need not be approximately normal CLT kicks in A statistical procedure is said to be robust if the results of the procedure are not affected very much when the conditions for validity are violated The t procedures are fairly robust to non normality except in the case of outliers or strong skewness Why The following are some loose guidelines Relationship between Confidence Interval and Hypothesis Test Draw two pictures The hypothesis test corresponding to HA u at uo when we Reject Ho Fail to reject Ho When we fail to reject we have the following inequality And this should look familiar So the events that lead to the decision to fail to reject Ho for the twosided test are exactly the events that form the 10c confidence interval for u The Moral If the confidence interval contains uo then we would fail to reject Ho for the two sided test of Ho u uo against HA u at uo and vice versa Assumptions and Relationship between Confidence Interval and Hypothesis Test Page 2 Sign Test Suppose we have two dependent samples and we d like to carry out a hypothesis test to detect a difference After we check assumptions we could conduct a ttest and make inference on the difference between the population means ul uz Question What can we do if the assumption of normality is violated Answer In practice we would typically try a transformation first If the transformed data allows for the normality of the sample mean assumption to be met then we can conduct at test on the transformed data If a transformation doesn t work ie there is still a violation of the normality assumption even on the transformed data then we could apply a nonparametric aka distribution free test We ll learn the sign test Note We try a transformation first since the ttest has more than the sign test Forming Hypotheses Since the sign test is nonparametric no parameters the hypothesis is written in words Our Ho will be that the distributions are the same there is no difference and HA will be that the distributions are shifted in some way Test Statistic The sign test uses the sign of the differences unlike the paired ttest which uses the sign and magnitude of the differences In the case of the sign test we ll throw out differences that are 0 and for the remainder of the data set we have N HA Y1 gt Y2 BS 2 N ifHAY1 lt Y2 maXN N if HA Y1 qt Y2 Pvalue To find the Pvalue for the sign test we first need to find the distribution of the test statistic BS Remember we find Pvalues under the assumption the null hypothesis is true So if the null hypothesis were true then we d expect each difference to be Notice each difference can be positive or negative we threw out the quot0 differences What is Pdifference being positive under the null hypothesis of no difference What is Pdifference being negative under the null hypothesis of no difference HA Y1 gt Y2 Then Bs quot39 And P if HA Y1 lt Y2 HA Y1 3t Y2 Example 912 TABLE 97 Skin Graft Survival Times Y skin graft surVIval In days HLA Compatibility Group 1 HLAntigen compatibility quotclosequot Close Paar Sign of M Patient Y1 Y2 d 11 Y2 Group 2 HlAntigen compatibility poor 1 37 29 2 19 13 3 57 15 4 93 26 5 16 11 Normal QQ Plot 3 3 2 i s 63 43 9 29 8 8 7 10 60 42 11 18 19 g g Patients 3 and 10 died before completing the g observation of skin graft survival time This data set it C has the quotcensoredquot feature since the patients death 0 prevents us from observing past a certain time point 0 7 o A QQplot of the noncensored data shows a clear 7 4 i i i violation of the normality assumption and we cannot 715 710 705 00 05 1D 15 invoke the CLT Why Theoretical Quantiles Conduct a sign test at the OL 005 significance level to determine whether skin grafts with close HLAntigen compatibility tend to survive longer Sign Test Page 2 Background Definitions and Graphical Displays Motivation Why analyze data 0 Clinical trialsdrug development compare existing treatments with new methods 0 Agriculture enhance crop yields improve pest resistance 0 Ecology study how ecosystems develop to environmental impacts 0 Lab studies learn more about biological tissuecellular activity Statistics is the science of collecting summarizing analyzing and interpreting data Our goal is to understand the underlying biological phenomena that generate the data Random Variables Data are generated by some underlying random process or phenomenon Any datum data point represents the outcome of a random variable We represent random variables with capital letters usually X Y and Z Example Let X weight of a newborn baby Let Y weight of baby at one week old Types of Random Variables Qualitative o Categorical or nominal o Ordinal Quantitative 0 Discrete 0 Continuous Definition A sample is a collection of subjects upon which we measure one or more variables Definition The sample size is the number ofsubjects in a sample The sample size in a study is almost always denoted by the lower case letter quotnquot Definition The observational unit is the type of subject being sampled Observational units could be a baby moth Petri dish etc Definition An observation is a recorded outcome of a variable from a random sample We represent observations with lower case letters For example suppose we are measuring the outcome of a random variable X weight of 10 newborn babies Our observations would be denoted by x1 x2 xw Notation x1 is the first observation Example For the following setting identify thei variabes in the study ii type of variable iii observational unit iv sample size From exercise 25 In a study of schizophrenia researchers measured the activity of the enzyme monoamine oxidase MAO in the blood platelets of 18 patients The results were recorded as nmols benzylaldehyde product per 108 platelets Definition A freguency distribution is a summary display of the frequencies of occurrence of each value in a sample Definition A relative frequency is a raw frequency divided by n sample size Example 24 Y number of piglets surviving 21 days litter size at 21 days What is the sample size Frequency Number of number i piglels of sows 5 w neewmeeeeo Background Definitions and Graphical Displays Page 2 Graphical Displays After you collect data we hope that it tells us something We look at it in order to learn something about the process that generated it One way to summarize or describe the data is through a graphical display A graphical display should always be as clear as possible It should be well labeled with a title key if necessary labels on the graphical display itself units should be clear from your display and the sample size should be clear Do not overlabel your graphical display A dot plot is a graphical display where dots indicate observed data in a sample Figure 24 Surviving Piglets at 21 Days n36 sows I a 0 0 567891011121314 Number of piglets A histogram is a graphical display where bars or bins replace the dots from a dotplot Figure 25 Surviving Piglets at 21 Days n36 sows 10 Fre que my 5 6 7 8 91011121314 Number 0139 piglets Background Definitions and Graphical Displays Page 3 Warning A histogram can be a very misleading display we ll see an example of this in a minute A stemplot or stem and leaf plot is a lot like a dot plot usually turned on its side We use a stemplot when we have more detailed information to replace the dots with The stems are the core values of the data and the leaves are the last values of the data points We ll put the leaves in numerical order in this class and the resulting plot is called an ordered stemplot but we ll never use the unordered kind so we ll refer to ours simply as a stemplot A stemplot should always include a key with units Example Exercise 25 modified In a study of schizophrenia researchers measured the activity of the enzyme monoamine oxidase MAO in the blood platelets of 18 patients The results were recorded as nmols benzylaldehyde product per 108 platelets 41 52 68 73 74 78 78 84 87 97 99 106 107 119 127 142 145 188 Create a stem plot for these data Background Definitions and Graphical Displays Page 4 Describing the shape of a frequency distribution We can see the shape of a frequency distribution by looking at an appropriate graphical display The following are some basic words we use to describe the shape of a frequency distribution 0 symmetric asymmetric o bellshaped skewed left skewed right 0 unimodal bimodal Some examples Background Definitions and Graphical Displays Page 5 Example From Exercise 213 Trypanosomes are parasites which cause disease in humans and animals In an early study of trypanosome morphology researchers measured the lengths mm of 500 individual trypanosomes taken from the blood of a rat The results are summarized in the accompanying frequency distribution The following is the default histogram returned by a statistical software package not well labeled yet Describe the shape of the frequency distribution JDJgtltJ 15 m 25 an 35 lengm Java ADDlel Windaw This next histogram is returned by the same software package for the trypanosome data with the binwidth changed How would you describe the shape of the distribution now Discuss the changes newech an ava ADDlel wimaw Background De nitions and Graphical Displays Page 6


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.