STAT METH RESEARCH 1
STAT METH RESEARCH 1 STA 6166
Popular in Course
Popular in Statistics
This 115 page Class Notes was uploaded by Golden Bernhard on Friday September 18, 2015. The Class Notes belongs to STA 6166 at University of Florida taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/206575/sta-6166-university-of-florida in Statistics at University of Florida.
Reviews for STAT METH RESEARCH 1
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/18/15
ANOVA 111 1 Randomized Complete Block Designs RCBD De a A Randomized Complete Block Design is a valiant of the completely randomized design that we recently learned In this design blocks of experimental units are chosen where the units within are block are more similar to each other homogeneous than to units in other blocks In a complete block design there are at least Iexperimental units in each block Examples of blocks 1 a litter of animals could be considered a block since they all have similar genetic structure similar prenatalparental care etc 2 a field or pasture that can be divided into quadrants since soil properties environmental conditions etc are similar within a eld 3 a greenhouse with multiple benches since environmental conditions are usually more similar within a greenhouse than between greenhouses 4 a year in which the experiment is performed since environmental conditions are similar within a year Example of a CRBD A nutritionist is interested in comparing the effect of three diets on weight gain in piglets In order to perform the experiment the researcher chooses 10 litters each with at least three healthy and similarly sized piglets that have just been weaned In each litter three piglets are selected and one treatment is randomly assigned to each piglet Diets are labeled A B or C ANOVA 111 2 Litter 1 2 B C A 10 C B A In a design without blocking the researcher would pick 30 piglets from different litters and randomly assign treatments to them This is known as unrestricted randomization Blocking designs have restricted randomization since the treatments are randomly assigned WITHIN each block An RCBD has two factors the factor of interest that includes the treatments to be studied and the Blocking Factor that identifies the blocks used in the eXperiment There are several forms of Blocking Designs 1 the RCBD that we will study 2 incomplete block designs in which not every block has I eXperimental units 3 block designs in which the blocks have more than IeXperimental units that are used in the eXperiment 4 Latin square designs which have very specific forms of randomization of treatments within blocks example is usually relates to time ordering of treatments ANOVA 111 3 Assumptions of the RCBD 1 Sampling a The blocks are independently sampled b The treatments are randomly assigned to the experimental units within a block 2 Homogeneous Variance the treatments all have the same variability ie they all have the same variance 3 Approximate Normality each population is normally distributed Hypotheses As we will see the blocking factor is included in the study only as a way of explaining some of the variation in responses Y of the experimental units As such we are not interested in testing hypotheses about the blocking factor Instead just like in a oneway ANOVA we restrict our attention to the other factor research factor So hypothesis testing proceeds similar to the techniques we learned for the oneway ANOVA The two differences are the calculation of the error variance MSE and a calculation of the effect of the blocking factor MSB ANOVA 111 4 Notation I the number of treatments of interest in the research factor b the number of blocks containing I experimental units N I X b the total sample size yij observed value for the experimental unit in the jth block assigned to the ith treatment j 12b andz39 12I b Zyij 391 371 j the sample mean of the 1th treatment b t Zyij i1 7 the sample mean of the jth block I 3 Z Z yij 39l 39l y L the overall sample mean of the combined treatments lb ANOVA 111 Example piglet diet experiment with three litters Diet Block Litter A B C Mean 1 yA1543 yBI 531 yCI 597 12557 2 yA2 yB2 yc2 yoz 3 M3 552 yBs 571 ycs 672 373 622 Treatment 14 544 g 2 552 370 598 Grand Mean Mean y 569 Model Yij u0 5 81 Where o y is the overall grand mean 051 is the effect due to the I39m treatment 8 is the effect due to the j h block and 811 is the error term Where the error terms are independent observations from an approximately Normal distribution with 2 mean 0 and constant variance 0398 ANOVA 111 6 Total variability of all of the YZj is TSS 22 if I J which can be broken up into three parts T SS SST SSB SSE SST 322 y2 320212 is the sum of squares treatments SSB I2 if y2 is the sum of squares blocks 1 j SSE 223 y y y 2 is the sum of squares error i j z39 139 Like before we are interested in the Mean Squares SST MST 1 the Mean Square Treatments t SSB MSB i the Mean Square Blocks SSE E l 1b 1 the Mean Square Error 2 andEMSE 0quot H g Here EMST of b2 ANOVA 111 7 ANOVA Table for a Randomized Complete Block Design Source Sum of Degrees of Mean Fslal Squares Freedom Square Treatment SST If I MST F MSTMSE Block SSB b i I MSB Error SSE If lb i I MSE Total T SS lb 7 I Again the test of a treatment effect Ho 1 2 1 HA at least one mean differs uses the statistic FMSTMSE If the null hypothesis is true then F has an FDistribution on numerator degrees of freedom t l and denominator degrees of freedom t lb 1 In addition to the similarity of the Ftest of equality of treatment means the tests and comparisons of treatment means are done exactly the same as before as well ANOVAaH 8 Example piglet experiment data pigsblocked input litter diet gain datalines l I 54 3 2 I 536 3 I 55 2 1 II 531 2 II 52 4 3 II 571 1 III 59 7 2 III 597 3 III 67 2 run proc glm datapigsblocked class diet litter model gain diet litter quit Dependent Variable gain Sum of Source DF Squares Mean Square F Value Pr gt F diet 2 12539 6269 1902 00091 litter 2 3846 1923 583 00652 Error 4 1318 330 CTotal 8 17704 ANOVA 111 Means comparisons using JMP v5 Least Squares Means Table Level Least Sq Std Error Mean 54366667 10481907 54200000 10481907 62200000 10481907 gainLS Means diet LSMeans Differences Tukey HSD Alpha 0050 O 3564 Meani Meanj II III Std Err Dif 0 016667 78333 0 148237 148237 II 01667 0 8 148237 0 148237 783333 8 0 148237 148237 0 Level Least Sq Mean III A 62200000 B 54366667 II B 54200000 Levels not connected by same letter are significantly different ANOVA 111 10 Same expe ment ignoring the litter effect proc glm datapigsblocked class diet litter model gain diet quit Sum of Source DF Squares Mean Square F Value Pr gt F Model 2 12539 6269 728 00248 Error 6 5165 861 CTotal 8 17704 Least Squares Means Table Level Least Sq Mean Std Error 54366667 169389 54200000 169389 62200000 169389 LSMeans Differences Tukey HSD Alpha 0050 O 306815 Err Dif Level Least Sq Mean III A 62 200000 B 54 366667 N B 54200000 Levels not connected by same letter are significantly different ANOVA 111 11 Advantages of the RCBD as compared to the CRD 1 reduce the error variance by explaining or identifying one source of some of the variability in the observations a book refers to this as filtering out some of the variation 2 the design is easy to construct ie when there are natural or obvious blocks with at least I experimental units the restricted randomization is easy to achieve Disadvantages 1 need homogeneous blocks in order for the blocking factor to be effective 2 the effect of the treatments in the Factor under study must be the same in every block ie the effect of a treatment cannot depend on which block it is being applied to e g experiment to compare the unused red light time for five different traf c light signal sequences during morning rush hour Traffic engineer chose several intersections and performed the different sequences at each intersection in random order Suppose the effect of a particular sequence depends on which intersection you are studying e g in intersections with heavy traffic the average unused red light time is greater than the average time at intersections with lighter traffic maybe This is known as interaction of factors Choosing Variables On Which To Block We want experimental units within each block to be as similar as possible to each other with respect to any characteristic which can effect or in uence the response variable Y So if a study relates to weight gain we want each block to have similar characteristics with respect to growth such as starting weight metabolic rates etc ANOVA 111 12 Which is better a RCBD or a CRD Can check using Relative Efficiency which compares the variance of the estimate of the ith treatment mean 112139 yio under the two different experiment designs Efficiency is calculated as the number of observations that would be required if the eXperiment had been conducted as a CRD without any blocking RERCBD CRD M MSERCBD SSBRCBD SSERCBD I07 1 MSERCBD 5 1MSBRCBD 70 1MSERCBD bl 1MSERCBD If the blocking was not helpful then the relative efficiency equals 1 The larger the relative efficiency is the more efficient the blocking was at r reduc1ng the error variance The value can be interpreted as the ratio E where r is the number of experimental units that would have to be assigned to each treatment if a CRD had been performed instead of a RCBD ANOVA 111 13 Example in the piglet experiment SSBRCBD 3846 SSERCBD 1318 I 3 b 3MSERCBD 3846131832 RERCBDCRD 3 30 2E 261 330 This implies that it would have taken more than 25 times as many experimental unitstreatment to get the same MSE as we got using the litters as blocks Ie we would have needed approximately 8 z 2613 piglets per treatment in a CRD experiment testing the three diets ANOVA 1 1 Inferences for More Than 2 Means from Independent Populations Examples 1 Suppose we want to compare species diversity of microfauna in three different habitats desert caves and arctic tundra a Hypotheses and inferences are related to determining whether there are differences in diversity among the three habitats and if so determining how they differ 2 Suppose we want to compare the average growth rate in oysters exposed to one of 4 levels of dermo a protozoan parasite Perkinsus marinus in combination with 2 levels of salinity a Hypothesis might be that growth rate decreases as dermo level increases but the amount of decrease depends on salinity level ANOVA 1 2 Completely Randomized Designs CRD Defh Completely Randomized Design is an experimental design in which the experimental units are either randomly selected from each of the populations or are randomly assigned to one of the populations De a A Factor is the variable of interest It separates the experimental units into their respective populations De a A Treatment is one level of the Factor under study If more than one factor is of interest then a treatment is a combination of levels of the factors Example Oyster experiment Factors Levels of Factors Treatments Populations under Study ANOVA 1 3 Defh A OneWay Analysis of Variance 1way ANOVA or AOV is the statistical method for testing and comparing means from 2 or more independent populations Here we ll use it with only one Factor De a Observational Study is one in which we cannot control the type of treatment performed on the experimental units De a Planned Experiment is one in which the type of treatment is randomly allocated or assigned to each experimental unit For example consider the Oyster experiment Assume that it will be performed at only one salinity level Then the experiment reduces to comparing 4 level of the disease Dermo There is now one factor of interest Dermo Level with four treatments 4 levels We could do this experiment two different ways 1 field study in which we collect oysters and measure their dermo level Then each oyster is identified as belonging to one of the four Dermo categories we are interested in none low medium high disease intensity 2 a lab experiment in which we manipulate the level of dermo in the water in which the oysters are placed ANOVA 1 4 Assumptions of the CRD 1 Sampling a For observational studies random samples are taken from each of the populations of interest CT For planned experiments the treatments are randomly assigned to the randomly chosen experimental units the objects on which the experiment is to be performed Here the populations refer to conceptual ones in which there is one population for each of the treatments in the experiment c Samples are independent Example 2 independent sampling here would mean that oysters were randomly selected for the experiment no clumps of oysters were taken and then separated oysters were taken from different locations oysters were not selected by size etc and further that if the experiment was planned the oysters were randomly assigned treatment levels 2 Homogeneous Variance we shall assume that the populations of interest all have the same variability ie they all have the same variance can be relaxed but that is for a later time 3 Approximate Normality we assume that each population is normally distributed can be relaxed but again is for a later time ANOVA I 5 Example 1 three habitats desert caves arctic tundra The variable of interest is species diversity Y The experimental unit might be a sq km randomly selected from a map showing the spatial extent of the habitat The population for a habitat is the species diversity values for every possible sq km in the habitat Within the larger geographic region under study Hence there are three populations one for each habitat By assumption 1 Species diversity in each habitat has a normal distribution 2 The three populations habitats have the same variance that is 2 2 2 2 adesert O caves aarctic 08 Our interest is in testing Whether the means of these three populations differs that is our claim is that either ydesm i games i yamc or at least some subset of the means is not equal Hypothesis ludeserl CGVQS 07017 HA at least one mean differs from the others ANOVA I 6 In a picture one possible version of the alternative hypothesis might look like A39cms dual The populations under the null hypothesis would look like ANOVA 1 7 Notation the number of populations of interest also the number of treatments sample size for the ith treatment or population 139 12 I quoti t N 211 the total sample size 11 yij observed value for the jth eXperimental unit sampled from the ith populationj 12 111 andz39 12T quoti Z 3 i1 jl th yi the mean of the 1 sample quot139 t quoti Z 2 ya in the overall mean of the combined samples I quoti 2 Z yio MSE 111N I the sample estimator of the variance 0 and is called the Mean Squared Error ANOVA 1 8 Example 82 page 335 text Compare three methods for reducing hostility levels in students known to have a certain level of hostility A total of twentyfour students were randomly assigned to one of the three methods Method 1 was assigned to eight students method 2 to seven students and method 3 to nine students After treatment each student was given a test and the scores were recorded The data are Method 1 2 3 yu 96 y21 77 ya 66 y12 79 y22 76 y32 73 y13 91 y23 74 y33 69 y14 85 y24 73 y34 66 y15 83 y25 78 y35 77 yw 91 y25 71 y35 73 y1782 y2780 y3771 ylg 87 ygg 70 339 74 Sample mean 371 8675 372 7557 g 7100 Sample Size n1 n2 7 n3 9 Experimental unit Treatments Factor ANOVA 1 9 Model Y1 y0 i 81 z39 81 where o y is the overall grand mean 0 th is the I39m treatment mean 0 051 y M is the deviation of the I39m treatment mean from the overall mean and o 3 j Y1 1 is called the error term ie it is the deviation of the j h observation Y1 from the I39m treatment mean Assumptions rewritten the error terms 81 are independent observations from an approximately Normal distribution with mean 0 and constant variance 02 ANOVA I 10 This model says that any observation from the i3911 treatment is the sum of three terms A39cms Jaw ANOVA 1 11 This means that we can decompose the variability of the YU into parts that are associated with each of the terms in the model Total variability of all of the Y1 is called the total sum of squares and is de ned as TSS 220G yon i j It can be broken up into two parts T SS SST SSE where SST fl 372 is the sum of squares between treatments 139 j and SSE Z 2021 37 2 is the sum of squares for error i j Note that SST is the sum of the squared deviations of the sample means om the overall mean and is part of the estimation of the contribution of the 051 to our model ANOVA 1 12 SSE is the sum of the squared deviations of the individual observations around their treatment means and is part of the estimation of the contribution of the 811 to the model We don t use these raw sums but instead use averages of them MST the Mean Square Treatments and MSE the Mean Square Error where MST measures the variability of the sample means around the overall mean of the combined samples and MSE measures the variability of individual observations around their mean average withinsample variance estimate assuming the populations all have the same variance It is the estimator of 62 ANOVA 1 13 For Example 82 we have that TSS 22 y2 14778 139 j SSTZZniJ io oo2 88675 77582 77557 77582 971 77582 109066 MST SSTz 1 1090663 1 54533 SSE 22 y2 n1 1S12n2 1s n3 1s32 38714 139 j MSE SSEN z 3871424 3 1843 ANOVA 1 14 We compare these two quantities MST and MSE to decide if we should reject the null hypothesis that the treatment means are all equal The reason is that their true unknown mean values also called the expected sums of squares are 2 um mm 0 Z MSE EMSE 0f If the means are all equal MST should equal MSE whereas if one or more differ MST should be larger than MSE The more different the treatment means are the larger MST should be MST So we look at the ratio F MSE Summarize all of this is a table known as the ANOVA table ANOVA Table for a OneWay Analysis Of Variance Source Sum of Degrees of Mean Fstat Squares Freedom Square Model SST If 1 MST F MSTMSE Between Error SSE N i I MSE Within Total T SS N i I ANOVA 1 15 To Test the Hypotheses H0 Mew yam yamc HA at least one mean differs MST Test Statistic F MSE If the null hypothesis is true then F has an FDistribution on numerator degrees of freedom If I dfl and denominator degrees of freedom me I dfz Decision Rule reject the null hypothesis if the pValue lt 0c ANOVA 1 16 Example photomorphogenetic studies of plants are often done using a lamp called a safelight whose effect on certain plant properties is the same as the effect of darkness The effect of 2 sources of safelight A and B each at 2 intensities of light Llow and Hhigh was compared with the effect of darkness on plant height after 20 weeks of exposure to the light regime Altogether there were five treatments The experimenter chose 20 identical seedlings and randomly allocated the treatments among the seedlings so that each treatment was given to 4 plants Height was measured at the end of 20 weeks data aov input tmt do i 1 to 4 input height output end drop i datalines D 3294 3598 3476 3240 AL 3055 3264 3237 3204 AH 3123 3109 3062 3042 BL 3441 3488 3407 3387 BH 3561 3500 3365 3291 proc print quit title 39One way Analysis of Variance39 proc glm class tmt model height tmt lsmeans tmt quit Hyp0th65isz H03 D AL AH BL BH HA at least one mean differs ANOVAD Oneway Analysis of Variance Dependent Variable height Sum of Source DF Squares Mean Square F Value Pr gt F Model 4 4108077 1027019 941 00005 Error 15 1637655 109177 Total 19 5745732 RSquare Coeff Var Root MSE height Mean 0714979 3159404 1044878 3307200 Least Squares Means height tmt LSMEAN AH 308400000 AL 319000000 BH 342925000 BL 343075000 D 340200000 Topic 8 7 POPULATION DISTRIBUTIONS 81 Topic 8 POPULATION DISTRIBUTIONS So far We ve seen some ways to summarize a set of data including numerical summaries We ve heard a little about how to sample a population effectively in order to get good estimates of the population quantities of interest eg taking a good sample and calculating the sample mean as a way of estimating the true but unknown population mean value We ve talked about the ideas of probability and independence Now we need to start putting all this together in order to do Statistical Inference the methods of analyzing data and interpreting the results of those analyses with respect to the populations of interest The Probability Distribution for a random variable can be a table or a graph or an equation Topic 8 7 POPULATION DISTRIBUTIONS 82 Let s start by reviewing the ideas of frequency distributions for populations using categorical variables QUALITATIVE NONNUMERIC VARIABLES For a random variable that takes on values of categories the Probability distribution is a table showing the likelihood of each value EXAMPLE Tree species found in a boreal forest For each possible species there would a probability associated with it E g suppose there are 4 species and three are very rare and one is very common A probability table might look like 001 003 008 088 All 100 We interpret these values as the probability that a random selection would result in observing that species We could also draw a bar chart but it would be fairly non informative in this instance since one value is so much larger than the others An equation cannot be developed since the values that the variable takes on are nonnumeric Topic 8 7 POPULATION DISTRIBUTIONS 83 QUANTITATIVE NUMERICAL VARIABLES A Discrete Random Variables Recall that a discrete random variable is one that takes on values only from a set of isolated speci c numbers The relative frequency distribution for a discrete random variable also sometimes called a probability mass function is a list of probabilities for each possible value that the variable can take on BERNOULLI DISTRIBUTION Suppose the scientist studying the tree species overlaid a grid of square quadrats over the region of interest and then recorded Whether any tree was in the quadrat or not Hence the random variable is binary ie only two outcomes presence 1 or absence 0 The Bernoulli distribution describes the probability of each outcome PrXl 7 PrXO l 7 The mean for a Bernoulli variable is 7 and the variance is 7c 17 Topic 8 7 POPULATION DISTRIBUTIONS 84 POISSON DISTRIBUTION Suppose the scientist studying the tree species overlaid a grid of square quadrats over the region of interest and then counted the number of hickory trees in each quadrat The histogram of the number of trees per quadrat for all of the quadrats might look like l ii I l 040 030 020 010 a I39I39I39I39I39I39I39I39I39I39I39l39 01234567891011 maximum 1000 11000 Mean 2999 995 8000 Std Dev 1750 975 7000 Std Error Mean 0025 900 5000 Upper 95 Mean 3048 quartile 750 4000 Lower 95 Mean 2951 median 500 3000 N 5000000 quartile 250 2000 Sum Weights 5000000 100 1000 25 0000 05 0000 minimum 00 0000 Topic 8 7 POPULATION DISTRIBUTIONS 85 Since we have sampled the entire population the set of counts for every quadrat in the region this histogram represents the probability distribution of the random variable X number of treesquadrat In general the Poisson distribution is a common probability distribution for counts per unit time or unit area or unit volume The graph can also be described using an equation known as the Poisson Distribution Probability Mass Function It gives the probability of observing a speci c count X in any randomly selected quadrat as PrXX e f X where X XX 1X and X 01 2 In order for this distribution to be a valid probability distribution we require that the total probability for all possible values equal 1 and that every possible value have a probability associated with it e u ZPrXX Z 1 X012 X012 X Topic 8 7 POPULATION DISTRIBUTIONS 86 andPrXXe f 20 X The mean of the Poisson distribution is u and the variance is u as well DISCRETE UNIFORM DISTRIBUTION every discrete value that the random variable can take on has the same probability of occurring For example suppose a researcher is interested in whether the number of setae on the rst antennae of an insect is random or not Further the researcher believes that there must be at least 1 seta and at most 8 Then she is postulating that every value between 1 and 8 are equally likely to be observed in a random draw of an insect om the population or equivalently that there are equal numbers of insects with l 2 or 8 setae in the population Such a distribution is known as the Discrete Uniform Distribution Let K be the total number of distinct values that the random variable can take on eg the set 1 2 8 contains K 8 distinct values Then 1 PrXX fOIX 12 8 X Topic 8 7 POPULATION DISTRIBUTIONS 87 In addition the mean for this particular discrete uniform IS 2 45 KS and the variance is 2 02 ZX K45 525 Also it is easy to see that the probabilities sum to l as required Finally the graph of the distribution looks like a rectangle Topic 8 7 POPULATION DISTRIBUTIONS 88 B Continuous Random Variables Recall that a continuous random variable is one that can take on any value from an interval on the number line Now for relative frequency distributions Fact 1 They show the frequencies of the values of the variable of interest in a set of data Std Dev 12 80 Mean 71 0 N 222 00 TIME Where the data have been assigned to speci c groupings bins or categories The height of each bar is proportional to the relative frequency in the data set of the group it represents Multiplying the heights by the Widths of the bars and adding all the areas gives the total area under in the bars red The area under any one bar divided by the total area equals Pran observation falls in that grouping Topic 8 7 POPULATION DISTRIBUTIONS 89 Fact 2 For a continuous variable and an extremely large population the number of bars is very large and the heights of the bars approach a smooth curve This curve is often referred to as a DENSITY CURVE or the probability distribution The curve describes the shape of the distribution and also depends on the mean and standard deviation of the population under study Blood glucrxse meidl Topic 8 7 POPULATION DISTRIBUTIONS 810 Normal Distributions with Different Means and Standard Deviations W2 139 W 10 my Fact 3 When the curve is describing frequency distribution of the population every observation must fall Within the limits of the distribution Hence 100 of the observations are listed Topic 8 r POPULATION DISTRIBUTIONS 811 When we combine these three facts we get that the density curve describing the frequency distribution of values of a quantitative variable 1 has a total area under the curve of l analogous to 100 and 2 the area over a range of values equals the relative frequency of that range in the population ie the area equals the probability of observing a value within that range Area in between these two lines is the probability that X falls between the values of5 and 8 There are many standard common density curves Topic 8 7 POPULATION DISTRIBUTIONS 812 UNIFORM DISTRIBUTION every subset interval of the same length is equal likely For example suppose we randomly selected a number om the number line 0 10 Then the Probability distribution is given by b a U L PraltXltb for X eLU and LU gt0 Uniform e g Pr3ltXlt4 and the The mean of a Uniform distribution is u variance is Topic 8 7 POPULATION DISTRIBUTIONS 813 NORMAL DISTRIBUTION BellCurve or Gaussian Distribution symmetric unimodal and bellshaped Some interesting facts about the NORMAL DISTRIBUTION 1 mean median mode the shape is perfectly symmetric With equal sized tails the Empirical Rule has an exact form 6826 of the values fall Within u i 6 9544 of the values fall Within u i 26 9974 of the values fall Within u i 36 the endpoints of the interval u i 6 fall exactly at the in ection points of the curve it s the most common distribution for natural phenomena that take on continuous values Topic 8 7 POPULATION DISTRIBUTIONS 814 Calculating Probabilities Of Events For A Normal Distribution EXAMPLE IQ as measured by the StanfordBinet test has a mean of u100 and a standard deviation of 615 1 What proportion of the US adult population has an IQ above 100 ie nd PrIQgt100 2 What proportion of the population has an IQ between 85 and 115 ie nd Pr85ltIQltII5 Topic 8 7 POPULATION DISTRIBUTIONS 815 Question What do we do when the value of interest in the probability phrase does NOT fall exactly at the standard deviation cutoffs E g nd PrIQlt 110 Answer Convert the value to a Zscore and use it and a look up table or a computer program to calculate the probability Recall the ZSCORE for a value is the number of standard deviations that value is om the mean X 039 Z scorez 110 y110 100 039 15 eg IQ ofllO a 2 0667 Topic 8 7 POPULATION DISTRIBUTIONS 816 Defn When X is normally distributed the Zscore has a STANDARD NORMAL DISTRIBUTION The Standard normal distribution is a normal distribution with a mean of u0 and a standard deviation of 61 u lo ulo u ZG u u26 u 3G u3o Original IQ score 55 70 85 100 115 130 145 Equivalent Zscore 3 2 1 0 1 2 3 Topic 8 7 POPULATION DISTRIBUTIONS 817 So the important point here is that we need to do the conversion PrX lt a Pr X lt L PrZ lt 2 O O in order to nd probabilities of events under a normal distribution eg PrQlt110 Pr IQ lt 110 j O O I3rQ 1OO lt110 100 PrZ lt 0667 15 15 Next look up the area ie Probability on a table PrZ lt 0667 07486 so approximately 75 ofthe population has an IQ less than 110 Topic 8 7 POPULATION DISTRIBUTIONS Areas Under the Normal Curve I u u1 642 us 004 005 006 07 us as 34 00003 00003 00003 00000 00003 00003 00000 00003 00003 00002 33 00005 00005 00005 00004 00004 00004 00004 00004 00003 32 00007 00007 00006 00006 00006 00006 00006 00005 00005 00005 31 00010 00009 00009 00003 00003 00000 0000 00007 00007 30 00013 00013 00013 00012 00012 00011 00011 00011 00010 00010 79 00019 00013 00017 00017 00016 00016 00015 00015 00014 00014 23 00026 00025 00024 00021 0022 00021 00021 00020 00019 77 00005 00034 00033 00032 00031 00030 00023 00027 00026 26 00047 00045 00044 00001 00041 00039 00033 00007 00036 2 00062 00060 00059 00057 00055 00054 00052 00051 00049 00043 24 0002 0 00073 00075 00073 00071 00069 00066 00066 00064 22 00107 00104 00102 00096 00094 00091 00039 00037 00064 22 00139 00132 00129 015 00122 00119 00116 00113 00110 71 00179 00174 00170 00166 00162 00153 0 00150 0146 00143 20 00223 0 00217 00212 00207 00002 00197 00192 00133 00133 13 00237 00231 00274 00260 00202 0 00244 00233 13 00359 00152 00044 00336 00529 0032 00314 00307 00301 00204 17 00446 00436 00427 00413 00409 00401 00392 00304 00375 0367 16 00543 00557 00526 00516 00505 00495 004 00475 0465 00455 15 00668 00655 00643 00630 00613 0 00594 00571 00559 14 00003 00793 00773 00764 00749 00735 00m 00703 00694 00681 13 00968 1 00913 00901 00315 00369 00153 00333 0032 12 00151 01131 01112 01093 01075 01056 01033 01020 0100 00915 11 01357 01335 01314 05292 01271 0151 0140 01210 01190 0u70 14 01537 015a 01539 01515 01492 01469 01446 01421 01401 01379 09 01341 01314 01733 01762 01736 01711 01635 01660 01635 01611 03 02119 02090 02361 02403 01977 01949 01922 01304 07 02420 0339 0253 02327 0296 02266 02236 02206 02177 02143 0 02m 02709 02676 02643 02611 02573 0546 07514 0 02451 05 03015 03050 03015 02931 02946 02912 02377 02343 00310 02776 44 03446 03409 03336 03300 03264 03223 03192 03156 03121 03 03321 03733 03745 03707 13669 03632 03594 35 03520 0 02 04207 04163 04129 04090 04052 04013 03974 03936 03559 41 04602 04562 4522 04433 04404 04364 0455 04206 1 04247 00 05000 04960 04920 04m 04340 04001 04761 04721 04631 04641 Topic 8 7 POPULATION DISTRIBUTIONS 819 Areas Under the Normal Curve z 000 001 002 V 003 004 005 006 007 008 009 00 05040 0 05120 05160 05199 05239 05279 05319 05359 01 05398 05438 05478 05517 05557 05596 05636 05675 05714 05753 02 057 05832 587 I 05910 05948 05987 06026 0 06103 06141 03 06179 06217 06255 6293 06331 06368 6406 06443 064 6517 04 06554 06591 06628 06664 0674 06736 06772 06808 06844 06879 05 06915 06950 07019 07054 07 07123 07157 71 7224 00 7757 07291 07324 07357 07339 07472 07454 7486 07517 07549 07 075 07611 07642 07673 07704 07734 07764 07794 0783 07852 08 0788 07910 0793 7967 07995 08023 8051 09178 08106 08133 09 08159 08186 08212 08328 08264 08289 08315 08340 08365 08389 10 08413 08438 08461 08485 08508 08531 08554 08577 08599 08621 11 08643 0 08686 08708 08729 08749 08770 08790 08810 08830 12 08849 08869 08888 08907 08975 0 08962 0 09015 13 09032 09049 09066 09382 09099 09115 09131 09147 09162 09177 14 09192 09207 09222 09136 09251 09265 09278 0929 0 09319 15 09332 09345 09357 09370 09382 09394 0 09418 09429 09441 16 09452 09463 09474 09484 09495 0 09515 0955 09535 09545 17 09554 09564 09573 09582 09591 09599 0 09616 09625 09633 18 0964 09649 09656 9664 09671 09678 0 09693 09699 097 19 0 9713 09719 09726 09732 09738 09744 09750 09756 09761 09767 20 09772 09778 09783 09788 09793 09798 09803 09808 09812 09817 21 09821 09826 09830 0 09838 09842 09846 09850 09854 9857 22 09861 09864 09868 09871 09875 09878 09881 09884 09887 9890 23 09893 09896 09898 09901 09904 09906 09909 09911 09913 09916 24 09918 09920 09922 0 09927 09929 09931 09932 09934 9936 25 09938 09940 09941 09943 09945 09946 09948 09949 09951 09952 26 09953 09955 09956 09957 09959 09960 09961 09962 09963 09964 27 09965 0 09967 09968 09969 09970 09971 09972 09973 09974 28 09974 09975 09976 09977 9977 09978 09979 09979 09980 09981 29 09981 09982 09982 09983 099 09984 09985 09985 09986 09986 30 09987 0 0 0 09988 09989 09989 09989 09990 09990 31 09990 09991 09991 09991 09992 09992 09992 09992 09993 09993 32 09993 09993 09994 09994 09994 09994 09994 09995 09995 09995 35 09995 09995 09995 09996 09996 09996 09996 09996 09996 9997 34 09997 09997 09997 09997 09997 09997 09997 09997 09997 09998 Topic 8 7 POPULATION DISTRIBUTIONS 820 Some practice which also uses the rules for Probability that we learned earlier 1 Find PrIQgt92 2 Find Pr70ltIQltI20 Topic 8 7 POPULATION DISTRIBUTIONS 821 Finding Quantiles for the Normal Distribution Most often used to nd extreme values in the very highest or lowest percentages EXAMPLE Suppose adult male heights are normally distributed with a mean of 69 and a standard deviation of 35 We have learned how to answer questions like What proportion of the population are taller than 6 72 How do we answer a question like Find the range of likely heights for the shortest 5 of the male population ie what height is the 5th percentile of the population Here we are being asked to nd the value of a that makes the following probability statement true PrHeighl lt a 005 We know that PrHeighl lt a PrZ lt 2 So we ll start by solving PrZ lt 230 05 Topic 8 7 POPULATION DISTRIBUTIONS forz Now we ll use the fact that 2 u and our knowledge of the values 0fu and 039 to solve for a 822 ANOVA 11 One Way Analysis of Variance 11 Two issues still to be dealt with a checking the assumptions of the model and b inference on individual means or combinations of means 1 Estimation a Predicted Values 0r LSMEANS Least Squares Means i The best estimators of the cell means ul are the sample means 7 ii The variance of the mean estimators is estimated using MS SEy n l b Residuals i The estimators of the error terms 81 are the residuals eij yij J7io ii The residuals always sum to 0 ieZ Z elj 0 and have I J variance estimated by MSE ANOVA 11 2 iii Under the assumptions the residuals have a Normal distribution with mean 0 and constant variance ANOVA 11 A39cms gland ANOVAGD Example the comparisons of the effects of safelights on plant height proc mixed class tmt model height tmt outpresids quit proc print dataresids var tmt height pred resid stderrpred quit Obs tmt height Pred Resid StdErrPred 1 D 3294 340200 10800 052244 2 D 3598 340200 19600 052244 3 D 3476 340200 07400 052244 4 D 3240 340200 16200 052244 5 AL 3055 319000 13500 052244 6 AL 3264 319000 07400 052244 7 AL 3237 319000 04700 052244 8 AL 3204 319000 01400 052244 9 AH 3123 308400 03900 052244 10 AH 3109 308400 02500 052244 11 AH 3062 308400 02200 052244 12 AH 3042 308400 04200 052244 13 BL 3441 343075 01025 052244 14 BL 3488 343075 05725 052244 15 BL 3407 343075 02375 052244 16 BL 3387 343075 04375 052244 17 BH 3561 342925 13175 052244 18 BH 3500 342925 07075 052244 19 BH 3365 342925 06425 052244 20 BH 3291 342925 13825 052244 ANOVA 11 5 2 Checking The Assumptions of the Model a Constant Variance Graphically do box plots of the residuals for each treatment and look for similar variabilities Hypothesis testing of the sample variances sf using Levene s test or Hartley s test b Normality i Graphically 1 do a stem and leaf plot a histogram or something similar using the residuals to check for the shape of the distribution and for outliers 2 do a normal probability plot of the residuals NOTE usually normality is NOT reviewed or tested until after any problems with variance are corrected Obviously if the variances are unequal it is highly likely that the distribution of the residuals will look platykurtotic 0 Independence and Random selectionallocation This is something that is controlled and decided by the scientist when planning and executing the experiment Important points to consider in addition to randomly selecting experimental units for inclusion in the study and randomly allocating those units to treatments one should also randomly order the laboratory analyses of the units after the experiment is over ANOVA 11 6 For example in the study of height of plants as affected by light regime the scientist should randomly measure the plants rather than take plants from the same treatment sequentially Subtle changes in the way measurements are done could be occurring that might in uence the results Remedial Measures Many different methods i change the model to account for the nonindependence change the model to account for the unequal variance do a transformation of the data for unequal variance and non normality use a nonparametric test for severely nonnormal data a KruskalWallis test b Bootstrapping wv 4 ANOVA 11 7 Estimation in a OneWay ANOVA Once we have rejected the null hypothesis that all means are equal and we have checked the assumptions of the testing procedure we usually wish to do some specific tests that can elucidate the relationships among the means These tests are variously called multiple comparisons contrasts or estimation of linear combinations of means A priori Hypotheses hypotheses about population means that are decided during the planning of the experiment They are the reason for performing the experiment A posteriori Hypotheses hypotheses generated as a result of looking at the data after the experiment has been performed Also called data snooping or data dredging This is almost ALWAYS inappropriate and to be avoided The only valid reason for doing so is as an exploratory analysis that will guide future experimentation Example a posteriori testing suppose a 1way ANOVA is performed and the results are obtained The analyst looks over the results and decides to test 2 means because they appear to be very different Now the effect could be due to a real difference in population means or to random occurrence due to sampling that makes them appear different Investigating only comparisons for which the effect appears large implies that the true confidence level for a conclusion is lower than the stated confidence level when there is no difference In other words you are more likely to reject H0 not different It can be shown that the actual ANOVA 11 8 confidence is 60 when 6 levels are used in an experiment and the statistical analysis always includes testing the difference between the largest and smallest means using a stated 95 confidence note that these means need not be the same treatment means each time There are times when it is possible to do a posteriori testing BUT the statistical method needs to be modified appropriately to account for the data snooping see later 1 Estimation Of A Treatment Mean The population mean for the ith treatment A is estimated using the sample mean It in with a standard error of MS SHEJ 1 Under our assumptions of normality and random sampling the l a100 Confidence Interval of the population mean is io ita EN t where la is the critical value for the upper tail of a tdistribution on N t 2 N i I df Hypothesis testing is done using a ttest as is usual for a single population mean ANOVA 11 2 Estimation Of The Difference Between 2 Treatment Means The unbiased estimator of the difference between 2 population means Dikis A DikzyioJ ko and which has a standard error of A l l SEDlk MSE quoti quotk assuming the variances are homogeneous Under our assumptions of normality and random sampling a 1 0 100 Confidence Interval of the difference of the population means is Dik it SEDik N t NIQ where la is the critical value for the upper tail of a tdistribution on t 2 N i I degrees of freedom Hypothesis testing is done using the ttest for two independent samples that we reviewed earlier this semester ANOVA 11 10 Example Rehabilitation Therapy A researcher is interested in the relationship between physical fitness in persons prior to knee surgery and the time required in physical therapy after surgery to obtain successful rehabilitation 24 male subjects with similar knee surgery during the past year were randomly selected from the patient records at the rehabilitation center and the number of days required for successful rehabilitation and prior physical fitness status were recorded The patients were categorized into one of three levels of fitness The hypotheses of interest are l the mean time to recovery will differ among the three groups 2 the above average fitness group will have a shorter recovery period than the below average and average groups and 3 the average group will have a shorter recovery than the below average group data fitness input tmt do i 1 to 8 input days output end drop i datalines below 29 42 38 4O 43 4O 3O 42 average 30 35 39 28 31 31 29 35 above 26 32 21 2O 23 22 25 23 I ANOVAGD u proc mixed datafitness class tmt model days tmt outp resids lsmeans tmt pdiff quit Covariance Parameter Estimates Cov Parm Estimate Residual 194048 Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr gt F tmt 2 21 2042 lt0001 Least Squares Means Standard Effect tmt Estimate Error DF t Value Pr gt t tmt above 240000 15574 21 1541 lt0001 tmt average 322500 15574 21 2071 lt0001 tmt below 380000 15574 21 2440 lt0001 Differences of Least Squares Means Standard Effect tmt tmt Estimate Error DF t Value Prgtt tmt above average 825 22025 21 375 00012 tmt above below 1400 22025 21 636 lt0001 tmt average below 575 22025 21 261 00163 ANOVA 11 12 3 Multiple Comparisons The procedures we ve just seen have 2 IMPORTANT limitations a The confidence 1 0c applies ONLY to a particular estimate or test not to the series of estimates or tests b The con dence 1 0c is appropriate ONLY if the estimate or test was not suggested by the data We typically perform multiple tests in order to piece the results together to draw a more complete conclusion This is sometimes referred to as a family of statements or tests and it is important to provide some assurance that all of the statements in the family are correct We call this assurance the familywise or experimentwise confidence The experimentwise error rate is denoted XE If we are interested in only specific hypotheses and they are not used in any combination in order to draw an overall conclusion then we refer to that as the individual or comparisonwise confidence and the individual error rate 0 refers to that specific estimate or test ONLY There is a problem with repeated ttests however Suppose there are ten means and each t test for comparing two means is performed at the 005 level There are 10101245 pairs of means to compare each with a 005 probability of a type 1 error a false rejection of the null hypothesis The chance of making at least one type 1 error in the set of 45 tests is much higher than 005 It is difficult to calculate the exact probability but you can derive a pessimistic approximation by assuming that the comparisons are independent giving an upper bound to the probability of making at least one type 1 error ANOVA 11 13 Prat least one Type 1 error 05E 3 l l 0Im 1 1 0545 090055 As can be seen as the number of tests increases the chance of making at least one Type 1 error approaches 1 When we test only apriori hypotheses then we have a limited set of statements for which we need an eXperimentwise error rate On the other hand when we go data snooping and construct tests posteriori then the family of tests statements should be the ENTIRE set of possible tests or estimates that could be performed rather than just the ones actually performed Example A Suppose that we are examining the response to a drug The experiment includes a control zero dosage level and four drug dosage levels in the relevant range The objective of the study is to determine if the drug produces an effect on the response variable This is not a comparisonwise hypothesis since the objective of this study would be answered by rejection of the overall null hypothesis of no differences among the drug treatment means The objective does not ask which level produces the greatest dose effect ANOVA 11 14 Example B Suppose we are examining five diets that are thought to improve the absorption of a certain nutrient from the digestive system The objective of the study is to identify the diet if any that results in the greatest absorption of the nutrient This is not a single experiment wise question that is to conclude that not all treatment means are equal would not answer the question Rather the objective of the study dictates that pairs of individual means must be compared in order to answer the objective of the study And so we are interested in controlling the comparisonwise error rate in such a way that the over all experimentwise error rate is kept at an acceptable level Methods for PairWise Comparisons l Bonferroni Method Bonferroni showed that 05E S ma Hence suppose we want to perform m 3 pairwise comparisons of means and control the experimentwise error rate at 05E 2 005 Then we should run each individual test at the comparisonwise error rate that satisfies 005 S 30 ie we should use 051 0053 2 00167 for each of the three comparisons Example Problem 81 page 337 of text 4 types of devices for determining pH in soil samples Are there differences among the devices in their readings 6 soil samples of known pH are used with each device recorded is the difference between the known pH and the device measurement Data and some summary stats are shown on Page 338 of the text ANOVA 11 Oneway Analysis of pH By Device Device Analysis of Variance Table Source DF Sum of Squares Mean Square F Ratio Prob gt F Device 3 05837482 0194583 48470 00108 Error 20 08028957 0040145 C Total 23 13866438 Means for Oneway ANOVA Level Number Mean Std Error Lower 95 Upper 95 A 6 016050 008180 03311 001013 B 6 009467 008180 00760 026529 C 6 012267 008180 00480 029329 D 6 027350 008180 01029 044413 Std Error uses a pooled estimate of error variance Means Comparisons DifMeani Meanj D C B A D 000000 015083 017883 043400 C 015083 000000 002800 028317 B 017883 002800 000000 025517 A 043400 028317 025517 000000 ANOVA 11 16 Bonferroni Comparisons for each pair using Student39s t but we adjust the individual error rate in order to control the familywise error rate at 05E 2 005 Then with m 6 and 05E 2 005051 2 05E m 00083 t df a SEDif I 287031 10 00083 011568 Ran the analysis using SAS and obtained the following table Differences of Least Squares Means Std Effect Dev Dev Estimate Error DF t Value Prgtt Device A B 02552 01157 20 221 00393 Device A C 02832 01157 20 245 00237 Device A D 04340 01157 20 375 00013 Device B C 00280 01157 20 024 08112 Device B D 01788 01157 20 155 01378 Device C D 01508 01157 20 130 02071 ANOVA 11 17 2 Fisher s Least Signi cant Difference LSD Method Basically this is the Student s ttest approach for comparing two means where we control ONLY the individual pairwise error rate and there is no control of the experimentwise error rate It is known as the protected Fisher39s LSD if one performs the testing only if the Ftest rejects the null hypothesis Calculate a statistic known as the least significant difference ie the minimum difference that is required to reject the null hypothesis H0 ul 2 uj for agiven 0 This statistic is a 71 N t 2 LSD 2 a IN t l n ii 1 JSEm W1 If the absolute value of the difference between 2 means exceeds LSD then reject the null hypothesis with a type I error of 0 ie If 7 7j LSD gt 0 then reject H0 gul uj and accept the alternative H A 1 u i uj Example Problem 81 Fisher s LSD t df a SEDif LSD 208596 20 005 011568 024130 DifMeani Meanj D C B A 000000 015083 017883 043400 015083 000000 002800 028317 017883 002800 000000 025517 043400 028317 025517 000000 gtUJOU ANOVA 11 18 AbsDifLSD D C B A D 024130 009047 006247 019270 C 009047 024130 021330 004186 B 006247 021330 024130 001386 A 019270 004186 001386 024130 Positive red values show pairs of means that are significantly different 3 Tuke s W or honestl si nificant difference HSD Method method that controls the experimentwise error rate and is used only when we have equal sample sizes in each treatment all 111 are same An alternative called TukeyKramer HSD is used when we have unequal sample sizes Tukey s W uses an approach that considers the statistic maXJ io minJ io MSE n where n is the sample size in each treatment This statistic has a distribution known as the Studentized range distribution ql V where I is the number of treatments under study and v is the degrees of freedom associated with MSE ie v N7 I ANOVA 11 19 MSE Wanltrvgt17 Where an I V is the uppertail critical value of the Studentized range Tukey s W is given by distribution with Itreatments and v degrees of freedom see Table 11 of text Again to do the actual tests if I 7 7j W gt 0 then reject H 0 1 ul 2 uj and accept the alternative H A 1 ul 7 uj A major difference between this approach and the LSD test is that Tukey s W controls the eXperimentWise error rate and Fisher s LSD controls the comparisonWise error rate Example Problem 81 In this problem 111 nj for all treatments so we can do Tukey s W n 6 Tukey HSD q 1 aE SqrtMSEn w 279894 20 005 01001828040 DifMeani Meanj D C B A 000000 015083 017883 043400 015083 000000 002800 028317 0 17883 002800 000000 025517 043400 028317 025517 000000 gtUJOU ANOVA 11 AbsDifW D D O28040 c o1o157 B o14494 A 015360 Positive red values show pairs of means that are significantly different Level gtUJOU gtJgtJgt B B C 0 12957 028040 025240 000277 B A 010157 015360 025240 000277 028040 002523 002523 028040 Mean 02735000 01226667 00946667 0 1605000 Levels not connected by same letter are significantly different 20 ANOVA 11 21 SOME EXTRA STUFF IF YOU ARE INTERESTED 4 Estimation Of A Linear Combination Of Means Pairwise comparisons are actually special cases of the more general comparison known as a linear combination of means t L 2 Z ciui i1 t where when 2 c 0 L is called a contrast i1 Now the unbiased estimator of L is t L 2 Z ciyz i1 A t c with standard error SE L MSEZ l if the variances are i1 quoti homogeneous Under our assumptions of normality and random sampling a l a 100 Confidence Interval of the linear combination of the population means is Z i za SELA where Ia df is the critical value for a tdistribution on df degrees of freedom If the variances are equal df m i I ANOVA 11 22 Example Plant heights experiment Set AH 1 AL 2 BH 3 BL 4 and D 5 Suppose we have 5 levels AH AL BH BL D and Wish to estimate the average height for safelight A and the average height for safelight B mean for safelight A uA mean for safelight B uB We may want to test the following hypothesis UHFU2 i 3 4 2 2 HA UHFU2 34 2 2 and has the unbiased estimator L 05y105y2 05y3 05y4 05 with standard error A t 2 2 2 2 2 2 SELMSEZclMSE 05 05 0 i1 i quot1 quot2 quot3 quot4 quot5 ANOVA 11 23 Testing of hypotheses is done as follows H0 2 chul C0 Where C0 is a hypothesized value 1 HA Eelul orgt or lt C0 2 ciyi C0 SEZciJ i Test statistic l with df N i I Decision Rule reject H0 if the pValue lt oc Note all of these tests and Cls require that the assumptions of the ANOVA be met 1234 UHFU2 34 Vs HA Exam le H p 039 2 2 2 2 Test Statistic ElciyiC0 l 05y105y2 O5y3 O5y40y5 O SElt cm 2 2 2 2 2 MSE 0 0 05 05 04 gtllt 4 4 Definition of some symbols Please note the similarities and differences N Population size Number of elements in the population1 N Sample size Number of units in the sample u Mean of the population of X s uX2 EX Z xPX x ifX is a discrete rv X allx J 197 xdx ifX is a continuous rv 1 n X Mean of the sample from the population of X s ZXI 7 11 ILL Mean ofthe gogulation ofsamgle means EX u by rule 1 Z EPQ f ifX is a discrete rv I all I f f iins a continuous rv 039 Variance of the gogulation of X s 039 XXX ILle gtltPX x ifX is a discrete TV 02 11 X J 0 x X2 fX xdx ifX is a continuous TV 2 039 039 Variance of the gogulatlon ofsamgle means by Rule 2 n 2 f y z gtltPX f ifX is a discrete TV 07 all I f 92 f 9361 ifX is a continuous I V 1r Proportion of Success s in the gogulation Number of Success s in the gogulation N p Proportion of Success s in the sample Number of Success s in the sample n up Mean of the gogulation of sample proportions Ep 1r by rule 6 039 Variance ofthe population of sample proportions 7r1 7rn by rule 9 1 Assumed to be infinity in many theoretical studies but finite in almost all real life problems 2 The subscript indicating the population X Y etc Will be dropped when there is no possible confusion Some rules that will be used Before you use any rule ortheory AL WAYS rst check to make sure that conditions are satisfied Conditions that 3995 MUST be satisfied Rawquot 1 Always true 2 uX u 2 Always true THEN 039 O39X Z 3 IF n 2 30 THEN X NJX OX 2 Central Limit Theorem 4 FXNIL 02 THEN XNyX aXJZ 5 1V mean ofrV General 139 N Randlog t ii g39atp39e THEN Z N01 case as a norma Ier u Ion St deV ofrv 6 Always true THEN up 7r 7 IF x Bn1r THEN X EX m 8 IF x Bn 1r THEN a VarX n1r11I 9 Always true THEN Up JII391 Irn IF n X 1 2 10 10 and ngtlt11I210 THEN 1 NOT 1 quot IF x Bn 1r amp 11 andnxnzm THENX Nmz39 mz391 7239 and n X 1 2 10 Normal Approximation to the Binomial Formula for Confidence Intervals and Conditions Questions 0 Ask Conditions that MUST be satis ed YOURSELF Check to see if they are satis ed 9 a A gt m x a e We you use them 5 g g E E E g Con dence interval E 2 42 g g g g These conditions E a g a 3 28 3 5 also apply to corresponding 8 5 at Qquot gt test of hypothesis g a Population standard deviation o X i Zm2 60 is known plus 1 g n Normal population or large sample 2 S Population standard deviation O lt X i thin 71 039 is unknown plus 5 g n Almost Normal population Z 1 I D p i 2 g an 1o andnlpZ 10 m g 02 02 Population standard deviations o X Y i Zm2 X0 Y0 cxo and cm are known plus 1IX nY Normal populationS or large sampleS m 3 t k2 g Population standard deviations O 5 1 239 df nX HY 6x and CY are unknown unegual plus g Almost normal populations 5 df smaller of nx 7 l and nV 7 l B o S 2 g D i tm MYTH D D s have a normal distribution D JH N l l 3 if p1 p2 i Zaz MM n1p1 210 andn1l p1Z 10 and gt E 1 n2 nzpz 2 10 and nzlpzZ 1o Revised on July 29 2010 CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data 1 CATEGORICAL DATA ChiSquare Tests for Univariate Data Recall that a categorical variable is one in which the possible values are categories or groupings We ve seen one such variable it s the binary variable with only two possible outcomes success or failure In this topic we explore testing hypotheses about categorical variables with MORE than two outcomes EXANIPLE Consider an experiment in which two different tomato phenotypes are crossed and the resulting offspring observed The parent types are tall cutleaf tomatoes and dwarf potatoleaf tomatoes Variable Offspring Phenotype Possible Values 1 tall cutleaf 2 tall potatoleaf 3 dwarf cutleaf and 4 dwarf potatoleaf If Mendel s laws of inheritance hold the resulting population proportions in the offspring would be 1 916 2 316 3 316 and4 116 Onemight hypothesize that Mendel s Laws don t hold for these genes In an experiment to test that the researcher observed the proportions 1 0575 2 0179 3 0182 and 4 0065 based on a sample of 1611 offspring CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data 2 EXAMPLE Consider an observational study in which the types of insects that feed on the nectar from a certain ower are studied The scientist randomly selects hours during the day over several days during the summer season and selects several different plants She counts the number of different kinds of insects that feed at the plant during the study Variable Insect Family Possible Values 1 bees 2 wasps or 3 ies One might hypothesize that this ower attracts the different insect families in unequal proportions Important Point Testing procedures for hypotheses of this form are called GoodnessofFit tests These tests compare the sample proportions to the hypothesized proportions to see how good the t is Important Point These categories must be mutually exclusive and exhaustive Notation k number of possible categories that the variable of interest can have CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data Category True Sample Hypothesized Population Proportion Population Proportion Proportion 1 7T1 71 7r 0 1 2 7T 2 7 2 7r 0 2 k 7f k 7 k 7r Exhaustive means that Z 7zl 1 2 721 l and 27921 EXAMPLE tomatoes and Mendel s Laws k 4 Category True Sample Hypothesized Population Proportion Population Proportion Proportion tall cut 7r1 ill 2 0575 lo 2916 leaf tall 72 frz 0179 g 2 316 potato dwarf 7Z3 733 0l82 723 2316 cut dwarf 7Z4 4 0065 a 2116 potato CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data 4 Now for a sample of size n and a set of hypothesized proportions under the null hypothesis I can calculate how many sample units should be in each category if there was no sampling variability of course These numbers are called the EXPECTED CELL COUNTS under the null hypothesis and are calculated as ngtlthypothesized value 7r0 for that category cell The OBSERVED CELL COUNTS are the actual counts seen in each category during the experiment Category Expected Observed Count Count 1 7239 quot721 2 11 73 7239 2 k 117239 quot7 k Important Point This test procedure is valid only if the sample sizes and hypothesized proportions are such that virtually every cell has an expected count of 5 or more If they aren t you must use a different test procedure CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data EXAMPLE Tomatoes amp Mendel s Laws 11 1611 Category Expected Count Observed Count tallcut mr1016119169062 nfrl926 leaf tall m3 16113163021 miz 288 potato dwarf M30 16113163021 W2 293 cut dwarf m216111161007 n k104 potato Hypotheses H0 7r1916 72392 316 7Z3 316 and 7Z4 ll6 HA not H0 H0 is not true Important Point Note how uninformative the alternative hypothesis is in a goodnessof t test These tests compare the sample data against a speci c set of hypothesized proportions If the null hypothesis is rejected one cannot tell what the true proportions are only that they are not the ones listed in the null hypothesis Significance Level let s choose oc004 CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data Test Statistic is a summary of the comparison of the observed and expected cell counts The actual form is observed count expected count2 all cells expected count x2 This is called the CHISQUARE 0r GOODNESS OF FIT STATISTIC Important Point the closer the expected and observed counts are to each other the smaller the value of X2 Small values of X2 support the null hypothesis and large values support HA EXAMPLE tomatoes and Mendel s Laws Category Expected Observed n f n 0 2 Count Count 0 117239 tall cut mrlo 29062 nfrl 2926 0433 leaf tall mfg 23021 nfrz 288 0658 potato dwarf mfg 23021 mi39z 293 0274 cut dwarf mfg 21007 n k 2104 0108 potato So X2 433 658 274 108 1473 CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data 7 Pvalue under the null hypothesis the test statistic X2 has a sampling distribution known as the CHISQUARE DISTRIBUTION Like the Tdistribution the shape of the ChiSquare Distribution depends on the degrees of freedom Here if k 1 Important Point the degrees of freedom for the Chi Square Goodness of Fit test are the number 0 categories k minus 1 NOT the sample size minus 1 The pValue is the area under the Chisquare distribution to the right of the test statistic value To nd the PValue rst calculate X2 and the df Then go to Table 8 page 686 of the text CATEGORICAL DATA 4 ChiSquare Tests For Univariate Data 8 Find the row labeled with the df you have for your test Go across the values in the row until you nd two values that bracket your X2 value Read the Pvalue from the tops of the columns containing the two bracketing values EXAlVIPLE tomatoes df413 and X21473 So on page 686 go to the row labeled df3 and nd the closest value to 147 It s bracketed by the values 05844 to the left and 6251 to the right The column headers for these two values are 090 left and 010 right This says that the Pvalue falls between 010 and 090 Conclusion since the Pvalue gt 01 gtgt 0L 004 do not reject H0 There is insufficient evidence to suggest that something other than Mendel s law of inheritance is working for the two tomato phenotypes that were crossed CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data GOODNESSOFFIT TEST PROCEDURE FOR UNIVARIATE CATEGORICAL DATA Null Hypothesis Ho 712710 72 273 7rk 7r where 7r is the hypothesized population proportion of the ith category and 2 7r 1 Alternative Hypothesis HA H0 is not true Test Statistic observed count expected count2 all cells expected count x2 h Where the expected count in the it category is mrlO Pvalue area to the right of the observed X2 value under the ChiSquare distribution With k 1 degrees of freedom Use table 8 to get an approximate value for the Pvalue Assumptions 1 the sample was random 2 the sample size is suf ciently large and the hypothesized cell proportions are such that the expected cell counts are all 5 or more CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data 1 O EXANIPLE It is hypothesized that when homing pigeons are disoriented in a particular manner they exhibit no preference in direction of ight after takeoff To test this 120 pigeons were disoriented then let loose and their direction observed as wedges representing an eighth of 360 The results are given below Use a significance level of 010 to test the hypothesis that direction of takeoff is equally likely Hypotheses H0 7T1 7r2 7T3 7T8 18 HA not H0 Direction Observed Expected n 72 n 720 2 Frequency Frequency 0 mr 045 18 12012515 0600 46900 20 15 1667 911350 20 15 1667 13 61800 15 15 0 1812250 13 15 0267 2262700 8 15 3 267 2713150 7 15 4267 3163600 9 15 2400 Test Statistic observed count expected count2 all cells expected count X2 061 667 24014 135 CATEGORICAL DATA 7 ChiSquare Tests For Univariate Data 1 1 Pvalue df kl 81 7 From the table 0025 lt pvalue lt005 Conclusion since the Pvalue lt 0L reject H0 and conclude that there exists evidence to indicate that the pigeons show directional preferences when disoriented before being allowed to y NOTE This is an example of a test in which the null hypothesis is the claim which goes against what we have learned this semester Most goodness of t tests are that way that is the null hypothesis is the distribution of interest It is assumed to be true unless the data indicate otherwise But we cannot show that the null hypothesis is true only that there is no evidence to not believe it led in a data set and to present that intormation in a convenient form g Qualitative Data Qualitative data are nonnumerical vs Class Frequency the number of observations in the data set that fall into a particular class Class Relative Fre uenc class fre uenc divided by the total number of observations in the data set 1 class relatlve frequency c ass equency 1 1 xample Field Inventory A chilensis N alpina dombe txample Field Inventory ted by bars where the height of each bar is either the class frequency class relative frequency or class percentage 3 7 39 6 N dombeyi A chisns39s N anha N dombevi txample Field Inventory represented by slices of a pie The size of each slice is proportional to the class relative frequency A chilensis 2000 V alpina 5333 N dombeyi 26 67 Misleading plots funiness I lake IBIIIBS warm V mmquot V Lake Lake Lake Lake Lake Ivhcmgan Huron Titicaca Erie Supermr we SOURCE LosAngeles Times SOURCE GraphJamcom ugustS 1979 multiple variables Contingencv Table Crosstabulation of units based on measurements of two qualitative variables simultaneously Stacked Bar Graph Bar chart with one variable represented on the horizontal axis second variable as subcategories within bars hvside comparisons within major groupings multiple variables 39 Acute Mountain Syndrome Among Himalayan Trekkers Numhsr of Medals multiple variables 2002 Winter Olympics Medals 35 I U Gold Silver 2 QBronze N 20 m 15 10 5 Ol l l s 3 5 quot g9 q a 4quot 8 z39b Q quot a v V 3 Q lt0 lt23 g w lt c S s VS v P Countries multiple variables uantltatlve Data Single variable Dot plots display a dot for each observation along a horizontal The dots r fl tthe h ofthe itri ti n Good for small data sets Quantitative Uata smgle variable stemeendeleef dls la for Alte Num er of observatlons 15 mmmum 1 Maxlmum 45u stem umte 1n leaf digits 1 the value 11nu 15 represented by 1 1 1 12225678 2 NNmm m H m Quantitative Data Single variable The box plot is a graph representing information about certain percentiles for a data set and can be used to identify outliers Quantitative Data Single variable The frequencies or relative frequencies are displayed on the vertical axis lbuenber Aqewears form 3 Sym metric Right Skewed Left S kewed View quot39 quotquot variables Scatterplot shows the relationship between two quantitative Iables Response variable y placed on the vertical updown axis and the explanatory variable x placed on the horizontal leftright axis ao 39 39 aan 5762 3939 Evz39 gsx h39 39 564 39 39 sn 39 39 39 39 ss 39 quot zu 3mm mun 5mm Crimes Occurng Pei 100000 People multiple variables quotquotquot quot a single variable measured on a single unit at different time points When measurements are made at equally spaced time points goal is often to describe temporal variation 15 EU 45 a 2n 4n mnn 20m an an 120 5 an 45 4n EU 12D mm mm multiple variables Complete clear brief 0Hgma Ams1 zz awsquot quotgm ubtamab e mm WWW CanuunS uck cum United but not uni ed GDPthangeon ayeartal er w h 44 2003 5mm mmm umwmm 200A 2005 x2 x3 x4 x Summation notation we use a summation symbol often 71 2x1 2 X1X2X3Xn i1 This tells us to add all the values of variable x from the first x1 to the last xn Example ifX1 1 X2 2 X3 3 and X4 4 i123410 Measures of Central Tendency describe the quotlocationquot or center of a set of measurements Value or values around which the data tend to cluster measurements Statistics Numeric descriptive measures based on Samples of measurements x I i1 n 39u N In practice we only observe sample and use x to estimate 1 fi1 1234410425 II 50 II 1 II 50 Lowest Value I I Median I I39HighestValue 39 Ifn is odd M is the middle number If n is even M is the average of the middle two numbers
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'