### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Analysis of Behavioral Data PSYC 610

RU

GPA 3.67

### View Full Document

## 10

## 0

## Popular in Course

## Popular in Psychlogy

This 58 page Class Notes was uploaded by Gerardo Little on Monday October 19, 2015. The Class Notes belongs to PSYC 610 at Radford University taught by Thomas Pierce in Fall. Since its upload, it has received 10 views. For similar materials see /class/224715/psyc-610-radford-university in Psychlogy at Radford University.

## Popular in Psychlogy

## Reviews for Analysis of Behavioral Data

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/19/15

One Way Repeated Measures ANO VA Dr Tom Pierce Department of Psychology Radford University 1 Differences between Betweensubjects and withinsubiects 39 J variables p I L 1 Review of and the control of in betweensubjects designs I know you ve heard all this before It won t hurt you a bit to hear it again Much Let s say that the design of your study goes like this there are three levels of one independent variable and there are ten subjects assigned to each of these three levels What kind of design is this The comparison being made is between three separate groups of subjects That makes it a between subiects design The 39 J J variable in this case is an example of a between subjects factor So what kinds of confounds should you be especially worried about when the independent variable is a betweensubjects factor In this design each level of the independent variable is comprised of different subjects A confound in general is any reason other than your independent variable for why there are differences among the means for your independent variable The independent variable is a difference between the conditions that the investigator put there When a the statistical analysis shows that there are significant differences among the means and b the only possible explanation for differences between the means is the manipulation of the independent variable the investigator is justified in inferring that differences in the means for the dependent variable must have been caused by the manipulation of the independent variable If there is any other possible explanation for why the means ended up being different from each other you can t be sure that it was your IV that had the effect If you can t tell whether the effect was caused by your IV or not your data are officially worthless You have a confound in your design The confound is a competing explanation for why you got a significant effect Confounds in betweensubjects designs usually take the form of individual difference variables The independent variable being manipulated might be the dosage of caffeine subjects receive But if you observe that subjects getting the high dosage are all in their 60s and 70s in terms of age but the subjects getting the low dosages are all in their 20s and 30s there is no way to tell if the significant effect that you observe is because of the independent variable you wanted to manipulate dosage of caffeine or because of the group difference in age Age is a way in which subjects differ from each other Age as an individual difference variable constitutes a confound in this particular example Other individual difference variables that investigators often worry about are gender race socioeconomic status years of education and employment history There are probably hundreds or thousands of other ways in which subjects could differ from each other Thomas W Pierce 2004 Every one of them is a potential reason for why the means in a given study are not all the same Investigators try to remove the threat of confounds in their designs through a random assignment to groups and b through matching Random assignment to groups works to control confounds due to individual differences by making sure that every subject in the study has an equal chance of being assigned to every treatment condition The only way that the subjects in one group could be significantly older than the subjects in another group would be by chance The strategy of random assignment to groups is appealing because it automatically controls for every possible individual difference variable the disadvantage to random assignment to groups is that it is always possible that there might be differences between the group on a particular variable just by chance And if you never measured this variable and there s no way to measure every possible confounding variable you d have to live with the possibility that a confound id present in the design just by chance Random assignment to groups leaves it up to chance as to whether or not there are confounds due to individual differences in the design Matching allows the investigator the luxury of xing it so that a particular variable will not be a confound in the design Unfortunately without extremely large sample sizes it is difficult to match on more than three or four possible confounding variables To sum this bit up confounds in betweensubjects designs are due to the unwanted in uence of individual difference variables Strategies for controlling for confounds in this case try to make sure that the only thing that makes the subjects in one group different from the subjects in another group is the one variable that the investigator has control over the independent variable C quot39 for f A in withinsubiects designs In the search for ways to control for the confounding effects of individual difference variables an alternative strategy is to simply use the same subjects in each treatment condition of the independent variable Think about it If you used the same subjects in Condition 1 that you did in Condition 2 Condition 3 and so on could the subjects who contributed data to the different treatment conditions possibly be different from each other on age No because you re using the same subjects in each group The mean age for the subjects giving you data for condition 1 is going to be exactly the same as the mean age of the subjects giving you data for Condition 2 Condition 3 and so on You ve automatically and perfectly controlled for the effects of age You ve automatically and perfectly controlled for the effects of gender because the ratio of male to female is going to be identical at every level of the IV Testing the same subjects under every level of the IV guarantee that confounds due to m individual difference variable will not be present in the design That s the primary advantage of having every subjects receive every level of the independent variable Because comparisons between the levels of the IV are conducted within one group of subjects the IV is referred to as a Thomas W Pierce 2004 within subjects factor When the design of the study includes one independent variable that is withinsubj ects the design is known as a within subjects design So manipulating the IV within one group of subjects eliminates individual difference variables as potential confounds Does this mean that there are no confounds associated with a withinsubjects IV No The subjects in the various treatment conditions are the same Yet these subjects obviously could not have participated in these different conditions at the same time Withinsubjects factors are vulnerable to confounds due to changes in the testing conditions over time or in changes in the state of the subject over time Changes in the testing conditions that are not related to the independent variable being manipulated are referred as the effects of time or histog Changes in the state of the subject that are not related to the independent variable being manipulated are referred to as practice effects Let s deal with confounds due to historical factors first then we ll look at practice effects Confounds due to the effects of historical events Let s start with a concrete example You re interested in the effects of age on cognitive function You administer a battery of cognitive tests perhaps from the WAIS and the HalsteadReitan Neuropsychological Test Battery One of these tests is the Digit Symbol subtest from the WAIS The measure obtained from this subtest is the number of items completed in 90 seconds You administer this battery of tests to a group of five subjects every five years I know this is a pathetically low sample size but it s just an example Get off my back You now have data from subjects when they were 65 years old 70 years old and 75 years old The data for this study are presented below Subject 65 Years Old 70 Years Old 75 Years Old l 55 51 45 2 63 60 59 3 49 51 47 4 51 44 44 5 44 39 34 When we calculate the means for the three levels of Age we find that the mean level of performance on the Digit Symbol test seems go down as the subjects get older But what if you knew that one year before the third time of testing when the subjects were 74 years of age there was a massive food shortage and pretty much everyone had to go without the recommended daily allowance of Vitamin Z Even if we get to say that statistically the mean of the scores at time 3 are significantly lower than the scores at time 2 do we get to say that it was the increase of five years of age that caused the scores to go down No because it s also possible to argue that it was the lack of Vitamin Z that caused the decline in the scores not the increase in age The historical event the food shortage is a confound in this design The food shortage produced a competing Thomas W Pierce 2004 explanation for why the scores for level three of the IV were different from the scores at the second level ofthe IV Confounds due to practice effects Let s think about a different study You do an experiment in which you manipulate the amount of caffeine subjects get and observe the effects on reaction time You ask subjects to come in at 800 in the morning to begin the experiment You obtain their informed consent and then give them a pill that you ve told them contains a high or a low dose of caffeine or no caffeine at all The subjects don t know what dosage of caffeine they re getting and the experimenter doing the testing doesn t know how much caffeine they re getting that s what makes this a doubleblind study Twenty minutes after the subjects get the pill the experimenter has them do what s called a choice reaction time On each trial the subject is asked to press one button as fast as they can if they see a digit appear on a computer screen and another button as fast as they can if they see a letter There are 400 trials in the experiment And it takes approximately a half an hour to complete the experiment After the subject has completed the reaction time task they are asked to wait a hour to give the caffeine a chance to wash out of their system Then they go through the same procedure with a second dosage of caffeine get the pill wait 20 minutes do the 400 trial RT task After completing the task a second time they go through the identical procedure gain with a third dosage of caffeine By the time the subject has ended their participation in the study they have taken all three dosages of caffeine and performed the same RT task three different times Is there any potential for a confound here What if you did the same excruciatingly boring RT task three times in the same day Do you think you re performance the third time you did the task would be the same as doing the task for the rst time No way What s potentially different about doing the task that third time Well for one thing you d probably be pretty sick and tired of doing the task It s a pretty good bet that you d be much more tired and a lot more bored the last time you did the task that day than the first time you did it That s a potential explanation for why performance at the third time of testing could be different than performance at the first time of testing Fatigge and boredom have a potentially confounding effect in the design If you find that performance in the high dose condition third time of testing is significantly different from performance in the no caffeine condition first time of testing there s no way to tell if it s the caffeine that s causing the effect or the onset of boredom or fatigue In addition it s possible that that performance at the third time of testing is better because of the extensive amount of practice the subjects have already received in doing the task Practice fatigue and boredom are all included under the heading of practice effects Practice effects represent confounds in a withinsubjects design that constitute changes in the state of the research subject across repeated administrations of the same procedures for getting a score on the dependent variable So how do experimenters control for the confounding effects of practice The answer is that there is no way to eliminate practice effects You simply cannot prevent people from gaining experience with a task or from becoming bored or tired However it is possible Thomas W Pierce 2004 to make sure that whatever effects there are of practice fatigue or boredom these effects are spread out evenly over the various levels of the independent variable This is accomplished by varying the order in which the levels of the independent variable are administered to subjects For example the table below presents a scheme for making sure that that in a set of three subjects every level of the independent variable is administered at every time of testing only once Varying the order of administration of levels of the independent variable in order to control for practice effects is known as counterbalancing Subject Time 1 Time 2 Time 3 1 No Caff Low Dose High Dose 2 Low Dose High Dose No Caff 3 High Dose No Caff Low Dose Counterbalancing works for control for practice effects by making sure that no one condition is any more in uenced by practice fatigue or boredom than any other condition After all all three dosages of caffeine occur once at Time 3 when the effects of practice should be most evident In this set of four subjects the three levels of the independent variable occur one time each at each of the three times of testing Counterbalancing doesn t eliminate practice effects Counterbalancing spreads the effects of practice out evenly across every level of the withinsubjects independent variable The error term for a onewav betweensubiects ANOVA In a oneway ANOVA in which the independent variable is a betweensubjects factor where does the error term come from When you get a Mean Square Not AccountedFor in a betweensubjects ANOVA what makes this number error Numerically the number is based on how far each raw score is from the mean of their group Because everyone in a particular group was treated exactly alike with respect to the independent variable differences between the raw scores and the means of their groups can t possibly be due to the effect of the independent variable The sum of squared deviations between the raw scores and the group means is referred to as error because this is an amount of variability that the independent variable cannot explain And in data analysis anything that is due to chance or that can t be explained is referred to as error Again it s not that anyone made a mistake Error is just something that we don t have an explanation for Error is variability that the independent variable should account for but it can t account for The error for a betweensubj ects factor comes from the fact that there is variability in the scores within each group but the independent variable can t explain why What Thomas W Pierce 2004 Variability should a withinsubjects IV be able to explain but it can t explain That amount of Variability is the error term that we ought to be using Below is the data for the five subjects we mentioned earlier in the study on the effects of caffeine on reaction time Caffeine Dosage A a1 a2 a3 Subject No Caff Low Dose High Dose s4 53333 s5 51333 In the graph what happens to subject 1 as they go from the No Caffeine Condition to the low dose condition to the high dose condition They start outwith a score of 250 and they go to 400 msec and then to 480 msec What is the effect of changing the dosage on RT for this subject The effect of the IV is the change in the scores thatyou can attribute to changing the conditions For this subject ifyou change the conditions from 0 mg to 4 mgyou see an increase of 150 msec in RT is 150 msec increase is the effect ofgiving the subject an additional 4 mg of caffeine What s the effect ofgoing from 4 m caffeine to 8 mg Again the effect is a further increase of 80 msec going from 400 msec Thomas W Pierce 2004 to 480 msec That s the effect of Caffeine Dosage on subject one a 150 msec increase going from no caffeine to a low dose and a 80 msec increase going from a low dose to a high dose What s the effect of changing the dosage on subject 2 For subject 2 going from no caffeine to a low dose resulted in an increase of 110 msec Going from a low dose to a high dose was associated with a decrease of 30 msec Is the effect of caffeine the same for subject 1 as it is for subject 2 No Giving subject 1 more caffeine was associated with substantially greater increases in RT than for subject 2 Why Every subject was treated exactly alike with respect to the independent variable Everyone was administered exactly the same treatment in exactly the same way So why didn t the two subjects respond in exactly the same way to the different dosages of caffeine We don t know We have no explanation for this If the IV was the only explanation for why scores on the dependent variable change over time every subject in the study ought to display the identical pattern of change in their scores as you go from one dosage of caffeine to the next But clearly they don t The effect of the IV is not the same for the various subjects in the study This is the error for a withinsubj ects factor the degree to which the effect of the IV is not the same for every subject So how do we put a number on the error attributable to a withinsubjects factor It turns out that we ve already talked about the basic idea of computing this source of variability Think about the definition of the error for the withinsubjects factor the degree to which the effect of the IV is not the same for every subject What does this remind you of Let me say it again THE DEGREE TO WHICH THE EFFECTS OF THE WITHIN SUBJECTS FACTOR ARE NOT THE SAME FOR EVERY SUBJECT This sounds like the definition for an interaction That s because it is But wait a minute How can I have an interaction between two independent variables when there s only one independent variable That s a fair question The answer is that the interaction we re dealing with here is between the independent variable that we re interested in Dosage of Caffeine with another variable that we can treat as a second independent variable The mysterious second IV is nothing other than Subject Think about it for a second Someone might theoretically be interested in seeing whether there are differences among the subjects when you average over the three dosages of caffeine You can do this because we have data for every combination of a level of Caffeine Dosage and a level of Subject because every subject went through every level of Caffeine Dosage The appropriate error term used to test the effect of Caffeine Dosage is the degree to which Caffeine Dosage interacts with Subject Look at the plot of the data for each subject as they provided data at each level of Caffeine Dosage Are the lines parallel with each other No not really Clearly when you increase the dosage you don t get exactly the same effect for each of the five subjects The job of the independent variable Caffeine Dosage is to explain why people s scores change as you go from one time of testing to another If everyone displayed the same pattern of change it would be fair to say that there would be no error as far as the effect of Caffeine Dosage If this were the case the lines for all of the subject s would Thomas W Pierce 2004 have to be perfectly parallel to each other The degree to which these lines are not parallel to each other is variability that Caffeine Dosage cannot explain Why is it that when every subject was treated exactly alike the subjects didn t all display exactly the same pattern of change over time Knowing how much caffeine people got at each time of testing can t help us to answer this question That s what makes AXS the error term for a oneway withinsubjects ANOVA The calculation of the sum of squares accountedfor is exactly the same as in the betweensubjects ANOVA It s based on the deviation between the treatment means the means for each Caffeine Dosage and the grand mean the mean of all the scores in the data set The calculation of the error term AXS is the same idea as calculating the sum of squares for AXB that we just got done talking about It helps if you think of this study as a design that has three levels of one IV Caffeine Dosage and five levels ofthe other IV Subject The only odd thing about the design is that there is only one subject at each combination of a level of Caffeine Dosage and a level of Subject Because the number of subjects in each cell combination of a level of A and a level of S is one there is no variability of scores within cells This means that there s no equivalent of SAB There s no variability of subjects within cells you can t have variability when there s only one subject This means that there are only three sources of variability that contribute to the sum of squares total 1 the main effect of Caffeine Dosage 2 the main effect of Subject and 3 the interaction between Caffeine Dosage and Subject The interaction between Caffeine Dosage and Subject is the error term used to test the two main effects For the purposes of the researcher the only effect that it really makes sense to test is the effect of Caffeine Dosage You certainly could test the effect of Subject if you wanted to but what would it tell you You already know that people are different from each other Big deal Below is the ANOVA table for the data discussed above Notice that the degrees of freedom are calculated in exactly the same way as in the twoway betweensubjects ANOVA Source SS df MS Fobserved Fcritical A 7033333 2 3516667 1629 446 S 5329333 4 AX S 1726667 8 215833 The observed value for F for the effect of Caffeine Dosage is greater than the critical value for F so we have a significant overall effect of Caffeine Dosage What does this tell us It tells us that there are differences among the means for the three levels of Thomas W Pierce 2004 Caffeine Dosage It doesn t tell us where these differences are So what are we supposed to do You guessed it Comparisons among treatment means Comparisons among treatment means for a onewy withinsub39ecm ANOVA Back in the good old days when life was simple and the truth selfevident when we had an independent variable that was betweensubjects we always used the same error term This error term might have been labeled SA or SAB It didn t matter It meant the same thing It meant the variability of subjecm within groups The error to measure was the degree to which subjects treated exactly alike with respect to the IV ended up having different scores The reason we got to keep using the same error term no matter which scores were being compared to which other scores was we assumed that the variability ofthe scores in the different groups is always the same The mean square for SAB is basically the average variance S2 for every group of subjects in the study For a design with two levels ofA and three levels ofB 2 X 3 design the MS SAB is the mean variance for the six groups in the study Because the values for S2 are assumed to be the same for all six groups it doesn t matter whetheryou re only looking at the subjects involved in the simple effect ofB at al The average variance for the three groups involved in the simple effect will be the same number as the average variance of all six groups 7 because the group variances are all assumed to be the same The assumption of homogeneity of variance saved us from having to calculate a new error term for every effect Unfortunately the error we re dealing with in a withinsubjects is not the variability of the scores within levels ofthe independent variable It s the interaction between the independent variable and subjects There is absolutely no guarantee that that the sum of squares Jul 39 39 A and S r 39 L to be L sum ofsquares for A X S for another comparison For example 5quot Let s say that we want to do two comparisons for this example We want to test the prediction that subjects will perform less well with a low dose of caffeine than with no caffeine Second we want to test the prediction the subjects getting a high dose of caffeine will perform less well than subjecm getting a low dose of caffeine These comparisons are al vs a2 and a2 vs a3 Obviously the only subjects that are relevant to al vs a2 are the subjecm in the no caffeine and low dose ofcaffeine conditions Let s plot their data alone Thomas W erce 2004 Let me ask you to rate the degree to which Caffeine Dosage intemcts with Subjects for this comparison How much do the lines deviate from being parallel How about on a scale of one no intemction at all to ten a perfect intemction You might reasonably give this a three This means that this comparison deserves to have an error term that is a relatively small number Now let s plot the data for the second comparison a2 vs a3 40quot RT C K D4 3 Lu Duh Lt Rate this intemction on the same scale of one to ten What would you give this one You might reasonably give this interaction an eight The lines are clearly pretty far from being parallel The effect of caffeine dosage is great deal less consistent across subjects for this second comparison than for Comparison One This means that the second comparison deserves to have an error term that is a relatively large number If we kept using the error term to test the ovemll effect of Caffeine Dosage we d have used an error term that was quite a bit different for the number that was really appropriate If we used error term for the overall effect to test the rst comparison we d use a number that s a lot larger than it needs to be We d end up with an observed value for F that s larger than it deserves to be and we d be less likely to reject the null hypothesis If we used the error term for the overall effect to test the second comparison we d use a number that s a lot lower than it needs to be In this case we d end up with an observed value for F that s larger than it really deserves to be Each comparison should be tested using an error term that re ects just the amount of error that exists among the scores that are relevant to that comparison There is no assumption of homogeneity of variance to keep these error terms relatively close to each other The AXS for one comparison can be whoppineg different from the A X S for another comparison And it doesn t violate ANY assumption That s just the way it is The implication of all this is that every effect involving the withinsubjects factor has to have its own tailormade error term to go with it This means that the comparison of al vs a2 is essentially a oneway withinsubj ects ANOVA with only two levels of the independent variable It s conducted only using the scores in levels one and two It s a 2 Th0mas W Pierce 2004 Comparisons Between Treatment Means in ANO VA Dr Tom Pierce Department of Psychology Radford University Let s say that a researcher wants to test the effects of an independent variable let s say the amount of tutoring on a dependent variable achievement test scores and that the independent variable has three levels The researcher decides to use Analysis of Variance to do this and does an Ftest As discussed in the last chapter this Ftest is designed to answer the question Are there any differences among the three sample means This is a yes or no question Either all three sample means are estimates of the same population mean or they re not The researcher finds that this Ftest is significant so they conclude that there is a significant effect of tutoring on achievement test scores Okay so what do you know so far You know that the independent variable has an effect on the dependent variable You know that if you change the amount of tutoring you change the scores on the achievement test That s something But this is a very general piece of information The Ftest doesn t tell you whether people who get a lot of tutoring perform better than people who only get some tutoring It doesn t tell you whether people who get some tutoring perform better than people who get no tutoring It only tells you whether or not there is M effect of tutoring on achievement test scores It gives you no information about what tutoring does to the scores Because this Ftest allows the researcher to make a decision about whether there are differences among any of the three sample means it s usually referred to as an overall F test or an omnibus F test When the overall Ftest is significant you ve only completed the first step in figuring out what s going on with your data Now in order to say exactly what tutoring does to people s scores you ve got to determine which groups are different from which other groups You ve got to be able to perform a set of comparisons among the treatment means in the study This overall Ftest is only the first step In order to get more specific information about which groups are different from which other groups the researcher needs to perform an additional set of tests to see which of these differences are significant These tests are referred to as comparisons among treatment means For example one interesting comparison might test the difference between the mean for people who get no tutoring and the mean for people who get some tutoring If you ve got three treatment means there are three possible comparisons of one mean with another mean Group 1 vs Group 2 Group 2 versus Group 3 and Group 1 versus Group 3 Planned versus unplanned comparisons One important distinction when conducting a set of comparisons is whether the comparisons are considered to be planned or unplanned This is really a question of Thomas W Pierce 2008 Comparisons Between Treatment Means in AN OVA 7 102308 whether or not the investigator has intended to test a certain comparison before the data were collected Comparisons that the researcher intended to make before they collected the data are referred to as planned comparisons Comparisons that the researcher decides to make after they get the data are referred to as unplanned comparisons The reason the distinction is important is because the techniques for testing planned and unplanned comparisons are different Although we ll get into greater detail on this later the basic issue here concerns the consequences of conducting a large number of statistical tests I mean think about it When you test the difference between two means what is the risk of making a Type I error Ifthe researcher uses an alpha level of 05 they re saying that they re willing to accept a five percent chance of making a Type I error Okay so how about if the researcher does three comparisons What s the risk of making a Type I error somewhere in that set of comparisons Well there s a five percent chance of making a Type I error every time the researcher does a test so the risk of making a Type I error anywhere in the set would have to be the number of comparisons three multiplied by the risk of making a Type I error for each comparison five percent This tells us that the risk of making a Type I error somewhere in that set of three comparisons is not the comfortably modest value of 5 it s really 15 It turns out that there s a difference between the risk of making a Type I error when conducting a single comparison and the risk of making a Type I error anywhere in a set of comparisons The risk of making a Type I error when testing a single comparison is referred to as the per comparison alpha level The risk of making a Type I error anywhere in a set of comparisons is referred to as the familywise alpha level The different methods for conducting planned and unplanned comparisons allow the researcher some choice in terms of how they want to want to handle the problem of taking on added risk of a Type I error with every additional comparison the researcher performs We ll talk about methods for conducting planned comparisons first Methods for conducting planned comparisons The investigator in the tutoring and achievement test scores example has found that there is a significant overall effect of tutoring on achievement test scores Anticipating this significant overall effect let s say that the investigator made two predictions before they collected their data First they predicted that students who get a lot of tutoring will have significantly higher scores than students who get a moderate amount of tutoring Second the investigator predicted that students who get tutoring regardless of the amount will have significantly higher scores on the achievement test than students who did not get tutoring Independent samples t test Without question a researcher should be allowed to conduct a number of planned comparisons without have to make any type of adjustment for familywise error In other words the researcher deserves an answer to the question of why the overall effect was significant and it s going to take several comparisons to address that question A familywise alpha level of 10 or 15 is just the price that has to be paid for this information Ifthe researcher has two to four planned comparisons in mind they should just go ahead and test these comparisons as regular old ttests There s no need to make it more difficult to reject the null hypothesis for any of these ttests In SPSS the way to get the results for these ttests is through the Contrasts option that is placed at the bottom of the OneWay ANOVA window You will often see comparisons reported as ttests in results sections Contrasts reported as F tests An alternative to reporting comparisons between treatment means as ttests is to report them as Ftests If you think for a second what s the difference between comparing the means of groups 1 and 2 using a ttest and testing the difference between the means for these two groups using an Ftest No Difference When there are only two groups an Ftest gives you exactly the same information that a ttest does You can see this clearly from the probability levels for the F and ttests They re identical which means that they re both equally likely to yield a significant effect Doing a ttest is the same thing as doing an Ftest One is no better or worse than the other one although the Ftest route carries a bit more exibility in terms of the types of questions you can ask as we ll see shortly The relationship between a value for t and a value for F is a very simple one Let s say that an investigator has an independent variable and they test the difference between the two means using a ttest and then an Ftest As we mentioned in the previous paragraph the probability levels will be the same and the value for F will be equal to the value for t that has been squared F t2 You ll often see comparisons between treatment means reported as Ftests Ifyou were doing your comparisons using Ftests in real life you d probably use a program like SPSS to do the calculations for you Let s say that you wanted to do the Ftest for a comparison by hand One reason for showing you this is that it helps to show you what the job of a set of comparisons really is We ve already talked our way through the ANOVA table for the overall effect There s a Sum of Squares BetweenGroups accounted for and a Sum of Squares WithinGroups not accounted for The WithinGroups Sum of Squares represents variability that we don t have an explanation for The Sum of Squares BetweenGroups represents variability among the scores that we attribute to the effect of changing the level of the independent variable as we go from one group to the next It represents the variability among all of the groups This sum of squares accountedfor is something that we can take apart There s a certain amount of variability that we can attribute to a difference between the scores in group 1 and the scores in group 2 This later amount of variability is apart of the total amount of variability accounted for And we can test this specific amount of variability to see if it s significantly greater than what we might expect to get just by chance In other words is the variance mean square accounted for by a difference between the means for groups 1 and 2 significantly greater than the variance mean square not accounted for mean square withingroups All we have to do is to calculate a sum of squares for the comparison we re doing then divide it by the appropriate number of degrees of freedom to get the mean square for that comparison Then we take the mean square for the comparison and divide it by the mean square withingroups that we already have to get an Fratio One way of help think about this process is to just add an extra row to the ANOVA table we generated in the last chapter The only difference is that this extra row will be dedicated to calculating an Fratio for our comparison Source SS df MS FObserved chical Between 250 2 125 50 389 al vs a2 Within 30 12 25 All right so now let s calculate the sum of squares for the comparison This process starts by calculating a value for D according to the notation in Howell 2002 One way of thinking about this value for D is that it basically represents the difference between the means being compared A value of zero for D indicates that there isn t any difference at all between the means The further away the value for D is from zero the bigger the difference between the means To calculate the value for D you rst list the means for all of the levels of the independent variable in order So the means for groups one two and three are l3 8 3 The next step is to multiply each mean by a weighting or coef cient that re ects the contribution of that mean to the comparison being made 13 8 3 The rules for generating a set of coefficients for a particular comparison are that a the coefficients have to add up to zero and b the pattern of the coefficients as you move across the various levels of the IV have to re ect or code the comparison being made One should be able to look at the coefficients multiplied by the means and know what the comparison is In terms of trying to explain how to generate these coefficients it s easier just to show you through a couple of examples than to define some elaborate set of rules that makes it harder than it really is Here goes We want to compare the mean of Group 1 to the mean of Group 2 Does Group 3 have anything to do with this comparison No Okay so how much does this group contribute to the comparison Nothing Okay so what coef cient do you think the mean for this group ought to get Zero Right So the mean of 3 gets multiplied by a zero 13 8 30 Next you know that the coef cients have to add up to zero So if you make the coef cient applied to group 1 equal to 1 what are you going to have to make the coef cient for group 2 It s going to have to be 71 So now we ve got 131 81 30 The set of coef cients 1 1 0 is said to code the comparison ofthe mean of group 1 to the mean of group 2 Now to generate the value for D that we need all you have to do is to multiply the means by their coef cients and then add these numbers up 137805 The value for D is 5 Obviously there is a five point difference between the mean for group 1 and the mean for group 2 Next we take this number and plug it into an equation that gives us the sum of squares for the comparison nD2 SSComparison quotquotquot quot 2c2 The top part of the equation is easy n refers to the number of people in each group which is 5 So nD2 becomes 552 525 125 The bottom part ofthe equation Ecz refers to the number you get when you take each of the coef cients square them and then add these squared numbers up So here we ve got 02 12 l2 0 l l 2 So the number crunching for the equation ends up looking like this nD2 552 125 SSComparison quotquotquotquot quotquotquot quotquot39 675 Zcz 2 2 The sum of squares for this particular comparison is 675 Now let s plug it into the ANOVA table and see what happens Source SS df MS FObserved FCritical Between 250 2 125 50 389 al vs a2 67 5 Within 30 12 25 We want the Fratio for the comparison We ve got the sum of squares How many degrees of freedom should we divide this sum of squares by Well how many groups are involved in the comparison Two And what s the equation for determining the number of degrees of freedom for the accountedfor term It s the number of levels of the independent variable minus one a 7 1 If we ve got two means being compared then two levels minus one leaves us with one degree of freedom The number of degrees of freedom for the comparison is one It turns out that because every comparison is the comparison of one group of scores to another group of scores every comparison has one degree of freedom associated with it If the comparison has one degree of freedom that means that the Mean Square for the comparison is equal to 675 divided by one which just leaves us with 675 Source SS df MS FObserved chical Between 250 2 125 50 389 al vs a2 675 1 675 Within 30 12 2 5 The last step is to determine the Fratio for the comparison The Fratio for the comparison is computed by taking the mean square for the effect being tested the comparison and dividing it by the mean square for the error term the sum of squares withingroups which is 25 This gives us an Fratio of 25 Source SS df MS FObserved chical Between 250 2 125 50 389 al vs a2 675 1 675 25 475 Within 30 12 2 5 The only thing we need to know now is the critical value used to test this Fratio for significance Just like before the critical value for F is based on the number of degrees of freedom for the numerator and the number of degrees of freedom in the denominator Here we need to look up the critical value for F when there is one degree of freedom in the numerator and 12 degrees of freedom in the denominator The number we get is 4 75 The observed value of 25 is greater than the critical value of 4 75 so our decision is to reject the null hypothesis that there is no difference between the two means Our conclusion is that The mean of students getting a lot of tutoring is significantly greater than the mean of students getting a moderate amount of tutoring F 1 l2 25 p lt 05 Now on to the second planned comparison of people who get tutoring compared to people who do not get tutoring Who s getting compared in this comparison Remember any comparison is between two groups of people Obviously there are two different groups who got tutoring lot of tutoring some tutoring and one group that didn t get tutoring In this comparison it doesn t matter how much tutoring people get We re really comparing the average score for everybody who got tutoring ie people in a1 or a2 to the average score for people who didn t get tutoring a3 As shorthand you could represent this comparison as a1 a2 vs a3 This is an example of a complex comparison because more than one group is represented on at least one side of the comparison The rst comparison al vs a2 is referred to as a pair wise comparison because only two 7 a pair 7 of means are involved in the comparison So how do you get the coef cients Remember that every comparison has two sides to it Also the coefficients have to add up to zero For whatever amount of weight you give to the positive side of the comparison you ll have to give the same amount of weight to the negative side of the comparison So if you make the coefficients on the positive side add up to 2 you ve got to have the negative coefficients add up to 2 We ve got two means on the positive side of the comparison a1 and a2 and we want to give equal weight to both means so we could assign a coef cient of 1 to al and a coef cient of 1 to a2 That way the total for the two coefficients comes up to 2 The only mean on the negative side of the comparison is a3 So we could assign a coef cient of2 to a3 That makes the entire set of coef cients 1 1 2 When we use these coefficients to calculate the value of D for this second comparison we get D131 8132138621615 Now we take this value for D of 15 and plug it into the equation for the sum of squares for a comparison nD2 5152 5225 1125 SSComparison quotquotquot quot quotquotquotquot quot quotquotquotquot quot quotquotquot 1875 Zcz 6 6 6 In the denominator of the equation we need to calculate the sum of the squared coef cients which is 1 1 22 1 1 4 6 So the sum of squares for the second comparison is 1875 When we put add this second comparison to the ANOVA table we get Source SS df MS FObserved FCritical Between 250 2 125 50 389 al vs a2 675 1 675 25 475 a1 a2 vs a3 1875 1 1875 75 475 Within 300 12 25 This second comparison has 1 degree of freedom This may seem strange because all three groups are involved in the comparison but it s still the case that this comparison is testing the mean of one group of people people who got tutoring against the mean of a second group of people people who did not get tutoring Two levels of the independent variable minus one degree of freedom leaves you with one degree of freedom The critical value for this second comparison stays the same because the degrees of freedom for the numerator and the denominator of the Fratio stay the same So this second comparison is significant and we can conclude that Students who get tutoring regardless of the amount perform significantly better than students who do not get tutoring F 112 750 p lt 05 Orthogonal versus nonorthogonal comparisons Now take a look at the sums of squares for the two comparisons When you add them up what do you get 250 That s the sum of squares accounted for How did that happen It turns out that when the independent variable has three groups it only takes two comparisons to provide an explanation for where the sum of squares accountedfor came from It s no accident that this number of comparisons corresponds to the number of degrees of freedom for the overall effect The combined information from these two comparisons is able explain why the overall effect was significant Together they explain exactly as much variability as there was to explain no more and no less Now let s say that we wanted to conduct a third comparison that compares the mean of people who get a lot of tutoring to the mean of people who get some tutoring or no tutoring This is the comparison of a1 versus a2 and a3 Coefficients to test this comparison might be 2 1 71 When you plug these coefficients into the Contrasts option in SPSS get the observed value for t square it to get the observed value for F and then figure out the sum of squares for this comparison you get 1875 When you add the sums of squares for comparisons 1 and 2 together you get exactly 250 When you add the sums of squares for comparisons 2 and 3 together 1875 1875 you get a number that s a lot greater than 250 It looks like comparisons 2 and 3 taken together account for more variability than there was to account for in the first place Why is that It s because of a distinction that one can make among sets of comparisons This distinction is in terms of whether the comparison are said to be orthogonal to each other or non orthogonal to each other Two comparisons are orthogonal to each other when they don t overlap at all in terms of the information they provide In other words when the comparisons you re talking about are orthogonal they re answering completely separate questions about the effects of the independent variable on the dependent variable Let s think about comparisons 1 and 2 for a second In comparison 1 the only thing happening is that we re looking at the difference between groups 1 and 2 Group three has nothing to do with this first comparison In the second comparison we re not looking at the difference between groups 1 and 2 We re averaging over groups 1 and 2 We re treating the people in groups 1 and 2 like they come from the same group And we re comparing these people people who got at least some tutoring to people who didn t get any tutoring In comparison 1 group 3 wasn t involved at all In comparison 2 group 3 in the only level of the independent variable that is on one side of the comparison Comparisons l and 2 are said to be orthogonal to each other because they are addressing completely separate questions One the other hand think about what comparisons 2 and 3 are doing What do comparisons 2 and 3 both have in common They both have group 1 on one side of the comparison and group 3 on the other side of the comparison In a very real sense comparisons 2 and 3 are answering similar questions They re not identical questions but the two comparisons both provide information about the difference between the means for groups 1 and 3 That means that when you test comparisons 2 and 3 you re getting some of the same information twice That s why when you add up the sums of squares for comparisons 2 and 3 you get a number greater than 250 It s because to some extent you re testing the same questions twice Some statisticians think that researchers should be restricted to testing orthogonal comparisons They feel this way because testing the same variability twice means that the researcher has two chances to make a Type I error when they test this variability There are two chances to make the one mistake That s a situation where the researcher is at double jeopardy of saying that there s something there when there really isn t Other statisticians feel that it s okay to conduct nonorthogonal comparisons as long as these comparisons provide answers to interesting and meaningful questions One quick way to tell if two comparisons are orthogonal or not is to start by stacking the coefficients for these comparisons on top of each other For comparisons l and 2 we d have 1 1 0 1 1 2 Now multiply the coefficients going down each column 1 1 0 1 1 2 1 1 0 After you ve got these numbers at the bottom of each column you just add these numbers up 1 1 0 gives you a value of zero Using this little trick if you get zero the two comparisons are orthogonal Ifyou get anything other than zero the comparisons are not orthogonal This method tells use that comparisons l and 2 are orthogonal to each other Now let s do the same thing for comparisons 2 and 3 When we stack the coefficients on top of each other we get 1 1 2 2 1 1 Now multiplying the coefficients going down each column we get 1 1 2 2 1 1 2 1 2 Adding these numbers up 2 1 2 gives you 3 This number is something other than zero so we know that these two comparisons are not orthogonal to each other This method can tell the researcher whether their comparisons are orthogonal to each other before they even collect their data Corrections for familywise risk of a Type I error The Bonferroni adjustment So how many planned comparisons should a researcher be allowed to do Obviously each comparison answers a different question regarding the effect of the independent variable on the dependent variable On the face of it why should there be any limit to the number of questions that a researcher should be allowed to ask It turns out that there is quite a bit of disagreement among statisticians on this What I m going to do is to tell you about the basic issue that the statisticians are wrestling with and then I ll describe a couple of approaches to addressing the issue Let s say that a researcher tests one planned comparison and uses an alpha level of 05 What are the chances that the researcher is going to commit a Type I error if they decide to reject the null hypothesis Five percent obviously Okay Now let s say that the researcher conducts a second planned comparison and uses an alpha level of 05 What are the odds that they ve committed a Type I error if they reject the null hypothesis for this second comparison Again five percent All right now in doing those two planned comparisons what are the odds of committing a Type I error in either comparison What are the odds of committing a Type I error in the set or th e family of comparisons Well they did two comparisons and each comparison carried a five percent risk of making a Type I error So overall the odds were 10 that the researcher would commit a Type I error in the set of two comparisons It turns out that the more comparisons the researcher does the higher the overall risk of making a Type I error gets There is an important distinction between the risk of making a Type I error for a single comparison and the risk of making a Type I error in a family or a set of comparisons The per comparison alpha level refers to the risk of making a Type I error for a single comparison The family wise alpha level refers to the risk of making a Type I error anywhere in a family or a set of comparisons The per comparison alpha level is the alpha level the researcher has decided to use to test an individual comparison The size of familywise alpha level is a function of two things the alpha level used to test each individual comparison and the number of comparisons being tested I m going to show you two equations for the familywise alpha level The rst equation gives the actual risk of making a Type I error in a set of comparisons This equation is awkward to use but gives you the correct answer The second equation only gives you an approximation of the actual familywise alpha level However this second equation is much easier to use Because answers using this second equation are very close to the real thing especially when the number of comparisons is less than ve or so this is the equation we re going to work with The equation to calculate the actual familywise risk of making a Type I error is le 1 1 chc If the researcher decides to use a per comparison alpha level of 05 and they want to conduct three planned comparisons they ll end up with the following familywise alpha level le 11 01Pcc 11FW1717053 11FW17953 xFW17857143 According to this equation the odds of committing a Type I error anywhere in the set of three planned comparisons is 143 Okay here s the second equation It s based on the same reasoning we talked our way through above It basically says that if the per comparison alpha level is 05 then the researcher takes on 5 worth of risk for every comparison they do This equation says that the familywise alpha level is equal to the per comparison alpha level the researcher is using multiplied by the number of comparisons the researcher has decided to do In symbol form SLFW lpcc or for our example iFW 053 15 According to this second and far easier equation the familywise risk of a Type I error is 15 which is pretty close to the actual value of 143 If anything this second equation slightly overestimates the actual familywise alpha level Okay so that s how a researcher can calculate the familywise risk of making a Type I error The critical question here is whether there should be a limit to the size of the familywise alpha level In other words just how much total risk of making a Type I error is too much Unfortunately this is one of those questions where statisticians disagree I ll give you two ways of handling the situation and leave it to you to decide which way makes the most sense to you A number of statisticians feel that researchers should be able to test as many comparisons as it takes to answer their important questions eg Keppel Sau ey amp Tokunaga 1992 If this means that the researcher tests five comparisons and ends up with a 25 chance of committing a Type error then so be it The benefits of getting answers to five important questions outweigh the potential costs that result from making one or more Type I errors This is a rather liberal approach to statistical decision making A person taking a slightly more conservative approach to statistical decision making might make the following argument The job of a set of comparisons is to explain why the overall ANOVA was significant It takes as many orthogonal planned comparisons to do this as you have degrees of freedom for the overall effect Therefore the researcher should be able to perform at least this many comparisons without having to worry that the familywise risk of making a Type I error is getting too large However if the researcher wants to do more planned comparisons than the number of degrees of freedom for the overall effect the researcher is now at the point where they have to worry that the familywise alpha level is getting too high For example if the independent variable has three levels then the researcher should be allowed to conduct two planned comparisons without any kind of penalty This mean that the researcher is allowed to let the familywise alpha level get up as high as 010 but they can t let it get any higher than this The highest they can let the familywise alpha level get is thus Maximum wa deoriginal 1pc 205 010 There is nothing wrong at all with wanting to test more than two planned comparisons in this example but how could the researcher do this and still not let the familywise alpha level get above 010 Let s say they want to test four planned comparisons Obviously if the researcher wants to test more than two planned comparisons they re going to have to use a lower percomparison alpha level In this case if the researcher used a per comparison alpha level of 0025 they could test their four planned comparisons and still keep the familywise alpha level from going over 010 The Bonferroni Adjustment provides an equation for calculating an adjusted percomparison alpha level to use when the number of planned comparisons is larger than the number of degrees of freedom for the overall effect This adjusted percomparison alpha level is basically just the maximum familywise alpha level that the researcher is allowed to have divided by the number of planned comparisons the researcher actually wants to do deoriginal 1pc Adjusted lPC 4 77 C The value for in this case represents the number of comparisons Applying the Bonferroni Adjustment to this example we get 2005 010 Adjusted aPC 0025 4 4 The strategy employed in the Bonferroni Adjustment is used widely in data analysis in situations where the researcher is faced with conducted a large number of different tests Unplanned Comparisons When conducting unplanned comparisons the investigator is prepared to look in a large number of places to detect differences between means The researcher needs to guard against the risk of making a Type I error anywhere in the set of all possible locations where you could look Because the number of all possible comparisons is larger than a limited number of planned comparisons methods for testing unplanned comparisons make it far more difficult to reject the null hypothesis for any one of these comparisons We ll look at two strategies for conducting unplanned comparisons the Scheffe test and the Tukey test Sche e Test The Scheffe test is the most conservative method for conducting unplanned comparisons It allows researcher to test every possible comparison pairwise and complex SPSS provides a table that reports the results for all pairwise comparisons conducted using the Scheffe method Unfortunately the critical value is adjusted based on the assumption that the investigator is testing every possible comparison including all of the complex comparisons So SPSS doesn t give you access to all ofthe comparisons you re paying for For this reason it doesn t make much sense look at SPSS s Scheffe output without being willing to test at least some of the complex comparisons through the method outlined above using the Contrasts option SPSS s Tukey output gives you the same pair wise comparisons but tests them using a lower critical value Fs degrees of freedom for effectCritical value for overall effect Plugging the number in from our example the critical value for F for the Scheffe test becomes FS 2389778 a value of 778 You ll notice that this critical value is quite a bit higher than the one used to test a planned comparison 477 And with four or five groups the critical value would go much higher than that This is what makes the Scheffe test the most conservative method for testing comparisons among treatment means This is what makes it the method of choice for the researcher who is especially concerned about committing a Type I error With the Scheffe test you ve got permission to test any and all comparisons without having to worry about an in ated familywise risk of a Type I error One way of thinking about the Scheffe test is that it s the allyoucaneat buffet of testing comparisons You re paying a high price for being able to go back for as many F tests as you can stomach Tukey method The Tukey method assumes that the researcher is going to test all possible pairwise comparisons A pairwise comparison is one treatment mean compared to one other treatment mean eg al vs a2 The tests of unplanned comparisons using the Tukey method are conducted using an adjusted critical value for F The more possible pairwise comparisons there are the larger this critical value will be Everything about obtaining the observed value for F is the same as for a planned comparison The only thing that changes is the critical value for F The equation for calculating this adjusted critical value for F is 2 1 FT 2 The adjusted critical value for F for the Tukey Test FT is equal to squaring the value for a statistic known as the Studentized Range q and then dividing by 2 It doesn t get much easier than that The value for q can be found in the Critical Values for the Studentized Range table To know which row to look in to nd the value for q you need to know the number of degrees of freedom for the Sum of Squares withingroups the denominator of the overall effect To know which column to look in you need to know the number of levels of the independent variable In the table we re using the number of levels of the independent variable is referred to as a value for K At 12 degrees of freedom for the sum of squares withingroups and three levels of the independent variable we get a value for q of 377 Plugging this number into the equation we get q2 3772 1421 a value of 7 11 to use as the as the adjusted critical value for F when conducting unplanned comparisons using the Tukey method Take a moment to compare the adjusted critical values for F for the Scheffe and Tukey methods The critical value for the Scheffe test is 778 and the critical value for the Two Way ANOVA Analysis of Data from Studies with Two Non Repeated Independent Variables Dr Tom Pierce Department of Psychology Radford University In one type of study that I do I measure people s blood pressure It may not seem like a very glamorous variable but if you think about it the haemodynamic properties of Okay forget it Blood pressure is not a very exciting variable for psychologists But that s what I do Anyway I measure people s blood pressure And if you re going to measure people s blood pressure you have to do that while they re sitting in a nice big comfy chair Let s say that one at a time I bring a group of ten research participants into my lab and I tell them that I m going to measure their pulse rate and blood pressure for ten minutes while they re sitting in the room quietly by themselves At the end of this ten minute period a little egg timer is going to go off and when it does I tell them I d like them to turn a piece ofpaper over that s sitting on a table beside them and to ll out the questionnaire The questionnaire is a measure of how anxious they feel at that moment Then I show them the device I m going to use to get the readings on pulse rate and blood pressure The device can take blood pressure automatically as often as I like In fact the machine is made for psychologists because to start it all you have to do is press the GREEN BUTTON To stop it you press the RED BUTTON Not bad eh It would have been nice to only have to worry about one button but I think we can handle this I get their informed consent and then I leave the room while the machine records their blood pressure every two minutes and they fill out the questionnaire ten minutes later For a second group of participants I bring them in tell them I m going to have the machine measure their pulse rate and blood pressure every two minutes for ten minutes and that after ten minutes an egg timer is going to go off and I ask them to fill out the questionnaire Everything is exactly the same as for the first group of participants except for these ten people I have an undergraduate research assistant go into the room five minutes into the ten minute period The assistant walks over to the other side of the room and sits down at a desk and is facing away from the participant As they walk in the assistant tells the participant that they need to finish filling out some forms from the last participant and that the participant should just try and ignore them If the participant says anything the experimenter asks them not to talk because talking will just make their blood pressure go up In this second condition the only thing that s different is that there is a stranger in the room with the participant This is usually considered to be a more stressful situation than someone just being in the room by themselves One variable that I m interested in studying is stress In this study I want to know what will happen to people s self reported scores for anxiety if I change the level of stress they re experiencing TwoWay ANOVA Thomas W Pierce 2004 Let s say that s I bring in a third group of participants For these people the instructions are that I m going to measure their blood pressure for ten minutes and that after ten minutes I d like them to ll out the questionnaire For this third group of participants I tell them that halfway through this ten minute period 7 after ve minutes 7 a research assistant is going to come in the room and ask them to do some mental arithmetic problems for the rest of the ten minute period So I ve randomly assigned participants to one of three groups 1 they re in the room by themselves for ten minutes 2 they re in the room with a stranger but they re not interacting with them and 3 they re doing mental arithmetic By changing the conditions in this way I m trying to provide an experimental manipulation of the amount of stress that people are experiencing I m assuming that these conditions represent increasing amounts of stress The independent variable in this study is Level of Stress and the dependent variable is SelfReported Anxiety The logic of this experiment is exactly the same as in the other experiments that we ve talked about you change the conditions as far as the independent variable goes and then you see what happens in terms of the dependent variable I ve got ten people in each of the three levels of stress If I had to analyze the data right now I could do a oneway ANOVA like we ve already learned about But let s say that there s one addition piece of information that I have about every participant in the study I know whether each participant is an introvert or an extrovert It turns out that half of the people in the study are introverts and half are extroverts In fact exactly half of the people assigned to each level of stress are introverts and half are extroverts This mean that of the ten people in the low stress condition alone for ten minutes ve are introverts and 5 are extroverts And it s the same way for the moderate stress stranger in the room and high stress mental arithmetic conditions IntroversionExtroversion represent a second independent variable And I ve got every combination of a level of stress that a participant could have been in and whether they were an introvert or an extrovert So in effect what I really have are six different groups because I have six combinations of the three levels of stress and the two levels of IntroversionExtroversion Below is a table showing the means for the six groups in the study Possible scores for the measure for anxiety range from one to twelve with higher scores re ecting higher levels of anxiety IntroversionExtroversion A n5 Introverts Extroverts al a2 1 1 1 Low Stress 1 M 2 1 M 2 1 b1 1 1 1 1 1 1 1 1 1 Level of Moderate Stress 1 M 5 1 M 2 1 Stress B b2 1 1 1 1 1 1 1 1 1 High Stress M 9 M 3 b3 1 1 1 So what questions could you answer using these data Well the question we started out with was whether stress has an effect on scores for anxiety What treatment means could you look at to see if changing the level of stress results in changes in anxiety To answer this question you d have to ask what the mean level of anxiety is like for people in the low stress condition for the people in the moderate stress condition and for people in the high stress condition What s the average score for people in the low stress condition Well the average for the ve introverts under low stress is 20 and the average for extroverts under low stress is 20 The average of these two means is 20 So the average for all ten people in the low stress condition is 20 Taking the same approach the mean score for the ten people in the moderate stress condition is 35 and the average score for the ten people in the high stress condition is 60 The means are displayed in the margin on the righthand side of the graph IntroversionExtroversion A Introverts Extroverts al a2 1 1 1 Low Stress 1 M 2 1 M 2 1 M 20 01 1 1 1 1 1 1 1 1 1 Level of Moderate Stress 1 M 5 1 M 2 1 M 35 Stress B b2 1 1 1 1 1 1 1 1 1 High Stress M 9 M 3 1 M 60 b3 1 1 1 These mean of 20 35 and 60 are referred to as the marginal means for the effect of stress These are the means that you get when you average over the effect of introversionextroversion In other words just for moment we re pretending like we didn t care whether people were introverted or extroverted We just had ten people giving us scores in each of three levels of stress From looking at these means does it seem like there s an effect of stress on anxiety It does A little later we ll do the actual statistical test to see if there are statistically signi cant differences among these means In this design it s possible examine the effect of stress all by itself When we do this a statistician would say that we re examining the main effect of stress on anxiety As a definition a main effect represents the effect of an independent variable on the dependent variable when you average over the effects of a second independent variable In this design not only can we look at the effect of stress all by itself we can also look at the effect of introversionextroversion all by itself In other words do introverts look different from extroverts in terms of anxiety when you average over the three levels of stress Now take a look at the means at the bottom of each column You ll notice that the independent variable IntroversionExtroversion has been assigned the letter A and that the independent variable Stress Level has been assigned the letter B Just like the situation when we had one independent variable the upper case letter represents the independent variable A small case letter with a number subscripted beneath it represents a particular level of the independent variable For example a1 represented a group of people Here to identify a particular group of people you have to specify the level of both the independent variables A and B IntroversionExtroversion A Introverts Extroverts al a2 l l l Low Stress l M 2 l M 2 l 01 l l l l l l l l l Level of Moderate Stress l M 5 l M 2 l Stress B b2 l l l l l l l l I High Stress l M 9 l M 3 l 03 l l l M533 M233 The mean of 533 represents the average score for the 15 introverts The mean of 233 represents the mean score for the 15 extroverts in the study Because these means show up in the bottom margin of the table they re referred to as the marginal means for introversionextroversion If we can show that there s a signi cant difference between these two means we ll be able to say that there is a significant main effect for introversionextroversion So one way of looking at the study is that you get two sets of results for the price of one study When you look for differences among the means in the righthand margin you re testing the effect of stress all by itself on anxiety When you look for differences among the means in the bottom margin you re testing the effect of introversionextroversion all by itself on anxiety Two for the price of one sounds like a pretty good deal But it turns out that the deal is even better than that Not only do you get to look at the effects of each independent variable separately you get an additional piece of information that you couldn t possibly get from a study that had just a single independent variable Think about it for a second When you look at the means for the main effect for stress it seems like higher levels of stress result in higher levels of selfreported anxiety But is that always true Is it true for everybody included in the study NO Take a look at the column for the introverts The means go from 20 to 50 to 90 as the level of stress increases It certainly looks like there s an effect of stress on anxiety when you only look at the introverts However when you look at the means in the column for the extroverts you get a very different story Here the means go from 20 to 20 to 30 as the level of stress increases When you only look at the extroverts 7 half the participants in the study 7 the researcher increased the level of stress but next to nothing happened to the scores There s m effect of stress when you only look at the extroverts There is an effect of stress when you only look at the introverts That s the piece of information you can only get when you look at the independent variables in combination with each other When you get this type of pattern in the results 7 when the effect of one independent variable on the dependent variable doesn t look the same at every level of a second independent variable we re able to say that there is an interaction between the two independent variables So there are three major questions that you can answers using this type of design You can see whether there s an effect of stress all by itself on anxiety That s the main effect of Stress You can look at the effect of IE all by itself on anxiety That s the main effect of I E And you can see whether the two independent variables interact with each other Okay so how do you test each of these three questions By using the same kind of strategy as in the OneWay ANOVA example In this study there is a certain amount of variability that needs to be accounted for In other words if you took the 30 scores for anxiety and had SPSS calculate the sum of squares for those scores you d get 126167 That s all the variability there is that needs to be explained In the OneWay ANOVA example we had SPSS calculate the amount of variability that our one possible explanation 7 our independent variable could explain And we could see how much variability it couldn t explain The SS Total equaled the SS squares accountedfor plus the SS notaccountedfor When we had one independent variable there was only one source of variability accountedfor It s the same thing here except that now instead of having one source of variability accountedfor we ve got three the main effect of A the main effect of B and the interaction between A and B AXB These sources of variability are orthogonal to each other In other words they provide completely independent pieces of information Having a main effect for A has nothing to do with having a main effect for B or with having an interaction between A and B If we a calculate the amount of variability in sums of squares units that we can attribute to each of these three sources of variability and then b add these three sums of squares up we ll have all of the variability that s accountedfor in this study The way a statistician would say it is that we can partition the variability that s accounted for betweengroups variability into three pieces SSTotal SSBetweenGroups T SSWithinGroups SSTma SSA 33 SSAXB SSSAB This means that there are now three overall effects When you use the GLM general linear model module in SPSS the program will give you an ANOVA table that provides separate Ftests for each of these three overall effects As we talked about A B A X B and S AB each have a sum of squared deviations associated them When you add these sums of squares up you get the sum of squares total To calculate an observed value for F for each of the three overall effects you have to take a Mean Square for each effect and divide it by the Mean Square WithinGroups To get these Mean Squares you ve got to take each sum of squares and divide it by the appropriate number of degrees of freedom The ANOVA table below shows the sums of squares for the effects and how these degrees of freedom are calculated fAXBa 1b 1122 fA a 12711 B b713712 fSAB abn1 234 24 H1 1 Source S S d F FCritical Stress IE 4500 2 39 IE 6750 1 Stress 8157 2 Error 3200 24 Total 22617 29 Once you take each sum of squares and divide it by the appropriate number of degrees of freedom you get the Mean Square for each effect Then to calculate the Fratio for each effect you divide the Mean Square for each effect by the Mean Square for the error term MS SAB A researcher could then look up the critical value associated with each effect df Total number of participants 1 Source S S df MS F chical Stress IE 4500 2 2250 1688 340 IE 6750 1 675 5063 426 Stress 8157 2 4083 3063 340 Error 3200 24 133 Total 22617 29 It appears from the resulting ANOVA table that all three overall effects are significant One thing to notice about this ANOVA table is that all three Fratios are based on the same denominator The mean square for each effect being tested is divided by the MSsAB of 133 This number 133 represents the average ofthe variances of the scores in each of the six groups In other words if you calculated S2 for each of the six groups and then got the average of these six numbers you d end up with 133 A second thing to notice about the numbers in the ANOVA table is that if you add the sums of squares for the main effect for A the main effect for B the A X B interaction and the error term you get the sum of squares total Interestingly if you add up the degrees of freedom for the same four terms you get the total number of degrees of freedom 29 Okay so what do we do now The significant main effects tell us that l there s an effect of stress when you average over the two levels of IE and that 2 there s an overall effect of I E when you average over the three levels of stress But doesn t it seem like these main effects are misleading I mean the significant main effect of stress leads you to think that there s always an effect of stress or that there s an effect of stress for eve one But is that really true Well no When you look at the means it seems like there s only an effect of stress when you look at the introverts And there s no effect of stress when you only look at the extroverts The main effect is leading you to accept a conclusion that s just not accurate The presence of a significant interaction between stress and I E tells you that the effect of stress on anxiety is not the same for introverts as it is for extroverts If you know that the effect of stress is not the same for introverts as it is for extroverts WHY WOULD YOU AVERAGE OVER THE TWO LEVELS OF IE You wouldn t It turns out that when the interaction between the two independent variables is significant the researcher has to use a great deal of caution when interpreting the main effects When the interaction is significant it s telling you that the effect of one independent variable is not the same at the various levels of the second independent variable Knowing this it follows that the logical course of action would be to look at the effect of stress separately for introverts and extroverts In other words you do a test to see whether there s an effect of stress on anxiety when you only include the data for the introverts Then after you ve got an answer to that question you do a test to see if there s an effect of stress on anxiety when you only look at the data for the extroverts When you do this you re performing a set of simple effects A simple effect is the effect of one independent variable on the dependent variable at a single level of the second independent variable This is in contrast to a main effect which is the effect of an independent variable when you average over every level of the second independent variable Let s say that we re particularly interested in the effects of stress So we decide to test the simple effects of stress at each level of IE We can do this by first using the Select Cases option in SPSS to include only the extroverts Then do a oneway ANOVA with stress as the independent variable and anxiety as the dependent variable The output tells us that the sum of squares for this effect is 333 This simple effect has 2 degrees of freedom associated with it because stress still has three levels and the Mean Square for this simple effect is 167 The ANOVA table for the output also gives us an Fratio and a significance level The problem with using this Fratio is that it s using the wrong denominator Most statisticians consider the appropriate error term to be the Mean Square S AB Essentially this is the average of all six group variances The error term in the ANOVA table from SPSS for the simple effect is the average of the group variances of only the three groups that are relevant to this particular simple effect Statisticians consider the average of six group variances to be more stable and accurate than the average of only three so they want us to go with the same denominator that we ve been using 133 So to get the correct Fratio all we have to do is to take the Mean Square for the simple effect that SPSS gave us 167 and divide it by the correct error term 133 We end up with an Fratio of 125 Then all you have to do is to use the Select Cases option to include only the introverts and then run the oneway ANOVA again The sum of squares for the simple effect is 12333 It has 2 degrees of freedom which gives us a Mean Square for this effect of 6167 Now put the information from this numerator into the ANOVA table When you divide this Mean Square by the correct error term of 133 you get an Fratio of 4637 The expanded ANOVA table that includes the two simple effects is provided below Source SS df MS F chical Stress IE AXB 4500 2 2250 1688 340 IE A 6750 1 675 5063 426 Stress B 8167 2 4083 3063 340 Bat a1 333 2 167 125 340 B at a2 12333 2 6167 4637 340 Error 3200 24 133 Total 22617 29 From the ANOVA table you see that there is no significant simple effect of stress on anxiety for extroverts However there is a significant simple effect of stress for introverts This combination of simple effects does its job which is to explain why we had a significant interaction in the first place Variabilitv J for bv a set of simple effects So what do these simple effects explain I mean the reason we did the simple effects was to explain why we had a significant interaction so you d think that they explain the same variability that the interaction does But if you add the sum of squares for B at al 333 to the sum of squares for B at a2 12333 you get 12666 The sum of squares for the interaction is only 450 It looks like the simple effects are accounting for more variability than there was to account for It turns out that a set of simple effects is able to explain more than just the interaction between the two independent variables Take a look at the sums of squares for the overall effects If you take the sum of squares for the interaction 450 add then add the sum of squares for the main effect for stress 8167 you get 12666 7 the same number we got when we added up the sums of squares for all of the simple effects of Stress at levels of IntroversionExtroversion 12666 The simple effects of B at every level of A are able to explain both the AXB interaction and the main effect for B They explain the interaction because they show us how the effect of B looks different as you go from one level of A to the next As far as the main effect of B goes think of it this way When we tested the simple effect of B at al we were testing the effect of B It s just that we did this using only the 15 people who were extroverts The test of B at al is a test of B for half of the participants in the study The simple effect of B at a2 is a test of the effect of B for the 15 introverts It s a test of the effect of B for the other half of the participants in the study But by the time you ve tested the effects of B at al and B at a2 you ve tested the effects of B using every participant in the study And that s the same thing that the main effect of B does So the simple effects of B at every level of A get two things done They get the same thing done as the main effect for B and they get the same thing done as the AXB interaction Simple comparisons The simple effect of B at al is not significant Is there anything more to do in terms of investigating the effects of stress on anxiety when you only look at the extroverts No The nonsignificant simple effect tells us that there are no significant differences among the group means of 20 20 and 30 The simple effect of A at a2 is significant What does this tell us It tells us that there are differences among the group means of 20 50 and 90 But that s a pretty general piece of information The simple effect doesn t tell us where these differences are Just like in a oneway ANOVA we need to do a set of comparisons among the three group means that are relevant to this simple effect Let s say that before the investigator had collected the data they d anticipated that the simple effect of B at a2 would be significant Because of this they decided to conduct two planned comparisons They decided to test the prediction that introverts in the high stress condition would have significantly anxiety scores that introverts in the moderate stress condition They also decided to test the prediction that introverts under moderate or high stress have significantly anxiety scores than introverts under low stress Because these comparisons regarding the effects on one independent variable are being conducted at a single level of a second independent variable they are referred to as simple comparisons To get the Mean Squares for these simple comparisons again you need to use SPSS to Select Cases to include only the data for introverts Then using the Contrasts option the researcher could enter coefficients of 0 1 and 71 to test the first simple comparison and coefficients of 2 1 1 to test the second simple comparison When the researcher runs the oneway ANOVA that includes the test of these two sets of coefficients SPSS provides observed values for t If the researcher wanted to reported these comparisons as F tests it seems like all they d have to do is to square these values for t But the problem with these values for F is that just like with the simple effects they are based on using the incorrect denominator The procedure for calculating the Mean Square for each simple comparison is described in the SPSS handout but basically what you have to do is remember that the Mean Square we need must have been divided by the same incorrect error term that was used in the SPSS output for the simple effect of B at a2 When we multiply the error term in the output for this simple effect by the preliminary value for F we got by squaring the value for t we get a value of 3999 3999 is the correct Mean Square for that simple comparison When we divide it by the correct error term we get an observed value for F of 3007 which is greater than the critical value of 426 This tells us that introverts under high stress have significantly higher levels of anxiety than introverts under moderate stress Using the same procedure for the simple comparison of bl vs b2 b3 at a2 we get a Mean Square for the simple comparison of 8332 and an observed value for F of 6265 Both simple comparisons are significant This tells us that introverts under moderate or high stress have significantly higher levels of anxiety than introverts under low stress The expanded ANOVA table that includes the two simple comparisons is presented below Source SS df MS F chical Stress IE AXB 4500 2 2250 1688 340 IE A 6750 1 675 5063 426 Stress B 8167 2 4083 3063 340 Bat a1 333 2 167 125 340 B at a2 12333 2 6167 4637 340 b2 vs b3 at a2 3999 1 3999 3007 426 b1 vs b2 b3 at a2 8332 1 8332 6265 426 Error 3200 24 133 Total 22617 29 One more thing about this ANOVA table Notice that when you add up the sums of squares for the two simple comparisons you get 12333 This is the same number as the sum of squares for the simple effect of B at a2 The two comparisons we did are orthogonal to each other and together they account for all of the variability associated with that simple effect One way of thinking about it is that the job of a set of simple Introduction to Statistical Inference Dr Tom Pierce Department of Psychology Radford University What do you do when there s no way of knowing for sure what the right thing to do is That s basically the problem that researchers are up against I mean think about it Let s say you want to know whether older people are more introverted on average than younger people To really answer the question you d have to compare all younger people to all older people on a valid measure of introversionextroversion 7 which is impossible Nobody s got the time the money or the patience to test 30 million younger people and 30 million older people So what do you do Obviously you do the best you can with what you ve got And what researchers can reasonably get their hands on are samples of people In my research I might compare the data from 24 older people to the data from 24 younger people And the cold hard truth is that when I try to say that what I ve learned about those 24 older people also applies to all older adults in the population I might be wrong As we said in the chapter on descriptive statistics samples don t have to give you perfect information about populations If on the basis of my data I say there s no effect of age on introversionextroversion I could be wrong If I conclude that older people are different from younger people on introversionextroversion I could still be wrong Looking at it this way it s hard to see how anybody learns anything about people The answer is that behavioral scientists have learned to live with the fact that they can t prove anything or get at the truth about anything You can never be sure whether you re wrong or not But there is something you can know for sure Statisticians can tell us exactly what the odds are of being wrong when we draw a particular conclusion on the basis of our data The means that you might never know for sure that older people are more introverted than younger people but your data might tell you that you can be very con dent of being right if you draw this conclusion For example if you know that the odds are like one in a thousand of making a mistake if you say there s an age difference in introversionextroversion you probably wouldn t loose too much sleep over drawing this conclusion This is basically the way data analysis works There s never a way of knowing for sure that you made the right decision but you can know exactly what the odds are of being wrong We can then use these odds to guide our decision making For example I can say that I m just not going to believe something if there s more than a 5 chance that I m going to be wrong The odds give me something concrete to go on in deciding how confident I can be that the data support a particular conclusion When a person uses the odds of being right or wrong to guide their decision making they re using statistical inference Statistical inference is one of the most powerful tools in science Practically every conclusion that behavioral scientists draw is based on the application of a few pretty 91508 Thomas W Pierce 2008 simple ideas Once you get used to them 7 and they do take some getting used to 7 you ll see that these ideas can be applied to practically any situation where researchers want to predict and explain the behavior of the people they re interested in All of the tests we ll talk about 7 ttests Analysis of Variance the significance of correlation coefficients etc 7 are based on a comm on strategy for deciding whether the results came out the way they did by chance or not Understanding statistical inference is just a process of recognizing this common strategy and learning to apply it to different situations Fortunately it s a lot easier to give you an example of statistical inference that it is to define it The example deals with a decision made that a researcher might make about a bunch of raw scores 7 which you re already familiar with Spend some time thinking your way through this next section If you re like most people it takes hearing it a couple of times before it makes perfect sense Then you ll look back and wonder what the fuss was all about Basically if you re okay with the way statistical inference works in this chapter you ll understand how statistical inference works in every chapter to follow An example of a statistical inference using raw scores The first thing I d like to do is to give you an example of a decision that one might make using statistical inference I like this example because it gives us the avor of what making a statistical decision is like without having to deal with any real math at all One variable that I use in a lot of my studies is reaction time We might typically have 20 younger adults that do a reaction time task and 20 younger adults that do the same task Let s say the task is a choice reaction time task where the participants are instructed to press one button if a stimulus on a computer screen is a digit and another button if the stimulus is a letter This task might have 400 reaction time trials From my set of older adults I m going to have 400 trials from each of 20 participants That s 8000 reaction times from this group of people Now let s say for the sake of argument that this collection of 8000 reaction times is normally distributed The mean reaction time in the set is 6 seconds and the standard deviation of reaction times is 1 seconds A graph of this hypothetical distribution is presented in Figure 31 Figure 31 a 1 Sfta is S r 1 yields One problem that I run into is that the reaction times for three or four trials out of the 8000 trials are up around 16 seconds The question I need to answer is whether to leave these reaction times in the data set or to throw them out They are obviously outliers in that these are scores that are clearly different from almost all the other scores so maybe I m justified in throwing them out However data is data Maybe this is just the best that these subjects could do on these particular trials so to be fair maybe I should leave them in One thing to remember is that the instructions I gave people were to press the button on each trial as fast as they could while making as few errors as they could This means that when I get the data I only want to include the reaction times for trials when this is what was happening 7 when people were doing the best they could 7 when nothing went wrong that might have gotten in the way of their doing their best So now I ve got a reaction time out there at 16 seconds and I have to decide between two options which are l The reaction time of 16 seconds belongs in the data set because this is a trial where nothing went wrong It s a reaction time where the person was doing the task the way I assumed they were Option 1 is to keep the RT of 16 seconds in the data set What we re really saying is that the reaction time in question is really a member of the collection of 8000 other reaction times that makes up the normal curve Alternatively N The reaction time does not belong in the data set because this was a trial where the subject wasn t doing the task the way I assumed that they were Option 2 is to throw it out What we re saying here is that the RT of16 seconds does NOT belong with the other RT s in the set This means that the RT of 1 6 seconds must belong to some other set ofRTs a set ofRTs where the mean ofthat set is quite a bit higher than 6 seconds In statistical jargon Option 1 is called the null hypothesis The null hypothesis says that our one event only differs from the mean of all the other events by chance If the null hypothesis is really true this says there was no reason or cause for the reaction time on this trial to be this slow It just happened by accident The symbol H0 is often used to represent the null hypothesis In statistical jargon the name for Option 2 is called the alternative hypothesis The alternative hypothesis says that our event didn t just differ from the mean of the other events by chance or by accident it happened for a reason Something caused that reaction time to be a lot slower that most of the other ones We may not know exactly what that reason is but we can be pretty confident that SOMETHING happened to give us a really slow reaction time on that trial the event didn t just happen by accident The alternative hypothesis is often symbolized as Hl Now of course there is no way for both the null hypothesis and the alternative hypothesis to both be true at the same time We have to pick one or the other But there s no information available to help us to know for sure which option is correct This is something that we ve just got to learn to live with Psychological research is never able to m anything or gure out whether an idea is true or not We never get to know for sure whether the null hypothesis is true or not There is nothing in the data that can tell us for sure whether that RT of 16 seconds really belongs in our data set or not It is certainly possible that someone could have a reaction time of 16 seconds just by accident There s no way of telling for sure what the right answer is So we re just going to have to do the best we can with what we ve got We ve got to accept the fact that whichever option we pick we could be wrong The choice between Options 1 and 2 basically comes down to whether we re willing to believe that we could have gotten a reaction time of 16 seconds just by chance If the RT was obtained just by chance then it belongs with the rest of the RTs in the distribution and we should decide to keep it If there s any reason other than chance for how we could have ended up with a reaction time that slow if there was something going on besides the conditions that I had in mind for my experiment then the RT wasn t obtained under the same conditions as the other RTs 7 and I should decide to throw it out So what do we have to go on in deciding between the two options Well it turns out that the scores in the data set are normally distributed And we know something about the normal curve We can use the normal curve to tell us exactly what the odds are of getting a reaction time this much slower than the mean reaction time of 6 seconds For starters if you convert the RT of 16 seconds to a standard score what do you get Obviously if we convert the original raw score a value of X to a standard score a value on we get a value of 100 The reaction time we re making our decision about is 100 standard deviations above the mean That seems like a lot The symbol ZX translates to the standard score for a raw score for variable X So what does this tell us about the odds of getting a reaction that far away from the mean just by chance alone Well you know that roughly 95 of all the reaction times in the set will fall between the standard scores of 72 and 2 Ninetynine percent will fall between 73 and 3 So automatically we know that the odds of getting a reaction time with a standard score of 3 or higher must be less than 1 And our reaction time is ten standard deviations above the mean Ifthe normal curve table went out far enough it would show us that the odds of getting a reaction time with a standard score of 100 is something like one in a million Our knowledge of the normal curve combined with our knowledge of where our raw score falls on the normal curve gives us something solid to go on when making our decision We know that the odds are something like one in a million that our reaction time belongs in the data set What would the odds have to be to make you believe that the score doesn39t belong in the data set An alpha level is a set of odds that the investigator decides to use when deciding whether one event belongs with a set of other events For example an investigator might decide that they re just not willing to believe that a reaction time really belongs in the set ie it s different from the mean RT in the set just by chance if the odds of this happening are less than 5 Ifthe investigator can show that the odds of getting a certain reaction time are less than 5 then it s different enough from the mean for them to litthat they didn t get that reaction time just by chance It s different enough for them to bet that the reaction time must have been obtained when the null hypothesis was false So how far away from the center of the normal curve does a score have to be before it s in the 5 of the curve where a score is least likely to be In other words how far above or below the mean does a score have to be before it fits the rule the investigator has set up for knowing when to reject the null hypothesis So far our decision rule for knowing when to reject the null hypothesis is Reject the null hypothesis when the odds that it s true are less than 5 Our knowledge of the normal curve gives us a way translating a decision rule stated in terms of odds into a decision rule that s expressed in terms ofthe scores that we re dealing with What we d like to do is to throw reaction times out if they look a lot different from the rest of them One thing that our knowledge of the normal curve allows us to do is to express our decision rule in standard score units For example if the decision rule for rejecting the null hypothesis is that we should reject the null hypothesis when the odds of its being true are 5 or less this means that we should reject the null hypothesis whenever a score falls in the outer 5 of the normal curve In other words we need to identify the 5 of scores that one is least likely to get when the null hypothesis is true How many standard deviations away from the center of the curve to we have to go until we get to the start of this outer 5 In standard score units you have to go 196 standard deviations above the mean and 196 standard deviations below the mean to get to the start of the extreme 5 of values that make up the normal curve So if the standard score for a reaction time is above a positive 196 or is below a negative 196 the reaction time falls in the 5 of the curve wh ere you re least likely to get reaction times just by chance The decision rule for this situation becomes Reject HO if Zx 2 196 or if Zx S l96 The reaction time in question is 100 so the decision would be to reject the null hypothesis The conclusion is that the reaction time does not belong with the other reaction times in the data set and should be thrown out The important thing about this example is that it boils down to a situation where one event a raw score in this case is being compared to a bunch of other events that occurred when the null hypothesis was true If it looks like the event in question could belong with this set we can t say we have enough evidence to reject the null hypothesis If it looks like the event doesn t belong with a set of other events collected when the null hypothesis was true that means we re willing to bet that it must have been collected when the null hypothesis is false You could think of it this way every reaction time deserves a good home Does our reaction time belong with this family of other reaction times or not The Z Test The example in the last section was one where we were comparing one raw score to a bunch of other raw scores Now let s try something a little different Let s say you ve been trained in graduate school to administer the IQ test You get hired by a school system to do the testing for that school district On your first day at work the principal calls you into their office and tells you that they d like you to administer the IQ test to all 25 seventh graders in a classroom The principal then says that all you have to do is to answer one simple question Are the students in that classroom typicalaverage seventh graders or not Now before you start What would you expect the IQ scores in this set to look like The IQ test is set up so that that the mean IQ for all of the scores from the population is 100 and the standard deviation of all the IQ scores for the population is 15 So if you were testing a sample of seventh graders from the general population you d expect the mean to be 100 Now let s say that you test all 25 students You get their IQ scores and you find that the mean for this group of 25 seventh graders is 135 135 Do you think that these were typical average seventh graders or not Given what you know about IQ scores you probably don t But why not What if the mean had turned out to be 106 Are these typical average seventh graders Probably How about if the mean were 112 Or 118 Or 124 At what point do you change your mind from yes they were typicalaverage seventh graders to no they re not What do you have to go on in deciding where this cutoff point ought to be At this point in our discussion your decision is being made at the level of intuition But this intuition is informed by something very important It s informed by your sense of the odds of getting the results that you did Is it believable that you could have gotten a mean of 135 when the mean of the population is 100 It seems like the odds are pretty low that this could have happened Whether you ve realized it or not your decisions in situations like these are based on odds of what really happened In an informal way you were making a decision using statistical inference Tools like ttests work in exactly the same way The only thing that makes them different is the degree of precision involved in knowing the relevant odds Instead of knowing that it was pretty unlikely that you d tested a group of typical average seventh graders a tool like a ttest can tell you exactly how unlikely it is that you tested a group of typicalaverage seventh graders Just like in the example with the reaction time presented above the first step in the decision process is in de ning the two choices that you have to pick between the null and alternative hypotheses In general the null hypothesis is that the things being compared are just different from each other by accident or that the difference is just due to chance There was no reason for the difference really It just happened by accident In this case the null hypothesis would be that the mean of the sample of 25 seventh graders and the population mean of 100 are just different from each other by accident The alternative hypothesis is the logical opposite of this The alternative hypothesis is that there is something going on other than chance that s making the two means different from each other It s not an accident The means are different from each other for a reason So how do you pick between the null and the alternative hypotheses Just like in the example with the reaction times it turns out that the only thing we can know for sure are the odds of the null hypothesis being true We have to decide just how unlikely the null hypothesis would have to be before we re just not willing to believe that s it s true anymore Let s say you decide that if we can show that the odds are less than 5 that the null hypothesis is true you ll decide that you just can t believe it anymore When you decide to use these odds of 5 this means that you ve decided to use an alpha level of 05 So how do you figure out whether or not the odds are less than 5 that the null hypothesis is true The place to start is by remembering where the data came from They came from a sample of 25 students There s an important distinction in data analysis between a sample and a population A population is every member of the set of people animals or things etc that you want to draw a conclusion about If you only wanted to draw a conclusion about the students that attend a certain school the students that go to that school make up the population of people we re interested in Ifyou re interested in drawing a conclusion about all older adults in the United States then you might define this population as every person in the US at or above the age of 65 A sample is a representative subset of the population you re interested in A sample often consists of just a tiny portion of the whole population The assumption is that the people that make up the sample contain roughly the same characteristics as the characteristics seen in the whole population Ifthe population is 65 female and 35 male then the sample should be 65 female and 35 male The sample should look like the population If it doesn t the sample may not be a representative subset of the population The whole point of using a sample is that you can use the information you get from a sample to tell you about what s going on in the whole population Investigators do whatever they can to make sure that their samples are unbiased 7 that is the samples don t give a distorted picture of what the people that make up the population look like Samples are supposed to tell you about populations This means that the numbers you get that describe the sample are intended to describe the population The descriptive statistics you get from samples are assumed to be unbiased estimates of the numbers you d get if you tested everyone in the population Let s look at that phrase 7 unbiased estimate The estimate part comes from the fact that every time you calculate a descriptive statistic from a sample it s supposed to give you an estimate of the number for everyone in the whole population By unbiased we mean that the descriptive statistic you get from a sample has an equal chance of being too high or too low No one assumes that estimates have to be perfect But they re not supposed to be systematically too high or too low So a sample mean has only one job 7 to give you an unbiased estimate of the mean of a population In the situation presented above we ve got the mean of a sample 135 and we re using it to decide whether a group of 25 seventh graders are members of a population of typical average seventh graders The mean of the population of all typical average seventh graders is 100 So the problem basically comes down to a yes or no decision Is it reasonable to think that we could have a sample mean of 135 when the mean of the population is 100 Just how likely is it that we were sampling from the population of typical average seventh graders and ended up with a sample mean of 135 JUST BY CHANCE Ifthe odds aren t very good of this happening then you might reasonable decide that your sample mean wasn t an estimate of this population mean Which says that you re willing to bet that the children in that sample aren t members of that particular population This is exactly the kind of question that we were dealing with in the reaction time example We started out with one reaction time a raw score and we decided that if we could determine that the odds were less than ve percent that this one raw score belonged with a collection of other reaction times a group of other raw scores we d bet that it wasn t a member of this set The basic strategy in the reaction time example was to compare one event a raw score to a bunch of other events a bunch of other raw scores to see if it belonged with that collection or not If the odds were less than five percent that it belonged in this collection our decision would be to reject the null hypothesis ie the event belongs in the set and accept the alternative hypothesis ie the event does not belong in the set So how do we extend that strategy to this new situation How do we know whether the odds are less than 5 that the null hypothesis is true The place to start is to recognize that the event we re making our decision about here is a sample mean not a raw score Before we compared one raw score to our collection of other raw scores If we re going to use the same strategy it seems like we would have to compare one sample mean to a bunch of other sample means and that s exactly how it works BUT WAIT A MINUTE I ve only got just the one sample mean Where do I get the other ones That s a good question but it turns out that we don t really need to collect a whole bunch of other sample means 7 sample means that are all estimates of that population mean of 100 What the statisticians tell us is that because we know the mean and standard deviation of the raw scores in the population we can imagine what the sample means would look like ifwe kept drawing one sample after another one and every sample had 25 students in it For example let s say that you could know for sure that the null hypothesis was true 7 that the 25 studenm in a particular class were drawn from the population of typical average students What would you expect the mean IQ score for this sample to be Well if they re typical average studenm and the mean of the population of typical average seventh graders is 100 then you d expect that the mean of the sample will be 100 And in fact that s the single most likely thing that would happen But does that sample mean have to be 100 If the null hypothesis is true and you ve got a sample of 25 typical average seventh graders does the sample mean have to come out to 100 Well that sample mean is just an estimate It s an unbiased estimate so it s equally likely to greater than or less than the number it s trying to estimate but it s still just an estimate And estimates don t have to be perfect So the answer is no The mean of this sample doesn t necessarily have to be equal to 100 when the null hypothesis is true The sample mean could be and probably will be at a least a little bit different from lOOjust by accident ijust by chance So let s say that hypothetically you can know for sure that the null hypothesis is true and you go into a classroom of 25 typical average seventh graders and obtain the mean IQ score for that sample Let s say that it s 104 We can put this sample mean where it belongs on a scale of possible sample means See Figure 32 Figure 32 85 all 100 W m Memo Now hypothetically let s say that you go into a second classroom of typical average seventh graders The null hypothesis is true the studenm were drawn from a population that has a mean IQ score of 100 7 but the mean for this second sample of students is 97 Now we ve got two estimates of the same population mean One was four points too high The other was three points too low The locations of these two sample means are displayed in Figure 33 Figure 33 2 me 85 C1D 100 110 50 Scsz Mam Now let s say that you go into fifteen more classrooms Each classroom is made up of 25 typical average seventh graders From each of these samples of people you collect one number 7 the mean of that sample Now we can see where these sample means fall on the scale of possible sample means In Figure 24 each sample mean is represented by a box When there s another sample mean that shows up at the same point on the scale ie the same sample mean we just stack that box up on top of the one that we got before The stack of boxes presented in Figure 34 represents the frequency distribution of the 17 sample means that we ve currently got The shape of this frequency distribution looks more or less like the normal curve Figure 34 8 011 100 110 L50 Samla Mean Now let s say that hypothetically you went into classroom after classroom after classroom Every classroom has 25 typical average seventh graders and from every classroom you obtain the mean lQ score for this sample of studenm If you were to get means from hundreds of these classrooms 7 thousands of these classrooms 7 and then put each of these sample means where they belong on the scale the shape of this distribution of numbers would look exactly like the normal curve The center of this distribution is 100 The average of all of the sample means that make up this collection is 100 This sort ofmakes sense because every one of those sample means was an estimate of that one population mean of 100 Those sample means might not all have been perfect estimates but they were all unbiased estimates Half of those estimates were too high half of them were too low but the average of all of these sample means is exactly equal to 100 All of these numbers these sample means were obtain ed under a single set of con ditions when the null hypothesis is true Now we ve got a collection of other sample means to compare our one sample mean to The sample means in this collection show us how far our estimates of the population mean of 100 can be offby just by chance This collection of sample means is referred to as the sampling distribution of the mean It s the distribution ofa bunch of sample means collected when the null hypothesis is true 7 when all of the sample means that make up that set are estimates of the one population mean we already know about 7 in this case 100 So how does the sampling distribution of the mean help us to make our decision Well the fact that the shape of this distribution is normal makes the situation exactly like the one we had before when we were trying to decide whether or not to toss a raw score out of a data set of other raw scores If you remember We made a decision about a single number 7 in this case a single raw score The decision was about whether that raw score was collected when the null hypothesis was true or whether the null hypothesis was false The only thing we had to go on were the odds that the raw score was obtained when the null hypothesis was true We had these odds to work with for two reasons First we had a collection of other raw scores to compare it to 7 raw scores that were all obtained under one set of conditions 7 when the null hypothesis is true Second the shape of the frequency distribution for these raw scores was normal We said that unless we had reason to think otherwise we d just have to go with the null hypothesis that the raw score really did belong in the collection We would only decide to reject the idea that the reaction time belonged in the set when we could show that the odds were less than 5 that it belonged in that set We said that if we converted all of the raw scores in the collection to standard scores we could use the normal curve to determine how far above or below a standard score of zero you d have to go until you hit the start of the extreme 5 of this distribution In other words how far above or below zero do you have to get to until you hit the start of the least likely 5 of reaction times that really do belong in that collection The decision rule for knowing when to reject the null hypothesis became Reject the null hypothesis if ZX is greater than or equal to 196 or if ZX is less than or equal to 7196 The only thing left was to take that raw score and convert it to a standard score Now we ve got exactly the same kind of situation We re making a decision about a single number The only difference is that now this number is the mean of a sample of raw scores rather than a single raw score The decision is whether that number was collected when a null hypothesis was true or when it was false There s no way of knowing for sure whether this sample mean was collected when the null hypothesis was true or not The only thing that we can know for sure are the odds that the sample mean was collected when the null hypothesis was true We can know these odds because we have a collection of other sample means to compare our one sample mean to These sample means were all collected under the same circumstances 7 when the null hypothesis was true Unless we have reason to think otherwise we ll have to assume that our sample mean was colleted when the null hypothesis was true 7 when the mean of the sample really was an estimate of that population mean of 100 0 Speci cally we can decide to reject the null hypothesis only if we can show that the odds are less than 5 that it s true We can decide to only reject the idea that the sample mean belongs in the set when we can show that the odds are less than 5 that it belongs in our hypothetical collection of other sample means 0 If we imagine that we re able to convert all of the sample means that make up our normal curve to standard scores we can use our knowledge of the normal curve to determine how far above or below a standard score of zero you d have to go until you hit the start of the most extreme 5 of this distribution In other words how far above or below zero do you have to go until you hit the start of the least likely 5 of sample means that you re likely to get when the null hypothesis is true 0 If we take our one sample mean and convert it to a standard score our knowledge of the normal now tells us that this standard score is greater than or equal to 1 96 or this standard score is less than or equal to I 96 we ll know that our sample mean falls among the 5 of sample means th at you re least likely to get when the null hypothesis is true We would know that the odds are less than 5 that our 25 seventh graders remember them were members of a population of typical average seventh graders See Figure 35 0 The only thing left at this point is to convert our sample mean to a standard score Figure 35 1 1 3 r 2 7 5 61 Howlow 3 1 I o M 4 5 3 g Mimi mm 4 11 To convert a number to a standard score you take that number subtract the mean of all the other numbers in the set and then divide this deviation score by the standard deviation of all the scores in the set A standard score is the deviation of one number from the mean divided by the average amount that numbers deviate from their mean The equation to convert a raw score to a standard score was X 7 7 ZX 6 It s the same thing to take a sample mean and convert it to a standard score We need to take the number that we re converting to a standard score our sample mean of 135 divide it by the average of all the sample means in the set and then divide this number by the standard deviation of all the sample mean in the set The equation for the standard score we need here becomes Z 7 represents the standard score for a particular sample mean M is the sample mean that is being converted to a standard score p represents the mean of the population or the average of all the sample means 6 7 represents the standard deviation of all the sample means So what s the average of all the sample means that you could collect when the null hypothesis is true Well you know that every one of those sample means was an estimate of the same population mean They were all estimates of the population mean of 100 If we assume that these sample means are unbiased estimates then half of those estimates end up being less than 100 and half of those estimates end up being greater than 100 so the mean of all those estimates all those sample means is 1001 The average of all of the values that make up the sampling distribution of the mean is also the mean of the population Okay that was easy Now how about the standard deviation of all of the sample means Well we ve got a collection of numbers and we know the mean of all of those numbers So we should be able to calculate the average amount that those numbers deviate from their mean Unfortunately the sampling distribution of the mean contains the means of hypothetically every possible sample that has a particular number of people in it So doing that is kind of out But this is where those spunky little statisticians come in handy Some smart person 7 probably on a Friday night when everyone else was out having fun 7 nailed down the idea that the standard deviation of a bunch of sample means is in uenced by two things 1 the standard deviation of the raw scores for the population mean and 2 the number of people that make up each individual sample The Central Limit Theorem tells us that to calculate the standard deviation of the sample means you take the standard deviation of the raw scores in the population sigma and then divide it by the square root of the sample size The equation for calculating the standard deviation ofthe sample means 6 becomes 6 039 2 m The symbol 6 7 simply re ects the fact that we need the standard deviation of a bunch of sample means The term for this particular standard deviation is the Standard Error 0f the Mean One way of thinking about it is to say that it s the average or standard amount of sampling error you get when you re using the scores from a sample to give an estimate of what s going on with everyone in the population I From this equation it s easy to see that if the standard deviation of all the raw scores in the population is larger if the raw scores are more spread out around their mean the more spread out the sample means get Also the more people you have in each sample the less spread out the sample means will be around their average This makes sense because the sample means are just estimates The more scores that contribute to each estimate the more accurate they ought to be 7 the closer they ought to be on average to the mean of the population So in our example the standard error of the mean becomes 15 15 c 30 X V25 5 With a standard deviation of the raw scores of 15 and sample sizes of 25 the average amount that sample means are spread out around the number they re trying to estimate is So now we ve got everything we need to convert our sample mean to a standard score 1357100 35 Our sample mean is 1167 standard deviations above the population mean of 100 Our decision rule already told us that we d be willing to reject the null hypothesis if the sample mean has a standard score that is greater than or equal to 196 So our decision is to reject the null hypothesis This means that we re willing to accept the alternative hypothesis so our conclusion for this decision is that The seventh graders in this classroom are not typical average seventh graders That s a lot to think about to get to say one sentence Directional onetailed versus nondirectional twotailed tests Now let s say that we change the question a little bit Instead of asking whether the 25 kids in the class are typicalaverage seventh graders let s say that researcher wants to know whether this is a class of gifted and talented seventh graders In other words instead of asking whether the mean IQ of the sample is signi cantly different from the population mean of 100 we re now asking whether the mean of the sample is signi cantly greater than the population mean of 100 How does this change the problem It doesn t change anything about the number crunching The standard error of the mean is still 30 The sample mean of 135 is still 11667 standard deviations above the population mean of 100 However the conclusion the researcher can draw on the basis of this standard score is going to change because the research question has changed One way to think about this is to consider that the only reason for getting our value for Z is help us to decide between two statements the null hypothesis and the alternative hypothesis You re doing the statistical test to see which of the two statement you re going to be willing to believe The alternative hypothesis is the prediction made by the researcher The null hypothesis is the opposite is this prediction So obviously if you change the prediction you change both the null and alternative hypotheses In the example in the previous section the prediction was that the mean of the sample would be signi cantly different from the mean of the population so that was the alternative hypothesis The null hypothesis was the logical opposite of the alternative hypothesis the mean of the sample of 25 seventh graders is not signi cantly different from the population mean of 100 This is said to be a non directional prediction because the statement could be true no matter whether the sample mean was a lot larger or a lot smaller than the population mean of 100 In the current example the prediction is that the mean of the sample is signi cantly greater than the mean of the population From this it follows that the alternative hypothesis is that the mean of the sample of 25 seventh graders is signi cantly greater than the population mean of 100 The null hypothesis is the logical opposite of this statement the mean of the sample of 25 seventh graders is not signi cantly greater than the population mean of 100 The researcher is said to have made a directional prediction because they re being speci c about whether they think the sample mean will be above or below the mean of the population The null and alternative hypotheses for the directional version of the test are stated below H0 The mean of the sample of 25 seventh graders is not significantly greater than the population mean of 100 H1 The mean of the sample of 25 seventh graders is significantly greater than the population mean of 100 Okay so if the null and alternative hypotheses change how does this change the decision rule that tells when we re in a position to reject the null hypothesis Do we have to change the alpha level No We can still use an alpha level of 05 We can still say that we re not going to be willing to reject the null hypothesis unless we can show that there s less than a 5 chance that it s true Do we have to change the critical values Yes you do And here s why Think of it this way With the nondirectional version of the test we were placing a bet don t reject the null hypothesis unless there s less than a 5 chance that it s true and we split the 5 we had to bet with across both sides of the normal curve We put half 212 on the right side and half 212 on the left side to cover both ways in which a sample mean could be different from the population mean a lot larger or a lot smaller than the population mean In the directional version of the test a sample mean below 100 isn t consistent with the prediction of the experimenter It just doesn t make sense to think that a classroom of gifted and talented seventh graders has a mean IQ below 100 The question is whether our sample mean is far enough above 100 to get us to believe that these are gifted and talented seventh graders So if we still want to use an alpha level of 05 we don t need to put half on one side of the curve and half on the other side We can put all 5 of our bet on the right side of the curve See Figure 36 Figure 36 If we put all 5 on the right side of the normal curve how many critical values are we going to have to deal with Just one The decision rule changes so that it tells us how far the standard score for our sample mean has to be above a standard score of zero before we can reject the null hypothesis If Z X 2 some number reject H0 The only thing left is to figure out what the critical value ought to be How about 196 Think about it Where did that number come from That was how many standard deviations about zero you had to go to hit the start of the upper 212 of the values that make up the normal curve But that s not what we need here We need to know how many standard deviations you have to go above zero before you hit the start of the upper 5 of the values that make up the normal curve How do you nd that Use the normal curve table If 45 of all the values that make up the curve are between a standard score of zero and the standard score that we re interested in that means that the standard score for our critical value is 1645 So the decision rule for our directional test becomes If Z 2 1645 reject HO Obviously the standard score for our sample mean is greater than the critical value of 1645 so our decision is to reject the null hypothesis This means that we re willing to accept the alternative hypothesis Our conclusion therefore is that the mean of the 25 seventh graders in the class is signi cantly greater than the population mean for typicaVaverage seventh graders Advantages and disadvantages of directional and non directional tests The decision of whether to conduct a directional or nondirectional test is up to the investigator The primary advantage of conducting the directional test is that as long as you ve got the direction of the prediction right the critical value to reject the null hypothesis will be a lower number eg 164 than the critical value you d have to use with the nondirection version ofthe test eg 196 This makes it more likely that you re going to be able to reject the null hypothesis So why not always do a directional test Because if your prediction about the direction of the result is wrong there is no way of rejecting the null hypothesis In other words if you predict ahead of time that a bunch of seventh graders are going to have an average IQ score that is signi cantly greater than the population mean of 100 and then you nd that their mean IQ is 70 standard deviations below 100 can you reject the null hypothesis No It doesn t matter if the standard score for that sample mean is 50 standard deviations below 100 In this case no standard score below zero is consistent with the prediction that the students in that class have an average IQ that is greater than 100 Basically if you perform a directional test and guess the direction wrong you lose the bet You re stuck with having to say that the sample mean is not signi cantly greater than the mean of the population What the researcher absolutely should not do is change their bet after they have a chance to look at the data The decision about predicting a result in a particular direction is made before the data is collected After you place your bet you just have to live with the consequences So if a theory predicts that the result should be in a particular direction use a directional test prrevious research gives you a reason to be con dent of the direction of the result use a directional test Otherwise the safe thing to do is to go with a nondirectional test There are some authors who feel that there is something wrong with directional tests Apparently their reasoning is that directional tests aren t conservative enough It is certainly true that directional tests can be misused especially by researchers that really had no idea what the direction of their result would be but went ahead and essentially cheated by using the lower critical value from a directional test However the logic of a directional test is perfectly sound An alpha level of 5 is an alpha level of 5 no matter whether the investigator has used that alpha level in the context of a directional or a nondirectional test Ifyou ve got 5 worth of bet to place it ought to be up to the researcher to distribute it in the way they want 7 as long as they re honest enough to live with the consequences I personally think it s rather selfdefeating to test a directional question but use a critical value based on having a rejection region of only 212 on that side of the curve The reality of doing that is that the researcher has done their directional test using an alpha level of 025 which puts the researcher at an increased risk of missing the effect their trying to nd a concept we ll discuss in the next section Errors in decision making When you make a decision like the one the made above what do you know for sure Do you know that the null hypothesis is true Or whether the alternative hypothesis is true No We don t get to know the reality of the situation But we do get to know what our decision is You know whether you picked the null hypothesis or the alternative hypothesis So it terms of the outcome of your decision there are four ways that it could turn out Reality HO False Ho True Reject HO l l Your Decision l Fail to Reject HO l l l l l l l l l There are two ways that you could be right and there are two ways that you could be wrong If you decide to reject the null hypothesis and in reality the null hypothesis is false you made the right choice 7 you made a correct decision There was something there to nd and you found it Some people would refer to this outcome as a Hit Reality HO False Ho True l Correct l l Reject HO l Decision l l Hit Your Decision l quotl quotj Fail to Reject HO l l l l l l l l l If you decide not to reject the null hypothesis and in reality the null hypothesis is true then again you made the right choice you made a correct decision In this case there was nothing there to nd and you said just that Reality HO False Ho True quotj l Correct l l Reject HO l Decision l l l Hit l l Your Decision l l l l l Correct l Fail to Reject HO l l Decision I Now let s say that you decide to reject the null hypothesis but the reality of the situation is that the null hypothesis is true In this case you made a mistake Statisticians refer to this type of mistake as a Type I error Basically a Type I error is saying that there was something there when in fact there wasn t Some people refer to this type of error as a False Alarm So what are the odds of making a Type I error This is easy The investigator decides how much risk of making a Type I error they re willing to run before they even go out and collect their data The alpha level specifies just how unlikely the null hypothesis would have to be before we re not willing to believe it anymore An alpha level of 05 means that we re willing to reject the null hypothesis when there is still a 5 chance that it s true This means that even when you get to reject the null hypothesis you re still taking on a 5 risk of making a mistake 7 of committing a Type I error Reality HO False Ho True l Correct l Type I Error l Reject HO l Decision l False Alarm l l Hit l l Your Decision l l l l l Correct l Fail to Reject HO l l Decision l Finally let s say that you decide that you can t reject the null hypothesis but the reality of the situation is that the null hypothesis is false In this case there was something there to find but you missed it The name for this type of mistake is a Type II error Some people refer to this type of mistake as a Miss Reality HO False Ho True 1 Correct 1 Type I Error 1 Reject HO 1 Decision 1 False Alarm l Hit Your Decision 1 l 1 Type II Error 1 Correct 1 Fail to Reject HO 1 Miss 1 Decision 1 l l l l l l So what are the odds of committing a Type II error This one s not as easy The one thing that it s not is 95 Just because the risk of making a Type I error is 5 that doesn t mean that we ve got a 95 chance of making a Type II error But one thing that we do know about the risk of a Type II error is that it is inversely related to the risk of a Type I error In other words when an investigator changes their alpha level they re not only changing the risk of a Type I error they re changing the risk of a Type II error at the same time I If an investigator changes their alpha level from 05 to 01 the risk of making a Type I error goes from 5 to 1 They re changing the test so that it s more difficult to reject the null hypothesis If you make it more difficult to reject the null hypothesis you re making it more likely that there might really be something there but you miss it If you lower the alpha level to reduce the risk of making a Type I error you ll automatically increase the risk of making a Type II error If an investigator changes their alpha level from 05 to 10 the risk of making a Type I error will go from 5 to 10 They re changing the test so that it s easier to reject the null hypothesis Ifyou make it easier to reject the null hypothesis you re making it less likely that there could really be something out there to detect but you miss it If you raise the alpha level and increase the risk of making a Type I error you ll automatically lower the risk of making a Type II error The graph below shows where the risks of both a Type I error and Type II error come from The example displayed in the graph is for a directional test 21 Figure 37 Ho love H0 1013 J 3 1 0 LI M 3 W 4 1 MK incl Valve The curve on the left is the sampling distribution of the mean we talked about before Remember this curve is made up of sample means that were all collected when the null hypothesis is true The critical value is where it is because this point is how far above the mean of the population you have to go to hit the start of the 5 of sample means that you re least likely to get when the null hypothesis is true Notice that 5 of the area under the curve on the left is in the shaded region The percenng of the curve on the left that is in the shaded region represents in risk of committing a Type I error Now for the risk of committing a Type 11 error Take a look at the curve on the right that is labeled HO False This curve represent the distribution of a bunch of sample means that would be collected if the null hypothesis is false That s why it s labeled HO False Let s say that the reality of the situation is that the null hypothesis is false In other words when we collected our sample mean it really belonged in this collection of other sample means Now let s say that the standard score for our sample mean turned out to be 145 What would your decision be reject H0 or don t reject HO Of course the observed value is less than the critical value so you re decision is going to be that you fail to reject the null hypothesis Are you right or wrong We just said that the null hypothesis is false so you d be wrong Any time the null hypothesis really is false but any time you get a standard score for your sample mean that is less than 1645 you re going to be wrong Look at the shaded region under the HO False curve The percentage of area that falls in the shaded region under this curve represents the odds of committing a Type 11 error All of the sample means that fall in this shaded region 21 produce situations where the research will decide to keep the null hypothesis when they should reject it Now let s say that the researcher had decided to go with an alpha level of 025 The researcher has done something to make it more dif cult to reject the null hypothesis How does that change the risks for both the Type I and a Type II error Well if use perform a directional test using alpha level of 025 what will the critical value be 196 of course On the graph the critical value will move to the right What percentage of area under the curve labeled HO True now falls to the right of the critical value 212 The risk of committing a Type I error has gone down And if the critical value moves to the right what happens to the risk of committing a Type II error Well now the percentage of area under the curve on the right 7 the one labeled HO False 7 has gone way up This is why when a researcher uses a lower alpha level the risk of making a Type II error goes up When you move the alpha 7 when you move the critical value 7 you change the risk for both the Type I and the Type II error The choice of alpha level Okay So why is 90something percent of research in the behavioral sciences conducted using an alpha level of 05 That alpha level of 05 means that the researcher willing to live with a ve percent chance that they could be wrong when they reject the null hypothesis Why should the researcher have to accept a 5 chance of being wrong Why not change the alpha level to 01 so that now the chance of making a Type I error is only 1 For that matter why not change the alpha level to 001 giving them a one in onethousand chance of making a Type I error Or 000001 The answer is that if the risk of a Type I error is the only thing they re worried about that s exactly what they should do But of course we just spent some time saying that the choice of an alpha level determines the levels of risk for making both a Type I or a Type II error Obviously if one uses a very conservative alpha level like 001 the odds of committing a Type error will only be one in onethousand However the investigator has decided to use an alpha level that makes it so hard to reject the null hypothesis that they re practically guaranteeing that if an effect really is there they won t be able to say so 7 the risk of committing a Type II error will go through the roof It turns out that in most cases an alpha level of 05 gives the risk a happy medium in terms of balancing the risks for both types of errors The test will be conservative but not so conservative that it ll be impossible to detect an effect if it s really there In general researchers in the social and behavioral sciences tend to be a little more concerned about making a Type I error than a Type II error Remember the Type I error is saying that there s something there when there really isn t The Type II error is saying there s nothing there when there really is In the behavioral sciences there are often a number of researchers who are investigating more or less the same questions Let s say that 20 different labs all do pretty much the same experiment And let s say that in this case the null hypothesis is true 7 there s nothing there to find But if all 20 labs conduct their test using an alpha level of 05 what s likely to happen The alpha level of 05 means that the test is going to be significant one out of every 20 times just by chance So if 20 labs do the experiment one lab will find the effect by accident and publish the result The other 19 labs will correctly not get a significant effect and they won t get their results published because null results don t tend to get published The one false positive gets in the literature and takes the rest of the eld off on a wild goose chase To guard against this scenario researchers tend to use a rather conservative alpha level like 05 in their tests The relatively large risk of making a Type II error in a particular test is offset by the fact that if one lab misses an effect that s really there one of the other labs will find it Chance determines which labs win and which labs lose but the field as a whole will still learn about the effect Now just to be argumentative for a moment Can you think of a situation where the appropriate alpha level ought to be something like 50 40 That means the researcher is willing to accept a 40 risk of saying that there s something there when there really isn t 7 a 40 chance of a false alarm The place to start in thinking about this is to consider the costs associated with both types of errors What s the cost associated with missing something when it s really there the Type I error What s the cost associated with saying that there s nothing there where there really is the Type II error Consider this scenario You re a pharmacologist searching for a chemical compound to use as a drug to cure the AIDS You have a potential drug on your lab bench and you test it to see if it works The null hypothesis is that people with AIDS who take the drug don t get any better The alternative hypothesis is that people with AIDS who take the drug do get better Let s say that people who take this drug get very sick to their stomach for several days What s the cost of committing a Type I error in this case If the researcher says the drug work when it really doesn t at least two negative things will happen a a lot of people with AIDS are going to get their hopes up for nothing and b people who take the drug are going to get sick to their stomachs with getting any benefit from the drug There are certainly significant costs associated with a Type I error But how about the costs associated with making a Type II error A Type II error in this case would mean that the researcher had a cure for AIDS on their lab bench 7 they had it in their hands 7 they tested it and they decided it didn t work Maybe this is the ONLY drug that would ever be effective against AIDS What are the costs of saying the drug doesn t work when it really does The costs are devastating Without that drug millions of people are going to die So which type of mistake should the researcher try not to make The Type II error of course And how can you minimize the risk of a Type II error By setting the alpha level so high that the risk of a Type II error drops down to practically nothing It seems like the researcher s choice of an alpha level ought to be based on an assessment of the costs associated with each type of mistake If the Type I error is more costly use a low alpha level Ifthe Type II error is more costly use a higher alpha level Using an alpha level of 05 out of habit without thinking about it strikes me as an oversight on the part of investigators in many fields where costly Type II errors could easily have been avoided through a considered use of alpha levels of 10 or higher The one size fits all approach to selecting an alpha level is particularly unfortunate considering the ease with which software packages for data analyses allow researchers to adopt whatever alpha level they wish When you read the results section of a paper ask yourself what the costs

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.