Statistics for Psychology
Statistics for Psychology PSYC 233
Popular in Course
Popular in Psychlogy
This 41 page Class Notes was uploaded by Marlene Abernathy DDS on Monday October 12, 2015. The Class Notes belongs to PSYC 233 at Fayetteville State University taught by David Wallace in Fall. Since its upload, it has received 23 views. For similar materials see /class/221596/psyc-233-fayetteville-state-university in Psychlogy at Fayetteville State University.
Reviews for Statistics for Psychology
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/12/15
Lesson 15 ANOVA analysis of variance Outline Variability between group variability within group variability total variability Fratio Computation sums of squares betweenwithintotal degrees of freedom betweenwithintotal mean square betweenwithin F ratio of between to within Example Problem Note The formulas detailed here vary a great deal from the text I suggest using the notation I have outlined here since it will coincide more with what we have already done but you might look at the text version as well Use whatever method you nd easiest to understand Variability Please read about this topic on the web page by locating the ANOVA demonstration or you can click here httpfacultv nm fsn J dwallacesanovahtml Between group variability and within group variability are both components of the total variability in the combined distributions What we are doing when we compute between and within variability is to partition the total variability into the between and within components So Between variability within variability total variability Hypothesis Testing Again with ANOVA we are testing hypotheses that involve comparisons of two or more populations The overall test however will indicate a difference between any of the groups Thus the test will not specify which two or if some or all of the groups differ Instead we will conduct a separate test to determine which specific means differ Because of this fact the research hypothesis will state simply that at least two of the means differ The null will still state that there are no significant differences between any of the groups insert as many mu s as you have groups H0 ulu2u3 Critical values are found using the Ftable in your book The table is discussed in the example below Computation How do we measure variability in a distribution That is how do we measure how different scores are in the distribution from one another You should know that we use variance as a measure of variability With ANOVA or analysis of variance we compute a ratio of variances between to within variance Recall that variance is the average square deviation of scores about the mean We will compute the same value here but as the de nition suggests it will be called the mean square for the computations So we are computing variance Recall that when we compute variance we first nd the sum of the square deviations and then divide by the sample size n l or degrees of freedom for a sample Sums 7 of 7 Squares n l deg rees 7 of 7 freedom When we compute the Mean Square variance in order to form the Fratio we will do the exact same thing compute the sums of squares and divide by degrees of freedom Don t let the formulas intimidate you Keep in mind that all we are doing is finding the variance for our between factor and dividing that by the variance for the within factor These two variances will be computing by finding each sums of squares and dividing those sums of squares by their respective degrees of freedom Sums of Squares We will use the same basic formula for sums of squares that we used with variance While we will only use the between variance and within variance to compute the F ration we will still compute the sums of squares total all values for completeness Total Sums of Squares X Z SSTOT ZX OT Kalil Note that it is the same formula we have been using The TOT subscript tot stands for the total It indicates that you perform the operation for ALL values in your distribution all subjects in all groups Within Sums of Squares m 2 1 X2 n1 Z 2 ZXZ2 2X ZXkZ quot2 quotk SSWITHIN 2X12 Notice that each segment is the same formula for sums of squares we used in the formula for variance and for the total sums of squares above What is different here is that you consider each group separately So the first segment with the subscript 1 means you compute the sum of squares for the first group Group two is labeled with a 2 but notice that after that we have group k instead of a number This notation indicates that you continue to find the sums of squares as you did for the first two groups for however many groups you have in the problem So k could be the third group or if you have four groups then you would do the same sums of squares computation for the third and fourth group Between Sums of Squares SS 4202 ZX22 2X ZXMY BETWEEN quot2 quotk NTOT 1 We have the same k notation here Again you perform the same operation for each separate group in your problem However with this formula once we compute the value for each group we must subtract an operation at the final step This operation is half the sums of squares we computed for the sums of squares total Degrees of Freedom Again we will first compute the sums of squares for each source of variance divide the values by degrees of freedom in order to get the two mean square values we need to form the Fratio Degrees of freedom however is different for each source of variability Total Degrees of freedom N 7 l ethis N value is the total number of values in all groups Within Degrees of freedom dfwmm N K 6K is the number of categories or groups N is still the total N Between Degrees of freedom deetwezn K 1 We will also use degrees of freedom to locate the critical value on the Ftable see page A29 for alpha 05 and A30 for alpha 01 The numerator of the Fratio is the between factor so we will use the degrees of freedom between along the top of your table The denominator of the Fratio is the within subjects factor so will use degrees of freedom within along the left margin of the table Mean Square Now we divide each sums of squares by the respective mean square Don t let the formula s intimidate you All we are doing is matching up degrees of freedom with the Sums of squares to get the mean square variance Within Mean Square Between Mean Square MS SSBZZWZZH SSWxthm theween WITHIN dfsezwzzn dexthm Fratio The nal step is to divide our between by within variance to see if the effect between is large compared to the error within MS F BEN22H MSWIthm Example A therapist wants to examine the effectiveness of 3 therapy techniques on phobias Subjects are randomly assigned to one of three treatment groups Below are the rated fear of spiders after therapy Test for a difference at or 05 Therapv A Therapv B Therapv C 5 3 l 2 3 0 5 0 l 4 2 2 2 2 l 2x118 2x210 2x35 2x12 74 2x22 26 2x32 7 STEP 1 State the null and alternative hypotheses H1 at least one mean differs H0 H1 H2 H3 STEP 2 Set up the criteria for making a decision That is find the critical value You might do this step after Step 3 since that is where you compute the critical value deetwezn K 1 31 dfthhm N K 153 12 Fcritical 388 STEP 3 Compute the appropriate teststatistic Although in this example I have given the summary values for some problems you might have to compute the sum of X and sum of squared X s yourself ZXTOTZ N07 T 332 1089 SSW ZX o SSW 1o7 107 107 726 344 2X12 new12 WITHIN 74 267J SSWHIN 74 648 26 20 7 5 WITHIN SS SS WITHIN SS SS 9262l72 WITHIN SS ZX1Y ZXZY ZXJ ZXWY BETWEEN n1 39 n2 39 39 nk NTOT 18gt2 my 52 33 BETWTTT SS 324 100 25 1089 BMW TT SS 648205 726l72 BETWEEN ltzngt2zw quotk Note that anytime you compute two of the Sums of Squares you can derive the third one without computation because Between Within Total dfthhm N K MS dftot N 1 deetwezn K 1 dsz 15114 dfgzm 3 1 2 dfm 15 3 12 SS SS M S MW MSWHIN dfmm fmm Msgmwm 86 MSW 143 F BEN22H MSWxthm F 86 6 143 Once we have computed all the values very often we place them in a source table below Putting the values in a table like this one may make it easier to think about the statistic Notice that once we get the Sums of Squares on the table we will divide those values by the df in the next column Once we get the two mean squares we divide those to get F S ource S S df M S F Between group 172 2 86 i Within error 172 12 143 Total 344 14 STEP 4 Evaluate the null hypothesis based on your answers to the above steps Reject the null STEP 5 Based on your evaluation of the null hypothesis what is your conclusion There is at least one group that is different from at least one other group Lesson 19 Chi Square Outline Categorical Data Goodness of Fit Test observed frequency expected frequency X2 statistic Example hypothesis testing Categorical Data As mentioned at the start of the lesson with correlation all of the data we have been working with so far involve measurement data We actually took measurements from units in our sample to create our distribution Often times however we will want to analyze categorical or qualitative data as well For categorical data we will not have a measure of individual units in the sample Instead we will analyze frequencies or counts of people falling into different categories or groups When analyzing categorical data we say the test is nonparametric Thus all the tests we have learned before this point were parametric tests Chi Square Goodness of Fit Test We will learn two different Chisquare tests The first of these is the goodnessoff1t test With the goodnessoff1t test we will test whether the data fit good with what we would expect if only chance factors were operating For example if I measured the number of insurance claims for different car types I might have the following data High Performance Compact Mid Size Full Size 20 14 7 9 Notice that our data is now frequency values or how many values in our sample fit into different categories The test will tell us whether there is a difference in how many values fall at different levels of the single variable car type Is there a difference in number of claims for different car types The values we observe in our sample are the observed frequencies f0 What we want to know is if they differ from the frequencies we would observe by chance The values we would expect if there really was no difference in the number of claims made for different car types are what we call the expected frequencies f If there really was no difference in the frequencies for each level of the variable then we would expect equal numbers of claims for each car type Since there a total of 50 claims in our sample and there are 4 different levels of the variable then we would expect 125 claims for each car type Thus High Performance Compact Mid Size Full Size 14 9 Observed 125 125 125 125 Expected What the Chisquare statistic does is to compare the values we observe to those we would expect if there was no difference If what we observe varies a good bit from the values we would expect if there was not difference then there must be a difference If there really was no difference in the number of insurance claims for this example then we would expect the number of claims to be close to the expected frequencies 2 f2 2 I 2 f 6 Notice that we subtract each expected value from each observed value square the difference and divide by the expected frequency We then sum up all of the values we computed Let s take a look at the example we have been working on within the context of hypothesis testing We will continue the problem with Alpha set to 05 Step 1 Write the vaotheses for the test H1 12 H0 lfe Here we are stating that the observed frequencies are the same as the expected for the null Step 2 Find the Critical Value Again we will use Appendix A to find the critical value see page A34 For our test degrees of freedom is equal to C 7 1 where C is the number of categories Df4713 X2 781 7121501 Step 3 Run the Statistical Test We have already computed the expected values so we just need to plug the numbers into the formula High Performance Compact Mid Size Full Size 20 14 7 9 Observed 125 125 125 125 Expected 2 fa fa2 I Z J 2 20 1252 L 14 1252 L7 1252 L9 1252 125 39 125 39 125 39 125 2 2 2 2 2275 15 55 35 125 125 125 125 2 5625 225 3025 1225 I 125 125 125 125 2 45018242098 808 Step 4 Make a Decision About the Null Reject the null The value we computed in Step 3 is larger than Step 2 so we reject the null Step 5 Conclusion Since we rejected the null we say there is a difference in the number of claims made for different car types Lesson 14 Independent Samples t test Outline No Population Values Changes in Hypotheses Changes if Formula standard error Pooled Standard Error weighted averages Critical Values df Sample Problem No Population Values With the independent samples ttest we nally reach the point where we have no population values This fact is important because when we test hypotheses we are usually testing an idea and a population that we know nothing about Think about the kinds of scienti c discoveries you hear about often New treatments for diseases new drugs or new techniques for improving depression all involve testing a population created by the treatment or drug or technique So with the independent samples ttest we will compare two sample values directly Note that we are still making the inference about the populations from which the samples are drawn Changes in Hypotheses All hypotheses from this point on in the course will be twotailed In addition since we no longer no any population values we will use mu to represent both populations So for example H0 diet P placebo Hl Mdiet i Mplacebo Formula Changes X 3 Recall the formula for the ttest we have been us1ng t H where s T n The numerator will now have two sample values Y1 2 instead of one sample and one population The denominator recall is the standard error the standard deviation divided by the square root ofthe sample size 3 Our standard error denomlnator was s T Remember that the standard error n measures variability we expect to see among samples Now that we have two samples we will want to include the estimate of variability from both Thus we will have to take into account the standard deviations and sample sizes of both samples We will compute the standard error separately for each sample and then add them together Because of the formula we will develop it will be easier if we switch from using the standard deviation to the variance In this way we can eliminate the radical in the denominator The two formulas are equivalent Since we are adding the two separate standard errors together we have 2 Z s s 1 2 s271 7 7 n n 1 2 Notice that we now denote the combined standard error with 32717272 Again it s just a way to symbolize the nal value we will divide into the numerator 2 2 X1 X2 51 52 where s S Xv X n 11 Xer2 1 2 t Pooled Standard Error Note that the formulas I present in this section differ from your text The above formula is useful when our sample sizes are the same However in situations where our sample sizes are different we cannot simply add the two standard errors together Instead we have to give more weight to the larger sample Weighted Averages Let s say I have one sample with N 20 S 15 Another sample has N 100 S 15 Although this example is extreme you can see that you would not want to simply average the two groups together in order to get the average of S If we did that we would have the average of two groups rather than the average of all 120 people Simple average oftwo groups 15 15 165 g 825 Weighted average of 120 people w a 1275 100 20 120 120 For the weighted average we are multiplying each variance times the sample size to get a sum of all 120 people and in the nal step we divide by the total number of people You can see that if I have 100 people with such a large variance that the average of those people plus 20 more of them with a small standard deviation should yield a value closer to the larger group 125 than the smaller group 825 When we have unequal sample sizes we will want to use a similar process to average or pool the variances from our two samples Below is the formula that does just that Notice that we are doing the same process we used for the weighted average above We multiply the variances times the sample size and divide by the total number of people The value nl or degrees offreedom is used to represent the sample size 32 quot1 1S12n2 1Szz p n1 n2 2 Here 3 is the symbol we use for the pooled variance Once we compute that value we plug it into the same formula we used with equal sample sizes but now denote the variance as pooled 2 2 tz X1X2 SP SP S 7 32717 39 I11 I12 Critical Values We will use the same table to nd the critical values as we did with the onesample ttest However degrees of freedom are now computed from two samples so dfn1n2 2 Sample Problem A new program of imagery training is used to improve the performance of basketball players shooting freethrow shots The first group did an hour imagery practice and then shot 30 free throw basket shots with the number of shots made recorded A second group received no special practice and also shot 30 free throw basket shots The data are below Did the imagery training make a difference Set alpha 05 X1 X2 15 5 17 6 20 10 25 15 26 18 27 20 271 2166 2 1233 Sf 2546 S22 3946 n6 n6 Step 1 Write the hypotheses in words and symbols H1 The population receiving imagery practice will make a different number of baskets than the population receiving no imagery practice H0 The population receiving imagery practice will make a different number of baskets than the population receiving no imagery practice H1 imagery i no imagery H0 imagery no imagery Step 2 Find the critical value for the test Since alpha is 05 and it is atwotail tcritical i2 228 Step 3 Run the test Since we have equal sample sizes n s for each group we can use the first shorter formula J 52 s2 tM whereslt Lz 32717272 39 I11 I12 All the values are given above so you just have to plug and compute 1424 657 1082 328 S XVXZ 6 6 t 2166 1233 933 284 328 328 Note that we could have used the longer formula here as well because it will work for equal or unequal sample sizes Step 4 Make a decision about the Null Reject the Null 9 since the value we computed in Step 3 is more extreme than the critical value in Step 2 we reject the idea that they are from the same population Step 5 Write a conclusion The population of players with imagery training made a different number of baskets compared to those with no training Lesson 10 Steps in Hypothesis Testing Outline Writing Hypotheses research H1 null H0 in symbols Steps in Hypothesis Testing stepl write the hypotheses step2 nd critical value step3 conduct the test step4 make a decision about the null step5 write a conclusion Writing Hypotheses Before we can start testing hypotheses we must first write the hypotheses in a formal way We will be writing two hypotheses the research H1 and the null H0 hypothesis The research hypothesis matches what the researcher is trying to show is true in the problem The null is a competing hypothesis Although we would like to directly test the research hypothesis we actually test the null If we disprove the null then we indirectly support the research hypotheses since it competes directly with the null We will discuss this fact in more detail later in the lesson Again the research hypothesis matches the research question in the problem Let s take a look at a sample problem Suppose some species of plants grows at 23 cm per week with a standard deviation of 03 u 23 6 03 I take a sample plant and genetically alter it to grow faster The new plant grows at 32 cm per week X 32 Did the genetic alteration cause the plant to grow faster than the general population Set alpha 05 Let s focus on writing hypotheses rather than any other steps we have learned for now In order to write the research hypothesis look at what the researcher is trying to prove Here we are trying to show that the genetically altered plant grows at a faster rate than unaltered plants That s what we want the research hypothesis to say However when you write your hypotheses be sure to include three elements 1 explicitly state the populations you wish to compare For now one will be a treatment population and the other will always be the general population 2 State the dependent variable We have to be explicit about the scale on which we expect to find differences 3 State the type or direction of the effect Are we predicting the treatment population will be greater or less than the general population ltail Or are we looking for differences in either direction at the same time 2tail The above problem is onetail since we are looking for a growth rate higher than the average Look for words that indicate a direction in the problem for onetail test e g higherlower moreless betterworse It would be twotailed if the problem had stated that we expected a different growth rate than the general population Different could be higher or it could be different because it is lower The current example is easy to translate into a hypothesis but check the homework packet because the wording is not always so obvious For the research hypotheses denoted by H1 H1 The population of genetically altered plants grows faster than the general population You could vary the wording a bit as long as you include the three elements Notice that we state both the treatment population and the population we will compare that to the general population Growth rate is the dependent variable and we indicate the direction by saying it will grow faster The null hypothesis denoted by H0 is a competing hypothesis It s basically the opposite of the research hypothesis In general it states that there is not effect for our treatment or no differences in our populations For this example H0 The population of genetically altered plants grows at the same or lower rate as the general population I ve included the same or lower wording for the onetail test because we want to cover all the possible outcomes of the test We only want to show that the treatment population grows faster If they end up growing slower it won t support the research hypothesis so we include leftover elements with the null For twotail tests substitute different for the word faster in the research hypothesis The twotail null would say the groups are do not differ In Symbols We can also write the hypothesis in notational form We will restate both the null and research hypotheses in symbols we have been using for our formulas Thus H1 ngnalt gt 23 H0 ngnalt E 23 Notice that we represent the treatment population with a mu u We do this because we want to make inferences about the population not the single value sample I am using to test the hypothesis Our inferences will be that the entire population the plant comes from grows at a faster rate The value of 23 is the general population mean we are comparing against Although it is represented with a mu in the problem we don t the symbol because we know the exact value for that population For twotail test we simply change the direction arrows to equalnot equal signs an sign for the null and at sign for the research hypothesis Steps in Hypothesis Testing Now we can put what we have learned together to complete a hypothesis test The steps will remain the same for each subsequent statistic we learn so it is important to understand how one step follows from another now Let s continue with the example we have already started Suppose some species of plants grows at 23 cm per week with a standard deviation of 03 u 23 6 03 I take a sample plant and genetically alter it to grow faster The new plant grows at 32 cm per week X 32 Did the genetic alteration cause the plant to grow faster than the general population Set alpha 05 Step 1 Write the hypotheses in words and symbols H1 The population of genetically altered plants grows faster than the general population H0 The population of genetically altered plants grows at the same or lower rate as the general population H1 ngnalt gt 23 H0 ngnalt E 23 Step 2 Find the critical value for the test Since alpha is 05 and it is a onetail test because we think our treatment will produce plants that grow faster than the general population Zcritical 164 Step 3 Run the test Here we nd out how likely the value is by computing the zscore Z 32 23 2amp23 03 03 Step 4 Make a decision about the Null Reject the Null or F all to Reject the Null retain the null are the only two possible answers here Since the value we computed for the ztest is more extreme than the cnocalwhlz we he ec u Nu Gmphlca y 39hnugh me hean fax u answer we have Natt mweuetzslmgtth l Insulhzxpmvmaxdlspxwen w m u sum pa mh ex n 5 a duTexem pupuh mn hm we hemeun that 5 mn mm u sum wpuh mn Itmayseem hhe ammxafsemn csbm Indulge me uh uh ahe MW Fax uh exam z we canalqu The pupnhnan afgenzma yahznd ams gmws he a Menm mm than u gemml pupuhonn quot E I Alumth we have a canchlsmn eh ap 4 we a canchlsmn hue eh am hmguge wnhmn mystnusocaljargan Whatdldaunes shnvV Ifyml rejec hz mm Lhzmhz heh than was a Meme tnmmzm had eh effect The research hyputhz s he yum canchlsmn we eeh hmphyheeme h mm Step1 Ifyml fad tn reject he hm heh he ml hyputhzsls he yumcamlusmnagam ymlcanwshewnte h fmmSmp 1 Lesson 3 Data Displays Outline Frequency Distributions Grouped Frequency Distributions class interval and frequency cumulative frequency relative percent cumulative relative percent interpretations HistogramsBar Graphs Frequency Distributions We often form frequency distributions as a way to abbreviate the values we are dealing with in a distribution With frequency distributions we will simply record the frequency or how many values fall at a particular point on the scale For example if I record the number of trips out of town X a sample of FSU students makes I might end up with the following data 0253243102604701243543161053 Instead of having a jumbled set of numbers we can record how many of each value f there are for the entire Xdistribution Below is a simple frequency distribution where the X column represents the number of trips and the corresponding value for f indicates how many people in the sample gave us that particular response X f 0 5 1 4 2 4 3 5 4 4 5 3 6 2 7 1 From the graph we can see that five people took no trips out of town four people took one trip out of town four people took two trips out of town and so on It is important not to confuse the fvalue and the Xvalue The fvalues are just a count of how many So you can reverse the process as well It might also be helpful in some examples to go from a frequency distribution back to original data set especially if it causes confusion In the following example I start with a frequency distribution and go backward to find all the original values in the distribution X f bUJNt O NUJAUJN What is the most frequent score The answer is two because we will have four twos in our distribution 00111222233344 Grouped Frequency Distributions The above examples used discreet measures but when we measure a variable it is often on a continuous scale In turn there will be few values we measure that are at the exact same point on the scale In order to build the frequency distribution we will group several values on the scale together and count any of measurements we observe in that range for the frequency For example if we measure the running time of rats in a maze we might obtain the following data Notice that if I tried to count how many values fall at any single point on the scale my frequencies will all be one 325 395 461 592 687 712 758 825 869 956 967 1024 1095 1099 1134 1159 1234 1345 1453 1486 We will begin by forming the class interval This will be the range of value on the scale we include for each interval There are many rules we could use to determine the size of the interval but for this course I will always indicate how big the interval should be In the end we want to construct a display that has between 5 and 15 intervals Thus Class Interval 35 68 911 1214 Once we have the class interval we will count how many values fall within the range of each interval Since there is a gab in each class interval we will be actually counting any values that would get rounded down or up into a particular interval For example with the above data the value 826 would be rounded down into the 68 class interval The value 869 would be rounded up into the 911 class interval We will include a column to indicate the real limits of the class interval These are the limits of the interval including any rounded values Real Limits Class Interval f 525 02 0 2555 35 3 5585 68 5 85115 911 7 115145 1214 5 Notice that my real limits cover half the distance of the gap between each class interval Most of the time this value will be 05 since most scales will have one unit values and 05 is half the distance So real limits have no gap but the class intervals do If a value falls exactly on one of the real limits we could randomly choose its group Cumulative Frequency Once we have formed the basic grouped frequency distribution above we can add more columns for more detailed information The first of these is the cumulative frequency column With this column we will keep a running count of the frequency column as we move down the class interval Real Limits Class Interval f Cum f 525 02 0 0 2555 35 3 3 5585 68 5 8 85115 911 7 15 115145 1214 5 20 So at the first interval we have zero frequency so cumulatively we have zero values For the second interval we have three so cumulatively we have three For the third interval we have five values so cumulatively we have 8 That includes the five for the third interval plus the three from the previous intervals We continue this process until the last interval Notice that when we reach the last interval we have all the values in the distribution represented So the bottom cumulative frequency is N or the total number of values in the distribution 20 here Relative Percent Another column will tell us the proportion of total values that fall at each interval That is we will express the frequency column as a percentage of the total To convert the frequency to a percentage take the frequency f and divide by the number of values N This will give us the proportion of values for that particular interval Move the decimal over two places or multiply by 100 to change the proportion into a percent Thus Real Limits Class Interval f Cum f Rel 525 02 0 0 0 2555 35 3 3 15 5585 68 5 8 25 85115 911 7 15 35 115145 1214 5 20 25 Cumulative Relative Percent For a final column we will keep a running count of the relative percent column in the same way we did with the cumulative frequency Keep in mind we are counting relative percentages now as we move down the display Real Limits Class Interval f Cum f Rel Cum Rel 525 02 0 0 0 0 2555 35 3 3 15 15 5585 68 5 8 25 40 85115 911 7 15 35 75 115145 1214 5 20 25 100 Notice that we can keep a running count of the relative percent column but we could also obtain the same numbers by computing the percentage for each cumulative frequency as well Interpretations The data display gives a good deal of information about where values in the sample fall One good piece of information is about percentiles A percentile is the percentage at or below a certain score You often get percentile information when you get your SAT or ACT test scores back Percentile information is found in the relative percentage column Each value in that column tells us the percentage of the distribution at that point or less on the scale Since we will be rounding values down into a certain interval based on the real limits then we will indicate where on the scale a certain percentile is based on its corresponding upper real limit For example what score corresponds with the 75Lh percentile The answer is 115 because any values of 115 or less are within the bottom 75 of the distribution Similarly what percentile is associated with a score of 85 We would use the cumulative relative percent that corresponds to 85 which is 40 So the score 85 corresponds with the bottom 40 or 401h percentile of the distribution Other interpretations from the table can be made as well For example we might be interested in how many people fall at a particular interval or at or below a certain interval How many scored between 3 and 5 The answer is a found in the frequency column or three How many scored 85 or less The answer for this question is in the cumulative frequency column or eight HistogramsBar Graphs We can also take the frequency information in our frequency or grouped frequency distribution and form a graph In the graph we will form a simple xy axis On the xaxis we will place values from our scale and on the yaxis we will plot the frequency for each point on the scale For grouped frequency distributions we will use the midpoint of each interval to indicate different points on the scale We will continue with our previous example but notice I have created a new column that indicates the center or midpoint of each interval We will use this value to graph the display Real Limits Class Interval MP f Cum f Rel Cum Rel 525 02 1 0 0 0 0 2555 35 4 3 3 15 15 5585 68 7 5 8 25 40 85115 911 10 7 15 35 75 115145 1214 13 5 20 25 100 E 7 E E1quot E 5 g 4 LE 3 2 1 U 1 1 7quot 1 13 Running Time Note that the bars are touching The bars touch like this when we are dealing with continuous data rather than discreet data When the scale measures discreet values we call it a bar graph and the lines do not touch For example if I measured the number of J 1 r 39 39 and 39 J J in a sample we would use a bar graph if we wanted to create a data display 1 000000 gt26 Umquot Lesson 17 Pearson s Correlation Coef cient Outline Measures of Relationships Pearson s Correlation Coefficient r types of data scatter plots measure of direction measure of strength Computation covariation of X and Y unique variation in X and Y measuring variability Example Problem steps in hypothesis testing r Note that some of the formulas I use differ from your text Both sets of formulas are in the homework packet and you should use the formulas you feel most comfortable using Measures of Relationships Up to this point in the course our statistical tests have focused on demonstrating differences in effects of a dependent variable by an independent variable In this way we could infer that by changing the independent variable we could have a direct affect on the independent variable With the statistics we have learned we can make statements about causality Pearson s Correlation Coef cient r Types of data For the rest of the course we will be focused on demonstrating relationships between variables Although we will know if there is a relationship between variables when we compute a correlation we will not be able to say that one variable actually causes changes in another variable The statistics that reveal relationships between variables are more versatile but not as definitive as those we have already learned Although correlation will only reveal a relationship and not causality we will still be using measurement data Recall that measurement data comes from a measurement we make on some scale The type of data the statistic uses is one way we will distinguish these types of measures so keep it in mind for the next statistic we learn chisquare One feature about the data that does differ from prior statistics is that we will have two values from each subject in our sample So we will need both an X distribution and Y distribution to express two values we measure from the same unit in the population For example if I want to examine the relationship between amount of time spent studying for an exam X in hours and the score that person makes on an exam Y we might have Immbmmmx 1 01 Scatter plots An easy way to get an idea about the relationship between two variables is to create a scatter plot of the relationship With a scatter plot we will graph our values on an X Y coordinate plane For example say we measure the number of hours a person studies X and plot that with their resulting correct answers on a triVia test Y X Y 0 0 l l l 2 2 3 3 5 4 5 5 6 Plot each X and Y point by drawing and XY axis and placing the xVariable on the x axis and the yVariable on the yaxis So when we are at 0 on the Xaxis for the rst person we are at 0 on the yaxis The next person is at l on the Xaxis and l on the Y axis Plot each point this way to form a scatter plot NCO5010 Number of Correc Answers 0 2 4 6 Number of Hours Studying In the resulting graph you can see that as we increase values on the XaXis it corresponds to an increase in the yaXis For a scatter plot like this one we say that the relationship or correlation is positive For positive correlations as values on the XaXis increase values on yincrease also So as the number of hours of study increases the number of correct answers on the exam increases The opposite is true as well If one variable goes down the other goes down as well Both variables move in the same direction Let s look at the opposite type of effect In this example the Xvariable is number of alcoholic drinks consumed and the Yvariable is number of correct answers on a simple math test Number of Correct Answers O 2 4 6 8 Number of Drinks This scatter plot represents a negative correlation As the values on X increase the values on Y decrease So as number of drinks consumed increases number of correct answers decreases The variables are moving in opposite directions Measures of Strength Scatter plots gave us a good idea about the measure of the direction of the relationship between two variables They also give a good idea of how strongly related two variables are to one another Notice in the above graphs that you could draw a straight line to represent the direction the plotted points move 12 10 Number of Correct Answers on O 2 4 6 8 Number of Drinks The closer the points come to a straight line the stronger the relationship We will express the strength of the relationship with a number between 0 and l A zero indicates no relationship and a one indicates a perfect relationship Most values will be a decimal value in between the two numbers Note that the number is independent of the direction of the effect So we may express a 1 value indicated a strong correlation because of the number and a negative relationship because of the sign A value of 03 would be a weak correlation because the number is small and it would be a positive relationship because the sign is positive Here are some more examples of scatter plots with estimated correlation r values Graph A xepxesems a slmng pu hve camlmanbecause m p ms m vexyclnse myth prhapsx 25 anhE represan aweakzxpughve camlmanx3 Graph c represan a sfmng gum sandman x 7 9n Cumpuhw39nn thn w campme m sandman n wdlbe m nun arcmmmn m m x and v mama m m mdmdulwmhxmym x and m mdmdulwmhxmym v By cmmmn we mm m ammlm um x and waymge ur Sn m caneth laaks m m lanmeth mumblzswrymgethzrnhm m m ammlm hywxy mdwldully 1m cmmnanams large mm m m mdmdulwmhthafeach mum s m m nh mnslup and m Wm an 15 sfmng W Unlque lque Varlatlon Variatlon 1n gtlt 1n y A mm mm mgh be mm m mausth camept mm exam z x 15 pupuhonndzmiymd l ls nnmhexbahusbam Imamemm m x le can um an m afdnTexem nasam whypupuhonn dzmiymgh wrybynself Pea z 1m m mm dznselypup hmd mas fax manyxeamn malndmg Jab appumlnmzs famdynasans ax chmm Imamemm m y le can alsa m an M afxeasans whybmh m maywxybynself Pea z maybe m uznced m have chddrenbecmlse afpxsaml mums wax ax ecammw nasam Covariation of X and Y For this example it is easy to see why we would expect X and Y to vary together as well No matter what the birth rate might happen to be we would expect that more people would yield more babies being born When we compute the correlation coefficient we don t have to think of all the reasons for variables to vary or covary but simply to measure the variability How do we measure variability in a distribution I hope you know the answer to that question by now We measure variability with sums of squares often expressed as variance So when we compute the correlation we will insert the sums of squares for X and Y in the denominator The numerator is the covariation of X and Y For this value we could multiply the variability in the Xvariable times the variability in the Yvariable but see the formula below for an easier computation 2 sz n lzXzMllzw l n n The only new component here is the sum of the products of X and Y Since each unit in our sample has both and X and a Y value you will multiply these two numbers together for each unit in your sample Then add the values you multiplied together See the example below as well Example Problem The following example includes the changes we will need to make for hypothesis testing with the correlation coefficient as well as an example of how to do the computations Below are the data for six participants giving their number of years in college X and their subsequent yearly income Y Income here is in thousands of dollars but this fact does not require any changes in our computations Test whether there is a relationship with Alpha 05 of Years of College Income x Y x2 Y2 XY 0 15 0 225 0 1 15 1 225 15 3 20 9 400 60 4 25 16 625 100 4 30 16 900 120 6 35 36 1225 210 XX 18 ZY140 2X2 78 2Y2 3600 ZXY505 Notice that I have included the computation for obtaining the summary values for you for completeness Be sure you know how to obtain all the summed values as they will not always be given on the exam Step 1 State the Hypotheses in Words and Symbols H1 The correlation between years of education and income is equal to zero in the population H0 The correlation between years of education and income not equal to zero in the population As usual the null states that there is no effect or no relationship and the research hypothesis states that there is an effect When we write them in symbols we will use the Greek letter rho p to indicate the correlation in the population Thus H1P 0 H0p0 Step 2 Find the Critical Value Again we will use a table to nd the critical value in Appendix A of your book Locate the table and nd the degrees of freedom for the appropriate test to find the critical value For this test df n 7 2 where n is the number ofpairs of scores we have Df6724 1 critical i 0 811 Step 3 Run the Statistical Test ZXY ZXZY 505 18l40 6 r 2 2 78 3 3600 6 6 2520 505 6 78 E 3600 19600 6 6 505 420 J 78 54 3600 326667 85 r 85 85 95 42433333 J799992 8944 39 Step 4 Make a Decision about the Null Reject the null 6 Since the value we computed in Step 3 is larger than the critical value in Step 2 we reject the null Step 5 Write a Conclusion There is a relationship between years spent in college and income The more years of school the more the subsequent income 2 1 6ften times we will square the rvalue we compute in order to get a measure of the size of the effect Just like with etasquare in ANOVA we will compute the percentage of variability in Y that is accounted for by X For the current example r2 90 so 90 of the variability in income is accounted for by education Sampling Distributions Introduction Sampling distributions represent a troublesome topic for many students However they are important because they are the basis for making statistical inferences about a population from a sample One problem sampling distributions solve is to provide a logical basis for using samples to make inferences about populations Sampling distributions also provide a measure of variability among a set of sample means This measure of variability will in turn allow one to estimate the likelihood of observing a particular sample mean collected in an experiment At the simplest level when testing a hypothesis one is testing whether an obtained sample comes from a known population usually the general population If the sample value is likely for the known population then it is likely that the value must come from the known population If the sample value is unlikely for the known population then it likely does not come from the known population and it can then inferred that it instead that it comes from a different unknown population If some treatment is performed like giving a drug to improve patient recovery rates then a sample value from the treated group will allow a test of the idea that treatment had some effect here on recovery rates Does giving patients this new drug in effect create a new and different population a population using the drug If the average recovery rate of the treated group is very similar to or likely for the known population of patients that do not take the drug the general population then the treatment likely had no effect If the average recovery rate is very different from or very unlikely for the known population of patients not taking the drug then the treatment must have had an effect and created a new population of patients with different outcomes Thus some way to judge how likely a value is for the known population is needed The common formula used to find the probability or the likelihood of a value for a known population solving zscore problems is In the above formula the standard deviation sigma 6 gives information about how much variability exists in the population Knowing how much variability exists in the population the width of the distribution of scores allows one to know how likely a single xvalue is for that population Since most values in a distribution will lie close to the mean less likely values will fall farther from the mean The wider the distribution of scores the less pronounced any specific difference between a value and the mean will be For example if the difference between an xvalue and the population mean remains constant in the numerator then that difference will be much more likely if the population has a very wide distribution large denominator compared to its likelihood in a very narrow distribution small denominator So any factor like a decreased spread in the distribution of scores that increases the relative difference between a value and the mean will lower our estimate of how likely the value is for the distribution However when testing a hypothesis it is never based on a single xvalue Instead a sample of values is used from which the average is computed If the average or mean value tested is very different from the known population then it can assumed the population the sample represents is not the same as the known population mean u The problem in using the above formula is that sigma gives information about how much individual values vary within a population but nothing about how much sample means vary Sampling distributions provide an explanation of how to measure variability in samples and thus the probability of observing a particular sample mean Sampling Distribution of the Mean Sampling distributions are theoretical and not actually computed However examining the process of computing one is necessary There are many types of sampling distributions and a sampling distribution for any statistic can be formed For the current discussion the sampling distribution of the mean is most relevant To form a sampling distribution 1 Sample repeatedly and exhaustively from the population 2 Calculate the statistic of interest the mean for each sample 3 Form a distribution of the set of means obtained from the samples The sample of values taken from the population to form a sample can be any speci c size but every possible sample of that size from the population must be taken Then an average of each sample is computed in order to examine this new set of scores The set of means obtained from each sample will form a new distribution a sampling distribution In this case where the mean is computed as the statistic it will be the sampling distribution of the mean Every possible combination of values from the population is sampled to form a true sampling distribution Since most populations are very large it is impractical to actually go through the process which is why they remain theoretical The first important fact learned from the sampling distribution of the mean is that the mean of the population and the mean of the sampling distribution of means will have exactly the same value That is the average of the entire population of single xvalues is exactly the same as the average value of the set of sample means from the sampling distribution This fact is important to hypothesis testing because when testing a hypothesis based on a sample even though a single sample will not likely be exactly like the population it will be on average Thus it is certain that repeated experiments will yield samples that will on average be the same as the population mean Using a sample to make an inference about a population is therefore a logical and reasonable proposition The next important piece of information obtained from the sampling distribution of the mean is a measure of variability among sample means Recall that some way to measure how much variability that exists in a set of sample means is needed so that there will be some way to gauge how likely it is to obtain a particular sample mean collected in an experiment If the value obtained in a sample is unlikely for the known general population then the population it comes from is probably different from the known population it is compared against If the value obtained in the sample is likely or similar to the known population then it is likely there is no difference between it and the known population If there is a large amount of variability from sample to sample an individual sample mean obtained to test a hypothesis will have to be much more different from the known population in order to stand out distinctively than if there is little variability The standard deviation just as with other distributions will be the measure used to indicate the spread or dispersion of a distribution Just as with the standard deviation of a population where the amount of variability that exists in a population is measured by how much individual scores deviate from the average variability in the set of sample means will be computed the same way When measuring the average deviation of a set of sample means in a sampling distribution the amount of variability there is from sample to sample is being measured This measure of the standard deviation of a distribution of sample means it is called the Standard Error and is symbolized as 0397 Since the sampling distribution of the mean is theoretical there is no need to actually calculate the standard error every time an inferential test is conducted Instead an estimate is made from the population or the sample The formula to estimate the standard error from the population is 03704 The above formula can be used if the population standard deviation is known If so it forms the denominator for a zscore hypothesis test f Z 7 a If the population standard deviation is not known which is usually the case the population standard deviation must be estimated from the obtained sample Although the standard deviation estimated from a sample is calculated slightly differently than when all the population values are known the computation of the standard error is essentially the same In such cases the standard error is usually represented with Roman instead of Greek letters So the standard error is represented as S 7 and is computed with the formula S S In addition when using the sample to estimate the standard deviation we are no longer computing a ztest but a ttest instead f t z 7 S Notice that the denominator is an estimate of the standard error and it is the same whether computing a ztest or a ttest The distance a sample mean falls from the mean of the population is mediated by how much variability there is from sample to sample If it is relatively unlikely to observe a certain sample plt05 for alpha05 then we can conclude that the sample did not come from the known population Finally sampling distributions also yield information about how large a sample needs to be in order to test a hypothesis The shape of the sampling distribution of the mean will always be normal regardless of the shape of the population distribution Whether the population distribution has a normal positively or negatively skewed unimodal or bimodal shape the sampling distribution of the mean will always have a normal unimodal and symmetric shape That s because when a distribution of sample means is formed each value in the distribution is derived from a sample that contains a variety of scores from the population Because each value in the sampling distribution is an average of these values from the population most of the scores will lay close the mean of the population and create unimodal and symmetric distribution even if the values in the population of single xvalues do not Recall that when values are used to form a sampling distribution samples of any size can be used However the larger the number of values in a sample taken from the population to form the sampling distribution the more normal the sampling distribution will be That s because there will be a larger variety of values from the population in any individual sample and the more likely the average from each sample will approximate the average value of the entire population As it turns out at around 30 values in a sample is when there is enough variety contained in the sample for those values to average out very close to the average of the population However the larger the number of values we take in a sample the closer we get to the average of the population Since an estimate the standard error is usually made from a sample the sample size needs to be around 30 in order to approximate the value that would be obtained if the standard error was computed from the population Thus the minimum number of values needed to approximate the population with a sample is usually close to 30 and it is best to have this minimum number in any sample used for hypothesis testing David S Wallace Cross references See also Central Limit Theorem Hypothesis Testing Normal Distribution Standard Error of the Mean t TestOne Sample Variance Further Readings Gravetter F J amp Wallnau 2002 Essentials of statistics for the behavioral sciences 4 11 ed Paci c Grove CA Wadsworth Hays W 1994 Statistics 5 11 ed Orlando FL Harcourt Brace Howell D C 1999 Fundamental statistics for the behavioral sciences 4th ed Paci c Grove CA Dquury Press Lesson 1 Introduction Outline Statistics Descriptive versus inferential statistics Population versus Sample Statistic versus Parameter Simple Notation Summation Notation Statistics What are statistics What do you thing of when you think of statistics Can you think of some examples where you have seen statistics used You might think about where in the real world you see statistics being used or think about how statistics in used in your major Statistics are divided into two main areas descriptive and inferential statistics Descriptive statistics These are numbers that are used to consolidate a large amount of information Any average for example is a descriptive statistic So batting averages average daily rainfall or average daily temperature are good examples of descriptive statistics Inferential statistics inferential statistics are used when we want to draw conclusions For example when we want to determine if some treatment is better than another or if there are differences in how two groups perform A good book definition is using samples to draw inferences about populations More on this once we define samples and populations Population Any set of people or objects with something in common Anything could be a population We could have a population of college students We might be interested in the population of the elderly Other examples include single parent families people with depression or burn victims For anything we might be interested in studying we could define a population Very often we would like to test something about a population For example we might want to test whether a new drug might be effective for a specific group It is impossible most of the time to give everyone a new treatment to determine if it worked or not Instead we commonly give it to a group of people from the population to see if it is effective This subset of the population is called a sample When we measure something in a population it is called a parameter When we measure something in a sample it is called a statistic For example if I got the average age of parents in singlefamily homes the measure would be called a parameter If I measured the age of a sample of these same individuals it would be called a statistic Thus a population is to a parameter as a sample is to a statistic This distinction between samples and population is important because this course is about inferential statistics With inferential statistics we want to draw inferences about populations from samples Thus this course is mainly concerned with the rules or logic of how a relatively small sample from a large population could be tested and the results of those tests can be inferred to be true for everyone in the population For example if we want to test whether Bayer asprin is better than Tylonol at relieving pain we could not give these drugs to everyone in the population It s not practical since the general population is so large Instead we might give it to a couple of hundred people and see which one works better with them With inferential statistics we can infer that what was true for a few hundred people is also true for a very large population of hundreds of thousands of people When we write symbols about populations and samples they differ too With populations we will use Greek letters to symbolize parameters When we symbolize a measure from a sample a statistic we will use the letters you are familiar with Roman letters Thus if I measure the average age of a population I d indicate the value with the Greek letter mu p 24 While if I were to measure the same value for a subset of the population or a sample then I would indicate the value with a roman letter Y 24 Simple Notation You might thing about descriptive statistics as the vocabulary of the quotlanguagequot of statistics If this is true then summation notation can be thought of as the alphabet of that language Notation and summation notation is just a short hand way of representing information we have collected and mathematical operation we want to perform For example if I collect data on a variable say the amount of time in minutes several people spent waiting at a bus stop I can represent that group of numbers with the variable X The variable X represents all of the data that I collected Amount of Time 156 With subscripts I can also represent an individual data point within the variable set we have labeled X For example the third data point 89 is the X3 data point The fifth data point X5 is the number 123 Very often when we want to represent ALL of the data points in a variable set we will use X by itself but we may also add the subscript 139 Whenever you the subscript 139 you can assume that we are referring to all the numbers for the variable X Thus X is all ofthe numbers in the data set or 51118935123156 There are other common symbols we will use besides X Sometimes we will have two data sets to deal with and refer to one distribution as X and the other distribution as Y It is also necessary for many formulas to know how many data points are in a data set The symbol for the number of data points in a set is N For the data set above the number of data points or N 6 In addition we will use the average or mean value a good deal We will indicate the mean as noted above differently for the population u than for the sample X Summation Notation Another common symbol we will use is the summation sign Z This symbol does not represent anything about our data itself but instead is an operation we must perform Whenever you see this symbol it means to add up whatever appears to the right of the sign Thus ZX or ZXZ39 tells us to add up all of the data points in our data set For our example above it would be 5 111 89 35 123 156 564 You will see the summation sign with other mathematical operations as well For example 2X2 tells us to add all the squared X values Thus for our example 2x2 52 1112 892 352 1232 1562 or 25 12321 7921 1225 15129 24336 63432 A few more examples of summation notation are in order since the summation sign will be central to the formulas we write The following examples should give you a better idea about how the summation sign is used Be sure you recall the order of operations needed to solve mathematical expressions You will find a review on the web page or youcanclickherehttn facllltvnncfsn J J quot J html For the examples below we will use a new distribution X 1 2 3 4 Y 5 6 7 8 2X2 2X2 For this expression we are saying that the sum of the squared X s is not equal to the sum of the X s squared Notice here we want to perform the operation in parentheses rst and then the exponents and then the addition Thus 2X2 ZXZ 12223242 212342 14916 102 30 100 For the next expression we show like in algebra that the law of distribution applies to the summation sign as well Again what is important is to get a feel for how the summation sign works in equations 2X Y xx ZY 15263748 12345678 6810121026 36 36
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'