### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# INTRO STATISTICS STAT 2000

UGA

GPA 3.5

### View Full Document

## 89

## 0

## Popular in Course

## Popular in Statistics

This 232 page Class Notes was uploaded by Ethel Hermiston on Saturday September 12, 2015. The Class Notes belongs to STAT 2000 at University of Georgia taught by Morse in Fall. Since its upload, it has received 89 views. For similar materials see /class/202532/stat-2000-university-of-georgia in Statistics at University of Georgia.

## Reviews for INTRO STATISTICS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/12/15

Probability Probablllty ls tne llkellhood ora partlcular outcome occurrlng tr waysigivmevmlcmmu Example probability of drawing a club from a deck ofcards Example in an urn full of 10 red marbles and 12 black marbles the probability of drawing a black marble is 9101ZZ Law of Large Numbers As we repeat an experiment a large number of times the ratio of the number of successes in the sample to the total number of trials will approach the probability of the event Example proportion ofheads should go to 50 10 coins 7 H 3 T 100 coins 53 H 47 T 1000 coins 498 H 502 T p 498 10000 coins 5002 H 4998 T p 5002 Types of Probabilities ClassicalRelative Frequency l waysgimmunm pouihlgvlmuml Subjective Likelihood of an event is based on your own personal judgment Probability Definitions A complement All possible events that are not in A Example A it s snowing AB it s not snowing PW 44 Example of a Complement We naye a total or an seats ln a mum and a snaded sduare abuve represents a taken seat Huvv many people are ln thls runrn7 lt s easrerto lnstead cuunt tne number or empty seats ufvvhlch tnere are 4 Therefure tne number or people ls 5n 7 4 4B 5e tne eomplement nere ls all empty seats Whlch glves us tne needed answer more easlly Probability HW 5153 The game Scrabble oontalns too lettered tlles 56 consonants 42 yowels and 2 blanks The yowels can e urtnerolassme as follows 9 st 5 s s an US uppose we r lnto a bag oontalnlng all tnetlles and select one letter vvnat s tne probablllty ofdravvlng a tlle wltn a letteron lt e 98 suon ules so 98100 98 e draw a vowel what s tne probablllty lt s ll l tne upper half of tne alphabet Meanan A M 7 Out of42 yowels theA s ES and is count so w 714285 42 42 lfthe tlle drawn contalns a letter on lt what s tne probabllltytnat letterls nota vowel Out of 98 letters 56 ofthern are not yowels 5 57143 98 Probability HW 5153 We have an urn full of marbles of the following colors 5 green 4 blue 3 yellow 2 red 1 white 15 total What s the probability of drawing a primary color Red yellow or We 2 3 4 9 6 15 7 15 739 If the marble drawn was a primary color what s the probability it was either red or yellow Out of 9 primary color marbles we have 2 red and 3 yellow so 2 3 5 7 55556 9 9 If the marble drawn was not a primary color what s the probability it was not green So out of6 marbles 1 is not green the White 116667 6 There are 6 noneprimary color marbles 5 green 1 white Probability HW 5153 Chocolate Strawberry Tota Male 490 H371 1361 Female 513 302 815 Total 1003 1173 2176 Find the probability that an individual selected is female and prefers strawberry ice cream Out of all 2176 we have 302 that are both female and like strawberry so 3022176 13879 Find the probability that an individual selected is female or prefers strawberry ice cream Out of 2176 again We count the number who are either female or like strawberry or both As long as they have at least one of the two characteristics we count them As indicated above this is everybody except the males who like chocolate So the answer is 513 302 871 2176 77482 Probability HW 5153 490 1361 513 302 815 Total 1003 1173 2176 Given that the person selected is male nd the probability they prefer strawberry Out of1361 males 871 prefer strawberry so the answer is 8711361 63997 Given that the person prefers strawberry nd the probability hey are male Out of 1173 people that like strawberry 871 of them are male so 8711173 74254 Random Variables Discrete A countable wholenumber of possible values The number of words in a magazine article The number of clubs in a drawing of 10 cards Continuous An uncountable infinite number of possible values with decimals allowed The weight of an athlete The time taken to complete a race Discrete Probability Distribution Two requirements 1 Each individual px is between 0 and 1 inclusive 2 All probabilities sum to 1 The mean ofa discrete distribution MEAN Zxpx Also called average or expected value Mean ofa Distribution HW 61 62 You are hiking through the Fire Swamp and along the way you have to battle Rodents of Unusual Size You will meet certain numbers with various probabilities Find the missing probability of an inconceivable 5 Rodents 1 005 012 015 036 032 Find the probability of battling at least 3 Rodents How about more than 3 At least three 015 036 032 083 More than three 036 032 068 As you wish work out how many Rodents you should expect to battle on one pass through the Fire Swamp 1 x 005 2 x 012 3 x 015 4 x 036 5 x 032 378 On average you will encounter 378 Rodents Discrete Mean HW 6163 The Cash 4 lottery involves picking 4 numbers each 0to 9 n henc 10000 combinations If you pick the correct combination you win 1200 Also suppose the ticket costs 2 On a single bet what s the probability you win 110000 0001 Make a probability distribution for this problem X is the pro t you make taking into account the ticket c If you lose you paid 2 and won nothing so your pro t is 2 If you win you win 1200 but still had to pay 2 so your pro t is 1198 mamp 2 09999 100001 1198 00001 Discrete Mean HW 6163 Find the mean expected pro t on this lottery X 9X 2 09999 1 00001 1198 00001 mean 2 x 09999 1198 x 00001 188 On one play how much should we expect to lose This is he interpretation ofan expected value Since this pro t is negative it means we expect to lose on avera 88 How about on 100 plays 100 x 188 188 so on 100 plays we expect to lose on average 1 Normal Curve Empirical Rule What s the mean and sd mm rm Comparing Normal Curves Which has a higher mean right highermean nght highersd 9 Three rules 1 N Normal Distribution Continuous Density 0 a Total probability 03 area underthe normal curve is 1 Normal curve is symmetric X value zscore goes in left box n probability goes in 393 392 391 U 1 right box on 33 StatCrunch 2 3 x 70 Sm m 171 Prunxlgt 172 x 05987053 Prob i v jamquot mm mm Two types of problems Normal Distributions 1 Given an x value find the probability above or below it 2 Given a probability find the x value that gives that probability above or below it DRAWA SKETCH Normal Probability Draw a sketch of a normal curve if we were looking for the probability to the left of z 15 Normal Probability DensW a t u 3 a 2 n n 39 39 39 39 73 7 rt u l 3 1 5 x mania ism new Frommi l1 lllQi331 v l Answer 2 93319 Normal Probability Draw a sketch to help find what zvalue has probability 04 to the right Here 04 is a probability the area of the blue region We are looking for 2 so going backwards 6 Normal Probability 2 4 o l 1 3 ea l Answer z 025335 M N quot PrMIl x lu 533A7l li W7 Area Between Two Lines Find the probability within 12 standard deviations from the mean Make a sketch of this Try to think of a strategy for this Area Between Two Lines 1st Method Instead fuul his Area Between Two Lines Area Between Two Lines 1st Method 2nd Method We want P 12 lt Z lt12 Filst get the probability left of z 12 PZ lt 12 88493 Instead find this 712 3 2 i u 1 1 3 Answer Mean ti l summit ProhlxlglV 11 li niiauunm 1 211507 76986 i 7 7 2 U 3 x Menu in Sm nevli ProlilX c v i i9 m V Area Between Two Lines Normal HW 6163 2nd Method The weight ofa platypus is normally distributed with mean 44 and Next get the prObablhty left of 139239 sd 04 pounds Suppose the probability that a platypus weight falls PZ lt 71211507 above 5 pounds is 0668 Find the probability that a randomly selected weight is between 44 and 5 Make a sketch as well Answer is the difference PZ lt 12 88493 11507 76986 44 5 l 50 0663 4331 Pmbtx lt lvl 712 01505967 Normal HW 6163 Percentiles Same problem p 44 and o 04 The Pth percentne is o What is the zscore for a platypus with a weight of 55 pounds the X that gives p z x 5395 4394 275 below on the normal 0 394 Always below What is the zscore for a platypus whose weight is 261 standard deviations to the left ofthe mean Azscore is the number of Example here the x deviations away from the mean a point lies To be 261 is the 30th percentile deviations below the mean is to have a zscore of 261 b 300 fth Negative because were below ecause 0 e o What weight corresponds to the above zscore data falls below X 44 x 261 KT gt x 261444 2 x 3356 Percentiles HW 6163 Percentiles HW 6163 The average high 0 What is the percentile temperature in New for a day with a high York has mean 78 temperature of 75 and sd 9 degrees 0 Make a sketch to help 0 Must be a percentage in determining the and a whole number 75th percentile 369 gt 369 an an m X an an gt 37 Mi mm a X 84070404 is the resulting answer from Prunanilis gt W the normal calculator 3 Types of Distributions Means Problem 1 Population Distribution of all points in the population 2 Sample Data Data MEAN ST DEV Distribution of one particular sample pOpLLATION a 3 SamplingDistribution of Sample Means I SAMPLE DATA DATA l Y S Distribution of the sample means of a given size n If you repeatedly draw samples note their sample SAMng 13151 1 i averages then plot these averages on a new graph that s the sampling distribution Distribution Shapes Sampling Distribution HW 7172 Sam ie Data Data Distribution is the 0 16 subjects are from a population skewed right with mean 40 and sd 8 same shape as the population for large n Shape ofsampe Data If population is skewed so is the sample data Skewed right popuiation If population is normal so is the sample data Mean of Sampling Distribution of Means 40 population Sam Him Distribution of Samiie Means is 0 St Dev Of Sampling Distribution of Means normal if 8 sqrms 2 Tim is normal or 0 Shape of Sampling Distribution of Means and P a why n gt 30 by the Central Limit Theorem No conclusion too small a sample and population Otherwise no conclusion about shape shape isn t normal Sampling Distribution HW 7172 0 100 subjects are from a population skewed left with mean 40 and sd 8 0 Shape of Sample Data Skewed left population 0 Mean of Sampling Distribution of Means 40 population 0 St Dev Of Sampling Distribution of Means 8 sqrt100 8 0 Shape of Sampling Distribution of Means and why Approximately Normal Central Limit Theorem Mean amp Standard Error Properties As the sample size n increases The mean of the sampling distribution does not change The standard error sd ofthe sampling distribution decreases Example 12 gt 1A larger denominator smaller overall fraction Similarly as the sample size decreases The mean of the sampling distribution does not change The standard error increases the opposite Distributions HW 7172 The size of a badger colony follows a skewed left distribution with mean 20 and sd 4 A sample of 36 colonies is selected and this sample has mean 16 and sd 55 o What is the center and spread for the population 1 20 039 4 o What is the center and spread for the sample data x 16 s 55 o What shape is the sample data Skewed left same as the population Distributions HW 7172 of the sample means with size 36 120 ii66667 VI What shape is the sampling distribution of the sample means What is the center and spread for the sampling distribution 7 Approximately normal since n gt 30 CLT Suppose nowwe adjust the sample size and the new resulting sampling distribution has a standard error of 1 Did we use a larger or smaller sample size Smaller since standard error increases with a smaller sample size Howwill the mean of the new sampling distribution change It stays the same it does not depend on sample size StatCrunCh HW 7172 The average household temperature in Vancouver is 676 degrees and the sd is 42 Asample of 51 households is selected What s the probability the average ofthis sample will be above 681 Fill in the boxes Mean Std Dev Prom X y StatCrunch HW 7172 Sampling distribution with n 51 Mean 676 population s 0 SD standard error a 42 T T 58812 n BE 67 BB 65 x Mean E75 Sm Dev 58812 main viiism 17019751522 3 SW cm mm StatCrunch HW 7172 Same question What s the probability the average ofthe sample with be within 15 degrees ofthe population mean 676 15 691 o 676 15 661 StatCrunch HW 7172 65 67 ex Be an 1 67 69 1 X l 20 00538 98924 Mean 676 Sin Dev 58812 1 200538 98924 milligragm Proportions Problem SEAN ST ERROR POPULATION p SAMPLE DATA DATA p p1 p 77 SAMPLING DIST p The sampling distribution of the sample proportion p is normal when up 215 and nl p 215 Computing Square Roots It is very important to type in square roots on your calculator correctly especially with the proportions standard errors Example compute 200720 64 One safe way is to first compute 201 20 64 0025 then square root the result sqrt0025 05 Otherwise use parentheses wisely Correct sqrt201 2064 05 Correct sqrt208064 05 Incorrect sqrt201 2064 00559 Incorrect sqrt201 20 64 00625 Proportions HW 71 73 o 60 of students at an academy in Vancouver are female In a random sample of 55 students 26 ofthem are female Let 1 female and O male Identify the population distribution of gender X PX 1 60 O 40 1 60 Proportions HW 7173 Identify the data distribution of gender X PX 1 26 55 47273 O 1 47273 52727 What is the mean amp standard error of the sampling distribution of the sample proportion Mean 60 se M 06606 It Is the sampling distribution approximately normal np 5560 33 and n1p 5540 22 so yes A researcher enrolled 1952 healthy men and women aged 18 to 49 in a study to compare the effectiveness of the traditional injected flu vaccine with a relatively new nasal spray flu vaccine The researchers randomly assigned the participants to receive either the injection or the nasal spray Participants were tracked throughout the flu season and suspected cases were confirmed by cell culture or a PCR assay or both At the end of the study 119 people had confirmed cases of the flu Those who got the injection had a 50 reduced chance of getting the flu compared with the group that got the nasal spray When the researchers looked only at the prevention of influenza Type A the most common flu to infect the participants the injection was 72 effective and the nasal spray was just 29 effective In similar studies researchers have used statistical methods to predict that the nasal spray is more effective than the injection in healthy children under the age of 6 a What part of this example refers to Design Randomly assigning healthy men and women between the ages of 18 and 49 to be given the injected flu vaccine or the nasal spray flu vaccine b What part of this example refers to Descriptive Statistics For Type A influenza the injection was 72 effective and the nasal spray was just 29 effective Those who got the injection had a 50 reduced chance of getting the flu compared with the group that got the nasal spray i What part of this example refers to Inferential Statistics The nasal spray is more effective than the injection in healthy children under the age of 6 i In this example is the percent effective 72 a statistic or a parameter statistic e The sample of 1952 was selected from what population all healthy men and women ages 1849 2 Consider the population of all students at your school A certain proportion support mandatory national service MNS following high school Your friend randomly samples 20 students from the school and uses the sample proportion who support MNS to predict the population proportion at the school You take your own separate random sample of 20 students and find the sample proportion that supports MNS a For the two studies are the populations the same yes 1 a subject new Honda accord b Variable gasoline mileage pollution emission sample the new Honda Accords that are chosen for the study i population all new Honda Accords 4 a The entire city of Athens is the population b The 200 citizens from the researcher39s data is the sample c Which symbol below would denote the value of 42350 in Variables Quantitative Variables Discrete Categorical QualllallVe A countable number ofwholenumbered values no Classifies subject by an attribute or decimals characteristic of people entering a snop per nour wnoie number Hair color type of professor make of car Ofspades m a poker hand 011345 0W of balls ajuggler rs currentlyjuggling Quantitative Continuous Gives numerical measures of subjects any numerical Vaer including decimaIS on respoknse tlmel number of The Weight of an athlete 150 150 Oi 181 312 etc mi es ravee o wor The time taken to complete a lap ll i seconds minutes etc The current speed or an airplane in rnpn Macn etc The speed or an angry nre ant ll i crnsecond say Examples HW1121 Important Terms Which of these are categorical or quantitative For the Population latter which are discrete or continuous Total set of subjects in population Length of an earthworm in mm which we are interested Sample Region of US Southeast West etc Asubset ofthe population for which Literary genre W5 haVe data Subject Entities we measure Number of times in one month the Creswell re alarm individuals goes off Important Terms Notation m We use different letters for population A numerical value summarizing the population parameters versus sample statistics data Ex number of freshmen out of all STAT 2000 7 Students u population mean x sample mean Statistic 039 population st dev sample st dev A numerical value summarizing the sample data A 0 ulatlon r0 ortlon Ex number of freshmen out of a sample of 100 p p p p p p sample prepomon STAT 2000 students Parameter amp Population both begin with P Statistic and Sample both begin with S Example HW 11 21 Descriptive vs Inferential A college dean wants to know the average Descriptive Statistic age ofthe faculty She takes a random Summary of the data in the sample sample Of 10 faculty members and Majority of students in a sample of 1000 attend averages their ages UGAfOOtba games population Inferential Statistic Sample Aconclusion or prediction about the Subject population based on the sample data parameter Majority of all UGA students attend UGA football Statistic games based on the sample Sampling Methods Sam plrng Methods Simple Random Sampling I I Strati ed Sampnng Cluster Sampling EgglthL bjeCt everywhere has an equally Ilker chance of being Taking some Takmg a SElbjects from Often done with a random numbertable SUbJeCt5 from a some poss39 e grOUpS Choosing a company somewhere in the US POSSIbIe groups Systematic Q Selecting every kth subject O Surveying every 10m person we meet downtown O O Convenience o Individuals are easily found eg internet surveys Q Often the laziest way so less reliable answers Sampling HW 4144 Random Table HW 4144 A researchertakes3 possible classifications of companies each of A study will assign subjects numbered 1 8 into which contains 1000 businesses and draws 100 random subjects f M f h U th t bl t from all three Whattype ofsampling isthis oneo 0 groups our In eaC 59 e a He 0 decrde who goes into the first group Start With Suppose instead she draws 200 businesses at random from the the top IEft and answer in numeric order whole population of companies What type of sampling is this 30494 17011 The same researcher instead randomly selects 2 of the 3 possible classifica ions and then surveys all businesses in those groups What type of sampling is this Suppose instead she gets an alphabetical list of all these companies starts with 4 and selects every 100 h after that for her sample Frequencies frequency proportlon total number of observatlons f percentage 2 amp X100 total number of observatlons Example 18 cookies out of a random sample of 32 are chocolate chip 18 ro onion 5625 p p 32 18 percentage X 100 5625 Frequencies HW 2122 Results from the question of how many children a family has had Fill in the answers Children 0 1 2 3 Count 786 460 662 489 Proportion Percentage Total number 2397 families Types of Charts Categorical Bar Graph MW Categories on horizontal axis frequency on vertical axis height of rectangle is frequency l m m H Other and lnv nebrale sh Reptile Primary ram ChiIla ur Mtgahrs Par u mph Pareto Graph 4 A bar graph arranged with bars in descending order of frequency Pacem 5 Flsh Mustang any Rapule Bird F mlry ma our afA nalnn Types of Charts Categorical Pie Chart A circle divided into Slices eaCh Slice Reginnal istributinn ot39Wesither Stations representing a category Sunkika of a variable quotWquot NnnhemHS Size of a slice represents Midwmm 125 overall percentage 337 To determine mode easier to use a bar chart W35le Categorical Data HW 22 Consider the following table of 240 animal tracks that were found in a certain park Raccoon Opossum Squirrel Deer Chipmunk 83 19 90 28 l 20 What proportion oftracks were not from raccoons or deer If we were to make a pie chart which animals would have the largest and smallest slices Can we find the mean median mode and range from this data If so find them Types of Charts Quantitative Dot Plot Places a dot for every data 39 39 value above a number line o o 90 9t 92 93 94 95 96 97 93 99 100 Histogram Abar graph for quantitative data Test S cores A s on a Test 90 90 91 93 95 95 95 98 98 99 gt a u u 4 m Which class has 9 those frequencies a N 9099 100109 llDHB 120129 Histogram Interpretation HW 22 o How many total students sampled IQ39S of 7th Graders highest owest frequency Whatare Frequency 5 Score interval How many students have an IQ between 100 and StemAndLeaf Plot A bar chart on its side Stem is all digits except 19 9 the last one Last digit is the leaf Ascending order No commas lf nothin in a rowwrite the row ut leave it blank Example HW212 2 eBay selling prices 199210210223225 225 225 228 232 235 Skewness Symmetric SkewedLeft Skew dRight m 21m m 2170 meal lt median 712am gt quot1 21770 Outliers The mean is sensitive to outliers The median is resistant to outliers When outliers are present best to use median as measure of central tendency Examples Earthquake magnitudes on the Richter Scale skewed right since some but very few big earthquakes Ages of MENSA members at the time they joined skewed left since most were adults but a few children had high enough le Outliers Example Miles traveled on public transportation 0 0 3 0 0 0 9 0 5 0 Mean 17 Median 0 Now introduce a new data point 90 003000905090 Mean 972727 Median 0 Mean amp Median HW 2324 The number oftrains a British person takes to get from one town to another in England can be modeled as follows 200 Brits were sampled and the results are listed Compute the mean and median What can you say about the distr bution s shape Trains Taken Frequency 1 30 2 1 02 3 36 4 32 Total 200 Standard Deviation The average distance between any data point and he mean ofthe data Measures how muchlittle the data distribution is spread out Which has larger and smaller st dev quuency l J Data Standard Deviation HW 23 24 Which has largest and smallest standard deviation StatCrunch Commands Summary Stats Standard Deviation 238581 average spread in data set Q1 4 5 ofdatalie below this Median sometimes Q2 69 50 ofdata lie below and above hisvalue 03 92 75 of data lie below this Range 75 Difference between maximum 112 and minimum 37 1 Enter data in one column 2 Stat gt Summary Stats gt Columns 3 Select column var1 4 Calculate W W Regresswn 1 Enter data in two columns same order E 2 STAT gt REGRESSION gt SIMPLE LINEAR 3 Select columns var1 and var2 4 Calculate Summary Stats Example BoxPlot HW 2526 From StatCrunch Om Distribution of taxes in cents Minimum 26 Q3 105 Mean g Q1 31 Maximum 206 Average ofthe data set 0 Median What proportion of states have taxes Greater than 31 cents Greaterthan 105 105 cents 25 25 I 25 I 26 31 55 105 106 Between what two values are the middle 50 ofthe data found Find and interpret the interquartile range New BOXPlot HW 25 26 Computer Drive Use in kilobytes Min 4 Q3 1105 Q1 256 Max 320000 Median 530 Is this bellshaped or skewed Use the 15 IQR rule to test for outliers Empirical Rule Only used for bell shaped distributions Within one standard deviation from the mean we have 68 of all data points Within two standard deviations from the mean we have 95 of all data points Empirical Rule Within three standard deviations from the mean we have almost all data points Anything else is an outlier SUMMARY 1 s 68 2 s 95 35 Almost all Example HW 2324 The weight ofa zebra is bellshaped with an average of 700 pounds and a standard deviation of 70 pounds Give an interval within which about 95 ofthe data fall Example HW 2324 Example HW 2324 The weight of a zebra is bellshaped wi h an average of 700 pounds and a standard deviation of 70 pounds The weight ofa zebra is bellshaped with an average of 700 nds Approximately what percentage of the data is between 560 and 770 pounds and a standard deviation of 70 pou Approximately what percentage of the data is between 630 and 770 Find the weight ofa zebra that is three standard deviations above the mean ZScore ZScore HW 2526 value mean value E value u For 261 heights the mean was 658 inches and the 7 or 7 standard dewauOn S 0 standard deVIation was 30 Inches The shortest person in this sample had a height of 56 inches Azscore is the number of standard deVIations Calculate the Home forthis person abovebelow the mean the data point lies If negative data point is below mean If positive data point is above mean Data p0int is an outlier if Interpretthe ZScore Z score gt 3 or Z score lt 3 ZScore HW 2 52 6 Percent es 39 39 The 20th percentile for example is the cutoff such that 20 of the subjects have a score falling beneath that cutoff 80 X of subjects fall beneath the xth percentile Exam le We have 200 subjects To nd the number falling beneath the 20th percentile we take 20 of 200 which is 200 20 40 For 261 heights the mean was 658 inches and the standard deviation was 30 inches What is the Z score for someone whose height is 24 standard deviations below the mean Therefore 40 subjects out of 200 fall below the 20th percentile QUEST ON For 200 subjects how many fall above the 45th percentile Find the height corresponding to the above Z score For 200 subjects what is the percentile for the person who s 52nd from the top Variable Types Response Determined by another variable yvariable on the vertical axis scatter plots Explanatory Explains or affects the response variable xvariable on the horizontal axis scatter plots A contingency table is a table that relates two categorical variables Explanatory variable on the side Response variable on the top Variables HW 31 Consider a study in which you are interested in any connections between gender and preference for dessert chocolate cake or ice cream Explanatory Response Good Bad Total Adjustment AdJUSTm em Orientation 72 1 4 86 No 28 45 73 Orientation Total 1 00 59 1 59 This is a chart of students that took freshmen orientation and students that did not and whether they adjusted well or poorty to college 86 159 did orientation 59 159 adjusted poorty 1486 is the proportion of the orientation students that did not adjust well as opposed to all students surveyed Relative Risk HW 31 conditional proportion for one group larger number conditional proportion for another group smaller number Relative risktells us how many times more likely the outcome is for one group than the other group The following three facts therefore follow 1 Relative risk 1 2 When the numerator and denominator proportions are very similar relative risk will be very close to 1 3 However when the numerator is quite a bit larger then relative risk will be quite a bit greaterthan 1 relative risk Good Bad Adjustment Adjustment TOtal Orientation 72 1 4 86 No Orientation 28 45 73 Total 1 00 59 1 59 Find the proportion of orienta ionstudents that adjusted well Find the proportion of noorientation studentsquot that adjusted well Find the relative risk of adjusting well to college for both groups of Students that are well to college than students that times more likely to adjust Scatter Plots W2 var m 25 2 v a Strong correlation a 2 a n 5 Weak correlation Correlation r HW 32 34 1 r51 If r is positive then so is the slope Same if r s negative Closer r is to 1 or 1 strong correlation Closer r is to 0 weak correlation r is unitless r does not change if we flip variables r measures only LINEAR relationship A strong correlation is not proof that one variable causes the other Which of the following has the strongest and weakest correlation 67 34 11 92 Lurking Variables Example X of ounces of coffee drunk the day before an exam y score on that exam Strong correlation does not prove that drinking more coffee causes an exam score to increase there could be lurking variables Number of hours reviewing GPA LeastSquares Regression y a bx x given data point y predicted response a intercept Predicted response when x 0 May not always have a practical interpretation b slope Slope is how much the predicted response increases or decreases for every unit increase in x Regression HW 3234 We want to predict average monthly car insurance payments y given the number of accidents x the client has had within the past three years 137113982x What s the predicted payment for someone who s had 2 accidents Interpret the slope and intercept ls correlation positive or negative Example HW 3234 The predicted number of visitors in Destin during the summer is to be modeled For every 1 degree in Fahrenheit in temperature the predicted number of beach visitors increases by 265 The yintercept is 15000 Using this information write down the regression equation Regression HW 3234 A shop ownerwants to assign a new price for dog biscuit packets She is curious how the price per packet x in dollars affects the number sold per clay y She studies previous years data and gets y 98 18x Interpret the slope Interpret the intercept Variables o Categorical Qualitative Classifies subject by an attribute or characteristic Hair color type of professor make of car o Quantitative Gives numerical measures of subjects Weight height response time number of miles traveled to work Quantitative Variables o Discrete A countable number of wholenumbered values no decimals o of people entering a shop per hour whole number 9 of living grandparents 01234 only 9 of spades in a poker hand 012345 only 9 of balls ajuggler is currentlyjuggling 9 Continuous Can take on any numerical value including decimals on an interval 9 The weight of an athlete 150 15001 181312 etc o The time taken to complete a lap in seconds minutes etc o The current speed of an airplane in mph Mach etc o The speed of a nerve impulse in cmsecond say Examples HW1121 9 Which of these are categorical or quantitative For the latter which are discrete or continuous Length of an earthworm in mm Quantitative Continuous Region of US Southeast West etc Categorical Literary genre Categorical Number of times in one month the Creswell fire alarm goes off Quantitative Discrete Important Terms 0 Population Total set of subjects in which we are interested 0 Sample A subset of the population for which we have data 0 Subject Entities we measure individuals population O 0 Important Terms Parameter A numerical value summarizing the population data Ex number of freshmen out of all STAT 2000 students Statistic A numerical value summarizing the sample data Ex number of freshmen out of a sample of 100 STAT 2000 students Parameter amp Population both begin with P Statistic and Sample both begin with S Notation We use different letters for population parameters versus sample statistics u population mean 7 sample mean a population standard deviation 5 sample standard deviation p population proportion f sample proportion Mnemonic Alert Since we know nothing about the population it is all Greek to us Example HW 1121 o A college dean wants to know the average age of the faculty She takes a random sample of 10 faculty members and averages their ages Population Sample Subject Parameter Statistic all faculty members the 10 faculty members selected an individual faculty member average age of all faculty members average age of the 10 selected Descriptive vs Inferential 9 Descriptive Statistic Summary of the data in the sample Ex Majority of students in a sample of 1000 attend UGA football games o lnferential Statistic A conclusion or prediction about the population based on the sample data Ex Majority of all UGA students attend UGA football games based on the sample Sampling Methods 9 Simple Random Sampling Each subject everywhere has an equally likely chance of being selected Often done with a random number table Choosing a company somewhere in the US 9 Systematic Selecting every kthquot subject Surveying every 10 person we meet downtown 9 Convenience Individuals are easily found eg internet surveys Often the laziest way so less reliable answers Sampling Methods o Stratified Sampling o Cluster Sampling Taking some subjects from Taking all subjects from all possible grOUpS some possible groups 0 O 0 o o o O Sampling HW 4144 A researcher takes 3 possible classifications of companies each of which contains 1000 businesses and draws 100 random subjects from all three What type of sampling is this Stratified Suppose instead she draws 200 businesses at random from the whole population of companies What type of sampling is this Simple Random Sampling The same researcher instead randomly seects 2 of the 3 possible classifications and then surveys all businesses in those groups What type of sampling is this Cluster Suppose instead she gets an alphabetical list of all these companies starts with 4 and selects every 100th after that Systematic Random Table HW 4144 o A study will assign subjects numbered 1 8 into one of two groups four in each Use the table to decide who goes into the first group Start with the top left and answer in numeric order 30494 1701 1 22368 46573 Skip 0 and 9 because only eight subjects Skip the second 4 because already picked Persons 1 3 4 7 will go into the first group 80 persons 2 5 6 8 will be the 2nd group Frequencies ro ortion frequency p p 7 total number of observations frequency 39 quot 7 total number of observations X 100 Example 18 cookies out of a random sample of 32 are chocolate chip 18 proportion 7 E 7 5625 percentage x 100 56 25 Frequencies 9 Result from the question of how many children a family has had Fill in the answer 9 Children 0 1 2 3 9 Count 786 460 662 489 o Children 0 Proportion 32791 19191 27618 20401 Percentage 32791 19191 27618 20401 Total Number 2397 families 0 Types of Charts Categorical x Gran a 5 Bar Graph 0 Categories on horizontal axis frequency on vertical axis height mm A s of rectangle is frequency a s Pareto Graph o A bar graph arranged with bars 3 quotI 3 j j j in descending order of frequency Types of Charts Categorical Pie Chart 0 A circle divided into slices with each slice representing a category of a variable 0 Size of a slice represents overall percentage o To determine mode easier to use a bar chart Categorical Data HW 22 9 Consider the following table of 240 car brands we saw downtown Toyota Chevrolet Honda Ford Nissan 83 o 28 20 What proportion of cars were not Toyota or Honda We have 240 83 90 67 so 67240 2792 If we were to make a pie chart which car brands would have the largest and smallest slices Largest Honda Smallest Chevrolet Can we find the mean median mode and range from this data If so find them We can t compute the mean median or range since this data is categorical The best we can do is the mode which is Honda since it has the highest frequency Types of ChartsQuantitative Dot Plot 0 Places a dot for every data value above a number line H t 99 99 92 99 99 9s 99 99 99 99 mo IS ogram T25 Sevres o A bar plot for quantitative data 9 9 4 1 3 3 A s on a Test NJ 93795 967100 90 90 91 93 95 95 95 98 98 99 Histogram Interpretation HW 22 How many total students sampled 60806040240 Which class has highest lowest frequency What are those frequencies 39 quot quotquot quotquot Highest quot100109quot with 80 Lowest quot120129quot with 40 anuenCY on so ma 39 HOW many students have an em H IQ between 100 and 119 80 60 140 StemAndLeaf Plot A bar chart on its side quotStemquot is all digits except the last one Last digit is the quotleafquot Ascending order No commas If nothing in a row write the stem but leave the leaf blank Example HW 2122 selling prices 199 210 210 223 225 225 225 228 232 235 19 9 20 21 00 22 35558 23 25 Skewness Symmemc skewed Len mean median mean lt median mean gt median mean lt median The lt looks like an L as in Left Skewed mean gt median The gt looks like part of an R as in Right Skewed O O O Outliers The mean is sensitive to outliers The median is resistant to outliers When outliers are present it is best to use the median as the measure of center Examples Earthquake magnitudes on the Richter Scale skewed right since some but very few big earthquakes Ages of MENSA members at time they joined skewed left since most were adults but afew children had high enough le Outlier Examples 9 Miles put on vehicle to get to campus 0 0 3 0 0 0 9 0 5 0 Mean 17 Median 0 9 Now introduce an new data point of 90 003000905090 Mean 972727 Median 0 Mean amp Median HW 2324 o The number of subway trains a New Yorker takes to get from home to work can be modeled as follows 200 New Yorkers were sampled and the results are listed Compute the mean and median What can you say about the distributions shape For the median find half the total count about 100 so we need to find where person 100 is It s not in Row 1 since we have the first 30 only After Row 2 we have 30 102 132 people Median 2 since erson 100 mean W falls in row 2 p 2 35 Mean gt median a somewhat skewed right Standard Deviation o The average distance between any data point and the mean of the data 0 Measures how muchlittle the data distribution is spread out Which has largest and smallest standard deviation a g Standard Deviation o The average distance between any data point and the mean of the data 0 Measures how muchlittle the data distribution is spread out Which has largest and smallest standard deviation i smaller mam deviation 7 lugnskmdmd dammit Standard Deviation HW 2324 o The average distance between any data point and the mean of the data 0 Measures how muchlittle the data distribution is spread out Which has the largest and smallest standard deviation E m m ELm Standard Deviation HW 2324 o The average distance between any data point and the mean of the data 0 Measures how muchlittle the data distribution is spread out Which has the largest and smallest standard deviation Smallest Mlddle StatCrunch Commands Summary Stats 1 Enter data in one column 2 Stat Summary Stats Columns 3 Select column var1 4 Calculate Regression 1 Enter data in two columns same order 2 STAT a Regression a Simple Linear 3 Select columns var1 and var2 4 Calculate Summary Stats Example From StatCrunch Summary slalislics nmmmmm ea Mean 709333 average of the data set Standard Deviation 238581 average spread in data set Q1 47 25 of data He below this Median sometimes Q2 69 50 of data He below and above this value Q3 92 75 of data He below this Range 75 difference between maximum 112 and minimum 37 BoxPlot HW 2526 o What proportion of states have taxes Greaterthan 31 cents 75 Greaterthan 105 105 cents 25 Between what two vales are the middle 50 of the data found 31105 Find and interpret the interquartile range QRQBQ1 10531 74 The range for the middle half of the data New BoxPlot HW 2526 Is this bell shaped or skewed Notice the median is closer to the left side of the middle box suggesting right skewness The right line from Q3 to highest point is much longer than the left line So the distribution is skewed right Use the 15 IQR rule to test for outliers IQRQ3 Q11105256849 Q1 15IQR 256 15849 10175 we have no lower outliers Q3 15IQR 1105 15849 23785 we have an upper outlier O O O O 0 Empirical Rule Only used for bellshaped distributions Within one standard deviation from the mean we have 68 of all data points Within two standard deviations from the mean we have 95 of all data points Within three standard deviations from the mean we have almost all data points Anything else is an outlier SUMMARY 1 s 68 2 s 95 3 5 Almost all Example HW 2324 o The average SAT score of an incoming UGA student is bellshaped with an average of 610 and a standard deviation of 50 Give an interval within which about 95 of the data fall 7610ands50 95 means we go 2 standard deviations to the right and left of the center LowerLimit quot72 x 561072x 50 510 Upper Limit392x s6102 x 50 710 So the interval is 510710 0 Example HW 2324 The average SAT score of an incoming UGA student is bellshaped with an average of 610 and a standard deviation of 50 Approximately what percentage of the data is between 560 and 660 Notice 660 610 50 and 610 560 50 We have therefore gone out 50 units which is 1 deviation from the mean By the Empirical Rule 1 deviation has about 68 of the data Find the score of an incoming student who is three standard deviations above the mean 23 x s6103x 50 760 Example HW 2324 o The weight of a zebra is bellshaped with an average of 700 pounds and a standard deviation of 70 pounds Approximately what percentage of the data is between 560 and 770 We have 68 between 630 and 770 so 16 in the tails We have 95 between 560 and 840 so 25 in the tails That means we have 25 below 560 and 16 above 770 and there 815 is in the middle 100 16 25 815 O O ZScore value 7 mean 7 value 7 7 or value 7 M Z standard deviation 7 s a A zscore is the number of standard deviations abovebelow the mean the data point lies If negative data point is below mean If positive data point is above mean Data point is an outlier if Zscore gt 3 or Zscore lt 3 ZScore HW 2526 o For 261 heights the mean was 658 inches and the standard deviation was 30 inches The shortest person in this sample had a height of 56 inches Calculate the zscore for this person Z 7 value7mean 7 56358 7326667 7 standard deviatlon 7 lnterpret the Zscore This persons height is 326667 standard deviations below the mean because it is negative It is less than 3 so it is an unusual observation outlier ZScore HW 2526 o For 261 heights the mean was 658 inches and the standard deviation was 30 inches What is the Zscore for someone whose height is 24 standard deviations below the mean 2 24 negative because below mean Find the height corresponding to the above Zscore 724 fog5 gt X 72 43t0 658 586 in O O O O Percentiles The 20th percentile for example is the cutoff such that 20 of the subjects have a score falling beneath that cutoff So x of subjects fall beneath the xth percentile Example We have 200 subjects To find the number falling beneath the 20th percentile we take 20 of 200 which is 200 20 40 Therefore 40 subjects out of 200 fall below the 20th percentile Percentiles o QUESTIONS For200 subjects how many fall above the 45th percentile 200 45 90 fall below the 45th percentile Therefore 200 90 110 fall above Alternatively 200 1 45 1 10 just a different method For 200 subjects what is the percentile for the person who s 52nd from the top The 52nd from the top is the 200 52 148th person so 148200 74 Therefore it s the 74th percentile Answer 74 a whole number so round to nearest if necessary Variable Types 9 Response Determined by another variable yvariable on the vertical axis scatter plots o Explanatory Explains or affects the response variable xvariable on the horizontal axis scatter plots o A contingency table is a table that relates two categorical variables Explanatory variable on the side Response variable on the top Variables 9 Consider a study in which you are interested in any connections between gender and preference for dessert chocolate cake or ice cream Explanatory gender Response dessert preference Could your gender possibly determine your preference for dessert Sounds reasonable Could your preference for dessert possibly determine your gender Not very realistic Relative Risk HW 31 conditional proportion for first group relative risk conditional proportion for second group the first group is the larger of the two proportions 9 Relative risk tells us how many times more likely the outcome is for one group than the other group 9 The following three facts therefore follow Relative risk 2 1 l When the numerator and denominator proportions are very similar relative risk will be very close to 1 93 However when the numerator is quite a bit larger then relative risk will be quite a bit greater than 1 Relative Risk HW 31 o This is a chart of students that took freshmen orientation and students that did not and whether they adjusted well or poorly to college 86159 did orientation 59159 adjusted poorly 14 86 is the proportion of the orientation students that did not adjust well as opposed to all students surveyed Relative Risk HW 31 Find the proportion of orientationstudents that adjusted well 7286 083721 Find the proportion of noorientationstudents that adjusted well 2873 038356 Find the relative risk of adjusting well to college for both groups of students Look at the Good Adjustment proportions Larger Smaller 083721 038356 218272 Students that did orientation are 218272 times more likely to adjust well to college than students that did not do orientation Correlation HW 3234 71 lt r lt 1 If r is positive then so is the slope Same if r is negative Closer r is to 1 or 1 strong correlation Closer r is to 0 weak correlation r is unitless does not change if we flip variables r measures only LINEAR relationship A strong correlation is not proof that one variable causes the other Which of the following has the strongest and weakest correlation 80 67 34 11 92 Strongest 92 closest to a 1 Weakest 11 closest to 0 Scatter Plots Strong Positive Correlation Weak Negative Correlation Lurking Variables Examples 9 x of ounces of coffee drunk the day before an exam o y score on that exam 9 Strong correlation does not prove that drinking more coffee causes an exam score to increase there could be lurking variables Number of hours reviewing O O O O LeastSquares Regression J7 a bx X given data point J predicted response a intercept Predicted response when X O May not always have a practical interpretation b slope Slope is how much the predicted response increases or decreases for every unit increase in X Regression HW 3234 c We want to predict average monthly car insurance payments y given the number of accidents x the client has had within the past three years 913711139182x What s the predicted payment for someone who has had 2 accidents J7 137111 39822 216175 Interpret the slope and intercept For every additional accident payment is expected to increase by 3982 The expected payment for someone with no accidents is 13711 ls correlation positive or negative Positive because the slope is positive Regression HW 3234 o The predicted number of visitors in Destin during the summer is to be modeled For every 1 degree in Fahrenheit in temperature the predicted number of beach visitors increases by 265 The yintercept is 15000 Using this information write down the regression equation J7 15000 265x 0 Regression HW 3234 A shop ownerwants to assign a new price for dog biscuit packets She is curious how the price per packet X in dollars affects the number sold per day y She studies previous years data and gets J7 98 7 18X Interpret the slope For every dollar increase in price the number of dog biscuit packets sold per day is expected to decrease by 18 Interpret the intercept Literally when price is 0 free the number sold per day is about 98 packets Nonsense so intercept has no interpretation here Regression HW 3234 c We want to predict the number of misprints y in a novel that is X pages long in hundreds For instanceX 25 is a 250 page novel The regression equation is J7 51 32x Interpret the intercept choose the best answer 1 For every additional 100 pages the predicted number of misprints goes up by 51 2 The number of misprints in a novel 0 pages long is about 51 3 The intercept has no practical interpretation Interpret the slope choose the best answer 1 For every additional 32 pages the predicted number of misprints goes up by 39I 2 A novel 400 pages long can be expected to have 32 more misprints than a novel 300 pages long 3 The slope has no practical interpretation Regression Output Simple linear regression results Dependent Variable varZ Independent Variable varl var2 30761627 7 013430233 varl Regresswn Equailon Sample size 5 R correlation coef menu 409956 Correlation Resq 09911769 Dunnuseii Estimate of error standard dewanon 06068089 o This shows where the regression equation and the correlation coefficient are located 0 Do not use the Rsq for correlation Residuals o A residual is the difference between an actual observation at a point X and what was predicted using the regression equation A positive value indicates that what was observed was larger than what was predicted so we underestimated A negative value indicates that what was observed was smaller than what was predicted so we overestimated Filled lInE mm residual 2 observed predicted y 7 Residuals HW 3234 o The car insurance question again The predicted payment for someone with 2 recent accidents was 21675 Suppose someone with 2 accidents had an actual payment of 201 Compute this person s residual y201 J7 21675 yiy 71575 Negative because actual was less below the regression line The model is based on people with between 0 and 6 accidents Can we use it to predict the payment for someone with 13 recent accidents No the model is only good between x 0 and 6 Who knows what happens outside that range This is extrapolation Probability Probablllty ls tne llkellhood ora partlcular outcome occurrlng tr waysigivmevmlcmmu Example probability of drawing a club from a deck ofcards Example in an urn full of 10 red marbles and 12 black marbles the probability of drawing a black marble is 9101ZZ Law of Large Numbers As we repeat an experiment a large number of times the ratio of the number of successes in the sample to the total number of trials will approach the probability of the event Example proportion ofheads should go to 50 10 coins 7 H 3 T 100 coins 53 H 47 T 1000 coins 498 H 502 T p 498 10000 coins 5002 H 4998 T p 5002 Types of Probabilities ClassicalRelative Frequency l waysgimmunm pouihlgvlmuml Subjective Likelihood of an event is based on your own personal judgment Probability Definitions A complement All possible events that are not in A Example A it s snowing AB it s not snowing PW 44 Example of a Complement We naye a total or an seats ln a mum and a snaded sduare abuve represents a taken seat Huvv many people are ln thls runrn7 lt s easrerto lnstead cuunt tne number or empty seats ufvvhlch tnere are 4 Therefure tne number or people ls 5n 7 4 4B 5e tne eomplement nere ls all empty seats Whlch glves us tne needed answer more easlly Probability HW 5153 The game Scrabble oontalns too lettered tlles 56 consonants 42 yowels and 2 blanks The yowels can e urtnerolassme as follows 9 st 5 s s an US uppose we r lnto a bag oontalnlng all tnetlles and select one letter vvnat s tne probablllty ofdravvlng a tlle wltn a letteron lt e 98 suon ules so 98100 98 e draw a vowel what s tne probablllty lt s ll l tne upper half of tne alphabet Meanan A M 7 Out of42 yowels theA s ES and is count so w 714285 42 42 lfthe tlle drawn contalns a letter on lt what s tne probabllltytnat letterls nota vowel Out of 98 letters 56 ofthern are not yowels 5 57143 98 Probability HW 5153 We have an urn full of marbles of the following colors 5 green 4 blue 3 yellow 2 red 1 white 15 total What s the probability of drawing a primary color Red yellow or We 2 3 4 9 6 15 7 15 739 If the marble drawn was a primary color what s the probability it was either red or yellow Out of 9 primary color marbles we have 2 red and 3 yellow so 2 3 5 7 55556 9 9 If the marble drawn was not a primary color what s the probability it was not green So out of6 marbles 1 is not green the White 116667 6 There are 6 noneprimary color marbles 5 green 1 white Probability HW 5153 Chocolate Strawberry Tota Male 490 H371 1361 Female 513 302 815 Total 1003 1173 2176 Find the probability that an individual selected is female and prefers strawberry ice cream Out of all 2176 we have 302 that are both female and like strawberry so 3022176 13879 Find the probability that an individual selected is female or prefers strawberry ice cream Out of 2176 again We count the number who are either female or like strawberry or both As long as they have at least one of the two characteristics we count them As indicated above this is everybody except the males who like chocolate So the answer is 513 302 871 2176 77482 Probability HW 5153 490 1361 513 302 815 Total 1003 1173 2176 Given that the person selected is male nd the probability they prefer strawberry Out of1361 males 871 prefer strawberry so the answer is 8711361 63997 Given that the person prefers strawberry nd the probability hey are male Out of 1173 people that like strawberry 871 of them are male so 8711173 74254 Random Variables Discrete A countable wholenumber of possible values The number of words in a magazine article The number of clubs in a drawing of 10 cards Continuous An uncountable infinite number of possible values with decimals allowed The weight of an athlete The time taken to complete a race Discrete Probability Distribution Two requirements 1 Each individual px is between 0 and 1 inclusive 2 All probabilities sum to 1 The mean ofa discrete distribution MEAN Zxpx Also called average or expected value Mean ofa Distribution HW 61 62 You are hiking through the Fire Swamp and along the way you have to battle Rodents of Unusual Size You will meet certain numbers with various probabilities Find the missing probability of an inconceivable 5 Rodents 1 005 012 015 036 032 Find the probability of battling at least 3 Rodents How about more than 3 At least three 015 036 032 083 More than three 036 032 068 As you wish work out how many Rodents you should expect to battle on one pass through the Fire Swamp 1 x 005 2 x 012 3 x 015 4 x 036 5 x 032 378 On average you will encounter 378 Rodents Discrete Mean HW 6163 The Cash 4 lottery involves picking 4 numbers each 0to 9 n henc 10000 combinations If you pick the correct combination you win 1200 Also suppose the ticket costs 2 On a single bet what s the probability you win 110000 0001 Make a probability distribution for this problem X is the pro t you make taking into account the ticket c If you lose you paid 2 and won nothing so your pro t is 2 If you win you win 1200 but still had to pay 2 so your pro t is 1198 mamp 2 09999 100001 1198 00001 Discrete Mean HW 6163 Find the mean expected pro t on this lottery X 9X 2 09999 1 00001 1198 00001 mean 2 x 09999 1198 x 00001 188 On one play how much should we expect to lose This is he interpretation ofan expected value Since this pro t is negative it means we expect to lose on avera 88 How about on 100 plays 100 x 188 188 so on 100 plays we expect to lose on average 1 Normal Curve Empirical Rule What s the mean and sd mm rm Comparing Normal Curves Which has a higher mean right highermean nght highersd 9 Three rules 1 N Normal Distribution Continuous Density 0 a Total probability 03 area underthe normal curve is 1 Normal curve is symmetric X value zscore goes in left box n probability goes in 393 392 391 U 1 right box on 33 StatCrunch 2 3 x 70 Sm m 171 Prunxlgt 172 x 05987053 Prob i v jamquot mm mm Two types of problems Normal Distributions 1 Given an x value find the probability above or below it 2 Given a probability find the x value that gives that probability above or below it DRAWA SKETCH Normal Probability Draw a sketch of a normal curve if we were looking for the probability to the left of z 15 Normal Probability DensW a t u 3 a 2 n n 39 39 39 39 73 7 rt u l 3 1 5 x mania ism new Frommi l1 lllQi331 v l Answer 2 93319 Normal Probability Draw a sketch to help find what zvalue has probability 04 to the right Here 04 is a probability the area of the blue region We are looking for 2 so going backwards 6 Normal Probability 2 4 o l 1 3 ea l Answer z 025335 M N quot PrMIl x lu 533A7l li W7 Area Between Two Lines Find the probability within 12 standard deviations from the mean Make a sketch of this Try to think of a strategy for this Area Between Two Lines 1st Method Instead fuul his Area Between Two Lines Area Between Two Lines 1st Method 2nd Method We want P 12 lt Z lt12 Filst get the probability left of z 12 PZ lt 12 88493 Instead find this 712 3 2 i u 1 1 3 Answer Mean ti l summit ProhlxlglV 11 li niiauunm 1 211507 76986 i 7 7 2 U 3 x Menu in Sm nevli ProlilX c v i i9 m V Area Between Two Lines Normal HW 6163 2nd Method The weight ofa platypus is normally distributed with mean 44 and Next get the prObablhty left of 139239 sd 04 pounds Suppose the probability that a platypus weight falls PZ lt 71211507 above 5 pounds is 0668 Find the probability that a randomly selected weight is between 44 and 5 Make a sketch as well Answer is the difference PZ lt 12 88493 11507 76986 44 5 l 50 0663 4331 Pmbtx lt lvl 712 01505967 Normal HW 6163 Percentiles Same problem p 44 and o 04 The Pth percentne is o What is the zscore for a platypus with a weight of 55 pounds the X that gives p z x 5395 4394 275 below on the normal 0 394 Always below What is the zscore for a platypus whose weight is 261 standard deviations to the left ofthe mean Azscore is the number of Example here the x deviations away from the mean a point lies To be 261 is the 30th percentile deviations below the mean is to have a zscore of 261 b 300 fth Negative because were below ecause 0 e o What weight corresponds to the above zscore data falls below X 44 x 261 KT gt x 261444 2 x 3356 Percentiles HW 6163 Percentiles HW 6163 The average high 0 What is the percentile temperature in New for a day with a high York has mean 78 temperature of 75 and sd 9 degrees 0 Make a sketch to help 0 Must be a percentage in determining the and a whole number 75th percentile 369 gt 369 an an m X an an gt 37 Mi mm a X 84070404 is the resulting answer from Prunanilis gt W the normal calculator 3 Types of Distributions Means Problem 1 Population Distribution of all points in the population 2 Sample Data Data MEAN ST DEV Distribution of one particular sample pOpLLATION a 3 SamplingDistribution of Sample Means I SAMPLE DATA DATA l Y S Distribution of the sample means of a given size n If you repeatedly draw samples note their sample SAMng 13151 1 i averages then plot these averages on a new graph that s the sampling distribution Distribution Shapes Sampling Distribution HW 7172 Sam ie Data Data Distribution is the 0 16 subjects are from a population skewed right with mean 40 and sd 8 same shape as the population for large n Shape ofsampe Data If population is skewed so is the sample data Skewed right popuiation If population is normal so is the sample data Mean of Sampling Distribution of Means 40 population Sam Him Distribution of Samiie Means is 0 St Dev Of Sampling Distribution of Means normal if 8 sqrms 2 Tim is normal or 0 Shape of Sampling Distribution of Means and P a why n gt 30 by the Central Limit Theorem No conclusion too small a sample and population Otherwise no conclusion about shape shape isn t normal Sampling Distribution HW 7172 0 100 subjects are from a population skewed left with mean 40 and sd 8 0 Shape of Sample Data Skewed left population 0 Mean of Sampling Distribution of Means 40 population 0 St Dev Of Sampling Distribution of Means 8 sqrt100 8 0 Shape of Sampling Distribution of Means and why Approximately Normal Central Limit Theorem Mean amp Standard Error Properties As the sample size n increases The mean of the sampling distribution does not change The standard error sd ofthe sampling distribution decreases Example 12 gt 1A larger denominator smaller overall fraction Similarly as the sample size decreases The mean of the sampling distribution does not change The standard error increases the opposite Distributions HW 7172 The size of a badger colony follows a skewed left distribution with mean 20 and sd 4 A sample of 36 colonies is selected and this sample has mean 16 and sd 55 o What is the center and spread for the population 1 20 039 4 o What is the center and spread for the sample data x 16 s 55 o What shape is the sample data Skewed left same as the population Distributions HW 7172 of the sample means with size 36 120 ii66667 VI What shape is the sampling distribution of the sample means What is the center and spread for the sampling distribution 7 Approximately normal since n gt 30 CLT Suppose nowwe adjust the sample size and the new resulting sampling distribution has a standard error of 1 Did we use a larger or smaller sample size Smaller since standard error increases with a smaller sample size Howwill the mean of the new sampling distribution change It stays the same it does not depend on sample size StatCrunCh HW 7172 The average household temperature in Vancouver is 676 degrees and the sd is 42 Asample of 51 households is selected What s the probability the average ofthis sample will be above 681 Fill in the boxes Mean Std Dev Prom X y StatCrunch HW 7172 Sampling distribution with n 51 Mean 676 population s 0 SD standard error a 42 T T 58812 n BE 67 BB 65 x Mean E75 Sm Dev 58812 main viiism 17019751522 3 SW cm mm StatCrunch HW 7172 Same question What s the probability the average ofthe sample with be within 15 degrees ofthe population mean 676 15 691 o 676 15 661 StatCrunch HW 7172 65 67 ex Be an 1 67 69 1 X l 20 00538 98924 Mean 676 Sin Dev 58812 1 200538 98924 milligragm Proportions Problem SEAN ST ERROR POPULATION p SAMPLE DATA DATA p p1 p 77 SAMPLING DIST p The sampling distribution of the sample proportion p is normal when up 215 and nl p 215 Computing Square Roots It is very important to type in square roots on your calculator correctly especially with the proportions standard errors Example compute 200720 64 One safe way is to first compute 201 20 64 0025 then square root the result sqrt0025 05 Otherwise use parentheses wisely Correct sqrt201 2064 05 Correct sqrt208064 05 Incorrect sqrt201 2064 00559 Incorrect sqrt201 20 64 00625 Proportions HW 71 73 o 60 of students at an academy in Vancouver are female In a random sample of 55 students 26 ofthem are female Let 1 female and O male Identify the population distribution of gender X PX 1 60 O 40 1 60 Proportions HW 7173 Identify the data distribution of gender X PX 1 26 55 47273 O 1 47273 52727 What is the mean amp standard error of the sampling distribution of the sample proportion Mean 60 se M 06606 It Is the sampling distribution approximately normal np 5560 33 and n1p 5540 22 so yes Chapter Four Gathering Data 41 Should we experiment or observe There are two basic ways to gather data 1 Observational Study 2 Experiment Difference between experiments and observational studies Experiments attempt to manipulate or in uence the subjects in an experiment Properly designed experiments can be used to prove causation that one variable CAUSES the other to change In using experiments subjects can be randomly assigned to groups Observational studies simply measure characteristics of the subjects without attempting to manipulate or in uence the subjects Observational studies cannot be used to prove causation they can only say that the variables are related to one another In using an observational study subjects cannot be randomly assigned to groups The advantages of using an experiment are that you can prove causation and apply randomization while an observational study can only prove association of two variables and not causation Page D of 53 Example Decide whether the following are experiments or observational studies 1 Rats with cancer are divided into 2 groups One group receives 5 mg a day of an experimental drug that is thought to ght cancer the other group receives 10 mg a day of the same drug After 2 years the spread of cancer is measured in both groups Subjects Experiment or observational study 2 A poll is conducted in which 500 people are asked whom they plan to vote for in the upcoming election Subjects Experiment or observational study Page of 53 42 43 Observational Study vs Experiment TYPES OF OBSERVATIONAL STUDIES CrossSectional Study a study that attempts to take a cross section of the population at the current time Ex A survey that asks Who is your current favorite singer Retrospective Study a study that is backward looking Ex A study to see if there is an association between cell phone usage and brain cancer We can form a sample of subjects with brain cancer and a sample of subjects without brain cancer and compare the past use of cell phones for both groups Prospective Study a study that is forward looking Ex A study that asks How many hours a week do you watch television knowing that they will ask the same people this question in one year Casecontrol study a retrospective study in which subjects who have a response outcome of interest the cases and subjects who have the other response outcome the controls are compared on an explanatory variable Ex The brain cancercell phone usage example Page of 53 Sources of Potential Bias in Sample Surveys 1 Sampling Bias occurs from using nonrandom samples or not using a large enough sample frame undercoverage Example In an election poll if we just took the first 30 people that walked up to us rather than randomly selecting people to poll this might result in sampling bias This would be called convenience sampling which we will talk more about later Or if we just took samples at a few of the voting stations and did not sample from all of them this would be sampling bias 2 Nonresponse Bias occurs when some sampled subjects cannot be reached or refuse to participate or fail to answer some questions Example If we send out 500 surveys and only get 52 back this could result in nonresponse bias These 52 don t give us a large enough sample size and might not be representative of the population 3 Response Bias occurs when the subject gives an incorrect response or the wording of the question is confusingmisleading or the way the interviewer asks the questions is confusingmisleading Example Asking the question on a survey this way might result in response bias Would you rather vote A candidateb Page of 53 4344 Experiments An experiment is a controlled study in which one or more treatments are applied to experimental units The experimenter then observes the effect of varying these treatments on a response variable A lot of new terms were used in the above de nition Let s now de ne these terms 1 experir ltal unit or subiect the person object or some other wellde ned item upon which a treatment is applied 2 treatment a condition applied to the experimental unit i e a new drug is administered to patients 3 response variable a quantitative or categorical variable that represents our variable of interest The goal in an experiment is to determine the effect the treatment has on the response variable Page G of 53 Think back to our example with the rats The experimental units are the rats with cancer The explanatory variable is the amount of the drug and there are two levels of treatment 5 mg and 10 mg The response variable is the spread of cancer for each rat Many designed experiments are double blind An experiment is double blind if neither the experimental unit nor the experimenter knows what treatment is being given to the experimental unit For example in a lot of medical experiments that are testing the effects of new drugs researchers often administer to each patient either a dose of the new drug OR a placebo A placebo is a dummy treatment in this case a fake pill so that the patient does not know if they are getting the real medication or not If this is to be double blind the researchers will not know which patients are getting the real medication and the patients also will not know if they are getting the real medication If the researcher does know but the patient does not then it is a single blind experiment Page of 53 Experimental Designs Completely randomized design The experimental units are randomly assigned to the treatments MatchedPairs Design A matchedpairs design is one in which the experimental units are somehow related or matched before the experiment takes place For example the same person before and after a treatment twins husband and wife etc Example One twin receives some treatment and the other twin receives some other treatment Not only can we measure the overall groups that received different treatments we can also look at the difference in the results for each matched pair of twins Often the measure of a response variable for an experimental unit is taken before a treatment is applied and then a measure is taken from the same unit after the treatment Here the individual is matched against itself Page of 53 CROSSOVER DESIGN Think about an experiment where subjects could take one treatment drug A the rst time they get a headache and the other treatment drug B the second time they get a headache The response variable is whether the subject s pain is relieved For each person we would have a matched pair of observations because they both correspond to the same person Subject Drug A Drug B 1 Relief No Relief 2 Relief Relief 3 No Relief No Relief A matchedpairs design like this in which subjects cross over during the experiment from using one treatment to using another treatment is called a crossover design In matchedpairs experiments a set of matched experimental units is referred to as a block So in the headache study above each person is a block Page of 53 To reduce possible bias treatments are usually randomly assigned within a block In the headache study the order in which the treatments are taken would be randomized Bias could occur if all subjects received one treatment before another A block design with random assignment of treatments to units within blocks is called a randomized block design Page of 53 Example A pharmaceutical company has developed an experimental drug meant to cure a deadly disease The company randomly selects 300 males aged 2529 years old with the disease and randomly divides them into two groups Group 1 is given the experimental drug while Group 2 is given a placebo After one year of treatment the white blood cell count for each male is recorded a Identify the experimental units b What is the response variable in this experiment c What is the explanatory variable What are the two levels of treatments dIs the experiment design completely randomized e Does this experiment use matchedpairs design f If the researcher knows which patients are getting which drugs is this a double or single blind experiment Page of 53 Example Researchers at UGA wished to determine whether there was a connection between listening to classical music and reasoning skills To test the research question 36 college students listened to one of Mozart s sonatas for 10 minutes and then took a reasoning test using the StanfordBinet intelligence scale The same students were also administered the test after sitting in a room for 10 minutes in complete silence The mean score on the test following the Mozart piece was 119 while the mean test score following the silence was 110 The researchers concluded that subjects performed better on the reasoning tests after listening to Mozart a What is the response variable in the experiment b What is the explanatory variable Describe each level of treatment c Does this experiment use matchedpairs design d Does this experiment use crossover design Page of 53 Let s take a look at one more example to make sure we understand matchedpairs design Let s say we administer two tests to students and want to compare grades on the two tests Let s say we want to compare the mean for 20 Test A scores versus the mean for 20 Test B scores If we were to select 20 people and compare their Test A and Test B scores then this would be an example of a matched pairs design We would be matching the Test A scores against the same students Test B scores The experimental units in each group are related because they are the same students in each group We could not only compare the Test A scores to the Test B scores as a group but we could also compare them for each individual However if we were to select 40 people and compare 20 students Test A scores versus 20 different students Test B scores then this would NOT be matched pairs design because there is no relation between one student s Test A score and a different student s Test B score We could only compare the Test A scores to the Test B scores as a group Page Q2 of 53 45 What are different ways to sample Sampling obtaining subjects that are representative of the population to participate in a certain study so that accurate information about the population can be obtained Simple Random Sampling every subject has an equally likely chance of being selected for the sample When performing simple random sampling usually samples are chosen using a random number table Example of Simple Random Sampling The following table lists the 43 presidents of the United States Obtain a simple random sample of size five using the table in Chapter 4 which is a random number table We will use row 1 Page Q3 of 53 HW Chapter 4 Simple Random Sampling Problem A randomized experiment investigates whether an herbal treatment is better than a placebo in treating subjects suffering from depression Unknown to the researchers the herbal treatment has no effect Subjects have the same score on a rating scale for depression for which higher scores represent worse depression no matter which treatment they take a The study will use eight subjects numbered 1 to 8 Using random numbers pick the four subjects who will take the herbal treatment Use the first row in the table LineCol 1 2 3 1 104801501101536 2 22368 46573 25595 3 24130 48360 22527 4 42167 93093 06243 Identify the four who will take the herbal treatment List in numerical order l j IE l Page of 53 What are different ways to sample Types of Sampling 1 Simple Random Sampling every subject has an equally likely chance of being selected for the sample Usually samples are chosen using a random number table Strati ed Sampling the population is divided into nonoverlapping groups called strata and a simple random sample is then obtained from each group Cluster Sampling the population is divided into non overlapping groups and all individuals within a randomly selected group or groups are sampled Convenience Sampling sampling where the individuals are easily obtained Internet surveys are convenience samples Studies that use convenience sampling generally have results that are suspect Systematic Sampling selecting every kth subject from the population The difference between strati ed and cluster sampling is that strati ed sampling samples some individuals from all groups where cluster sampling samples all individuals from some groups Page Q3 of 53 Example Identify the type of sampling used below In order to determine the average IQ of ninthgrade students a school psychologist obtains a list of all high schools in the local public school system She randomly selects five of these schools and administers an IQ test to all ninthgrade students at the selected schools A member of Congress wishes to determine her county s opinion regarding estate taxes She divides her county into three income classes lowincome households middleincome households and upper income households She then takes a random sample of households from each income class A radio station asks its listeners to call in their opinion regarding the use of American forces in peacekeeping missions In an effort to identify whether an advertising campaign has been effective a marketing firm conducts a nationwide poll by randomly selecting individuals from a list of known users of the product A lobby has a list of the 100 senators of the US In order to determine the Senate s position regarding farm subsidies they decide to talk with every seventh senator on the list starting with the third Page of 53 Chapter 8 Statistical Inference Confidence Intervals 81 What are Point and Interval Estimates of Population Parameters When we rst began our discussion of statistics we mentioned that there were two branches of statistics descriptive and inferential The inferential branch uses sample information to draw conclusions about the population One of the most common usesof the inferential branch is to use sample statistics such as X to estimate population parameters such as u It makes sense that if we take a large enough sample X should be pretty close tothe actual value of u But the chances are prettysmall that X turns out to be exactly p This is why we call X a point estimate of u The key here is that sample statistics estimate population parameters For eXample X is a point estimate of u and p is a point estimate of p So we know the value of X is probably pretty close to pg but we want to get even closer So rather than just say X is close to u we are going to build an interval around X and then say that u probably lies somewhere on this interval We call this an interval estimate Here s an eXample Page of 53 Suppose you were asked to estimate the average age of all the students in our class You might survey 10 students and nd their average age to be 20 This sample mean of 20 would be a point estimate of u BUT you could also express your guess by giving a range of ages centered around your sample mean So your guess could be 20 give or take 2 years This give or take 2 years part is what we call the margin of error which we will talk about more later So mathematically your guess would be 20 2 which would be the interval estimate Suppose you were then asked how con dent you were that u the mean age of all students was within your interval estimate of 18 to 22 years old You might say I am 95 con dent that the mean age of all students is within 18 to 22 years old In statistics we construct intervals for the population mean that are centered around an estimate This estimate is X the sample mean Since we can t get the full population mean we go for the next best thing We take a sample and calculate a sample mean And what we add and subtract from the sample mean to get the interval estimate is the margin of error When we construct these interval estimates we call them confidence intervals Page of 53 So a confidence interval of a parameter consists of an interval of numbers and this interval is our point estimate our margin of error Just as in our example above where our interval was 20 2 or in other words 18 to 22 20 is our point estimate and 2 is our margin of error We call the value we obtain when we take the point estimate minus the margin of error in our example 20 2 or 18 the lower limit or lower bound And we call the value we obtain when we take the point estimate plus the margin of error in our example 20 2 or 22 the upper limit or upper bound You will also see the notation of the lower and upper limit in parentheses for con dence intervals In our above example the con dence interval may be written as 1822 Page Q9 of 53 It is also important for us to note the level of con dence of a con dence interval In our example before our level of con dence would have been that we were 95 con dent that the mean age of all students in the class was somewhere on our interval So the level of con dence is the probability that the interval contains the population parameter in this case u We will see in examples that as we increase our level of con dence we will get wider and wider intervals We will be constructing two different types of con dence intervals 1 In Section 82 we will be calculating the con dence interval for the population proportion p 2 In Section 83 we will be calculating the con dence interval for the population mean u like our classroom age example Before we get to these sections let s make sure we understand the terms in the example on the next page In this example the con dence interval will already be constructed for us In Sections 82 and 83 we will actually learn how to construct these con dence intervals Page of 53 Example Suppose a farmer is trying to estimate the average number of peaches per tree in his orchard He does not want to count every peach on every tree so he takes a random sample of a few trees and calculates a 95 confidence interval based on the sample That 95 confidence interval for the mean yield of a new variety of peaches in an orchard is 112 to 148 peaches per tree This means that we are 95 confident that the population mean u for the number of peaches per tree is somewhere between 112 and 148 peaches per tree What is the lower limit What is the upper limit What is the level of confidence What is the width of the confidence interval What is the sample mean Remember the sample mean is always themiddle of the confidence interval The sample mean x will always be on the confidence interval but the population mean u may or may not be on the confidence interval What is the margin of error Page 21 of 53 82 How Can We Construct a Con dence Interval to Estimate a Population Proportion Recall from Section 81 that con dence intervals can be written in the general format point estimate margin of error The point estimate and margin of error change depending on what parameter is being estimated For example we looked at an example of a Con dence Interval for u so our point estimate was X Now we will consider the format of the Con dence Interval for the population proportion p The point estimate for this type of Con dence Interval is the sample proportion I3 xn where x is the number of individuals in the sample with the desired characteristic and n is the sample size So we know what goes before the the point estimate and we can calculate that easily Now we need to know how to calculate what goes after the the margin of error Page 22 of 53 The margin of error will always be a multiple ofthe standard error In Section 82 we discuss con dence intervals for population proportions so the standard error will be l x A 121 p x l39 H Why is it now in the formula and not p So the margin of error will always be some number times the standard error we see above The number we multiply the standard error by to get the margin of error TOTALLY depends on the level of con dence The general formula for a con dence interval for the population proportion is You can see that the margin of error is this Z value times the standard error Later on in this chapter we will see how to get this Z value because this Z value TOTALLY depends on our level of con dence how con dent we want to be that the population proportion is on our interv For now we will just focus on 95 con dence intervals where this Z value equals 196 Page 23 of 53 For a 95 con dence interval the margin of error is Iquot 131 1 196 standard error in other words 196 39 So the formula for a complete 95 con dence interval point estimate the margin of error equals xpa p 13i196l n The lower limit is The upper limit is Think way back to the Empirical Rule and we can see why this 196 makes sense Using the Empirical Rule we said approximately 95 of the data values are within two standard deviations or standard errors of the parameter Now we are starting with i7 and we want to add and subtract something from that value and get an interval and we want to be 95 con dent that the interval contains p the true population proportion So using the same Empirical Rule logic if we start with 13 and add and subtract close to 2 standard errors 196 to be exact it makes sense that we are going to be 95 con dent that the p value will be within that interval Page 24 of 53 Example 3 from Section 82 We asked 11 1154 Americans Would you be willing to pay 6 per gallon of gas In our random sample 518 said they would be willing to pay 6 per gallon of gas a Find a 95 con dence interval for the population proportion of Americans willing to pay 6 per gallon of gas First we need our sample proportion 5 Xn who said yestotal number in the sample 5181154 45 Next we need the standard error 513013 45gtlt55 Standard error l H X154 0015 WG13 The 95 con dence interval formula p i 1 l39 H So the interval is 45 l960015 45 03 So the lower limit is 45 03 42 And the upper limit is 45 03 48 So our 95 con dence interval is 4248 b Interpret the interval We are 95 con dent that the proportion of ALL Americans that are willing to pay 6 per gallon of gas is between 42 and 48 in other words between 42 and 48 Page 25 of 53 c EXTRA QUESTION Does it appear likely that 50 of ALL Americans are willing to pay 6 per gallon of gas No 50 or 5 is not on our interval so it does not appear likely that 50 of all Americans are willing to pay 6 per gallon of gas Our interval went between 42 and 48 so it appears that less than 50 of ALL Americans are willing to pay 6 per gallon of gas Sample Size Needed for Validity of Confidence Interval for a Proportion For these con dence intervals to be valid we need to check some requirements as we did back when we were determining the sampling distribution for the sample proportion The following two things must be true 113215 AND n113215 Also we must make sure we take a random sample Check that this is true in the above gas example Page 26 of 53 Example The drug Lipitor is meant to lower cholesterol levels In a clinical trial of 884 randomly selected patients who received 10 mg doses of Lipitor daily 221 reported a headache as a side effect a Obtain a point estimate for the population proportion of users who will experience a headache b Verify that the requirements for constructing a confidence interval about p are satisfied c Construct a 95 CI for the population proportion of users who will have a headache d Interpret this interval Page 27 of 53 How can we use 21 Con dence Level Other Than 95 So far we have just been creating 95 con dence intervals so our margin of error has been 196 standard error But where does this 196 come from And what if we want something different than a 95 confidence interval We can never have a 100 confidence interval because we can never be 100 sure that the population proportion is on the interval if we don t know it But what if we want a 90 confidence interval or a 99 con dence interval Here is how we get the 196 for a 95 con dence interval First when you think of a 95 con dence interval think of a normal curve with 95 shaded in the middle like this J L If 95 or 95 is in the middle then what is the area in each of the tails l 95 05 and 052 025 so 025 is in each ofthe tails Now put 025 as the little tail area to the right in your StatCrunch calculator with mean 0 and standard deviation 1 hit Compute and you get 196 Page 28 of 53 This 196 is the Zscore that matches up with a 95 con dence interval This is why it was important for us to nd those probabilities involving Zscores before This Zscore value is what we now take and multiply the standard error by to get the margin of error for a con dence interval and it will always be a positive Zscore We did a lot of work to get there let s work through it again and see what the Zscore would be for a 90 con dence interval First draw the curve with 90 in the middle and nd the area of both tails Then put the area of the tail in the StatCrunch Normal Calculator with mean O and standard deviation l The zscore you get 1645 So to get the margin of error for a 90 con dence interval you multiply the standard error by 1645 Let s use this in the following example Page 29 of 53 Example A study of 70 randomly selected people in Atlanta was conducted to estimate the proportion of Atlantans that owned dogs The study revealed that 42 of the 74 people were dogowners a Obtain a point estimate for the population proportion of dogowners in Atlanta b Verify that the requirements for constructing a con dence interval about p are satis ed c Construct a 90 con dence interval for the proportion of Atlantans that are dogowners d Interpret the con dence interval Page 30 of 53 Now using the same example as above construct a 99 con dence interval Let s see how the interval changes if we increase the con dence level We have the point estimate and the standard error so we just need the new Zscore for this con dence interval First draw the curve with 99 in the middle and nd the area of both tails Then put the area of the tail in the StatCrunch Normal Calculator with mean O and standard deviation l The Zscore Now create the 99 con dence interval Notice that the 99 con dence interval is wider than the 90 con dence interval Page 31 of 53 In this example We saW that As the level of confidence increases the margin of error increases and the confidence interval gets Wider ALSO as the level of confidence decreases the margin of error decreases and the confidence interval gets narrower This applies to all con dence intervals like in the picture below 99 Confidence Interval 95 Confide nee Interval Why is this true With a 95 con dence interval We Want to be 95 con dent that the population parameter is on the interval But With a 99 con dence interval We Want to be even more con dent 99 con dent that the population parameter is on the interval So to be that much more sure the proportion is on the interval We need a Wider interval Page 32 of 53 We have seen what happens when we change the con dence level what about if we change the sample size As the sample size increases the margin of error decreases and the con dence interval gets narrower As the sample size decreases the margin of error increases and the confidence interval gets wider So the opposite happens when we increase the sample size The con dence interval gets narrower Why is this true As we increase our sample size the sample statistic we obtain whether we are looking for a mean or a proportion is a better representation of the population So as we increase our sample size our point estimate is a better and better estimate and we don t need such a wide con dence interval Page 33 of 53 RECAP The following symbols go along with the following terms when calculating the con dence interval for the population proportion Term Symbol Point 33 Estimate Margin of 5m Error d VT Standard Error quot H Con dence p i Interval 39 n Page 34 of 53 HOW CAN STATCRUNCH CALCULATE THESE CONFIDENCE INTERVALS FOR US Look back at our example where we wanted to get a 90 confidence interval for the population proportion of ALL Atlantans that own dogs on page 30 of our notes We got the 90 confidence interval which has a lower limit of 50369 and an upper limit of 69631 On page 31 we got the 99 con dence interval which has a lower limit of 44917 and an upper limit of 75083 Guess what STATCRUNCH can get these values for us Go to Stat 9 Proportions 9 One Sample 9With Summary Here we can type in how many Atlantans owned dogs in our sample In our sample 42 of 70 Atlantans owned dogs Put those numbers in just like this and hit Next E nzsa n mum4th a mu Pm n Mumhermsuccesses 42 Mumhermnhsenmmns ml 11 525 marl mm quotM WM l rmwtiwm On the next screen choose Con dence Interval and we want a 90 confidence interval so change the 095 to 090 Page 35 of 53 r quotWmhesls 1251 mmquot Ina Anemm 5 mm mm Level Haul Melhnd Sandaerald V 35mm m um um mm 39 Java Applet wmw Hit Calculate and here is what we get ll CI 5 90 cun uence interval results p prupumun ursuccesses for pupulatian MEIhDd SlandamAWald lepurtinn CuumlT taIISanlplerp 5mm l Lumit JILLinIiI In l azl 70 Mnnssaamoshaoaaarzelo5953129 IdavaAuDlelWindow It tells us the Sample Proportion which is the point estimate which is 6 the same thing we got in part a It also tells us the lower limit 50369 and the upper limit 69631 the same values we calculated Notice it also gives us the standard error The only values it does not give us are the margin of error and the Zscore used in the formula so we still would need to know how to get those by hand Now get the 99 Confidence Interval and check it against our answers of 044917 075083 Page 36 of 53 Section 83 How Can we Construct a Confidence Interval to Estimate a Population Mean Recall from Section 81 that con dence intervals can be written in the general format point estimate margin of error Remember the point estimate is a single number that is our best guess for the parameter What single number is the best guess for a population mean if we only have a sample from the population The sample mean So the sample mean is the point estimate part of the con dence interval formula It is the center of the con dence interval so now we need to know the margin of error We need to know what to add and subtract from the point estimate to get the lower and upper limits of our con dence interval Just like in Section 82 the margin of error will be some number times the standard error But the formula for the standard error when we are talking about means is Standard error s n where s standard deviation from our sample We saw the formula for the standard error back in Chapter 7 was 6 ln but we don t know anything about the population so we don t know 6 so we have to use the standard deviation from our sample s Page 37 of 53 So we are this far into our formula for the con dence interval for the population mean g some numbers In All we have left to nd is the some number We saw in con dence intervals for the population proportion that this some number ended up being a Zscore that corresponded with the level of con dence For con dence intervals for the population mean the some number still corresponds with the level of con dence but it is from a new distribution that we call the Tdistribution So if you look on Stat 9 Calculators 9 you will see a calculator just called T Before we see how to get these T values let s talk about the properties of this Tdistribution and how the T distribution or Tcurve is different from the normal distribution or normal curve Page 38 of 53 Properties of the TDistribution 1 The T distribution is centered at 0 and is symmetric about 0 like the standard normal distribution The total area under the curve is 1 The area to the right of 0 is 05 and the area to the left of 0 is 05 like the standard normal distribution The T distribution is different for different values of n our sample size The area in the tails of the T distribution is a little greater than the area in the tails of the normal distribution 5 As the sample size n increases the T curve looks more and more like the normal curve N OJ 5 Since the Tdistribution looks different for different values of n we always have to type in what we call the degrees of freedom on the T calculator The degrees of freedom we have to put in the T calculator n l The degrees of freedom on the Tcalculator is abbreviated as DF So DF in StatCrunch n 1 and we always have to put that into the T Calculator Try some different DF values in StatCrunch and see how the Tdistribution changes for different sample sizes Again it is Stat 9 Calculators 9 T Try DF 5 Then try DF 500 this one looks more like our normal CUI39VC Page 39 of 53 So our con dence interval formula for the population mean is Lower limit E T S o o i Upper llmlt x T M These intervals are valid when we 1 use a random sample AND 2 either use a sample size gt 30 OR when we are sampling from a normal population So we can get the sample mean sample standard deviation and 11 value but we haven t yet talked about what the T value is that we want from the T Calculator To get the T value is just the same as getting the Z value when we were doing con dence intervals for the population proportion in Section 82 The only difference is that the T value depends on BOTH the confidence level and the sample size Page 40 of 53 Let s nd the T Value for a 95 con dence interval if the sample size We used is n 32 First draw a curve With 95 inthe middle and nd the area of both tails Next put in the right tail area 025 in the T Calculator ANDputDF327131 e Tcalculalnr SEE DensW N US n2 U1 u i 2 u 2 o DF 3 Pmmxlrllznaamaa n25 muse Cumpme JavaADDlEKWmdaw Hit Compute and you get T 20395 Page4lof53 Let s do a few more These are the same thing they are asking you to get on Homework 8384 a Find the tscore for a 99 con dence interval for a population mean with 5 observations in our sample First draw a curve with 99 in the middle and nd the area of both tails Next put in the right tail area 005 in the T Calculator AND put DF 5 l 4 Hit Compute and you get T So now we can construct con dence intervals for the population means We can get all the symbols in these formulas Lower limit T h h Upper limit T Finally let s do some examples Page 42 of 53 Example 7 in Section 83 lpods are sold all the time on eBay We have the prices for a random sample of seven lpods that recently sold on eBay 235 225 225 240 250 250 210 We will assume these prices are normally distributed We want to find the 95 con dence interval for the population mean price of lpods sold In other words we want to construct an interval of numbers and be 95 confident that if we averaged the price ofALL lpods sold on eBay the average price would be on our interval We need to nd Lower limit T Upper limit T slaw Let s break it down n 7 because we are using a sample of7 lpods sold on eBay How do we get x and s Easy we can list our seven prices in StatCrunch go to Stat 9 Summary Stats 9 Columns Choose our column with the data and get Cnlumn Statistics Summzvysmmms cu m JEVEAW E W W so E 23357143 and s 146385 Now nally we need to get the T score Page 43 of 53 First draw a curve With 95 inthe middle and nd the area of both tails Next put in the right tail area 025 in the T Calculator 6 AND put DF 7 g mummy WEE 75 n 5 x or a mum 1 mam n25 mm goumpmej Java App et wmw Hit Compute and you get T 244691 Page 44 of 53 Now we have everything we need we can now construct the lower and upper limits of the 95 con dence interval Lower limit E T S 23357 244714647 22003 Upper Limit x T 23357 244714647 247ll So we are 95 con dent that the mean price of ALL Ipods sold on eBay is somewhere between 22003 and 24711 EXTRA QUESTION According to our con dence interval is it likely that the population mean price of ALL Ipods sold on eBay 250 No 250 is not on our con dence interval so therefore it is not a likely mean price for ALL Ipods sold on eBay We think that mean price should be somewhere between 22003 and 24711 EXTRA QUESTION 2 According to our con dence interval is it possible that the population mean price of ALL Ipods sold on eBay 225 Yes 225 is a possible mean price because it is on our interval Page 45 of 53 Using STATCRUNCH to construct con dence intervals Whenever we have actual data like in the above eBay example we can put this data into StatCnmch and StatCnmch will actually calculate these intervals for us First put the seven eBay prices in a column on StatCnmch craprm van vavz a w Go to Stat 9 T Statistics 9 One Sample 9 with data Choose the column you have put the data in and hit Next Choose Con dence Interval and type in 095 Hit Calculate and here are our results On sample Tslalislics EEE 95 cnn dence Imeml results u Java Avulel Wmde e same amounts we got before Lower limit of the con dence interval 22003 Upper limit of the con dence interval 24711 Page 46 of 53 Let s do an example like this where we have to calculate the limits using the summary statistics and not the actual data Example Suppose a sample of 16 test scores is taken from a normal population If the sample mean x 782 and the sample standard deviation s 255 then construct a 90 confidence interval for the population mean of all test scores Let s do this by hand rst Page 47 of 53 Now let s use StatCrunch to create this con dence interval or us Go to Stat 9 T Statistics 9 One Sample 9 With summary Put in our sample mean sample standard deviation and sample size just like this One sample Tsla Saluplemean 782 samplesmuw 255 Sample slze we 4 5qu emu H quotW1 Mumj Hit Next and choose a 90 con dence interval Hit Calculate and here are the results 0quot sample Tsla gnun cnn dence Imeml results u pupmauun mean These are the same values We calculated by hand Page 48 of 53 The following symbols go along with the following terms for construction of a con dence interval for a population mean Term Symbol Point Estimate X Margin of Error T Standard Error Con dence 2 T Interval quot Page 49 of 53 Section 84 How Do We Choose the Sample Size for a Study Sometimes before setup of an experiment survey we know that we want the margin of error to be a certain amount Like if we are trying to get results for an election and we are going to be getting a sample proportion for the proportion of people who will vote for candidate A maybe we know that whatever sample proportion we get we want that to be within 3 of the true population proportion for ALL voters So we know we want the margin of error 3 We can use a formula to tell us what sample size we need to take so that our margin of error will be 3 or in other words so we can be sure that whatever sample proportion we get will be within 3 of the true population proportion with a certain level of con dence Here is that formula for choosing sample size in estimating a population proportion n 231 25221112 where P is a guess at the value we think we might get for the sample proportion If no guess is given then use PO5 The m is the margin of error The Zscore is calculated again based on the level of con dence just like with the con dence intervals It represents how con dent we want to be that the sample proportion we get will be that close to the true population proportion Page 50 of 53 Example 9 in Section 84 An election is expected to close and we are going to take a sample of people to obtain a sample proportion of the people who voted for candidate A How large should the sample size be for the margin of error of a 95 con dence interval to equal 002 What we are saying here is that we want to take a sample and get a sample proportion We then want to create a 95 con dence interval around that sample proportion and we want the margin of error for that con dence interval to equal 002 Sample size formula 11 131 13Z2m2 P 05 because we aren t given a guess for P to use m 02 Z Zscore based on 95 level of con dence First draw the curve with 95 in the middle and nd the area of both tails Then put the area of the right tail in the StatCrunch Normal Calculator with mean 0 and standard deviation l The zscore you get 196 just like before 11 051 051962022 2401 We would need to take a sample of 2401 people to get a sample proportion that we can be 95 con dent will be within 002 of the true population proportion Page 51 of 53 What if we are not dealing with a population proportion example but a population mean example That is we want to know what sample size we need so that the sample mean we get is close enough to the true population mean For example maybe we want to estimate the income for an entire company We want to take a sample of their employees and get a sample mean of their income And we want this sample mean income to be within 5000 of the entire company s mean income with 95 con dence We can determine what sample size is needed so that whatever sample mean income we get it will be within 5000 of the population mean income and we can be 95 con dent of that Here is the formula we use to determine sample size for estimating the population mean 022 2 n 2 m2 where o is the provided standard deViation m is the margin of error and Z is obtained just like before Page 52 of 53 Chapter Five Probability 5 Inn is probalbilily Outcome A particular result Probabilitl 7 measure of the likelihood of an outcome Probability is the proportion of times an outcome occurs in a large number of trials IfI toss a coin ten times I could get 5 heads and 5 tails orI couldget 7 heads and 3 tails or I couldget all 10 heads and 0 tails All these things could happen in a small number of trials BUT if1 toss a coin 10000 times what we will see is thatwe will get very close to getting 5000 heads and 5000 tails This is called the Law ofLarge Numbers and you will run through an example like this in your lab Law of Large Numbers 7 Lfthe number of times an experiment is repeated is increased the ratio of the number of successful occurrences to the number of trials will tend to approach the probability of the event This law of large numbers explains why casinos make money For a short time a gambler may be lucky and make money but in the long run the casino knows that itwill come out ahead Page 1 of 58 Two Types of Probabilities 1 Classical Probability or Relative Frequency Probability where PA Number of ways an Event can occur Number of possible outcomes Example What is the probability of rolling an even number 2 Subjective Probability 7 probability based on subjectivejudgment Example What do you think the probability is that you will get an A on the next test This probability would be totally subjective based on what you think In this class we will only deal with classical probabilities Page 3 of 58 52 How can we nd prubahili us39 To find probabilities the rst step is to list all the possible outcomes Sample Space 7 The set of all possible outcomes of an experiment denoted by S What is the sample space for one coin toss What is the sample space for one die toss Eventi Any subset of a sample space Events are denoted using letters If you are rolling a die what is the event of rolling an even number We define theprobabiliry ofan evemA denoted PA as the likelihood of the event occurring And the idea that with any roll ofa die we could roll a 1 2 3 4 5 or 6 is called randomnem Page 2 of 58 Properties of Probabilities 1 The probability of any event must be between 0 and 1 inclusive That is 0 5 PA 5 1 2 The total ofall the individual probabilities equals 1 Check that these rules are held with the six possible outcomes from rolling a die Example A university held a blood pressure screening clinic for its professors The results are summarized in the table below by age group and blood pressure level a age b What is the probability that a randomly selected professor has high re blood pressu Page 4 of 58 c Which is more likely Complement of an event A r all outcomes in the sample space that are not in event A We use AC to denote the Complement of A Sometimes you will also see the complement with the following notation A PA 17PA Example In rolling a die let s say event A is rolling an even number What is the complement of A and what is the probability of A Example According to the National Gambling Impact Study Commission 52 of Americans have played state lotteries What is the probability that a randomly selected American has not played a state lottery Page 5 of 58 AND and OR Probabilities with Contingency Tables We calculate AND and OR probabilities by simply looking at the counts in a contingency table Example If a person is selected at random what is the probability that the person is male AND lefthanded or in other words a lefthanded male Example If a person is selected at random what is the probability that the person is male OR lefthanded Page 7 of 58 AND an OR probabilities Compound events are formed by combining two or more events We will study the following two compound events 1 The probability of A AND B which consists of the outcomes that are in both A and B 2 The probability of A OR B which consists of outcomes that are in A or B or both In probability A or B denotes that A occurs or B occurs or both occur Example What is the probability of rolling a 4 OR rolling a 6 Page 6 of 58 53 Tnndilional Prohabil Wial s llle probability of Given B Conditional Probability rthe probability of event A given that B has already occurred Example If you roll a die and you know you have rolled an even number what is the probability you have rolled a 6 This 39s a conditional probability What we are looking for is the 1 probability of rolling a 6 given that you know you have rolled an even number Let s find this probability using common sense Page 8 of 58 Think back to when we were doing conditional proportions for contingency tables in Section 31 We were nding conditional probabilities We looked at the followin data We were asked this question and answered it What proportion of the males is le handed We could reword this as Given that we have selected a male what is the probability that he is le handed Let s do one more Given that we have selected a righthanded person what is the probability that the person selected is female Page 9 of 58 Chapter S39x Probability Distributions Ll How can vu39 summarize possith outcomes and their probabilities Random Variable 7 a unique value for each outcome of a random enomenon Ex Let Y qualifying speed for the Daytona 500 Y is arandom variable and can equal any speed in mph There are two types of random variables Discrete Random Variables r a countable number of possible values Ex The number of heads in three ips of a coin Continuous Random Variables 7 an uncountable infinite number of possible values Just as before these values are on an interval Ex Qualifying speed in mph for the Daytona 500 Probabilitg Distribution of a discrete random variable 7 a table graph athematical equation that provides the possible values of the random variable and their corresponding probabilities Example Roll afair die Let X number of dots showing on the die Pagell of58 Independent Events A and B are said to be independent events if knowing that A happens doesn39t change the probability that B happens A and B are dependent events if knowing that A happens does change or affect the probability that B happens For the following pairs of events A and B indicate whether they are independent or dependent l A you study for Test 2 B you earn an A on Test 2 2 A your car has a at tire one morning B you are late for work that morning 3 A you earn more than 50000 a year B you are born in the month of July Page100f58 39 for 3 Discrete Prohahilitv 1 For each X the probability PX falls between 0 and 1 2 The sum of the probabilities for all the possible x values 1 So in the previous die example Pl 16 and P2 6 and P6 16 Example Is the example on the previous page a probability distribution Example from Section 61 Let X the number of home runs the Red Sox hit in a game in 2004 The table below displays the probability distribution for X Is this a valid probability distribution of at least 3 home runs in a PagelZof58 Mean of a Probability Distribution for a Discrete Random Variable The mean of a probability distribution for a discrete random variables is given by the iormu a MEAN 2p Px where x is the value of the random variable or outcome and Px is the probability of observing that particular value or outcome What we are really doing is calculating an average or mean outcome using all the possible outcomes and their corresponding probabilities We call it a weighted average Example Think about the example of rolling a die Let s say we want to calculate the average roll of a die This would be the same as computing the mean of the discrete random variable X below There is actually a new term we will use for this mean of a discrete random variable and it is on the next page Page 13 of58 Example There are three envelopes in a game one contains 45 and the other two contain nothing For the player how many different outcomes to the game are there For each outcome how much would the player gain What is the probability for each outcome WHAT IS THE EXPECTED VALUE FOR THIS GAME PagelSofSS Expected Value The mean of a random variable is the longrun average outcome of the experiment In this sense as the number of trials of the experiment increases the average result of the experiment gets closer to the mean of the random variable For this reason the mean of a random variable is often called the expected value Example Given the probability distribution we saw before for the Red Sox nd the meanexpected value of this probability distribution and interpret it Pagel4of58 Example Suppose someone gives me a lottery ticket for my biithday It says on the ticket that the probability that this is the winning lottery ticket is 0000001 but if it is the winning ticketI get 360000 What is the expected value of this lottery ticket Pagel6of58 Example Suppose a life insurance company sells a 27year old woman a 5000 oneyear life insurance policy for 300 If the probability that she dies in the coming year is 0056 what is the expected gain for the insurance company Page 17 of58 A continuous random variable has possible values that form an interval Each interval has probability between 0 and 1 and the probability is the area under the curve above that interval The interval containing all possible values has probability equal to 1 so The total area under the curve is equal to 1 Example Let s say that the curve that represents the heights for all UGA students is bellshaped and looks like the following 60 80 We can nd the proportion of students that are over 70 inches in height Shade that interval in the graph below 40 60 80 We can also nd the probability that if We randomly selected a student they would be between 50 and 60 inches tall Shade that interval in the graph below 40 0 1n the next section We will see how We can actually use StatCrunch to calculate these probabilities for us Page 19 of58 Probability Tquot quot 39 for nntinunus Random Variables So far We have been nding the probability of discrete variables where We have a countable number of outcomes like th probability that the Red Sox hit 3 home runs in a game We have been using probability distributions We have looked at data like the following Frobablllty Dusmbuuomorme Red sox Example Relzllve r mqumw 5 a my move 2 3 4 Murrl39n nmnme Runs m a Game When We have data that is continuous which can include numerous decimal places the graph of our data transforms from a histogram to a curve like the following Think of an example like the heights for all UGA studenw Probabilities for continuous random variables are determined by nding areas under the curve PROBABILITY AREA UNDER THE CURVE Page 18 of 58 62 HIM can we nd probabilities for bellshaped distributions The most well known and most often used probability distribution is the normal distribution This distribution is symmetrical and bell shaped Recall the Empirical Rule 1 16 1 to 1 N 68 ofthe area in other Words approximately 68 of the area under the normal curve is Within one standard deviation of the mean pi 26 2 to 2 N 95 ofthe area in other Words approximately 95 of the area under the normal curve is Within two standard deviations of the mean u i 36 3 to 3 N 997 ofthe area in other Words approximately 997 of the area under the normal curve is within three standard deviations of the mean my l lvul llvxill mom i 9mm I Icuaimns u View 391 Frnimhililx mllun l 91 1nd ml Km Mimi ni lean rm Plnimlul mllun 1 minim C 11an u Mam The Norlna Dlstrlhurion in i 39u39 t i s m 3 1 x r x Page 20 of58 Sketch the graph ofa normal distribution with u 50 and S 10 Label one two and three standard deviations away For a normal distribution with mean 50 and standard deviation 10 we can use the Empirical Rule to state that the probability between 40 and 60 is approximately 68 because these values 40 and 60 are within one standard deviation of the mean By the Empirical Rule what values will have approximately the middle 95 between them We will be able to use the StatCrunch calculator to nd probabilities for many other values For example in this distribution with u 50 and S 10 what ifwe wanted to nd the probability below 65 We can do this by going to Stat 9 Calculators 9 Normal in StatCrunch Page21 of58 Let s do a few more problems like this Example 8 T Math scores are normally distributed with a mean u 500 and a standard deviation S 100 Ifyour SAT Math score is a 600 what percentage of SAT scores were higher than yours Hint Draw the curve and shade the areayou are looking for then use StatCrunch to get the answer Page 23 of58 Here is how we do it Enter Mean 50 and Standard Deviation 10just like in the picture below Select probability to the left ProbX lt and enter 65 Hit Compute Nurmal calculator w an an m X Mean l an Sm Dev W pew L155 e in 9331928 Close joempmej We can see that StatCrunch labels the values that are 1 2 and 3 standard deviations from the mean on the horizontal xaxis So the probability below 65 is 09331928 Or we could say that 09331928 or 9331928 ofthe data values in this distribution are less than 65 How can you nd the probability above 65 Page 22 of58 Example 9 On a midterm exam students who score between 80 and 90 receive a grade of B Suppose the scores on the exam have a normal distribution with mean 83 and standard deviation 5 What proportion of students get 7 a B First draw the curve to see what area we are looking for Then use StatCrunch to obtain the above shaded area Page 24 of58 e can also calculate these probabilities according to how many standard deviations away from the mean that a value is Let s go back to the example where u 50 and S 10 We wanted to find the probability below 65 We know we canjust plug these numbers into StatCrunch and get the answer but let s do it a different way How many standard deviations from the mean is 65 To get this wejust have to calculate the Zscore for 65 because the Zscore tells us how far away a value is from the mean in terms of the standard deviation 15 ziscore 6550 7 10 So 65 is 15 standard deviations above the mean Draw a curve showing the actual population mean and standard deviation and shade in the probability below 65 Now draw a curve with mean 0 and standard deviation 1 we call this the standard normal distribution and shade in the probability below 15 because that is the Zscore we calculated It is the same probability Page 25 of58 This second way seems more tedious but there is a reason it is important wae didn t have StatCrunch we would have to use tables to look up these probabilities and these tables are based on Zscores So lucky for us we have StatCrunch Let sdoa r on A we understand them XAMPLE Let s use the StatCrunch calculator to check that one of the Empirical Rule probabilities are correc Go to STAT 9 Calculators 9 Normal Let s check if the probabilityarea within two standard deviations of the mean really is close to 95 This means we are trying to check if the probability between Z 2 and Z 2 is close to 95 or 095 Draw this curve below and shade in the areaprobability we are looking for Page 27 of58 Now let s check that this probability is the same as we got on page 22 that the probability below 65 is 09331928 We want to see if the probability below 15 on a standard normal distribution is the same So to get this area go to STAT 9 Calculators 9 Normal Since we are trying to find probability in terms ofa Zscore all we have to do is change the mean to 0 and standard deviation to 1 and here s 72 4 n X Mean U7 Sm Dev 117 mag 15 F u 9331928 Close Cumpute It s the same probability as before 09331928 So there are two ways to get these probabilities 1 By putting in the actual mean and standard deviation and finding the probability associated with the value 2 By putting in mean 0 and standard deviation 1 converting your value into a Zscore and finding the probability associated with that Z Page 26 of58 StatCrunch won t find the area in between two values so we just have to fmd the probability to the left on 2 and the probability to the right of Z Probability to the right on 39 39 L This 39 distribution is symmetrical Page 28 of58 Now that we know the probability to the le of 2 is 02275 And the probability to the right of 2 is 02275 The probability between 2 and 2 is 1 7 022752 09545 So the Empirical Rule is correct approximately 95 of the probability does lie within two standard deviations of the mean Let s do some more of these type of calculations Find the probability that an observation is more than 227 standard deviations above the mean 6 area to the ri ht o Z i 22 e Nulmalcalcululnr Q 3 a 4 u 1 2 3 a Mean 17a Sm Dev 17 mm We j uu116u3751 Ciuse CumputE This probability equals 001 16 Page 29 of58 Here are some tougher probabilities associated with Zscores that are similar to questions you will see on the homework For the normal probability shown below nd the probability that an observation falls in the shaded region For a normal distribution nd the probability that an observation falls within 258 standard deviations ofthe mean Page 31 of58 So far we have been doing calculations where we are given zscores and we want an area or probability associated with those zscores Now let s do some examples where we are giventhe areaprobability and we want to nd the zscore associated with that areaprobability 1 Find the zscore that has areairobability 2912 below it I he Zscore with areaprobability 2912 below it 054988 2 Find the zscore that is the 90 iercentile l n ty y b 1 Lil 128155 Page 30 of58 Now that we have done many many many examples of probabilitiesareas under the normal curve involving zscores let s actually get to some more real data examples REMEMBER ALWAYSDRAW PICTURES OF THE CURVES Example 7 IQ scores are normally distributed with a mean of 100 and a standard deviation of 16 a What IQ score is the 98L11 percentile b How many standard deviations above the mean is the 983911 percentile This is the same as asking Find the zscore that is the 98th percentile Check that a is the zscore that matches up with the IQ score in b Page 32 of58 Example 8 SAT Math scores are normally distributed with a mean it 500 and a stande deviation 6 00 a If your SAT Math score is a 600 how many stande deviations from the mean was it b What percentage of SAT scores were higher than yours Do this using the stande normal distribution in StatCrunch and the Z score you got in part a Then check it against the easy way in StatCrunch by actually putting in the population mean standard deviation and your score We did this on page 23 of the notes Page33 of58 We will be able to do this using our StatCrunch Normal Calculator but instead of looking at the distribution of the individual values like we looked at before we need the distribution of the possible sample means we could get from samples of a certain size We just need to create a curve of all the possible sample means we coul ge Page35 of58 Chapter 7 Sampling Distributions Chapter 7 is about two distributions 1 Sampling Distribution of the Sample Mean 2 Sampling Distribution of the Sample Proportion These sound complicated but they really aren t that bad Before we get to them let s review what we learned in 62 We learned how to nd probabilities like this Test scores are normally distributed with a population mean of 82 and a population standard deviation of 10 What is the probability that a randomly selected student scored higher than an 85 To do this we would use the Normal Calculator in StatCrunch Notice in this example that we are nding the probability for just one student Soon we Will learn how to solve the following problem Test scores are normally distributed with a population mean of 82 and a population standard deviation of 10 A random sample of ve students is taken What is the probability that the average test score for these ve students is above an 85 So we don t just care about nding the probability that one student scored above 85 we want to nd the probability that the average score for ve students would be above 85 Page 34 of58 Let s talk through an example Let s say that the population we are interested in is everyone in this class And lets say we are interested in the average number of siblings for the class What if I told you that I got everyone s information and the average number of siblings for everyone in the class is p 15 This would be a population mean because it is the average for the whole class Now what if I took a sample often people asked them how many siblings they had and averaged the values for these 10 The average from these 10 people would be a sample mean because it is just for these 10 people not everyone Illthe class And the calculated sample mean for this sample is x Page36of58 for 10 more people x Sometimes we will get a sample mean above the population mean sometimes we will get a sample mean below the population mean and occasionally we might get a sample mean right at the population mean Here s the idea let s say I went and did this for every combination of 10 people and calculated a sample mean for every combination of 10 people What we are interested in is seeing what the distribution of all these possible sample means would look like That distribution is what we call the Sampling Distribution of the Sample Mean Page 37 of58 7172 Sampling Distribution of the Sample Mean The general idea behind obtaining the sampling distribution of the sample mean is 1 Obtain a simple random sanlple of size n 2 Compute the sample mean x 3 Assuming that we are sampling from a nite population repeat steps 1 and 2 until all simple random samples of size n have been obtained This is easy to do with a nite population with very few values Let s look at an exam e Example Draw all possible samples of size 2 from the population 2468 Construct the sampling distribution of the sample mean What is the probability that we would get a sample mean of 5 from a sample of size 2 from this population However most of the time we Will not have all the values from a population because most populations we look at are very arge Page39of58 To nd probabilities involving these sample means like the probability that a sample of ten people are going to have an average of more than 2 siblings per person we will have to use the StatCrunch Normal Calculator So we will need three things 1 What is the mean or in other words the overall average of all these possible sample means 2 What is the standard deviation or in other words the spread of the values for all these possible sample means 3 What is the shape of the distribution of these sample means Because to use the Normal Calculator they need to be normally distributed We will talk about these three things and how to calculate these probabilities in Chapter 7 Page380f58 So to nd probabilities involving sample means like we mentioned before we need to knoW What the distribution looks like We need to knoW the mean the standard deviation and the shape 1 Mean The average or mean of all the possible sample means we could get Will ALWAYS be equal to the overall population mean In other words min IL Check that this is true in the example on the previous page Page 40 of58 2 Standard Deviation When we are talking about the spread or standard deviation of a sampling distribution we call it the standard error Many students get confused by this term but just think of the standard error as a type of standard deviation It just measures the spread of a sample statistic like the sample mean It just places a value on the spread of all the possible sample mean values When we are talking about the sampling distribution of the sample mean the standard error T n Again we don t go through proofs of why this is true but think about why it makes sense Individual values in a population are going to be all over the place Some will be high some will be low and we calculate their standard deviation to be 6 However sample means are not going to be as spread out Yes we might have some high values and we might have some low values in a sample but for the most part these values are going to average out close to the mean So the standard error or spread isn t going to be as big as a it is actually i J Page41 of58 Based on all these things we have talked about we can now find probabilities involving sample means in StatCrunch We just have to make the following changes 1 Make sure the sampling distribution of the sample mean is normally distributed We do this by checking if the population we are sampling from is normal OR if we are using n gt 30 When we want to find probabilitiesareas under the curve involving sample means we will still put in p for the Mean in StatCrunch but we will now put in 7an for the standard deviation Other than that everything else will be the same N We use this new standard deviation called the standard error because sample means are not as spread out as much as individual data values so the new standard deviation is 7an rather than just a Page 43 of58 3 Shape To nd these probabilities of sample means that we are looking for we need to make sure that the distribution of these sample means is bellshaped or normal To check for this we need to just make sure one of these two things is true 1 If the population is normally distributed then the sampling distribution of the sample mean is normally distributed regardless of sample size N If we are using a large enough sample size usually we say n greater than 30 the sampling distribution of the sample mean is approximately normal regardless of the distribution of the population This is the Central Limit Theorem Page 42 of58 Example Suppose a single value is selected from a normal population with mean u 5 an standard deviation 0 1 Use StatCrunch to find the probability that the value is greater than 55 Now suppose a sample of size 25 is selected from this population Now let s use StatCrunch to find the probability that the sample mean for these 25 values is greater than 55 What do we need to change to do this Notice that the standard deviation that we use in our StatCrunch calculator is now the standard error Page 44 of58 Example Suppose a simple random sample of size n 36 is obtained from a population with u 30 and 6 12 a Describe the shape of sampling distribution of the sample mean b What is the probability that the sample mean is greater than c What is the probability that the sample mean is less than 28 Page 45 of58 73 How can we make predictions about a population Basically we are looking at three distributions 1 Population Distribution the entire distribution from which we take the samp e 2 Sample Data Distribution the distribution of the sample data for each given sample The shape of the sample mirrors the population 3 Sampling Distribution The probability distribution of a sample statistic such as a sample mean It is a distribution of all the possible values for the sample statistic Page 47 of58 Example Suppose the scores on Test 1 have a mean u 82 and standard deviation 6 10 Suppose we take a sample ofn 25 students a If we took many samples of 25 students and computed a sample mean test score for each sample what would the standard deviation spread for these sample mean test scores be What do we call this So basically if we were to observe sample mean test scores for many different samples of n 25 students the sample mean test scores would vary around 82 with a spread described by the standard error of approximately 2 b If we wanted to calculate probabilities involving these sample means what must be true regarding the population c Assuming the population is approximately normal what is the probability that the average test score for n 25 randomly selected students is higher than 83 We are asking what is the probability we will get a sample mean above 83 Page 46 of58 Example 8 in Chapter 7 The distribution of family size in a particular tribe of people is skewed to the right with population mean u 52 and population standard deviation 6 These values are not known to an anthropologist who samples families in this society to estimate family size For a random sample of 36 families she gets a mean of 46 and a standard deviation of 32 a What are the mean standard deviation and shape of the population distribution b What are the mean standard deviation and shape of the anthropologist s sample c What are the mean standard error and shape of the sampling distribution of the sample mean Page 48 of58 RECAP Sample Sampling J S Sample standard deviation measures the spread of data values Standard error is a type of standard deviation that measures the spread of the possible sample statistic values For example standard error measures how spread out the possible sample means are from different samples Page 49 of58 Now what if I took a sample of ten people found their gender and found the proportion of women for these 10 people The proportion from these 10 people would be a sample proportion because it is just for these 10 people not everyone in the class And the calculated sample proportion for this sample is This is the notation we use to denote a sample proportion Pagesr of58 71 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION There is one other Sampling Distribution we are interested in the Sampling Distribution of the Sample Proportion Instead of looking at an average from a sample maybe we want a proportion from a sample that t a certain category Here s an example Let s say again that the population we are interested in is everyone in this class And let s say we are interested in the proportion of women in the class What if I told you that I got everyone s information and the proportion of women in the class is p 055 This would be a population proportion because it is the proportion for the whole class and we use the letter p to represent this population proportion PageSOOfSS for 10 more people 17 Sometimes we will get a sample proportion above the opulation proportion sometimes we will get a sample proportion below the population proportion and occasionally we might get a sample proportion right at the population proportion Here s the idea let s say I went and did this for every combination of 10 people and calculated a sample proportion for every combination of 10 people What we are interested in is seeing what the distribution of all these possible sample proportions would look like That distribution is what we call the Sampling Distribution of the Sample Proportions Page520f58 Example Select all possible samples of size 2 from the population 1234 Calculate the proportion of even numbers in each sample Page 53 of58 The Sampling distribution of the sample proportion 1 Mean 1 1 2 Standard Deviation or Standard Error 3 Shape The sampling distribution ofj will be approximately normal when n is large How large We need np Z 15 and n1p Z 15 Example Find 3 in the example Where we selected all possible samples of size 2 from the population 1234 using the work we did and test to see if it is equal to p the population proportion Calculate 63 in the example where we selected all possible samples ofsize 2 from the population 1234 Page 55 of58 Notation Just like We have a population mean p we have a Population Proportion p Example In the dataset 1234 calculate the population proportion for the proportion of even numbers in the dataset Just like we have a sample mean E we have a Sample Proportion 13 Example For the sample 23 which we found in the example on the page before our x would be 1 because We have one even number and our n Would be 2 because our sample size is 2 So our would be 12 or 05 Page 54 of58 Example Consider a very large population of adults where approximately 45 of the adults favor the death penalty Suppose samples of size 275 are selected from this population and the value of is recorded for each sample a Will the sampling distribution of be approximately normal b If so what are the mean and standard error for this distribution In other words find pf an 613 b What is the probability of getting a sample proportion of 5 or higher where 50 or more of the sample favors the death penalty from a random sample on7 people LLJD X n a u o5 n 5 x Mean o5 sm new US v 5 p n m77an352 SEW Em Java Applet wmgw Page 56 of58 Chapter Four Gathering Data 4 Should we experilnenl or ohserw There are two basic ways to gather data 1 Observational Study 2 Experiment Difference between experiments and observational studies Experiments attempt to manipulate or in uence the subjects in an experiment Properly designed experiments can be used to prove causation that one variable CAUSES the other to change In using experiments subjects can be randomly assigned to groups Observational studies simply measure characteristics of the subjects without attempting to manipulate or in uence the subjects Observational studies cannot be used to prove causation they can only say that the variables are related to one another In using an observational study subjects cannot be randomly assigned to groups The advantages of using an experiment are that you can prove causation and apply randomization while an observational study can only prove association of two variables and not causation 42 43 hsen alinnnl Study s Experiment TYPES OF OBSERVATIONAL STUDIES Cross Sectional Study a study that attempts to take a cross section of the population at the current time Ex A survey that asks Who is your current favorite singer Retrospective Study a study that is backward looking Ex A study to see if there is an association between cell phone usage and brain cancer We can form a sample of subjects with brain cancer and a sample of subjects without brain cancer and compare the past use of cell phones for both groups Prospective Study a study that is forward looking Ex A study that asks How many hours a week do you watch television knowing that they will ask the same people this question in one year Casecontrol study a retrospective study in Which subjects who have a response outcome of interest the cases and subjects who have the other response outcome the controls are compared on an explanatory variab e Ex The brain cancercell phone usage example Example Decide whether the following are experiments or observational studies 1 Rats with cancer are divided into 2 groups One group receives 5 mg a day of an experimental drug that is thought to ght cancer the other group receives 10 mg a day of the same drug After 2 years the spread of cancer is measured in both groups Subjects Experiment or observational study 2 A poll is conducted in which 500 people are asked whom they plan to vote for in the upcoming election Subjects Experiment or observational study Sources of Potential Bias in Sample Surveys 1 Sampling Bias occurs from using nonrandom samples or not using a large enough sample frame undercoverage Example In an election poll if we just took the rst 30 people that walked up to us rather than random selecting people to poll this might result in sampling bias This would be called convenience sampling which we will talk more about later Or if we just took samples at a few of the voting stations and did not sample from all of them this would be sampling bias 2 Nonresponse Bias occurswhen some sampled subjects cannot be reached or refuse to participate or fail to answer some questions Example If we send out 500 surveys and only get 52 back this could result in nonresponse bias These 52 don t give us a large enough sample size and might not be representative of the population 3 Response Bias occurs when the subject gives an incorrect response or the wording of the question is confusingmisleading or the way the interviewer asks the questions is confusingmisleading Example Asking the question on a survey this way might result in response bias Would you rather vote for CANDIDATE A or mr 434 Experimenls An experiment is a controlled study in which one or more treatments are applied to experimental units The experimenter then observes the effect of varying these treatments on a response variable A lot of new terms were used in the above de nition Let s now define these terms 1 experimental unit or subject the person object or some other welldefined item upon which a treatment is applied 2 treatment a condition applied to the experimental unit i e a new drug is administered to patients 3 response variable a quantitative or categorical variable that represents our variable of interest The goal in an experiment is to determine the effect the treatment has on the response variable Experimental Designs Completely randomized design The experimental units are randomly assigned to the treatments MatchedPairs Design A matchedpairs design is one in which the experimental units are somehow related or matched before the experiment takes place For example the same person before and after a treatment twins husband and wife etc Example One twin receives some treatment and the other twin receives some other treatment Not only can we measure the overall groups that received different treatments we can also look at the difference in the results for each matched pair of twins Often the measure of a response variable for an experimental unit is taken before a treatment is applied and then a measure is taken from the same unit after the treatment Here the individual is matched against itself Think back to our example With the rats The experimental units are the rats with cancer The ewlanatary variable is the amount of the drug and there are two levels of treatment 5 mg and 10 mg The response variable is the spread of cancer for each rat Many designed experiments are double blind An experiment is double blind if neither the experimental unit nor the experimenter knows what treatment is being given to the experimental unit For example in a lot of medical experiments that are testing the effects of new drugs researchers often administer to each patient either a dose of the new drug OR a placebo A placebo is a dunnny treatment in this case a fake pill so that the patient does not know if they are getting the real medication or not If this is to be double blind the researchers will not know which patients are getting the real medication and the patients also will not know if they are getting the real medication If the researcher does know but the patient does not then it is a single blind experiment CROSSOVER DESIGN Think about an experiment where subjects could take one treatment drug A the first time they get a headache and the other treatment drug B the second time they get a headache The response variable is whether the subject s pain is relieved For each person we would have a matched pair of observations because they both correspond to the same person A matchedpairs design like this in which subjects cross over during the experiment from using one treatment to using another treatment is called a crossover design In matchedpairs experiments a set of matched experimental units is referred to as a block So in the headache study above each person is a block To reduce possible bias treatments are usually randomly assigned within a block In the headache study the order in which the treatments are taken would be randomized Bias could occur if all subjects received one treatment before another A block design with random assignment of treatments to units within blocks is called a randomized block design Example A pharmaceutical company has developed an experimental drug meant to cure a deadly disease The company randomly selects 300 males aged 2529 years old with the disease and randomly divides them into two groups Group 1 is given the experimental drug while Group 2 is given a placebo After one year of treatment the white blood cell count for each male is recorded a Identify the experimental units b What is the response variable in this experiment c What is the explanatory variable What are the two levels of treatments dIs the experiment design completely randomized e Does this experiment use matchedpairs design f If the researcher knows which patients are getting which drugs is this a double or single blind experiment Let s take a look at one more example to make sure we understand matchedpairs design Let s say we administer two tests to students and want to compare grades on the two tests Let s say we want to compare the mean for 20 Test A scores versus the mean for 20 Test B scores If we were to select 20 people and compare their Test A and Test B scores then this would be an example of a matched pairs design We would be matching the Test A scores against the same students Test B scores The experimental units in each group are related because they are the same students in each group We could not only compare the Test A scores to the Test B scores as a group but we could also compare them for each individual However if we were to select 40 people and compare 20 students Test A scores versus 20 different students Test B scores then this would NOT be matched pairs design because there is no relation between one student s Test A score and a different student s Test B score We could only compare the Test A scores to the Test B scores as a group Example Researchers at UGA wished to determine whether there was a connection between listening to classical music and reasoning skills To test the research question 36 college students listened to one of Mozart s sonatas for 10 minutes and then took a reasoning test using the StanfordBinet intelligence scale The same students were also administered the test after sitting in a room for 10 minutes in complete silence The mean score on the test following the Mozart piece was 119 while the mean test score following the silence was 110 The researchers concluded that subjects performed better on the reasoning tests after listening to Mozart a What is the response variable in the experiment b What is the explanatory variable Describe each level of treatment c Does this experiment use matchedpairs design d Does this experiment use crossover design Chapter 10 Signi cance Tests Comparing Two Groups In Chapter 10 we will continue our discussion of Significance Tests but we will be considering two populations instead of one Instead of testing if a population parameter is equal to a certain value we will be testing Whether a population parameter is equal to another population parameter So let s say we have a population 1 with a population mean pm And let s say we have a population 2 with population mean Hz We can test if M u by taking samples and seeing if we are getting big differences in our samples This type of analysis depends on Whether the two samples representing the two populations were selected independently or not so let s see What we mean by dependent and independent samples Suppose we want to determine whether or not students perform better on Test 1 than Test 2 That is our claim is that the average of the Test 1 scores is higher than the average of the Test 2 scores If we randomly select 5 scores from Test 1 and then randomly select 5 scores from Test 2 we will have independent samples If we randomly select 5 students and look at their scores on Test 1 and Test 2 then we will have dependent samples We will begin with comparing two means using paired data two dependent samples in Section 104 Example Suppose below are the Test 1 and 2 scores for our random sample of 5 students Let s test our claim that the average of the Test 1 scores is higher than the average of the Test 2 scores at the 1 05 significance level We will denote these differences with the letter d For comparing Test 1 and Test 2 scores d Test 1 score 7 Test2 score Although the test is exactly the same as for a single population mean in Section 93 you will see different notation The subscript d indicates the values refer to a difference d7 The differences your data values xd The average difference in our sample sd Standard deviation of the differences in our sample Section 104 How Can We Analyze Dependent Samples So in our test example we want to test whether or not students perform better on Test 1 than Test 2 What would our hypotheses be We actually test for this but we move things around algebraically so that instead of testing p1 p we are going to test whether pl 7 p2 0 Algebraically you can see that these are the same So our hypothesis test looks like the following Our null hypothesis will always be H0 3 H1 H2 0 And depending upon our test our alternative hypothesis may be HA 3 H1 H2lt0 0139 HA 3 H1 H2gt0 0139 HA 3 Hi H2 0 Many times you will see this pl 7 p2 written as pd which just stands for the difference between the two means Steps in the Significance Test 1 Assumptions same as before Were the students taken a random sample Are the differences in test scores normally distributed We will assume so 2 Hypotheses The null hypothesis is that there is no difference between the two Tests that the average on Test 1 is the same as the average on Test 2 H03 H1H2 or H139H20 The alternative hypothesis is that the average of the Test 1 scores is higher than the average of the Test 2 scores HA H1gtH2 or H1H2gt0 3 Test Statistic Test Statistic V n where Xd the average difference in our sample sd the standard deviation of the differences in our sample n the number of matched pairs in the sample n 5 and we can get d and sd by putting the data into StatCrunch and using Summary Stats to get this information Put the differences in a column in StatCrunch Go to Stat 9 Summary Stats 9 Columns and pick the column with the differences and here is what you get 7 LEJJSJ Summarystznstics Cnlumn so t 2 447214x5 1 In our sample students are performing 2 points better on Test 1 than Test 2 And this sample mean difference of 2 is only 1 standard error above 0 5 Conclusion 7 18695 gt 05 Our pvalue gt the level of signi cance so we will not reject H0 and we will not accept our claim HA that Test 1 scores are signi cantly higher than Test 2 scores In our sample it did not look overwhelmingly like Test 1 scores were higher than Test 2 scores because we did not get that large of an average difference in our sample so really there might be close to zero difference between the means of the two test scores Now let s see how StatCrunch can do these problems for us 4 Pvalue Now we need to see the probability that we got an average difference in our sample this far above 0 Since HA is gt this is a right tailed test and we need the area in the T calculator above 1 talcdam AELXJ Denslfi Pmth 11 lm aaasm 7 Snanshnl Close C mrlum lJavaADDlaanduw pvalue 18695 So there was a 18695 or 18695 chance that we would get a average difference of 2 or more in our sample if the test averages are the same So we didn t get an average difference that far away from 0 so we probably won t reject the null hypothesis that says there is zero difference between the population means We probably don t have enough evidence Dependent Sample Signi cance Tests in StatCrunch First enter the Test 1 scores in column 1 and the Test 2 scores in column 2 like this Erm l BU RDW Vail VWZ El 1 3 4 SE SE 5 Go to Stat 9 T Statistics 9 Paired amp ll in like this a Paimi I sta ilics EIE E Sample 1 Sample 2 SaIIIhIE1III lvan v 1 Sample 2 in Ivar v Wllel e Gmnlilvy rrupunnawv Ll 39ncesj caixcell Nexu calcium lJava Anplelthcm Hit Next Then choose Hypothesis Test with the null hypothesis being that the mean difference equals zero and the alternative hypothesis being gt because we think varl Tl scores are hi er than var2T2 scores 5 Paired T statistics 7 Hyimiiiesis rest mm menu tun n Allel mm gj r Cmvmlence imemi 3 m m mm Java Avpiei wimaw Hit Calculate and the resulm are use Mmhesis test resuits Dineiences stayed in cuiumn Dineiences Look we get the same test statistic l and the same p value l I Could we have done this test using a confidence interval g Paired T statistics r Hyuoiiiesis rest mi 1 muem imemi Level in 95 ii We 1 mm lavaADDleiWindaW The reason we want a 95 confidence interval is because we are this test at the 05 significance level so there can be a 5 chance we are wrong so 1 05 95 or 95 Hit Calculate and here is what we get use 95 cnn dence imemi resuits di eience n Di evences stayed in cuiumn Di evences So the 95 confidence interval is355289 755289 We are 95 confident that uluz is between 355289 and 755289 Yes and here is the formula we would have used I 54 yd imge where 39 r Again we can construct this interval very easily in StatCrunch Put the data in the two columns like before Go to Stat 9 T Statistics 9 Paired amp fill in like this again Paired T statistics Smillile I e Snilllile 2 m quotuptiu Grolllilwi quotUptiunalrrv i save tll elellces snaps We text at Hit Next Then select Confidence Interval77 instead of Hypothesis Test and create a 95 confidence interval for the population mean difference like this Now all we have to do is check to see if 0 is on the interva lf0 is on the interval then it is possible that ul uz 0 and we canNOT reject H0 that pd 0 or in other words ul uz 0 There really may be zero difference between the population means because 0 is on our interval Like in our above example 0 is between 355289 and 755289 so ul uz may really equal 0 Now if 0 is not on our interval then we can reject H0 and say that there is some significant difference between the means because 0 is not a possible difference inthe population means because 0 is not on our interval lf0 is not on the interval then we don7tthink ul uz 0 we think there is a significant difference between the population means Section 102 How Can We Compare Two Means Using Independent Samples In this section we will once again compare two population means except now we will be using independent samples to compare these means In Section 104 we saw how to compare two means using dependent samples paired data Whenever we run tests to compare two means this should be our first question are we using dependent or independent samples Example Let s say I want to look at the average price of Ipods on eBay I want to see if the average price an Ipod is sold using Buy It Now is different from the average price of an Ipod sold using just a regular bidding auction Let M average Ipod price using Buy It Now and u average Ipod price using regular auction Set up the null and alternative hypotheses for the test We will run this test at the 05 level of significance Now remember what our hypotheses really are H03 H1 H2 0 HA3 Hi H2 75 0 What should weiuse to estimate pl 7 Hz We should use x1 x2 which is 2348 7 22825 655 So the point estimate of M 7 u 2348 7 22825 655 which is the difference between our sample means Now we need to see if 655 is a significant difference or not If there is a significant difference between our sample means we would reject the null hypothesis and accept our claim that there is a significant difference in price between Buy It Now auctions and regular auctions If there is not a significant difference between our sample means we would not reject the null hypothesis There may not be a significant difference in price between Buy It Now auctions and regular auctions Let s say we have the following data from some past auctions Buy It Now 249 240 220 245 220 Regular Auction 226 244 232 209 250 208 228 229 Are these independent or dependent samples What is the average price in our Buy It Now sample 249 240 220 245 2205 2348 We will call this 1 because it is the sample mean associated with pm What is the average price in our Regular Auction sample 2262442322092502082282298 22825 We will call this E2 because it is the sample mean associated with HZ Steps in the Significance Test 1 Assumptions same as before Do we have a random sample Are the populations norm ally distributed 2 Hypotheses H03 H1 H20 HA3 H1 H20 3 Test Statistic We want to see really how different are our sample means x1 x2 To do this we see how far 1 E2 is from 0 using this formula t X 1 X 2 ST ANDARDERR 0R If we ever ask you to calculate this test statistic we will give you the standard error Or if you have the actual data we will see that you can use StatCrunch So for our eBay example we know 1 k2 655 1f1 told you the standard error 810615 then you could get the test statistic by doing I 080803 810615 So our Test Statistic is t 080803 Our difference between our sample means x1 x2 is only 080803 standard errors above zero So there is not that much difference between our sample means and so we probably wouldn t think there is much difference between our population means So we probably won t reject the null hypothesis because we did not get a very extreme test statistic BUT we still have to get the pvalue for this test statistic 5 Conclusion 04399 gt 005 Our pvalue gt level of significance so we would not reject the null hypothesis So the 655 difference that we got between our sample means was not a significant difference So it looks like there may be no difference between the prices of 1pods whether you are using a Buy It Now auction or a regular auction We cannot reject H0 ul 7 Hz 0 4 Pvalue The pvalue is the probability that we got a sample mean difference this far away from a 0 difference assuming the two population means are equal have zero difference We are comparing two means so we need to use the t calculator If we ever want you to do these by hand we will also have to give you the degrees offreedom We have to give you this because we are using samples that have different sample sizes For this problem we will use degrees of freedom 9 111 Denle n a unla l Prnhtx gtlvllmn3 1 gym But we are performing a twotailed test so our pvalue 221995 04399 Independent Mean Significance Tests in StatCrunch First enter the data in two columns just like before w Van Van a mu sample Tswan svics smnple 1Smnnle 2 Smuule 1 In van Where irrummnalw Sample 2 m vavl v wume r39 pliunalw 39 3 Canal 1 exp Cnlculme MAKE SURE YOU UNCHECK Pool variances Hit Next 2 TM sample r slallsllee SEE 539 Hylmlllesls Ten em m e lei meme la Ll Con dence Interval Cancel ltBack v meme We want a twotailed test so when we hit Calculate here is our output Twu sample Tslallslics Mmhesls test results up mean uNaM u mean uNavZ uv Hn UVUZD HA Ul39uz Look the same test statistic 80803 and basically the same pvalue 4398 L Limit LL Limit 4 to 2487014 We are confident that if we actually found up 7 up it would be on this interval Since 0 is on the interval H1 7 up may equal zero so we cannot reject the null hypothesis There really may be zero difference between the two population means They may be equal If 0 were not on the interval then we could reject the null hypothesis and say that one mean was significantly higher than the other In other words that one type of auction got higher mean sales than the other type of auction We can also run this test using confidence intervals Again all we have to do is check if 0 is on the constructed con dence interval or not Let s use a 95 confidence interva Go to Stat 9 T Statistics 9 Two Sample amp fill in like this remembering to uncheck Pool variances Wu sample T slalislics Sample 17 Snllllylz 2 smuuleiln wnere quotUptlunalquot g When m Where lemmaquot r1peolmreeee el meal l penal camel Hit next and select a 95 confidence interval El FEE Wu sample T slalislics r Hylmlllesls resl l Camel meal e 1 meme Section 101 Comparing Two Proportions In this section we will be able to test whether two population proportions are equal to each other Here is an example Let s say there is an election coming up It is thought that the proportion of women that are going to vote for candidate A is less than the proportion of men that are going to vote for candidate A We can take a random sample from each group and compare sample proportions to test this We will be comparing proportions in this example We want to see if the proportion of women who will vote for candidate A we will call this p is less thanthe proportion of men who will vote for candidate A we will call this p2 Let s set up these hypotheses Example Let s run this test at the 05 level of signi cance with the following data In our random sample of 10000 women 4200 of them said they would vote for candidate A This means l So in our sample ofwomen 042 or 42 of them said they would vote for candidate A In our sample of 10000 men 4600 ofthem said they would vote for candidate A mismeans 132 460010000 046 So in our sample ofmen 046 or 46 of them said they would vote for candidate A So now we want to see if these sample proportions 42 and 46 are signi cantly different Really we are checking to see if the sample proportion of women who would vote for candidate A 042 is signi cantly lower than the sample proportion of men who wouldvote for candidate A 046 So now we would simply like to see if 004 is a signi cant difference If004 is a signi cant difference we can reject the null hypothesis and accept our claim thath is less than p2 If004 is NOT a signi cant difference we cannot reject the null hypothesis In that case there really may be zero difference between the two population proportions Well again the good news is that StatCrunch can run this test for us The only thing we need to check before we run this through StatCrunch is the assumptions 1 Categorical response variable 2 Independent random samples 3 The sample sizes are large enough so that there are at least ve successes and ve failures in each group If 042 is signi cantly lower than 046 then we can reject the null hypothesis and accept our claim that a lower proportion of women will vote for candidate A than men But if 042 is not signi cantly lower than 046 then we cannot reject the null hypothesis And so maybe there is no signi cant difference between the proportion of men and the proportion of women who will vote for candidate A Again though rather than checking Ho p1 p2 HA3 P1 lt P2 We will be checking the differences in the proportions to see if there is really zero difference between the proportions or not So here are the real hypotheses we are testing HO3P1P20 HA3P139P2lt0 We can t get these population proportions so we use the next best thing our sample proportions In our samples A A p1 p2 042 046 004 Signi cance Test for Two Proportions in StatCrunch Let s run our test in StatCrunch Go to Stat 9 Proportions 9 Two Sample 9 with summar 3 lrllpIwwwwehasslgnnel r SlaIClunch lm WehAssigll r Micrnsn lmemel Explmer StatCrunch for WebAssign L lrl Unwed val t Raw van l summary Stats r val valB A 1 Tables gt Zstatlstlcs gt7 5 Tstatlstlcs gt1 wlth data 1 e Varlance 99 l g l Regresslon gt g d 7 ANOVA r j Monparametrlcsr 39 Goodness of an m A H controlcharts r ohmm calculators r Here we will put in our sample proportions For the women s sample we got 420010000 For the men s sample we got 460010000 We would put these numbers in like this 2 Two 51mm Plowman wilh summary smume 1 snmme 2 Saluplel Nulllhu of successes Zuni Number or aliservalmus 101T Saluplez Mummy 039 sucesses W Nuluhel or nhsewmlaus l v Dunn J Sllauslm cancel i Next Calculate Hit Next and we want a Hypothesis Test The null hypothesis will be that the proportion difference 0 stating that there is zero difference between the proportions The alternative hypothesis is our claim that women are less likely to vote for candidate A so the proportion of women voting for candidate A p1 should be less than the proportion of men voting for candidate A p2 so put in lt for the alternative hypothesis Hymss w umm ru Alternative m cmmm mm 3 m m mm Java ADD BK Windaw Using Con dence Intervals to run these Signi cance Tests Again we can construct con dence intervals to run these signi cance tests and we are looking for the same results as before We will be creating a con dence interval around our sample proportion difference 004 and once we have created it we will be con dent at a certain level that pl 7 p2 will be on the interval If 0 is on our con dence interval then there really may be ZERO difference between the proportions and we can NOT reject our null hypothesis that the two proportions are the same have no difference If 0 is NOT on our con dence interval then there is enough evidence that there is a signi cant difference between the proportions and we can reject the null hypothesis So here is how we create these con dence intervals in StatCrunch nvnpumon msuccesses 1m nonmaqu 004 And we get that this sample difference is 569803 standard errors below 0 test statistic 569803 which is really far below 0 Because the pvalue lt level of signi cance we will reject the null hypothesis and accept our claim that the proportion of women voting for candidate A does appear to be signi cantly less than the proportion of men voting for candidate A Let s use the example where p1 the proportion of women voting for candidate A and p2 the proportion of men voting for candidate A Remembet 3 4200 10000 042 13 460010000 046 Go to Stat Proportions 9 Two Sample 9With Summary Put our data in 39ust like before Twasamplel mpnninnwnhsunnnary El l SanMe 1 7 Sample 2 Samlvlet Nulnlrel nlsllccesses 17100 NIllllxel ol nhsewmlnns lUUUD SalupleZ Nlllllel n1 sllcesses 4500 Number 0 oliservmimls H3000 Ll Snavslml I Cancel Nextgt l Calculate Hit Next and now let s create a 95 con dence interval So here is our selection Twn sample Pmpnrlinn with summary El lg quot mehesis 1251 wimp2mm Level El 95 Cancel ltBack w Calculate Hit Calculate and here are our calculations e Twu sample Pmpnrlinn will summary 95 con dence Imam ms 5 un pvupumun utsuccesses 1m pupmauum Pvupumun utsuccesses 1m pupmauun 2 men m evencemmupumuns You can see we get the same sample difference of 004 And now look at our confidence interval it goes from 005375 to 002625 It does not contain 0 so we reject the null hypothesis Because all of the possible values for the proportion difference are below 0 we can infer that the difference between the two proportions is negative This study provides evidence that the population proportion of women who will vote for candidate A does appear to be smaller than the population proportion of men who will vote for candidate A Chapter 11 Analyzing the Association between Categorical Variab Sections 1112 Part 1 Goodness of Fit Test We have already seen in Chapter 9 how we can test whether a proportion is equal to avalue like p4 and we have seen in Chapter 10 how we can test whether two proportions are equal like p1p2 Now we are going to see how we can test many proportions at one time using only one test This is called a Goodness of Fit Test because we would like to test and see how well the statedproportions t Example The MARS company that makes MampM s states that the percentage of each color made is blue 24 orange 20 green 16 yellow 14 red 13 brown 13 We can test all these proportions in one test using the Goodness of Fit Test by just taking a random sample of MampM s and seeing how close the proportions in our sample t the proportions above SUMNTARY 1f 0 falls in the confidence interval it is plausible that the two population proportions are equal and we can t reject the null hypothesis If all of the values in a confidence interval for p1 p are positive then 071 p2 gt 0 or 171 gt172 If all of the values in a confidence interval for p1 p2 are negative then 071 172 lt 0 orp1ltp2 Let s say we buy a bag of MampM s and there are 60 in the bag This will be our random sample Let s say here is the breakdown of our bag of MampM s 13 Blues 14 Oranges 10 Greens 11 Yellows 6 Reds and 6 Browns These are what we call the observed values To run our test we need to compare these to the values we would expect to get from a bag of60 MampM s based on the proportions we are testing These are called the expected values Here is a table comparing the observed and expected values Now we can test these proportions by seeing how close or far away our observed counts are from our expected counts the counts we would expect to get Here are the 5 steps for this test Step 1 Assumptions There are only two assumptions for this test 1 The sample must be random 2 The EXPECTED counts must all be at least 5 Are the assumptions met for our MampM s test Step 2 The hypotheses For these tests we do not have symbols in our hypotheses We simply write out in words what we are testing The null hypothesis will be that the proportions we are testing are correct The alternative hypothesis will be that the proportions we are testing are incorrect And we are using our bag of MampM s as our evidence to test these hypotheses H0 The stated proportions given to us by the MARS company are correc HA The stated proportions given to us by the MARS company are incorrect Step 4 Pvalue To nd the pvalue we have to use the Chi Square calculator in StatCrunch We go to STAT9 Calculators 9 Chi Square Like the tdistribution we will need a degrees of freedom value The degrees of freedom will be c 7 l where c is the number of categories we are comparing In our example the degrees of freedom 6 7 l 5 because we are comparing the proportions for the 6 different colors of MampM s Also this Goodness of Fit Test is always a righttailed test so you will always nd the probability to the right of the test statistic Lastly you just have to put in the test statistic just like the other calculators like this rm Denle D15 7 WW I Clnse 57quotme iavaiwuwm Step 3 Test Statistic obs exp2 2 Z exp In Chapters 9 and 10 our test statistics were always Z or T values Here our test statistic is that weird symbol you see above it is just a symbol called Chi Square All it means is that when we calculate the value of this test statistic instead of plugging it into the Normal or T calculator to get the pvalue we have to use the Chi Square calculator in StatCrunch To calculate this value we are basically in a way seeing how far away each observed value is from its corresponding expected value Here would be the calculation for our MampM s data 2 z 13 1442 I 14 122 I IO 962 I 144 12 96 2 2 2 4648 212164 84 78 7 8 Step 5 Conclusion Again all we have to do is compare the pvalue to a level of signi cance For this problem let s use 0905 as our level of signi cance 083207 gt 05 So we will not reject the null hypothesis Basically the proportions we got in our bag of MampM s were pretty close to the proportions stated by the MARS company So we do not have enough evidence to claim those proportions are incorrect Just like the other tests we have seen how to do the ve steps by hand Now let s see how StatCrunch can run the test for us First we need to input the corresponding observed and expected values into two columns in StatCrunch like this Raw van V212 1 1 N4 2 N 12 3 1B 95 IX 11 E4 5 6 7a 3 5m mm at And we tell it where the observed values are and where the expected values are We hit Calculate and here is what we get quar test Ag x ChlrSmlare gnaunesso n uesuns Observer vaH 9412mm van alue E 21216426 118321 W Now suppose someone states that they think that each color is equally likely Suppose we use a different sample of MampM s to run this test Example Suppose the Red amp Black puts out an article that states that for undergraduate students at UGA 20 of them drive to class 10 ofthem bike to class and 70 come to class some other way for example walk take the bus get a ride Let s test these proportions Suppose we take a random sample of 200 students and ask them how they get to class 51 of them drive to class 15 ofthem take a bike and the other 134 come to class some other way Is this enough evidence to claim that the out by the Red amp Black are incorrect at the 05 Are the assumptions met H0 The stated proportions given to us by the Red amp Black are correc HA The stated proportions given to us by the Red amp Black are incorrect We put the data in StatCrunch Like this Data Grannl Raw van van 1 51 an 2 15 In 3 13 an Then we choose Stat 9 Goodness of t 9 Chi Square test W w7 id a 7 observed an Exlnecled way where rummnzlr mm s l Silauslml cancel calculme And we teIl it where the observed values are and where the expectedvalues are W hit Calculate and here is what we get 5 all 11 ChlrSlluzre gnndnzssa n results onsewen van Expected Vail W So we would not reject the null hypothesis Based on our data we do not have enough evidence to reject the proportions given to us by the Red amp Black Sections 1112 Part 2 Independence Test Remember in Chapter 3 when we looked at this data m 760 We talked about looking at the conditional proportions and based on how different they were we could predict whether or not there was an association between gender and whether someone is le or righthanded Now we will see a test that we can run to actually test if there is an association between these two variables It is called the Chi Squared Test of Independence So the counts in the above table are what we call the observed counw To get the expected counts we need to use the following formula in each cell Expected Count Row Total Column Total Overall Total When we calculate these expected counw these are the counts we would expect to get if there is no association between these two variables Step 1 Assumptions There are only two assumptions for this test 1 The sample must be random 2 The EXPECTED counts must all be at least 5 Are the assumptions met for our test Step 2 The hypotheses in other words that they are not associated The alterna V hypothesis will be that the variables are dependent or in other words that they are associated The null hypothesis will be that the variables are independent or e HE There is no association between gender and whether someone is le or righthanded These two variables are independent HA There is an association between gender and whether someone is le or righthanded These two variables are dependent So here are the observed counts the data we actuall collected Le handed Riht handed mm m 1160 1460 And here are the expected counw what we would expect if there is no association between gender and whether someone is le Take a look at the above tables the observed counts aren t that far off from the counts we would expect to get ifthere is no association between gender and whether someone is le or righthanded But now let s see the 5 steps of this test Step 3 Test Statistic obs exp2 2 Z exp 160 15616 2 600 60384 2 2 lt 15616 60384 2 2 140 14384 560 55616 2024731 14384 55616 Step 4 Pvalue To nd the pvalue we go to STAT9 Calculators 9 Chi Square The degrees of freedom will be r 7 lxc 7 l where r is the number of rows in our table and c is the number of columns in our table In our example r 2 because we have two rows male and female and c 2 because we have two columns left and righthanded So the degrees of freedom 2 7 l x 2 7 l 1 Also this Independence Testis always a righttailed test so you will always nd the probability to the right of the test statistic Lastly you just have to put in the test statistic like this Ethisnuare almlamr 39 E 25 Densl 1 auuu zsuu zunu isun mug sun u 5 H1 5 DEM Pml zgtjvl T Mahala 7 7 snansnm clnsu Cnmlu e j pvalue06l898 We put the data in StatCnlnch like this We have to put in a column that labels the different rows D at Brahmas Puw Vrl VVeVZ vav3 l M 160 WU 2 F MD 560 Then choose Stat 9 Tables 9 Contingency 9 with summary The columns that actually have the data are var2 and var3 so we choose those and the row labels are in varl so here is what we would choose and hit Calculate Select cnlumns 1m Izhle Raw labels m van v Cnlumn mnzhle 3 5mm m m quotw mm Java AWlel Wmde And here are the results i El A Op uns Cnnlmuenwlahle results Haws var 8 z Total mu Man Man Stals1c DF Valun PVZIHB Chlrswavej 1 024730793 um JavaAwlaIWmdww Step 5 Conclusion Again all we have to do is compare the pvalue to a level of signi cance For this problem let s use amp05 as our level of signi cance 061898 gt 05 So we will not reject the null hypothesis Basically the proportions we got in our data were pretty close to the proportions we would expect if there was no association present So we do not have enough evidence to claim these variables are associated Just like the other tests we have seen how to do the ve steps by hand Now let s see how StatCnlnch can run the test for us Example Do you believe that marijuana should be made legal We are going to runthis test to see if someone s political af liation might be associated with their answer Test at the 05 level of signi cance Calculate the expected counts if there were no association Compare the observed and the expected counts Are the assumptions met H0 Someone s opinion about legalization and their political af liation are not related they are independent variables HA Someone s opinion about legalization and their political af liation are related they are dependent variables Review for Test 1 1 Chapter 2 The following data represent the pulse rate of 16 randomly selected males after stepping up and down on a 6 inch platform for 3 minutes Pulse is measured in beats per minute 72 86 88 96 1 15 120 128 129 136 138 138 143 146 151 154 169 a What is the variable of interest Is it categorical or quantitative If quantitative is it discrete or continuous b Construct a stemandleaf plot c What percentage of the pulse rates are between 110 and 1499 beats per minute d What percentage of the pulse rates are at least 110 beats per minute e Compute the sample mean and median pulse What do these values tell you about the shape of the distribution 1 Compute the range and sample standard deviation g Compute Q1 Q3 and IQR h Are there any outliers in the data set using the quartiles i Calculate the zscore for 72 Page 1 of6 2 Chapter 2 The data values below are the years of service for a sample of n 10 employees at a company as reported by data entry personnel 10 13 14 31 11 16 17 19 8 203 Note the 10th observation was probably recorded incorrectly In this case which statistic would be the best measure of center the mean the median or the mode Why 3 Chapter 3 Suppose we are looking at three variables x y and z If the correlation between x and y is 08 the correlation between x and z is 082 and the correlation between y and z is 020 which pair of variables have the strongest correlation 4 Chapter 3 Based on data here is a regression line comparing the variables y weight of a bar of soap in grams and x number of days soap has been used 3 785 42x a What value is the slope Interpret the slope bWhat value is the yintercept Interpret it c Predict the weight for a bar of soap that has been used 8 days dFind the residual if the actual weight 40 for a bar of soap that has been used 8 days Page 2 of6 5 Chapter 2 A random sample of 50 gas stations in Illinois resulted in a mean price per gallon of 160 and a standard deviation of 007 in 1998 Assume these gas prices follow a bellshaped distribution a Determine the percentage of gas stations that have prices between 146 and 174 according to the Empirical Rule b According to the Empirical Rule approximately 68 of the gas stations would have prices that fall between and c According to the Empirical Rule approximately what percentage of gas stations would charge more than 174 6 Chapter 2 The average 15 yearold male is 682 inches tall with a standard deviation of 28 inches a If the height of a 15 yearold male is 264 standard deviations below the mean what is the corresponding z score for that male b What height for a 15 yearold male is 264 standard deviations below the mean Page 3 of6 7 Chapter 3 A university held a blood pressure screening clinic for its professors The results are summarized in the table below by age group and blood pressure level Low Blood High Blood Total Pressure Pressure Under 50 64 51 1 15 Middleage 50 and Over 31 73 104 Old Total 95 124 219 a What proportion of the professors have high blood pressure b What percent of the professors who are 50 and over have high blood pressure c Fill in the blank A professor who is 50 and over is times more likely to have high blood pressure than a professor who is under 50 Page 4 of6 8 Chapter 2 Below is a distribution of grades in STAT 2000 for Fall 2002 GPA Number of students who received that grade 00 30 10 80 20 210 3 0 650 40 320 a What proportion of the students received higher than a 20 b What is the mean GPA for STAT 2000 in Fall 2002 c What is the median GPA for STAT 2000 in Fall 2002 d What GPA is the mode for STAT 2000 in Fall 2002 Page 5 of6 Chapter 8 Statistical Inference Con dence Intervals 81 What are Point and Interval Estimates of Population Parameters When we first began our discussion of statistics we mentioned that there were two branches of statistics descriptive and inferential The inferential branch uses sample information to draw conclusions about the population One of the most common uses of the inferential branch is to use sample statistics such as x to estimate population parameters such as u It makes sense that if we take a large enough sample x should be pretty close to the actual value of u But the chances are pretty small that x turns out to be exactly u This is why we call x a point estimate of u The key here is that sample statistics estimate population parameters For example x is a point estimate of u and 3 is a point estimate of p So we know the value of x is probably pretty close to u but we want to get even closer So rather than just say X is close to u we are going to build an interval around x and then say that u probably lies somewhere on this interval We call this an interval estimate Here s an example Page of89 So a con dence interval of a parameter consists of an interval of numbers and this interval is our point estimate our margin of error Just as in our example above where our interval was 20 2 or in other words 18 to 22 20 is our point estimate and 2 is our margin oferror We call the value we obtain when we take the point estimate minus the margin of error in our example 20 7 2 or 18 the lower limit or lower bound And we call the value we obtain when we take the point estimate plus the margin of error in our example 20 2 or 22 the upper limit or upper bound You will also see the notation of the lower and upper limit in parentheses for con dence intervals In our above example the con dence interval may be written as 1822 Page of89 Suppose you were asked to estimate the average age of all the students in our class You might survey 10 students and nd their average age to be 20 This sample mean of20 would be a point estimate of u BUT you could also express your guess by giving a range of ages centered around your sample mean So your guess could be 20 give or take 2 years This give or take 2 years part is what we call the margin of error which we will talk about more later So mathematically your guess would be 20 2 which would be the interval estimate Suppose you were then asked how con dent you were that u the mean age of all students was within your interval estimate of 18 to 22 years old You might say I am 95 con dent that the mean age ofall students is within 18 to 22 years old In statistics we construct intervals for the population mean that are centered around an estimate This estimate is x the sample mean Since we can t get the full population mean we go for the next best thing We take a sample and calculate a sample mean And what we add and subtract from the sample mean to get the interval estimate is the margin of error When we construct these interval estimates we call them confidence intervals Page of89 It is also important for us to note the level of con dence of a con dence interval In our example before our level of con dence would have been that we were 95 con dent that the mean age of all students in the class was somewhere on our interval So the level of con dence is the probability that the interval contains the population parameter in this case u We will see in examples that as we increase our level of con dence we will get wider and wider intervals We will be constructing two different types of con dence intervals 1 In Section 82 we will be calculating the con dence interval for the population proportion p 2 In Section 83 we will be calculating the con dence interval for the population mean u like our classroom age example Before we get to these sections let s make sure we understand the terms in the example on the next page In this example the con dence interval will already be constructed for us In Sections 82 and 83 we will actually learn how to construct these con dence intervals Page of89 Example Suppose a farmer is trying to estimate the average number of peaches per tree in his orchard He does not want to count every peach on every tree so he takes a random sample of a few trees and calculates a 95 confidence interval based on the sample That 95 confidence interval for the mean number of peaches per tree in the orchard is 112 to 148 peaches per tree This means that we are 95 confident that the population mean it for the number of peaches per tree is somewhere between 112 and 148 peaches per tree What is the lower limit What is the upper limit What is the level of confidence What is the width of the confidence interval What is the sample mean x Remember the sample mean is always themiddle of the confidence interval The sample mean x will always be on the confidence interval but the population mean u may or may not be on the confidence interval What is the margin of error Page of89 82 How Can We Construct a Confidence Interval to Estimate a Population Proportion Recall from Section 81 that con dence intervals can be Written in the general forma point estimate margin of error The point estimate and margin of error change depending on What parameter is being estimated For example We looked at an example of a Con dence Interval for u so our point estimate was it Now We will consider the format of the Con dence Interval for the population proportion p The point estimate for this type of Con dence Interval is the sample proportion xn Where x is the number of individuals in the sample with the desired chamcteristic and n is the sample size So We know What goes before the the point estimate and We can calculate that easily Now We need to know how to calculate What goes a er the the margin of error Page of89 Example of a Con dence Interval for a proportion Suppose there is an election coming up and We Want to predict What proportion of the Votes that candidate A will receive Suppose We took a random sample of 200 Voters and found that 112 of these Voters said they would Vote for candidate A What proportion of the Voters in our sam 1e said they would Vote for candidate A In other Words What is the sample proportion for this sample We are trying to predict What proportion of all Voters will Vote for candidate A by using this sample What if I told you that the margin of error for a 95 con dence interval to be used to predict the population proportion is equal to 007 Construct and interpret this interval Candidate A will Win ifheshe gem more than 50 ofthe Votes Using the interval are We 95 con dent that more than half of the Voters will Vote for candidate A Page of89 The margin of error will always be a multiple ofthe standard error In Section 82 We discuss con dence intervals for population proportions so the standard error will be 0223 I n Why is it now in the formula and not p So the margin of error will always be some number times the standard error We see above The number We multiply the standard error by to get the margin of error TOTALLY depends on the level of con dence The general formula for a con dence interval for the population proportion is pi p0 TP l H You can see that the margin of error is this Z Value times the standard error Later on in this chapter We will see how to get this Z Value because this Z Value TOTALLY depends on our level of con dence how con dent We Want to be that the population proportion is on our interval For now We will just focus on 95 con dence intervals Where this Z Value equals 196 Pageof89 For a 95 con dence interval the margin of error is W 1 T I3 11 195 standard error in other words 195 So the formula for a complete 95 con dence interval point estimate the margin of error equals The lower limit is The upper limit is Think way back to the Empirical Rule and we can see why this 196 makes sense Using the Empirical Rule we said approximately 95 of the data values are within two standard deviations or standard errors of the parameter Now we are starting with p andwe want to add and subtract something from that value and get an interval and We Want to be 95 con dent that the interval contains p the true population proportion So using the same Empirical Rule logic ifwe start with 19 and add and subtract close to 2 standard errors 195 to be exact it makes sense that we are going to be 95 con dent that the p value will be within that interval page am c EXTRA QUESTION Does it appear likely that 50 of ALL Americans are willing to pay 5 per gallon of gas 50 or 5 is not on ourinterval so it does not appear likely that 50 of all Americans are willing to pay 5 per gallon of gas Our interval went between 42 and 48 so it appears that less than 50 of ALL Americans are willing to pay 5 per gallon of gas Sample Size Needed for Validity of Con dence Interval for a Proportion For these con dence intervals to be valid we needto check some requirements as we 39d back w en we were determining the sampling distribution for the sample proportion The following two things must be true n 215 AND n1 215 Also we must make surewe take a random sample Check that this is true in the above gas example page in M89 Example 3 from Section 82 We asked n 1154 Americans Wouldyou be willing to pay 5 per gallon of gas In our random sample 518 said they would be willing to pay 5 per gallon of gas a Find a 95 confidence interval for the population proportion ofAmericans willing to pay 5 per gallon of gas First we need our sample proportion 19 xn who saidyestotal number in the sample 5181154 45 Next we need the standard error We 0 Standard error 0015 i The 95 confidence interval formula 1 i 196 u So the interval is 45 1950015 45 03 So thelowerlimit is 45 0 42 And the upper limit is 45 0 48 So our 95 con dence interval is 4248 b Interpret the interval We are 95 confident that the proportion ofALL Americans that are willing to pay 5 per gallon of gas is between 42 and 48 in other words between 42 and 48 page 0 at a Example The drug Lipitor is meant to lower cholesterol levels In a clinical trial of 884 randoml selected patients who received 10 mg doses of Lipitor daily 221 reported a headache as a side eitect a Obtain a point estimate for the population proportion of users who will experience a headache b Verify that the requirements for constructing a con dence interval about p are satisfied 3 Construct a 95 CI for the population proportion of users who will have a headache I Interpret this interval page to at aa How can we use 3 Confidence Level Other Than 95 So far we have just been creating 95 con dence intervals so our margin of error has been 195 standard error But Where does this 196 come om And what ifwe want something different than a 95 con dence interval e can never have a 100 con dence interval because we can never be 100 sure that the population proportion is on the interval ifwe don t know it But what if we want a 90 con dence interval or a 99 con dence interval Here is how We get the 196 for a 95 con dence 39 al interv First when you think of a 95 con dence interval think of n a ormal curve with 95 shaded in the middle like thIS 1095 or 95 is in the middle then what is the area in each ofthe tails 1 95 05 and 052 025 so 025 is in each ofthe tails Now put 025 as the little tail area to the right in your StatCrunch calculator with mean 0 and standard deviation 1 hit Compute andyou get 195 page as am Example A study 5170 randomly selected people in Atlanta was conducted to estimate the proportion of Atlantans that owned dogs e study revealed that 42 of the 70 people were dogowners a obtain a point estimate for the population proportion of dogowners in Atlanta b Verify that the requirements for constructing a con dence interval about p are satis ed c Construct a 90 con dence interval for the proportion of Atlantans that are dog owners d Interpret the con dence interval page is am This 196 is the Zscore that matches up with a 95 on dence interval This is why it was important for us to nd those probabilities involving Zscores before This Zscore value is what we now take and multiply the standard error by to get e margin of error for a con dence interval and it will always be a positive zscore We did a lot ofwork to get there let s work through it again and see what the zscore would be for a 90 con dence interval r draw the curve with 90 in the middle and nd the area of both tail 5 Then put the area ofthe tail in the StatCrunch Normal Calculator with mean 0 and standard deviation 1 The z score you get 164485 So to get the margin oferror for a 90 con dence interval you multiply the standard error by 164485 Let s use this in the following example page in cf pg Now using the same example as above construct a 99 con dence interval Let s see how the interval changes if we increase the con dence level We have the point estimate and the standard error so we just need the new zscore for this con dence interval draw the curve with 99 in the middle and nd the area of both tails Then put the area ofthe tail in the StatCrunch Normal Calculator with mean 0 and standard deviation 1 The Zscore New Create the 99 con dence interval Notice that the 99 con dence interval is wider than the 90 con dence interval page in cf pg In this example we saw that As the level of confidence increases the margin of error increases and the confidence interval gets wider ALSO as the level of confidence decreases the margin of error decreases and the confidence interval gets narrower This applies to all con dence intervals like in the picture below 99 Confidenoe Interval 95 Confidence Interval Why is this true With a 95 con dence interval we want to be 95 con dent that the population parameter is on the interval But with a 99 con dence interval we want to be even more con dent 99 con dent that the population parameter is on the interval So to be that much more sure the proportion is on the interval we need a wider interval Page of89 RECAP The following symbols go along with the following terms when calculating the con dence interval for the population Term Error Error Interval Page of89 We have seen what happens when we change the con dence level what about if we change the sample size As the sample size increases the margin of error decreases and the confidence interval gets narrower As the sample size decreases the margin of error increases and the confidence interval gets wider So the opposite happens when we increase the sample size The con dence interval gem narrower Why is this true As we increase our sample size the sample statistic we obtain whether we are looking for a mean or a proportion is a better representation of the population So as we increase our sample size our point estimate is a better and better estimate and we don t need such a wide con dence interval Page of89 HOW CAN STATCRUNCH CALCULATE THESE CONFIDENCE INTERVALS FOR US Look back at our example where we wanted to get a 90 con dence interval for the population proportion of ALL Atlantans that own dogs on page 30 of our notes We got the 90 con dence interval which has a lower limit of 50369 and an upper limit of69631 On page 31 we got the 99 con dence interval which has a lower limit of 44917 and an upper limit of 75083 Guess what STATCRUNCH can get these values for us Go to Stat 9 Proportions 9 One Sample 9With Summary Here we can type in how many Atlantans owned dogs in our sample In our sample 42 of70 Atlantans owned dogs Put those numbers injust lil e this and hit Next Wt sum wumuumm F mm ldnlrv llmr F 3 am MQM dad W On the next screen choose Con dence Interval and we want a 90 con dence interval so change the 095 to 090 Page of89 4941231 m cnn dence meml resuns p pvupumunutsuccessesmvpupulahun Method Standavdrvval IIa liavaApplethdaW It tells us the Sample Proportion which is the point estimate which is 6 the same thing we got in part a It also tells us the lower limit 50369 and the upper limit 69631 the same values we calculated Notice it also gives us the standard error The only values it does not give us are the margin of error and the Zscore used in the formula so we still would need to know how to get those by hand Now get the 99 Con dence Interval and check it against our answers of0449l7 075083 Page210f89 So we are this far into our formula for the con dence interval for the population mean it some numbers Vn All we have le to nd is the some number We saw in con dence intervals for the population proportion that this some number ended up being a Zscore that corresponded with the level of con dence For con dence intervals for the population mean the some number still corresponds with the level of con dence but it is from a new distribution that we call the Tdistribution So ifyou look on Stat 9 Calculators 9 you will see a calculator just called T Before we see how to get these T values let s talk about the properties of this Tdistribution and how the T distribution or Tcurve is different from the normal distribution or normal curve Page 23 of89 Section 83 How Can we Construct a Con dence Interval to Estimate a Population Mean Recall from Section 81 that con dence intervals can be written in the general forma point estimate margin of error Remember the point estimate is a single value that is our best guess for the parameter What single number is the best guess for a population mean ifwe only have a sample from the population The sample mean So the sample mean is the point estimate part ofthe con dence interval formula It is the center of the con dence interval so now we need to know the margin of error We need to know what to add and subtract from the point estimate to get the lower and upper limits of our con dence interval Just like in Section 82 the margin oferror will be some number times the standard error But the formula for the standard error when we are talking about means is Standard error s Vn where s standard deviation from our sam We saw the formula for the standard error back in Chapter 7 was 6 Vn but we don t know anything about the population so we don t know 6 so we have to use the standard deviation from our sample s Page 22 of89 Properties of the TDistribution 1 The T distribution is centered at 0 and is symmetric about 0 like thestandardnnrmal distribution The total area under the curve is 1 The area to the right of 0 is 05 and the area to the left of 0 is 05 N like the standard normal distribution The T distribution is different for different values of n our sample size The area in the tails of the T distribution is a little greater than the area in the tails of the normal distribution As the sample size It increases the T curve looks more and more like the normal curve 1 4 u Since the Tdistribution looks different for different values ofn we always have to type in what we call the degrees of freedom on the T calculator The degrees offreedom we have to put in the T calculator n 7 l The degrees of freedom on the Tcalculator is abbreviated as DF So DF in StatCrunch n 7 l and we always have to put that into the T Calculator Try some different DF values in StatCrunch and see how the Tdistribution changes for different sample sizes Again it is Stat 9 Calculators 9 T Try DF 5 Then try DF 500 this one looks more like our normal curve Page 24 of89 So our con dence interval formula for the population mean Lower limit E T IH Upper limit E T These intervals are valid when we 1 use a random sample AND 2 either use a sample size gt 30 OR when we are sampling from a normal population So we can get the sample mean sample standard deviation and n value but we haven t yet talked about what the T value is that we want from the T Calculator To get theT value isjust the same as getting the Z value when we were doing con dence intervals for the population proportion in Section 82 The only difference is that the T value depends on BOTH the con dence level and the sample size Page 25 of89 Let s do a few more These are the same thing they are asking you to get on Homework 8384 a Find the tscore for a 99 con dence interval for a population mean with 5 observations in our sample First draw a curve with 99 in the middle and nd the area of both tails Next put in the right tail area 005 in the T Calculator AND put DF 5 1 4 Hit Compute and you get 391 So now we can construct con dence intervals for the population means We can get all the symbols in these formulas I 5 Lower limit x T 77 5 Upper limit x T 7 Finally let s do some examples Page 27 of89 Let s nd the T value for a 95 con dence interval if the sample size we used is n 3 First draw a curve with 95 in the middle and nd the area of both tails Next put in the right tail area 025 in the T Calculator ANDputDF327131 mam Egg Dervle n 6 DF 31 Few Ll 2n355133 p257 Close Cumpuie Java ADDlel Wmde Hit Compute and you get T 203951 Page 26 of89 Example 7 in Section 83 Ipods are sold all the time on eBay We have the prices for a random sample ofseven Ipods that recently sold on eBay 235 225 225 240 250 250 210 We will assume these prices are normally distributed We want to nd the 95 con dence interval for the population mean price of Ipods sold In other words we want to construct an interval of numbers and be 95 con dent that if we averaged the price of ALL Ipods sold on eBay the average price would be on our interval We need to nd Lower limit E T V 4 I 5 Upperlimit xT 7 Let s break it down n 7 because we are using a sample of7 Ipods sold on eBay 7 How do we get x and s Easy we can list our seven prices in StatCrunch go to Stat 9 Summary Stats 9 Columns quot so 1 23357143 and s 146385 Now nally we need to get the T score Page 28 of89 First draw a curve with 95 in the middle and nd the area of both tails Next put in the right tail area 025 in the T Calculator AND putDF7 16 DF la Pyabmgt I new we iJavaAwlel Wavan Hit Compute and you get T244691 Page 29 of 89 Using STATCRUNCH to construct con dence intervals Whenever we have actual data like in the above eBay example we can put this data into StatCrunch and StatCrunch will actually calculate these intervals for us First put the seven eBay prices in a column on StatCrunch van vavz 235 ammewNiE 2 Go to Stat 9 T Statistics 9 One Sample 9 with data Choose the column you have put the data in and hit Next Choose Con dence Interval and type in 095 Hit Calculate and here are our ults 2 in sample Tslilislics 95 cnmaence meml resuns u JavaAppleledaw The same amounts we got before Lower limit ofthe con dence interval 22003 Upper limit of the con dence interval 24711 Page 31 of 89 Now we have everything we need we cannow construct the lower and upper limits of the 95 con dence interval Lower limit E r 7 23357 724471464N7 2003 Upper Limit x T 7 23357 24471464V7 24711 So we are 95 con dent that the mean price ofALL Ipods sold on eBay is somewhere between 22003 and 24711 EXTRA QUESTION According to our con dence interval is it likely that the population mean price of ALL Ipods sold on eBay 250 No 250 is not on our con dence interval so therefore it is not a likely mean price for ALL Ipods sold on eBay We think that mean price should be somewhere between 22003 and 2471 1 EXTRA QUESTION 2 According to our con dence interval is it possible that the population mean price of ALL Ipods sold on eBay 225 Yes 225 is a possible mean price because it is on our interv Page 30 of 89 Let s do an example like this where we have to calculate the limits using the summary statistics and not the actual data Example Suppose we are trying to estimate the average ofa large population of test scores Suppose a random sample of 16 test scores is taken from a normal population If the average of these 16 test scores is E 782 and the sample standard deviation s 255 then construct a 90 confidence interval for the population mean of all test scores Let s do this by hand rst Page 32 of 89 Now let s use StatCrunch to create this con dence interval or us Go to Stat 9 T Statistics 9 One Sample 9 with summary sample size just like this 2 one sample Tslalislicswilh summary El Put in our sample mean sample standard deviation and Smupleluenn 782 sample sul lw 2 55 Sample size ml 7 enamel em l quotml seminal Hit Next and choose a 90 con dence interval Hit Calculate and here are the results 39 s with summary 39 e anesampisrs 39 gnun cnn dence meml resuns u pup These are the same values we calculated by hand Page 33 of89 Section 84 How Do We Choose the Sample Size for a Study Sometimes before setup of an experimentsurvey we know that we want the margin of error to be a certain amount Like if we are trying to get results for an election and we are going to be getting a sample proportion for the proportion of people who will vote for candidate A maybe we know that whatever sample proportion we get we want that to be within 3 of the true population proportion for ALL voters So we know we want the margin oferror 3 We can use a formula to tell us what sample size we need to take so that our margin of error will be 3 or in other words so we can be sure that whatever sample proportion we get will be within 3 of the true population proportion with a certain level of con dence Here is that formula for choosing sample size in estimating a population proportion 11 91 zz1112 where 1 is a guess at the value we think we might get for the sample proportion Ifno guess is given then use 905 The In is the margin o error The Zscore is calculated again based on the level of con dence just like with the con dence intervals It represents how con dent we want to be that the sample proportion we get will be that close to the true population proportion Page 35 of89 The following symbols go along with the following terms for construction of a con dence interval for a population Term Symbol Point Estimate X Margin of Error T Standard Error 7 Con dence E T J Interval Page 34 of89 Example 9 in Section 84 An election is expected to close and we are going to take a sample of people to obtain a sample proportion of the people who voted for candidate A How large should the sample size be for the margin of error of a 95 con dence interval to equal 002 What we are saying here is that we want to take a sample and get a sample proportion We then want to create a 95 con dence interval aron that sample proportion and we want the margin of error for that con dence interval to equal 002 Sample size formula 11 Pl J Z2m2 p 5 because we aren tgiven aguess for p to use In 02 Z Zscore based on 95 level of con dence First draw the curve with 95 in the middle and nd the area ofboth tails Then put the area of the right tail in the StatCrunch Normal Calculator with mean 0 and standard deviation The zscore you get 196 just like before n 051 05196Z02Z 2401 We would need to take a sample of 2401 people to get a sample proportion that we can be 95 con dent will be within 002 ofthe true population proportion What if we were given a statement like this From a prior estimate the sample proportion will be about 7 Then n 071 07196Z02Z 201684 2017 Page 36 of89 What if we are not dealing with a population proportion example but a population mean example That is we want to know what sample size we need so that the sample mean we get is close enough to the true population mean For example maybe we want to estimate the average income for an entire company We want to take a sample of their employees and get a sample mean of their income And we want this sample mean income to be within 5000 of the entire company s mean income with 95 con dence We can determine what sample size is needed so that whatever sample mean income we get it will be within 5000 ofthe population mean income and we can be 95 con dent of that Here is the formula we use to determine sample size for estimating the population mean 722 2 11 m2 where 6 is the provided standard deviation m is the margin of error and Z is obtained just like before Page 37 of89 Chapter 9 Statistical Inference Signi cance Tests about H otheses othesis Testin at are the steps for performing a Significance Test In this section we will introduce the language and steps of significance testing The procedures will be addressed in later sections of Chapter 9 Basics of Significance Testing A statem ent is made about a population parameter A claim is made that this statem ent is incorrect Evidence sample data is collected in order to test the claim The data are analyzed in order to support or refute the claim a 9 4 Example A car manufacturer advertises a mean gas mileage of 26 mpg A consumer group claims that the mean gas mileage is less than 26 mpg A sample of 33 cars is taken and the sample mean for these 33 cars is 252 mpg Significance testing is a procedure based on sample evidence and probability used to test claims regarding a characteristic of one or more populations We use sample data to test hypotheses Page 39 of89 Example An estimate is needed of the mean height of women in Ontario Canada A 95 con dence interval should have a margin of error of 3 inches A study ten years ago in this province had a standard deviation of 10 inches a About how large a sample of women is needed b About how large a sample of women is needed for a 99 con dence interval to have a margin of error of 3 inches Page 38 of89 The Five Steps of a Significance Test 1 Assumptions 2 Hypotheses 3 Test Statistic 4 Pvalue 5 Conclusion Page 40 of89 l Assumptions 7 each type of test will have certain assumptions that we need to check ex is the sample size large enough 2 Hypotheses Each signi cance test has two hypotheses about a population parameter the null and alternative hypotheses The null hypothesis denoted Hg read Hnaught is a statement to be tested The null hypothesis is assumed true until evidence indicates otherwise In this chapter it will be a statement regarding the value of a population parameter In our car example the null hypothesis is Hg u 26 mpg This is the statement made by the car manufacturer that we have to accept as true before we test the claim The alternative hypothesis denoted HA is a claim to be tested Generally this is a statement that says the population parameter has a value different in some wa from the value given in the null hypothesis In experiments we are usually trying to nd evidence for the alternative hypothesis In our car example the alternative hypothesis is HA u lt 26 mpg This is the claim made by the consumer group that the mileage is less than what the car manufacturer stated Page41 of89 Example Determine whether the signi cance test is le tailed righttailed or twotailed a H0 u26 HA ult 26 b H0p046 HA p gt 046 c H0 u2 HA u 2 Page 43 of89 There are three ways to set up the null and alternative hypotheses 1 Less than test le tailed test H0 parameter some value HA parameter lt some value Example A car manufacturer advertises a mean gas mileage of26 mpg A consumer group claims that the mean gas mileage is less than 26 mpg 2 Greater than test righttailed test H0 parameter some v ue HA parameter gt some value Example A newspaper states that a candidate will receive 46 of the votes in an upcoming election An analyst believes the percentage will be higher than 46 3 Not equal to test twotailed test H0 parameter some value H A parameter 4 some value Example Five years ago the average daily rainfall in a jungle was 2 inches A scientist thinks it is different now We always test about population parameters like u and p We never test about sample statistics like x and phat because they change with every sample Page 42 of89 3 Test Statistic In all of these tests we will be testing the population parameter based on what we get in a sample The test statistic tells how far away the sample statistic is from the assumed population parameter It tells us this information in terms of how many standard errors away the sarn le statistic we get is from the assumed population parameter Think of the car example The manufacturer states that their cars get 26 mpg We take a sample oftheir cars and in our sample their cars get 252 mpg on average We want to see how far away our sample mean 252 is from the assumed population parameter 26 in terms of the standard error Example If the test statistic 1 then the sample mean of 252 was only one standard error below the population mean of 26 mpg Draw a curve to represent this Example lfthe test statistic 25 then the sample mean of 252 was 25 standard errors below the population mean of 26 mpg Draw a curve to represent this We will see the formula for how to calculate this test statistic for different tests in Sections 92 and 93 Page 44 of89 4 Pvalue The pvalue is the probability of getting a sample statistic as far away or further from the parameter as we did if the null hypothesis is true In our car example the pvalue is the probability that we would get a sample mean of 252 or lower ifthe true population mean equals 26 Example If the test statistic 1 then the sample mean of 252 was only one standard error below the population mean of 26 mpg Draw a curve and SHADE IN THE AREA REPRESENTING THE PVALUE Example Ifthe test statistic 25 then the sample mean of 252 was 25 standard errors below the population mean of 26 mpg Draw a curve and SHADE IN THE AREA REPRESENTING THE PVALUE Page 45 of89 5 Conclusion At the end of these signi cance tests we state our conclusion Our conclusion will be one of these two 1 If our sample statistic is signi cantly different from the stated population parameter like if our sample mean is way off from the stated population mean then we reject H0 and accept HA When the null hypothesis is rejected we say that there is enough evidence to reject the null hypothesis and accept the alternative hypothesis 2 If our sample statistic is NOT signi cantly different om the stated population parameter like if our sample mean is pretty close to the stated population mean then we DO NOT reject H0 and we CANNOT accept HA When the null hypothesis is NOT rejected we say there is NOT enough evidence to reject the null hypothesis and accept the alternative hypothesis NOTE We NEVER accept H0 we just don t reject H0 Page 47 of89 Ifthis pvalue is NOT very small then we got a sample statistic pretty close to the population parameter we were testing and we will NOT reject that population parameter like in our rst example where the sample meanwas only one standard error below the population mean BUT if this pvalue is very small then we got a sample statistic pretty far away from the population parameter we were testing and we will reject that population parameter like in our second example where the sample mean was 25 standard errors below the population mean We will see how to use our calculators in StatCrunch to calculate this pvalue in Sections 92 and 93 Page 46 of89 Example According to the American Hotel Association the average price ofa room was 7862 per night in 1998 An analyst believes that this value has increased since then a Determine the null and alternative hypotheses What type oftest is this b Suppose sample data indicate that we should not reject H0 State the conclusion of the researcher Page 48 of89 92 Significance Tests about Proportions Let s look at an example ofa signi cance test about a population proportion before we get to the steps in the test Example A magazine states that 40 of the population will vote for candidate A in the upcoming election We claim that the propor on is higher than 40 State the null and alternative hypotheses To test this claim let s say we take a random sample of200 people and ask those 200 people who they are going to vote for This will give us a sample proportion the proportion in our sample of 200 that will vote for candidate A We will then calculate a test statistic we will see the formula for this in a few pages that will tell us how many standard errors away from 4 our sample proportion is If our sample proportion is really far above 4 then we will reject the null hypothesis and accept our claim the alternative hypothesis Ifthe sample proportion is not that far above 4 then we will not reject the null hypothesis Page49 of89 3 Test Statistic 7 The test statistic tells us how far the sample proportion we get 1 falls from the assumed population proportion p if the null hypothesis is true The test statistic will tell us how many standard errors away from p that our sample proportion I39 is To get this test statistic value here is the formula 2 If Pa 3 12 5 r where 15 our sample proportion Pu assumed population proportion that we are testing and n sample size we used to get our sample proportion So again our test statistic is how far away our sample proportion is from our assumed population proportion P pa in terms ofthe standard error V l Pu 1 lml divided by l 71 And as you can see we treat this test statistic as a Zscore Again we will calculate these values in examples on the next few pages Page510f89 Steps in a Significance Test about a Population Proportion We will do an example ofthis in a few pages for now let s just talk through the steps 1 Assumptions 7 When performing signi cance tests about a population proportion we need the following three assumptions to e true a The data is categorical b The data are obtained using randomization like a random sample c We need the shape ofthe sampling distribution ofthe sample proportion to be approximately normal SO we need npu215 AND n17pu215 where Pu assumed population proportion we are testing 2 Hypotheses 7 We set up the hypotheses just like we have seen before The null hypothesis will be p value like p 4 If it is a twotailed test the alternative hypothesis will be p value like p 4 If it is a le or righttailed test the alternative hypothesis will be either p lt value or p gt value like p gt4 Page 50 of89 4 Pvalue 7 The pvalue is the probability that we would get a sample proportion that far away from the assumed population proportion assuming that the null hypothesis is true in other words assuming that the assumed population proportion is correct Draw an example of a curve showing this But we have turned our sample proportion into a test statistic we have turned it into a Zscore in Step 3 so that we can nd this area using the Normal Calculator in StatCrunch So we now want to nd the probability that we would get this extreme a Zscore under our standard normal curve So we need to look back at our alternative hypothesis to see what type oftest we are using If it is right tailed test p gt value then the pvalue is the area to the right ofthis Zscore under the standard normal curve Ifit is a le tailed test p lt value then the pvalue is the area to the le of this Zscore under the standard normal curve If it is a two tailed test p value then the pvalue is the sum ofthe area to the le ofthe negative Zscore plus the area to the right of the positive Zscore This sounds complicated but it s not that bad We will do examples and see that we canjust nd the area to the right ofthe positive Zscore and double it because the graph is symmetrical Page 52 of89 5 Conclusion 7 The pvalue is going to lead us to our conclusion In the problems we will always be given a significance level to compare our pvalue to If the pvalue lt signi cance level then the probability of us obtaining that sample proportion was really small So we did get a sample proportion far enough away from the population proportion to reject it So if the pvalue lt signi cance level we reject the null hypothesis and accept our claim the alternative hypothesis If the pvalue gt signi cance level then the probability of us obtaining that sample proportion was not that small and the sample proportion is not that far away from the population proportion So if the pvalue gt signi cance level we do not reject the null hypothesis We don t have enough evidence based on this sample to reject the null hypothesis The last four pages may have gone right by you but that s okay these things are much easier to see in examples so let s take a look at one Page 53 of89 3 Test Statistic Test Statistic We want to see how far away our sample proportion is from the assumed population proportion in terms of the standard error So our sample proportion is 086603 standard errors above the assumed population proportion 4 Pvalue Because H A has gt this is a right tailed test Draw a graph with the population proportion in the middle and shade the area to the right of our sample proportion This shaded area is the pvalue Draw a standard normal curve and shade the area to the right of our Zscore This shaded area is also the pvalue Page 55 of89 Example A magazine states that 40 of the population will vote for candidate A in the upcoming election We claim that the proportion is higher than 40 In a random sample of 200 people 86 said they would vote for candidate A Is this suf cient evidence to claim that the proportion is higher than 40 at the 05 signi cance level 1 Assumptions Is this a random sample and is the data categorical Isn pui 15 andnlpu215 2 Hypotheses Set up the null and alternative hypotheses Before we get to step 3 what is phat the sample proportion Page 54 of89 We need to nd that area to the right of our Zscore We can do this in StatCrunch by putting in the Zscore and nding the area to the right of it DensW n 6 e3 e2 4 n 1 2 3 Mean u Sm Dev 1 Pmmxlgll means Wanna Close Compute JavaADleKWmdaw o our pvalue 19324 This means that there is a 19324 or a 19324 chance that we would get a sample proportion this far off from the population proportion if the assumed population proportion is correct So ifthe population proportion is correct then our sample proportion isn t that rare There was a 19324 chance of us getting a sample proportion this far from the population proportion So we probably won t reject the null hypothesis but we still need to compare this pvalue to the level of signi cance that they gave us to see ifit is signi cant Page 56 of89 5 Conclusion 19324 gt 05 The PValue gt the given signi cance level so we will not reject Hg and state that there is not suf cient evidence to reject HE and accept HA at the 05 level of signi cance Interpretation So we are not rejecting the magazine s statement that the population proportion of people voting for candidate A is 40 Our sample proportion was not far enough away from 4 or 40 to reject it This was done at the 5 level of signi cance The level of signi cance is saying that if we wanted to reject the null hypothesis there needed to be less than a 5 chance of us getting the sample proportion that we got and there wasn t There was a 19324 chance ofus getting our sample proportion or higher Page 57 of89 Same test statistic 86603 and same pvalue 1932 So much easier Now use StatCrunch to check and see how the conclusion would have changed ifin our sample 96 out of 200 said they were voting for candidate A instead of only 86 out of 200 Page 59 of89 Now that we have done all this work and see the steps in a hypothesis test let s see how StatCrunch can easily do it Go to Stat 9 Proportions 9 One Sample 9 with summary In our sample we had that 86 out ofthe 200 people in our sample said they would vote for candidate A So the number ofyeses or successes 86 And the number of observations 200 151 e one sample Prnpnllinn uumher m successes 86 uumherovnusemmns Inn 71 em jaw mam lavaADleKWmdaw Hit Next and choose Hypothesis Test The magazine states that 40 or 04 ofthe entire population will vote for candidate A So enter the null hypothesis that the proportion 04 Our alternative hypothesis is our claim we think that proportion is really higher than 40 or 04 so choose gt Hit Calculate and what we get is on the next page Page 58 of89 Example In a 1998 article the credit card industry asserted that 50 of college students carry a credit card balance from month to month In a random sample of 300 college studenm 174 carried a balance each month Is this sufficient evidence to claim that the proportion is different from 50 at the 01 significance evel 1 Assumptions Is this a random sample and is the data categorical Isn pu215 andn1pu215 2 Hypotheses Set up the null and alternative hypotheses Before we get to step 3 what is phat the sample proportion Page 60 of89 3 Test Statistic We want to see how far away our sample proportion is from the assumed population proportion in terms of the standard error So our sample proportion is 277128 standard errors above the assumed population proportion 4 Pvalue Because HA has i this is a two tailed test Draw a graph with the population proportion in the middle and shade the area to the right of our sample proportion Also shade the same symmetrical area on the le This shaded area of both tails is the pvalue Draw a standard normal curve and shade the area to the right of our positive Zscore and to the le our our negative Zscore This shaded area is also the pvalue Page61 of89 5 Conclusion 00558 S 01 The PValue S the given signi cance level so we will reject Hg and state that there is suf cient evidence to reject HE and accept HA at the 01 level of signi cance Interpretation we are rejecting the credit card company s statement that the population proportion of college students that carry credit card debt is 50 and accepting our claim that the population proportion of college students that carry credit card debt is different from 50 We got a sam l proportion far enough away from 50 that we could make this claim This was done at the 1 level of signi cance The level of signi cance is saying that if we wanted to reject the null hypothesis there needed to be less than a 1 chance of us getting the sample proportion that we got and there was There was only a 0558 chance ofus getting a sample proportion this far from the population proportion Page 63 of89 We need to nd the area in both tails because this is a two tailed test We can do this on StatCrunch by putting in the positive Z score and nding the area o the right ofit ltDJgtltJ DensW n 4 e3 e2 4 u 1 2 3 x smmeveh Prnht szmza p Erma ATM HT Himquot Our area in the le tail area to the left ofthe negative Z score will be the same because the graph is symmetrical So our pvalue 002792 00558 This means that there is a 00558 or only a 0558 chance that we would get a sample proportion this far off from the population proportion if the assumed population proportion is correc Page 62 of89 Again let s see how StatCrunch can easily do it for us Go to Stat 9 Proportions 9 One Sample 9 with summary In our sample we had that 174 out ofthe 300 college studenw in our sample had credit card debt So the number ofyeses or successes 174 And the number of observations 300 e cue sample Pmpmlinn wi number m successes m Number m oliselvmmns Sunl 392 meat 1 New meet JavaADDlethdaw Hit Next and choose Hypothesis Test The credit card company states that 50 or 5 ofthe college students carry credit card debt So enter the null hypothesis that the proportion 5 Our alternative hypothesis is our claim that we think that proportion is really different from 50 or 5 so choose t Hit Calculate and what we get is on the next page Page 64 of89 Fmporuo Assumptions I Categorical variablgwiLh pnpulation proportion p defined in mlexl 39 a Simple random sample for gathering data SUMMARY Steps of 3 Signi cance Test for a Population 391 P l Randomization such as ug I Expect ntlmsl15successesnndlslailuresuudcrlic and M1 7 pa 2 15Tllis is mainly important or mat is rip z is onevs lded tests Same test statistic 277128 and same pvalue 0056 z mehem Null m p PA mm 70 is the hypothesized value such as 0 so Alltrmztsve39 m p rmsuch 558 p 050 tworsidedlor P lt Normand or H p gt manemed 3 Test snttsti quotWmch V Pvil PWquot ran 4 P value Aimsle hypothesis Mains Hi p gt in Rightrail probabilin a 5 H p lt Pu LeflrLulpmbabxluy 3 p a pa TwoAm probability t Conclusion Smaller Prvalues gvs srmugEr evidence against an lt a decisionis issuer meet Hg 1 the Maine is 1255 man or equal m m preselech swam level such as 005 Relate me conclusion w the Conth of me study z Page 65 of89 Page 66 of89 Section 93 Significance Tests about Means Steps in a Significance Test about a Population Mean 1 Assumptions 7 When performing signi cance tests about In this section we run significance tests to test values a population mean We need the following assumptions to be true of population means that have been given like our car manufacturer example Th 39 bl 39 tilati Exam 1e A car manufacturer advertises a mean gas a e vma e 15 quan V6 mlleage Of 26 mPg A consumer group Clams that the b The data are obtained using randomization like a gas mileage is less A sample of 33 cars is taken and random samp e the sample mean for these 33 cars is 252 mpg c The population distribution is approximately normal or Before we work through examples we need to see we are Sing a sample Size 3 30 how the 5 steps in the test have changed We will see the big change is the new formula for the test statistic and the fact that we treat the test statistic as a tvalue rather than a Zscore like we used in Sectron 92 The null hypothesis will be u value like u 26 2 Hypotheses 7 We set up the hypotheses just like We have seen before If it is a twotailed test the alternative hypothesis Will be u value like it 26 If it is a onetailed test the alternative hypothesis will be either it lt value or u gt value like u lt 26 Page 67 of89 Page 68 of89 3 Test Statdstte r RememberLLhe test stattstde tells us hnw far the samplemezn we g x s 39um the assumed nnpulatdnn mean uh rfthenuu hynnthesrs rs true The test stausue wru tell us hnw many standard errnrs awny 39um uh that nur sample mean x rs Tn getthtstest stausuevalue herersthe fnrmuta where x uurszmplemean uh the assumed nnnutatrnn mean that we are trymg tn pmve wrnng s the standard demaunn 39nm the sample andn sample srze weusedtn get nur samntemean Sn agan nur test statrstde rshnwfar away nur samntemean 15 39nm nur assumed nnpulaunn mean quot 1 9 m terms quhe standard errnr dunded by S xN an see we treat thts test stausue as a Hiatus And as ynu e sn wewru needtn use the T ealeulatnrm Statoruneh tn get nurprvalues furthesetypes nfsrgmdeanee tests me new 5 cnnelusrnn rThxs rs the same asbefure prvalue lt level nfsrgudeanee d reject Hg and aeeent HA prvalue gt level nfsrgudeanee d dn nutreject Hg and dn nut aeeept HA an let39s Wurk thrnugn snme examples Pusl ufxv 4 Frvalue e The prvalue rs the prnhahrhty that we wnutd get a sample mean that far away 39um the assume nnp atrnn mean assummg that the null hypnthesrs rs true m ntherwnrds assummgthat the gwennnnutaunn meanrs enrreet Drawa euwetn see what thrs wnu1d1nnkhke But we have turned nur sample mean mtn atest statistic we have turnedtttntn atrvalue m Step 3 Sn we nnwwant tn nd the prnhahrhty that we vmuld get thrs extreme atrvalue un er nurtrcurve Draw thrs euwe Thrs nrneess wru he exaetdy the same exeent we are nnw just usmg the T calculatur m Statorunnh Sn we need tn luuk hank at nur altemanve hynnthesrs tn see what type nftest we are usmg xfrtrs a gt nrngnttadedtest weneed the area tn the nght ufuurtrvalue xfrtrs a lt nr1etttaded test we needthe areatn Lhele uf nurtrvalue Tfrtrs a nrtwn tadedtest weneed the area tn the nght uf uurpusmve value and the area tn the le nfthenegatwet value But agan1ustlxke befure we canJust nd nne uf these areas and mutunty rthy twn heeause theteurve rs alsu symmemcal naemam Example A ear manufaeturer adveruses amean gas mdeage ufZ mdes per gannn A ennsumer gnu n am at the gas mdeage rs1ess Arandum sample ufKK ears era annrnmmatety nnrmany drsttnhuted Is there ewdenee atthe sxgm czncelevel uf ustn enneludethemngrs1essthanzm 1Assumpu39nnse Is thrs arandnm sample andrsthe vanahle nuantdtatwev Is the nnputatrnn nnrmauy drsttnhutedt z Hypmheszs 7 Set up the null and alternatwe hypntheses mg 72 um 3 Test Statistic T 110 5 quot x n We want to see how far away our sample mean is from the assumed population mean in terms of the standard error I So our sample mean is 158471 standard errors below the assumed population mean 4 P Value Because HA has lt this is a left tailed test Draw a graph with the population mean in the middle and shade the area to the left of our sample mean This shaded area is the pvalue Draw at curve and shade the area to the left of our t value This shaded area is also the pvalue Page 73 of89 5 Conclusion 06143 gt 05 The PValue gt the given significance level so we will NOT reject H0 and state that there is NOT enough evidence to reject H0 and accept HA at the 05 level of significance Interpretation o we are not rejecting the car manufacturer s statement that the mean gas mileage of all their cars is 26 mpg and we are not accepting our claim that the mean gas mileage of all their cars is less than 26 mpg because we got a sample mean that was NOT significantly below 26 mpg This was done at the 5 level of significance This means that we did not reject the null hypothesis because there was not less than a 5 chance ofus getting a sample mean this far off from the population mean Page 75 of89 We need to find that area to the left of our tvalue We can do this in StatCrunch by putting in the tvalue and finding the area to the left of it Agm Damn I 4 nuajmtap l n H 0m 1 1mva So our pvalue 06143 This means that there is a 06143 or a 6143 chance that we would get a sample mean this far off from the population mean if the assumed population mean is correct Page 74 of89 Now let s see how StatCrunch can do it for us Go Stat9T Statistics90ne Sample9with summ Put in the sample mean sample standard deviation and sample size in like this anaem 39 J e r mums Valium Jam Samplemezn 252 Sampleslddw 29 sample size 331 ll mm em mm quotW cam Hit Next and choose Hypothesis Test Put in mean 26 for the null hypothesis and it is a lefttailed test so put in lt for the alternative hypothesi like this n samul I statistic wld ls 39 D X mumhesus res Null mean 25 nnemame lt v r Cnn dence lmeml men n as Snapshot Cancel 39ltaaekj Nexlgt Calculate Here are the results Hun simul Tstalistiti mm L n x Page 75 of89 Problem on Homework Chapter 9 An industrial plant claims to discharge no more than 1000 gallons of wastewater per hour on the average into a neighboring lake An environmental action group decides to monitor the plant in case this limit is being exceeded Doing so is expensive and only a small sample is possible A random sample of four hours is selected over a period of a week Test at the 005 significance level Assume the distribution of wastewater is approximately normal The observations are below 2000 1500 3000 2500 1 Assumptions Is this a random sample and is the variable quantitative Is the population normally distributed 2 Hypotheses Set up the null and alternative hypotheses We aren t given the sample mean and standard deviation but we can put these numbers into StatCrunch Stat9 Summary Stam9Columns and get that the sample mean 2250 and the sample standard deviation 645 49725 Page77 of89 We need to nd that area to the right of our tvalue We can do this in StatCrunch by putting in the tvalue and nding the area to the right of it A m Derile mums liaanaa 11 7001523318 ijimm HTS ilcmmg So our pvalue 01523 This means that there is only a 01523 or a 1523 chance that we would get a sample mean this far off from the population mean ifthe assumed population mean is correct Page79 of89 3 Test Statistic We want to see how far away our sample mean is from the assumed population mean in terms of the standard error So our sample mean is 387298 standard errors above the assumed population In an 4 Pvalue Because H A has gt this is a right tailed test Draw a graph with the population mean in the middle and shade the area to the right ofour sample mean This shaded area is the pvalue Draw a t curve and shade the area to the right of our tvalue This shaded area is also the pvalue Page78 of89 5 Conclusion 01523 S 05 The PValue S the given significance level so we will reject H0 and state that there is sufficient evidence to reject H0 and accept HA at the 05 level of significance So we are rejecting the industrial plant s statement that the population mean of the wastewater per hour that the plant pus out is 1000 gallons and accepting our claim that the population mean is greater than 1000 gallons Page 80 of89 Now let s see how StatCrunch can do it for us Put in the four values from our data Go to Stat9T Statistics90ne Sample with data and select the column you put the data in Choose Hypothesis Test Our null hypothesis is that the mean 1000 gallons so input 1000 in the first blank Our claim the alternative hypothesis is that it is greater than 1000 gallons so put in gt inthe second blank and hit Calculate Mmhesls lest resuns u meanuivaviabie H Wi nn HA ugt1uuu Java Avpiel Windaw Look the same test statistic 387298 and the same pvalue 0152 Page8i of89 Section 94 Decisions and Types of Errors in Significance Tests Type I and Type 1 errors As we have stated earlier we use sample data to determine whether or not we will reject the null hypothesis BUT because we are using only sample information we could reach an incorrect conclusion Four possible outcomes from Significance Testing 1 We reject H0 when in fact H0 is false and HA is true This decision is correc We do not reject Howhen in fact H0 is true This decision is correct We reject Howhen in fact Ho is true This is incorrect This is called a Type I error 4 We do not reject Howhen in fact H0 is false and HA is true This is incorrect This is called a Type II error N 93 Below is a chart that breaks down these outcomes Page 83 of89 SUMMARY Steps of a Signi cance Test for a Population Mean p l Assumptions Quaniiiaiive variable wrila populau39on mean a de ned in eonlcxi l Dam are ob us39 randomization such as a ample random sample I Yopulation distribution is approximately normal 2 Hypotheses H0414 itquot where wisle hypothzsized value Such asHo p D Alremaiive Hat1 w an twosided or Hg 1 lt 4 awarded arrr yr gt moncaxled 3 Test semi 7 quot r I M whieu sVn aluz Use distribution Table BW1tl l df n e 1 Alternative Hypothesis Pavaiue m p iao Twaml39lpmbability 8 r gt a Rightrallymbabiliry H la lt a Lcjramilpmbabzliry s Canciusiun Smaller Pavaiues give stranger evidence against Hg and supportingrg it using 2 xiaui cauce level lo make a dedsion rejeci Hr 2r Evalue is less man or mini to the signi cance level such as 005 Relate me curler slim to the comm or Lhc eludy Page 82 of89 Example According to the US Department of Justice the mean age of a death row inmate in 1980 was 367 years A district attorney believes that the mean age of a death row inmate is different today a Determine the null and alternative hypotheses b If a Type I Error was made which of the following statements would apply i The null hypothesis was not rejected ii The null hypothesis was rejected iii The researcher decided there was enough evidence to indicate a change in the mean age iv The researcher decided there was not enough evidence to indicate a change in the mean age v Actually the mean age has changed signi cantly vi Actually the mean age has not changed signi cantly Page 84 of89 c If a Type II Error was made which of the following statements would apply i The null hypothesis was not rejected ii The null hypothesis was rejected iii The researcher decided there was enough evidence to indicate a change in the mean age iv The researcher decided there was not enough evidence to indicate a change in the mean a e v Actually the mean age has changed significantly vi Actually the mean age has not changed significantly Page 85 of89 The Level of Significance We never really know whether the results of a significance test result in an error or not Howeverjust as we place a level of confidence in the construction of a confidence interval we can limit the probability of making errors The symbol we use is a a level of significance We will always be given the level of significance when we are performing significance tests For example they might say Run this test at the 5 level of significance which would mean a 05 What they are really saying is If you are going to reject the null hypothesis make sure there is less than a 5 chance that you would get this extreme a sample statistic We then take the calculated pvalue and compare it to this level of significance 1 If our calculated pvalue is less than the stated significance level we can reject the null hypothesis Page 87 of89 Interpretation of Error Ho the defendant is innocent HA the defendant is guilty A Type I Error means we reject the null hypothesis so we don t think the defendant is innocent when the null hypothesis is really true the defendant really is innocent So in this type of error we have put an innocent person in jail A Type II Error means we do not reject the null hypothesis so we think the defendant is innocent when the null hypothesis is really false the defendant really is guilty So in this type of error we have let a guilty person go free Page 86 of89 Section 95 Limitations of Significance Tests We can also perform significance tests using a confidence interval and we have actually already done problems like this If the confidence interval contains the value we are testing then we do not reject the null hypothesis If the confidence interval does not contain the value we are testing then we do reject the null hypothesis So which is better a significance test or a confidence interval The consensus is that confidence intervals are actually better Why The reason is confidence intervals actually give us a range of possible values for the population parameter In a significance test we can only test if the population parameter is above below or not equal to a stated value Page 88 of89 Chapter One Introduction to Statistics 11 What is Statistics Statistics 7 The science of designing studies and analyzing the data that those studies produce Statistics is the science of learning from data Example 7 Predicting an Election Using an Exit Poll Page 1 of 58 Example A college dean is interested in learning about the average age of faculty at the college The dean takes a random sample of30 faculty members and averages their 30 ages Match the following A population B sample C subject D parameter E statistic the average age of all faculty members at the college 30 randomly selected faculty members at the college a single faculty member from the sample all faculty members at the college the average age of the 30 randomly selected faculty members at the college Page 3 of 58 12 We Learn about Populations using Samples Population 7 The total set of subjects in which we are interested Ex the entire voting public Sampl 7 A subset of the population for Whom we ave Ex 200 randomly selected voters Subject 7 entities that we measure in a study Ex each voter in the sample Parameter 7 A numerical value summarizing the population data Ex proportion of voters voting for candidate A in the entire population Statistic 7 A numerical value summarizing the sample data Ex proportion of voters voting for candidate A in our sample the 200 randomly selected voters Page 2 of 58 Symbols we use in Statistics In the previous example we were interested in the average age of all faculty members at the University of Georgia Whenever we are interested in an average for a full population there is a lowercase Greek symbol we use to denote this value it It is pronounced mu So in this previous example we would say u average age for all faculty members at UGA Now most of the time in real life we will not be able to actually calculate this value so we try for the next best thing Like in the previous example instead of trying to find every single faculty member we are okay with just getting an average for 30 randomly selected faculty members and use that as our estimate Whenever we have an average calculated from a sample like in this case from 30 randomly selected faculty members there is a symbol we use for that average from the sample i So in this example E the average age for 30 randomly selected faculty members it represents an average calculated from a full population x represents an average calculated from a ple Page 4 of 58 More S bols In the election example on page 1 we were interested in studying proportions not averages So when we are studying proportions there are other symbols we use n we have the proportion ror an entire population like the proportion or voters voting ror candidate A in the entire population we use the letter p to denote this population proportion However if we have the proportion ror just a sample like the proportion or voters voting ror can 39 A 39 9 EL e In our sample the 200 randomly selected voters we use the symbol 1 to denote this sample proportion p represents a proportion calculated rrom a lull population P represents a proportion calculated irom a sample pigs 5 afSX Sampling Methods Sam lin 7 obtaining subjects from a population to participate in a certain dy What are dinerent ways to sample es of Sam 39 Simple Random Sampling 7 every sub39ect has an equally likely chance of being selected for the sample Usually samples are chosen using a random number table 2 Strati ed Sampling 7the population is divided into nonoverlapping groups called strata and a simple random sample is then obtained from each group 3 cluster Sampling the population is dividedinto non overlapping groups and all individuals within a randomly selected group or groups are sampled 4 Convenience Sampling 7 sampling where the individuals are easily obtained Internet surveys are convenience samples Studiesthat use convenience sampling generally have results that are suspect 5 Systematic Sampling 7 selecting every kth subject from the population The difference between strati ed and cluster sampling is that strati ed sampling samples some individuals i groups where cluster sampling samples all individuals irom some groups Page 7af58 rom all Aspects of Statl s 1 Design How to obtain the data to answer questions of interest Ex use a survey set up an experiment 2 Description 7 summarizing the obtained data Describe the sample dam EX 7 bar graph 3 Inference iMaking decisions ann predictions based on the sample data Predict using the sample data Ex 7 We predict that ior all voters the prop ortion an the voting ior candidate A will be higher tb proportion voting ior candidate B Page oat52 Example Name that sampling method There are 300 passengers on a night rrom Atlanta to Denver We need to survey a random sample or these passengers Name each or the sampling methods described below Pick every 10quot passenger as people board the plane From the boarding list randomly choose 5 people nying rst class and 25 people nying coach Randomly generate 30 seat numbers and survey the passengers sitting in those seats Select the first 30 that enter the plane Randomly select several rows and survey all or the passengers sitting on those rows Page zit52 Chapter Two Exploring and Summarizing Two Kinds of Quantitative Variables Data 1 Discrete 7 a countable number ofvalues Variable 7 Characteristic that we are studying EX number of students in each class at UGA 21 What are the Types Data 232161 of words on each page ofthese m 2 Continuous 7 an uncountable number of values Continuous variables are usually variables 1 Categorical 7 Cla551f1es subjects based on some that can take on all values on an interval attri ute or characteristic Each observation belongs to a set of categories EX Height Weight Temperature Ex Gender male or female political affiliation republican democrat other 2 Quantitative 7 Provides numerical measures of categorical subjects The variable takes on numerical values Ex Hei ht Wei ht SAT Score g g variable discrete quantitative continuous Page9of58 Page100f58 Example Identify each of the following as Frequency Tables categorical or quantitative variables If quantitative identify further as discrete or MM HUIHbeT Ofoccmences continuous Frequency table 7 Lists the number of observations for 1 The length of time in minutes until a pain eaCh category Of data39 reliever begins to work of 30 co sumers aviritet pe of So ar Peanut Butter Chocoiate CHiE Chocoiate CHiE reo Chocoiate cnip Oatmeai RaiSU i cnocoiate cnip cnocoiate CHiE Chocoiate CHiE Sugar Brownie atmeai RaiSH i Oreo Brownie Peanut Butter OatmeaiRaiSiri Chocoiate cnip Oatmeai RaiSH i 2 The brand of refrigerator found in a home Oreo Peanut Butter Brownie Chocoiate Chip Oreo 3 The number of files on a hard drive Instead of frequency sometimes we are interested in the proportion or percentage of observations within a certain category Ex Calculate the proportion of people that picked Chocolate Chip cookies Ex Calculate the percentage of people that picked Peanut Butter cookies Pagell of58 PagelZof58 22 How can we describe data using graphs Bar Graphs graphs constructed by putting the categories on the horizontal axis the frequency or proportion on the vertical axis and the height of the rectangles for each category are equal to the category s frequency or proportion Favorite Cookie Frequency Ammbmmqmm Oreo cnnenlate Oatmeal Sugar Peanut Elrovvnle cmp Palsln Elutter cookie Type FavoriteCookie Oreo cnoealate Oatmeal Sugar Peanut Brownle cmp Ralslrl Butte Cookie Type Pagel3of58 Graphs for Quantitative Variables Histoggam 7 a bar graph for quantitative data Ex The table below shows the number of points scored by the UGA football team in the 200272003 season Construct a bar graph by tens Have the groups be 1019 207 29 30739 40749 and 50759 PagelSof58 Pareto charti a bar graph Whose bars are drawn in decreasing order of frequency or proportion Favorite Cookie cnocolate Oreo Oatmeal Peanut Browrlle sugar cmp Ralslrl er Cookie Type Pie ChartsiA pie chart is a circle divided into sectors Each sector represents a category of data Favor e Cookie Chocolate one an nunn 0 en t Oatmeal lE 67 RalSlNl l6 67 Pagel4of58 Stem and Leaf Plot 7 A stern and leaf plot is just a bar graph on its side The stem consists of all digits except for the nal one Which is the leaf Ex The table below shows the number of points scored by the UGA football team in the 200272003 season First place the numbers in ascending order Then put the numbers into Stemeand Leaf Diagram Pagel6of58 Example The following data represent the length of eruption in seconds for a random sam le of eruptions of Old Faith rlquot a geyser a Yellowstone National Park Draw a stem and leaf plot E 97 75 107 104 114 Pagel7 of58 Exa ple IQ39S of7th Graders 39128 21 mm E an E EU 52 52 Au 2m 15 15 2 3 3 1 Bursa 7mg may anaa WEIEHEIB HUME 12mm Home Home In Sums How many students were sampled Which class has the highest frequency What is its frequency Which class has the lowest frequency What is its frequency What proportion of the students have an IQ between 120 and 129 Describe the shape of the distribution 7 is it skewed right skewed left or approximately normal Pagel9of58 Sh apes of Histogram s SymmetricNormal 7 the side of the distribution below the middle is a mirror the Skewed left 7 left tail is stretched the right tail Skewed right 7 right tail is stretched out the le tail NOTE Many times we will use smooth curves to show the data rather than histograms Page180f58 23 How can we describe the center of quantitative data Mean Average 7 adding up all the values of the variable x and dividing by the number ofthese values n 6 Mean 71 Example What is the mean of l 3 6 7 8 Population Mean 11 known as mu Sample Mean x known as xbar Median 7 The value of the data that occupies the middle position when the data are ranked in ascendipg order It separates the bottom 50 of the data from the top 50 of the data Steps in Computing the Median of a Data Set range the data from low to high 2a Ifn the number of values is odd there is a unique mi dle data value The median is the observation that lies in the n 12 position Example What is the median ofthis dataset 1 3 6 7 8 2b Ifn is even the median is the average of the two middle 0 servations in the data set These two middle bservations lie in the nZ and nZ 1 positions Example What is the median ofthis dataset 1 3 4 6 7 8 Page 20 of58 Mode 7 The data value that occurs most frequently has the highest frequency It is important to point out that the mode is NOT equal to the frequency the mode is the data value that corresponds with the highest frequency Example 10 bags of MampM s were opened and the number of MampM s in each ofthe 10 bags is 32 34 31 35 32 36 29 38 34 32 What is the mode Example What is the mode in the bar graph below Favorite Cookie t a E 7 a 2 f g 5 E A quot a 2 1 Gran Chocolate Oatmeal Sugar Peanut Etrowme cmp alsln tter Cockle Type PageZl of58 The following frequency table shows the number of children in a large How many total children attend the daycare on this day What is the mean age for children at this daycare on this day What is the median age for children at this daycare on this day What is the mode for the ages ofthese children at this daycare on this day Page 23 of58 Example Using the previous UGA football example Number ofPoints 31 13 45 41 27 18 48 52 13 31 24 51 30 26 What is the sum of all points scored by the Bulldogs that year What is the mean number of points scored per game by the Bulldogs that year What is the mode of this data This is an example ofa bimodal dataset has two modes What is the median of the data NOW let s see how StatCrunch can do these calculations for us Put in the data and go to Stat 9 Summary Stats 9 Columns Page 22 of58 It is important to note that the mean is sensitive to extreme values in the dataset either very large or very small numbers The median however is not The median is resistant to extreme values Example Data set 13 12 16 10 18 17 15 10 600 Find the mean n 2x mean Find the median Put values in order Median Find the mode Mode IfI asked for the number which best describes the middle of the data what is the best answer Why Page 24 of58 Mean Median The graph is approxrmately normalsymmetnc Mean ltMedian The gaph is skewed left This is true because a skewed le gaph has more low data values on the left These low data values make the mean lower amp less than themedian Mean gt Median The graph is skewed nght This is true because a skewed right gaph has more high data h values on the right T ese high data values make the mean higher amp geater than the median Example Nlamh Lhe histo am to these summ statistics va i ucllty Page 25 arsz Sample Variance rthe mean of the squared deviations calculated using n 71 as the divisor What you are doing when you are calculating samplevariance is in awayyou are averaging all the squared deviations accept you are dividing by n 71 instead of dividing by n 7 z X T X Variance 27 n 7 1 From the aiample on the previous page Deviatinn x Q2 A ouowk Variance Standard Deviatinnr the positive squarerroot othe variance 3 Var ance From the aiample above s SLaLCrunch can also Calculate this sample standard deviation value without having to go through all these steps Page 27 arsz 24 How can we describe the spread of quantitative data 1 Range Rangp rThe difference between the largest and the smallest pieces ofdata range Largest value Smallest value 2 Standard Deviatinn 7 Deviatinn mm the Mean 7A deviation from the mean 6 7 x is the ditrerence between the value ofx and the mean X X ExDataSet 267 911 2679ll 35 x 7 5 x L Deviatiun 2 z 7 o o 7 7 7 7 Q 9 7 ll llr7 Page 25 ai 52 Variance and standard deviatinn measure haw spread apart ynur data values are The higher the vanance and standard deviation the more spread apart the data values will be Example vave adminileer TeslA and TesLB to five SludmLS and their scores were the following TeSLA 20 58 79 92 98 TeSLB 75 78 80 82 83 Which test scores would have a larger standard deviatior Why7 Just like we can either be looking at a population mean or a sample mean depending upon ifWe are looking atthe entire population or Just a sample from the population we also have symbols to represeatpopulation standard deviation and sample standard deviation Population Standard Deviationro lmovvn as signa Sample Standard Deviation rs enever we calculate standard deviation using statCrunch we W39h are calculating sample standard deviation s Page 22 ai 52 Example Consider the following three data sets A 52 52 52 B 48 51 57 C 32 50 74 Use these data sets to practice nding the sample standard deviation Which distribution has the smallest standard deviation Which distribution has the largest standard deviation Page 29 of 58 Empirical Rule N lfa distribution is bellshaped we can approximate the percentage of data that lie within one two and three standard deviations of the mean Mean i l StandardDeviation N 68 ofthe data values Mean i 2 Standard Deviations N 95 of the data values Mean i 3 Standard Deviations N all of the data values m mm in m ixmmiani rlhnl Huumm m e m In m n H l amlul w ti39 mlth mm HHIHI Hmuluut AMIJIMH 7 Max M r y 7 t N y r e z e u u 139 m Example If we have a population of test scores with mean 80 and standard deviation 6 that is bellshaped label the test scores that correspond to l 2 and 3 standard deviations away on the above curve and interpret those values Page 31 of58 As you can see in the distributions below the distribution with a larger standard deviation is going to be wider because its data values are more spread apart Of data distributions A and B below which would you guess to have a larger standard deviation A Page 30 of58 Example Suppose the body temperature for a random sample of UGA students is collected and recorded Suppose the average body temperature for the sample is 986 and the sample stan ard deviation is a A histogmm of the data indicates that the data follow a bell shaped distribution Draw a curve of these body temperatures b Determine the approximate percentage of students that have body temperature between 972 and 100 according to the Rul Empiric al c Determine the approximate percentage of students that have body temperature higher than 993 according to the Empirical Rule d Determine the approximate percentage of students that have body temperature between 979 and 986 according to the Empirical Rul Page 32 of58 25 How can we describe the position of values in quantitative data 1 Percentiles The p3911 percentile is a value such that p of the observations in the data fall below or at that value This also means that the other 100 r p of the observations in the data are larger than that value A data value s percentile tells you approximately what of the data are less than that value If a value lies at the 30 11 percentile then approximately 30 of the data values are less than that value and approximately 70 of the data values are higher than that value Example If John graduated at the 78th percentile in a class of 876 approximately how many students ranked below John Page33 of58 The following data represent the hemoglobin in gdL for 20 randomly selected cats 57 77 78 8 7 8 9 94 95 96 96 99 100 103 106 107 110 112 117 129 130 134 Determine the quartiles Page35 of58 Quartiles 7 speci c percentiles that are use il Each set of data has three quartiles First Quartile Q r the value such that 25 ofthe data values are smaller than Q1 and 75 are larger This is also known as the 253911 percentile Second Quartile Q 7 the value such that 50 ofthe data values are smaller than Q2 and 50 are larger This is also known as the median and the 503911 percentile Third Quartile Q 7the value such that 75 ofthe data values are smaller than Q3 and 25 are larger This is also known as the 753911 percentile 25 25 25 T 25 Minimum Q1 Q2 Q3 Maximum NOTE Q1 and 253911 percentile are the same Q3 and 753911 percentile are the same Q2 and the 50L11 percentile are the same Finding Quartiles 1 Arrange the data in order 2 Find the median This is the second quartile Q2 3 Consider the lower half of the observations The median of these observations is the first quartile Q1 4 Consider the upper half of the observations The median of these observations is the third quartile Q3 Page 34 of58 Outliers 7 extreme observations that occur because of error in the measurement of the variable during data entry or from errors in sampling Stws for Checking for Outliers 1 Determine the first and third quartiles of the dataset 2 Compute the interquartile range The interguartile range or IQR is the difference between the third and first quartile IQR 3 Ifa data value is less than Q1 7 15IQR or greater than Q3 15IQR it is considered an outlier Example continued Hemoglobin in Cats The following data represent the hemoglobin in gdL for 20 randomly selected cats 57 78 87 89 94 95 96 96 99 10 0 103 106 10 7 110 Compute the IQR Are there any outliers Page36of58 The 5Number Summary and Boxplots 25 25 25 25 1 Minimum Q1 Q2 Q3 Maximum This is the 5number summary it includes the minimum Q 1 Q2 or the median Q3 and the maximum num er Boxplot a graph of the ve number summary Steps in Drawing a Boxplot 1 Determine Q1 Q2 and Q3 2 Draw vertical lines ath the median Q2 and Q3 Enclose these vertical lines in abox 3 Draw a line from Q1 to the smallest data value that is not an outlier Draw a line from Q3 to the largest data value that is not an outlier 4 Any data values that are outliers are marked with an asterisk Page 37 of58 Distribution Shape Based upon Boxplot 1 1fthe median is near the center ofthe box and each horizontal line is approximately equal length the distribution is approximately symmetric 2 1fthe median is to the le ofthe center ofthe box andor the right line is much longer than the le line the distribution is skewed right 3 1fthe median is to the right ofthe center ofthe box andor the le line is much longer than the right line the distribution is skewed le 3r 70 3 E 0 D H r r r r H laswusossso swnm 25 as 45 55 as 75 a Symmetric 20 5 339 10 i D uc 1357911111517192113 0 10 20 b skewed nghl m a E 30 E to u hl l l l t l r l H 15 u is rs 17 is 19 2n n 13 14 is 16 17 rs 19 20 21 c Skewed 11 Page 39 of58 Example Draw a boxplot for the cat data 5 7 94 95 96 96 99 100 103 106 107 110 112 117 129 130 134 Step 1 Determine Q1 Q2 and Q3 Step 2 Draw vertical lines at Q1 the median Q2 and Q3 Enclose these vertical lines in abox Step 3 Draw a line from Q1 to the smallest data value that is not an outlier Draw a line from Q3 to the largest data value that is not an outlier Step 4 Any data values that are outliers are marked with an asteris Page 38 of58 2 Z score Zscore a The position a value has relative to the mean measured in standard deviations value mean 2 score standard devratron The Zscore is the number of standard deviations a data value is from the mean NOTICE If the value is equal to the mean then thezscore 0 Example From samples taken the average 2029 year01d man is 700 inches tall with a standard deviation of 28 inches while the average 2029 yearold woman is 646 inches tall with a standard deviation of 26 inches Find the zscore for a 75inch tall man Find the zscore for a 70inch tall woman Who is relatively taller a 75inch man or a 70inch woman Page 40 of58 If the heighw for males are normally distributed draw a curve representing these heights Label where the 75inch tall man is under this curve and see that it corresponds to his Zscore What height is exactly two standard deviations below the mean Calculate the Zscore for this height to make sure it does equal 2 What height is 128 standard deviations below the mean Using ZScores to check for Outliers Outliers for a bellshaped curve A data value in a bellshaped distribution is regarded as a potential ou ier if it falls more than three standard deviations from the mean Or in other words if a value has a ZScore less than 3 or a ZScore greater than 3 then it is a potential outlier Assume that male heights are normally distributed In the previous example would a male with a height of 58 inches be considered a potential outlier What about a male with a height of 62 inches Page 41 of58 31 How can we explore the association between two categorical variables To do this we use contingency tables Contingen or 2way tab 7 a table that relates 2 categorical variables Each box inside the table is referred to as a i Suppose we have the following data Le handed Righthanded 1 m Are these categorical variables What is the response variable What is the explanatory variable In the examples we use the explanatory variable will always be on the side and the response variable will always be on top How many are How many righthanded people are there in this data Page 43 of58 Chapter Three A 39 quot F Correlation and Regression In Chapter 3 we explore the relationships between two variables Response variable 7 a variable that can be explained by or is determined by another variable When the two variables are quantitative the response variable will be the yvariable the variable that goes on the vertical axis when graphing data Explanatory variable 7 explains or affecw the response variable When the two variables are quantitative the explanatory variable will be the xvariable the variable that goes on the horizontal axis when graphing data Ex The amount you eat affects how much weight you gain The amount you eat is the explanatory variable which determines weight gain the response variable Association 7 an association exisw between two variables if a particular value for one variable is more likely to occur with certain values of the other variable Ex Ifthe amount we eat is small then we probably won t notice much gain in weight However ifthe amount we eat is large then we probably will notice some gain in weight So there is an association between the amount eaten and weight gain Lurking Variable 7 related to the response or explanatory variable or both but is not the variable being studied x A lurking variable could be frequency of exercise The amount of exercising can also affect weight gain the response variable Page 42 of58 We can also calculate the proportion for Total up the Ex What proportion of the people in the data is le handed Conditional Proportion 7 the proportion for a value of a variable given a speci c value of the other varia e T Ex What proportion of the males is righthanded Ex What proportion of the females is le handed Page 44 of58 Relative Risk can use these conditional proportions to determine the comparative odds for each group Let s create a table with these conditional proportions for 39able categories of the response vari Lenhanded Rihthanded Female conditional proportion for one group 1 conditional proportion for another group When 39 quot thehigher quot39 goes in the numerator We can use relative risk to see how many times more likely the outcome for one group is than the other group Example Fill in the blank A male is times more likely to be lefthanded than a female You can see this value is close to one A relative risk close to 1 means it is about the same likelihood for both groups Now look at this example Page 45 of 58 32 How can we explore the association between two quantitative variables Suppose you are engaged and you are looking at diamond rings you looked at four rings with the following data We want to see if the two variables number ofcarats and price are associated and later in this section we will see if we can use L 39 L of carats quot variable cost When we have two quantitative variables the rst thing we do is make a scatterplot of the data Scatterplot 7 a graphical display for two quantitative variables Explanatory variable is on the horizontal axis response variable is on the vertical axis and the points are not connected Page 47 of 58 We asked people to try one of either drug A or B when they get a headache and let us know whether the headache went away within Fill in the blank Drug A is times more likely to relieve a headache than Drug B e can see that when we get larger di erences in the conditional proportions like here we get a much higher relative risk value than 1 Page 46 of 58 Here is a scatterplot of this data Pnee m uHars 1 1 5 Nunmerevcarats positive association 7 as x increases y increases negative association 7 as x increases y decreases no association 7 as x increases there is no de nite shift in the values 0 y Are the variables related Does it look like there is an association between these two variab es What type of association do we see in the scatterplot above Page 48 of 58 xampl Estimate the type or assnciz nn fur the fulluwing So we e e the assoeiation between two variables but what if pairs nfvarizhles we wantto take it one step fur er and deterrnine ifthere is a linear relationship between the vari 1 7 a weight nra car and niiiesper gaunn it gets Then we eaieuiate whatwe ean enrrelatinn linear enrrelatinn when the datatendto follow a straight line path ifx inereases andy inereases it is positive eorreiation or ifx s andy a h speed nra ear and distancereqllired tn ennietn a ennipiete Stnp inerease eereases itis negative eorreiataon nn enrrelatinn a as n inereases there is no de nite shiftm the e weight an altar and number nrrepetitinns aweightiirter can values ofy no linear relationship between X 52v achieve Correlation can be 1 positive negative orno eoneiation l n i1 theternperatiire niitsiiie and my grade nn atest 2 Strong orweak eorre atao Page 49 visa Page 5n visa Scatter Diagganis and Cnrrelatann Prnperties erinear Cnrrelatinn Cnef uen V 1 rrnust always be between 71 andl 1 lt 1 r e 2 r gt 0 indieates apositive hnearreiataonship 1 r 1thereisperfeetpositiveeorreiataon 3 r lt 0 indieates anegatave1inearre1ationship Ifr 1 there is perfeet negative eorreiata relationshipwhi1e a value ofr e ose to zero represents a weak linear reiataonshi 0 5 1 Whieh of the following is the strongest correlauon7 8 67 r 34 0 r 92 StatCriineh ean also ea1eii1ate this rvalue by putting in the data and going to Stat a Surnrnarv Stats gt Correlation Calculate r for the ring data Page 51 orsz Page 52 orsz Example 7 A typical weightlifting bar with no weight on it weighs 45 pounds Suppose we take a random sample of UGA students ask em to p orm as many repetitions as they can at Various weights Below is how much weight was added to the bar in addition to the 45 pound bar and how many reps the student could do Wei ht X 40 80 100 120 150 Number of Repsy 32 20 18 12 1 Draw a scatterplot and comment on the type of relation that appears to exist between x and y Is it a negative or positive relation Does it seem strong or weak 2 Calculate r for this data using StatCrunch Page 53 of 58 Let s take a look back at our ring data and look at the regression line for that data me in dollars 1 1 a Number utcarats Residual 7 the difference between the actual value and the predicted value of y residual actual y 7 predicted y y 7 y The line that best describes the relation between 2 Variables is the one that makes the residuals as close to zero as possible Page 55 of 58 334 How to predict the outcome of a variable To predict the response variable using the explanatory variable we create what is called a regression line regression line predicm the value for the response variable y as a straightline function of the value X of the explanatory variable y predicted value of y using the regression line the equation for the regression line 9 a b x In this formula a is called the yintercept and b is called the slope So for each of our four rings we have an actual x value number of carats and an actual y value price We can also use the regression line to calculate a predicted y value predicted price for each ring The best regression line is going to be the one that has the predicted y values closest to the actual y values We use the actual data values to create the regression line We won t need to do this but StatCrunch can do this for us Page 54 of 58 The formula for the regression line using the least squares method A y a b x where a yintercept and b slope So for our ring example here is the regression line orm a 7 939759126409639x We can get this in StatCrunch by inputting the data and going to Stat9Regression9Simple Linear Interpretations of yintercept and slope yintercept the predicted value of y when x 0 Interpret the yintercept in the above scenario Slope the amount that the predicted value of y changes when x increases by one uni Interpret the slope in the above scenario Page 56 of 58 Important Terms 0 Population Total set of subjects in which we are interested 0 Sample A subset of the population for which we have data 0 Subject Entities we measure individuals population Histogram Interpretation HW 22 How many total students sampled 60806040240 Which class has highest lowest frequency What are those frequencies 39 quot quotquot quotquot Highest quot100109quot with 80 Lowest quot120129quot with 40 anuenCY on so ma 39 HOW many students have an em H IQ between 100 and 119 80 60 140 StemAndLeaf Plot A bar chart on its side quotStemquot is all digits except the last one Last digit is the quotleafquot Ascending order No commas If nothing in a row write the stem but leave the leaf blank Example HW 2122 selling prices 199 210 210 223 225 225 225 228 232 235 19 9 20 21 00 22 35558 23 25 Sampling Methods 9 Simple Random Sampling Each subject everywhere has an equally likely chance of being selected Often done with a random number table Choosing a company somewhere in the US 9 Systematic Selecting every kthquot subject Surveying every 10 person we meet downtown 9 Convenience Individuals are easily found eg internet surveys Often the laziest way so less reliable answers Sampling Methods o Stratified Sampling o Cluster Sampling Taking some subjects from Taking all subjects from all possible grOUpS some possible groups 0 O 0 o o o O Skewness Symmemc skewed Len mean median mean lt median mean gt median mean lt median The lt looks like an L as in Left Skewed mean gt median The gt looks like part of an R as in Right Skewed O O O Outliers The mean is sensitive to outliers The median is resistant to outliers When outliers are present it is best to use the median as the measure of center Examples Earthquake magnitudes on the Richter Scale skewed right since some but very few big earthquakes Ages of MENSA members at the time they joined skewed left since most were adults but afew children had high enough le Standard Deviation o The average distance between any data point and the mean of the data 0 Measures how muchlittle the data distribution is spread out Smallest Mlddie Summary Stats Interpretation Mean average of the data set Standard Deviation average spread in data set Q1 25 of data lie below this Median sometimes 02 50 of data lie below and above this value 03 75 of data lie below this Maximum largest value in data set Minimum smallest value in data set Range difference between maximum and minimum BoxPlot HW 2526 25 25 25 25 o What proportion of states have taxes Greaterthan 31 cents 75 Greaterthan 105 105 cents 25 Between what two vales are the middle 50 of the data found 31105 What is the range Range maximum minimum 206 26 2034 Box Plot Outlier HW 2526 o Any point lying above QB 15 gtlt IQR is an outlier o Any point lying below Q1 15 gtlt IQR is also an outlier o Are there any outliers on this boxplot QRQ3Q11105256849 Q1 15IQR 256 15849 10175 we have no lower outliers QB 15QR 1105 15849 23785 we have an upper outlier Mean amp Median HW 2324 o This chart shows the number of grams of protein in various brands of loafs of bread Compute the mean and median of the data set What can you say about the shape of the distribution Prete39 9 count For the median find half the total 0 15 count about 28 so we need to 1 16 find where bread 28 is 2 21 It s not in Row 0 since we have the 3 4 first 15 only T013356 After Row 1 we have 15 16 31 loafs Median 1 since bread 28 falls mean In Row 1 01511e22134 Mean gt median a somewhat 56 skewed right 1125 o o o O 0 Empirical Rule Only used for bellshaped distributions Within one standard deviation from the mean we have 68 of all data points Within two standard deviations from the mean we have 95 of all data points Within three standard deviations from the mean we have almost all data points Anything else is an outlier SUMMARY 1 s 68 2 s 95 3 3 Almost all Esn O O ZScore value 7 mean 7 value 7 7 or value 7 M Z standard deviation 7 s a A zscore is the number of standard deviations abovebelow the mean the data point lies If negative data point is below mean If positive data point is above mean Data point is an outlier if Zscore gt 3 or Zscore lt 3 Example HW 2324 o The weight of a house cat is bellshaped with mean 14 pounds and standard deviation 25 Find an interval within which about 95 of house cat weights will fall By the Empirical Rule we go out 2 deviations from the mean3 14 72 x 25142 x 25 919 What zscore represents a house cat that is 28 standard deviations to the right of the mean What weight is that z 28 zL 728ng4X28x251421 Relative Risk HW 31 conditional proportion for first group relative risk conditional proportion for second group the first group is the larger of the two proportions 9 Relative risk tells us how many times more likely the outcome is for one group than the other group 9 The following three facts therefore follow Relative risk 2 1 l When the numerator and denominator proportions are very similar relative risk will be very close to 1 93 However when the numerator is quite a bit larger then relative risk will be quite a bit greater than 1 Relative Risk HW 31 Find the proportion of people who are at least 30 that are HV 39855 00456 Find the proportion of people under 30 who are HV 18641 00281 Find the relative risk of being HV for both groups Look at the HV proportions Larger Smaller 00456 00281 1623 People who are at least 30 are 1623 times more likely to be HV than people who are under 30 Relative Risk HW 31 Find the probability that someone selected did orientation and adjusted well 72159 Find the probability that someone selected did orientation or adjusted well 722814159 If a subject selected adjusted poorly what s the probability heshe did not do orientation 4559 Correlation 71 lt r lt 1 If r is positive then so is the slope Same if r is negative Closer r is to 1 or 1 strong correlation Closer r is to 0 weak correlation r is unitless does not change if we flip variables r measures only LINEAR relationship A strong correlation is not proof that one variable causes the other Scatter Plots Figure Strong Positive Correlation Weak Negative Correlation O O O O LeastSquares Regression J7 a bx X given data point J predicted response a intercept Predicted response when X O May not always have a practical interpretation b slope Slope is how much the predicted response increases or decreases for every unit increase in X residual observed predicted y7 J7 Regression HW 3234 o Analysis says that we can use the length of an alligator in feet to predict its weight in pounds The equation is given y 91o40x Find the expected weight of an alligator that s 10 feet long J7 10 4010 410 pounds Suppose an alligator that s 10 feet long actually weighs 402 pounds Calculate the residual Observed Predicted 402 410 8 so we overestimated Interpret the slope For every additional foot in length an alligator s weight is expected to increase by 40 pounds Interpret the intercept Literally an alligatorwith length 0 will weigh 10 pounds which makes no sense So the intercept has no interpretation here Probability 9 Probability is the likelihood of a particular outcome occurring 7 of desired outcomes pmbab39my 7 tota posslb e outcomes Example probability of drawing a club from a deck of cards is 7 13 clubs 1 7 p 7 52 tota cards 7 4 7 25 o A complement All possible events that are not in A Example A it s snowing AC it s not snowing Complement probability PAC 1 7 PA Probability HW 51 53 c We have an urn full of 12 blue 10 red and 8 black marbles We reach in and draw a marble at random What s the probability of drawing a marble that s one of UGA s colors 30 total marbles and 10 8 18 of them are red or black so 21354 g3 6 If the marble drawn was a UGA color what s the probability it was not red Out of 18 marbles of UGA colors red or black 8 of them are black so not red so 14444

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.