Statistics for Scientists
Statistics for Scientists STAT 3000
Utah State University
Popular in Course
Popular in Statistics
This 101 page Class Notes was uploaded by Geovanny Lakin on Wednesday October 28, 2015. The Class Notes belongs to STAT 3000 at Utah State University taught by Staff in Fall. Since its upload, it has received 17 views. For similar materials see /class/230503/stat-3000-utah-state-university in Statistics at Utah State University.
Reviews for Statistics for Scientists
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/28/15
Introduction Statistics science that deals with data Data facts with context 67 red 75 Numbers and words do not tell us almost anything Add some context My height is 67 inch I have a red car The baby s weight was 75 lbs Now we have DATA CHAPTER 6 SAMPLE STATISTICS Descriptive Statistics 1 Population and Sample 11 Basic De nitions POPULATION Data set 9 collectlon of data Data set consists of observations Example student data set Each student 9 one observation record All potential observations 9 population Group of observations obtained in a particular study 9 sample Flexible de nition of population and sample 7 depends on what we want to study Want to study body weight of young males age 18 7 30 at USU Population all males of age 18 7 30 at USU Sample all males of age 18 7 30 enrolled in STAT 3000 Want to study are the students enrolled in STAT 3000 on average younger than last year Two populations Two samples 1 all STAT 3000 students in Fall 2000 1 20 STAT 3000 students from Fall 2000 2 all STAT 3000 students in Spring 2001 2 20 STAT 3000 students from Spring 2001 Characteristics of the population are often unknown The whole population cannot be recorded because 0 The population is too large 0 It is impossible to record all observations 0 It is too expensive to include all observations CHAPTER 6 We analyze a SAMPLE to gan knowledge about the POPULATION The proeess of drawlng eonelusrons about the populatron based on the lnformanon gamed from the sample ls ealled STATISTICAL MERENCE Representatrve sample random sample Random sample a E L h an equal unblased r Seleeuon of one observauon does not change the enanee of any other observauon of belng taken m the sample mdependent Important good sampllng deslgn Bad sampllng deslgn Qplnlon poll telephone survey m SLC m 1950 Studylng Average GPA at USU Sample aDean39s llst Nonrrandom andor blased samples ean result m W39RONG CONCLUSIONSl CHAPTER 6 SAMPLE STATISTICS 12 Data and Variables STAT 3000 student data set Observations Students each student 7 one observation Variables Gender Marital status Number of Sibs Age Color of Eyes Height Weight Variables describe an observation 121 Types of Variables Categorical nominal variables place observation into categories Numerical quantitative variables take numerical values aritmethic operations are possible Discrete variables take only whole integer values 9 counts Continuous variables take any value within a given interval 9 measurements Type of the variable determines type of analysis Distribution of the variable tells us what values it takes and how often these values occur CHAPTER 6 SAMPLE STATISTICS 2 Exploratory data analysis Exploratory data analysis 9 analyze what we see Explore patterns in the data With graphs Data presentation 51gt Descriptive statistics With numbers 21 Displaying data with graphs Visualization of data 9 explain what we see 211 Visual display of categorical variables Goal compare size of categories of the data 2111 Bar chart Quickly compares the size of categories How to do it List categories Count observation falling in each category Calculate proportion or Variable Color of eyes 2 CHAPTER 6 SAMPLE STATISTICS Graphical display X 7 aXis Categories y 7 aXis Frequencies 2112 Pie chart When the data are given in percents it is useful to construct a pie cha1t Pie chart shows what part of the whole circle each group forms Each category 9 one slice Angle of each slice 9 prop01tional to relative frequency Angle L X 360 where n r 9 frequency number of observations in each category 71 9 total number of observations Blue angle Brown angle Green angle Other angle CHAPTER 6 SAMPLE STATISTICS Graphical display start at 12 o clock Bar chart vs pie chart Bar chart is more exible 7 it can compare only those categories of interest eg number of students with blue and brown eyes Pie chart has to include all categories 7 because the slices must add up to 100 Bar chart can compare catregories from different data sets eg number of students with blue eyes from Section 001 and Section 002 Pie chart can opnly compare categories from the same whole 212 Visual display of numerical variables Goal determine the distribution of the data 2121 Stem and leaf plot Working tool Enables to quickly see the distribution fast and easy How to do it Separate each observation into two parts Stem 7 all but the last digit Leaf 7 last digit Write stems in a vertical column from the smallest to the largest or Vice versa Do not skip any missing stem Write each leaf in a row to the right of its stem in increasing order CHAPTER 6 SAMPLE STATISTICS Body weight of STAT 3000 female students Observation that do not follow the general pattern of the distribution are called OUTLIERS Causes of outliers Typos Measurement errors Natural variation too small data set 2122 Histogram Nicer presentation of data Similar to bar chart but for numerical data X 7 aXis 7 numerical scale How to do it Divide the whole range of the data into intervals Number of intervals depends on the data 7 must be optima Too few intervals can t see the distribution Too many intervals spiky histogram Rule of thumb number of intervals 1 n of observations For construction by hand choose convinient intervals round width Always same width of the intervals CHAPTER 6 SAMPLE STATISTICS optional determine midpoints Count observations falling into each interval Construct histogram X a axis intervals numerical values y 7 aXis frequency Body weight of STAT 3000 female students Range of the data Take quotconvinien quot intervals of pounds Proportion Intervals Midpomts Count frequency rel frequency Total 10000 Note no spaces between intervals all intervals same width yaXis always starts from 0 Unit 4 Probability distributions 50 Theoretical Probability Distribution So far we have leamed how to Construct a probability mass function for discrete rv Construct a probability density function for continuous rv Construct a cumulative distribution function Calculate various probabilities using pmf pdf or cdf Calculate expected value variance etc of a rv But if we do a data analysis obtaining all these things would be difficult even impossible because we have a sample ie a limited number of observations and thus we do not know the exact distribution in the population However we can assume that our data follow a certain THEORETICAL PROBABILITY DISTRIBUTION Often observations generated by different statistical experiments or sampling have the same type of behavior 9 they can be described by the same probability distribution Therefore we usually do not have to calculate complicated integrals etc because for theoretical probability distributions exist simple formulas andor tables that we can use to calculate different probabilities expected values variances etc A theoretical probability distribution is described by numbers called PARAMETERS Expectations and variances of a theoretical distribution or a random variable that follows that distribution are functions of these parameters In this course we will leam about the following theoretical probability distributions 39239 Discrete probability distributions 0 Binomial distribution 39239 Continuous probability distributions 0 Normal distribution t distribution 952 distribution F 7 distribution 102 Discrete Probability Distributions 5 1 Binomial distribution 511 Introduction Consider the following quotrandom experimentsquot and their outcomes head Coin toss tail boy Gender of a baby girl pass Result of a test fail dead Survival of a cancer patient 2 mo a er the therapy started alive Each of these random experiments has only two possible outcomes One of them can be labeled as SUCCESS the other one as FAILURE 0 Success 9 what we are interested in 0 Failure 9 everything else Such a random experiment that has only two possible outcomes is called Bernoulli trial We are interested in occurrence of the outcome not in magnitude of the outcome We can assign numerical values to the outcomes of a Bernoulli trial 0 Success l 0 Failure 0 Such a random variable that has only two possible values 0 and l is called Bernoulli random variable 103 LetX be a random variable indicating gender of a baby with the values l ifthe baby is a boy quotsuccessquot and 0 if the baby is a girl quotfailurequot X N Bern p p 9 probability of having a boy probability of success p 9 parameter of the Bernoulli distribution Assume that the probability of having a boy is 049 ie p 049 Then X N Bern 049 What is the probability distribution of X What is the expected value of X o What is the va1iance of X 104 Now let39s assume that we have more children ie we repeat this random experiment having a baby 71 times Then in this situation The gender of each baby is either boy or girl success or failure The probability p of having a boy is same in every trial All trials are independent the gender of one baby does not in uence the gender of other babies ie if your first baby is a boy it will not change the probability that your second baby is a boy too Therefore we have n Bernoulli trials All 71 trials are independent Each trial success or failure only Binomial setting Probability p of success is same in each trial binomial trial Now we are interested in number of boys among n children LetX be a random variable indicating number of boys ie number of successes among 71 children X N B quota P X has a binomial distribution described by two parameters n total number of trials p probability of success in each trial X is called binomial random variable 512 Calculating binomial probabilities Imagine you plan to have 5 children What is the probability distribution of number of boys among your 5 children n 5 X 9 number of boys out of 5 children p 049 X N B 5 049 Possible values ofXare PXx 105 n PX x p l pH 51gt probability mass function ofX x n J 9 quotn choosexquot 51gt BINOMIAL COEFFICIENT n 9 number of poss1b111t1es of getting x successes out of 71 trials L n xx l n nn ln 2 l x n x What is the probability of having 0 l 2 3 4 or 5 boys out of 5 children PX0 PX 1 PX2 PX3 PX4 PX 5 iHX x Line graph of pmf of X 9 number of boVs out of 5 children 106 o What is the probability of having not more than 3 boys out of 5 children 0 What is the probability of having between 2 and 4 boys out of 5 children 5121 Computer tools for calculating binomial probabilities Microsoft Excel 0 no pmf will be calculated PX x 0 Statistical Function BINOMDIST x n cumulatig 1 yes cdf will be calculated PXg x Internet tools Java applets o DistributionDensity Calculators Plotters and RNG s httpwww statncla J 39 39 df 0 Probability calculators within Hyperstat httpnlavfair stanf nrd edn Nnarasis 39 html 107 513 Expected value and variance of the binomial distribution Recall binomial setting n independent events trials Each trial with only 2 outcomes success or failure Probability p of success is same in each trial X N B n p 9 binomial rvX that counts number of successes Binomial rvX is a sum ofn Bernoulli rvX 139 l n XX1X2 X EX1p V0VXx 1170 P n Therefore the expected value of X is And the variance of X is 108 o What is the expected number of boys in a family of 5 children 0 What is the variance of number of boys in a family of 5 children Note 0 If p 05 binomial distribution is symmetric o If p lt 05 binomial distribution is skewed to the right 0 If p gt 05 binomial distribution is skewed to the left 514 Proportion of successes Sometimes we are interested in estimating unknown success probability in a population population proponion We estimate this proportion using a sample sample proportion Y 9 proportion ofsuccesses Y i X N B n p n EX np Varoo W p Therefore EY VarY 109 o What is the expected value and the variance of proportion of boys in a family with 5 children X 9 number ofboys X N B 5 49 0 Which value of p would maximize this variance llO 52 Normal distribution The normal dlstxlbunon ls a natural drsmbutron of many naturally oeeurnng phenomena physreal measurements errors m sclenu measurements 7 1t ls used as a base for many stausueal mference methods 7 1t ls usedto model samplmg dsmbuuons error drstnbuuons 7 Many other rmportant drsmbuuons have been denved from the normal drstnbutron e and on the other hand the normal drstnbutron ls wrdely used to apprommate other drstnbutrons 7 Normal drsmbuuon 7 mother of all other dlsmbunons Normal drstnbuuon ls ealled Gaussian or quothellshapedquot curve gt NORMAL CURVE s m t t V X4704 02 u gt mean 02 gt uananee LLand o2 are parameters of the normal dsmbuuon The probabrhty densrty funeuon ofthe normal dlsmbunon ls wt an x 2 6m Plot of the probabrhty densrty functlon of a normal drstnbuu on 039 21 Propemes of anormal curve 1 Symmemc abput 4 2 Mean rneahan made pmnt at xraxxs where the runetmn reaehes 1ts rnaxrrnurn 3 17mm afm ecnon change at d1reetmn at x 14 o 4 Concave duwnwardxf k 5 94 a concave upward atherwrse 5 Asyrnptpheany apprpaehes rams 1n bath ahreehpns The tuta1 area under the normal eurve 15 equal one prever we eannpt Shaw eas11y that Ix dz 1 beeause the 1ntegra1 pr pdf dues npt have a class form If X my 02 then Eco I4 mg 7 02 The parameters 4 and 02 or o determan the shape 7me normal curve When 4 1nereases the dasmbutmn Wanders tn the nght but the mead dues npt change ereases the dasmbumm becames atter andrnare spread cut but the eentxa1 e When 02 1n 1aeahpn dues nut ehang 112 522 Standard normal distribution Siandard nm39mal disinbuuunls a nurmal disinbuuun with u o and f 1 ifrvaulluws a standard nurmal dl nbuuun ii is unedst manual Iquot XNEI1 Siandard nurmal disinbuiiun is a vary Spe al disinbuiiun bemuse ur that me letter phi The funnuia arms pdfufa sizndard numal dismbuuun is 7 fur rmeSm 4Q 1 x Standard Normal Distribution mnaiun 1 X x MW e misiniemi duesnuihavedusedfur msuluuun Tu m1miaie xweusetahlzs if XNnlmen x PX s x CDF Ufa slanderd nurmal distribuan is a Srshaped lme DF of Standard Normal Distribution n n s median LinemuseMELD l5 symmemeabeul n An I 15215 mepeml erm eenen erme dfcurve ifxe n am I ifx gt m xel 5 3 Probablllw calculation uslngthe normal alsmbutlon Examgle l SuppesemalzNnl Fmd a PZSl 15 PZSl 15 1 15 Tu nd 1 15 we use Table l m me Hayler bunk pp 9m 7 m Tms table cunlams wines erme eumulauve dismbuuun runenen l e Lhe am unda me pal er me slanderd nurmal mmbuuen To read the value of d3 115 we scan down the rst column labeled x in Table I until we locate an appropriate row containing the unit and the rst decimal in this case x 11 Next we read across this row until we intersect the column that contains the second decimal ofx 115 in this case 005 The value ofq3115 is b PZZ 115 c 13010232 115 d 130212 115 115 e 1302 115 f The value ofx for which P Z S x 075 g The value ofx for which P 2 S x 075 116 Example 2 Real 1le example plg produeaon average dally gam ADG of plgs ls one of the mam factors that m uenee emereney of produeaon ngh ADG ls deslrable and plg breeders try to lmprove rtthrough seleeaon of plgs wth hrghest ADG for further breedlng er K ls a plg breeder He as a large herd of plgs m whlch ADG ls normally dstrrbuted wth mean 800 gday and standard demaaon 75 g ay He ls rnterested m lmprovmg ADG m hls herd and asks you for help Here are hls questrons a If 1 keep for further breedlng only plgs wrth ADG of 900gday and hrgher what propomon of plgs wlll be kep Frrst step m solvmg thrs problem ls to slandan lixe the normal random vanable of rnterest KXN0402 and ZXquot ealenlate the probabrhaes then z N0 1 and we ean use Table Ito mnsxm zerntt b quotA certain buyer is interested in buying pigs that have ADG between 650 gday and 850 gday What proportion of my pigs could I sell to himquot c quotIf I select only 10 of my pigs with the highest ADG for further breeding what will be the minimum ADG ofthe selected pigsquot d quotWhat is the interval of ADG values centered at the mean that contains 80 of my pigsquot 118 524 Critical points ofthe standard normal distribmion 1n the preylous examples we saw how we ean use Table 1 to deterrnlne quantiles or percentiles of the standard normal dstnbuuon 1 28 gt 10quot pereentlle or 0 1 quantlle 1 28 gt 90pereentlle MO 9 quanule Some of the pereentlles of the standard normal dsmbutlon are used frequently ln statlsueal tesung They are ealled CRITICAL POINTS and have then speelal notauon 7 2a ms the probablllty that the standard norrnal random varlable takes values as large as or largerthan za zolror 11 h The most unportant crmcal polnts are shown ln Table I zeu 1645 Zn 2 326 When you uonduet a statlstleal test eg zrtest you ealeulate a test Statistic from the sample ln order to test your hypothesls If the ealeulated test stausue ls larger than the enueal polnt 20 for a ehosen x you Wlll reJeet your hypothesls at Orsiyli cance level more about ltln chapter 81 525 Linear combination of normal random variables 5251 Linear function ofa normal rv X Nm oz E00 u VarX 02 Y aX b EY aEXb a ub VarY a2VarX 61202 Y N Nau b 61202 5252 Sum of two normal rv XINNCLLpOi EX1ALL1 VarX16 X1 and X2 X2 Nuz03922 EX2u2 VarX203922 9 independent Y X1 X2 EYEX1EXz 1 Hz VarY VarX1 VarX2 of a yNNu1nuzaoiO 61 O12O22 5253 Sum ofn normal rv P0pulati0nX N N01 02 Sample ofsize 71 X1 X2 X3 Xn 9 n iid normal rV s Each X Nu 02 E n VarX 02 Sum ofn iid normal rV39s Y X1X2 X3 Xn EY EX1X2 X3 Xn EX1EX2 EXn u u u Ll nu VarY VarX1X2 X3 Xn VarX1 VarX2 VarXn 020202 02n02 Y MM 7102 120 5254 Average ofn normal rv P0pulati0nX N N01 02 Sample ofsize 71 X1 X2 X3 Xrl 9 n iid normal rV39s Each X Nu 02 EX y VarX o2 X1X2 Xn Average of n iid normal W X n X1X2 Xn 1 1 1 EXE E X1 X2 Xn n n n n iEX1iEXZ 1EXn lEX1EX2 EXn n n n n 1 1 LL LL LL nu u n n X1X2 Xn 1 1 1 VarXVar Var X1 X2 Xn n n n n 1 1 1 n ZVarX1n 2VarX2 7VarXn i2VarX1 VarX2 VarXn n i2 03920392 oJ Limo2 z n n n 02 OJ 6 XNU7 037 7 121 Example A bottling company is using a lling machine to ll plastic bottles with cola The content of bottles vary according to a normal distribution with mean u 298 ml and standard deviation 039 3 ml a What is the probability that an individual bottle contains less than 295 ml b What are the mean and the variance of the total content of the bottles in a 6pack c What are the mean and the variance of the average content of the bottles in a 6pack 122 d What is the probability that the average content of the bottles in a 6pack is less than 295 m1 e What is the probability that exactly one bottle in a sixpack contains less than 295 m1 f What is the expected number of bottles containing less than 295 ml in a sixpack 123 526 Normal distribution software 5261 Microsoft Excel functions Function wizard button 9 statistical o NORMDIST calculates x or F x for given x NORMDIST x mean standardidev cumulative 9 0 for x or 1 for F x o NORMINV calculates x for a given value of F x NORMINV probability mean standardidev o NORMSDIST calculates z for given 2 for standard normal distribution NORMSDIST z o NORMSINV calculates z for a given value of z for standard normal distribution NORMSINV probability 5262 Normal Distribution on the Web 0 Z curve applet for standard normal distribution um u 39 perstatz table html Computes area below 2 area in tails area from mean to z o VassarStats calculator for standard normal distribution httpfacultyvassareduNlowmzphtml Calculates area under the curve to the left of z 9 PZ S z area under the curve to the right of 2 9 PZ Z z area in tails 9 P Zl Z z area in the middle 9 PlZl S 2 o DistributionDensity calculators httpwww stat 11012 J 39 39 cdf Normal distribution pdf cdf plots probabilities 124 0 Probability calculators within Hyperstat httpnlavfair stanf nrd edn 4141 39 39 39 html For any normal distribution Calculates area to the left of the rst value area to the right of the second value area between two values 0 Statlets Probability distribution calculators plotters httpwwwstatletscomfreepdisthtm Calculates and plots probability distributions You can enter up to 5 different distributions Interesting plots of pdf and cdf tail areas critical values 0 Seeing Statistics Normal probability calculation demonstrations httppsvch cnlnradn edn UULJL 39 39 html Normal distribution probabilities Normal distribution more interactive applet z scores calculates onetailed twotailed cumulative and middle prob s also does standardization automatically 527 quotSigmaquot rules Using the quotz scores applet from the quotSeeing statistics electronic textbook ll the blanks in the following tables Middle area between u 039 u O39and n0 LL 2039and LHZO39 LL 3o and y 3039 0 1 800 75 4 2 125 For any normal dumbuuon 68 3 values are between ue a and 4 5 95 4 values are between ue 25 and m 25 99 7 values are betweenye 3a and m 35 These are sorcalled sigma rules Probanimy densuy 1 960 528 Approximating distributions with normal distribution Nurmal dismbuuun pmvldes guud appmmmahun in many Ether dismbuhuns Must pupulans me nur mal zppmxlmahuntu me bmumlal distributh Examgle It is very tedmus tn emain Lhs pmbablhty by hand The answerunaz we get using a su tware e g Mlcmsu Excel funmlun BINOMDIST unknuwn pmbablhty 1 e BINOMDIST196U 25 1 is n I925 Bur fur hand calculanun we need annuner way tn emain unis pmbablhty Expenea value qu mo mp Vananeean moo nplrp up D2 WPU P Thus X E N rip n100117 Now we can use normal distribution to calculate the probability that our student has at least 20 correct answers Approximation improves when 71 gets larger Approximation is good when p 05 symmetric distribution Problems may arise if p is close to 0 or 1 and if n is small Generally relatively good approximation can be expected if both number of successes and number of failures are at least 10 ie an 10 and nlp Z 10 How can we improve this approximation 128 52 81 Continuity correction Xaaon 04000 03000 02000 Pxn 01000 ooooo 39 Let39s connect the lines in this line graph and construct a histogram quotprobability histogramquot Now instead ofPX 1 we have P05 SXS 15 instead ofPX 2 we have P15 SXS 25 etc Instead ofPX 0 we have PXS 05 and instead of PX 6 we have PXZ 55 This is called continuity correction CC Example 1 What is PX S 1 Calculate exact binomial probability normal probability without CC and normal probability with CC 129 Example 2 Calculate the following probabilities both exactly and by using a normal approximation with CC a PXZ 5 b P3sXs 4 Formulas XBnp 999 X Nnpnpl p Psz Pst05 np qx05 np an p lnp p PX2x P22w 1 Inp p Inp p 130 Applets on the Internet httpwwwrufriceeduNlanestat simbinom demohtml o Illustrates a binomial distribution and its normal approximation 0 No probability calculation httpwwwrufriceeduNlanestat simnormal approxindexhtml 0 Graphics of binomial distribution and its normal approximation 0 Calculates exact binomial probabilities and corresponding normal probabilities with continuity correction Example 3 a What is the probability that the student who guesses blindly has at least 20 correct answers on the multiplechoice test consisting of 60 questions each with 4 possible answers Calculate this probability using a normal approximation with CC b How many questions are needed in order to be 99 con dent that the student scores no more than 35 on the test 131 52 8 2 Central limit theorem Recall X1 X2 X3 Xn 9 n iid rv s Xi N N H 02 Y N N u g 9 average of 71 random variables 2X1 N N nu n02 9 sum ofn random variables 11 If 71 random variables are normally distributed their average and their sum will be normally distributed as well Moreover if n is large the average and the sum of 71 random variables will be normally distributed regardless of what the original distribution of each X is This is called CENTRAL LIMIT THEOREM CLT If there is a large number of rv s their average sum etc will be normally distributed at least approximately That is why a normal approximation to the binomial distribution works better when n is large CLT on the Internet Rolling dice from GASP httpwwwstatsceduNwestjavahtmlCLThtml Sampling distribution simulation httpwwwruf rice eduNlane stat sim sampling distindeXhtml 132 529 Distributions related to normal distribmion 5 2 9 7 Chirsquare disiribuiiu Ifthe random vanableXfollows a standard normal dismbuuon r e rf XN01 tlnen tlne random vanable Y follows a chirsqllare distributinan one degree offreedoml e ye an tlrere are 1 standard normal rv s tlren the random variable Ywhlch ls tlre sum of tlrerr squares r e Y X X X follows a chirsqllare distributinan 1 degrees offreedom Y7 f de nedfor Y 20 Number of degrees nf freednm at gt number of rndependent preees of rnformauon tlrat go rnto tlre esumate of the parameter In tlre Hayter book the symbol used for numberdfls v greekletter nu X 753 gt Xfollows a chlrsquare drstrrbuuon wth v degrees offreedom ChiSquare Distribution EIEI ZUEI ADD EDD EDD iEIEI X v ls tlre parameter ofthe chlrsquare drsonbuuon and determrnes rts shape 7 smaller v gt peaked drstnbutron skewedto tlre nglrt larger v gt atter more symmetne drstnbuuon every large v gt elose to 2 a normal drstnbuuon eentral lrmrt tlreoreml wrtlr mean and vananee o2 v 133 The most important thing regarding the chisquare distribution are its critical points Denotedby xiv de ned as PX Z xiv 06 Critical points of the chisquare distribution are given in Table II p 909 for different 06s avalues at the top of each column indicate the area in the right side quotupper tail areasquot of the particular chisquare distribution with vdf For example the critical point of the chisquare distribution with 5 df for a 005 ie 0575 This means that a random variable that follows a chisquare distribution with 5 df has a 5 probability to take values equal to or larger than gains Microsoft Excel functions 0 CHIDIST calculates chisquare probability PX Z x for given x CHIDIST x degifreedom o CHIINV calculates x for a given probability CHIINV probability degifreedom of chisquare distribution39 0 Test for quotgoodnessof tquot test if observed proportions significantly deviate from expected proportions Chisquare test 9 used in genetics 0 Test for independence in twoway tables 0 Likelihood Ratio Test 0 The distribution of sample variance is a scaled chisquare distribution 134 5 2 9 2 The Irdisiribuiimi Siudem39s Irdisiribuiimi The zedlstnbunon or Student s zedlstnbutron was named after a stansneran erlram s Gosset an employee of Gumness Brewenes m Ireland who was rnterested m maklng was unknown wer not ermrtted to publrsh researeh work under therr own names Gosset adopted the seudonym A Student p 9 If arv er wth a standard normal dlsmbuuonr e XN01 and Yls a chlrsquare rv whlch ls drmded by rts number of df r e T7L Ak LA 4 47 defmedfor 7w s 1 son dlsmbunon 39T L whreh ls zero1ust a lrttle brt atter v the number of degrees of freedom ls the parameter of the zedlsmbunon and determrnes rts shape Student39s t Distribution 7 smaller v gt atter dlstnbunon every large v v and gt zedlstnbutron eonverges to standard normal distribution 135 Again the most important thing regarding the t distribution are its critical points Denoted by row de ned as PX 2 row 06 Critical points of the chisquare distribution are given in Table III p 910 for different 06s Again we read 06s in the rst row and number of df vin the rst column For example the critical point of the t distribution with 5 df for a 005 ie t 00575 i Note that the critical points become smaller with increasing number of df for a given 0 this is because the distribution becomes less at more close to standard normal with larger number of df The last row in Table III corresponds to the standard normal distribution 20 Onetailed and twotailed probabilities Because the t distribution is symmetric around its mean 0 we distinguish between 0 Onetailed probability PX 2 tom 06 or PX 3 ow a Twotailed probability PWZ tot2y a PXZttxZy or X St x2v PXZttxZy t PXS lmy 05 or 2PX2t1myv 06 or 2PXS lttmy l 06 136 Microsoft Excel functions 0 TDIST calculates one tailed or twotailed probability for given x TDIST x degifreedom tails Hint x must be positive 0 TINV calculates x for a given twotailed probability TINV probability twoitailed degifreedom To obtain x for one tailed probability enter double probability Applications of t distribution 0 Construction of con dence intervals for u o Hypothesis test for means t test Comparing a sample mean with known population mean onesample ttest Comparing two sample means twosample ttest 5293 The F distn39bution If X1 and X 2 are two independent chisquare rv s both divided by their respective numbers of df ie X1 N E and X 2 N 9632 V1 V2 Then the ratio F of these two rv39s F 5 X2 has an F distribution with v1 and v2 degrees of freedom 2 F 2 v 7v xv v2 1 2 An F distribution has two numbers of df F v1 9 quotnumerator df39 v2 9 quotdenominator dfquot 137 va Fred The Frdlstxlbutzon ls de ned for x 2 0 lt ls unlmodal and skewed to the nght v and Q are parameters of the Frddstnbutlon r they determme the shape of the F7 dlsmbunon F Distribution The vananee deereases Wlth larger v and 0 gt the dumbuuon beeomes sharply splked about 1 The mean on the Frdlsmbunon depends on Q 50 vzwrZ forsz EF gt 1 1f m very large Again the critical paints are the most lmportant when Workmg VJth an Frdlsmbunon Denoted by F de ned as PX 2 F am v mm T The entleal polnt of the Frddstxlbuuon can be found m Table IV for or 01 p 911 or 0 05 p 912 and 070 01p 913 Locate the numbers of df v1 9 first row F10 v2 9 first column F1052 F 0555 7 Microsoft Excel functions FDIST calculates probability PX Z x for given x FDIST x degifreedoml degifreedom2 o FINV calculates x for a given probability FINV probability degifreedoml degifreedom2 Applications of F distribution 0 F distribution is used in statistical testing to compare variances of two samples F test for variances in the analysis of variance ANOVA which actually tests differences among means 139 Chapter 1 3 Probability theory 30 Introduction Nothing in life is certain In everything we do we gauge the chances of successful outcomes from business to medicine to the weather Probability is the formal study of the lows of chance Probability or chance is all around us Sometimes chance result from the human design as in the casino39s games of chance and the statistician s random samples Sometimes nature uses chance as in choosing the gender of a child Sometimes the reasons for chance behavior are mysterious as when the number of deaths each year in a large population is as regular as the number of heads in many tosses of a coin Probability helps us to answer questions like the following ones How can gambling which depends on the unpredictable fall of dice and cards be a profitable business for a casino If you buy a lottery ticket every day for many years how much will each ticket win on average Give a test for the AIDS virus to the employees of a small company What is the chance of at least one positive test if all the people tested are free of the virus Gregor Mendel found that inheIitance is governed by chance What do the laws of probability say about the color of Mendel39s peas or the gender of a couple39s children What are the chances that your sister s rst son is color blind if you know that your father was color blind The study of probability started with gambling Gambling supposedly began in ancient Egypt where people used foursided 39astragaliquot made of animal heel bones There is some evidence that the Roman Emperor Claudius wrote the first known book about gambling However the use of probability in gambling began later in the Middle Age when gambling became popular among French aristocrats The official beginning of probability as a science branch of mathematics is connected to the correspondence between two famous French mathematicians in the 17111 century Blaise Pascal and Pierre ale Fermat who wanted to predict the outcome in games of chance in order to increase the amount of money won 39 Today probability is a branch of mathematics dealing with uncertainty Probability provides basics for statistical inference 31 Probability and randomness Toss a coin a die The result of this action is unpredictable but after several application of this action we can notice a certain regular pattern in the outcomes Example What is the probability of getting a head when tossing a coin If we toss a coin once we get either head or tail If we get a head P head 1 If we get atail P head 0 If we toss the coin again the outcome may be a head again then P head l or a tail then P head 12 If we toss the coin many many times thousands we will see that the proportion of heads gets closer to 12 In other words after many many tosses the proportion of heads converges to 12 Therefore the proportion of heads in the long run is 12 Probability 9 proportion in a long run Random 7E haphazard In statistics quotrandomquot means that a result is not known in advance but it shows a certain pattem after many repetitions ie in a long run Probability 9 longterm relative frequency 40 32 Basic definitions Random experiment A process of observing the outcome of a random event Elementary outcome or Sample point 0 A single possible outcome of a random experiment Sample space S Set or collection of all possible elementary outcomes of a random experiment Event A B C Subset one or multiple sample points of a sample space S 41 33 Probability values Outcomes and events have probabilities Probability of an outcome 0 9 p Probability of eventA 9 PA 331 Facts 0 Any probability is a number between 0 and 1 Higher numbers mean higher probabilities ie greater chance of occurring Probability 0 9 event never occurs impossible events Probability 1 9 event always occurs certain event 0 S PA S 1 PA 0 9 A is impossible PA1 9 A is certain 9 The sum of probabilities of all outcomes in a sample space S must be one S 01 02 03 On p 9 probability of outcome 0 pl p2 P3 quotupquot 51gt p1p2p3pn1 or R1 gt PS1 If the probability of one outcome in a sample space is one probability of all other outcomes in this sample space must be zero If there are n mutually exclusive eventsl A1 Aquot within the sample space S than the probability of all these events will be PA1PA2 PAn 130411 1 Mutually excluswe events are events that do not have outcomes 1n common 1e events that cannot happen at the same time For example if we choose a person at random and record their gender then eventsA male andB female are mutually exclusive since a person cannot be male and female at the same time 42 332 Assigning probabilities Consider the following examples 1 Tossing a fair coin fair coin 9 head and tail equally likely 2 Tossing a fair die fair die 9 all 6 outcomes equally likely l Outcome Head Tail Probability 12 12 2 Outcome l l 2 3 4 5 6 Probability If there are k outcomes and all k outcomes are equally likely the probability of each outcome is The probability of eventA is of outcomes in A of outcomes in A PA 9 relative frequency of outcomes in S total of outcomes 43 Examples 1 Toss one coin What is the probability to get a head or a tail 2 Toss two coins What is the probability to get at least one tail 3 Choose a student from the STAT 3000 class at random assumption each student has equal chances of being chosen What is the probability that this student is male and single 44 34 Graphical presentation of events Venn diagram 9 uses quotbubblesquot to represent events S Sample space 9 rectangle Individual outcomes 9 dots Events 9 bubbles u A and B are mutually exclusive disjoint events 9 no outcomes in common S A and B are not mutually exclusive 39 events 9 they have outcomes in common S S STAT 3000 class A male students B female students S STAT 3000 class A male students B single students Venn diagram enables simple calculation of probabilities of events and combinations of events 9 just by adding probabilities of outcomes contained in the quotbubblesquot of interest 45 35 Events and complements Event is a subset ofpossible outcomes ofthe sample space S S A C S 9 quotA belongs to Squot A39 PA sum of probabilities of all outcomes contained in A Everything within the sample space which is not contained in A is called complement of A Complement 9 A 39 AC notA Complement rule PA39 1 PA PA PA39 1 PS For example if PA 053 PA l 053 047 If eventA comprises the whole sample space then PA 1 PA 0 The complement of A is empty set Q The probability of an empty set is PQ 0 46 Pvnmnles of events and 1 Toss one coin A head ortail 2 Toss two coins A at least one tail 3 Choose a student from the class at random A malesingle 47 36 Cmnhinztinn at events 361 Immem39nn at events A mag students 3 single students A n 3 New single students A n B 9 A and B 9 intersection of events A and B PA n 3 Z probabilities of all outcomes that are m MA and 3 Intersection ofeventA wnh its complementA s an empty seu AnA 25 Therefom PAnA M25 0 Intersection ofA vmh Bst n B 9 everythmg m Bwhmh is not m A A39nK Intersection ofA thh 8 is A n 8 9 A mag students 3 smglestudents Ans everythmg m A mm s not m 3 male students A B smgle students AnB Mummy exeluswe events A male students 3 femalestudents Ans 362 Uninn at events A male students 3 smgle students A U 3 maleg smgle students gt A U B gt A or B gt umon ofevemsA and B The outcomes m the eventA U B can be 5 e InA butnot m 3 A n 3 mutually e In 3 but not m A A n 3 exdusxve e In bothA ands A n 3 events A male gt 3 smglegt AnK39 AnK A39nK Ans ma1eandmamedgt A Bmaleandsmglegt A n 3 female and smglegt Because these three events are mutually exclusive the probability of A U B will be PA UB PA nB PA MB PA39n B We can write these probabilities differently PA nB39 PA PA n B PA39nB PB PA n B Substituting PA n B and PA 39n B in the equation above we have PA UB PA PA nB PB PA nB PA nB Addition rule PA u B PA PB PA n B P male or single If events A and B are mutually exclusive A n B will be Q and PA n B will be 0 thus PA U B PA PB 51gt addition rule for mutually exclusive events P male or female P Biol major or ComSci major 50 W A Esau3y L Aww A U B can be wnnen asA n 8 gt outcomes that are noth inoc m B AUB mas orsmgle AUB Pawn Venn dlggram Chapter 8 7 Inferences on Population Mean 70 Intrndllctinn Statlstleal lufe a rehee gt proeess of drawlng concluslons about the populauoh based or the results from sample Con dence mteryals Statrsueal lnference Hypothesrs testmg 7 1 Con dence Intervals Ircnn dmce intervals The SAT tests are wldely used measures of readmess for eollege study There are two parts o e for verbal reasomng ablllty SATrV and one for the mathematreal reasomng abrlrty SATM In 1995 the mean SAT seores hatrohwrde were 500 pomts wth a standard deyratroh of 100 pomts Last year the SATrM test was glyeh to a random sample of 500 hlgh sehool students from Callfomla They obtamed the average seore of 461 pomts Based on the results of th5 sample what ear we say about the SATrM seores m the whole populauoh of hlgh sehool stu dents7 Populauoh reference values Sample observed data r 500 h 00 O 100 2 461 sea Aeeordmg to the slgmarmles there ls a probablllty of about 95 that hes Wthln 9 pomts 2 standard deylauohs of the Pmblbll ly535 pepulahon mean PLL9SLL97 95 Thls earl be also wrltteh as P379949 95 populatlon mean1 lles wlthln 9 polnts ofthe sample mean E In 0 er words lfwe Lake a large number ofsamples of 500 studean from the populatlon of all smdents about 95 of these samples Wlll capture the lrue populatlon mean between Er 9pman and z9pomls 461 there ls a probablllty of about 95 that the lrue Based on our sample n 5 452 polnts and461 9470 polnts panama 9 Pue452470 95 quot L r 4 WWW T 9lnLerval am n ml probablllty l r a Thlspmbablllty l 7 ms called CONEDENCE LEVEL Con dence level 1 r a 95 or 95 d2 d2 The most important con dence intervals C1 are 0 90 CI 1 a 90 o 95 CI 1 a 95 o 99 CI 1 a 99 margin of error 1 a CI 9 fi critical point X sef 95 CI for mean SAT score 1 0595 3 a 05 3 d2 025 039 Lower bound of le x 2025 39 J 039 Upper bound of u J7z025 J Because we were interested in both lower and upper bound of land we assumed that the standard deviation s is known so we could use standard normal distribution and its critical points 20 we call this confidence interval two sided z confidence interval We might be interested in lower or upper bound only Then we talk about one sided z con dence interval What is the 95 onesided z con dence interval for the lower bound ie min possible value of the mean SAT score Now we have to nd 1 00 onesided con dence interval for the lower bound of 1 which is 172 In our example the 95 one sided z con dence interval for the lower bound of His There is a 95 probability ie we are 95 con dent that the mean SAT score is not below this value However individual students scores can be lower than this lower limit but the mean score of all students in 95 of all cases will not be below this value 712 t confidence intervals Situations in which we can assume that the standard deviation in the population 0 is known are very rare almost not realistic More often we only know 3 sample standard deviation calculated from our data Therefore we have to calculate the standard error of the mean using 3 the sample standard deviation which implies that the distribution of standardized sample means is not a standard normal distribution any longer but a t distribution with n 1 degrees of freedom n sample size Using 3 a statistic calculated from the sample which varies with each new sample instead of o a parameter ie a xed value for a given population brings more uncertainty 9 that is why the distribution of sample means standardized using 3 is quotsloppierquot than a standard normal distribution 173 Example Find the 90 con dence interval for the mean body weight of all female students at USU using the data we collected in this class in the beginning of the course Population Sam le all female students at USU female students from STAT 3000 class 11 n 039 f S The standardized sample mean Distribution of standardized sample means Twosided 1 09 tconfidence interval s l OC CI fit quot7 z 1 J We find the critical point tat27H in Table II critical points of the tdistribution For the mean female students body weight the twosided 90 tcon dence interval is 1 0590 3 a 10 3 d2 05 Lower bound of u f t S 05n71 3 Upper bound of u ft057n71 J 174 What is the 90 one sided t confidence interval for upper bound of mean of female students body weight The 1 00 onesided con dence interval for the upper bound of His 11 e 00 f t i swirl J In our example the 90 one sided t con dence interval for upper bound of uis s xt 10n71 39 J Again it means that in 90 of all samples the mean body weight will not exceed this value although individual students weights can be higher 713 Length of con dence interval S 7 latZan S 012an fz gtlt J L 9 length of the con dence interval L2 gtlt taZvnili 2 X marginoferror J 175 Example Calculate and compare the lengths of the 90 95 and 99 twosided tcon dence intervals for mean of the female students body weight L 90 C1 L 95 C1 L 99 C1 The length of the con dence level depends on the chosen con dence level Higher con dence level 3 higher probability that uis contained within this CI Higher con dence level 3 longer CI 3 lower accuracy We want High con dence level High accuracy 714 Effect of sample size on con dence interval How can we reduce L for given con dence level Reduce sample standard deviation usually impossible or produces biased results Increase sample size larger samples 3 shorter C13 better accuracy H L 2 X tat27ml 9 L is inversely proportional to J a 176 How big our sample should be to have a CI not longer than a certain quotdesiredquot length L0 2 gtltt dz mm L 2 I n 2 4 9 L0 quotdesiredquot length of C K Le J How many female students should I have in a sample to have a 90 CI for the mean body weight not larger than 10 lbs quotgt4rMT K Lo J If I already have n1 n1 female students in my sample and I need 71 students 71 how many additional female students should I sample to get a CI not larger than 10 lbs The t critical point in this example is based on my previous experience the number of df is the number of students I had in the first sample 1 If I have no previous experience I can assume that l I can sample infinitely In this case I will take a critical I point for the chosen confidence level and 00 df critical 2 point 2 I can allow to sample max 71 individuals In this case I will take a critical I point for given confidence level and n 1 df 177 For example a Assume I can sample female students in nitely How many female students should I have in a sample to have a 90 CI for the mean body weight not larger than 10 lbs b Assume I can sample up to 50 female students due to funding and time limitations How many female students should I have in a sample to have a 90 CI for the mean body weight not larger than 10 lbs If there is a large difference between 71 allowed and 71 required I have to try to get more funding for my research lower my accuracy level 178 One more example A random sample of 41 glass sheets is obtained and their thickness is measured The sample mean is c 304 mm and the sample standard deviation is s 0124 mm Construct a 99 twosided tinterval for the mean glass thickness Do you think it is plausible that the mean glass thickness is 290 mm How many additional glass sheets do you think should be sampled in order to construct a 99 twosided tconfidence interval for the average sheet thickness with a length not larger than L0 005 mm 72 Hypothesis testing 720 Introduction Confidence intervals give us information about accuracy of our estimate of a certain parameter Hypothesis testing another aspect of statistical inference give us information about plausibility or credibility of a specific statement or hypothesis Hypothesis 9 in statistics a hypothesis means a certain theory claim assertion or a statement about a particular parameter in the population Hypothesis testing 9 a procedure statistical test that enables us to make inferences about a population parameter by analyzing differences between the results we observe our sample statistic and the results we expect to obtain if some underlying hypothesis is actually true 179 In other words we conduct a statistical test to check whether our hypothesis is plausible or not Based on the results of this test we decide to accept or to reject the hypothesis Because in this Chapter we compare one sample mean with the quotknownquot population mean we talk about one sample tests 721 z test 7211 Two sided z test SAT scores example In 1995 the mean SAT scores nationwide were 500 points with a standard deviation of 100 points Last year the SATM test was given to a random sample of 500 high school students from California They obtained the average score of 461 points Based on our sample the data we observe can we believe that the mean SAT score is really 500 points The first step in hypothesis testing is to state the hypotheses Hypothesis we want to test 11 500 This hypothesis that the population parameter is equal to the claimed value is called null hypothesis H0 In our example H311 500 Generally Ho 11 ng where ng is some speci c fixed value Note that even though we actually work with a sample the null hypothesis is written in terms of the population parameter This is because we are interested in the entire population of high school students taking SATM test If the null hypothesis is judged false or implausible something else must be plausible Therefore whenever we state a null hypothesis we have to state an alternative hypothesis HA the one that must be plausible if the null hypothesis is found implausible In our example Hull 500 Generally H141 7E ng where ng is the same value as in H0 We test the null hypothesis that the mean SAT score is 500 points against the alternative hypothesis that the mean SAT score is not equal 500 points ie 180 H311 500 This is a two sided hypothesis testing problem Hl u 500 because Hl states that uis different than 500 either smaller or larger Generally in a twosided problem we test H 0 f Lb against ILLput Ho The second step in hypothesis testing is obtaining the test statistic A test statistic is calculated from the data 9 it quanti es differences between the stated value of the parameter Lb and the point estimate calculate from our data f We calculate the test statistic by standardizing the obtained sample mean by substracting the value stated in H0 and dividing the results by the standard deviation of the sample means For the SAT score problem we obtain the test statistic as follows fHo z MOT Ho O Oquot J The test statistic we calculate here is called z statistic because we assume that the population standard deviation is known 2 Thus this test will be a two sided two tailed z test The third step in hypothesis testing is obtaining p Value IfHo is plausible we expect the point estimate obtained from the data to take values very close to the value stated in H0 and that the difference is due just to sampling error IfHo is plausible the calculated test statistic quotstandardized sample meanquot will be very small close to zero 181 But if there are large dlscrzpancles between the point esumate obtained from the data d the value stated in Ha I e if the absolute value of the calculated test stausuc is large this might indicate that Hg is Wrong n1 7 n4 n3 nus Dz m 7 mm DEI Ann 45m SUgtEIltD 55m BUD en Jusadmlnxln an 5U 7U 90 Dismbuuon of sample means from the Dismbuuon of quotsmudardued39 sample means dssmbuuonwith y 500 obtained from many from the dismbuuon wi y 0 obtained samples of size n 500 from many samples ofsize n 500 We have to detmne how likely is the test stausuc we obtained from the sample the Hg is really uue i e ifthe populauon mean is really 500 The pawl is the probability or obhining the test statistic equal to or more extzeme I n m plausible le prvalue le prvalue Obtaining gvalue for a twosided z test pvalue P l Z l 2 z 9 area in tails P 2 Z z or Z S z PZS zPZZz 2 P Z S l 2 l 2 0D l zl For our SAT score problem the pvalue is pvalue 2d lzl Interpretation of gvalue A large pvalue means that the observed data ie the parameter estimate obtained from our sample is quite plausible for the distribution of sample means when true 11 Lb A small pvalue means that the observed data ie the parameter estimate obtained from our sample is very unlikely when true 11 ug which means that Our sample comes from a different population Information about uis not correct H0 is wrong HA is plausible Finally the fourth step in hypothesis testing is making decision about H0 whether to accept or to reject H0 This decision is based on the obtained p value General classificationquot of gvalues pvalue Z 005 51gt the difference between our sample39s mean and the population mean Lb stated in H0 is just due to chance ie it is just a result of sampling error in fact there is no difference 51gt there is not enough evidence to reject H0 51gt H0 is accepted 183 005 gt p value Z 001 51gt there is some evidence that the mean of the population that our sample comes from is different than M however this evidence is not overwhelming 51gt H0 is rejected at 5 signi cance level pvalue lt 001 51gt there is strong evidence that the mean of the population is different than Lb 51gt H0 is rejected at 1 signi cance level In our SAT score problem p value Decision 7212 One sided z test In our SAT score problem the sample mean we obtained is much smaller that the value of 500 stated in H0 Is there enough evidence to assume that the population mean ie mean SATM score of all students is smaller than 500 points 0 State hypotheses H0 12 500 This is an one sided hypothesis testing problem because we test HA ult 500 only the lower bound of u 9 Calculate test statistic 250PM 039 184 e Obtamprvalue prvalue s the probability tovobtam value ofz of or even smaller 1f the population mean 5 really see prvalue PZSz qgtz 0 State conclusxons make demsmn aboutHu prvalue Demsxon 72 3 Same cnmments nn signi cancelwel of wrongly rejecting HE y when m fact Hg 5 plausible y abxhty Rugmn n Nonrejecuun Roman ur mumquot Rugmn Ur mm 0 n NmIrchclmn Regum m qum R qmmn The signi cance level is usually chosen in advance it is usually 5 a 005 1 a 001 and sometimes 01 a 0001 The level of signi cance we choose indicates the risk of committing Type I error when making decisions about H0 06 005 51gt probability of error is 5 which is relatively high but we will choose this signi cance level only when being wrong will not have great consequences Usually if we reject H0 at 5 signi cance level but do not reject it at 1 signi cance level we will whenever possible try to collect more data to get a clearer picture 06 001 51gt probability of error is 1 which indicates smaller risk of error because making an error can cost you time and money a 0001 51gt probability of error is 01 the risk of being wrong is very small We will choose this level of signi cance if making an error can be life threatening medical research Based on our sample data we obtain the p value and compare it with the chosen 0 The pvalue obtained from the data is referred to as the observed level of signi cance which is the smallest level at which H0 can be rejected for a given set of data The decision rules for rejecting H0 are as follows If the pvalue is greater or equal to O the null hypothesis is accepted not rejected at asigni cance level If the pvalue is smaller than 0 the null hypothesis is rejected at asigni cance level Sometimes however we cannot obtain the exact pvalue no Table I and no software available In such situations we can obtain an approximate p Value by comparing the calculated test statistic z statistic with the critical points of the standard normal distribution 20 for chosen 0 In most situations we will compare the calculated 2 statistic with Z 0025 and Z 0005 9 for atwosided problem Z 005 and Z 001 9 for an onesided problem The decision rules for rejecting H0 are then as follows 186 Twnrsirled prnhlem 11514 3 H414 La 2 242 gt acceptHu ap a reject HE HLgtpo is NZ 2 is PZ 52 Hn2Mltw HR ew is 2PZ 2 kl Onesided prnhlem D Test forlower bound 115742 2272 gt acceptYD gt reject15 Test for upper bound H14 Sm H414 gt m 22 gt acceptin 2 gt rejectya gt t b 7214 Summary Fourstep recipe for hypothesis testing 0 State hypotheses 9 Calculate test statistic 9 Obtain pvalue 0 State conclusions If we can obtain the p value exactly we can proceed directly to step 4 If we cannot obtain the p value exactly and have to compare the calculated test statistic with critical points of the appropriate distribution instead proceed as follows Compare your calculated 2 with the critical point for 06 005 that is 2 onesided problem or 2025 two sided problem If lz l S2025 twosided or 2 2 205 onesided lower bound or z S 205 one sided upper bound then 3 accept H0 3 end of the problem 3 there is not sufficient evidence to reject H0 3 end of the problem If lzl gt2025 twosided or z lt z05 onesided lower bound or z gt 205 one sided upper bound then 3 reject H0 at 5 signi cance level 3 and continue as follows Compare your calculated 2 with the critical point for 06 001 that is z01 onesided problem or 2005 two sided problem If lz l S2005 twosided or 2 2 201 onesided lower bound or z S 201 one sided upper bound then 3 do not reject H0 at 1 significance level 3 there is some evidence that H0 is wrong but the evidence is not very strong 3 end of the problem in practice try to collect more data if possible 188 If lzl gt2005 twosided or z lt z01 onesided lower bound or z gt 201 one sided upper bound then 3 reject H0 at 1 significance level 3 there is strong evidence that H0 is wrong 3 end of the problem When we accept H0 it means that H0 is plausible When we reject H0 it means that H0 is wrong We cannot prove that H0 is true because there are many plausible HO s We can only prove that H0 is wrong Hypothesis testing is similar to parentage testing 9 In parentage testing we can only exclude a certain man as a father of the child but we cannot prove that this man is really the father of the child 9 In hypothesis testing we can exclude a certain H0 and show that it is not plausible but we cannot prove that this H0 is really true 722 t test In many situations we do not know the population standard deviation 0 and have to work with the sample standard deviation 3 that we obtain from our sample data When we calculate test statistic which is in fact a standardized sample mean we have to use 3 instead of 07 Thus the calculated test statistic is not a 2 statistic any longer it is a tstatistic I J x nu s The tstatistic follows a quotsloppierquot tdistribution with n 1 degrees of freedom n sample size Because we calculate a tstatistic and use a t distribution we talk about ttest 189 7221 Two sided t test Example 1 Weight of a candy bar adapted from a STAT 3000 project Fall 1999 Is the weight of a candy bar really as stated on the package A group of STAT 3000 students decided to prove how reliable is the weight information stated on the package For this exercise I have chosen their results from the analysis of Almond Joy candy bars The net weight stated on the package of Almond Joy candy bars is 190g Our students purchased 20 such bars and obtained their net weight using a precise scale than they ate the sample From this sample of 20 bars they obtained average weight of 20053g with standard deviation of 0514g a Is there enough evidence to conclude that the weight of Almond Joy candy bars is different than that stated on the package 0 State hypotheses H 02 HA 9 Calculate test statistic HGH S t 190 9 Obtain pvalue In this case we have to obtain pvalue based on a t distribution with 711 degrees of freedom pvalue P l Tl Z t 9 where T has a tdistribution with n l 20 1 19 df P TZ t or TS t 2 P T Z l t l 9 twotailed probability We can obtain this pvalue using Microsoft Excel pvalue TDIST t 71 1 2 If we do not have software at hand we have to work with Table III that contains critical points of the tdistribution Then we have to obtain an approximate pvalue by comparing the calculated tstatistic with the critical points of the t distribution with n l df for 06 005 and 06 001 For 06 005 t0025 19 For 06 001 t0005 19 9 State conclusions b Construct a 99 twosided con dence interval for the mean weight of Almond Joy39 candy bars c How does the con dence interval construct in part b provide the answer to part a 191 7222 One sided t test Example 2 2 Water hardness adapted from the Final Exam Spring 2000 For their outofclass STAT project Nicole Troy and Terry from STAT 3000 Section 002 decided to analyze hardness of water in the Widstoe Chemistry Building at the USU They analyzed nine 9 50ml samples of tap water and obtained the average hardness of 20522 ppm CaC03 and the standard deviation of 217 ppm CaC03 a Construct a twosided 95 con dence interval for the mean hardness of water in the Widstoe Chemistry Building b Would a 99 twosided con dence interval for mean water hardness be shorter or longer that the 95 con dence interval found in a You do not have to construct this interval just answer the question Reported water hardness in Logan and surrounding is 190 ppm CaC03 Based on the results of our students experiment is there any evidence that hardness of water in the Widstoe Chemistry Building is greater than that reported for Logan and surrounding O V 192 7223 Onersamplelrtest Summary Twnrsirlerl prnhlem Onesided prnhlem 514 Test forlower bound H414 La If 2 M Mam gt acceptin H4314lt La MW gt repay tzitm gt accechu tlt7th gt menin Test for upper bound H14 SM H414 gt La ts 1m gt acceptin 1 gt1 gt mm HE H 1pgtm is PHquot 21 t t ang ug is 2Pltr 2M 73 Statistical software useful for con dence intervals and hypothesis testing N E 4 Rice Virtual Lab in Statistics RVLS httpwwwru riceedu1anerv1shtm1 51gt Analysis Lab httpwwwrufriceedulanestat analysis 90 95 and 99 twosided t intervals onesample twosided ttest for mean Statlets httpwwwstatletscomfree1 1WebStathtm1 51gt Analyze 51gt One sample 51gt One variable analysis 51gt ttest arbitrary one and twosided tintervals onesample one and twosided ttest 51gt Analyze 51gt One sample 51gt Hypothesis test Mean and Sigma 51gt ttest arbitrary one and twosided tintervals onesample one and twosided t test when only sample statistics no data are available Webstat httpwwwstatsceduwebstat 51gt Analyze 51gt One sample 51gt One variable analysis 51gt t test arbitrary one and twosided t and z intervals onesample one and twosided t and z test Microsoft Excel 51gt Tools 51gt Data analysis 51gt Descriptive statistics 51gt Con dence level for mean margin error for mean 194 CHAPTER 6 SAMPLE STATISTICS The variance and standard deviation are good measures of spread when the distribution is symmetric and without outliers However when the distribution is skewed andor there are outliers the variance and standard deviation are not the best measures of spread because they are sensitive to skewness and outliers In a case of a skewed distribution andor outliers both variance and standard deviation are quotin atedquot ie they are much larger than the real spread of the data is In such situations we need more robust measures of spread 2225 Quantiles and percentiles Quantiles are measures that divide the data set into several equal parts The pth quantile Qp is the value such that the proportion p of the observations has values smaller than Qp and the proportion l p of the observations has values larger than Qp Ordered data set 0 O proportion p smaller 0 OGQp proportion l p larger 00000 For example Q10 is the value such that the proportion 10 of the observations have smaller and the proportion 90 have larger values than Q10 Usually instead of quantiles we use percentiles The pth percentile Pp is the value such that p of the observations is smaller and 100 p ofthe data is larger than Pp For example P10 is the value such that 10 of the observations have smaller and 90 of the observations have larger values than P10 25 CHAPTER 6 SAMPLE STATISTICS 2226 Quartiles The most important quantiles are quartiles the values that split the data set into four equal parts The quartiles are The rst lower quartile or Q1 the value such that 25 of the observations have smaller and 75 of the observations have larger values than Q1 The second quartile or Q2 the value such that 50 of the observations have smaller and 50 of the observations have larger values than Q2 The second quartile is better known as the median The third upper qualtile or Q3 the value such that 75 of the observations have smaller and 25 of the observations have larger values than Q3 In addition the smallest observation min in the data set refers to as Q0 and the largest observation in the data set max refers to as Q4 Obtaining quartiles Q and Q There are several methods to obtain the lower and the upper quartile 1 When the data set is small Sort the data in ascending order from the smallest to the largest observation Divide the sorted data list into two halves the quotlower half observations smaller than the median and the quotupper halfquot observations larger than the median Ifn is odd the median is the data point exclude the median Find the median of the lower half 9 this is the lower quartile Q1 Find the median of the upper half 9 this is the upper quartile Q3 Lower and upper quartile of bodv weight of STAT 3000 female students 26 CHAPTER 6 SAMPLE STATISTICS 2 When the data set is large Sort the data in ascending order from the smallest to the largest observation Approximate the quartiles by using the following positioning point formulas n l 0 Q1 value corresponding to the ordered observation ie Q1 xEMJ 4 0 Q3 value corresponding to the 3n41 ordered observation ie Q3 x 30M 4 J The following rules are then used to obtain quartile values 0 If the resulting positioning point is an integer the quartile will take the value of the corresponding observation 0 If the resulting positioning point is halfway between two integers the quartile will take the average value of these two corresponding observations 0 If the resulting positioning point is neither an integer nor a value halfway between two integers a simple rule is to round off to the nearest integer and to take the value of the corresponding observation Q and Q of body weight of STAT 3000 female students obtained using the approximate formulas 27 CHAPTER 6 SAMPLE STATISTICS 3 When we want to calculate quartiles and any other quantile exactly There are two main formulas used to determine the position of the particular observation in the data set The quantile corresponding to x the ith observation in the ordered data list is obtained as either I 5 1 or 2 n n l where 139 the position of the observation x in the ordered data list and n number of observations in the sample 5 Therefore when Formula 1 is used the l n n th quantile or Q x0 Similarly when Formula 2 is used the 1 quot1 th quantile or Q n l E J 95039 As the quartiles usually fall between two data points we use linear interpolation to calculate the exact values of the quartiles The values of the quartiles are then calculated as the appropriately weighted average of the values of two observations between which the particular quartile lies Example Figure 637 from the Havter book p 316 139 xa 1quot5 I This data set has n 20 observations The n quot1 data list is ordered Find the lower and the 1 09 00250 00476 upper quartile of this data set 2 13 00750 00952 3 18 01250 01429 4 25 01750 01905 5 26 02250 02381 4 Q1 6 28 02750 02857 7 36 03250 03333 8 40 03750 03810 9 41 04250 04286 10 42 04750 04762 11 43 05250 05238 12 43 05750 05714 13 46 06250 06190 14 46 06750 06667 15 46 07250 07143 Q3 16 47 07750 07619 17 48 08250 08095 18 49 08750 08571 19 49 09250 09048 20 50 09750 09524 28 CHAPTER 6 SAMPLE STATISTICS Finding Q1 and Q3 using Formula 1 The lower quartile Q35 must lie somewhere between x5 and x5 because x5 corresponds to Q225 and x5 corresponds to Q275 Q25 must be obtained by linear interpolation and calculated as weighted average of x5 and x5 These weights depend on the distance of Q25 relative to the quantiles corresponding to x5 and x5 9 the smaller the distance between x5 and the unknown Q25 the greater the weight for x5 and the smaller the weight for x5 in the calculation of Q25 Calculating weights Weight for x5 W 1 distance between Q25 and xm 1 25225 1 025 5 1 distance between xw and xm 275 225 05 I Weight for x5 W21 W11 55 Thus the lower quartile will be calculated as Q1 W1 gtlt xm W2 xx 526528 27 Similarly the upper quartile Q75 must lie somewhere between x1 5 and x05 because x15 corresponds to Q725 and x15 corresponds to Q775 Q75 will be obtained by linear interpolation and calculated as weighted average of x15 and x15 Weight for x15 W 1 distance between Q75 and x05 1 75725 1 2 5 1 distance between x06 and x05 775 725 05 Weight for x15 W21 W11 55 Thus the upper quartile will be calculated as Q W1 gtlt c15W2 gtlt x06 546547 465 29 CHAPTER 6 SAMPLE STATISTICS Obviously in this example when the quantiles are calculated using Formula 1 the lower quartile lies exactly in the middle between x5 and x5 and the upper quartile lies exactly in the middle between x15 and x15 Therefore we could also have obtained these quartiles by calculating the arithmetic average of the two observations anking Q1 and Q3 respectively Finding Q1 and Q3 using Formula 2 When Formula 2 is used to calculate corresponding quantiles for the observations in the data set the lower quartile Q25 lies somewhere between x5 and x5 because x5 corresponds to Q2381 and x5 corresponds to Q2857 Q25 must be obtained by linear interpolation and calculated as weighted average of x5 and x5 Calculating weights Weight for x5 W distance between Q25 and 965 25 2381 0119 1 distance between xw and xm 2857 2381 0476 39 Weight for x5 W2 l W1l 7525 Thus the lower quartile will be calculated as Q1 W1gtlt xm W2 xx 75262528 265 Similarly the upper quartile Q75 lies somewhere between x1 5 and x15 because x15 corresponds to Q7143 and x15 corresponds to Q7519 Q75 will be obtained by linear interpolation and calculated as weighted average of x15 and x15 Weight for x15 W distance between Q75 and x05 1 75 7143 E I 1 distance between n1Q and x05 7619 7143 0476 Weight for x15 W2 1 W11 2575 1 Formula 2 is obviously used in the Hayter book 30 CHAPTER 6 SAMPLE STATISTICS Thus the upper quartile will be calculated as Q3 W1 gtlt c15W2 gtlt x06 25467547 4675 These formulas are usually used in statistical software packages There is no reason to prefer one formula to the other because when the data set is large the results are very similar When the data set is large very similar results can also be obtained by simple calculation using the approximate formula For example for our data set with 20 observations the first quartile obtained using the approximate formula will be Q1 x525 m x5 26 and the third quartile will be Q3 lm Tm 2 e xue 47 4 4 Thus the approximate formulas are preferred for calculations by hand The main advantage of using the exact formulas either Formula 1 or Formula 2 we can exactly calculate any quantile for a given data set Example Calculate Q50 for the data set used in the previous example using Formula 2 Solution When Formula 2 is used Q50 lies somewhere between xm and x13 since xm corresponds to Q5714 and x13corresponds to Q5190 Thus Q50 must be calculated as weighted average of xm and x13 Weight for xm 1 distancebetweeano and xm 1 60 5714 0286 3992 1 distance between 913 and x02 6190 5714 0476 Weight for x13 W2 1 W1 1 3992 6008 Thus Q50 will be calculated as Q60 W1 gtlt c12W2 gtlt x03 399243600846 4480 31 CHAPTER 6 SAMPLE STATISTICS 2227 Interquartile range IQR Interquartile range I QR is the difference between the upper and the lower quartile IQR Q3 Q1 IQR of bodv weight of STAT 3000 female students IQR 9 robust measure of variability it is not in uenced by skewness and outliers IQR is better measure of variability than the variance or standard deviation when the data are skewed andor contain outliers 2228 Box plot Box plot or Box and Whisker plot is graphical presentation of median quartiles minimum and maximum of the data To construct a box plot we need to obtain a ve number summary The ve number summary consists of Min Q1 M Q3 Max The box plot looks like follows Whisker Whisker I Min Q1 M Q3 Max 50 data within the box 50 data outside the box 32 CHAPTER 6 SAMPLE STATISTICS Looking at the box plot we can determine the shape of the distribution If the median is in the middle of the box and the whiskers are equally long the distribution is symmetric If the median is not in the middle of the box and the whiskers are not equally long the distribution is skewed l EI l Rule for outliers Ifa data point is more than 15 X IQR below or above the box ie more than 15 X IQR smaller than Ql or larger than Q3 it is considered an outlier Outliers are presented as asterisks outside the box and the whiskers Box plot of bodV weight of STAT 3000 female students 33 CHAPTER 6 Mzzmres nf shay 2 data is comparedto the deal normal dumbunon 2231 Kunnsis 7 Kunosxs desmbes relative peakedness or atness ofa dumbunon compared wnh the normal dumbunon as comparedto the tails 7 Kurtosxs 7o gt normal dutnbunon 7 Kurtosxs gt 0 gt relatively peaked dutnbunon Kurtosxs lt 0 gt relauvely at dutnbunon AL sum Lupmknmc I luukmhc Kur osxs 0 Kurtosxs gt 0 Kurtosxs lt 0 meso 7 middle lepw mm pl aty 7 at Formula for formulalovers only Km J W Z J Li 301702 Ln71n72n73 J 01730173 CHAPTER 6 2229 skewness skewness 0 gt symmem dutnbuuon skewness gt 0 gt nghtrskewed posmvely skewed dumbuuon skewness lt 0 gt le rskewed negatively skewed dumbuuon a we skewness 0 skewness gt 0 Skewnm lt 0 Formula for formularlovers only Skzwness m2 CHAPTER 6 SAMPLE STATISTICS 23 Some additional comments on graphical presentation of data 231 Other types of graphs 2311 Pictogram Pictogram is similar to bar chart but it uses pictures related to the topic of the graph Computers in schools each computer icon represents 200000 computers Milk cartons a picture of milk carton is used instead of bar chart When using or reading pictogram extreme caution is required Pictograms can easily become misleading our eye sees the area of the milk carton and not only height the difference looks huge Diploma example our perception is such that we see the area of the diplomas and not only the height Attempt to keep realistic proportions misleading Attempt to keep right dimensions the graph is not nice If problems arise it is recommended to use a simple bar chart instead of a fancy pictogram 2312 Line graph Line graphs are usually used to display how a variable changes over time x axis time axis y axis value of the variable A line graph is used to display time series 9 value of the variable is measured in regular time intervals over a longer period of time The line enables to see Longterm changes 9 TREND Shortterm changes 9 SEASONAL FLUCTUATIONS 2313 Scatterplot A scatterplot is useful to present the relationships between two quantitative variables It is more difficult to read than a line graph but it contains more information It shows the strength and direction of relationship between two variables It shows outliers It shows the degree of variability that exists for one variable at each location of the other variable Constructing a scatterplot is the first step in regression and correlation analysis 36 CHAPTER 6 SAMPLE STATISTICS 232 Use and abuse of graphs Potential dangers Truncated axes axes that do not start at zero The differences among categories might be overemphasized bar chart Sometimes it makes sense not to start from zero line graph scatterplot but sometimes it is statistically incorrect and misleading bar chart Using 2 and 3dimensi0nal effects to represent onedimensional data The human eye perceives what it sees 9 using of volume or area to represent onedimensional changes might be deceptive Other problems Gaps in labeling Different categories Two different scales on the same graph 37