### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Mathematical Statistics MT 427

BC

GPA 3.53

### View Full Document

## 153

## 0

## Popular in Course

## Popular in Mathematics (M)

This 158 page Class Notes was uploaded by Mr. Halie Wilkinson on Saturday October 3, 2015. The Class Notes belongs to MT 427 at Boston College taught by Jenny Baglivo in Fall. Since its upload, it has received 153 views. For similar materials see /class/218064/mt-427-boston-college in Mathematics (M) at Boston College.

## Reviews for Mathematical Statistics

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/03/15

MT427 Notebook 4 prepared by Professor Jenny Baglivo Copyright 2009 by Jenny A Baglivo All Rights Reserved 4 MT427 Notebook 4 41 Kth Order Statistics and Their Distributions 411 De nitions 412 Distribution of the Kth Order Statistic 413 Approximate Mean and Variance of Order Statistic Distributions 414 Graphical Analysis Probability Plots 4 Estimation and Hypothesis Testing Methods 421 Large Sample Theory Sample Median 422 Approximate Con dence Interval Procedure for the Median 423 Exact Con dence Interval Procedures for Quantiles 424 Procedures for Endpoint Parameters 43 Sample Quantiles 431 De nitions 432 Graphical Analysis Box Plots 4 MT427 Notebook 4 This notebook is concerned with the use of order statistics to answer questions about quantiles of continuous distributions The notes include material from Chapter 3 joint distributions and Chapter 10 summarizing data of the Rice textbook 41 Kth Order Statistics and Their Distributions 411 De nitions Let X1 X2 Xn be a random sample from the continuous distribution whose PDF is m and whose GDP is PX m for all real numbers m and let k be an integer between 1 and n k E 1 2 n 1 Kth Order Statistic The kth order statistic X00 is the kth observation in order XW is the kth smallest of X1 X2 X to Sample MaximumMinimum The largest observation XW is called the sample maxi mum and the smallest observation X1 is called the sample minimum OJ Sample Median The sample median is the middle order statistic when n is odd and the average of the two middle order statistics when n is even Xm when n is odd Sample Median l i 2ltX X1gt whenn is even For example if the following numbers were observed and ordered 138 170 210 245 318 395 452 476 then the observed sample minimum is 138 the observed sample maximum is 476 and the observed sample median is 412 Distribution of the Kth Order Statistic Let XW be the kth order statistic of a random sample of size n from a continuous distribution with PDF x and GDP Fz and let k be an integer between 1 and n Then 1 CDF The cumulative distribution function of XW has the following form n n I I Fkm Z 7 Fz 7 for all real numbers m jk 2 PDF The probability density function of XW has the following form 71 am gm k 71717 7 k Fxk 1fx1 7 MW after simpli cation whenever the derivative exists To demonstrate that the formula for Fkm PXk z is correct rst note that the event that XW z is equivalent to the event that k or more of the Xi s are z NOW complete the demonstration Exercise Let X be a uniform random Variable on the interval 157 and let n 4 a When at E 157 and b In each case7 Write the formula for the PDF When at 6 15 simplify7 if possible Vmx 1 First Order Statistic U 8 U 1 DJ U 2 7 1 2 3 J 5 quot Ffm 1 Second Order Statistic U A DJ DJ U 2 7 1 1 3 4 5 quot PM 1 Third Order Statistic I b D 6 U4 U l 7 1 1 3 1 5 quot V141x 1 Fourth Order Statistic U 8 U 1 DJ U 2 7 1 1 J w c Find PXk 3 for k 1 234 d Find P2 Xlt2gt 3 Exercise Let XW be the 4th order statistic of a PM random sample of size 5 from the exponential die Mb tn39bution With parameter A Find P10 g X g 20 0 001 20 3E Ewemise Sample Maw39imumM39in39imum The sample maximum and sample minimum of a random sample of size n from the X distribution are used in many computations a Simplify the general formulas for Fnz and n as much as possible 1 Simplify the general formulas for F1m and f1z as much as possible Ewemise Quantiles of Sample MawimumMinimum for unifom distributions Let X be a uniform random variable on the interval 11 and consider the sample maximum and sample minimum of a random sample of size n from the X distribution a Find a general formula for the pth quantile of the Xm distribution 1 Find a general formula for the pth quantile of the X1 distribution 0 Use your answers to parts a and b to nd the median of the sample maximum and sample minimum distributions when n 4 and 071 17 5 413 Ayym mate Mean and Variance of Order sumac Distributions m Mable mm PDF m In addmmn let 1 X09 be Lhe 10m uxda mum of a xandnm sample Di sue 71 mm Lhe X dxmxbuuun 2 a heme pm quanule quheX ammuum whae If0Lhaq EX m and waxwwli Nom 1 The Lheman mus us that me n 1 mmwls waWA y EXAVEX2V y EXHVXny EXnyw mm mm mm E wv axe apyxmumateb equam Mew Thins PXETheIquot 1mm xz12 41 2 The mmqu yven m we mam axe stub m um nn dAsmbuhnns Ezzmuz Let X be Lhe Lhud uxda musm vi a xandum sample Di Sue 4 mm we um nn dishAbuan on Lhe mtawl 1 a a Use Lhe mam above a nd apyxmumam wins m 390 and wag 1 Demonstrate that the approximate values obtained in part a are exact That is7 compute the mean and variance starting with the PDF of the third order statistic7 and check that your nal answers are the same as the answers in part a 414 Graphical Analysis Probability Plots Let X be a continuous random variable and 961 962 96W be the observed values of the order statistics of a random sample of size n from the X distribution A probability plot is a plot of pairs of the form k st model quantile 1 k 12Hni The theorem in the last section tells us that the ordered pairs in a probability plot should lie roughly on the line y at For example I used the computer to generate a random sample of size 95 from the normal distribution With mean 0 and standard deviation 10 1 Probability Plot The plot on the right is a probabili aquot ity plot of pairs of the form 3 k ch 3 J39 7 model quantile x k 3 lt 96 lt l u 10 for k 1 2 i i i 95 In the plot the Observed order J statistic vertical axis is plotted against its approx 3 D t 39 WV 7 1mate Expected value horizontal ax1s 3quot 3 1quot W Dmu ilv 2 Comparison Plot The plot on the right shows an um empirical histogram of the same sample superimi posed on the density curve for a normal distribution quotm With mean 0 and standard deviation 10 gm Twelve subintervals of equal length Were used to con MI struct the empirical histogram W 10 4n 0 I0 10 30 7 Footnotes If n is large then both plots give good graphical comparisons of model and data But ifn is small to moderate then the probability plot may be a better Way to compare model and data since the shape of the empirical histogram may be very different from the shape of the density function of the continuous model 42 Estimation and Hypothesis Testing Methods 421 Large Sample Theory Sample Median The following theorem tells us that the sampling distribution of the sample median is approx imately normal When n is a large odd integer Theorem Sampling Distribution Let X be a continuous random variable With density function fz and With median 9 Further let n be an odd integer and Xvi be the sample median of a random sample of size n 2 If f 74 0 and n is large then the distribution of Ais approximately normal With mean 9 and variance 013 422 Approximate Con dence Interval Procedure for the Median Under the conditions of the theorem above an approximate 1001 7 a con dence interval for the median of the X distribution has the following form 9A i za2 4nf Where 9A X and za2 is the 1001 7 a2 point of the standard normal distribution In this formula f is the estimate of f obtained by substituting the sample median for 9 Exercise Let X be a Cauchy random variable With center 9 and spread l The density function of X is W 1 M 741 x 7 e2 for all real numbers at The median of the Cauchy distribution is 9 the mean is indeterminate g a Evaluate the variance formula W b ambuan mm canta a and syxead 1 733042 15 28A 2 A 145 6156 a m 5552 am am 5136 a 435 a 515 6336 w 215 10934 n m n W 11366 1 4329 54m 5925 636 5600 7333 2 an A 909 m4 293w 4922 a m a 935 a n a m 3 A66 W 780 740 m n 20 w a use 3138 E E E ayyxwumam 50 con dmce mmwl m a Exercise Let X be the continuous random Variable With density function we 1 E e 10 When at gt and 0 otherW1sei W Note that this X has a shifted exponential distribution With shift parameter and scale parameter El Let 9 be the median of the X distribution a Demonstrate that 9 lOln27 and nd b If Ais an estimator of 9 then 3 7 lOln2 is an estimator of r Further7 both estimators haVe the same Variancei Use these facts to develop an approximate con dence interval procedure for When the sample size n is a large7 odd integeri apunenhal ambuan mm 5th pumem 5 and scale puma D 15 om 1a 445 12 m 19 121 21 359 2A 953 22 m A0 222 10 30 5mg 15974 15909 13599 21 971 22552 26933 32215 mum 32521 423 Exact Con dence Interval Procedures for Quantiles Let X be a continuous random variable and let 9 be the pth quantile of the X distribution for some proportion p E 0 1 Let XW be the kth order statistic of a random sample of size n from the X distribution Then 1 Intervals The n order statistics divide the real line into n 1 intervals 007X17 X17X27 7 Xn717Xn7 Xm 00 ignoring the endpoints 2 Binomial Probabilities The probability that 9 lies in a given interval follows a binomial distribution with parameters n and p Speci cally a First Interval The event 0 E fooX1 is equivalent to the event that all Xi s are greater than 0 Thus PW E 007X1 1 19 b Middle Intervals The event 0 E XkXk1 is equivalent to the event that exactly k Xi s are less than 0 Thus n P0 e XkXk1 ltkgtpk1pn k c Last Interval The event 0 E X gt077 is equivalent to the event that all Xi s are less than 0 Thus PW E Xn700 19 These facts can be used to prove the following theorem Quantile Con dence Interval Theomm Under the conditions above if indices k1 and k2 are chosen so that PW lt Xvi 261 21741 PYHI aQ PXk1 lt 0 lt Xk2 212 21070 29 1 a PW gt Xk2 22 21041 Pij 042 then the interval Xk1Xk2 is a 1001 7 a con dence interval for 0 Note that in practice k1 and k2 are chosen to make the sums in the theorem as close as possible to the values shown on the right Exercise Suppose that we are interested in nding an interval estimate for the median of a continuous distribution by using 10 independent observations Binomial probabilities when n 10 and p 12 j0j1j2j3j4j5j6j7j8j9j10 01001 01010 01044 01117 01205 01246 01205 01117 0044 0010 01001 Use the table of binomial probabilities to nd k1 and k2 so that Xk1 Xltk2 is a 90 con dence interval or as close as possible for the median Give the exact con dence level Exercise Source Shea er et al 1996 The following table shows the total yearly rainfall in inches for Los Angeles in the 10 year period from the beginning of 1983 to the end of 1992 1985 1984 1985 1986 1987 1988 1989 1990 1991 1992 Rainfall 34104 8190 8192 18100 9111 11157 4156 6149 15107 22156 Assume these data are the values of a random sample from a continuous distribution Construct a 90 or as close as possible con dence interval for the median rainfall 424 Procedures for Endpoint Parameters Let X be a continuous random variable 1 Upper EndpointSample Maximum If the range of X has upper endpoint 6 then the sample maximum can be used in statistical procedures concerning 6 2 Lower EndpointSample Minimum If the range of X has lower endpoint 6 then the sample minimum can be used in statistical procedures concerning 6 The following multipart exercise illustrates the use of the sample minimum Exercise Let X be the continuous random Variable With density function 1 m l V E 67T09 When at gt 6 and 0 otherwise Note that this X has a shifted exponential distribution With shift parameter 6 and scale parameter A E f Let X0 be the sample minimum of a random sample of size n from the X distribution a Completely specify the CDF of X0 F1z PX1 g b Find a general formula for the 10 quantile of the X0 distribution 0 Let mp and zl be the pth and 1 7 pth quantiles of the Xlt1gt distribution Use the fact that 17 2p Pmp X1 1717 to ll in the following blanks Plt7lt0lt7gt172p 1 Assume the following data are the values of a random sample from the X distribution 2582777 2854957 3017597 3580577 3668717 57257 Use the result of part c to construct a 90 con dence interval for 0 e Let n 8 Find the value of c so that the test with decision rule Reject t9 15 in favor of 9 lt 15 when X1 c is a 5 lower tail test of the null hypothesis 9 15 1 Assume the following data are the values of a random sample from the X distribution 1525647 1591657 1791897 2024587 232217 2539417 2727037 298149 Would you accept or reject t9 15 in this case 43 Sample Quantiles 431 De nitions Let X be a continuous random Variable let 9 be the pch quantile of the X distribution and let XW be the k h order statistic of a random sample of size n for k 12 M n l pth Sample Quantlle For p 6 hi the pch sample quantile is de ned as follows a ifp mi for some k then 9 X b ifp e for some k then 9 X n lp7 kXk17 XW With this de nition the point 62110 is on the piecewise linear curve connecting the 39 t successive p011 S 1 2 n ltXlt1gtgtmgtgt ltXlt2gtgtmgtgt Germ 2 Sample QuartilesMedlanIQR The sample quartiles are the estimates ofthe 25th 50 an 75ch percentl es A A A 11 90257112 90507113 90 75 The sample median is 1 and the sample intepquartlle range is the difference qg 7 ql For example suppose that the 5 numbers 15 75 94 102 180 are observed Then I the sample median is 953 94 I the sample rst and third quartiles are ql 951 050x2 7 961 45 and qg 964 050x5 7 954141 I the sample interquartile range is qg 7 ql 96 Exercise Source Rice textbook Chapter 11 As part of a study on the effects of an infectious disease on the lifetimes of guinea pigs7 more than 400 animals were infected The data below are the lifetimes in days of 45 animals given a low exposure to the disease 33 44 56 59 74 77 93 100 102 105 107 107 108 108 109 115 120 122 124 136 139 144 153 159 160 163 163 168 171 172 195 202 215 216 222 230 231 240 245 251 253 254 278 458 555 a Find the sample quartiles and sample interquartile range 1 Find the sample 90th percentile A a 2 GxaphlcalAnalels Bax plats A bot PM ls a yehhml dshley hi 3 dete set thet shaws the sample methane the sample lhtehqlmtele heme and the Presmae ulhhsslhle authels humhehs that are m mm the emteh ELK plats wee 1mmde by Juhn mm m the was let 1qu ehan he the sample quanlhs Tb cmstmet a box plat l 591 A box ls dzawnl39x m ql to en 2 set A bar lsdrawn at the sample medmns qe awnl39mm q to the ls dr Ahatheh whskeh ls at equal w 71500137 mm A whshe largest uhsevstlm thet ls less than at equal to ml 500137 N dzawnl39mm g m the sma sst uhsetstlm thet ls meaty than to A Duhem chsetstlms altshe the mteml ql7150qh7 11 h150 h ql ale drawn esseherete Palms These uhsemtms at ca ed the ouhem Fat emee 3 box hlnt hi the luetemes dete hum the lest exemse ls shawn helaw m the luetemes data the mteml tn 7 1 500137 11 qa150qg ml ls and the authehs are Ezemue m mume am are mhwaghcs m aims for two yaups m ahddrm a7 93 97 9a we 10a no 113 m m m m m 153 These am are cmnpared mung box le 5 wmm a nr Mm u m on xu um m no mo and um sampl quartiles 1L 12 h and um mm 9 7 1 500 7 12043771500137 q for Bah am sec Example Source Rice teztbook Chapter 10 Consider sideebyeside box plots of the following lifetimes in days of guinea pigs giVen low medium and high exposure to an infectious disease 1 Low Exposure 33 44 56 59 74 77 93 100 102 105 107 107 108 108 109 115 120 122 124 136 139 144 153 159 160 163 163 168 171 172 195 202 215 216 222 230 231 240 245 251 253 254 278 458 555 Sample Summaries a 45 median 153 iqr 112 2 Medium Exposure 10 45 53 56 56 58 66 67 73 81 81 81 82 83 88 91 91 92 92 97 99 99 102 102 103 104 107 109 118 121 128 138 139 144 156 162 178 179 191 198 214 243 249 380 522 Sample Summaries a 45 median 102 iqr 59 3 High Exposure 15 22 24 32 33 34 38 38 43 44 54 55 59 60 60 60 61 63 65 65 67 68 70 70 76 76 81 83 87 91 96 98 99 109 127 129 131 143 146 175 258 263 341 341 376 Sample Summaries a 45 median 70 iqr 535 The graph shows a strong relationship between level of exposure and lifetime Both 0 the median lifetime and o the sample IQR decrease with increasing exposure In addition as exposure increases the sample dis tributions become more skewed Lam Mallllm Hull I used the compute to d as siderbyrsxde box plots 1 3167 9OHIZ1N5 Natzca that the boxes axe all xeasonabh symmetuc and appxoxxmately centexed at 0 Notice also that the boxes have tew u any outliers the box and whiskers u sampling s done hom the standazd noxmal dmtszuhon Speci calhx et 2 be the standaxd nozmal tandem wuabl w mg 71500 ems and m zo 515030753025t whexe by s the pm quanhle of the stande nozmal distubuhon Fmd Pw1 Z wz MT427 Notebook 5 prepared by Professor Jenny Baglivo Copyright 2009 by Jenny A Baglivo All Rights Reserved 5 MT427 Notebook 5 3 51 Two Sample Analysis Difference in Means 3 511 Introduction Notation and Model Summaries 3 512 Exact Methods for Normal Distributions 4 513 Approximate Methods 11 514 Transformations to Normality 14 52 Two Sample Analysis Ratio of Variances 15 521 F Ratio Distribution 15 522 Sampling Distribution of Ratio of Sample Variances 16 52 Exact Methods for Normal Distributions 17 53 Nonparametric Methods for Two Sample Analysis 19 531 De nitions 19 532 Wilcoxon Rank Sum Statistic 21 533 Wilcoxon Rank Sum Distribution and Methods 22 534 MannWhitney U Statistic 27 535 MannWhitney U Distribution and Methods 28 536 Hodges Lehmann HL Estimator of Shift Parameter 30 53 Exact Con dence Interval Procedure for Shift Parameter 31 54 Sampling Models 34 541 Population Model 34 542 Randomization Model 35 5 MT427 Notebook 5 This notebook is concerned with parametric and nonparametric methods for two sample anal ysis The notes include material from Chapter 11 comparing two samples and Chapter 6 distributions derived from the normal distribution of the Rice textbook 51 Two Sample Analysis Difference in Means In many statistical applications interest focuses on comparing two probability distributions For example 1 An education researcher might be interested in determining if the distributions of stan dardized test scores for students in public and private schools are equal 2 A medical researcher might be interested in determining if the distributions of mean blood pressure levels are the same in patients on two different treatment protocols In this section we focus on comparing two distributions by comparing their means 511 Introduction Notation and Model Summaries 1 X Sample Let X1X2 Xn be a random sample of size n from a distribution with mean pm and standard deviation am to Y Sample Let Y1Y2 Ym be a random sample of size m from a distribution with mean My and standard deviation 0y 03 Independent Samples Assume that the samples were chosen independently Thus the combined sample X17 X27 7 Xm Y1 Y2 7 Ym is a list of n m mutually independent random variables where the rst n are HD with the same distribution as X and the last m are HD with the same distribution as Y Difference in means Let 6 pm 7 My be the difference in means of the distributions The difference in sample means Y 7 7 is used to estimate 6 Since the samples were chosen independently summary values are easy to compute Theorem Sample Summaries Under the conditions stated above 3 leqm gltdqm EY776 and VarY77 512 Exact Methods for Normal Distributions If X and Y are normal random variables7 then X 7 Y has a normal distribution There are two situations where this fact can be used to construct exact methods for analyzing the difference in means 1 Um 0399 Known Statistical methods use the fact that the standardized difference w has a standard normal distribution 2 0393C 0399 Estimated Statistical methods use the fact that the approximately stan dardized difference i 7 7 7 5 5 i T has a Student t distribution with n m 7 2 df V 1 R In this formula7 S is the pooled estimator of the common variance n 7 1S m 7 ms 52 p nm72 where S and S are the sample variances for the X and Y samples Note that in order to get an exact Student t distribution in the second situation7 we need to assume that the unknown standard deviations are equal To illustrate the computation for estimating a common variance7 suppose that n 87 m 67 s 858 and 522 1235 are observed Then the estimate of the common variance is Exercise Let 02 a a be the common variance of the X and Y distributions Under the assumptions of this section7 demonstrate that S is an unbiased estimator of a Con dence interval procedures The following tables give 10017a con dence interval procedures for the difference in means parameter 6 am 7 My 1 075 Q Known 747 4 252 U34 TL where 2a2 is the 1001 7 a2 point of the standard normal distribution Y7 i tnm2a2m where tnm2a2 is the 1001 7 a2 point on the Student t distribution with n m 7 2 df 2 075 0y Estimated Hypothesis testing procedures The following tables give size 04 tests of the null hypoth esis that the difference in means parameter is a xed value H0 6 60 1 a Q Known 2 075 0y Estimated Test Statistic Z 772V 26 T m at 5 i RR for Ha 6 lt 60 Z S 72a T S 7tnm2a RR for Ha 6 gt 60 Z 2 2a T 2 tnm2a RR for H z 6 t 60 121 2 act2 1T1 2 tnm42lta2gt Exercise Assume the following data are the values of independent random samples from normal distributions with common standard deviation 2 l X Sample n 8E101 607 700 949 976 1119 1131 1296 1302 2 YSample m 12 683 386 452 514 523 533 632 721 756 794 819 907 1159 a Construct a 95 con dence interval for the difference in means7 am 7 My 1 Consider testing am 7 My 4 versus am 7 My 7 4 using the information provided above Would the null hypothesis be accepted or rejected at the 5 signi cance level State the conclusion and report the observed signi cance level p value Exercise SDu f E Shaemaker SE 1995 Normal body temperatures of 148 subjects Were taken several times over tWo consecutive days A total of 130 values are reported below 1 X Sample 65 temperatures in degrees Fahrenheit for Women 964 967 968 972 972 974 976 977 977 978 978 978 979 979 979 980 980 980 980 980 981 982 982 982 982 982 982 983 983 983 984 984 984 984 984 985 986 986 986 986 987 987 987 987 987 987 988 988 988 988 988 988 988 989 990 990 991 991 992 992 993 994 999 1000 1008 Sample summaries n 65 5 983938 5 07435 2 Y Sample 65 temperatures in degrees Fahrenheit for men 963 967 969 970 971 971 971 972 973 974 974 974 974 975 975 976 976 976 977 978 978 978 978 979 979 980 980 980 980 980 980 981 981 982 982 982 982 983 983 984 984 984 984 985 985 986 986 986 986 986 986 987 987 988 988 988 989 990 990 990 991 992 993 994 995 Sample summaries m 65 y 981046 5y 06988 om anuu Mm 1 Left Plot Sideebyeside box plots of the tWo samples are shown on the left The sample distributions are approximately symmetric to Right Plot A normal probability plot of standardized temperatures is shoWn on the right7 Where a each at value is replaced by z 7 352 b each y value is replaced by y 7ysy and c the 130 ordered standardized values vertical axis observed are plotted against the k 131Sc quantiles of the standard normal distribution horizontal axis expected The normal probability plot has been enhanced to include the results of 100 simulations from the standard normal distribution For eac k 7130 the minimum and maximum value of the 100 simulated k h order statistics are plotted Assume these data are the values of independent random samples from normal distributions with a common variance 0 Test the lam My versus um 7 My at the 5 level 0 Construct a 95 con dence interval for the difference in means7 am 7 My 0 Comment on the analyses Exercise Sumner Larsen 5 Mam 1985 Electroencephalograms are records showing uctuations of electrical activity in the brain Among the several di erent kinds of brain waves produced the dominant ones are usually alpha waves These have a characteristic frequency of anywhere from 8 to 13 cycles per second As part of a study to determine if sensory deprivation over an extended period of time has any effect on alphaswave pattern 20 male inmates in a Canadian prison were randomly split into two equalssized groups Members of one group control group were allowed to remain in their cells while members of the other group treated group were placed in solitary con nement After seven days alphaswave frequencies were measured in all 20 men 1 X Sample Average number of cycles per second for members of the control group 96 103 104 104 105 107 107 109 111 112 Sample summaries 77 10 10 58 se 0 4590 2 Y Sample Average number of cycles per second for members of the treated group 90 92 93 95 96 97 99 103 104 109 Sample summaries m 1Ug 9 78 av 0 5978 0 Cmnml Trmmzl 1 Left Plot Sidebysside box plots shown on the left suggest that population means for nonscon ned and solitaryscon ned prisoners are different 2 Right Plot Enhanced normal probability plot of standardized averages suggests that normal theory methods are reasonable although the plot tells us nothing about whether the assumption of a common variance is reasonab e Assume these data are the values of independent random samples from normal distributions with a common variance 0 Test the lam My versus um 7 My at the 5 level 0 Construct a 95 con dence interval for the difference in means7 am 7 My 0 Comment on the analyses 513 Approximate Methods In addition to the exact methods given in the last section there are approximate methods we can use to answer questions about the difference in means parameter 6 am 7 My 1 am 7 0y Estimated Normal Samples Assume that X and Y are normal random variables and that am 7 0y Statistical methods use the fact that the approximate standardization i i X i Y i m 7 My 52 55 ft T has an approximate Student t distribution with degrees of freedom as follows 5571 SSm df 7 2 SinVn SimVm 2 am 0y Estimated Large Samples Assume that n and m are large Statistical methods use the fact that the approximate standardization 7 7 i m 7 My sg 2 7 my has an approximate standard normal distribution 2 Notes 1 Pooled versus Welch t Methods Exact methods for normal samples when am try is estimated using pooled information are called pooled t methods Approximate methods for normal samples where am and Q are separately estimated are called Welch t methods after the mathematician who proved in the 1940 s that the sampling distribution was approximately Student t 2 Computing the Degrees of Freedom To apply the formula for df developed by Welch for the rst situation above you would round the expression on the right to the closest whole number The computed df satis es the following inequality minnm71 df nm72 A quick by hand method is to use the lower bound for df instead of Welch s formula 3 Central Limit Theorem The central limit theorem can be used to demonstrate that the difference in sample means Y 7 7 is approximately normally distributed when both it and m are large enough Thus the Z given in the second situation above has an approximately standard normal distribution when both it and m are large enough Con dence interval procedures The following tables give approximate 10017 a con dence interval procedures for the difference in means parameter 6 am 7 My 1 075 0y Estimated Normal Samples i 7 S S X7Ygt j tdfOt2 777 where tdf Oz2 is the 1001 7 a2 point on the Student t distribution with df dfi 2 075 0y Estimated Large Samples i 7 2 S2 X4 i 2a27i where 2a2 is the 1001 7 a2 point of the standard normal distribution Hypothesis testing procedures The following tables give size approximate 04 tests of the null hypothesis that the difference in means parameter is a xed value H0 6 60 1 075 0y Estimated Normal Samples 2 075 0y Estimated Large Samples Test Statistic T 7 is 5quot Z W 77 i RR for Ha 6 lt 60 T S 7tdfa Z S 721 RR for Ha 6 gt 60 T 2 tdfa Z 2 2a RRforHaztlf o lTl thfa2 lZl 2212 Exercise Source Stukel 1998 FTPlibstatcmuedudatasetsi Several studies have suggested that low levels of plasma retinol Vitamin A are associated with increased risk of certain types of cancer As part of a study to investigate the relationship between personal characteristics and cancer incidence data were gathered on 315 subjects This exercise compares mean plasma levels of retinol in nanograms per milliliter ngml for 35 women and 35 men who participated in the study Data summaries are as follows Women n 35 E 600943 sac 157i103 Men m 35 y 673457 sy 26737 12 mu Women Muv Assume the information on the previous page is a summary of independent random samples from normal distributions Construct an approximate 95 con dence for the difference in means parameter7 6 Hz 7 My and comment on your analysis 514 Transformations to Normality Methods based on sampling from normal distributions are popular and easy to apply For this reason researchers often transform their data to achieve approximate normality and then use normal theory methods on the transformed scale For example the left plot below shows sideibyiside box plots of samples taken from skewed pos itiVe distributions and the right plot shows an enhanced normal probability plot of combined stan ardized Values mo sun auu mu n zuu x Sample YSmnplc Notice that the boxes are asymmetric there are large outliers and the normal probability plot has a pronounced bend By contrast plots based on a log transformation of the data suggest that normal theory methods could be applied to the logitransformed data Db Lung Sumph39 Loglel Sompld 7 Footnotes Although the use of transformations is attractive there are many drawbacks For ex ple it may be dif cult to nd an appropriate transformation or it may be dif cult to interpret the results back on the original scale In Section 53 page 19 of these notes we will study methods that can be used for a broad range of distributions 52 Two Sample Analysis Ratio of Variances Assume that X and Y are normal random variables This section develops methods for answering statistical questions about the ratio of variances parameter r 0305 for normal distributions The ratio of sample variances Si A95 is used to estimate r The sampling distribution of the ratio of sample variances is related to the f ratio distribution which is introduced rst 521 F Ratio Distribution U and V be independent chiisquare random variables with n1 and n2 degrees of freedom respectively Then Un1 Vng is said to be an f ratio random variable or to have degrees of freedom The PDF of F is as follows 7 Fltltmnw2gt m n12 n2 n1n22 f Fltm2gtFltn22gt n2 mac F an f ratio distribution with n1 and n2 when x gt 0 and 0 otherwise Typical forms for the PDF and CDF of F are shown below quotf A our The location of the median f0 5 has been labeled in each plot Notes 1 Fisher Ratio Distribution The f in f ratio distribution77 is for RA Eisher who pioneered its use in analyzing the results of comparative studies that is in analyzing the results of studies comparing two or more samples to Shape Both parameters govern shape and scale lfng gt 2 then 3 t e mean is indeterminate Note that gt 1 as n gt 00 2 otherwise 2 7 If n2 gt 4 then VarF W otherwise the variance is indeterminate 15 03 Reciprocal If F has an f ratio distribution with m and 712 degrees of freedom7 then the reciprocal of F has an f ratio distribution with 712 and 711 degrees of freedom q Quantiles The notation fp is used to denote the pth quantile 100pth percentile of the fratio distribution The Rice textbook includes tables for p 090 page A107 p 095 page A117 p 0975 page A127 and p 099 page A13 The p 010 70057 00257 001 quantiles can be computed using reciprocals Specifically7 1 d fp on 711an f flip On 7127 n1 df To illustrate the use of the tables in the textbook7 let M 8 and n2 10 Then 1 When p 0907 0957 0975 and 0997 the values are read from the tables f090 238 f095 307 f0975 385 f099 506 to When p 0107 0057 00257 and 0017 the quantiles are computed using reciprocals Specifically7 since 1 1 PF z P F 2 g for every x to obtain the 0107 0057 00257 and 001 quantiles of the distribution with 8 degrees of freedom in the numerator and 10 degrees of freedom in the denominator7 we use the reciprocals of the 0907 0957 09757 and 099 quantiles of the fratio distribution with 10 degrees of freedom in the numerator and 8 degrees of freedom in the denominator Thus7 1 1 1 1 f010 039 f005 030 f0025 023 f001 017 522 Sampling Distribution of Ratio of Sample Variances Let X be a normal random variable with mean um and standard deviation am and let Y be a normal random variable with mean My and standard deviation Hg The following theorem tells us about the sampling distribution of the ratio of sample variances when samples are chosen independently from the X and Y distributions Theorem Sampling Distribution Let S and S be the sample variances ofindependent random samples of sizes 71 and m7 respectively7 from the X and Y distributions Then 5 A95 0303 has an f ratio distribution with n 7 1 and m 7 1 degrees of freedom7 where the numerator is the ratio of sample variances and the denominator is the ratio of model variances To demonstTate that the conclusion of the theorem is correct rst note that 1 U S has a chi square distribution with n 7 1 df and 2 V 55 has a chi square distribution with m 7 1 df NOW please complete the demonstration 523 Exact Methods for Normal Distributions Let X and Y be normal random variables Under the conditions of the last section the following tables give exact con dence interval and hypothesis test methods for the ratio of variances parameter T 0505 1 1001 7 a CIfoT T 03012 when 75 and My aTe estimated 5355 5355 fn71m71a2 7 fn71m711 12 Where fn1m1p is the 1001 7 p point of the f ratio distribution With n 7 l and Tn 7 1 df 2 100a tests of H0 T 7 0 when Mac and My aTe estimated 535 To Test Statistic RR foT Ha T lt T0 F S fn71m1l 7 a RR foT Ha T gt T0 F 2 fn71m1a RR foT Ha T 7t T0 F S fn71m1l 7 12 or F 2 fn71m1a2 To illustrate the con dence interval procedure assume the following information summarizes the values of independent random samples from normal distributions n 16 E 7805 31 856 m 13 y 6913 3 433 and that we would like to construct a 95 con dence interval for the ratio r 0503 Since the 975 point on the fratio distribution on 15 12 df is 318 and the 25 point is 756 using the table entry for 12 15 df the con dence interval is 856433 856433 1229 11568 318 1296 l Firststep analysis F ratio methods are often used as a rst step in an analysis of the difference in means Exercise continued For example in the alpha waves exercise beginning on page 9 a con dence interval for the difference in means was constructed under the assumption that the population variances were equal To demonstrate that this assumption is justi ed we test 02 1 versus 7 7 1 0y qum l 1510 at the 5 signi cance level The rejection region for the test is F 109540975 025 or F 2 109140025 403 7 1 7 403 and the observed value of the test statistic is 35522 05897 Since the observed value of the test statistic is in the acceptance region the hypothesis of equal variances is accepted 53 Nonparametric Methods for Two Sample Analysis This section focuses on broadly applicable two sample analysis methods 531 De nitions 1 ParametricNonparametric Methods Statistical methods that require strong assump tions about the shapes of distributions for example uniform or exponential and ask questions about parameter values are called parametric methods By contrast nonparametric methods also known as distribution free methods make mild assumptions such as the distributions are continuous or the continuous distributions are symmetric around their centers77 to Stochastically LargerSmaller Let V and W be continuous random variables V is stochastically larger than W corresponding W is stochastically smaller than V if PV 2 z 2 PW 2 z for all real numbers m with strict inequality that is where gt77 replaces 277 for at least one m 03 Shift Model The random variables V and W are said to satisfy a shift model if V 7 A and W have the same distribution where A is the difference in medians A MedianV 7 MedianW 4 Shift Parameter The parameter A from above is called the shift parameter Ewample Quantile con dence interval procedure Most of the statistical methods we have worked with so far have been parametric methods An example of a nonparametric method is the quantile con dence interval procedure from Section 423 of these notes Let X be a continuous random variable 9 be the pth quantile of the X distribution for some proportion p 6 01 and X00 be the kth order statistic of a random sample of size n from the X distribution Then Xk1Xk2 is a 10017 a con dence interval for 0 where the indices k1 and k2 are chosen so that PW lt Xk1 261 21741 PYHI a2 Pm lt 0 lt XW 7 223 pilt17pgtr739 717 a H0 gt X0 7 2992 p7lt17pgtr739 7 az Illustmtion stochastically largersmaller Tandem variables To illustrate the de r nition of stochastically largersmaller consider the following plots of the PDFs left plot and the CDFs right plot of two random Variables V solid blue and W dashed gray VIMXJ V is stochastically larger than W correspondingly W is stochastically smaller than V Note that if V is stochastically larger than W then their CDFs satisfy the inequality Fvx g for all x with strict inequality for at least one x Eaample Random variables satisfying shift models If V and W satisfy a shift model with shift parameter A then their distributions must have the same shape Here are two examples 1 Normal Distiibution o 5 If V is a normal random Variable with mean 10 and standard de Viation 5 and W is a normal random Variable with mean 3 and standard deviation 5 then V and W satisfy a shift model with shift parame ter A 7 Since A gt 0 V is stochastically larger than W 2 Shifted Exponential Distribution A 110 If V be an exponential ran m Variable with parameter and let W be a shifted exponential random Variable with PDF as follows e 27810 when x gt 8 and 0 otherwise Then V and W satisfy a shift model with shift parameter A 78 Since A lt 0 W is stochastically larger than V 532 Wilcoxon Rank Sum Statistic In the 1940 s Wilcoxon developed a nonparametric method for testing the null hypothesis that two continuous distributions are equal versus the alternative hypothesis that one distribution is stochastically larger than the other Given independent random samples X17X27397Xn7 and Y17Y277Ym7 from the X and Y distributions Wilcoxon rank sum statistics for the X sample R1 and for the Y sample R2 are computed as follows 1 Pool and sort the n m observations 2 Replace each observation by its rank or position in the sorted list 3 Let R1 equal the sum of the ranks for observations in the X sample and R2 equal the sum of the ranks for observations in the Y sample For example 1 If n 4 m 6 and the data are as follows 11 25 32 41 and 28 36 40 52 58 72 then the sorted combined list of n m 10 observations is 11 25 28 32 36 40 41 52 58 72 The observed value of R1 is The observed value of R2 is 2 If n 9 m 5 and the data are as follows 128 156 157 173 185 229 275 297 351 and 82 126 167 216 324 then the sorted combined list of n m 14 observations is 82 126 128 156 157 167 173 185 216 229 275 297 324 351 The observed value of R1 is The observed value of R2 is Notes 1 Fixed Sum Recall from calculus that the sum of the rst N positive integers is Nag T us 1 R1R2 nmrm 7 and tests based on R1 are equivalent to tests based on R2 We Will use the R1 statistic WilcoxonMmmr Whitney M ethode Another equivalent statistical method Was developed by Mann amp Whitney in the 1940s Their approach can be used to develop a con dence interval procedure for the shift parameter in a shift model 533 Wilcoxon Rank Sum Distribution and Methods The following theorem gives us information about the distribution of the WilcoXon rank sum statistic for the X sample under the null hypothesis that the X and Y distributions are equal Theorem Rank Sum Distribution Let X and Y be continuous distributions and R1 e the WilcoXon rank sum statistic for the X sample based on independent random samples of sizes n and m respectively from the X and Y distributions If the distributions of X and Y are equal then 1 The range of R1 is EVE 12 4 1 21 1 7 nm n 21 2 ER1 W and VaTR1 W 3 The distribution of R1 is symmetric around its mean In particular PR1x PR1nnmliz u If n and m are large then the distribution of R1 is approximately normal If both are greater than 20 then the approximation is reasonably goo For example if n 9 and m 5 then R1 has range R 4546 90 and summary values ER1 675 and VaTR1 5625 The sampling distribution of R1 is obtained by considering all 194 2002 subsets of size 9 chosen from the set of ranks 12 14 If the X and Y distributions are equal then each choice of subset is equally likely Exercise Let n 2 and m 4 a List all 15 subsets of size 2 from 12734757 6 1 Use your answer to part a to completely specify the PDF of R1 0 Find the mean and variance of R1 Finding p values Let robs be the observed value of R1 for a given set of data Then observed signi cance levels p values are obtained as follows Alternative Hypothesis P Value X is stochastically larger than Y PR1 2 Tubs X is stochastically smaller than Y PR1 S Tubs One random variable is stochastically larger lf Tubs gt ERl7 then 2PR1 2 Tubs and or smaller than the other if Tubs lt ERl7 then 2PR1 S Tubs lf Tubs ERl7 then the p value is 1 For example let n 9 and m 5 z PR1 z PR1 S I z PR1 z PR1 S I z PR1 z PR1 S I 45 00005 00005 61 00370 02188 77 00250 09051 46 00005 00010 62 00405 02592 78 00215 09266 47 00010 00020 63 00440 03032 79 00175 09441 48 00015 00035 64 00465 03497 80 00145 09585 49 00025 00060 65 00490 03986 81 00115 09700 50 00035 00095 66 00504 04491 82 00090 09790 51 00050 00145 67 00509 05000 83 00065 09855 52 00065 00210 68 00509 05509 84 00050 09905 53 00090 00300 69 00504 06014 85 00035 09940 54 00115 00415 70 00490 06503 86 00025 09965 55 00145 00559 71 00465 06968 87 00015 09980 56 00175 00734 72 00440 07408 88 00010 09990 57 00215 00949 73 00405 07812 89 00005 09995 58 00250 01199 74 00370 08182 90 00005 10000 59 00290 01489 75 00330 08511 60 00330 01818 76 00290 08801 H If the alternative hypothesis is X is stochastically smaller than Y and the observed value of R1 is 607 then the observed signi cance level is to If the alternative hypothesis is X is stochastically larger than Y and the observed value of R1 is 747 then the observed signi cance level is OJ If the alternative hypothesis is One random variable is stochastically larger than the other77 and the observed value of R1 is 487 then the observed signi cance level is Example Source Rice textbook Chapter 11 An experiment was performed to determine whether two forms of iron Fe2 and Fe3 are retained differently If one form of iron were retained especially well it would be the better dietary supplement The investigators divided 108 mice randomly into 6 groups of 18 each three groups were given Fe2 in three different concentrations 102 12 and 03 millimolar and three groups were given Fe3 at the same concentrations The mice were given the iron orally the iron was radioactively labeled so that a counter could be used to measure the initial amount given At a later time another count was taken for each mouse and the percentage of iron retained was calculated7 Results for the second concentration 12 millimolar are reported below 1 X sample 18 observations percent retention for mice given Fe2 404 416 442 493 549 577 586 628 697 706 778 923 934 991 1346 1840 2389 2639 2 Y sample 18 observations percent retention for mice given Fe3 220 293 308 349 411 495 516 554 568 625 725 790 885 1196 1554 1589 183 1859 The left plot below shows side by side box plots of the percent retention for each group and the right plot is an enhanced normal probability plot of the 36 standardized values om These plots suggest that the X and Y distributions are not approximately normal The equality of the X and Y distributions will be tested using the Wilcoxon rank sum test a two sided alternative and 5 signi cance level Flux The sampling distribution of R1 has range mm 0010 R 171172495 001 and is centered at 333 The observed value of R1 is 362 0Wquot and the p value is 2PR1 2 362 z 0372 m min Thus please complete ODD 39 L m 252 333 m 405 Handling equal observations Continuous data are often rounded to a xed number of decimal places causing two or more observations to be equal 1 Tied Observations Equal observations are said to be tied at a given value 2 Midmnks If two or more observations are tied at a given value then their average rank or midmnk is used to compute the rank sum statistic For example if the two smallest observations are equal they would each be assigned rank 1 22 15 03 Sampling Distribution To obtain the sampling distribution of R1 we use a simple urn model lmagine writing the n m midranks on separate slips of paper and placing the slips in an urn After thoroughly mixing the urn choose a subset of size n and compute the sum of the values on the chosen slips If each choice of subset is equally likely then the resulting probability distribution is the distribution of R1 for the given collection of midranks Example Source Rice textbook Chapter 11 Two methods A and B were used in a determination of the latent heat of fusion of ice Natrella 1963 The investigators wished to nd out by how much the methods differed The following table gives the change in total heat from ice at 707200 to water 000 in calories per gram of mass77 1 X sample 13 observations caloriesgram using Method A 7997 7998 8000 8002 8002 8002 8003 8003 8003 8004 8004 8004 8005 2 Y sample 8 observations caloriesgram using Method B 7994 7995 7997 7997 7997 7998 8002 8003 The following table shows the ordered values and corresponding midranks Observation Midrank Observation Midrank 1 7994 10 12 8002 115 2 7995 20 13 8002 115 3 7997 45 14 8003 155 4 7997 45 15 8003 155 5 7997 45 16 8003 155 6 7997 45 17 8003 155 7 7998 75 18 8004 190 8 7998 75 19 8004 190 9 8000 90 20 8004 190 10 8002 115 21 8005 210 11 8002 115 The observed value of R1 is The observed value of R2 is The equality of the X and Y distributions Will be tested using the WilcoXon rank sum test a tWoesided alternative and 5 signi cance level The R1 statistic takes Wholenumber and halfenumber JL values between 91 and 195 and is centered at 143 The observed value of R1 is and the observed signi cance level is 213121 gt m 0005 Thus please complete 534 MannWhitney U Statistic In the 19407s Mann amp Whitney developed a nonparametric tWoesample test for the null hye pothesis that tWo continuous distributions are equal versus the alternative hypothesis that one distribution is stochastically larger than the other Given independent random samples X1X2MXi and niamym from the X and Y distributions ManneWhitney U statistics for the X sample U1 and for the Y sample U2 are de ned as follows 1 U1 Statistic The U1 statistic equals the number of times an X observation is greater than a Y observation n 7 U1 Xi gt Y 22102 gt Y i1 7391 Where IXv gt 1 if the inequality is true and 0 otherwise 2 U2 Statistic The U2 statistic equals the number of times a Y observation is greater than an X observatron U2Y739gtXiZ 1Y739gtXi7 1 71 i Where gt Xi 1 if the inequality is true and 0 otherwise Note If all n m observations are distinct then U1 U For example 1 to If n 4 m 6 and the data are as follows 11 25 32 41 and 28 36 40 52 58 72 then the sorted combined list of 10 observations with the m values underlined is m L5 28 36 40 Q 52 58 72 The observed value of U1 is The observed value of U2 is If n 5 m 7 and the data are as follows 49 73 92 110 173 and 05 07 15 27 56 87 134 then the sorted combined list of 12 observations with the m values underlined is 05 07 15 27 g 56 73 87 92 110 134 173 The observed value of U1 is The observed value of U2 is 535 MannWhitney U Distribution and Methods The following theorem gives us information about the Mann Whitney U statistic for the X sample under the null hypothesis that the X and Y distributions are equal and relates the Mann Whitney and Wilcoxon statistics Theorem U Statistic Distribution Let X and Y be continuous distributions and let U1 and R1 be the Mann Whitney and Wilcoxon statistics for the X sample based on independent random samples of sizes 71 and m from the X and Y distributions If the distributions of X and Y are equal then rPPJEO U mmi The range of U1 is O 1 2 71m EU1 and VarU1 W The distribution of U1 is symmetric around its mean In particular PU1 m PU1 nmi x If n and m are large then the distribution of U1 is approximately normal If both are greater than 20 then the approximation is reasonably good 28 For example if n 9 and m 5 then U1 has range R 0145 and summary values EU1 225 and VaTU1 5625 x The sampling distribution of U1 is obtained by considering all 194 2002 assignments of 9 observations to the rst sample With the remaining observations being assigned to the second sample If the X and Y distributions are equal then each assignment is equally likely It is instructive to prove the rst part of the sampling distribution theorem assuming that each observation can be Written With in nite precision Let n be the rank of ich order statistic of the X sample X for i l2n A total of n 7 l observations precede X in the ordered combined list Of these I i 7 l observations are from the X sample and I n 7i observations are from the Y sample NoW please complete the proof 536 HodgesLehmann HL Estimator 0f Shift Parameter Recall that the random variables X and Y are said to satisfy a shift model if X 7 A and Y have the same distribution where A the shift parameter is the difference in medians A MedianX 7 MedianY Notes If X and Y satisfy a shift model then 1 Stochastically largersmaller We can use the shift parameter when answering questions about whether one distribution is stochastically larger than the other as summarized in the following table Value of A Comparison of Distributions A 0 X and Y have the same distribution A gt 0 X is stochastically larger than Y A lt 0 X is stochastically smaller than Y 2 Treatment E ects In studies comparing a treatment group to a no treatment group often called a control group where the effect of the treatment is additive the shift parameter is referred to as the treatment e ect Estimating the shift parameter In the 1960 s Hodges amp Lehmann developed a method to estimate the shift parameter in a shift model Given independent random samples of sizes n and m from the X and Y distributions 1 Walsh Di erences The following list of nm differences Xi7Yj for i12n j12m are called the Walsh di erences Note that the Walsh differences are a list of nm dependent random variables from the distribution of X 7 Y 2 Hodges Lehmann Estimator The Hodges Lehmann HL estimator of A is the median of the list of nm Walsh differences Note that HL estimator of A is niot necessarily equal to the difference of sample medians of the separate X and Y samples For example if n 5 and m 7 and the data are as follows 49 73 92 110 173 and 05 07 15 27 56 87 134 then the following 5 by 7 table gives the 35 Walsh differences 0 5 0 7 1 5 2 7 5 6 8 7 134 4 9 4 4 4 2 3 4 2 2 707 738 8 5 7 3 6 8 6 6 5 8 4 6 1 7 714 1 9 2 8 7 8 5 7 7 6 5 3 6 0 5 4 2 1110 105 103 9 5 8 3 5 4 2 3 724 173 168 166 158 146 117 8 6 3 9 The HL estimate of A is 537 Exact Con dence Interval Procedure for Shift Parameter Let X and Y be continuous distributions satisfying a shift model with shift parameter A and let Du lt De lt lt Dmm be the ordered Walsh differences based on independent random samples of sizes 71 and m Then 1 Intervals The nm Walsh differences divide the real line into mm 1 intervals 007 1317 Day 132 7 Dnm717 Dnm7 Dnm7 00 where the endpoints are ignored 2 Mann Whitney Probabilities The probability that A lies in a given interval follows the distribution of the Mann Whitney statistic for the X sample Speci cally if we let Dlt0gt foo and DOW 00 for convenience then 131 lt A lt Dltk1 PU1 k for k 012 nm Note that the main ideas needed to use Mann Whitney probabilities are the following a If X and Y satisfy a shift model model with shift parameter A then the samples Samplel XliAX27AXn7A Sample 2 Y1Y2Ym are independent random samples from the same distribution Thus the distribution of can be tabulated assuming that each assignment of n observations to the rst sample with the remaining in observations being assigned to the second sample are equally likely 1 Under the assumptions of part a the U1 distribution is symmetric around Note also that these facts can be used to prove the following theorem Shift Pammeter Con dence Interval Theomm Under the assumptions above if 16 is chosen so that the null probability PU1k1 then DOW Dnmk1 is a 1001 a con dence interval for A Exercise Suppose that we are interested in nding an interval estimate of A based on inde pendent random samples of sizes n 4 and m 6 Mann Whitney probabilities when n 4 m 6 and 16 01Hi10 160 161 162 163 164 165 166 167 168 169 1610 00048 00048 00095 00143 00238 00286 00429 00476 00619 00667 00762 Use the table of Mann Whitney probabilities to nd 16 so that DOW D9540 is a 90 con dence interval for the shift parameter or as close as possible Give the exact con dence level Exercise Assume that n 4 m 67 and the data are as follows 417 1257 1297 139 and 1067 1227 1557 1677 1707 206 Assume these data are the values of independent random samples from continuous distributions satisfying a shift model7 With A MedianX7MedianY a Find the HL estimate of the shift parameter7 A 1 Find a 90 as close as possible con dence interval for A 54 Sampling Models The methods of this chapter assume that the measurements under study are the values of independent random samples from continuous distributions In most applications simple random samples of individuals are drawn from nite populations and measurements are made on these individuals If population sizes are large enough then the resulting measurements can be treated as if they were the values of independent random samples Recall that a simple random sample of size n from the population of size N is a subset of n individuals chosen in such a way that each choice of subset is equally likely 541 Population Model If simple random samples are drawn from suf ciently large populations of individuals then sampling is said to be done under a population model Under a population model measurements can be treated as if they were the values of independent random samples When comparing two distributions sampling can be done in many different ways including 1 Sampling from Separate Subpopulations Individuals can be sampled from separate sub populations For example a researcher interested in comparing achievement test scores of girls and boys in the fth grade might sample separately from the subpopulations of fth grade girls and fth grade boys to Sampling from a Total Population Followed by Splitting Individuals can be sampled from a total population and then separated For example the researcher interested in comparing achievement scores might sample from the total population of fth graders and then split the sample into subsamples of girls and boys 03 Sampling from a Total Population Followed by Randomization Individuals can be sam pled from a total population and then randomized to one of two treatments For example a medical researcher interested in determining if a new treatment to reduce serum choles terol levels is more effective than the standard treatment in a population of women with very high levels of cholesterol might do the following a Choose a simple random sample of n in subjects from the population of women with very high levels of serum cholesterol b Partition the n in subjects into distinguishable subsets or groups of sizes n and m c Administer the standard treatment to each subject in the rst group for a xed period of time and the new treatment to each subject in the second group for the same xed period of time By randomly assigning subjects to treatment groups the effect is as if sampling was done from two subpopulations the subpopulation of women with high cholesterol who have been treated with the standard treatment for a xed period of time and the subpopula tion of women with high cholesterol who have been treated with the new treatment for a xed period of time Note that by design the subpopulations differ in treatment only 34 542 Randomization Model The following is a common research scenario A researcher is interested in comparing two treatments and has nm subjects will ing to participate in a study The researcher randomly assigns 71 subjects to receive the rst treatment the remaining m subjects will receive the second treatment Treatments could be competing drugs for reducing cholesterol as above or competing meth ods for teaching multivariable calculus If the n m subjects are not a simple random sample from the study population but the assignment of subjects to treatments is one of equally likely assignments then sampling is said to be done under a randomization model Under a randomization model for the comparison of treatments chance enters into the ex periment only through the assignment of subjects to treatments The results of experiments conducted under a randomization model cannot be generalized to a larger population of inter est but may still be of interest to researchers The Wilcoxon rank sum test is an example of a method that can be used to analyze data sampled under either the population model or the randomization model Mathematical Statistics I Notes 4 prepared by Professor Jenny Baglivo Copyright 2004 by Jenny Al Baglivoi All Rights Reserved 9 Order statistics and quantiles 109 91 De nitions kth order statistic sample min max median i i i i i i i i i i i i i i i 109 91111 Distribution of the sample maximum i i i i i i i i i i i i i i i i i i i i i i i 109 91112 Distribution of the sample minimum i i i i i i i i i i i i i i i i i i i i i i i 111 91113 Distribution in the general case i i i i i i i i i i i i i i i i i i i i i i i i i i 113 91114 Approximate mean variance probability plots i i i i i i i i i i i i i i i i i 116 9 2 Estimation and hypothesis testing methods i i i i i i i i i i i i i i i i i i i i i i i 118 91211 Approximate con dence procedure for median When n is odd i i i i i i i i i 118 9122 Exact con dence procedures for quantiles i i i i i i i i i i i i i i i i i i i i 122 91213 Some Mathematica commands i i i i i i i i i i i i i i i i i i i i i i i i i i 124 91214 Procedures for endpoint parameters i i i i i i i i i i i i i i i i i i i i i i i 124 913 Sample quantiles i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 128 9311 Sample quartiles sample lQR i i i i i i i i i i i i i i i i i i i i i i i i i i i 128 9132 Box plots outliers i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 130 91313 Some Mathematica commands i i i i i i i i i i i i i i i i i i i i i i i i i i 134 108 9 Order statistics and quantiles 91 De nitions kth order statistic sample min max median Let X1 X2 Xn be a random sample from the continuous distribution with PDF at and GDP PX S w and let k be an integer in the interval 1 S k S n Then the kth order statistic X09 is the kth observation in order X09 is the kth smallest of X1 X2 X The largest observation X n is called the sample maximum and the smallest observation Xa is called the sample minimum Sample median The sample median is the middle observation when n is odd and the average of the two middle observations when n is even Xltmgt when n is odd Sample median Xg Xg1gt When n is even Oh t 911 Distribution of the sample maximum Let Xm be the sample maximum The GDP and PDF of Xm are as follows Fnw for all real numbers a fnw nFw 1fw for all real numbers a To demonstrate that the formula for F0 is correct observe that Fltngtx PXltngt S I PX1 95X2 95HX 95 H1 PXl S as by independence F 1quot The formula for the PDF is obtained by applying the chain rule 109 For example7 let Xlt8gt be the sample maximum of a random sample of size 8 from a uniform distribution on the interval 1050 The following graph shows the PDF of X8 y g Density The distribution of X 8 is concentrated near the upper endpoint of the interval Exercise 1 Let Xm be the sample maximum of a random sample of size n from the uniform distribution on the interval 11 1 Find a general formula for the pth quantile of the X n distribution 2 Let n 8 and 1 b 1050 Find the 25th 50th7 and 75th percentiles of the distribution of the sample maximum 110 3 Let n 8 and 1 b 1050 Find P30 g Xm g 45 912 Distribution of the sample minimum Let X0 be the sample minimum The GDP and PDF of X0 are as follows F1w 1 7 1 7 for all real numbers w f1w n1 7 Fw 1fw for all real numbers an Exercise 2 Demonstrate that the formula for F1w is correct 111 For example7 let X0 be the sample minimum of a random sample of size 8 from a uniform distribution on the interval 1050 The following graph shows the PDF of Xu 3 f0 Density X Note that the values of X 1 are concentrated near the lower endpoint of the interval Exercise 3 Let X0 be the sample minimum of a random sample of size n from the uniform distribution on the interval 11 1 Find a general formula for the pth quantile of the X 1 distribution 2 Let n 8 and 1 b 1050 Find the 25th 50th7 and 75th percentiles of the distribution of the sample minimum 112 3 Let n 8 and 1 b 1050 Find P15 3 X0 3 30 913 Distribution in the general case Let X09 be the kth order statistic 1 lt k lt n The GDP and PDF of X09 are as follows Fkw 2914 7 Fw 7 for all real numbers w fkw 1671371716Fwk 1fwl 7 for all real numbers w To demonstrate that the formula for FOG is correct rst note that FIcgtx PXIcgt S 96 Pk or more Xl s are S The probability that exactly j observations are S x is a binomial probability 9pm 7 W where p PltX z Fm Thus the formula for FOG is the sum of binomial probabilities The demonstration of the formula for fog uses the chain rule and the product rulei 113 Exercise 4 Let X be a uniform random variable on the interval 10507 and X8 be the 3rd order statistic of a random sample of size 8 from the X distribution Density 0 O8 114 Exercise 5 Let X be an exponential random variable with parameter 1107 and Xlt5gt be the 5th order statistic of a random sample of size 7 from the X distribution Density 008 115 914 Approximate mean variance probability plots The following theorem gives useful approximate formulas for the mean and variance of the kth order statistic Theorem 6 Let X be a continuous distribution with PDF at X09 be the kth order statistic of a random sample of size n from the X distribution and 6 be the pth quantile of the X distribution Where p If f6 7t 0 then p1 i p EXk z 0 and VarXk z W Exercise 7 Let X be the 3rd order statistic of a random sample of size 4 from the uniform distribution on the interval 010 Demonstrate that the approximate formulas for EX3 and VarX3 are exact in this case 116 Probability plots The theorem above implies that t The kth order statistic is as an estimator of the quantile k 12 n An interesting graphical comparison of model with data uses this result A probability plot is a plot of pairs of the form st model quantile 5306 k 12n Note that the wk s are the observed order statistics Example 8 Let X be a normal random variable with mean 0 and standard deviation 10 The plots below show two different comparisons of the values of a random sample of size 95 from the X distribution with the normal model o The left plot is a probability plot of the pairs k th model quantile 5309 k 1295 obs is for observed and exp is for expected 0 The right plot shows the density function of X lled plot with an empirical histogram of the data superimposed If n is large then both plots give good graphical comparisons of model and data If n is small to moderate then the probability plot may be better since the empirical histogram used in the comparison plot may have a much different shape than the density function of the mode 117 92 Estimation and hypothesis testing methods Let X be a continuous random variable with CDF PX S w and PDF 921 Approximate con dence procedure for median when n is odd The following theorem says that under certain conditions the sampling distribution of the sample median is approximately normal with mean equal to the population median Theorem 9 Let 6 be the median of the X distribution and 9ampgt where n is an odd positive integer If n is large and f 7 0 then the distribution of g is approximately normal with mean 6 and variance Mala Approximate con dence intervals for the median Under the conditions of the theorem above an approximate 1001 7 a con dence interval for the median is A 1 A 6 i 2a 2 AZ where Xmgt 4n ltf6gt 2 and 2012 is the 1001 7 a2 point of the standard normal distribution In this formula f is the estimate of f obtained by substituting the sample median for 6 Note that the demonstration of this procedure is similar to the demonstration of the approximate con dence interval procedure for ML estimators Exercise 10 Let X be the continuous random variable with PDF and GDP 1 1 1 x m and 7 arctanw 7 6 for all real numbers 3 Note that X is a Cauchy random variable with center 6 and spread 1 The X distribution is symmetric around 6 Further the median is 6 problem continues on the next page 118 1 Evaluate the variance formula Mala 2 Assume the following data are the values of a random sample from the X distribution 7881043 7151284 7111278 71451 11830 11838 21571 21987 31056 31188 31275 41145 4610 4681 41705 41839 41909 41922 41978 51048 51165 51193 5302 5308 5380 5401 5412 5443 5457 5636 5652 51792 51794 51822 51826 51935 51944 51985 51991 61059 61170 61185 61239 61255 6301 6350 6351 6411 6448 6449 6485 61515 61521 61556 61571 6600 61784 61846 61866 61869 61885 71215 7302 7693 71776 71833 71844 8455 81828 91258 101984 111232 111270 11365 371091 Use the normal approximation to the distribution of the sample median to construct an approximate 90 con dence interval for 6 119 Exercise 11 Assume that X has a shifted exponential distribution with PDF and GDP as follows 1 g E 67m and 1 Lff when w gt and 0 otherwise where the parameter is a real number 1 Find a general formula for the median of the X distribution 2 Develop an approximate con dence procedure for based on the con dence procedure for the median Simplify your answer as much as possible 120 3 Assume the following data are the values of a random sample from the X distribution 151004 151036 15253 15285 151422 151495 151506 151574 151721 151794 151934 161445 161527 161632 161667 161771 161834 161877 161909 161935 161940 171320 171343 171498 171534 181042 181084 181423 181535 181699 181713 181968 191097 191134 191592 191985 201379 201867 211079 211564 211571 211657 211730 211795 211869 22296 221300 221414 221588 231021 231459 231557 24206 241362 241907 241958 251000 25249 25249 251804 261332 261418 261938 271059 27233 271778 271886 271933 281836 281924 291949 301316 311824 321315 321676 331621 351634 40237 401380 411144 411800 441685 541009 541638 601460 Use your answer to step 2 to construct an approximate 90 con dence interval for 121 922 Exact con dence procedures for quantiles Let X1 X2 Xn be a random sample from a continuous distribution and let 6 be the pth quantile of the X distribution The 71 order statistics X0 lt X9 lt lt Xm divide the real line into n 1 intervals 007X17 X17X27 my Xn717Xngt7 Xn700 ignoring the endpoints The probability that 6 lies in a given interval follows a binomial distribution with parameters 71 and 13 Speci cally P9ltX1 0 13 PXklt6ltXk1 Z ak ipykk k12n71 139 gt Xngt 13 These facts can be used to prove the following theorem Theorem 12 Quantile con dence intervals Under the assumptions of this section if In and k2 are chosen so that P9 lt X090 261 p71 23 12 PXk1 lt 9 lt X09 2171lt p7lt17pgtw 1 a 1 7 P9 gt X09 2192 1370 Pij 12 then the interval Xk1Xk2 is a 1001 7 a con dence interval for 6 Note that in applications of this theorem In and k2 are chosen so that 04 P lt R P gt R 5 For moderate to large data sets it is best to let the computer do the work The method is valid for any continuous distribution with pth quantile 9 and any sample size It does not require precise knowledge of the X distribution 122 Exercise 13 The following table shows the total yearly rainfall in inches for Los Angeles in the 10 year period from 1983 to 1992 Scheaffer et al7 1996 H 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 911 11157 4 56 649 15107 2256 l year rainfall H 3404 8 90 892 18100 Assume these data are the values of a random sample from a continuous distribution Construct a 90 or as close as possible con dence interval for the median rainfall Note For convenience the binomial probabilities When n 10 and p 0 50 10 v v 10 lt gt0i5010i501 1 lt gt050 j01m10 J J are given in the following table j4 j5 j6j 7 j8 39 j10l 0 05 0 46 0 05 0 17 01044 01010 0001 l j0j1j2j3 0 01 0 10 0 44 0 17 123 Example 14 Using the data from Exercise 10 and the methods of this section computer analyses produced the following con dence intervals with con dence level as close to 90 as possible for the 25th 50th and 75th percentiles of the X distribution Percentile Con dence Interval Con dence Level 25ch 9503505 46105380 2213 715035107575 1 8912 50th x31x45 56526301 2317j50501050751 8945 75th x51x63 64857302 3251 715075103575 J39 8912 923 Some Mathematica commands The Mathematica command QuantileCI datapConfidenceLevelgt 1 7 04 returns a con dence interval for the pth quantile with con dence level 1 7 04 or more The procedure uses the method of the previous section and adjusts for the possibility that some observed values may be equal 924 Procedures for endpoint parameters The sample maximum can be used in statistical procedures concerning the upper endpoint of an interval Similarly the sample minimum can be used in statistical procedures concerning the lower endpoint of an interval The following exercise illustrates the use of the sample minimum Exercise 15 Assume that X has a shifted exponential distribution with PDF as follows 1 L E e 10 when w gt 6 and 0 otherwise fw where the parameter 6 is a real number Note that for this random variable EX 10 6 and VarX 100 Further based on a random sample of size n the ML estimator of 6 is the sample minimum 124 1 Completely specify the CDF ofX1 F1w PX1gt lt 2 Find a general formula for the pth quantile of the X 1 distribution 125 3 Let app and wkp be the pth and 1 7 pth quantiles of the X0 distribution Use the fact that 1 7 2p Pwp 3 X0 3 wkp to ll in the following blanks P7lt9lt7172p 4 Assume the following data are the values of a random sample from the X distribution 258277 285495 3017597 3580577 3668717 572539 Use the result of step 3 to construct a 90 con dence interval for 6 126 5 Let n 8 Find the value of c so that the test with decision rule Reject 6 15 in favor of 6 lt15 when X1 c is a 5 lower tail test of the null hypothesis 6 15 6 Assume the following data are the values of a random sample from the X distribution 1525647 1591657 1791897 2024587 232217 2539417 2727037 298149 Would you accept or reject 6 15 in this case 127 93 Sample quantiles Let X1 X2 Xn be a random sample from a continuous distribution and let 6 be the pth quantile of the distribution where mi 3 p S Then the pth sample quantile 6 is de ned as follows X09 when p mi X09 n 1p 7 k ltXltk1gt 7 X0 when 3 lt p lt m where X09 is the kth order statistic for k 1 2 n When p kn 1 the pth sample quantile is the kth order statistic otherwise it is de ned so that the following three points lie on a single line k A k 1 ltXk7mgt7 713 ltXk17mgt Note that when p 050 the de nition given above reduces to the de nition of the sample median given earlier 931 Sample quartiles sample IQR Let X be a continuous random variable Recall that the quartiles of the X distribution are the quantiles 025 050 and 075 and that the laterqaartlle range of the X distribution is the difference IQR 075 7 025 Estimates of 025 050 and 075 are called the sample quartiles and are denoted by 11 12 and 13 respectively 12 is also the sample median The difference qg 7 11 is called the sample laterqaartlle range sample IQR 128 Exercise 16 Bjerkdal Amer J Hygiene 1970 72130 148 Rice Dquury Press 1995 p 349 As part of a study on the effects of an infectious disease on the lifetimes of guinea pigs more than 400 animals were infected The data below are the lifetimes in days of 45 animals given a low exposure to the disease 33 44 56 59 74 77 93 100 102 105 107 107 108 108 109 115 120 122 124 136 139 144 153 159 160 163 163 168 171 172 195 202 215 216 222 230 231 240 245 251 253 254 278 458 555 0 Find the sample quartiles and sample interquartile range 0 Find the sample 90th percentile 129 932 Box plots outliers A box plot is a graphical display of a data set that shows the sample median7 the sample interquartile range7 and the presence of possible outliers numbers that are far from the center Box plots were introduced by John Tukey in the 1970 s To construct a box plot 1 A boxis drawn from the sample 25th percentile 11 to the sample 75th percentile 13 10 A bar is drawn through the box at the sample median 12 9quot A whisker is drawn from 13 to the largest observation that is less than or equal to 13 15013 7 11 Another whisker is drawn from 11 to the smallest observation that is greater than or equal to 11 7 15Oq3 7 11 r gt Observations outside the interval q1 715013 7 q17q3 150q3 7 q1 are drawn as separate points These observations are called the outliers Example 17 A box plot of the lifetimes data from the last exercise is shown below days 100 200 300 400 500 600 For these data7 the interval 11 i 150013 11MB 150013 11H and the outliers are 130 Exercise 18 The following data are birthweights in ounces for two groups of children 1 Children whose mothers Visited their doctors ve or fewer times during pregnancy 49 52 82 93 96 101 108 110 114 114 114 116 120 134 2 Children whose mothers Visited their doctors six or more times during pregnancy 87 93 97 98 106 108 110 113 116 119 119 129 131 153 5 or Fewer H 7ll 6 or More 43 weight 60 80 100 120 140 160 In each case7 report 0 the sample quartiles 11 12 13 and o the interval 11 7 150q3 7 qlq3 150q3 7 11 131 Example 19 Consider side by side box plots of lifetimes in days of guinea pigs given low medium and high exposure to an infectious disease with 45 animals per group 1 Low exposure 33 44 56 115 120 122 195 202 215 59 124 216 74 136 222 2 Medium exposure 10 45 53 91 91 92 128 138 139 3 High exposure 15 22 24 60 61 63 96 98 99 days 600 500 400 300 200 100 0 56 92 144 56 156 32 65 109 33 65 127 108 171 458 107 109 163 253 107 163 254 108 168 278 102 159 245 105 160 251 93 144 231 100 153 240 77 139 230 555 83 118 380 81 104 214 81 107 82 109 249 73 102 191 81 103 198 66 99 178 67 102 179 58 99 162 34 38 38 43 44 54 55 59 60 60 67 68 70 70 76 76 81 83 87 91 129 131 143 146 175 258 263 341 341 376 Low Medium High The plot suggests a strong relationship between level of exposure and lifetime For the low medium and high exposure groups the estimated median lifetimes are 153 days 102 days and 70 days respectively In each case there are large outliers In addition as exposure increases the sample distributions become more skewed In each case the distance between the rst and second sample quartiles is smaller than the distance between the second and third sample quartiles As the exposure increases the differences are more pronounced 132 Exercise 20 The following plot shows sideby side box plots of 15 random samples7 each of size 1007 from the standard normal distribution l 2 3 4 5 6 7 8 9 10 ll l2 l3 14 15 Note that the boxes are reasonably symmetric centered at approximately 07 and have very few if any outliers Let Z be the standard normal random variable and M1 2025 i 13950 2075 i 2025 and M2 2075 1r502075 2025 where 2p is the pth quantile of the standard normal distribution Find Pw1 3 Z S 102 133 933 Some Mathematica commands The command Quantile modelp returns the pth quantile of the continuous model distri bution For example the following commands initialize the exponential distribution with parameter 110 and return the median of the distribution 101n2 modelExponentialDistribution 110 Quantile model 12 The command SampleQuantile sample p returns the pth sample quantile using the method discussed in this chapter Similarly the commands SampleQuartiles sample and SampleInterquartileRange sample return estimates based on the methods of this chapter 134 Mathematical Statistics I Notes 1 prepared by Professor Jenny Baglivo Copyright 2004 by Jenny A Baglivo All Rights Reserved 6 Transition to statistics 2 61 Reference univariate distributions 2 611 Discrete distributions 2 612 Continuous distributions 4 613 Table discrete and continuous distributions 6 614 Table standard normal distribution 7 62 Chisquare distribution 8 621 Table quantiles of chisquare distributions 9 63 Student t distribution 11 631 Table quantiles of Student t distributions 12 64 F ratio distribution 14 65 Some Mathematica commands 16 66 Random sample from a normal distribution 17 661 Distributions sample mean sample variance 17 662 Approximate standardization of the sample mean 19 67 Multinomial experiments 21 671 Multinomial distribution 21 672 Goodnessof t test known model 23 673 Goodnessof t test estimated model 26 674 Some Mathematica commands 29 68 Independent random samples from normal distributions 30 681 Distribution ratio of sample variances 30 6 Transition to statistics 61 Reference univariate distributions This section contains information on the standard discrete and continuous distributions 611 Discrete distributions 1 Discrete uniform distribution Let n be a positive integer The random variable X is said to be a discrete uniform random variable or to have a discrete uniform distribution with parameter n when its PDF is when x 1 2 i i i n and 0 otherwise N 3lgt There are n equally likely outcomes 1 3 i i i n 2 Hypergeometric distribution Let n M and N be integers with 0 lt M lt N and 0 lt n lt Ni The random variable X is said to be a hypergeometric random variable or to have a L t I d39s y 39La 1m with A n M and N when its PDF is for integers as between max0 n M 7 N and minn M and equals 0 otherwise Hypergeometric distributions are used to model urn experiments where N is the number of objects in the urn M is the number of special objects n is the size of the subset chosen from the urn and X is the number of special objects in the chosen subseti If each choice of subset is equally likely then X has a hypergeometric distributioni 5 Bernoulli distribution A Bernoulli experiment is a random experiment with two outcomes The outcome of chief interest is called success and the other outcome failure Let p equal the probability of success You run a Bernoulli experiment once and let X equal 1 if a success occurs and 0 if a failure occurs Then X is said to be a Bernoulli random variable or to have a Bernoulli distribution with parameter p The PDF of X is as follows p1 p p0 17p and px 0 otherwise Unfortunately p is used in two different ways 4 Binomial distribution Let X be the number of successes in n independent trials of a Bernoulli experiment With success probability p Then X is said to be a binomial random variable or to have a binomial distribution With parameters n and p The PDF of X is px pg 17 prim when as 012n and 0 otherwise X can be thought of as the sum of n independent Bernoulli random variables Thus by the central limit theorem the distribution of X is approximately normal When n is large 5 Geometric distribution on 0 1 2 Let X be the number of failures before the rst success in a sequence of independent Bernoulli experiments With success probability p Then X is said to be a geometric random variable or to have a geometric distribution With parameter p The PDF of X is px 1 ipycp when as 012 and 0 otherwise Note An alternative de nition of the geometric rv is X is the trial number of the rst success in a sequence of independent Bernoulli experiments With success probability p 6 Negative binomial distribution on 0 1 2 Let X be the number of failures before the rth success in a sequence of independent Bernoulli experiments With success probability p Then X is said to be a negative binomial random variable or to have a negative binomial distribution With parameters r and p The PDF of X is 71 px T T xgt17 pfp quot when as 012 and 0 otherwise as X can be thought of as the sum of r independent geometric random variables Thus by the central limit theorem the distribution of X is approximately normal When r is large Note An alternative de nition of the negative binomial rv is X is the trial number of the rth success in a sequence of independent Bernoulli experiments With success probability p 7 Poisson distribution Let A be a positive real number A gt 0 The random variable X is said to be a Poisson random variable or to have a Poisson distribution With parameter A When its PDF is ASE px e i when as 012 and 0 otherwise x If events occurring over time follow an approximate Poisson process With an average of A events per unit time then X is the number of events observed in one unit of time The distribution of X is approximately normal When A is large 612 Continuous distributions 8 Continuous uniform distribution Let a and b be real numbers with a lt b The random variable X is said to be a uniform random variable or to have a uniform distribution on the interval a b when its PDF is 7 when a S x S b and 0 otherwise a The constant density for the continuous uniform random variable takes the place of the equally likely outcomes for the discrete uniform random variable 9 Exponential distribution Let A be a positive real number A gt 0 The random variable X is said to be an exponential random variable or to have an exponential distribution with parameter A when its PDF is Ae M when x 2 0 and 0 otherwise An important application of the exponential distribution is to Poisson processes Speci cally the time to the rst event or the time between events of a Poisson process with rate A has an exponential distribution with parameter A 10 Gamma distribution Let oz and be positive real numbers a gt 0 gt 0 The continuous random variable X is said to be a gamma random variable or to have a gamma distribution with parameters oz and when its PDF is 1 7 xaile x when x gt 0 and 0 otherwise WHOl The de nition uses the Euler gamma function 00 Fr xrile xdx when r gt 0 10 Two properties of the gamma function are 1 If r is a positive real number then Fr 1 r lquotr 2 If r is a positive integer then Fr r 7 1 The gamma function interpolates factorials An important application of the gamma distribution is to Poisson processes Speci cally the time to the rth event of a Poisson process with rate A has a gamma distribution with parameters oz r and lA When oz r X can be thought of as the sum of r independent exponential random variables Thus by the central limit theorem the distribution of X is approximately normal when r is large 11 Cauchy distribution Let a be a real number and b be a positive real number b gt 0 The continuous random variable X is said to be a Cauchy random variable or to have a Cauchy distribution With center a and spread b When its PDF is for all real numbers so 7rb2 ac 7 a The X distribution is symmetric around its center at The 25th 50th and 75th percentiles of the X distribution are a 7 b a and a b respectively The expectation and variance of X are indeterminate the integrals do not converge l 12 Normal or Gaussian distribution Let u be a real number and o be a positive real number 0 gt 0 The continuous random variable X is said to be a normal random variable or to have a normal distribution With mean u and standard deviation 7 When its PDF is 7 1 x 2 we 7 7 expliw gt for all real numbers so The curve y is the classic bellshaped curves77 The normal distribution is important in applications as a model for stochastic experiments and as an approximate sampling distribution eg the sample mean has an approximate normal distribution in many situations l 613 Table discrete and continuous distributions Distribution Model Summaries Discrete Uniform n EX 1 Li pwlnwl2n VarX 12 7 M Hypergeom jtrijc RIMDVN EX 7 nWM M N VarXnN lt17Ngt wmaXOnM7NminnM Bernoulli p EX 13 231 pp0gt 17 VarX p1pgt Binomial n p EX np pw mpIO ipyl v w 01 n Geometric p EXgt 1gp pwgt limzp 012H VarltXgt 1 21 Negative Binomial r p EXgt 11 pltwgt i lt17pgtww 012 Poisson A EX A pwe w0l2n VarX A Uniform a b EX 0 fwgt va w b Exponential A EX Aei m7 w 2 0 VarX T12 Gamma 04 EX 048 x ml 52 WWW w gt o varltXgt a z Cauchy ab fwmiooltwltoo EX is indeterminate VarX is indeterminate Normal u a e x 22027 700 lt w lt 00 fwgt 2a EX M VarX 02 614 Table standard normal distribution Let Z be the standard normal random variable In 07 a 1 and let ltIgtz PZ S 2 be the cumulative distribution function of Z 0 The following tables gives ltIgtz for z 2 0 Where z Row Value Column Value o lf 2 lt 07 then z17 72 0100 0101 0102 0103 0104 0105 0106 0107 0108 0 09 010 05000 05040 05080 05120 05160 05199 05239 05279 05319 05359 011 05398 05438 05478 05517 05557 05596 05636 05675 05714 05753 012 05793 05832 05871 05910 05948 05987 016026 016064 016103 016141 013 016179 016217 016255 016293 016331 016368 016406 016443 016480 016517 014 016554 016591 016628 016664 016700 016736 016772 016808 016844 016879 05 016915 016950 016985 017019 017054 017088 017123 017157 017190 017224 016 017257 017291 017324 017357 017389 017422 017454 017486 017517 017549 017 017580 017611 017642 017673 017704 017734 017764 017794 017823 017852 018 017881 017910 017939 017967 017995 018023 018051 018078 018106 018133 019 018159 018186 018212 018238 018264 018289 018315 018340 018365 018389 10 018413 018438 018461 018485 018508 018531 018554 018577 018599 018621 1 1 018643 018665 018686 018708 018729 018749 018770 018790 018810 018830 12 018849 018869 018888 018907 018925 018944 018962 018980 018997 019015 13 019032 019049 019066 019082 019099 019115 019131 019147 019162 019177 14 019192 019207 019222 019236 019251 019265 019279 019292 019306 019319 15 019332 019345 019357 019370 019382 019394 019406 019418 019429 019441 1 6 1 7 1 8 1 9 2 0 019452 019463 019474 019484 019495 019505 019515 019525 019535 019545 019554 019564 019573 019582 019591 019599 019608 019616 019625 019633 019641 019649 019656 019664 019671 019678 019686 019693 019699 019706 019713 019719 019726 019732 019738 019744 019750 019756 019761 019767 019772 019778 019783 019788 019793 019798 019803 019808 019812 019817 211 019821 019826 019830 019834 019838 019842 019846 019850 019854 019857 212 019861 019864 019868 019871 019875 019878 019881 019884 019887 019890 213 019893 019896 019898 019901 019904 019906 019909 019911 019913 019916 214 019918 019920 019922 019925 019927 019929 019931 019932 019934 019936 25 019938 019940 019941 019943 019945 019946 019948 019949 019951 019952 216 019953 019955 019956 019957 019959 019960 019961 019962 019963 019964 217 019965 019966 019967 019968 019969 019970 019971 019972 019973 019974 218 019974 019975 019976 019977 019977 019978 019979 019979 019980 019981 219 019981 019982 019982 019983 019984 019984 019985 019985 019986 019986 310 019987 019987 019987 019988 019988 019989 019989 019989 019990 019990 311 019990 019991 019991 019991 019992 019992 019992 019992 019993 019993 312 019993 019993 019994 019994 019994 019994 019994 019995 019995 019995 313 019995 019995 019995 019996 019996 019996 019996 019996 019996 019997 314 019997 019997 019997 019997 019997 019997 019997 019997 019997 019998 35 019998 019998 019998 019998 019998 019998 019998 019998 019998 019998 316 019998 019998 019999 019999 019999 019999 019999 019999 019999 019999 317 019999 019999 019999 019999 019999 019999 019999 019999 019999 019999 318 019999 019999 019999 019999 019999 019999 019999 019999 019999 019999 62 Chisquare distribution Let Z1 Z2 Zm be independent standard normal random variables Then V Zfzr wzg is said to be a chisquare random variable or to have a chisquare distribution with in degrees of freedom The PDF of V is as follows fw W m2gt71 e zZ when w gt O and 0 otherwise For example let V be the Chi square random variable with 5 elf A graph of y fw is shown below Density 015 0125 01 0075 005 0025 0 X If V is a chisquare random variable with in degrees of freedom then EV m and VarV 2m Since V is the sum of in HD random variables the central limit theorem implies that the distri bution of V is approximately normal when m is large enoug lf V1 and V2 are independent chisquare random variables with m1 and 7112 degrees of freedom respectively then the sum V1 V2 has a chisquare distribution with m1 7112 degrees of freedom lf X1 X2 X7 is a random sample of size n from a normal distribution with mean M and standard deviation 7 then 7 X17 2 7L vT aiggmpif is a chisquare random variable with n degrees of freedom The notation X12 is used to denote the pth quantile lOOpth percentile of the chisquare distribution A table with quantiles corresponding to p 0005 0010 0025 0050 0100 0900 0950 0975 0990 0995 for various degrees of freedom is given on page A8 Table 3 in the book An extended table is given on the next two pages 621 Table quantiles of Chisquare distributions Let V be a Chi square random variable with df degrees of freedom 1 The following table gives selected quantiles of the V distribution when df 1 q H 0005 0010 0025 0050 0100 1 X H 0000039 000016 000098 00039 0016 l q H 0900 01950 01975 01990 0995 1 X H 2171 3 84 502 6 63 788 2 The following tables give selected quantiles of the V distribution when df gt 1 df X3005 X301 X3025 X305 X310 X390 X395 X3975 X399 X3995 2 0 01 0 02 0 05 0 10 0 21 4 61 5 99 7 38 921 10160 3 0 07 0 11 0 22 0 35 0 58 6 25 7 81 935 11134 12184 4 0 21 0 30 0 48 0 71 1 06 7 78 949 11114 13128 14186 5 0 41 0 55 0 83 1 15 1 61 924 11107 12183 15109 16175 6 0 68 0 87 1 24 1 64 220 10164 12159 14145 16181 18155 7 0 99 1 24 1 69 2 17 283 12102 14107 16101 18148 20128 8 1 34 1 65 2 18 2 73 349 13136 15151 17153 20109 21195 9 1 73 2 09 2 70 3 33 417 14168 16192 19102 21167 23159 10 2 16 2 56 3 25 3 94 487 15199 18131 20148 23121 25119 11 2 60 3 05 3 82 4 57 558 17128 19168 21192 24172 26176 12 3 07 3 57 4 40 5 23 630 18155 21103 23134 26122 28130 13 3 57 4 11 5 01 5 89 704 19181 22136 24174 27169 29182 14 4 07 4 66 5 63 6 57 779 21106 23168 26112 29114 31132 15 4 60 5 23 6 26 7 26 855 22131 25100 27149 30158 32180 16 5 14 5 81 6 91 7 96 931 23154 26130 28185 32100 34127 17 5 70 6 41 7 56 867 10109 24177 27159 30119 33141 35172 18 6 26 7 01 8 23 939 10186 25199 28187 31153 34181 37116 19 6 84 7 63 891 10112 11165 27120 30114 32185 36119 38158 20 7 43 8 26 959 10185 12144 28141 31141 34117 37157 40100 21 8 03 890 10128 11159 13124 29162 32167 35148 38193 41140 22 8 64 954 10198 12134 14104 30181 33192 36178 40129 42180 23 926 10120 11169 13109 14185 32101 35117 38108 41164 44118 24 989 10186 12140 13185 15166 33120 36142 39136 42198 45156 25 10152 11152 13112 14161 16147 34138 37165 40165 44131 46193 26 11116 12120 13184 15138 17129 35156 38189 41192 45164 48129 27 11181 12188 14157 16115 18111 36174 40111 43119 46196 49164 28 12146 13156 15131 16193 18194 37192 41134 44146 48128 50199 29 13112 14126 16105 17171 19177 39109 42156 45172 49159 52134 30 13179 14195 16179 18149 20160 40126 43177 46198 50189 53167 df X3005 X301 X3025 X305 X310 X390 X395 X3975 X399 X3995 31 14146 15166 17154 19128 21143 41142 44199 48123 52119 55100 32 15113 16136 18129 20107 22127 42158 46119 49148 53149 56133 33 15182 17107 19105 20187 23111 43175 47140 50173 54178 57165 34 16150 17179 19181 21166 23195 44190 48160 51197 56106 58196 35 17119 18151 20157 22147 24180 46106 49180 53120 57134 60127 36 17189 19123 21134 23127 25164 47121 51100 54144 58162 61158 37 18159 19196 22111 24107 26149 48136 52119 55167 59189 62188 38 19129 20169 22188 24188 27134 49151 53138 56190 61116 64118 39 20100 21143 23165 25170 28120 50166 54157 58112 62143 65148 40 20171 22116 24143 26151 29105 51181 55176 59134 63169 66177 41 21142 22191 25121 27133 29191 52195 56194 60156 64195 68105 42 22114 23165 26100 28114 30177 54109 58112 61178 66121 69134 43 22186 24140 26179 28196 31163 55123 59130 62199 67146 70162 44 23158 25115 27157 29179 32149 56137 60148 64120 68171 71189 45 24131 25190 28137 30161 33135 57151 61166 65141 69196 73117 46 25104 26166 29116 31144 34122 58164 62183 66162 71120 74144 47 25177 27142 29196 32127 35108 59177 64100 67182 72144 75170 48 26151 28118 30175 33110 35195 60191 65117 69102 73168 76197 49 27125 28194 31155 33193 36182 62104 66134 70122 74192 78123 50 27199 29171 32136 34176 37169 63117 67150 71142 76115 79149 55 31173 33157 36140 38196 42106 68180 73131 77138 82129 85175 60 35153 37148 40148 43119 46146 74140 79108 83130 88138 91195 65 39138 41144 44160 47145 50188 79197 84182 89118 94142 98111 70 43128 45144 48176 51174 55133 85153 90153 95102 100143 104121 75 47121 49148 52194 56105 59179 91106 96122 100184 106139 110129 80 51117 53154 57115 60139 64128 96158 101188 106163 112133 116132 85 55117 57163 61139 64175 68178 102108 107152 112139 118124 122132 90 59120 61175 65165 69113 73129 107157 113115 118114 124112 128130 95 63125 65190 69192 73152 77182 113104 118175 123186 129197 134125 100 67133 70106 74122 77193 82136 118150 124134 129156 135181 140117 105 71143 74125 78154 82135 86191 123195 129192 135125 141162 146107 110 75155 78146 82187 86179 91147 129139 135148 140192 147141 151195 115 79169 82168 87121 91124 96104 134181 141103 146157 153119 157181 120 83185 86192 91157 95170 100162 140123 146157 152121 158195 163165 125 88103 91118 95195 100118 105121 145164 152109 157184 164169 169147 130 92122 95145 100133 104166 109181 151105 157161 163145 170142 175128 135 96143 99174 104173 109116 114142 156144 163112 169106 176114 181107 140 100165 104103 109114 113166 119103 161183 168161 174165 181184 186185 145 104189 108135 113156 118117 123165 167121 174110 180123 187153 192161 150 109114 112167 117198 122169 128128 172158 179158 185180 193121 198136 63 Student t distribution Assume that Z is a standard normal random variable V is a Chi square random variable with in degrees of freedom and Z and V are independent Then Z V m is said to be a Student t random variable or to have a Student t distribution with in degrees of freedom The PDF of T is as follows 7 PW 12 m mm2 w 7 m W m 2 for all real numbers a For example let T be the Student t random variable with 5 elf The plot below is the graph of 3 f 00 Density 04 If T is a Student t random variable with m df then 0 the distribution is symmetric around zero 0 ET 0 when m gt 1 and o VaTT mmi 2 when m gt 2 Further the distribution of T is approximately standard normal when m is large The notation tp is used to denote the pth quantile lOOpth percentile of the Student t distribution A table with quantiles corresponding to p 0 60 0 70 080 090 095 0975 099 0995 for various degrees of freedom is given in the book on page A9 Table 4 An extended table is given on the next two pages Since the Student t distribution is symmetric around zero tkp itpi 631 Table quantiles of Student t distributions Let T be a Student t random variable with df degrees of freedom 0 The following tables give selected quantiles of the T distribution when p gt 050 o For 13 lt 0507 use t 7t11p o The df 00 row of the table corresponds to quantiles of the standard normal distribution i t03960 t03970 t03930 t03990 t03995 t039975 t03999 130995 T 0325 01727 1376 31078 6314 121706 31821 631657 2 01289 01617 11061 1886 21920 4303 61965 91925 3 01277 01584 01978 11638 2353 31182 41541 5841 4 01271 01569 01941 11533 21132 21776 31747 41604 5 01267 01559 01920 11476 21015 21571 3365 41032 6 01265 01553 01906 11440 11943 21447 31143 31707 7 01263 01549 0896 11415 1895 2365 21998 31499 8 01262 01546 0889 1397 1860 2306 2896 3355 9 01261 01543 0883 1383 1833 21262 2821 31250 10 01260 01542 0879 1372 1812 21228 21764 31169 11 01260 01540 0876 1363 11796 21201 21718 31106 12 01259 01539 0873 1356 11782 21179 21681 31055 13 01259 01538 0870 1350 11771 21160 21650 31012 14 01258 01537 0868 1345 11761 21145 21624 21977 15 01258 01536 0866 1341 11753 21131 21602 21947 16 01258 01535 0865 1337 11746 21120 21583 21921 17 01257 01534 0863 1333 11740 21110 21567 2898 18 01257 01534 0862 1330 11734 21101 21552 2878 19 20 0257 01533 01861 11328 11729 21093 21539 21861 0257 01533 01860 11325 11725 21086 21528 21845 21 0257 01532 01859 11323 11721 21080 21518 21831 22 0256 01532 01858 11321 11717 21074 21508 21819 23 0256 01532 01858 11319 11714 21069 21500 21807 24 0256 01531 01857 11318 11711 21064 21492 21797 25 0256 01531 01856 11316 11708 21060 21485 21787 26 0256 01531 01856 11315 11706 21056 21479 21779 27 0256 01531 01855 11314 11703 21052 21473 21771 28 0256 01530 01855 11313 11701 21048 21467 21763 29 0256 01530 01854 11311 11699 21045 21462 21756 30 0256 01530 01854 11310 11697 21042 21457 21750 13060 0256 0255 0255 0255 0255 0255 0255 0255 0255 0255 13070 04530 04530 04530 0529 0529 0529 0529 0529 0529 0529 13030 04853 04853 04853 04852 04852 04852 04851 04851 04851 04851 13090 L309 L309 L308 L307 L306 L306 L305 L304 L304 L303 13095 L696 L694 14692 L691 L690 L688 L687 14686 L685 L684 130975 24040 24037 24035 24032 24030 24028 24026 24024 24023 24021 13099 24453 24449 24445 24441 24438 24434 24431 24429 24426 24423 130995 24744 2738 24733 2728 24724 2719 2715 2712 2708 24704 04255 04255 04255 04255 04255 04255 04255 04255 04255 04255 0529 0528 0528 0528 0528 0528 0528 0528 0528 0528 0850 04850 04850 04850 04850 04850 04849 04849 04849 04849 14303 14302 14302 14301 14301 14300 14300 14299 14299 14299 14683 14682 14681 14680 14679 14679 14678 14677 14677 14676 24020 24018 24017 24015 24014 24013 24012 24011 24010 24009 24421 24418 24416 24414 24412 24410 24408 24407 24405 24403 24701 24698 24695 24692 24690 24687 24685 24682 24680 24678 04255 04254 04254 04254 04254 04254 04254 04254 04254 04254 0527 0527 0527 0527 0527 0526 0526 0526 0526 0526 04848 04848 04847 04847 04846 04846 04846 04846 04845 04845 14297 14296 14295 14294 14293 14292 14292 14291 14291 14290 14673 14671 14669 14667 14665 14664 14663 14662 14661 14660 24004 24000 14997 14994 14992 14990 14988 14987 14985 14984 24396 24390 24385 24381 24377 24374 24371 24368 24366 24364 24668 24660 24654 24648 24643 24639 24635 24632 24629 24626 04254 04254 04254 04254 04254 04254 04254 04254 04254 04254 04253 0526 0526 0526 0526 0526 0526 0526 0526 0526 0526 0524 04845 04845 04845 04845 04845 04844 04844 04844 04844 04844 04842 14290 14289 14289 14289 14288 14288 14288 14288 14287 14287 14282 14659 14659 14658 14658 14657 14657 14656 14656 14655 14655 14645 14983 14982 14981 14980 14979 14978 14978 14977 14976 14976 14960 24362 24361 24359 24358 24357 24355 24354 24353 24352 24351 24326 24623 24621 24619 24617 24616 24614 24613 24611 24610 24609 2576 64 F ratio distribution Let U and V be independent Chi square random variables with 711 and 712 degrees of freedom respectively Then U 711 Vng is said to be an f ratio random variable or to have an f ratio distribution with 711 and 712 degrees of freedom The PDF of F is as follows F when w gt O and 0 otherwise i Pn1n2gt2 E 7212 n2 mn22 T Fn12Fn22 n2 712 7113 For example let F have an fratio distribution with 8 and 10 df The plot below is a graph Myf Density 8 The F77 in fratio distribution is for RA Fisher who pioneered its use in analyzing the results of comparative studies itet studies comparing two or more samples If F has an fratio distribution with ml and n2 degrees of freedom then 0 ifn2gt2 then EF nQn272 and o if n gt 4 then 2n n1 n 7 2 VarF n1n2 74n2 7 2 Note that as n A 0 EF approaches 1 If F has an fratio distribution with ml and n2 degrees of freedom then 0 lF has an f ratio distribution with n and n1 degrees of freedomt The notation fp is used to denote the pth quantile 100pth percentile of the f ratio distribution The book includes tables for p 090 page A10 P 095 page All p 0975 page A12 and p 099 page A13 The p 010 005 0025 001 quantiles can be computed using reciprocals Speci cally d fquot O quot1 2 f jet 011712 n1 df To demonstrate the computation of quantiles let M 8 and n2 10 Then f090 238 f095 307 f0975 335 f099 506 Since PF 3 w 13 2 i to obtain the 010 005 0025 and 001 quantiles of the distribution with 8 degrees of freedom in the numerator and 10 degrees of freedom in the denominator we use the reciprocals of the 090 095 0975 and 099 quantiles of the fratio distribution with 10 degrees of freedom in the numerator and 8 degrees of freedom in the denominator Thus 1 1 1 1 f010 7 039 f005 7 030 f0025 m i 023 f001 7 01739 65 Some Mathematica commands The following distributions are included Distributions related to the Normal Distribution ChiSquareDistribution m StudentTDistribution m FRatioDistribution 711 712 If X has the model distribution then PDF modelw returns the value of the PDF at 3 For example the following commands initialize the chi square distribution with 5 df and return f35 0151313 modelChiSquareDistribution 5 PDF model 3 5 Similarly CDF model w returns the value of the GDP at 3 For the chi square model above CDF model35 returns PX S 35 0376612 If X has the model distribution then Mean model Variance model StandardDeviation model return the values of the mean variance and standard deviation of X respectively For the chi square model above Mean model returns 5 Variance model returns 10 and StandardDeviationmodel returns V10 If X has the model distribution then Quantile modelq returns the qth quantile of the X distribution For the chi square model above Quantile model050 returns 435146 66 Random sample from a normal distribution Let X1 X2 Xn be a random sample from a normal distribution with mean In and standard deviation 7 Recall that the sample mean and sample variance 32 are de ned as follows n 1 1 72X and 327 7121 nili 1mm Y n and that the sample standard deviation S is the positive square root of the sample variance S V 32 661 Distributions sample mean sample variance Theorem 1 Let Y be the sample mean and 32 be the sample variance of a random sample of size n from a normal distribution with mean In and standard deviation 7 Then 1 Y is a normal random variable with mean In and standard deviation n7 1 S2 72 2V is a chi square random variable with n 7 1 df 3 Y and 32 are independent random variables Exercise 2 Let Y be the sample mean of a random sample of size n from a normal distribution with mean In and standard deviation 7 Further7 let 2p be the pth quantile of the standard normal random variable 1 Find an expression for the pth quantile of the Y distribution 2 Let n 407 u 75 a 10 Find the 10th and 90th percentiles of the Y distribution Exercise 3 Let 32 be the sample variance of a random sample of size n from a normal distribution with mean u and standard deviation 7 Further7 let x be the pth quantile of the chi square distribution with n 7 1 df 1 Find expressions for ESz and Var32 2 Find an expression for the pth quantile of the 32 distribution 3 Let n 407 la 75 a 10 Find the 10th and 90th percentiles of the 32 distribution 662 Approximate standardization of the sample mean Since Y is a normal random variable with mean a and standard deviation the standardized sample mean Y i M M72 n is a standard normal random variable An approximation is obtained by substituting the sample variance 32 for the true variance 72 2 Theorem 4 Let Y be the sample mean and 32 be the sample variance of a random sample of size n from a normal distribution with mean a and standard deviation 7 Then T 7 Y7 quot 7 3271 has a Student t distribution with n 7 1 df To demonstrate this theorem using the previous theorem note that Z Y 7 n 02n is a standard normal random variable V n 7 USO72 is a chisquare random variable With n 7 1 degrees of freedom and Z and V are independent Thus 7 i T Z X 1 XiL mew W has a Student t distribution With n 7 1 degrees of freedom Exercise 5 Let Y be the sample mean and 32 be the sample variance of a random sample of size n from a normal distribution with mean u and standard deviation 7 Let tp and tkp be the pth and 1 7 pth quantiles of the Student t distribution with n 7 1 df 1 Use the fact that Ptp S T S tkp 1 7 213 to nd expressions to ll in each of the following blanks Plt S M S gt 1 7 21339 b Evaluate the endpoints of the interval from part 1 using 71 15 a 365 52 375 and p 005 6 7 Multinomial experiments A multinomial experiment is an experiment with exactly k outcomes The probability of the ith outcome is p i12k The outcomes of a multinomial experiment are often referred to as categories or groups 671 Multinomial distribution Let Xi be the number of occurrences of the ith outcome in n independent trials of the experiment i 12 k Then the random k tuple X1X2 Xk is said to have a multinomial distribution with parameters 71 and 131132 pk The joint PDF for the k tuple is n I I I 23w1ywzwqwk 131pr wlvw27gt7k when wlw2wk O1n and 254 n and zero otherwise The multinomial coefficient 71 nl I1I2H Ik x1 x2 mg is the number of ways to partition the n outcomes into exactly 951 of type 1 952 of type 2 etc For each i X is a binomial random variable With parameters 71 and pit For each i y j X1 Xj has a trinomial distribution With parameters 71 and p1pj 1 7 pl 7 pj In particular C07 7 X1Xj 7 Each pair is negatively correlated o If X is a binomial random variable With parameters 71 and p then X1X2Xan X has a multinomial distribution With parameters 71 and p1 p2 p 1 7 p o If X Y has a trinomial distribution With parameters 71 and p1p2p3 then X15X2gX3 Xayan X Y has a multinomial distribution With parameters 71 and p1 p2p3i Exercise 6 Assume that MampM s candies come in four colors brown red green and yellow and that each bag is lled in the following proportions 40 brown 10 red and 25 each of green and yellow Let X1 be the number of brown candies X2 the number of red candies X3 the number of green candies and X4 the number of yellow candies in a bag of 10 and assume X1 X2X3X4gt has a multinomial distribution Find the probability of getting a Exactly 4 brown 1 red 2 green and 3 yellow candies b Five or more brown candies c At least one red and at least one yellow candy 672 Goodnessof t test known model In 1900 Karl Pearson developed a quantitative method to determine if observed data are consistent with a given multinomial model If X1X2 Xk has a multinomial distribution with parameters 71 and 131132 pk then Pearson s statistic is de ned as follows Xi 711302 711 X2 8 H Ma H For each i the observed frequency X2 is compared to the expected frequency rip under the multinomial model If each observed frequency is close to expected then the value of X2 will be close to zero If at least one observed frequency is far from expected then the value of X2 will be large and the appropriateness of the given multinomial model will be called into question A test can be developed using the following distribution theorem Theorem 7 Pearson s Theorem Under the assumptions above if n is large the dis tribution of X2 is approximately chi square with k 7 1 degrees of freedom Pearson s goodness of t test For a given k tuple 12 wk let k 2 2 702 711 X obs npi be the observed value of Pearson s statistic Compute the p value PX2 2 xgbs Then 1 If PX2 2 ngs gt 010 the t is judged to be good the observed data are judged to be consistent with the multinomial model 2 If 005 lt PX2 2 ngs lt 010 the t is judged to be fair the observed data are judged to be marginally consistent with the multinomial model 3 If PX2 2 ngs lt 005 the t is judged to be poor the observed data are judged to be not consistent with the multinomial model The p value measures the strength of the evidence against the given multinomial model Pearson s theorem tells us that When n is large enough the chisquare distribution With k 7 1 degrees of freedom can be used to compute the p value The chisquare approximation is adequate When EXl npl 2 5 forz39 12k When conducting a test by hand you need to refer to the chisquare table of quantiles Then 0 If X2 lt xgvgo the t is judged to be good 0 1f xgvgo lt X2 lt Xa95 the t is judged to be fair 0 If X2 gt Xa95 the t is judged to be poor Analysis of standardized residuals For a given k tuple 531532 wk the list of standardized residuals n 2 12k 71 serve as diagnostic values for the goodness of t test When n is large the as are approximate values from a standard normal distribution Values outside the interval 72 2 are considered to be unusual and deserve comment in your analysis Further the value of Pearson s statistic is 21 r3 Exercise 8 The table below gives the age ranges for adults 18 years of age or older and proportions in each age range according to the 1980 census Age group 18 7 24 25 7 34 35 7 44 45 7 64 65 l 1980 proportion 018 023 016 027 016 l In a recent survey of 250 adults there were 40 52 43 59 and 56 individuals in ranges 18 24 25 34 35 44 45 64 and 65 respectively Assume this information summarizes 250 independent trials of a multinomial experiment with ve outcomes Of interest is whether or not these data are consistent with the 1980 census model 0 Conduct a goodnessiofifit analysis using Pearson s statistic Comment on any unusual standardized residuals 673 Goodnessof t test estimated model In many practical situations certain parameters of the multinomial model need to be esti mated from the sample data RA Fisher proved a generalization of Pearson s theorem to handle this case Theorem 9 Fisher s Theorem Suppose that X1X2 Xk has a multinomial dis tribution with parameters 71 and 131132 pk and that the list of probabilities has 6 free parameters Then under smoothness conditions and when n is large the distribution of the statistic k X 7 ma 2 1 npi X2 is approximately chi square with k 7 1 7 6 degrees of freedom where 13 is an appropriate estimate Ofpi for 2 12 k The smoothness conditions mentioned in the theorem and methods for estimating free parameters in models are studied in detail in Chapter 7 of these notes Pearson s goodness of t test is conducted in the same way as before with estimated ex pected frequencies taking the place of expected frequencies and k 7 1 7 6 degrees of freedom taking the place of k 7 1 degrees of freedom Example 10 Berkson Wiley 1966 Rice Dquury Press 1995 p 240 Experimenters recorded emissions of alpha particles from the radioactive source americium 241 They observed the process for more than 3 hours Of interest was whether or not these data were consistent with a Poisson model 1 Let X be the number of particles observed in a ten second period In the sample of n 1207 ten second periods an average of 8392 particles per period were observed 2 To obtain a multinomial model group the observations by the following events X32 X3 X4 X16 X217 3 The probabilities in the multinomial model are estimated using the Poisson distribution with parameter 8392 13PX 27 135PX37 171EPX167 171EPX217r One free parameter has been estimated 4 The following table summarizes the important information needed in the goodness of t analysis Observed Expected Standardized Component Event Frequency Frequency Residual of X2 x S 2 18 12204 1659 2753 x 28 26950 0202 0041 x 4 56 56540 70072 0005 x 105 94897 1037 1076 x 6 126 132730 70584 0341 x 146 159124 71040 1082 x 164 166921 70226 0051 x 9 161 155645 0429 0184 x 10 123 130617 70666 0444 x 11 101 99649 0135 0018 x 12 74 69688 0517 0267 x 13 53 44986 1195 1428 x 14 23 26966 70764 0583 x 15 15 15087 70022 0000 x 16 9 7913 0386 0149 x 2 17 5 7084 70783 0613 Note that all estimated standardized residuals are in the interval 722 and that the sum of their squares is 903693 5 Pearson s goodness of t test uses the chi square distribution with 14 degrees of freedom The p value is PX2 2 903693 x 0829 6 Since the p value is much larger than 010 the alpha particle emissions data are judged to be consistent with a Poisson model Example 11 Terman Houghton Mi lin 1919 Olkin et al Macmillan 1994 p 387 It is often said that intelligence quotient 1Q scores are well approximated by the normal distribution The data for this example are from one of the rst studies of IQ scores A study was conducted using the Stanford Binet Intelligence Scale to determine the in telligence quotients of children in ve kindergarten classes in San Jose and San Mateo California There were 0 112 children 64 boys and 48 girls 0 ranging in age from 3 to 7 years old The majority of the kindergarteners were from the middle class and all were native born 27 1 Let X be the IQ score of a randomly chosen kindergarten student In the sample of n 112 children7 a sample mean of E 104455 and a sample standard deviation of s 163105 were observed 2 To obtain a multinomial model7 group the observations by the following events X lt 005 005 S X lt 010 010 S X lt 015 quot397 090 S X lt 095 X 2 095 where app is the pth quantile of the normal distribution with mean 104455 and standard deviation 163105 3 The estimated multinomial model has 20 equally likely outcomes 13 005 for each 2 Two free parameters have been estimated 4 The following table summarizes the important information needed in the analysis Observed Expected Standardized Component Event Frequency Frequency Residual of X2 IQ lt 7763 5 56 7025 006 7763 S IQ lt 8355 7 56 059 035 8355 S IQ lt 8755 6 56 017 003 8755 S IQ lt 9073 5 56 7025 006 9073 S IQ lt 9345 6 56 017 003 9345 S IQ lt 9590 2 56 7152 231 9590 S IQ lt 9817 9 56 144 206 9817 S IQ lt 10032 4 56 7068 046 10032 S IQ lt 10241 7 56 059 035 102413 IQ lt 10446 3 56 7110 121 10446 S IQ lt 10650 4 5 6 70 68 046 10650 S IQ lt 10859 6 56 017 003 10859 S IQ lt 11074 9 56 144 206 11074 S IQ lt 11301 8 56 101 103 113013 IQ lt 11546 6 56 017 003 11546 S IQ lt11818 3 56 7110 121 11818 S IQ lt 12136 8 56 101 103 12136 S IQ lt 12536 5 56 7025 006 12536 S IQ lt 13128 5 56 7025 006 IQ 2 13128 4 56 7068 046 Note that all estimated standardized residuals are in the interval 7243 and that the sum of their squares is 133571 5 Pearson s goodness of t test uses the chi square distribution with 17 degrees of freedom The p value is PX2 2133571 x 0712 6 Since the p value is much larger than 0107 the IQ scores data are judged to be consistent with a normal distribution 674 Some Mathematica commands The command GDFTest model observeddf returns 1 the observed value of Pearson s statistic 2 the p value based on the chi square distribution with elf degrees of freedom and 3 a plot of standardized residuals In the GDFTest command model is the multinomial model under consideration and observed is the list of observed frequencies For example the following commands initialize the information needed for the 1980 census model exercise probabilities018023016027016 modelMultinomialDistribution 250probabilities observed40 52 43 59 56 The following commands can be used to return the list of standardized residuals expectedMean model N observedexpected Sqrt expected In this command model is the multinomial model under consideration For the 1980 census model exercise the commands above returned 70745356 70725319 0474342 7103459 252982 When trying to t one of the standard continuous models to sample data it is common practice to group the outcomes into k equally likely intervals where the boundaries are determined using the E quantiles of the continuous distribution or estimated continuous distribution In the IQ scores example k 20 intervals were chosen so that the estimated expected frequencies were 2 5 and close to 5 1 The IQ scores list iqscores is initialized and the sample mean xbar and sample standard deviation sd computed iqscores61 72 75 76 77 7980 808080 81828485858586868890 9090 90 919192 9393 93 94 94 9696 9696979798989899 100 100 100 101 101 102 102102102102103103103105106106106107107107107108108109109 109109109110110110110111111112112113113113113114114114114114 114117117118119119120121121121121121122123124124125126126129 130130136142146152 xbar sdN Meaniqscores StandardDeviation iqscores The sample mean is 104455 and the sample standard deviation is 163105 29 2 The estimated normal model model and quantiles list is initialized using the 005 010 095 quantiles Finally the RangeCounts command returns the list of observed frequencies 57656294734698638554 nmodelNormalDistribution xbar sd quantilesTable Quantile nmodel p 13 005 095 005 observedRangeCou11ts iqscores quantiles 68 Independent random samples from normal distributions Let X1 X2 Xn be a random sample of size n from a normal distribution with mean pm and standard deviation am and let Y1 Y2 Ym be a random sample from a normal distribution with mean My and standard deviation 0y Further assume the X and Y samples are independent 681 Distribution ratio of sample variances Theorem 12 Let SE and S be the sample variances of independent random samples of sizes 71 and m respectively from normal distributions Then 2 2 F Tag73 has an f ratio distribution with n7 1 and m71 degrees of freedom where the numerator is the ratio of sample variances and the denominator is the ratio of model variances To demonstrate this theorem note that U n 71Sa and V m 71Sa are independent chisquare random variables With n 7 1 and m 7 1 df respectively Thus F Ultn71gt 330 3355 Vm71 5305 7305 is an fratio random variable With n 7 1 and m 7 1 degrees of freedom Exercise 13 Let S be the sample variance of a random sample of size n from a normal distribution with variance 7 S be the sample variance of a random sample of size m from a normal distribution with variance 7 and assume the samples are independent Let fp be the pth quantile of the fratio distribution with n 7 1 and m 7 1 df 1 Find an expression for the pth quantile of the 3335 distribution b Find the 10th and 90th percentiles of the SEES distribution when n 8 m 107 731650 and a 550 Exercise 14 Let S and S be the sample variances of independent random samples of sizes 71 and m respectively from normal distributions Let fp and fkp be the pth and 1 7 pth quantiles of the f ratio distribution with n 7 1 and m 7 1 df 1 Use the fact that Pfp S F S f17p 1 7 213 to nd expressions to ll in each of the following blanks 172p Plt7 Zi ltigt b Evaluate the endpoints of the interval from part 1 using 71 8 m 107 5g 18757 5 345 and p 005 Mathematical Statistics I Notes 3 prepared by Professor Jenny Baglivo Copyright 2004 by Jenny A Baglivo All Rights Reserved 8 Hypothesis testing 72 81 De nitions hypothesis simple and compound hypotheses test 72 811 NeymanPearson framework null and alternative hypotheses 72 812 Test statistic rejection region RR acceptance region AR 73 813 Equivalent tests 76 82 Properties of tests 76 821 Errors of type 1 and 11 size signi cance level power 77 822 Observed signi cance level p value 82 823 Power function uniformly more powerful UMPT 83 83 Example Normal distribution 84 831 Tests of M M0 84 832 Tests of 72 72 85 833 Some Mathematica commands 89 84 Example Bernoullibinomial distribution 89 841 Small sample tests ofp p0 89 842 Large sample tests of p p0 90 85 Example Poisson distribution 91 851 Small sample tests of 0 91 852 Large sample tests of 0 91 86 Likelihood ratio tests 93 861 Likelihood ratio statistic NeymanPearson lemma 93 862 Generalized likelihood ratio tests 97 863 Large sample theory approximate tests 99 864 Example comparing Bernoulli parameters 100 865 Example comparing Poisson parameters 102 866 Example multinomial goodnessof t 104 71 8 Hypothesis testing 81 De nitions hypothesis simple and compound hypotheses test An hypothesis is an assertion about the distribution of a random variable A simple hypothesis speci es the distribution of X completely For example H X is a Bernoulli random variable with parameter p 045 H X is an exponential random variable with parameter 15 A compound hypothesis does not specify the distribution of X completely for example H X is a Bernoulli random variable with parameter p 2 045 H X is an exponential random variable H X is not an exponential random variable A test is a decision rule allowing the user to choose between competing assertions 811 NeymanPearson framework null and alternative hypotheses In the NeymahPearsoh framework of hypothesis testing there are two competing assertions 1 The null hypothesis H0 and 2 The alternative hypothesis Ha The null hypothesis is accepted as true unless su icient evidence is provided to the contrary then the null hypothesis is rejected in favor of the alternative hypothesis Example 1 Suppose that the standard treatment for a given medical condition is effective in 45 of patients A new treatment promises to be effective in more than 45 of patients In testing the e icacy of the new treatment the hypotheses could be set up as follows H0 The new treatment is no more effective than the standard treatment Ha The new treatment is more effective than the standard treatment If p is the proportion of patients for whom the new treatment would be effective then the hypotheses could be set up as follows H0 p 045 versus Ha p gt 045 72 Example 2 In testing whether or not an exponential model is a reasonable model for sample data the hypotheses would be set up as follows H0 The distribution of X is exponential Ha The distribution of X is not exponential lf Pearson s goodness of t test is used then the data would be grouped using k ranges of values for some k and the hypotheses would be set up as follows H0 The data are consistent with the grouped exponential model Ha The data are not consistent with the grouped exponential model 812 Test statistic rejection region RR acceptance region AR Let X1 X2 Xn be a random sample from the X distribution To set up a test 1 A test statistic T TX1 Xn is chosen and 10 The range of T is subdivided into the rejection region RR and the complementary acceptance region AR 9quot If the observed value of T is in the acceptance region then the null hypothesis is accepted Otherwise the null hypothesis is rejected in favor of the alternative Note that the test statistic and acceptance and rejection regions are chosen so that the probability that T is in the rejection region is small that is near 0 when the null hypothesis is true Hopefully the probability that T is in the rejection region is large that is near 1 when the alternative hypothesis is true but this is not guaranteed 73 Example 3 Let X1 X2 X25 be a random sample from a Bernoulli distribution with success probability p and let Y be the sample sum Consider the following decision rule for a test of the null hypothesis that p 045 versus the alternative hypothesis that p gt 045 Reject p 045 in favor ofp gt 045 when Y 2 16 o If the null hypothesis is true7 then 25 PY 2 16 whenp 045 Z lt25gt045y05525w g 0044 y16 o If the actual success probability is p 0707 then 25 PY 2 16 whenp 070 Z lt25gt070y03025w g 0811 y16 The following graph shows the distribution of Y under the null hypothesis that p 045 in gray and under the alternative hypothesis when the actual success probability is p 070 in black A vertical dashed line is drawn at y 16 Probability O l7 5 Upper and lower tail tests The test above is an example of an upper tail test In an upper tail test the null hypothesis is rejected when the test statistic is in the upper tail of distributions satisfying the null hypothesis A test which rejects when the test statistic is in the lower tail of distributions satisfying the null hypothesis is called a lower tail test Upper tail and lower tail tests are also called one sided tests 74 Example 4 Let X1 X2 X15 be a random sample from a normal distribution with mean M and standard deviation 107 and let X be the sample mean Consider the following decision rule for a test of the null hypothesis that M 85 versus the alternative hypothesis that M 7 85 Reject M 85 in favor of M 7 85 when X 3 803 or Y 2 897 o If the null hypothesis is true7 then P7 S 803 or Y 2 897 when M 85 PY S 803 when M 85 PY 2 897 when M 85 R 003 003 0 06 o If the actual mean is M 787 then PY S 803 or Y 2 897 when M 78 PY S 803 when M 78 PY 2 897 when M 78 R1 0821 0 000 0821 The following graph shows the distribution of Y under the null hypothesis that M 85 in gray and under the alternative hypothesis when the actual mean is M 78 in black Vertical dashed lines are drawn at w 803 and w 897 Density Two tailed tests The test above is an example of a two tailed test In a two tailed test the null hypothesis is rejected if the test statistic is either in the upper tail or the lower tail of distributions satisfying the null hypothesis A two tailed test is also called a two sided test 75 813 Equivalent tests Consider two tests each based on a random sample of size n l A test based on statistic T with rejection region RRT and 2 A test based on statistic W with rejection region RRW The tests are said to be equivalent if TeRRT ltgt WeRRW That is given the same information either both tests accept the null hypothesis or both reject the null hypothesis Equivalent tests have the same properties For example if X1 X2 X15 is a random sample from a normal distribution with mean u and standard deviation 10 Y is the sample mean and i X 7 85 10K16 is the standardized mean when u 85 then the test with decision rule Reject u 85 in favor of u 7 85 when X 3 803 or Y 2 897 is equivalent to the test with decision rule Reject u 85 in favor of u 7 85 when Z 2 188 82 Properties of tests Let X1 X2 Xn be a random sample from a distribution with parameter 6 Let 9 be the set of parameter values under consideration and assume that the null and alternative hypotheses are set up as follows H0 6 6 an versus Ha 6 E 9 7 an where an is a subset of S2 wo Q For example if X is a Bernoulli random variable with success probability p and we are interested in testing the null hypothesis 1 3 030 versus the alternative hypothesis 1 gt 030 then p0 p 1 and w0p0 p 030 76 821 Errors of type I and II size signi cance level power When carrying out a test7 two types of errors can occur Accept H7 Reject H7 H7 is true No error Type I error H7 is false Type II error No error 0 An error of type I occurs when a true null hypothesis is rejected 0 An error of type II occurs when a false null hypothesis is accepted Size signi cance level The size or signi cance level of the test with decision rule Reject 9 E on in favor of 9 E 9 7 on when T 6 RR is de ned as follows 04 sup96w0PT 6 RR when the true parameter is 9 A test with size 04 is called a 10004 test77 The size or signi cance level is the maximum type I error or the least upper bound of type I errors if a maximum does not exist If the signi cance level is oz and the observed data lead to rejecting the null hypothesis then the result is said to be statistically signi cant at level at If the observed data do not lead to rejecting the null hypothesis then the result is not statistically signi cant at level at Power The power of the test with decision rule Reject 9 E on in favor of 9 E 9 7 on when T 6 RR at 9 E 9 is de ned as follows Power at 9 equals PT 6 RR when the true parameter is 9 lf 9 E 0 then the power at 9 is the same as the type I error If 9 E Q 7400 then the power corresponds to the test s ability to correctly reject the null hypothesis in favor of the alternative hypothesis 77 Exercise 5 Let X1 X2 X12 be a random sample from a Bernoulli distribution with success probability p and let Y be the sample sum 1 Find the rejection region for a 5 test or as close as possible of the null hypothesis 1 045 versus the alternative hypothesis 1 gt 045 using Y as test statistic State the exact size of the test Note For convenience 12 y PY y Whenp 0 45 lt gt0i45y0i5512 y 012m12 is given in the following table 78 2 For the test developed in step 17 nd 0 The type II error when p 0607 p 080 o The power when p 0707 p 090 79 Exercise 6 Let X1 X2 X3 X4 X5 be a random sample from a Poisson distribution with parameter A and let Y be the sample sum 1 Find the rejection region for a 2 test or as close as possible of the null hypothesis A 20 versus the alternative hypothesis A lt 20 using Y as test statistic State the exact size of the test Note For convenience 710010011 PYywhenA2i0 5 y01m26 y is given in the following table 2 For the test developed in step 17 nd 0 The type II error when A 157 A 075 o The power when A 107 A 05 822 Observed signi cance level p value The observed signi cance level or p value is the minimum signi cance level for which the null hypothesis would be rejected The p value measures the strength of the evidence against the null hypothesis Example 7 Let X1 X2 X12 be a random sample from a Bernoulli distribution with success probability p and let Y be the sample sum Assume we are interested in testing 13 045 versus 13 gt 045 If 8 successes are observed then the p value is 12 12 PY 2 8 when p 045 E lt gt045y05512y x 01118 7 3 y78 Example 8 Let X1 X2 X3 X4 X5 be a random sample from a Poisson distribution with parameter A and let Y be the sample sum Assume we are interested in testing A 20 versus A lt 20 If 2 events are observed then the p value is 2 lo010011 PY g 2 when A 20 Z x 00028 110 31 Example 9 Let X1 X2 X15 be a random sample from a normal distribution with mean a and standard deviation 10 and let X be the sample mean Assume we are interested in testing a 85 versus a 7 85 If a sample mean of 7955 is observed then the p value is 2 Px 3 7955 when l 85 z 00292575 If a sample mean of 8723 is observed then the p value is 2 PY 2 8723 when a 85 z 0372393 For a size oz test If the observed signi cance level is gt a then the null hypothesis is accepted If the observed signi cance level is S a then the null hypothesis is rejected in favor of the alternative hypothesis 82 823 Power function uniformly more powerful UMPT The power function of the test with decision rule Reject 6 6 an in favor of 6 E 9 7 an when T 6 RR is the function Power6 PT 6 RR when the true parameter is 6 for 6 6 2 For example7 if X1 X2 X15 is a random sample from a normal distribution with mean u and standard deviation 10 and Y is the sample mean7 then the test with decision rule Reject u 85 in favor of u 7 85 when X 3 803 or Y 2 897 has the following power function 897 1 2 Power 17 eiwif 12395 dw Id 90803 x27r 250 The following graph is a plot of y Poweru Power 0000 Note that power increases as 1 gets farther from 85 in either direction Uniformly more powerful Consider two 100a tests of 6 6 an versus 6 E 9 7 an each based on a random sample of size n 1 A test based on statistic T with power function PowerT 7 and 2 A test based on statistic W with power function PowerW Then T is said to be uniformly more powerful than W if PowerT6 2 PowerW6 for all 6 E 9 7 on with strict inequality gt for at least one 6 E 9 7 on It is possible that the test based on T is more powerful than the one based on W for some values of 9 E Q 7 4 and that the test based on W is more powerful than the one based on T for other values of 9 E Q 7 cool If the test based on T is uniformly more powerful than the one based on W then T has a greater or equal chance of rejecting the false null hypothesis for each model satisfying the alternative hypothesis Thus we would prefer to use the test based on Tl Uniformly most powerful The test based on T is the uniformly most powerful test UMPT if it is uniformly more powerful than any competitor test An interesting and difficult problem in the eld of statistics is that of determining when a UMPT exists We will consider this problem in later sections 83 Example Normal distribution Let Y be the sample mean and 32 be the sample variance of a random sample of size n from a normal distribution with mean u and standard deviation 7 831 Tests of u no If the value of 72 is known7 then the standardized mean when u p0 X 7 do MHZ71 can be used as test statistic The following table gives the rejection regions for one sided and two sided 100a tests 2 Alternative Hypothesis Rejection Region M lt 0 Z S da M gt 0 Z 2 2a o W Eda2 where is the 1001 7 p point of the standard normal distribution These are examples of z tests A 2 test is a test based on a statistic with a standard normal distribution under the null hypothesis If the value of 72 is estimated from the data7 then the approximate standardization of the sample mean when u p0 X 7 do 2 n can be used as test statistic The following table gives the rejection regions for one sided and two sided 100a tests T Alternative Hypothesis Rejection Region M lt 0 T S tnil l M gt 0 T 2 Ma a M 0 m 2 MAW2 where tn1p is the 1001 7p point of the Student t distribution with n 7 1 degrees of freedom These are examples of t tests A t test is a test based on a statistic with a Student t distribution under the null hypothesis 832 Tests of lt72 a If the value of u is known7 then the sum of squared deviations from 1 divided by the hypothesized variance 227L109 W2 00 can be used as test statistic The following table gives the rejection regions for one sided and two sided 100a tests V Alternative Hypothesis Rejection Region 02lta Vgx lia 72 gt 72 V 2 x a 72 03 Vgx lia oerx a where x p is the 1001 7 p point of the chi square distribution with n df If the value of u is estimated from the data7 then the sum of squared deviations from the sample mean divided by the hypothesized variance 21109 if 2 0 V 039 can be used as test statistic The following table gives the rejection regions for one sided and two sided 100a tests Alternative Hypothesis Rejection Region 02lt0 Vgxiil ia 02 gt 03 V 2 Xiida 0203 VSXii liammVZXiamQ where xii p is the 1001 7 p point of the chi square distribution with n 7 1 df These are examples of chisquare tests A chisquare test is a test based on a statistic with a chisquare distribution under the null hypothesis Exercise 10 The mean yield of corn in the United States is about 120 bushels per acre A survey of 40 farmers this year gives a sample mean yield of 1238 bushels per acre We want to know whether this evidence is su icient to say that the national mean has changed Assume the information above summarizes the values of a random sample from a normal distribution with mean u and standard deviation 10 Test the null hypothesis In 120 versus the two sided alternative 1 7 120 at the 5 signi cance level Report the conclusion and the observed signi cance level p value Exercise 11 Hand et al7 Chapman amp Hall7 19947 p 229 The table below gives informa tion on a beforeand a er experiment of a standard treatment for anorexia Before After AfteriBefore Before After AfteriBefore 705 818 113 723 882 159 740 863 123 751 867 116 773 773 00 775 812 37 776 774 702 781 761 720 781 814 33 784 846 62 796 814 18 797 730 767 806 735 771 807 802 705 813 896 83 841 795 746 844 847 03 852 842 710 855 883 28 860 754 7106 873 751 7122 883 781 7102 887 795 792 890 788 7102 894 801 793 918 864 754 AfteriBefore summaries n 26 E 7045 32 638194 Twenty six young women suffering from anorexia were enrolled in the study The table gives their weights in pounds before treatment began and at the end of the xed treatment period We analyze the differences data after 7 before Assume the differences data are the values of a random sample from a normal distribution with mean u and standard deviation 7 Use these data to test the null hypothesis In O the treatment has not affected mean weight versus the two sided alternative 1 7 O the treatment has changed the mean7 for better or for worse at the 5 signi cance level Report your conclusion and comment on your analysis Exercise 12 In recent years7 the yield in bushels per acre of corn in the United States has been approximately normally distributed with a mean of 120 bushels per acre and with a standard deviation of 10 bushels per acre In a recent survey of 16 farmers7 the following yields were reported 151401 132122 104008 121201 102828 109120 112403 142736 124752 121322 120030 105243 105567 128974 138625 115289 Sample summaries n 16 E 120976 32 217523 Assume these data are the values of a random sample from a normal distribution Test separately the null hypotheses u 120 and a 10 In each case7 use a two sided alternative and the 5 signi cance level Comment on the computations 833 Some Mathematica commands The commands MeanTest and VarianceTest can be used to construct 100a tests for the mean and variance of a normal distribution when both parameters are unknown For example the following commands initialize a data list and return the p value for the two sided test of u 105 versus 1 7 105 as a Mathematica rule TwoSidedPValue a 000229669 data10459739103269556104441019210212990494159866 MeanTest data 105 TwoSidedHTrue Similarly VarianceTest data28TwoSidedgtTrue returns the p value for the two sided test of lt72 28 versus 72 7 28 as a Mathematica rule TwoSidedPValue a 0227937 Given the data list above and using the 5 signi cance level we would reject the hypothesis that u 105 and accept the hypothesis that 72 28 84 Example Bernoulli binomial distribution Let Y be the sample sum of a random sample of size n from a Bernoulli distribution with parameter p Y is a binomial random variable with parameters 71 and 13 841 Small sample tests ofp p0 Rejection regions for one sided and two sided 100a tests are as follows Alternative Hypothesis Rejection Region p lt p0 Y S c where c is chosen so that oz PY cwhenpp0 p gt p0 Y 2 c where c is chosen so that oz PY2 cwhenpp0 p Po Y S 61 or Y 2 CQ where c1 and CQ are chosen so that a P0 S c1 whenp 100 P0 2 c2 whenp po and the two probabilities are approximately equal 842 Large sample tests ofp 130 The standardized sample sum when p 130 Y 7 77430 7741300 7 P0 can be used as test statistic Since by the central limit theorem the distribution of Z is approximately standard normal when n is large rejection regions for approximate one sided and two sided 100a tests are as follows Alternative Hypothesis Rejection Region 10 lt 100 Z S 20 P gt Po Z 2 2a wept lZl 2202 where is the 1001 7 p point of the standard normal distribution Exercise 13 Approximate analysis In a test of p 045 versus 13 gt 045 using 250 independent trials of a Bernoulli experiment 512 128250 of the trials ended in success Use an approximate analysis to determine if the results are signi cant at the 001 level Clearly state the conclusion and report the observed signi cance level p value 85 Example Poisson distribution Let Y be the sample sum of a random sample of size n from a Poisson distribution with parameter A Y is a Poisson random variable with parameter nA 851 Small sample tests of A AL7 Rejection regions for one sided and two sided 100a tests are as follows Alternative Hypothesis Rejection Region A lt A0 Y S c where c is chosen so that oz 13ch whenAA0 A gt A0 Y 2 c where c is chosen so that oz PY2c whenAA0 A y A0 Y 3 c1 or Y 2 CQ where c1 and CQ are chosen so that oz PY 3 c1 when A A0 PY 2 CQ when A A0 and the two probabilities are approximately equal 852 Large sample tests of A AL7 The standardized sample sum when A A0 Z Y 7 71A7 nAL7 can be used as test statistic Since by the central limit theorem the distribution of Z is approximately standard normal when n is large rejection regions for approximate one sided and two sided 100a tests are as follows Alternative Hypothesis Rejection Region A lt A0 Z 3 7201 A gt A0 Z 2 2a A A0 W 2 2012 where is the 1001 7 p point of the standard normal distribution Exercise 14 Approximate analysis You decide to test A 20 versus A 7 20 using 80 independent observations from a Poisson distribution a Design an approximate 5 test of A 20 vs A 7 20 b Using the test from part a nd the approximate power when A 160 c The results are now in A total of 189 events were recorded Is the null hypothesis accepted or rejected at the 5 level What is the observed signi cance level p value 86 Likelihood ratio tests Likelihood ratio tests were introduced by J Neyman and E Pearson in the 1930 s In many practical situations likelihood ratio tests are uniformly most powerful In situations where no uniformly most powerful test UMPT exists likelihood ratio tests are popular choices because they have good statistical properties 861 Likelihood ratio statistic NeymanPearson lemma Let X1 X2 Xn be a random sample from a distribution with parameter 6 and let Lik be the likelihood function based on this sample Consider testing the null hypothesis 6 60 versus the alternative hypothesis 6 6a where 60 and 6a are constants The likelihood ratio statistic A is the ratio of the likelihood functions Lik60 A Iik6a A likelihood ratio test based on A is a test whose decision rule has the following form Reject 6 60 in favor of 6 6a when A S 0 Note that if the null hypothesis is true then the value of the likelihood function in the numerator will tend to be larger than the value in the denominator If the alternative hypothesis is true then the value in the denominator will tend to be larger than the value in the numerator Thus it is reasonable to reject when A is small77 The following theorem proven by Neyman and Pearson states that the likelihood ratio test is a uniformly most powerful test for a simple null hypothesis versus a simple alternative hypothesis Theorem 15 NeymanPearson Lemma Given the situation above if c is chosen so that PA S 0 when 6 60 04 then the test with decision rule Reject 6 60 in favor of 6 6a when A S c is a uniformly most powerful test of size 04 In general a likelihood ratio test is not implemented as shown above Instead an equivalent test with a simpler statistic and rejection region is used Example 16 Bernoulli distribution upper tail Let Y be the sample sum ofa random sample of size n from a Bernoulli distribution and consider testing 13 220 versus 13 pa where pa gt 130 The computations below demonstrate that the test with decision rule Reject p pa in favor ofp pa when A S c is equivalent to the test with decision rule Reject p pa in favor of p pa when Y 2 k where c and k satisfy PA S 0 when p p0 PY 2 k when p p0 04 for some 04 1 Since Iikp pYl 7 W the likelihood ratio statistic is 7 Likwo 7 PoY1po Y 7 p0 Y lipo WTY A 7 Lil pa 7 paylipayky lint 2 Then A S c ltgt logA S logc Ylt10g ltgt 7 log 32 S logc 7 nlOg ltigt ltgt ltgt Ylogltp 17pquotgt logc7 nlog 171 ltgt ltgt Ylog 1 n 7 Ylog ltig gt S logc pe1po 1717s Y 2 k where k logc 7 nlog gtlog The inequality switches since pa gt 130 implies that the ratio 2561 is less than 1 and its logarithm is a negative number Thus by the Neyman Pearson lemma the test based on Y is uniformly most powerful In fact we can now conclude that the test with decision rule Reject p p0 in favor ofp gt p0 when Y 2 k is a uniformly most powerful size oz test of the null hypothesis p p0 versus the onesided alternative p gt p0 where k is chosen so that PY 2 k when p p0 a The speci c value of pa gt p0 is needed only in the A form of the test 94 Exercise 17 Bernoulli distribution lower tail Let Y be the sample sum ofa random sample of size n from a Bernoulli distribution Use the Neyman Pearson lemma and the example above to demonstrate that the test with decision rule Reject p pa in favor ofp lt 130 when Y S k is a uniformly most powerful size 04 test of the null hypothesis 1 130 versus the onesided alternative 13 lt 130 where k is Chosen so that PY S k when p p0 04 Exercise 18 Exponential distribution upper tail Let Y be the sample sum of a random sample of size n from an exponential distribution with parameter 5 and PDF 1 x 5 six9 when w gt O and 0 otherwise Use the Neyman Pearson lemma to demonstrate that the test with decision rule Reject 6 60 in favor of 6 gt 60 when Y 2 k is a uniformly most powerful size 04 test of the null hypothesis 6 60 versus the onesided alternative 6 gt 60 where k is Chosen so that PY 2 k when 6 60 04 862 Generalized likelihood ratio tests The methods in this section generalize the approach above to compound hypotheses and to multiple parameter families Generalized likelihood ratio tests are not guaranteed to be uniformly most powerful In fact in many situations eg two tailed tests uniformly most powerful tests do not exist Let X be a distribution with parameter 6 where 6 is a single parameter or a k tuple of parameters Assume that the null and alternative hypotheses can be stated in terms of values of 6 as follows H0 6 E on versus Ha 6 E 9 7 on where 9 represents the full set of parameter values under consideration and on Q 2 For example if X is a normal random variable with unknown mean and variance the null hypothesis is a 120 and the alternative hypothesis is a 7 120 then 6 Moz fl Ialt72 7 oolt Ialt 0002 gt 0 and 010 12002 I 72 gt 0 Let X1 X2 Xn be a random sample from a distribution with parameter 6 and let Lik be the likelihood function based on this sample The generalized likelihood ratio statistic A is the ratio of the maximum value of the likelihood function for models satisfying the null hypothesis to the maximum value of the likelihood function for all models under consideration A 7 maxeewo HMO T maxeeg Lik and a generalized likelihood ratio test based on this statistic is a test whose decision rule has the following form Reject 6 E on in favor of 6 E 9 7 on when A S 0 Note that the value in the denominator of A is the value of the likelihood at the maximum likelihood estimator and the value in the numerator is less than or equal to the value in the denominator Thus A S 1 Further if the null hypothesis is true then the numerator and denominator values will be close and A will be close to 1 otherwise the numerator is likely to be much smaller than the denominator and A will be close to 0 Thus it is reasonable to reject when A is small In general a likelihood ratio test is not implemented as shown above lnstead an equivalent test with a simpler statistic and rejection region is used We often drop the word generalized and call these tests likelihood ratio tests LRTs Example 19 Let Y be the sample mean of a random sample of size n from a normal distribution with known standard deviation 7 Consider testing M Mo Versus M 7 Mo using a generalized likelihood ratio test Then A 7 maxeewo sz 7 Mkltp0gt maxeeg Likw L2H since 6 u nJ 700ltILLltOO7 W0IU3907 and Y is the ML estimator of u Further7 since 7 M0 dagn for appropriately chosen 0 and k the test of u p0 versus the two sided alternative 1 7 p0 we used earlier is a generalized likelihood ratio test Ago ltgt ltgt Example 20 Let Y be the sample mean and 32 be the sample variance of a random sample of size n from a normal distribution with both parameters unknown Consider testing u p0 versus u 7 do using a generalized likelihood ratio test Then A 7 maxeewo sz 7 Mk M07 maxeeg sz Lik Y since 6 0102 nJ3970392gt 7mlt lt007a2gt07 W0M07a2gta2gt07 and the ML estimators are as shown above Further7 since i X 7 do 2n for appropriately chosen 0 and k the test of u p0 versus the two sided alternative 1 7 p0 we used earlier is a generalized likelihood ratio test Ago ltgt ltgt Complete demonstrations of the equivalences above are long and tedious Similar methods could be used to show that other twosided tests mentioned earlier are generalized likelihood ratio tests or approximate generalized likelihood ratio tests 98 863 Large sample theory approximate tests In many situations the exact distribution of A or an equivalent form is not known The theorem below proven by 88 Wilks in the 1930 s provides a useful large sample approximation to the distribution of 72 logA where log is the natural logarithm function under the smoothness conditions of the Cramer Rao lower bound and Fisher s theorems Let X1X2Xn be a random sample from a distribution with parameter 6 and let Lik be the likelihood function based on this sample Consider testing the null hypothesis 6 6 an versus the alternative hypothesis 6 E 9 7 an using the likelihood ratio test Theorem 21 Wilks Theorem Given the situation above under smoothness conditions on the X distribution and when n is large the distribution of 72logA is approximately chi square with r 7 r0 degrees of freedom where r is the number of free parameters in 2 To is the number of free parameters in an and log is the natural logarithm function Approximate likelihood ratio tests Under the conditions of Wilks Theorem an ap proximate 100a test of 6 6 an versus 6 E 9 7 an has the following decision rule Reject 6 6 an in favor of 6 E 9 7 an when 72logA 2 X24001 where X24001 is the 1001 7 a point on the chi square distribution with r 7 r0 degrees of freedom r is the number of free parameters in S2 and To is the number of free parameters in wo Rejecting when A is small is equivalent to rejecting when 7210gA is large 864 Example comparing Bernoulli parameters Let Y be the sample sum of a random sample of size m from a Bernoulli distribution with parameter pi for 2 12 k Consider testing the null hypothesis that the k Bernoulli parameters are equal versus the alternative that not all parameters are equal Under the null hypothesis the combined sample is a random sample from a Bernoulli distribution with parameter p p1 p2 a a a pk The parameter sets for this test are np17p27quot397p1 gtZOSPLPZV39WPICS1andw0P7p7quot397pgt0 p 1 There are k free parameters in S2 and 1 free parameter in an The statistic 72 logA simpli es to mp m1 7 13 where 13 is the estimate of the common parameter under the null hypothesis nnmn pn1n2wnk and log is the natural logarithm function If each m is large then 72logA has an approximate chi square distribution with k 7 1 df 100 Example 22 Hand et al Chapman amp Hall 1994 p 237 As part of a study on depression in adolescents ages 12 through 18 researchers collected information on 465 individuals both men and women who were seriously emotionally disturbed SED or learning disabled LD The following table summarizes one aspect of the study Severely Severely Depressed lndividuals Depressed 1 LD Male 41 219 1872 2 LD Female 26 102 2549 3 SED Male 13 95 1368 4 SED Female 17 49 3469 To determine if the level of severe depression is the same in the four groups we conduct a test of equality of parameters P1 P2 P3 P4 where pi equals the probability that an adolescent in the 1th group is severely depressed using the generalized likelihood ratio test and the 5 signi cance level 1 The estimate of the common proportion is 13 97465 02086 2 The following table lists the components of 72 logA Component of 72logA 1 LD Male 0623 2 LD Female 1260 3 SED Male 3273 4 SED Female 5000 The sum of the components is 10156 3 The rejection region for the test is 72109 2 X 005 7815 Since the observed value of the test statistic is in the rejection region we reject the hypoth esis of equal probabilities in favor of the alternative that not all probabilities are equal o The p value for the example above is P72 logA 2 10156 R 00173 0 Any comments on the analysis above 101 865 Example comparing Poisson parameters Let Y be the sample sum of a random sample of size m from a Poisson distribution with parameter A2 for 2 12 k Consider testing the null hypothesis that the k Poisson parameters are equal versus the alternative that not all parameters are equal Under the null hypothesis the combined sample is a random sample from a Poisson distri bution with parameter A A1 A2 a a a M The parameter sets for this test are 9 A12kgt Z A12k 2 0 and 010 A 2 AZ 0 There are k free parameters in S2 and 1 free parameter in an The statistic 72 logA simpli es to k Y 7210 A 2Y lo g gltmAgt where X is the estimate of the common parameter under the null hypothesis KEw 711712 71k X and log is the natural logarithm function If each mean 71M is large then 72 logA has an approximate chi square distribution with k 7 1 df 102 Example 23 Waller et al in Lange et al Wiley 1994 p3 24 As part of a study of incidence of childhood lukemia in upstate New York data were collected on the number of children contracting the disease in the veyear period from 1978 to 1982 The following table summarizes the results using geographic regions of equal total popula tion 175000 people per area Regionl Region2 Region3 Region4 Region5 Region6 1 Cases 89 86 97 96 120 102 The geographic regions run from west to east To determine if the incidence of leukemia is the same in each geographic area we conduct a test of the equality of rates A123455 where M equals the average number of childhood leukemia patients in region 2 for a ve year period using the generalized likelihood ratio test with n 1 for all and the 5 signi cance level 1 The estimated common rate is X 2 The following table lists the components of 72 logA Region 1 Region 2 Region 3 Region 4 Region 5 Region 6 Component of 72 logA The sum of the components is 3 The rejection region for the test is 72109 2 X 005 1107 Since please complete 103 866 Example multinomial goodnessof t Assume that X1X2 Xk has a multinomial distribution with parameters 71 and P17P2w r r 7Pkgt Consider testing the null hypothesis pi pic for 2 12 k versus the alternative that the given model for probabilities does not hold Then 9 P17P27gt7Pkgt30 SP17P27gt7P1 S 1721 1 and r k 7 1 There are two cases to consider Case 1 If the model for probabilities is known then an contains a single k tuple and has 0 free parameters The statistic 72 logA simpli es to k X 72mm 22X log lt Z gt 2 1 711 where log is the natural logarithm function If n is large then 72 logA has an approxi mate chi square distribution with k 7 1 df Case 2 If 6 parameters need to be estimated then an has 6 free parameters The statistic 72 logA simpli es to k X 72mm 22X log lt Z gt 2 1 M522 where log is the natural logarithm function and is the estimated value of pic for 2 12 k lfn is large then 72 logA has an approximate chi square distribution with k 7 1 7 6 df 104 Example 24 Plackett Macmillan 1974 p 134 In a study of spatial dispersion of houses in a Japanese Village the number of homes in each of 1200 squares of side 100 meters was recorded There were a total of 911 homes The following table gives the numbers of squares with 0 1 2 and 3 or more homes Number of homes 0 1 2 3 1 Frequency 584 398 168 50 To determine if a Poisson model is reasonable for these data we conduct a generalized likelihood ratio test at the 5 signi cance level 1 The estimated parameter of the Poisson model is 9111200 07592 and the estimated probabilities for the multinomial model a grouped Poisson model are Number of homes 1 0 1 2 3 1 1 Estimated probability 0468 0355 0135 0042 2 The following table lists the components of 2 logA Number of homes 0 1 2 3 Component of 2 logA The sum of the components is 3 The rejection region for the test is 2logA 2 X3005 599 Since please complete 105 As a nal exercise we will demonstrate that Pearson s goodness of t statistic k Xi 711 1 711 X2 2 is a 2nd order Taylor approximation to the 72 logA statistic for multinomial goodness of t In large samples the values of the two statistics are very close Thus Pearson s test is an approximate generalized likelihood ratio test Recall that a 2nd order Taylor approximation to at a is given by the expression on the right WW 2 96 i 12 1 Q ll f ax 11H Exercise 25 To demonstrate that Pearson s statistic is a 2nd order Taylor approximation to the 72 logA statistic for multinomial goodness of t 1 Let a be a positive constant Use calculus to demonstrate that new I 2wln m 290 7 a 106 2 Use the result of step 1 to demonstrate that 7210gA m X2 107

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.