### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# APP NONPARAMETRIC STAT MATH 324

JMU

GPA 3.61

### View Full Document

## 29

## 0

## Popular in Course

## Popular in Mathematics (M)

This 342 page Class Notes was uploaded by Eunice Schoen on Saturday September 26, 2015. The Class Notes belongs to MATH 324 at James Madison University taught by Steven Garren in Fall. Since its upload, it has received 29 views. For similar materials see /class/214030/math-324-james-madison-university in Mathematics (M) at James Madison University.

## Similar to MATH 324 at JMU

## Reviews for APP NONPARAMETRIC STAT

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/26/15

Section 51 A Permutation Test for Correlation and Slope January 87 2009 5 Tests for Trends and Associations 51 A Permutation Test for Correlation and Slope Sample data pairs Xi YZ for i 1 n The population correlation coe cient is de ned by p any007 Where ow EX axXY ay p measures the strength of the linear relationship be tween two variables The Pearson sample correlation coe cient is de ned by i109 500639 Y 7 mam 502 Y Examples interpreting r For the follow ing models generate 100 pairs of observations de termine the sample correlation coef cient and plot 1 Section 51 A Permutation Test for Correlation and Slope January 87 2009 2 the data a X N N0 1 Y X 5 Where the e are inde pendent N0 05 b X N N0 1 Y X 5 Where the e are independent N 0 05 c X N N20 1 Y X 5 Where the e are inde pendent N 30 05 d X N N0 1 Y X 5 Where the e are inde pendent N0 02 e X N N01 Y 5X37 f X N N01 Y 5X37 g X N N205 Y X 5 Where the e are independent N 60 5 HOW much variability in Y can be explained by the linear relationship between X and Y h X N N20 5 Y 5 Where the e are indepen dent N60 5 Section 51 A Permutation Test for Correlation and Slope January 87 2009 3 i X N Unif1030 Y 5X 202 e where the e are independent N 900 25 D Problem 511 crying babies and IQ scores This data set is from Karelitz et al 1964 Child Development To test whether children who cry more actively as babies later tend to have higher IQs a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet IQ scores at age three with the results shown below Cry count 10 20171212 15 19 12 14 23 IQ score 87 90 94 94 97 100 103 103 103 103 Cry count 15 14 13 27 17 12 18 15 15 23 IQ score 104 106 106 108 109 109 109 112 112 113 Cry count 16 21 16 12 9 13 19 18 19 16 IQ score 114 114 118 119 119 120 120 124 132 133 Section 51 A Permutation Test for Correlation and Slope January 87 2009 4 Cry count 22 31 16 17 26 21 27 13 IQ score 135 135 136 141 155 157 159 162 a Plot the data gt Z scan2 problem511 T T b Determine the sample correlation coef cient c What is an interesting hypothesis test for this data set D De ne tcorr 7 n 2 1 r2 need not memo rize If p 0 and if the x y are based on a simple random sample from a bloam ate normal distribution then tcorr is t distributed With n 2 degrees of freedom Revisit problem 511 crying ba bies and IQ scores This data set is from Karelitz et al 1964 Child Development To test Section 51 A Permutation Test for Correlation and Slope January 87 2009 5 Whether children Who cry more actively as babies later tend to have higher le a cry count was taken for a sample of 38 children aged ve days and was later compared With their Stanford Binet IQ scores at age three a State the null and alternative hypotheses b Determine the value of the standardized test statis tic for the parametric test using hand calculations c Determine the asymptotic pvalue of this test us ing hand calculations and state the conclusion d Plot the asymptotic distribution of your standard ized test statistic under H 0 and shade in the appro priate region corresponding to the pvalue e Determine the asymptotic pvalue of this test us ing cortest f What is the 95 lower con dence bound on p g Does association imply causation D Section 512 Slope of the Least Squares Lme January 87 2009 6 512 Slope of the Least Squares Line A simple linear regression model is de ned by Yr 50 51Xt 5t where the as are independent and identically dis tributed random variables with mean zero and nite variance However if the e are normal then X and Y are btvarlate normal We de ne the least squares estimates of g and 1 by 31 TSySx and Bo Y 51X memorize these two formulas The least squares line 3 Bo 3151 is the line which minimizes the sum of the squares of the vertical dis tances between the observations and the line For large rt what are the asymptotic values in the formula 31 rsysx Section 512 Slope of the Least Squares Line January 87 2009 Revisit problem 511 crying ba bies and IQ scores This data set is from Karelitz et al 1964 Child Development To test whether children who cry more actively as babies later tend to have higher le a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet IQ scores at age three a Which variable should be X and which should be Y b Determine the least squares equation using hand calculations c Plot the least squares line on the scatterplot d Predict the child s IQ at age three years if the cry count at age ve days was 25 e Determine the least squares equation using the macro lm f Verify that the least squares equation goes through Section 513 The Permutation Test January 87 2009 8 the sample means g Regarding the simple linear regression model What is an interesting hypothesis test B 513 The Permutation Test Speci cally this permutation correlation test is for nonzero negative or positive Pearson correla tion or slope If the simple linear regression model holds and the null hypothesis of 1 0 holds What can we say about X and Y How should a permutation test be performed Problem 512 In the data set below test for negative Pearson correlation a 11 15 21 y 139 143 87 Section 513 The Permutation Test January 87 2009 9 a State the null and alternative hypotheses b Determine the permutation distribution 0151 and 7 X111X215 X321 Y1 Y2 Y3 31 r 87 139 143 521 0839 87 143 139 474 0763 139 87 143 111 0178 139 143 87 553 0890 143 87 139 0316 0051 143 139 87 584 0941 Section 513 The Permutation Test January 87 2009 10 61 7 probability 584 0941 16 553 0890 16 0316 0051 16 111 0178 16 474 0763 16 521 0839 16 sum 1 c Determine the pvalue of the permutation corre lation test using hand calculations d What would be the pvalue for a two sided test e Obtain the simulated pvalue using the macro permcortest f What assumptions were needed to perform this test from part e g Using the one sided hypothesis test and the para metric approach7 determine the pvalue Section 514 Large Sample Approx for Permutation Distrib of January 87 2009 11 h What assumptions were needed to perform this test from part g i Are the assumptions needed to perform this test reasonable D In general for 71 data pairs how many permutations groupings exist when performing the permutation correlation test 514 LargeSample Approximation for the Permutation Distribution of 739 Suppose n independent pairs of observations X Y are sampled from distributions with nite variances such that X and Y are independent What is p the population Pearson correlation coef Cient Section 514 Large Sample Approx for Permutation Distrib of January 87 2009 What is 7 the sample Pearson correlation coef cient Consider the permutation distribution of 7 What is Er Where the expectation is taken condition ally on the permutation distribution varr 1 n 17 conditional on the permutation distribution Example Sample 71 independent pairs of ob servations X Y from distributions with nite vari ances such that X and Y are independent Where n is large a Determine 7 a xed function of n such that there is approximately a 95 chance that 7 the sam ple correlation coef cient will be between 7 n and Tn b Set 71 100 in part a c Set 71 400 in part a 01 Set 71 1000 in part a Section 52 Spearmrm Rank Correlation January 87 2009 13 e Set 71 10 000 in part a D 52 Spearman Rank Correlation Problem 52 1 Consider the following data set X0234 6 Y 016 812561296 a Determine the Pearson correlation coefficient gtxc02346 gt y c 0 16 81 256 1296 gt cor x y b 18 7 the Pearson correlation coefficient a reason able measure c Compare the ranks of X with the ranks of Y D 521 Statistical Test for Spearman Section 52 Statistical Test for Spearmaa Rank Correlation January 8 2009 Rank Correlation The Spearman rank correlation 719 is the Pear son correlation coefficient applied to ranks Thus the Spearman rank correlation is not heavily in uenced by outliers The Spearman rank correlation measures the associ ation between two variables Revisit problem 52 1 a Determine the Spearman rank correlation for this data set gtxc02346 gt y c 0 1681 256 1296 b Test if the association between the two variables is positive using the Spearman rank correlation and hand calculations c Test if the population Spearman rank correlation is positive using the macro cortest Section 52 Statistical Test for Speaman Rank Correlation January 87 2009 15 D Revisit problem 5 12 In the data set below use the Spearman rank correlation 11 15 21 139 143 87 cc rahkx 1 rahky 2 OSLO r tw a State the hull and alternative hypotheses b Determine the permutation distribution of is Section 52 Statistical Test for Speaman Rank Correlation January 87 2009 16 rankltX1 1 rankltX2 2 rankltX3 3 rankY1 rankY2 rankY3 r5 1 2 3 1 1 3 2 05 2 1 3 05 2 3 1 05 3 1 2 05 3 2 1 1 7 5 probability 1 16 05 26 05 26 1 16 sum 1 c Determine the pvalue of the Spearman correla tion test using hand calculations d Determine the pvalue 0f the Spearman correla Section 522 Large Sample Appmmz39matz39on January 87 2009 17 tz on test using the macro cortest e What would be the pvalue for a two sided test f Obtain the stmulatedpvalue using the macro perrncortest g What assumptions were needed to perform this test B 522 LargeSample Approximation Recall from section 514 that for large n and indepen dent X and Y the distribution of Z mm 1 is approximately standard normal Where 7 is the sample Pearson correlation coef cient Adjustment for Ties Two options 9 Apply the Pearson correlation to the ranks adjusted for ties 9 Use the normal approxirnation Sectz39zm 522 Large Sample Appmmz39matz39on January 87 2009 Revisit problem 511 crying ba bies and IQ scores This data set is from Karelitz et al 1964 Child Development To test Whether children who cry more actively as babies later tend to have higher le a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet IQ scores at age three a Determine the sample Spearman correlation coef cient b State the null and alternative hypotheses c Estimate the exact p value of this test using sim ulations and state the conclusion d Determine the value of the standardized test statis tic for the largesample test using hand calculations e Determine the asymptotic pvalue of this test us ing hand calculations f Plot the asymptotic distribution of your standard Section 54 Permutation Tests for Contingency Tables January 87 2009 19 ized test statistic under H 0 and shade in the appro priate region corresponding to the pvalue g Determine the asymptotic p value of this test us ing cortest D Caution in Using the Pearson or Spearman Correlation Caution Tirne dependent data may invalidate the in dependence between the X Y data pairs Example Suppose 1 is the year and y is the average ocean temperature for the year D 54 Permutation Tests for Contingency Tables Scenario We have two categorical variables and we Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 20 enter the data into a table 541 Hypotheses to be Tested and the ChiSquare Statistic Problem 541 gender and handed ness hypothetical data The following hypothetical data are based on gender and the pre ferred hand for writing among 10 year old children Observed Left Right total Girls 5 83 88 Boys 7 65 72 total 12 148 160 The hypothesis test will be a test for association a What is the hypothesis test of interest b Estimate the expected ie average number of left handed girls under H 0 c Estimate the expected ie average number of Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 21 right handed girls under H 0 d Estimate the expected ie average number of left handed boys under H0 e Estimate the expected ie average number of right handed boys under H0 Expected Left Right total Girls 66 814 88 Boys 54 666 72 total 12 148 160 f What is the general rule for computing these ex pected cell frequencies under H0 Formula The chisquare test statistic is de ned as 7 C V 2 2 71217 ez39jl26239j 11 31 g Compute the chisquare test statistic for this data set Asymptotic result For large sample sizes ie if Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 22 em 2 5 for all cells then the test statistic V has approximately a X2 distribution with degrees of free dom equal to number of rows 1 X number of columns 1 h How many degrees of freedom are associated with this test i Determine the asymptotic p Value associated With this test j Plot the asymptotic distribution of your test statis tic under H0 and shade in the appropriate region corresponding to the pvalue k Determine the asymptotic pyalue using the macro chisqtest D Example gender and handedness real data In a survey of Scottish school children aged approximately ten to twelve years the teacher Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 23 observed whether the pupil wrote with the left or right hand with the following results Clark 1957 Left Handedness University of London Press Lon don Observed Left Right Percentage left Girls 1478 25045 557 Boys 991 12629 728 D Problem 542 Roosevelt The follow ing results were obtained in a 1948 study of the 1944 Presidential election in Elmira New York Mc Carthy 1957 Introduction to Statistical Reason ing McGraw Hill Section 542 Permutation Chi Square Test January 87 2009 24 Individual Interviewed on Percentage reached 1944 Pres vote First call Second or later call Total on rst call Roosevelt 138 217 355 389 Dewey 124 200 324 383 Did not vote 90 142 232 388 Other or too young 39 78 117 333 Total 391 637 1028 380 Test the hypothesis that the distribution of responses is the same for individuals reached on the rst call as for those interviewed on the second or later calls D 542 Permutation ChiSquare Test Herein we use the same chisquare test statistic 7 C V 2 2 71217 12 z j 11 31 However we do NOT use the XQ distribution to ap proximate the pvalue Instead the permutation distribution of the test statistic is determined exactly or is simulated Section 542 Permutation Chi Square Test January 87 2009 25 Determining the permutation distribution Revisit problem 541 gender and hand edness hypothetical data The follow ing hypothetical data are based on gender and the preferred hand for writing among 10 year old chil dren Observed Left Right total Girls 5 83 88 Boys 7 65 72 total 12 148 160 9 Fix all of the margin totals 12 148 88 72 160 Is our observed table rare under H 0 Under these xed margins What are other possible val ues of the table Speci cally What are the possible values for X the number of left handed girls Based on a permutation involving say 10 left handed girls complete the rest of the table Section 542 Permutation Chi Square Test January 87 2009 D How many degrees of freedom are associated With this test 9 The chi square test statistic V may be determined for each value of X the number of left handed girls forX0112 D To obtain the permutation distribution of V un der these xed margins the probabilities of V or X may be determined based on the hypergeomet ric distribution A hypergeometric distribution is similar to a binomial distribution except that a hy pergeometi ic distribution is based on sampling with OUT replacement Example aside Suppose a classroom has 10 female students and 7 male students Sample 5 stu dents WithOUT replacement at random and let W be the number of female students in the sample Then W has a hypergeometric distribution D D The pvalue for this permutation distribution is Section 55 Fisher s Emact Test for a 2 gtlt 2 Contingency Table January 87 2009 27 based on either all possible permutations of V or a large number of simulated permutations of V D 55 Fisher s Exact Test for a 2 X 2 Contingency Table Fisher s exact test is similar to the permutation chisquare test except the permutations are based on X say the number of left handed girls rather than V Where V i 0127 12 z j 211 The permutation probabilities of X are again deter mined by the hypergeometm c distribution under the xed margin totals Revisit problem 541 gender and hand edness hypothetical data The follow ing hypothetical data are based on gender and the Section 55 Fisher s Emact Test for a 2 gtlt 2 Contingency Table January 87 2009 preferred hand for writing arnong 10 year old Chil dren Observed Left Right total Girls 5 83 88 Boys 7 65 72 total 12 148 160 Let X the test statistic be the number of left handed girls What does a large value of X suggest What does a small value of X suggest Test if the proportion of girls Who are left handed is smaller than the proportion of boys Who are left handed D Example Bush VS Gore Election of 2000 Summary The Presidential Election between George Section 55 Fisher s Emact Test for a 2 gtlt 2 Contingency Table January 87 2009 W Bush and Albert Gore took place on November 7 2000 The vote was quite close in Florida the Win ner of Which would Win the election On November 8 the count resulted in a small lead for Bush Gore sought a manual recount in several Florida counties so the process of recounting votes began as permit ted by the Florida Supreme Court Bush argued that recounting only certain counties violated the equal protection clause of the fourteenth amendment to the U S Constitution and Bush also argued that Florida s electors should be selected by the Decem ber 12 deadline On December 11 a 574 majority of the U S Supreme Court ruled that no constitutionally valid recount could be completed by the December 12 deadline effectively ending the recounts Is there statistically signi cant evidence that the U S Supreme Court Justices tended to favor their own Section 55 Fisher s Emact Test for a 2 gtlt 2 Contingency Table January 87 2009 30 political party ie according to the political party of the President Who appointed the Justice Use F lsher s exact test Justice Appointed by President Decision William Rehnquist Reagan End recount Sandra Day OlConnor Reagan End recount Antonin Scalia Reagan End recount Anthony Kennedy Reagan End recount Clarence Thomas G H W Bush End recount John Paul Stevens Ford Continue recount David Souter G H W Bush Continue recount Ruth Ginsburg Clinton Continue recount Stephen Breyer Clinton Continue recount Is the proportion of Republican appointed Justices Who voted to end the recount signi cantly large B Example sampling and the U S Cen sus 1999 Section 55 Fisher s Emact Test for a 2 gtlt 2 Contingency Table January 87 2009 31 The U S Census contains error in that some indi viduals are not counted quite often in regions dorn inated by Democrats The statistical technique of sampling could greatly reduce this error and con sequently could raise affect the apportionment in the House of Representatives in favor of the Democrats The Clinton administration wanted to use sampling but the Republicans opposed the use of sampling for determining seats in the House of Representa tives On January 25 1999 the U S Supreme Court ruled 5 to 4 against the use of sampling in the Census for the purpose of apportioning seats in the House of Representatives among the states Is there statistically signi cant evidence that the U S Supreme Court Justices tended to favor their own political party ie according to the political party of the President Who appointed the Justice Use F tsher s exact test Section 0 Cumulative Distributions and Probability Density Functions January 87 2009 0 Preliminaries First introductory question When do we use nonparametric statistics Example Construct a 95 con dence interval on the mean household income of your community Test if mean income equals 50000 versus the alternative that mean income does not equal 50000 1 How is this done 2 What test or table is often used 3 What assumptions are needed Second introductory question What are non parametric statistics Details on nonparametric methods begin in section 05 01 Cumulative Distributions and Probability Density Functions Section 0 Cumulative Distributions and Probability Density Functions January 87 2009 A population consists of all possible values for some variable A sample consists of a subset of the population and is often drawn randomly A random variable denotes an observation selected randomly from the population A cumulative distribution function cdf of a random variable X is PX g 51 for all real x Example Suppose X denotes the income of a randomly selected household What does the cumu lative distribution of X evaluated at 100j000 repre sent D If the random variable is continuous then probabil ities may be illustrated by the area under a curve known as the probability density function pdf What are some continuous distributions which have Section 02 Common Continuous Probability Distributions January 87 2009 3 names 02 Common Continuous Probability Distributions Using R or Splus see wwwrprojectorg To download R Click Download CRAN UNC Chapel Hill or some other site Windows base R 281 win32exe Click Save and then Save again The saving should take less than ve min utes using high speed internet Click Open and OK and continue to click Next until the pro cess is complete Alternatively use Rweb by going to http bayesmathmontanaeduRwebRwebgenei alhtinl Normal distribution Nu a Section 02 Common Continuous Probability Distributions January 87 2009 4 The probability density function is e f MQZUQ f 0 m oo lt 1 lt 00 need not memorize What is f for a standard normal distribution Example Use R regarding the normal distribu tion for the following a Graph the probability density function pdf of a standard normal distribution say from 3 to 3 gt X lt 0100 Generate numbers from 0 to 100 gt X 0100 Generate numbers from 0 to 100 Alternatively use gt X C0100 0 means combine gt X X 100 05 Generates numbers between 05 and 05 gt X X gtllt 6 Generates numbers between 3 and 3 Alternatively type gt X 6 gtllt c0100 100 05 Section 02 Common Continuous Probability Distributions January 87 2009 5 gt help dnorrn Help menu for dnorrn gt dnorrn Again the help menu for dnorrn gt y dnorrnx x ie probability density func tion for a standard normal rV gt Note that different means and standard deviations may be used gt plot X y gt plot X dnorrnx gt plot X dnorrnx type s Use stair step for a nicer looking graph gtx6gt1lt 010001000 05 gt plot x dnorrnx type s b Graph the probability density function pdf of a Nn 05 o 03 random variable from 3 to 3 gt plot X dnorrn x rnean05 sd03 type s different mean and sd Section 02 Common Continuous Probability Distributions January 87 2009 6 gt plot X dnorm X 05 03 type s 7 different mean and sd c Graph the cumulative density function cdf of a standard normal random variable from 3 to 3 01 Use the macro plotdist to graph the probabil ity density function pdf of a N 500 5 random variable from 35 to 65 gt source httpvvvvvvmathjmuedugarrenst math324dirRmacros Read in the macros gt ls List the functions macros and variables gt plotdist To read the source code and comments gt plotdist dnorm 50 5 xmin35 xmax65 e Use the macro plotdist to graph the cumulative distribution function cdf of a Nu 50 a 5 random variable from 35 to 65 f Use the macro plotdist to graph the probability density function pdf and the cumulative distri bution function cdf of a standard normal random Section 02 Common Continuous Probability Distributions January 87 2009 7 variable g What Will the following R commands generate gt pnorm 196 gt pnorm 196 gt pnorm 196 pnorm 196 gt qnorm 0975 q in qnorm means quantile gt pnorm 3196 300 10 gt qnorm 0975 300 10 h Generate 1000 independent observations from a N 80 10 distribution Determine the sample mean X and the sample standard deviation 3 and graph these observations in a histogram gt X1 rnorm 1000 80 10 7 The r in rnorm means random gt mean X1 gt sd X1 gt hist X1 Section 02 Common Continuous Probability Distributions January 87 2009 8 gt history To view the history gt q To quit Central Limit Theorem If n independent observations are sampled from a population with mean u and nite standard deviation 0 then the sample mean X has approximately a N u 0 distribution for large n Furthermore if n independent observations are sam pled from an approximately N u a population then the sample mean X has approximately a N u 0 distribution A parameter is an often unknown quantity Which is a Characteristic of a population More on standardizing In general if X is any random variable and a and b Section 02 Common Continuous Probability Distributions January 87 2009 9 are constants such that the distribution of X a b does not depend on a or b then a is a location parameter and b is a scale parameter Then the probability density function pdf of X can be written mom Example Let f or f 70x be the pdf of a N u a distribution e m MQQUQ fltxO m7 OOltltOO Let hz be the pdf of a N 0 1 distribution Then 222 hz ooltzltoo Uniform distribution Section 02 Common Continuous Probability Distributions January 87 2009 The pdf for a standard uniform distribution is h1z1 0ltzlt1 Example Use R regarding the uniform distri bution for the following a Graph the pdf of a standard uniform distribu tion gt runif b Graph the cdf of a standard uniform distribu tion c Graph the pdf and cdf of a uniform distribution with endpoints 30 and 40 d What is the mean of a uniform random vari able e Sample 1000 observations from a Uniforrn30 40 distribution calculate the sample mean and plot the histogram gt X runif 10007 307 40 Section 02 Common Continuous Probability Distributions January 87 2009 gt X 1 50 Observe the rst 50 observations D Example Applying the Central Limit Theorem to the uniform distribution using R Let U N Uniform40 60 Let Un be the sample mean of n independent realizations of U a Plot the pdf and Cdf of U b Sample 30 independent observations of U2 using the macro rephcate gt ubar replicate 30 mean runif 2 40 60 c Construct a histogram of 10000 independent ob servations of U2 Interpret your graph d Repeat part C for U5 U20 and U100 gt hist replicate 1e4 mean runif 5 40 60 Section 02 Common Continuous Probability Distributions January 87 2009 12 Exponential distribution An exponential random variable is continuous and lacks memory Example Suppose that a particular type of com puter chip running continuously has lifetime dis tributed according to an exponential distribution Suppose that an old computer chip Which has been running for two years is compared to a new com puter chip Which has been running only one day Which computer chip is better Can the lifetime of a computer chip be negative D The pdf for a standard exponential distribution is h2z e z z gt 0 and has mean and standard deviation equal to l Section 02 Common Continuous Probability Distributions January 87 2009 13 gt Graph the pdf of a standard exponential distri bution gt plotdist gt dexp gt Graph the pdf of an exponential distribution using rate equal to 01 gt What are the new mean and standard deviation gt Generate 10000 observations from an exponential distribution using rate equal to 01 gt 7 Construct a histogram of the data gt Determine the sample mean of x gt 7 Determine the sample standard deviation of x D Laplace or Double Exponential distribution A Laplace distribution is symmetric about its mean such that each half of the pdf is exponential Section 02 Common Continuous Probability Distributions January 87 2009 The pdf for a standard Laplace distribution is i Z L 7 The standard Laplace distribution has rnean zero and OOltZltOO h3lt2gt standard deviation one gt Graph the pdf of the standard Laplace distribu tion gt dlaplace gt Graph the pdf of the Laplace distribution With rnean equal to 2 and standard deviation equal to 3 gt Generate 10000 observations from a Laplace dis tribution With rnean equal to 2 and standard devia tion equal to 3 gt 7 Construct a histograrn of the data gt Deterrnine the sarnple rnean of x gt Deterrnine the sarnple standard deviation of X D Section 02 Common Continuous Probability Distributions January 87 2009 Cauchy distribution A Cauchy distribution is symmetric and has heavy tails The pdf for a standard Cauchy distribution is 1 MW W Note the typo in the textbook on p 3 OOltZltOO The mean and the standard deviation of a Cauchy dis tribution DO NOT EXIST gt Graph the pdf of the standard Cauchy distribu tion gt plotdist gt dcauchy gt Graph the pdf of a Cauchy distribution with lo cation pararneter 40 and scale parameter 5 gt If X has the above pdf deterrnine PX lt 40 using R Section 02 Common Continuous Probability Distributions January 87 2009 16 gt 7 Determine PX lt 35 gt shadedist gt shadedist 35 dcauchy 40 5 gt Determine PX lt 45 D Comparing Normal and Cauchy distributions Ignoring the constants the probability density func tions of the normal and Cauchy distributions are 6 22 and 1 221 Which of these converges to zero faster as 2 gt 00 or z gt oo If Z is standard normal deterrnine P 1 lt Z lt 1 gt shadedist c 1 1 dnorrn lowertai1F If X is standard Cauchy deterrnine P 1 lt X lt Section 02 Common Continuous Probability Distributions January 87 2009 17 1 Generate 100 observations from a standard normal distribution and 100 observations from a standard Cauchy dis tribution gt Z rnorm100 X rcauchy100 Look at the observations Are there any outliers Determine the minimum and maximum for both Z and x gt minz maxz gt minx maxx Plot the histograms May want to try gt trunchist x gt trunchist X 3 3 7 Construct a truncated his togram excluding values outside 3 and 3 D Section 02 Common Continuous Probability Distributions January 87 2009 18 Lighttailed distributions such as uniform and nor mal rarely produce outliers Heavytailed distributions such as exponential Laplace and Cauchy tend to produce outliers The Cauchy distribution has very heavy tails What is the distribution of the sample mean of n in dependent N a a random variables What is the distribution of the sample mean of n in dependent Cauchy6 1 random variables Homework C021 LetX Nu 500 0 200 Let U be a uniform random variable with end points 50 and 60 Let Y N Cauchy location 500 scale 200 Let X be the sample mean of 100 independent realizations of X Let U be the sample mean of 100 independent realizations of U Let Y be the sample mean of 100 independent real izations of Y Let Y be the sample median of 100 independent observations of Y Use R for all graphs Section 02 Common Continuous Probability Distributions January 87 2009 below and show both your source code and out put CLEARLY LABEL THE VARIOUS PARTS a b c etc a Graph the pdf of X b Graph the pdf of U c Graph the pdf of Y 01 What distribution ie the shape location and scale does X have There is no need to use R for this part e Graph the pdf of X After completing part e sample 10000 independent realizations of X f Compute the sample mean and sample standard deviation of your 10000 values of X g Graph your 10000 values of X in a histogram h Are your results from parts f and g consistent With your answer from part d Explain Section 02 Common Continuous Probability Distributions January 87 2009 Next sarnple 10000 independent realizations of U i Graph your 10000 values of U in a histogram j What is the approximate shape of the distribution of U Next sample 10000 independent realizations of 17 k Graph your 10000 values of 17 in a histogram 1 Graph your 10000 values of 17 in a truncated his tograrn m What distribution ie the shape location and scale does 17 have There is no need to use R for this part n When trying to estimate the population rnedian ie location parameter of a Cauchy distribution which is more informative one observation from the Cauchy distribution OR the sample mean based on 100 independent observations from the Cauchy dis tribution Explain Next sarnple 10000 independent realizations of 17 Section 03 The Binomial Distribution January 87 2009 21 0 Graph your 10000 values of i7 in a histogram p When trying to estimate the population median ie location parameter of a Cauchy distribution which is more informative the sample mean or the sample median Explain End of Homework C021 D 03 The Binomial Distribution Example Toss a coin ten times Where the prob ability of heads is 40 Let X be the number of heads Then X is a binomial random variable With parameters 71 10 and p 04 Example Sample ten people independently from a population consisting of 40 Democrats Let X be the number of Democrats Then X is a bino mial random variable with parameters 71 10 and p 04 Section 03 The Binomial Distribution January 87 2009 22 For both above examples What are the possible values of X In general A Bernoulli trial may result in a success or a failure Suppose X represents the number of successes from n independent Bernoulli trials Where n is fixed not random and p Psuccess Then X N Binomialn p The probability density function pdf of X can be shown to be PX CL pr 1 p IJO12n 1 Note well The de nition of the pdf of a discrete random variable differs from that of a continuous random variable In the above example regarding political affiliation What is the probability that exactly 30 of the sam ple consist of Democrats Alternatively in the above example regarding coin tosses Section 03 The Binomial Distribution January 87 2009 What is the probability that exactly 30 of the ten coin tosses result in heads gt dbinom3 10 04 Determine PX 51for all a 0 1 2 10 On average how many heads do we expect from the 10 coin tosses On average how many Democrats do we expect from the 10 people sampled In general what is the mean of X The variance of X can be shown to be np1 p A special case of the Central Limit Theorem If X N Binomialnp and p Xn Where np Z 10 and n1 p Z 10 then X is ap proximately N nx mp 095 and p is approximately N up p 03 Hence for suf ciently large sample sizes a binomial random variable or a sample proportion may be Section 03 The Binomial Distribution January 87 2009 24 reasonably approximated by a normal random vari able Example Graph the pdf of a Binomialn p 03 random variable for various values of n say it 1 2 5 10 30 50 and 100 gt plotdist dbinom 1 03 D Homework C031 LetX N Binomialn p 001 a Graph the pdf ofX for n 10 100 500 and 1000 Make sure that you select appropriate domains for X so that your four graphs are neat b State for which of these graphs in part a a normal approximation seems reasonable End of Homework C031 D Homework C032 Let X N Binomialn Section 03 The Binomial Distribution January 87 2009 25 500 p 08 CLEARLY LABEL THE VAR IOUS PARTS a b c etc a Determine exactly PX g 390 b Determine the mean and standard deviation of X c Determine the z score ie standard normal score corresponding to a 390 or you may use a 3905 21 Using a normal approximation to the binomial compute PX g 390 Use R for this computation not a standard normal table e Compare your answers from parts a and f Sample 1000 independent realizations of X Deter mine the sample mean X and the sample standard deviation 3 and graph these observations in a his togram g Compare your answers from part b with X and 8 from part End of Homework C032 D Section 04 Con dence Intervals and Tests of Hypotheses January 87 2009 26 04 Con dence Intervals and Tests of Hypotheses Suppose a the observations from a population are independent and b the original population is ap proximately normal or the sample size n is large and a lt 00 Let X be the sample mean and let 8 be the sample standard deviation Then an approxi mate con dence interval on the population mean u if nite is X i tn1s Hypothesis testing on u involves converting the stan dardized test statistic T X us to a pvalue Example Consider the following data set regard ing birthweight of humans in grams 3615 3105 42173234355140233098 a Graph the data in a Q Q plot normal probability Section 05 Parametric versus Nonparametric Methods January 87 2009 27 plot gt qqnorm gt X c 3615 31054217 3234 3551 4023 3098 b Construct a 90 con dence interval on u the pop ulation mean birthvveight gt ttest gt ttest x conflevel 09 c Test at level or 005 HO u 3400 g versus Ha J gt 3400 g gt ttest x alternative greater mu 3400 05 Parametric versus Nonparametric Methods Parametric methods often require that the popula tion be approximately normal or the sample size be large When n is large and the standard deviation is nite Section 05 Parametric versus Nonparametric Methods January 87 2009 normal theory often may be used Much theory has been developed for the normal dis tribution Random variables from other distributions Whose the ory has also been developed quite thoroughly arise from normal populations What is an example 0 Some populations are nonnormal but may be trans formed to normality by taking a transforma tion For example a data set generated by a lognormal random variable may be transformed to a normal random variable How 0 Other distributions in addition to the normal distri bution also have theory that is fairly well developed For example data from an exponential population Section 05 Parametric versus Nonparametric Methods January 87 2009 may be analyzed using techniques based on the ex ponential distribution 0 How can we determine whether or not we believe a data set is from an approximately normal population How well can we distinguish normal data from ex ponential data for small sample sizes 0 With parametric methods such as the t test typ ically the sample size needs to be large or the shape of the population is assumed such as normality and some or all parameters such as u and a are assumed unknown 0 Nonparametric methods generally do not require that the shape of the population be known or that the sample size be large With nonparametric methods normality is not Section 05 Parametric versus Nonparametric Methods January 8 2009 30 assumed The assumptions required on the distribu tions for nonparametric methods tend to be quite weak eg the population is continuous Nonparametric methods are sometimes called distributionfree methods since use of such meth ods is not restricted to particular eg normal distributions 0 Typically when the distribution is heavy tailed and the sample size is small the nonparametric meth ods are more valid than the parametric methods Suppose the population really is normal Should parametric or nonparametric methods be used 0 Even When the population is normal nonparamet ric methods often perform well even in comparison to parametric methods Section 06 Classes of Nonparametric Methods January 87 2009 31 06 Classes of Nonparametric Methods Nonparametric methods are based on four statistical ideas Methods Based on the Binomial Distribution chapter 1 Suppose we perform hypothesis testing or construct con dence intervals on the median of a continuous population We sample n independent observations from the dis tribution An observation above the median might be labeled as a success Whereas an observation below the median might be labeled as a failure Let Y be the number of successes Then Y has What distribution Section 06 Classes of Nonparametric Methods January 87 2009 32 Permutation Shuf ing Scrambling Methods chapters 2 through 7 Example Is the new drug better than the old drug for lowering cholesterol Suppose there are 20 volunteers for the study Obtain some statistic to account for the difference between the two groups regarding cholesterol change How do we convert our statistic to a p value Permute shuf e at random the 20 observations and compute a permuted statistic Repeat the permuting many times to produce many perhaps 10000 permuted statistic Compare your original statistic to the 10000 per muted statistics Interpret Suppose the original statistic falls far into the appropriate tail of the histogram of the per muted statistics Section 06 Classes of Nonparametric Methods January 87 2009 33 Bootstrap Resampling Methods Chapter 8 and section 91 Sample 100 independent observations from an unknown population with unknown nite mean and unknown median X estimates u and the sample median estimates the population median What is the standard deviation of X From Math 220 what is a reasonable estimator of the standard deviation of X What is the standard deviation of the sample me dian What is a reasonable estimator of the standard devia tion of the sample median One method for estimating is to use the bootstrap Bootstrap 1 Sample ie RESAMPLE 100 observations Section 06 Classes of Nonparametric Methods January 87 2009 34 WITH replacement from the original 100 observa tions and obtain a bootstrapped sample median 2 Repeat step 1 many times to produce many per haps 10000 bootstrapped sample medians 3 Compute the sample standard deviation of the 10000 bootstrapped sample medians That is your estimate Smoothing and NonLeast Squares Methods Smoothing sections 101 and 102 Example using one variable gt X c rnorm1000 rnorm1000 6 15 gt Think of X as just being some data set of 2000 numbers gt hist X D Section 51 A Permutation Test for Correlation and Slope January 87 2009 1 5 Tests for Trends and Associations 51 A Permutation Test for Correlation and Slope Sample data pairs Xi7Y07 for i 17 771 The population correlation eoe eient is de ned by p Umy039m039y7 where am EX 7 n1Y 7 My p measures the strength of the linear relationship between two variables The Pearson sample correlation eoe eient is de ned by 222109 7 500639 i Y 2109 X2 Li le Section 51 A Permutation Test for Correlation and Slope January 87 2009 2 Examples interpreting 7 For the following models7 generate 100 pairs of observations7 determine the sample correlation coef cient7 and plot the data a X N N0717 Y X 87 where the 8 are independent N07 05 Section 51 A Permutation Test for Correlation and Slope January 87 2009 3 b X N N0717 Y 7X e where the 8 are independent N0705 c X N N20717 Y X 87 where the 8 are independent N307 05 d X N N0717 Y X 87 where the 8 are independent N07 02 eXN01Y5X37 Section 51 A Permutation Test for Correlation and Slope January 87 2009 4 f X N N0717 Y 75X37 g X N N20757 Y X 87 where the 8 are independent N6075 How much variability in Y can be explained by the linear relationship between X and Y Section 51 A Permutation Test for Correlation and Slope January 87 2009 5 h X N N20757 Y 87 where the 8 are independent N6075 i X N Untf107307 Y 75X 7 202 87 where the 8 are independent N9007 25 Section 51 A Permutation Test for Correlation and Slope January 87 2009 6 Problem 511 crying babies and IQ scores This data set is from Karelitz et al 19647 Child Development To test whether children who cry more actively as babies later tend to have higher le7 a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet 1Q scores at age three with the results shown below Cry count 10 20 17 12 12 15 19 12 14 23 1Q score 87 90 94 94 97 100 103 103 103 103 Cry count 15 14 13 27 17 12 18 15 15 23 1Q score 104 106 106 108 109 109 109 112 112 113 Cry count 16 21 16 12 9 13 19 18 19 16 1Q score 114 114 118 119 119 120 120 124 132 133 Cry count 22 31 16 17 26 21 27 13 1Q score 135 135 136 141 155 157 159 162 Section 51 A Permutation Test for Correlation and Slope January 87 2009 7 a Plot the data gt Z scan2 problem511 7 T7 T b Determine the sample correlation coef cient c What is an interesting hypothesis test for this data set Seettort 51 A Permutation Test for Correlation and Slope January 87 2009 8 De ne tcorr r n 7 21 7 r2 need not memorize If p 07 and if the my are based on a simple random sample from a bivartate normal distribution7 then tcorr is t distributed with n 7 2 degrees of freedom Revisit problem 511 crying babies and IQ scores This data set is from Karelitz et al 19647 Child Development To test whether children who cry more actively as babies later tend to have higher le7 a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet IQ scores at age three a State the null and alternative hypotheses Section 51 A Permutation Test for Correlation and Slope January 87 2009 9 b Determine the value of the standardized test statistic for the parametric test7 using hand calculatio ns c Determine the asymptotic pvalue of this test using hand calculations7 and state the conclusion 1 Plot the asymptotic distribution of your standardized test statistic under H07 and shade in the appropriate region corresponding to the pvalue e Determine the asymptotic p Value of this test using contest Section 512 Slope of the Least Squares Lirie January 87 2009 10 f What is the 95 lower con dence bound on p g Does association imply causation 5 12 Slope of the Least Squares Line A simple liriear regression model is de ned by Yi39 50 51Xi 517 where the 8S are independent and identically distributed random variables with mean zero and nite variance Section 512 Slope of the Least Squares Line January 87 2009 11 However7 if the 8 are normal7 then X and Y are bivartate normal We de ne the least squares estimates of 60 and 61 by 31 rsysm and 30 l7 7 31X memorize these two formulas The least squares lme7 7 30 BIL is the line which minimizes the sum of the squares of the vertical distances between the observations and the line For large 717 what are the asymptotic values in the formula 31 rsysm Section 512 Slope of the Least Squares Line January 87 2009 Revisit problem 511 crying babies and IQ scores This data set is from Karelitz et al 19647 Child Development To test whether children who cry more actively as babies later tend to have higher le7 a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet IQ scores at age three 21 Which variable should be X7 and which should be Y b Determine the least squares equation using hand calculations c Plot the least squares line on the scatterplot Section 512 Slope of the Least Squares Lirie January 87 2009 13 d Predict the child7s IQ at age three years if the cry count at age ve days was 25 e Determine the least squares equation using the macro lm7 f Verify that the least squares equation goes through the sample means g Regarding the simple liriear regression model7 what is an interesting hypothesis test Section 513 The Permutation Test January 87 2009 513 The Permutation Test Specifically7 this permutation correlation test is for nonzero7 negative7 or positive Pearson correlation or slope If the simple linear regression model holds7 and the null hypothesis of 61 0 holds7 what can we say about X and Y How should a permutation test be performed Section 513 The Permutation Test January 87 2009 15 Problem 512 In the data set below7 test for negative Pearson correlation y 139 143 87 a State the null and alternative hypotheses b Determine the permutation distribution of Bl and r Section 513 The Permutation Test January 87 2009 16 X111 X215 X321 Y1 Y2 Y3 B1 7 87 139 143 521 0839 87 143 139 474 0763 139 87 143 111 0178 139 143 87 7553 70890 143 87 139 0316 0051 143 139 87 7584 70941 31 r probability 7584 70941 16 7553 70890 16 0316 0051 16 111 0178 16 474 0763 16 521 0839 16 sum 1 c Determine the p Value of the permutation correlation test7 using hand calculations Section 513 The Permutation Test January 87 2009 17 d What would be the p Value for a two sided test e Obtain the simulated pvalue using the macro permcortest7 f What assumptions were needed to perform this test from part e Section 513 The Permutation Test January 87 2009 18 g Using the one sided hypothesis test and the parametric approach7 determine the pvalue h What assumptions were needed to perform this test from part g i Are the assumptions needed to perform this test reasonable Section 514 Large Sample Approx for Permutation Distrib ofr January 87 2009 19 ln general7 for 71 data pairs7 how many permutations groupings exist when performing the permutation correlation test 514 LargeSample Approximation for the Permutation Distribution of 7 Suppose n independent pairs of observations X7 Y are sampled from distributions with nite variances such that X and Y are independent What is p7 the population Pearson correlation coef cient What is T7 the sample Pearson correlation coef cient Consider the permutation distribution of r What is Er where the expectation is taken conditionally on the permutation distribu tion Section 514 Large Sample Approx for Permutation Distrib of January 87 2009 20 varr 1n 7 17 conditional on the permutation distribution Section 514 Large Sample Approx for Permutation Distrib ofr January 87 2009 Example Sample 71 independent pairs of observations X7 Y frorn distributions with nite variances such that X and Y are independent7 where n is large a Deterrnine Tn a xed function of n such that there is approximately a 95 chance that r the sample correlation coef cient will be between 7r and Tn b Set 71 100 in part a Section 514 Large Sample Approx for Permutation Distrib of January 87 2009 22 c Set 71 400 in part a 1 Set 71 1000 in part a e Set 71 107 000 in part a Section 52 Spearman Rank Correlation January 87 2009 23 52 Spearman Rank Correlation Problem 521 Consider the following data set X0 2 3 4 6 Y 0 16 81 256 1296 a Determine the Pearson correlation coef cient gtXC072737476 gt y C0716781725671296 gt cor X7 y b ls r7 the Pearson correlation coef cient7 a reasonable measure c Compare the ranks of X with the ranks of Y Section 52 Statistical Test for Spearman Rank Correlation January 87 2009 52 1 Statistical Test for Spearman Rank Correlation The Spearman rank correlation7 r57 is the Pearson correlation coef cient applied to ranks Thus7 the Spearman rank correlation is not heavily in uenced by outliers The Spearman rank correlation measures the association between two variables Revisit problem 521 21 Determine the Spearman rank correlation for this data set gtXC072737476 gt y c 0 16 81 256 1296 Section 52 Statistical Test for Spearman Rank Correlation January 87 2009 25 b Test if the association between the two variables is positive7 using the Spearman rank correlation and hand calculations c Test if the population Spearman rank correlation is positive7 using the macro cortest7 Section 52 Statistical Test for Speaman Rank Correlation January 87 2009 26 Revisit problem 51 2 In the data set below7 use the Spearman rank correlation z 11 15 21 y 139 143 87 a State the null and alternative hypotheses Section 52 Statistical Test for Spearman Rank Correlation January 87 2009 27 b Determine the permutation distribution of r5 rankX1 1 rankX2 2 rankX3 3 rankOl rankY2 rankY3 r5 1 2 3 1 1 3 2 05 2 1 3 05 2 3 1 7051 3 1 2 705 3 2 1 71 Seetiori 52 Statistical Test for Spearmari Rank Correlatiori January 87 2009 28 rS probability 71 16 706 26 05 26 1 16 sum 1 c Determine the p Value of the Spearmari eorrelatiori test7 using hand calculations 1 Determine the p Value of the Spearmari eorrelatiori test7 using the macro Cortest7 e What would be the p Value for a two sided test Section 522 Large Sample Appmm39matz39on January 87 2009 29 f Obtain the simulated pvalue using the macro permcortest7 g What assumptions were needed to perform this test 522 LargeSample Approximation Recall from section 514 that for large n and independent X and Y7 the distribution of Z rxn 7 1 is approximately standard normal7 where r is the sample Pearson correlation coef cient Adjustment for Ties Two options Section 522 Large Sample Approm39matz39on January 87 2009 30 0 Apply the Pearson correlation to the ranks adjusted for ties 0 Use the normal approximation Revisit problem 511 crying babies and IQ scores This data set is from Karelitz et al 19647 Child Development To test whether children who cry more actively as babies later tend to have higher le7 a cry count was taken for a sample of 38 children aged ve days and was later compared with their Stanford Binet IQ scores at age three a Determine the sample Spearman correlation coef cient b State the null and alternative hypotheses c Estimate the exact pvalue of this test using simulations7 and state the conclusion Section 522 Large Sample Appmmz39matz39zm January 87 2009 31 1 Determine the value ofthe standardized test statistic for the large sample test7 using hand calculatio ns e Determine the asymptotic p Value of this test using hand calculations f Plot the asymptotic distribution of your standardized test statistic under H07 and shade in the appropriate region corresponding to the pvalue Determine the as m tottc value of this test usin contest g y P 1039 E Section 522 Large Sample Appmmz39matz39on January 87 2009 32 Caution in Using the Pearson or Spearman Correlation Caution Time dependent data may invalidate the independence between the X7 Y data pairs Example Suppose z is the year7 and y is the average ocean temperature for the year Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 33 54 Permutation Tests for Contingency Tables Scenario We have two categorical variables7 and we enter the data into a table 541 Hypotheses to be Tested and the ChiSquare Statistic Problem 541 gender and handedness hypothetical data The following hypothetical data are based on gender and the preferred hand for writing arnong 10 year old children Observed Left Right total Girls 5 83 88 Boys 7 65 72 total 12 148 1 60 The hypothesis test will be a test for association Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 34 a What is the hypothesis test of interest b Estimate the espeeted ie7 average number of lefthanded girls under H0 c Estimate the espeeted ie7 average number of righthanded girls under H0 Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 35 1 Estimate the epeeted ie7 average number of lefthanded boys under H0 e Estimate the epeeted ie7 average number of right handed boys under H0 Expected Left Right total Girls 66 814 88 Boys 54 666 72 total 12 148 160 f What is the general rule for computing these expected cell frequencies under H0 Formula The chi square test statistic is de ned as V i 0117 Guy517 i1 j1 Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 36 g Compute the chi square test statistic for this data set Asymptotic result For large sample sizes ie if eij gt 5 for all cells7 then the test statistic V has approximately a X2 distribution with degrees of freedom equal to number of rows 7 1 gtlt number of columns 7 1 h How many degrees of freedom are associated with this test i Determine the asymptotic pvalue associated with this test Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 37 j Plot the asymptotic distribution of your test statistic under H07 and shade in the appropriate region corresponding to the p Value k Determine the asymptotic pvalue using the macro Chisqtest Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 38 Example gender and handedness real data In a survey of Scottish school Children7 aged approximately ten to twelve years7 the teacher observed whether the pupil wrote with the left or right hand7 with the following results Clark7 19577 Left Handedness7 University of London Press7 London Observed Left Right Percentage left Girls 17478 257045 557 Boys 991 127629 728 Section 541 Hypotheses to be Tested 55 the Chi Square Statistic January 87 2009 39 Problem 542 Roosevelt The following results were obtained in a 1948 study of the 1944 Presidential election in Elrnira7 New York McCarthy7 19577 Introduction to Statistical Reasoning7 McGraw Hill Individual lnterviewed on Percentage reached 1944 Pres vote First call Second or later call Total on rst call Roosevelt 138 217 355 389 Dewey 124 200 324 383 Did not vote 90 142 232 388 Other or too young 39 78 117 333 Total 391 637 1028 380 Test the hypothesis that the distribution of responses is the same for individuals reached on the rst call as for those interviewed on the second or later calls Section 542 Permutation Chi Square Test January 87 2009 40 542 Permutation ChiSquare Test Herein7 we use the same chi square test statistic7 V Z 20 Guy517 i1j1 However7 we do NOT use the XZ distribution to approximate the pvalue Instead7 the permutation distribution of the test statistic is determined exactly or is simulated Section 542 Permutation Chi Square Test January 87 2009 41 Determining the permutation distribution Revisit problem 541 gender and handedness hypothetical data The following hypothetical data are based on gender and the preferred hand for writing among 10 year old children Observed Left Right total Girls 5 83 88 Boys 7 65 72 total 12 148 1 60 0 Fix all of the margin totals 127 1487 887 727 160 ls our observed table rare under H0 Under these xed margins7 what are other possible values of the table Speci cally7 what are the possible values for X7 the number of lefthanded girls Based on a permutation involving7 say7 10 lefthanded girls7 complete the rest of the table Section 542 Permutation Chi Square Test January 87 2009 42 D How many degrees of freedom are associated with this test 0 The chi square test statistic7 V7 may be determined for each value of X7 the number of lefthanded girls7 for X 01 712 D To obtain the permutation distribution of V under these xed margins7 the proba bilities of V or X may be determined based on the hypergeometrle distribution A hypergeometm39e distribution is similar to a binomial distribution7 except that a hypergeometm39e distribution is based on sampling WithOUT replacement Example aside Suppose a classroom has 10 female students and 7 male students Sample 5 students WithOUT replacement at random7 and let W be the number of female students in the sample Then W has a hypergeometm39e distribution D Section 55 Fisher s Ecoaet Test for a 2 gtlt 2 Contingency Table January 87 2009 43 The pvalue for this permutation distribution is based on either all possible permuta tions of V or a large number of simulated permutations of V 55 Fisher s Exact Test for a 2 X 2 Contingency Table Fisher7s exact test is similar to the permutation chi square test7 except the permuta tions are based on X say7 the number of left handed girls7 rather than V7 where V i 0117 Guy517 il j1 Section 55 Fisher s Exact Test for a 2 gtlt 2 Contingency Table January 87 2009 44 The permutation probabilities of X are again determined by the hypergeometrlc distri bution under the xed margin totals Revisit problem 541 gender and handedness hypothetical data The following hypothetical data are based on gender and the preferred hand for writing among 10 year old Children Observed Left Right total Girls 5 83 88 Boys 7 65 72 total 1 2 148 1 60 Let X the test statistic be the number of left handed girls What does a large value of X suggest What does a small value of X suggest Section 55 Fisher s Exact Test for a 2 gtlt 2 Contingency Table January 87 2009 45 Test if the proportion of girls who are lefthanded is smaller than the proportion of boys who are lefthanded Section 55 Fisher s Exact Test for a 2 gtlt 2 Contingency Table January 87 2009 D Example Bush vs Gore Election of 2000 Summary The Presidential Election between George W Bush and Albert Gore took place on November 77 2000 The vote was quite close in Florida7 the winner of which would win the election On November 8 the count resulted in a small lead for Bush Gore sought a manual recount in several Florida counties7 so the process of recounting votes began7 as permitted by the Florida Supreme Court Bush argued that recounting only certain counties violated the equal protection77 clause of the fourteenth amendment to the U S Constitution7 and Bush also argued that Florida7s electors should be selected by the December 12 deadline On December 117 a 574 majority ofthe U S Supreme Court ruled that no constitutionally valid recount could be completed by the December 12 deadline7 effectively ending the recounts Section 55 Fisher s Exact Test for a 2 gtlt 2 Contingency Table January 87 2009 47 Is there statistically signi cant evidence that the U S Supreme Court Justices tended to favor their own political party ie7 according to the political party of the President who appointed the Justice Use Fisher s exact test Justice Appointed by President Decision Williarn Rehnquist Reagan End recount Sandra Day O7Connor Reagan End recount Antonin Scalia Reagan End recount Anthony Kennedy Reagan End recount Clarence Thomas G H W Bush End recount John Paul Stevens Ford Continue recount David Souter G H W Bush Continue recount Ruth Ginsburg Clinton Continue recount Stephen Breyer Clinton Continue recount Section 55 Fisher s Ewaet Test for a 2 gtlt 2 Contingency Table January 87 2009 48 Is the proportion of Republican appointed Justices who voted to end the recount signif icantly large Section 55 Fisher s Exact Test for a 2 gtlt 2 Contingency Table January 87 2009 49 Example sampling and the U S Census 1999 The U S Census contains error7 in that some individuals are not counted7 quite often in regions dominated by Democrats The statistical technique of sampling could greatly reduce this error7 and consequently could raise affect the apportionment in the House of Representatives in favor of the Democrats The Clinton administration wanted to use sampling7 but the Republicans opposed the use of sampling7 for determining seats in the House of Representatives On January 257 19997 the U S Supreme Court ruled 5 to 4 against the use of sampling in the Census for the purpose of apportioning seats in the House of Representatives among the states Section 55 Fisher s Exact Test for a 2 gtlt 2 Contingency Table January 87 2009 50 Is there statistically signi cant evidence that the U S Supreme Court Justices tended to favor their own political party ie7 according to the political party ofthe President who appointed the Justice Use Fisher s exact test Justice Appointed by President Use sampling William Rehnquist Reagan no Sandra Day O7Connor Reagan no Antonin Scalia Reagan no Anthony Kennedy Reagan no Clarence Thomas G H W Bush no John Paul Stevens Ford yes David Souter G H W Bush yes Ruth Ginsburg Clinton yes Stephen Breyer Clinton yes Section 11 A Nonparametric Test of Hypothesis and Con dence Interval for the Median 1 OneSample Methods This chapter discusses nonparametric hypothesis tests and con dence intervals based on January 87 2009 the binomial distribution 11 A Nonparametric Test of Hypothesis and Con dence Interval for the Median Notation 605 is the unknown population median of a continuous distribution lnterpret 005 What do heavy tailed distributions tend to produce Example gt X c20287100 gt meanX Section 111 Binomial Test January 87 2009 2 gt mean X19 gt medianX gt median X19 111 Binomial Test Suppose that under the null hypothesis H07 the population median is 75 Sample 40 independent observations7 and let B be the number of observations larger than 75 What is the distribution of B under H0 Example 111 p 11 Suppose a certain food product is advertised to contain 75 mg of sodium per serving7 but preliminary studies indicate that servings may Section 111 Binomial Test January 87 2009 3 contain more than that amount The amount of sodium in the product varies from one serving to another Test if the median amount of sodium per serving is 75 mg versus the alternative that the median is greater than 75 mg7 at level oz 005 H0 605 75 mg7 Ha 605 gt 75 mg Suppose an observation results in exactly 75 mg Should this particular observation support H0 or Ha Note Some textbooks recommend discarding ties for the binomial test7 since such rounded observations are not informative Under H07 what proportion of the observations do we expect on average to be strictly greater than 75 mg Download the data set gt x scan httpwwwmathjmuedugarrenstmath324dirdatasetstablel11 gt x scan2 table11177 gt x gt length x Section 111 Binomial Test January 87 2009 4 ls this a one tailed test or a two tailed test Does 26 fall far enough into the tail of the Binomialn 407 p 05 distribution to reject H0 at level 04 005 Solve the problem again7 this time excluding ties ie7 discard all observations which are 75 gtnsurnxl75 Section 111 Binomial Test January 87 2009 5 Does 26 fall far enough into the tail of the Binomialn 397 p 05 distribution to reject H0 Determine the pvalue R has a built in macro called binomtest7 binomtest7 does not request the original data set7 but rather the numbers of successes and failures gt binomtest 267 397 alternative greater77 Textbook uses normal approximation to the binomial Section 112 Con dence Interval January 87 2009 6 Brief review If the distribution is symmetric7 how do the population mean and population median compare The binomial test typically should be used in place of the t test when the distribution is symmetric and the tails are heavy 112 Con dence Interval Explain how con dence intervals and hypothesis tests are related via the following H02 005 0H7 Ha 005 With 04 005 95 con dence interval on 605 Section 112 Con dence Interval January 87 2009 Textbook gives a normal approximation for constructing this con dence interval regarding the binomial test However7 we can compute this con dence interval exactly using a macro7 which ef fectively performs a hypothesis test based on the binomial test on each of the observations using exact binomial probabilities7 not normal approximations to bi nomials Example Consider the same data set on sodium content from Table 1117 p 127 in Example 112 Construct a 95 con dence interval on the population median where the only assumption on the population is continuity gt x scan2 table11177 gt cibinomtest We are 95 con dence that the population median sodium content per serving is be tween 750 mg and 771 mg Homework 0111 Understanding nonparametric hypothesis tests and con dence intervals A simple random sample of size n 26 is drawn from a population7 whose only assumption is continuity a Suppose one is interested in testing H0 605 80 versus Ha 605 31 80 at level 04 005 Section 12 Estimating the Population cdf and Pcrccntilcs January 87 2009 8 Clearly explain in detail using more words than mathematics how to test these hy potheses7 based on the binomial test In other words7 explain HOW to use any testing procedures and the NAMES of any random variables used b Suppose one is interested in constructing a 95 con dence interval on the population rnedian7 based on the binomial test Clearly explain in detail using more words than mathematics how to construct this con dence interval7 based on the binomial test End of Homework C111 D 12 Estimating the Population cdf and Percentiles Section 121 Con dence Intervalfor the Population cdf January 87 2009 9 121 Con dence Interval for the Population cdf Recall The cumulative distribution function of a random variable X is PX S n for all real x In most practical situations7 the population is unknown7 so is also unknown Take a simple random sample from this unknown population For any xed value of x how can we estimate The empirical cdf is denoted by Example 121 p 15 Testing of electrical and mechanical devices often involves an action such as turning a device on and off or opening and closing a device many times The interest is in the distribution of the number of on off or open close cycles that occur before the device fails The hypothetical data in Table 121 are the number of cycles in thousands that it takes for 20 door latches to fail a Evaluate the empirical cdf at 25 Section 121 Con dence Intervalfor the Population edf January 87 2009 10 gt X scan2 tablel2177 gt length X gtXlt25 gtsumXlt 25 gt mean X lt 25 b Using computer software7 graph the empirical Cdf for this data set gt empiricalcdf Section 121 Con dence Intervalfor the Population edf January 87 2009 11 c Suppose we were able to sample more and more observations from this population Explain how the empirical Cdf should appear D Recall The empirical cdf is a special case of a sample proportion What do we know about sample proportions for large sample sizes Revisit example 112 p 12 regarding sodium content a Determine the empirical Cdf evaluated at 75 Section 121 Con dence Intervalfor the Population edf January 87 2009 12 gt X scan2 table11177 b Graph the empirical cdf c ls a normal approximation valid for constructing a 95 con dence interval on F757 gt n length X 1 Construct a 95 con dence interval on F757 based on the normal approximation gt c phat 7 qnorrn 0975 gtk sqrt phat gtk liphat n 7 phat qnorrn 0975 gtk sqrt phat gtk liphat n e ls the above con dence interval wide or narrow Section 122 Inference for Pereentz39les January 87 2009 13 122 Inference for Percentiles Question What is the 25th percentile of a distribution governed by the random variable X Find the 25th percentile of a Nn 5007 039 100 distribution gt shadedist 4325517 dnorm 5007 100 gt shadedist qnorm 0257 5007 100 7 dnorm 5007 100 What does the 50th percentile represent Notation 017 represents the 100pth percentile7 for 0 S p S 1 Example Sample 50 observations from an unknown continuous population 21 Clearly explain using more words than mathematics how we might test H0 603 200 versus Ha 603 7 200 at level 04 005 Section 122 Inference for Pereentz39les January 87 2009 14 b Suppose one is interested in constructing a 95 con dence interval on 0037 based on the binomial test Clearly explain in detail using more words than mathematics how to construct this con dence interval7 based on the binomial test Section 122 Inference for Pereentz39les January 87 2009 15 Textbook discusses using the normal approximation to the binomial or sample propor tion for inference on percentiles Example 124 expanded Refer to the door latch77 data in Table 121 a Determine a point estimate of the 25th percentile ie7 determine the 25th sample percentile gt X scan2 table12177 Sort the observations from smallest to largest gt lengthx gtk 025 gt 7quantile Section 122 Inference for Pereentz39les January 87 2009 16 b Construct a 90 con dence interval on the 25th percentile gt cibinorntest c Construct a 95 con dence interval on the 25th percentile Section 13 A Comparison of Statistical Tests January 87 2009 17 1 Construct a 90 con dence interval on the 50th percentile e Construct a 90 con dence interval on the 75th percentile 13 A Comparison of Statistical Tests Compare the t test with the binomial test under speci c distributions and sample sizes Section 13 A Comparison of Statistical Tests January 87 2009 18 A type I error is the event that H0 is rejected when H0 is true We denote Oz Ptype l error7 and often we set 04 005 A type II error is the event that H0 is NOT rejected when Ha is true power 1 7 Ptype ll error Power is the probability that H0 is rejected when Ha is true Do we want 04 to be large or small Do we want power to be large or small What happens to power as 04 gets srnall How can we decrease Oz and increase power simultaneously In general7 power is good Steps for computing power of a ttest Section 13 A Comparison of Statistical Tests January 87 2009 19 Compute the power of the t test7 where the alternative distribution is NW U7 and we are testing H0 n no versus Ha n gt no 1 Find the rejection region in terms of T such that H0 is rejected if and only if T is in the rejection region7 where T X 7 M0s 7 for a speci c value of a 2 Compute the probability that T is in the rejection region7 under the assumption that the population is Nta7 039 Note Step 1 involves using the t table7 and step 2 involves using the noncentral t table with noncentrality parameter Ma 7 M0U This table is easily tabulated by R Note Your textbook loses accuracy by approximating the t distribution with a normal distribution twice7 and names the test the CLT test Hence7 in homework exercises7 replace the term CLT test77 by t test7 Example 131 p 19 rst part ttest with normal alternative Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 n 75 mg versus Ha n gt 75 mg7 using 04 005 Using the following steps7 compute the power of the t test under the alternative that the sodium content is distributed as NL 758 mg7 039 25 mg a De ne the standardized test statistic b Determine the exact rejection region for your test statistic in part a Section 13 A Comparison of Statistical Tests January 87 2009 20 c Graph the pdf of your standardized test statistic under H07 and shade in the region corresponding to 04 1 Determine the noncentrality parameter e Compute the power of the t test under the alternative that the sodium content is distributed as Nt 758 mg7 039 25 mg without using powerttest7 Section 13 A Comparison of Statistical Tests January 87 2009 21 Below is the graph of the pdf of Ti under Ha gt shadedist qt0957 397 dt777 397 ncp7 F Below is the graph of the pdf of Ti approximated by a NW 20238587 039 1 under Ha gt shadedist qnorm0957 dnorm777 ncp7 17 F gt plotdist dnorm777 ncp7 17 dt777 397 ncp Compare normal to noncentral t f Graph the pdf of Ti simultaneously under H0 and under Ha gt plotdist dt 397 NULL 16739 ncp Example 131 p 19 second part t test with Normal and Laplace alternatives Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 M 75 mg versus Ha M gt 75 mg7 using 04 005 a Compute the power of the t test under the alternative that the sodium content is distributed as NQL 758 mg7 039 25 mg gt 7powerttest Section 13 A Comparison of Statistical Tests January 87 2009 22 Note well Use delta7 gt 0 gt powerttest 407 087 257 type onesarnple 7 alternative onesided77 b Compute the power of the t test under the alternative that the sodium content is distributed as Nt 761 rng7 039 25 mg What should happen to the power c What power should we obtain by plugging in zero for delta7 in powerttest77 1 Compute the exact power of the t test under the alternative that the sodium content is distributed as NW pm 039 25 rng7 for values of pa equal to 7487 757 7527 7547 7567 and 758 Section 13 A Comparison of Statistical Tests January 8 2009 23 gt delta C748 75 752 754 756 758 i 75 gt powerttest 40 delta 25 type onesample alternative onesided77 e Next graph the LAPLACEQL 75 039 25 probability density function f Compute the approximate power ofthe t test under the alternative that the sodium content is distributed as LAPLACEQL pa 039 25 mg for values of pa equal to 748 75 752 754 756 and 758 Section 13 A Comparison of Statistical Tests January 87 2009 24 Computing power of a binomial test Example 131 p 19 third part binomial test with normal alternative Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 M 75 mg versus Ha t gt 75 mg7 using a 005 Using the following steps7 compute the exact power of the binomial test under the alternative that the sodium content is distributed as NW 758 mg7 039 25 mg a De ne the test statistic Section 13 A Comparison of Statistical Tests January 87 2009 25 b Graph the pdf of your test statistic under H0 c Determine the exact rejection region for your test statistic in part a 1 Determine the exact size of this test Section 13 A Comparison of Statistical Tests January 87 2009 26 gt shadedist 2557 dbinom 407 057 F e Determine the probability that a randomly sampled NW 758 mg7 039 25 mg observation is larger than 75 f Graph the pdf of a NW 7587 039 25 distribution7 and shade in the region corresponding to the above probability g Let B be the number of observations larger than 75 What is the distribution of B under the alternative that the sodium content is distributed as NW 758 mg7 039 25 mg Section 13 A Comparison of Statistical Tests January 87 2009 27 h Compute the exact power of the binomial test under the alternative that the sodium content is distributed as NM 758 mg7 039 25 mg without using powerbinomtest7 D Example 131 p 19 fourth part binomial test with Laplace alternative Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 M 75 mg versus Ha M gt 75 mg7 using 04 005 Using the following steps7 compute the exact power of the binomial test under the alternative that the sodium content is Laplace distributed with n 758 mg and 039 25 mg a De ne the test statistic Section 13 A Comparison of Statistical Tests January 87 2009 28 b Determine the exact rejection region for your test statistic in part a c Determine the probability that a randomly sampled Laplacet 758 mg7 039 25 mg observation is larger than 75 1 Graph the pdf of a Laplacet 7587 039 25 distribution7 and shade in the region corresponding to the above probability Section 13 A Comparison of Statistical Tests January 87 2009 29 e Let B be the number of observations larger than 75 What is the distribution of B under the alternative that the sodium content is distributed as Laplacet 758 mg7 039 25 mg f Compute the exact power of the binomial test under the alternative that the sodium content is distributed as Laplacet 758 mg7 039 25 mg without using powerbinomtest7 D Example 131 p 19 fth part summary comparing ttest and binomial test Refer to example 112 regarding sodium content of 40 servings Section 13 A Comparison of Statistical Tests January 87 2009 30 of a food product The hypothesis test is H0 n 75 mg versus Ha n gt 75 mg7 using 04 005 a Determine the exact power of the t test under the alternative that the sodium content is distributed as NM 758 mg7 039 25 mg gt powerttest 407 087 257 type onesample 7 alternative onesided77 b Determine the exact power of the binomial test under the alternative that the sodium content is distributed as NL 758 mg7 039 25 mg7 gt powerbinomtest gt powerbinomtest 407 0057 greater777 757 pnorm7 7587 25 c Which test7 the t test or the binomial test7 is better ie7 more powerful under this normal alternative 1 Determine the approximate power of the t test under the alternative that the sodium content is distributed as Laplacet 758 mg7 039 25 mg Section 13 A Comparison of Statistical Tests January 87 2009 31 e Determine the exact power of the binomial test under the alternative that the sodium content is distributed as Laplacet 758 mg7 039 25 mg f Which test7 the t test or the binomial test7 is better ie7 more powerful under this Laplace alternative Summary 0 Often7 the t test is more powerful than the binomial test for light tailed distributions such as the normal distribution Section 11 A Nonparametric Test of Hypothesis and Con dence Interval f0quot the Median January 87 2009 l 1 OneSample Methods This chapter discusses nonparametric hypothesis tests and con dence intervals based on the binomial dis tribution 11 A Nonparametric Test of Hypothesis and Con dence Interval for the Median Notation 805 is the unknown population median of a continuous distribution Interpret 805 What do heavy tailed distributions tend to produce Example gt X c2028 100 gt rneanx gt rnean x19 gt rnedianx Section 111 Binomial Test January 87 2009 2 gt median x19 D 1 1 1 Binomial Test Suppose that under the null hypothesis H0 the pop ulation median is 75 Sample 40 independent observations and let B be the number of observations larger than 75 What is the distribution of B under H0 Example 111 p 11 Suppose acertain food product is advertised to contain 75 mg of sodium per serving but preliminary studies indicate that servings may contain more than that amount The amount of sodium in the product varies from one serving to another Test if the median amount of sodium per serving is 75 mg versus the alternative that the median is greater Section 111 Binomial Test January 87 2009 3 than 75 mg at level 05 005 H0 905 75 mg Ha 905 gt 75 mg Suppose an observation results in exactly 75 mg Should this particular observation support H 0 or Ha Note Some textbooks recommend discarding ties for the binomial test since such rounded observations are not informative Under H 0 what proportion of the observations do we expect on average to be strictly greater than 75 mg Download the data set gt X scan httpvvvvvvmathjmuedugarrenst math324dirdatasetstablel11 gt X scan2 table111 gt X gt length x Is this a one tailed test or a two tailed test Section 111 Binomial Test January 8 2009 4 Does 26 fall far enough into the tail of the Binomialn 40 p 05 distribution to reject H0 at level or 005 Solve the problem again this time excluding ties ie discard all observations which are 75 gtnsumxl75 Does 26 fall far enough into the tail of the Binomialn 39 p 05 distribution to reject H0 Determine the pvalue R has a built in macro called binomtest binomtest does not request the original data set but rather the numbers of successes and failures gt binomtest 26 39 alternative greater Textbook uses normal approximation to the binomial Brief review If the distribution is symmetric how do the popula tion mean and population median compare Section 112 Con dence Interval January 87 2009 The binomial test typically should be used in place of the t test when the distribution is symmetric and the tails are heavy 1 1 2 Con dence Interval Explain how con dence intervals and hypothesis tests are related via the following H02 905 91 Hut 905 7 911 With 05 005 95 con dence interval on 805 Textbook gives a normal approximation for con structing this con dence interval regarding the bino mial test However we can compute this con dence interval ex actly using a macro which effectively performs a hypothesis test based on the binomial test on each of the observations using exact binomial probabili ties not normal approximations to binomials Seetz39on 112 Con dence Interval January 87 2009 6 Example Consider the same data set on sodium content from Table 1117 p 127 in Example 112 Construct a 95 con dence interval on the popu lation median Where the only assumption on the population is continuity gt X scan2 table111 gt cibinomtest We are 95 con dence that the population median sodium content per serving is between 750 mg and 771 mg Homework C111 Understanding nonpara metric hypothesis tests and con dence intervals A simple random sample of size n 26 is drawn from a population Whose only assumption is continuity a Suppose one is interested in testing H0 805 80 versus Ha 805 y 80 at level 05 005 Clearly explain in detail using more words than math Section 121 Con dence Intervalfor the Population edf January 87 2009 7 ernatics how to test these hypotheses based on the binomial test In other words explain HOW to use any testing procedures and the NAMES of any ran dorn variables used b Suppose one is interested in constructing a 95 con dence interval on the population rnedian based on the binomial test Clearly explain in detail using more words than rnath ematics how to construct this con dence interval based on the binomial test End of Homework C111 D 12 Estimating the Population cdf and Percentiles 121 Con dence Interval for the Population cdf Section 121 Con dence Intervalfor the Population edf January 87 2009 Recall The cumulative distribution function of a ran dom variable X is FI PX g 51 for all real 1 In most practical situations the population is unknown so F is also unknown Take a simple random sample from this unknown pop ulation For any fixed value of x how can we estimate F The empirical cdf is denoted by F Example 121 p 15 Testing of electrical and mechanical devices often involves an action such as turning a device on and off or opening and closing a device many times The interest is in the distri bution of the number of on off or open close cycles that occur before the device fails The hypotheti cal data in Table 121 are the number of cycles in thousands that it takes for 20 door latches to fail Section 121 Con dence Intervalfor the Population edf January 87 2009 9 a Evaluate the empirical cdf at 25 gt X scari2 table121 gt length X gt X lt 25 gt sum X lt 25 gt meari X lt 25 b Usng computer software graph the empirical cdf for this data set gt empiricalcdf c Suppose we were able to sample more and more observations from this population EXplairi how the empirical cdf should appear D Recall The empirical cdf is a special case of a sam ple proportion What do we know about sample proportioris for large sample sizes Section 122 Inference for Pereentz39les January 87 2009 10 Revisit example 112 p 12 regard ing sodium content a Determine the empirical cdf evaluated at 75 gt X scan2 table111 b Graph the empirical cdf C Is a normal approximation valid for constructing a 95 con dence interval on F 75 gt n length X d Construct a 95 con dence interval on F75 based on the normal approximation gt c phat qnorm 0975 gtllt sqrt phat gtllt 1 phat n phat qnorm 0975 gtllt sqrt phat gtllt 1 phat H e Is the above con dence interval Wide or narrow D 122 Inference for Percentiles Section 122 Inference for Pereentz39les January 87 2009 Question What is the 25th percentile of a distribution governed by the random variable X Find the 25th percentile of a N n 500 o 100 distribution gt shadedist 432551 dnorrn 500 100 gt shadedist qnorrn 025 500 100 dnorrn 500 100 What does the 50th percentile represent N otatz39on 619 represents the 100pth percentile for 0 g l9 S 1 Example Sample 50 observations from an un known continuous population a Clearly explain using more words than mathe matics how we might test H0 Ha 903 y 200 at level or 005 903 200 versus b Suppose one is interested in constructing a 95 con dence interval on 803 based on the binomial Section 122 Inference for Percentiles January 87 2009 12 test Clearly explain in detail using more words than math ematics how to construct this con dence interval based on the binomial test Textbook discusses using the normal approximation to the binomial or sample proportion for inference on percentiles Example 124 expanded Refer to the door latch data in Table 121 a Determine a point estimate of the 25th percentile ie determine the 25th sample percentile gt X scan2 table121 Sort the observations from smallest to largest gt lengthx gtllt 025 gt X 5 l gt quantile b Construct a 90 con dence interval on the 25th Section 13 A Comparison of Statistical Tests January 87 2009 percentile gt cibinorntest c Construct a 95 con dence interval on the 25th percentile d Construct a 90 con dence interval on the 50th percentile e Construct a 90 con dence interval on the 75th percentile D 13 A Comparison of Statistical Tests Compare the t test With the binomial test under spe ci c distributions and sample sizes A type I error is the event that H0 is rejected when H 0 is true We denote oz Ptype I error and often we set 05 005 Section 13 A Comparison of Statistical Tests January 87 2009 14 A type II error is the event that H 0 is NOT rejected when Ha is true power 1 Ptype ll error Power is the probability that H0 is rejected when Ha is true Do we want or to be large or small Do we want power to be large or small What happens to power as or gets small How can we decrease or and increase power simulta neously In general power is good Steps for computing power of a ttest Compute the power of the t test where the alternative distribution is N pa 0 and we are testing H0 u uo versus Ha u gt uo 1 Find the rejection region in terms of Tquot such that H0 is rejected if and only if Tquot is in the rejection Section 13 A Comparison of Statistical Tests January 8 2009 region where Tquot X nos for a speci c value of oz 2 Compute the probability that Tquot is in the rejection region under the assumption that the population is N Ma 0 Note Step 1 involves using the t table and step 2 involves using the noncentral t table with non centrality parameter pa u0o This table is easily tabulated by R Note Your textbook loses accuracy by approximating the t distribution with a normal distribution twice and names the test the CLT test Hence in homework exercises replace the term CLT test by t test 19 rst part t test with normal alternative Refer to Example 131 p example 112 regarding sodium content of 40 serv ings of a food product The hypothesis test is H0 Section 13 A Comparison of Statistical Tests January 87 2009 u 75 mg versus Ha u gt 75 mg using 05 005 Using the following steps compute the power of the t test under the alternative that the sodium content is distributed as Nt 758 mg a 25 mg a De ne the standardized test statistic b Determine the exact rejection region for your test statistic in part a c Graph the pdf of your standardized test statistic under H 0 and shade in the region corresponding to 04 d Determine the noncentrality parameter e Compute the power of the t test under the alterna tive that the sodium content is distributed as N LL 758 mg a 25 mg without using powerttest Below is the graph of the pdf of Tquot under Ha gt shadedist qt095 39 dt 39 ncp F Below is the graph of the pdf of Tquot approximated by a NO 2023858 0 1 under Ha Section 13 A Comparison of Statistical Tests January 8 2009 gt shadedist qnorm095 dnorm ncp 1 F gt plotdist dnorm ncp 1 dt 39 ncp Compare normal to noncentral t f Graph the pdf of Tquot simultaneously under H 0 and under Ha gt plotdist dt 39 NULL dt 39 ncp D Example 131 p 19 second part ttest with Normal and Laplace al ternatives Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 u 75 mg versus Ha u gt 75 mg using 05 005 a Compute the power of the t test under the al ternative that the sodium content is distributed as N01 758 mg a 25 mg gt powerttest Section 13 A Comparison of Statistical Tests January 87 2009 Note well Use delta gt 0 gt powerttest 40 08 25 type onesample alter native onesided b Compute the power of the t test under the al ternative that the sodium content is distributed as NM 761 mg a 25 mg What should happen to the power c What power should we obtain by plugging in zero for delta in powerttest d Compute the exact power of the t test under the alternative that the sodium content is distributed as NO Ma 0 25 mg for values of ua equal to 748 75 752 754 756 and 758 gt delta c748 75 752 754 756 758 75 gt powerttest 40 delta 25 type onesample al ternative onesided e Next graph the LAPLACEQL 75 a 25 probability density function Section 13 A Comparison of Statistical Tests January 87 2009 f Compute the approximate power of the t test under the alternative that the sodium content is dis tributed as LAPLACEOJ Ma 0 25 mg for values of ua equal to 748 75 752 754 756 and 758 D Computing power of a binomial test Example 131 p 19 third part bi nomial test with normal alternative Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 u 75 mg versus Ha u gt 75 mg us ing or 005 Using the following steps compute the exact power of the binomial test under the al ternative that the sodium content is distributed as N01 758 mg a 25 mg a De ne the test statistic 19 Section 13 A Comparison of Statistical Tests January 87 2009 b Graph the pdf of your test statistic under H 0 c Determine the exact rejection region for your test statistic in part a d Determine the exact size of this test gt shadedist 255 dbinom 40 05 F e Determine the probability that a randomly sam pled N u 758 mg a 25 mg observation is larger than 75 f Graph the pdf of a Nu 758 a 25 distri bution and shade in the region corresponding to the above probability g Let B be the number of observations larger than 75 What is the distribution of B under the al ternative that the sodium content is distributed as Nu 758 mg a 25 mg h Compute the exact power of the binomial test under the alternative that the sodium content is dis tributed as N LL 758 mg a 25 mg Without Section 13 A Comparison of Statistical Tests January 87 2009 using powerbinomtest D Example 131 p 19 fourth part binomial test with Laplace alterna tive Refer to example 112 regarding sodium content of 40 servings of a food product The hy pothesis test is H0 75 mg versus Ha u gt 75 mg using 05 005 Using the following steps compute the exact power of the binomial test under the alternative that the sodium content is Laplace distributed with u 758 mg and a 25 mg a De ne the test statistic b Determine the exact rejection region for your test statistic in part a c Determine the probability that a randomly sam pled Laplaceu 758 mg a 25 mg observation is larger than 75 d Graph the pdf of a Laplaceu 758 a 25 Section 13 A Comparison of Statistical Tests January 87 2009 22 distribution and shade in the region corresponding to the above probability e Let B be the number of observations larger than 75 What is the distribution of B under the al ternative that the sodium content is distributed as Laplaceu 758 mg a 25 mg f Compute the exact power of the binomial test under the alternative that the sodium content is distributed as Laplaceu 758 mg a 25 mg Without using povverbinomtest D Example 131 p 19 fth part sum mary comparing ttest and bino mial test Refer to example 112 regarding sodium content of 40 servings of a food product The hypothesis test is H0 u 75 mg versus Ha u gt 75 mg using oz 005 a Determine the exact power of the t test under the Scctttm 13 A Comparison of Statistical Tests January 8 2009 alternative that the sodium content is distributed as Nu 758 mg a 25 mg gt powerttest 40 08 25 type onesample alter native onesided b Determine the exact power of the binomial test under the alternative that the sodium content is distributed as Nu 758 mg a 25 mg gt powerbinomtest gt powerbinomtest 40 005 greater 75 pnorm 758 25 c Which test the t test or the binomial test is better ie more powerful under this normal al ternative d Determine the approximate power of the t test under the alternative that the sodium content is dis tributed as Laplaceu 758 mg a 25 mg e Determine the exact power of the binomial test under the alternative that the sodium content is dis Chapter 2 Two Sample Methods January 87 2009 2 TwoSample Methods Review of the twosample t test Problem 211 tWO diets Nieman7 Groot7 and Jansen 19527 The nu tritive value of butter fat compared with that of vegetable fats7 l7 Koamkl Ned Akad Van Weteasehap Seaquot C55 5887604 In a comparison of the effect on growth of two diets A and B7 a number of growing rats were placed on these two diets7 and the following growth gures were observed after 7 weeks A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 Use the two sample t test to test if the mean growth rate differs for the two diets7 at level 04 005 Do not assume that the two populations have equal variances gt Z scan httpwwwmathjmueduwgarrenstmath324dirdatasetsproblem211 7 commentchar gt Z scan2 problem211 gt X zl 1 7 gtyz8lengthz H0 MmMy7 Ha Mm7 y TX Y Mm Ly Msiansgn Chapter 2 Two Sample Methods January 87 2009 2 The estimated number of degrees of freedom is Sim 8571 4 2 4 2 39 Sacm syn mil n71 You do NOT need to memorize this formula for estimated degrees offreedom gt mean X gt mean y gtsdx gtSdY gt 7ttest Chapter 2 Two Sample Methods January 87 2009 3 What assumptions were necessary for performing this two sample t test gt stripchart X7 stack7 For a dotplot of X gt stripchart y7 stack7 For a dotplot of y gt linegraph X For a line graph of X gt linegraph y For a line graph of y Section 211 The Permutation Test January 87 2009 4 211 The Permutation Test Revisit example tWO diets Test for inequality of means at level 005 A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 Diet A has 7 observations7 and diet B has 10 observations If X 7 Y is suf ciently large or small7 then reject H0 in favor of Ha First7 compute X 7 l7 Next7 permute the 17 observations7 and assign 7 of them to diet A and the remaining 10 of them to diet B Compute the permuted value of X 7 17 How many combinations exist for the 17 observations7 with 7 of them assigned to diet A and 10 of them assigned to diet B January 87 2009 Section 211 The Permutation Test Compute X 7 Y for each permutation Determine the proportion of times that the permuted values of D 7 57 are at least as large as the observed value of D 7 57 That will be your p valuel Problem 212 simpler example smaller sample sizes Con sider the following mutually independent observations from two different continuous populations A 35 25 B 50 30 70 a Test H0 um My versus Ha pm lt My at level 0057 using the difference in means as the test statistic Compute the observed value of X 7 l7 5 Section 211 The Permutation Test January 87 2009 6 gt Inean C357 25 7 mean C507 307 70 How many combinations exist for the 5 observations7 with 2 of them assigned to diet A and 3 of them assigned to diet B List all possible groupings of the 5 observations7 and compute the permuted value of X 7 Y7 along with 221 Xi Permuted Samples Pop A Pop B X 7 Y 2211 Xi 1 25 30 35 50 70 72417 55 2 25 35 30 50 70 720 60 3 25 50 30 35 70 775 75 4 25 70 30 35 50 917 95 5 30 35 25 50 70 71583 65 6 30 50 25 35 70 7333 80 7 30 70 25 35 50 1333 100 8 35 50 25 30 70 083 85 9 35 70 25 30 50 175 105 10 50 70 25 30 35 30 120 Section 211 The Permutation Test January 87 2009 7 List the permutation distribution of X 7 l7 X757 72417 720 71583 775 7333 083 917 1333 175 30 sum probability 01 01 01 01 01 01 01 01 01 01 1 Determine the proportion of times that the permuted values of X 7 Y are at least as small as the observed value of X 7 Y b Test H0 Mm My versus Ha pm lt My at level 0057 using the sample sum from population A as the test statistic Section 211 The Permutation Test January 87 2009 8 c Test H0 pm My versus Ha pm 31 My at level 0057 using the difference in means as the test statistic Is the pvalue 2 gtlt 02 04 gt linegraph 4724177 7207 7757 9177 715837 73337 13337 0837 1757 30 Section 211 The Permutation Test January 87 2009 1 Test H0 My My versus Ha My lt My at level 0057 using the difference in means as the test statistic7 and using the macro perrntest7 gt permtest C257 357 C307 507 707 less gt X C257 35 g y C307 507 70 gt permtest X7 y7 less e Test H0 My My versus Ha My 31 My at level 0057 using the difference in means as the test statistic7 and using the macro perrntest7 Revisit problem 211 tWO diets Test for inequality of means at level 005 H0 Mmy7 Ha Mm7 y A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 9 Section 211 The Permutation Test January 87 2009 10 How many combinations are required to calculate this exact pvalue When using permtest77 the parameter numsim7 is an upper limit on the number of combinations to be computed gt permtest X7 y7 numsim2e4 Example Consider the nonparametric test based on the difference between two means7 and also consider the nonparametric test based on the sample sum from population A Are these two tests equivalent Justify your answer mathematically Section 22 Permutation Tests Based on Median Ed Trimmed Means January 87 2009 11 22 Permutation Tests Based on the Median and Trimmed Means Recall Heavy tailed distributions tend to produce outliers7 which have large impacts on means but not on medians Section 221 A Permutation Test Based on Medians January 87 2009 Hence7 when sampling from heavy tailed distributions7 a nonparametric test based on the difference between the two medians might be more appropriate than one based on the difference between the two means 221 A Permutation Test Based on Medians Revisit problem 212 section 21 Consider the following mutually in dependent observations from two different populations A 35 25 B 50 30 70 a Test H0 x y versus Ha 1m lt y at level 0057 using the difference in medians as the test statistic Compute the observed value of difference in sample medians Section 221 A Permutation Test Based on Medians January 87 2009 13 Permuted Samples Pop A Pop B medXl 7 medYZ 1 25 30 35 50 70 7225 2 25 35 30 50 70 720 3 25 50 30 35 70 25 4 25 70 30 35 50 125 5 30 35 25 50 70 7175 6 30 50 25 35 70 5 7 30 70 25 35 50 15 8 35 50 25 30 70 125 9 35 70 25 30 50 225 10 50 70 25 30 35 30 gt X C257 35 g y C307 507 70 gt permtest X7 y7 less777 statmedian b Test H0 t1 y versus Ha t1 31 y at level 0057 using the difference between medians as the test statistic Section 221 A Permutation Test Based on Medians January 87 2009 14 Revisit problem 211 tWO diets Test for the inequality of medians at level 0057 using the difference in medians as the test statistic H0 m y7 Ha A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 Section 222 Trimmed Means January 87 2009 15 222 Trimmed Means A trimmed mean is a compromise between a sample mean and a sample median For example7 a 20 trimmed mean is the mean of the middle 80 of the observations In other words7 we delete the smallest 10 and the largest 10 of the observations7 and then average the remaining 80 of the observations to obtain our 20 trimmed mean Technically7 when trimmed sample means are used7 the hypothesis test is based on a trimmed population mean However7 if the population is symmetric and has nite mean7 then the population mean M is equal to the trimmed population mean Problem 221 hypnosis Agosti and Camerota 19657 Some effects of hypnotic suggestion on respiratory function7 Intern J Clin Emptl Hypnosis7 13 1497156 Section 222 Trimmed Means January 87 2009 At the beginning of a study of the effect of hypnotism7 the following measurements of ventilation were taken on eight treatment subjects to be hypnotized and eight controls Control 469 419 399 421 484 454 548 464 Treatment 552 436 508 520 478 574 467 516 a Construct a linegraph and a Q Q normal probability plot for each of the two data sets gt Z scan2 problem221 Section 222 Trimmed Means January 87 2009 17 b Using the two sample t test7 test H0 My My versus Hy My lt My at level 0057 where My and My are the population means of the control and treatment groups7 respectively c Using the permutation test based on means7 test H0 My My versus Hy My lt My at level 0057 where My and My are the population means of the control and treatment groups7 respectively Compute the exact p value gt permtest X7 y7 less777 numsim2e4 gt permtest X7 y7 less777 numsim2e47 plotT To plot the permutation distribu tion 1 Using the permutation test based on medians7 test H0 My My versus Hy My lt My at level 0057 where My and My are the population medians of the control and treatment groups7 respectively Compute the exact p value Section 222 Trimmed Means January 87 2009 18 e Using the permutation test based on a 25 trimmed mean7 test H0 My My versus Ha My lt My at level 0057 where My and My are the population means of the control and treatment groups7 respectively Compute the exact p Value Backup Below is the sorted table Control 399 419 421 454 464 469 484 548 Treatment 436 467 478 508 516 520 552 574 gt mean sortX 2 7 gt 7mean Section 222 Trimmed Means January 87 2009 19 Value of test statistic is gt permtest gt permtest X7 y7 less 7 numsim2e47 trim0125 Section 23 Random Sampling the Permutations January 87 2009 20 23 Random Sampling the Permutations In the previous example problem 221 hypnosis7 the permutation tests involved computations based on 186 127 870 groupings or combinations ofthe 16 subjects gt choose 167 8 Suppose we had 10 subjects in each group Suppose we had 20 subjects in each group Section 231 Approm39matc p Value Based on Sampling Permutations January 87 2009 21 Suppose we had 25 subjects in each group 231 An Approximate p Value Based on Random Sampling the Permutations Recall section 067 regarding permutation methods Revisit problem 221 hypnosis Using the permutation test based on means7 test H0 My My versus Ha My lt My at level 0057 where My and My are the population means of the control and treatment groups7 respectively Approximate the p value based on 107000 randomly sampled permutations Section 231 Approm39matc p Value Based on Sampling Permutations January 87 2009 22 Control 469 419 399 421 484 454 548 464 Treatment 552 436 508 520 478 574 467 516 Steps for approximating the p value Use the permutation test based on means to test H0 My My versus Ha My lt My at level 0057 where My and My are the population means Approximate the p value based on 107000 randomly sampled permutations 1 Compute X 7 l7 ie7 the difference in sample means for the observed data 2 Randomly shuf e the 16 observations to reassign 8 observations to the control group and 8 observations to the treatment group 3 Compute X 7 17 for this permuted data set 4 Repeat steps 2 and 3 many times7 such that all selected permutations of 16 observations are independent7 until 107000 independent permuted values of X 7 17 are generated 5 Determine the proportion of the permuted values of X 7 Y which are at least as small as the observed value of X 7 17 That is your approximated p valuel gt permtest gt permtest X7 y7 less777 allpermsF Section 231 Appromz39matc p Value Based on Sampling Permutations January 87 2009 23 Accuracy of the Procedure Suppose p is the true exact p value based on X 7 57 when testing H0 um My versus Ha pm lt My What is the probability that a randomly selected permutation of the data produces a value of X 7 57 at least as small as the observed value of X 7 Y conditional on the original data set Section 231 Approm39matc p Value Based on Sampling Permutations January 87 2009 24 Let 13 be the sample proportion of times that randomly selected permutations of the data produce values of X 7 57 at least as small as the observed value of X 7 57 What is the approximate distribution ofp conditional on the original data set What is an approximate 95 con dence interval on p conditional on the original data set Revisit problem 221 hypnosis Using the permutation test based on means7 test H0 Mm My versus Ha pm lt My at level 0057 where 1 and y are the population means of the control and treatment groups7 respectively Construct a 95 con dence interval on the true p value based on 107000 randomly sampled permutations gt pvalue permtest X7 y7 less777 allpermsF pvalue Section 24 Wilcomzm Rank Sum Test January 87 2009 25 gt J 1e4 gt pValue 7 qnorrn0975 gtk sqrt pValue gtk 17pvalue J gt pValue qnorrn0975 gtk sqrt pValue gtk 17pvalue J How can we decrease the width of the con dence interval 24 Wilcoxon RankSum Test For a data set X17 X27 7Xm7 the rank of an observation Xi is RXi number of Xs S Xi Revisit problem 212 section 21 Consider the following mutually independent observations from two different populations A35 25 B 50 30 70 gt z c 357 257 507 307 70 gt rank Z Section 24 Wilcomzm Rank Sum Test January 87 2009 26 Let W1 be the sum of the ranks of the observations in treatment 2 7 239 17 2 Compare W1 to all permuted values of W1 The p value is the proportion of permuted values of W1 which are at least as extreme as the observed value of W1 How many ways can we divide the ve observations into two groups7 where group A has 2 observations and group B has 3 observations 21 Use hand calculations ie7 calculations may use R but not any macros containing the word test and the Wilcoxon rank sum test to test at level 0057 H0 the populations A and B are the same7 versus Ha the values in population A tend to be smaller than the values in population B Section 24 Wilcomzm Rank Sum Test January 87 2009 27 Permuted Samples Pop A Pop B RX1 RX2 W1 1 25 30 35 50 70 1 2 3 2 25 35 30 50 70 1 3 4 3 25 50 30 35 70 1 4 5 4 25 70 30 35 50 1 5 6 5 30 35 25 50 70 2 3 5 6 30 50 25 35 70 2 4 6 7 30 70 25 35 50 2 5 7 8 35 50 25 30 70 3 4 7 9 35 70 25 30 50 3 5 8 10 50 70 25 30 35 4 5 9 b Repeat part a using the macro wilcoxfues gt 7wilcoxtest gt Wilcoxtest X7 y7 less Section 24 Wilcomzm Rank Sum Test January 87 2009 28 c Use the Wilcoxon rank sum test to test at level 0057 H0 the populations A and B are the sarne7 versus Ha the populations A and B are different Under H07 is W1 symmetric 1 Re eat art C usin the macro wilcoxtesti p p E gt 7wilcoxtest Section 242 Comments on the Use of the Wilcomon Rank Sum Test January 87 2009 29 242 Comments on the Use of the Wilcoxon RankSum Test Let W1 be the sum of the ranks of the m observations in treatment 17 and let W2 be the sum of the ranks of the n observations in treatment 2 What does Wlm 7 Wgn represent Example ls a test based on Wlm 7 Wgn equivalent to a test based on W1 Justify your answer mathematically Section 244 Computer Analysis January 87 2009 30 243A Statistical Table for the Wilcoxon RankSum Test AND 244 Computer Analysis Since B provides high precision for p values and con dence intervals for virtually any practical sample size7 we will disregard Table A3 whose values of m and 71 cannot exceed 107 and Whose list of 04 values is limited to 0017 0025 and 005 Homework 0241 Perform the two sided Wilcoxon rank sum test at level 04 005 for the following data A 97 64 51 B 45 73 32 21 using ONLY hand calculations ie7 you may use R7 but not wilcoxtest b using ONLY the R rnacro wilcoxtest Section 25 Wilcomzm Rank Sum Test Adjusted for Ties January 87 2009 31 End of Homework C241 D 25 Wilcoxon RankSum Test Adjusted for Ties Example Modify the data from problem 2127 section 21 Consider the following mutually independent observations from two different populations A35 25 B 50 35 70 The adjusted rank is the average rank of the tied observations What rank should be assigned to the two values of 357 gt z c 357 257 507 357 70 gt rank Z Section 25 Wilcomzm Rank Sum Test Adjusted for Ties January 87 2009 32 Use hand calculations and the Wilcoxon rank sum test to test at level 0057 H0 the populations A and B are the same versus Ha the values in population A tend to be smaller than the values in population B Permuted Samples Pop A Pop B RX1 RX2 W1 1 25 35 35 50 70 l 25 35 2 25 35 35 50 70 l 25 35 3 25 50 35 35 70 l 4 5 4 25 70 35 35 50 l 5 6 5 35 35 25 50 70 25 25 5 6 35 50 25 35 70 25 4 65 7 35 70 25 35 50 25 5 75 8 35 50 25 35 70 25 4 65 9 35 70 25 35 50 25 5 75 10 50 70 25 35 35 4 5 9 Section 261 The Mann Whitney Statistic January 87 2009 33 26 MannWhitney Test and 21 Con dence Interval 261 The MannWhitney Statistic Consider observations X17X277Xm frorn population A7 and consider observations Y17Y27 7Y7 frorn population B The MannWhitney statistic is de ned to be U number of pairs Xi7Y7 for which Xi lt The null distribution of U may be deterrnined7 under the null hypothesis that the two populations are the same Compute the empirical probability that the observed value of U falls in the appropriate tail of the null distribution of U to determine what In other words7 compute the proportion of the permuted values of U which are at least as extreme as the observed value of U to determine what Section 261 The Mana Whitney Statistic January 87 2009 34 What is the total number of pairings of Xi Y Revisit problem 212 section 21 Consider the following mutually independent observations from two different populations A35 25 B 50 30 70 Use hand calculations and the Mann Whitney test to test at level 0057 H0 the pop ulations A and B are the same versus Ha the values in population A tend to be smaller than the values in population B Should we reject Hg for large or small values of U For how many values of do we have 35 lt Y For how many values of do we have 25 lt Y What is the observed value of U List all combinations of the ve observations such that two observations are in group A7 and compute U for each combination Section 261 The Mann Whitney Statistic January 87 2009 35 Permuted Samples Pop A Pop B U 1 25 30 35 50 70 3 3 6 2 25 35 30 50 70 3 2 5 3 25 50 30 35 70 3 1 4 4 25 70 30 35 50 3 0 3 5 30 35 25 50 70 2 2 4 6 30 50 25 35 70 2 1 3 7 30 70 25 35 50 2 0 2 8 35 50 25 30 70 1 1 2 9 35 70 25 30 50 1 0 1 10 50 70 25 30 35 0 0 0 What proportion of the permuted values of U are at least as large as our observed value of U 5 How should ties be handed with the Mann Whitney test Section 262 Equivalence of Mann Whitney 55 Wilcomzm Rank Sum January 87 2009 262 Equivalence of MannWhitney and Wilcoxon RankSum Statistics A monotone increasing in fact7 linear relationship between the Mann Whitney statistic and the Wilcoxon rank sum statistic exists What must be true regarding the p value obtained from the Mann Whitney statistic compared to the p value obtained from the Wilcoxon rank sum statistic Revisit problem 221 hypnosis Use the MannWhitney test to test at level 005 H0 the control and treatment groups are the same versus Ha the values in the control group tend to be smaller than the values in the treatment group Control 469 419 399 421 484 454 548 464 Treatment 552 436 508 520 478 574 467 516 gt Z scan2 problem221 36 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 37 263 A Con dence Interval for a Shift Parameter A and the Hodges Lehmann Estimate Suppose we sample mutually independent observations from two distributions7 which may differ by merely the location parameter Draw probability density functions of N27 l for X and N07 l for Y on the same graph Draw probability density functions of Cauchy27 l for X and Cauchy07 l for Y on the same graph Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 38 However7 the only assumption is that the distributions are identical except for the loca tion parameters Estimating the shift parameter A Let X be an observation from a distribution with location M Let Y be an observation independent of X from the same distribution but with location MiA What is PX 7 A lt Y7 for continuous X and Y What is PX 7 Y lt A7 for continuous X and Y Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 39 What is the population median of X 7 Y7 for continuous X and Y How should we estimate the population median of XiY based on our sample of mutually independent observations X17 7Xm and 517 7Y77 for continuous X and Y This estimate of A is called the HodgesLehmann estimate Theoretical Basis for the Con dence Interval Consider the new data set based on all possible values of Xi 7 Idea Construct a 95 con dence interval on the population median7 based on this new DEPENDENT data set The con dence interval on A is based on the MannWhitney statistic7 which was de ned in section 261 to be U number of pairs Xi for which Xi lt The textbook suggests using Table A4 Lower and Upper Critical Values for Mann Whitney Statistic to construct the con dence interval7 whereas we will use R Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 40 Problem 261 shocking rats Solomon and Coles 19547 A case of failure of generalization of imitation across drives and across situations777 J Abnorm Soc Psychol7 49 7713 From a group of nine rats available for a study of the transfer of learning7 ve were selected at random and were trained to imitate leader rats in a maze They were then placed together with four untrained control rats in a situation where imitation of the leaders enabled them to avoid receiving an electrical shock The results the number of trials required to obtain ten correct responses in ten consecutive trials were as follows Controls X 110 70 53 51 Trained rats Y 78 64 75 45 82 a Determine the HodgesLehmann estimate of A7 without using the macro wilcoxtest Assume that the two above populations are identical in distribution except for the shift parameter A7 where A is the median of population A minus the median of population B Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 41 pairings X Y X 7 Y 110 78 11077832 110 64 11076446 110 75 11077535 110 45 11074565 110 82 11078228 70 78 70 7 78 78 70 64 70 7 64 6 70 75 70 7 75 75 70 45 70 7 45 25 70 82 70 7 82 712 53 78 53 7 78 725 53 64 53 7 64 711 53 75 53 7 75 722 53 45 53 7 45 8 53 82 53 7 82 729 51 78 517 78 727 51 64 517 64 713 51 75 517 75 724 51 45 517 45 6 51 82 517 82 731 gt x c 1107 707 537 51 gt y c 787 647 757 457 82 gtu1x17y gtu2x27y gtu3x37y gtu4x47y gtuc u17u27u37u4 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 42 gt rnedian u b Construct the 90 con dence interval on A7 using the macro wilcoxtest Again7 assume that the two above populations are identical in distribution except for the location pararneters gt Wilcoxtest X7 y7 confintT7 conflevel09 c Without assurning norrnality7 test at level 0057 H0 A 0 versus Ha A 31 07 where the two populations are identical under H0 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 43 Twosample t test with pooled standard deviation What assumptions are needed to construct a two sample t test or two sample t con dence interval When constructing the Wilcoxon rank sum test7 the two populations are identical un der H07 and the two populations may differ only by the location parameter when constructing con dence intervals Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 44 To appropriately compare the Wilcoxon rank sum test with the t test or con dence intervals7 what assumptions are needed on the populations The pooled77 standard deviation is m715 n71s mn72 39 The appropriate con dence interval on 1 7 M2 is XiYitseilanlm where the t critical value is based on m n 7 2 degrees of freedom Revisit problem 261 shocking rats Controls X 110 70 53 51 Trained rats Y 78 64 75 45 82 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 45 Assume that the data are mutually independent observations from two normal popula tions Nn17a and Nn27a a Are there any outliers b Compute the difference in sample means c Construct the 90 con dence interval on A M1 7 n2 Section 27 Scoring Systems January 87 2009 46 gt ttest X7 y7 varequalT7 conflevel09 For pooled standard deviation 1 Test at level 0057 H0 A 0 versus Ha A 31 0 gt ttest X7 y7 varequalT 27 Scoring Systems Example Suppose we have observations 357 477 157 217 967 347 52 Then7 their respective ranks are 47 57 17 27 77 37 6 These ranks may be viewed as scores of the original data set Section 27 Scoring Systems January 87 2009 47 Suppose the data are believed to be from some particular distribution How should the scores be selected 0 Suppose a data set of size N is believed to be from a Uniform07 N 1 distribution gt plotdist dunif 77 07 11 For a sample of size 10 On average7 what is the expected value mean of the smallest observation gt x replicate 1e47 min runif 107 07 11 mean x On average7 what is the expected value mean of the 2nd smallest observation On average7 what is the expected value mean of the 3rd smallest observation Hence7 if we score the original observations under the assumption of a Uniform07 N1 distribution7 then we are effectively ranking the data Therefore a permutation test for two populations7 based on uniform scores and the difference between means or the sample sum of the scores from group 17 is equivalent to a Wilcoxon ranksum test Section 27 Three Common Scoring Systems January 87 2009 48 Thus7 the Wilcoxon rank sum test is most appropriate for uniform distributions 271 Three Common Scoring Systems Normal Scores Instead of scoring the observations according to ranks or the expected value mean of the ordered statistics from a uniform distribution7 we could score the observations according to the expected value mean of the ordered statistics from a normal distribution7 to obtain the normal scores Normal scores are often used for constructing Q Q normal probability plots Using normal scores with the permutation test is reasonable for populations which are approximately normal For 10 observations with no ties7 the normal scores based on N071 are 715397 710017 706567 703767 701237 01237 03767 06567 10017 1539 gt mean replicate 1e47 min rnorm 10 N 10 Van der Waerden Scores When the data are believed to be approximately normal7 an alternative to using normal scores is using van der Waerden scores7 which are based on the quantilcs of a N071 distribution Section 27 Three Common Scoring Systems January 87 2009 49 Speci cally7 for sample size N7 these quantiles with no ties correspond to the situation where the standard normal cdf is equal to 1N 17 2N 17 3N 17 NN 1 General example Without speci c observations Suppose we have nine ordered observations with no ties7 where we believe that the observations are from a normal population7 perhaps under a null hypothesis a Determine the van der39 Waer39den scores gt N 9 gt p 1 N N 1 These are the values of the cdf gt q qnorm p b Observe some of the van der39 Waer39den scores graphically gt shadedist q 1 l The default distribution is dnorm D The textbook lists the van der39 Waer39den scores for 12 ordered observations with no ties in table 271 on p 51 gt qnorm 112 13 These results should match table 271 Section 27 Three Common Scoring Systems January 87 2009 50 Exponential or Savage Scores Recall If the lifetime of something is memorylcss and continuous7 then the lifetime has an exponential distribution When the data are believed to be approximately exponential7 we could score the observations according to the expected value mean of the ordered statistics from an exponential distribution7 to obtain the exponential scores Letting N be the sample size7 these exponential scores are 1N7 lN 1N 7 17 1N1L1N711N727 gt mean replicate 1e47 min rexp 10 N 10 The Savage scores are the exponential scores minus one7 and have mean zero Hence7 tests based on Savage scores are equivalent to tests based on exponential scores Section 27 Three Common Scoring Systems January 87 2009 51 General example Without speci c observations Suppose we have tweer ordered observations with no ties7 where we believe that the observations are from an exponential population7 perhaps under a null hypothesis Obtain the exponential scores gtN12 gt1N1N1N711N1N711N72Andsoon gtCurnsurn1N171N Section 27 Three Common Scoring Systems January 8 2009 52 Scoring and permutation tests Problem 271 live VS TV MacLachlan 1965 Variations in learning behavior in two social situations according to personality type 7 unpublished master7s thesis at University of California at Berkeley In a business administration course a set of lectures was given televised to one group and live to another In each case an examination was given prior to the lectures and immediately following them The differences between the two examination scores for the women in the two groups were as follows Live 203 235 47 219 156 203 266 219 794 47 716 250 TV62 156 25047 281 172 141312 126 94 172 234 Test at level 005 whether or not the two groups differ in their change in scores Let A be the population median for the live group minus the population median for the TV group a Construct linegraphs to view the two data sets gt Z scan2 problem271 Section 27 Three Common Scoring Systems January 87 2009 53 b Use the Wilcoxon ranksum test c Use the permutation test based on means 1 Use the permutation test based on van der Waerden scores gt Z score X7 y Section 27 Three Common Scoring Systems January 87 2009 54 gt perrntest ZX7 Zy gt Alternative method for generating scores gt z score 009 Y gt perrntest Z1127 Z1324 e Use the permutation test based on exponential scores gt Z score X7 y7 expon T Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 55 gt perrntest ZX7 Zy gt Alternative method for generating scores gt Z score CX7 y7 expon T gt perrntest Z1127 Z1324 f Use the t test with unequal variances gt ttest X7 y g Use the t test with equal variances gt ttest X7 y7 varequa1 T 28 Tests for Equality of Scale Parameters and an Omnibus Test Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 56 Idea Based on samples from two populations7 we compare the spreads of the two populations Example According to the rules of the United States Tennis Association The tennis ball shall have a mass of more than 560 grams and less than 594 grarns7 Suppose a manufacturer produces tennis balls whose rnedian mass is 58 grarns7 but variability in individual masses is quite large Would these tennis balls conform to the standards of the United States Tennis Associ ation Sarnple rnutually independent observations Xi 239 17 7m7 and Yj j 1 771 from the rnodels7 Xi 51 iz and amp28jy7 such that the 8S are identically distributed with rnedian 0 Note that is the same for the two distributions Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 57 What does 1 represent What do 71 and 52 represent Plot two Cauchy distributions with common median 50 but different scales7 5 and 10 gt plotdist dcauchy 507 57 dcauchy 507 10 GOAL Test H0 71 62 versus a one sided or two sided alternative SiegelTukey test section 281 1 Order all m n observations from smallest to largest Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 58 2 Assign rank 1 to the smallest observation7 rank 2 to the largest observation7 rank 3 to the 2nd largest observations7 rank 4 to the 2nd smallest observation7 rank 5 to the 3rd smallest observation7 rank 6 to the 3rd largest observation7 rank 7 to the 4th largest observation7 and so on Note The sample with the smallest ranks tends to have larger variability than the sample with the largest ranks 3 Apply the Wilcoxon rank sum test7 replacing the original values of X and Y by their assigned ranks D Example Assume the above models for Xi and Use the Siegel Tukey test to test at level 04 01 for equality of the two scale parameters versus the alternative that the scale parameter of population A is smaller than the scale parameter of population B7 based on the following data PopulationA 57 45 PopulationB 60 20 80 40 a Perform calculations using neither the macro wilcoxtest7 nor the macro siegeltest7 gt X c 577 45 gt y c 607 207 807 40 Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 59 gt Step 1 Order the observations from smallest to largest gtZsortcx7y gt Step 2 Assign the ranks gt Apply the Wilcoxon rank sum test using hand calculations7 replacing the original values of X and Y by their assigned ranks Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 60 How many combinations will be used in the Wilcoxon rank sum test Perrnuted samples Pop A Pop B W1 1 1 2 3 4 5 6 3 2 1 3 2 4 5 6 4 3 1 4 2 3 5 6 5 4 1 5 2 3 4 6 6 5 1 6 2 3 4 5 7 6 2 3 1 4 5 6 5 7 2 4 1 3 5 6 6 8 2 5 1 3 4 6 7 9 2 6 1 3 4 5 8 10 3 4 1 2 5 6 7 11 3 5 1 2 4 6 8 12 3 6 1 2 4 5 9 13 4 5 1 2 3 6 9 14 4 6 1 2 3 5 10 15 5 6 1 2 3 4 11 Which permuted sample corresponds to the original data set Determine the pvalue for this one sided test Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 61 Perform the two sided test at level 01 b Perform the one sided test using the macro wilcoxtest7 but not the macro siegeltesti gt wilcoxtest c57 67 c17 27 37 47 greater c Perform the one sided test using the macro siegeltesti gt siegeltest gt siegeltest X7 y7 less Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 62 D Suppose that with the SiegelTukey test7 the ranks were assigned beginning with the largest observation rather than the smallest observations Would we obtain the same p value Example 281 p 53 The amount of soda dispensed into 16 ounce bottles might be correctly centered at 16 ounces However7 if variability is large7 then some bottles would be over lled while others would be under lled Table 281 below contains data on the amounts of liquid in randomly selected 16 ounce beverage containers before and after the lling process has been repaired Use the Siegel Tukey test to test at level 04 005 whether or not the repairs were successful Assume the above models for Xi and In other words7 Xi 53918m and Yj 28jy7 such that the 8S are identically distributed with median 07 where all observations are mutually independent Treatment 1 before process repair 1655 1536 1594 1643 1601 Treatment 2 after process repair 1605 1598 1610 1588 1591 a State the null and alternative hypotheses b Perform the hypothesis test without using the macro siegeltest 7 gt Z scan2 table281 Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 8 2009 63 gt sort Z 1 1536 1588 1591 1594 1598 1601 1605 1610 1643 1655 rank 1 4 5 8 9 10 7 6 3 2 What are the ranks associated with treatment 1 What are the ranks associated with treatment 2 gt wilcoxtest c 1 2 3 8 10 c 4 5 6 7 9 less c Perform the hypothesis test using the macro siegeltest gt siegeltest X y greater 1 Perform the hypothesis test reversing the rankings eg assigning rank 1 to the largest observation and so on without using the macro siege1test 1 1536 1588 1591 1594 1598 1601 1605 1610 1643 1655 rank 2 3 6 7 10 9 8 5 4 1 What are the ranks associated with treatment 1 What are the ranks associated with treatment 2 Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 64 gt wilcoxtestc124797c37576787107 less e Perform the hypothesis test7 reversing the rankings eg7 assigning rank 1 to the largest observation7 and so on7 using the macro siegeltest gt siegeltest X7 y7 greater T Which method is preferred7 assigning rank 1 to the smallest observation or to the largest observation To overcome this ambiguity7 one may use the Ansari Bradley test below AnsariBradley test section 281 The Ansari Bradley test averages the ranks from the forward direction assigning rank 1 to the smallest observation with the ranks from the reverse direction assigning rank 1 to the largest observation However7 the new rank sum does not follow the Wilcoxon distribution Instead of deriving this distribution7 we will simply use the macro in R Section 282 Tests for Dcm39tmces January 87 2009 65 gt 7ansaritest gt ansaritest X7 y7 greater 282 Tests for Deviances This model allows for different location parameters7 unlike the model from section 281 Where the Siegel Tukey and Ansari Bradley tests were used Let Xi 1 518m and z 52871 such that the 8S are identically distributed with median 07 Where all observations are mutually independent The deViances for the data set are Xi 7 l and 7 g Since l and g typically are unknown7 how should we estimate them First7 de ne X Xi 7 ftl and Y 7 g where l and fig are the sample medians of the original data sets Section 282 Tests for Dem39auees January 87 2009 66 The test statistic is based on the ratio of mean absolute value of dem39auees RMD7 and is de ned by the following Egrlelm RED n j1 To determine one sided l39o Values7 the observed value of RED is compared to the values of RED under either all permutations or a large number of simulated permutations of X and Y To determine two sided go Values7 the test statistic is de ned as the following R D27sid5d 7 max 221 lXZlm 221 lYiln mm emum mummy and p Values are determined by the right tail only Problem 281 diabetes In a study designed to determine whether middle aged and old subjects with maturity onset diabetes respond to exercise by producing Section 282 Tests for Devianees January 87 2009 high levels of fasting serum growth hormone7 A P Hansen 19737 Diabetes collected the following data regarding hormone level in nanograms per milliliter Assume the model Xi 1 518m and z 52871 such that the 8S are identically distributed with median 07 where all observations are mutually independent We are interested in testing at level 005 whether or not the scale parameters are equal Controls X 14 16 14 41 26 11 04 18 22 03 13 17 10 12 14 05 11 15 11 33 26 07 01 16 25 07 17 03 19 00 05 Diabetics Y 12 02 03 09 42 09 03 07 09 11 30 09 23 13 02 12 15 21 77 200 12 34 22 01 43 07 07 13 13 98 09 47 00 04 210 120 42 27 17 05 10 09 21 01 17 10 39 10 05 07 02 09 09 08 05 15 11 11 16 15 40 47 09 a Construct a linegraph of X and a linegraph of Y gt Z scan2 problem281 gt X zl 1 31 gtyZ3294 Section 282 Tests for Deviances January 87 2009 68 b Determine the value of the RMD test statistic not the go Value7 without using the macro rrndtesti gt rmdonesided mean abs X 7 medianx mean abs y 7 mediany gt rmdtwosided 1 rmdonesided c Describe verbally how the pvalue is obtained7 without using the macro rrndtesti Section 282 Tests for Dcm39tmccs January 87 2009 69 1 Obtain the 10 yalue gt rmdtest gt rmdtest X7 y Suppose the 8S are approximately normally distributed Which parametric test typically is used on the scale parameters ls such a test valid for our diabetes example Section 283 Kolmogorov Smimov Test January 87 2009 70 283 Kolmogorov Smirnov Test The Kolmogorov Smirnov test is an omnibus test ie7 the null hypothesis is that the two distributions of interest are the same versus the alternative hypothesis that the two distributions are dz crcnt Plot the probability density functions of N907 20 and Cauchy1507 30 random vari ables gt plotdist dnorrn777 907 207 dcauchy777 1507 30 Example Suppose a huge sample is obtained from a N071 distribution7 and another huge sample is obtained from a Laplace071 distribution7 such that all observations are mutually independent Would the two sided t test for equality of means or the F test for variances be powerful in detecting that the two distributions differ ie7 will these two tests typically produce small p values Section 283 Kolmogomv sz39mov Test January 87 2009 71 Section 283 Kolmogorou Smirnou Tcst January 87 2009 72 Plot the probability density functions pdfs of the N071 and Laplace0717 random variables gt plotdist driorm 07 17 fllaplace77 Plot the cumulative distribution functions Cdfs of the N071 and Laplace071 random variables gt plotdist pnorm 07 17 plaplace77 The largest vertical difference between the two Cdfs will be estimated by the Kolmogorov Smirnov test statistic How may we estimate the cdf of a population7 when given a data set Section 283 Kolmogorov Smimov Test January 87 2009 73 The KolmogorovSmirnov test statistic is the maximum difference between the two empirical cdfs Hence7 the KolmogorovSmirnov test statistic is K S mix lF1w 7 F2wl7 where F w and F2w are the empirical cdfs of the two populations of interest To determine the l o Value7 the original Kolmogorov Smirnov test statistic is compared to the values of K S under either all permutations or a large number of simulated permutations Example Mutually independent observations are sampled from two populations7 such that the rst sample is 427 607 127 23 and the second sample is 317 567 47 857 77 a Construct the empirical cdfs on the same graph gt X c 427 607 127 23 gtyc31756747857 77 Section 283 Kolmogomv sz39mov Test January 87 2009 74 b Determine the KolmogorOV Smirnov test statistic7 without using the macro kstest7 D c Determine the KolmogorOV Smirnov test statistic and the go Value7 using the macro kstest7 Section 283 KolmogoroU sz39mov Test January 87 2009 75 gt 7kstest gt kstest X7 y Revisit problem 281 diabetes Test at level 005 whether or not the diabetics and control populations differ in terms of fasting serum growth hormone levels after exercise gt Z scan2 problem281 gt X zl 1 31 gtyZ3294 gt kstest X7 y Keep alternative7 set to twosidedi Section 283 Kolmogorovismirnov Test January 8 2009 76 Section 291 The t Test January 87 2009 77 29 Selecting Among TwoSample Tests Recall Power is de ned to be the probability of rejecting H0 given that Ha is true For which distributions do we prefer the t test over nonparametric tests For which distributions do we prefer nonparametric tests over the t test Recall from section 132 an example where the binomial test was more powerful than the t test for Laplace alternatives7 but less powerful for normal alternatives In this section we assume two populations which are identical except for possibly the location parameter ie7 the cdfs satisfy F1x F2x 7 A Test H0 A 0 versus a one sided or two sided alternative 291 The t Test When distributions F1 and F2 are allowed to differ only by the location parameter ie7 equal variances are assumed7 the t test we consider is based on the pooled sample standard deviation Section 292 The Wileocoorz Rank Sum Test versus the t Test January 87 2009 78 When the two populations are normal and have equal variances7 the pooled t test for level oz is the most powerful among all tests with level no larger than 047 for all sample sizes for one sided tests ls the pooled t test valid for nonnormal populations with equal nite variances and large sample sizes When the two populations are nonnormal but have equal nite variances7 is the pooled t test for level oz the most powerful among all tests with level no larger than 047 for large sample sizes for one sided tests 292 The Wilcoxon RankSum Test versus the t Test The textbook compares the powers of the Wilcoxon rank sum test and t test for different sample sizes and alternative distributions The small sample sizes are for m n 12 The moderate sample sizes are from m n 36 to m n 108 Construct graphs to show how the rzormal pdf compares with the uniform7 Laplace7 and ezporzerztlal pdfs7 using identical means and identical standard deviations Section 292 The Wilcocszm Rank Sum Test versus the t Test January 8 2009 79 gt plotdist dunif 71 1 dnorrn 0 1sqrt3 gt plotdist dlaplace 0 1 dnorrn gt plotdist dexp 1 NULL dnorrn 1 Recall when using qqnorrn7 the dif culty in distinguishing between normal and expo nential data for small sample sizes speci cally for n 7 Textbook shows comparisons of powers in table 291 Conclusions 0 When the alternative distribution is uniform the t test tends to be more powerful than the Wilcoxon rank sum test for small and moderate sample sizes 0 When the alternative distribution is Laplace the t test tends to be more power ful than the Wilcoxon rank sum test for small sample sizes but less powerful for moderate sample sizes 0 When the alternative distribution is exponential the t test tends to be somewhat equal in power to the Wilcoxon rank sum test for small sample sizes but much less powerful for moderate sample sizes When the alternative distribution is Cauchy which test is better the t test or the Wilcoxon rank sum test Section 293 Relater E eleaey January 87 2009 80 293 Relative E iciency Instead of small or moderate sample sizes7 we now consider large sample sizes The textbook again discusses hypothesis testing on A and de nes asymptotic ef ciency to compare two tests with certain sample sizes De nition Let mt 71 be the sample size required for the two sample t test to achieve the same power as the two sample Wilcoxon rank sum test with a sample size of mW 71W7 for large sample sizes The asymptotic e iciency of the Wilcoxon rank sum test to the t test is mt ntmW 71W Table 2927 p 63 Section 293 Relative E ciency January 8 2009 81 Distribution Uniform Normal Laplace Exponential Cauchy Conclusions E ciency 10 0955 15 30 00 Q When the alternative distribution is uniform a t test with sample size 1000 has approximately the same power as a Wilcoxon rank sum test also with sample size 1000 Q When the alternative distribution is normal a t test with sample size 955 has ap proximately the same power as a Wilcoxon rank sum test with sample size 1000 Q When the alternative distribution is Laplace a t test with sample size 1500 has approximately the same power as a Wilcoxon rank sum test with sample size 1000 Section 294 Power of Permutation Tests January 87 2009 82 Q When the alternative distribution is exponential7 a t test with sample size 37000 has approximately the same power as a Wilcoxon rank sum test with sample size 17000 Q When the alternative distribution is Cauchy7 a t test with any arbitrarily large sample size has less power than a Wilcoxon rank sum test with sample size 17000 294 Power of Permutation Tests Analysis of table 293 permutation test vs t test Here we compare the permutation test based on the difference between two means with the t test for normal alternatives What is the test statistic associated with the permutation test What is the test statistic associated with the t test with pooled sample standard devia tion Section 294 Power of Permutation Tests January 87 2009 83 Which test statistic is heavily in uenced by outliers Which test statistic should perform well when the alternative distribution is normal When the two populations are normal and have equal variances7 the pooled t test for level oz is the most powerful among all tests with level no larger than oz for all sample sizes for one sided tests as already mentioned in section 291 For large sample sizes7 what is the approximate distribution of the permutation statis tic Section 294 Power of Permutation Tests January 87 2009 84 For large sample sizes7 What is the approximate distribution of the t statistic Table 293 compares the power of the permutation test with the pooled t test under normal alternatives with m n 10 or 20 Even for these small sample sizes7 the t test is only slightly more powerful than the permutation test Section 294 Power of Permutation Tests January 87 2009 85 Analysis of table 294 permutation testS vs Wilcoxon rank sum test Here we compare the permutation test based on the difference between two means or two medians with the Wilcoxon rank sum test7 under normal7 Laplace7 and Cauchy alternatives7 for small sample sizes m n 10 Recall from section 292 When comparing the Wilcoxon rank sum test to the t test7 which is more powerful First consider means 0 A permutation test based on the difference between two means is similar to which famo us test 0 For Laplace and Cauchy alternatives with m n 10 or 207 which typically is more powerful7 the permutation test based on the difference between two means or the Wilcoxon rank sum test Section 294 Power of Permutation Tests January 87 2009 86 D For normal alternatives with m n 10 or 207 which is more powerful7 the permutation test based on the difference between two means or the Wilcoxon rank sum test Next consider medians Q Which nonparametric test statistic is more susceptible to outliers7 the permutation test based on the difference between two medians or the Wilcoxon rank sum test 0 For Laplace and Cauchy alternatives with m n 10 or 207 which typically is more powerful7 the permutation test based on the difference between two medians or the Wilcoxon rank sum test 0 For normal alternatives with m n 10 or 207 which typically is more power ful7 the permutation test based on the difference between two medians or the Wilcoxon rank sum test Section 2102 Application to the Wilcomzm Rank Sum Test January 87 2009 87 210 LargeSample Approximations 2101 Sampling Formulas When sampling a large number of independent observations from a population with nite variance7 what is the approcm39matc distribution of the sample mean When sampling a large number of independent observations from a population with nite variance7 what is the approcm39matc distribution of the sample sum 2102 Application to the Wilcoxon RankSum Test When performing the Wilcoxon rank sum test7 what is the test statistic ls the variance or standard deviation of these ranks nite Are these ranks independent Section 2102 Application to the Wilcomzm Rank Sum Test January 87 2009 88 Letting N m 717 what is the average rank What is the mean of the Wilcoxon rank sum statistic7 W under the null hypothesis that the two continuous populations are the same The variance of the Wilcoxon rank sum statistic under the null hypothesis that the two continuous populations are the same is varW mnN 112 need not memorize7 as derived in the textbook How do we obtain a p value based on the asymptotic distribution of W under H0 Improvement Use a continuity correction since lV7 an integer7 is being approximated by a continuous in fact7 normal distribution This normal approximation is fairly accurate even for the small sample sizes of m n 67 as shown in table 21017 p 67 Exercise 218 p 75 A biologist examined the effect of a fungal infection on the eating behavior of rodents lnfected apples were offered to a group of eight rodents7 Section 2102 Application to the Wilcomzm Rank Sum Test January 87 2009 89 and sterile apples were offered to a group of four The amounts consumed grams of applekilogram of body weight are listed in the table Assume that the two populations of eating behavior may differ only by a location parameter Using the Wilcoxon rank sum test7 we wish to test at level 005 whether or not these two location parameters are equal Experimental Group 11 33 48 34 112 369 64 44 Control Group 177 80 141 332 a Compute the exact 10 yalue gt q scan2 exercise218 F7 T gt X q 1 8 gtyq912 Section 2102 Application to the Wilcocczm Rank Sum Test January 87 2009 90 b Using hand calculations ie7 you may use R7 but not wilcoxtest 7 determine the asymptotic p Value based on the normal approximation with continuity correction gt rank c X7 y gt W sum rank c X7 y 18 gt m length X gt n length y gtNmn gt mu m gtk N1 2 population mean of W gt sigma sqrt m gtk n gtk N1 12 population sd of W gt ls W 41 at the left tail or the right tail Section 2102 Application to the Wilcomzm Rank Sum Test January 87 2009 91 gt Z 415 7 mu sigma gt pValue 2 gtk pnorm Z gt shadedist cz7 iz Graph in terms of Z gt shadedist c4157 6257 dnorm mu7 sigma Graph in terms of W c Using the macro wilcoxtest determine the asymptotic pvalue based on the nor mal approximation with continuity correction Homework 02101 Using the data from exercise 24 ie7 chapter 27 exercise 47 p 73 Chapter 2 Two Sample Methods January 87 2009 2 TwoSample Methods Review of the twosample ttest Problem 211 two diets Nieman Groot and Jansen 1952 The nutritive value of butter fat compared with that of vegetable fats I Konmkl Ned Akad Van Wetenschap Ser C55 5887604 In a comparison of the effect on growth of two diets A and B a number of growing rats were placed on these two diets and the following growth gures were observed after 7 weeks A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 Use the two sample t test to test if the mean growth rate differs for the two diets at level 05 005 Do not assume that the two populations have equal variances Chapter 2 Two Sample Methods January 87 2009 2 gt Z scan httpWWWmathjmuedugarrenst math324dirdatasetsproblem211 comment char gt Z scan2 pr0b1em211 gt X Z 1 7 gtyZ8lengthz H03 06uy7 Ha de Jy X Y Ha My 43925 m 8371 The estimated number of degrees of freedom is Sim 83702 4 2 4 2 Sxm 821 771 1 n l T You do NOT need to memorize this formula for estimated degrees of freedom gt mean X gt mean y gtSdx Section 211 The Permutation Test January 87 2009 3 gt sd y gt ttest What assumptions were necessary for performing this two sarnple t test gt stripchart X stack For a dotplot of x gt stripchart y stack For a dotplot of y gt linegraph X For a line graph of x gt linegraph y For a line graph of y D 211 The Permutation Test Revisit example two diets Test for in equality of means at level 005 A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 Diet A has 7 observations and diet B has 10 observa tions Section 211 The Permutation Test January 87 2009 4 If X l7 is suf ciently large or small then reject H0 in favor of Ha First compute X 37 Next permute the 17 observations and assign 7 of them to diet A and the remaining 10 of them to diet B Compute the permuted value of X 37 How many combinations exist for the 17 observations With 7 of them assigned to diet A and 10 of them assigned to diet B Compute X l7 for each permutation Determine the proportion of times that the permuted values of lX Yl are at least as large as the observed value of lX Yl That Will be your pvaluel D Problem 212 simpler example smaller Section 211 The Permutation Test January 87 2009 sample sizes Consider the following mutually independent observations from two different contin uous populations I35 25 B 50 30 70 a Test H0 ux uy versus Ha ux lt uy at level 005 using the difference in means as the test statistic Compute the observed value of X 37 gt mean C35 25 mean C50 30 70 How many combinations exist for the 5 observations With 2 of them assigned to diet A and 3 of them assigned to diet B List all possible groupings of the 5 observations and compute the permuted value of X Y along with m Section 211 The Permutation Test January 87 2009 6 Permuted Samples Pop A Pop B X 7 2le l 25 30 35 50 70 2417 55 2 25 35 30 50 70 20 60 3 25 50 30 35 70 75 75 4 25 70 30 35 50 917 95 5 30 35 25 50 70 l583 65 6 30 50 25 35 70 333 80 7 30 70 25 35 50 1333 100 8 35 50 25 30 70 083 85 9 35 70 25 30 50 175 105 10 50 70 25 30 35 30 120 List the permutation distribution of X l7 X757 probability 01 72417 720 01 71583 775 7333 083 917 1333 175 30 sum 01 01 01 01 01 01 01 01 Determine the proportion of times that the permuted values of X l7 are at least as small as the observed Section 211 The Permutation Test January 87 2009 7 value of X Y b Test H0 My My versus Ha My lt My at level 005 using the sample sum from population A as the test statistic c Test H0 My My versus Ha My y My at level 005 using the difference in means as the test statistic Is the p value 2 X 02 04 gt linegraph c 2417 20 75 917 1583 333 1333 083 175 30 01 Test H0 My My versus Ha M lt My at level 005 using the difference in means as the test statistic and using the macro perrntest gt perrntest c25 35 c30 50 70 less gt X c25 35 y c30 50 70 gt perrntest X y less e Test H0 My My versus Ha My y My at level 005 using the difference in means as the test Scctton 211 The Permutation Test January 87 2009 8 statistic and using the macro perrntest D Revisit problem 211 two diets Test for inequality of means at level 005 H03 06uy7 Ha Hm y A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 How many combinations are required to calculate this exact p Value When using perrntest the parameter nurnsirn is an upper limit on the number of combinations to be computed gt perrntest x y nurnsirn2e4 D Example Consider the nonpararnetric test based on the difference between two means7 and also con Section 221 A Permutation Test Based on Medians January 87 2009 9 sider the nonparametric test based on the sample sum from population A Are these two tests equiv alent Justify your answer mathematically D 22 Permutation Tests Based on the Median and Trimmed Means Recall Heavy tailed distributions tend to produce out liers which have large impacts on means but not on medians Hence when sampling from heavy tailed distributions a nonparametric test based on the difference between the two medians might be more appropriate than one based on the difference between the two means 221A Permutation Test Based on Medians Section 221 A Permutation Test Based on Medians January 87 2009 Revisit problem 212 section 21 Consider the following mutually independent obser vations from two different populations I35 25 B 50 30 70 a Test H0 195 Ly versus Ha x lt y at level 0057 using the difference in medians as the test statistic Compute the Observed value of difference in sample rnedians Section 221 A Permutation Test Based on Medians January 87 2009 ll Permuted Samples Pop A Pop B medX medY l 25 30 35 50 70 225 2 25 35 30 50 70 20 3 25 50 30 35 70 25 4 25 70 30 35 50 125 5 30 35 25 50 70 l75 6 30 50 25 35 70 5 7 30 70 25 35 50 15 8 35 50 25 30 70 125 9 35 70 25 30 50 225 10 50 70 25 30 35 30 gt x C25 35 y C30 50 70 gt permtest X y less statmedian b Test H0 x y versus Ha x y y at level 005 usng the difference between medians as the test statistic D Section 222 Trimmed Means January 87 2009 12 Revisit problem 211 two diets Test for the inequality of medians at level 005 using the difference in medians as the test statistic H0 xLya Ha A 156 183 120 113 138 145 142 B 109 107 119 162 121 123 76 111 130 115 222 Trimmed Means A trimmed mean is a compromise between a sample mean and a sample median For example a 20 trimmed mean is the mean of the middle 80 of the observations In other words we delete the smallest 10 and the largest 10 of the observations and then average the remaining 80 of the observations to obtain our 20 trimmed mean Technically When trimmed sample means are used the hypothesis test is based on a trimmed population Section 222 Trimmed Means January 87 2009 mean However if the population is symmetric and has finite mean then the population mean u is equal to the trimmed population mean Problem 221 hypnosis Agosti and Camerota 1965 Some effects of hypnotic sugges tion on respiratory function Intern J Clin Err ptl Hypnosis 13 1497156 At the beginning of a study of the effect of hypnotism the following measurements of ventilation were taken on eight treatment subjects to be hypnotized and eight controls Control 469 419 399 421 484 454 548 464 Treatment 552 436 508 520 478 574 467 516 a Construct a linegraph and a Q Q normal proba bility plot for each of the two data sets gt Z scan2 problem221 gtXZl8 Section 222 Trimmed Means January 87 2009 gt y Z 9 16 l b Using the two sample t test test H0 My My versus Hy My lt My at level 005 where My and My are the population means of the control and treatment groups respectively c Using the permutation test based on means test H0 My My versus Hy My lt My at level 005 Where My and My are the population means of the control and treatment groups respectively Com pute the exact p value gt permtest X y less numsim2e4 gt permtest X y less numsim2e4 plotT To plot the permutation distribution 01 Using the permutation test based on medians test H0 My My versus Ha My lt My at level 005 Where My and My are the population medians of the control and treatment groups respectively Compute the exact p value Section 23 Random Sampling the Permutations January 87 2009 15 e Using the permutation test based on a 25 trimmed mean test H0 My My versus Ha M lt My at level 005 Where M and My are the population means of the control and treatment groups respec tively Compute the exact p Value Backup Below is the sorted table Control 399 419 421 454 464 469 484 548 Treatment 436 467 478 508 516 520 552 574 gt mean sortx 2 7 gt mean Value of test statistic is gt permtest gt permtest X y less numsim2e4 trim0125 D 23 Random Sampling the Permutations Section 231 Appmmz39matc p Value Based on Sampling Permutations January 87 2009 16 In the previous example problem 221 hypnosis the permutation tests involved computations based 16 8 12 870 groupings or combinations of the on lt 16 subjects gt choose 16 8 Suppose we had 10 subjects in each group Suppose we had 20 subjects in each group Suppose we had 25 subjects in each group D 231 An Approximate p Value Based on Random Sampling the Permutations Recall section 06 regarding perrnutation rnethods Revisit problem 221 hypnosis Us ing the permutation test based on means test H 0 ux uy versus Ha ux lt uy at level 005 where ux Section 231 Appmm39matc p Value Based on Sampling Permutations January 87 2009 and My are the population means of the control and treatment groups respectively Approximate the p value based on 10000 randomly sampled permuta tions Control 469 419 399 421 484 454 548 464 Treatment 552 436 508 520 478 574 467 516 Steps for approximating the p value Use the permutation test based on means to test H 0 My My versus Hy My lt My at level 005 where M05 and My are the population means Approximate the pvalue based on 10000 randomly sampled permutations 1 Compute X 17 ie the difference in sample means for the observed data 2 Randomly shuf e the 16 observations to reassign 8 observations to the control group and 8 observations to the treatment group Scctz39zm 231 Appromz39matc p Value Based on Sampling Permutations January 87 2009 18 3 Compute X Y for this permuted data set 4 Repeat steps 2 and 3 many times such that all selected permutations of 16 observations are inde pendent until 10000 independent permuted val ues of X Y are generated 5 Determine the proportion of the permuted val ues of X Y which are at least as small as the observed value of X 37 That is your approxi mated pyaluel gt perrntest gt perrntest X y less allpermsF D Accuracy of the Procedure Suppose p is the true exact p Value based on X Y when testing H0 ux uy versus Ha ux lt uy What is the probability that a randomly selected per Section 231 Appmmz39matc p Value Based on Sampling Permutations January 87 2009 19 mutation of the data produces a value of X 37 at least as small as the observed value of X l7 conditional on the original data set Let p be the sample proportion of times that randomly selected permutations of the data produce values of X 37 at least as small as the observed value of X 37 What is the approximate distribution of p conditional on the original data set What is an approximate 95 con dence interval on p conditional on the original data set Revisit problem 221 hypnosis Us ing the permutation test based on means test H 0 My My versus Hy M lt My at level 005 Where M and My are the population means of the control and treatment groups respectively Construct a 95 con dence interval on the true p value based on 10000 randornly sarnpled perrnutations Section 24 Wilcomzm Rank Sum Test January 87 2009 20 gt pvahie perrntest X y less a11perrnsF pvahie gt J 1e4 gt pvahie qnorrn0975 gtllt sqrt pvahie gtllt 1 pvahie J gt pvahie qnorrn0975 gtllt sqrt pvahie gtllt 1 pvahie J How can we decrease the width of the con dence in terval 24 Wilcoxon RankSum Test For a data set X1 X2 Xm the rank of an obser vation X is RXZ number of Xj s g Xi Revisit problem 212 section 21 Consider the following mutually independent obser vations from two different populations Section 24 Wilcomzm Rank Sum Test January 87 2009 I35 25 B 50 30 70 gt Z c 35 25 50 30 70 gt rank Z Let m be the sum of the ranks of the observations in treatment z z 1 2 Compare W1 to all permuted values of W1 The p value is the proportion of permuted values of W1 which are at least as extreme as the observed value of W1 How many ways can we divide the ve observations into two groups where group A has 2 observations and group B has 3 observations at Use hand calculations ie calculations may use R but not any macros containing the word test and the Wilcoxon rank sum test to test at level 005 H0 the populations A and B are the same versus Section 24 Wilcoxon Rankisum Test January 8 2009 22 Ha the values in population A tend to be smaller than the values in population B Section 24 Wilcomzm Rank Sum Test January 87 2009 23 Permuted Samples Pop A Pop B RX1 RX2 W1 1 25 30 35 50 70 1 2 3 2 25 35 30 50 70 1 3 4 3 25 50 30 35 70 1 4 5 4 25 70 30 35 50 1 5 6 5 30 35 25 50 70 2 3 5 6 30 50 25 35 70 2 4 6 7 30 70 25 35 50 2 5 7 8 35 50 25 30 70 3 4 7 9 35 70 25 30 50 3 5 8 10 50 70 25 30 35 4 5 9 b Repeat part a using the macro Wilcoxtest gt Wilcoxtest gt Wilcoxtest X y less c Use the Wilcoxon rank sum test to test at level 005 H0 the populations A and B are the same versus Ha the populations A and B are different Section 242 Comments on the Use of the Wilcomon Rank Sum Test January 87 2009 Under H0 is W1 symmetric 01 Repeat part C using the macro Wilcoxtest gt Wilcoxtest D 242 Comments on the Use of the Wilcoxon RankSum Test Let W1 be the sum of the ranks of the m Observations in treatment 1 and let W2 be the sum of the ranks of the n Observations in treatment 2 What does W1 m Wgn represent Example Is a test based on W1 m Wgn equivalent to a test based on W1 Justify your an swer mathematically 243A Statistical Table for the Section 244 Computer Analysis January 87 2009 Wilcoxon RankSum Test AND 244 Computer Analysis Since B provides high precision for pvalues and con dence intervals for virtually any practical sample size we will disregard Table A3 Whose values of m and 71 cannot exceed 10 and Whose list of or values is limited to 001 0025 and 005 Homework C2 4 1 Perform the two sided Wilcoxon rank surn test at level or 005 for the following data A 97 64 51 B 45 73 32 a using ONLY hand calculations ie you may use R but not Wilcoxtest b using ONLY the R rnacro Wilcoxtest Section 25 Wilcomzm Rank Sum Test Adjusted for Tics January 87 2009 End of Homework C241 D 25 Wilcoxon RankSum Test Adjusted for Ties Example Modify the data from problem 212 section 21 Consider the following mutually inde A B pendent observations from two different populations 35 25 50 35 70 The adjusted rank is the average rank of the tied observations What rank should be assigned to the two values of 35 gt z c 35 25 50 35 70 gt rank Z Use hand calculations and the Wilcoxon rank surn test to test at level 005 H 0 the populations A and B are the same versus Ha the values in population Section 25 Wileoxon Rankisum Test Adjusted for Ties January 8 2009 27 A tend to be smaller than the values in population B Section 261 The Mann Whitney Statistic January 87 2009 28 Permuted Samples Pop A Pop B RX1 RX2 W1 1 25 35 35 50 70 1 25 35 2 25 35 35 50 70 1 25 35 3 25 50 35 35 70 1 4 5 4 25 70 35 35 50 1 5 6 5 35 35 25 50 70 25 25 5 6 35 50 25 35 70 25 4 65 7 35 70 25 35 50 25 5 75 8 35 50 25 35 70 25 4 65 9 35 70 25 35 50 25 5 75 10 50 70 25 35 35 4 5 9 D 26 MannWhitney Test and 3 Con dence Interval 261 The MannWhitney Statistic Sectton 261 The Mann Whitney Statistic January 87 2009 Consider observations X1X2 Xm from popula tion A and consider observations Y1 Y2 Yn from population B The MannWhitney statistic is de ned to be U number of pairs Xi for which X1 lt The null distribution of U may be determined under the null hypothesis that the two populations are the same Compute the empirical probability that the observed value of U falls in the appropriate tail of the null distribution of U to determine What In other words compute the proportion of the per muted values of U which are at least as extreme as the observed value of U to determine What What is the total number of pairings of Xi Revisit problem 212 section 21 Consider the following mutually independent obser Section 261 The Maaa Whitney Statistic January 87 2009 30 vations from two different populations A35 25 B 50 30 70 Use hand calculations and the Mann Whitney test to test at level 005 H0 the populations A and B are the same versus H a the values in population A tend to be smaller than the values in population B Should we reject Hg for large or small values of U For how many values of do we have 35 lt Y For how many values of do we have 25 lt Y What is the observed value of U List all combinations of the ve observations such that two observations are in group A and compute U for each combination Section 261 The Mann Whitney Statistic January 87 2009 31 Permuted Samples Pop A Pop B U 1 25 30 35 50 70 336 2 25 35 30 50 70 325 3 25 50 30 35 70 314 4 25 70 30 35 50 303 5 30 35 25 50 70 224 6 30 50 25 35 70 213 7 30 70 25 35 50 202 8 35 50 25 30 70 112 9 35 70 25 30 50 101 10 50 70 25 30 35 000 What proportion of the permuted values of U are at least as large as our observed value of U 5 D How should ties be handed with the Mann Whitney test Section 262 Equivalence of Mann Whitney 55 Wilcocczm Rank Sum January 87 2009 262 Equivalence of MannWhitney and Wilcoxon RankSum Statistics A monotone increasing in fact linear relationship be tween the Mann Whitney statistic and the Wilcoxon rank surn statistic exists What must be true regarding the pvalue obtained from the Mann Whitney statistic compared to the pvalue obtained from the Wilcoxon rank surn statis tic Revisit problem 221 hypnosis Use the MannWhitney test to test at level 005 H0 the control and treatment groups are the same ver sus Ha the values in the control group tend to be smaller than the values in the treatment group Control 469 419 399 421 484 454 548 464 Treatment 552 436 508 520 478 574 467 516 32 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 33 gt Z scan2 problem221 gtXZl8 gtyZ9l6 D 263 A Con dence Interval for a Shift Parameter A and the Hodges Lehmann Estimate Suppose we sample mutually independent observations from two distributions which may differ by merely the location parameter Draw probability density functions of N 2 1 for X and N0 l for Y on the same graph Draw probability density functions of Cauchy27 1 for X and Cauchy0 1 for Y on the same graph However the only assumption is that the distributions are identical except for the location parameters Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 34 Estimating the shift parameter A Let X be an observation from a distribution with lo cation LL Let Y be an observation independent of X from the same distribution but with location u A What is PX A lt Y for continuous X and Y What is PX Y lt A for continuous X and Y What is the population median of X Y for contin uous X and Y How should we estimate the population median of X Y based on our sample of mutually independent ob servations X1 Xm and Y1 ous X and Y Yn7 for continu This estimate of A is called the HodgesLehmann estimate Theoretical Basis for the Con dence Interval Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 35 Consider the new data set based on all possible values of Xi Idea Construct a 95 con dence interval on the pop ulation rnedian based on this new DEPENDENT data set The con dence interval on A is based on the Mann Whitney statistic which was de ned in section 261 to be U number of pairs Xi for which X1 lt The textbook suggests using Table A4 Lower and Up per Critical Values for Mann Whitney Statistic to construct the con dence interval whereas we will use R Problem 261 shocking rats Solomon and Coles 1954 A case of failure of generalization of imitation across drives and across situations J Abnorm Soc Psychol 49 7713 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 36 From a group of nine rats available for a study of the transfer of learning five were selected at random and were trained to imitate leader rats in a maze They were then placed together with four untrained con trol rats in a situation where imitation of the leaders enabled them to avoid receiving an electrical shock The results the number of trials required to obtain ten correct responses in ten consecutive trials were as follows Controls X 110 70 53 51 Trained rats Y 78 64 75 45 82 a Determine the HodgesLehmann estimate of A without using the macro wilcoxtest Assume that the two above populations are identical in distribution except for the shift parameter A where A is the median of population A minus the median of population B Section 263 0 for Shift Parameter A 65 Hodges Lehmann Estimate January 87 2009 37 pairings X Y X Y 110 78 110 7832 110 64 110 6446 110 75 110 75 35 110 45 110 4565 110 82 110 8228 7O 78 7O 78 8 7O 64 7O 64 6 7O 75 7O 75 5 7O 45 7O 45 25 7O 82 7O 82 12 53 78 53 78 25 53 64 53 64 11 53 75 53 75 22 53 45 53 45 8 53 82 53 82 29 51 78 51 78 27 51 64 51 64 13 51 75 51 75 24 51 45 51 45 6 51 82 51 82 31 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 8 2009 38 gt x c 110 70 53 51 gt y c 78 64 75 45 82 gt ul Xl y gt u2 X2 y gt u3 X3 y gt u4 X4 y gt u cu1 u2 u3 u4 gt rnedian u b Construct the 90 con dence interval on A us ing the macro Wilcoxtest Again assume that the two above populations are identical in distribution except for the location pa rarneters gt Wilcoxtest x y confintT conflevel09 c Without assurning norrnality test at level 005 H0 A 0 versus Ha A y 0 Where the two populations are identical under H0 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 39 D Twosample t test with pooled stan dard deviation What assumptions are needed to construct a two sample t test or two sample t con dence interval When constructing the Wilcoxon rank surn test the two populations are identical under H0 and the two populations may differ only by the location parame ter When constructing con dence intervals To appropriately compare the Wilcoxon rank surn test With the t test or con dence intervals What as surnptions are needed on the populations The pooled standard deviation is m 13n 1s mn 2 39 6 The appropriate con dence interval on u1 ug is X Yit Sexler 171 Section 263 0 for Shift Parameter A 55 Hodges Lehmann Estimate January 87 2009 40 Where the t critical value is based on m n 2 degrees of freedom Revisit problem 261 shocking rats Controls X 110 70 53 51 Trained rats Y 78 64 75 45 82 Assume that the data are mutually independent obser vations from two normal populations NW1 0 and N 12 0 a Are there any outliers b Compute the difference in sample means c Construct the 90 confidence interval on A H1 2 gt ttest x y varequalT conflevel09 For pooled standard deviation 01 Test at level 005 H0 A 0 versus Ha A y 0 gt ttest X y varequalT D Section 27 Scoring Systems January 87 2009 41 27 Scoring Systems Example Suppose we have observations 35 47 15 21 96 34 52 Then their respective ranks are 4 5 1 2 7 3 6 These ranks may be viewed as scores of the original data set D Suppose the data are believed to be from some partic ular distribution How should the scores be selected D Suppose a data set of size N is believed to be from a Uniform0 N 1 distribution gt plotdist dunii 0 11 For a sample of size 10 On average What is the expected value mean of the smallest observation gt X replicate1e4minrunif 10 0 11 mean Section 27 Three Common Scoring Systems January 87 2009 42 X On average What is the expected value mean of the 2nd smallest observation On average What is the expected value mean of the 3rd smallest observation Hence if we score the original observations under the assumption of a Uniform0 N 1 distribution then we are effectively ranking the data Therefore a permutation test for two populations based on uniform scores and the difference be tween means or the sample sum of the scores from group 1 is equivalent to a Wilcoxon ranksum test Thus the Wilcoxon rank sum test is most appropriate for uniform distributions 271 Three Common Scoring Systems Normal Scores Section 27 Three Common Scoring Systems January 87 2009 43 Instead of scoring the observations according to ranks or the expected value mean of the ordered statistics from a uniform distribution we could score the observations according to the expected value mean of the ordered statistics from a normal distribution to obtain the normal scores Normal scores are often used for constructing Q Q normal probability plots Using normal scores with the permutation test is reasonable for populations which are approximately normal For 10 observations with no ties the normal scores based on N0 1 are 1539 1001 0656 0376 0123 0123 0376 0656 1001 1539 gt mean replicate 1e4 min rnorm 10 N 10 Van der Waerden Scores Section 27 Three Common Scoring Systems January 87 2009 44 When the data are believed to be approximately nor mal an alternative to using normal scores is using van der Waerden scores which are based on the quantz les of a N 0 1 distribution Speci cally for sample size N these quantiles with no ties correspond to the situation where the standard normal cdf is equal to 1N1 2N1 3N 1 NN 1 General example without speci c observa tions Suppose we have nine ordered observations with no ties where we believe that the observa tions are from a normal population perhaps under a null hypothesis a Determine the van der Waerden scores gtN9 gtp1 N N1Thesearethevaluesof the cdf Scotion 27 Three Common Scoring Systems January 87 2009 45 gt q qnorm p b Observe some of the van der Waerden scores graphically gt shadedist q 1 l The default distribution is dnorm D The textbook lists the van der Waerden scores for 12 ordered observations with no ties in table 271 on p 51 gt qnorm 112 13 7 These results should match table 271 Exponential or Savage Scores Recall If the lifetime of something is memoryless and continuous then the lifetime has an exponential distribution When the data are believed to be approximately ex ponential we could score the observations ac Section 27 Three Common Scoring Systems January 87 2009 46 cording to the expected value mean of the ordered statistics from an exponential distribution to ob tain the exponential scores Letting N be the sample size these exponential scores are 1N 1N1N 1 1N1N 11N 2 gt rnean replicate 1e4 rnin rexp 10 N 10 The Savage scores are the exponential scores minus one and have mean zero Hence tests based on Savage scores are equivalent to tests based on exponential scores General example without speci c observa tions Suppose we have twelve ordered observa tions with no ties Where we believe that the obser vations are from an exponential population perhaps under a null hypothesis Obtain the exponential Scotion 27 Three Common Scoring Systems January 87 2009 47 scores gt N 12 gt1N1N1N 11N1N 1 1 N Z 7 And so on gt curnsurn 1 N1 1N D O Scoring and permutation tests Problem 271 live VS TV MacLach lan 19657 Variations in learning behavior in two social situations according to personality type un published rnaster s thesis at University of California at Berkeley In a business administration course a set of lectures was given televised to one group and live to another In each case an examination was given prior to the lectures and immediately following them The dif Section 27 Three Common Scoring Systems January 87 2009 48 ferences between the two examination scores for the women in the two groups were as follows Live 203 235 47 219 156 203 266 219 94 47 16 250 TV62 156 250 47 281 172 141 312 126 94 172 234 Test at level 005 whether or not the two groups differ in their change in scores Let A be the population rnedian for the live group minus the population rnedian for the T V group a Construct linegraphs to View the two data sets gt Z scan2 problern271 b Use the Wilcoxon ranksum test c Use the permutation test based on means 01 Use the permutation test based on van der Waer den scores gt Z soorex y Section 27 Three Common Scoring Systems January 87 2009 49 gt perrntest ZX Zy gt Alternative method for generating scores gt Z score Cx y gt perrntest Z112 Z1324 e Use the permutation test based on exponential scores gt Z score X y expon T gt perrntest ZX Zy gt 7 Alternative method for generating scores gt Z score CX y expon T gt perrntest Z112 Z1324 f Use the t test With unequal variances gt ttest X y g Use the t test with equal variances gt ttest X y varequa1 T D Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 28 Tests for Equality of Scale Parameters and an Omnibus Test Idea Based on samples from two populations we corn pare the spreads of the two populations Example According to the rules of the United States Tennis Association The tennis ball shall have a mass of more than 560 grams and less than 594 grams Suppose a manufacturer produces tennis balls Whose rnedian mass is 58 grams but variability in individ ual masses is quite large Would these tennis balls conform to the standards of the United States Tennis Association D Sarnple rnutually independent observations Xi 2 1m and Yjj 1n from the models X1 L 515 and L 525 Scctioa 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 8 2009 such that the as are identically distributed With me dian 0 Note that 1 is the same for the two distributions What does 1 represent What do 51 and 52 represent Plot two Cauchy distributions With common median 50 but different scales 5 and 10 gt plotdist dcauchy 50 5 dcauchy 50 10 GOAL Test H 0 61 52 versus a one sided or two sided alternative SiegelTukey test section 281 1 Order all m n observations from srnallest to largest 2 Assign rank 1 to the smallest observation rank 2 to the largest observation rank 3 to the 2nd largest observations rank 4 to the 2nd small est observation rank 5 to the 3rd smallest ob Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 52 servation rank 6 to the 3rd largest observation rank 7 to the 4th largest observation and so on Note The sample with the smallest ranks tends to have larger variability than the sample with the largest ranks 3 Apply the Wilcoxon rank sum test replacing the original values of X and Y by their assigned ranks D Example Assume the above models for X1 and Yj Use the Siegel Tukey test to test at level or 01 for equality of the two scale parameters versus the alternative that the scale parameter of population A is smaller than the scale parameter of popula tion B based on the following data Population A 57 45 Population B 60 20 80 40 a Perform calculations using neither the macro wilcoxtest Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 nor the macro siegeltest gt X c 57 45 gt y c 60 20 80 40 gt Step 1 Order the observations from smallest to largest gtZsortcxy gt 7 Step 2 Assign the ranks gt Apply the Wilcoxon rank sum test using hand calculations replacing the original values of X and Y by their assigned ranks How many combinations will be used in the Wilcoxon rank sum test Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 54 Permuted samples Pop1 Pop B W1 1 OONQOTHgtOON r tr tr tr tr t ybOQNHQ 15 mpbpbwwwwmmmwwwwbn CEOTCEOTHgt OTHgtOQ OTHgtOJLD r tr tr tr tr tr tr tr tr tr kmmmmw 4 NNNNNNOOOQOJHgtOJOOOOAgt OJOOOJAgtHgtOTAgtHgtOTOTHgtHgtOTOTOT HgtOT OT OT OT 3 OONOON OTN OTHgt x O 11 Which permuted sample corresponds to the original data set Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 55 Determine the pvalue for this one sided test Perform the two sided test at level 01 b Perform the one sided test using the macro wilcoxtest but not the macro siegeltest gt wilcoxtest c5 6 c1 2 3 4 greater c Perform the one sided test using the macro siegeltest gt siegeltest gt siegeltest X y less D Suppose that with the SiegelTukey test the ranks were assigned beginning with the largest observa tion rather than the smallest observations Would we obtain the same pvalue Example 281 p 53 The amount of soda dispensed into 16 ounce bottles might be correctly centered at 16 ounces However if variability is large then some bottles would be overfilled while Section 28 Tests for Equality of Scale Parameters 65 Omnibus Test January 87 2009 56 others would be under lled Table 281 below contains data on the amounts of liquid in randomly selected 16 ounce beverage con tainers before and after the lling process has been repaired Use the Siegel Tukey test to test at level or 005 Whether or not the repairs were successful Assume the above models for X1 and In other words Xi 1 515 and 11 52533 such that the as are identically distributed With me dian 0 Where all observations are mutually indepen dent Treatment 1 before process repair 1655 1536 1594 1643 1601 Treatment 2 after process repair 1605 1598 1610 1588 1591 a State the null and alternative hypotheses b Perform the hypothesis test Without using the Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 57 macro siege1test gt Z scan2 tab1e281 gtXZ15 gtyz610 gtsortz 1 1536 1588 1591 1594 1598 1601 1605 1610 1643 1655 rank 1 4 5 8 9 10 7 6 3 2 What are the ranks associated With treatment 1 What are the ranks associated With treatment 2 gt Wilcoxtest c 1 2 3 8 10 c4 5 6 7 9 less c Perform the hypothesis test using the macro siege1test gt siege1test X y greater d Perform the hypothesis test reversing the rankings eg assigning rank 1 to the largest observation and so on Without using the macro siege1test Section 28 Tests for Equality of Scale Parameters 55 Omnibus Test January 87 2009 58 1 1536 1588 1591 1594 1598 1601 1605 1610 1643 1655 rank 2 3 6 7 10 9 8 5 4 1 What are the ranks associated With treatment 1 What are the ranks associated With treatment 2 gt Wilcoxtestc 1 2 4 7 9 C 3 5 6 8 10 less e Perform the hypothesis test reversing the rankings eg assigning rank 1 to the largest observation and so on using the macro siegeltest gt siegeltest x y greater T D Which method is preferred assigning rank 1 to the smallest observation or to the largest observation To overcome this ambiguity one may use the Ansari Bradley test below Ansari Bradley test section 281 The AnsariBradley test averages the ranks from Section 282 Tests for Dcm39anccs January 87 2009 59 the forward direction assigning rank 1 to the smallest observation with the ranks from the re verse direction assigning rank 1 to the largest observation However the new rank surn does not follow the Wilcoxon distribution Instead of deriving this distribution we will simply use the macro in R gt ansaritest gt ansaritest X y greater 282 Tests for Deviances This model allows for different location pararneters unlike the model from section 281 where the Siegel Tukey and Ansari Bradley tests were used Let Xi L1 515 and L2 53928jy Section 282 Tests for Devianees January 87 2009 60 such that the es are identically distributed with me dian 0 where all observations are mutually indepen dent The deviances for the data set are X1 111 and 112 Since 111 and lg typically are unknown how should we estimate them First de ne X Xi fll and g where fll and g are the sample medians of the original data sets The test statistic is based on the ratio of mean ab solute value of deulauces RMD and is de ned by the following 2311le ln To determine one sided pvalues the observed value of RM D is compared to the values of RM D under either all permutations or a large number of simu lated permutations of X and Section 282 Tests for Dcm39tmccs January 87 2009 To determine two sided pvalues the test statistic is de ned as the following RMDZ sz ded maxltz 11lX m 231 mmlt237i1lXWma 2311 Wln and p values are determined by the right tail only Problem 281 diabetes In a study de signed to determine whether middle aged and old subjects with maturity onset diabetes respond to ex ercise by producing high levels of fasting serum growth hormone A P Hansen 1973 Diabetes collected 61 the following data regarding hormone level in nanograms per milliliter Assume the model X2 L1 515 and L2 6253117 such that the as are identically distributed with me dian 0 where all observations are mutually indepen dent Section 282 Tests for Deviances January 8 2009 62 We are interested in testing at level 005 Whether or not the scale parameters are equal Section 282 Tests for Deviances January 87 2009 63 Controls X 14 22 11 25 16 03 15 07 14 13 11 17 41 17 33 03 26 10 26 19 11 12 07 00 04 14 01 05 18 05 16 12 09 15 43 00 10 05 11 Diabetics Y 02 11 21 07 04 09 07 11 03 30 77 07 210 21 02 16 09 09 200 13 120 01 09 15 42 23 12 13 42 17 09 40 09 13 34 98 27 10 08 47 03 02 22 09 17 39 05 09 07 12 01 47 05 10 15 a Construct a linegraph of X and a linegraph of Y gt Z scan2 problem281 gtXZ131 gtyZ3294 b Determine the value of the RMD test statistic not Section 283 Kolmogomv sz39mov Test January 87 2009 64 the p yalue Without using the macro rmdtest gt rmd0nesided mean abs X medianX mean abs y mediany gt rmdtW0sided 1 rmd0nesided c Describe verbally how the pyalue is Obtained with out using the macro rmdtest d Obtain the p Value gt rmdtest gt rmdtest X y D Suppose the as are approximately normally distributed Which parametric test typically is used on the scale parameters ls such a test valid for our diabetes example 283 KolmogorOVSmirnov Test The KolmogorOVSmirnov test is an omnibus Section 283 Kolmogomv Smirnov Test January 87 2009 65 test ie the null hypothesis is that the two distribu tions of interest are the same versus the alternative hypothesis that the two distributions are di ei ent Plot the probability density functions of N 90 20 and Cauchy150 30 random variables gt plotdist dnorrn 90 20 dcauchy 150 30 Example Suppose a huge sample is obtained from a N 0 1 distribution and another huge sam ple is obtained from a Laplace0 1 distribution such that all observations are mutually independent Would the two sided t test for equality of means or the F test for variances be powerful in detecting that the two distributions differ ie will these two tests typically produce small pvalues Plot the probability density functions pdfs of the N 0 1 and Laplace0 1 random variables gt plotdist dnorrn 0 1 dlaplace Section 283 Kolmogomu Smirnou Test January 87 2009 66 Plot the cumulative distribution functions cdfs of the N 0 1 and Laplace0 1 random variables gt plotdist pnorrn 0 1 plaplace The largest vertical difference between the two cdfs will be estimated by the KolmogorovSmirnov test statistic D How may we estimate the cdf of a population when given a data set The KolmogorovSmirnov test statistic is the max imum difference between the two empirical cdfs Hence the KolmogorovSmirnov test statistic is K S mgXlF1w F2wla where and are the empirical cdfs of the two populations of interest To determine the pvalue the original Kolrnogorov Srnirnov test statistic is compared to the values of Section 283 Kolmogomv Smlmov Test January 87 2009 67 K S under either al l permutations or a large number of simulated permutations Example Mutually independent observations are sampled from two populations such that the rst sample is 42 60 12 23 and the second sample is 315648577 a Construct the empirical cdfs on the same graph gt x c 42 60 12 23 gt y c 31 56 4 85 77 b Determine the Kolmogorov Smirnov test statis tic Without using the macro kstest D C Determine the Kolmogorov Smirnov test statis tic and the pvalue using the macro kstesth gt lltstest gt kstest X y D Section 29 Selecting Among Two Sample Tests January 87 2009 68 Revisit problem 281 diabetes Test at level 005 whether or not the diabetics and control populations differ in terms of fasting serurn growth hormone levels after exercise gt Z scan2 problern281 gt X Z l 31 l gt y z 32 94 gt kstest x y Keep alternative set to twosided7 D D 29 Selecting Among TwoSample Tests Recall Power is de ned to be the probability of re jecting H 0 given that Ha is true For which distributions do we prefer the t test over nonpararnetric tests Section 291 The t Tcst January 87 2009 69 For which distributions do we prefer nonparametric tests over the t test Recall from section 132 an example where the bi nornial test was more powerful than the t test for Laplace alternatives but less powerful for normal alternatives In this section we assume two populations which are identical except for possibly the location pararneter ie the cdfs satisfy F1I F2I A Test H 0 A 0 versus a one sided or two sided alter native 291 The t Test When distributions F1 and F2 are allowed to differ only by the location pararneter ie equal variances are assumed the t test we consider is based on the pooled sarnple standard deviation When the two populations are normal and have equal Section 292 The Wileocszm Rank Sum Test versus the t Test January 87 2009 70 variances the pooled t test for level 05 is the most powerful among all tests with level no larger than 05 for all sample sizes for one sided tests Is the pooled t test valid for nonnormal populations with equal nite variances and large sample sizes When the two populations are nonnormal but have equal nite variances is the pooled t test for level 04 the most powerful among all tests with level no larger than 05 for large sample sizes for one sided tests 292 The Wilcoxon RankSum Test versus the tTest The textbook compares the powers of the Wilcoxon rank surn test and t test for different sample sizes and alternative distributions The small sample sizes are for m n 12 The moderate sample sizes are from m n 36 to Section 292 The Wilcoccon Rank Sum Test versus the t Test January 87 2009 71 m n 108 Construct graphs to show how the normal pdf com pares with the uniform Laplace and exponential pdfs using identical means and identical standard deviations gt plotdist dunif 1 1 dnorm 0 1sqrt3 gt plotdist dlaplace 0 1 dnorm gt plotdist dexp 1 NULL dnorm 1 Recall when using qqnorm the dif culty in distin guishing between normal and exponential data for small sample sizes speci cally for n 7 Textbook shows comparisons of powers in table 291 Conclusions 9 When the alternative distribution is uniform the t test tends to be more powerful than the Wilcoxon rank sum test for small and moderate sample sizes Section 293 Relative E eieney January 87 2009 72 Q When the alternative distribution is Laplace the t test tends to be more powerful than the Wilcoxon rank surn test for small sample sizes but less pow erful for moderate sample sizes 9 When the alternative distribution is exponential the t test tends to be somewhat equal in power to the Wilcoxon rank surn test for small sample sizes but much less powerful for moderate sam ple sizes When the alternative distribution is Cauchy which test is better the t test or the Wilcoxon rank surn test 293 Relative Ef ciency Instead of small or moderate sample sizes we now consider large sample sizes The textbook again discusses hypothesis testing on A and de nes asymptotic e iciency to compare Section 293 Relative E cicncy January 87 2009 73 two tests With certain sample sizes De nition Let in rm be the sample size required for the two sample t test to achieve the same power as the two sample Wilcoxon rank sum test With a sample size of mw my for large sample sizes The asymptotic e iciency of the Wilcoxon rank sum test to the t test is mt 711 mW my Table 292 p 63 Distribution E iciency Uniform 10 Normal 0955 Laplace 15 Exponential 30 Cauchy 00 Conclusions 9 When the alternative distribution is uniform a Section 293 Relative E ciency January 87 2009 74 t test with sample size 1000 has approximately the same power as a Wilcoxon rank sum test also with sample size 1000 9 When the alternative distribution is normal a t test with sample size 955 has approximately the same power as a Wilcoxon rank sum test with sam ple size 1000 9 When the alternative distribution is Laplace a t test with sample size 1500 has approximately the same power as a Wilcoxon rank sum test with sam ple size 1000 9 When the alternative distribution is exponential a t test with sample size 3000 has approximately the same power as a Wilcoxon rank sum test with sample size 1000 9 When the alternative distribution is Cauchy a t test with any arbitrarily large sample size has less power than a Wilcoxon rank sum test with sample Section 294 Power of Permutation Tests January 87 2009 75 size 1000 294 Power of Permutation Tests Analysis of table 293 permutation test vs ttest Here we compare the permutation test based on the difference between two means with the t test for normal alternatives What is the test statistic associated with the permu tation test What is the test statistic associated with the t test with pooled sample standard deviation Which test statistic is heavily in uenced by outliers Which test statistic should perform well when the al ternative distribution is normal When the two populations are normal and have equal variances the pooled t test for level or is the most Scctz39zm 294 Power of Permutation Tests January 87 2009 76 powerful among all tests with level no larger than or for all sample sizes for one sided tests as already mentioned in section 291 For large sample sizes what is the approximate distri bution of the permutation statistic For large sample sizes what is the approximate distri bution of the t statistic Table 293 compares the power of the permutation test with the pooled t test under normal alterna tives with m n 10 or 20 Even for these small sample sizes the t test is only slightly more powerful than the permutation test Analysis of table 294 permutation testS vs Wilcoxon rank sum test Here we compare the permutation test based on the difference between two means or two medi ans with the Wilcoxon rank sum test under nor Section 294 Power of Permutation Tests January 87 2009 77 mal Laplace and Cauchy alternatives for small sample sizes m n 10 Recall from section 2 92 When comparing the Wilcoxon rank sum test to the t test which is more powerful First consider means 9 A permutation test based on the difference between two means is similar to which famous test 9 For Laplace and Cauchy alternatives with m n 10 or 20 which typically is more powerful the permutation test based on the difference between two means or the Wilcoxon rank sum test 9 For normal alternatives with m n 10 or 20 which is more powerful the permutation test based on the difference between two means or the Wilcoxon rank sum test Next consider medians Scctz39zm 2101 Sampling Formulas January 87 2009 78 Q Which nonparametric test statistic is more suscep tible to outliers the permutation test based on the difference between two medians or the Wilcoxon rank surn test 9 For Laplace and Cauchy alternatives with m n 10 or 20 which typically is more powerful the permutation test based on the difference between two medians or the Wilcoxon rank surn test 9 For normal alternatives with m n 10 or 20 which typically is more powerful the permutation test based on the difference between two medians or the Wilcoxon rank surn test 210 LargeSample Approximations 2101 Sampling Formulas When sampling a large number of independent ob Scctz39zm 2102 Application to the Wilcocczm Rank Sum Test January 87 2009 79 servations from a population With nite variance What is the approximate distribution of the sam ple mean When sampling a large number of independent ob servations from a population With nite variance What is the approximate distribution of the sam ple sum 2102 Application to the RankSum Test Wilcoxon When performing the Wilcoxon rank sum test What is the test statistic Is the variance or standard deviation of these ranks nite Are these ranks independent Letting N m a What is the average rank What is the mean of the Wilcoxon rank sum statistic W under the null hypothesis that the two contin Section 2102 Application to the Wilcomon Rank Sum Test January 87 2009 80 nous populations are the same The variance of the Wilcoxon rank sum statistic un der the null hypothesis that the two continuous pop ulations are the same is varW mnN 1 12 need not memorize as derived in the textbook How do we obtain a pvalue based on the asymptotic distribution of W under H 0 Improvement Use a continuity correction since W an integer is being approximated by a continuous in fact normal distribution This normal approximation is fairly accurate even for the small sample sizes of m n 6 as shown in table 2101 p 67 Exercise 218 p 75 A biologist examined the effect of a fungal infection on the eating behavior of rodents Infected apples were offered to a group of eight rodents and sterile apples were offered to a Section 2102 Application to the Wilcomzm Rank Sum Test January 87 2009 81 group of four The amounts consumed grams of ap ple kilogram of body weight are listed in the table Assume that the two populations of eating behav ior may differ only by a location parameter Using the Wilcoxon rank sum test we wish to test at level 005 whether or not these two location parameters are equal Experimental Group 11 33 48 34 112 369 64 44 Control Group 177 80 141 332 a Compute the exact p Value gt q scan2 exercise218 F T gt X q 1 8 l gt y q 9 12 l b Using hand calculations ie you may use R but not wilcoxtest determine the asymptotic pyalue based on the normal approximation with continuity Section 2102 Application to the Wilcocczm Rank Sum Test January 87 2009 82 correction gt rank c X y gt W sum rank c x y 1c8 gt m length X gt n length y gt N m n gt mu m gtllt N1 2 population mean of W gt sigma sqrt m gtllt n gtllt N1 12 population sd of W gt ls W 41 at the left tail or the right tail gt Z 415 mu sigma gt pValue 2 gtllt pnorm z gt shadedist CZ Z Graph in terms of Z gt shadedist c415 625 dnorm mu sigma Graph in terms of W c Using the macro Wilcoxtest determine the asymp totic pvalue based on the normal approximation

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.