Theoretical Statistics STAT 210A
Popular in Course
Popular in Statistics
This 94 page Class Notes was uploaded by Floy Kub on Thursday October 22, 2015. The Class Notes belongs to STAT 210A at University of California - Berkeley taught by Staff in Fall. Since its upload, it has received 85 views. For similar materials see /class/226735/stat-210a-university-of-california-berkeley in Statistics at University of California - Berkeley.
Reviews for Theoretical Statistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
Stat210B Theoretical Statistics Lecture Date February 8 2007 Lecture 7 Lecturer Michael I Jordan Scribe Kurt Miller 1 Properties of VCClasses 11 VC preservation Let C and D be VCclasses Le classes With nite VCdimension Then so are CE C e C o CUD1CECDED o C D1CECDED o Where lt1 is ll o CXD2CECDED 12 Half spaces Let Q be a nitedimensional vector space of functions Let C g 2 0 g E Q or more formally C w 9a 2 0 g E Q Then V0 S dimQ l 13 Subgraphs De nition 1 A subgmph of f X A R is the subset X X R given by 95 t t S A collection 7 is a VCsubgmph class if the collection of subgraphs is a VC class 2 Covering Number We now begin to explore a more powerful method of de ning complexity than VCdimension 21 De nitions De nition 2 Covering Number Pollard 1984 p 25 Let Q be a probability measure on S and f be a class of functions in L1Q ie Vf E f f lt 0 For each E gt 0 de ne the L1 covering number N1E Q 7 as the smallest value of m for Which there exist functions 91 gm not necessarily in f such that minj E 9quot S E for each f in 7 For de niteness set N1E Q 7 0 if no such m exists 2 Lecture 7 Note that the set 9 that achieves this minimum is not necessarily unique De nition 3 Metric Entropy De ne H1E Qf log N1 E Q 7 as the 1 metric entropy of 7 More generally HpE Qf uses the LpQ norm Write this as lg lm flglde1plt De nition 4 Totally bounded A class is called totally bounded if VE HpE Qf lt 00 Another kind of entropy De nition 5 Entropy with bracketing Let N BE Q 7 be the smallest value of m for which there exist pairs of functions gf such that Vj 179lepr lt E and Vf E f 3jf st 9 S f 3 gym Then we de ne the entropy with bracketing as prQE Qf log prQE Qf Finally using IgHoo supxgx gx let NOOE7 be the smallest m such that there exists a set g l such that supfef minj1wym 7 ngoo lt E Then HOOET lOgNoQE f 22 Relationship of the various entropies Using the de nitions above we have that 1 H1EQf s H BEQf Va gt 0 2 HPBEQ S HooE27 VE gt 0 Can these quantities be computed for normal classes of functions Yes but you would generally look them up in a big book We ll look at how to compute one of these quantities here 23 Examples Example 6 Let f f 01 l S l ie functions from 01 to 01 with rst derivatives bounded by 1 Then H00 E f where A is a constant that we will compute ll0 A Proof Let 0 10 lt 11 lt lt am l where we kE and k 0m Let Bl 10111 and Bk ak71ak For each f E 7 de ne N m flllc f 5 lBC k21 f takes on values in Ek where k is an integer We also have 3 2E because fak1 7 fak1 S E by construction and fac 7 fak1 S E since f is bounded by 1 We now count the number of possible f obtained by this construction At 10 there are lE 1 choices for fa0 since f only takes on values of Ek in 0 1 Furthermore combining previous results gives us Ware 7 aker Wale ak mate fak71llfak71i aker S 3 3E Lecture 7 3 Therefore having chosen fak1 f can take on at most 7 distinct values at ak Therefore Nooaaf s 1 71 which gives us that 1 Hoo25f S E log7 logl1Ej 1 so our constant can be chosen as any constant that gt log 7 El A seminal paper in this eld is by Birman and Solomjak in 1967 They present other examples of metric entropy calculations including Example 7 Let f f 01 a 01 ffltmgtx2dx g 1 Then Hagar g flailm Example 8 Let f f R A 01 f isincreasing Then H BEQ S A Example 9 Let R A 01 f S 1 the class of bounded variation Then HpBE Qf 3 Ag Lemma 10 Ball covering lemma A ball BAR in Rd of radius R can be covered by 4R E d E Proof Let Cj be a packing of size E Euclidean norm This implies that balls of radius E with centers at Cj cover BAR otherwise we could add more points Cj to the packing Let Bj be the ball of radius 54 centered at cj We must have that B O Bj is empty for i y j Therefore 3 are disjoint and balls of radius E Ujsj c BdRE4i A ball of radius p has volume Cdpd where Cd is a constant that depends on the dimension d Therefore the volume of the union Uij is MCdE4d and since it is a subset of BAR 54 we have E d E d lt i Mcd lt4 Cd RT 4 With a simple manipulation of this equation we get that a MS lt4REgt E References Pollard D 1984 Convergence of Stochastic Processes Springer New York Stat210B Theoretical Statistics Lecture Date March 207 2007 Examples using LeCam s 3rd Lemma and Rank Statistics Lecturer Michael I Jordan Scribe Daniel Ting 1 Examples using LeCam s 3rd Lemma 11 Wilcoxon signed rank statistic The Wilcoxon signed rank statistic is used to test if the location of a sample is equal to zero under the following assumptions 0 f is a density symmetric about 9 iid o X1X2an N f o The null hypothesis is 9 0 and the alternative is 9 gt 0 The Wilcoxon signed rank statistic is n Wn rfg2 ZRznsigMXi i1 where R23 rank of Under the null hypothesis signXi l with probability 12 and 71 with probability 12 It is easily veri ed that the sign and rank are independent and we have 0 i To get the variance rewrite W7 as a sum over the ranks versus a sum over the data points W7 yrs2 Em k where J 1 if Xi is positive and is the 16 largest value k 71 otherwise Again using the independence of sign and rank we obtain VarWn n SVarZka n73 E k2 k k n73nnl62nlH13 2 Examples using LeCam s 3rd Lemma and Rank Statistics We now have the asymptotic mean and variance of the statistic and must show it is asymptotically normal One can show that Wn is an asymptotically linear estimator wit Wn nil2 Z UisignXZ 0171 where Ui and G is the cdf of lndeed if we replace G with the empirical cdf we recover the Wilcoxon signed rank statistic Since Wn is asyptotically linear and the UisignXl7s are bounded it follows that Wn is asymptotically normal bt the CLT 111 Power under shrinking alternatives We consider a test where we reject the null 90 0 if Wn gt e and examine the power of the test under the alternatives 9 h dP LeCam s 3rd Lemma suggests that we look at Wn log 140 We rst consider the case where the density f is normal In this case dPhf 1 n 7 7 2 I 1 712 I 7 2 WIog M gt7 n Us1gnltXgt hn h 2gt 00 which is asymptotically bivariate normal with the covariance of the cross term 712 hCOUOClelSignX17X1 hEo Clele hxi where the covariance and expectation are taken under the null The last equality is an exercise in integration The a level test rejects when Wn gt akaV3 The asymptotic power of this test under the alternatives is then lim P9 Wn gt aka lim P9 W7 7 h gt aka 7 h7r 17 WM 7 ink7 Dropping the simplifying normal assumption we can obtain the asymptotic power under quadratic mean differentiability if the density is also square integrable Recall that the qmd of a location family with density 7 9 is 7f z 7 7 9 The covariance term of interest is thus 73912 C0v0U1signX17hfX1fX1 Qhf2zdz where the last equality follows from integration by parts Thus under the alternatives 9 h Wn A N2hff2l3 12 Neyman Pearson statistic The loglikelihood ratio under the qmd regularity conditon has dP log 0W 1 Np hTIgh TIgh Examples using LeCam s 3rd Lemma and Rank Statistics 3 In this case the test statistic is the loglikelihood ratio and the covariance of interest is just the asyptotic variance of the loglikelihood ratioi ie m thghi dP Thus the asymptotic power of the a level test that rejects when log M gt zku hTIg 7 thgh 1s 1 7 zka 7 hTIQh under the alternatives 2 Rank Statistics van der Vaart 1998 Chapter 13 We rst introduce some notationi o Denote the order statistics of X1X2 mXN by XN1 S XN2 S S XNN7 and denote the order statistic vector by XNO 0 Let the rank RNi be the position of Xi in the order statistic so in the absence of ties Xi XNRN1i De nition 1 Linear rank statistics A statistic is a linear rank statistic is it is of the form N TN ZCNiaN N i1 The eNlls are called the coef cients and aNkls are called the scores Here are some examples of linear rank statisticsi Example 2 TwoSample problems Given two independent samples X1 mXm and Y1MYn the two sample problem is to determine if both samples came from the same distribution ie the null hypothesis is that both the Xils and les have the same distribution Let N m n and RN be the rank vector of the pooled sample X1 in Xm Y1 in Yni Note that under the null rank statistics are distribution free They also throw away irrelevant properties of the distribution such as scalei General twosample problems often use If 0 ifi1mm CNZ 1 ifim1mN Example 3 Wilcoxon statistic N W 2 RM im1 Note that this is not the same as the Wilcoxon signed rank test statistic Also note that the MannWhitney statistic de ned by U 2 basic M is a Ustatistic and equivalent to the Wilcoxon up to an additive constanti 4 Examples using LeCam s 3rd Lemma and Rank Statistics Example 4 van der Waerclen statistic N V Z ltIgt 1RNZ im1 Lemma 5 van der Vaart 1998 Lemma 131 p174LetF be a cdfwith density f and X17X27 WXN 131 F Then the following hold 1 XNO and RN are independent 2 XNO has density N XNW has density NI11Fzi 1l7 3 4 RN is uniform on permutations of 127m7N 5 E TX1MXNlRN r ETXNr1MXNRN 5 For a linear rank statistic ET NENEN 1 VarT icNi 7 EN gym 7 EN2 Theorem 6 van der Vaart 1998 Theorem 135p 176 Let 45 01 7 R and 452 lt 00 and scores aNi be either i aNi39 7 157N 1 Vi or am E UNi Vi where UNO denotes the order statistics of N iid uniform07 l rv s De ne TN NENEN ZltCNi EN FXi Then A ETN ETN and VarTN 7 TN VarTN A 0 mngCNi ENV ZltCNi EN2 Remark 7 TN is asymptotically normal by Lindeberg7s CLT if References van der Vaart7 Al WW 1998 Asymptotic Statistics Cambridge University Press7 Cambridge STAT 210A Theoretical Statistics Fall 2006 Lecture 22 7 November 14 Lecturer Martin Wainwright Scribe Henry Lin These lecture notes have only been mildly proofread Outline 0 UMP Tests 0 Generalized LRT7s Reading 0 Keener Chapters 14 15 18 o Bickel and Doksum Chapter 4 221 Existence of A Uniformly Most Powerful Test Recall from last lecture the concept of a uniformly most powerful UMP test Given that we are trying to test H0 6 E 90 versus H1 6 E 91 a hypothesis test i5 is uniformly most powerful at level a if supeeeoEefa WH S a o E9 z 2 E96z for all 6 E 91 and for all 6 of level a As we saw in the last lecture we cannot always hope to nd a UMP test when testing a composite hypothesis We need additional structure in order to guarantee the existence of a UMP test In today7s lecture we establish that together the two conditions below guarantee the existence of a UMP test i the test is one sided eg H0 6 60 versus H1 6 gt 60 ii appropriate conditions are imposed on the likelihood ratio test LRT for example monotonicity De nition 1 A family g 6 i 6 E 9 is said to have monotone likelihood ratio MLR in some statistic Tx if for all 61 gt 60 the likelihood ratio 533 can be written as a non decreasing function of Tx given 61 and 60 xed 22 1 STAT 210A Lecture 22 7 November 14 Fall 2006 Example Monotonicity in exponential families consider a 1 d exponential family with probability density pm a hltzgteltquotlt9gtTltmH lt9gtgt Hence the log of the likelihood ratio is log 77091 e nlteogtgtTltzgt we 7 W91 For increasing 77 it is easy to see the LRT obeys monotonicity Given 91 gt 00 we have 770 7 770 gt 0 which implies the likelihood ratio is an increasing function of Simi larly if is decreasing then the LRT is also monotone with respect to the statistic 7Tx As a consequence of this observation we may conclude that the following distributions all have monotone LRT7s o Binomial X N B nn6 0 Normal Location X N N002 o Poisson X N 13026 222 Constructing UMP Tests Let7s show how a MLR allows us to construct a UMP test for a one sided hypothesis test H0 t9 00 versus H1 0 gt 00 for X N B nn0 First note For any 91 gt 00 the LRT is an increasing function of x thus we have a MLR in x Now consider the following generic threshold test 1ifxgtc x 7 ifzc 0 ifzltc Given any level a 6 01 it is possible to nd 0 and 7 such that E90 x a where x is the threshold test de ned by c 0 and y 7 Side comment Note that we need the extra 7 parameter and the z 0 case in order to achieve all possible levels a To illustrate the necessity of the y parameter consider a random variable X N Ber0 If we only allow non randomized tests of the form 1 iszc 60 0 ifzltc 22 2 STAT 210A Lecture 22 7 November 14 Fall 2006 Without randomization we can only achieve three values for 01 rather than all values in 01 Returning to the main thread let us de ne 1 if 0 S 0 0v0 E906cx Prx 2 0 60 if 0 6 01 90 i 0 1f 0 gt 1 Now that we have de ned our test 15 we would like to show that 15 is uniformly most powerful at level 0v To prove 15 is UMP we need to show that any xed 61 gt 60 and test 5 of level 01 E9115x 2 E9 In other words we would like to show 15 is most powerful at level 01 when testing H0 6 60 versus H1 6 61 for any xed 61 gt 60 Using the fact that the binomial likelihood ratio is monotone in the parameter X we may conclude that 15 is equivalent to the following test provided f06160 is de ned appropriately 1 if gt f06160 E331 f 0701 60 0 lt fC61 60 61 Yak O Consequently it is easy to see that the Neyman Pearson theorem implies 15g1 15 is most powerful at level 01 for any xed 61 gt 60 Therefore 15 is also uniformly most powerful when testing H0 6 60 versus H1 6 gt 60 Here we found a UMP test for a speci c example but we can repeat the same argument with Tz in place of z in the hypothesis test to establish the following much more general theorem Theorem 221 Suppose that a family has MLR in Tz and consider a one sided test eg H0 6 60 versus H1 6 gt 60 Then the test 1 if Tz gt 0 we v To v 0 if Tz lt 0 is UMP at level 0v where 0 and 7 are constants set such that E9015 0v 223 Generalized LRT s Lets take another look at the case when we have a general composite hypothesis test of the form H0 6 E 90 versus H1 6 E 91 where 90 91 Q and 90 U 91 9 One reasonable approach here is to threshold based on the generalized likelihood ratio supeee 1996 9 G x supeeoo 1139 22 3 STAT 210A Lecture 22 7 November 14 Fall 2006 In the expression above we call the numerator the global or unrestricted MLE maximum likelihood estimator and the denominator the restricted MLE on 90 Note Gz is always 2 1 In general we would like to choose H0 if Gz is close to 1 and choose H1 if Gz gtgt 1 Note the numerator of the GLR optimizes over all of 9 rather than 91 We de ne our GLR in this way because it is often easier to optimize over 9 Comment 1 When a UMP test does exist in many cases a GLR can be be used to con struct it 2231 Generalized LRT Standard Normal Example Let7s consider the case when X N Nu02 for 239 1 n Let 9 u02 u E R0 2 gt 0 and suppose we are trying to test 90 u 002 1 versus 91 9 90 In other words we want to test if the X are drawn from a standard normal distribution The denominator of the GLR is pz u 002 1 27f 2 e21m22 while to compute the numerator note that the MLE estimate for u and 02 are A in 1 21 z and 2 52 7 in respectively Thus the numerator is p z and the GLR is QWSZ nZ e zy wi iwz amen2 5 22 w 27Tsz 2 e mz amen2 5 22 w 7 Sin eingZszqm GM It follows that 2 log Gz in log 527nZL1 952 and next time we will show 2 log Gz converges in distribution to X1 where is X1 equivalent in distribution to 22 where z N N01 2232 Generalized LRT Multinomial Example Let7s suppose a random variable X has a multinomial distribution based on 71 trials with parameters 610d In other words given nonnegative integers z1zd such that 2 d n pz1 zd t9 x1 zd 110 Let 9 0 6 Rd 22101 iz 0 for 239 1n and consider testing H0 6 E 90 01 0d versus H10 90 In this case STAT 210A Lecture 22 7 November 14 Fall 2006 whizmd and it can be shown that are the MLEYS for the unrestricted problem in the numerator Thus7 we have 1 M m Hi1g Z 7 which is actually equivalent to a KullbackiLeibler rd 7 1 1 7 7andqi37m7g A Note that i log Gz 211Og 1 a divergence KL0Hq7 if we de ne p m 1 n 7 Stat210B Theoretical Statistics Lecture Date January 18 2007 Lecture 2 Lecturer Michael I Jordan Scribe Ariel Kleiner Lemma 1 Fatou If Xn 3 X and Xn 2 Y with lt 0 then 11131 infEXn 2 EX Theorem 2 Monotone Convergence Theorem If 0 3 X1 3 X239 and Xn 3 X then Em A EX Note that the Monotone Convergence Theorem can be proven from Fatou s Lemma Theorem 3 Dominated Convergence Theorem If Xn 3 X and an S Y lt 0 then E Xn A EX Theorem 4 Weak Law of Large Numbers If X l39Kd39 X and lt 0 then X L EX where Xn E1 X1 Theorem 5 Strong Law of Large Numbers If Xz l39Kd39 X and lt 0 then Xn 3 EX De nition 6 Empirical Distribution Function Given n iid data points X1 i39w39 F the empirical distribution function is de ne n Fm 5 21mm 11 Note that Fnac Fac for each as Theorem 7 GlivenkoCantelli Given n d data points X l39Kd39 F Ps1p 7 a 0 1 That is the random variable supx Fn 7 converges to 0 almost surely Theorem 8 Central Limit Thorem Given n d random variables X from some distribution with mean u and covariance E which are assumed to exist MK 7 M L Mo 2 2 Lecture 2 The following theorem is a generalization of the Central Limit Theorem It applies to noniiiid iiei independent but not identically distributed random variables as might be arranged in a triangular array as follows where the random variables within each row are independent Yll Y21 Y22 Y31 YSQ YSS Theorem 9 LindebergFeller For each n let Ynl Yng i i Ynkn be independent random vaiiables with nite variance such that 21 Va39rYm A E and ZE HYMHQIHHYMH gt 5 11 0 V5 gt 0 11 Then kn 2 7 Eliml i Nam 11 We now consider an example illustrating application of the LindebergFeller theorem Example 10 Permutation Tests Consider 2n paired experimental units in which we observe the results of n treatment experiments XM and n control experiments WM et M an 7 WM We would like to determine whether or not the treatment has had any effect That is are the ZM signi cantly nonzero To test this we condition on lZM39l This conditioning effectively causes us to discard information regarding the magnitude of an and leaves us to consider only signs Thus under the null hypothesis H0 there are 2 possible outcomes all equally probable We now consider the test statistic 1 ZnEZni and show that under H 0 M i N01 0n 2 7 1 n 2 where an 7 g 211 Zm and we assume that ZQ m 1 0 Proof Let y J E Z 1 Note that under H0 EYM 0 because H0 states that Xj and are identically distributed Additionally Lecture 2 3 we have Ej VarYnj 1 Now observe that V5 gt 0 Z Z Z2 Z2 E llMlQIl lMl gt 5 21 gt 52 j Elzm Ezzm lA V V pg Ha NB g M 0 V m w W 7gt 0 where the equality in the rst line follows from the de nition of YM and the fact that we are conditioning on the magnitudes of the Zn thus rendering Z72 deterministicl The desired result now follows from application of the LindebergFeller theoreml We now move on to Chapter 3 in van der Vaartl Theorem 11 Delta Method van der Vaart Theorem 31 Let lt1 Dd g RU 7gt R di erentiable at 9 Additionally let Tn be random variables whose ranges lie in D and let r7 7 0 Then given that T7 7 9 i T i new 7 M L MD a was 7 we 7 wean 7 a i 0 Proof Given that Tn 7 9 L T it follows from Prohorov s Theorem that rnTn 7 9 is uniformly tight UT Di erentiability implies that W9 h W9 W901 WWI from the de nition of the derivative Now consider h Tn 7 9 and note that Tn 7 9 L 0 by UT and r7 7 00 By Lemma 212 in van der Vaart it follows that gtTn W9 9Tn 7 9 0PlTn 7 9H Multiplying through by rn we have MWTn 4 09 Mn 9 0131 thus proving ii abovel Slutsky now implies that r QTn 7 9 and rn gtTn 7 have the same weak limitl As a result using the fact that 5 is a linear operator and the Continuous Mapping Theorem we have mm 7 a 7 Man 7 a i 4M and so how 7 w L MT We now jump ahead to U statisticsl Lecture 2 De nition 12 U Statistics For X1 iiiid and a symmetric kernel function hX1 i i XT a U statistic is de ned as 1 U TZMX UMX r r 3 Where ranges over all subesets of size 7 chosen from 1 i i i n Note that by de nition U is an unbiased estimator of 9 EhX1 i i XT iiei EU 9 Example 13 Consider 9F EX xdFx Taking Mac x 1 U Z X As an exercise consider 6ltFgt ltx 7 WM and identify 1 for the corresponding U statistic Where M f xdFxi Stat210B Theoretical Statistics Lecture Date April 26 2007 More Functional Delta Method Quantiles Lecturer Michael I Jordan Scribe Chris Haulk 1 Motivation Last lecture we developed a functional delta method that using the notion of Gateaux derivative With d 1533675 7 P ltEq l 7 tP t6 gt IF pz 20 we write 1 am 7 ltPgt 7 g 2 IF PltXgt R and hope to show that EPIF pX 0 VarpIF p 72 and Rn 0171 Then the CLT gives MM 7 W i MW Showing that EPIF pX 0 should not be too hard and calculating a variance at some point probably cannot be avoided if we want to show asymptotic normality of 7 However showing that Rn 0171 may be dif cult depending on 45 and Pi For a delta method that avoids this last step we will modify our notion of derivative 2 Delta Method via Hadamard Differentiability Let D and E be normed linear spaces and suppose 45 D 7gt E where D C D We say that 45 is Hadamard di erentiable at 9 if there is a continuous linear map 459 D 7gt E such that 9 th 7 9 l a algal so astiO lE for every sequence h 7gt h such that 9 tht E D for all suf ciently small ti If it is possible to de ne 459 only on a subset D0 C D and the sequences ht above are restricted to have limits h in D0 then 45 is said to be Hadamard differentiable tangentially to Del Theorem 1 Delta Method van der Vaart 1998 208 Let D and E be normed linear spaces Let 45 D C D 7gt E be Hadamard di erentiable at 9 tangentially to D0 Let Tn 9 7gt D be maps such that rn Tn 7 9 i T for same sequence of numbers rn 7gt 00 and a random element T that takes values in D0 Then mm 7 we 7 aw Proof De ne gnh rnq t9 hrn 7 for h E h 9 hrn 6 D45 By Hadamard differentiability gn hn 7 459 h for every subsequence hn 7gt h 6 Del Therefore 9 rn Tn 7 9 L 459 T by the extended continuous mapping theorem 18 1 D 2 More Functional Delta Method Quantiles 3 Applications and Examples Last lecture we used the Gateaux functional delta method to prove asymptotic normality of the Mann Whitney test statistic We will prove this fact again using the Hadamard version of the functional delta metho i Lemma 2 van der Vaart 1998 Lemma 2010 Let 45 01 A R be twice continuously di erentiable Then the function F1F2 gt gt f F1dF2 is Hadamard di erentiable at every pair offunctions F1F2 such that E E D7oooo and Fi has bounded variation The derivative is h1h2 gt gt hgqb 0 F1 30 ihgdqb 0 F1 F1h1dF2i Here h denotes the leftcontinuous version of h Proof See text 1 Now suppose at time 1 we observe two independent random samples X1 in me Y1 in Yny from distributions F and C respectively Let N my nV and suppose mN A A 6 01 as 1 A 00 By Donsker s theorem and Slutskyls lemma GF Go gt TX l 7 A for independent Brownian bridges GF and Gal Let z and apply Lemma 2010 together with the functional delta method to see Ga Gp N Fmd0n7 FdGgt 1 7 dF idea A m A That the limit distribution is Gaussian follows from a generalization of a wellknown result for nite di mensional processes namely that continuous linear transformations of Gaussian processes are Gaussian Alternatively note that Thmi 208 implies that the limit variable is the limit in distribution of 7 xZVG 7 GMF 1VFm 7 mac xZVFm7FGm7Gi lt rewrite the expression above as a difference of scaled centered sums and apply the usual CLTi 4 Quantiles The quantile function F 1 0 l A R of a cumulative distribution function F is F 1p infz Fz 2 p The quantile function has some nice properties Lemma 3 van der Vaart 1998 Lemma 211 For 0 lt p lt1 and z E R F 1p z 279 s For 0 FoF 1p 2 p o F 1 o Fz S I More Functional Delta Method Quantiles 3 o Fo F 1p S p where F denotes the left continuous version ofF o F loFoF 1F 1 o FOG 1 G loF 1 Proof Chase de nitions or see the text D In the next lecture we will see that 17 quantiles are asymptotically normal whenever F is differentiable with positive derivative at F 1p that is F1p 7 F 1P i N lt07 gt For now we will merely calculate the in uence function of pF F 1pi Assume that F o F 1p pi Let E 1 7 tF to By the de nition of Ft the equality p F o Ft 1p can be rewritten as 0 1 75FF271P7561ltF1PA Differentiating both sides with respect to t we get d d o FF1p1tfF1p3F1p 61K1ltpgtwxgelltpgt and setting t 0 we can solve for the in uence function Ffl ot0 1F 1P 2 I 7 mm fF 1p References van der Vaart Al WW 1998 Asymptotic Statistics Cambridge University Press Cambridge Stat210B Theoretical Statistics Lecture Date April 12 2007 Power of the LRT Bartlett Correction Ef ciency of estimators Lecturer Michael I Jordan Scribe Daniel Ting 1 Asymptotic Power of the LRT van der Vaart 1998 Sec 164 Recall the setup from last time Consider a distribution PT such that Pith satis es local asymptotic normality eg the distribution has a density that is qmd Then the likelihood ratio statistics An converge in distribution to a statistic in the limit experiment NhI 1 Let X be from the limit experiment So in the limit we may write 71 X N N h In d 1 2 1 2 2 A AA Hz1 hiln HOHQ where Z Ti2X is a standard multidimensional normal and the norm is understood to be taken as the in mum over all values in the null hypothesis H0 So we see that A is the squared norm of a standard k dimensional normal projected onto a k 7 1 dimensional subspace where k is the dimension of X and l is the dimension of H0 Thus An converges to a xsquared distribution with noncentrality parameter 6 HITVQh 7 Tl2H Under the null hypothesis 6 0 so the asymptotic level a test is to reject when An gt inly The asymptotic power function under the alternatives is given y M07 hV PHIL lmy gt Xiiza A PhA gt Xiiza 0 where 01 130645 gt Xiim Example 1 Power and eigenvalues of the Fisher matrix Consider the simple null hypothesis H0 0 Then the noncentrality parameter is s1mp y T 6 7 t h Inh Consider the alternatives in the direction of an eigenvector he of In ie let h phe Then EmA WI PXizMX gt X249 We see that the power is the greatest on the biggest eigenvalue since the corresponding noncentrality parameter is the largest 2 Power of the LRT Bartlett Correction E ciency of estimators 2 Bartlett Correction van der Vaart 1998 Sec 165 In the previous section we apply the properties of the asymptotic distribution of the test statistic A directly to An However the properties of An for any particular n are different from A and we may consider correcting An to make it more similar to A In particular we may consider correcting the mean If A N xi then it has mean 7 We may try to correct An by ta ing 7 TA A 7 EgoAn Note that 90 is the in the previous section However E9 An is generally hard to compute We may try 77 7 n to replace is with some expansion EgoAn 1 b60n Note that this series is typically divergent However for a given truncation of the series it is often accurate for a range of values before diverging This expansion is related to Edgeworth expansions and saddlepoint approximations If we have an estimator in for 2090 we obtain the corrected statistic TA A f lbnn n 3 Estimation and Ef ciency of Estimation van der Vaart 1998 Ch 8 Outline 0 estimate with a sequence of estimators Tn o derive a Gaussian limit as the best within a minimax framework 0 First consider an easier problem asymptotic relative ef ciency 31 Asymptotic Relative Ef ciency ARE Consider an estimator that satis es ltTn 7 we amp Mo we Let us rescale time to get a N0 l limit Let 1 denote time and let my observations be taken at times 1 so that MTm 7 we amp N071 Then we have gxWTM 7 we 3 MW g 09 Power of the LRT Bartlett Correction E ciency of estimators 3 We see that my represents how many samples we need to take in order to achieve a fixed level of accuracy As with Pitman efficiency of tests7 we can compare estimators by taking a ratio of the nylsi De ne the aymptotic relative ef ciency to be n a 9 2 ARE lim 2 2 yam n 1 a162 Example 2 ARE of median Consider a location family with density f where f is symmetric about 0 and iid draws from the family Xi f1 7 6 Then ME 7 9 i N0702 N d 1 V509 9 N07 W where X denotes the median We now consider the ARE under the normal location and Laplace location familiesi Under the normal location family with a2 17 we have l4f02 7r27 so the ARE is ARE 772 072 141 02 Under the Laplace location family we have f gelxli 012z2e lmldz 0 IQE sz F3 2 2 03927 1 7 4f02 so the ARE 127 and the median requires half the number of samples as the mean 32 Hodges7 estimator and supere iciency van der Vaart 1998 Example 81 Consider H Xi 151 My 1 Tn 7n De ne Hodges7 estimator to be S 7 Tn iflTnl 2 n 14 n 7 0 else We have a n 7 6 L N07 17 but for Sn we have 1 TnSn 3 0 for any sequence MILL if 9 0 2 msn 7 9 i N01 4 Power of the LRT Bartlett Correction E ciency of estimators In other words for any 9 f 0 the asymptotic behavior of 5 is the same as Tn and for 9 0 Sn converges arbitrarily fast to the truth To show 2 note that PltTn e lt6 7 Mme Mx a LeeM M where L9 is the measure for a N61i Note that we may choose M large to make L97M M arbitrarily close to 1 If 9 f 0 then the intervals 9 7 M t9 M7L and in 14n14 are eventually disjoint and hence PTn S7 A 1 To show 1 note that the interval 9 7 M C in 14n14 eventually so Ps0a1 References van der Vaart A W 1998 Asymptotic Statistics Cambridge University Press Cambridge Stat210B Theoretical Statistics Lecture Date January 25 2007 Lecture 4 Lecturer Michael I Jordan Scribe Mike Higgins 1 Recap De ne the following hcx1iuxc Ehx1 i i i xc X511 1 i i XT QC VarhCX1 i i i XC Now consider a UStatistic U TZMX UMX r r g Where E01 9 and Note that 11 Rao Blackwellization Note that we can Write Un EhX1 i i i XTHXO i i i X Thus we have the following inequality EEhX1iuXTlX1HiXT2 EEh2X1 aXrlX1a H Xr h l A 2 Projections De ne L2P as the set of functions that are nite When squared and let T and S S E S belong to L2Pi De nition 1 s e s is a projection of T on 5 if and only if ET 7 5w 0 for all s e 5 Corollary 2 From van der Vaart Chapter 11 ET2 ET 7 5 2 Lecture 4 Now consider a sequence of statistics Tn and spaces Sn that contain constant real variables with projections w Theorem 3 If X T l 7gt 1 then ar n Tn 7 Em 7 S 7 may L 0 SHE TU SukiSit Proof Let A 7 igwEgg Note that EAn 0 and Vela 2 7 2 stdevTnstdevSn Since Tn 7 1 St Tn 7 is orthogonal to 53 we have Ems 7 E6 7 CoVTn n Var n A o 7 Anio 21 Conditional Expectations are Projections S E linear space of all measurable functions 9Y of Y1 De ne EXlY as a measurable function of Y that satis es EX 7 EXlYgY 0 As a consequence we have the following 0 Setting 9 E 1 then EX 7 EXlY 0 EX o EfYXlY fYEXlY because E lfYX fYEXlYl 90 EX EXlYfY9Y 0 o EEXlY Zgt1Ygt7 mm 22 Hajek Projections Let X1X21HXn be independent 8 Ell 91951 91 E L2Pi S is a Hilbert space Lemma 3 1110 in van der Vaart Let T have a m39te 2nd moment Then S 7 Z ETle 7 n 71ET 11 Proof ETXz 7 ET ifz39 Tle ifz39 j E5 lXj 21 3Tn 1ET ETlXj ETlXj 71 EltEltT1XzgtlXjgt 7 5 Thus we have that A A ET 7 Swan 7 EET 7 SlXj9Xj1 7 0 And we conclude T 7 1 Si Lecture 4 3 Asymptotic Normality of U Statistics Assume E012 lt 00 Take Hajek projection of Un 7 9 onto Ell 91951 91 E L2Pi De ne U e 231 EU 7 aux We have that 7 7 h1z ifz39 E E01093 X r 7 6 X1 7 x 7 0 otherwise Where h1x Ehx1 X2 i i i XT 7 9 Now 1 73 ZElthltxmtttszgtiegt n gmw 7 7 EUn 6 Xz U1ih1x n 7111 1 Note that EU 0 and A 2 2 varam nvarlthltX1gtgtn Tg l And so we have A 1 By our previous theorem we have that Ufa U 2 lt lt10ltn72gt aw L 0 By Slutsky we have w ieirfwio By CLT we have rf L N0r2 1 And by Slutsky again we have warn 7 9 L N0r21 STAT 210A Theoretical Statistics Fall 2006 Lecture 28 7 December 7 Lecturer Martin Wainwright Scribe Ying Xu These scribe notes have only been mildly proofread Outline 0 Normal means model and regression 0 Analysis of James Stein o Stein7s risk estimator 281 Normal means model and regression Recall the following normal means model that we discussed in last lecture Fix 6 E R Y 0 UnlVl39 W N N0 1 239 12 n We are interested in estimating 6 under quadratic loss R6 EHQi HZ EZ6r i2 Note also the non trivial scaling n observations parameters 2811 Connection to nonparametric regression Before turning to analysis of the James Stein shrinkage estimator let7s complete our explo ration of how the normal means model is related to non parametric problemsiin particular regression In the problem of non parametric regression we observe noisy samples of an unknown function of f E L201 say of the form 239 Z f7 0613 6 N N01 2391 n n We now convert this to the normal sequence model as follows Choose 1 2 3 to be an othonormal basis of L20 1 In terms of this orthonormal basis we can write f as 95 WWW 28 1 STAT 210A Lecture 28 7 December 7 Fall 2006 where the coef cients 19 are projections that is 19 ltf7rjgt fol f7rjd Remarks a As discussed last time many natural smoothness constraints on f have simple rep A C7 V resentations in terms of the coef cients 01192 As an example the smoothness constraint of f is m times differentiable and Hfm H2 g 02 known as a Sobelev ball is equivalent to having the coef cients of f in the Fourier basis satisfy 13072 g 02 with 717 ifj is even a 7rj71m ifj is odd By Parseval7s identity HfHZ 072 lt 00 Since the weights aj decay according to 17 N j one example of a smooth function is f de ned according to 6 coef cients decaying as 19 N j m l Different bases other than Fourier can also be of interest depending on the relevant smoothness properties in the application One important class of bases are wavelets the Haar basis a family of dilated and translated step functions is one special case of a wavelet family Returning to the main thread note that our observations Z1 Zn are independent Gaussian variables with Z N Nf 02 Therefore if de ne L n 1 239 YjZZ j for j1n then is Gaussian with Eco gfgwgw fltzgt jltzgtdz6jltf jgt varY7 2and call this an It can also be shown that for large n we have covYYj 0 for all 239 31 j Thus for large n we have 6jUng j 1 n where is a standard normal distribution which corresponds to a version of the normal means model Similarly other non parametric problems eg density estimation etc can be cast into the normal means framework 282 Analysis of JamesStein Given 71 samples of the form Y i 0a3W withWN01 28 2 STAT 210A Lecture 28 7 December 7 Fall 2006 consider the risk function associated with quadratic loss 1396 EH9 6H2 Ewai 602 i1 Also consider the two estirnates MLE Y With 0393 A n 7 2 QJSE 1 7079K HYHZ We now show that for n 2 3 6J5E dorninates MLE 2821 Perspective from empirical Bayes One way to gain some intuition for the James Stein estimator is from the following empirical Bayes Viewpoint lndeed say that we had the Bayesian rnodel Y 6 Wan with 6 N0TZI prior lf 7392 were known then the optimal Bayes estirnate under quadratic loss would be the conditional mean A n7 n0 eBay25 2 2y 1 7 2 n 2 Y7 71739 710 71739 710 which is a linear shrinkage estirnator lf 7392 were not known then one can imagine using the data to try and estimate it This approach in which data is used to estimate parameters in a Bayesian prior or hyperprior is known as an empirical Bayes method For instance under our Bayesian model we have ENDH2 71T2UZ7 Thus we can think of the James Stein shrinkage factor as arising from an empirical Bayes approach 2822 Different regimes It is also helpful to consider the performance ofthe James Stein estirnator in different regirnes First suppose that 6 0 in which case we are observing Y UnW The random variable HYHZ is scaled xi and can be shown to be tightly concentrated around its mean so that m 7103 for large 71 As a consquence we hace A n7202 7172 6 17 Ym 177 YwO E lt m2 n for large n Thus the shrinkage is extremely bene cial when 6 is zero and close to it 28 3 STAT 210A Lecture 28 7 December 7 Fall 2006 At the other extreme7 let us imagine that HQHZ is very large relative to the noise variance In this case7 again by concentration for non central X2 we have HYHZ W2 7103 lf WHZ gtgt 710 then the shrinkage factor is very close to 17 so that the JSE is close to the MLE 2823 Analysis of JamesStein We now show that JSE dominates MLE A useful result and of independent interest is the following Proposition 281 Stein7s unbiased risk estimatorSURE SayX N N6UZI and is some estimator of 0 Assume h z 7 6x is differentiable Then 1 7102 H2 7 QaztraceVh is a is an unbiased estimator ofRt97 6 To clarify notation in this statement7 note that 6 Rn7R and h Rn7R and de ne WMZ Hh1w 7hnH2 271 and i1 VhWDM 7 6 The proof of the SURE formula is based on the following lemma Lemma 282 Stein s identity Let X N N0UZI Ifh R 7 R satis es lt 00 then EKX 7 0ThX 02EtraceVhX Recall that we proved the scalar case n 1 of this identity in a previous homework Let us now apply the SURE technique to the James Stein estimator7 for which we have MY 7 Y Y 7 JSEltYgt ahi U2 2 7 n i 511 illH2 illH4 traceVhy W see Keener HhyH2 ya 28 4 STAT 210A Lecture 28 7 December 7 Fall 2006 Finally7 plugging these calculations into the SURE formula yields 71 7 202 Rlt67 JSE 71039 1 lt STAT 210A Theoretical Statistics Fall 2006 Lecture 4 7 September 19 Lecturer Martin Wainwright Scribe Tanya Roosta 41 Convexity and Randomization The outline of this lecture is o convexity and randomizationsuf ciency o Rao Blackwell theorem 0 Exponential Family Last time we de ned the notion of a randomized estimator This leads to an operational characterization of suf ciency Recall that one characterization of suf ciency is that given the value of the suf cient statistic T and a random number generator it is possible to generate a new surrogate sample X that has the same distribution as the original sample X The key here is that this can be achieved without knowledge of the true underlying parameter 0 Proposition 1 Say X N P9 and T is suf cient for P P910 6 9 Then for any estimator 6x of some 90 there exists a possibly randomized estimator 77 on T with the same risk function as 6 ie R66 R077 Proof By suf ciency given T and a random number generator we can generate a surrogate sample X N P9 without knowledge of 0 Now we apply the estimator 6X to the surrogate data Since 6X 6X in distribution risk functions must agree D This proposition implies that we can always do at least as well as any estimator using only a suf cient statistics 42 Convexity and RaeBlackwell Theorem Convexity plays a central role in many statistical settings De nitionA set C 6 R is convex if VLy 6 004 6 01 then 20 0w 17 ay De nitionGiven C 6 Pt convex a function f C 7 R is convex if for all z 31 y E C a 6 01 ax 1 7 ay S afx1 7 afy The function is strictly canvas if the inequality holds strictly An illustration of a convex function is shown in Figure 41 4 1 STAT 210A Lecture 4 7 September 19 Fall 2006 mm U U y y 39t x 391 A y x x 1IUy y Figure 41 f is a convex funtion 4 2 STAT 210A Lecture 4 7 September 19 Fall 2006 Examples 1 Quadratic loss function strictly convex L0 oz t9 7 a2 2 l1 loss function convex but not strictly convex L0 oz t9 7 oz 3 Robust loss Huber 1967convex but not strictly 97a 2mg i M i k2 6 7 lt k 46701 O 7 0w 4 Many so called 77divergence77 measures between probability distributions pq They naturally arise from convex function such as Kullback Leiber divergence de ned to be 19 1996 qu E log7 px logidm H A ql m gm The KL divergence is non negative and it is equal to 0 iff p q A bit more general the class of f divergences are de ned by Dfqu Epf where f is strictly convex Ali amp Silvey 1966 Cisiszar 1967 where pxqz gt 0 Theorem 41 Jensen s Inequality Say that f is convex on some open convex set C and X is a random variable with pX E C 1 leC lt 00 then fEX S EfX and if f is strictly convex then the inequality is strict unless X is constant almost everywhere A special case of the Jensen7s inequality is when X has the support X1 Xn Then this reduces to fp11 pnzn S p1fzm pnfn where p PX The proof is via induction the case where n 2 is true from the de nition An illustration of the above inequality is in the case of arithmetic geometric mean inequality 1 n n A Z 111quot Recap At the start of the lecture we saw that the risk function of any estimator can be matched using only the suf cient statistics For problems with convex loss Rao Blackwell provides something stronger ie any estimator that is not a function of a suf cient statistics then it can be improved Theorem 42 Rao Blackwell Let X N P9 and let T be suf cient for P Pew E 9 Say 6x is an estimator ofgt9 and the loss L0 is strictly convex for each 6 E 9 Assume that 6 has nite mean and risk R66 E9L06x and de ne 77t E96xt Then the estimator 77T satis es R6 77 lt R6 6 unless 6 77 ae Note Observe that 77t is not dependent on 6 since we condition on suf cient statistics Proof Essentially we need to note that from the de nition of the risk function and the Jensen7s inequality we can pull out E9 and nd an upper bound L0 77t L0 E9 4 3 B STAT 210A Lecture 4 7 September 19 Fall 2006 43 Exponential Families Exponential families provide a uni ed treatment of various aspects of statistical theory De nition A family P Pelt E 9 is an exponential family if each P9 has a density mass function of the form Pz t9 h expZ1 7716Tix7B0 on a common support X with measurable functions 771 9 a R and suf cient statistics Tl z a R The factor B09 ensures that the distribution normalizes to one Exponential families arise as the solution to the maximum entropy problem7 and in physics they are referred to as Gibbs distributions The important point to note is that the support of the exponential families does not depend on 0 Stat210B Theoretical Statistics Lecture Date February 20 2007 Lecture 10 Lecturer Michael I Jordan Scribe Alex Shyr ln Empirical Process theory the notion of a sequence of stochastic processes converging to another process is important The scalar analogy of this convergence is the CLTi This lecture is an introduction to Donsker s Theorem one of the fundamental theorems of Empirical Process theory 1 Weak Convergence aka Conv in Law Conv in Distribution Given the usual sample space Qf P random element X Q H Xi Let A be a o eld of Xi De ne CX A to be the space of continuous bounded function class on X which is measurable on At A sequence of probability measures Q7 converges weakly to Q if A QfVf E CXAi Note that A must be smaller than the Borel o eld BXi An alternative eld that works is the projection o eld generated by the coordinate projection maps 2 Continuous Mapping Theorem van der Vaart 1998 Cha 18 Since weak convergence does not hold for all probability measures we need conditions on the set C on which the limiting random element concentrates De nition 1 A set C is separable if it has a countable dense subseti A point X in X is regular if V neighborhood V of X 3 a uniformly continuous g with 9X l and g 3 Vi Theorem 2 Let H be an AA measurable map from X into another metric space X IfH is continuous at each point of some separable Ameasurable set C of regular points then XnLX andPX C1 a HXnigtHX Some useful notes 0 a common function space X is D0 l which is the set of all Rvalued cadlag functions 0 dx y supOStSl 7 de nes a metric and closed balls for cl generate the projection o eld 0 every point of D0 l is regular but D0 l is not separable o BUT the limit processes we will talk about concentrate on C0 l which is separablei 2 Lecture 10 Theorem 3 Cquot l or A J t tightness Let X1 m X7 be random elements of D0 1 Suppose that PX E C 1 for some separable C Then X7 A X i Fidi convergence of X7 to X ie Tlan L HSX V nite S Q 01 Ve gt 0 6 gt 0 a grid 0 to lt t1 lt lt tn 1 st limsupn Pmaxlsusz where Jz tn n1 Xnt 7 Xntl gt 6 lt e 3 Donsker s Theorem for standard empirical process The rst version of Donsker s theorem deals With the convergence of the empirical process U7 of random variables drawn uniformly from the unit interval Where 1 n a Unt g 21 7 t and g N U01 11 De nition 4 Brownian Bridge U is a Brownian Bridge i V nite subset S E 0 1 USU is Gaussian With zero mean covariances EUSUt 317t V0 3 s S t S 1 and U only has continuous sample paths Theorem 5 U7 A U where U is a Brownian Bridge Proof First check of Theorem 3 l ElUnSUntl ZEl1 zSt it1gzgs Sll l Ema s s 7 we 3 s 7 spoil s t st n 31t References van der Vaart A W 1998 Asymptotic Statistics Cambridge University Press Cambridge Stat210B Theoretical Statistics Lecture Date May 37 2007 Lecture 29 Continuation of Bootstrap Discussion Lecturer Michael I Jordan Sc be Mike Higgins 1 Theory of Bootstrap Oftentimes we will have a statistic in the form of instead of F and we will want to estimate performance measures in this setting Examples of this include o CDF MF 7 BMW 7 ltFgtgt a Bias Mp Egg 7 Mp Variance Mp EF n 7 Mp The basic idea of the bootstrap method is to replace F with Example 1 Suppose AnF 7 S a Replace F with n throughout thus n becomes a function of data Xi XS i i X sampled from Fnl So AnFn P137 7 S a Example 2 UStatistic Let n M371 Ekj XiXjl We have shown that AnF 4127 722 where 7 E X1X21JX1Xg and 722 E X1X22 and so AnF 7gt MF 47 On the other hand we have that Siffh q yf where via 51g Ej Ek XiXj1JXiXk and 2 Ej XiXj2l Let 7 E Xi Xi2l If we have that 7f27 2 and 7 are all nite then we have consistency An 7gt MF 47 However we will show that if 7 00 we may not have consistencyl Let Xi be ilild Uniform0l variables and de ne 11 so that when i f j XiXjl S M for some real number M lt 00 and XiXZ exp 1 For divergence of Andi we need P1 62 gt A 71 1 1 1 X2 for all A gt 0 Since 672 2 maxiexiz we can prove divergence by showing Pmaxie Tz S An2 L n L ltPex1 S 1473 7gt 0 To show this note PeXz S An2 PXi gt W 17 my Since 1 W 2 i for suf ciently large n and l 7 7gt 0 it follows that Pmaxi 672 S An2 7gt 0 and we have divergence of the bootstrap estimatorl 11 Comparing weak convergencebased approximations and boostrap Suppose AnF L A which is independent of F We can use A as an approximation to AnF or we can use If we suppose An F A O TF 0n 1 where a is a coef cient depending on the distribution then An A 0 7 0n 1 Additionally if we suppose that 7 aF is tight then we have mail aF 0171 and so AnF 0pn 1 This is better than our Opn 1 result obtained from using 2 Lecture 29 Continuation of Bootstrap Discussion If on the other hand A is not independent of F we get A on 1 which implies AF 7 Mp 7 MP1 7 Mp gouty 7 aF owl 7 owl since MP1 7 Mp is owl Example 3 Suppose 02 Then EiX7Xn2 M2 where M is the ith central sample moment 1 Let AnF VarnM2 4 i 3 W 35 where M is the ith central moment The classical estimator is M4 7 M22 but the bootstrap estimator is M4 7 7 Wig M jing For both estimators the error is M4 7 7 M4 7 3 On 1 which is On because Mi M On E0 Note that EM2 77102 and let AnF be the bias of M2 that is AnF 102 7 a2 We have AnF 7gt A 0 which is independent of F and so it is possible that the bootstrap estimator will converge faster than the classical estimator We will now show that this is the case Note that the bootstrap estimator ng a2 On which implies 7 AnF On which beats the On 1 rate of the classical estimatorl 12 Bootstrap Con dence Intervals De ne a root Rn Xn 9P as a quantity that can be inverted to obtain a con dence interval The classical example of a root is RnXn 7 9P 757303 where 3 is some estimate of the standard deviation To obtain con dence intervals based on Rn we need the distribution of Rn which we will call AnP That is AnP t PRnXnt9P S t The simplest case occurs when An is independent of P in which case we call Rn is called a pivot Example 4 Suppose X i39iN39d39 Nt902 Then An SXj N tn71 which is independent of 9 and 02 In this instance An is a pivot g t 7 17a for all P then 63 7 t n 7 t in 79P Su 5 is a l 7 a con dence interval for 9P independent of P In general if Rn is a pivot and there is a t such that P lt l In the case of the bootstrap we approximate AnPZ by AnPn and we consider the set Bnl 7 aXn I 6 E 9 A771 137 S RnXn6 S A1l 7 Pn We can use a Monte Carlo method to estimate A1397 Lemma 5 van der Vaart 1998 Lemma Assume 9g 9 L T and 927 L T Then the bootstrap con dence intervals are asymptotically consistent Theorem 6 Sample means van der Vaart 1998 Theorem 234 Suppose X are d with u and CovXXj 2 Then conditionally on X1X2Xn 7 Xn L N0E for almost every sequence X1 Theorem 7 Delta method for bootstrap van der Vaart 1998 Theorem 235 Let 45 be di erentiable in a neighborhood of 9 let 2 9 and let we 7 9 i T 6 7 63 i T Then own 7 159 A 459T and 7 L q59T conditionally almost surely V Lecture 29 Continuation of Bootstrap Discussion References van der Vaart7 Ar WW 1998 Asymptotic Statistics Cambridge University Press7 Cambridge Stat210B Theoretical Statistics Lecture Date January 16 2007 Lecture 1 Lecturer Michael I Jordan Scribe Karl Rohe Reading Chapter two of van der Vaart s book Asymptotic Statistics 1 Convergence There are four types of convergence that we Will discuss De nition 1 Weak convergence also known as convergence in distribution or law is denoted d Xn X A sequence of random variables Xn converges in law to random variable X if PXn g x A PX 3 ac for all ac at Which PX 3 ac is continuous De nition 2 X7 is said to converge in probability to X if for all e gt 0 P dXn X gt e A 0 This is denoted Xn L X De nition 3 X7 is said to converge in rth mean to X if E dXm XV A 0 This is denoted Xn L X De nition 4 X7 is said to converge almost surely to X if P limn dXn X 0 1 This is denoted Xn 3 X Theorem 5 0 As convergence implies convergence in probability 0 Convergence in rth mean also implies convergence in probability 0 Convergence in probability implies convergence in law 0 Xn L c implies Xn L c Where c is a constant Theorem 6 The Continuous Mapping Theorem Let g be continuous on a set C where PX E C 1 Then 1 X L X e 9Xn L 9X 2 Xn L X e 9Xn L 9X 3 X7 3 X e 9Xn 3 9X Example 7 Let Xn L X Where X N N0 1 De ne the function 9a 952 The CMT says 9Xn L 9X But X2 N X111 So 9Xn L X111 Example 8 Let X l and 9a 1500 Then X7 i 0 and 9Xn i 1 But g0 1 n 2 Lecture 1 Theorem 9 Slutsky s Theorems 1 Xn L X and Xn 7 Yn L 0 together imply Yn L X 3 5 Xn L X and Yn L c together imply Xn Yn L X c 2 Xn L X and Yn L c together imply 4 Xn L X and Yn L c together imply X Yn L Xc 5 Xn L X and Yn L c together imply T L when c y 0 Example 10 Let Xn be iid with mean it and variance 72 From the Weak Law of Large Numbers we know the sample mean Xn L M Similarly 21 X3 L EX2l By Slutsky s Theorem we know Si 21 X127 X2 L 72 Together with the CMT this implies Sn L 0 From the CLT V50 7 M0 L N0 1 Together these facts imply i an Xn MiiL ILMOJ Sn 0 Sn n Where this last equality is due to Slutskyl So the tstatistic is asymptotically normall De nition 11 Xn UAR pronounced Xn is little ohpeeRn77 means Xn Y Rm where Yn L 0 De nition 12 Xn OpRn pronounced Xn is big ohpeeRn77 means Xn Yan where Yn 0171 0171 denotes a sequence Zn which for any 6 gt 0 there exists an M such that P znl gt M lt e Lemma 13 Let R RIC A R and R0 0 Let Xn 0171 Then as h A 0 for allp gt 0 1 Bo 00th implies RltXigt UpOIXnHquot 2 Bo own implies RltXigt Op anHquot To prove this apply the CMT to g 0 Any random variable is tight lie for all e gt 0 there exists and M such that gt M lt e o Xa oz E A is called Uniformly Tight UT if for all e gt 0 there exists and M such that supaP IXaH gt M lt 5 Theorem 14 Prohorov s theorem cf HeineBord 1 If X L X then X is UT 2 If Xn is UT then there exits a subsequence XM with XM L X asj A 0 for some X As we move on in the course we will wish to describe weak convergence for things other than random variables At this point the our previous de nition will not make sense We can then use this following theorem as a de nition Lecture 1 3 Theorem 15 Por39tmanteau Xn L X ltgt A for all bounded continuous In this theorem bounded and continuous77 can be replaced With 0 continuous and vanishes outside of compacta77 o bounded and measurable such that PX E Cg 177 Where Cg is the set of 9 s continuity points 0 bounded Lipshitz77 o fX enX77 This is the next theorem Theorem 16 Continuity theorem X i X ltgt EexpitTXn A EexpitTX Example 17 To demonstrate Why f must be bounded observe What happens if 9a x and X 7 n Wp ln 0 otherwise X L 0 EgX71 1990 0 Example 18 To demonstrate Why f must be continuous observe What happens if Xn ln and x 7 1 if x gt 0 9 0 if x 0 Theorem 19 Sche For random variables Xn gt 0 if X7 3 X and EXn A EX lt 0 then Ean 7 Xl A 0 For densities A 9a for all x then f 7 gxldx A 0 Stat210B Theoretical Statistics Lecture Date February 217 2008 Weak Convergence in General Metric Spaces Lecturer Michael I Jordan Scribe Yueqing Wang 1 General Metric Norm Space The objects of interest are functions from a sample space to a general metric space where each point is a function Then we can try to use statistical properties eg goodness of t to test certain assumptions Example 1 Cram rvon Mises Let Pn be the empirical probability measures of a random sample X1 i Xn of realvalued random variables The Cramervon Mises statistic for testing the null hypoth esis that the underlying probability measure is a given P is given by Pnf 7 Pf2dP7 which can be considered as a measure for the distance between Pu and Pi If the distribution of this statistic is known we can test the hypothesis P can be very complex But if the class 7 of measurable functions is PDonsker the Cramervon Mises statistic converges to a Brownian Bridger De nition 2 Uniform Norm The uniform norm on function spaces is de ned as HZH SuPlZWA 1 tET Example 3 Some commonly used general metric spaces 0 Ca by All the continuous functions on ab E Rt 0 Dabl Cadlag functions All the functions that have limit from the left and are continuous from the right 0 Z abl All bounded functions And we have 0M g Dia b g Maw Notel Cab is separable ie it has a countable dense subsetl Dab isnlt separablei Hence Z ab is not separable neitherl Most of the empirical processes are in Dalb because of the jumps most limiting processes are in Ca 2 Weak Convergence De nition 4 Random Element The Borel afield on a metric space D is the smallest a eld that contains the open sets and then also the closed sets A function de ned relative to one or two metric 2 Weak Convergence in General Metric Spaces spaces is called Borel measurable if it is measurable relative to the Borel a eldsi A Borel measurable 9 7gt D de ned on a probability space 911 P is referred to as a random element with values in Dr De nition 5 Random Elements X7 converging weakly to the random element X means 7gt lEfX for all bounded and continuous function Note For random elements Continuous Mapping Theorem still holds lf random elements X7 i X and functions 9 7gt g are continuous it follows that d 9nXn A 900 De nition 6 A random element is tight if V6 gt 0 3 a compact set K such that lP X if K S e De nition 7 X XE t E T is a collection of random variables where X 9 7gt R is de ned on 911 P A sample path is de ned as t 7gt Xtwi Theorem 8 Converge Weakly to a Tight Random Element A sequence of maps Xn 9 7gt l T converge weakly to a tight RE i Fidi Convergence Xnglp i i Xngk converges weakly in Rk for each nite set t1i i i tk Asymptotic Partition Veg gt 0 exists a partition of T into nitely many sets T1 i i Tk such that limsup lP sup sup les 7 n gt e lt 77 n7gtoo i39 steTI 3 The Donsker Theorems Theorem 9 Classical Donsker Theorem If X1Hi are random variables with distribution function F where F is uniform distribution function on the real line and M are the empirical processes ant l Ei1 Iggy Then for xed t1i i i tk it follows that WORM Ft17 i FAQ FWD i GFt17m7GFtk7 where are zeromean Gaussian with covariance ti tj 7 titj Theorem 10 Donsker If X1Hi are random variables with distribution function F then the sequence of empirical processes nan 7 F converges in distribution in the space D7oooo to a tight random element GF ie a Brownian Bridge whose marginal distributions are zeromean normal with covariance function EGF tiGFtj Fti tj 7 FtiFtj Denote empirical processes as follows 3 7307 7 P and thus an 7 De nition 11 PDonsker f is P 7 Donsker if 3 converges weakly to a tight limit process in l which is a PBrownian Bridge GP with zero mean and covariance function EprGpg Pfg 7 Pngi De nition 12 De ne the Bracketing Integral as 6 J6fL2P logNefL2Pde 0 Weak Convergence in General Metric Spaces 3 Theorem 13 If JH1TL2P lt 00 f is PDzmsker Example 14 f 1700 t E R By calculating the bracketing number it follows that log NH A Hence there exists limits for J1L2Pi By the above theorem we know that this function space is PDonske r and the empirical processes will converge to a Brownian Bridger Example 15 Lipschitz Classes are PDonsker Let f f9 1 9 E 9 C Rd be a Lipschitz function classi irei given I xed if Um I f921l lt m1ll91 92H7V917 927 then diameter 9 NHlteHm LTltPgtgt lt Mfr where k is a certain constanti Proof The brackets f9 67717169 6771 for 9 have size smaller than ZeHm I And they cover 7 because fel SW S f92 S fel 67 if H91 7 t92H S t diam 6 e Hence we need at most 1 cubes of size 6 to cover 9 and then use balls to cover the cubesi D References References van der Vaart Ar WW 1998 Asymptotic Statistics Cambridge University Press Cambridge Stat210B Theoretical Statistics Lecture Date March 07 2007 Asymptotics of Empirical Processes Lecturer Michael I Jordan Scribe WeiChun K110 1 Lemma 1 van der Vaart 1998 lemma 1924 Let f be Pdonsker and f be a random sequence of functions taking values in f st ne 7 foltxgtgt2dPltzgt i 0 for some f0 in L2P then we have own 7 f0 7 0 and cm L pro with G m 7 P Proof Sketch Uses uniform continuity of sample paths of GP with CMTi El 2 An Example Mean Absolute Deviation De ne mean absolute deviation 1 Mn7 X17Xn Let F denote the unknown CDFi Wilioig let Far 0 Let F X 7 9 zz X1 7 9 If F952 lt 0 and if 9 E 9 for a compact 9 then 7 9 is FDonsker van der Vaart 1998 example 197 Fltwx7 2m 7 W s W2 i 0 By Lemma 1 we have anxx7m7onwzw Lo 1 Consider ltMn7szwgt7 mmwa 7szwgt0nwxw 0120 2 assume that 9 gt F x 7 9 is di erentiable at 9 0 di erentiate F x 7 9 at 9 0 we have the derivative 2F0 7 it 2 Asymptotics of Empirical Processes Apply Delta Method on 7 an 7 FM we have x FlI 7 M 7 lel 7 72F0715x 7 X7 7 x 0131 3 7 2F0 7 U Xn 4 7 2F071 Fn 7 m 0120 5 7 2F071an 0121 6 Therefore we have ltMn 7 FM 7 lt2Flt0gt71gtx M 0pm 7 L Gp2F071x m 8 Thus Mn is AN with mean 0 and variance equals to variance of 2F071X1 lel We lose 2F071X term by not knowing the mean of X When the mean an median are the same 2 0 7 1 0 in which case we don t incur any extra variance by having to estimate the location parameter 3 AN of Zestimators De nition 2 A function 995 is Lipschitz if 3 a function sit W91 96 7 WWW S MW 7 92H V91 92 in some neighborhood of 90 and P12 3 0 Theorem 3 van der Vaart 1998 Theorem 521 For each 90 in an open subset of Euclidean space let w9x be Lipschitz Assume PHiZmOHQ S 0 PM is di erentiable at 90 with derivative V90 note that it is di erent from we is di erentiable Let in n 0pn 12 a near zeroquot 9 Assume 1 L 90 consistency Then we have V5097 7 9o 7 4931 21590061 0131 1 n 7 90 is AN with zero mean and covariance 1Pw90w oV9T Proof van der Vaart 1998 example 197 shows that Lipschitz functions are PDonsker Apply Lemma 1 we have am 7 0mg i 0 By assumption that pnw n 0131 we have Gnd n 7 5P n 0131 10 PWeO 7 n0P1a 11 with Pweo 0 by de nition Apply Delta Method or van der Vaart 1998 Lemma 212 we have o9o7 n op l n79oll Gnibeo 0130 12 Asymptotics of Empirical Processes 3 By invertability of V90 we have m nieow M Mva anH 13gt oplt1gtopltmw weom 14gt Inequality 14 is Obtained by plugging 12 into 13 and using triangle inequality Therefore we have 1 is n 7 consistent 15 By 12 and 15 we have Val906 90 Gnibeo 0131 16 Multiply both side by V91 to get the result El References van der Vaart At Wt 1998 Asymptotic Statistics Cambridge University Press Cambridge Stat210B Theoretical Statistics Lecture Date March 157 2007 Change of Measure and Contiguity Lecturer Michael I Jordan Scribe Aria Haghighi 1n the last lecture we discussed contiguity of measure as the analogue of absolute continuity for asymptotic statistics In this lecture we will use contiguity to establish results changeof measure results for statistical hypothesis testing We brie y recall the de nition of contiguity here De nition 1 Contiguity Let Qn and Pn be sequences of measures We say that Qn is contiguous wirit to P denoted Qn 4 P if for each sequence of measurable sets Aml we have that PnAn A 0 A 0 We also showed that Qn 4 Pn if and only if whenever the RadonNikodym derivative 2quot converges weakly under Pn to a random variable V ie 13quot 133 V then we have EV 1 2 We also saw that a distribution being in the Quadratic Mean Derivative family implied contiguity for shrinking alternatives in statistical testingi Formally for QMD families P9 we have that 1353 h 4 P97 by Theorem 72 in van der Vaart 1998 pg 94 We now state an important result regarding the joint distribution of test statistics and the likelihood ratio Lemma 2 Theorem 66 in van der Vaart 1998 pg 90 Let Pu and Qn be sequences of measures such that Qn 4 P Let Xn be a sequence of test statistic random variables Suppose that we have dQn P X m w X W for limiting random variables X and V Then we have that LB E13 XV de nes a measure Further more Xn wquot L Proof By contiguity we have that EV 1 which in turn implies that L must be a probability measure Using Portmanteau s lemma and a standard induction over measurable functions gives that X7 wquot L 1 Typically we have that X V is bivariate normal In this case we have a very appealing result about the asymptotic distribution of the test statistic under Lemma 3 LeCan7s Third Lemma pg 90 van der Vaart 1998 Suppose that d n P u E 7 ltlttga2gt7ltg 02gt where 739 and a are scalars3 Then we have that X Awnz 1V J39here measurable means with respect to the underlying Borel set of Q which may change with n 2Note that by Prohorov7s theorem that dPquot has a convergent subsequence so the theorem isn7t vacuous 3Note that we have that the mean of log must be 70 2 Change of Measure and Contiguity This lemma shows that under the alternative distribution Qn the limiting distribution of the test statistic Xn is also normal but has a mean shifted by 739 limnh00 CovXnlog dP Proof Suppose that X W be the limiting distribution on the RHS of the above By the continuous mapping theorem we have that dQn 7 dPn Since we have that W N N70202 we have that Qn lt1 Pni We have by theorem 64 then that Xn converges weakly to L under Qn where L E13 XeWi We are going to determine the distribution of L via its characteristic function Xn gt13 MW eithdLI E eizTXW E eitTXi7iW 1 1 E 739 t 7 T 7 7 2 7 7 T 7 7expzt M 2a 2a 7 2 T 02 eizTu77zT2 L N N0 739 E D where the last line is obtained by recognizing the form of the RHS of the previous equation as the charac teristic function of the normal distribution Example 4 Asymptotically Linear Statistics Suppose that P9 is a family of QMD measuresi We are interested in the asymptotic behavior of V509 7 90 We will consider the following setting A l wn 90 W Zip19L Xi 0191 where VaI gO LJgo X 7392 lt 00 and Egg190 0 Furthermore we assume that under H0 ie when 9 90 we have by the CLT that ma 7 90 L Mo 72 Since P9 is in the QMD family we have the following expression dP L Vial 90 90Xi7th90Xi or hTIeohl 0191 Using the bivariate CLT we have that the RHS above converges to a normal distribution where the covariance A L between ab 7 90 and 33 is given by 739 CoveO 90 X hTZ90 0 Our next example builds upon the previous one Example 5 TStatitic for Location Families Suppose that fX 7 9 is a density for a QMD location family We are interested in testing 90 0 We de ne the tstatitic as 7 7 7n tn V5 W 01390 1 4Which uniquely determines a distribution Change of Measure and Contiguity 3 where the second equality uses a delta method argument This yields that the tstatistic is an asymptotic linear statistic as in examp e 4 We are interested in the behavior of tn under the alternative 9h Recall that 290 iii5 Using example 4 and the fact that 90 X we have that 7 fig X0 109009 igz df igzf dz h 7 using integration by part a 739 7ECovXZ a L We therefore have that under shrinking alternatives tn if A 1 Example 6 Sign Test for Location Families We suppose again that fX 7 9 is a density for QMD family of distributions We also suppose that is continuous at the origin and that P90X gt 0 We de ne the sign statistic l l s W gum 7 i We again suppose we are interested in testing whether 90 0 Under the alternative hypothesis 9h we have f X 739 hCOVeo1Xgto7W 7h f Xdz hf0 Under the alternative hypothesis the asymptotic distribution of 3 is normal with mean hf0i References van der Vaart Al Wi 1998 Asymptotic Statistics Cambridge University Press Cambridge STAT 210A Theoretical Statistics Fall 2006 Lecture 12 7 Thursday October 5 Lecturer Martin Wainwright Scribe Martin Wainwright Outline 0 Information inequality motivation basic theory examples 0 Reading Bickel and Doksum 342 Keener7 Chapter 8 121 Motivation and basic theory In today7s lecture7 we discuss the information inequality also known as the Cram r Rao bound It provides a useful benchmark for assesing the variance of any unbiased estimator ln particular7 it plays a very important role in asymptotic theory7 via the notion of asymptotic ef ciency 1211 Basic lower bound Let 6 be any unbiased estimator of 967 so that 90 E96X for each 6 E 9 Also let 770 be any random variable with nite second moment ie7 Ew2X 0 lt 00 for each 6 E 9 Applying the Cauchy Schwartz inequality with covariance as the inner product yields that 2 var6 2 M 121 varw ln general7 this bound is not helpful7 since the covariance term on the RHS depends on 67 the estimator whose performance we are trying to lower bound However7 equation 121 actually describes an in nite family of bounds7 one for every choice of function 1 Thus7 we would like choose 7 judiciously7 in such a way that this dependence vanishes STAT 210A Lecture 12 7 Thursday October 5 Fall 2006 1212 Hammersley ChapmanRobbins inequality We illustrate the use of the bound 121 with a classical bound due to Hammersley Chapman Robbins Suppose that X is distributed with a strictly positive density px 9 gt 0 for all z E R Given some 6 in the interior of the parameter space 97 suppose that A is suf ciently small that 0 A E 9 as well Then let us consider the function 1996 9 A WW A straightforward calculation yields that E9wz6 0 for all xed A7 where E9 denotes expectation under px t9 Thus7 pX 9 A cov67w E6X 71 lt gt lt pltX 6 E9A6X 7E96X 99 A 7 909 where we have recalled that 90 E96X due to the unbiasedness of 6 as a an estimator of 90 Thus7 we conclude from equation 121 that 99 A 7 WW I E 22239 02 Magi 71 var6 2 122 1213 Cram r Rao bound Under appropriate regularity conditions to be speci ed7 the information inequality is ob tained by letting A 7 0 in the bound 122 In particular7 we have 99 A 7 WW A2 am ltpltX e A 7pX02 lt1pltX 6W 9192 t E logpX 62 The following lemma uses a useful alternative form of this inequality var6 gt Lemma 121 Assuming that we have suf cient regularity to interchange integration and differentiation the random variable 7X t9 logpx 9 is unbiased Proof We note that 192 Iogpoc 6 6ng 6gtdz p0dz STAT 210A Lecture 12 7 Thursday October 5 Fall 2006 Using this useful lemma7 the information inequality can be written as 6 2 var6 2 JgA 123 var logpX 6 The quantity 6 var logpz 6 is known as the Fisher information In fact7 we can derive the information inequality 123 directly by choosing w6 logpz 6 in the basic lower bound 121 Note that 966 9 1996 9 measures the relative rate at which the density changes as a function of 6 6 wW 10gp969 1214 Some illustrative examples Example Normal location model Suppose that X N N6702 where 6 E R is a location parameter7 and 02 gt 0 is known Then 10gp969 C i 96 i 92 where C is a constant independent of 6 Thus logpz 6 7 62 and 6 102 Thus7 the Fisher information decreases monotonically with the noise variance Example General location model Let us consider the location model more generally say X N fz 7 6 where is some xed known density satisfy fx gt 0 for all z E R7 and 6 is the unknown location parameter Here a simple calculation exercise yields that 00 2 19 IF 1 dz Thus7 the Fisher information is independent of 6 for any location family Example 1 dimensional canonical exponential family Consider a one dimensional expo nential family in the canonical form WW9 M96 exp9T9 A9 Suppose that we are interested in estimating the mean parameter p E9TX 96 Note that g is differentiable7 and we 2116 AW 12 3 STAT 210A Lecture 12 7 Thursday October 5 Fall 2006 since A 0 90 from the cumulant generating properties of the function A We can also calculate a logpw T06 799 Combining these two pieces7 we have for any unbiased estimator 6 ie7 such that E96 u 90 for all 0 the lower bound We 5 2 varTXEeTX A 0 varTX Note that the usual estimator 6X TX achieves the lower bound with equality in this case Example Estimating canonical parameters Continuing the previous example7 suppose that we are now interested in estimating the canonical parameter7 so that 90 0 In this case7 we have 1 var6 2 m for any unbiased estimator E6X 0 Note the inverse relationship in terms of the Fisher informations associated with estimating the mean parameter p and the canonical parameter Example Special case Bernoulli As a particular illustration of the preceding develop ment7 consider the Bernoulli as a one dimensional exponential family We have X N Ber0 and p0 ll E 071 exp 0x 7A0 We calculate varT varX 1 7 u where u EX lP X 1 Thus7 for estimating u 907 we have 90 90 1790 For estimating 07 we have 0 1Ig0 1215 Additivity of information An important property of the Fisher information is its additivity under independent sam pling a If Xi has Fisher information 110 about 07 then the sample X177Xn with inde pendent components has Fisher information for estimating 0 STAT 210A Lecture 12 7 Thursday October 5 Fall 2006 b Supposing furthermore that the Xi are iid with 116 E 67 then we have Ilwn6 7116 Thus7 under iid sampling models7 we anticipate that var6 2 1nl6 is the infor mation bound for estimating 67 with scaling in terms of the number of samples 122 Regularity and attainment 1221 Formal statement We now provide a formal statement of a version of the information bound theorem Theorem 122 Let 73 P9 6 E 9 be regular family Where 9 is an open set in R Suppose that each member has a density or PMF p 6 that is differentiable with respect to 6 Consider an estimator 6 ofg6 that is unbiasediviz E96X 96 for all 6 E 9 Assume moreover that a E9 logpX6 0 b 19246200 lt 00 c g 6 cov 6X7 logpX6 Under these conditions we have 692 7 19 for all 6 E 97 Where 6 var logpX 6 E logpX 62 Note Without such regularity conditions7 the information inequality may fail to hold A simple example is the uniform distribution Uni0767 where a calculation see homework shows that the bound fails to hold 1222 Attainment of the bound Since the information inequality is based on the Cauchy Schwartz inequality7 equality will hold when the functions 6a and wag 6 logpx 6 are co linearinamely7 when 6m aw 10mm blt6gt for some functions 16 and b6 However7 for 6 to be a valid estimator7 we need the RHS of this relation to be independent of 6 so that the choice of 16 and b6 must to cancel out any 6 dependence in 3 logpx6 STAT 210A Lecture 12 7 Thursday October 5 Fall 2006 Example Suppose that X N Binn70 Then we compute 6m 46 10mm 196 a0 x logt9 n log17 9 b0 x7710 16 176 b0 For this to be a valid estimator7 we need 10 61 7 t9 and b0 7167 so that 6x x Thus7 we have recovered the sample mean as the ef cient estimator 1223 Multivariate extensions When 6 E 9 Q Rd7 then we have a d gtlt d Fisher information metrics 197 where element i7j is given by under suitable regularity conditions as before mam E9 10mm 6 Iogpm 6 3 3 7 COV 10gp767 10gp7 7 so that the Fisher information matrix is positive semide nite One generalization ofthe information inequality is as follows if 6 is an unbiased estimator of 967 then WW5 2 V909 109W1 V997 where V909 6 Rd is the gradient vector STAT 210A Theoretical Statistics Fall 2006 Lecture 23 7 November 17 Lecturer Martin Wainwright Scribe Wei Chun Kao 231 Comments on UMP tests One side comment before proceeding last lecture we discussed uniformly most powerful UMP tests in a single dimension There exist analogous results for UMP tests in higher dimension for more discussion of this issue please see Chapter 15 in Keener For instance one example considered by Keener is the d dimensional exponential family 1096 9777 M96 BXPWUW lt777T96gt AW 77 where 6 E R is the parameter of interest and the vector 77 6 Rd 1 is a collection of nuisance parameters We might be interested in the test H0 6 60 against H1 6 61 in which case it is natural to condition on the suf cient statistics Keener presents useful results about exponential families under conditioning that can be used to analyze this approach 232 Asymptotics of Generalized Likelihood RatioGLR The main goal of today7s lecture is to explore the asymptotic behavior of the generalized likelihood ratio GLR test Recall that for test H0 6 E 90 vs H1 6 E 91 where 9 90 U 91 we de ne the GLR supeee pX 9 GOO supeeeopX939 231 Note that the numerator is the unrestricted MLE and the denominator is the restricted MLE In general the statistic GX does not have an easily computable null distribution ie under H0 So in practice it is desirable to know the asympotitics of GLR to set the critical region Consider 9 6 Rd 90 61 62 6 067 free forj r 1 d Generally we can reparametrize any suitable smooth 9 into this form by de ning a reparametrization 9 such that 916 926 976 0 Theorem 231 VViIks 1930 Under the regularity conditions ensuring asymptotic normal ity AN for MLE we have 2 log GX HL X 232 0 STAT 210A Lecture 23 7 November 17 Fall 2006 Remark To provide some intuition if we consider the d 7 r unrestricted components of the parameter vector it can be shown that both the restricted and unrestricted MLE are approximately equal up to terms of order smaller than The only substantial difference comes from the rst restricted 7 components where the behavior is asymptotically normal By Taylor series the GLR looks like a quadratic form in a r dimensional Gaussian which is asymptotically Xa Proof Here we provide a very rough sketch for a somewhat more complete argument please refer to Ch 18 of Keener From equation 231 we have 2103 G00 2103pX n 7 logpX 5M7 233 where 3 is unrestricted MLE under 9 and 1 is the restricted MLE under 90 By Taylor series we can expand 233 around 9 to obtain 2 log GX 7 2 7W 7 7 Tv22 7 3 234 1 2 where 0 logpx g t9 and 5 1 g 7 with t 6 01 Since 3 is the unrestricted MLE estimate V609 0 wp one as n 7 00 and hence 2 log GX 7 7 7 5Tv22 w 7 n 235 In the easier case having 7 d we have 5 E 00 0 deterministically From equa tion 235 we know that 7 60 i N0I 160 As in our earlier work oniprov ing asymptotic normality of MLEs suitably regularity conditions ensure that ivz n 5 7160 converges uniformly Hence xIw 7 00 i N0I and 2 log GX 7 2T2 236 7 xi 237 where z N N0 Id and d 7 For the more general case having 7 lt d see Keener for the outline of a proof D 233 An Example Say X W 130291 Y W 130292 Z W 1302093 for 239 7 1 n and e 7 9 6 13 gt 0 We are interested in test hypothesis H0 03 91 02 E 90 agaist H1 6 E 0 In this case r 1 and d 3 From Theorem 231 2 log Gz i x The likelihood function is 912271 Xi61yi6 1Zi67n919293 X7 Y7 Z 9 H271ltXngtltmltzugt 238 23 2 STAT 210A Lecture 23 7 November 17 Fall 2006 Thus7 we have unrestricted MLE l X g l7 g Zn and restricted MLE g1 xgg g 92 gjgg 3 7 logGX Y Z olt X Y1og Zlog If we think about a normalized partition of X Y 2 then this statistic again looks like a KL divergence between the Poisson distribution with unrestricted MLE7 and that with restricted MLE 234 General Observation ln fact7 the structure of the preceding problem is a special case of the following more general phenomemon Suppose that we are d dimemsional canonical exponential family Consider the test H0 B6 0 E 90 for some matrix B E R7 against H1 6 E 0 then some calculations show that 5 log Gm mm a we 5 where D denotes the Kullback Leiber divergence7 px is the unrestricted likelihood7 and p g is the restricted likelihood You are strongly encouraged to work through this example as an exercise7 as the special structure of exponential families allows you to gain intuition for the asymptotics easily 235 Another Example Contigency Tables Contigency tables are suitable for modeling vectors of discrete multinomial data In this example7 we test whether the answers of two different students are independent7 which could be useful for detecting cheaters say if students were given random tests7 so that their answers7 assuming no cheating7 should be independent Suppose that the random variables X117X127X217X22 represent the number of correctly and incorrectly answered questions by each student according to Student 1 Correct lncorrect Student Correct X11 X12 2 lncorrect X21 X22 Letting n denote the number of questions in the test7 we have X11 X12 X21 X22 n We model X117X127X217X22 as Multinomz39alm131171312713211322 The corresponding likelihood function is 7 n X11 X12 X21 X22 p X11X12X21X22 PH P12 P21 P22 STAT 210A Lecture 23 7 November 17 Fall 2006 To test the hypothesis H0 independence agaist H1 dependence one approach might be to examine the quantity p P1Pf p gt 1 i attractive positive dependence p 1 i independence p lt 1 i negative dependence In this formulation we would test H0 p 1 against H1 p 31 1 Alternatively we can test for independence directly by introducing the marginal distri butions PH 2P i1 Pig 239 739 P1 Z 13 P1 P2 2310 An independence test then entails comparing H0 PM 13 versus H1 PM 7 PH PM where 239 1 2 and j 1 2 In this formulation we have the unrestricted MLE and restricted MLE where 131 PM and 1317 iPZj The parameters in this characterization of the multinomial are de ned over the probability simplex 9 P 2 0 P11 P12 P21 P22 1 We take P2P2P22 as the reparametrization of the model The new parameter space is a full dimensional subset 9 Q R3 And the hypothesis becomes H0 P22 P121321 vs H1 P22 31 P121321 From Theorem 231 we have 271Dpijiipipj i X Stat210B Theoretical Statistics Lecture Date April 107 2007 Signed Rank Test and Likelihood Ratio Lecturer Michael I Jordan Scribe Jinn Ding 1 Sign Rank Statistics Let Tn l ian R signXZ eigl am lEq UnZUni are ordered statistics of Unif0li Let n Fz 2Fz 7 1 be the cidlfi of Recall we assume the distribution is symmetric Then 1 7 l Tn 7 F Xl szgnXl 0poll Choose 45 in a smart way note that 71 set 7 LL 1 MU VI f FT UM then 1 1 f 1 1 f Tn 77 i E7ltlXiDSZQnltXigt 01901 77 7Xi 0Po17 Which satis es Theorem 2 in lecture 21 Apr 5 Thm 154 pl 221 in van der Vaart 1998 Thus the signed rank test is asymptotically optim i Example 1 Laplace olt e lxll This implies Tn f signXZ Which is a sign test 2 Likelihood Ratio Test X 210g supeeel P9Xi supeeeo 139Xi7 An 210g suPage H1P9Xi supeeeo P9Xi Where 9 90 U 91 So An clips to 0 Which does not matter because we reject for large Ant LAN approach 0 introduce local parameter spacesi h 9n WW EL H7 VH9 7 77 and Hno Vaeo 77 2 Signed Rank Test and Likelihood Ratio 0 Write An using Hn and Hmoi P Xi P n Xi Hi nh 7 2 sup 10g Hi nhf Hi PnXi hEHW PnXi An 2 sup log hEHn 0 LAN result give asymptotic expansions for the logs eigi7 under qmdi 0 Under suitable notion of convergence of sets7 we expect An to converge to 7 7 T 7 7 39 7 T 7 A 7 h1EnI OX h InX h h InX h7 in Which X N NhIn 1i o The distribution of An under 77 corresponds to the distribution of A under h 0 0 Under h 01712X N N0Ii Lemma 2 Lemma 166 in van der Vaart 1998 le N Nk071 H0 is a ldim linear subspace of Rk Then HX Hell N xi 0 If 77 is an interior point of 9 then H is all of Rk and second term in A vanishesi o If 0 7 77 converges to a linear subspace H0 of dimension 17 the asymptotic null distribution of An 2 1s inli Theorem 3 Theorem 167 in van der Vaart 1998 suppose 1 P9 is qmd 2 llog P91 7 log P90 S 7 92H makes LAN work for some such that 13le lt 00 and 91792 E neiborhood of 77 makes EPT work 5 MLE s mo and 7 are consistent 4 HMO and Hn converge to sets H0 and H Then A M A for X N Notify What ab out power References van der Vaart7 Al WW 1998 Asymptotic Statistics Cambridge University Press7 Cambridgei STAT 210A Theoretical Statistics Fall 2006 Lecture 11 7 October 3 Lecturer Martin Wainwright Scribe Simone Gambini Warning These scribe notes have only been mildly proofread 111 Characterization of minimax estimators and least favorable priors Theorem 111 Suppose that A is a prior on 9 and that 6A is a Bayes estimator wrt A st My 5 2 R 6A d sup 1mm 9 Note that equality a is simply the de nition of the Bayes risk whereas equality b is a substantive assumption which we refer to as the equalized risk property Under these conditions we may conclude that a 6A is minimaX b The prior is least favorable c If 6A is unique Bayes then it is also unique minimaX Proof a For any estimator 6 we have s1pR 6 2 R666d6 Since 6A is Bayes for A we have that R 6A d 2 R 66d6 Finally by assumption we have R666d steip 1mm so that the minimax property of 6A is established by taking the in ma on both sides STAT 210A Lecture 11 7 October 3 Fall 2006 b For any prior A 71176 R66AX6d6 g R66AX6d6 where the last step follows because 6A is Bayes for Ag so it minimizes Bayes risk We conclude by using the equalized risk property to conclude that mean0W sup Rpm 9 TA76A7 which shows that A is least favorable c Follows from a by replacing inequalities with strict inequalities 112 Examples Let7s consider some examples to illustrate the utility of this theorem in actually nding minimaX estimators 1121 Bernoulli with nite parameter space Estimate the probility of success 6 in a X N Ber6 with a single observation Suppose that 6 E 9 137237 and we use the loss function L6a 6 7 12 Due to the discrete nature of 9 any decision rule 6 is speci ed by a pair 607 617 and the prior on 6 by a single number 7111 P6 a The frequentist risk takes the form 1mm 60762176661762 From the preceding theorem7 we can obtain a minimax estimate by nding the Bayes estimate for a least favorable prior For a given prior 7ra7 the Bayes risk r7ra76 equals WaRa617 7raRb7 6 Taking derivatives wrt 607 61 and setting equal to 07 one nds ma17 ab17b17 7ra 17 17111 17b17 Wu 7 127 b217 ml 1 7 ma b17 7ra 60 111 112 11 2 STAT 210A Lecture 11 7 October 3 Fall 2006 For the minimax theorem to hold we need the frequentist risk to be constant over the hypothesis space For this case this is equivalent to a b260 i 61 1 260 7 6 0 For a 1 7 b 13 60 37r1gt61 and substituing into the expression above one nds 71 12 is the Least Favorable Prior while the minimax decision rule is 60 4961 59 1122 Bernoulli with continous parameter space In this case assume 6 6 01 and X N Ber We estimate under quadratic loss As before the frequentist risk is R 6 621 260 7 61 W6 7 62 7 260 63 The Bayes risk r 6 m21 260 7 61 m16f7 62 7 260 63 where m1 E m2 E 2 The Bayes decision rule is found minimizing wrt to 6061 7712 60 m1 m1 m2 6 1 7 m1 Furthermore R 6 does not depend on if and only if 61760 12 and 6 7637260 0 or 60 1461 34 Solving for m1m2m1 12m2 38 is found A 51212 prior satis es this constraint The associated Bayesand hence minimax risk is In general a 5n2n2 is Least Favorable on n observations and the associated minimax risk is W For the sample mean estimator 259 we have the risk function R gaff These risks are plotted in Figure 111 for di erent values of n Tmz39mmaz 7 1 13 Minimax theorem It is natural to look at the minimax decision problem in terms of game theory Here a two player game is played between nature and the statistician Nature picks a prior A the statistician a decision rule 6 Then the statistician pays to nature the amount TA6 f R 6 d Nature gains the same amount the statistician loses so that the game is zero sum As de ned previously the following two quantities are important v sup irsif R 6 113a 9 irsifsup R 6 113b 9 called respectively the lower and the upper values of the game These quantities have the following interpretation 1 is the amount the player pays when he is told what value of 6 nature choses before he chooses 6 Conversely 6 is the amount the player pays when nature is told what 6 the player chose before choosing 6 11 3 STAT 210A Lecture 11 7 October 3 Fall 2006 Figure 111 Minimax Risk compared to risk of sample mean Theorem 112 Suppose that a The parameter space 9 91 i i k is nite and b The risk set is closed and bounded from below Then a The game has a valueithat is y 17 b There exists a probability vector E Rk that is a least favorable prior Remark A special case of this set up and the one originally considered by von Neumann is the case when in addition to the assumptions given above the space of possible decision rules is niteisay 61 i i i 6m Under these condition de ne a k X in matrix R with elements Rd R092 63 corresponding to the risk incurred by using rule 6 When the state of nature was 92 Due to the nite nature of the parameter space a prior can be de ned by a non negative row vector X E Rkst Z 2 17 that is X belongs to the probability simplex in Rk The Bayes risk of the non randomized decision rule 62 is given by i 62 XRBZ39 where e is a unit vector in Rm Randomized decision rules have a Bayes risk that given by m 6 XR Where 7 is a non negative vector such that 73 1 That is 7 is a member of the probability simplex in Rm The von Neumann version of the minimax theorem then states that min max AR max min AR A T T A Where respectively 7 range over the probability simplices in Rk respectively Rm This statement can be viewed as a particular case of duality theory for convex concave functions over compact sets in particular for linear programs in this particular case 11 4 STAT 210A Lecture 11 7 October 3 Fall 2006 1131 Proof of minimax theorem Let7s prove the previously stated minimax theorem It is known that v S 17 therefore it su ices to show that 6 S v in order to establish claim a Given a vector 04 E R1 de ne the lower rectangular sets of the form Ba y E R l y S 04V239 1 In addition de ne 7 arginfa Ba 0 S a Vi By de nition of y for each natural integer n 12 there exists a decision rule 6 Rwy6 S 7 Vj 1 k Therefore for any prior A on 9 61 Gk we have TA6n S 7 Taking asup over the choice if priors yields that supA TA 6n 3 75 and hence inf5 supA TA 6 S 7 This inequality holds for each natural 71 so taking the limit as n a 00 yields that 6 S 7 Now we are going to use the separating hyperplane theorem to construct a vector A that can be viewed as a least favorable prior thereby establishing that part b holds It will also show that v 2 7 which in conjunction with the previously shown 6 S 7 establishes part a Considering the lower rectangle B we observe that its interior intB and the risk set S are two disjoint convex sets in R1 the separating hyperplane theorem guarantees the existence of some vector A E R such that AT gt 7 Vm E S and AT 3 C7 0 Vm E intB We rst claim that A Z 0W 1k This can be easily proven by contradiction say Av S 0 then we can let 12 a 700 while still staying inside the set intB This yields a vector y such that AT is inde nitely large ie larger than any given constant 0 Thus we must have A Z 0 Since A a 0 we can normalize it to sum to one so that it can be interpreted as a valid prior Consider the vector 5 7T it is in the closure of intB so that we must have AT y g c 114 where we have normalization property of A Now letting 6 be an arbitrary decision rule with risk vector R 6 for 239 1 k We then have MM ATR 6 2 c 2 y since the risk vector R 6 belongs to S by de nition Since 6 was arbitrary we have inf5rA6 Z 7 Taking the supremum over priors we nd that SupingA6 v 2 7 17 115 A which completes the proof of part a Furthermore A that we constructed is the least favorable prior of part Stat210B Theoretical Statistics Lecture Date February 6th 2007 Uniformly Strong Law of Large Numbers Lecturer Michael I Jordan Scribe Jitm Ding In this lecture we try to generalize the GlivenkoCantelli theorem Let 1 Q 5 N P and are iiiid sequences We de ne Pf in which X N P We also de ne Pnf with respect to the empirical measure but puts mass at 51 Notice that by de nition Pnf EL1 1651 We point out that Pnf 7 Pf is an object of interest and supfgf anf 7 Pfl is of even more interest For example let 7 Egan t E R then Pnf 7 Pf becomes Fnt 7 Ft and supfef l l becomes supt ant 7 FUN In general we are interested in statistics de ned on a family of stochastic processes with index set Uniform Law of Large Numbers De ne HP 7 PH supfgf anf 7 Recalling the discussion in last lecture we get PPn7Plgte lt 2PPn7PHgt 0 e g 4PPnll gt 1 where P7 is a signed measure putting mass i0 at 51 Again 71 independently picllt value uniformly on 1 71 Specialize f to indicators Let j 700 tj where tj lie between the points 1 ie to lt 1 lt t1 lt 2 lt t lt Consider e PHIPEH gt 1 5 n 0 V E 7 PEEJOHPJA gt 4H5 PHP ljl gt 55 j0 3 71 lmalelP Ijl gt gig J Recall Hoe ding s inequality Let Y be independent 0 al 3 Y S bl Then ND1 Y2 Ynl gt 7 S exp7 E ji We apply this to a S t and conclude e 2 n6 4 2 PP lt7oo w gt it 2explt7gt 2 n6 3 me 3 gt 2 Uniformly Strong LaW of Large Numbers notice that this is independent of 5 so P IPn 7 PH gt e S 8n lexp77 iiei we get Uniform Law of Large Numbers in probability and also almost surely by BorelCantelli The conclusion namely GlivenkoCantelli theorem is not new However this method can be generalized to richer class of functions immediately VC Classes Consider a collection C of subsets of some set X and consider points 51 3 from X De ne As C 1aquot39 afn I C E C 77101 1 maxai an A 1 in V61 minln 1 77101 lt 2quot Examples 1 X R C 700 Then VC 2 2 X RC st s lt t Then VC 3 3 X RdC 7001t t E Rd Then VC d l 4 Rectangles in Rd VC 2d l Sauer s Lemma Lemma 1 M O E 55 T l 9 S 7 nis n 7 219 22 takel9 n S S L Sgtlt1ngt 555 l A This suggests mn 0 E 0 N E mum gt 4 5 M an gt 4 5 E are indicators of subsets that achieve WW 3 NM gt 2 5 lt 711 Uniformly Strong Law of Large Numbers Then if f is 3 VC Class i e VC lt 0 then mum 7 PM gt e s Poly in ngtltexplticngtgt STAT 210A Theoretical Statistics Fall 2006 Lecture 20 7 November 7 Lecturer Martin Wainwright Scribe Chuohao Yeo These scribe notes have only been mildly proofread 201 Hypothesis testing We can think of hypothesis testing as a special case of decision theory Here7 we are interested in testing H00690vsH10 1 where the parameter space is 9 90 U 91 and 90 91 Q We de ne a test as the following map 6 X a 01 Note that this allows for randomized tests In other words7 given X x we would declare H1 with probability The level of a test 6 is the probability of incorrectly declaring H1 when H0 is true or 60 E 90 W90 E90 500 The power of a test 6 is the probability of correctly declaring H1 when H1 is true or 91 E 91 60 E915Xl ldeally7 we would like to have awe 0 uniformly over 90 and have 661 1 uniformly over 91 2011 Simple hypothesis tests In simple hypothesis tests7 we want to decide between the following alternatives H0000vs H1001 In other words7 90 00 and 91 01 Furthermore7 we also have the following 04 E90i5Xl E E0i5Xl B E91 500 E E15Xl In one formulation of this problem7 we might want to specify the maximum tolerance of a type I error7 Ev Then7 we would be interested in the following optimization problem STAT 210A Lecture 20 7 November 7 Fall 2006 2012 Geometry of simple hypothesis tests Consider the following set s mm 6 011 1E06X 1E16X for some 6 Since we allow for randomized tests the set S is always convex 3 5 W50 J 1 11 Desirable operating space Acmevable Wlth 6001 00 Achievable with 6x0 202 Likelihood ratio tests A likelihood ratio test LRT is speci ed by a thresholdt 6 0 00 and takes the form 1 if PXSl gttPX50 34X 7 ifPX51tPX50 O if PXSl lttPX50 e is a motivation for why we might use a LRT from a largesample perspective We know that the likelihood ratio is PX 91 PX 50 Say X1 Xn are drawn iid from P 51 Consider the following statistic LX 1 1 PX51 Z 71 L X 7 l 7 n n 0g ngogpog o F rom WLLN we have the following result 27 1 1E9 log DP3951llP3950 gt 0 2072 STAT 210A Lecture 20 7 November 7 Fall 2006 Conversely7 if Xi were drawn iid from P 907 we have In other words7 Zn converges to something positive if 6 91 and Zn converges to something aimhg 77mmwmweme negative if 6 00 203 Neyman Pearson Lemma Let us consider the following optimization problem mgx w 201 st 0v6 S Ev We would like to show that the optimum solution to problem 201 is always achieved by a LRT We call a solution to this most powerful MP at level Ev Theorem 201 Neyman Pearson i For any 0v 6 01 3 threshold t such that E906tX Ev ii Ifa LRT 6 satis es E906tX Ev then it is MP at level Ev iii Any MP test at level Ev can be written as a LRT Proof Let PiX PX 917 239 01 i De ne 0vt P0P1 X gt tP0X We can then express it as W 7 P0 gt v 17 Fm where Z Since 1 7 0v is a CDF7 it is right continuous and non decreasing7 with W 7 0 7 W 7 P0 7 v where 0vt 7 0 913th 7 5 Given any Ev E 017 choose to such that 0vt0 S Ev S Evt0 7 0 in the continuous case7 choose to F11 7 Ev Now7 set 1 M g m MM w 0 STAT 210A Lecture 20 7 November 7 Fall 2006 A M iii Checking we see that E06tOX P0 Egg gt to 04t0 07 7 04t0 04 Say 6 is a LRT of level Ev Let b be any other test of level Ev We need to show that E16X 2 E1 X De ne the following sets 5 945 5 WW P1X P0X 7 WC gt 0 7 WC lt 0 On 6 we must have 6X gt 0 i we must have 6X lt1 P1X S tP0X Hence fy 595 M90 13196 1t173006 195 fs 595 M90 13196 f130 6195 fsi 595 MM 13196 1t13006 6195 0 0 0 2 t Similarly on S 2 However we note that fx 596 P196 Wow 196 E15X Xl tEo 5X 00 2 0 E190 00 2 div 7 E0 4500 2 0 i with the last inequality following since both 6 and b are of level Ev ie E06X Ev and E0 X S Ev by assumption Say b is MP at level Ev From ii we can nd a LRT 6 that is also MP at level Ev De ne T 5 U 5 le1 31 tP0z We note that b and 6t differ on the set 5 U S and P1X 7 tP0X 31 0 on the set P1 31 tP0 Let f 6x 7 P1x 7 tP0z On the set T fx gt 0 since P1x gt tP0x Ms 1 i Hz 7 Ms gt 0 if b and 6t differs similarly for the case if P1 lt tP0z Hence ifthe set T does not have zero measure then fT fzd gt 0 Now we have that Tfdz SUS fzd since f 0 for z E P1 tP0x But if fsws fzd gt 0 it would contradict the assumption that 6 is MP from the proof of ii this integral evaluates to E16tX 7 X 7 tE06tX 7 X E16X 7 X since both 6 and b are of level Ev Hence the desired result follows E STAT 210A Theoretical Statistics Fall 2006 Lecture 22 7 November 14 Lecturer Martin Wainwright Scribe Henry Lin These lecture notes have only been mildly proofread Outline 0 UMP Tests 0 Generalized LRT7s Reading 0 Keener Chapters 14 15 18 o Bickel and Doksum Chapter 4 221 Existence of A Uniformly Most Powerful Test Recall from last lecture the concept of a uniformly most powerful UMP test Given that we are trying to test H0 6 E 90 versus H1 6 E 91 a hypothesis test i5 is uniformly most powerful at level a if supeeeoEefa WH S a o E9 z 2 E96z for all 6 E 91 and for all 6 of level a As we saw in the last lecture we cannot always hope to nd a UMP test when testing a composite hypothesis We need additional structure in order to guarantee the existence of a UMP test In today7s lecture we establish that together the two conditions below guarantee the existence of a UMP test i the test is one sided eg H0 6 60 versus H1 6 gt 60 ii appropriate conditions are imposed on the likelihood ratio test LRT for example monotonicity De nition 1 A family g 6 i 6 E 9 is said to have monotone likelihood ratio MLR in some statistic Tx if for all 61 gt 60 the likelihood ratio 533 can be written as a non decreasing function of Tx given 61 and 60 xed 22 1 STAT 210A Lecture 22 7 November 14 Fall 2006 Example Monotonicity in exponential families consider a 1 d exponential family with probability density pm a hltzgteltquotlt9gtTltmH lt9gtgt Hence the log of the likelihood ratio is log 77091 e nlteogtgtTltzgt we 7 W91 For increasing 77 it is easy to see the LRT obeys monotonicity Given 91 gt 00 we have 770 7 770 gt 0 which implies the likelihood ratio is an increasing function of Simi larly if is decreasing then the LRT is also monotone with respect to the statistic 7Tx As a consequence of this observation we may conclude that the following distributions all have monotone LRT7s o Binomial X N B nn6 0 Normal Location X N N002 o Poisson X N 13026 222 Constructing UMP Tests Let7s show how a MLR allows us to construct a UMP test for a one sided hypothesis test H0 t9 00 versus H1 0 gt 00 for X N B nn0 First note For any 91 gt 00 the LRT is an increasing function of x thus we have a MLR in x Now consider the following generic threshold test 1ifxgtc x 7 ifzc 0 ifzltc Given any level a 6 01 it is possible to nd 0 and 7 such that E90 x a where x is the threshold test de ned by c 0 and y 7 Side comment Note that we need the extra 7 parameter and the z 0 case in order to achieve all possible levels a To illustrate the necessity of the y parameter consider a random variable X N Ber0 If we only allow non randomized tests of the form 1 iszc 60 0 ifzltc 22 2 STAT 210A Lecture 22 7 November 14 Fall 2006 Without randomization we can only achieve three values for 01 rather than all values in 01 Returning to the main thread let us de ne 1 if 0 S 0 0v0 E906cx Prx 2 0 60 if 0 6 01 90 i 0 1f 0 gt 1 Now that we have de ned our test 15 we would like to show that 15 is uniformly most powerful at level 0v To prove 15 is UMP we need to show that any xed 61 gt 60 and test 5 of level 01 E9115x 2 E9 In other words we would like to show 15 is most powerful at level 01 when testing H0 6 60 versus H1 6 61 for any xed 61 gt 60 Using the fact that the binomial likelihood ratio is monotone in the parameter X we may conclude that 15 is equivalent to the following test provided f06160 is de ned appropriately 1 if gt f06160 E331 f 0701 60 0 lt fC61 60 61 Yak O Consequently it is easy to see that the Neyman Pearson theorem implies 15g1 15 is most powerful at level 01 for any xed 61 gt 60 Therefore 15 is also uniformly most powerful when testing H0 6 60 versus H1 6 gt 60 Here we found a UMP test for a speci c example but we can repeat the same argument with Tz in place of z in the hypothesis test to establish the following much more general theorem Theorem 221 Suppose that a family has MLR in Tz and consider a one sided test eg H0 6 60 versus H1 6 gt 60 Then the test 1 if Tz gt 0 we v To v 0 if Tz lt 0 is UMP at level 0v where 0 and 7 are constants set such that E9015 0v 223 Generalized LRT s Lets take another look at the case when we have a general composite hypothesis test of the form H0 6 E 90 versus H1 6 E 91 where 90 91 Q and 90 U 91 9 One reasonable approach here is to threshold based on the generalized likelihood ratio supeee 1996 9 G x supeeoo 1139 22 3 STAT 210A Lecture 22 7 November 14 Fall 2006 In the expression above we call the numerator the global or unrestricted MLE maximum likelihood estimator and the denominator the restricted MLE on 90 Note Gz is always 2 1 In general we would like to choose H0 if Gz is close to 1 and choose H1 if Gz gtgt 1 Note the numerator of the GLR optimizes over all of 9 rather than 91 We de ne our GLR in this way because it is often easier to optimize over 9 Comment 1 When a UMP test does exist in many cases a GLR can be be used to con struct it 2231 Generalized LRT Standard Normal Example Let7s consider the case when X N Nu02 for 239 1 n Let 9 u02 u E R0 2 gt 0 and suppose we are trying to test 90 u 002 1 versus 91 9 90 In other words we want to test if the X are drawn from a standard normal distribution The denominator of the GLR is pz u 002 1 27f 2 e21m22 while to compute the numerator note that the MLE estimate for u and 02 are A in 1 21 z and 2 52 7 in respectively Thus the numerator is p z and the GLR is QWSZ nZ e zy wi iwz amen2 5 22 w 27Tsz 2 e mz amen2 5 22 w 7 Sin eingZszqm GM It follows that 2 log Gz in log 527nZL1 952 and next time we will show 2 log Gz converges in distribution to X1 where is X1 equivalent in distribution to 22 where z N N01 2232 Generalized LRT Multinomial Example Let7s suppose a random variable X has a multinomial distribution based on 71 trials with parameters 610d In other words given nonnegative integers z1zd such that 2 d n pz1 zd t9 x1 zd 110 Let 9 0 6 Rd 22101 iz 0 for 239 1n and consider testing H0 6 E 90 01 0d versus H10 90 In this case STAT 210A Lecture 22 7 November 14 Fall 2006 whizmd and it can be shown that are the MLEYS for the unrestricted problem in the numerator Thus7 we have 1 M m Hi1g Z 7 which is actually equivalent to a KullbackiLeibler rd 7 1 1 7 7andqi37m7g A Note that i log Gz 211Og 1 a divergence KL0Hq7 if we de ne p m 1 n 7 Stat210B Theoretical Statistics Lecture Date February 27 2007 Lecture 12 Stochastic Equicontinuity and Chaining Lecturer Michael I Jordan Scribe Guilherme V Rocha 1 Stochastic Equicontinuity De nition 1 Stochastic Equicontinuity A collection of stochastic processes Znt indexed by t E T is said to be stochastic equicontinuous at to E T if V7 gt 0VE gt 0 there exists a neighborhood Ute of to such that limsupll sup lZMt E Znt0l gt 71gt lt E n teU One application of stochastic equicontinuity will be in proving results of the kind Lemma 2 Suppose Znt is stochastically equicontinuous at to E T Let Tn be a sequence of random elements ofT known to satisfy Tn L to It follows that znmy znuo L 0 Proof Fix 7 gt 0 and E gt 0 From the stochastic equicontinuity of Zn we know that there exists a neighborhood U of to such that lininsuplP 13 We 7 z uo gt n lt 2 Since Tn A to it follows that limnsupllh 77 E U lt From the assumptions we have that wt 7 Men gt n a m it U OR 15 wt 7 we gt n Now using the union bound on the union of these two events yields the result I We now move on to results on asymptotic normality AN of Mestirnators We will cover the results in Pollard 1984 based on stochastic equicontinuity 2 Lecture 12 Stochastic Equicontinuity and Chainng 2 Chaining We now develop chaining arguments leading to stochastic equicontinuity Chaining arguments are based on building a multiresolution grid on the space of functions we are interested We bound the uctuations of the process by controlling the uctuations along short paths on the grid To control the uctuations along the path in the grid we need to bound the covering integral de ned below We are given 0 a stochastic process Zt t E T o a semimetric on T ds t s t E T o a pointwise exponential inequality eg the Hoe ding bound We want to nd conditions on Zt ensuring that the pointwise inequality can be upgraded to a uniform inequality One important quantity in getting such results is the covering integra De nition 3 a 2 12 N u J6d T 210g du 0 u where N6 is the smallest integer m such that there exist points t1 tm such that minlsls dt t S 6Vt E T Here we implicitly restrict t1 t2 W tm to be points of T which is di erent from the de nition of covering number in earlier lectures Our next lemma establishes that boundedness of the covering integral J6 d T is sufficient to ensure that the di erence between points close to one another are unlikely to be larger than a quantity related to the covering integral Lemma 4 Pollard 1984 Lemma 9 Suppose that J6 d T lt 0 and there exists D such that MM 7 W gt n elm 2exp Then P lZns E Zntl gt 26DJE d T for some st E T with ds t lt E lt E Proof Let 6 3 and de ne 21 Ht 3 210g N303 Now construct a 26net by following these steps 1 Pick an arbitrary t1 E T 2 For k going from 2 to m1 N61 a pick tk such that dtk tj gt 261 for all j lt k Lecture 12 Stochastic Equicontinuity and chaining 3 DJ Let T1 t1tm1 u Pick tm11 such that dtm1tj gt 262 for all j lt m1 1 9quot For k going from m1 2 to m1 mg with 711 N62 a pick tk such that dtk tj gt 262 for all j lt k 93 Let T2 tm13 Jtm1m2l 8 Pick tx g such that dtm1tj gt 26 for all j lt mj 1 7quot F3 For k going from mj 2 to mj m with m N6l a pick tk such that dtk tj gt 262 for all j lt k 10 Let Tl Egimww t2 1magtl 11 12 Let T UlTl We now de ne the set Al on which something bad happens at scale 239 ie a set where the observed difference between Zs E Zt is large for a pair of points on the grid at scale 61 Al w E Q lZws E Zw tl gt DdstH6l for some st E Tl Now notice that Al is the sum of at most N612 events whose probabilities can be controlled using the pointwise bound and conclude MAJ S 2N612 exp lt7H612gt 261 It follows that MUfilAz lt ZPA2E Now we want to extend this result from the points in the grids T to the entire set T Now let st E T be such that ds t lt E Find 71 such that 67 S ds t S 267 Now link 3 sm1 Sm 3 such that Sm E Tm choosing the closest point at each step By construction 5131 311 S 261 Similarly build tm tm tm1 for t tm1 Now using the triangular inequality ms 7 2w mm 7 2M 2 mm 7 2w lZtz1 7 2m Now on A5 we have lZSz1 ZSzl S D39 d51151 39 H6z1 S 2D3961 l 139 H6z1 4 Lecture 12 Stochastic Equicontinuity and Chaining which substituting this back into the inequality above yields that on U lAiY ms 7 Ztl D dew Hm 2izwiHlt6ilgt 7Ln The distance along the chain is such that dsntn g dst 2 241523 dti1tigt 7Ln S 26n4i6i S 1051 As a result7 since 3239 46H1 7 3242 we have on U lAiY Zltsgt7zlttgt lt 1006HHlt6Hgt4DE4lt6M 76i2gtHlt6igt 7Ln 3 1005mm 16D 2 mm 3 u g 6i1Hudu g 26DJ ltdstgt El Figure 1 Pictorial argument for 6H1 7 6 2H6 2 S fI6 1 7 6 2Hudu 3 Symmet rization Equicontinuity and Chaining Recall we constructed P73 as a signed measure assigning mass in 1 to each of the observed points 5152 5 We will now de ne rescaled version of P and P73 as Enf WPnf W7 E0 wing 71 21609 Lecture 12 Stochastic Equicontinuity and Chaining 5 Figure 2 Pictorial argument for 6116 3 f6 0 Hudu Let E denote a Brownian Bridge process and let E f f f We Will be extending the theory to establish conditions on a class of functions 7 so convergence E f Pn f to E f is obtained for every fef Coming up next In the coming classes we Will be covering 0 Pollard 1984 chapter 7 o Pollard 1984 Theorem 13 o Pollard 1984 Lemma 15 o Pollard 1984 Theorem 21 Use Stochastic Equicontinuity to get E i E References Pollard D 1984 Convergence of Stochastic Processes Springer New York Stat210B Theoretical Statistics Lecture Date May 17 2007 Functional Delta Method and Bootstrap Lecturer Michael I Jordan Scribe Amsh Ali Amim39 1 Functional Delta Method Example 1 quantile function continued Recall the de nition of the quantile function F 1p infz Fz 2 p Letls assume for simplicity that p FF 1pl To obtain the in uence function we need to differentiate implicitly using this relation We have done this in the previous lecture and obtained 1 F 1 7 1mm e W 1 W 1 lt1 fF 10 where is the density associated with Fr In deriving l7 we have also assumed differentiability of F and positivity of f at the quantilel The graph of the in uence function is given in Figure 1 It is easy to show P f F P F 117 fF M Figure 1 Influence function of the pth quantile that ElFFX 0i Letls compute the variance 72F VarlFFX dFltzgt l 1 71 2 1xooFP2P1xooF 10de f2F71pgt l gt D gt D l l FF lp 2PF F lp 102 f2F71pgt l l where the last equality follows because l my lI S y 1900 Reusing our assumption 10 FF 1P7 we get 17m 2 F P M f2F 1p which correctly suggests that Mi m 7 F400 i N lt07 2 2 Functional Delta Method and Bootstrap Letls now derive this result rigorously using Hadamarddifferentiabilityi We will need the following lemma which we state without proo Lemma 2 van der Vaart 1998 Lemma 213 p 306 Let F be di erentiable at a point 5 6 ab such that F p p with FEp gt 0 Then F 1p is H J J 739 quot 739 at F t quot he set of functions h that are continuous at 517 with derivative 7 Hip F Epl ab FUl Using a variation of the functional delta method namely the second part of Thm 208 form van der Vaart 1998 p 297 we conclude that ltan71p 7 F 1p is asymptotically equivalent to evaluated at OFn 7 F Using the lemma this means op1i V5ltFn 110 F 1Pgt i wn 7517 7 f F 1 Expanding E and rearranging we get Mama 7 F 1P 7 1XzF1ltpgtP 0 fauna hm Note that the in uence function appears again ie we have got the expansion IFFXZi It only remains to apply CLT and Slutskyls lemma to get 0171 2 Bootstrap Bootstrap is a plugin methodology introduced by Brad Efron for estimating performance measures asso ciated with statistics The bootstrap estimate of a performance measure is obtained by replacing every occurrence of the true unknown distribution F with the empirical distribution an eigi is replaced by and is replaced by and by replacing the original sample Xi1 with bootstrap sample 1 obtained by resampling with replacement from ani In practice computing expectations wirit to M is dif cult if not impossible Instead the bootstrap estimate is usually obtained by computer simulation ie by generating multiple bootstrap samples and approximating E117 by the frequentistls average 1B 251 where B is the number of bootstrap samples For example a bootstrap estimate of the variance of the median could be where at is the sample meadian of the bth bootstrap sample We will demonstrate the idea by several examples First let s consider a toy example to show that the bootstrap estimate may be computed only based on the knowledge of the original sample without any resamplingi Functional Delta Method and Bootstrap 3 Example 3 Let EFX and n 2 Also let X0 e lt X9 d be the order statstics of the original sample Then the bootstrap sample XLXS can take on one of the following values ee ed de dd each with probability 141 It follows that 9 the bootstrap sample mean takes on v ues 1 1 1 1 57 Cd7d WP 17 E71 Thus performance measures eg variance bias etc can be computed and no sampling is needed Example 4 bias of the median Let 9 median of F7 De ne the performance measure An F Elt6n 7 9 where 9 is the sample mediani Bootstrap replaces F with F which gives the following estimate of the bias Mm 1374733792 where 1 is the sample median of the bootstrap sampler Also note that we have replaced 9 with 9n WM Now let s consider eg the case n 3 Let X0 12 X9 e and X3 d Then XfX X can take on 27 different values The resulting distribution on X31 X32 X53 is given by bbb bbc bbd bcc bcd ddd 27 27 27 27 27 27 One concludes that X32 is distributed according to 7 l3 7 b e d wipi g i which gives the following formula for bootstrap estimate of the bias MUM EFWX2 X2 7 l3 7 y Xlt1gt y Xlt2gt Xltsgtgt Xlt2gt l4 X1 X3 Again note that we got the bias estimate in terms of the original sample without any resamplingi This hopefully illustrates the idea of bootstrapi In general bootstrap may be summarized as plug in Monte Carlo integration7 but the second step is not always required 7 Now we show by an example that bootstrap does not always work ie our estimate of the bias may not converge to the true value of the bias In other words we may not always get consistencyi Example 5 Ustatistic In this example Anan is and is not consistent depending on the assumptions Let AnF E n2 where 3 is a Ustatistic A 1 97 XhX nltn71 J 4 Functional Delta Method and Bootstrap We have calculated moments of Ustatistics before cf the nal part of Thm 123 of van der Vaart 1998 4n72 A FE1l n71 7i E 1 X17X2 1 X17X37 7 E 112X1X2i 2 727 It follows that AnF A MF 47 We also obtain MUM 27 4n72 lt W2 n l 72 ZZZMXiXMXiXk i j k 72 ZZWXth i 1 Note that 9 depends on XiXj for i f j but AnOFn depends also on XhXi The situation thus depends on the diagonal of the kernel mil 0 We obtain consistency When 7 722 and 7 are all nite Where 7 1139209799 lndeed because of niteness assumption we can use the law of large numbers1 LLN to conclude 71 2 A 7 and 732 A 722 Which in turn imply An Fn A 4A 0 But What if 7 00 Note that 2 X2Jr i2q ltXi Xi 2 n2 7 J 72 I 7 j z The rst term converges in probability to 73 But the second term may diverge or converge An example of a setting in Which the second term diverges is Xi Ld Un01 l 1 I7yllt M VI y7 zz elxi References van der Vaart Al WW 1998 Asymptotic Statistics Cambridge University Press Cambridge 1We are using both the usual LLN and also the LLN for Uestatistics STAT 210A Theoretical Statistics Fall 2006 Lecture 21 7 November 9 Lecturer Martin Wainwright Scribe Benjamin 1 P Rubinstein These scribe notes have only been mildly proofread Outline 0 ROC curves risk and Bayesian formulations 0 Error rates Stein7s Lemma and o Uniformly Most Powerful tests Reading Keener Chapters 14 and 15 and Bickel amp Doksum Chapter 4 211 Geometry of Simple Binary Hypothesis Testing Consider a simple binary hypothesis test of the form H0 t9 00 vs H1 t9 1 Recall the following probabilities Level 04 E0 6X Power 6 E16X The level is the probability of a type ll errorirejecting H0 under the null hypothesisiand is also known as the false alarm rate7 The power is the probability of correctly deciding to reject H0 under the alternate hypothesis and in Engineering literature is referred to as the hit rate7 Plotting achievable 046 pairs produces a Receiver Operator Characteristic ROC curve as depicted in Fig 211 Note that the points 00 and 11 in ROC space are always achievable by the trivial constant decision procedures 6X 0 and 6X 1 respectively Furthermore allowing randomization corresponds to taking the convex hull of a ROC curve hence the general shape of the curve in Fig 211 STAT 210A Lecture 21 7 November 9 Fall 2006 11 0 00 1 Figure 211 A generic ROC curve for a randomized family of decision procedures for simple binary hypothesis testing Receiver Operating Characteristic curves were rst used in the Signal Processing community there Ho and H1 are used to represent noiseno signal7 versus signal noise7 respectively Le Y W versus Y 6 W where W N 07 02 2111 Binary Hypothesis Test Risk Sets If we de ne the 0 1 loss as L 39 ll39 27 a 2 7g a 1 otherwise then it is easy to check that Ho E0L06ltXgtH 1amp0an a R1 E1L16ltXgtH E0 17600 17 Ranging over decision procedures the produces risk sets as depicted in Fig 212 Thus7 the relationship to ROC space arises from the transformation HO7 R1 lt gt a 1 7 R11 Figure 212 A risk set Ro7 R1 1 R0 R1 achieved by some 6 The lower boundary corresponds to NeymanePearson STAT 210A Lecture 21 7 November 9 Fall 2006 Consider the natural Bayesian formulation for this setup where A ADJ 7 A0 places a prior on H07 H1 Then the Bayes risk for 1 gt AU gt 0 is NM AoRo 17 A017 R1 AUEU6X17 A017E16X Then choosing 53W 6 arg n16inr 5 211 R11 5 decreasing 10 7 r 120 Figure 213 Finding dBaygs by sliding lines of procedures of equal r6 down to the lower boundary of the risk set where risk is minimized Varying A0 corresponds to sweeping out the entire lower boundary of the risk set by sliding isoperformance lines that are perpendicular to A in R0 R1space see Fig 213 is equivalent to minimizing R1 730 R07an optimization problem for maximizing R1 min imizing 3 subject to a constraint on how large R0 a should be We will show in a homework exercise that optimal solutions to 211 take the form of likelihood ratio tests with threshold t FOO for some F In particular when minimizing a linear function on a convex set one can prove that an optimum is found on the boundary and that you can get to this with a LRT 212 Error Rates Suppose that we have an iid sample X1HXn 1 where 6 60 or 6 61 Any reasonable test 6X1Xn should have 046 7gt 0 under 1 and 1 7 56 7gt 0 under 1P1 as n 7gt oo in the homework we will see that Bayes rules have this property That is we would like the test to be asymptotically error free A more re ned question is how fast is this convergence to 0 This question leads to the study of Large Deviations Our intuition tells us that by taking 046 7gt 0 arbitrarily slowly should allow exponentially fast convergence of 1 7 B to 0 The next result show that this rate is governed by the Kullback Leibler divergence between 1 and 11 213
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'