Popular in Course
Popular in Statistics
verified elite notetaker
This 22 page Class Notes was uploaded by Orval Funk on Monday September 28, 2015. The Class Notes belongs to STAT551 at University of Pennsylvania taught by L.Brown in Fall. Since its upload, it has received 25 views. For similar materials see /class/215427/stat551-university-of-pennsylvania in Statistics at University of Pennsylvania.
Reviews for INTROTOLINEARSTATMOD
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
Shrinkage Estimators for Estimation and Prediction in Multiple Regression revised and corrected for Stat 551 2005 Part I References to be added A good general reference for this is Berger Statistical Decision Theory and Bayesian Analysis 2quot edition See especially Section 54 But note there is an error in one of Berger s formulas as noted below The following references are cited there Stein s Lemma The key ingredient for all that follows is the basic integration by parts equality as used by Stein 1973 1981 Lemma If X N F 92 where Z is a diagonal matrix with diagonal elements on and l SRF gt SR1 is differentiable or more generally is absolutely continuous then 1 EX 61XEom 11001 so long as the expectations eXist See Brown 1985 for the de nition of absolute continuity and for slightly more precisely stated regularity conditions In some of the following including Theorem 1 a little care must be used when applying the usual regularity conditions However this can be done and all the following applications of l are valid Corollary If 2 all in the above lemma then 139 EX 9391XEazvo1X In 139 the symbol V0 denotes divergence according to the usual de nition as 6 Vol x l x lt gt z ax lt gt 1 Proof of 1 and 1 These are just integration by parts Note for 1 that a 7 x75 2 2a 2 e Z x610 ax The argument from here is a little simpler and cleaner if we assume that l is bounded and everywhere differentiable We will do so Also we ll consider the case il in l for notational convenience Then integration by parts yields a A a Z 1quot912239711 E 011 6712 MD szy 6712 X O39He 61x1ch2 de 3 C I I1 xix1 61eizxrgl hauldxldxz de 0 x2xp x1 all EX1 lt911X The 0 in the second line of 3 is from the boundary term involved in integration by parts This proves 1 The expression 139 results from summing 1 over 139 in the special case where 03102 VLD m Setting Let 4 X N p 9 2 with 2 positive de nite and known M is a known positive de nite matrix and it is desired to estimate the vector 9 under the general quadratic loss 5 L9dd 6 Md 9Trd 9d 939M The E is ofcourse R025 Eg L65X The following special case of spherical symmetry is the most important setting 439 X Np 9021 with 0392 known and 539 L6dd 9 1d 9d 92 Later we ll also consider the case 439 with 0392 unknown but estimable Fact 50 x x is a minimax estimator for this problem It has constant risk R 960TrZM James Stein Estimator Here is the basic application of 139 involving Stein s unbiased estimate for the difference in risk See the expression for A below I first give the statement and proof for the special case of 439 and 5 and later for the more general case of 4 and 5 The JamesStein paper 1961 involved only the case of 439 and 5 Their original proof was rather different from the one in Theorem 139 The following proof appeared in Stein 1973 1981 Theorem 1 Let X N p 9 03921 and consider the problem of estimating 6 under ordinary squared error loss 5 Let 6 0ltCS2p 2 hence p23 and let 2 7 6Js1Ci2x quotxquot Then 5amp3 is minimax In fact R 9 5amp3 lt Rt9 50 V6 except when C2p 2 When C2p 2 then RD 66jis RD 660 V6 Note See Figure l for a plot of the risk function of this estimator Proof Note that p 2 2 llxll Using 139 in the third inequality below then yields A R 950 R 95JS x vo Z M 2 xl 8 E mo 2 x a Clo4 quotxquot EU 42Cp 2 C2ZO Z M Remarks 0 The expression in curly brackets on the right of 8 is called Stein s unbiased estimator of risk More technically it s an unbiased estimator of the difference in risks This expression and it s manifold variants for other situations has proved to be a valuable tool for computing risks and for some related theoretical and methodological purposes 0 Examination of 8 shows that the best constant is given by Cp2 However for the positivepart estimators to be defined later in 13 any constant satisfying p 2 S C S 2p 2 is a viable candidate General covariance and quadratic loss The original JamesStein setting is surely the most important single one for this type of phenomenon However there are many situations where the covariance andor the loss have a more general structure Results are obtainable for these situations though they are not quite as pleasing I proceed in two steps first for diagonal matrices and then for more general nonsingular situations Theorem 2 Let X N p 9 I and letD be a positive de nite diagonal matrix Consider the problem of estimating 6 under loss LD6d d 639Dd 6 This is the same as 5 except that I ve substituted the symbol D forM for convenience later on and in order to emphasize thatD is diagonal here As before let 6 0ltCS2p 2 hence p23 Then let be the estimator with coordinates C 39 d 9 d x1 mmz m 61 quotXquot Then 5 is minimax In fact RD 095 lt RD 650 V6 except when C2p 2 When C2p 2 and all 61 are equal then RD 095 RD 650 V6 Proof 1 yields A RD660 RD66 min 2x12 2min mind x12 ElZZC Minxquot uxlri 2C M d nxlr i 2Emiquotn l 2cp 2 CZ z 0 This completes the proof D It is now possible to give the general result for the setting 45 Corollary 2 Consider the setting 45 Let C satisfy 6 and now define C I I 2M 10 5x1ltmlL gtM121x x39Z x Then 5 is minimax in this setting in fact Rt9 5 lt R 9 50 V6 except possibly when C2p 2 Proof The proof is easiest to complete by first noting the following Lemma 2 Problem 552 in Berger Let B be a known nonsingular ngtltn matrix Consider the linearly transformed problem with XBX 9Bt9 andlet M B39MB and U 6d cf 6Md 6 For any procedure 5x let 5 96 3503496 Then 12 925 R66 Proof of Lemma Exercise A To complete the proof of the theorem rst rewrite 5 in matrix form as C min 6139 D 11 5 xjwl1x x x To apply the theorem to the setting of 2 let B be a matrix as in a such that 3339 Z BMB D diagonal Then X Nt92 as in 2 and U is the same as L of 239 Hence the estimator 12 5 3513496 will be minimax Substitute 7 into 8 to get the desired result Note that the eigenvalues of BMB are the same as those of BBM M D Remark The form 9 or 10 is not the only possible general shape for a minimax estimator in the general setting For example one can also show that when 2max 61 lt 261 there is a constant C depending on max 61 d such that the spherically symmetric estimator 5 x l ill x 2 unbiased estimate of risk and the fact that 2 Clinic S max 6111 See Brown 1973 for x x dominates 50 Exercise B Use the implications of this result Positive part Versions The performance of these estimators can be improved by taking their positivepart version The best known result is for the JamesStein estimator So consider the spherically symmetric case as in 439539 Theorem 3 De ne the positive part estimator C 0392 2 M Then R6 5 ltR 95JS v9 173 13 613 1 x Proof This result is heuristically obvious However there doesn t seem to be a proof of it that doesn t involve or hide some annoying manipulations In particular there doesn t seem to be a proof that builds on Stein s unbiased estimate of risk in the way that Theorem 1 does Here are the main steps of an argument to prove the theorem Begin with the familiar fact that 14 3gt0ggt0 e35 e35gt0 It follows from this that agt03gt0 gt0 15 2 2 2 2 3 air 8437 2 3 air 3 2 gt 19284375 2 3 2 Next observe that the loss function and the estimators are spherically symmetric Hence it suffices to prove the theorem in the special case where 9 30 039 and we assume this form of 9 in the following calculations Now write 2 2 2 2 16 E5JS a 5H a 2 Egg 1 9 EH1 9 since for j 2p either 5JSj 73 or 73 0 9 The rhs ofl6 can be written as 17 E E5H1 92 51 92 x 552gp or x g 2gpD E C 52 F a 92 e WVZ a 92 e 5 322 92 Hwy2 e WZm z 0 where C I 2 pgt0 and a l 20 do not depend on the sign of 5 C has a somewhat messy exact form but it can fairly easily be shown to be positive The desired result follows E With some additional care a similar result can be shown to hold for the setting of Theorem 2 Define 5 to have coordinates a Cmind 18 51xl 2 x d quotXquot Corollary 3 RD 9 5 lt RD 9 5 Hence 5 is minimaX under the condition 6 Proof Omitted Remarks 0 Corollary 3 along with Lemma 2 and Corollary 2 make clear how to de ne a positivepart version of 8 as given in 10 Unfortunately I do not have a convenient matrix expression for this that does not explicitly involve the matrix B of Lemma 2 above Maybe there is such an expression and I ve just missed it 0 An alternative minimax positivepart estimator is given by C min ei 2M 19 5M x1 minlM121x x x and minimaxity of this can be proved in a manner similar to the above arguments This would start by using 1 to prove minimaxity of C min ei D 5105 x I min LWD1X xx in the setting of Theorem 2 Ithink but haven t written out a proof that 5 of 10 is better than 5105 o Berger gives the estimator 19 as 532 in his book except that the necessary factor min eig 2M is left out of his 532 Berger s Q is my M and take 11 0 in his 532 wlog Plot of the Risk The following plot is for the case of spherical symmetry and p10 The qualitative relationship of risks for the spherically symmetric case is similar in all dimensions we have investigated so this plot gives the correct heuristic picture for all those cases The plot shows the risks of several estimators as a function of under the assumptions 439 and 539 of spherical symmetry Take 0392 1 without loss of generality The risk of the usual estimator 50 x x is 10 for all The other risk lnctions shown in the plot are Dotted curve 5amp3 Dashed curve Ejis Solid curve Formal Bayes procedure for Harmonic prior as discussed in handout on Bayes and Empirical Bayes procedures Note that all curves are asymptotic to p10 from below as gt 00 Note also that the dashed curve is always below the dotted curve as Theorem 3 says it must be Also the solid curve appears to always be below the dotted curve Hence it appears that the harmonic Bayes procedure is always better in risk than 5amp3 Juntian expects to be able to prove this is so for all p Z 3 risks 10 p10 theta Figure 1 Risks ofthree estimators when 1710 See above discussion for more details 10 Bayes and EmpiricalBayes Estimation Of Multivariate Normal Means Stat 551 L Brown 32305 rev 32905 Background a Block matrix inversion Consider a partitioned positive de nite symmetric matrix Q 911 212 Q21 Q22 211 212 9391 921 922 where 2 911 911 91292219211 and Q QHQIZQ22 Qi1191292239 The other entries of 9391 follow from this by symmetry b Conditional normal distributions Let Z Z1Z2 with Z N naturally partitioned as in 1 above Then 3 Z IZ z2 N Np 912922z29911 Q1292921 0 Q Q is assumed to be invertible and is Fq 1 2 If Z N u Q then Z1ZZ is again normal with the same covariance as in 3 and the corresponding formula for its distribution can be derived from 3 after noting that Z yN0 2 Proof of 3 The density of Z can be written as fZ C1Qexp Z39Q391Z 1 t t 4 C2 QZ2exp EZ1Q Zm 2492120 C3 Q Z2exp zm MZ239 g2 21 MZ2 Comparison of the second and third lines of 4 and substitution from 2 yield 5 M Q 1 Q 9129 The desired result then follows since 4 shows that the conditional distribution is normal 71 7 With mean MZm and covariance 911 911 Q Q 192169 12 22 Posterior distributions for normal priors Let Y Np 112 and suppose that 1 has the normal prior distribution 1 Np 0 Then the posterior distribution of 1 given Yy is 6 nYyNp 1 2 1 71y 1 1 2 1 71 1 Proofof6 Let Z 1y39 Note that y 11W where W N0Z Hence 7 Z N 0 1 1 2p 0 1 1 2 39 The desired result now follows from 3 3 Note that a similar formula can be derived if 1 Np y 1 by using the fact that Y yNp1fZ with 11 n Np 0 1 Bayes estimates for normal priors Consider the above situation and suppose it is desired to estimate 1 under a quadratic loss function of the form 8 Lndd n39Md nTrd nd q39Mj whereM is positive definite Then the Bayes estimator is 9 m P P2 1Y This follows immediately from 6 and the fact that the posterior mean is the Bayes estimator for a quadratic loss function like that in 8 Note for later use that the Bayes estimator does not depend on the value of M in 8 Empirical Bayes estimation parametric empirical Bayes Consider only the spherically symmetric case where Z 021p and 1 112 I p Assume that 0392 is known but that 112 is not known Since 0392 is known we will set 0392 l for convenience in the following Motivation for this setting will be discussed in lecture Although 112 is not known it can be estimated from the data Note that marginally Y Np 0 1 Z Np 0 l zZI Hence 10 EpYquot2plz2 It follows that the best unbiased estimator for 112 is Y Z 11 H 1 P From the marginal distribution of Y it follows that quotY H2 is a sufficient statistic for this family of distributions and has distribution Y H2 1 112 x Hence the MLE is Y 2 12 lifJLE 1 p If we plugin l 1 into the Bayes estimation formula 9 formula we get 2 JJZIL1 13 p ZY 1iij HM 1 quotYquot P Plugging in 12 yields the slightly more appealing formula M 14 p Y 1 p Y EBMLE 2 2 39 1 m1 1Y1 p James Stein shrinkage Estimator Empirical Bayes is a general concept that can be dressed in asymptotic justifications However there is nothing axiomatic about the details of the plugin step In particular with equal justification the preceding can be done slightly differently in order to lead to a slightly different formula One can view the normal distribution of 1 as being parameterized by its precision if 1112 Since this is a 11 transformation this provides an equally good representation of the family of distributions of 1 and it also has a long pedigree as a useful representation of the normal family Then 1 1 1 5 Egluylerpwll 9quot Hence a UMVU estimator of if exists whenever p Z 3 and has the formula 71 2 16 9quot quotT 1 Plugging this back into the Bayes estimation formula for known if now yields the empirical Bayes estimate W 1 2 p 2 17 quotJ73 p 2Y 1 2Y39 1 1 Y p 2 This is the wellknown JamesStein shrinkage estimator The preceding derivation was not the original motivation The original motivation for this estimator was rather different We will study that motivation in a little while It is possible to modify 16 in an adhoc fashion and hence arrive at the slightly more pleasing and better alternative estimator 7 called the positive part JamesStein estimator Wtq p2 p Z 18 f1 Y1 Y 1 ma Yul p 2 Supplemental Material 7 Calculation of Bayes Procedures Here is a formula that can often be used to facilitate calculation of Bayes procedures Consider the situation above where Y N p n 03921 We will assume 0392 is known but will not assume 0392 1 so that the following can easily be used with different values of 0392 Let ptr t oquot1 tO39 denote the normal density with variance 0392 Let G be any prior measure on If G is a probability distribution then the posterior density of 1 given Y is 19 Gd77y p 77yGd77 10577 yGd77 This expression makes formal sense so long as the denominator is nite and in that case it satisfies J Gal77y 1 Hence when G is a nonnegative measure we m 19 to be my the formal posterior distribution corresponding to G For a prior distribution the Bayes estimator under squared error loss 8 is Itpa t yGalt 20 66y E77y jiGdiy m Ipat yGalt Hence we define 20 to be the formal Bayes estimator corresponding to G so long as the denominator eXists Let 21 gy jptyGdt denote the formal marginal distribution corresponding to G This can also be interpreted as the convolution of the normal density ptr with G Theorem Assume g y is finite for all y E SW Then the formal Bayes estimator satisfies Vg y i 22 56yyazy0392Vlogg g y Proof The expression 21 for g y can be differentiated under the integral sign See Lehmann TSH or Bickel and Doksum Mathematical Statistics or Brown F ofqmristi a F r 39 Families This yields It3yV y7t3dt since Vequot2 Z t0392equot22ch Substitution in 1 then yields the desired result 3 Corolla If Y N p 112 as in 6 let pZ denote the density of Y De ne g as in 21 but with pZ in place of pa Assume g is nite Then the formal Bayes estimator is 23 56yy2gyyyZVloggy Proof Exercise As a simple example consider the situation discussed in 6 As noted there it is easy to derive that the marginal distribution is N 0 Z 1 Hence 23 yields Vg y 24 56yyZTyy ZZ P1y PZ P1y This is the same formula as in 9 The Harmonic Prior is an interesting formal prior This is the prior measure with density relative to Lebesgue measure given by l 25 gm 11 k quot ll2 In the case wherep is even then 6 p2 is an integer Let Tk x 39 denote 0 j J the partial sum for the Taylor expansion of elxllz2 Let Z I The following formulas are easily generalized to the case where Z all Then for p even with p Z 4 Juntian Xu has shown that 26 g y llyllz p 1 ell2T 0 It then directly follows that P 2 1 eilileZZTZA y z y M2 bell2220 Since eillleZZTZ71y gt 0 as gt 00 it follows that 6g here closely resembles the 27 6g y 1 positive part JamesStein estimator when yquot is large or even moderate Here s a plot that illustrates the comparison of the two estimators when p8 In this plot write the estimators as 6m 2 M m The plot shows M for the two estimators I 0 5 10 15 20 Hz James Stein estimator Harmonic Prior estimator There are analogous formulas for the case of odd p but they re not as easy to derive nor are they quite as convenient Examples of Linear Models 1 Ordinary Linear Regression aka Simple Liner Regression Study of of car trips to office buidings as a function of office space of the building Suburban and semiurban midAtlantic areas Goal Learn how to predict Y for given values of x Statistical Model includes a EK o 10rYX where o l b Y independent c VarY constant a c are often summarized as K o lxl 03928 81 independent with Varel l The Model often also includes more precisely speci ed assumptions about the distribution of Yi most usually d and ex 1N01 The vectormatrix form for part a of this model is E Y X In writing this general vectormatrix form for linear models we customarily write as a vector Thus the usual way of writing the coordinates of is 1 There is one embarrassing Z feature to this representation In the model a we have 0 Thus we have created a 1 notational monster in which A quot quot e This doesn t seem to bother the authors of our text Z 1 RampD nor most other authors It won t bother you either if you don t let it do so P S A better way to proceed would have been to write with a different letter 7 eg K a0 alxl 03928 81 independent with Varel l Then it would have been true A do I I that wh1ch 1s OK 0 1 2 Here are Y andX for the usual vector and matrix form no of AM car tripsday Xi1 Xi2occup sq 9900 1 6000 14220 1 6000 17680 1 6500 15198 1 7794 14813 1 7879 6703 1 7980 15206 1 8003 14589 1 9726 17205 1 9300 22535 1 10290 15976 1 10580 11449 1 10700 14388 1 5995 16606 1 5995 11280 1 11280 15914 1 10975 16134 1 10975 15036 1 10975 34800 1 12000 17292 1 12440 25384 1 12820 21109 1 13030 10567 1 10360 17100 1 15000 35520 1 16000 24878 1 16260 25203 1 16260 22440 1 16500 22770 1 16500 20010 1 17400 33166 1 16100 13346 1 17560 16400 1 20000 36245 1 19806 23569 1 10203 35224 1 20013 38790 1 25520 32000 1 25600 40086 1 26200 43510 1 26370 40110 1 21918 44955 1 33300 24376 1 13618 53200 1 35000 31878 1 41400 60651 1 42712 46001 1 47918 61824 1 47194 95183 1 50900 111996 1 54900 113129 1 44019 141948 1 58656 Standard Regression output Bivariate Fit of AM Trips By Occup Sq Ft 1 1500 1250 1000 750 AM Trips 500 250 l I l I I 0 100 200 300 400 500 600 Occup Sq Ft 1000 Linear Fit AM Trips 7939 16967 Occup Sq Ft 1000 Summary of Fit RSquare 0800 RSquare Adj 0797 Root Mean Square Error 12273 Mean of Response 28667 Observations or Sum Wgts 61 Analysis of Variance DF Sum of Squares Mean Square F Ratio Model 1 35569465 3556946 2361615 Error 59 8886287 15062 Prob gt F C Total 60 44455751 lt0001 Parameter Estimates Estimate Std Error t Ratio Probgtt Interce t 7939 24 79 032 07499 p Occup Sq Ft 1000 169679 01109 1537 lt0001 Here s an alternate different analysis We ll later discuss this one and others Bivariate Fit of AM Trips By Occup Sq Ft 1000 Weight quotweightquot 1500 1250 1000 750 AM Trips 500 250 l I I I I 0 100 200 300 400 500 600 Occup Sq Ft 1000 Linear Fit AM Trips 25948 14789 Occup Sq Ft 1000 Summary of Fit RSquare 08201 RSquare Adj 08171 Root Mean Square Error 05617 Mean of Response 68091 Observations or Sum Wgts 0027342 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 8485679 848568 2689703 Error 59 1861377 03155 Prob gt F C Total 60 10347056 lt0001 Parameter Estimates Term Estimate Std Error t Ratio Probgtt Intercept 259483 4259315 609 lt0001 Occup Sq Ft 1000 14789008 0090175 1640 lt0001 2 Multiple Linear Regression The data involves scores on a statewide Texas student pro ciency exam of math and English The independent variables in the data set are the by school passing the math test the English test and both tests Not all students take both tests There possible predictor variables covariates measure either the demographic character of the school district These are marked with a or features of the school organization and nancing avgteacherisalary avgclassisize avgteacheriexperience pctltdenglish r J39 J quot in the school grade3enrollment in the school pctispecial ed pctigifted pctiblack pctihispanic perpupil expend After preliminary analysis that we ll examine later I decided that most of the differences in the math score could be explained by 4 ofthe independent variables This yields a model of the type o 1x11 2x21 3x31 Axm with the X as below Summary of Fit 02267 Root Mean Square Error 12106 Mean of Response 8245 Observations or Sum Wgts 3421 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 4 1467 366885 25034 Error 3416 500621 1466 Prob gt F C Total 3420 647375 lt0001 Parameter Estimates Term Estimate Std Error t Ratio Probgtt Intercept 81395 24681 3298 lt0001 avgteachersalary 0000253 0000074 341 00006 avgteacherexperience 03555 008062 441 lt0001 pctecondisadv 02014 000729 2763 lt0001 pctback 01301 001417 918 lt0001 NOTE that two of these explanatory variables are demographic and two relate to the character of the school 3 Multi way ANOVA models In pure models of this type the predictor variables are all categorical The data used for last term s term project provides a good example This comes from the US Bank Call Center The Y variable is the log service time in seconds for an individual call handled by an agent at the service center The covariates are day of the week MON through FRI and the agent s name recorded as an ID number The data are for a very small sample of the calls handled in one particular week 7 1502 to 7 1902 by those agents who worked every day that week Calls lasting less than 11 seconds were omitted from the data One way model If one is interested only in dayofweek effect you could use a basic model of the form Lzrdayl where here Y represents the Y value of the ith observation This is more conventionally written by letting Y i denote the Y value of the j th observation on day 139 This yields a model of the form EYU n1 i1j1Jl The design matrix for such a model will be discussed in class Here is some of the output from such an analyisis Oneway AnalEis of LoggSerTime By Day of Week I I 3 A I D E I I L g 6 239 O l 1 I Day of Week Excluded Rows51 Oneway Anova Summary of Fit quuare 000730 Root Mean Square Error 03918 Mean of Response 2072 Observations or Sum Wgts 1271 Analysis of Variance Source DF Day of Week 4 Error 1266 C Total 1270 Means for Oneway Anova evel RI MON 278 THU 268 TUE 253 ED Mean 208801 210743 203210 203307 226 Std Error uses a pooled estimate of error variance Sum of Squares Mean Square F Ratio 14281 0357033 23259 19433589 0153504 19576402 Std Error Lower 95 Upper 95 002498 20390 21370 002350 20613 21535 002393 19852 20791 002463 19847 20814 002606 20511 21533 Prob gt F 00545 Two and higher way analyses For examining the joint effect of day and server one could use an additive model of the form Y E W 11 139 j 139 11 j 1J k 1KU In principle some values of Ki could be 0 with the obvious interpretation OR one could use an additive model with interactions of the form EYW 11 139 j U 139 11 j 1J k 1KU Later we ll discuss some alternative ways of modeling data such as this involving random effects models Here is some output from the model with interaction Response LogSerTime Summary of Fit RSquare 00875 Root Mean Square Error 0387 Mean of Response 2072 Observations or Sum Wgts 1271 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 79 885 0216821 14456 Error 1191 17863517 0149988 Prob gt F C Total 1270 19576402 00079 Effect Tests Source Nparm DF Sum of Squares F Ratio Day of Week 4 4 081 17253 Server D 15 15 3585552 15937 Server DDay of Week 60 60 12443047 13827 Prob gt F 01420 00686 00305