New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Haleigh Nitzsche

PT MGMT & ADMIN I 101 119

Haleigh Nitzsche
GPA 3.95


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Physical Science

This 28 page Class Notes was uploaded by Haleigh Nitzsche on Friday October 23, 2015. The Class Notes belongs to 101 119 at University of Iowa taught by Staff in Fall. Since its upload, it has received 34 views. For similar materials see /class/228107/101-119-university-of-iowa in Physical Science at University of Iowa.


Reviews for PT MGMT & ADMIN I


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/23/15
ADAPTIVE LASSO FOR SPARSE HIGHDIMENSIONAL REGRESSION MODELS Jian Huangl7 Shuangge MaZ7 and Cun Hui Zhang3 1University of Iowa7 2Yale University7 3Rutgers University November 2006 The University of Iowa Department of Statistics and Actuarial Science Technical Report No 374 Revision 1 June 2007 Summary We study the asymptotic properties of the adaptive Lasso estimators in sparse high dimensional linear regression models when the number of covariates may increase with the sample size We consider variable selection using the adaptive Lasso where the L1 norms in the penalty are re weighted by data dependent weights We show that if a reasonable initial estimator is available then under appropriate conditions the adaptive Lasso correctly selects covariates with nonzero coe icients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic distribution that they would have if the zero coefficients were known in advance Thus the adaptive Lasso has an oracle property in the sense of Fan and Li 2001 and Fan and Peng 2004 In addition under a partial orthogonality condition in which the covariates with zero coefficients are weakly correlated with the covariates with nonzero coe icients marginal regression can be used to obtain the initial estimator With this initial estimator adaptive Lasso has the oracle property even when the number of covariates is much larger than the sample size K ey Words and phrases Penalized regression high dimensional data variable selection asymptotic normality oracle property zero consistency Short title Adaptive Lasso AMS 2000 subject classi cation Primary 62J05 62J07 secondary 62E20 60F05 1 Introduction Consider a linear regression model yX eelR 1 where X is an n x pn design matrix 6 is a pn x 1 vector of unknown coefficients and e is a vector of iid random variables with mean zero and nite variance 02 We note that pm the length of 6 may depend on the sample size n We assume that the response and covariates are centered so the intercept term is zero We are interested in estimating 6 when pn is large or even larger than n and the regression parameter is sparse in the sense that many of its elements are zero Our motivation comes from studies that try to correlate a certain phenotype with high dimensional genomic data With such data the dimension of the covariate vector can be much larger than the sample size The traditional least squares method is not applicable and regularized or penalized methods are needed The Lasso Tibshirani 1996 is a penalized method similar to the ridge regression Hoerl and Kennard 1970 but uses the L1 penalty 2371 l jl instead of the L2 penalty 231 372 So the Lasso estimator is the value that minimizes 2 1m Hyix H 2AZl jlv lt2 j1 where is the penalty parameter An important feature of the Lasso is that it can be used for variable selection Compared to the classical variable selection methods such as subset selection the Lasso has two advantages First the selection process in the Lasso is continuous and hence more stable than the subset selection Second the Lasso is computationally feasible for high dimensional data In contrast computation in subset selection is combinatorial and not feasible when pn is large Several authors have studied the properties ofthe Lasso When pn is xed Knight and Fu 2001 showed that under appropriate conditions the Lasso is consistent for estimating the regression parameter and its limiting distributions can have positive probability mass at 0 when the true value of the parameter is zero Leng Lin and Wahba 2005 showed that the Lasso is in general not path consistent in the sense that a with probability greater than zero the whole Lasso path may not contain the true parameter value b even if the true parameter value is contained in the Lasso path it cannot be achieved by using prediction accuracy as the selection criterion For xed pn Zou 2006 further studied the variable selection and estimation properties of the Lasso He showed that the positive probability mass at 0 of a Lasso estimator when the true value of the parameter is 0 is in general less than 1 which implies that the Lasso is in general not variable selection consistent He also provided a condition on the design matrix for the Lasso to be variable selection consistent This condition was discovered by Meinshausen and Buhlmann 2006 and Zhao and Yu 2007 In particular Zhao and Yu 2007 called this condition the irrepresentable condition on the design matrix Meinshausen and Buhlmann 2006 and Zhao and Yu 2007 allowed the number of variables go to in nity faster than 71 They showed that under the irrepresentable condition the Lasso is consistent for variable selection provided that pn is not too large and the penalty parameter grows faster than m Speci cally pn is allowed to be as large as expn for some 0 lt a lt 1 when the errors have Gaussian tails However the value of required for variable selection consistency over shrinks the nonzero coef cients which leads to asymptotically biased estimates Thus the Lasso is variable selection consistent under certain conditions but not in general Moreover if the Lasso is variableselection consistent then it is not ef cient for estimating the nonzero parameters Therefore these studies con rm the suggestion that the Lasso does not possess the oracle property Fan and Li 2001 Fan and Peng 2004 Here the oracle property of a method means that it can correctly select the nonzero coe icients with probability converging to one and that the estimators of the nonzero coef cients are asymptotically normal with the same means and covariance that they would have if the zero coef cients were known in advance On the other hand Greenshtein and Ritov 2004 showed that the Lasso has certain persistence property for prediction and under a sparse Riesz condition Zhang and Huang 2006 proved that the Lasso possesses the right order of sparsity and selects all coef cients of greater order than AnEwhere kn is the number of nonzero coe icients In addition to the Lasso other penalized methods have been proposed for the purpose of simultaneous variable selection and shrinkage estimation Examples include the bridge penalty Frank and Friedman 1996 and the SCAD penalty Fan 1997 Fan and Li 2001 For the SCAD penalty Fan and Li 2001 and Fan and Peng 2004 studied asymptotic properties of penalized likelihood methods They showed that there exist local maximizers of the penalized likelihood that have the oracle property Huang Horowitz and Ma 2006 showed that the bridge estimator in a linear regression model has the oracle property under appropriate conditions if the bridge index is strictly between 0 and 1 Their result also permits a divergent number of regression coefficients While the SCAD and bridge estimators enjoy the oracle property the objective functions with the SCAD and bridge penalties are not convex so it is more difficult to compute these estimators Another interesting estimator the Dantzig selector in high dimensional settings was proposed and studied for the estimation of 6 by Candes and Tao 2005 This estimator achieves a loss within a logarithmic factor of the ideal mean squared error and can be solved by a convex minimization problem An approach to obtaining a convex objective function which yields oracle estimators is by using a weighted L1 penalty with weights determined by an initial estimator Zou 2006 Suppose that an initial estimator 3 is available Let wnj lanl717 j177pn 3 Denote 2 Pu W My email 2AZwmmjl lt4 j1 The value 3 that minimizes L is called the adaptive Lasso estimator Zou 2006 By allowing relatively higher penalty for zero coefficients and lower penalty for nonzero coefficients the adaptive Lasso hopes to reduce the estimation bias and improve variable selection accuracy compared with the standard Lasso For xed pm Zou 2006 proved that the adaptive Lasso has the oracle property We consider the case when pn a 00 as n a 00 We show that the adaptive Lasso has the oracle property under an adaptive irrepresentable and other regularity conditions and in particular this can be achieved with marginal regression as the initial estimates under a partial orthogonal condition on the covariates This result allows pn Oexpn 1 for some constant 0 lt a lt 1 where a depends on the regularity conditions Thus the number of covariates can be larger than the sample size if a proper initial estimator is used in the adaptive Lasso When pn gt n the regression parameter is in general not identi able without further assumptions on the covariate matrix However if there is suitable structure in the covariate matrix it is possible to achieve consistent variable selection and estimation We consider a partial orthogonality condition in which the covariates with zero coefficients are only weakly correlated with the covariates with nonzero coefficients We show that for p gt n and under the partial orthogonality and certain other conditions the adaptive Lasso achieves selection consistency and estimation efficiency when the marginal regression estimators are used as the initial estimators although they do not yield t t t t of the 1 t The partial orthogonality condition is reasonable in microarray data analysis where the genes that are correlated with the phenotype of interest may be in different functional pathways from the genes that are not related to the phenotype Bair et al 2006 The partial orthogonality condition was also discussed in the context of bridge estimation by Huang et al 2006 Fan and Lv 2006 studied univariate screening in high dimensional regression problems and provided conditions under which it can be used to reduce the exponentially growing dimensionality of a model A new contribution of the present article is that we also investigate the effect of the tail behavior of the error distribution on the property of the marginal regression estimators in high dimensional settings The rest of the paper is organized as follows In Section 2 we state the results on variable selection consistency and asymptotic normality of the adaptive Lasso estimator In Section 3 we show that under the partial orthogonality and certain other regularity conditions marginal regression estimators can be used in the adaptive Lasso to yield the desirable selection and estimation properties In Section 4 we present results from simulation studies and a real data example Some concluding remarks are given in Section 5 The proofs of the results stated in Sections 2 and 3 are provided in the online supplement to this article 2 Variableselection consistency and asymptotic normality Let the true parameter value be 60 3017 73017 with dimension p pn For simplicity of notation7 we write 60 im ZO where lo is a kn x 1 vector and 620 is a mm x 1 vector Suppose that lo 7 0 and 620 07 where 0 is the vector with appropriate dimension with all components zero So kn is the number of non zero coefficients and mm is the number of zero coefficients in the regression model We note that it is unknown to us which coefficients are non zero and which are zero Most quantities and data objects in our discussion are functions of 71 but this dependence on n is often made implicit7 especially for n vectors and matrices with 71 rows We center the response y yl 7yn and standardize the covariates X 17an so that 7L 7L 1 7L Zyi07 Zij0 and gzmgj17j17Pn 5 i1 i1 i1 Let Xj mg7znj be the j th column of the design matrix X x17xpn7 and y yl yn The regression model is written as pr yZ jX15X5 6 j1 with the error vector 5 617 en Let Jnl j 307 7 0 and de ne X1 X17 7an7 Enll n71X1X1 Let Tm be the smallest eigenvalue of Enn For any vector X 1727 7 denote its sign vector by sgnx sgnz17 sgnm27 7 with the convention sgn0 0 Following Zhao and Yu 20077 we say that Bn 9 6 if and only if sgn3n sgn6 Let bnl min ojl E Jnl 7 We assume the following conditions A1 The errors 6139 62 are independent and identically distributed random variables with mean zero and that for certain constants 1 g d g 27 C gt 0 and K the tail probabilities of 8139 satisfy Pl6il gt x KexpiCmd for all m 2 0 and 2 1 2 A2 The initial estimators BM are rn consistent for the estimation of certain 77M rnmax BM 7 71mquot Op17 m a 007 737 where nm are unknown constants depending on 6 and satisfy 1 M 2 2 12 max M 7 gM 07 j Jm lnnil n2 jg nnj nnj z n1 n A3 Adaptive irrepresentable condition For Sm lnnjl lsgn ojj E Jn1 and some a lt 1 W1 x Xlighsml Hlnm39lv Vi Jm A4 The constants kn7 mn7 An Mn17Mn27bn1 satisfy the following condition 1d 12 1 M Id1 logkn 1d i 1 logn 711me logmn A Mm Tn bmn a 0 A5 There exists a constant 7391 gt 0 such that Tm 2 7391 for all 71 Condition A1 is standard for variable selection in linear regression Condition A2 assumes that the initial BM actually estimates some proxy ml of BM so that the weight wm39 z lnnjl l is not too large for 307 7 0 and not too small for 307 O The adaptive irrepresentable condition A3 becomes the strong irrespresentable condition for the sign consistency of the Lasso if mm are identical for all j pn lt weakens the strong irrepresentable condition by allowing larger my in Jnl smaller Sm and smaller lnm39l outside Jnl lf sgnnm39 sgn nj in A27 we say that the initial estimates are zero consistent with rate m In this case7 A3 holds automatically and an 0 in A2 Condition A4 restricts the numbers of covariates with zero and nonzero coefficients7 the penalty parameter7 and the smallest non zero coefficient The number of covariates permitted depends on the tail behavior of the error terms For sub Gaussian tail7 the model can include more covariates7 while for exponential tail7 the number of covariates allowed is fewer We often have n5 12rn a 00 and An n for some 0 lt a lt 1 and small 6 gt 0 In this case7 the number mn of zero coefficients can be as large as expnd 5 But the number of nonzero coefficients allowed is of the order minnz1 anl 257 assuming 1bn1 01 and Mm 0k71L2 Condition A5 assumes that the eigenvalues of Enn are bounded away from zero This is reasonable since the number of nonzero covariates is small in a sparse model Among conditions A1 to A57 A3 is the most critical one and is in general difficult to establish It assumes that we can estimate certain 77m satisfying the condition On the other hand7 this task can be reduced to establishing the simpler and stronger properties under a partial orthogonality condition described in Section 3 Theorem 1 Suppose that conditions AU A5 hold Then PltBn 5 no gt 1 The proof of this theorem can be found in the on line supplement to this article Theorem 2 Suppose that conditions A1 to A5 are satis ed Let 5 720427311047 for any kn X1 uector an satisfying Haan 1 If MnlknnlZ gt 0 7L n12510h n1 50 nil2571 Z Eiahznlllxli 0121 HD N0717 8 i1 where op1 is a term that conuerges to zero in probability uniformly with respect to an This theorem can be proved by verifying the Lindeberg conditions the same way as in the proof of Theorem 2 of Huang et al 2006 Thus we omit the proof here 3 Zeroconsistency partial orthogonality and marginal regression For the adaptive Lasso estimator to be variable selection consistent and have the oracle property7 it is crucial to have an initial estimator that is zero consistent or satis es the weaker condition A3 When pn n the least squares estimator is consistent and therefore zero consistent under certain conditions on the design matrix and regression coefficients In this case we can use the least squares estimator as the initial estimators for the weights However when pn gt n which is the case in many microarray gene expression studies the least squares estimator is no longer feasible In this section we show that the marginal regression estimators are zero consistent under a partial orthogonality condition With the centering and scaling given in 5 the estimated marginal regression coefficient is 2711 MM 39 Xyn 9 7 21 96 7 We take the 77m in A2 to be EBM Since 0 Ey X o N kn 77m E m39 XQHon Z ozXQXzn 10 11 It is also possible to consider Bm lx jynll and 77m lx ijnll with certain 39y gt 0 but we focus on the simpler 9 and 10 here We make the following assumptions B1 The condition A1 holds B2 Partial or The covariates with zero L t and those with nonzero coefficients are only weakly correlated xgxml pm NZ Jm k e Jm 1 n 2 n 1 1k 11 where for certain 0 lt a lt 1 pn satis es 72 mew Z 7 77 01gt jEJni where a is given in A3 B3 The minimum hm minlnnjlj E Jnl satis es ki21 on n12 N H0 r bnlrn n logmn1dlogn1d1l Condition B2 is the weak partial orthogonality assumption which requires that the covariates with zero coef cients have weaker correlation to the mean no Ey than those with nonzero coef cients in an average sense For limo and B2 holds for the Lasso with 71m 1 Thus the adaptive Lasso has advantages only when on lt 1 Condition B3 requires that the non zero coef cients are bounded away from zero at certain rates depending on the growth of kn and mn Theorem 3 Suppose that conditions B1 to B3 hold Then A2 and A3 hold for the nm39 in 10 ie the n in 9 is rn consistent for nm39 and the adaptive irrepresentable condition holds The proof of this theorem is given in the on line supplement to this article Theorem 3 provides justi cation for using marginal regression estimator for adaptive Lasso as the initial estimator under the partial orthogonality condition Under B1 B3 A4 follows from Let bng Then bnl oj bng E Jnl and 10g kn 10g mn1d 77 7 Tnbnl nrn kit2A nbnlgnl knpn1rn gt Thus under B1 B4 and A5 we can rst use the marginal regression to obtain the initial estimators and use them as weights in the adaptive Lasso to achieve variable selection consistency and oracle ef ciency A special case of Theorem 3 is when pn 0n 12 that is the covariates with nonzero and zero coef cients are essentially uncorrelated Then we can take 71m 0j Jnl and 11 is satis ed Consequently the univariate regression estimator 3 in 9 is zero consistent with rate Tn In this case the adaptive irrepresentable condition A3 is automatically satis ed 4 Numerical Studies We conduct simulation studies to evaluate the nite sample performance of the adaptive Lasso estimate and use a real data example to illustrate the application of this method Because our main interest is in when pn is large and Zou 2006 has conducted simulation studies of adaptive Lasso in low dimensional settings we focus on the case when pn gt n 41 Simulation study The adaptive Lasso estimate can be computed by a simple modi cation of the LARS algorithm Efron et al 2004 The computational algorithm is omitted here In simulation study we are interested in 1 accuracy of variable selection and 2 prediction performance measured by mse mean squared error For 1 we compute the frequency of correctly identifying zero and nonzero coef cients in repeated simulations For 2 we compute the median prediction mse which is calculated based on the predicted and observed values of the response from independent data not used in model tting We also compare the results from the adaptive Lasso to those from the standard Lasso estimate We simulate data from the linear model y X e 5 N0021 Eight examples with pn gt n are considered In each example the covariate vector is generated as normal distributed with mean zero and covariance matrix speci ed below The value of X is generated once and then kept xed Replications are obtained by simulating the values of e from N0 021 and then setting y X e for the xed covariate value X The sample size used in estimation is n 100 Summary statistics are computed based on 500 replications The eight examples we consider are 1 p 200 and a 15 The n rows of X are independent For the i th row the rst 15 covariates mm 1315 and the remaining 185 covarites zma zi200 are independent The pairwise correlation between the kth and the jth components of 131 zi15 is rlk jl 12 with r 05 kj 115 The pairwise correlation between the kth and the jth components of zi16zi200 is rlk jl with r 05 kj 16200 The rst 5 components of 6 are 25 components 6710 are 15 components 11715 are 05 and the rest are zero The covariate matrix has the partial orthogonal structure 2 The same as Example 1 except that r 095 3 The same as Example 1 except that p 400 4 The same as Example 2 except that p 400 5 p 200 and a 15 The predictors are generated as follows mij le gigJ 1 5 17 Zgj Eiji 6 10 17 Zgj Eiji 11 15 and 17 Zij where Zij are iid N0 1 and 57 are iid N01100 The rst 15 components of 6 are 15 the remaining ones are zero 6 The same as Example 5 except that p 400 7 p 200 and a 15 The pairwise correlation between the kth and the jth components of zi1zi200 is rlk jl with r 05kj 1300 Components 175 of 6 are 25 components 11715 are 15 components 21725 are 05 and the rest are zero 8 The same as example 7 except that r 095 Partial orthogonal condition is satis ed in Examples 176 Especially Examples 1 and 3 represent cases with moderately correlated covariates Examples 2 and 4 have strongly correlated covariates while Examples 5 and 6 have the grouping structure Zou and Hastie 2005 with three equally important groups where covariates within the same group are highly correlated Examples 7 and 8 represent the cases where the partial orthogonality assumption is Violated Covariates with nonzero coef cients are correlated with the rest In each example the simulated data consist of a training set and a testing set each of size 100 For both the Lasso and Adaptive Lasso tuning parameters are selected based on V fold cross validation with the training set only We set V 5 After tuning parameter selection the Lasso and adaptive Lasso estimates are computed using the training set We then compute the prediction MSE for the testing set based on the training set estimate Speci cally in each data set of the 500 replications let 37 be the tted value based on the training data and let yi be the response value in the testing data whose corresponding covariate value is the same as that associated with y Then the prediction MSE for this data set is n 1 Egg 7 yi2 where n 100 The PMSE included in Table 1 is the median of the prediction MSE s from 500 replications Summary statistics of variable selection and PMSE results are shown in Table 1 It can be seen that for Examples 1 6 the adaptive Lasso yields smaller models with better prediction performance However due to the very large number of covariates the number of covariates identi ed by the adaptive Lasso is still larger than the true value 15 When the partial orthogonality condition is not satis ed Examples 7 and 8 the adaptive Lasso still yields smaller models with satisfactory prediction performance comparable to the Lasso Extensive simulation studies with other value of p and different marginal and joint distributions of mg yield similar satisfactory results We show in Figures 1 and 2 the frequencies of individual covariate effects being properly classi ed zero versus nonzero For a better view we only show the rst 100 coef cients which include all the nonzero coef cients The patterns of the results from the remaining coef cients are similar 42 Data example We use the data set reported in Scheetz et al 2006 to illustrate the application of the adaptive Lasso in high dimensional settings In this data set F1 animals were intercrossed and 120 twelve week old male offspring were selected for tissue harvesting from the eyes and microarray analysis The microarrays used to analyze the RNA from the eyes of these F2 animals contain over 31042 different probe sets Affymetric GeneChip Rat Genome 230 20 Array The intensity values were normalized using the RMA robust multi chip averaging Bolstad 2003 lrizzary 2003 method to obtain summary expression values for each probe set Gene expression levels were analyzed on a logarithmic scale For the 31042 probe sets on the array we rst excluded probes that were not expressed in the eye or that lacked suf cient variation The de nition of expressed was based on the empirical distribution of RMA normalized values For a probe to be considered expressed the maximum expression value observed for that probe among the 120 F2 rats was required to be greater than the 25th percentile of the entire set of RMA expression values For a probe to be considered sufficiently variable it had to exhibit at least 2 fold variation in expression level among the 120 F2 animals A total of 18976 probes met these two criteria We are interested in nding the genes whose expression are correlated with that of gene TRIM32 This gene was recently found to cause Bardet Biedl syndrome Chiang et al 2006 which is a genetically heterogeneous disease of multiple organ systems including the retina The probe from TRIM32 is 1389163at which is one ofthe 18 976 probes that are suf ciently expressed and variable One approach to nding the probes among the remaining 18975 probes that are most related to TRIM32 is to use regression analysis Here the sample size n 120 ie there are 120 arrays from 120 rats and the number of probes is 18975 Also it is expected that only a few genes are related to TRIM32 Thus this is a sparse high dimensional regression problem We use the proposed approach in the analysis We rst standardize the probes so that they have mean zero and standard deviation 1 We then do the following steps 1 Select 3000 probes with the largest variances 2 Compute the marginal correlation coef cients of the 3000 probes with the probe corresponding to TRIM32 3 Select the top 200 covariates with the largest correlation coef cients This is equivalent to selecting the covariates based on marginal regression since covariates are standardized 4 The estimation and prediction results from ada lasso and lasso are provided below Table 2 lists the probes selected by the adaptive Lasso For comparison we also used the Lasso The Lasso selected 5 more probes than the adaptive Lasso To evaluate the performance of the adaptive Lasso relative to the Lasso we use cross validation and compare the predictive mean square errors MSEs Table 3 gives the results when the number of covariates p 100 200 300 400 and 500 We randomly partition the data into a training set and a test set the training set consists of 23 observations and the test set consists of the remaining 13 observations We then t the model with the training set then calculate the prediction MSE for the testing set We repeat this process 300 times each time a new random partition is made The values in Table 3 are the medians of the results from 300 random partitions In the table7 cov is the number of covariates being considered Nonzero is the number of covariates in the nal model Corr is the correlation coefficient between the predicted value based on the model and the observed value Coef is the slope of the regression of the tted values of Y against the observed values of Y7 which shows the shrinkage effects of the two methods are similar Overall7 we see that the performance of the adaptive Lasso and Lasso are similar However7 there are some improvement of the adaptive Lasso over the Lasso in terms of prediction MSEs Notably7 the number of covariates selected by the adaptive Lasso is fewer than that selected by the Lasso7 yet the prediction MSE of the adaptive Lasso is smaller 5 Concluding remarks The adaptive Lasso is a two step approach In the rst step7 an initial estimator is obtained Then a penalized optimization problem with a weighted L1 penalty must be solved The initial estimator does not need to be consistent7 but it must put more weight on the zero coefficients and less on nonzero ones in an average sense to improve upon the Lasso Under the partial orthogonality condition7 a simple initial estimator can be obtained from marginal regression Comparing to the Lasso7 the theoretical advantage of the adaptive Lasso is that it has the oracle property Comparing to the SCAD and bridge methods which also have the oracle property7 the advantage of the adaptive Lasso is its computational efficiency Given the initial estimator7 the computation of adaptive Lasso estimate is a convex optimization problem and its computational cost is the same as the Lasso lndeed7 the entire regularization path of the adaptive Lasso can be computed with the same computational complexity as the least squares solution using the LARS algorithm Efron et al 2004 Therefore7 the adaptive Lasso is a useful method for analyzing high dimensional data We have focused on the adaptive Lasso in the context of linear regression models This method can be applied in a similar way to other models such as the generalized linear and Cox models It would be interesting to generalized the results of this paper to these more complicated models Acknowledgements The authors wish to thank two anonymous referees and an Associate Editor for their helpful comments H to 03 F U a T 00 to H 0 H H H 03 H F Hoerl A E and Kennard R W 1970 REFERENCES Bair E Hastie T Paul D and Tibshirani R 2006 Prediction by supervised principal components J Amer Statist Assoc 101 119 137 Bolstad BM lrizarry R A Astrand M and Speed TP 2003 A Comparison of normalization methods for high density oligonucleotide array data based on bias and variance Biomformatz39cs 19 185 193 Candes E and Tao T 2005 The Dantzig selector statistical estimation when p is much larger than n Preprmt Department of Computational and Applied Mathematics Caltech Accepted for publication by the Ann Statist Chiang A P Beck J S Yen H J Tayeh M K Scheetz T E Swiderski R Nishimura D Braun T A Kim K Y Huang J Elbedour K Carmi R Slusarski D C Casavant T L Stone E M and Sheffield V C 2006 Homozygosity Mapping with SNP Arrays Identi es a Novel Gene for Bardet Biedl Syndrome BBS10 Proceedings of the National Academy of Sciences USA103 6287 6292 Efron B Hastie T Johnstone l and Tibshirani R 2004 Least angle regression Ann Statist 32 407499 Fan J and Li R 2001 Variable selection via nonconcave penalized likelihood and its oracle propertiesJ Amer Statist Assoc 96 1348 1360 Fan J and Peng H 2004 Nonconcave penalized likelihood with a diverging number of parameters Ann Statist 32 928 961 Fan J and Lv J 2006 Sure independence screening for ultra high dimensional feature space Preprint Department of Operational Research amp Financial Engineering Princeton University Frank I E and Friedman J H 1993 A statistical view of some chemometrics regression tools with discussion Technometm39cs 35 109 148 Greenshtein E and Ritov Y 2004 Persistence in high dimensional linear predictor selection and the virtue of overparametrization Bernoulli 10 971988 Ridge regression Biased estimation for nonorthogonal problems Teemometh 12 55 67 Huang J Horowitz J L and Ma S G 2006 Asymptotic properties of bridge estimators in sparse high dimensional regression models Technical report No 360 Department of Statistics and Actuarial Science University of Iowa Accepted for publication by the Ann Statist lrizarry RA Hobbs B Collin F Beazer Barclay YD Antonellis KJ Scherf U and Speed TP 2003 Exploration normalization and summaries of high density oligonucleotide array probe level data Biostatz39stz39cs 4 249 264 Knight K and Fu W J 2000 Asymptotics for lasso type estimators Ann Statist 28 135671378 H T H 00 2 O 21 2 to 23 Leng7 C7 Lin7 Y7 and Wahba7 G 2004 A Note on the Lasso and Related Procedures in Model Selection Statistica Sinica 167 1273 1284 Meinshausen7 N and Buhlmann7 P 2006 High dimensional graphs and variable selection with the Lasso Ann Statist 347 1436 1462 Scheetz7 T E7 Kim7 K Y A7 Swiderski7 R E7 Philp17 A R7 Braun7 T A7 Knudtson7 K L7 Dorrance7 A M7 DiBona7 G F Huang7 l Casavant7 T L7 Sheffield7 V C7 and Stone7 E M 2006 Regulation of Gene Expression in the Mammalian Eye and its Relevance to Eye Disease Proceedings of the National Academy of Sciences 1037 14429 14434 Tibshirani7 R 1996 Regression shrinkage and selection via the Lasso J Roy Statist Soc Ser B 58 267 288 Van der Vaart7 A W and Wellner7 J A 1996 Weak Convergence and Empirical Processes With Applications to Statistics Springer7 New York Zhang7 C H and Huang7 J 2006 2006 003 Model selection consistency of the LASSO in high dimensional linear regression Technical report 2006 003 Department of Statistics7 Rutgers University Zhao7 P and Yu7 B 2007 On model selection consistency of Lasso J Machine Leaming Res 77 25412567 Zou7 H 2006 The Adaptive Lasso and its Oracle Properties J Amer Statist Assoc 1017 1418 1429 Zou7 H and Hastie7 T 2005 Regularization and variable selection via the elastic net J Roy Statist Soc Sen B67 3017320 Department of Statistics and Actuarial Science University of Iowa Iowa City7 Iowa 52242 E mail jian statuiowaedu Division of Biostatistics Department of Epidemiology and Public Health Yale University New Haven7 Connecticut 06520 8034 E mail shuanggemayaleedu Department of Statistics 504 Hill Center7 Busch Campus Rutgers University Piscataway NJ 08854 8019 E mail cunhuistatrutgersedu Example 1 Example 2 o o A A r w maywa w r a I m I AAAAA A A i A no 5 A o o A no i9 39 39 w o N l W O I AA 50 3 g o w gnu MAA A gt w A A A If A gt u A g o 39 A I g o 39 A 2 A 5 2 c A A U 2 v A A 2 v LI 0 A A L o N N o o o o o o I I I I I I I I I I I I 0 20 40 60 BO 100 0 20 40 60 BO 100 Covarlate Cwarlate Example3 ExampleA o o r r M a I o o o g A A I A g I A A c A A A AA A A 3 w A A A a w y o A A o 2 A 2 c c 2 v e v LI 0 LL 0 N N o o o o O O I I I I I I I I I I I I 0 20 40 60 BO 100 0 20 40 60 BO 100 Covarlate Cwarlate Figure 1 Simulation study Examples 174 frequency of individual covariate effect being correctly identi ed Circle Lasso Triangle adaptive Lasso Only the results of the rst 100 coefficients are shown in the plots Example 5 Example 6 o o AA WW 3 v mm MA em A 9 A a I no A5 I no IA 39 A oWhewm o a A AA A I o A A A A A a A w A A w A A A a v A A A A a v E o 39 A E o 39 N N o o o o o o I I I I I I I I I I I I 0 20 AU 60 BO 100 0 20 AU 60 BO 100 Covarlate Cwarlate Example7 Examples 3 AA MI AA A E MW 4 quot1 A A AA A A ME 390 1 432 J 39 39 39 3930 mint 1M 39 o 1 a o o u lm I 0 39A I 2quot s AAX quot39s39 39039 9 ofquot 39 0 g 0 AA A A A g o a f g AA A A A A g A L o A A AA A A 0 A I A Al I A A N N o o o o o o I I I I I I I I I I I I 0 20 AU 60 BO 100 0 20 AU 60 BO 100 Covarlate Cwarlate Figure 2 Simulation study examples 578 frequency of individual covariate effect being correctly identi ed Circle Lasso Triangle adaptive Lasso Only the results of the rst 100 coefficients are are shown in the plots Table 1 Simulation study7 comparison of adaptive Lasso with Lasso PMSE median of PMSE7 inside 0 are the corresponding standard deViations Covariate median of number of covariates with nonzero coefficients Lasso Adaptive Lasso Example PMSE Covariate PMSE Covariate 1 3829 0769 58 3625 0695 50 2 3548 0636 54 2955 0551 33 3 3604 0681 50 3369 0631 43 4 3304 0572 50 2887 0499 33 5 3148 0557 48 2982 0540 40 6 3098 0551 42 2898 0502 36 7 3740 0753 59 3746 0723 53 8 3558 0647 55 3218 0578 44 Table 2 The probe sets identi ed by Lasso and adaptive Lasso that correlated with TRIM32 Probe ID Lasso Adaptive Lasso 1369353at 0021 0028 1370429at 0012 1371242at 0025 0015 1374106at 0027 0026 137413Lat 0018 0011 1389584at 0056 0054 1393979Lat 0004 0007 1398255at 0022 0009 1378935at 0009 1379920at 0002 137997Lat 0038 0041 1380033at 0030 0023 1381787at 0007 0007 1382835at 0045 0038 1383110at 0023 0034 1383522at 0016 001 1383673at 0010 002 1383749at 0041 0045 1383996at 0082 0081 1390788aat 0013 0001 1393382at 0006 0004 1393684at 0008 0003 1394107at 0004 1395415at 0004 Table 3 Prediction results using cross validation 300 random partitions of the data set are made7 in each partition7 the training set consists of 23 observations and the test set consists of the remaing 13 observations The values in the table are medians of the results from 300 random partitions In the table7 cov is the number of covariates being considered nonzero is the number of covariates in the nal model corr is correlation coef cient between the tted and observed values of Y coef is the slope of the regression of the tted values of Y against the observed values of Y7 which shows the shrinkage effect of the methods Lasso Adaptive Lasso cov nonzero mse corr coef nonzero mse corr coef 100 20 0005 0654 0486 18 0006 0659 0469 200 19 0005 0676 0468 17 0005 0678 0476 300 18 0005 0669 0443 17 0005 0671 0462 400 22 0005 0676 0442 19 0005 0686 0476 500 25 0005 0665 0449 22 0005 0670 0463 ADAPTIVE LASSO FOR SPARSE HIGHDIMENSIONAL REGRESSION MODELS ONLINE SUPPLEMENT Jian Huangl Shuangge MaZ and Cun Hui Zhang3 1University of Iowa 2Yale University 3Rutgers University In this supplement we prove Theorems 1 and 3 Let iJdm expzd 7 1 for d 2 1 For any random variable X its wd Orlicz norm HXHW is de ned as HXHW infC gt 0 Eibd XlC 1 Orlicz norm is useful for obtaining maximal inequalities see Van der Vaart and Wellner 1996 hereafter referred to as VW 1996 Lemma 1 Suppose that 1 n are iid random uariables with Eei 0 and Var6i 0392 Furthermore suppose that their tail probabilities satisfy P061 gt m Kexp70mdi 12 for constants C and K and for 1 g d g 2 Then for all constants a satisfying 21 a 1 7 KHUQKVMCAML 1ltd 2 wd KdauKynwnL d1 lt TL H 2am i1 where Kd is a constant depending on at only Consequently am4WMy 1ltd 2 afag1 exp7tdM1 10g d 17 TL q t SUP P Za i gt t S i1 for certain constant M depending on 1 K C only Proof Because 6 satis es Pl l gt x Kexp7Cmd its Orlicz norm HalMW 1 KC1d Lemma 221 VW 1996 Let 1 be given by 1d 1d 1 By Proposition A16 of VW 1996 there exists a constant Kd such that TL TL TL Zone S Kd E 2am Ella illid i1 W i1 i1 n 2 12 n 1M Ki E 1K1d0 1dZlaild l i1 i1 TL S Kd Ult1Kgt1d071d Z ai dl i1 For 1 lt d g 2 d dd 71 2 2 Thus 2 my 271la12d 2 1 It follows that TL E am i1 Ki 0 1 K1dC 1d we For 1 1 by Proposition A16 of VW 1996 there exists a constant K1 such that V L V L 2046i K1 i1 i1 K1a mega Halsilt H gglamlllwl 1 l A K1a K 1 KC 1logn max ml 199 l A K1a K 1 KC 1logn The last inequality follows from 7 7 d M gt 211le mos 1 11 Ewan11mm 2e 7 Vt gt 0 in View of the de nition of HXHW D Lemma 2 Let gm lgnjlilsgd ojxj E Jnly and Sm lnnjlilsgd ojxj E Jm Suppose A2 holds Then 1 oP1M1 max J Jn1 lagEm i lnnjlsmll 0121 12 Proof Since Mm om maxjdm HEMVlan 7 11 M1OP1r 0p1 by the n consistency of EM Thus 1 l 0p1Mn1 For the second part of 12 we have 2 3 39 i n 39 mgxll nnlenrlnnjlsm lz Miz Z 0PltM31rigtoplt1gt lt13 quot1 jEJnl 15ml l m l and maxmm mm 7 lnnjl mll opltMmmgt 0121 a Proof of Theorem 1 Let Jnl j 307 7 0 It follows from the Karush Kunh Tucker conditions that 3 3m Bnp is the unique solution of the adaptive Lasso if Xjy 7 annjsgrmgn v an 7g 0 14 lXjy 7 lt MUij an 0 25 and the vectors xj m 7 0 are linearly independent Let Em wnjsgn ojj E Jn1 and A 71 N 7 N am 7 Xaxl Kay 7 ml 7 601 waxls 7 Ansmm lt15 A A where Enn lf nl 9 301 then the equation in 14 holds for in m 0 Thus since XE Xl m for this 3 and Xjj E Jm are linearly independent A Em 9 301 in S 80 1f 16 lxy 7 X1 n1llt Anwm39 Vj Jm This is a variation of Proposition 1 of Zhao and Yu 2007 Let Hn In 7 XlEglllX n be the projection to the null of It follows from 15 that y 7 Xl m e 7 X1Bn1 7 301 Hns Xlzgflsmxnn so that by 16 A Sgn o3930j 7 3m lt 15071 W 6 Jm n s 50 if 17 xHne Xlzglllsmxnn lt Anwnj Vj Jm Thus by 17 and 15 for any 0 lt a lt a 6 lt1 PEn 7amp5 30 Ple2flxgeln 2 meg2 for some j e Jm P 8j2111 n1 n71 2 moiV2 for somej E Jm PlXHnel 2 17 a 7 ELMij for some j Jm PlXXlE111 nlln 2 a awn for somej Jm PBn1PBn2PBn3PBn47 SW7 18 where ej is the unit vector in the direction of the j th coordinate Since He2111X 1 Hn nil2112 le marl2 and 130 2 51 for j e Jm PB1 Ple2flxgeln 2 mil2 y 6 Jn1 knq v nbm with the tail probability qt in Lemma 1 Thus PBn1 7 0 by A1 Lemma 1 A4 and A5 Since wm39 16 by Lemma 2 and conditions A4 and A5 N E M legEglllsmwn 7 1 n Op 7 Tnl n gt 0Pbn17 Tnl n where bnl minl ojlj E Jm This gives PBn2 01 For Bng we have 1021 S an Op1rn Since W for large C PBn3 PlxHnsl 2 17 K 7 synmm 1m3j Jn1 01 mnq17 r 7 enCMn2 Thus by Lemma 1 and A4 PB3 7 0 Finally for BM Lemma 2 and condition A5 imply lXX1271 g 1 7 ax J W lnanJ39Xlznlllsnll 7 Jn1 nwnj 71 2 HX9X12nl1 lln H l mlsm 7 lnmlsmH m 0121 0121 due to HXszn 1 Since lnnjx leEglllsml a by A3 we have PB4 7 O D Proof of Theorem 3 Let 0 Ey 11 Xj oj Then gm XJYn 77M X jsn with 77m X ijn Since HXJ39HZn 1 by Lemma 1 for all e gt 0 Plim max 7 77M gt e Plim max lx eln gt e pnq ern 01 79 79 7 due to rnlogplog n1d177 01 For the second part of A2 with Mm maxi1 1 lnm39l we have by B3 1 M2 k lt7 g 1 F 62 DUEL jeJm 77M 77M n1 To verify A3 we notice that 2 HXSXJ39HZ Z xzxg Mpg 16 n1 and lnml x Hsan lullZen for all j Jnl Thus7 for such j B2 implies lnnjlnil XQXiEglllsm S annPnTnl S It The proof is complete


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.