New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

ST 790A

by: Jordane Kemmer
Jordane Kemmer
GPA 3.79


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Statistics

This 70 page Class Notes was uploaded by Jordane Kemmer on Thursday October 15, 2015. The Class Notes belongs to ST 790A at North Carolina State University taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/223961/st-790a-north-carolina-state-university in Statistics at North Carolina State University.

Similar to ST 790A at NCS


Reviews for ST 790A


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/15/15
Model Selection lMS Lecture Notes Monograph Series 2001 Volume 38 The Practical Implementation of Bayesian Model Selection Hugh Chipman Edward 1 George and Robert E McCulloch The University of Waterloo The University of Pennsylvania and The University of Chicago Abstract In principle the Bayesian approach to model selection is straightforward Prior probability distributions are used to describe the uncertainty surround ing all unknowns After observing the data the posterior distribution pro vides a coherent post data summary of the remaining uncertainty which is relevant for model selection However the practical implementation of this approach often requires carefully tailored priors and novel posterior calcula tion methods In this article we illustrate some of the fundamental practical issues that arise for two different model selection problems the variable se lection problem for the linear model and the CART model selection problem OHugh Chipman is Associate Professor of Statistics Department of Statistics and Actuarial Science University of Waterloo Waterloo ON N2L 3G1 Canada email hachipman uwaterlooca Edward 1 George is Professor of Statistics Department of Statistics The Wharton School of the University of Penn sylvania 3620 Locust Walk Philadelphia PA 191046302 USA email edgeorgewhart0nupennedu Robert E McCulloch is Professor of Statistics Graduate School of Business University of Chicago 1101 East 58th Street Chicago IL USA email RobertMcCulloch gsbuchicag0edu This work was supported by NSF grant DMS 9803756 and Texas ARP grant 003658690 Contents 1 Introduction 2 The General Bayesian Approach 21 A Probabilistic Setup for Model Uncertainty 22 General Considerations for Prior Selection 23 Extracting Information from the Posterior 3 Bayesian Variable Selection for the Linear Model 31 Model Space Priors for Variable Selection 32 Parameter Priors for Selection of Nonzero 33 Calibration and Empirical Bayes Variable Selection 34 Parameter Priors for Selection Based on Practical Signi cance 35 Posterior Calculation and Exploration for Variable Selection 351 Closed Form Expressions for pY l y 352 MCMC Methods for Variable Selection 353 Gibbs Sampling Algorithms 354 Metropolis Hastings Algorithms 355 Extracting Information from the Output 4 Bayesian CART Model Selection 41 Prior Formulations for Bayesian CART 411 Tree Prior Speci cation 412 Parameter Prior Speci cations 42 Stochastic Search of the CART Model Posterior 421 Metropolis Hastings Search Algorithms 422 423 Selecting the Best Trees Running the MH Algorithm for Stochastic Search 0 Much More to Come Practical Bayes Model Selection 67 1 Introduction The Bayesian approach to statistical problems is fundamentally probabilistic A joint probability distribution is used to describe the relationships between all the unknowns and the data Inference is then based on the conditional probability distribution of the unknowns given the observed data7 the posterior distribution Beyond the speci cation of the joint distribution7 the Bayesian approach is automatic Exploiting the internal consistency of the probability framework7 the posterior distribution extracts the relevant information in the data and provides a complete and coherent summary of post data uncertainty Using the posterior to solve speci c inference and decision problems is then straightforward7 at least in principle In this article7 we describe applications of this Bayesian approach for model uncer tainty problems where a large number of different models are under consideration for the data The joint distribution is obtained by introducing prior distributions on all the unknowns7 here the parameters of each model and the models themselves7 and then combining them with the distributions for the data Conditioning on the data then in duces a posterior distribution of model uncertainty that can be used for model selection and other inference and decision problems This is the essential idea and it can be very powerful Especially appealing is its broad generality as it is based only on probabilistic considerations However7 two major challenges confront its practical implementation the speci cation of the prior distributions and the calculation of the posterior This will be our main focus The statistical properties of the Bayesian approach rest squarely on the speci cation of the prior distributions on the unknowns But where do these prior distributions come from and what do they mean One extreme answer to this question is the pure subjective Bayesian point of view that characterizes the prior as a wholly subjective description of initial uncertainty7 rendering the posterior as a subjective post data description of uncertainty Although logically compelling7 we nd this characterization to be unrealistic in complicated model selection problems where such information is typically unavailable or dif cult to precisely quantify as a probability distribution At the other extreme is the objective Bayesian point of view which seeks to nd semi automatic prior formulations or approximations when subjective information is unavailable Such priors can serve as default inputs and make them attractive for repeated use by non experts Prior speci cation strategies for recent Bayesian model selection implementations7 including our own7 have tended to fall somewhere between these two extremes Typi cally7 speci c parametric families of proper priors are considered7 thereby reducing the speci cation problem to that of selecting appropriate hyperparameter values To avoid 68 H Chipman E I George and R E McCulloch the need for subjective inputs automatic default hyperparameter choices are often rec ommended For this purpose empirical Bayes considerations either formal or informal can be helpful especially when informative choices are needed However subjective con siderations can also be helpful at least for roughly gauging prior location and scale and for putting small probability on implausible values Of course when substantial prior information is available the Bayesian model selection implementations provide a natural environment for introducing realistic and important views By abandoning the pure subjective point of view the evaluation of such Bayesian methods must ultimately involve frequentist considerations Typically such evaluations have taken the form of average performance over repeated simulations from hypothetical models or of cross validations on real data Although such evaluations are necessarily limited in scope the Bayesian procedures have consistently performed well compared to non Bayesian alternatives Although more work is clearly needed on this crucial aspect there is cause for optimism since by the complete class theorems of decision theory we need look no further than Bayes and generalized Bayes procedures for good frequentist performance The second major challenge confronting the practical application of Bayesian model selection approaches is posterior calculation or perhaps more accurately posterior ex ploration Recent advances in computing technology coupled with developments in nu merical and Monte Carlo methods most notably Markov Chain Monte Carlo MCMC have opened up new and promising directions for addressing this challenge The basic idea behind MCMC here is the construction of a sampler which simulates a Markov chain that is converging to the posterior distribution Although this provides a route to calculation of the full posterior such chains are typically run for a relatively short time and used to search for high posterior models or to estimate posterior character istics However constructing effective samplers and the use of such methods can be a delicate matter involving problem speci c considerations such as model structure and the prior formulations This very active area of research continues to hold promise for future developments In this introduction we have described our overall point of view to provide context for the implementations we are about to describe In Section 2 we describe the general Bayesian approach in more detail In Sections 3 and 4 we illustrate the practical im plementation of these general ideas to Bayesian variable selection for the linear model and Bayesian CART model selection respectively In Section 5 we conclude with a brief discussion of related recent implementations for Bayesian model selection Practical Bayes Model Selection 69 2 The General Bayesian Approach 21 A Probabilistic Setup for Model Uncertainty Suppose a set of K models M M1 MK are under consideration for data Y and that under Mk Y has density pY 0k Mk where 0k is a vector of unknown parameters that indexes the members of Mk Although we refer to Mk as a model it is more precisely a model class The Bayesian approach proceeds by assigning a prior probability distribution 1909 Mk to the parameters of each model and a prior probability pMk to each model lntuitively this complete speci cation can be understood as a three stage hierarchical mixture model for generating the data Y rst the model Mk is generated from pM1 pMK second the parameter vector 0k is generated from 1909 Mk and third the data Y is generated from pY 0k Letting Yf be future observations of the same process that generated Y this prior formulation induces a joint distribution pYfYt9kMk pYfY 0kMk 1909 Mk Conditioning on the observed data Y all remaining uncertainty is captured by the joint posterior distribution pYf 0k Mk Y Through conditioning and marginal ization this joint posterior can be used for a variety Bayesian inferences and decisions For example when the goal is exclusively prediction of Yf attention would focus on the predictive distribution pYf Y which is obtained by margining out both 0k and Mk By averaging over the unknown models pYf Y properly incorporates the model uncer tainty embedded in the priors In effect the predictive distribution sidesteps the problem of model selection replacing it by model averaging However sometimes interest focuses on selecting one of the models in M for the data Y the model selection problem One may simply want to discover a useful simple model from a large speculative class of models Such a model might for example provide valuable scienti c insights or perhaps a less costly method for prediction than the model average One may instead want to test a theory represented by one of a set of carefully studied models In terms of the three stage hierarchical mixture formulation the model selection problem becomes that of nding the model in M that actually generated the data namely the model that was generated from pM1 pMK in the rst step The probability that Mk was in fact this model conditionally on having observed Y is the posterior model probability PO MkPMk Mk Y 2ka Malawi 21 where W Mk pY 0kMkp0k Mode 22 70 H Chipman E I George and R E McCulloch is the marginal or integrated likelihood of Mk Based on these posterior probabilities pairwise comparison of models say M1 and M2 is summarized by the posterior odds PM11Y PW 1M1 PM1 pltM2 1 Y 290 Mi X pltM2gt39 2393 This expression reveals how the data through the Bayes factor updates the prior odds to yield the posterior odds The model posterior distribution pM1 lY pMK lY is the fundamental object of interest for model selection Insofar as the priors 1909 and pMk provide an initial representation of model uncertainty the model posterior summarizes all the relevant information in the data Y and provides a complete post data representation of model uncertainty By treating pMk 1 Y as a measure of the truth of model Mk a natural and simple strategy for model selection is to choose the most probable Mk the one for which pMk 1 Y largest Alternatively one might prefer to report a set of high posterior models along with their probabilities to convey the model uncertainty More formally one can motivate selection strategies based on the posterior using a decision theoretic framework where the goal is to maximize expected utility Gelfand Dey and Chang 1992 and Bernardo and Smith 1994 More precisely let 04 represent the action of selecting Mk and suppose that 04 is evaluated by a utility function ua A where A is some unknown of interest possibly Yf Then the optimal selection is that 04 which maximizes the expected utility ua ApA 1 mm 24 where the predictive distribution of A given Y MA 1 Y pA l Mk7YPMk 1 Y 25 is a posterior weighted mixture of the conditional predictive distributions pA l Mk Y pA l 0kMkp0k l MkYd0k 26 It is straightforward to show that if A identi es one of the Mk as the true state of nature and ua A is 0 or 1 according to whether a correct selection has been made then selection of the highest posterior probability model will maximize expected utility However different selection strategies are motivated by other utility functions For example suppose Oz entails choosing pA l MkY as a predictive distribution for a future observation A and this selection is to be evaluated by the logarithmic score function ua A logpA l Mk Y Then the best selection is that 04 which maximizes Practical Bayes Model Selection 71 the posterior weighted logarithmic divergence pA l MMY pm Mm 2397 229ka lY pltA MiniHog kl San Martini and Spezzaferri 1984 However if the goal is strictly prediction and not model selection then expected logarithmic utility is maximized by using the posterior weighted mixture pA Y in 25 Under squared error loss the best prediction of A is the overall posterior mean EA1Y ZEAle7YpMle 28 k Such model averaging or mixing procedures incorporate model uncertainty and have been advocated by Geisser 1993 Draper 1995 Hoeting Madigan Raftery and Volinsky 1999 and Clyde Desimone and Parmigiani 1995 Note however that if a cost of model complexity is introduced into these utilities then model selection may dominate model averaging Another interesting modi cation of the decision theory setup is to allow for the possibility that the true model is not one of the Mk a commonly held perspective in many applications This aspect can be incorporated into a utility analysis by using the actual predictive density in place of pA j Y In cases where the form of the true model is completely unknown this approach serves to motivate cross validation types of training sample approaches see Bernardo and Smith 1994 Berger and Pericchi 1996 and Key Perrichi and Smith 1998 22 General Considerations for Prior Selection For a given set of models M the effectiveness ofthe Bayesian approach rests rmly on the speci cation ofthe parameter priors p0kle and the model space prior pM1 pMK Indeed all of the utility results in the previous section are predicated on the assump tion that this speci cation is correct If one takes the subjective point of view that these priors represent the statistician s prior uncertainty about all the unknowns then the posterior would be the appropriate update of this uncertainty after the data Y has been observed However appealing the pure subjective point of view here has practical limitations Because of the sheer number and complexity of unknowns in most model uncertainty problems it is probably unrealistic to assume that such uncertainty can be meaningfully described The most common and practical approach to prior speci cation in this context is to try and construct noninformative semi automatic formulations using subjective and 72 H Chipman E I George and R E McCulloch empirical Bayes considerations where needed Roughly speaking one would like to spec ify priors that allow the posterior to accumulate probability at or near the actual model that generated the data At the very least such a posterior can serve as a heuristic device to identify promising models for further examination Beginning with considerations for choosing the model space prior pM1 pMK a simple and popular choice is the uniform prior PltMk E 1K 29 which is noninformative in the sense of favoring all models equally Under this prior the model posterior is proportional to the marginal likelihood pMle cx pY le and posterior odds comparisons in 23 reduce to Bayes factor comparisons However the apparent noninformativeness of 29 can be deceptive Although uniform over models it will typically not be uniform on model characteristics such as model size A more subtle problem occurs in setups where many models are very similar and only a few are dis tinct In such cases 29 will not assign probability uniformly to model neighborhoods and may bias the posterior away from good models As will be seen in later sections alternative model space priors that dilute probability within model neighborhoods can be meaningfully considered when speci c model structures are taken into account Turning to the choice of parameter priors pt9k l Mk direct insertion of improper noninformative priors into 21 and 22 must be ruled out because their arbitrary norming constants are problematic for posterior odds comparisons Although one can avoid some of these difficulties with constructs such as intrinsic Bayes factors Berger and Pericchi 1996 or fractional Bayes factors O Hagan 1995 many Bayesian model selection implementations including our own have stuck with proper parameter priors especially in large problems Such priors guarantee the internal coherence ofthe Bayesian formulation allow for meaningful hyperparameter speci cations and yield proper poste rior distributions which are crucial for the MCMC posterior calculation and exploration described in the next section Several features are typically used to narrow down the choice of proper parameter priors To ease the computational burden it is very useful to choose priors under which rapidly computable closed form expressions for the marginal pY l Mk in 22 can be obtained For exponential family models conjugate priors serve this purpose and so have been commonly used When such priors are not used as is sometimes necessary outside the exponential family computational efficiency may be obtained with the ap proximations ofpY l Mk described in Section 23 In any case it is useful to parametrize pt9k l Mk by a small number of interpretable hyperparameters For nested model for mulations which are obtained by setting certain parameters to zero it is often natural Practical Bayes Model Selection 73 to center the priors of such parameters at zero7 further simplifying the speci cation A crucial challenge is setting the prior dispersion It should be large enough to avoid too much prior in uence7 but small enough to avoid overly diffuse speci cations that tend to downweight 19Y 1 Mk through 227 resulting in too little probability on Mk For this purpose7 we have found it useful to consider subjective inputs and empirical Bayes estimates 23 Extracting Information from the Posterior Once the priors have been chosen7 all the needed information for Bayesian inference and decision is implicitly contained in the posterior In large problems7 where exact calculation of 21 and 22 is not feasible7 Markov Chain Monte Carlo MCMC can often be used to extract such information by simulating an approximate sample from the posterior Such samples can be used to estimate posterior characteristics or to explore the posterior7 searching for models with high posterior probability For a model characteristic 77 MCMC entails simulating a Markov chain7 say 717 727 that is converging to its posterior distribution 197 1 Y Typically7 7 will be an index of the models Mk or an index of the values of 017 Simulation of 717 727 requires a starting value 70 and proceeds by successive simulation from a probability transition kernel 197 177397 see Meyn and Tweedie 1993 Two of the most useful prescriptions for constructing a kernel that generates a Markov chain converging to a given 197 1 Y7 are the Gibbs sampler GS Geman and Geman 19847 Gelfand and Smith 1990 and the Metropolis Hastings algorithms Metropolis 19537 Hastings 1970 Introductions to these methods can be found in Casella and George 1992 and Chib and Greenberg 1995 More general treatments that detail precise convergence conditions essentially irreducibility and aperiodicity can found in Besag and Green 19937 Smith and Roberts 1993 and Tierney 1994 When 7 E R 7 the GS is obtained by successive simulations from the full conditional component distributions 197 177 i 17 7197 where 7 denotes the most recently updated component values of 7 other than 7 Such GS algorithms reduce the problem of simulating from 197 1 Y to a sequence of one dimensional simulations MH algorithms work by successive sampling from an essentially arbitrary probability transition kernel q7 1 77 and imposing a random rejection step at each transition When the dimension of 7 remains xed7 an MH algorithm is de ned by 1 Simulate a candidate 7 from the transition kernel q7 1 77 74 H Chipman E I George and R E McCulloch 2 Set nlt71gt 77 with probability l lY 7 mm 17617 n 297739 CW M qnlnlt pn gtlY Otherwise set 771 n0 This is a special case of the more elaborate reversible jump MH algorithms Green 1995 which can be used when dimension of 7 is changing The general availability of such MH algorithms derives from the fact that 197 1 Y is only needed up to the norming constant for the calculation of 04 above The are endless possibilities for constructing Markov transition kernels 197 1 77 that guarantee convergence to 197 1 Y The GS can be applied to different groupings and reorderings of the coordinates and these can be randomly chosen For MH algorithms only weak conditions restrict considerations of the choice of q7l77 and can also be con sidered componentwise The GS and MH algorithms can be combined and used together in many ways Recently proposed variations such as tempering importance sampling perfect sampling and augmentation offer a promising wealth of further possibilities for sampling the posterior As with prior speci cation the construction of effective transi tion kernels and how they can be exploited is meaningfully guided by problem speci c considerations as will be seen in later sections Various illustrations ofthe broad practical potential of MCMC are described in Gilks Richardson and Spieglehalter 1996 The use of MCMC to simulate the posterior distribution of a model index 7 is greatly facilitated when rapidly computable closed form expressions for the marginal 19Y l Mk in 22 are available In such cases 19Y l 7197 olt 197 1 Y can be used to implement GS and MH algorithms Otherwise one can simulate an index of the values of 0k Mk or at least Mk and the values of parameters that cannot be eliminated analytically When the dimension of such an index is changing MCMC implementations for this purpose typically require more delicate design see Carlin and Chib 1995 Dellaportas Forster and Ntzoufras 2000 Geweke 1996 Green 1995 Kuo and Mallick 1998 and Phillips and Smith 1996 Because of the computational advantages of having closed form expressions for 19Yle it may be preferable to use a computable approximation for 19Y l Mk when exact ex pressions are unavailable An effective approximation for this purpose when h0k E log19Y l 0 Mk190k l Mk is sufficiently well behaved is obtained by Laplace s method see Tierney and Kadane 1986 as my legt z lt2wgtdk2lHlt igtl1ZPltY 15kMkp k legt 210 where dk is the dimension of 0k 6k is the maximum of h0k namely the posterior mode of 190k lY Mk and is minus the inverse Hessian of h evaluated at This is obtained Practical Bayes Model Selection 75 by substituting the Taylor series approximation h0k z 7 0k70k H k0k 79k for 710 in pMk 1 Y fexph0kd0k When nding k is costly further approximation of pY l M can be obtained by 190 l Mk 27rldk21H kll2PY l k7MkP k l Mk 211 where k is the maximum likelihood estimate and H can be H minus the inverse Hessian of the log likelihood or Fisher s information matrix Going one step further by ignoring the terms in 211 that are constant in large samples yields the BIG approximation Schwarz 1978 logpY M x logpY 03mm 7 dk2logn 212 where n is the sample size This last approximation was successfully implemented for model averaging in a survival analysis problem by Raftery Madigan and Volinsky 1996 Although it does not explicitly depend on a parameter prior 212 may be considered an implicit approximation to pY l M under a unit information prior77 Kass and Wasser man 1995 or under a normalized Jeffreys prior Wasserman 2000 It should be emphasized that the asymptotic justi cation for these successive approximations 210 211 212 may not be very good in small samples see for example McCulloch and Rossi 1991 3 Bayesian Variable Selection for the Linear Model Suppose Y a variable of interest and X1 Xp a set of potential explanatory variables or predictors are vectors of n observations The problem of variable selection or subset selection as it often called arises when one wants to model the relationship between Y and a subset of X1 Xp but there is uncertainty about which subset to use Such a situation is particularly of interest when p is large and X1 X1 is thought to contain many redundant or irrelevant variables The variable selection problem is usually posed as a special case of the model selec tion problem where each model under consideration corresponds to a distinct subset of X1 Xp This problem is most familiar in the context of multiple regression where at tention is restricted to normal linear models Many of the fundamental developments in variable selection have occurred in the context of the linear model in large part because its analytical tractability greatly facilitates insight and computational reduction and because it provides a simple rst order approximation to more complex relationships Furthermore many problems of interest can be posed as linear variable selection prob lems For example for the problem of nonparametric function estimation the values of the unknown function are represented by Y and a linear basis such as a wavelet basis or 76 H Chipman E I George and R E McCulloch a spline basis are represented by X1 Xp The problem of nding a parsimonious ap proximation to the function is then the linear variable selection problem Finally when the normality assumption is inappropriate such as when Y is discrete solutions for the linear model can be extended to alternatives such as general linear models McCullagh and Nelder 1989 We now proceed to consider Bayesian approaches to this important linear variable selection problem Suppose the normal linear model is used to relate Y to the potential predictors X1 Xp Y N NnX 721 31 where X X1 Xp B is a p x 1 vector of unknown regression coefficients and 02 is an unknown positive scalar The variable selection problem arises when there is some unknown subset of the predictors with regression coefficients so small that it would be preferable to ignore them In Sections 32 and 34 we describe two Bayesian formulations of this problem which are distinguished by their interpretation of how small a regression coefficient must be to ignore Xi It will be convenient throughout to index each of these 21 possible subset choices by the vector v V177 Yp7 where y 0 or 1 according to whether 3 is small or large respectively We use q E V1 to denote the size of the 7th subset Note that here 39y plays the role of model identi er Mk described in Section 2 We will assume throughout this section that X1 Xp contains no variable that would be included in every possible model If additional predictors Z1 Zr were to be included every model then we would assume that their linear effect had been removed by replacing Y and X1 Xp with 7ZZ Z 1Z Y and 7ZZ Z 1Z Xi i 1 p where Z Z1 Z7 For example if an intercept were to be included in every model then we would assume that Y and X1 Xp had all been centered to have mean 0 Such reductions are simple and fast and can be motivated from a formal Bayesian perspective by integrating out the coefficients corresponding to Z1 Zr with respect to an improper uniform prior 31 Model Space Priors for Variable Selection For the speci cation of the model space prior most Bayesian variable selection imple mentations have used independence priors of the form PW H w3i1 will 32 Practical Bayes Model Selection 77 which are easy to specify substantially reduce computational requirements and often yield sensible results see for example Clyde Desimone and Parmigiani 1996 George and McCulloch 1993 1997 Raftery Madigan and Hoeting 1997 and Smith and Kohn 1996 Under this prior each X enters the model independently ofthe other coefficients with probability p yi 1 1 7 p yi 0 wi Smaller wi can be used to downweight X which are costly or of less interest A useful reduction of 32 has been to set wi E w yielding 19W WU WWW 33 in which case the hyperparameter w is the a priori expected proportion of Xis in the model In particular setting w 12 yields the popular uniform prior 29W E W 34 which is often used as a representation of ignorance However this prior puts most of its weight near models of size q p2 because there are more of them lncreased weight on parsimonious models for example could instead be obtained by setting w small Alternatively one could put a prior on w For example combined with a beta prior w N Betaa 33 yields BQqw piq7 pm 36 3 35 where BOz is the beta function More generally one could simply put a prior hq7 on the model dimension and let 71 pm 1 mm 36 qv of which 35 is a special case Under priors of the form 36 the components of 39y are exchangeable but not independent except for the special case 33 Independence and exchangeable priors on 39y may be less satisfactory when the models under consideration contain dependent components such as might occur with interac tions polynomials lagged variables or indicator variables Chipman 1996 Common practice often rules out certain models from consideration such as a model with an X1X2 interaction but no X1 or X2 linear terms Priors on 39y can encode such prefer ences With interactions the prior for 39y can capture the dependence relation between the importance of a higher order term and those lower order terms from which it was formed For example suppose there are three independent main effects A B C and three two factor interactions AB AC and BC The importance of the interactions such as AB will 78 H Chipman E I George and R E McCulloch often depend only on whether the main effects A and B are included in the model This belief can be expressed by a prior for 39y 39yA yBc of the form PW PVAPVBPVCPVAB l YA7 YBP YAC l YA7 YCP YBC l YB7 V0 37 The speci cation of terms like p yAcl yA yc in 37 would entail specifying four probabilities one for each of the values of VA yo Typically p yAC l0 0 lt p yAC l1 0 p yAC l O 1 lt p yAC l 1 1 Similar strategies can be considered to downweight or eliminate models with isolated high order terms in polynomial regressions or isolated high order lagged variables in ARIMA models With indicators for a categorical predictor it may be of interest to include either all or none of the indicators in which case p y 0 for any 39y violating this condition The number of possible models using interactions polynomials lagged variables or indicator variables grows combinatorially as the number of variables increases In con trast to independence priors of the form 32 priors for dependent component models such as 37 is that they concentrate mass on plausible models when the number of possible models is huge This can be crucial in applications such as screening de signs where the number of candidate predictors may exceed the number of observations Chipman Hamada and Wu 1997 Another more subtle shortcoming of independence and exchangeable priors on 39y is their failure to account for similarities and differences between models due to covariate collinearity or redundancy An interesting alternative in this regard are priors that dilute probability across neighborhoods of similar models the so called dilution priors George 1999 Consider the following simple example Suppose that only two uncorrelated predictors X1 and X2 are considered and that they yield the following posterior probabilities 19W l Y 03 Suppose now a new potential predictor X3 is introduced and that X3 is very highly correlated with X2 but not with X1 If the model prior is elaborated in a sensible way as is discussed below the posterior may well look something like Variablesin39lell X2 X3 X17X2 l p yY 03013013 006 Practical Bayes Model Selection 79 The probability allocated to X1 remains unchanged7 whereas the probability allocated to X2 and X1X2 has been diluted across all the new models containing X3 Such dilution seems desirable because it maintains the allocation of posterior probability across neighborhoods of similar models The introduction of X3 has added proxies for the models containing X2 but not any really new models The probability of the resulting set of equivalent models should not change7 and it is dilution that prevents this from happening Note that this dilution phenomenon would become much more pronounced when many highly correlated variables are under consideration The dilution phenomenon is controlled completely by the model space prior p y because p y Y X pY yp y and the marginal pY 39y is unaffected by changes to the model space Indeed7 no dilution of neighborhood probabilities occurs under the uniform prior 34 where p y Y cx pYquoty Instead the posterior probability of every 39y is reduced while all pairwise posterior odds are maintained For instance7 when X3 is introduced above and a uniform prior is used7 the posterior probabilities become something like Variablesin y X1 XZ Xg X17X2 me 015 02 02 01 If we continued to introduce more proxies for X27 the probability of the X1 only model could be made arbitrarily small and the overall probability of the X2 like models could be made arbitrarily large7 a disturbing feature if Y was strongly related to X1 and unrelated to X2 Note that any independence prior 327 of which 34 is a special case7 will also fail to maintain probability allocation within neighborhoods of similar models7 because the addition of a new Xj reduces all the model probabilities by wj for models in which Xj is included7 and by 1 7 w for models in which Xj is excluded What are the advantages of dilution priors Dilution priors avoid placing too little probability on good7 but unique7 models as a consequence of massing excess probability on large sets of bad7 but similar7 models Thus dilution priors are desirable for model averaging over the entire posterior to avoid biasing averages such as 28 away from good models They are also desirable for MCMC sampling of the posterior because such Markov chains gravitate towards regions of high probability Failure to dilute the probability across clusters of many bad models would bias both model search and model averaging estimates towards those bad models That said7 it should be noted that dilution priors would not be appropriate for pairwise model comparisons because the relative strengths of two models should not depend on whether another is considered For this purpose7 Bayes factors corresponding to selection under uniform priors would be preferable 80 H Chipman E I George and R E McCulloch 32 Parameter Priors for Selection of Nonzero B We now consider parameter prior formulations for variable selection where the goal is to ignore only those X for which 3 0 in 31 In effect the problem then becomes that of selecting a submodel of 31 of the form 19013770392739 NnX39y 39yv 02 3398 where X is the n x q matrix Whose columns correspond to the 39yth subset of X1 XP 6 is a q x 1 vector of unknown regression coefficients and 02 is the unknown residual variance Here 67 02 plays the role of the model parameter 0k described in Section 2 Perhaps the most useful and commonly applied parameter prior form for this setup especially in large problems is the normal inverse gamma which consists of a gay dimensional normal prior on 6 1937 1 03927 397 Nth 677 02277 3399 coupled with an inverse gamma prior on 02 2902 1 v 2902 19V27VV2L 310 which is equivalent to VAUZ N For example see Clyde DeSimone and Parmigiani 1996 Fernandez Ley and Steel 2001 Garthwaite and Dickey 1992 1996 George and McCulloch 1997 Kuo and Mallick 1998 Raftery Madigan and Hoeting 1997 and Smith and Kohn 1996 Note that the coefficient prior 39 when coupled with p y implicitly assigns a point mass at zero for coefficients in 31 that are not contained in 37 As such 39 may be thought of as a point normal prior It should also be mentioned that in one the rst Bayesian variable selection treatments of the setup 38 Mitchell and Beauchamp 1988 proposed spike and slab priors The normal inverse gamma prior above is obtained by simply replacing their uniform slab by a normal distribution A valuable feature of the prior combination 39 and 310 is analytical tractability the conditional distribution of BA and 02 given 39y is conjugate for 38 so that 67 and 02 can be eliminated by routine integration from pY 6702 1y pY 16A 02 39yp n 02 39y 1902 1 39y to yield 19YW0lt1X X7 21r122w12ltm WWW 311 where 2 71 71 S Y Y 7 YX7XZX7 E X1Y 312 As will be seen in subsequent sections the use of these closed form expressions can substantially speed up posterior evaluation and MCMC exploration Note that the scale Practical Bayes Model Selection 81 ofthe prior 39 for BA depends on 02 and this is needed to obtain conjugacy Integrating out 02 with respect to 3107 the prior for BA conditionally only on 39y is 1937 1397 TIM V7 83977 A27 33913 the gay dimensional multivariate T distribution centered at 37 with 1 degrees of freedom and scale A27 The priors 39 and 310 are determined by the hyperparameters 6727 A7 177 which must be speci ed for implementations Although a good deal of progress can be made through subjective elicitation of these hyperparameter values in smaller prob lems when substantial expert information is available7 for example see Garthwaite and Dickey 19967 we focus here on the case where such information is unavailable and the goal is roughly to assign values that minimize prior in uence Beginning with the choice of A and 17 note that 310 corresponds to the likelihood information about 02 provided by 1 independent observations from a N 07 A distribu tion Thus7 A may be thought of as a prior estimate of 02 and 1 may be thought of as the prior sample size associated with this estimate By using the data and treating 5 the sample variance of Y7 as a rough upper bound for 02 a simple default strategy is to choose 1 small7 say around 37 and A near 5 One might also go a bit further7 by treating sixLL the traditional unbiased estimate of 02 based on a saturated model7 as a rough lower bound for 02 and then choosing A and 1 so that 310 assigns substantial probability to the interval sinLL7 Similar informal approaches based on the data are considered by Clyde7 Desimone and Parmigiani 1996 and Raftery7 Madigan and Hoeting 1997 Alternatively7 the explicit choice of A and 1 can be avoided by using 1902 W X 1727 the limit of 310 as 1 a 07 a choice recommended by Smith and Kohn 1996 and Fernandez7 Ley and Steel 2001 This prior corresponds to the uniform dis tribution on log 02 and is invariant to scale changes in Y Although improper7 it yields proper marginals pY 1 39y when combined with 39 and so can be used formally Turning to 397 the usual default for the prior mean 37 has been 37 07 a neutral choice re ecting indifference between positive and negative values This speci cation is also consistent with standard Bayesian approaches to testing point null hypotheses7 where under the alternative7 the prior is typically centered at the point null value For choosing the prior covariance matrix 27 the speci cation is substantially simpli ed by setting 27 CV77 where c is a scalar and V7 is a preset form such as V X Xnyf1 or V lg77 the qquv identity matrix Note that under such VA7 the conditional priors 39 provide a consistent description of uncertainty in the sense that they are the conditional distributions of the nonzero components of 3 given 39y when 3 N Np0cz72X X 1 or B N Np0cazl7 respectively The choice V7 XIYXAYV1 serves to replicate the covariance structure of the likelihood7 and yields the g prior recommended by Zellner 82 H Chipman E I George and R E McCulloch 1986 With V I 1 the components of 37 are conditionally independent causing 39 to weaken the likelihood covariance structure In contrast to VA X9X7 1 the effect of VA L on the posterior depends on the relative scaling of the predictors In this regard it may be reasonable to rescale the predictors in units of standard deviation to give them a common scaling although this may be complicated by the presence of outliers or skewed distributions Having xed V7 the goal is then to choose 0 large enough so that the prior is rel atively at over the region of plausible values of 37 thereby reducing prior in uence Edwards Lindman and Savage 1963 At the same time however it is important to avoid excessively large values of 0 because the prior will eventually put increasing weight on the null model as c a 00 a form of the Bartlett Lindley paradox Bartlett 1957 For practical purposes a rough guide is to choose 0 so that 313 assigns substantial probability to the range of all plausible values for By Raftery Madigan and Hoeting 1997 who used a combination of VA Iq and VA XX 1 with standardized predictors list various desiderata along the lines of this rough guide which lead them to the choice 0 2852 They also note that their resulting coe icient prior is rela tively at over the actual distribution of coefficients from a variety of real data sets Smith and Kohn 1996 who used M X4X7 1 recommend 0 100 and report that performance was insensitive to values of 0 between 10 and 10000 Fernandez Ley and Steel 2001 perform a simulation evaluation of the effect of various choices for c with VA X4X7 1 1902 1 39y X 172 and p y 2 17 on the posterior probability of the true model Noting how the effect depends on the true model and noise level they recommend 0 maxp2 33 Calibration and Empirical Bayes Variable Selection An interesting connection between Bayesian and non Bayesian approaches to variable selection occurs when the special case of 39 with BA 0 and V X9X7 1 namely pm 1024 Nqlt0c02ltXX7gt1gt7 314 is combined with 19W WW1 WW 315 the simple independence prior in 33 for the moment 02 is treated as known As shown by George and Foster 2000 this prior setup can be calibrated by choices of c and w so that the same 39y maximizes both the model posterior and the canonical penalized sum of squares criterion SSW02 7 F q 316 Practical Bayes Model Selection 83 where SS7 SgX XVBV 37 E X9X7 1XY and F is a xed penalty This corre spondence may be of interest because a wide variety of popular model selection criteria are obtained by maximizing 316 with particular choices of F and with 02 replaced by an estimate 72 For example F 2 yields Op Mallows 1973 and approximately AIC Akaike 1973 F logn yields BIC Schwarz 1978 and F 2 logp yields RIC Donoho and Johnstone 1994 Foster and George 1994 The motivation for these choices are varied OP is motivated as an unbiased estimate of predictive risk AIC by an expected information distance BIC by an asymptotic Bayes factor and RIC by minimax predictive risk in ation The posterior correspondence with 316 is obtained by reexpressing the model pos terior under 314 and 315 as Y Y 7 SS SS y in 1 pith 1 quotIV2 7 7 7 7 pm gt cx w lt w ltcgt exp 202 202mg 4 2 7 x exp 121 C SSWa Fc w q1 317 where 16 0 Fcw 210g 1 20w log1 2 318 The expression 317 reveals that as a function of 39y for xed Y p y 1 Y is increasing in 316 when F Fc w Thus both 316 and 317 are simultaneously maximized by the same 39y when 0 and w are chosen to satisfy Fcw F In this case Bayesian model selection based on p y 1 Y is equivalent to model selection based on the criterion 316 This correspondence between seemingly different approaches to model selection pro vides additional insight and interpretability for users of either approach In particular when 0 and w are such that Fcw 2 logn or 2logp selecting the highest posterior model with 02 set equal to 72 will be equivalent to selecting the best AICCp BIC or RIC models respectively For example Fc w 2 logn and 2 logp are obtained when 0 2 39271 and 192 respectively all with w 12 Similar asymptotic connections are pointed out by Fernandez Ley and Steel 2001 when 1902 1 39y X 172 and w 12 Because the posterior probabilities are monotone in 316 when F Fcw this cor respondence also reveals that the MCMC methods discussed in Section 35 can also be used to search for large values of 316 in large problems where global maximization is not computationally feasible Furthermore since 0 and w control the expected size and proportion of the nonzero components of B the dependence of Fc w on c and w provides an implicit connection between the penalty F and the pro le of models for which its value may be appropriate Ideally the prespeci ed values of c and w in 314 and 315 will be consistent with 84 H Chipman E I George and R E McCulloch the true underlying model For example large 0 will be chosen when the regression coef cients are large and small w will be chosen when the proportion of nonzero coefficients are small To avoid the difficulties of choosing such 0 and w when the true model is completely unknown it may be preferable to treat 0 and w as unknown parameters and use empirical Bayes estimates of c and w based on the data Such estimates can be obtained at least in principle as the values of c and w that maximize the marginal likelihood under 314 and 315 namely Maw 1 Y 0lt Z 19W 1 wpY l v70 55 7 319 2021 1 c Although this maximization is generally impractical when p is large the likelihood 319 x qul17 wp q71 1quot772 exp 397 simpli es considerably when X is orthogonal a setup that occurs naturally in nonpara metric function estimation with orthogonal bases such as wavelets In this case letting ti bwia where 122 is the ith diagonal element of X X and 1 is the ith component of 3 X X 1X Y 319 reduces to Z7 Lc w Y x H 17 waif2 w1 c1Ze 2lt1cgt 320 i1 Since many fewer terms are involved in the product in 320 than in the sum in 319 maximization of 320 is feasible by numerical methods even for moderately large p Replacing 02 by an estimate 72 the estimators 6 and 12 that maximize the marginal likelihood L above can be used as prior inputs for an empirical Bayes analysis under the priors 314 and 315 In particular 317 reveals that the 39y maximizing the posterior p y 1 Y 610 can be obtained as the 39y that maximizes the marginal maximum likelihood criterion CMML 55762 7 F6 a q 321 where Fcw is given by 318 Because maximizing 319 to obtain 6 and m can be computationally overwhelming when p is large and X is not orthogonal one might instead consider a computable empirical Bayes approximation the conditional maximum likelihood criterion CCML 55772 7 C17 1 103557f72qv 2 10319 7 qy quot 10361177 322 where log is the positive part of log Selecting the 39y that maximizes CCML provides an approximate empirical Bayes alternative to selection based on CMML In contrast to criteria of the form 316 which penalize SSW72 by FqV with F constant CMML uses an adaptive penalty F6 12 that is implicitly based on the esti mated distribution of the regression coefficients COML also uses an adaptive penalty Practical Bayes Model Selection 85 but one can be expressed by a rapidly computable closed form that can be shown to act like a combination of a modi ed BIC penalty F log n which gives it same consis tency property as BIG and a modi ed RIC penalty F 2logp Insofar as maximizing COML approximates maximizing CMML these interpretations at least roughly explain the behavior of the CMML penalty F6w in 321 George and Foster 2000 proposed the empirical Bayes criteria CMML and COM and provided simulation evaluations demonstrating substantial performance advantages over the xed penalty criteria 316 selection using CMML delivers excellent performance over a much wider portion of the model space and CCML performs nearly as well The su periority of empirical Bayes methods was con rmed in context of wavelet regression by Johnstone and Silverman 1998 and Clyde and George 1999 Johnstone and Silverman 1998 demonstrated the superiority of using maximum marginal likelihood estimates of c and w with posterior median selection criteria and proposed EM algorithms for imple mentation Clyde and George 1999 also proposed EM algorithms for implementation extended the methods to include empirical Bayes estimates of 02 and considered both model selection and model averaging Finally a fully Bayes analysis which integrates out 0 and w with respect to some noninformative prior pC w could be a promising alternative to empirical Bayes estima tion of c and w Indeed the maximum marginal likelihood estimates 6 and vi correspond to the posterior mode estimates under a Bayes formulation with independent uniform priors on c and w a natural default choice As such the empirical Bayes methods can be considered as approximations to fully Bayes methods but approximations which do not fully account for the uncertainty surrounding c and w We are currently investigating the potential of such fully Bayes alternatives and plan to report on them elsewhere 34 Parameter Priors for Selection Based on Practical Signi cance A potential drawback of the point normal prior 39 for variable selection is that with enough data the posterior will favor the inclusion of X for any 7 0 no matter how small Although this might be desirable from a purely predictive standpoint it can also run counter to the goals of parsimony and interpretability in some problems where it would be preferable to ignore such negligible 3 A similar phenomenon occurs in frequentist hypothesis testing where for large enough sample sizes small departures from a point null become statistically signi cant even though they are not practically signi cant or meaningful An alternative to the point normal prior 39 which avoids this potential drawback is the normal normal formulation used in the stochastic search variable selection SSVS 86 H Chipman E I George and R E McCulloch procedure of George and McCulloch 1993 1996 1997 This formulation builds in the goal of excluding X from the model whenever lt 6 for a given 6 gt O The idea is that 6 is a threshold of practical signi cance that is prespeci ed by the user A simple choice might be 6 AYAXi where AY is the size of an insigni cant change in Y and AX is the size of the maximum feasible change in X To account for the cumulative effect of changes of other X s in the model one might prefer the smaller conservative choice 6139 AYpAXi The practical potential of the SSVS formulation is nicely illustrated by Wake eld and Bennett 1996 Under the normal normal formulation of SSVS the data always follow the saturated model 31 so that W l B 027 v NnX 7 021 323 for all 39y In the general notation of Section 2 the model parameters here are the same for every model 0k E 302 The 7th model is instead distinguished by a coef cient prior of the form 7W l 027 V 7W l V N17lt07D7RD7gt 324 where R is a correlation matrix and D7 is a diagonal matrix with diagonal elements 1 h 0 D7 W W en W 325 111 when yl39 1 Under the model space prior p y the marginal prior distribution of each component of B is here PW 1 PN07 W PN07U1i7 326 a scale mixture of two normal distributions Although B is independent of 02 in 324 the inverse Gamma prior 310 for 02 is still useful as are the speci cation considerations for it discussed in Section 32 Fur thermore R X X X 1 and R I are natural choices for R in 324 similarly to the commonly used choices for E in 39 To use this normal normal setup for variable selection the hyperparameters 110i and 111139 are set small and large77 respectively so that NOu0i is concentrated and N0 111 is diffuse The general idea is that when the data support 39yi 0 over 39yi 1 then B is probably small enough so that X will not be needed in the model For a given threshold 61 higher posterior weighting of those 39y values for which gt 6 when yi 1 can be achieved by choosing 110 and 111 such that p i l y 0 N0 110 gt 1931 l39yi 1 N0 111 precisely on the interval 76136 This property is obtained by any 110 and 111139 satisfying lOgU1iU0iU6i1 7 11121 Practical Bayes Model Selection 87 By choosing 111139 such that N0u1i is consistent with plausible values of 3139 110139 can then be chosen according to 327 George and McCulloch 1997 report that computational problems and dif culties with uh too large will be avoided whenever rim1101 100007 thus allowing for a wide variety of settings Under the normal normal setup above7 the joint distribution of B and 02 given 39y is not conjugate for 31 because 324 excludes 02 This prevents analytical reduction of the full posterior p z7239y 1 Y7 which can severely increase the cost of posterior computations To avoid this7 one can instead consider the conjugate normal normal formulation using p H 39y Np0 7213713137 328 which is identical to 324 except for the insertion of 02 Coupled with the inverse Gamma prior 310 for 02 the conditional distribution of B and 02 given 39y is conju gate This allows for the analytical margining out of B and 02 from 19Y7 B702 1 39y pY 167026 102702302 1 v to yield 190 1397 X 1XX 1 D YRD Y71 712 D YRD Y 7120A 1 SilimJHVZ 329 where 53 Y Y 7 Y XX X DVRD7 1 1X Y 330 As will be seen in Section 357 this simpli cation confers strong advantages for posterior calculation and exploration Under 3287 3107 and a model space prior p39y7 the marginal distribution each component of B is 195139 W 1 39YiT1V707 Mm ViT1V707U1i7 331 a scale mixture of t distributions7 in contrast to the normal mixture 326 As with the nonconjugate prior7 the idea is that 1101 and uh are to be set small and large77 respectively7 so that when the data supports 39yl 0 over 39yj 17 then 31 is probably small enough so that X will not be needed in the model However7 the way in which 110139 and 111139 determine small and large77 is affected by the unknown value of 02 thereby making speci cation more dif cult and less reliable than in the nonconjugate formulation For a chosen threshold of practical signi cance 6 the pdf 1931 i m 0 T107 UOi is larger than the pdf 1931 igyl 1 T10u1i precisely on the interval 761767 when 1101 and uh satisfy UOiUMVVH w 63 V w 6 II01 332 By choosing uh such that T10u1i is consistent with plausible values of 31 1101 can then be chosen according to 332 88 H Chipman E I George and R E McCulloch Another potentially valuable speci cation of the conjugate normal normal formula tion can be used to address the problem of outlier detection which can be framed as a variable selection problem by including indicator variables for the observations as poten tial predictors For such indicator variables the choice 113 1 and 111 K gt 0 yields the well known additive outlier formulation see for example Petit and Smith 1985 Furthermore when used in combination with the previous settings for ordinary predic tors the conjugate prior provides a hierarchical formulation for simultaneous variable selection and outlier detection This has also been considered by Smith and Kohn 1996 A related treatment has been considered by Hoeting Raftery and Madigan 1996 35 Posterior Calculation and Exploration for Variable Selection 351 Closed Form Expressions for pY l 39y A valuable feature of the previous conjugate prior formulations is that they allow for analytical margining out of B and 02 from pY 02 l 39y to yield the closed form expres sions in 311 and 329 which are proportional to pY l 39y Thus when the model prior p y is computable this can be used to obtain a computable closed form expression g39y satisfying 9W 0lt 290 l WPW olt 29W 1 Y 333 The availability of such g39y can greatly facilitate posterior calculation and estimation Furthermore it turns out that for certain formulations the value of g39y can be rapidly updated as 39y is changed by a single component As will be seen such rapid updating schemes can be used to speed up algorithms for evaluating and exploring the posterior m 1 Y Consider rst the conjugate point normal formulation 39 and 310 for which pY l 39y proportional to 311 can be obtained When EA c X4X7 1 a function g39y satisfying 333 can be expressed as 9W 1 CVWZOIA Y Y 110 1W W 2pv 334 where W T 1X1Y for upper triangular T such that T T XA X7 for obtainable by the Cholesky decomposition As noted by Smith and Kohn 1996 the algorithm of Dongarra Moler Bunch and Stewart 1979 provides fast updating of T and hence W and g39y when 39y is changed one component at a time This algorithm requires 0q3 operations per update where 39y is the changed value Now consider the conjugate normal normal formulation 328 and 310 for which pY l 39y proportional to 329 can be obtained When R I holds a function g39y Practical Bayes Model Selection 89 satisfying 333 can be expressed as 17 gm H TE lt17 mam munigt 12ltw W e W W 2pv 335 i1 where W TAXY for upper triangular T such that T T X X obtainable by the Cholesky decomposition As noted by George and McCulloch 19977 the Chambers 1971 algorithm provides fast updating of T7 and hence W and g39y7 when 39y is changed one component at a time This algorithm requires 0p2 operations per update The availability of these computable7 closed form expressions for g39y cx p y 1 Y enables exhaustive calculation of p y 1 Y in moderately sized problems In general7 this simply entails calculating g39y for every 39y value and then summing over all 39y values to obtain the normalization constant However7 when one of the above fast updating schemes can be used7 this calculation can be substantially speeded up by sequential calculation of the 2p g39y values where consecutive 39y differ by just one component Such an ordering is provided by the Gray Code7 George and McCulloch 1997 After computing T7 W and g39y for an initial 39y value7 subsequent values of T7 W and g39y can be obtained with the appropriate fast updating scheme by proceeding in the Gray Code order Using this approach7 this exhaustive calculation is feasible for p less than about 25 352 MCMC Methods for Variable Selection MCMC methods have become a principal tool for posterior evaluation and exploration in Bayesian variable selection problems Such methods are used to simulate a sequence 39y139y2 336 that converges in distribution to p39le ln formulations where analytical simpli cation of pm 02 39y 1 Y is unavailable7 336 can be obtained as a subsequence of a simulated Markov chain of the form 5lt1gt7 Ult1gt7 ylt1gt7 52 32 ylt2gt7 337 that converges to pm 02 39y 1 Y However7 in conjugate formulations where B and 02 can be analytically eliminated form the posterior7 the availability of g39y cx p y 1 Y allows for the exible construction of MCMC algorithms that simulate 336 directly as a Markov chain Such chains are often more useful7 in terms of both computational and convergence speed In problems where the number of potential predictors p is very small7 and g39y cx p y 1 Y is unavailable7 the sequence 336 may be used to evaluate the entire poste rior p y 1 Y lndeed7 empirical frequencies and other functions of the 39y values will be 90 H Chipman E I George and R E McCulloch consistent estimates of their values under p y l Y In large problems where exhaustive calculation of all 21 values Ofp y l Y is not feasible the sequence 336 may still provide useful information Even when the length of the sequence 336 is much smaller than 21 it may be possible to identify at least some of the high probability 39y since those 39y are expected to appear more frequently In this sense these MCMC methods can be used to stochastically search for high probability models In the next two subsections we describe various MCMC algorithms which may be useful for simulating 336 These algorithms are obtained as variants of the Gibbs sampler GS and Metropolis Hastings algorithms described in Section 23 353 Gibbs Sampling Algorithms Under the nonconjugate normal normal formulation 324 and 310 for SSVS the posterior p 0239y l Y is p dimensional for all 39y Thus a simple GS that simulates the full parameter sequence 337 is obtained by successive simulation from the full conditionals 193 l 027 M Y pltaz l mm pltaz l W 338 low l 30277Y m l 37 i 17 72 where at each step these distributions are conditioned on the most recently generated parameter values These conditionals are standard distributions which can be simulated quickly and efficiently by routine methods For conjugate formulations where g y is available a variety of MCMC algorithms for generating 336 directly as a Markov chain can be conveniently obtained by applying the GS with g y The simplest such implementation is obtained by generating each 39y value componentwise from the full conditionals WWWV 2 17277p7 339 ya m w 39y1 VH1 39yp where the y may be drawn in any xed or random order By margining out 6 and 02 in advance the sequence 336 obtained by this algorithm should converge faster than the nonconjugate Gibbs approach rendering it more effective on a per iteration basis for learning about p y l Y see Liu Wong and Kong 1994 The generation of the components in 339 in conjunction with g y can be obtained trivially as a sequence of Bernoulli draws Furthermore if g y allows for fast updating as in 334 or 335 the required sequence of Bernoulli probabilities can be computed Practical Bayes Model Selection 91 faster and more ef ciently To see this7 note that the Bernoulli probabilities are simple functions of the ratio p39Yi1739i W 9Vi1739i39 33940 PW 07 w W 9 07 Wm At each step of the iterative simulation from 3397 one of the values ofg39y in 340 will be available from the previous component simulation Since v has been varied by only a single component7 the other value of g y can then be obtained by using the appropriate updating scheme Under the point normal prior 39 with EA c XZYXA 17 the fast updating of 334 requires Oq27 operations7 whereas under the conjugate normal normal prior formulation 328 with R 1 fast updating of 335 requires 0p2 operations Thus7 GS algorithms in the former case can be substantially faster when p y Y is concentrated on those 39y for which q is small7 namely the parsimonious models This advantage could be pronounced in large problems with many useless predictors Simple variants of the componentwise GS can be obtained by generating the com ponents in a different xed or random order Note that in any such generation7 it is not necessary to generate each and every component once before repeating a coordinate Another variant of the GS can be obtained by drawing the components of 39y in groups7 rather than one at a time Let 1k k 12m be a partition of 17 27 719 so that7 k g 1727719 Ulk 1727719 andlkl lkz 0 for 1 7g k2 Let yIk E 1k and yak w 1k The grouped GS generates 336 by iterative simulation from wk wIw k12m 341 Fast updating of g y7 when available7 can also be used to speed up this simulation by computing the conditional probabilities of each wk in Gray Code order The potential advantage of such a grouped GS is improved convergence of 336 This might be achieved by choosing the partition so that strongly correlated y are contained in the same lk thereby reducing the dependence between draws in the simulation lntuitively7 clusters of such correlated 39yi should correspond to clusters of correlated Xi which7 in practice7 might be identi ed by clustering procedures As before7 variants of the grouped GS can be obtained by generating the wk in a different xed or random order 354 MetropolisHastings Algorithms The availability ofg39y cx p39yY also facilitates the use of MH algorithms for direct simu lation of 336 By restricting attention to the set of 39y values7 a discrete space7 the simple MH form described in Section 23 can be used Because g yg y p y Yp39y Y7 such MH algorithms are here of the form 1 Simulate a candidate 39y from a transition kernel q39y 39y7 92 H Chipman E I George and R E McCulloch 2 Set W11 V with probability a i 739 min MM V W qvlvjgvquot71 39 33942 Otherwise 711 WU When available fast updating schemes for g39y can be exploited Just as for the Gibbs sampler the MH algorthims under the point normal formulations 39 with 27 c X Xnyf1 will be the fastest scheme when p y 1 Y is concentrated on those 39y for which q is small A special class of MH algorithms the Metropolis algorithms are obtained from the class of transition kernels q39y1 1 yo which are symmetric in V1 and 39y0 For this class the form of 342 simpli es to QM 739 min QWT v lv TWUJ 343 Perhaps the simplest symmetric transition kernel is 1 0 p 0 1 cm W 119 if Elm wl 1 344 1 This yields the Metropolis algorithm 1 Simulate a candidate 39y by randomly changing one component of 39ym 2 Set 711 f with probability 04M39y l 39y7 Otherwise 711 39ym This algorithm was proposed in a related model selection context by Madigan and York 1995 who called it M03 It was used by Raftery Madigan and Hoeting 1997 for model averaging and was proposed for the SSVS prior formulation by Clyde and Parmigiani 1994 The transition kernel 344 is a special case of the class of symmetric transition kernels of the form 17 m1 W0 qd if 21v Will d 345 1 Such transition kernels yield Metropolis algorithms of the form 1 Simulate a candidate 39y by randomly changing 1 components of 39ym with proba bility qd 2 Set 711 f with probability 04M39y l 39y7 Otherwise 711 39ym Practical Bayes Model Selection 93 Here qd is the probability that 39y will have 1 new components By allocating some weight to qd for larger d7 the resulting algorithm will occasionally make big jumps to different 39y values In contrast to the algorithm obtained by 344 which only moves locally7 such algorithms require more computation per iteration Finally7 it may also be of interest to consider asymmetric transition kernels such as 1 0 p 0 1 cm W 9d 1f EW w d 346 1 Here qd is the probability of generating a candidate value V which corresponds to a model with d more variables y When 1 lt 07 V will represent a more parsimonious model than y By suitable weighting of the qd probabilities7 such Metropolis Hastings algorithms can be made to explore the posterior in the region of more parsimonious models 355 Extracting Information from the Output In nonconjugate setups7 where g y is unavailable7 inference about posterior character istics based on 336 ultimately rely on the empirical frequency estimates the visited 39y values Although such estimates of posterior characteristics will be consistent7 they may be unreliable7 especially if the size of the simulated sample is small in comparison to 21 or if there is substantial dependence between draws The use of empirical frequencies to identify high probability 39y values for selection can be similarly problematic However7 the situation changes dramatically in conjugate setups where g y cx p39yY is available To begin with7 g y provides the relative probability of any two values 39yO and V1 via g y0 g39y1 and so can be used to de nitively identify the higher probability 7 in the sequence 336 of simulated values Only minimal additional effort is required to obtain such calculations since g y must be calculated for each of the visited 39y values in the execution of the MCMC algorithms described in Sections 353 and 354 The availability of g y can also be used to obtain estimators of the normalizing constant C law I Y 09W 347 based on the MCMC output 336 7 say 39y1 yK Let A be a preselected subset of 39y values and let gA EVEAgW so that pA I Y CgA Then7 a consistent estimator of C is a f WW 348 9AK k1 where lA is the indicator of the set A7 George and McCulloch 1997 Note that if 336 were an uncorrelated sequence7 then VarC39 CZK4A Y if2 suggesting that 94 H Chipman E I George and R E McCulloch the variance of 348 is decreasing as pA 1 Y increases It is also desirable to choose A such that IA39yk can be easily evaluated George and McCulloch 1997 obtain very good results by setting A to be those 39y values visited by a preliminary simulation of 336 Peng 1998 has extended and generalized these ideas to obtain estimators of C that improve on 348 lnserting 3 into 347 yields improved estimates of the probability of individual 39y values 23W 1 Y 09v7 349 as well as an estimate of the total visited probability 23BlY 393 350 where B is the set of visited 39y values Such 13B 1 Y can provide valuable information about when to stop a MCMC simulation Since 13y l Yp y 1 Y E 00 the uniform accuracy of the probability estimates 349 is lCC 7 1 351 This quantity is also the total probability discrepancy since 2 113W 1 Y 7 p y 1 Y 10 f 012796 100 ll The simulated values 336 can also play an important role in model averaging For example suppose one wanted to predict a quantity of interest A by the posterior mean E4 1 Y Z EAlYple 352 all 397 When p is too large for exhaustive enumeration and p y 1 Y cannot be computed 352 is unavailable and is typically approximated by something of the form EVA 1 Y EMA l MYWV 1 KS 353 765 where S is a manageable subset of models and 13y 1 Y S is a probability distribution over S In some cases EA l 39y Y will also be approximated Using the Markov chain sample for S a natural choice for 353 is EfAlY ZEAl Y7YJ3f YlY75 354 765 where 13f39y 1 Y S is the relative frequency of 39y in S George 1999 Indeed 354 will be a consistent estimator of EA 1 Y However here too it appears that when g y is available one can do better by using EgAlY ZEAlY gleS 355 765 Practical Bayes Model Selection 95 where 239W l KS 9v9S 356 is just the renormalized value ofg39y For example when S is an iid sample from p39le 355 increasingly approximates the best unbiased estimator of EA l Y as the sample size increases To see this note that when S is an iid sample EfA l Y is unbiased for EA l Y Since S together with g is suf cient the Rao Blackwellized estimator EEfA l Y lS is best unbiased But as the sample size increases EEfA l Y lS a Egm Y 4 Bayesian CART Model Selection For our second illustration of Bayesian model selection implementations we consider the problem of selecting a classi cation and regression tree CART model for the rela tionship between a variable y and a vector of potential predictors z 1 mp An alternative to linear regression CART models provide a more exible speci cation of the conditional distribution of y given m This speci cation consists of a partition of the z space and a set of distinct distributions for y within the subsets of the partition The partition is accomplished by a binary tree T that recursively partitions the z space with internal node splitting rules of the form x E A or z A By moving from the root node through to the terminal nodes each observation is assigned to a terminal node of T which then associates the observation with a distribution for y Although any distribution may be considered for the terminal node distributions it is convenient to specify these as members ofa single parametric family py l0 and to assume all observations of y are conditionally independent given the parameter values In this case a CART model is identi ed by the tree T and the parameter values 9 01 0b of the distributions at each of the 1 terminal nodes of T Note that T here plays the role of Mk of model identi er as described in Section 2 The model is called a regression tree model or a classi cation tree model according to whether y is quantitative or qualitative respectively For regression trees two simple and useful speci cations for the terminal node distributions are the mean shift normal model py 0 NW0 2 1b 41 where 0139 m a and the mean variance shift normal model py 0 Nia 2 1b 42 where 0 01401 For classi cation trees where y belongs to one of K categories say 96 H Chipman E I George and R E McCulloch Cl CK a natural choice for terminal node distributions are the simple multinomials K 1lt c gt E pyl0iHpky k 21b 43 k1 where 0 pi E pl1 pK pm 2 0 and Ekpik 1 Here py E Ck pik at the ith terminal node of T As illustration Figure 1 depicts a regression tree model where y N N022 and z 12 ml is a quantitative predictor taking values in 010 and 2 is a qualitative predictor with categories A B CD The binary tree has 9 nodes of which I 5 are terminal nodes that partition the z space into 5 subsets The splitting rules are displayed at each internal node For example the leftmost terminal node corresponds to ml 30 and 2 6 0 D The 0 value identifying the mean of y given z is displayed at each terminal node Note that in contrast to a linear model 0 decreases in 1 when 2 6 A B but increases in 1 when 2 6 0 D The two basic components of the Bayesian approach to CART model selection are prior speci cation and posterior exploration Prior speci cation over CART models en tails putting a prior on the tree space and priors on the parameters of the terminal node distributions The CART model likelihoods are then used to update the prior to yield a posterior distribution that can be used for model selection Although straightforward in principle practical implementations require subtle and delicate attention to details 1 1 11 Prior formulation must be interpretable and m uni manageable Hypeipaiaui eter speci cation can be usefully guided by overall location and scale measures of the data A feature of this approach is that the prior speci cation can be used to downweight undesirable model characteristics such as tree complexity or to express a preference for certain predictor variables Although the entire posterior cannot be computed in non trivial problems posterior guided MH algorithms can still be used to search for good tree models However the algorithms require repeated restarting or other modi cations because of the multimodal nature of the posterior As the search proceeds selection based on marginal likelihood rather than posterior probability is preferable because of the dilution properties of the prior Alternatively a posterior weighted average of the visited models can be easily obtained CART modelling was popularized in the statistical community by the seminal book of Breiman Friedman Olshen and Stone 1984 Earlier work by Kass 1980 and Hawkins and Kass 1982 developed tree models in a statistical framework There has also been substantial research on trees in the eld of machine learning for example the C45 algorithm and its predecessor lD3 Quinlan 1986 1993 Here we focus on the method of Breiman et al 1984 which proposed a nonparametric approach for tree selection based on a greedy algorithm named CART A concise description of this approach which Practical Bayes Model Selection Figure 1 A regression tree where y N N097 22 and m 172 98 H Chipman E I George and R E McCulloch seeks to partition the z space into regions where the distribution of y is homogeneous and its implementation in S appears in Clark and Pregibon 1992 Bayesian approaches to CART are enabled by elaborating the CART tree formulation to include parametric terminal node distributions effectively turning it into a statistical model and providing a likelihood Conventional greedy search algorithms are also replaced by the MCMC algorithms that provide a broader search over the tree model space The Bayesian CART model selection implementations described here were proposed by Chipman George and McCulloch 1998 and Denison Mallick and Smith 1998a hereafter referred to as CGM and DMS respectively An earlier Bayesian approach to classi cation tree modelling was proposed by Buntine 1992 which compared to CGM and DMS uses similar priors for terminal node distributions but different priors on the space of trees and deterministic rather than stochastic algorithms for model search Priors for tree models based on Minimum Encoding ideas were proposed by Quinlan and Rivest 1989 and Wallace and Patrick 1993 Oliver and Hand 1995 discuss and provide an empirical comparison of a variety of pruning and Bayesian model averaging approaches based on CART Paass and Kindermann 1997 applied a simpler version of the CGM approach and obtained results which uniformly dominated a wide variety of competing methods Other alternatives to greedy search methods include Sutton 1991 and Lutsko and Kuijpers 1994 who use simulated annealing Jordan and Jacobs 1994 who use the EM algorithm Breiman 1996 who averages trees based on bootstrap samples and Tibshirani and Knight 1999 who select trees based on bootstrap samples 41 Prior Formulations for Bayesian CART Since a CART model is identi ed by 9 T a Bayesian analysis of the problem proceeds by specifying priors on the parameters of the terminal node distributions of each tree p l T and a prior distribution pT over the set of trees Because the prior for T does not depend on the form of the terminal node distributions pT can be generally considered for both regression trees and classi cation trees 411 Tree Prior Speci cation A tree T partitions the z space and consists of both the binary tree structure and the set of splitting rules associated with the internal nodes A general formulation approach for pT proposed by CGM is to specify pT implicitly by the following tree generating stochastic process which grows trees from a single root tree by randomly splitting terminal nodes 1 Begin by setting T to be the trivial tree consisting of a single root and terminal Practical Bayes Model Selection 99 node denoted 7 to Split the terminal node 7 with probability p7 a1 17quot where 17 is the depth of the node 7 and Oz 6 07 1 and B 2 0 are prechosen control parameters 03 If the node splits7 randomly assign it a splitting rule as follows First choose xi uniformly from the set of available predictors lf xi is quantitative7 assign a splitting rule of the form g 5 vs gt s where s is chosen uniformly from the available observed values of m If ml is qualitative7 assign a splitting rule of the form E C vs C where C is chosen uniformly from the set of subsets of available categories of m Next assign left and right children nodes to the split node7 and apply steps 2 and 3 to the newly created tree with 7 equal to the new left and the right children if nontrivial splitting rules are available By available in step 37 we mean those predictors7 split values and category subsets that would not lead to empty terminal nodes For example7 if a binary predictor was used in a splitting rule7 it would no longer be available for splitting rules at nodes below it Each realization of such a process can simply be considered as a random draw from pT Furthermore7 this speci cation allows for straightforward evaluation of pT for any T7 and can be effectively coupled with the MH search algorithms described in Section 421 Although other useful forms can easily be considered for the splitting probability in step 2 above7 the choice Ofp a1d7 is simple7 interpretable7 easy to compute and dependent only on the depth 17 of the node 7 The parameters 04 and 3 control the size and shape of the binary tree produced by the process To see how7 consider the simple speci cation7 p7 E 04 lt 1 when 3 0 In this case the probability of any particular binary tree with 1 terminal nodes ignoring the constraints of splitting rule assignments 1 a natural generalization of the geometric distribution in step 3 is just 117417 04 A binary tree with 1 terminal nodes will always have exactly I 7 1 internal nodes Setting 04 small will tend to yield smaller trees and is a simple convenient way to control the size of trees generated by growing process The choice of B 0 above assigns equal probability to all binary trees with 1 terminal nodes regardless of their shape lndeed any prior that is only a function of b will do this for example7 DMS recommend this with a truncated Poisson distribution on b However7 for increasing 3 gt 07 p7 is a decreasing function of 17 making deeper nodes less likely to split The resulting prior pT puts higher probability on bushy trees7 those whose terminal nodes do not vary too much in depth Choosing Oz and B in practice can guided by looking at the implicit marginal distributions of characteristics such as 1 Such marginals can be easily simulated and graphed 100 H Chipman E I George and R E McCulloch Turning to the splitting rule assignments step 3 ofthe tree growing process represents the prior information that at each node available predictors are equally likely to be effective and that for each predictor available split values or category subsets are equally likely to be effective This speci cation is invariant to monotone transformations of the quantitative predictors and is uniform on the observed quantiles of a quantitative predictor with no repeated values However it is not uniform over all possible splitting rules because it assigns lower probability to splitting rules based on predictors with more potential split values or category subsets This feature is necessary to maintain equal probability on predictor choices and essentially yields the dilution property discussed in Sections 22 and 31 Predictors with more potential split values will give rise to more trees By downweighting the splitting rules of such predictors pT serves to dilute probability within neighborhoods of similar trees Although the uniform choices for pT above seem to be reasonable defaults non uniform choices may also be of interest For example it may be preferable to place higher probability on predictors that are thought to be more important A preference for models with fewer variables could be expressed by putting greater mass on variables already assigned to ancestral nodes For the choice of split value tapered distribution at the extremes would increase the tendency to split more towards the interior range of a variable One might also consider the distribution of split values to be uniform on the available range of the predictor and so could weight the available observed values accordingly For the choice of category subset one might put extra weight on subsets thought to be more important As a practical matter note that all of the choices above consider only the observed predictor values as possible split points This induces a discrete distribution on the set of splitting rules and hence the support of pT will be a nite set of trees in any application This is not really a restriction since it allows for all possible partitions of any given data set The alternative of putting a continuous distribution on the range of the predictors would needlessly increase the computational requirements of posterior search while providing no gain in generality Finally we note that the dependence of pT on the observed x values is typical of default prior formulations as was the case for some of the coef cient prior covariance choices discussed in Sections 32 and 34 412 Parameter Prior Speci cations As discussed in Section 23 the computational burden of posterior calculation and ex ploration is substantially reduced when the marginal likelihood here pY l T can be obtained in closed form Because of the large size of the space of CART models this computational consideration is key in choosing the prior p l T for the parameters of Practical Bayes Model Selection 101 the terminal node distributions For this purpose we recommend the conjugate prior forms below for the parameters of the models 41 43 For each of these priors 9 can be analytically margined out via 22 namely loo1T pltY MW 1 We 44 where Y here denotes the observed values of y For regression trees with the mean shift normal model 41 perhaps the simplest prior speci cation for p lT is the standard normal inverse gamma form where 1 m are iid given a and T with 290 l 07 T N017 UZa 45 and W m pm Jaw2 W2 46 Under this prior standard analytical simpli cation yields Y T cab2 b t nv2 47 P l M H1nia12 slzu l where c is a constant which does not depend on T 5139 is 7 1 times the sample variance ma ma of the ith terminal node Y values 25 g 7 2 and g is the sample mean of the ith terminal node Y values In practice the observed Y can be used to guide the choice of hyperparameter values for lll1a Considerations similar to those discussed for Bayesian variable selection in Section 32 are also useful here To begin with because the mean shift model attempts to explain the variation of Y it is reasonable to expect that 02 will be smaller than 5 the sample variance of Y Similarly it is reasonable to expect that 02 will be larger than a pooled variance estimate obtained from a deliberate over tting of the data by a greedy algorithm say 5 Using these values as guides 11 and would then be chosen so that the prior for 02 assigns substantial probability to the interval 5 Once 1 and have been chosen 1 and a would be selected so that the prior for u is spread out over the range of Y values For the more exible mean variance shift model 42 where a can also vary across the terminal nodes the normal inverse gamma form is easily extended to POM 1 MT N017 HZa 48 29M2 1T pW CVZVVQL 49 1 102 H Chipman E I George and R E McCulloch with the pairs 11 01 M74717 independently distributed given T Under this prior analytical simpli cation is still straightforward and yields 17 Pn39l2 Y T x 7139 22 1 VZ l S39t39l WW2 410 291111 ltgtmwmltligt ltgt where s and t are as above Interestingly the MCMC computations discussed in the next section are facilitated by the factorization of this marginal likelihood across nodes in contrast to the marginal likelihood 47 for the equal variance model Here too the observed Y can be used to guide the choice of hyperparameter values for 1 1 a The same ideas above may be used with an additional consideration In some cases the mean variance shift model may explain variance shifts much more so than mean shifts To handle this possibility it may be better to choose 1 and so that 032 is more toward the center rather than the right tail of the prior for 02 We might also tighten up our prior for u about the average y value In any case it can be useful to explore the consequences of several different prior choices For classi cation trees with the simple multinomial model 43 a useful conjugate prior speci cation for 9 p1 pb is the standard Dirichlet distribution of dimension K 7 1 with parameter or 041 aK 0 gt 0 where p1 pb are iid given T with pp l T Dirichletp l 04 x p514 pfgg l 411 When K 2 this reduces to the familiar Beta prior Under this prior standard analytical simpli cation yields P b b P I InaT CXlt H Hk 39mk l ak 43912 Hk New 1 P m 2k 0 where 71 is the number of ith terminal node Y values in category Ck m 2k 71 and k 1 K over the sums and products above For a given tree pY l T will be larger when the Y values within each node are more homogeneous To see this note that assignments for which the Y values at the same node are similar will lead to more disparate values of nil niK which in turn will lead to larger values of pY l T The natural default choice for 04 is the vector 1 1 for which the Dirichlet prior 411 is the uniform However by setting certain 04k to be larger for certain categories pY l T will become more sensitive to misclassi cation at those categories This would be desirable when classi cation into those categories is most important One detail of analytical simpli cations yielding integrated likelihoods 47 410 or 412 merits attention Independence of parameters across terminal nodes means that integration can be carried out separately for each node Normalizing constants Practical Bayes Model Selection 103 in integrals for each node that would usually be discarded for example lb2 in 47 need to be kept since the number of terminal nodes 1 varies across trees This means that these normalizing constants will be exponentiated to a different power for trees of different size All the prior speci cations above assume that given the tree T the parameters in the terminal nodes are independent When terminal nodes share many common parents it may be desirable to introduce dependence between their 0139 values Chipman George and McCulloch 2000 introduce such a dependence for the regression tree model resulting in a Bayesian analogue of the tree shrinkage methods considered by Hastie and Pregibon 1990 and Leblanc and Tibshirani 1998 42 Stochastic Search of the CART Model Posterior Combining any of the closed form expressions 47 410 or 412 for pY l T with pT yields a closed form expression gT satisfying 9T 0lt 290 1TpT 0lt MT 1 Y 413 Analogous to bene ts of the availability g y in 333 for Bayesian variable selection the availability of gT confers great advantages for posterior computation and exploration in Bayesian CART model selection Exhaustive evaluation of gT over all T will not be feasible except in trivially small problems because of the sheer number of possible trees This not only prevents exact calculation of the norming constant but also makes it nearly impossible to determine exactly which trees have largest posterior probability In spite of these limitations MH algorithms can still be used to explore the posterior Such algorithms simulate a Markov chain sequence of trees T0T1 T2 414 which are converging in distribution to the posterior pT 1 Y Because such a simulated sequence will tend to gravitate towards regions of higher posterior probability the sim ulation can be used to stochastically search for high posterior probability trees We now proceed to describe the details of such algorithms and their effective implementation 421 MetropolisHastings Search Algorithms By restricting attention to a nite set of trees as discussed in the last paragraph of Section 411 the simple MH form described in Section 23 can be used for direct sim ulation of the Markov chain 414 Because gTgT pT l YpT 1 Y such MH 104 H Chipman E I George and R E McCulloch algorithms are obtained as follows Starting with an initial tree T0 iteratively simulate the transitions from T7 to T71 by the two steps 1 Simulate a candidate T from the transition kernel qT l T7 2 Set T71 T with probability 04T T739 minji 3l 415 Otherwise set T71 T7 The key to making this an effective MH algorithm is the choice of transition kernel qT l T7 A useful strategy in this regard is to construct qT l T7 as a mixture of simple local moves from one tree to another moves that have a chance of increasing posterior probability In particular CGM use the following qT l T7 which generates T from T7 by randomly choosing among four steps GROW Randomly pick a terminal node Split it into two new ones by randomly assigning it a splitting rule using the same random splitting rule assignment used to determine pT PRUNE Randomly pick a parent of two terminal nodes and turn it into a terminal node by collapsing the nodes below it CHANGE Randomly pick an internal node and randomly reassign it using the same random splitting rule assignment used to determine pT SWAP Randomly pick a parent child pair which are both internal nodes Swap their splitting rules unless the other child has the identical rule In that case swap the splitting rule of the parent with that of both children In executing the GROW CHANGE and SWAP steps attention is restricted to splitting rule assignments that do not force the tree have an empty terminal node CGM also recommend further restricting attention to splitting rule assignments which yield trees with at least a small number such as ve observations at every terminal node A similar qT l T7 without the SWAP step was proposed by DMS An interesting general approach for constructing such moves was proposed by Knight Kustra and Tibshirani 1998 The transition kernel qT l T7 above has some appealing features To begin with every step from T to T has a counterpart that moves from T to T Indeed the GROW and PRUNE steps are counterparts of one another and the CHANGE and Practical Bayes Model Selection 105 SWAP steps are their own counterparts This feature guarantees the irreducibility of the algorithm7 which is needed for convergence It also makes it easy to calculate the ratio qT7 TqT T7 in 415 Note that other reversible moves may be much more difficult to implement because their counterparts are impractical to construct For example7 pruning off more than a pair of terminal nodes would require a complicated and computationally expensive reverse step Another computational feature occurs in the GROW and PRUNE steps7 where there is substantial cancellation between 9 and q in the calculation of 415 because the splitting rule assignment for the prior is used 422 Running the MH Algorithm for Stochastic Search The MH algorithm described in the previous section can be used to search for desirable trees To perform an effective search it is necessary to understand its behavior as it moves through the space of trees By virtue of the fact that its limiting distribution is 19TiY7 it will spend more time visiting tree regions where pT Y is large However7 our experience in assorted problems see the examples in CGM has been that the algorithm quickly gravitates towards such regions and then stabilizes7 moving locally in that region for a long time Evidently7 this is a consequence of a transition kernel that makes local moves over a sharply peaked multimodal posterior Once a tree has reasonable t7 the chain is unlikely to move away from a sharp local mode by small steps Because the algorithm is convergent7 we know it will eventually move from mode to mode and traverse the entire space of trees However7 the long waiting times between such moves and the large size of the space of trees make it impractical to search effectively with long runs of the algorithm Although different move types might be implemented7 we believe that any MH algorithm for CART models will have dif culty moving between local modes To avoid wasting time waiting for mode to mode moves7 our search strategy has been to repeatedly restart the algorithm At each restart7 the algorithm tends to move quickly in a direction of higher posterior probability and eventually stabilize around a local mode At that point the algorithm ceases to provide new information7 and so we intervene in order to nd another local mode more quickly Although the algorithm can be restarted from any particular tree7 we have found it very productive to repeatedly restart at the trivial single node tree Such restarts have led to a wide variety of different trees7 apparently due to large initial variation of the algorithm However7 we have also found it productive to restart the algorithm at other trees such as previously visited intermediate trees or trees found by other heuristic methods For example7 CGM demonstrate that restarting the algorithm at trees found by bootstrap bumping Tibshirani and Knight 1999 leads to further improvements over the start points A practical implication of restarting the chain is that the number of restarts must be 106 H Chipman E I George and R E McCulloch traded off against the length of the chains Longer chains may more thoroughly explore a local region of the model space while more restarts could cover the space of models more completely In our experience a preliminary run with a small number of restarts can aid in deciding these two parameters of the run If the marginal likelihood stops increasing before the end of each run lengthening runs may be less pro table than increasing the number of restarts It may also be useful to consider the slower burn in77 modi cation of the algorithm proposed by DMS Rather than let their MH algorithm move quickly to a mode DMS intervene forcing the algorithm to move around small trees with around 6 or fewer nodes before letting it move on This interesting strategy can take advantage of the fact that the problems caused by the sharply peaked multimodal posterior are less acute when small trees are constructed Indeed when trees remain small the change or swap steps are more likely to be permissible since there are fewer children to be incompatible with and help move around the model space Although this burn in77 strategy will slow down the algorithm it may be a worthwhile tradeoff if it suf ciently increases the probability of nding better models 423 Selecting the Best Trees As many trees are visited by each run of the algorithm a method is needed to identify those trees which are of most interest Because gT cx pT 1 Y is available for each visited tree one might consider selecting those trees with largest posterior probability However this can be problematic because of the dilution property of pT discussed in Section 411 Consider the following simple example Suppose we were considering all possible trees with two terminal nodes and a single rule Suppose further that we had only two possible predictors a binary variable with a single available splitting rule and a multilevel variable with 100 possible splits If the marginal likelihood pY l T was the same for all 101 rules then the posterior would have a sharp mode on the binary variable because the prior assigns small probability to each of the 100 candidate splits for the multilevel predictor and much larger probability to the single rule on the binary predictor Selection via posterior probabilities is problematic because the relative sizes of posterior modes does not capture the fact that the total posterior probability allocated to the 100 trees splitting on the multilevel variable is the same as that allocated to the single binary tree It should be emphasized that the dilution property is not a failure of the prior By using it the posterior properly allocates high probability to tree neighborhoods which are collectively supported by the data This serves to guide the algorithm towards such regions The dif culty is that relative sizes of posterior modes do not capture the relative Practical Bayes Model Selection 107 allocation of probability to such regions and so can lead to misleading comparisons of single trees Note also that dilution is not a limitation for model averaging Indeed one could approximate the overall posterior mean by the average of the visited trees using weights proportional to pY I TpT Such model averages provide a Bayesian alternative to the tree model averaging proposed by see Breiman 1996 and Oliver and Hand 1995 A natural criterion for tree model selection which avoids the difficulties described above is to use the marginal likelihood pY I T As illustrated in CGM a useful tool in this regard is a plot of the largest observed values ofpY I T against the number of terminal nodes of T an analogue of the Cp plot Mallows 1973 This allows the user to directly gauge the value of adding additional nodes while removing the in uence ofpT In the same spirit we have also found it useful to consider other commonly used tree selection criteria such as residual sums of squares for regression trees and misclassi cation rates for classi cation trees After choosing a selection criterion a remaining issue is what to do when many different models are found all of which t the data well Indeed our experience with stochastic search in applications has been to nd a large number of good tree models distinguished only by small differences in marginal likelihood To deal with such out put in Chipman George and McCulloch 1998b 2001a we have proposed clustering methods for organizing multiple models We found such clustering to reveal a few dis tinct neighborhoods of similar models In such cases it may be better to select a few representative models rather than a single best model 5 Much More to Come Because of its broad generality the formulation for Bayesian model uncertainty can be applied to a wide variety of problems The two examples that we have discussed at length Bayesian variable selection for the linear model and Bayesian CART model se lection illustrate some of the main ideas that have been used to obtain effective practical implementations However there have been many other recent examples To get an idea of the extent of recent activity consider the following partial list of some of the highlights just within the regression framework To begin with the Bayesian variable selection formulation for the linear model has been extended to the multivariate regression setting by Brown Vannucci and Fearn 1998 It has been applied and extended to nonparametric spline regression by Deni son Mallick and Smith 1998bc Gustafson 2000 Holmes and Mallick 2001 Liang Truong and Wong 2000 Smith and Kohn 1996 1997 Smith Wong and Kohn 108 H Chipman E I George and R E McCulloch 1998 and to nonparametric wavelet regression by Abramovich Sapatinas and Silver man 1998 Chipman Kolaczyk and McCulloch 1997 Clyde and George 19992000 Clyde Parmigiani and Vidakovic 1998 Holmes and Mallick 2000 and Johnstone and Silverman 1998 Related Bayesian approaches for generalized linear models and time series models have been put forward by Chen lbrahim and Yiannoutsos 1999 Clyde 1999 George McCulloch and Tsay 1995 lbrahim and Chen 2000 Mallick and Gelfand 1994 Raftery 1996 Raftery Madigan and Volinsky 1996 Raftery and Richardson 1996 Shively Kohn and Wood 1999 Troughton and Godsill 1997 and Wood and Kohn 1998 for loglinear models by Dellaportas and Foster 1999 and Albert 1996 and to graphical model selection by Giuduci and Green 1999 and Madigan and York 1995 Bayesian CART has been extended to Bayesian treed modelling by Chip man George and McCulloch 2001 an related Bayesian partitioning approach has been proposed by Holmes Denison and Mallick 2000 Alternative recent Bayesian methods for the regression setup include the predictive criteria of Laud and lbrahim 1995 the information distance approach of Goutis and Robert 1998 and the utility approach of Brown Fearn and Vannucci 1999 based on the early work of Lindley 1968 An ex cellent summary of many of the above articles and additional contributions to Bayesian model selection can be found in Ntzoufras 1999 Spurred on by applications to new model classes re nements in prior formulations and advances in computing methods and technology implementations of Bayesian ap proaches to model uncertainty are widening in scope and becoming increasingly preva lent With the involvement of a growing number of researchers the cross fertilization of ideas is further accelerating developments As we see it there is much more to come REFERENCES Abramovich E Sapatinas T and Silverman BW 1998 Wavelet thresholding via a Bayesian approach J Roy Statist Soc Ser B 60 725 749 Akaike H 1973 Information theory and an extension of the maximum likelihood principle In 2nd Internet Symp Inform Theory BN Petrov and F Csaki eds 267 81 Akademia Kiado Budapest Albert JH 1996 The Bayesian Selection of Log linear Models Canad J Statist 24 327 347 Bartlett M 1957 A comment on D V Lindley s Statistical Paradox Biometrika 44 533 534 Bernardo J M and Smith AFM 1994 Bayesian Theory Wiley New York Berger J0 and Pericchi LR 1996 The intrinsic Bayes factor for model selection Practical Bayes Model Selection 109 and prediction J Amer Statist Asso 91 109 122 Besag J and Green PJ 1993 Spatial statistics and Bayesian computation with discussion J Roy Statist Soc Ser B 55 25 37 Breiman L 1996 Bagging predictors Machine Learning 24 123 140 Breiman L Friedman J Olshen R and Stone C 1984 Classi cation and Regression Trees Wadsworth Brown PJ VannucciM and Fearn T 1998 Multivariate Bayesian variable selection and prediction J Roy Statist Soc Ser B 60 627 642 Brown PJ Fearn T and Vannucci M 1999 The choice of variables in multivariate regression a non conjugate Bayesian decision theory approach Biometriha 86 635 648 Buntine W 1992 Learning Classi cation Trees Statist Comput 2 63 73 Carlin BF and Chib S 1995 Bayesian model choice via Markov Chain Monte Carlo J Roy Statist Soc Ser B 77 473 484 Casella G and George El 1992 Explaining the Gibbs sampler The American Statistician 46 167 174 Chambers JM 1971 Regression updating J Amer Statist Asso 66 744 748 Chen MH Ibrahim JG and Yiannoutsos C 1999 Prior elicitation variable selec tion and Bayesian computation for logistic regression models J Roy Statist Soc Ser B 61 223 243 Chib S and Greenberg E 1995 Understanding the Metropolis Hastings algorithm The American Statistician 49 327 335 Chipman H 1996 Bayesian variable selection with related predictors Canad J Statist 24 17 36 Chipman H A George E 1 and McCulloch R E 1998a Bayesian CART model search with discussion J Amer Statist Asso 93 935 960 Chipman H A George E l and McCulloch R E 1998b Making sense of a forest of trees Computing Science and Statistics Proc 30th Symp Interface S Weisberg Ed 84 92 Interface Foundation of North America Fairfax VA Chipman H George E 1 and McCulloch R E 2000 Hierarchical priors for Bayesian CART shrinkage Statist Comput 10 17 24 Chipman H George El and McCulloch RE 2001a Managing multiple models Arti cial Intelligence and Statistics 2001 Tommi Jaakkola and Thomas Richardson eds 11 18 Morgan Kaufmann San Francisco CA 110 H Chipman E I George and R E McCulloch Chipman H George E l and McCulloch R E 2001b Bayesian treed models Machine Leaming To appear Chipman H Hamada M and Wu C F J 1997 A Bayesian variable selection approach for analyzing designed experiments with complex aliasing Technometrics 39 372 381 Chipman H Kolaczyk E and McCulloch R 1997 Adaptive Bayesian wavelet shrinkage J Amer Statist Asso 92 1413 1421 Clark L and Pregibon D 1992 TreeBased Models In Statistical Models in S J Chambers and T Hastie Eds 377 419 Wadsworth Clyde MA 1999 Bayesian model averaging and model search strategies with discus sion ln Bayesian Statistics 6 JM Bernardo JO Berger AP Dawid and AFM Smith eds 157 185 Oxford Univ Press Clyde MA DeSimone H and Parmigiani G 1996 Prediction via orthogonalized model mixing J Amer Statist Asso 91 1197 1208 Clyde M and George El 1999 Empirical Bayes estimation in wavelet nonparamet ric regression ln Bayesian Inference in Wavelet Based Models P Muller and B Vidakovic eds 309 22 Springer Verlag New York Clyde M and George El 2000 Flexible empirical Bayes estimation for wavelets J Roy Statist Soc Ser B62 681 698 Clyde MA and Parmigiani G 1994 Bayesian variable selection and prediction with mixtures J Biopharmaceutical Statist Clyde M Parmigiani G Vidakovic B 1998 Multiple shrinkage and subset selection in wavelets Biometrika 85 391 402 Denison DGT Mallick BK and Smith AFM 1998a A Bayesian CART algo rithm Biometrika 85 363 377 Denison DGT Mallick BK and Smith AFM 1998b Automatic Bayesian curve tting J Roy Statist Soc Ser B60 333 350 Denison DGT Mallick BK and Smith AFM 1998c Bayesian MARS Statist Comput 8 337 346 Dellaportas P and Foster JJ 1999 Markov Chain Monte Carlo Model determination for hierarchical and graphical log linear Models Biometiika 86 615 634 Dellaportas P Forster JJ and Ntzoufras l 2000 On Bayesian model and variable selection using MCMC Statist Comput To appear Dongarra ll Moler CB Bunch JR and Stewart GW 1979 Linpack Users7 Guide SlAM Philadelphia Practical Bayes Model Selection 111 Donoho DL and Johnstone IM 1994 Ideal spatial adaptation by wavelet shrinkage Biometrika 81 425 56 Draper D 1995 Assessment and propagation ofmodel uncertainty with discusssion J Roy Statist Soc Ser B57 45 98 Edwards W Lindman H and Savage LJ 1963 Bayesian statistical inference for psychological research Psychological Review 70 193 242 Fernandez C Ley E and Steel MPJ 2001 Benchmark priors for Bayesian model averaging J Econometrics 100 381 427 Foster DP and George El 1994 The risk in ation criterion for multiple regression Ann Statist 22 1947 75 GarthwaiteP H and Dickey JM 1992 Elicitation of prior distributions for variable selection problems in regression Ann Statist 20 1697 1719 Garthwaite P H and Dickey JM 1996 Quantifying and using expert opinion for variable selection problems in regression with discussion Chemomet Intel Lab Syst 35 1 34 Gelfand A E Dey D K and Chang H 1992 Model determination using predic tive distributions With implementations via sampling based methods In Bayesian Statistics 4 J M Bernardo J O Berger A P Dawid and A F M Smith eds 1477167 Oxford Univ Press Gelfand A and Smith A F M 1990 Sampling based approaches to calculating marginal densities J Amer Statist Asso 85 398 409 George El 1998 Bayesian model selection Encyclopedia of Statistical Sciences Update Volume 5 S Kotz C Read and D Banks eds 39 46 Wiley New York George El 1999 Discussion of Bayesian model averaging and model search strate gies77 by MA Clyde Bayesian Statistics 6 JM Bernardo IO Berger AP Dawid and AFM Smith eds 175 177 Oxford Univ Press George El and Foster DP 2000 Calibration and empirical Bayes variable selection Biometrika 87 731 748 George El and McCulloch RE 1993 Variable selection via Gibbs sampling J Amer Statist Asso 88 881 889 George El and McCulloch RE 1996 Stochastic search variable selection In Markov Chain Monte Carlo in Practice WR Gilks S Richardson and DJ Spiegelhalter eds 203 214 Chapman and Hall London George El and McCulloch RE 1997 Approaches for Bayesian variable selection Statist Sinica 7 339 73 112 H Chipman E I George and R E McCulloch George El McCulloch RE and Tsay R 1995 Two approaches to Bayesian Model se lection with applications In Bayesian Statistics and Econometrics Essays in Honor of Arnold Zellner D Berry K Chaloner and J Geweke eds 339 348 Wiley New York Geisser S 1993 Predictive Inference An Introduction Chapman and Hall London Geman S and Geman D 1984 Stochastic relaxation Gibbs distributions and the Bayesian restoration of images IEEE Trans Pattn Anal Mach Intell 6 721 741 Geweke J 1996 Variable selection and model comparison in regression ln Bayesian Statistics 5 J M Bernardo J O Berger A P Dawid and A F M Smith eds 609 620 Oxford Univ Press Gilks WR Richardson S and Spiegelhalter DJ 1996 Markov Chain Monte Carlo in Practice Chapman and Hall London Giuduci P and Green PJ 1999 Decomposable graphical Gausssian model determi nation Biometrika 86 785 802 Goutis C and Robert CR 1998 Model choice in generalized linear models A Bayesian approach via Kullback Leibler projections Biometrika 82 711 732 Green P 1995 Reversible Jump Markov chain Monte Carlo computation and Bayesian model determination Biometrika 82 711 732 Gustafson P 2000 Bayesian regression modelling with interactions and smooth effects J Amer Statist Asso 95 795 806 Hastie T and Pregibon D 1990 Shrinking trees ATampT Bell Laboratories Technical Report Hastings WK 1970 Monte Carlo sampling methods using Markov chain and their applications Biometrika 57 97 109 Hawkins D M and Kass G V 1982 Automatic interaction detection In Topic in Applied Multivariate Analysis D M Hawkins ed 267 302 Cambridge Univ Press Hoeting J A Raftery AB and Madigan D 1996 A method for simultaneous vari able selection and outlier identi cation in linear regression Computational Statistics and Data Analysis 22 251 270 Hoeting J A Madigan D Raftery AE and Volinsky CT 1999 Bayesian Model averaging A tutorial with discussion Statist Sci 144 382 417 Corrected ver sion available at httpwwwstatwashingtoneduwwwresearchonlinehoeting1999pdf Holmes CC Denison DGT and Mallick BK 2000 Bayesian prediction via parti tioning Technical Report Dept of Mathematics Imperial College London Holmes CC and Mallick BK 2000 Bayesian wavelet networks for nonparametric Practical Bayes Model Selection 113 regression IEEE Trans Near Netwrks 107 1217 1233 Holmes7 CC and Mallick7 BK 2001 Bayesian regression with multivariate linear splines J Roy Statist Soc Ser B637 3 18 Ibrahim7 JG and Chen7 M H 2000 Prior Elicitation and Variable Selection for Generalized Linear Mixed Models In Generalized Linear Models A Bayesian Per spective Dey7 DK7 Ghosh7 SK and Mallick7 BK eds 41 537 Marcel Dekker7 New York Johnstone7 lM and Silverman7 BW 1998 Empirical Bayes approaches to mixture problems and wavelet regression Technical Report7 Univ of Bristol Jordan7 Ml and Jacobs7 RA 1994 Mixtures of experts and the EM algorithm Neural Computation 67 181 214 Kass7 G V 1980 An exploratory technique for investigating large quantities of cate gorical data Appl Statist 297 119 127 Kass7 R E and Wasserman7 L 1995 A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion J Amer Statist A550 907 928 34 Key7 JT7 Pericchi7 LR and Smith7 AFM 1998 Choosing among models when none of them are true In Proc Workshop Model Selection Special Issue of Rassegna di Metodi Statistici ed Applicazioni Racugno7 ed 333 3617 Pitagora Editrice7 Bologna Knight7 K7 Kustra7 R and Tibshirani7 R 1998 Discussion of Bayesian CART model search by Chipman7 H A7 George7 E 17 and McCulloch7 R E77 J Amer Statist A550 937 950 957 Kuo7 L and Mallick7 B 1998 Variable selection for regression models Sankhya Ser B 607 6581 Laud7 PW and Ibrahim7 JG 1995 Predictive model selection J Roy Statist Soc Ser B 577 247262 Leblanc7 M and Tibshirani7 R 1998 Monotone shrinkage of trees J Comput Graph Statist 77 417 433 Liang F7 Truong7 YK and Wong7 WH 2000 Automatic Bayesian model averaging for linear regression and applications in Bayesian curve tting Technical Report7 Dept of Statistics7 Univ of California7 Los Angeles Lindley7 DV 1968 The choice of variables in regression J Roy Statist Soc Ser B 307 31 66 Liu7 JS7 Wong7 WH7 and Kong7 A 1994 Covariance structure and convergence rate of the Gibbs sampler with applications to the comparisons of estimators and 114 H Chipman E I George and R E McCulloch augmentation schemes Biometriha 81 27 40 Lutsko J F and Kuijpers B 1994 Simulated annealing in the construction of near optimal decision trees In Selecting Models from Data AI and Statistics IV P Cheeseman and R W Oldford eds 453 462 Madigan D and York J 1995 Bayesian graphical models for discrete data Internat Statist Rev 63 215 232 Mallick BK and Gelfand AB 1994 Generalized linear models with unknown num ber of components Biometriha 81 237 245 Mallows C L 1973 Some Comments on Op Technometrics 15 661 676 McCullagh P and Nelder JA 1989 Generalized Liner Models 2nd Ed Chapman and Hall New York McCulloch RE and Rossi P 1991 A Bayesian approach to testing the arbitrage pricing theory J Econometrics 49 141 168 Metropolis N Rosenbluth AW Rosenbluth MN Teller AH and TellerE 1953 Equations of state calculations by fast computing machines J Chem Phys 21 1087 1091 Meyn SP and Tweedie RL 1993 Markov Chains and Stochastic Stability Springer Verlag New York Mitchell TJ and Beauchamp JJ 1988 Bayesian variable selection in linear regres sion with discussion J Amer Statist Asso 83 1023 1036 Ntzoufras l 1999 Aspects of Bayesian model and variable selection using MCMC PhD dissertation Dept of Statistics Athens Univ of Economics and Business Athens Greece O Hagan A 1995 Fractional Bayes factors for model comparison with discussion Jour of the Roy Statist Soc Ser B57 99 138 Oliver JJ and Hand DJ 1995 On pruning and averaging decision trees Proc Internat Machine Learning Conf 430 437 Paass G and Kindermann J 1997 Describing the uncertainty of Bayesian predictions by using ensembles of models and its application 1997 Real World Comput Symp 118 125 Real World Computing Partnership Tsukuba Research Center Tsukuba Japan Petit Ll and Smith A F M 1985 Outliers and in uential observations in linear models In Bayesian Statistics 2 JM Bernardo MH DeGroot DV Lindley and AFM Smith eds 473 494 North Holland Amsterdam Peng L 1998 Normalizing constant estimation for discrete distribution simulation Practical Bayes Model Selection 115 PhD dissertation Dept MSIS Univ of Texas Austin Phillips D B and Smith A F M 1995 Bayesian model comparison viajump diffu sions In Practical Markov Chain Monte Carlo in Practice WR Gilks S Richardson and DJ Spiegelhalter eds 215 239 Chapman and Hall London Quinlan J R 1986 Induction of decision trees Machine Learning 1 81 106 Quinlan J R 1993 045 Tools for Machine Learning Morgan Kauffman San Mateo CA Quinlan JR and Rivest RL 1989 Inferring decision trees using the minimum description length principle Information and Computation 80 227 248 Raftery AB 1996 Approximate Bayes factors and accounting for model uncertainty in generalized linear models Biometriha 83 251 266 Raftery A E Madigan D and Hoeting J A 1997 Bayesian model averaging for linear regression models J Amer Statist Asso 92 179 191 Raftery AE Madigan D M and Volinsky CT 1995 Accounting for model un certainty in survival analysis improves predictive performance with discussion In Bayesian Statistics 5 J M Bernardo J O Berger A P Dawid and A F M Smith eds 323 350 Oxford Univ Press Raftery AB and Richardson S 1996 Model selection for generalized linear models via GLIB application to nutrition and breast cancer Bayesian Biostatistics DA Berry and DK Strangl eds 321 353 Marcel Dekker New York San Martini A and Spezzaferri F 1984 A predictive model selection criterion J Roy Statist Soc Ser B46 296 303 Schwarz G 1978 Estimating the dimension of a model Ann Statist 6 461 4 Shively TS Kohn R and Wood S 1999 Variable selection and function estimation in additive nonparametric regression using a data based prior with discussion J Amer Statist Asso 94 777 806 Smith AFM and Roberts GO 1993 Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods J Roy Statist Soc Ser B 55 324 Smith M and Kohn R1996 Nonparametric regression using Bayesian variable selec tion Journal of Econometrics 75 317 344 Smith M and Kohn R 1997 A Bayesian approach to nonparametric bivariate re gression J Amer Statist Asso 92 1522 1535 Smith M Wong CM and Kohn R 1998 Additive nonparametric regression with autocorrelated errors J Roy Statist Soc Ser B 60 311 331 116 H Chipman E I George and R E McCulloch Sutton C 1991 Improving classi cation trees with simulated annealing Computing Science and Statistics Proc 23rd Symp Interface Keramidas ed 396 402 Interface Foundation of North America Tibshirani R and Knight K 1999 Model search by bootstrap bumping J Com put Graph Statist 8 671 686 Tierney L 1994 Markov chains for exploring posterior distributions with discussion Ann Statist 22 1674 1762 Tierney L and Kadane JB 1986 Accurate approximations for posterior moments and marginal densities J Amer Statist Asso 81 82 86 Troughton RT and Godsill SJ 1997 Bayesian model selection for time series using Markov Chain Monte Carlo Proc IEEE Internat Conf Acoustics Speech and Signal Processing 3733 3736 Wake eld JC and Bennett JE 1996 The Bayesian modelling of covariates for population pharmacokinetic models J Amer Statist Asso 91 917 928 Wallace 00 and Patrick JD 1993 Coding decision trees Machine Learning 11 7 22 Wasserman L 2000 Bayesian model selection and model averaging J Math Psy chology 44 92 107 Wood S and Kohn R 1998 A Bayesian approach to robust binary nonparametric regression J Amer Statist Asso 93 203 213 Zellner A 1986 On assessing prior distributions and Bayesian regression analysis With g prior distributions ln Bayesian Inference and Decision Techniques Essays in Honor of Bruno de Finietti PK Goel and A Zellner eds 233 43 North Holland Amsterdam M Clyde 117 DISCUSSION M Clyde Duke University I would like to begin by thanking the authors for their many contributions to Bayesian model selection and for providing an excellent summary of the growing body of literature on Bayesian model selection and model uncertainty The development of computational tools such as the Gibbs sampler and Markov chain Monte Carlo approaches has led to an explosion in Bayesian approaches for model selection On the surface Bayesian model averaging BMA and model selection is straightforward to implement one speci es the distribution of the data and the prior probabilities of models and model speci c parame ters Bayes theorem provides the rest As the authors point out the two major challenges confronting the practical implementation of Bayesian model selection are choosing prior distributions and calculating posterior distributions In my experience I have found that this is especially true in high dimensional problems such as wavelet regression or non parametric models where subjective elicitation of prior distributions is practically infeasible and enumeration of the number of potential models is intractable Clyde et al 1998 Clyde and George 2000 Choice of Prior Distributions The speci cation of prior distributions is often broken down into two parts 1 elici tation of distributions for parameters speci c to each model such as the distribution for regression coef cients in linear models p l39y 02 and 2 selection of a prior distri bution over models p y For high dimensional problems one cannot specify the prior probability of each 39y directly and practical implementations of Bayesian selection have usually made prior assumptions that the presence or absence of a variable is independent of the presence or absence of other variables As a special case of this the uniform prior distribution over models is appealing in that posterior probabilities of models depend only on the likelihood However this prior distribution may not be sensible for model averaging when there are groups of similar variables and does not provide the proper dilution of posterior mass over similar models see Clyde 1999 Hoeting et al 1999 and discussion therein George 1999 1999a In this regard uniform and independent prior distributions must be used carefully with highly correlated explanatory variables and special consideration should be given to constructing the model space Even with OMerlise Clyde is Associate Professor Institute of Statistics and Decision Sciences Duke University Durham NC 277080251 USA email clydestatDukeEDU 118 Discussion uniform priors on models the posterior distribution over models naturally penalizes adding redundant variables however this may not be enough to lead to the proper rate of dilution with nearly redundant variables One approach to constructing dilution prior distributions is to start with a uniform prior over models and use imaginary training data to construct a posterior distribution for 39y based on the training data this posterior would become the prior distribution for 39y for the observed data Selection of the train ing data and hyperparameters are non trivial issues however this approach would likely provide better dilution properties than starting with an independent prior Clearly con struction of prior distributions for models that capture similarities among variables in a sensible way is a difficult task and one that needs more exploration For 1 by far the most common choice is a normal prior distribution for such as in the conjugate setup for point prior selection models section 32 CGM where n N N002X7 Again as one cannot typically specify a separate prior distribution for under each model any practical implementation for Bayesian model uncertainty usually resorts to structured families of prior distributions Another important consideration is whether prior speci cations for are compatible across models Dawid and Lauritzen 2000 For example suppose that Model 1 contains variables 1 and 2 Model 2 contains variables 2 and 3 and Model 3 includes only variable 2 With apologies for possible abuse of notation let g denote the coefficient for variable 2 in each model With completely arbitrary choices for 27 under Model 1 the variance for g given that 1 0 may not be the same as the variance for g given that g 0 derived under Model 2 and both may differ from the variance for g under the prior distribution given Model 3 To avoid this incompatibility choices for E are often obtained from conditional speci cations ie conditional on i 0 for w 0 derived from a prior distribution for under the full model For example Zellner s g prior Zellner 1986 is commonly used which leads to E CX X 1 for the full model and EA CXZXA 1 for model 39y While in many cases conditioning leads to sensible choices the result may depend on the choice of parameterization which can lead to a Borel paradox Kass and Raftery 1995 Dawid and Lauritzen 2000 Other structured families may be induced by marginal ization or projections leading to possibly different prior distributions While structured prior distributions may reduce the number of hyperparameters that must be selected ie to one parameter 0 how to specify 0 is still an issue The choice of 0 requires careful consideration as it appears in the marginal likelihoods of the data and hence the posterior model probabilities and in my experience can be in uential In the situation where little prior information is available being too non informative77 about taking 0 large can have the un intended consequence of favoring the null model a posteriori Kass and Raftery 1995 While default prior distributions M Clyde 119 both proper and improper can be calibrated based on information criteria such as AIC Akaike Information Criterion Akaike 1973 BIC Bayes Information Criterion 7 Schwarz 1978 or RIC Risk In ation Criterion 7 Foster and George 1994 there remains the question of which one to use Clyde 2000 George and Foster 2000 Fernandez et al 1998 such decisions may relate more to utilities for model selection rather than prior beliefs although it may be impossible to separate the two issues Based on simulation studies Fernandez et al1998 recommend RIC like prior distributions when n lt p2 and BIC like prior distributions otherwise In wavelet regression where p 71 there are cases where priors calibrated based on BIC have better predictive performance than prior distributions calibrated using RIC and vice versa From simulation studies and asymptotic arguments it is clear that there is no one default choice for c that will perform well for all contingencies Fernandez et al 1998 Hansen and Yu 1999 Empirical Bayes approaches provide an adaptive choice for c One class of problems where BMA has had outstanding success is in non parametric regression using wavelet bases In nonparametric wavelet regression where subjective information is typically not available empirical Bayes methods for estimating the hyperparameters have empiri cally led to improved predictive performance over a number of xed hyperparameter speci cations as well as default choices such as AIC BIC and RIC Clyde and George 1999 2000 George and Foster 2000 Hansen and Yu 1999 for both model selection and BMA Because of the orthogonality in the design matrix under discrete wavelet transformations EB estimates can be easily obtained using EM algorithms Clyde and George 1999 2000 Johnstone and Silverman 1998 and allow for fast analytic expres sions for Bayesian model averaging and model selection despite the high dimension of the parameter space p n and model space 2 bypassing MCMC sampling altogether George and Foster 2000 have explored EB approaches to hyperparameter selection for linear models with correlated predictors EM algorithms for obtaining estimates of c and w as in Clyde and George 2000 can be adapted to the non orthogonal case with unknown 02 using the conjugate point prior formulation and independent model space priors equation 32 in CGM For the EM algorithm both model indicators 39y and 02 are treated as latent data and imputed using current estimates of c and w wl wp where wj is the prior probability that variable j is included under the independence prior At iteration i this leads to 29007 629Mwm A0 1 l pltwcltlgtgtpltviwltlgtgt l 139 21 7 7 C S 7 Y Y 71 Ca SSRW 2 where SSRW is the usual regression sum of squares and 3 is a Bayesian version of 120 Discussion residual sum of squares using the current estimate a Values of c and an that maximize the posterior distribution for c and as given the observed data and current values of the latent data are A i1 A 39 w gt Z W lt3 7 such that 771 6039 max 0 23990 1 lt4gt Au Salim u where 1 and are hyperparameters in the inverse gamma prior distribution for 02 CGM equation 310 These steps are iterated until estimates converge EB estimates based on a common as for all variables are also easy to obtain For ridgeregression independent priors with E c or other prior distributions for B estimates for wj have the same form but estimates for c are slightly more complicated and require numerical optimization The ratio in the expression for 6 has the form of a generalized F ratio which is weighted by estimates of posterior model probabilities The EM algorithm highlights an immediate difficulty with a common 0 for all models as one or two highly signi cant coefficients may in uence the EB estimate of c For example the intercept may be centered far from 0 and may have an absolute t ratio much larger than the t ratios of other coefficients As the intercept is in all models it contributes to all of the SSRW terms which has the effect of increasing 0 as the absolute t ratio increases Since the same 0 appears in the prior variance of all other coefficients if 0 becomes too large in order to account for the size of the intercept we risk having the null model being favored Bartlett s paradox Kass and Raftery 1995 While one could use a different prior distribution for the intercept even a non informative prior distribution which would correspond to centering all variables the problem may still arise among the other variables if there are many moderate to large coefficients and a few that have extreme standardized values Implicit in the formulation based on a common 0 is that the non zero standardized coefficients follow a normal distribution with a common variance As such this model cannot accommodate one or a few extremely large standardized coefficients without increasing the odds that the remaining coefficients are zero Using a heavy tailed prior distribution for B may result in more robust EB estimates of c Clyde and George 2000 Other possible solutions including adding additional structure into the prior that would allow for different groups of coefficients with a different c in each group In the context of wavelet regression coefficients are grouped based on the multi resolution wavelet decomposition in other problems there may not be any natural a priori groupings Related to EB methods is the minimum description length MDL approach to model selection which effectively uses a different an estimated from the data M Clyde 121 for each model Hansen and Yu 1999 While EB methods have led to improvements in performance part of the success depends on careful construction of the modelprior Some of the problems discussed above highlight possible restrictions of the normal prior Unlike the EM estimates for orthogonal regression the EB approach with correlated predictors involves a summation over all models which is clearly impractical in large problems As in the inference problem one can base EB estimation on a sample of models This approach has worked well in moderate sized problems where leaps and bounds Furnival and Wilson 1974 was used to select a subset of models these were then used to construct the EB prior distribution and then estimates under BMA with the estimated prior For larger problems leaps and bounds may not be practical feasible or suitable such as CART models and models must be sampled using MCMC or other methods How to scale the EBEM approach up for larger problems where models must be sampled is an interesting problem In situations where there is uncertainty regarding a parameter the Bayesian approach is to represent that prior uncertainty via a distribution In other words why not add an other level to the hierarchy and specify a prior distribution on 0 rather than using a xed value While clearly feasible using Gibbs sampling and MCMC methods analytic calcu lation of marginal likelihoods is no longer an option Empirical Bayes EB estimation of 0 often provides a practical compromise between the fully hierarchical Bayes model and Bayes procedures where c is xed in advance The EB approach plugs in the modal 0 into g y which ignores uncertainty regarding c while a fully Bayes approach would integrate over 0 to obtain the marginal likelihood As the latter does not exist in closed form Monte Carlo frequencies of models provide consistent estimates of posterior model probabilities However in large dimensional problems where frequencies of most models may be only 0 or 1 it is not clear that Monte Carlo frequencies of models pf39le S from implementing MCMC for the fully Bayesian approach are superior to using renormalized marginal likelihoods evaluated at the EB estimate of c When the EB estimate of 0 corresponds to the posterior mode for c renormalized marginal likelihoods g39y evalu ated at the EB estimate of c are closely related to Laplace approximations Tierney and Kadane 1986 for integrating the posterior with respect to c the Laplace approximation would involve a term with the determinant of the negative Hessian of the log posterior A hybrid approach where MCMC samplers are used to identifysample models from the fully hierarchical Bayes model but one evaluates posterior model probabilities for the unique models using Laplace approximations may provide better estimates that account for uncertainty in c 122 Discussion Implementing Sampling of Models In the variable selection problem for linear regression marginal likelihoods are available in closed form at least for nice conjugate prior distributions for generalized linear models and many other models Laplace s method of integration can provide accurate approximations to marginal distributions The next major problem is that the model space is often too large to allow enumeration of all models and beyond 20 25 variables estimation of posterior model probabilities model selection and BMA must be based on a sample of models Deterministic search for models using branch and bounds or leaps and bounds algo rithms Furnival and Wilson 1974 is ef cient for problems with typically fewer than 30 variables For larger problems such as in non parametric models or generalized addi tive models these methods are too expensive computationally or do not explore a large enough region of the model space producing poor ts Hanson and Kooperberg 1999 While Gibbs and MCMC sampling have worked well in high dimensional orthogonal problems Wong et al 1997 found that in high dimensional problems such as non parametric regression using non orthogonal basis functions that Gibbs samplers were unsuitable from both a computational ef ciency standpoint as well as for numerical rea sons as the sampler tended to get stuck in local modes Their proposed focused sampler focuses on variables that are more active at each iteration and in simulation stud ies provided better MSE performance than other classical non parametric approaches or Bayesian approaches using Gibbs or reversible jump MCMC sampling Recently Holmes and Mallick 1998 adapted perfect sampling Propp and Wilson 1996 to the context of orthogonal regression While more computationally intensive per iteration this may prove to be more ef cient for estimation than Gibbs sampling or MH algorithms in problems where the method is applicable Whether perfect sampling can be used with non orthogonal designs is still open With the exception of deterministic search most methods for sampling models rely on algorithms that sample models with replacement In cases where g39y is known model probabilities may be estimated using renormalized model probabilities Clyde et al 1996 As no additional information is provided under resampling models algorithms based on sampling models without replacement may be more ef cient Under random sampling without replacement with equal probabilities the estimates of model prob abilities CGM equation 356 are ratios of Horvitz Thompson estimators Horvitz and Thompson 1952 and are simulation consistent Current work joint with M Littman involves designing adaptive algorithms for sampling without replacement where sam pling probabilities are sequentially updated This appears to be a promising direction for implementation of Bayesian model selection and model averaging M Clyde 123 Summary CGM have provided practical implementations for Bayesian model selection in linear and generalized linear models non parametric regression and CART models as well as spurred advances in research so that it is feasible to account for model uncertainty in a wide variety of problems Demand for methods for larger problems seems to out pace growth in computing resources and there is a growing need for Bayesian model selection methods that scale up77 as the dimension of the problem increases Guidance in prior selection is also critical as in many cases prior information is limited For example current experiments using genearray technology result in high dimensional design matrices p gt 7000 however the sample size may only be on the order of 10 100 Spang et al 2000 Identifying which genes corresponding to columns of X are associated with outcomes response to treatment disease status etc is a challenging problem for Bayesian model selection from both a computational standpoint as well as the choice of prior distributions ADDITIONAL REFERENCES Clyde M 2000 Model uncertainty and health effect studies for particulate matter To appear in Enm39ronmentn39cs Dawid A and Lauritzen S 2000 Compatible prior distributions Technical Report Fernandez C Ley E and Steel MF 1998 Benchmark priors for Bayesian model averaging Technical report Dept of Econometrics Tillburg Univ Netherlands Furnival GM and Wilson Robert WJ 1974 Regression by Leaps and Bounds Technometn39cs 16 499 511 George EI 1999a Discussion of 77Bayesian Model Averaging A Tutorial77 by Hoeting JA Madigan D Raftery AE and Volinsky CT Statist Sci 14 401 404 Hansen M and Yu B 1999 Model selection and the principle ofminimum description Technical Report httpcmbell labscomwhococteaupapers Hanson M and Kooperberg C 1999 Spline adaptation in extended linear models Technical Report httpcmbell labscomwhococteaupapers Hoeting HA Madigan D Raftery AE and Volinsky CT 1999 Bayesian model averaging A tutorial with discussion Statist Sci 14 382 417 Holmes C and Mallick BK 1998 Perfect simulation for orthogonal model mixing Technical Report Dept of Math Imperial College London Horvitz D and Thompson D 1952 A generalization of sampling without replacement from a nite universe J Amer Statist A550 47 663 685 124 Discussion Kass RE and Raftery AB 1995 Bayes factors J Amer Statist Asso 90 773 795 Propp J and Wilson D 1996 Exact sampling with coupled Markov chains and applications to statistical mechanics Random Structures and Algorithms 9 223 252 Spang R Zuzan H West M Nevins J Blanchette C and Marks JR 2000 Prediction and uncertainty in the analysis of gene expression pro les Discussion paper lSDS Duke Univ Wong F HansenMH Kohn R and Smith M 1997 Focused Sampling and its Application to Nonparametric and Robust Regression Bell Labs Technical Report Technical Report httpcmbell labscomwhococteaupapers Dean P Foster and Robert A Stine University of Pennsylvania We want to draw attention to three ideas in the paper of Chipman George and Mc Culloch henceforth CGM The rst is the importance of an adaptive variable selection criterion The second is the development of priors for interaction terms Our perspec tive is information theoretic rather than Bayesian so we brie y review this alternative perspective Finally we want to call attention to the practical importance of having a fully automatic procedure To convey the need for automatic procedures we discuss the role of variable selection in developing a model for credit risk from the information in a large database Adaptive variable selection A method for variable selection should be adaptive By this we mean that the prior particularly p y should adapt to the complexity of the model that matches the data rather than impose an external presumption of the number of variables in the model One may argue that in reasonable problems the modeler should have a good idea how many predictors are going to be useful It can appear that a well informed modeler does not need an adaptive prior and can use simpler more rigid alternatives that re ect knowledge of the substantive context While domain knowledge is truly useful it does Wne are Associate Professors Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia PA 19104 6302 USA emails fos terdiskw0rldwhart0nupennedu and stinewhart0nupennedu D P Foster and R A Stine 125 not follow that such knowledge conveys how many predictors belong in a model The problem is made most transparent in the following admittedly arti cial setting A small error in the choice of the basis in an orthogonal regression can lead to a proliferation in the number of required predictors Suppose that we wish to predict future values of a highly periodic sequence one dominated by a single sinusoid with frequency to If we approach this as a problem in variable selection and use the common Fourier basis to de ne the collection of predictors the number of predictors is in uenced by how close the frequency of the dominant cycle comes to a Fourier frequency Fourier frequencies are of the form wj 27rjn indicating sinusoids that complete precisely j cycles during our 71 observations If it so happens that w 4 then our model will likely need but one sinusoid to model the response If to is not of this form however our model will require many sinusoids from the Fourier basis to t the data well For example with n 256 and w 27r55n it takes 8 sinusoids at Fourier frequencies to capture 90 of the variation in this signal The optimal basis would need but one sinusoid Adaptive thresholding 7 the empirical Bayes approach 7 is forgiving of such errors whereas dogmatic methods that anticipate say a single sinusoid are not Information theory and the choice of priors A dif cult choice in the use of Bayesian models for variable selection is the choice of a prior particularly a prior for the subspace identifying the predictors We have found coding ideas drawn from information theory useful in this regard particularly the ideas related to Rissanen s minimum description length MDL The concreteness of coding offers appealing metaphors for picking among priors that produce surprisingly differ ent selection criteria In the Bayesian setting calibration also offers a framework for contrasting the range of variable selection criteria The problem we consider from information theory is compression This problem is simple to state An encoder observes a sequence of 71 random variables Y Y1 Yn and his objective is to send a message conveying these observations to a decoder using as few bits as possible In this context a model is a completely speci ed probability distribution a distribution that is shared by the encoder and decoder Given that both encoder and decoder share a model PY for the data the optimal message length here the so called idealized length77 since we ignore fractional bits and the in nite precision of real numbers is HY log2 bits 1 PW If the model is a good representation for the data then PY is large and the resulting message length is small Since the encoder and decoder share the model PY they can 126 Discussion use a technique known as arithmetic coding to realize this procedure But what model should they use Common statistical models like the linear model are parametric models Peg indexed by a q dimensional parameter 0g For example suppose that the data Y are generated by the Gaussian linear model Y 01X102X20qu e 6N002 To keep the analysis straightforward we will assume 02 is known see Barron Rissanen and Yu 1998 for the general case Given this model the shortest code for the data is obtained by maximizing the probability of Y namely using maximum likelihood ie least squares to estimate 0 and obtain a message with length log 5 Q Y 722 A 77 2 4 2g 1355qu Elog227ra where 135509 is the residual sum of squares This code length is not realizable however since P q is not a model in our sense The normal density for Y with parameters q 0qY integrates to more than one CW APMMYMY gt 1 Once normalized with the help of some benign constraints that make the integral nite but do not interfere with variable selection see eg Rissanen 1986 the code length associated with the model P Cnq is 41 910 10g2 Cw 1 940 39 1 The need for such normalization reminds us that coding does not allow improper priors improper priors generate codes of in nite length We can think of the rst summand in 1 as the length of a code for the parameters q thus de ning a prior for 0g and the second as a code for the compressed data This perspective reveals how coding guards against over tting adding parameters to the model will increase CW while reducing the length for the data So far so good but we have not addressed the problem of variable selection Suppose that both the encoder and decoder have available a collection of p possible predictors to use in this q variable regression Which predictors should form the code In this expanded context our code at this point is incomplete since it includes q but does not identify the q predictors It is easy to nd a remedy simply pre x the message with the p bits in 39y Since codes imply probabilities the use ofp bits to encode 39y implies a D P Foster and R A Stine 127 prior p1 say for these indicators This prior is the iid Bernoulli model with probability Pr39y 1 for which the optimal code length for 39y is indeed p 10g21P1 Y 10g2 2p P Since adding a predictor does not affect the length of this pre x 7 it s always p bits 7 we add the predictor Xq1 if the gain in data compression represented by the reduction in RSS compensates for the increase in the normalizing constant CW Using a so called two part code to approximate the code length 1 we have shown Foster and Stine 1996 that this approach leads to a thresholding rule For orthogonal predictors this criterion amounts to choosing those predictors whose z statistic 27 jSE 7 exceeds a threshold near 2 Such a procedure resembles the frequentist selection procedure AIC which uses a threshold of in this context Now suppose that p is rather large Using the p bits to represent 39y seems foolish if we believe but one or two predictors are likely to be useful If indeed few predictors are useful we obtain a shorter message by instead forming a pre x from the indices of the those 39yj 1 Each index now costs us about lOgZp bits and implies a different prior for 39y This prior p2 say is again iid Bernoulli but with small probability Pr Yi 1 1p the optimal code length for 39y under p2 is 10g21P2 Y qlogp 7 p 7 q10g17 qp qlogp for q 27 w lt p Notice how this code assigns a higher cost to adding a predictor to the model Adding a predictor does not affect the length of the pre x given by p1 With p2 y as the prior however adding a predictor adds both a coef cient as well as its index to the message The pre x grows by an additional log2 p bits For orthogonal regression and two part codes this approach implies a threshold for 27 near m which also happens to correspond roughly to another frequentist procedure This is the well known Bonferroni method which retains predictors whose pvalue is less than 0419 for some 0 g 04 g 1 Both of these codes have some appeal and correspond to frequentist methods as well as Bayesian priors but neither is adaptive The prior for the rst code with a xed p bit pre x expects half of the predictors to be useful The second has a prior that expects only one predictor to enter the model As codes both are awed The gold standard in coding is to compress the data down to the limit implied by the entropy of the underlying process whatever that process may be Both p1 y and p2 y only approach that limit when they happen to be right eg when in fact only one predictor is needed in the model Alternatively so called universal codes exist for representing binary sequences and these compress 39y almost as well as if the underlying probability 128 Discussion were known Assuming that the elements W are iid one can drop this condition as well a universal code represents 39yq using about pHqp bits where H denotes the Boolean entropy function H lo 7117 lo 1 0lt lt1 U U U U g2 u g2 1 U7 7 7 Universal codes are adaptive in that they perform well for all values of qp doing almost as well as either of the previous codes when they happen to be right but much better in other cases Returning to the setting of an orthogonal regression a universal code also implies a threshold for adding a predictor The threshold in this case now depends on how many predictors are in the model One adds the predictor Xj to a model that already has q predictors if its absolute z statistic gt W This is essentially the empirical Bayes selection rule discussed by CGM in Section 33 The threshold decreases as the model grows adapting to the evidence in favor of a larger collection of predictors Again this procedure is analogous to a frequentist method namely stepup testing as described for example in Benjamini and Hochberg 1995 Coding also suggests novel priors for other situations when the elements of 39y are not so anonymous For example consider the treatment of interaction terms In the application we discuss in the next section we violate the principle of marginality and treat interactions in a non hierarchical fashion That is we treat them just like any other coefficient Since we start with 350 predictors the addition of interactions raises the number of possible variables to about p 67 000 Since they heavily outnumber the linear terms interactions dominate the predictors selected for our model Coding the model differently leads to a different prior For example consider a variation on the second method for encoding 39y by giving the index of the predictor One could modify this code to handle interactions by appending a single bit to all indices for interactions This one bit would indicate whether the model included the underlying linear terms as well In this way the indices for Xj Xk and Xj Xk could be coded in 1 logzp bits rather than 3l0g2p bits making it much easier for the selection criterion to add the linear terms An application of automatic adaptive selection Methods for automatic variable selection matter most in problems that confront the statistician with many possibly relevant predictors If the available data set holds say 1000 observations but only 10 predictors then variable selection is not going to be very important The tted model with all 10 of these predictors is going to do about as well as anything As the number of predictors increases however there comes a point where an D P Foster and R A Stine 129 automatic method is necessary What constitutes a large number of possible predictors Probably several thousand or more Such problems are not simply imaginary scenarios and are the common fodder for data mining Here is one example of such a problem one that we discuss in detail in Foster and Stine 2000 The objective is to anticipate the onset of bankruptcy for credit card holders The available data set holds records of credit card usage for a collection of some 250000 accounts For each account we know a variety of demographic characteristics such as place and type of residence of the card holder When combined with several months of past credit history and indicators of missing data we have more than 350 possible predictors The presence of missing data adds further features and indeed we have found the absence of certain data to be predictive of credit risk Though impressive at rst the challenge of choosing from among 350 features is nonetheless small by data mining standards For predicting bankruptcy we have found interactions between pairs or even triples to be very useful Considering pairwise interactions swells the number of predictors to over 67000 It would be interesting to learn how to apply a Gibbs sampler to such problems with so many possible features Though challenging for any methodology problems of this size make it clear that we must have automated methods for setting prior distributions To handle 67000 predictors we use adaptive thresholding and stepwise regression Beginning from the model with no predictors we identify the rst predictor X that by itself explains the most variation in the response We add this predictor to the model if its t statistic tn Bj11seBj11 in absolute value exceeds the threshold m If 3714 meets this criterion we continue and nd the second predictor ij that when combined with Xj explains the most variation in Y Rather than compare the associated t statistic 2573922 to the initial threshold we reduce the threshold to W making it now easier for ij to enter the model This process continues greedily adding predictors so long as gt x210gpq Benjamini and Hochberg 1995 use essentially the same procedure in multiple testing the t statistic for each exceeds the declining threshold Step q Add predictor qu ltgt ltjM where it is known as step up testing This methodology works in this credit modelling in that it nds structure without over tting Figure 1 shows a plot of the residual sum of squares as a function of model size as usual RSS decreases with p The plot also shows the cross validation sum of squares CVSS computed by predicting an independent sample The validation sample for these calculations has about 2400000 observations we scaled the CVSS to roughly match the scale of the RSS Our search identi ed 39 signi cant predictors and each of these 7 with the exception of the small bump 7 improves the out of sample performance of the model Although the CVSS curve is at 130 Discussion Sum of Squares i Residual SS 440 Validation Figure 2 Residual and cross validation sums of squares for predicting bankruptcy near p 39 it does not show the rapid increase typically associated with over tting Gibbs sampling and the practical Bayesian methods discussed by CGM offer an interesting alternative to our search and selection procedure They have established the foundations and the next challenge would seem to be the investigation of how their search procedure performs in the setting of real applications such as this Greedy selection methods such as stepwise regression have been well studied and probably do not nd the best set of predictors Once Xi becomes the rst predictor it must be in the model Such a strategy is clearly optimal for orthogonal predictors but can be tricked by collinearity Nonetheless stepwise regression is fast and comparisons have shown it to be competitive with all possible subsets regression eg see the discussion in Miller 1990 ls greedy good enough or should one pursue other ways of exploring the space of models via Gibbs sampling ADDITIONAL REFERENCES Barron A Rissanen J and Yu B 1998 The minimum description length principle in coding and modelling IEEE Trans Info Theory 44 2743 2760 Benjamini Y and Hoohberg Y 1995 Controlling the false discovery rate a practical and powerful approach to multiple testing J Roy Statist Soc Ser B 57 289 300 Foster DP and Stine RA 1996 Variable selection via information theory Technical Report Northwestern Univ Chicago Foster DP and Stine RA 2000 Variable selection in data mining Building a predictive model for bankruptcy Unpublished Manuscript Miller AJ 1990 Subset Selection in Regression Chapman and Hall London H Chipman E I George and R E McColloch 131 Rissanen J 1986 Stochastic complexity and modelling Ann Statist 14 1080 1100 REJOINDER Hugh Chipman Edward 1 George and Robert E McCulloch First of all we would like to thank the discussants Merlise Clyde Dean Foster and Robert Stine for their generous discussions They have each made profound contribu tions to model selection and this comes through in their insightful remarks Although there is some overlap in the underlying issues they raise they have done so from different vantage points For this reason we have organized our responses around each of their discussions separately Clyde Clyde raises key issues surrounding prior selection For choosing model space priors for the linear model with redundant variables she con rms the need to move away from uniform and independence priors towards dilution priors This is especially true in high dimensional problems where independence priors will allocate most of their probability to neighborhoods of redundant models Clyde s suggestion to use an imaginary training data to construct a dilution prior is a very interesting idea Along similar lines we have considered dilution priors for the linear model where p y is de ned as the probability that Y N Nn0 I is closer to the span of X7 than the span of any other Xv Here Y can be thought of as imaginary training data re ecting ignorance about the direction of Y Further investigation of the construction of effective dilution priors for linear models is certainly needed Clyde comments on the typical choices of E for the coefficient prior 1937 1 02 39y NqBA0227 in 39 namely 27 CXZXA 1 and E 0L In a sense these choices are the two extremes CXIYX771 serves to reinforce the likelihood covariance while 0L1 serves to break it apart As we point out the coefficient priors under such 27 are the natural conditional distributions of the nonzero components of 3 given 39y when 3 N Np0cazX X 1 and B N Np00021 respectively The joint prior 1933 l 02 then corresponds to a reweighting of the conditional distributions according to the chosen model space prior p y With respect to such joint priors the conditional distributions are indeed compatible in the sense of Dawid and Lauritzen 2000 Although not strictly 132 Rejoinder necessary we nd such compatible speci cations to provide an appealingly coherent description of prior information We agree with Clyde that the choice of c can be absolutely crucial As the calibration result in 317 shows different values of c and hence different selection criteria such as AlC BlC and RIC are appropriate for different states of nature For larger models with many small nonzero coefficients smaller values of c are more appropriate whereas for parsimonious models with a few large coefficients larger values of c are better Of course when such information about the actual model is unavailable as is typically the case the adaptive empirical Bayes methods serve to insure against poor choices It is especially appealing that by avoiding the need to specify hyperparameter values em pirical Bayes methods are automatic a valuable feature for large complicated problems Similar features are also offered by fully Bayes methods that margin out the hyperpa rameters with respect to hyperpriors The challenge for the implementation of effective fully Bayes methods is the selection of hyperpriors that offer strong performance across the model space while avoiding the computational difficulties described by Clyde Clyde points out an important limitation of using empirical Bayes methods with conditional priors of the form 1937 l 02 y NqBAazc V7 When the actual model has many moderate sized coefficients and but a few very large coefficients the few large coefficients will tend to in ate the implicit estimate of c causing the moderate sized coefficients to be ignored as noise In addition to the heavy tailed and the grouped prior formulations she describes for mitigating such situations one might also consider elaborating the priors by adding a shape hyperparameter Finally Clyde discusses the growing need for fast Bayesian computational methods that scale up77 for very large high dimensional problems In this regard it may be useful to combine heuristic strategies with Bayesian methods For example George and McCulloch 1997 combined globally greedy strategies with local MCMC search in applying Bayesian variable selection to build tracking portfolios In our response to Foster and Stine below we further elaborate on the potential of greedy algorithms for such purposes Foster and Stine Foster and Stine begin by emphasizing the need for adaptive procedures We completely agree The adaptive empirical Bayes methods described in Section 33 offer improved performance across the model space while automatically avoiding the need for hyper parameter speci cation For more complicated settings adaptivity can be obtained by informal empirical Bayes approaches that use the data to gauge hyperparameter values H Chipman E I George and R E McColloch 133 such as those we described for the inverse gamma distribution in Sections 32 and 412 In the sinusoid modelling example of Foster and Stine a simple adaptive resolution is obtained by a Bayesian treatment with a prior on 4 This nicely illustrates the funda mental adaptive nature of Bayesian analysis By using priors rather than xed arbitrary values to describe the uncertainty surrounding the unknown characteristics in a statisti cal problem Bayesian methods are automatically adaptive We attribute the adaptivity of empirical Bayes methods to their implicit approximation of a fully Bayes approach Foster and Stine go on to discuss some revealing analogies between strategies for minimum length coding and formulations for Bayesian model selection The key idea is that the probability model for the data namely the complete Bayesian formulation also serves to generate the coding strategy Choosing the probability model that best predicts the data is tantamount to choosing the optimal coding strategy Foster and Stine note that improper priors are unacceptable because they generate in nite codes This is consistent with our strong preference for proper priors for model selection They point out the potential inefficiencies of Bernoulli model prior codes for variable selection and use them to motivate a universal code that adapts to the appropriate model size This is directly analogous to our observation in Section 33 that different hyperparameter choices for the Bernoulli model prior 315 correspond to different model sizes and that an empirical Bayes hyperparameter estimate adapts to the appropriate model size It should be the case that the universal prior corresponds to a fully Bayes prior that is approximated by the empirical Bayes procedure Finally their coding scheme for interactions is interesting and clearly effective for parsimonious models Such a coding scheme would seem to correspond to a hierarchical prior that puts a Bernoulli 1p prior on each potential triple two linear terms and their interaction and a conditionally uniform prior on the elements of the triple The credit risk example given by Foster and Stine raises several interesting issues It illustrates that with this large dataset an automatic stepwise search algorithm can achieve promising results Figure 1 shows how their adaptive threshold criterion guards against over tting although the cross validation results seem also to suggest that a smaller number of terms around 20 is adequate for prediction Another automatic adaptive alternative to consider here would be a stepwise search based on the empirical Bayes criterion CCML in 322 It would also be interesting to investigate the poten tial of one of the Bayesian variable selection approaches using the hierarchical priors described in Section 31 to account for potential relationships between linear and inter action terms As opposed to treating all potential predictors independently such priors tend to concentrate prior mass in a smaller more manageable region of the model space For example Chipman Hamada and Wu 1997 considered an 18 run designed ex 134 Rejoinder periment with 8 predictors used in a blood glucose experiment The non orthogonal design made it possible to consider a total of 113 terms including quadratic terms and interactions They found that independence priors of the form 32 led to such a diffuse posterior that in 10000 steps of a Gibbs sampling run the most frequently visited model was visited only 3 times On the other hand hierarchical priors like 37 raised posterior mass on the most probable model to around 15 In the same problem stepwise meth ods were unable to nd all the different models identi ed by stochastic search In effect priors that account for interactions or other structure such as correlations between predictors which can lead to the dilution problem discussed in Section 31 can narrow the posterior to models which are considered more plausible We note however that the credit risk example is much larger than this example and because the number of observations there is much larger than the number of predictors such a hierarchical prior may have only a minor effect The credit risk example is also a clear illustration of the everlasting potential of greedy search algorithms on very large problems At the very least greedy algorithms can provide a baseline against which MCMC stochastic search results can be compared and then thrown out if an improvement is not found Furthermore greedy algorithms can provide a fast way to get rough estimates of hyperparameter values and can be used directly for posterior search Greedy algorithms also offer interesting possibilities for enhancement of stochastic search At the most basic level the models identi ed by greedy algorithms can be used as starting points for stochastic searches Stochastic search algorithms can also be made more greedy for example by exponentiating the probabilities in the acceptreject step of the MH algorithms The use of a wide variety of search algorithms including MCMC stochastic search can only increase the chances of nding better models


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Kyle Maynard Purdue

"When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the I made $280 on my first study guide!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.