Popular in Course
verified elite notetaker
Popular in Statistics
This 19 page Class Notes was uploaded by Miss Sabina Grimes on Monday October 19, 2015. The Class Notes belongs to STAT 631 at Rice University taught by Volkan Cevher in Fall. Since its upload, it has received 24 views. For similar materials see /class/225036/stat-631-rice-university in Statistics at Rice University.
Reviews for GRAPHICAL MODELS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/19/15
ALPHA DIVERGENCE Thursday September 18 2008 Rice University STAT 631 ELEC 633 Graphical Models Scribe Instructor Andrew WATERS Dru Volkan CEVHER Edited by Ahmad Beiramz39 and David Kahle ll REVIEW FROM PREVIOUS CLASS In the previous lecture we discussed the difficulties of carrying out exact infer ence when the posterior density pZ X is either intractable or overly complex1 The graph we had in mind is contained in Figure 1 it is this same graph which continues to be of interest in this article In the situation where the posterior is unwieldy we noted that is often easier to perform approximate inference over a more tractable density 42 which approximates pZ Xi FIGURE 1 Simple Observed Directed Graph In our pursuit we discovered that the logarithm of pX admits the expansion 1 10gPX17 42 KL quPzix7 where 7 M f 42 E ibg 42Z i 7 42 KL qZHpZ Xgt qu log PZXZi The relation turned out to be fundamental in the sense that it provides a means by which we can determine optimal approximations The optimality requires nding the 42 which minimizes the KL term the KullbackLeibler divergencei In this arti cle we discuss a generalization of the KL divergence the so called ozdivergence and various properties which it exhibits In addition our considerations elucidate the meaning of optimality not only for solutions obtained via the minimization of the Kullback Leibler divergence but also the a divergences as wel 2i a DIVERGENCES In previous notes we introduced the KLdivergence and discussed the conse quences of using both KL qZHpZ X and KL pZ XHqZ In order to obtain more freedom in choosing the metric according to which we are approximating a density we introduce a more general class of metrics called adivergences which can be used to obtain distributions for 42 which approximate pZ Xi For ease of notation 1For a more detailed description of notation see the related article on variational Bayes since there is no ambiguity for future reference we will refer to these two densities simply as 4 an pi We de ne the adivergence as follows fap1 1 141 PIl l41l1 dr Ot 2 DaltPll4gt a E 700oo Some of the properties of the adivergence are 1 Daqu is convex with respect to both p and q 2 De allq 2 0 3 De allq 0 when p q as Similar to the KL divergence these properties allow us to minimize the adivergence in order to nd the best approximating distribution 41 in some class of potential approximations There are several special cases for various values of a that are of interest to us The most important cases are 3 DILEDJPHQ KLOIHP 4 algnlDa allQ KLCUHQl Hence the adivergences include the KL divergences as a special case Other special cases are 6 Dawm E Xm 6 D2qu W 7 DPll4 2 M d1 D is known as the Hellinger distance D is a valid distance metric it satis es both the traingle inequality and symmetric properties 3 INVESTIGATING THE BEHAVIOR OF DD FOR DIFFERENT VALUES OF 1 The de nition of the adivergence in 2 exhibits the reduced representation w mWWPJMMWWWW a1 7 a As discussed before we are interested in approximating an intractable prob ability distribution pz with a tractable distribution In this process we introduced adivergence DD qu as a class of pseudometrics which measure the accuracy of our approximationi Thus we minimize Dapl lq over a tractable family of approximating distributions 41 in order to nd the best approximation of 101 in that family In this section we investigate the behavior of the adivergence for different values of ai Sweeping a from 700 to 00 will result in different properties for the resulting approximation We will start with an example and will discuss the main results These results will be proven in the next lecturer We will describe the properties of the approximation using an example Suppose the density 101 is given by 9 101 l1 71l2 75703952 1 e R Here we are interested in nding an approximation to 101 using a Gaussian dis tribution Moreover we are interested in understanding the resulting optimal approximation 41 as a result of different choices of 0A These approximations are demonstrated in Figure I F rst assume that a is a very large negative value In this case the minimization of DapH4 will force 41 to be an exclusive approximation ie the mass of 41 will lie within 101 as shown in Figure 2 for a 711 When a is increased toward zero the approximation starts to lose the exclusivity property The approximation for a S 0 satis es the following if 101 0 at some point 10 the approximation will be such that 41 0 at 10 as demonstrated in Figure 2 When a 0 D0 pH4 KL4Hp the KullbackLeibler divergencei We dis cussed the properties Of KL divergences in the previous lecturesi In particular 41 is such that 101 0 41 0 ie the mechanism by which we obtain solutions is zero forcing F om a 0 to a l the approximation changes from D0pH4 KL4Hp to D1pl l4 KLpl For a 2 l the approximation satis es the following property 101 gt 0 41 gt 0 as demonstrated in Figure 2 As a grows from 1 to in nity the approximation becomes inclusive ie the mass of 41 includes all the mass of As discussed above and demonstrated in Figure 2 the value of 1 plays a sig ni cant role in determining the properties of the approximation by minimizing the aDivergencei REFERENCES 1 CM Bishop and SpringerLink Online service PatteTn Tewgmtion and machine leuTm39ng Springer 2006 RICE UNIVERSITY DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING 6100 MAIN ST HOUSTON TEXAS 77005 E H rlrlm 2 y in beimmmri r111 1 L1 r111 1 FIGURE 2 Optimal Gaussian approximation qx for px for dif ferent values of a EXPECTATION PROPAGATION September 30 2008 Rice University STAT 631 ELEC 633 Graphical Models Scribes Instructor Ahmad Beirami Dr Volkan Cevher Andrew Waters M atthew Nokleby Index terms Approximate inference Expectation propagation exponential family 1 Motivation and Introduction In class we have encountered inference problems that are too complex to solve exactly As a result we have looked at methods such as Laplace s method and variational Bayes that simplify the problem In such methods we simplify the form of the distributions by making them a product of simpler distributions for example while minimizing some error metric such as the KLdivergence Here we focus on the expectation propagation EP algorithm 1 2 EP is a deterministic algorithm which approximates the true distributions with exponentialfamily distributions Speci cally EP applies to probabilistic models where the joint distribution has the factorization pm Hh m where the factors1 fi 6 are M9 109 2 129 10Iii9 3 where z is the observed data and 6 is the latent variable whose distribution we hope to infer So the data xi are conditionally independent given 6 We can compute the posterior according to Bayes rule m 1 Hf m p I i 39 7 101 i I where the evidence 101 is am w m Of course if the factors fi6 are complicated computing the posterior may be dif cult So we approximate p6iz with a mathematically tractable distribution 46 In ER we choose 46 to be of the form q where each is a member of the exponential family We ll review the exponential family in the next section but for now it s enough to say that the exponential family is considerably easier to work with which simpli es the 1Note that the factors fi6 do indeed depend on m inference problem Each factor 19 in the approximation corresponds to one factor fl We want to nd the best exponentialfamily approximation of p according to a meaningful error metric For ER our error metric is KLpl Notice that we do not use KLql lp as we did previously in variational Bayes We previously discussed the differences between the two For example KLqu brings about bad results when the original approximation has multiple modes However even assuming that 46 belongs to the exponential family minimizing KLpl lg is not feasible Instead EP presents iterations which in practice satisfactorily minimize KLpl But there are no guarantees of optimality or even convergence of the algorithm The remainder of the document is organized as follows In Section 2 we brie y review the exponential family In Section 3 we detail the steps of the expectation propagation algorithm We conclude with two examples borrowed from 1 in Section 4 2 Exponential Family In this section we brie y review the exponential family A distribution fi 19 from the exponential family has the form 129 h9977 exp 77319 7 where 77 is a vector of hyperparameters and the functions M19 977 and un are known Although it has a simple form the exponential family is a broad family which includes for example the Gaussian distribution A bivariate Gaussian distribution looks like paw 2 ma a 7 WWW 7 11gt lt8 for z 117 12 and 2 1 A We can rewrite this in the form of 7 W hzgtgn exp nTuzgt 9 where we choose x1 AiiMi A12M2 I2 A22M2 A12m 1112 7 77 A12 10 1 7A11 22 We also choose hr l and 1 1 2 1 2 977 rm XP A111 A12M1M2 A2227 11 which can be rewritten as 1 977 14n4n5 7 77 exp mm 772112 12 where M 7 772773 2771775 1 7 7 4774775 i 77 771773 2772774 2 2x 4774775 i 773 3 Implementation of Expectation Propagation In this section we discuss the implementation of the expectation propagation Our factorization assumes that 9 gt 1 U f lt6 p I 444 39 7 101 Z and we want to approximate 109 1 using 46 as given by lt6 7 1 H M q i Z i z e First we calculate KLqu KLltPH4gt eplog LIPHI 7 logym nTu9 constant To minimize KLp q we form the gradient and set it equal to zero 6K L 67 Vn 10g977 Ep9 0A 13 14 15 16 17 18 Note that since 4 is determined by 77 then optimization with respect to q is equivalent to optimization with respect to 77 Therefore we have Vn 10g 907 Em u9x Since 4 follows an exponential density distribution as given by 16 we have awe 1 907 hlt6gt exp Wanda 1 Now we can differentiate this equation with respect to 77 to obtain won hlt6gt exp Wanda pop hlt6gt exp pTult6gtult6gtdo o 19 20 21 where we have used Leibniz rule It is straightforward to show that if z 11 m In and a a1 m an then we have fx exp zTa Taking derivatives we get 8f I a I aexp zTa afzr 22 m fax Figure 2 Product of approximate individual probability distributions f1 f2 fails to make a good approximation of the product of the multiplications f1 f2 while a good approximation could be obtained for the product itself It is easy to see that the latter term in 21 is Eq9 Therefore our optimality conditions reduce to V 907 W Eqltegtiultegtt 23 Thus the whole problem reduces to Eqelu9l Epelu9l 24 In other words at each step of the optimization we need to match the moments between p and q However solving this equation is intractable itself since it requires calculation of expectation with respect to the original probability distribution function p But we have already decided that be intractable which is why we are using an approximation in the rst place A plausible simpli cation would be to approximate each factor in p using one factor in q That is we could optimize by matching the moments between 126 and fi However by doing so we limit ourselves to a limited subset of the feasible region In other words we eliminate many candidate solutions that effectively minimize KLpl lg but for which the individual moments of f do not match those of f For example consider Figure l where f1 has two modes while f2 has only one By minimizing the KL divergence for each of the distributions f1 and 1Z2 could be obtained as shown in Figure 1 Now consider f1 1 Since f2 m 0 for z in the second mode of f1z f1 f2 only has one mode as shown in Figure 2 and is wellapproximated by minimizing the KLdivergence However f1 f2 brings about a distribution which poorly approximates of f1 f2 1 as demonstrated in Figure 2 So we need something more sophisticated than matching moments factorbyfactor At each step of EP we pick some j and include all the factors except for fj and try to approximate the result That is instead of matching moments for each factor of p and q we match moments for the distributions with fj and 1 missing This way if we have large number of factors we obtain a much better approximation than factorbyfactor matching since the form of the approximation is not as restricted To see this let N be the total number of factors in p In approximating factorbyfactor we introduce N 7 1 extra constraints to the original problem In contrast we only add a single extra constraint by omitting one factor at a time which guarantees a much better result We generate an iterative method to solve the approximate problem Now we separate factor from the approximate distribution form 49 139 H 09 25 1 We de ne qj as the distribution function where factor j is omitted 49 139 aw m Now we de ne 4 as the distribution function in which all the factors are from the original distribution except for 12 which is chosen from the approximating family we Zijfxe g3 if 941 lt6 lt27 Then we will solve this simpler problem by minimizing the KLdivergence between 4 and q qnew argrr1qinKLqquott9llqt97 28 which is easily solved by moment matching Then we can nd the updated 1 from qnew 9 41 9 Unfortunately there are no theoretical guarantees for matchingpursuit algorithms in general So in general we K m cannot say anything about the quality of the approximation produced by ER and there exist examples where EP fails to satisfactorily minimize the KLdivergence However for large N we have seen that EP at least outperforms momentmatching on individual factors And despite the lack of convergence guarantees EP works well in practice 4 Additional Examples 41 Clutter Problem In the clutter problem discussed in l we have Gaussian observations of ddimensional data x which are corrupted by noise and embedded in unrelated clutter This gives us a Gaussian mixture model Pylx 17 wyx71 MW 0101 30 where I is the identity matrix and N y m V denotes the multivariate Gaussian distribution over y with mean m and covariance V The rst term in 30 is the noisecorrupted desired data and the second term is diffuse Gaussian clutter We assume that the data has a Gaussian prior px Nx01001i 31 Presumably we pick a large variance to make the prior as noninformative as possible If we have n independent observations D y1 yn the joint distribution is given by n n pDx 7 W H pltyilxgt 7 H gelx lt32 i1 i0 So the factor f0 is the Gaussian prior and each additional fi are the mixturemodel likelihood functions Using EP we nd a Gaussian approximation to pDx Speci cally we will choose the spherical Gaussian distribution Nm1 1 I where terms are uncorrelated and have the same variance So we need to nd the parameters m1 and 1 that result in the best approximate pD x To nd this approximation with EP we rst initialize the approximate terms si exp 7amp0 7 miTx 7 33 where 5139 vi and m are the parameters of the distribution For f0 we just initialize to the parameters of the prior which is already Gaussian v0 100 so 27rv0 d2 and mo 0 We initialize the data terms such that l v 00 m 0 and s 1 This gives us the globalparameters m1 0 and v1 100 After initialization we iterate until all mi vi 5 converge within some small 6 gt 0 in l e 104 At each iteration we perform the following steps for each 1 g 239 g n note that the notation refers to the set with element 239 removed 1 Remove the factor from the current posterior estimate giving vlir1 7 v1 7 v1 64 my x vgivi Wmi 7 mi 35 2 Recompute m1 v1 from my via moment matching and compute the normalization constant Z 7 lt17 wgtNltym ltvi1gt1gt wAyi 0101 36 That is we compute Z by evaluating the estimated likelihood factor at the observation yi 3 Update 7 v1 7 ltng1 lt37 m my vi viivi 1mz 7 my 38 mad2mm my v v5 39 Finally when iterations terminate we can use the approximated factors to compute the normalizing constant needed to perform inference MD 27 expltB2gtwosi lt40 where me n B as at 7 z 1 41 71 lt gt 42 Bayes Point Machine Minka in 1 discusses the use of EP for the problem of Bayesian point classi cation In this problem we assume that we have some point W that we wish to classify into one of two groups y i1 through the following rule 7 T y 7 s1gnW x 42 Given a training set D Xl7 yl7 i i i Xn7 we can write the likelihood for W as T Lquot xlgti 43 E pltDlW Hummm 11 lt j Nz 07 ldz7 44 where e is an error tolerance parameter It can be seen that 152 becomes a step function as e A 0 Minka derives the EP algorithm for this scenario assuming a multivariate Gaussian posterior on W with Gaussian prior and exponential form for the nal factors He shows that applying EP to the BPM problem yields superior results relative to other approximation methods available previously in the literature The results are shown in 3 and are plotted in computational requirements in FLOPS vs classi cation error probability It can be seen that EP not only performs better against the other algorithms in terms of classi cation error but is also computationally the most ef cient 1 fiW si exp7 21111wai 7 mi2 Initalize with vi 007mi 031 1 2 qw jmw7 Vw Initialize With prior mw 07 Vw I 3 Loop over 239 l7 2 i i n until convergence a Remove filtWgt from the posterior i Vw 2Vw JT Vw39 Vw vivaw m mw Viplxiv1xmw 7 b Recompute mu7 Vw via the following vw vs 7 with vimV 0 Update T 7 ffzxvai XiTVEXi mi ximei vi xiTViuiXiWi I 7 ltz1gt 1vjlx7vhxt 81 7 lt 1 XTVLIXIH gt 8117 5 041 mm 4 Compute B mel Ulmw 7 T3112 101 N lel12expB2H1Si Figure 3 Comparison of BPM using EP With other classical methods References 1 T Minka Expectation propagation for approximate Bayesian inference Proc 17th Conf Uncertainty in Arti cial Intelligence 2001 2 C M Bishop Pattern Recognition and Machine Learning Oxford Springer 2007 Achieving Stationary Distributions in Markov Chains Monday November 17 2008 Rice University STAT 631 ELEC 639 Graphical Models Scribe Instructor Ryan E GUERRA Tahira N SALEEM Dr Volkan CEVHER Terrance D SAVITSKY 1 Motivation Markov Chain Monte Carlo MCMC simulation is a very popular method to produce samples from a known posterior distribution for hidden variables where the form of the distribution is highly complex such that it may not be directly sampled The MCMC algorithm draws samples from a proposal distribution from where we know it will be easy to acquire samples When certain properties are met by both the known posterior and the proposal distributions the MCMC algorithm possesses the pleasing quality that samples drawn in successive iterations will almost surely converge to samples drawn from the known posterior The Markov Chain MC time indexed random process provides the underlying theory that supplies these needed properties that must be met by both the posterior distribution and the proposal distribution so that we may use the MCMC algorithms Speci cally we will enumerate the properties that must be possessed by a properly constructed probability transition probability matrix for a discrete state Markov Chain to ensure that it converges to an invariant stationary distribution as the number of steps in our chain increases 2 Review of Markov Chains An MC is a time indexed random process with the Markov property Having the Markov property means that given the present state future states are independent of the past states Consider a k state markov chain where for n 6 2 the set of all positive integers 79m 19900 77901 PW 8739 gt gt gt gt gt xn72 xn71 x xn1 By D separation 1 for a 77head to tail77 con guration we know mnlmWA 1L 1 2 zn2 We de ne transition matrix H 5nln71 3n71 3 the probability of moving from state 3139 at time n 7 1 to state 5739 at time step 71 We assume the transition matrix P is time homogenous meaning the probabilities do not vary with time or space in the Markov chain More constructively we build our probability transition matrix with De ne Hn as the marginal probability distribution vector that supplies the probabilities of being in each of the states in state space S E S at time n 7 1 Then given the marginal probability vector at time n 7 1 we may compute the same at time 71 EM PHn 71 By iterating this algorithm starting at time n O we may de ne the marginal probability over the state space S at time n by repeated application of the transition matrix nm Pmm 3 Convergence to the Stationary Distribution 31 The Transition Matrix converges to an invariant distribution We will demonstrate that if our Markov Chain satis es certain properties re ected in the construc tion of the transition matrix P that lim P 7gt H taco where H is an invariantconstant marginal distribution independent of time In other words P will converge to a rank 1 matrix with constant columns equal to the stationary distribution 32 Conditions for Convergence of Transition Matrix P For a transition matrix P to converge to an invariant distribution it must possess the following properties 5 o Irreducibility A MC is said to be irreducible if every state in the state space S can be reached from every other state space in a nite number of moves with positive probability This property may also be stated as every state communicates with every other state We may express this property in compact form with 373019 gt0Vij es An irreducible Markov Chain is said to possess a single class If an MC contains multiple classes then the MC may be divided into separate chains each with its own transition matrix 0 Aperiodicity The periodicity of state 3139 measures the minimum number of steps it takes to have positive probability of returning to that state We express as gcdn 2 119 gt 0 where 77gcd77 stands for 7 greatest common divisor77 and represents the step multiple required to return to state i A state is called aperiodic if the minimum number of steps equals 1 An MC is said to be aperiodic if all the states in the MC are aperiodic o Recurrence A state 3 is called recurrent if the chain returns to s with probability 1 in a nite number of steps The MC is recurrent if all the states in S are recurrent When these 3 properties are satis ed by our MC then all the entries of P are 0 lt pij lt 1 Vij E P lntuitively this construction of P says that there is some positive probability of moving to or remaining in any other state from any other state at time n We cannot get stuck in any state The Perron F robenius theorem 4 tells us that for an n x 71 matrix A with positive entries aij gt 0 there is a positive real eigenvalue r of A such that any other eigenvalue satis es M lt r The bound r is referred to as the spectral radius of A The practical signi cance is that on repeated application of A for example in a timeindexed random process the direction of the largest eigenvalue dominates so that starting in any state at time 0 repeated application of A will drive the system to the invariant direction expressed by the eigenvector u of A the largest eigenvalue of A For example if we have an eigenbasis E V1 Vn that spans R for A which means it is diagonalizable then we may express any state vector x E R as x clvl 1 cnvn Then Ax Ac1V1 1 cnvn c1A1V1 annvn Then in repeated application of A over it V L V L steps we have AA c1V1 3237 V2 1 cnf Vn Then we may conclude 1 Anx c1AV1 Returning to our transition matrix P with 0 lt pij lt 1 Vij E P this construction of P ensures that the largest eigenvalue A 1 and all other eigenvalues A 7 1 of P satisfy lAl lt 1 Also in this case there exists a vector having positive entries summing to 1 which is an eigenvector associated to the eigenvalue A 1 Both properties can then be used in combination to show that the limit P00 limknoo Pk exists and is a positive stochastic random matrix of matrix rank one containing the desired stationary distribution H which is the eigenvector associated to A 1 Note that since eigenvectors are unique only up to constants of proportionality we enforce the constraint that the members of the eigenvector must sum to 1 in order to provide the desired unique solution 33 Eigen Decomposition and Diagonalizability Recall for an n x 71 matrix A the eigenvalues are solutions to the characteristic polynomial pAz detzll 7 A 7 A1z 7 A2 2 7 Am O The algebraic multiplicity of A is the number of repeated values in pAz The associated eigenvectors are derived for each A from kerAll7A Then de ne the geometric multiplicity of A as the dimension of this space or EA dim kerAll 7 We are able to de ne an eigenbasis for A and to therefore diagonalize A if the algebraic multiplicity equals the geometric multiplicity for all A In this case we may decompose A PAPA where P is the eigenbasis of A and A is a diagonal matrix with the eigenvalues of A as the diagonal entries 4 Example Finding the Stationary Distribution for a transition matrix7 P Xn where Pmn 874 3i Pni i gt no 02 01 03 P 04 07 03 04 02 04 020 015 021 P2 048 059 045 032 026 034 From the Perron F robenius Theorem and the properties of an irreducible7 aperiodic and recurrent MC7 we know the largest eigenvalue is 1 1 gt 2 gt 2 71 100 PV1V2V3 0 2 0 V1V2Vsiil 0 0 A3 in 0 0 P P 0 A2 0 F 1 0 0 A3 1 0 r 0 1quot1 P V1HV1H u 11u If we calculate the eigenvalue of our matrx P we nd 1 10000 2 03562 3 700562 and u 18 53 29 References 1 C Bishop Pattern Recognition and Machine Learning Cambridge UK Springer Science 2006 2 G Casella and E George Explaining the Gibbs Sampler77 The American Statistician Vol 46 No 3 August 1992 3 S Chib and E Greenberg Understanding the Metropolis Hastings Algorithm77 The American Statistician Vol 49 No 4 November 1995 4 1L Doob Stochastic Processes New York John Wiley and Sons 1953 5 SP Meyn and RL Tweedie Markov Chains and Stochastic Stability London Springer Verlag 1993 Second edition to appear Cambridge University Press 2008 online httpdecision csl uiuc edumeynpagesbook html
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'