Popular in Course
Popular in Statistics
This 119 page Class Notes was uploaded by Miss Sabina Grimes on Monday October 19, 2015. The Class Notes belongs to STAT 631 at Rice University taught by Volkan Cevher in Fall. Since its upload, it has received 28 views. For similar materials see /class/225036/stat-631-rice-university in Statistics at Rice University.
Reviews for GRAPHICAL MODELS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/19/15
Variational Bayes Approximation ice University STAT 631 ELEC 639 Graphical Models Sc be David KAHLE Instructor Reviewers D i V 1k CEVHER 39 r 0 an Konstantinos Ts1anos and Tahira aleem 1 BACKGROUND These lecture notes are the seventh in a series of lecture notes taken from a course offered in the Fall of 2008 at Rice University entitled Graphical Models The course was written and instructed by Dr Volkan Cevher in the Department of Electrical and Computer Engineering This particular set of notes was taken by David Kahle on September 16 2008i 2 INTRODUCTION In the last lecture we became particularly interested in making inference on the graph displayed in Figure 1 Here Z 6 Rm and X E R are random vectors For the purposes of this lecture all FIGURE 1 Simple observed directed graph random vectors will be assumed to exhibit densities with respect to the Lebesgue measure denoted p and subscripted by the corresponding random vector For example the random vector Z exhibits the density 172 The joint density of the random vectors X and Z is written pXZ z The graphical model presented above is simple in order to emphasize the fact that we wish to make inference regarding the vector Z provided we have information on the vector Xi To that end recall the crucial formula derived from the de nition of conditional expectation PXZ7Z PZlXltzlPX7 1 which is sometimes referred to as the product ruler As we well know our primary interest lies in the factor pZ Xzl17i The rest of this lecture is concerned with characterizing this density using the method of variational Bayes VB approximation This is the second example of a deterministic scheme in approximating the conditional density for inferential procedures the rst being Laplace approximation 3i MOTIVATION AND THE KULLBACK LEIBLER DIVERGENCE In this section we derive a pseudo distance metric of probability distributions known as the Kullback Leibler divergence The idea is that without some kind of measure of difference between probability measures we have no benchmark for comparing the accuracy of different approximations Our result will be the Kullback Leibler divergence it will soon become clear why it is called divergence instead of distance The divergence is derived as follows Beginning with 1 and taking logs we obtain logpxz 1772 IOngXZl1710gPX17 2 which rearranges to logpx 17 10gpxzw7z ilogpzi zlw 3 Now recall the property of logs which states loga 7 log log 7 log provided everything is de ned appropriately Appealing to this fact we introduce another density qz z which is an arbitrary probability density also with respect to the Lebesgue measure Then 3 grants pxzwz 71 PZXZl17 lo lo o 7 4 gpx g qzltzgt g qzltzgt and multiplying both sides by qzz we obtain UX 21772 Pzpdzlil z 1 17 z 1 7 z 1 7 5 42 ogpxlt qz 0g WM 42 0g qzltzgt By integrating with respect to z we have mz p z 17 qzzlong dz qzzlong Z dz7 qzzlogM dz 6 Jim Jim 422 Rm LIZ z This is the key equation for our search To summarize it we note that the left hand side of 6 is simply long and use short hand notation for the two terms on the right hand side with the second including the negative sign The short hand notation is de ned as 7 z o z U42 quot Aqu lg 422 d 7 O M qullg 42Z l KL qzllpzlx 1 AmqZzlngZ lX27zmdz 7 z 0 76729 z 7 quZ lgPZXZl17 d Z qu log L 7 PZlX Z lX where EH represents the expected value operator with respect to the probability measure or equiv alent Hi Thus the fundamental relationship is concisely written 10gPX17 012 KL qzllpzix 7 3 The functional KL qzpZ X is known as the Kullback Leibler divergence of pZ X from qz and is the pseudo metric which we are seeking A few properties of KL are that for all valid pZ X and 9Z7 1 KL CIlepZiX Z 0 and 2 KL qzpZ X 0 ltgt qz pZ X ae Proof of both these facts is provided in Lemma 31 of 11 However7 note that the Kullback Leibler divergence is not symmetric7 and thus not a true distance metric To conclude this section it is instructive to see the meaning of the KL divergence with an example In figure 2 we plot an attempt to approximate a normal distribution p N205 with three different log normals 117 127 13 Oberve that the approximation that visually seems more accurate also has the smallest KL divergence Moreover7 it is worth noting that for the case where KLpq2 01115 we have KLq2p 01399 KLdivergence is 11946 KLdivergence is 01115 KLdivergence is 00829 Normal iNormal Normal 03 a iLogNormal 08 iLogNormal 08 iLogNormal r FIGURE 2 Three different approximations The better the approximation the smaller the KL divergence 4 VARIATIONAL BAYES APPROXIMATION Recall that the choice of qzz is entirely arbitrary so long as it is a probability density with respect to the Lebesgue measure The idea in variational Bayes VB approximation is to select a simple approximation qz to the complicated conditional density pZ X The Kullback Leibler divergence gives us a measure of the discrepancy between qz and pZ X so the goal is to find an approximation qz which minimizes KL qzpZ X Further consideration of 7 in light of our new task will prove beneficial Note that the left hand side does not vary with z moreover7 in our graph we are considering an experiment where we observe 137 so the quantity is fixed2 It is generally referred to as the log marginal likelihood or the 1Note that what we are labeling divergence and what Kullback and Leibler label divergence are different function als Our de nition of KL divergence is consistent with literature despite Kullback and Leibler de ning it differently 211 this set of notes the upper case Roman characters such as X will be used to denote random vectors while the lower case Roman characters such as 12 outside a density or integral equation will denote a single observation of X as is common in statistics literature log evidencei Since it does not vary with 42 it is clear that the functionals L and KL are inversely related Therefore a minimization of KL amounts to a maximization of L iiei z 39 KL L 8 42 argqrg g QlePmX argg g qz where q Z is our approximation of interest and Q denotes any set of valid probability densitiesi We will refer to q Z as the QVB approximation It is also sometimes useful to note that by the rst property of KL long 2 L 12 and thus euqz provides a lower bound for the marginal density of X Finding q Z when Q is my probability density is in general a dif cult task To make our analysis more tractable we can impose an independence structure on the random vector Z that is we will only consider qzls which come from the set Q 42z 422 qu2iv 9 where qzl is the probability density of Z the ith element of Z Sometimes even this task proves dif cut and more restrictions are imposed to make Q even smaller Such techniques are referred to as restricted variational Bayes R VB techniques For example we could add the additional requirement that each 4Z1 is in the exponential family For the rest of these notes we will take Q as de ned in 9 for our Q in To nd 8 we begin by looking at L 12 In particular it will be bene cial to look at the jth factor qzi For this reason our mathematics is aimed at separating out the factors which depend on qzyi Presented in the equations which follow is the derivation in terms of expectations An equivalent form of the rst part of the derivation in terms of integrals is provided in the Appendix So7 from 9 and Fubinils theorem 3 we have U42 qu logp Xg gml qu logpxzX7 Z 10g42Zl qu logpsz Z i log H qzkZkl 161 7 E42 Iogpxzon 2 7 it 10ng m 161 Iogpxzon mil 7 DEM 10ng m I 7 161 1 k1 quj Emmi qzl 10gPXzX7Zll Egg 103127 Zjl Zquk longAZIJl 7 quj 10 lteXPlEH71qzz 10gPXzX7 Zlgtl Egg 103127 Zjl Zquk longAZwl 2 lEqZ log exp El121 1Z1 long2X7 7 log 42 Zj 7 Zquk log qzc 2 expEnr1qzz logpxzX7 ZN m 7 27 7 7 Egg log M W Iogmzm u 4 Z 39 m 7 7qu log Z J 7 DEM longgzm 1 6XPE1 1qZI logpxzX7 ZN 27 27 7 7KL am exp Ema Iogpxzon Zgt 7 10ng Zak 1 k1 u Now7 we know from KL7s rst property that it is never negative Thus to maximize L 42 we need to minimize the KL term in the last equationi From KL7s second property7 we know that it is minimized precisely when the two terms are equivalent are This gives the nice formula we know must hold for the probability density q the jth factor of q Z qa 21gt oc exp logpxZX7 ZN 10gt 2 where here Z Z17 i i i Zj7172j Zj1 i i i Zml i Wm states that if foB fwyldwy lt 00 then fA f3 agMy dm fB IA myth foB flt gt11gtdltgtygtv 5 EXAMPLE A UNIVARIATE GAUSSIAN To t the f 39 A A 39 39 procedure7 we show the analytic computations in the case of a univariate gaussian distribution Assume we have a data set D 11 i i i 7IN draw from a distribution with unknown parameters M 7quot Given the data7 the likelihood function is MMef wf iwtl We may have some idea about the distribution of the unknown parameters M 7quot To simplify the analysis here7 we introduce the following conjugate priors MW NMlM07OT 1 107 Gamma7 la0b0 Recall that we are interested in estimated the posterior distribution q7 over the unknown variables According the mean eld approximation we assume it takes a factorized form which is not how the true posterior factorizes 4 T WWW To nd the optimal value for factor 4 u we apply the formula the we derived above loquOA 1E7 log DvAv l 1E7 logl DlMTPMlTPTll E7 logpDlt7 739 lngOLlT logp7 ET logpDlt7 739 logpl7 const N iElTl AOOL 7 02 7 My constant 7 2 The next step is to complete the square over M to obtain the form of a gaussian NW MN ARR for QM M where 7 Ao o N MNi A0N AN0NlETi A similar analysis for 17 739 shows that it follows a gamma distribution Gamma lmw bN where N1 aNa0T7 N l 5N be EET 2amp7 02 AOW 02 A The same analysis together with more sophisticated examples can be found in chapter 10 of lt important to notice that we can use the above derived formulas to iterative compute more re ned estimates of the model parameters Observe that the paramters of QM n depend on the mean value of 739 and vice vercai To conlude this section we refer to the reader to two excellent references on variational methods 7 7 6 APPENDIX 7 pxz177z qz 7 Rm qzzlog qzz dz m PX 21772 412i lognJ dz mdzm a Z magma 1 quot39H4Z12i logpxzmzZlongkmv d21 dZm 1R rel1 k1 H4Z1Zi10gPXZ7Z L121 dzm R Ri1 7Hazl2i10ngk2kd21 dam R Ri1 k1 AQQZ72jAa 71 qulzil0gpxzz dzv dzj z 72 10ngk2kqulZi d21 d2m k1 R R i1 qujzjR 71 qulzil0gpxzz dzv dzj 2 Z IOgQZkZk4ZkZk dzk k1 R quy 21vAa 71 qulzil0gpxzz dzv dzj z i R 10 LIZ 4427 21 M i Z R 10 qzk 204216 2k 42k kg 7 REFERENCES 139 S Kullback and RA Leibler On information and sn iciency Annals of Mathematical Statistics 22 1951 no 1 7986 Variational Logistic Regression Sunday October 12 2008 Rice University STAT 631 ELEC 639 Graphical Models Instructor Scribe Dr Volkan CEVHER Terrance SAVITSKY 1 Motivation In the past few classes we have explored the problem of nding tracatable approximations to otherwise intractable pdfs That is given some probabil ity distribution that is either dif cult or impossible to work with we wish to nd an approximating pdf that is easier to work with and yet still yields good results Thus far we have considered both local approximation methods Laplace Approximation as well as global approximation methods Variational Bayes Global methods attempt to seek an approximation to the entire distribution over all variables while local methods attempt to nd acceptable approxi mations for individual variables or groups of variables In this lecture we consider a local variational method and apply this to variational logistic regression governed by the logistic sigmoid function This document is organized as follows Section 2 details the bounding of the logistic sigmoid function which will be used in the nal regression model In Section 3 we give a general overview of the framework for using the bound on the logistic sigmoid derived in Section 2 Section 4 will develop the regression model for computing the posterior and how to optimize this model using the Expectation Maximization algorithm Section 5 will present some additional examples that use similiar techniques for further illustration and we give some concluding remarks in Section 6 2 Variational lower bound on Logistic Sigmoid 05 We will utilize a lower bound on the logistic sigmoid cdf function 0m W to provide a normal normal quadratic in z conjugate model when we derive variational logistic regression Start with logaltzgt 7 7loglt1 explt7zgtgt lt1 7 log an w an lt2 7 g7logexplt gt explt 7 gt 3 now fz 7 log exp exp 7 is a convex function of 2 which one may verify by taking the second derivative with respect to 2 and noting it is everywhere positive Therefore we may construct a lower bound on fz which is a linear function of 2 gA max Am2 7 fVm2gt I which we maximize in 2 to produce using the chain rule of differentiation dm 1 1 m 0 A7 WE A Etanh 4 Denote the value of z corresponding to the tangent line for the given value of A as E which gives us an expression for the variational parameter A in terms of E 1 E 1 1 A 77t h7 77 77 lt5 45 an lt2 25 ac 2 We ll let 5 play the role of the local variational parameter ilo A because we obtain a simpler expression for the lower bound 9w 7 we 7 m 7 we 10gexp em so that we may state a lower bound on fz with M 2 Am 7 gm 7 Am 7 A52 710g mg em We nally achieve a lower bound on 0m by using equation 3 exponenti ating log 0z and multiplying and dividing by exp as follows mm 2 3 m 7 A52 710g mg expltigt 2 gtexpltxltz2 7 B expff 965 altzgt 2 explt i 7 71 gt mg explt7 gt So that we nally achieve our variational lower bound on 0z as 95 7 E altzgt 2 alt5gtexp 2 W 7 8 lt5 We see that this bound has the form of the exponential of a quadratic function in m which gives us a local Gaussian approximation of the logistic cdf 3 Framework for how to use Variational Lower Bound on 005 Suppose we wish to evaluate an integral of the form I aapada 0a logistic cdf E likelihood function for the test data pa Gaussian posterior where a E latent variables parameters We employ the variational lower bound for 0a 2 fa 5 Then the integral becomes the product of 2 Gaussians exponential quadratic function and we can 7complete the square7 in order to evaluate I analytically Note that we have introduced an additional parameter 5 the variational free parameter where I 2 fa7 Pada M We now nd the value of 5 that maximizes F5 to get the tightest bound F5 Yet 0a 2 fa5 so that the required choice for 5 depends on a speci c a Here however we marginalize over all values of a So 5 is a compromise weighted by pa 4 Variational Logistic Regression 41 Derive the Regression Model from the class of General ized Linear Models Suppose we have an observed set of random samples X 1 mn where X may be continuous count or categorical data We further suppose that the distribution for X derives from the exponential family of distribu tions which has the following form x 1W1 exp TITNX 7 C17 MK For example for z N 11dV39u 02 we may factorize the joint distribution for X to demonstrate that the Gaussian distribution is a member of the exponential family as fz1 mn 271172 exp 72f NZ 1 In 1 2 mexp 07quot7 7 W 7 7 logo l where 11T1 i and 1ng and 1 2 loga As an aside we note the important role played by T T1 T2 in the exponential family where the n observations ml zn are reduced to 2 statistics in the case of the Gaussian distribution independent of the data So we refer to T as the sufficient statistics such that all the relevant information in the data useful for estimating our parameters 1 are fully captured by the sufficient statistics Now we utilize the exponential family to regress 1 against a set of covariates X as 907 WT gtX 6 where Shao 2003 notes 91 6 R is called the 7link function7 and 91 1 is called the canonical link function wT gtX is a linear combination of transformed z eg sinz and w may be viewed as slope parameters we estimate to determine the relationship of gtX to 91 Now let s develop a regression model speci c to count data Let t t1 tk for tn 6 01 categorical and 7139 Ptn 11w 6 01 De ne the associated likelihood ptnlw 7quot17 7W17tn cx exp ifnlog1 7 7 n where 7 log1 jrn E R is our link function which is a latent response Let s relabel 7 as an and de ne our regression function 7 an log WT gt 7 1 7 7139 We may invert this regression function and solve for 7139 with 1 7 7 T 7r 7 1 exp7w7 gtn 7 0w 1 T 7 7 1 0w lt1 7 0an 7 1 exp7an Then returning to our likelihood WW altangtrlt1e more 8 7 1 tn 7 1 17th 1explteangt 1 1explteangt 9 expantn expantna7an 10 So we note that our likelihood will be approximated by a Gaussian distri bution when we introduce the variational parameters En Now we complete the speci cation of our logistic regression model with a Gaussian prior on our slope parameters w pw Nwlm0 SO where m0 and So are xed hyperparameters 42 Develop Variational Lower bound for Posterior We are moving towards de ning some tractable approximation for our poste rior in w to achieve a normal normal model Recall we derive a posterior on w as pWlt 0ltpWpth pmw 11 As an aside we might then use the posterior to conduct variable selection by thresholding based on the False Discovery Rate the posterior values on wn Returning to our likelihood from equation 10 we introduce a variation parameter En corresponding to each observation gtntn to approximate dean with the variational lower bound from equation 5 such that pt7w ptlwpw 2 MW pW plttn1wgt expltatgtalteagt 2 expltatgtaltagt exp e Alt nai 7 52gt plugging in for an wT gtn N WT hltw75gt H alt5ngtexp Wm e e Alt5ngtltltw7 ngt2 e 52gt n1 Returning to equation 11 we now take the log of both sides The inequality is maintained due to the monotonicity of the logarithm We then insert our lower bound for ptlw to obtain 10gptlw 2 log1W7E 10gpw E qW N 39 T W mm Z lt10g0 n 7 Ma WW 7 52 n1 Substituting in the Gaussian prior on pw supplies logptlw cx logqw 12 1 N 1 ltw e mogtTsalltw 7 me Z wwtn e 5 e AltsngtwTlt n Zgtw n1 13 Our variational approximation qw is quadratic in w allowing us to com plete the square in w to produce our variational approximation for the pos terior qW75 NWlmN7sN7 where 14 N 1 mN sNs31m0 Z 25 7 5 1 15 n1 N Sivl 831 2 Z Mammy lt16 n1 43 Optimize Approximation over Variational Parameters V L We have obtained a Gaussian approximation for our posterior distribution and can now optimize over En to make this a tight bound starting with 1ogpltt 10g plttwpltw lt17 2 10g hltw7 5pltwdw lt5gt lt18 Since E is de ned as an integration over latent w we may use the EM algorithm to maximize the partial likelihood E by maximizing the full likelihood w We implement the EM algorithm from Bishop 2007 for this application with the following steps 1 E Estimation Step Choose initial values for gold and use these values to determine the variational posterior distribution qw 501d 2 M Maximization Step 0 Recall 2mg gammw lt19 i 2610 10gpt7W 20 2 qltw1oghltwspltw lt21 122w log hltw7 5pltw1 lt22 0 So then we de ne our optimization equation for the M step as Q 57 gold IEqw log hw7 23 o Plugging in for log hw E N 5 c2 5 501d 2 10mm 7 3 7 Me ME wa lt2 i 33 n1 25 0 Now plug in our expressions for din ME and differentiate Q to obtain 0 A a WE wa 7 53 lt26 0 Since AEn 7 0 we nally obtain the results for one iteration of the EM algorithm 5W ZE wa lt1 27 gtZ SN mNm lt1 28 5 Additional Examples 51 with S L approximation ln Jaakkola and Jordan 2000 variational logistic regression is compared to a sequential approximation method by Spiegelhalter and Lauritzen which is then referred to as the S L approximation It utilizes a Laplace approx imation The variational methods perform quite well in comparison with similar results at low variance and better results at high variance Figure 1 shows the error in posterior mean and standard deviation as well as the K L divergence between the true and approximated distributions for each method to demonstrate the effectiveness of variational techniques 52 Bayesian Logistic Regression for Image Database Queries Ksantini and Ziou 2008 apply variational lower bands and logistic regres sion to the dual problems of image classi cation and retreival for image databases They compare this to classical image classi cation techniques and nd that the variational methods acheive superior results Their method dubbed Bayesian Logistic Regression Model BLRM uses an approximation method based on the same variation method employing the logistic sigmoid described in this document This model is used to create feature vectors that are stored in the image database When the system is queried for an image a target feature vector is created for the query and comparison metrics are calculated for each of the images in the system The system then returns a xed number of images that have the highest metric value To evaluate the system a study was conducted on a group of individuals who would enter image search terms into both the BLRM and CLRM databases Lrsra Flguze 1 ard E1101 m the posteuoz means and stahdazd dewatloh as a fuhotloh got r 1 0f the p110 mean 0 01 a b p110 standazd dewatloh o 1 and r K h Lzue and 0 7d o 2 e r 7L dlvexgence betwee appzoxlmate posteuoz dlstubutloh f0 6 g 2 and r g 3 laakkola ahd Jozdah 2000 The use would then evaluate the pleclslon of all lmages zetumed The pleclslon some f0 both methods IS plotted m 2 agamst the soope total mb 1 0f lmages zetumed whlch shows that the BLRM model gleatly outpeszan the olassmal model Addltlonally the authors note that the vauatlohal appzoxlmauon used m RM also dlastloally zeduces both the tlalmhg and query tlme zelatwe to the CLRM system 6 Conclusion Vauatlohal methods llke the one desouhed heze 1er oh boundlng ruhotlohs 0f mtexest Wlth often Simple convex 01 oohoave boun ale Impoltant heoause these do not have looal mlmma dlstmot 10m global muuma and thus oah besolved uslng a Wlde vanety of methods that zequue fewez ltelatlohs and have stlohg Convexgenceeuo bounds Cohoave ruho lt 39E E EuBud database EuBud database Weights computed by the Bayesian logistic regression model 25 15 05 Weights computed by the Bayesian logistic regression model Weights computed by the classical logistic regression model Weights computed by the classical logistic regression model Figure 2 a b Comparison between the Classic and Bayesian models for image retrieval and classi cation Precision is scored by the user and is based on the relevance of the images returned by the database Scope is the number of images returned by the database Ksantini and Ziou 2008 tions can be converted to convex functions and thus are also suf cient EM explored in this paper is one technique for solving for speci c arguments of the resultant maximization minimization problems References Bishop C M 2007 Pattern Recognition and Machine Learning Oxford Springer J aakkola T S and M I Jordan 2000 Bayesian parameter estimation via variational methods Statistics and Computing 10 25 37 Ksantini R and D Ziou 2008 Weighted pseudometric discriminatory power improvement using a bayesian logistic regression model based on a variational method IEEE Transactions on Pattern Analysis and Macine Intelligence 30 253 266 Shao J 2003 Mathematical Statistics Oxford Springer 10 Proofs of Alpha Divergence Properties Thursday September 11 20 Rice University STAT 631 ELEC 639 Graphical Models Scribe Instructor Ahmad BEIRAMI Dri Volkan CEVHER Reviewers Beth Bower and Konstantinos Tsianos 1i MOTIVATION In many cases we need to de ne a measure of distance between probability distributions For example when trying to approximate a distribution with another on of a particular form we can compute the necessary parameters by minimizing such a distance measure A general class of distance measures is the family of adivergences In general for any two probability distributions 101 41 the adivergences is de ned as follows 1 fapm 1 1495 alia WW pzl qzl1 dr7 V for all a E 700 00 In a previous lecture there is more details about the exact usage of adivergences We devote this lecture to proving the following important properties 1 DapH4 is convex with respect to both p and 4 2 DapH4 2 0i and DapH4 0 if and only if 101 41 for all 1 1 3 KL divergence is a special case of an adivergence 101 hm DD KL 7 1ln 7 d1 H ltqu ltqu qlt gt W lim DD KL 7 1 lnd1i 0H1 PM PM 0 pm 4 For a A 700 the estimation 4 that approximates p is exclusive iiei 41 S 101 for all 1 5 For a A 00 the estimation 4 that approximates p is inclusive iiei 41 2 101 for all 1 6 For a S 0 the estimation 4 that approximates p is zero forcing iiei 101 0 41 0 7 For a gt 1 the estimation 4 that approximates p is zero avoiding iiei 101 gt 0 41 gt 0 1Since the distributions need not be continuous this statement from now on will refer to all the values of 1 for which the distributions are wellede ned 1 2 To proceed with proving properties 1 7 7 rst note that the de nition of an adivergence can be further simpli ed to 7 1 f lPIl l4Il1 dI Daltqugt 7 a0 7 a 7 lt1 2 PROOFS Property 1 Daqu is convex with respect to both p and q Proof BQDDAPHQ f Ma 1lPIl 2l4Il1 dI 802 a1 7 a 7 pltzgta2qltzgt1adz 2 or 2 Therefore Daqu is convex with respect to p It is straightforward to repeat this calculation for q Property 2 Daqu 2 0i and Daqu 0 if and only if 101 971 for all 1 Proof We prove this by proving the following inequality for any z W apltzgtlt17 aqz7pr qr1 2 0 al7a 3 If this statement is proven Daqu 2 0 will be achieved by integration of f with respect to 1 Now let us look at f 110 1 14 7lplo lt1l1 4 fo74 Mia Now we differentiate with respect to p 6f p 4 q a 10 quot I 7 7 7 7 a7 i 5 6p a1 i a 4 q l We can also nd the second derivative with respect to p as given by 2 0472 6 fo74 p p 6 8102 q 7 1 It is easy to observe that fqq 0 and qq 0 We can also see that g gi q gt 0 for all p gt 0 Therefore fpq only has one minimum with respect to p and it takes its minimum where p q 0 ie where p qr Also the value of the minimum is fq q 0 Now we have Dam 7 f ltpltzgtqltzgtgt dz 2 or 7 The equality sign is obtained if and only if 101 41 for all z Property 3 DILEDJPHQ KWHP 4z1n dag gigDaltqugt KLltqugt 7 pltzgt 1n dz s Proof When a A 0 it is easy to observe that both the numerator and the denominator in equation 1 will vanish Therefore we can use the L7Hopital rule to obtain the limit We differentiate both of them with respect to a 01413 DOME 0113 17 2a KLOJHP 9 The proof of the other equality is also straightforward Note that due to the symmetry between 17 a and 41 7 a the second equality holds Property 4 If we are minimizing D0 to nd the best p that estimates 4 and a A 700 the estimation 4 is exclusive ie 41 S 101 for all 1 Proof We take the limit as lt39 10 0212 BMW 0212 a2 If p and 4 are continuous functions and 41 gt 101 at some point 10 there exists a neighborhood around 10 where 41 gt Therefore 4p 0 A 00 and the result of the integral will diverge to in nity Thus any minimization algorithm will result in a 41 where 41 S Property 5 If we are minimizing D0 to nd the best p that estimates 4 and a A 00 the estimation 4 is inclusive ie 41 2 101 for all 1 Proof We take the limit as a a 4 0133 Dapllq 03131 a2 11 If p and 4 are continuous functions and 41 lt 101 at some point 10 there exists a neighborhood of 10 where 41 lt Therefore in4 A 00 and the result of the integral will diverge to in nity Thus any minimization algorithm will result in a 41 where 41 2 Property 6 If we are minimizing Du to nd the best p that estimates 4 and a S 0 then 1095 0 41 0 Proof We have lt gt a 17 f 41d1 DaltPll4gt WA 12 If p and 4 are continuous functions and 101 0 at some point 10 there exists a neighborhood around 10 where 101 lt 6 Therefore if 41 gt 0 4p4p 0 will be very large in that neighborhood which makes the integral grow very large Thus any minimization algorithm will result in a 41 where 41 0 if 101 0 Note that Do is convex with respect to p and 4 Property 7 If we are minimizing Du to nd the best p that estimates 4 and a gt 1 then 101 gt 0 41 gt 0 Proof We can achieve this property using symmetry as discussed before We have 17 f 33gt 0 71 1d1 Dapllq If p and 4 are continuous functions and 101 gt 0 at some point 10 there exists a neighborhood around 10 where 101 gt kl Therefore if 410 0 104 will be very large in that neighborhood which makes the integral grow very large Thus any minimization algorithm will result in a 41 where 41 gt 0 if 101 gt 0 13 Keywords Dzdivergence variational methods KLdivergence REFERENCES 1 Thomas Minka Divergence measures and message passing Technical report Microsoft Research Ltd 2005 Compressive Sensing and Graphical Models Volkan Cevher volkanriceedu Rice University ELEC 633 STAT 631 Class httpwwwecericeeduvc3eec633 RICE UNIVERSITY 1 39 39 n quotxi quot4 IL Vr If a 1M l l 1 I v I39 I MFAMWHML PMx vn 39 8295 2355 Pressure is on Digital Sensors Success of digital data acquisition is placing increasing pressure on signalimage processing hardware and software to support higher resolution denser sampling gtgt ADCs cameras imaging systems microarrays x large numbers of sensors gtgt image data bases camcm arays distributed wireless sensor networks X increasing numbers of modalities gtgt acoustic RF visual IR UV x ray gamma ray Pressure is on Digital Sensors Success of digital data acquisition is placing increasing pressure on signalimage processing hardware and software to support higher resolution denser sampling gtgt ADCs cameras imaging systems microarrays x large numbers of sensors gtgt image data bases camcm arays distributed wireless sensor networks x increasing numbers of modalities gtgt acoustic RF visual IR UV deluge of data gtgt how to acquire store fuse process efficiently Sensing by Sampling Longestablished paradigm for digital data acquisition uniformly sample data at Nyquist rate 2x Fourier bandwidth sample gt ML m 3quot LL w LL L L L L LLL my L i L Lquot Lquot L Lquot Lquot Lquot Lquot L LLLL L L La LLL LL L L LL L L LLLLLL L LLLLLLL LwL LL L31 L LL L L L LL LL Ls LL LLLLLLLLLL LLLLLL L L L L LLLL LLLLLLLL L L39L L L LLLLLLLLLLLLLLLLLLL LL L L LL LLL L L W L LL L LL L L LLL L LL w Law L L L LLLL LLLLLL L 4 LLLL LL L L L LL L E L 5 L L L L LL LL ML L L L L L LL L LL L L L L L LLL LL LLLLLL L L L L TL Lm L L LLLL LML LMM L LMLL L L L L L L L LL LLLL L L LLLLL L LLL LLLLLLLLLLLH L L L L L w 5 Lquot L L L L LL LLLL L L LLLL L LLLLLL L LLLLL 2 L L L L L L L L LL LL LL L LL Lf39L I39 LL L LL L L f L L L L L L L L L LLLL LLL LL L L L L LL LEL X L w w L L L L L LL L LL LL 4 L L LLLLL L LL LL L L 2 6 LLquot Mquot L L L L L L L L L L L L 3 LLLL LL L X L L L E 2 L L 5 L L 54 LL 3 LL L L t L LLLL LL L L L L 4512 E f L LLLLMLMLL L LL L LL LLL LL L L ti L L L LL L L L L L z z LL L Lquot L i L LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLMLLLL L L m L I Lquot L L LLLLLLLL LL LL LL LLL LL LLL LLL L L LL L L LL LLL LLLLLLLLL LLLLLLLL L L L L LMLLLLLLM LL LL LLL LL 1 L L LLLLLLLLL LLL LLL LL w L Sensing by Sampling Longestablished paradigm for digital data acquisition uniformly sample data at Nyquist rate 2x Fourier bandwidth Sensing by Sampling 0 Longestablished paradigm for digital data acquisition uniformly sample data at Nyquist rate 2x Fourier bandwidth compress data NgtgtK sample r compress r transmitstore receive JPEG JPEG 2000 decompress Sparsity Compressibility K N gifxels arglt wavelet coefficients blue 0 N gt K ltlt N wideband large Signal E Gabor TF samples coefficients time Sample Compress o Longestablished paradigm for digital data acquisition uniformly sample data at Nyquist rate compress data sample NgtgtK K receive V compress V transmitstore sparse compressible wavelet transform N decompress gt ff What s Wrong with this Picture 0 Why go to all the work to acquire N samples only to discard all but K pieces of data sample N NgtgtK receive compress K transmitstore sparse compressible wavelet transform decompress Compressive Sensing Directly acquire compressed data 0 Replace samples by more general mcusuecns KzMltltN com resswe sensm gt transmitstore y receive gt reconstruct Compressive Sensing Theory I Geometrical Perspective Compressive Sensing Goal Recover a sparse or compressible signal a from measurements y 0 Problem Random projection ltgt not full rank Solution Exploit the sparsitycompressibility geometry of acquired signal a Compressive Sensing Goal Recover a sparse or compressible signal a from measurements y iid Gaussian Problem Random iid Bernoulli j projection ltgt not full rank but satis es Restricted Isometry Property RIP 1 5Kll v S lt1gt93 S1 5Klll v Solution Exploit the sparsitycompressibility geometry of acquired signal a Compressive Sensing Goal Recover a sparse or compressible signal a from measurements y 0 Problem Random projection ltgt not full rank Solution Exploit the model geometry of acquired signal a Concise Signal Structures Sparse signal only K out of N coordinates nonzero model union of K dimensional subspaces aligned w coordinate axes 62K K N sorted index Concise Signal Structures Sparse signal only K out of N coordinates nonzero model a E ZK Compressible signal sorted coordinates decay rapidly to zero powerlaw decay K N sorted index Concise Signal Structures Sparse signal only K out of N coordinates nonzero model a E ZK Compressible signal sorted coordinates decay rapidly to zero model scompressible 0K33p S T81PSK s sorted index Sampling 0 Signal 13 is Ksparse in basisdictionary IJ WLOG assume sparse in space domain IJ Ngtlt1 sparse signal K nonzero BITCHES DIIIIIIIJJZIIIS Sampling 0 Signal 13 is Ksparse in basisdictionary IJ WLOG assume sparse in space domain U I 0 Samples y CI I a N x 1 N X 1 E E sparse measurements signal I I K nonzero entries Compressive Sampling 0 When data is sparsecompressible can directly acquire a condensed representation with noIittle information loss through linear dimensionality reduction 3 33 9 CID a M X 1 N X 1 measurements Sparse Signal M X N K nonzero K lt M N entries How Can It Work Projection I 9 CD not full rank E M lt N and so loses information in general i i Ex Infinitely many CD39S map to the same y How Can It Work 0 Projection cl notqurank MltN and 50 K columns loses information in general But we are only interested in sparse vectors How Can It Work CE 0 Projection CI notqura k M lt N and 50 K columns loses information in general But we are only interested in sparse vectors CD is effectively MXK How Can It Work Projection Cl notqurank MltN and 50 K columns loses information in general But we are only interested in sparse vectors Design CID so that each of its MXK submatrices are full rank How Can It Work Goal Design CD so that its 2K Columns MXZK submatrices are full rank difference 331 332 between two K sparse vectors is 2K sparse in general preserve information in K sparse signals Restricted Isometry Property RIP of order 2K Unfortunately 2K columns Goal Design I so that its MXZK submatrices are full rank Restricted Isometry Property RIP Unfortunately a combinatorial NPcomplete design problem 0 Draw I at random iid Gaussian iid Bernoulli dbl 2K columns 0 Then I has the RIP with high probability as long as M 0K ogNK ltlt N MXZK submatrices are full rank stable embedding for sparse signals extends to compressible signals in E bas Compressive Data Acquisition Measurements y random linear combinations of the entries of a WHP does not distort structure of sparse signals no information loss M X 1 N X 1 measurements Sparse Signal K nonzero K lt M N entries CS Signal Recovery 0 Random projection Cl notqurank Recovery problem given y 2 CD51 find Null space 0 So search in null space for the best aquot according to some criterion ex least squares yltlgtat hyperplane t random angle CS Signal Recovery 0 Recovery illposed inverse problem E2 fa st given y CDQ find a sparse f arg min a32 13 yltgt 35 CDTCD1ltgtTy pseudoinverse CS Signal Recovery 0 Recovery given y 2 CD51 illposed inverse problem find a sparse 0 2 fast wrong g arg min yltgtat A T 1 T IIIIIIIJIIIIIIIIJLJII39IIIIIIIIJIJIIJIJIIJII 1 d CD CD y pseudoi nverse Why 62 Doesn t Work for signals sparse in the spacetime domain 5 ill yltlgta3 least squares minimum 2 solution is almost never sparse 2yCgtai null space of CD translated to Cl random angle CS Signal Recovery Reconstructiondecoding given y 4313 a illposed inverse problem find 0 62 fast wrong f arg min llll2 yltgtat O f arg min a30 yltgta3 i number of nonzero entries nd sparsest a in translated nulspace CS Signal Recovery Reconstructiondecoding given y 2 CD21 illposed inverse problem find Cl 62 fast wrong f arg min yltgtat 60 correct A only M2A arg g 0 measurements y a A required to reconstruct Ksparse signal quotumber 0f nonzero entries CS Signal Recovery Reconstructiondecoding given y 2 CD21 a illposed inverse problem find a 62 fast wrong f arg min yltgtat 60 correct A only M2 m arg mibn llzvllo 1 measurements y A required to reconstruct b f Ksparse signal quotum era DOHZGFO entries slow NPcomplete algorithm CS Signal Recovery Recovery given y 2 CD33 illposed inverse problem find Cl sparse 2 fast wrong arg min B2 yltgtat 60 correct slow arg min a0 yltgtat 61 correct efficient 53 arg min mild oversampling yzcba Candes Romberg Tao Donoho linear program number of measurements required M 0 39OQUVKD ltlt N quot quot r 77 i r 4 r 739 l j39 r quot1 rquot u quotI 4 i a 39 r I L Ee 39 E a gt J w P l f if a 5 er 7 M I 5 39 pa J w a quot4 quot J i r I 7A Min CS Signal Recovery 0 Recovery given y 4313 illposed inverse problem find Cl sparse o 60 correct slow g arg mil IICCHO y a M 2 2K o 61 correct efficient g arg min mild oversam ling y Candes Romberg Tao Donoho convex optimization CoSaMP correct more efficient iterative greedy IHT mild oversampling Tropp and Needell Blumensath and Davies number of measurements required M 0K IogNK ltlt N CS Recovery via Linear Programming 0 Optimization problem Basis Pursuit BP min x1 subject to y 2 Pa 0 Standard linear program min 0 subject to Ag 2 b a 2 O 0 Make translations m nl nc LAID JJWUI U7 bWJ7J7HWV7 W U I y o Yields minu 1 subject to Cbu CPU 2 y uv 2 O Why 61 Works for signals sparse in the s ace time domain 38 arg min la 1 minimum 51 solution sparsest solution with high probability if M 0K ogNK ltlt N Universality Random measurements can be used for signals sparse in any basis alfoz Universality Random measurements can be used for signals sparse in any basis yltDacltgtWoz Universality Random measurements can be used for signals sparse in any basis yZCDfL ZCDWOKZCDO Ngtlt1 sparse coefficient vector K nonzero entries Compressive Sensing Directly acquire compressed data Replace N samples by M random projections M OKIogNK random measurements transmitstore 5 adv receive linear pgm quotiiil I d I Ell l Compressive Sensing Recovery Algorithms CS Recovery Algorithms Convex optimization noisefree signals Linear programming Basis pursuit FPC Bregman iteration nois signals Basis Pursuit DeNoising BPDN SecondOrder Cone Programming SOCP Dantzig selector GPSR Iterative greedy algorithms Matching Pursuit MP Orthogonal Matching Pursuit OMP StOMP CoSaMP software Iterative Hard Thresholding IHT dspriceeducs SOCP Standard LP recovery min 331 subject to y 2 cm Noisy measurements yltlgtatn secondOrder Cone Program min a31 subject to My ltIgt2 g e Convex quadratic program BPDN Standard LP recovery min 2131 subject to y CDQ Noisy measurements y Dd n Basis Pursuit DeNoising 1 mining ltgtasll2 a1 Convex quadratic program Matching Pursuit Greedy algorithm Keyideas 1 measurements y composed of sum of K columns of I K columns 2 identify which K columns sequentially according to size of contribution to y Matching Pursuit For each column compute Choose largest greedy Update estimate 53 by adding in 213239 Form resndual measurement y y 513239 239 and iterate until convergence Orthogonal Matching Pursuit 0 Same procedure as Matching Pursuit Except at each iteration remove selected column reorthogonalize the remaining columns of CID Converges in K iterations Compressive Sensing Summary CS Hallmarks CS changes the rules of the data acquisition game exploits a priori signal sparsity information Stable acquisitionrecovery process is numerically stable Universal same random projections hardware can be used for any compressible signal class generic Asymmetrical most processing at decoder conventional smart encoder dumb decoder CS dumb encoder smart decoder Random projections weakly encrypted CS Hallmarks 0 Democratic each measurement carries the same amount of information robust to measurement loss and quantization simple encoding 0 Ex wireless streaming application with data loss conventional complicated unequal error prorecuon of compressed data DCTwavelet low frequency coefficients CS merely stream additional measurements and reconstruct using those that arrive safely fountain like Compressive Sensing Graphical Models Spa rsity s arse imae Assumption sparsecompressible wavelet images background subtraction Sparsity and Structure Assumption sparsecompressible Reality sparsecompressible with structure wavelet images background subtraction Hidden Markov Trees Markov Random Fieldsing Model Duarte Wakin and Baraniuk 2005 2008 Cevher Duarte Hegde Baraniuk 2008 La and Do 2005 2006 Lee and Bresler 2008 Models for SparseCompressible Signals General models for diverse data types Restricted or probabilistic union of subspaces Graphical models Rooted Connected Trees Markov Random Fieldsing Model for waveletSparse Slgnals for spatially clustered signals Models for SparseCompressible Signals What can we expect Less number of measurements Faster recovery Increased robustness and stability Rooted Connected ITrees Markov Random Fieldsing Model for waveletSparse Slgnals for spatially clustered signals Baraniuk Cevher Duarte Hegde submitted to Trans on IT Cevher Duarte Hegde Baraniuk NIPS 2008 ModelSparse Signals a K sparse signal a E K a Kmodelsparse signal 13 E MK g ZK ModelSparse Signals aK sparse signal 33 E ZK aKmodelsparse signal a E MK Q K Rooted Connected Trees for waveletsparse signals ModelSparse Signals RIP requires aK sparse signal M logNK aKmodelsparse signal M log Blumensath and Davies submitted to Trans on IT Rooted Connected Trees for waveletsparse signals ModelCompressible Signals Modelbased approximation error Standard CS Recovery Convex problem arg min yltgtat Candes Romberg Tao Donoho Guarantees H9 j5H2 S Clllil a7K2 EXLEW TKlll 03 n2 Model based CS Recovery Nonconvex problem Rooted Connected Trees for waveletsparse signals 61 ball Markov Random Fieldsing Model for spatially Clustered signals Modelbased CS Recovery 0 Iterative solution algorithms below is based on CoSaMP calculate current residual form signal estimate calculate a uo t sm estimate signal for obtained support shrink support of obtained estimate 0 During iterations signal support must agree with the signal graphical model change support enlarging and shrinking steps to enforce the signal graphical model Modelbased CS Recovery 0 Iterative solution algorithms below is based on CoSaMP calculate current residual form signal estimate calculate mums uo t sm estimate signal for obtained support shrink support of obtained estimate a Performance guarantees smr to tiiti so m 2 S 01 K2 C2 LEH K1 03n2 Wavelet tree for sample piecewise smooth signal Treebased Signal Recovery Heavisine W N 1024 Signal CoSaMP RMSE1123 M80 Ll minimization RMSEO751 Tree based recovery RMSE0037 Monte Carlo Sims Wavelet Trees In ModeI baeed recovery 43 3933 In Co8al uiFi 39 5 go 4 E 1 E 06 1 1 o E 05 E quotE 04 03 E J 02 g g 01 it D 2 25 3 35 4 45 5 WK CbSaMPM5K 393939 1 minimization M 45K 1 8 Model based recovery M 35Kgquot C r F3 on F3 4 55339 ro Maximum normalized reoovery error 03 04 05 02 Mll i illzllyllzl m 2 S 01 K2 C2 LEH K1 03n2 Treesparse piecewise cubic signals with lt 5 break points length 1024 500 trials average RMSE Modelbased recovers at 35K CoSaIVIP needs 5K Clustered Sparsity 7 in J 3 l 4 l3 LI Markov RandomllFieIdIsing Model for spatially clustered images Cevher Duarte Hegde Baraniuk NIPS 2008 A Vision Application Background Subtraction Target CoSaM P M 3K N 65536 Lattice Matching Pursuit LaMP LaMP Convergence CoSaMP 65 sec 62 sec LaMP iterations 09 sec Summary 0 Compressive sensing exploits signal sparsitycompressibility information 0 CS via graphical models provides novel research directions in optimization learning and information theory exploits structure to make CS better strvge wcl Wter uses efficient iterative algorithms to solve certain classes of model based CS recovery problems dspriceeducs SIMULATION I October 2 2008 Rice University STAT 631 ELEC 633 Graphical Models Scribes Instructor Konstantinos Tsianos Dr Volkan Cevher Ahmad Beirami M atthew Nokleby Index terms Sampling J acobian Cumulative distribution function CDF Acceptancerejection sampling 1 Introduction We have spent considerable effort in this class approximating intractable probability distributions These approxima tions have several uses they allow us to give a compact but fairly accurate representation for a complicated data set or allow us to perform prediction or inference that would otherwise be impossible The last few lectures have focused on deterministic methods for approximating probability distributions such as the Laplace approximation Variational Bayes and Expectation propagation While these methods give powerful and tractable approximate solutions they often have no guarantee of convergence or optimality This lecture starts the discussion about simulation methods In deterministic methods we start with a posterior distribution that needs to be approximated In simulation methods we try to simulate the distribution Speci cally we ll approximate by drawing samples from the desired distribution This scribe is organized as follows In Section 2 we motivate the sampling methods with a simple example In Section 3 we discuss sampling via by using Jacobians which allow us to de ne transformations between random variables with different distributions Finally in Section 4 we discuss acceptancerejection sampling another method for generating samples from arbitrary distributions 2 Motivation Suppose for example we have a function Then Ef ffzpz d2 where denotes the statistical expectation If 102 is suf ciently complicated we cannot compute Ef in closed form One possibility is to evaluate the integral numerically by for example a Riemann sum For instance using uniform samples we can take the value of f at N uniformly selected points 21 i i i 2N and then compute 1 N 7 EU 2 N atW 2 f 1 As long as f is Riemann integrable this Riemann sum will converge to the true expectation as N A 00 It tunis out that this method for approximating Ef is not ef cient in general especially for multidimensional data The required number of grid points goes up exponentially with the dimensionality of the data Sometimes this is called the curse of dimensionality So we need a more ef cient method of approximating Ef Instead of using points on a grid we could perform the approximation using samples that are drawn from our distribution 102 N EU 2 EM 202 lt2 i1 This approach raises an important issue Generating random samples according to a particular probability distribution function PDF is in general much harder than evaluating the PDF at a certain point So we ll focus on methods that allow us to transform easytogenerate samples such as uniform or Gaussian samples into samples from arbitrary distributions 3 Sampling via J acobians Assume we have vectors x 11 i i InT and y yl7 i i i ynT such that x gy for some vector function g Suppose we have the integral I f fx dx and that we want to transform I into an integral over y To do this we need to compute the Jacobian of gy gm gm y y 7 1 78117Hi71n J 3 3g Bi ylauwyn 3 By Now we can write I f f gy lJ ldy where lJ l is the determinant of the Jacobian This determinant is essentially a scaling factor Assuming integration over a small square in the xspace lJ l gives the area of the corresponding patch mapped through g onto the yspace How can we use this to draw samples from complicated distributions Given a function that generates samples from some simple distribution how can we generate samples that follow a different distribution Speci cally suppose we have a random number generator that gives us 2 N 1107 1 the uniform distribution over 07 1 How do we nd a function g such that if y gz then y N py ie y follows some other desired distribution p The relation between y and z is deterministic through the bijective relation y Therefore y g y ltgt 2 g 2 ie Fy Fl and we have my pywy pzzdz madam since 2 is uniformly distributed in 01 This yields Fyy z y Fy lz where Fy is the cumulative distribution function CDF of y Therefore if we have samples from uniform distribution and we apply the function F1 to them we will obtain samples that follow the distribution py 31 Example 1 Exponential distribution If py Ae ky then Fyy l 7 e ky and thus y 7 logl 7 2 where z N u07 1 This example is taken from 1 32 Example 2 Approximating 7r and Gaussian distribution Assume we have a uniform random sampler inside a square 21 X 22 07 l X 01 ie 21 and 22 are independent and 21 N u07 1 Also consider the inscribed circle of radius 1 Given that the area of that circle is equal to 7r we can try to approximate the value of 7r by sampling Speci cally we generate uniform random samples for 21 22 Then we only keep only the pairs that lie inside the unit circle 2 23 lt l The probability of a point 21 22 being inside the circle will be p21 22 E following the ratio of areas of the quarter circle and the square So we could use this approach to approximating 7r where the approximation accuracy increases with the number of sample points Furthermore if we de ne yl 21fgl izr1 y2 22 77275 then y1y2 follows a jointly normal 1 2 1 2 distribution 1 y 1 yg 10y1 y2 J27 exp3 exp 3 Therefore we can obtain a normal distribution by marginalizing the joint distribution 33 Example 3 Multivariate Gaussian In this example we assume that z N N0 I where I is the identity matrix and N m S is the multivariate normal distribution with mean m and covariance S Suppose we want to obtain samples y N N p 2 Since 2 is necessarily positive semide nite we can use the Cholesky decomposition to get 2 LLT where L is a lower triangular matrix Now let s assume that y p Lz Then it is straightforward to show that Ey p Moreover C0Vy Ey 7 py 7 MT ELzzTLT LE22TLT LLT 2 4 AcceptanceRejection Sampling In the last section we saw how we can generate samples from an arbitrary distribution using transformation of the samples from a given distribution speci cally a given uniform distribution However it is not always straightforward to nd a suitable transformation for a complicated distribution We may also encounter cases in which we do not have direct access to the desired distribution which precludes our computing the required transformation So we examine another technique for generating samples from a desired distribution called acceptancerejection sampling In this method we need much less information about the desired probability distribution Suppose ar is the distribution from which we wish to sample Suppose also that there is a distribution 121 from which it is easy to sample that satis es ar g 5121 for some 5 gt 0 So whenever ar is nonzero 121 has to be nonzero The acceptancerej ection algorithm proceeds as follows 1 y My 2 u N u01 3 if u g 52 then I lt7 y otherwise goto 1 Observe that this algorithm makes use of a uniform random sampler and a sampler of 121 both of which are easy to sample from by hypothesis To see that this algorithm will result let s look at the CDF of the resulting samples PltIltwult W 4 7 cby Pltu S 38 fimfo bm dudz T 5 ficoofOEb y bI dudz f3 mm dz fooazd1 7 7 W t Z az dz 8 paw 9 So the CDF of the sample I is Fa I as desired For any wellbehaved az we can always nd a c gt 0 such that ar g 5121 for a suf ciently simple But if c is large the algorithm will perform slowly This is since the acceptance probability depends on the gap between of 5121 and az If c is chosen too large the AcceptanceRejection algorithm will have very poor ef ciency 41 Example 1 Gaussian distribution Suppose that we only have uniform generators at hand and we need samples drawn from a Gaussian distribution az N J07 1 By letting 12x 175 5 we can satisfy azbr g 05 As shown in Figure l the histogram of the samples generated in this manner follows the j07 1 Gaussian distribution as desired blsamplesm bin Figure 1 Gaussian distribution histogram for samples generated by acceptancerej ection sampling 42 Example 2 Triangle distribution Next consider a simple triangle distribution de ned by the density function 17 1 for 1 az i 10 07 otherwise We can draw samples from ar by using acceptancerejection sampling By letting 12x 1717 l we have ar g 2121 as shown in Figure 2 Now we can easily draw samples from 121 and 1107 1 and reject them according to the criterion u g 52 Figure 3 shows the histogram of 105 samples generated in this manner Figures 2 and 3 agree well showing that acceptancerejection sampling works well in this case Figure 2 Density functions for Example 2 Triangle distribution References 1 C M Bishop Pattern Recognition and Machine Learning Oxford Springer 2007 In of samples In b Figure 3 Triangle distribution histogram for samples generated by acceptancerej ection sampling Approximate Inference for Localization Monday November 4 2008 Rice University STAT 631 ELEC 639 Graphical Models Sallie Instructor Stephen Schnelle 5T52riceedu Dr Volkan CEVHER Tahira N Saleem ts4riceedu Ryan E Guerra warriceedu 1 Motivation This document outlines a sensor network situation where approximate inference is used for localization The basic assumptions made for simpli cation are described however the locations are not explicitly denoted This is the rst of several lectures on this topic The next few lectures revolve around a sensor network example where the goal is to determine the location and number of targets in a eld with sound sensors scattered in that eld This particular example can be extended or modi ed to apply to just about virtually every type of sensor data used for localization with only the mathematical models for propagation needing to be modi ed This document is not intended to walk someone through the entire process of an aproximate inference probleml Rather our goal here is to set up the framework for thinking about these problems and provide an example of what a problem would look like The notation outlined will be followed through this document 2 Problem Setup and Notation Suppose that we have a 1km by 1km square eld with arbitrary dimensions chosen to provide normalized results There are K targets scattered throughout the eld 1 2 l l l K and Q sensors 1 2 l l Q as shown in gure 1 o xk rhh zkvT horizontal and vertical components of the kth target locationl sk skhzkvT horizontal and vertical components of the kth sensor location 0 ykt timedomain signal emanating from k h target at time t o ykw Fourier transform of ykt in frequency domain 0 yk yk t1 yk t2 l l yk tTT vector of discrete time samples of signal emanating from k h target 0 Yk ykw1 ykw2 l l ykwTT Fourier transform of ykl o 2qt time signal observed by q h sensor 0 zqw Fourier transform of zqtl o zq zq t1 zq t2 l l zq tT vector of discrete time samples of signal observed by q h sensor 0 Zq zqw1 zq Lug l l zq w l vector of discrete frequency samples of signal observed by q h sensor 0 x xrlrxF l l ix T target locations 1km Figure 1 Example eld for localization Dots are sensors and stars are targets 0 y yrlryg i r r yITAT timedomain signal samples from targets 0 z zrfzg r r 1ng timedomain signal samples at sensors 3 Propagation Model The signal obtained at the sensor is not exactly the signal emanating from the target We must allow for propagation effects From targets x to sensors s we use the model 1 v M 2W W5 W 901 1 This equates to a timedomain equation of 1 HX SH 7 t 7 i 2 2w HxisHam C gt ltgt We can extend this model to describe the relation between all targets and sensors Let 2 represent a concatenation of the signals at sensors 1 to Then 2 251 AXk k n where n describes the noise and Axk is the vertical concatenation of Q diagonal matrices with each having kailquae jwzw as its diagonal elements of index In this model c respresents the propagation speed through our mediumi For example for an electrical signal 5 is the speed of light 299792458 ms whereas acoustic signals would travel at roughly 343 ms in air and 1497 ms in water a is a propagation constant to handle directional issues when waves are modeled propagating through a mediumi For example many signals and sounds are modeled as emitting equally in all directions iiei spherical propagationi In this case a ll A k x z r 39 j v x x a a Spherical Signal Propagation b Cylindrical Signal Propagation Figure 2 Comparison of standard signal propagation in air and for sound underwater a and b On the other hand if we are far underwater sound appears to travel mainly horizontallyi This is due to a change in density between water layers as sound moves at an angle from the horizontal it re ects off the boundary layer above or below returning to the same altitude while continuing horizontally from the source Hence the sound waves reinforce each other horizontally and we model this effect with a propagation constant of Since we are dividing by MK 7 sHD this increases the strength of the signal at the sensor over a spherical model Our graphical model can be diagrammed as shown in gure 3 Figure 3 Graphical Model for sensor eld 4 Approach to Inference Graphical model algorithms do not directly handle physical effects Rather these are incorporated into the solution using equations that explain relationsi From the visual graph each time we see a link between nodes there is a relationship between themi The propagation equations above describe the link between yk and zqi De ne MK K y this can be seen as encapsulating the question of model order and model parameters in a single variable We can safely assume that there are not an in nite number of sources emitting signals collected by the sensors K while unknown must be nite which allows us to utilize inference techniques such as the reversible jump Markov chain Monte Carlo We however will instead assume a xed number of targets and then optimize our results over different potential quantitiesi We would like to determine the order of our model 100ml 0lt PZlKPK However we lack pleK because we do not know the source locations and thus cannot accurately describe the propagation of each signal Fortunately we can use marginalization to obtain this quantity starting with pzlxMKi Since the source location is not dependent on the signal it emits pleK plei Furthermore we assume we have no prior information about the location of the target and thus pxlk olt 1 ie a constant From this WW pltzxwgtpltxwgtdx oltpzlxmdxi Our model is solved using maximization techniques for an approximate LK and in We have x argw max PltZlX7KPKdX 3 argx max 100427 MK While it is beyond the scope of this paper to de ne techniques for solving these equations one should note that these solutions may just be local maxima and not the standard global maximal Using the energy equation Ex y Hz 7 211 Axkka2 we can substitute this in as pzlxMK olt e mx wi The expansion of this could be simpli ed somewhat because sources are assumed to be un correlated Additionally if we solve 0 we nd that argylc min Ex y Alxkz where Al is the pseudoinverse of Al 5 Conclusion This set of notes mainly outline the sensor network scenerio to be discussed in future lectures and the assumptions made for reduction Details of inference will be addressed in other documents while references can be found in Bishop 10 Important things of note are the marginalization over the source locations and although the visual model does not expressly de ne the physical properties of signal propagation this is incorporated into the conditional distributionsi 6 NearOptimal Spatial Localization Example The following example is provided courtesy of Volkan et all 3 it is meant here to show visually the results of one method for solving the localization problemr For more detail about the solution be sure to check out To demonstrate the localization framework and the optimization algorithms we simulate a sensor network where we use Q 30 sensors randomly deployed on a 150 X 150 m2 deployment area to localize two targets In Fig 4a the target locations are shown with stars whereas the sensor locations are shown with circlesr In this experiment our sources are actual recoded projectile shots as shown in Fig 4b The power of the rst source top is approximately three times the power of the second source bottom 150 o a I 39 g 100 39 0 at 39 I 50 0 c gt O O 50 I 7 O 100 I 150 100 O 100 h tlme a b Figure 4 a Sensor network deployment set up b Source signals We simulate a decentralized message passing scheme as discussed in We compress the dimensions of the observed signals 50fold from their Nyquist rate using Gaussian random matrices for transmission 2 compression at each sensor Given the compressive measurements from other sensors in the network each sensor proceeds to solve for the best Kterm solution locallyr Finally each solution along with the objective score is passed across the sensor network Only the solution with the minimum score among all the sensors is kept during transmission Figure 5 summarizes the estimation results Note that the heights of the peaks are approximately proportional to the source powers In Fig 5a the locally computed data score values are shown The scores vary because the dictionaries are built using the observed signals themselves which include both sources In 5b we illustrate the localization result for the sensor with the best local scorer Even in the presence of additive noise in the observed signals and the high amount of compression the resulting location estimates are satisfactory ln 5c we randomly selected sensor 19 and plotted its localization outputr Given the ground truth shown with stars in the plot the localization output of sensor 19 is much better than the sensor with the best local scorer However with Monte Carlo run we expect data fusion scheme to perform better on the average For completeness we show the average of all the localization outputs from the sensor network woom J83me r mmpdmm is 85302 E 833283 E33 83528 r 6 gaqsam a 930an m qumo gt E 2 E a raw 23ng do d pwgww 305 simmzdm wad Qo gsmaow 3de 302 Edge R5132 mash wzmmpwk mau 530 m E woom wameum me am quotVHS rwmwi gmo 55qu m 3u NESN S mmumm 533 do mmm O E mmosmnmmmm mHOuU gt owuszdooH E 6 mo 988 BEL A3 nm 52 3 wawwaommwioQ 9033 owuszdooH 48 3 9 H 3 809m ammo w 3 w wqogmwig 9033 ostSmooH 0J8 3V mQOEEOm 533232 E Hem 35 132 wwud smm Adv Hm EDME A3 3 Hoot ooh vu nu n o 3V 3 owcmw om mm om mr or m o o 5v 0 A su tOO tt 000 v o 60 who O BJOOS toct Sparse Approximation October 23 2008 Rice University STAT 631 ELEC 639 Graphical Models Instructor Scribe Drl Volkan CEVHER Ryan El GUERRA warriceedu 1 Introduction In graphical models we often nd that it is too computationally expensive or memory intensive to use the full representation of a pdf in our calculation and algorithms For instance in beliefpropagation schemes 2 the size of the messages being passed can have a large impact on the performance of the algorithml This motivates the need for a more concise or compact representation of the information contained in a discrete pdf This discussion presents a method for nding a reducedorder representation called the best kterm approximation 2 Best kterm Approximation ln data compression literature 1 the general approach to compressing a signal vector 6 RN involves selecting a certain basis set 11 so that Ei1 ibisi and s is the basis set coefficients for the signal This basis set is selected such that the information contained in the original signal can be concisely retained in a subset of the coef cients sic A kterm approximation 16 lt N of s then involves selecting k coef cients in s and discarding the rest as zero Selection of which coef cients to keep and which to discard is determined by minimizing the best k term approximation error 0k We de ne the norm Halli lai p for all p as a scalar measure of vector length where HaHZ v the length of the vector a and HaHg Vo max lai Assuming for the sake of notation that already represents the coef cients of the signal in the appropriate basis the best k term approximation error in a given norm 1 iszw a 17 min 17 7 z 1 klt gtng m H M lt gt where 2k is the set of all candidate kterm estimates The best k term approximation is then the vector ik that minimizes this error k argminlnizHg v 2 262k Much more work is focused around choosing appropriate basis sets and a suf cient k in order to obtain a certain magnitude of error This is beyond the scope of this discussion 3 Best kterm Approximation to a PDF Without loss of generality we limit our discussion to a probability distribution function in one dimension Let us discretize the support of our pdf on a grid and index each discretized bin with 91 l l 9Nl We then draw random samples from our continuous pdf using any of the previously discussed methods and place them in the corresponding binl After a suf cient number of samples this should give us a nice discretized representation of our pdf p 91 l l l pz 9N 6 RN where 21 is the total number of samples drawn Let us assume that the basis for the approximation is simply the canonical basis then we de ne our best kterm approximation as m argmin Hp 7 2M lt3 262k where this 2k is the set of all vectors formed by replacing all but h entries in p with zeros if is an arbitrary norm and p gt 1 Let pIl be the vector p with entries ordered by magnitude largest to smallesti A remarkable result is that the h terms that minimize expression 3 and that therefore uniquely de ne k are the rst h terms in the pIli This can be proved easily by contradiction a selection of any lesser terms would yield a larger error 014p Since k is not a pdf we normalize it by letting q An interesting consequence of this is that when we consider the divergence Dqu in particular the Kullback Leibler divergence we nd that KLqu 7 ln2q a proof of which is given in the appendix Thus selecting the largest h terms in p for the approximation minimizes the KL divergence a rather desirable result We can think of this intuitively as keeping the h terms that encode the most mass in p This best hterm approximation allows us to substantially reduce the amount of memory required to represent a given pdf which we will nd useful when actually trying to implement messagepassing schemes The selection of h itself will depend heavily on the application at hand and ones desired level of error References 1 A Cohen Wi Dahmen Rf DeVore Compressed sensing and best hterm approximation preprint 2006 2 Ci Bishop Pattern Recognition and Machine Learning Cambridge UiKi Springer Science 2006 APPENDIX Lemma 1 The KullbackLeibler divergence of the best kter39m approximation to a pdf is simply 7 log Zq De nelz P PLPQMHJ NlTvPi 2 07 1 7 0m 0 q quphm 7Pk77 yl In the above p represents an unnormalized distribution q represents the normalized and truncated version of p q contains the rst h terms of p Zq is the normalization constant that ensures qi 1 We now wish to show that KLqu 7 logZqi We have de ned are vectors such that p and q are sorted the largest h nonzero terms are at the begining of the distribution but this can trivially be extended to the unsorted casei KLltqugt E qilog 4139 4139 E 4i10glt7gt E 4i10glt7gt pi kltigN pi igigk 1this proof courtesy of Andrew Waters andrewevwatersricevedu Noting that the right summation only contains terms With qi 0 we see that each term in this summation is elimatedi This yields KLltqugt Z meg igigk pl We can now substitute qi and rearrange terms q KLltqugt Z glog ogigk 1 lt1 0g i i Z4 igigk Z4 1 log 7 7 logZqi D Laplace Approximation Thursday September 11 2008 Rice University STAT 631 ELEC 639 Graphical Models Scribe Instructor Ryan GUERRA Dr Volkan CEVHER Reviewers Beth Bower and Terrance Savitsky Index Terms Laplace Approximation Approximate Inference Taylor Series C hi Distri bution Normal Distribution Probability Density Function When we left off with the Joint Tree Algorithm and the Mar Sum Algorithm last class we had crafted messages to transverse a tree structured graphical model in order to calculate marginal and joint distributions We are interested in nding when pz is given as shown below Z x FIGURE 1 Graph representing both hidden clear and observed shaded variables with their conditional dependance indicated by the arrow In this case x is our observed variable and z is our given variable for which we wish to make some inference While we would normally wish to make some sort of exact inference about pzz this problem is often either impossible to solve or the required algorithm is intractable The next few lectures will focus on deterministic approximations to a pdf and then we will move on to stochastic approximations The general hierarchy of approximation techniques is given here for reference Approximate Inference o Deterministic Approximations 1 Laplace local 2 Variational Bayes global 3 Expectation Propagation both 0 Stochastic Approximations 1 Metropolis HastingsGibbs 2 SIS 1 LAPLACE APPROXIMATIONS TO A PDF 11 Motivation for Representation The idea here is that we wish to approximate any pdf such as the one given below with a nice7 simple representation FIGURE 2 An example multi modal distribution that we want to approximate The Laplace approximation is a method for using a Gaussian N01 02 to represent a given pdf This is obviously more effective for a single mode1 distribution7 as many popular distributions could be roughly represented with a Gaussian As an example of what we mean by represen 7 consider that we have some function gz distributed according to the density function pz and we wish to get the expected value through sampling Ema gltzgtpltzgtdz We wish to calculate this expected value by sampling discrete values from 1927 and thus get an estimate for that can be calculated as such L E9 3 29m i1 1A mode is a concentration of mass in a pdf We could imagine a nonnice distribution like a Uquadratic that may not have its mass concentrated in a single area that would be poorly represented by the Laplace approximation FIGURE 3 The original distribution we want to represent blue with its Gaussian approximation red obtained by using the Laplace approxima tion method Note that we were only able to capture one of the original distribution s modes Often we will nd that pz cannot be easily sampled and we wish to nd an alternative way to draw samples from This is the subject matter for chapter 11 from Bishop2 but suffice to say we can instead draw samples from another nicer distribution qz where qz is some known pdf and qz 7 O E9x Emmi Zgltzigt1 j qltzigt So far we haven t said anything about the choice of qz we could use to represent our pdf but we d like to use something simple and computable because as the dimensions of the problem increase the required computational memory increases dramatically This is why we need approximations Thus far we have introduced the motivation behind approximation schemes in particular the method of Laplace approximation We will proceed by derving the Laplace Approxima tion using Taylor series expansions Then we move to a paper by Tierney and Kadane 3 and describe the use of Laplace approximation to estimate a posterior mean We conclude with an example of approximating the Chi distribution with a Normal distribution and demonstrate the quality of approximation through graphics 12 Derivation of the Laplace Approximation Suppose we wished to approximate 2990 m z x 2 0 Let s look at the Taylor series expansion2 for the ln m H w n w T x7xof 2S x7xo2mfT 7130 acaco acaco acaco ZDe nition m 7 0 81n fz 2 8z2 1 1mm lnfz0 830 z 7 z02 11025 zzo zzo second term Let s assume that the higher order terms are negligible and focus7 for now7 on the second term in 1 2 81137535 35 0 825 z 7 900 mmo zz0 We notice that is zero at local maxima of the pdf If we nd this local maxima and choose to expand our Taylor series around this point zmax we ensure that the second term in the RHS of 1 is always zero This is done by setting 3Q equal to zero and solVing for zm the local maxima of the pdf Taking the rst three terms of the Taylor series expansion around zo zmam then 1 becomes 1 82 ln 3 lnfz lnfzm 58735133 z 7 zm2 182111 mg 111 z 2 4 e f exp 1H mmmax z 7 zmax 182111 mg In w In rmax 7 2 5 e dz e t t exp 2 amz mmmax z zmax dz constant Where we took the exponent in 4 and integrated both sides in We see that the RHS of 5 contains a bunch of constants and a single term inside the exponent that is quadratic with respect to z For the sake of simplicity7 let s let ln fz Then if we we get a result that looks remarkably like a Gaussian let 0392 Luwmax 7 2 6 eLmdz z eummaquot exp 7 dz This is the result of the Laplace method for integrals though it is cited with an additional term 71 and with the traditional notation zmax z in Tierney 85 Kadane 3 as 7 fe Lmdz z enuw fexp 7773270332 dz 27ran enu 13 Application To elaborate further on what 7 gets us7 I m going to borrow from Tierney amp Kadane s paper Given a smooth7 positive function 97 we wish to approximate the posterior mean of Prg Y fgme m7rxdm 8 E E yW 4444444444444i n9xl WEN l PHYW feiltmgtwltxgtd Where m is the loglikelihood function 1hpz 7rx is the prior density and W is the observed set of data As we can see7 the forms of the integrals in 8 are very similar to the forms seen in 77 and are then easily estimated by the Laplace approximation of a pdf But you can read the paper if you re interested We now have a step by step process for using the Laplace approximation to approximate a singlemode pdf with a Gaussian 1 nd a local maximum mm of the given pdf fz 2 7 2 calculate the variance 7 7m 3 approximate the pdf with pz z Nzmax 72 k 2 14 Example Chi distribution pm z 21r x gt 0 Note that z is a normalization constant that doesn t depend on x Apparently most books don t even bother with it7 and we ll ignore it here for the sake of convenience Remember that we re working with the log likelihood here7 so fz lnpz 2 fz lnpz ln 71 lne m 3 i 3 kil m ilnpzi m Ina 2 3m 3 1 k72 Fkilx 7m kil 0 z z kil Now that we ve found the local maximum f we compute the variance 72 7W 921W f 3 82 mm kil 2 1 7 72 1 27 7 2 FIGURE 4 A plot of four Chi distributions solid and the corresponding Normal approximations dashed 15 Application A Medical Inference Problem In the paper 77Laplace s Method Approximations for Probabilistic Inference in Belief Networks with Continuous Variables77 17 the authors introduce a medical inference problem The gure below shows the Bayesian graphical model for the problem There are two different experimental treatments for a disease The goal is to estimate the posterior mean of the increase in one year survival probability In previous work7 the authors ran a Monte Carlo simulation to sample from the posterior to determine the posterior mean In this example7 we compare the Laplace approximation to the posterior to the Monte Carlo sampling method Below is a graph of the Monte Carlo sampling and the Laplace approximation superimposed 1 A a p A I 4 a 391 and 0 c J u ml Lg 1 mm um AM my Ham 5 mpbce Appmxmulmn v mm Cub mm m Weppmxmm mas 0 dB we Waxmuung we pusvmm m mums um um the mm cub methad mg m ms my mmmmny um the szbce mmmmmm W Waxm wn web ms ample m Mm W m mg n D 5mm wag ma W a mm a w m mmkam Mmmn a Jam1032 MW nu 2 many My rmnum m4 um w w m 3 Luke Tierney and Joseph Kadane Accurate approximations for posterior moments and marginal den sities Journal of the American Statistical Association7 813937 1986