New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Providenci Mosciski Sr.


Providenci Mosciski Sr.
GPA 3.66


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Statistics

This 26 page Class Notes was uploaded by Providenci Mosciski Sr. on Wednesday September 9, 2015. The Class Notes belongs to STAT 535 at University of Washington taught by Staff in Fall. Since its upload, it has received 9 views. For similar materials see /class/192520/stat-535-university-of-washington in Statistics at University of Washington.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/09/15
Stat 535 lecture 2 notes hidden Markov models HMMs 1 Mixture models The simplest interesting graphical model is the mixture model7 with a state variable Q and an output variable Y governed by PQy PQPY Q This can be viewed as a generative model that generates samples independently of each other by the following process 1 for sample i7 choose state qi according to PQ 2 then choose output yi according to Pym In typical applications the state qi is hidden aka latent7 meaning that it is not directly observable but can only be inferred from yi lf Q 1 n7 Y R1 and Pqu N NW1 Sq then Py is an n component Gaussian mixture with mixture weights Pq7 with pdf 1 1 71 1311 Z Humexpkgwinqmq 117 q1n For example7 an equally weighted 3 component mixture in R2 a o a a i0 i0 9 0 Q 393 o o a o o a u o o DDDB CPO DJ n EODOB D06 0 D oD n D o UH QVO 00500 a EM mg mg 009 DEV gaging no a g odaonon a ampQQDDDOO D an o5 a a o o a o e a o 9 G 0 0 a a a D a s a O a a on D a n Q Do a o 0 3mm o 0 Doom 0 on a a a D 00 o D on wa39znm 8 o g O Od7 nm a 9 0 09 9593 000W QQ ch o W 7 90 d a g9 33 a 2 5 39 a a 2 o 09 00 a 2 o EGO Qua Do a o O K a Do U D 2 a on DD 0 an a am e o o a 2 2 2 m 2 2 m 2 probability density labeled sample observed samples Mixture models are often used for classi cation tasks In training7 labeled pairs qi7 yi are used to estimate the distributions Pq and Pqu Then given an unlabeled point y one can calculate a distribution over the hidden label state using Bayes7 rule Pltqugt oc Pawn For example7 one might use a mixture model to classify small n X n binary images of handwritten characters into labels A 20 9 Although far from optimal7 a simple param eterization is Py yn ymlq Hi ElyAmi this naive Bayes model is equivalent to the graphical model 1This is playing fast and loose With notation for discrete Y Py q and Py are probabilities for continuous Y Py q and Py are probability densities but the formula still holds Why is this model naive 2 Hidden Markov models The hidden Markov model is a generalization of the mixture model that allows for dependence between the state of samples It is typically used When samples have a natural order in time or space The generative process is 1 for sample 17 choose state ql according to PQl for subsequent samples7 choose qi according to PQi qH 2 then choose output yi according to Pyi qi Vitally7 the probability of state qi is dependent on qFl but not preVious states Vi lt j lt k7 Qz L leQj In the example of handwritten character recognition7 the hidden state dependence can be used to model structure in character sequences7 for example the tendency for vowels to follow consonants and vice versa Consider again the 3 component Gaussian mixture assuming at every step the state pro gression is governed by i 8 if q 1 12 Plt Iz1lq2 7 1 otherwise In that case7 the marginal probability distribution is the same as before all states are equally likely but consecutive samples are highly correlated 2 o e a lt0 lt2 labeled sample sequence Using knowledge of the state dynamics and output history yl yi it is possible to dra matically improve our predictive power We will see how to calculate Pq 1ly1 yi7 and from that Py 1ly1 72 D 2 e a in V2 72 D 2 e a m 12 4 4 Py 1 Py 1ly1 last 5 samples shown 3 Operations HMMs are used many di erent ways 0 Smoothing lf Q represents a hidden true signal that is corrupted by independent noise to produce an observed signal Y7 then the distribution PQ y can be used to de noise777 or smooth7 the observation A simple model of noise is Py q N Jq7 a 0 Prediction Given a sequence of observations yl yt predict yt1 using the identity Pyt1ly1 yt 21h Pqt1ly1 ytPyt1lqt1 Such models are commonly used in target tracking7 Where Q represents a true position and Y noisy measurements of it In some text entry devices predictive probabilities are used to alter the size or order of keys to improve entry speed and accuracy o P quot2 on Y represents a equential input and Q a desired classi cation or labeling of its parts For example7 in an information retrieval application yi may be a word from a document yl yn and qi a classi cation of the word into one of one of NameStart7 Namelliddle7 NameEnd7 Other Model estimation might take place from labeled examples extraction of names from documents may be performed by calculating arg maxq Pqu o Transduction TBD 0 Analysis TBD 4 Conditional Independence in HMMS Using the chain rule7 Pq7y Pq1qmy1yn Pq1Pq2lq1Pq3lq1q2 Pqnlq1 117171 Py1lq1qnPy2lq1qmy1Pynlq1emu 111171 The HMM graphical model implies the conditional independencies Pq lq1q 71 Pq lq 71 P12 llt1 11117111 4221 Py lq Therefore Pq7y PltI1 H P l 71 H Py l 22n 21n Rewriting pairwise conditional probabilities into pairwise joint probabilities7 notice Pq7y Pq1lll W P1223 Hi2n 1392 717 12 1114 13912315 Hi1n Pqi6Qigt71 6 Where 6Qi is the degree of variable Qi in the graphical model Simply relabeling the terms gives Plt 7 Hi2n Q2 1Qiqiil7qi Hi1n DQiYiqi7 yi q7 y 7 Hi1n DQi Ii6Qigtil an expression that shows directed HMMs are equivalent to undirected graphical models Thus7 all of the following can express the same distributions ln fact7 we can generalize the expression of the joint probability one step further to IDS 57 Pm Hstgt6E t6 X t Hvev Mm 0 1 Where V are the vertices in the graphical model and E are the edges This expression holds for any acyclic undirected graphical model tree model7 and directed graphical models Where no node has more than one parent 5 Inference Let us calculate Py Zq Pq7 y7 a computation that at rst blush appears to require a summation over an exponential in 72 number of hidden state con gurations Py ZPq7y Z PltI1 Pq lq 71 PWilQi 11141 2 2n 2 1 ZPQ1P11llth 2 H PltI2 l71P12 l 1241 2 2n ZPQ1P11llthZPQ2lQ1Py2lQ2 2 H PltI2 l71P12 l ZPQ1P11llthZPQ2lQ1Py2lQ2ZZPinQn71P1nl This re expression requires that multiplication left distributes over addition cm cy cy there are many operator combinations 8 and G9 With this property7 besides ordinary multiplication and addition of reals The algorithms developed below can be generalized by choosing other operators lt7s clear that the key terms of the computation are the products Pq qu 1Pyilq 7 so to simplify notation let 141 Altigtltlt1phlt1 gt Pq2 7y2 Pq Py lq Pqi7yilq2 71 Pq lq 71Py lq for 2 gig Tl Notice that A is a function of the observation y Then7 PW ZAlt1gtQ1ZA2gtQ17Q2ZZAWgt717 8 For any i the inner expression Elli Alti1gtq 7q 1zqi2 depends on qi but7 thanks to the conditional independence assumption of Markov models7 no state qj for j lt i Hence we can write PW ZA1gtq1 Z Altigtqi717qizAlti1gtqi7qi1 Z Angtqn717qn 11 11239 In n1 QigtEPyi1uyal11igt ZAlt1gtltq1Z AWth qi5lt2 gtltq 11 11239 Computationally7 we can calculate the backward probability vector 50 once and store it7 rather than calculatingit once per con guration of ql q 1 The conditioning state qi can be further elevated by moving its summation to the front and capturing the remaining summations over ql qFl as a forward probability or PW Z ZA1gtq1ZA Wquq 71Aq 717q 5igtQi 11239 1123971 or 11039041 iquot1igt Z aigtqigigtqi 11239 E Z Py1yi7P1 1ynl Ii 21311 qu2 11239 Here we see the power of conditional independence choosing i 7 the calculation of Py is reduced to two problems of half the size Thus7 instead of a sum of lQl terms7 2 sums of lng terms In fact7 or and B can both be computed recursively 062 21400011 ZAlt 71gt012727 9271A39gtq2gt17q2 111 11271 Z04i71gtq271Aigtq271712 11271 Where 021gtq1 Alt1gtltQ1gtv and 0012 ZAGHRQn 121 Z Altngtltlt1n717 In n1 In Z Alti1gt 7 15i1 121 n1 Where gtqn 1 51 Matrix formulation Let us express the equations so far using matrix notation7 treating 02igt as a row vector 02057 595 as a column vector 5057 and A1gtq 1q as a lQFll gtlt matrix A The matrix 141 is a 1 X lQll row and the matrix Am is a lin X 1 column The forward prob ability calculation is a sequence of vector matrix products and the backward probability calculation a sequence of matrix vector products AW 1 7 fOT l 7 OXFUA for 2 2 22 1 52 Am for 2 22 7 AQHWWFD for 2 22 7 1 0 The calculation of Py can be performed using a vector vector product of forward and backward probabilities at any point 239 PW E Z Py1y 7q Py 1ynlq lIi 049 for any 1 1 n Alt1gtAlt2gtAltigtAlti1gt Alt gtAlt 1gt an1gt 50 Finally7 the marginal conditional probabilities 7qi E Pq ly7 also known as the complete probabilities7 can be calculated from forward and backward probabilities as a simple scalar product with normalization this is the famous forward backward algorithm 7012 E Pqily Py1 y2 7q2 P12 1 nynlqi Py alt2gt4i 2gt4i 50 5 2 Trellis diagram The trellis diagram below depicts the calculation of Py for an HMM with 4 steps and 3 possible state values qi E 17273 at each step Py Alt1gtA2gtA3gtA4gtA5gt Comput ing this product of matrices is equivalent to summing over all left to right paths through the trellis7 each representing a particular hidden state sequence The calculation is made ef cient by summarizing intermediate results in forward probability vectors or Q1 Q2 Q3 Q4 2 3 4 141 1 A11 02 A11 03 A11 04 2 5 M 41 1amp2 41f 1amp2 41f 1amp2 319 go 319 Ag a5 A52 2 A52 an A5 a Pltygt Q 5V ea W W9 W 3 lt2 i 422 3 423 A531 045 2 043 3 0433gt 4 04534 A33 A33 A33 A slight modi cation to the diagram shows how the forward backward algorithm e iciently calculates m Polly a3 Py Q1 2 Q2 2 Q2 3 Q3 4 Q4 A 01 Air 042 V1 52 Ali 53 Air 54 1 V r V W452 2 J 422 2091 up Ag a5 A5 a 59gt A52 59gt A52 w 1 3 42 9 3 lt29 33 42quot 141 04531 A2 042 539 A3 533 4 554 1 33 33 Py 53 The forwardbackward algorithm Here explicitly is the forward backward algorithm for computing the marginal posterior complete probabilities 7qi Pq ly for every 239 q lt computes and stores forward and backward probabilities given input y for every position and state value and from them the complete probabilities using lelZ time and lel space More e icient algorithms discussed further below consume substantially less space 12 12 VD CI FDRWARD BACKWARDy ylyn dJ J new arraynJ 239 q initialized to 0 forward probs BJ J new arraynJ 239 q initialized to O backward probs 7J J new arraynJ 239 q initialized to 0 complete probs compute forward probabilities foril diJ J new arraylQ lJ 0M q PQ1 I P lQi1Ul 1 else l for r 1lQ 71 04m 11 O Utll 7quot PQAQHWV Pin Ullq compute Py p sum q 1 in of oanJqJ compute backward and complete probabilities for 239 nl BU J new arraylQ lJ 7U J new arraylQ lJ for q 1 Q l if 172 BU q 1 else for r 1 Q 1l q 1 T PQi1lQiqu PYi1lQi1y 11V 7U q oaUJ lt1 5U lt1 I return 1 WE J 54 Associativity and order of operations The matrix formulation of HMM operations lets us use our understanding of matrix mul tiplication to gain insight into computation in graphical models For example7 from 13 My A1gtA2gt I AWgtAn1gt it is clear that despite starting from a generative de nition of HMMs that proceeds left to right7 in fact by the associative nature of matrix multiplication the calculation can be performed in many orders W11142gt148A4gtA5gtA6gtA7gtASgt left to right 141 lgtA2gtA3gt 144145A6gtA7gtA8gt right to left 140142A3gt144A5gtA6gtA7gtA8gt divide and conquer Certain orders may have better computational properties than others the divide and conquer algorithm multiplies lQl gtlt lQl matrices7 an operation that using standard algo rithms on a sequential computer takes lelB total time7 whereas the traditional left to right calculation of forward probabilities 04 and right to left calculation of backward probabilities summarize results in vectors and perform only matrix vector products7 take lelz time7 a substantial savings for large However given the possibility of parallel computation7 the divide and conquer strategy may be preferable7 as the computation depth is reduced from n to log 55 Variable elimination For HMMs7 multivariate conditional distributions over hidden variables can be calculated e iciently by variable eliminatiom a general inference technique that computes marginal distributions over subsets of variables by constructing a modi ed graphical model contain ing only those variables Consider the marginal distribution PQinQk y of the hidden variables Qi Qj Qk where 239 lt j lt k are not necessarily consecutive7 after conditioning on the observation Y y Pq 7 qjv Ik ly Py7q2 7qjv Py Alt1gtAltigt Alt 1gtAlt7gt 11239 llz j Altj1gtAltkgtq Ak1gtA 1gt J Qk Py 04Alt mgtwAlt7Lkgtwgltkgtqk Py 39 11k Thus7 PQinQk y can be computed e iciently using tables over state pairs constructed by partitioning 141 A 1gt Furthermore7 since PQinley zinqiv qj ijqj7 9k Where 239 Ai1j y i a 1i INT Qinqi7qj 7 W b ijqy7qk AltyLkgtW qw PQinQk y is an undirected graphical model of form It is also easily veri ed that PQinQk y PQi yPQj QinQk ij Where OM 39 PQily q qz Py P 7 ijQiy 7 040 Az1ygtqi qjgygtqj 7 Az1ygtqi qjgygtqj leQiy 713 a qi5 qi gal P PQjQw altjgt4jAltj1 kgt 1jquot1k kgt11k Altj1 kgtqw5kgtqk leij ijy 04jqj jqj 50511 39 consistent With the directed graphical model It is not true for general graphical models that the marginal distribution over a subset of variables takes such a simple form the hidden variables of HMMs and more generally7 tree graphical models are a special case 6 Variations 61 The Viterbi algorithm In smoothing7 classi cation and transduction tasks one frequently desires the most probable con guration q of the hidden variables given an observation q arg max Pqu q arg max Pq7 y q The con guration q is known as the Viterbi sequence It can be computed e iciently by replacing the summation in the forward or backward calculation with maximization7 so that instead of the usual 0 13 EPy1ylvqi Z Py1y 7q1q 71q Ilnqiil the forward probabilities are instead de ned 04 3 max Plt1li 457 124 llulhil In more detail7 let the optimal pre x 7r for a state value qi be the sequence ql qi that maximizes the probability of y1 yi lt simpli es bookkeeping to store optimal pre xes 7171i and their probabilities together in a pair 7T5 31 maXP11 yz h q 71q mugging 1 E ltPyiy 77T gt It remains possible to express the computation of or using matrix notation if the G9 operator maximize probabilities and the operator concatenates state sequences All ltP 11Py1llth 11gt A533 ltPq lq 71Py lq q A572 lt1 gt ltP1 311 2 32gt P1192 COHC WShW P1 31gt if p1 2 p2 ltp1 81gt7ltp2 82gt ltp2 82gt otherwise 17 From this we have ltPyq qgt Alt1gtA2gtA 1gt An implementation of the Viterbi algorithm q VITERBI y yl yn o nil probabilities for state 271 7r nil optimal prefixes for state 271 for 2 122 02D new arrayHQil probabilities for state 2 7r new arrayHQil optimal prefixes for state 2 for q 1Q if 2 06qu PQM PmQ21U39NltI 7rqu q else aq 700 for 2 1Q 1 P Dd7 PQNQFA I T PW Q21Z39 I if p gt 02 q 023 P 7rq concat7r 2 q 02 new becomes old 71quot 7r new becomes old 9 compute best of final states p 700 q nil for q 1Qn if oq gt p p oql 1 Wq return q 62 States and output 63 Variable horizon 7 Practicalities 71 Memory consumption Both the forward backward algorithm and the Viterbi algorithm require lel storage space if implemented naively 72 Numerical under ow 73 Continuous distributions 74 Large state space Stat 535 lecture 2 notes Variable elimination These notes don7t cover everything from the class on VE but should be su icient reference for homework Variable elimination is probably the single most important algorithm to understand for manipulating graphical models because it solves the most important problem inference computing the conditional distri bution over one set of variables given assignments to another it works for any directed or undirected graphical model over discrete variables its easy to understand and re derive with the right elimination order its as e icient as any other method that works for arbitrary probability tables that is that relies only on conditional independencies and not the numeric values of the probability distribution 1 Moralization VE can be done on either directed or undirected graphs but its convenient to consider only the undirected Markov random eld case and for problems on directed graphs Bayes7 nets convert the BN net to an MRF prior to elimination The conversion process is called moralizatz on since it involves the marriage of a child7s parents For a directed graphical model let G be the conditional indpeendence graph and P the probability distribution Let Gm and P be the same after moralization Moralization exactly maintains the probability distribution P P But it may weaken the corre spondence between the independence graph and the distribution although Gm remains an l map of P some of the conditional independencies encoded in G may be missing from Gm Given the directed graph G we know that P factors P H PXiWAPaWiD XieV And we know that P must factor Pm em The correspondence P Pm is achieved by creating a potential for every family in ltIgtXi PXi lpafv By the Hammersley Cli ord theorem7 Gm is an l map of P if the domain of every potential function is a not necessarily maximal clique in Gm7 or equivalently7 if every family in G is a clique in Gm Thus7 tvvo equivalent methods of constructing the edges for Gm V7 Em o Em is the union of cliques over all families of G o Em is E after adding edges between the parents of each variable marryingthe parents to create cliques Gm may contain cliques that do not correspond to families of G there are no corresponding potentials in the factorization of P Here7s an example of moralization P PBPCPEPFPG P 133130 DE IDF DG PDiEFGPAiBCD DDEFG DABCD Notice that Gm has lost numerous conditional independencies E L F7 F L G714 L 07 etc 2 Elimination Algorithm Variable elimination works by eliminating variables from an undirected graphical model one by one7 until only the variables of interest are left As each unobserved variable W is eliminated7 the set of potentials involving W is replaced by a single new potential IJVZ constructed by marginalizing over W Graphically7 this corresponds to eliminating the variable and its edges7 and fully connecting its neighbors since With the elimination of W they are generally no longer conditionally independent The input to variable elimination 0 V7 a set of variables 0 Q lt137 a set of of potential functions over V o O ltXj jgt observations of a subset of V possibly empty 0 71397 the elimination order7 an ordered subset of V for simplicity7 if is assumed to include the observation varibles O The output 0 DZ a table over Z V 7 71397 the set of variables not eliminated Each entry 1322 is the joint probability of Z 2 and the observations7 marginalized over the elirninated variables ltIgtZ ELIMINATEV Q e 7r V1Vm O ltXjjgtgt for V in 7r 9 lt13 6 E domainltlgt potentials involving V if vgt E O V is observed create new potentials by fixing V for I E Q N domainltIgt 7 13 new potential over variables N for each config n in N 1302 130171 9 9U 43 else compute new potential by marginalizing over V N UgenidomainltlgtiV neighbors of V 13 new potential over variables N for each config n in N Dn ZveV H e 017 Q 9U 43 eliminate V and its potentials V V i4 9 979 compute table over remaining variables DZ new potential over remaining variables V for each config m in V DZ Hwe DW return DZ 3 Example The factorization P PAPBMPQAPD ABPE BCPF E With directed graph Suppose the goal is to compute PCD7F17 this is accomplished by rst computing the joint PCD7F1 using VE7 by eliminating A7 B7 E and F After choosing an elimination order here7 arbitrarily7 FEAB7 VE gives a table of over the remaining variables DOD ELIMINATEABCDEF 7 DAltDA3ltIgtACltIgtABDltIgtBCEltIgtEF 7 FEAB 7 and then DOD C7 Zed DCD07 d39 5 PCc7Dle1 The intermediate stages of variable elimination are as follows Var Neigh New factorization DA DAB ltDAG DABD DBCE DEF New graph P ltDA DAB DAC DABD DBCE DJIE F E 1 ltDE6 DEAF71 P ltIgtAltIgtABltIgtACltIgtABD ZBC E B C 7 DZBCbc Z DBCEbC DlE P ZBC P 0D A B707D Pg c bcd Z DAlt1 DAB 15 DACacltIgtABDabd P 1 4CD B C D 7 D4CDcd Z DZBC5 D CDde b 4 Computational complexity and induced width The computational work done in VB is primarily the calculation of new potentials A potential over it binary variables is a table of 2 values7 each computed by summing a product of potentials over an eliminated variable With proper implementation7 the cost of computing the potential product is a constant rather than growing linearly with the number of potentials Q why7 so the total computational cost of computing a new potential over it variables is 2 If c is the maximum number of variables in any potential under a particular elimination order 71397 it is clear the the time complexity of variable elimination is ll 26 However nding the elimination order that minimizes c is an NP complete problem De ne the induced width ofG under 71397 w7r7 to be the maximum size of the neighbor set N in the execution of ELIMINATE for an ordering 7r over all the variables in the graph total elimination De ne the elimination width to be the minimum of w7r over all elimination orders It is not hard to see that the elimination width of an unconnected graph is 07 of any tree is 1 achieving this requires the right choice of elimination order7 and of a loop is 2 There are a variety of di erent ways of de ning graph complexity that are all equivalent a graph7s elimination width can be shown to be equal to its tree width7 a complexity measure related to the decomposition of the graph into a tree of cliques that we will explore as part of the junction tree algorithm


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Kyle Maynard Purdue

"When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the I made $280 on my first study guide!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.