New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Providenci Mosciski Sr.


Providenci Mosciski Sr.
GPA 3.66

Carl De Marcken

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Carl De Marcken
Class Notes
25 ?




Popular in Course

Popular in Statistics

This 20 page Class Notes was uploaded by Providenci Mosciski Sr. on Wednesday September 9, 2015. The Class Notes belongs to STAT 535 at University of Washington taught by Carl De Marcken in Fall. Since its upload, it has received 19 views. For similar materials see /class/192502/stat-535-university-of-washington in Statistics at University of Washington.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/09/15
Stat 535 lecture 2 notes hidden Markov models HMMs 1 Mixture models The simplest interesting graphical model is the mixture model7 with a state variable Q and an output variable Y governed by PQy PQPY Q This can be viewed as a generative model that generates samples independently of each other by the following process 1 for sample i7 choose state qi according to PQ 2 then choose output yi according to Pym In typical applications the state qi is hidden or latent7 meaning that it is not directly observable but can only be inferred from yi lf Q 1 n7 Y R1 and Pqu N NW1 Sq then Py is an n component Gaussian mixture with mixture weights Pq7 with pdf 1311 21 Humexpk w 7 Mq 21y 7 Ma For example7 an equally weighted 3 component mixture in R2 a w a on E E a we equot 0 2 PD 0 00 69 D lt1 D 0 a a D 0 o D o a Dog we 0 5 a a 0 Eric E 4 4 a y 0 CW 0 u 0 lt2 6 we 50 asp a 2 2 f a 09 D DC a c D D Do U D 00 D D D D a probability density labeled samples observed samples Mixture models are often used for classi cation tasks In training7 labeled pairs qhyi are used to estimate the distributions Pq and Pqu Then given an unlabeled point y one can calculate a distribution over the hidden label state using Bayes7 rule1 Pqu PIS 0lt PquPq For example7 one might use a mixture model to classify small n X n binary images of handwritten characters into labels A 20 9 Although far from optimal7 a simple param eterization is Py yn ymlq Hi ElyAmi this naive Bayes model is equivalent to the graphical model 1This is playing fast and loose With notation for discrete Y Py q and Py are probabilities for continuous Y Py q and Py are probability densities but the formula still holds Why is this model naive 2 Hidden Markov models The hidden Markov model is a generalization of the mixture model that allows for dependence between the state of samples It is typically used when samples have a natural order in time or space The generative process is 1 for sample 17 choose state ql according to PQl for subsequent samples7 choose qi from the distribution PQi qH 2 then choose output yi from the distribution Pyi qi Vitally7 the probability of state qi is dependent on qFl but not previous states Vi lt j lt k7 Qz L leQj In the example of handwritten character recognition7 the hidden state dependence can be used to model structure in character sequences7 for example the tendency for vowels to follow consonants and vice versa Consider again the 3 component Gaussian mixture assuming that the state progression is governed by i 8 if q 1 12 PltQz1lq2 7 1 otherwise In that case7 the marginal probability distribution is the same as before all states are equally likely but consecutive samples are highly correlated 3 72 u 2 4 5 a m 2 labeled sample sequence Using knowledge of the state dynamics and output history yl y it is possible to dra matically improve our predictive power We will see how to calculate Pq lly1 yi7 and from that Py 1ly1 72 n 2 E a m 12 4 2 4 6 E Py 1 Py 1ly1 last 5 samples shown 3 Operations HMMs are used many di erent ways o Smoothing lf Q represents a hidden true signal that is corrupted by independent noise to produce an observed signal Y7 then the distribution PQ y can be used to de noise or smooth7 the observation A simple model of noise is Py q N Jq7 0 Prediction Given a sequence of observations yl yt predict yt1 using Pyt1ly1 yt 21h Pqt1ly1 ytPyt1lqt1 Such models are commonly used in target track ing7 where Q represents a true position and Y noisy measurements of it In some text entry devices predictive probabilities are used to alter the size or order of keys to improve entry speed and accuracy o P quot2 tirm39 Y represents a equential input and Q a desired classi cation or labeling of its parts For example7 in an information retrieval application yi may be a word from a document yl yn and qi a classi cation of the word into one of one of NameStart7 Namelliddle7 NameEnd7 Other Model estimation might take place from labeled examples extraction of names from documents may be performed by calculating argmaququ o Transduction TBD 0 Analysis TBD 4 Conditional Independence in HMMS Pqy Pq1qmy1yn Pq1Pq2lq1Pq3lq1q2 Pqnlq1 117171 Py1lq1qnPy2lq14177111Pynlq1qmy1yn71 Using the conditional independencies implied by the HMM graphical model7 Pq lq1 q 1 PltQ l 1 gt1gt and Py lltI1qmy1y 71 Py lq Therefore Pq7y Pf h Pf b lqpi H Pflil b l 2392n 2391n Rewriting pairwise conditional probabilities into pairwise joint probabilities7 notice Pq2gt17q2 13012315 P0171 i Pq1i12lmn Pq u pm 7 W i Hi1nPqi6Qigtil Where 6Qi is the degree of variable Qi in the graphical model Simply relabeling the terms gives H 2n DQHQMFh qi H 1n 4 42qu 11239 Hi1n DQi qi6Qigt71 Pq7y an expression that shows directed HMMs are equivalent to undirected graphical models Thus7 all of the following can express the same distributions ln fact7 we can generalize the expression of the joint probability one step further to HsteE D5t57 t Hiev Dv6X gt 1 Where V are the vertices in the graphical model and E are the edges This expression holds for any acyclic undirected graphical model tree model7 and directed graphical models Where no node has more than one parent 5 Inference There are many ways to derive the formulas and algorithims for HMM inference The derivation here emphasizes di erent things than those in Jordan s book chapter please read both Let us calculate Py Zq Pq7 y7 a computation that at rst blush appears to require a summation over an exponential number of hidden state con gurations Py ZPq7y Z PltI1 H Pq lq 71 Py lq Ii q i n i1n ZPWUPWil h 2 H Pqilqi71Pyilqi 1241 i2n ZPWUPWil hZPQ2lQ1Py2lQ2 2 H Pqilqi71Pyilqi 1341 i3n Z Pq1Py1lq1 Z Pq2lq1Py2lq2 Z Z Pqnlqn71Pynlqn 111 112 13 In This reexpression requires that multiplication left distributes over addition C cy cy there are many operator combinations 8 and G9 With this property we7ll consider others in the future lt7s clear that the key terms of the computation are the products Pq qu 1Py lq 7 so to simplify notation let Alva P012425 PqiPy lqi A3qiil7qi P 7y l71 PltI l71P1 lqi Then7 Pa ZAaqnzAzahannZAqun In For any i the inner expression Zqi1Af1q 7q 1Zqi2 depends on qi but7 thanks to the conditional independence assumption of Markov models7 no state qj for j lt i Hence we can write Po ZmqlZimamAmmanHZAqun 111 11239 In n1 5g4igtEPyi1uyal11igt ZAlq1 ZA3717ltI 53 Computationally7 we can compute the backward probability vector 53 once rather than once per con guration of ql q 1 The conditioning state qi can be further elevated by moving its summation to the front and expressing the remaining summations as a forward probability 04 Pa Z ZAral ZAilmmiqi1gtA3ltq 17qigt 63a 11239 1139 1 a WEP yin2123112 ZOCEWD fQJ 1239 E Z Py1 123 Q Py 1ynl 12 prlnqu 1239 Here we see the power of conditional independence choosing 239 7 the calculation of Py is reduced to two problems of half the size Thus7 instead of a sum of lQl terms7 2 sums of lng terms In fact7 044 and 53 can both be computed recursively 063 2 1431011 Z Aidqph qpiM qph 12 111 E71 2 043710122014 1247 12 l123971 where oa ql zili q17 and mm ZAEHW 1 2AM q In n1 ZA31715311 n1 where qn 1 Let us express the equations so far using matrix notation7 treating oa qi as a row vector 044 and B qi as a column vector 3 and 1414q2717 qi as a lQ 71l gtlt matrix A37 we have 11 7 y 041 A1 11 7 y y 042 042 71142 y 7 5n 7 1 53 A353 Py 04353 14ng Ag A1 ALI1311 z agsziuyz i 5gEPyi1uyal hgt Gail f Py7q aflqil flqi lodglil fli Pq ly These expressions let us use the properties of matrix multiplication to understand compu tation in this graphical model For example7 from Py Ai AgA A 1A L71A L1 it is clear that despite starting from a generative de nition of HMMs that proceeds left to right7 in fact by the associative nature of matrix multiplication the calculation can be performed in many orders AfA3A AZA A3A3A 1 left to right AiltAzltAzltAzltAzltAzAltA 1gtgtgtgtgtgtgt right to left Ai AgAgAZAgAgAA 1 divide and conquer Certain orders may have better computational properties than others the divide and conquer algorithm performs matrix multiplies on lQl gtlt lQl matrices7 an algorithm that on a sequential computer is leP total time7 Whereas the traditional left to right calcu lation of forward probabilities 04 and right to left calculation of backward probabilities summarize results in vectors and perform only matrix vector multiplies7 only lelz time7 a substantial savings for large Given the possibility of parallel computation7 the divide and conquer strategy may be preferable7 as the computation depth is reduced from n to logn Associativity also makes clear how multivariate conditional distributions can be calculated ef ciently by variable elimination7 a general technique used to collapse parts of a graphical model that are irrelevant for a particular inference Py7qi7qjvqk PltQi7q77ley My W A qi Ag1 A m Ag1 AQW Ag1 Ailhk P y AME 17wa Aymhj qk AZHWJLM y m Stat 535 lecture 10 notes Beyond posterlor probabllltles We could summarize the class so far as o Conditional independence as expressed by a graph G lets us factor PX1 X into product of potentials 1302 where Ci are cliques of G RisiT ltgt RlTlS ltgt PRSTltDRSltDST o Distributive law lets us build ef cient algorithms for marginalizing joint probabilities Z ZPRSTT8t Z DR3T8Igt3T875 Z DR3T8gt DST8tgt Tst Ht 9 o Solve the inference problem PXAle 17 by entering evidence 17 into potential functions and marginzalizing eliminating all variables X7 a XA lt7s interesting at this point to consider two questions 0 Are there ways of generalizing the methods we7ve developed to solve problems outside the domain of probability 0 Are there probabalistic operations beyond inference can we perform ef ciently on graphical models 1 Functions over con gurations Many optimization and counting combinatoric problems can be cast in the following form 0 There is a con guration space X over a nite set of variables7 X X1 gtlt X X o Constraints between variables are expressed using logical or arithmetic functions over subsets of the variables7 for example X1 V X2 or X1 X2 X3 2 2 1 0 One or more value functions f are de ned over X7 that decompose into the sum or product of functions over subsets of the variables we will use 29 to denote the value composition operator f cfcc 0 One or more aggregation functions such as 7 min7 max7 list combine values f across the space X or a subset of variables to create a summary Z We will use 69 to denote the aggregation function Z zfm Probability calculations over graphical models can be cast in this framework7 where the value function is PX HltIgtc thus7 X and the aggregation function is 69 or G9 max if the goal is to nd the maximum a posteriori MAP probability Hence we can hope to generalize algorithms for manipulating graphical models to solve problem from these other elds7 typically by introducing di erent operators 8 and 69 that have the same algebraic properties as X and 11 Algebra In our derivations of the FB7 VE7 JT and SP algorithm7 we relied on the fact that X and over R form a semi ring A semi ring is a set R and two operators 8 and 69 such that 0 R7 69 is a commutative monoid with identity 0 a abc abc a b b a i 0 a a 0 R7 29 is a monoid with identity 1 a b ca b c 71x aa X 1 a 0 distributes over 69 a bca ba c The distributive property is what lets us write 2 DR3T8ltDST8t Z DR3T8gt D3T8tgt Ht 9 In some algorithms we also relied on X being commutative for example7 in the VE algo rithm that allows us to eliminate in any order a commutative semi ring adds the axiom 1 29 b b 29 a Two other very useful aggregation functions that form a commutative semi ring with 29 X and 29 1L7 so long as values are non negative7 are 69 min and G9 max The simple substitution of max for in all algorithms lets us nd the probability of the most probable con guration as per the Viterbi algorithm In many problems it is more convenient to assume the value composition operator is rather than gtlt7 for example when dealing with costs or energies rather than probabilities or when manipulating log probabilities to avoid under ow issues Again7 min and max distribute over Frequently one wishes to consider multiple values at once clearly using elementwise oper ations semi rings over values can be extended to semi rings over vectors of values 2 Total enumeration Suppose we wish to enumerate the complete set of con gurations7 each annotated with its value as de ned by P HltIgtcc or equivalently7 f chc Consider the following operators and potential functions de ned over sets of pairs vc where v is a partial value and c is a partial con guration Xim lt1 Xi 2M mam ltlt1gtccgt 0 l lt1 gt A B va vbca Ucbgt vacagt 6 A7 vbcbgt E B AG9B AUB Z IJ is the complete set of valued con gurations Solving using VB for A7ltIgtAB 7B7 one starts from ANA HBWB HA BWAB 0 1 A 0gt 0 1 B 0gt 0 0 ltlt1gtAB00 gt 1 1 A 1gt 1 1 B 1gt 0 1 ltlt1gtAB01 gt 1 0 ltlt1gtAB10 gt 1 1 ltlt1gtAB11 gt Elimination of A gives wb amm mum b and elimination of B gives ltlt1gtABlt00gt A OB ogt z ebwb we 2383 E lg ill ltlt1gtABlt11gtA 13 1gt though of course the output size is exponential size and takes exponential time to compute 4 3 Example paths in directed graphs Consider problems related to paths in a directed7 acyclic graph G V7 E With initial vertex S and terminal vertex T The space of valid paths can be expressed by introduc ing binary variables Xv for every vertex v E V and XM for every edge um E E7 and constraints X3 1 initial vertex is in path XT 1 terminal vertex is in path V1 a S Xv EwweE X1W ow into 1 is conserved V1 a T Xv Emma XW ow out of v is conserved Let there be two additive functions over the edges of a path7 that can therefore be written in terms of the edge variables fX ZeeE XefegX ZeeE Xeg8 Where f5 and ye are per edge constants If the graph represents a ight network7 then f5 and ye might be the cost and duration of ight 6 Finally let there be a probability distribution h de ned by potentials over the edge and vertex variables All of the following problems and many more can be solved through suitable reformula tions of graphical model inference algorithms 0 count the paths in the graph 0 nd the shortest path in the graph as measured by f o enumerate k paths in order of f calculate Ehm7 Ehf27 Ehfg and more general moments Ehfkgl 0 sample a random path according to h nd all possible values of f7 9 With their path counts 0 for each edge e or vertex v compute the minimal f for any path involving 6 or v 301 Counting paths If the ow constraints are expressed using 0 1 potentials ltIgt3 6X371 ltIgtT 6X371 D 6Xv Z XW u1EE Di 6Xv Z XW vw6E then PX HltIgtX is an unnormalized probability distribution over X with value 1 for valid paths and 0 otherwise Thus the number of paths is given by the normalization constant Z Z PX7 easily computed by variable elimination 302 Shortest path To compute the additive f7 take 29 and create potentials 135X5 Xefe To aggregate to the shortest path7 G9 min The change of the value composition operator from X to causes problems for our representation of the ow constraints7 but these are solved by taking logs of the ow potentials7 replacing 0 with the cost 00 and 1 with the cost 0 4 Moments Consider calculating the expected value of a function f over a distribution de ned by a graphical model Em Ex lf f is arbitrary there is no more e icient way to compute the expectation than by summing over the entire domain However if f then by linearity of expectations Elfl ZElfil ZZPWMW and the expectation is easily computed by using the FB7 JT or SP algorithms to compute the marginal posterior probability of each variable value and then This extends to functions that decompose over cliques f 6 fcc7 since each of the FB7 JT and SP algorithms can equally e iciently compute the posterior probabilities of clique con gurations It is natural question to ask if expectations can be computed as an integral part of prob ability calculations rather than as a post processing step ls there an easy way to extend the FB7 VE7 JT or SP algorithms to return Em directly Consider calculating EU fR M over the distribution PEST ltIgtRsltIgtSTZ Then Elfl thfRT fTt IgtT8 D8t 25 ET fRT DT8 St 43875 2t fTt D8t ET 08 Zs ZST8 ET fRT DT8 ZRS8 2t fTt D8t ZS ZST8ERS8 ZRS8EST8 ZS ZSR8ZST3 Where ZRS8 SAN ZST8 Zt mst ERS8 ZMRUWUS EST8 thTt D8t Picking this apart7 the table ZR33 ER33gt is a su icient statistic after the elimination of R and simlarly ZST 3 E3T3gt after the elimination of T This suggests replacing the numeric values Z in the summary tables of our algorithms With pairs Z Egt7 and extending the sum and product operators chc lt Dcci Dccfcgt Z1E1gt G9 Z2E2gt ltZiZ2E1E2gt ltZ1Z2E1Z2 It is easy to verify that EU Where ltZ Egt s T I Rs t IJST More generally7 to compute Epf fil Where P H 133 de ne extended potentials 1 as follows Ili lt1f gt function potentials Ilj Dj0gt probability potentials and then calculate Z Egt by using any of the algorithms we have studied7 de ned using the G9 and 29 operators in place of and gtlt1 41 Arbitrary moments and generating functions What about higher moments7 such as Ef2useful for calculating 012 mm 7 Emz Can one calculate Efk e iciently You might at this point want to try for a moment to come up With an algorithm for computing Ef2 While it is fairly straightforward to derive extensions of the G9 and 29 operators to compute Efk directly from the expansion of fk fik7 a cleaner and more general derivation comes from considering the moment generating function for f A2 2 FEe fE1Af Notice that 1For VE ltZEgt is the nal result after eliminating all variables for the FB algorithm ltZEgt 6941911011 zqz for any position 239 for the JT algoirhtm ltZEgt EBSltIgtSS for any separator S The junctiontree algorithm divides values during absorption so one must de ne i Verify that Zn En ZnEd ltltznEngtltzdEdgtgtltZZe 23 gt ensures A B A B When Z13 y 0i


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.