Advanced Language Techologies
Advanced Language Techologies CS 6740
Popular in Course
Popular in ComputerScienence
This 7 page Class Notes was uploaded by Lacey Collier on Saturday September 26, 2015. The Class Notes belongs to CS 6740 at Cornell University taught by Staff in Fall. Since its upload, it has received 57 views. For similar materials see /class/214337/cs-6740-cornell-university in ComputerScienence at Cornell University.
Reviews for Advanced Language Techologies
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/26/15
Conditional Random Fields Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira Presenter Yej in Choi arcs observations X nodes outputs Y Label Bias Problem 1 Herbivores V NNS 0 N 1 J 2 NN 3 2 Camivores quot r like r eating 0 zl sheep or I OJIVB 4 VBG SJINN 3 Suppose ms gt v5 transition more Suppose from v5 only v5 gt vac frequent than transition is pos sible now what is my vac yr v5 x merino Reca11 new models For I Yr x1 N39NS gt IN Label Bias Problem iHexbiv ores it Iljkez r merino i sheep b INNS 0N 1 II 2NN 3 I39Camiyores e Ilikel g eating o sheep b NNS OJIVB 4 VBG SJINN 3 So how do we fix this nonsense PY1 VEG Yr VB x 1 for any x Do not normalize on each node Instead normalize over the entire This motivates the global normaliza sequence tion scheme of CRFs Other approaches Cohen and Carvalho 2005 Sutton and McCallum 2005 Label Bias Problem 0 13101 7vQgtCgts 3 4 397 5 Herbivoies h1ike r Imerino i lsheep bf NNS 0 IN 1 JJ 2NN 3 ICaInivoies a like n leating o l sheep h INNS 0 IVB 4 VBG 5NN 3 I Wait what about HMMs Reca11 mm models Pug 11 and For y 50 that Px Yir Yiel39PY Yin Pxr Y Yr then we can get Pmexino vac o r Label Bias VS Observation Bias Observation explains current state so Both are to do with local conditional normalization So Let s normalize globally But how What we need is the global joint distribution ie py x e WheTeYYla may andxxr x A we do not want distributions on individual node ie Iyi I 4 Instead we want nonprohahilisticpntential function ie Myi 39 p l x m g Yr Xi x 4 Ya ls 77777777 Problem with directed graphs like Bayesian Network a probability distribution should be given for each node then the joint probability py Hi py i parentsyi btw what is parentsyi for MEMM 7 Markov Random Field Markov Network Random Field Markov Random Field WWW forget about conditioning on X Let G Y r be a graph where each vertex v is a random variable Suppose my i all other y my i neighborswm thou y is a random field F Example 9 PY5 i all olllcr it not i Y Y6 But how do We compute the global joint distributionP Y out othis 7 Besides We don39t want to compute my i neighborsw ill HammersleyClifford theorem 1 97 1 Pt 1 rm Z clique C where Z ZY He 1 Given MRF GYE such that PY l Y Y PYl nhIY 2 Given NYC for V clique C in G such that NYE gt 0 cliques may overlap cliques may not be maximal this implies we don t need to compute PY l nhIY to get PY l De nition of CRFS De nition Tmt 139 V hr l Q177le Mir flint Y hl so that Y is indexed by tile l39u39 3905 Thu XY is H conditional random field in m WIH II 39JlltliliUIHYI on X Illv l39mldulll 39 39nllllll YA uwy tlu39 Nal39kln39 7UIl39l with l39r hpt I f tn the graph MYH X Y W 439 iYlX Yup u N 1 l39llt l39t w I nlmlls flint in Hill 1 m39u lll lglllllllh ill 6 Almost identical to the de nition ofMRFs except CRFs condition on X and notice that x is not part of v HammersleyeClifford theorem can he extended to the conditional cases on X as welll Objective function for training Given the training data D x0 ll J 1 1 andpltylx 2m exp 1 o F y x Objective function conditional likelihood L0 LO l D P D M H py l Km equiv to optimize 1 log LO 2 log 130ml Km 1 11 e 2 log le xw 2 log M exp A o F yea xw 2 7 Fy01x01elog zoo 2 it F yui xl 410g Zyexp 2 0 F yo x01 Py x for linearchain CRFs O O O O 0 Let yo x exp 2x M fkyo x A yo yon x exp 2x 7px fxyo yon x Then objective39lhncticn later At plty lx 23X 2 exp in h my x 2n n my yam 2 exp 1 o Fltyx wheaezx2yexpi 0 Fyx palm 2xnp znanleylxgt 2 newt U My x yo yo x Objective function for training Given the tJaining data D x0 ll J 1 andplty lx 23X exp 2 o F y x Objective function conditional likelihood 1414 L0 l D P D l A Hj py01lx07 equiv to optimize 10 log L0 2 log py0l x0 10 2 log py07l x0 2 log 2 exp i o F yo x0 2 it o F W x0 410g Zx0 2 2 F yo xm 4 log zyexp 7 F yo no This diagram ts from Willl ln Cohen39s Slides computing for linearchain CRFs 11 log py0l x 2 a F W x 4 log Zyexp A a F y XE name a name e name MM HOHNZITIT HOHNZJHT HOHNZJHC WU mm x 61 sz Vk fay m x I N Hm Parameter Estimation for CRFs Iterative scaling algorithms Generalized iterative scaling GIS Improved iterative scaling IIS All of mesa 39 l I conditional likelihood Possible mmtrain K discrimihatively with voted pef ep39tiroh Gradient descent methods Collins 2002 CG conjugate gradient L BFGS limited memory Newton s method Parameter Estimation for CRFs how to compute argmax 10 7 Iterative scaling algorithms A Generalized iterative scaling GIS 4 Improved iterative scaling IIS 4 easy to implement 4 both really slow to converge Gradient descent methods 4 CG conjugate gradient 4 LBFGS limited memory Newton s method 4 much harder to implement but lots of code available you only need to provide 1 and 1 0 4 much faster to converge Sha and Pereira 2003 Table 3 Runtime for various training methods in minutes 375k examples argmaxy Py x forlinearachain cm Mgmaxy py lx Mgmaxy 10g py l x argmaxyl oFyxglogZx mgmaxy A Fyx name a name e name nonName dgtnonNameTnonName Other CRFS Factorial CRFS Sutton exalt 2004 Tree CRFS cam and masam 2005 Skipchain CRFS Sum 21nd McCallum 2004 m Graphical representation HMM MEMM CRF Y X Graphical representation HMM MEDM CRF me Klein and Taskafs slides Maximum Entropy II From Klein and Taskafs slides Max vs SoftMax Margin Also regularization smoothing I SVMS max 2 way h log 2 expway lllvl2 quotGrin MHZ g 5M3 max wa y 00 W l v Hard Penalized Margin Maximize likelihood Minimize logloss 39 Maxent H 2 T y T m xiln AHu39llQZ wT yi logZexMWT yD nun MI H w my 39 3 w My y x Soft Margin I Motivation Connection to maximum entropy principle Very Similar Bothtry to make the true score Might want to do a good job of being uncertain on noisy cases better than funCthn Of the Other Scores in practice though posteriors are pretty peaked 39 The SVM mes to beat the augmented runner39up The maxent classifier tries to beat the softmax me Klein and Taskafs slides me Klein and Taskafs slides SoftMax LogLoss maxa b log expa expa If we View maxent as a minimization problem 39 l 24 we y 4 w maxlta7 b og expa expa nun kllull My loggexm My This minimizes the logloss on each example a waiv 4 log Z gtltDWTf39VJ V log expWTfyi 3909 P yi x Logloss bounds zeroone loss waiyi 4 mgwaMy y y me Klein and Taskafs slides Loss Functions Merino Sheep ZeroOne Loss 2 391 waiyi 7 max waiy i Wequot Hinge WM myax Wm g Log M39s CHERRINO amp WILL MSDN 589 way 4 log 2 exp wai 31 3mm awxllmnlma ueuuq vvquot mum mm M 1 waiyquotquot max waxy y v
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'