Class Note for CMPSCI 585 at UMass(16)
Class Note for CMPSCI 585 at UMass(16)
Popular in Course
Popular in Department
This 48 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 16 views.
Reviews for Class Note for CMPSCI 585 at UMass(16)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 02/06/15
Noisy Channel Ngrams amp Smoothing Lecture 9 Introduction to Natural Language Processing CMPSCI 585 Fall 2007 University of Massachusetts Amherst Andrew McCaIIum Andrew McCallum UNIass Amherst Today s Main Points Application of the Noisy Channel Model Markov Models including Markov property definition Smoothing Laplace Lidstone s Heldout GoodTuring Noisy Channel Model Y X W W E d Channel D d nco er eco er pXy Message Input to Output from Attempt to from a finite channel channel reconstruct message alphabet based on output Optimize Encoder for throughput and accuracy compression remove all redundancy accuracy adding controlled redundancy Capacity rate at which can transmit information with arbitrarily low probability of error in W v max X Y MY Andrew McCallum UMass Amherst Noisy Channel in NLP Y X Y Chan n el 4gt y Decoder mpunn oupunem Auempnu charme charme wecnnsu un message based an uuum anguage charme m Mrly m M pyr argmaxi m g 1119XIJKWI7rt 1 MT 1 mg Mr I AnnIcmmn Input omnm rM mxm Macmne mwmqsequences szmqsequences PL1maan uage Uans ahunmude TVans ahun mudg Omwca chavac ev ac uaHeM ExtwnhOCR39 mubwanguage mudg uYOCR Recuqnmen 00R enms new enms Panensneecn POS aE Enqnsnwmq pvumeOS pvubabwwmwuvq POSHaEEmE sequences sequence sequences mven a speecn wmqsequences acuushcspeech pmnmwmq acuushcmude Recuqnmen Smna sequences Ducumem c ass abe wmq sequence m c asspnuv PL1Ymmeach c assmcauun ducumem pvubamhw c ass Probabilistic Language Modeling Assigns probability pt to a word sequence t w1w2w3w4w5w6 Chain rule and jointconditional probabilities for text t pf1111391 u npu 1pwnw1 Armin n 2H pwilwlwiil i1 where pw1urk N CLt lwk I lL LL39 quotJim 2 V lt I 1 l 1 pw1wk1 u lwk1 The chain rule leads to a historybased model we predict following things from past things Human mumsquot ng ram models the classic example of a statistical model of language Each word is predicted according to a conditional distribution based on limited context Conditional Probability Table CPT pX both pofboth 0066 ptoboth 0041 pinboth 0038 aka Markov chain models sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors 1gram model w1 W2 w3 w4 Firstorder Markov model Pwtwt1 Simplest linear graphical model Words are random variables arrows are direct dependencies between them CPTs These simple engineering models have been amazingly successful Andrew McCallum UMass Amherst nth order Markov models First order Markov assumption bigram EL A1lL Al 2 M 39r KILA Ml 1 1 1 quotklul 1 phrki Similarly nth order Markov assumption Most commonly trigram 2nd order puvk72wkilwk 7 LL lL JD 5 ZL39 7rwm7 lt M 1 L 1 kl k z A 1 I wizwkil Andrew McCallum UMass Amherst Andrei Andreyevich Markov 1856 1922 Graduate of Saint Petersburg University 1878 where he began a professor in 1886 Mathematician teacher political activist In 1913 when the government celebrated the 300th anniversary of the House of Romanov family Markov organized a countercelebration of the 200th anniversary of Bernoulli s discovery of the Law of Large Numbers Markov was also interested in poetry and he made studies of poetic style Markov s Model Took 20000 characters from Pushkin s Eugene Onegin to see if it could be approximated by a simp e chain of characters vowel consonant vowel 0128 0872 consonant 0663 0337 Markov Approximations to English Zeroorder approximation Pc XFOML RXKXRJFFUJ ZLPWCFWKCRJ FFJEWKCQSGHYD QPAAMKBZAACIBZLHJQD Firstorder approximation Pcc OCRO HL RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBI IVA Secondorder approximation Pccc ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE From Shannon s original paper Markov Approximations to English cont Thirdorder approximation Pwwww IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTABIN IS REGOACTIONA OF CRE Markov Random Field with 1000 features WAS REASER IN THERE TO WILL WAS BY HOMES THING BE RELOVERATED THER WHICH CONSISTS AT FORES ANDITING WITH PROVERAL THE CHESTRAING FOR HAVE TO INTRALLY OF QUT DIVERAL THIS OFFECT INATEVER THIFER CONTRANDED STATER Della Pietra Della Pietra amp Lafferty 1997 Wordbased Approximations Firstorder approximation representing and speedily is an good apt or come can different natural here he the a in came the to of to expert gray come to furnishes the line message had be Secondorder approximation the head and in frontal attack on an English writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected Shannon s comment 1948 quotIt would be interesting if further approximations could be constructed but the labor involved becomes enormous at the next stage Andrew McCallum UMass Amherst ngram models Core language model for the engineering task of better predicting the next word Speech recognition OCR Contextsensitive spelling correction It has only recently that improvements have been made for these tasks Alshawi 96 Wu 97 But linguistically they are appallingly simple and naive Why might ngram models not work Relationships say between subject and verb can be arbitrarily distant and convoluted as linguists love to point out The man on the sidewalk without pausing to look at what was happening down the street and quite oblivious to the situation that was about to befall him confidently strode into the center of the road Why do they work That kind of thing doesn t happen much Collins 1997 74 of dependencies in the Penn Treebank WSJ are with an adjacent word 95 with one less than 5 words away once one treats simple NPs as units Evaluation of language models Best evaluation of probability model is taskbased As substitute for evaluating one component standardly use corpus perword cross entropy A n l HlXxf lg 1 s t p n 1 31ml ul u L Or perplexity units average number of choices scaled for uniform distr high unpredictable n 1 quotquotn PP 391 QHC xP H 139lquot3911391 Ur 1 H J Andrew McCallum UMass Amherst Parameter Estimation Maximum Likelihood Estimate Relative frequency Makes training data a probable as possible Overfits K39lf 11 112 i 7 i 141 l p39 112 11 Limitations of the Maximum Likelihood Estimator Problem often infinitely surprised when unseen word appears Punseen O Problem this happens commonly Probabilities of zerocount words are too low Probabilities of nonzerocount words are too high Estimates for high count words are fairly accurate Estimates for low count words are unstable We need smoothing Sparsity How often does an every day word like kick occur in a million words of text kick about 10 depends vastly on genre of course wrist about 5 Normally we want to know about something bigger than a single word like how often you kick a ball or how often the dative alternation he kicked the baby a toy occurs How often can we expect that to occur in 1 million words Almost never There s no data like more data Must be ofthe right domain Andrew McCallum UMass Amherst Severity of the sparse data problem count 2grams 3grams 1 8045024 53737350 2 2065469 9229958 3 970434 3654791 gt4 3413290 8728789 gt0 14494217 75349888 possible 68 x1010 17 x1016 Vocab size 260741 words 365M words training The Zero Problem Necessarily some zeros trigram model 17 x 1016 parameters but only 26 x 106 words of training data How should we distribute some probability mass over all possibilities in the model optimal situation even the least frequent trigram would occur several times in order to distinguish its probability versus other trigrams optimal situation cannot happen unfortunately how much data would we need Two kinds of zeros pwh0 or even ph0 Laplace smoothing 39iu l no 1 14JI11111quot11 i a H 1 431111 l V is the vocabulary size assume fixed closed vocabulary This is the Bayesian maximum a posteriori estimator you get by assuming a uniform prior on multinomials a Dirichlet prior Dirichlet Distribution Multinomial is a die a distribution over a finite alphabet of outcomes Dirichlet is a dice generator a distribution over multinomials UZA 1A H ilk 1 pig1 w Dil39ic39hlvtm1 12 i39 1 HA Dink 1k It is a conjugate prior for multinomials Dirichlet Examples LU5 Ll1 35 2 3 15 25 2 1 15 05 1 quotquot Cl 39 39 39 39 0 Cl Cl 04 05 03 1 U 02 U4 US U2 U1G 15 5 x39 j U U G 02 04 06 03 I O 02 04 05 Andrew McCallum UMass Amherst Laplace Smoothing Problem gives too much probability mass to unseens Not good for large vocabulary comparatively little data NLP eg 10000 word vocab 1000000 words of training data but comes across occurs 10 times Of those 8 times next word is as PMLEascomes across 08 PLap aEJaslcomes across 81101000000009 Quick fix Lidstone s law Mitchell s 1997 mestimate 7101 102 A C1L 1V folA lt1 Cg 120103 M 392 ml How much mass to allocate to unseens For Laplace smoothing in Pcomes across 100001001O of the prob mass is given to unseen events How do we know that this is too much Absolute discounting Idea is that we want to discount counts of seen things a little and reallocate this probability mass to unseens By subtracting a fixed count probability estimates for commonly seen things are scarcely affected while probabilities of rare things are greatly affected lfthe discount is around 6075 then seeing something once is not so different than not having seen it at all plu olu ll if l 1391 Pol ll quot THU1 39 quot lt1 39 iquotul39d 7i 1391 quoti 39 n ihr rwiso pl u olu l j Held Out Estimator How do you know how likely you are to see a new word type in the future in a certain context Examine some further text and find out empirical heldout estimators validation Divide data into two pots training data validation data N number of nonunique bigrams in training Nr number of unique bigrams with freq r in training Tr number of times that all bigrams appearing rtimes in the training data appeared in the validation data TI 1 1 39 1 w uquot up r 39 um I39 7 illquot to 10 l ATA l TrNr is an improved estimate for r Pots of data for estimating and testing models Major error testing on your training data Overfitting expect future events to be too much like the events on which it was trained rather than allowing sufficiently for other possibilities Training data Validation data Testing data Development test data Final test data Don t report results on just one test set but average of many and report variance use a test of statistical significance Heldout Estimator with Crossvalidation Reshuffle the training data several times into pots of training and validation data Calculate ph0w1w2 for each split then average them Tin Till l 1 Where 139 3quot quot1quotquot l 1 N 4V l A V I quot 39l39 pm n u l 2 Extreme case of crossvalidation leaveoneout cross validation Train on N1 of the words validate on 1 word How much probability mass would be reserved for the unseen words in this case GoodTuring Smoothing Derivation reflects leaveoneout estimation For each word token in data call it the validation set remaining data is training set The validationset word has count r in training set See how often any word has r counts in training set How many different words have count r This will happen every time word left out has r1 counts in original data 80 total count mass of r count words is assigned from mass of r1 count words Nr1 X r1 Apply to low counts not needed harmful for high count words GoodTuring smoothing All words with same count get same probability as before Count mass of words with r1 occurrences is assigned to words with r occurrences r is corrected frequency estimates for a word occurring rtimes ELM1 E Afr I m 3 7 Irft 1 H quotr gr 1 1 Estimated frequencies in AP newswire Church amp Gale 1991 rfM LE femprical fLap fdel fGT 0 0000027 0000137 0000037 0000027 1 04480 000274 0396 0446 2 125 0000411 124 126 3 224 0000548 223 224 4 323 0000685 322 324 5 421 0000822 422 422 6 523 0000959 520 519 7 621 000109 621 621 8 721 000123 718 724 Andrew McCallum UMass Amherst Differentiating based on history So far the methods considered have all used nothing but the raw frequency of an ngram the large Pargethe the mauve Pmauvelthe if C the large C the mauve eg 0 then Pargethe Pmauvelthe This doesn t seem right Also use frequency of its n1 gram parge pmauve Linear Interpolation Estimate probability of an ngram from a weighted average of oworderhigh order ngrams 1 Pliu397z mn2 u39nil A A1P1L111n A2Pgwnlu39nil A BIL 77 1139771 u39wig where Ns sum to 1 set Ns by hand or from heldout data they can be functions of equivalenceclassed histories Also known as shrinkage Stein 1957 Works surprisingly well mwcmmmm Assigning Probability to the Unseen Event At test time see word u that wasn t seen at training time Pu Replace all singleton word tokens in training data with special token ltUNKgt Smoothing Rest of the story Other methods backoff Katz 1987 Try fourgram if zerocount try trigram if zerocount try bigram Kneser and Ney 1995 Backoff ngram counts not proportional to frequency of ngram in training data but to expectation of how often it should occur in novel trigram since one only uses backoff estimate when trigam not found WittenBell discounting Smoothed maximum entropy models See Chen and Goodman 1998 for a survey Progress in the field is often dominated not by the need to create fancier more complex models but by the need to do a good job of estimating parameters for the simpler models we already have Statistical Language Modeling Noam Chomsky Fred Jelinek BurT mus 1quot be reCQQnZed that the Anytime a linguist leaves the group nor0 7 of prObab l ty Of a sentence the speech recognition rate goes up IS an entirely useless one under any Wme at IBM speech group 1988 known interpretation of the term 1969 Andrew McCallum UMass Amherst Progress in the field is often dominated not by the need to create fancier more complex models but by the need to do a good job of estimating parameters for the simpler models we already have Real benefit comes from targeted enhancements and sharp tool set of excellent estimation techniques Distinctiveness of NLP as an ML problem Most structure is hidden Relational constraint satisfaction nature Long pipelines with cascading errors Large and strange sparse discrete distributions Large scale Featuredriven performance driven HW4 As usual your choice Naive Bayes Classifier Spam vs Ham English vs French vs Spanish vs Klingon Sliding window PartofSpeech tagger Ngram language model Train and generate language Use for spelling correction there vs their HW4 Help Accuracy Evaluation Result of running classifier on a test set filename trueclass predclass ppredclassdoc filename trueclass predclass ppredclassdoc filename trueclass predclass ppredclassdoc true spam true ham pred spam TP FP pred ham FN TN Accuracy TPTN TPTNFPFN Precision TP TPFP Recall TP TPFN F1 harmonic mean of Precision amp Recall Andrew McCallum UMass Amherst HW4 Help PrecisionRecall Curve Result of running classifier on a test set filename trueclass predclass ppredclassdoc filename trueclass predclass ppredclassdoc filename trueclass predclass ppredclassdoc true spam true ham pred spam TP FP pred ham FN TN Accuracy TPTN TPTNFPFN Precision TP TPFP Recall TP TPFN F1 harmonic mean of Precision amp Recall Andrew McCallum UMass Amherst AccuracyCoverage Curve Result of running classifier on a test set filename trueclass predclass ppredclassdoc filename trueclass predclass ppredclassdoc filename trueclass predclass ppredclassdoc HW4 Help true spam true ham pred spam TP FP pred ham FN TN Accuracy TPTN TPTNFPFN Precision TP TPFP Recall TP TPFN F1 harmonic mean of Precision amp Recall Andrew McCallum UMass Amherst HW4 Help Working with logprobabilities pzId 0C pc Hpt ijv i 1igpquotl ix lt g1r Z 1Cg39uji i Getting back to pcd Subtract a constant to make all nonpositive expo