Class Note for CMPSCI 585 at UMass(10)
Class Note for CMPSCI 585 at UMass(10)
Popular in Course
Popular in Department
This 6 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 22 views.
Reviews for Class Note for CMPSCI 585 at UMass(10)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 02/06/15
Word Disambiguation Lecture 8 Introduction to Natural Language Processing CMPSCI 585 Fall 2004 University of Massachusetts Amherst Andrew McCaIIum Words and their meaning Three lectures Last time Collocations multiple words together different meaning than than the sum of its parts Today Word disambiguation one word multiple meanings Expectation Propagation Future Word clustering multiple words same meaning Today s Main Points What is word sense disambiguation and why is it useful Homonymy Polysemy Other similar NLP problems 4 Methods for performing WSD Supervised na39ive Bayes Unsupervised Expectation Propagation Word Sense Disambiguation The task is to determine which of various senses of a word are invoked in context True annuals are plants grown from seed that blossom set new seed and die in a single year Nissan s Tennessee manufacturing plant beat back a United Auto Workers organizing effort with aggressive tactics This is an important problem Most words are ambiguous have multiple senses Problem statement A word is assumed to have a finite number of discrete senses Make a forced choice between each word usage based on some limited context around the word Converse word or senses that mean almost the same image likeness portrait facsimile picture Next lecture WSD important for Translation The spirit is willing but the esh is weakquot The vodka is good but the meat is spoiledquot Information Retrieval query wireless mousequot document Australian long tail hopping mousequot Computational lexicography To automatically identify multiple de nitions to be listed in a dictionary Parsing To give preference to parses with correct use of senses There isn t generally one way to divide the uses of a word into a set of nonoverlapping categories Senses depend on the task Kilgarriff 1997 11mm WSD Many other cases are harder 0 title Nameheading of a book statute work of art of music etc Material at the start of a film The right of legal ownership of land The document that is evidence of his right An appellation of respect attached to a person s name A written work WSD types of problems Homonymy meanings are unrelated 7 bank ofa nyer 7 bank financial institution Polysemy related meanings as on previous slide 7 title ora book 7 title material at the start or a film Systematic polysemy standard methods of extending a meaning such as from an organization to the building where it is housed e The Speaker of the legislature e The legislature decided today 7 He opened the door and entered the legislature A word 39equently takes on further related meanings through systematic polysemy or metaphor Upper and lower bounds on performance Upper bound human performance 7 How often do humaniudges agree on the correct sense assignment 7 Particularly interesting iryou only give humans the same input context glvel l to machine method lA good iest for any NLP lneihodi e Gale 1992 give pairs ofvvords in context humans say it they are the same sense Agreement 919 forword With clear senses but 65r70 ror polysemous Words Lower bound simple baseline algorithm 7 Always piclltthe most common sense roreach Word 7 Accuracy depends greatly on sense distrioutionl goesocm Senseval competitions Senseval 1 September 1998 Results in Computers and the Humanities 3412 OUP Hector corpus Senseval 2 In rst half of 2001 WordNet senses httpVlMAIlitribrightonacukeventssenseva WSD automated method performance Varies widely depending on how dif cult the disambiguation task is Accuracies over 90 are commonly reported on the classic o en fairly easy word disambiguation tasks pike star interest Senseval brought careful evaluation of dif cult WSD many senses different POS Senseval 1 ne grained senses wide range oftypes 7 Overall about 75 accuracy 7 Nouns about80 accuracy 7 Verbs about 70 accuracy WSD solution 1 expert systems Small 1980 Hirst 1988 Most early work used semantic networks frames logical reasoning or expert system methods for disambiguation based on contexts The problem got quite out of hand The word expert for throvll is currently six pages long but should be ten times that sizequot Small and Rieger 1982 WSD solution 2 dictionarybased Lesk1986 A word s dictionary de nitions are likely to be good indicators for the senses they de ne One sense for each dictionary de nition Look for overlap between words in de nition and words in context at hand Word ashquot Sense Definition l ties as tree uithe Dilvefamliy 2 ittirilEli iilE Stiiid TESldllE iE WilEil CUTllllltS tilliE material is llllrrlEEi quotThis cigar bums slowy and creates a stiffashquot sense 1 El sense2l quotThe ash is one orthe last treesto Come into tear sensell sense2u Insuf cient information in de nitions Accuracy 5070 WSD solution 3 thesaurusbased Walker 1987 Yarowsky 1992 Occurrences of a word in multiple thesaurus subject codes is a good indicator of its senses Count number of times context words appear among the entries for each possible subject code Increase coverage of rare words and proper nouns by also looking in the thesaurus for words that cooccur with context words more often than chance Eg Hawking cooccurs with cosmology black hole M Sense Roget categom Accuracy star space object UNIVERSE 96 celebrity ENTERTAINER 95 starshaped INSIGNIA 82 An extra trick global constraints Yarowsky 1995 One sense per discourse the sense of a word is highly consistent within a document Get a lot more context words because combine the context of multiple occurrences True for topic dependent words Not so true for other items like adjectives and verbs e g make take Other similar disambiguation problems Sentence boundary detection I live on Palm Dr Smith lives downtown Only really ambiguous when word before the period is an abbreviation which can end a sentence not something like a title word alter the period is capitalized and can be a proper name otherwise it must be a sentence end Contextsensitive spelling correction I know their is a problem with there account WSD solution 4 supervised classification Gather a lot of labeled data words in context handlabeled into different sense categories Use naive Bayes document classification with context as the document Straightforward classification problem Simple powerful method Requires handlabeling a lot of data Can we still use naive Bayes but without labeled data WSD sol n 5 unsupervised disambiguation wordcontext labeled according to sense h with Train one multinomial per class via maximum likelihood What youjust did for HW1 wordcontext unlabeled Label is missing 28 years ago Maximum Likelihood from Incomplete Data via the EM Algorithm By A P DEMPS39HZR N M LAT and D E Rmle Harnmd Universily and Educutl39annl Texting Service Read before the Rom STATISTICAL SOCIETY at a meeting organized by the RESEAch SFC DN on Wednesday Deuembex Rm 1976 Professor 5 D Sltvey in the Chair SUMMARY A broadly applicable algorithm for computing maximum likelihood estimates rum incomplete data is presented at various levels of generality Theory showing the monotone hehaviour of the likelihood and convergence of the algorithm is derived Many are sketched 39 39 missing value 39 39 39 39 to grouped censored or truncaled dam nnire mixture models variance component estimation lryperpnrarneier estimation iteratively reweighted least squares and factor analysis Keywords MAXIMUM LIKELIHDOD INCOMPLETE nitm EM ALGORITHM FDSTEKIOK MODE L INTRODUCHON THIS psper presents a general approach to iterstivs oomputation of maximumlikelihood estimates when the observations can he viewed in incomplete data Sinoe each iteration orquot the algorithm consists of on expectation step followed by n maximizntion step we call it the en algorithm The on process is remarkable in part because or Lhc sim Iicit and generality of Filling in Missing Labels with EM Dempster etal 77 Ghahramani amp Jordan 95 McLachlan amp Krishnan 97 Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data Estep Use current estimates of model parameters to guess value of missing labels Mstep Use current guesses for missing labels to calculate new estimates of model parameters Repeat E and Msteps until convergence Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data Recall Na39l39ve Bayes Pick the most probable class given the evidence c argmaxcj Prcj d C a class like Planning d a document like language intelligence proof Bayes Rule Na39l39ve Bayes lrl Pr 0 Pr d c PrCVHPrWdiC Prcj d 1 z 11 J Prd 2Prckl PfWd lei Wd the i th word in d like proof Recall Parameter Estimation in Na39l39ve Bayes Estimate of Pc 1Countd 6 cl Pcj IC 2Countd ck k Estimate of Pwc 1 2Countwdk 169 IVi IV HE 2Countwdk r1dkEcJ Pw lcj EM Recipe Initialization Create an array Pcd for each document and ll it with random normalized values Set Pc to the uniform distribution Mstep likelihood Maximization Calculate maximumlikelihood estimates for parameters Pwc using current Pcd 1 2Coumwd PC IdL 13m I f IV HE ECuuandePu39 ldk nan Estep missingvalue Estimation Using currentparameters calculate new Pcd the same way you would at test time PrcPrd l 6 Pm id PM Loop back to Mstep until convergence Converged when maximum change in a parameter Pwc is below some threshold m EM We could have simply written down likelihood taken derivative and solved but unlike complete data case not solvable in closed form must use iterative method gradient ascent EM is another form of ascent on this likelihood surface Convergence speed and local minima are all issues If you make hard 0 versus 1 assignments in Pcd you get the Kmeans algorithm Likelihood will always be highest with more classes Use a prior over number of classes orjust pick arbitrarily 11mm EM Some good things about EM no learning rate parameter very fast for low dimensions each iteration is guaranteed to improve likelihood adapts unused units rapidly Some bad things about EM can get stuck in local minima ignores model cost how many classes both steps require considering all explanations of the data all classes SemiSupervised Document Classification Training data Data available at training with class labels time but without class labels 39i Witi Web pages Web pages usersays are usersays are interesting uninteresting Web pages user hasn t seen or said anything about Can we use the unlabeled documents to increase accuracy SemiSupervised Document Classi ication Build a classi cation Use model to estimate the model using limited labels ofthe unlabeled labeled data documents H quotwith Use all documents to build a new classi cation model which is otten more accurate because it is trained using more data An Example WebKB Data Set Labeled Data Unlabeled Data Baseball Ice Skating 7 T L g k silli ui e iuaes 5mm WW mm mm didn thuithei Peneettiipieiurnp l peneirnanee sne giaeed tne iee Witn a series eipeneetiurnps and wen tne geid rnedai Tara nglnskl beugnt I New iee skates l parents as geed an athlete asTaia Piadiee attne iee nglnskl linkEvElyday After EM a newneuseieinei Pete Ruse is net E ore EM Prl Lipiriskl i iee Skaling Prl LlplnsklUUl Prl Llplnskl i Basebaiii u n ma PrlLipinslxilZElElEll 4 classes 4199 documents from CS academic departments Word Vector EvolutIon WIth EM inteiiigenee 00 D 0 DD artificial lemure lemure understanding be be DEM 7 00 DD dist 00 00 due identical handuut 7 ms due humevvurk arrange prublem assignment games set handuut dartrnuutn tay set natural DDam hW Eugnltlve yurtas Exam ieigie humevvurk prublem pruvirig kfuury DDarn prulug see pustseript D is adigit EM as Clustering 39 unlabeled EM as Clustering Gone Wrong 20 Newsgroups Data Set 973233 am Aw o a39w mf f oo2w6g 9 a z g ae tlo b b g 999g wcaowo g a g wto 9n iii3 3 quot lll ag 4 9370 39 a 4 wa as Illa 00 20 class labels 20000 documents 62k unique words Newsgroups Classification Accuracy varying labeled documents IOD 10000 unlabeled documenls 7 r 50 No unlabeled documenls 4 80 L 7u 60 r 50 r 4quot Accuracy 40 30 i 20 r 39 l 0 0 l l l 10 20 50 Inn 200 Sun 1mm 20m 5000 Number ol Labeled Documents