Class Note for CMPSCI 585 at UMass(13)
Class Note for CMPSCI 585 at UMass(13)
Popular in Course
Popular in Department
This 4 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 17 views.
Reviews for Class Note for CMPSCI 585 at UMass(13)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Collocations Lecture 5 Introduction to Natural Language Processing CMPSCI 585 Fall 2004 University of Massachusetts Amherst Andrew McCaIIum Today s Main Points What is collocation Why do people care Three ways of finding them automatically Words and their meaning Some upcoming lectures Word disambiguation one word multiple meanings Word clustering multiple words same meaning Collocations this lecture multiple words together different meaning than than the sum of its parts Simple measures on text yielding interesting insights into language meaning culture Collocations An expression consisting of two or more words that correspond to some conventional way of saying things Characterized by limited compositionality compositional meaning of expression can be predicted by meaning of its parts strong teaquot rich in calciumquot weapons of mass destructionquot kick the bucketquot hear it through the grapevinequot Collocations important for Terminology extraction Finding special phrases in technical domains Natural language generation To make natural output Computational lexicography To automatically identify phrases to be listed in a dictionary Parsing To give preference to parses with natural collocations Study of social phenomena Like the reinforcement of cultural stereotypes through language Stubbs 1996 11mm Contextual Theory of Meaning In contrast with structural linguisticsquot which emphasizes abstractions properties of sentences Contextual Theory of Meaning emphasizes the importance of context context of the social setting not idealized speaker context of discourse not sentence in isolation context of surrounding words Firth a word is characterized by the company it keeps Example Halliday strong tea coffee cigarettes powerful drugs heroin cocaine Important for idiomatically correct English but also social implications of language use Method 1 Method 1 Frequency Frequency with POS Filter AN NN AAN ANN NAN NNN NPN 80871 01 the 58ampl1 in the 11487 New York A N 26430 10 the 7261 United State A N 21amp12 on the 5412 Lo Angele A N 21839 for the 3301 la 1 year N N 18568 and the 3191 Saudi Arabia N N 16121 that the 2699 la 1 week A N 15630 at the 2514 vice pre ident A N 15494 10 be 2378 Per ian Gull A N 13899 in a 2161 San Franci co N N 13689 01 a 2106 Pre ident Bu h N N 13361 by the 2001 Middle Ea t A N 13183 with the 1942 Saddam Hu ein N N 12622 from the 1867 Soviet Union A N 11428 New York 1850 White Hou e A N 10007 he aid 1633 United Nation A N 1828 oil price N N 1210 next year A N 1074 Chief executive A N 1078 real e 1a1e A N Method 2 Method 2 Mean and Variance Mean and Variance Some collocations are not of adjacent words Sentence but words in more flex ble distance Stocks crash as rescue plan teeters reationship Timeshifted bigrams h k k d h39 d 1 2 3 S e noc e on IS oor stocks crash stocks as stocks rescue they knocked at the door crash as crash rescue crash plan 100 women knocked on Donaldson s door a man knocked on the metal front door Not a constant distance relationship But enough evidence that knock is better than hit punch etc as rescue as plan as teeters To ask about relationship between stocks and crash gather many such pairs and calculate the mean and variance of their offset mean 0 variance 5 Method 2 Mean and Variance 80 70 60 50 40 30 20 10 0 4 J 2 1 0 1 2 3 4 Position of strong versus opposition mean115 deviation067 Method 2 Mean and Variance 80 70 60 50 40 30 20 10 0 4 3 2 1 0 1 2 3 4 Position of strong versus support mean145 deviation107 Method 2 Mean and Variance Method 2 Mean and Variance d mean count Word1 Word2 043 097 11657 New York 048 183 24 previous games 015 298 46 minus points 049 387 131 hundreds dollars 4 03 044 36 editorial Atlanta 4 a 2 1 o 1 2 3 A 4 03 000 78 ring New 3 96 019 119 point hundredth Position of strong versus for mean112 deviation215 3 96 029 106 subscribers by Method 3 Method 3 Likelihood Ratios Determine which of two probabilistic models is more appropriate for the data H1 hypothesis ofmodel 1 H2 hypothesis ofmodel 2 L H likelihood mtit log LEHiD Hypothesis 1 pw2w1 p pw2w1 Hypothesis 2 pw2w2 p1 p2 pw2w1 Data N total count of all words c1 count ofword1 c2 count ofword 2 c12 count ofbigram word1word2 Likelihood Ratios Determine which of two probabilistic models is more appropriate for the data H1 H2 Pw2w1 pc2lN p1 c1 2c1 Pw2w1 pc2lN p2c2 c12Nc1 c12 out ofc1 bc12 c1 p bc12c1p1 bigrams are w1V2 c2 c12 out ofNc1 bc2 c12 Nc1 p bc2 c12Nc1p2 bigrams are w1V2 L H likeliin ratio 10g 7 1m Method 3 Likelihood Ratio example data Mg Q m M W2 1291 12593 932 150 most powerful 99 379 932 10 politically power Jl 82 932 934 10 power Jl computers 80 932 3424 13 power Jl force 57 932 291 6 power Jl symbol 51 932 40 4 power Jl lobbies 51 171 932 5 economically powerful 51 932 43 4 power Jl magnet 50 4458 932 10 less powerful 50 6252 932 11 very powerful 49 932 2064 8 power Jl position 48 932 591 6 power Jl machines 47 932 2339 8 power Jl computer 43 932 396 5 power Jl magnets Collocation studies helping lexicography Want to help dictionarywriters bring out differences between strong and powerful Understand meaning ofa word by the company it keeps Church and Hanks 1989 through statistical analysis concluded that it is a matter of intrinsic vs extrinsic quality strong supportfrom a demographic group means committed but may not have capability powerful supporter is one who actually has capability to change things But also additional subtleties helps us analyze cultural attitudes strong tea versus powerful drugs Method 1 strong versus powerful Csfrong w Cpoweriul w upper 50 force 13 afeiy 22 computer 10 aie 21 p0 mom 5 oppo mon 19 men 5 howmg 15 computer 5 en e 18 man 7 me age 15 ymboi 5 defen e 14 mmtary 5 gam 13 country 5 critic m 13 Weapon 5 p0 ibihty 11 pot 5 feehng 11 peopie 5 demand 11 force 5 chaiienge 11 amp 5 chaiienge 11 nation 5 ca e 10 Germany 5 upponer 10 enator 4 ignai 9 neighbor 4 Likelihood Ratios across different corpora from different times ModeH modei for NYTime51989 ModeiZ modeiforNYTime51990 Rzlin n n24 D cm H n37 u may D mm n ma n n51 L1 Karim East Miss 17 HUD East Prague w Obeid Eerimevs Manners earthquake Emma s Germans Spring 1989 Mushm cienc Sheik Abdui Krim Obeid abducted disintegration ofcommumst Eastern Europe scandai H i HUD October 17 earthquake m San Fransisco MiSS Manners no iongercamed by NYTimes m 1990
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'