Class Note for CMPSCI 585 at UMass(4)
Class Note for CMPSCI 585 at UMass(4)
Popular in Course
Popular in Department
This 19 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 14 views.
Reviews for Class Note for CMPSCI 585 at UMass(4)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
What is Information Extraction Information Extraction Coreference and Relation Extraction AS a family information Extraction Lecture 22 of techniques segmentation classification association clustering Introduction to Natural Language Processing October 14 2002 400 am PT CMPSCI 585 Spring 2004 For yearsMicr9 2f1 2rmLati2n QEQ Elli Un versily olMassachusetfs Amherst 9315 railed against the economic philosophy Microsoft Corporation f of open source software with Orwellian fervor 3 g 2 denouncing its communal licensing as a Bill Gates cancer that stifled technological innovation g g 2 2 Lu Today W claims to quotlovequot the open source concept by which software code is MlcrOSO 3 made public to encourage improvement and Gates 2 development by outside programmers 9315 8 n 3 himself says Wwill gladly disclose its V U gt Lquot crown jewels the coveted code behind the l eg e a Windows operating system to select MICfOSO E customers a JJ Andrew McCalum quotWe can be open source We love the concept g g a of shared sourcequot said muggmg a Rlchard Stanman i3 g g 39 quotTh t39 39 it t h39ft iii for us in te Is ofiosd 222slTpo an s I founder I 5 Frat Qn 39warp Foundation 55 55 g Bichand tallman founder of the lite Winn countered saying IE in Context Main Points Createontoiogy Coreference How to cast as classification Cardie Measures of string similarity Cohen Scaling up McCallum et aI Spider D Filter by relevance quotE D D D D D D Segment Classify Associate Cluster Relation extraction V th augmented grammar Miller et al 2000 V th joint inference Roth amp Yih Load DB Train extraction models Query Search Label training data Data mine SemisuperVised Brin 3 4 Coreference Resolution InSIde the Traditional Solution AKA quotrecord linkagequot quotdatabase record deduplicationquot quotcitation matchingquot quotobject correspondencequot quotidentity uncertaintyquot Pa39r39w39se Aff39mty metr39c Mention 3 Mention 4 In ut Out it YIN p p MrPowell Powell News article Number of entities N 3 with namedentity quotmentionsquot tagged 1 Today Secretary of State Colin Powell Y one word In common 13 met with gcretary Of State collquot Powell Y quotNormalizedquot mentions are string identical 39 bdrid l Zz h e ML Powell Y Capitalized word in common 17 II I I ivir39f o39w39oii39 he 39 Powell Y gt 50 character quotquot939am quotequotap 19 sixtiogiaont39ons39TfYY I39I 2 Y Insame sentence 9 B h ice Condoleezza Rice Y Within two sentences 8 Y quotHobbs Distancenlt3 3 President Bush Eggu39 atcms 1399 Bus OVERALL SCORE 98 gt threshold0 5 6 Noun Phrase Coreference Identify all noun phrases that refer to the same entity Queen Elizabeth set about tmnsforming her husband King George VI into a viable monarch Logue a renowned speech therapist was summoned to help the King overcome his speech impediment Noun Phrase Coreference Identify all noun phrases that refer to the same entity Queen Elizabeth set about tmnsforming her husband King George VI into a viable monarch Logue a renowned speech therapist was summoned to help the King overcome his speech impediment Noun Phrase Coreference Identify all noun phrases that refer to the same entity Queen Elizabeth set about tmnsforming her husband King George VI into a viable monarch Logue a renowned speech therapist was summoned to help the King overcome his speech impediment Noun Phrase Coreference Identify all noun phrases that refer to the same entity Queen Elizabeth set about tmnsforming her husband King George VI into a viable monarch Logue a renowned speech rherapist was summoned to help the King overcome his speech impediment IE Example Input Text SAN SALVADOR 15 JAN 90 ACANEFE TEX39H ARMANDO CALDERON SOL PRESIDENT OF THE NATIONALIST REPUBLICAN ALLIANCE ARENA THE RULING SALVADORAN PARTY TODAY CALLED FOR AN INVESTIGATION INTOANY POSSIBLE CONNECTION BETWEEN THE MILITARY PERSONNEL IMPLICATED IN THE ASSASS NATION OF JESUIT PRIESTS quotIT IS SOMETHING SO HORRENDOUS SO MONSTROUS THAT WE MUST INVESTIGATE THE POSSIB LITY THAT THE FMLN FARABUNDO MARTI NATIONAL LIBERATION FRONT STAGED THESE MURDERS TO DISCREDIT THE GOVERNMENTquot CALDERON SOLSA D SALVADORAN PRESIDENTALFREDO CRISTIANI IMPLICATED FOUR OFFICERS INCLUDING ONE COLONEL AND FIVE MEMBERS OF THE ARMED FORCES IN THE ASSASSINATION OF SIX JESUIT PRIESTS AND TWO WOMEN ON 16 NOVEMBER ATTHE CENTRAL AMERICAN UNIVERSITY H IE Example Output Template DATE LOCATION TYPE STAGE OF EXECUTION INCIDENT CATEGORY PERP INDIV DUAL D mwbu m 7 PERP ORGANIZATION ID 8 PERP CONF DENCE 9 HUM TGT DESCRIPTION 10 HUM TGT TYPE 11 HUMTGT NUMBER 12 EFFECT OF INCIDENT 16 NOV 90 EL SALVADOR CENTRALAMERICAN UNIVERSITY MURDER ACCOMPLISHED TERRORIST ACT quotFOUR OFFICERSquot quotONE COLONELquot quotFIVE MEMBERS OF THE ARMED FORCESquot quotARMED FORCESquot quotFMLNquot REPORTED AS FACT ACCUSED BY GOVT quotJESUITSquot CIV ILIAN JESU IT quot 2 quotWOMENquot DEATH quotJESUITSquot DEATH quotWOMENquot IE Example Coreference SAN SALVADOR 15 JAN 90 ACANEFE TEXT ARMANDO CALDERON SOL PRESIDENT OF THE NATIONALIST REPUBLICAN ALLIANCE ARENA THE RULING SALVADORAN PARTY TODAY CALLED FOR AN INVESTIGATION INTOANY POSSIBLE CONNECTION BETWEEN THE MILITARY PERSONNEL IMPLICATED IN THE ASSASS NATION OF JESUIT PRIESTS quotIT IS SOMETHING SO HORRENDOUS SO MONSTROUS THAT WE MUST INVESTIGATE THE POSSIB LITY THAT THE FMLN FARABUNDO MARTI NATIONAL LIBERATION FRONT STAGED THESE MURDERS TO DISCREDIT THE GOVERNMENTquot CALDERON SOLSA D SALVADORAN PRESIDENTALFREDO CRISTIANI IMPLICATED FOUR OFFICERS INCLUDING ONE COLONEL AND FIVE MEMBERS OF THE ARMED FORCES IN THE ASSASSINATION OF SIX JESUIT PRIESTS AND TWO WOMEN ON 16 NOVEMBER ATTHE CENTRAL AMERICAN UNIVERSITY Why It s Hard Many sources of information play a role head noun matches IBM executives the executives syntactic constraints John heiped himseifto IJ thhh helped him to number and gender agreement discourse focus recency syntactic parallelism semantic class world knowledge Why It s Hard No single source is a completely reliable indicator number agreement he assassination these murders Identifying each of these features automatically accurately and in context is hard Coreference resolution subsumes the problem of pronoun resolution A Machine Learning Approach Classi cation given a description of two noun phrases NP and NP classify the pair as coreferenf or not coreferent coref 2 coref 2 IE Queen Elizabeth set about transforming her husband I t not coref 2 I394 McCarthy 3 Lennen i995 A Machine Learning Approach Clustering coordinates pairwise coreference decisions Queen Elizabeth Queen Elizabeth coref her Queen Elizabeth set about transforming King George VI m h 1 39 husband coref er King Gmgg VI husband mng not coref his Logue Lugue a lennwned speech lhemplsl Machine Learning Issues Training data creation Instance representation Learning algorithm Clustering algorithm Supervised Inductive Learning Examples ofN39P pairs features class ML Algorithm 1 novel Pair of NPS Concept description features progmm label Training Data Creation Creating training instances texts annotated with coreference information one instance IVLYMNPI NPj for each pair ofNPs assumption NP precedes NP feature vector describes the two NPs and context ciass vaiue mref pairs on the same coreference chain not mref otheiWise 2ti Instance Representation 25 features perinstance e lexical3 string matchingrorpronounspropernarheseorhrhon nouns e grammaticali8 pronoun demonstrative the this inde nite itis raining numoer genoer animaev apposinve george the hingpreoieate nominative a horse isamammal bindingcunstrairtts simplecurttrarindaingconstraints spannaxirhainp e sernantic2 samevvoroneteiass aiias e positionaim oistanee betweenthe NPs interms orxor sentences 7 knowledgebased naive pronoun resolution algorithm Learning Algorithm 0 RIPPER Cohen 1995 045 Quinlan 1994 rule learners input set oftraining instances output coreference classi er Learned classifier input test instance represents pair ofNPs output classi ca ion con dence of classi cation Clustering Algorithm 0 Bestfirst singlelink clustering 7 Mark each NP as belonging to its own class NP E 7 Proceed through the NPS in lefttoright order For each NP NPj create test instances imtNP NPJQ for all ofits preceding NPs NP Select as the antecedent for NPj the highestcon dence coreferent NP NPf according to the coreference classi er or none if all have below 5 con dence Merge Cj and Sf 23 Evaluation MUG6 and MUO7 coreference data sets documents annotated wrt coreference 30 30 training texts dry run 30 20 test texts formal evaluation scoring program recall precision Fmeasure 2PRPR System output 24 Baseline Results Problem 1 Coreference is a rare relation skewed class distributions 2 positive instances remove some negative instances farthest antecedent 26 Worst lllllC System 36 44 40 52 5 21 4 Best lllllC System 59 72 65 561 688 618 25 Problem 2 Coreference is a discourselevel problem different solutions for different types ofNPs proper names string matching and aliasing inclusion of hardquot positive training instances positive example selection selects easy positive training instances cf Harabaglu eta 2001 Queen Elizabeth set about tmnsforming her llll tI m 3 I King George VI into a viable monarch Logue I I I the renowned speech therapist Was summoned to help J meleg overcorne Problem 3 Coreference is an equivalence relation loss oftransitivity need to tighten the connection between classi ca ion and clustering prune learned rules wzt the clusteringlevel corelerence scoring function coref coref l Queen Elizabeth set about tmnsforming her husband l 1 not coref 23 Results NEGSELECT 465 678 562 374 597 460 POSSELECT 53 808 641 41 1 780 538 NEGSELEmPOSSELECT 634 763 693 595 561 E72 Ultimately large increase in Fmeasure due to gains in recall 29 Comparison with Best MUC Systems MUC 6 MUC 7 R P F R P F NEGSELECTPOSSELEmRULESELECT 633 769 695 542 763 634 EestllllJCSystem 59 72 65 561 688 613 an Supervised ML for NP Coreference cood performance compared to otner systems but lots ofroom for improvement 7 Common nouns lt pronouns lt proper nouns e Tighter connection between classification and clustering l5 possible Rich Caruana s ensemble methods Statistical methods for learning probabilistic relational models cetoor et ai 2001 Lafferty et al 2001 Taskar et ai 2003 McCallum and Wellner 2003 7 Need additional data sets New release of ACE data from Penn s LDC General problem reliance on manually annotated data at Main Points Coreference How to cast as classification Cardie Measures of string similarity Cohen Scaling up McCallum et al Relation extraction V th augmented grammar Miller et al 2000 V th joint inference Roth amp Yih Semisupervised Brin 32 Record linkage definition Record linkage determine if pairs of data records describe the same entity e nd record pairs that are coreferent Entities usually people or organizations or Data records names addresses job titles birth dates Main applications Joining two heterogeneous relations Removing duplicates from a single relation 33 Record linkage terminology The term record linkage is possibly co referent with For DB people data matching mergepurge duplicate detection data cleansing ETL extraction transfer and loading de duping For AlML people reference matching database hardening object consolidation In NLP coreferenceanaphora resolution Etoltal 3 matching cluotelmd lantiuuue ITIDIJE ill IC Finding a technical paper 0 1995 Start with citation quot Experience Vilith aLeaming Personal Assistantquot TM Mitche11R CaruanaD Freitag McDennott andD Zabowski Communications 0fch ACM Vol 37No 7pp 8191u1y1994 Find authors institution w INSPEC Find web host w NETFIND Find author s home page and hopefully the paper by browsing 35 The data integration problem infirm hm lls itHlllnl rantMiniquot mmputrr srit uru in Illlivt l39 ll y nl39 mlil mm liitvnl tt 1 ng rmtmtnmltmlu mmpulvr sritriiri it li rilllr39le staiil nrtl lilliu rsi rulil nrnizt lwsi ncl Dtpl ol L39umpui Sci Crilil39nrni lfni ullnil gn Lu Jnllu CA L min alt INsr39ttcl Dr l annnlpul 39 Slniil39nrtl Unit IA USA as String distance metrics overview Termbased eg TFIDF as in WHIRL Distance depends on set of words contained in both 5 and t Editdistance metrics Distance is shortest sequence of edit commands that transform 5 to t Pair HMM based metrics Probabilistic extension of edit distance Other metrics 37 String distance metrics termbased Termbased eg TFIDF as in WHIRL Distance between 5 and tbased on set of words appearing in both 5 and t Order of words is not relevant Eg Cohen Williamquot VWIiam Cohenquot and James Joyce Joyce Jamesquot Words are usually weighted so common words count less Eg Brown counts less than Zubinsky Analogousto FelligiSunter s Method1 BE Jaccard Distance en V lliam Cohen University en n V lliam Cohen E j quoti ll qlw Jaccard Score 39 String distance metrics termbased Advantages Exploits frequency information Ef ciency Finding t simtsgtkis sublinear Alternative word orderings ignored William Cohen vs Cohen V liam Disadvantages Sensitive to spelling errors William Cohen Sensitive to abbreviations Univ vs University Alternative word orderings ignored James Joyce vs Joyce James City National Bank vs National City Bank 4n String distance metrics Levenshtein Editdistance metrics Distance is shortest sequence of edit commands that transform 5 to t Simplest set of operations Copy character from s over to t Delete a character in 5 cost 1 Insert a character in tcost 1 Substitute one character for another cost 1 This is Levenshtein distance 41 Levenshtein distance example distance V iam Cohen Willliam Cohon 5 WI LLgaplAMCOHEN alignment tWI LLLAMCOHON lll OPCCCCICCCCCCCSC 50500001111111122 42 Computing Levenshtein distance 1 Dij score of best alignment from svi to t1tj Diljl ifsitj copy Di1j11 if silaj vubvtitute mm Di1j1 imerf Dij11 delete 43 Computing Levenshtein distance 2 Dij score of best alignment from svi to t1tj Di1j1 dsitj vubvtcopy min Di1j1 imert Dij11 delefe simplify by letting dcd0 if cd 1 else also let Di0i for i inserts and D0jj Computing Levenshtein distance 3 Di1j1 dsitj vubvfcopy Computing Levenshtein distance 4 Di1j1 dsitj vubvfcopy Dijmin Di1j1 inverf Dij11 delefe H E A trace indicates 3 Where the min value came from and can be used to nd edit operations andor a best alignment may be more than 1 3 46 1314an Di1j1 i39mert Dij11 delete c o H E N M 1 2 3 4 5 c 1 2 3 4 5 c 2 2 3 4 5 o 3 2 3 4 5 H 4 3 2 3 4 N 5 4 3 3 3 DW 45 NeedlemanWunch distance Diljl vubvfcopy Dij min Di1j imert Dij1 delefe G gap cost dcd is an arbitraty distance function on characters e g related to typo frequencies William Cohen Wukkudm Cigeb amino acid substitutibility etc 47 SmithWaterman distance Instead of looking at each sequence in its entirety this compares segments of all possible lengths and chooses whichever maximise the similarity measure Thus it is a generalization of longest common subsequence For every cell the algorithm calculates all possible paths leading to it These paths can be of any length and can contain insertions and deletions 43 SmithWaterman distance 0 start over Di1j1 dsitj rubvtcopy 1314 max Di1j G r39mert Dij1 G delete C O H E N G 1 M o o o o o dw 2 c 2 0 o o o 0 0 0 0 dcd 1 c 2 O 0 4 3 0 0 H 0 3 6 5 3 N 0 2 5 5 7 49 SmithWaterman distance Monge amp Elkan s WEBFIND 1996 mlerliel Imsi trislrlulmii rsiicsdedu computer science department university oi r39nlii orniii sari lingo cssluiilbi39dedu coiiipulei39 science department stanl39otd university palu alto California IVSPEC Dept oi iiomput Sci 39faliibrnin Univ San Diego La Julia TA USA INSPEC Dept oi39Camput Sci sonnetti L39uiv CA USA Table i Exmnpie ul NETFIND and lNSPEL llcltls SmithWaterman distance in Monge amp Elkan s WEBFIND 1996 Used a standard version of SmithWaterman with handtuned weights for inserts and character substitutions Split large text fields by separators like commas etc and found minimal cost over all possible pairings of the subfields since SW assigns alarge cost to large tmnspositions Result competitive with plausible competitors 51 Results SW from Monge amp Elkan Saw a UCS t t ruminant 52 Affine gap distances SmithWaterman fails on some pairs that seem quite similar William W Cohen William W Don t call me Dubya ohen Intuitively single long insertions are cheaper than a lot of short insertions Affine gap distances 2 Idea Current cost of a gap of 11 characters nG Make this cost A n1B where A is cost of opening a gap and B is cost of continuing a gap Affine gap distances 3 Di1j1 dsitj Dij max ISI1j1 dsitj ITI1j1 dsitj 13071 max DiIlj A Best score in which 5139 IS1 l B is aligned with a gap Best score in which tj quot D1J1A Ll max id4 B IS aligned With a gap 55 Affine gap distances as automata B A 3 4m Generative version of af ne gap automata BilenkoampMooney TechReport 02 HMM emits pairs ad in state M pairs 5 in state D and pairs d in state I For each state there is a multinomial distribution on pairs The HMM can tmined with EM from a sample of pairs ofmatched strings gt Estep is forwardbackward Mstep uses some ad hoe smoothing W 56 Affine gap editdistance learning experiments results Bilenko amp Mooney Table 2 sin plu mum l39ccuizlk ll39oln m RizsmuruNr Mum munc M m vlmic min Squ mo Dell l ml m n mm M rm m lzmumlm Dcllmlmxm semi u muc Dell no Scmmt me New m CH lzmuxm Dun mm Smublcdulvllcvlc letmus rm lle M nuns mum m mm unless my m4 hml n ix llimn 3 l tmmu 1M hurlng mm x we rmnni u Experimental method parse records into elds append a few key elds together sort by similarity pick a threshold T and call all pairs With distancev t lt T duplicates picking Tto maximize Fmeasure 58 Affine gap editdistance learning experiments results Bilenko amp Mooney nkmuc Rrsmunwrmmo I lAH CllNlllcln 0870 084 Learned Levenshtem 1 001 0886 Af ne H917 H883 Lcmucd Affine 0971 0967 Duluuuo rucll REsTAuRANTaddmss MAILlNQname l MAILINGaddress cvcmlvlmn 0950 0867 Learned Lev U 975 0899 Af ne l 870 092 Lcai ncd A i 0929 0959 Affine gap editdistance learning experiments results Bilenko amp Mooney Marislnv c as 055 G7 m5 0 was Us 095 Precisionrecall for MAILING dataset duplicate detection EU Affine gap distances experiments from McCallumNigamUngar KDDZOOO Goal is to match data like this Fahlman Scoit x LehiereWl9 The cascade correlaiion learning architect Tourerzky D editor Adian s in Neural Information Processing Systems volume 2 pp 524 532 San Mateo CA Morgan Kanrrnann Fanlman SE and Lehiere Th Cascade Correlation Learning Architecture NLPs 2 pp 532 Morgan Kantmann 1990 Fanlmann s E and Lehiere c 1989 The cascade correlation learning architecture n ces in Neural In formation Processing Systems mm 2 Denver Colorado Figure 2 Three sample citations to the same paper Affine gap distances experiments from McCallumNigamUngar KDDZOOO Handtuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a HMMbased normalization to group title author booktitle etc into fields 62 Affine gap distances experiments TF DF Edit Adaptive Distance Cora 0751 0839 0945 0721 0964 OrgNamel 0925 0 633 0 923 0 366 0950 0 776 Orgl iameZ 0958 0 571 0958 0 778 0 912 0984 Restaurant 0 981 0 827 1000 0967 0 867 0 950 Parks 0 976 0 967 0984 0967 0 967 0967 W as String distance metrics outline Termbased eg TFIDF as in VVlllRL Distance depends on set of words contained in both 5 and t Editdistance metrics Distance is shortest sequence of edit commands that transform 5 to t Pair HMM based metrics Probabilistic extension of edit distance 39 Other metrics Jaro metric Jaro metric is apparently tuned for personal names Given st de ne cto be common in st if it sic tjc and i jiltmrnrisr2 De ne cdto be a transposition if cdare common and Cd appear in different orders in s and t Jarost average of commons commonI and 05transposiiionscommon Variant weight errors early in string more heavily Easy to compute note edit distance is Ost NB Tnis l5 my interpretation ofWinkler s description Jaro metric w i a L A it w 1 ti a i n n a l n n n i n 1 i n n m i n ii i L i u r u n u L i i r 1 El n u i i r 0 1 r i r o i i n i i i n i n n t tliiisrmvrur mire inmiomir Darn mm neon riwmnm meant m e Hm imiiinnwiniuiniine iniiiiiniarii niainiiirriiioiw inninnniii m ilwnr l39lltl IIl Jumsl m rummou iii u and l y Iquot In em TN 7 no Mtranspnsrtmni for 5 and r1 66 Soundex metric Soundex is a coarse phonetic indexing scheme widely used in genealogy Every Soundex code consists ofa letter and three numbers between 0 and 6 eg B536 for Bender The letter is always the rst letter ofthe surname The numbers hash together the rest ofthe name Vowels are generally ignored eg Lee Lu gt LOOO Later later consonants in a name are ignored Similarsounding letters eg B P F V are not differentiated nor are doubled letters There are lots of Soundex variants 67 Ngram metric Idea split every string 5 into a set of all character n grams that appear in s for nltk Then use tenn based approaches eg COHEN gt 00HENC00HHEENCOHOHEHEN For n4 or 5 this is competitive on retrieval tasks It doesn t seem to be competitive with small values ofn on matching tasks but it s useful as a fast approximate matching scheme 68 Main Points Coreference How to cast as classification Cardie Measures of string similarity Cohen Scaling up McCallum et al Relation extraction V th augmented grammar Miller et al 2000 V th joint inference Roth Semisupervised Brin ES Reference Matching Fahlrnan Scott amp Lebiere Christian 1989 The cascadencorrelation learning architecture in Touretzky D eortor Advances rn Neural information Processing Systems volume 2 pp 524532 San Mateo CA Morgan Kaufmann Fanlrnan S E and Leblere C Tne Cascade Correlation Learning Arcn1tecture NlPSVol 2 pp 524632 Morgan Kaufrnann1990 Fahlrnan s E 1991The recurrent cascadencorrelationlearning architecture in Lipprnan R P Moody J E and Touretzky D s editors NlPS 31907205 7n The Citation Clustering Data Over 1000000 citations About 100000 unique papers About 100000 unique vocabulary words Over 1 trillion distance calculations 71 The Canopies Approach Two distance metrics cheap amp expensive First Pass very inexpensive distance metric create overlapping canopies Second Pass expensive accurate distance metric canopies determine which distances calculated 72 Illustrating Canopies 7s Creating canopies with two thresholds Put all points in D Loop Pick a point X 39om D Put points within Kloose ofX in canopy Light Remove points wi hin Ktight ofX 39om D 39 75 Overlapping Canopies Using canopies with Greedy Agglomerative Clustering Calculate expensive distances between points in the same canopy All other distances default to infinity Sort nite distances and iteratively merge closest 7e Computational Savings inexpensive metric ltlt expensive metric canopies per data point f small but gt 1 number of canopies 0 large complexity reduction 77 The Experimental Dataset All citations for authors Michael Kearns Robert Schapire Yoav Freund 1916 citations 121 unique papers Similar dataset used for parameter tuning 7E Inexpensive Distance Metric for Text Wordlevel matching TFIDF Inexpensive using an inverted index aardvark ant 39Lm ZOO 79 Expensive Distance Metric for Text Stringeditdistance S e c a Compute with Dynamic Programming 000001quot 3 Costsforcharacter 5 11 14 1 Z serf39m c 14 07 10 14 1 eletion substitution 0 21 11 1 2 t 28 14 21 t 35 18 24 doFalilmanvsFalman Eu Extracting Fields using HMMs Fahlman SE and Lebiere C The Cascade Correlation Learning Architecture NIPS Vol 2 pp 524532 Morgan Kaufmann 1990 Author Faniman s E and Leoiere c Title The Cascade Correlation LeamingArchitecture Venue NiPS Year 1990 B1 Experimental Results F1 Minutes Canopies GAC 0838 765 Complete GAC 0835 13409 Existing Cora 0784 003 AuthorYear 0697 003 Add precision recall along side F1 132 Main Points Coreference How to cast as classification Cardie Measures of string similarity Cohen Scaling up McCallum et al Relation extraction With augmented grammar Miller et al 2000 V th joint inference Roth amp Yih Semisupervised Brin as Information Extraction Named Entity Recognition INPUT Pro ts soared a1 Boeing Co easily topping forecasts on Wall Street as their CEO Alan Mulally announced rst quaner results OUTPUT Proms soared at iCompany Boeing Co easily topping forecats on iLooarion Wall Street as their CEO iPerson Alan Mulallyj announced rst quarter results Relationships between Entities INPUT Boeing is located in Seattle Alan Mulally is the CEO OUTPUT Relaiorship Companylocaion Relationship EmployerrEmployee Company Boeing Employer Boeing Co Looarion Seariei Employee Alan Mulally Extraction From Entire Documents Hi PERSDN Ted and PlazSON Hill Inst a reminder that the game move will need to be entered I39JMEtonight We will need data on operations rawmaterials ordering and details of the bond to be sold PERSON Hill Iwill be in the H u ll ll m lobby after the class at Two pm how about we meet in the t in l I 39s lobbyt around that time ie when both our classes are over PERSON Ted Let me know how you are going to provide the bond related input information We tan either meet in the l 9 2mm lobbyl around TIME 530 pm or you can email me the info Thanks PERSON Ajay TIME 9 pm 18th September TIME 530 pm 18th September LOCATION Lobby Building N39EAS LOCATION Lobby Building NE43 PERSON David Hill Ajay Sinclair PERSON Ted Jones Ajay Sinclair TOPIC data on operations TOPIC bond related input information 10TH DEGREE is a fttll service advertising agency specializing in direct and interactive marketing Iocated in Irvine CA 10TH DEGREE is looking for an Assistant Interactive Account Manager to help manage and coordinate interact ve marketing initiatives for a marquee automotive account Experience in online marketing automotive andor the advertising agency eld is a plus Assistant Account Manager Responsibilities Ensures smooth implementation of programs and initiatives Helps manage the delivery of projects and key client deliverables Compensation 50000 7 0000 Hiring Organization 10TH DEGREE Principals only Recruiters please don39t contact this job poster Please no phone calls about this job Please do not contact job poster about other services products or commercial interests Reposting this message elsewhere is NOT OK this is in or around Orange County Irvine U INDUSTRY Advertising POSITION Aistant Account Manager LOCATION Irvine CA COMPANY 10th Dgree SALARY 50XJO 80XJO Relationship Extraction Miller et a zone An example Donald M Goltktein a historian at the University of Pittsburgh o Entity information to be extracted Named entity boundaries Organizaiors people and locations Person descriptors a historian a the University ofPittsburghquotrefers to Donald M Goltkteinquot Entity relationships to be extracted EmployerEmployee relaiors eg Galdrtein is employed at University lifPiImburgh Companyproduct relations Organizaionheadquartersrlocation relation Relationship Extraction Annotation Another example Nance who is a paid consultant to ABC News said e The following information was annotated mi r as a person ABC News as an organization cpoid consultant to ABC News as a descriptor A Carecreme link between Junk and upaid consultant I0 ABC News An emplayererelatilm link from it pm39d Comldlant to ABC News to ABC News WI 39lnlitl haw inn vu hlld l modal with h rum WK lhk hlurlnll illl The Basic Approach Build astatistical parsing model which simultaneously recovers syntactic relaion and the information extraoion informaion To do this Step 1 annotate training sentences for entities descriptors coreference links and relaion links Step 2 train a parser on the Penn treebank and apply it to the new training sentences Force the parser to pmduoe parses that are consistent with the entitydescriptor etc boundaries Step 3 enhance the parse trees to include the information extraoion informaion we39 11 come to this soon 0 Step 4 mtnu39n the parser on the new training data and with the new annotations ill ill SEAR llill at ti w who A VEZ ill l is ill ll TO ill DT VEN NN t l l l to illll lllll a paid consultant l l ABC News NP39 NPquot SEAR my late v71 w who VEZ NP ls A PF NP i TO NP u DT VEN NN l l l t a pawl consultant NNl In NNl Zu ABC News 39rlltl unurlllu Illux xllmun IlIL U ulli 6 m organizaion m person mg I organizaion reportablequotoomplae n person ieportable complae N Hper NPpen SEA R NNl Dper late w w who VEZ NP u ls NP PF DT VEN NN T O NPorgquot l l l a pald consultant NNl Worg NNl Iorg ABC News ltld servantt Hp dummy Intrpm pvt mu erson descriptor m one 1 person descriptor reportable complete NPper NHpenr SEAerrlk w NNliZper SEAR Nailoe WHNP Wm u r we W hO VEZ NPperrdescrr ls N Hpendesc pp TO NPorgrr DT VEN NN l l l l to a pawl consultant NNTorg NNl jlorg A E c News mr Meme ap gum lml mum m nL Ila lemuuu m 1 nl persondescriptor link m 1n 1 pelsowdescnpm pointer N Hper Nope SEA ernkperrdeserof NNliZper SEA Rpairdeserpttr rule WHNP VPpardeserpttr we Wiho VEZ NPperrdeserr N Hpendesc Plllnk l PP DT VEN NN l l l TO NPorgrr a pald consultant 0 NNPorg NNPorg l l nJll luak 39xthum uupllgn Hxlp l39u I IIIJIUII ABC News employeeroflink m m employeeof pointer N NP SEA ernkperrdeserof NNiz SEA Rpeirdescrpttr rule WHNP VPperrdeserptr w p lilo VEZ NPperrdeserr s NHpendesc Plilnkanlncy PPoligrptr DT VEN NN l at no NP 0 NNP NNP Aisc Neiws W H Hquot ulm PmSON desaiptot link DESCRIPIOR EMPLOYERVOF zeliuorl u cl u Building a Parser a We now have contextifree rules where each nonitemrirlal in the grammar has A e A semantic label A headiwordheaditag 1 l tll39rlzlsr 3 l per7descirconsultantNN 39l per7desccorlsultantNN l l IllkCmp0fIOTO It39s possible to modify syntactic parsers to estimate rule probabilities in this case Summary Goal build a parser that reoovers syntactic structure named entities descriptors and relations Annotation mark entity boundaries descriptor boundaries links between entities and descriptors Enriched parse trees given annotation and a parse tree form a new enriched parsetree The staistical model nonrterminals now include syntadic caegory semantic label head word head lg Rule probabilities are estimaed using similar methods to syntactic parsers Results precision 81 recall 64 in recovering relaions employeremployee companyproduct oompanyheadquarterslocation 1 Association with Graphical Models Capture arbitrarydistance Rmhg ym 20 dependencies among M M Predlctwns 1 Random variable PlEiDO overthe class of quotquot relation between entity 2 and 1 quot RM igbgfot39wfm39 Random variable over the class of entity 1 eg over person location new Local language models contnbute ellldence to entity classification Local language models contribute ellidence to relation classification Dependencies between classes of entities and relations piano lnrerence with loopy belier propagation Main Points Coreference How to cast as classification Cardie Measures of string similarity Cohen Scaling up McCallum et al Relation extraction V th augmented grammar Miller et al 2000 With joint inference Roth amp Yih Semisupervised Brin BE Also capture long distance dependencies among predictions Random variable overthe class of entity 1 eg over 1 Association with Graphical Models Roth ampYl39h 20172 ill M M Ralidom variable overthe class of relation between entity 2 and 1 Ru e9 overllvesltn m lsbossof person location n l l l lEtLX Ox L a o l 9 L4 m ocdelsa39c guntragbeme g m1 0 ellldenceto relation Kl classification L all V m zcdelsuc gumarglsme quot M evidencetio entity classification Dependencies between classes mm of entities and relations lnrerence with loopy belier propagation m 1 Association with Graphical Models Also capture long distance lRothampWh 2002 dependencies among Pred39c quot5 l Random variable l lEil O overthe class of relation between Random variable over the class of entity 1 eg over person location to 7 entity 2 and 1 t lt a U4 eg overlvesltri Local language lsbossof models contribute Local language euldencetio entity models contribute classification mm ellidence to relation classification Dependencies between classes orentities and relations a lnrerence with loopy belier propagation lDl Main Points Coreference How to cast as classification Cardie Measures of string similarity Cohen Scaling up McCallum et al Relation extraction V th augmented grammar Miller et al 2000 V th joint inference Roth amp Yih Semisupervised Brin lDz Partially Supervised Approaches to Relation Extraction a Last lecture introduced a partially supervised method for named entity classi cation a Basic observation redundancy in that either spelling or context of an entity is often suf cient to determine its type 0 Lead to cotmirlirtg approaches where two classi ers bootstrap each other from a small number of seed rules a Can we apply these kind of methods to relation extraction From Brin 1998 The World Wide Web provides a vast source of information of almost all types ranging from DNA databases to resumes to lists of favorite restaurants However this informaion is o en scatered among many web servers and hosts using many different formats If these chunks of information could be extracted from the World Wide Web and integraed into a structure form they would form an unprecedented source of informaion It would include the largest intemaional directory of people the largest and most diverse databases of products the greaest bibliography of amdemic works and many other useful resources From Brin 1998 For data we used a repository of 24 million web pages totalling 147 gigabytes This data is part of the Stanford WebBase and is used for the Google search engine BP and other research projects As a part of the search engine we have built an inverted index of the ertire repository The repository spars many disks and several machines It takes a considerable amount of time to make jist one pass over the data even without doing any substantial processing Therefore in these sic we only made pases oversubsets of the repository on any given iteraion BP Sergey Erin and Larry Page Google search engine http goog1e Stanford edu 0 From Brin 1998 authorsbookatitles data web data seeds are Isaac Asimov The Robots of Dawn David Brin Startide Rising Jams Gleik Chaos Making aNew Science Charles Dickens Great Expeaaiors William Shakespeare The Comedy of Errors DIPRE B in 1998 A pattern is a 5 tuple Order author preceding title or visa versa URLepre x a pre x of the URL of the page of the pattern a pre x up to 10 characters preceding the authortitle pair a middle the characters between the author and title c m ix up to 10 characters following the authortitle pair DIPRE Inducing Patterns from Data Find all instances ofseeds on web pages Basic question how do we induce patterns from these examples The Overall Algorithm 0 Answer Following procedure 1 Group all occurrences togetherwhich have the same wlues for order 1 Use the seed examples to label some data middle 2 For any gm 59 I VEix 0 be long common Pm x 0f the 2 Induce patterns from the labeled examples using method group39s URLs pre x to be the longest common pre x of the group u ix to be the longest common suf x 3 For each group39s patem calculaue its speci city as described on the previous slide 3 Apply the patterns to data to get a new set of authortitle pairs Spf l p illmirldlt HilrlrthY mllrttf7rllxilff where n is the number of examples in the group rl is the length of 4 Return to step 2 and iterate n in characers 4 1r speci city exceeds some threshold include the patem s Else Ifall patterns occur on same webpage reject the patem 5 Else create new subgroups grouped by characters in the urls which is one past urlrpre x and repeat the procedure in step 2 for these new subgroups DIPRE Inducmg Patterns from Data 1 Association using Parse Tree The patterns found m the rm iteration Simultaneously POS tag parse extract amp associate Miller et a 20W www sff netlocuec e ltLlgtltE aneE byeerhmlt dns cltynetcomlmannawards hugos1 984 htnl ltlrzr12l by eerhmlt 7mm suuaux dolphlnupenn edudcummlnstexts s awardhtn author H mlei e y W WM WMW j lnererse Since er parse ennstnutes In include o The 5 seeds produced 199 labeled rnsances grvrng the 3 patterns above enlltyznu remnnuns rmnern n Applying the three patterns gave 4047 new book insances r m t s s n an h mm mm mm cm mneme mnslnuem eatenmy Searching 5 million web pages gave 3972 occurrences ofthese books fir gggggmnm w wnm This gave 105 patems 24 applied to more than one URL I in New W s g 0 Applied to 2 million URLS produced 9359 unique authontitle pairs myquot ile eww Swimwman l f meow mimeter roerlmnpenmmd 0 Manual inervention removed 242 bogus iems where the author was W a r WWW P W i39quotquot ll39 wquot l Conclusion i J new i J Li l o Frnal rterahon ran over 155000 documents whrch conarned the word rv m m r r K This is also a great example books Induced 345 pal Ber 15r257a 1h0rr epalr5 PTquot i V T m m quot m TM f of extraction using a tree rrmdel i m W t rim r rule with n or he on
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'