Class Note for CMPSCI 646 at UMass(9)
Class Note for CMPSCI 646 at UMass(9)
Popular in Course
Popular in Department
This 16 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 24 views.
Reviews for Class Note for CMPSCI 646 at UMass(9)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 02/06/15
Information Retrieval Retrieval Models Introduction Boolean and Language models James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Language modeling iznltiucnoiz o virlox La x lenko All slides Copyrlgil James Allan What is a retrieval model Model is an idealization or abstraction of an actual process here retrieval Mathematical models are used to study the properties of the process draw conclusions make predictions Conclusions derived from a model depend on whetherthe model is a good approximation ofthe actual situation Statistical models represent repetitive processes make predictions about frequencies of interesting events use probability as the fundamental tool CMPSCI 545 Copynghl JamesAllan What is a retrieval model Retrieval models can describe the computational process 7 eg how documents are ranked 7 Note that how documents or indexes are stored is implementation Retrieval models can attempt to describe the human process 7 eg the information need interaction 7 Few do so meaningfully Retrieval variables queries documents terms relevance judgements users information needs Retrieval models have an explicit or implicit definition of relevance CMPSCI 545 Copynghl JamesAllan Models we ll consider Boolean exact match 7 Exact match retrieval 7 Dominant model for years 7 Still popular Statistical language models 7 Ranking best match model 7 Current hot model for information retrieval 7 Effective Will consider other models later CMPSCI 545 Copynghl JamesAllan Exact vs Best Match Exactmatch 7 query specifies precise retrieval criteria 7 every document either matches or fails to match query 7 result is a set of documents Unordered in pure exact match Bestmatch 7 Query describes good or best matching document 7 Every document matches query to some degree 7 Result is ranked list of documents Popular approaches often provide some of each 7 Eg some type of ranking of result set best of both worlds 7 Eg bestmatch query language that incorporates exact match operators CMPSCI 545 Copynghl James Allan Exactmatch pros amp cons Advantages of exact match 7 Can be very efficiently implemented 7 Predictable easy to ex lain 7 Structured queries for pinpointing precise documents 7 Work well when you know exactly or roughly what the collection contains and what you re looking for Disadvantages of exact match 7 Query formulation difficult for most users Difficulty increases with collection size 7 Indexing vocabulary same as query vocabulary 7 Acceptable precision generally means unacceptable recall Ranking models consistently shown to be better Hard to compare best and exactmatch in principled way why CMPSCI 545 Copynghl James Allan Unranked Boolean retrieval Boolean model is most common exactmatch model 7 queries are logic expressions with document features as operands 7 In pure Boolean model retrieved documents are not ranked Most implementations provide some sort of ranking 7 query formulation difficult for novice users Boolean queries 7 Used by Boolean model 7 and in other models Boolean query Boolean model Pure Boolean operators AND OR ANDNOT Most systems have proximity operators Most systems support simple regular expressions as search terms to match spelling variants CMPSCI 545 Copynghl James Allan Example WESTLAW information is slightly dated but approximately correct Part of Thompson Legal and Regulatory Serves legal and professional market 7 legal materials news financial Total collection size more than 50 terabytes 7 16700 databases of stored content as of 2007 7 More than 1000 servers 700000 users as of 2002 7 Preferred by about 60 of searches according to 2007 survey 7 claim 47 of legal searchers as of 2002 7 120000 daily onIine users 54 ofonIine users 7 2 million searchers per week 84 million transactions per day In operation since 1974 51 countries as of 2002 Bestmatch and free text queries added in 1992 7 WIN West is Natural CMPSCI 545 Copynghl James Allan WESTLAW Boolean features 1 of 2 Boolean operators 7 OR is implicit 7 AND using amp operator 7 BUTNOT using operator a BUTNOT b 9 a b Proximity operators Phrases two fish 7 Same sentence brillig s toves To force order to be honored brillig s tovesquot 7 Same araira h manxome foe thought Similarly p forces order 7 Word proximity uffish l5 stood or uffish 5 stood Operator ordering 7 phrase 9 space or 9 n 9 s 9 p 9 amp 9 7 Parenthesis permitted to override order CMPSCI 545 Copynghi James Allan WESTLAW Boolean language features Document structure elds 7 query expression restricted to named document component 7 tite anguage models amp citesPonte amp dateafter 1998 Metadata restrictions 7 eg DATEAFTER 1992 amp BEFORE 1995 Term expansion 7 Universal character any one character manome 7 Root expander gimb Automatic expansion of plurals and possessive forms CMPSCI 545 Copynghi James Allan Features to Note about Queries Queries are developed incrementally 7 Change query until reasonable number of documents retrieved language modelsquot language modelsquot ls statistical language modelsquot ls stat language modelsquotls stat quotA toolkit language modelsquotls stat quotA toolkit HMM r implicit relevance feedback Queries are complex 7 proximity operators used very frequently 7 implicit OR used for synonyms 7 NOT is rare Queries are long av 910 words 7 not typical Internet queries 12 words Response time tuned to be fairly consistent at 7 seconds CMPSCI 545 Copynghl JamesAllan WESTLAW query examples What is the statute of limitations in cases involving the federal tort claims act 7 LIMIT 3 STATUTE ACTION IS FEDERAL l2 TORT l3 CLAIM What factors are important in determining What constitutes a vessel for purposes of determining liability of a vessel owner for injuries to a seaman under the Jones Act 46 USC 688 r 741 3 824 FACTOR ELEMENT STATUS FACT P VESSEL SHIP BOAT P 46 3 688 JONES ACTquot P INJUR IS SEAMAN CREWMAN WORKER Are there any cases which discuss negligent maintenance or failure to maintain aids to navigation such as lights buoys or channel markers 7 NOT NEGLECT FAIL NEGLIG l5 MAINT REPAIR P NAVIGAT l5 AID EQUIP LIGHT BUOY CHANNEL MARKERquot What cases have discussed the concept of excusable delay in the application of statutes of limitations or the doctrine of Iaches involving actions in admiralty or under the Jones Act or the Death on the High Seas Act 7 EXCUS l3 DELAY P LIMIT 3 STATUTE ACTION LACHES P JONES ACTquot DEATH ON THE HIGH SEAS ACTquot 46 3 761 CMPSCI 545 Copynghl JamesAllan Westlaw search screen Boolean Two types of queries Dalihas 5mm l a unmnn 5mm damn 1 ll hm Irn rrlmdixl d Yllesawus Query gt Mam3w m m m 19gt 5 m HEa39rsrarinrinn Djvnlr lltl39nn n n m am an mum my mum Help pr mas mum l mm um i d rmmwsmgmsuawcum Wesllamesea m gume x Feb 2mm Onlme al mm lmes1 lhumsun cumpmuucmnmaempmuucl asp CMPSCI 646 Copynght James Allan Westlaw results screen 5m m 1mm um lav Wm mm n mm mm unzmplnymem gammaquot Result 2 nmmm WV u m Mman 5mm mm mm by mm drwe mum and a ma lugs m Secnnd d a m um dr wav s lullunncw Establlshnd mum was a News m mam ALR mm RasultsPlu Eamng Unemnlnxmvn MEEan mm n W mm 5 mm mm mmnm bamnn WEN mm 1 M M m MR ummmmcnmmum saAvmle Pmmedmg ismvlmsle welghr and Summencv m Evldence aswsm Famculav mm m Urmms Fault av msmm assAksaoa u m Ganeval meavly 25mm 5 untmulnvlnulllmmulnxuliun Wmm was 5 meth sunuunindlnw um lnvmenttnmpnnxnhnn mm m m m ah anvarsme um mm icddcmswereuusedmmsnlnl m zlzsansnualsemvlnand Duhllc mm 35mm Unnmnluvmlm Cumnnnsillnn smnm mm Revlew asmkasg 0mm aFLaw and Fanassnmsu ResultsPlus mm 3 M m WMm W Dances MEWS quot gl sznltag nllrimm Am1ur2d o Unemgluvmsnt cmgsmm ulmm pm a 3 ml l u ummmmcnmmmnn ssAvmlDy mm cummmquot 25mm um cause ufunemulavmem m Geneval zsmzaa rm nrMsmndua asmkaag k quota mm m mman m Ganmal Aslnula damllman m mlnm aka mung nrzaralessngss anx nutcnnsllmt mm mlscnn un many a Duewnm Emmy 3 mm Cum u we h w CMPSCI 646 Copynght James Allan mlm m mulm We axldent r2a mamli Usmgwesuawcum We lwveseavchuume Febznm Onlme al mm lmes1 lhumsun cumpmuucmnmaempmuucl asp Boolean query languages still used Many users prefer Boolean 7 Especially professional searchers 7 Many WESTLAW DIALOG searches still use Boolean 7 Control 7 Understandability For some queries or collections Boolean often works better eg using AND on the Web Boolean and free text nd different documents Need retrieval models that support both 7 Extended Boolean vector space 7 Probabilistic inference network Need interfaces that provide good cognitive models for ranking CMPSCI 545 Copynghl JamesAllan Best Match Retrieval Retrieving documents that satisfy a Boolean expression constitutes the Boolean exact match retrieval model Bestmatch or ranking models are now more common Advantages 7 Signi cantly more effective than exact match 7 Uncertainty is a better model than certainty 7 Easier to use supports full text queries 7 Similar ef ciency based on inverted le implementations Disadvantages 7 More dif cult to convey an appropriate cognitive model control 7 Full text does not mean natural language understanding no magicquot 7 Ef ciency is always less than exact match cannot reject documents early Boolean or structured queries can be part of a bestmatch retrieval model CMPSCI 545 Copynghl JamesAllan Outline Statistical language models CMPSCI 545 Copynghl James Allan What is a Language Model Probability distribution over strings oftext 7 how likely is a given string observation in a given language 7 for example consider probability for the following four strings p1 P a quick brown dog p2 P dog quick a brown p3 P un chien brun rapide p4 P un brown dog rapide 7 English p1gt p2gt p3gt r4 depends on what language we are modeling 7 In most of IR assume that p1 p2 7 for some applications we will want p3 to be highly probable CMPSCI 545 Copynghl James Allan Before continuing a quick review Probabilities Ps is probability of some events P retrieval PsM is probability of some events given M P retrieval information gt P retrieval octopus Sum of Ps over all s must be 1 and Pnot s 1 Ps Independence If events w1 and w2 are independent then PW1 AND W2 PW1 PWz 21 Bayes rule PUNB PB n A fk l CMPSCI 646 Copyright James Allan Language Modeling Notation Convenient to make explicit what we are modeling M language we are trying to model 3 observation string of tokens from some vocabulary 3 PsM probability of observing s in language M M can be thought of as a source or a generator a mechanism that can spit out strings that are legal in the language PsM probability of getting s during random sampling from M CMPSCI 646 Copyright James Allan 10 How can we use LMs in IR Task given a query retrieve relevant documents Use LMs to model the rocess of uer e usertninllts or some relevant document 7 picks some keywords to use as tne query Relevant Docs en erati on hummnawoa loa 1 396 K 0 i i H 7 lGooglevtropical storms v mealtime cmcim cwwexmmim Language Modeling for IR Every document in a collection de nes a language 7 considerall possible sentences stringstnatautnorcould have written down when creating some glvel l document 7 some are oemaos more llkelyto occurtnan otners suoiecttutopiciwntingstyle language 7 PslMD probablllty tnat authorwould write down string 5 l1an urwriting a rlvarlatlorls ura documentand uurltlrlg now many time We get Now suppose Q is the user s query 7 What is tne probabllltythat authorwould write down 0 7 Rank documents D in the collection by PQMD during random sampling from tne e probablllty ofobservll lg language model ordocument D cmcim cwwexmmim 11 Major issues in applying LMs What kind of language model should we use 7 Unigram or higherorder models 7 Multinomial or multipleBernoulli How can we estimate model parameters 7 Basic models 7 Translation models 7 Aspect models 7 nonparametric models How can we use the model for ranking 7 Querylikelihood 7 Documentlikelihood 7 Divergence of query and document models CMPSCI 545 Copynghl JamesAllan Unigram Language Models Words are sampled independently of each other 7 metaphor randomly pulling words from urn with replacement 7 joint probability decomposes into a product of marginals 7 estimation of probabilities simple counting query 090 Poooo PoPoPo Po 4I92I94I93I9 CMPSCI 545 Copynghl JamesAllan 12 Higherorder Models Unigram model assumes word independence 7 cannot capture surface form P brown dog P dog brown Higherorder models 7 ngram condition on preceding words m 7 cache condition on a window cache s 7 grammar condition on parse tree m o Are they useful 7 no improvements from ngram grammarbased models 7 some research on cachelike models proximity passages etc 7 parameter estimation is prohibitively expensive CMPSCI 646 Copynghl James Allan Multinomial or multipleBernoulli Most popular model is the multinomial 7 fundamental event whats the identity of the i th query token 7 observation is a se uence of events one for each uer token ID31383 Pq1qklM Pq lM Original model Ponte1998 is multipleBernoulli 7 fundamental event does the word w occur in the query 7 observation is a vector of binary events one for each possible W gt PltqyqklMgt HPwlM HllPwlM CMPSCI 646 Weq qk weq qk Copynghl James Allan 13 Multinomial or multipleBernoulli Two models are fundamentally different 7 entirely different event spaces random variables involved 7 both assume word independence though it has different meanings 7 both use smoothed relativefrequency counting for estimation Multinomial 7 can account for multiple word occurrences in the query 7 well understood lots of research in related fields and now in IR 7 possibility for integration with ASRMTNLP same event space MultipleBernoulli 7 better suited to IR directly checks presence ofquery terms 7 provisions for explicit negation of query terms A but not B 7 no issues with observation length CMPSCI 545 Copynghl JamesAllan Ranking with Language Models Standard approach querylikelihood 7 estimate a language model MD for every document D in the collection assume that we know how to do this 7 rank docs by the probability of generating the query k Pq1qk lMD I I PM lMD 11 Drawbacks 7 no notion of relevance in the model everything is random sampling 7 userfeedbackquery expansion not part ofthe model examples of relevant documents cannot help us improve the language model MD the only option is augmenting the original query Q with extra terms however we could make use of sample queries for which D is relevant 7 does not directly allow weighted or structured queries CMPSCI 545 Copynghl JamesAllan 14 Ranking Documentlikelihood Flip the direction ofthe querylikelihood approach 7 estimate a language model MQ forthe query Q 7 rank docs D by likelihood of being a random sample from M0 7 M0 expected to predict a typical relevant document PD IMQ H Pw IMQ weD Problems 7 with different doc lengths probabilities not comparable 7 favors documents that contain frequent low content words 7 document that has only highest probability query word is ideal CMPSCI 545 Copynghl James Allan Ranking Likelihood Ratio Try to fix document likelihood 7 Bayes likelihood that Mq was the source given that we observed D PMQPDlMQ CH PW MQ WED HMQ D PD N HPleE WED 7 GE is general English a generic model of documents 7 allows relevance feedback query expansion etc 7 can benefit from complex estimation ottne query model MQ Cons 7 does not provide for modeling on the document side 7 ideal doc has only query word with high likelihood ratio CMPSCI 545 Copynghl James Allan 15 Ranking Model Comparison Combine advantages of two ranking methods 7 estimate a model of both the query M0 and the document MD 7 directly compare similarity ofthe two models 7 natural measure of similarity is crossentropy others exist HMg IIMD Z PWMQ10g PWMD 7 number of bits we would need to encode MQ using MD 7 equivalent to KullbackLeiblar divergence H MQ MD H Mo M0 7 equivalent to querylikelihood if M0 is simply counts of words in Q Crossentropy is not symmetric use H MQ H MD 7 reverse works consistently worse favors different documents 7 use reverse if ranking multiple queries wrt one document CMPSCI 545 Copynghl James Allan Summary of LM choices Use Unigram models 7 no consistent benefit from using higher order models 7 estimation is much more complex eg bigram from a 3word query Use Multinomial models 7 wellstudied consistent with otherfields that use LMs 7 extend multipleBernoulli model to nonbinary events Use Model Comparison for ranking 7 allows feedback expansion etc through estimation of Moand MD 7 use H MQ H MD for ranking multiple documents against a query Estimation of MQ and MD is a crucial step 7 very significant impact on performance more than other choices 7 key to crosslanguage crossmedia and other applications CMPSCI 545 Copynghl James Allan