New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Information Retrieval

by: Roman McCullough

Information Retrieval CMPSCI 646

Roman McCullough
GPA 3.57

James Allan

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

James Allan
Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 244 page Class Notes was uploaded by Roman McCullough on Friday October 30, 2015. The Class Notes belongs to CMPSCI 646 at University of Massachusetts taught by James Allan in Fall. Since its upload, it has received 146 views. For similar materials see /class/232258/cmpsci-646-university-of-massachusetts in ComputerScienence at University of Massachusetts.


Reviews for Information Retrieval


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/30/15
Information Retrieval Retrieval Models Introduction Boolean and Language models James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Luggage mod 4 immmm n Vino Lm39lcnko All slides Copyrlgil James Allan What is a retrieval model Model is an idealization or abstraction of an actual process here retrieval Mathematical models are used to study the properties of the process draw conclusions make predictions Conclusions derived from a model depend on whetherthe model is a good approximation ofthe actual situation Statistical models represent repetitive processes make predictions about frequencies of interesting events use probability as the fundamental tool CMPSCI 545 Copynghl JamesAllan What is a retrieval model Retrieval models can describe the computational process 7 eg how documents are ranked 7 Note that how documents or indexes are stored is implementation Retrieval models can attempt to describe the human process 7 eg the information need interaction 7 Few do so meaningfully Retrieval variables queries documents terms relevance judgements users information needs Retrieval models have an explicit or implicit definition of relevance CMPSCI 545 Copynghl JamesAllan Models we ll consider Boolean exact match 7 Exact match retrieval 7 Dominant model for years 7 Still popular Statistical language models 7 Ranking best match model 7 Current hot model for information retrieval 7 Effective Will consider other models later CMPSCI 545 Copynghl JamesAllan CMPSCI 646 Exact vs Best Match Exactmatch 7 query specifies precise retrieval criteria 7 every document either matches or fails to match query 7 result is a set of documents Unordered in pure exact match Bestmatch 7 Query describes good or best matching document 7 Every document matches query to some degree 7 Result is ranked list of documents Popular approaches often provide some of each 7 Eg some type of ranking of result set best of both worlds 7 Eg bestmatch query language that incorporates exact match operators Copynghl James Allan CMPSCI 646 Exactmatch pros amp cons Advantages of exact match 7 Can be very efficiently implemented 7 Predictable easy to ex lain 7 Structured queries for pinpointing precise documents 7 Work well when you know exactly or roughly what the collection contains and what you re looking for Disadvantages of exact match 7 Query formulation difficult for most users 7 Difficulty increases with collection size 7 Indexing vocabulary same as query vocabulary 7 Acceptable precision generally means unacceptable recall 7 Ranking models consistently shown to be better Hard to compare best and exactmatch in principled way why Copynghl James Allan Unranked Boolean retrieval Boolean model is most common exactmatch model 7 queries are logic expressions with document features as operands 7 In pure Boolean model retrieved documents are not ranked Most implementations provide some sort of ranking 7 query formulation difficult for novice users Boolean queries 7 Used by Boolean model 7 and in other models Boolean query Boolean model Pure Boolean operators AND OR ANDNOT Most systems have proximity operators Most systems support simple regular expressions as search terms to match spelling variants CMPSCI 545 Copynghl James Allan Example WESTLAW information is slightly dated but approximately correct Part of Thompson Legal and Regulatory Serves legal and professional market 7 legal materials news financial Total collection size more than 50 terabytes 7 16700 databases of stored content as of 2007 7 More than 1000 servers 700000 users as of 2002 7 Preferred by about 60 of searches according to 2007 survey 7 claim 47 of legal searchers as of 2002 7 120000 daily onIine users 54 ofonIine users 7 2 million searchers per week 84 million transactions per day In operation since 1974 51 countries as of 2002 Bestmatch and free text queries added in 1992 7 WIN West is Natural CMPSCI 545 Copynghl James Allan WESTLAW Boolean features 1 of 2 Boolean operators 7 OR is implicit 7 AND using amp operator 7 BUTNOT using operator a BUTNOT b 9 a b Proximity operators 7 Phrases two fish 7 Same sentence brillig s toves To force order to be honored brillig s tovesquot 7 Same ara ra h manxome foe thought Similarly p forces order 7 Word proximity uffish l5 stood or uffish 5 stood Operator ordering 7 phrase 9 space or 9 n 9 s 9 p 9 amp 9 7 Parenthesis permitted to override order CMPSCI 545 Copynghi James Allan WESTLAW Boolean language features Document structure elds 7 query expression restricted to named document component 7 tite anguage models amp citesPonte amp dateafter 1998 Metadata restrictions 7 eg DATEAFTER 1992 amp BEFORE 1995 Term expansion 7 Universal character any one character manome 7 Root expander gimb Automatic expansion of plurals and possessive forms CMPSCI 545 Copynghi James Allan Features to Note about Queries Queries are developed incrementally Change query until reasonable number of documents retrieved language modelsquot language modelsquot ls statistical language modelsquot ls stat language modelsquotls stat quotA toolkit language modelsquotls stat quotA toolkit HMM implicit relevance feedback Queries are complex proximity operators used very frequently implicit OR used for synonyms NOT is rare Queries are long av 910 words not typical Internet queries 12 words Response time tuned to be fairly consistent at 7 seconds CMPSCI 545 Copynghl JamesAllan WESTLAW query examples What is the statute of limitations in cases involving the federal tort claims ac LIMIT I3 STATUTE ACTION IS FEDERAL I2 TORT I3 CLAIM What factors are important in determining What constitutes a vessel for purposes of determining liability of a vessel owner for injuries to a seaman under the Jones Act 46 USC 688 741 3 824 FACTOR ELEMENT STATUS FACT IP VESSEL SHIP BOAT IP 46 3 688 JONES ACTquot IP INJUR IS SEAMAN CREvaAN WORKER Are there any cases which discuss negligent maintenance or failure to maintain aids to navigation such as lights buoys or channel markers NOT NEGLECT FAILI NEGLIG l5 MAINTI REPAIRI IP NAVIGAT l5 AID EQUIP LIGHT BUOY CHANNEL MARKERquot What cases have discussed the concept of excusable delay in the application of statutes of limitations or the doctrine of laches involving actions in admiralty or under the Jones Act or the Death on the High Seas Act EXCUSI I3 DELAY IP LIMIT I3 STATUTE ACTION LACHES IP JONES ACTquot DEATH ON THE HIGH SEAS ACTquot 46 3 761 CMPSCI 545 Copynghl JamesAllan Westlaw search screen Boolean Two types of queries Daubm ummn Stats demon d u nine Searml 7 m Ima rlmlw m r n l nrmxnmlunn2a w D 1mg rwellglemlvwg y Query gt I d Yl nawus a scent Searhes mnean mlmm m a re an m jay 5 m l c e 39 n on one an mum mm mm Help rmmwsmgmsuawcum Wesuameseavch gume x Feb 2mm omme al mm lmes1 lhumsun cumpmuucmnmaempmuucl asp CMPSCI 646 Copynght James Allan Westlaw results screen 5 e n e m 5m ll laments nusr lav negugm new n mm mm lmzmplnymem cnmpenammn Result 2n Dncu Ranked document W gt w ResulisPluy Ami Muellch lnsaLiun all ALR n W new a Highlighted e l l Wm Bensval Almnugh numbernfzc mm m emplme l nnlum 39pcmant em we r dunng mm Related documer s H H unemmmenmlmwnn gammypmeemuewmgeWeen quot ReSUltSF US sslsgllizizsllls221fm sezmsm szlfgmifaazaem um meme acd ems were caused m Dubll Raw var we ewe m n m z zsensnualsemnnam c Willem aseevm Unnmnlnvmlm Cnmnnnsnlnn smnm 1mm aw stang amen aFLaw and Fanassnksbu Mm W idmm mm unemnlnvmlnttnmpnnxuhnn wssAvmlm mm myemquot 355w um cause urunemnlavmenl m Geneval assuage rm nrmlsmndua Am1ur2d my I a my 35mm quotemeeuemmme memeWeleeemlmmmm 5 mum m m em am quotemulate mu meme m quotglee n wequot amm emulememeeme m wheequot hem mmmm Wesuameseavch gume Fm Usmewesuaweum CMPSCI 646 Copynght 1mesmlm Onlmealmplme lhumsuncquvuducl UZ BBBBpvuduclasp Boolean query languages still used Many users prefer Boolean 7 Especially professional searchers 7 Many WESTLAW DIALOG searches still use Boolean 7 Control 7 Understandability For some queries or collections Boolean often works better eg using AND on the Web Boolean and free text nd different documents Need retrieval models that support both 7 Extended Boolean vector space 7 Probabilistic inference network Need interfaces that provide good cognitive models for ranking CMPSCI 545 Copynghl JamesAllan Best Match Retrieval Retrieving documents that satisfy a Boolean expression constitutes the Boolean exact match retrieval model Bestmatch or ranking models are now more common Advantages 7 Signi cantly more effective than exact match 7 Uncertainty is a better model than certainty 7 Easier to use supports full text queries 7 Similar ef ciency based on inverted le implementations Disadvantages 7 More dif cult to convey an appropriate cognitive model control 7 Full text does not mean natural language understanding no magicquot 7 Ef ciency is always less than exact match cannot reject documents early Boolean or structured queries can be part of a bestmatch retrieval model CMPSCI 545 Copynghl JamesAllan Outline Statistical language models CMPSCI 545 Copynghl James Allan What is a Language Model Probability distribution over strings oftext 7 how likely is a given string observation in a given language 7 for example consider probability for the following four strings p1 P a quick brown dog p2 P dog quick a brown p3 P un chien brun rapide p4 P un brown dog rapide 7 English p1gt p2gt p3gt r4 depends on what language we are modeling 7 In most of IR assume that p1 p2 7 for some applications we will want p3 to be highly probable CMPSCI 545 Copynghl James Allan Before continuing a quick review Probabilities Ps is probability of some events P retrieval PsM is probability of some events given M P retrieval information gt P retrieval octopus Sum of Ps over all s must be 1 and Pnot s 1 Ps Independence If events w1 and w2 are independent then PW1 AND W2 PW1 PWz 21 Bayes rule PUNB PB n A fk l CMPSCI 646 Copyright James Allan Language Modeling Notation Convenient to make explicit what we are modeling M language we are trying to model 3 observation string of tokens from some vocabulary 3 PsM probability of observing s in language M M can be thought of as a source or a generator a mechanism that can spit out strings that are legal in the language PsM probability of getting s during random sampling from M CMPSCI 646 Copyright James Allan 10 How can we use LMs in IR Task given a query retrieve relevant documents Use LMs to model the rocess of uer eneration e usertnlnllts or some relevant document 7 olcllts some keywords to use as tne query Relevant Docs cmcim cwwexmmlm Language Modeling for IR Every document in a collection de nes a language 7 conslderall oosslole sentences stnngstnatautnorcould have ertterl down when creatlng some glvel l document emaos more llkelyto occurtnan otners subletttotoplc wrltlrlgstyle language 7 PslMD probability tnat authorwould erte down stnng 5 rl vanatlons Elf a document and Euurltlrlg NEW l1an urwntlnga manytlme WE get Now suppose Q is the user s query 7 What ls tne probabllltythat authorwould erte down 0 7 Rank documents D in the collection by PQMD e probability or ooservmg dunng random samollng trom tne language model ordocument D cmcim cwwexmmlm Major issues in applying LMs What kind of language model should we use 7 Unigram or higherorder models 7 Multinomial or multipleBernoulli How can we estimate model parameters 7 Basic models 7 Translation models 7 Aspect models 7 nonparametric models How can we use the model for ranking 7 Querylikelihood 7 Documentlikelihood 7 Divergence of query and document models CMPSCI 545 Copynghl JamesAllan Unigram Language Models Words are sampled independently of each other 7 metaphor randomly pulling words from urn with replacement 7 joint probability decomposes into a product of marginals 7 estimation of probabilities simple counting query 000 Poooo PoPoPo Po 4I92I94I93I9 CMPSCI 545 Copynghl JamesAllan Higherorder Models Unigram model assumes word independence 7 cannot capture surface form P brown dog dog brown Higherorder models 7 ngram condition on preceding words m 7 cache condition on a window cache Q 7 grammar condition on parse tree Are they useful 7 no improvements from ngram grammarbased models 7 some research on cachelike models proximity passages etc 7 parameter estimation is prohibitively expensive CMPSCI 545 Copynghl James Allan Multinomial or multipleBernoulli Most popular model is the multinomial 7 fundamental event whats the identity of the i th query token 7 observation is a se uence of events one for each uer token rib1amp0 Pq1qklM Pq lM Original model Ponte1998 is multipleBernoulli 7 fundamental event does the word w occur in the query 7 observation is a vector of binary events one for each possible gt PltqyqklMgt HPwlM HllPwlM CMPSCI 545 Copynghl James Allan Multinomial or multipleBernoulli Two models are fundamentally different 7 entirely different event spaces random variables involved 7 both assume word independence though it has different meanings 7 both use smoothed relativefrequency counting for estimation Multinomial 7 can account for multiple word occurrences in the query 7 well understood lots of research in related fields and now in IR 7 possibility for integration with ASRMTNLP same event space MultipleBernoulli 7 better suited to IR directly checks presence ofquery terms 7 provisions for explicit negation of query terms A but not B 7 no issues with observation length CMPSCI 545 Copynghl JamesAllan Ranking with Language Models Standard approach querylikelihood 7 estimate a language model MD for every document D in the collection assume that we know how to do this 7 rank docs by the probability of generating the query k 13611 mqk lMDHPqI MM 71 Drawbacks I 7 no notion of relevance in the model everything is random sampling 7 userfeedbackquery expansion not part ofthe model examples of relevant documents cannot help us improve the language mode MD the only option is augmenting the original query Q with extra terms however we could make use of sample queries for which D is relevant 7 does not directly allow weighted or structured queries CMPSCI 545 Copynghl JamesAllan Ranking Documentlikelihood Flip the direction ofthe querylikelihood approach 7 estimate a language model MQ forthe query Q 7 rank docs D by likelihood of being a random sample from M0 7 M0 expected to predict a typical relevant document PD MQ H Pw MQ D Problems we 7 with different doc lengths probabilities not comparable 7 favors documents that contain frequent low content words 7 document that has only highest probability query word is ideal CMPSCI 545 Copynghl James Allan Ranking Likelihood Ratio Try to fix document likelihood 7 Bayes likelihood that Mq was the source given that we PM iD PMQPD mg g CEDPWWQ Q PD HPW GE 7 GE is general English a generic model of documents 7 allows relevance feedback query expansion etc 7 can benefit from complex estimation ottne query model MQ Cons 7 does not provide for modeling on the document side ideal doc has only query word with high likelihood ratio CMPSCI 545 Copynghl James Allan Ranking Model Comparison Combine advantages of two ranking methods 7 estimate a model of both the query M0 and the document MD 7 directly compare similarity ofthe two models 7 natural measure of similarity is crossentropy others exist HMg IIMD Z PWMQ10g PWMD 7 number of bits we would need to encode MQ using MD 7 equivalent to KullbackLeiblar divergence H MQ MD H Mo M0 7 equivalent to querylikelihood if M0 is simply counts of words in Q Crossentropy is not symmetric use H MQ H MD 7 reverse works consistently worse favors different documents 7 use reverse if ranking multiple queries wrt one document CMPSCI 545 Copynghl James Allan Summary of LM choices Use Unigram models 7 no consistent benefit from using higher order models 7 estimation is much more complex eg bigram from a 3word query Use Multinomial models 7 wellstudied consistent with otherfields that use LMs 7 extend multipleBernoulli model to nonbinary events Use Model Comparison for ranking 7 allows feedback expansion etc through estimation of Moand MD 7 use H MQ H MD for ranking multiple documents against a query Estimation of MQ and MD is a crucial step 7 very significant impact on performance more than other choices 7 key to crosslanguage crossmedia and other applications CMPSCI 545 Copynghl James Allan Information Retrieval Relevance Feedback Real and Assumed James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Copynghl JamesAllan First some intuition courtesy of Fernando Diaz Suppose you search for canine Copynghl JamesAllan Page 1 Score 10 mm canine 10 dog 0 cat 0 Cnpyngh JamesAllan Amcvican Camm A mcm lm luc v Score 5 onpmma JamsAllan Im canine dog 0 cat 0 Page 2 Score 2 Im 2 canine dog 10 cat 2 capmme JamsAllan w m M mnuw my Mum 39 Score 0 11 m Inm39r39m Amu mlm Im canine dog 2 cat 10 onpmma JamsAllan Page 3 Idowa m w 7 Q91 g5 Score 0 m canine 0 dog 10 oat Copyright James Allan Embed in vector space model camne I 0 v 1 cat a 0 dog Copyright James Allan Page4 Retrieval with query canine tanine o O t to o d ca 0 0 g Copyright James Allan Retrieval with query canine fnine V V o i b cat 39 o O dog Copyright IamesAllan Page 5 Retrieval with query canine Loanine Copyright James Allan Retrieval with query canine Em V V cat quot o O Copyright IamesAllan Page 6 Relevance feedback canine relevant document 039 cat 0 o 390 g Copyright James Allan Relevance feedback new query canine dog relevant document 0 Copyright IamesAllan Page 7 Pseudorelevance feedback Loanine Copyright James Allan Big picture Relevance feedback 7 Adjust query with interaction User looks at returned list of documents and provides feedback aysxem returns a reVIseu rankeu n51 7 Can a better query be created automatically by analyzing relevant and nonrelevant documents Pseudorelevance feedback 7 Adjust query Without interaction Generate ranked list but do not present it Use information to create a new ranked list that is presented 7 Can a better query be created automatically by assuming that some documents are relevant Occurs in different models 7 Vector space is used most often we ll focus on it 7 Language modeling Excellent success with assumed relevance relevance models Less obviously good results for real feedback Copyright IamesAllan Page 8 Relevance Feedback in the Vector Space Goal Move new query closer to relevant documents Approach New query is a weighted average of original query and relevant and nonrelevant document vectors 1 QIZQ I39O ZDj 2 Dj IRI Djep INRI DjeNR 1 on and 3 are constants that represent the relative importance of positive and negative feedback Variations Different values of on and B Vector length number of terms added to the query Which documents are used for training all best uncertain etc Copyright James Allan Relevance Feedback in the Vector Space Example Original Query Relevance Feedback Formula 5 0 3 0 1 Q Q0DZERDj D gRDj Document D1 Relevant 9 9 2 1 2 O 0 Document D2 Nonrelevant 1 O O O 2 a0500025 Q Q 05 D1 025 D2 5 0 3 0 1 052 1 2 0 0 025 1 0 0 0 2 575 050 400 00 05 Copyright James Allan Page 9 TREC Example The TREC Topic An Information Need Original TREC Topic ltnumgt Number 106 ltdomgt Domain Law and Government lttitlegt Topic US Control of Insider Trading ltdescgt Description Document Will report proposed or enacted changes to US laws and regulations designed to prevent insider trading 2 securities laW bill legislation regulation rule 3 Insider Trading Sanctions Act Insider Trading and Securities Fraud Enforcement Act 4 Securities and Exchange Commission SEC Commodity Futures Trading Commission CFTC National Association of Securities Dealers NASD ltfacgt Factor s ltnatgt Nationality US Copynght James Allan TREC Example InQuery Adhoc Query Automaticallyprocessed query WSUM 10 lTerms from lttitlegt field 20 UW50 Control of Insider Trading 20 PHRASE USA Control 50 PHRASE Insider Trading l Terms from ltcongt field 20 PHRASE securities laW 20 bill 20 legislation 20 regulation 20 rule 20 3 Insider Trading Sanctions Act 20 3 Insider Trading and Securities Fraud Enforcement Act 20 3 Securities and Exchange Commission 20 SEC 20 3Commodity Futures Trading Commission 20 CFTC 20 3 National Association of Securities Dealers 20 NASD l Terms from ltdescgt field 10 proposed 10 enacted 10 changes 10 PHRASE USA laws 10 regulations 10 designed 10 prevent 20 NOTFOREIGNCOUNTRY Copynght James Allan Page 10 TREC Example Relevance Feedback Query Automatically modified query top 10 documents judged WSUM 1 3882349 UW50 control inside trade 2208832 SUM usa control 145571381 SUM inside trade 22084291 SUM secure law 22693285 bill 20984898 legislate 10354733 regulate 6540223 rule 1529766 OD3 inside trade sanction act 3290401 OD4 inside trade secure fraud enforcement act 48404 OD4 secure exchange commission 43578438 sec 094752 OD3 commodity future trade commission 1074666 cftc 2864415 OD4 national associate secure deal 21846081 nasd 0542252 propose 245709 enact 0988893 change 4354009 SUM usa law 0799089 design 1727937 prevent 0346877 NOT foreigncountry 4599784 drexel 2052418 fine 1845434 subcommittee 169074 surveillance 1597542 markey 1528179 senate 1186563 manipulate 1101982 pass 1060453 scandal 0921561 edward Relevance Feedback in the Vector Space Variations Term Selection None original query terms only All terms Most common terms Most highly weighted terms Weighting lde oc1 31 don t normalize by number ofjudged documents lde Dec Hi oc1 B a only the highest ranked nonrelevant documents don t normalize by number ofjudged documents Rocchio Choose 0c and B such that ocgtB and ocB1 1 NR D Z jENR 1 QIQ06 Z Dj Copyright James Allan j Page 11 11 Relevance Feedback in the Vector Space Common Choices Ide Dec Hi is effective when there are a fewjudged documents Rocchio oc075 025 is effective when there are manyjudged documents Expanding by all terms is best but selecting most common terms also works well 7 Depends somewhat on the retrieval model Coping With negatively weighted terms 7 Vector space does not allow negative weights for cosine similarity 7 Usually drop terms that end up negatively weighted 7 Can create a not like this vector consisting of negative terms Dif cult to balance issues correctly Copynghl James Allan Relevance Feedback Machine Learning Approaches An unstructured vector query is a linear discriminator 7 egWll 1WztzWatn The goal is to learn weights that separate the relevant documents from the nonrelevant documents Nonrelevant Re1evant documents documents lfthe documents are linearly separable a learning algorithm can be chosen that is guaranteed to converge to an optimal query lfthe documents are not linearly separable a learning algorithm can be chosen that minimizes the total amount of error Copynghl James Allan Page 12 Relevance Feedback Simple Methods for Adding Structure Basic Process Generate candidate operators Boolean Phrase proximity etc 7 algorithms exhaustive greedyselective Add some or all candidates to document representations Weight like other terms Effectiveness Extremely effective for proximity operators Boolean Copynghl JamesAllan Relevance Feedback Relevance Feedback could also modify document representation 7 document space modification 7 connectionist learning changing weights in network Assumptions 7 a person s relevance judgements are consistent 7 modifications for one person are meaningful for another Never shown to be effective consistently An old idea periodically resurfaces 7 recommender systems Dif cult to gure out how searchers should use it Copynghl JamesAllan Page 13 Relevance Feedback Summary Relevance feedback can be very effective Effectiveness depends on number ofjudged documents Signi cantly outperforms best human queries given enough judged documents Results can be unpredictable with less than five judged documents Not used often in production systems eg Web 7 consistent mediocre performance preferred to inconsistently goodgreat results 7 Stick with documents like this one variant An area of active research many open questions Copynghl JamesAllan Relevance Feedback Using it Relevance feedback is not widely used Few studies explore the user side of feedback 7 Don t necessarily answer that question but are still interesting JUrgen Koenemann and Nick Belkin looked at this 7 A case for interaction A Study of Interactive Information Retrieval 96 Behavior and Effectiveness CHI 19 User study of 64 users Presented with three styles of relevance feedback 7 Opaque relevance feedback is magic behind the scenes 7 Transparent same as opaque but users shown expansion terms 7 Penetrable user given chance to edit list of terms before rerun Copynghl JamesAllan Page 14 Base system used no feedback Rut ers INQUERV j Reet AH UNDO LAsT RUN QUERV Show search TopTc Text Eth RU lNQUERV Enter next query term below and nTt ltRETURNgt 0 document 3 current Query Ha 4 term automobTL manufactuvquot car detect vecal Total of 6747 document retrTeved lump Document M of 6747 GM Plan to RecaH 62000 988789 car mth Quad 4 E gme wsl9004T 3700 3 04T 390 WALL STREETJOURNAL 0 PAGE 32 TRO T W General Motor corp aTd Tt T recamng 00 9887 9 model car SqulDDed Wlth Tt hlghrtech Quad 4 engme to x defectwe fuel lme lmked to 24 engme ve 0 988 Pon a m fourrcwmd r uad 4 engme have fuel LTne tnat could crack or ar tro tne engme Alth mna T rep rto tTrecaued byl axattnbu bl t t e o f 24 o n faulty fuel hne a pokeman ay tne company Know of n lnluvle reuLtTng from tne Tnchent GM old about 3T 2000 car SqulDDed Wlth Quad 4 engme m tne 988789 model year anotner actTon GM aTd Tt T recang about 3200 of mg m a w a g lt 2 E amp p Q m n o n1 w o a w o tne aembly tnat lock tne teermg wnen tne car T DarkedT Tne defectwe bolt could break caumg tne teermg wneeL to Copyright James Allan Allowing user access penetrable TT T a Q g e T TT TT TT TTT TTT T T TTTT TTT TTTTT TT ResetAH UNDO LAST RUN QUERY Show Searc z sgtsg s TT TTTI TT T TT T h TODIC Reset AH UN DO LAST RUN QUERV Show Search 39 Enter next query term below and m ltRETURNgt Enter next query term below and hit ltRETURNgt Current Query Ha 4 termts Current Query Ha 4 tenan aut Tn un TTaTTuTa Lu nut TTT un TTaTTu F car39 defect defect recalquot a r cal TotaLot Oldsmobllquot Use All contmue Query Run Copyright James Allan Page 15 Experimental procedure Two query construction approaches 7 First without relevance feedback 7 Second one of three RF approaches randomly assigned Task is to construct a good longterm query Evaluation is based on effectiveness of final query No difference between users on first task Copyright o James Allan Effectiveness with Feedback Kemeval Emma m mp so requiem Tm Precision at 30 documents Clear improvements from 7 use of RF 39 7 Opaque and transparent the same by design Penetrable best Only statistically significant difference is between penetrable and base Results comparable for precision at 100 documents eeeee l nun m n m in amquot in 1xgpxmmumm n Candinan Copyright o James Allan Page 16 Behavior with Feedback Task was to build a good query How man attem ts do people make For some reason transparent interface encouraged an extra iteration Penetrable interface took one less than normal Not clear what this means quotml afltemumh nun n u up n swim Copyright JamesAllan Was Feedback Used by Searcher Where did words they chose come from 7 Copied from lists provided by feedback 7 Added automatically by system Users typed short quenes Feedback added many terms Penetrable system encouraged fewer terms 7 But resulted in more effective queries faster Mean Nu mber K Source of Qu ry Term Relevance User Controlled Added User Copy by 2 Condition Typed from 2 RF RF SYS Topic 162 No e 69 na 69 na 69 Opaq 109 1141 109 35 468 Transparent 33 91 124 423 551 Penetrable 63 244 306 113 306 Topic 165 None 60 nfa 50 nfa 60 Opaq 38 nfa 33 205 243 Transparent 43 53 95 173 273 Penetrable 33 95 128 nfa 123 1623x2135 No e 64 nja 54 13 64 Opa 73 nfa 73 232 355 Transparent 38 72 109 303 412 Penetrable 43 159 217 nfa 217 Copyright JamesAllan Page 17 Subjective reactions Subjects liked the penetrable version Subjects in opaque condition expressed desire to see and control wnat happened Subjects comments that feedback made them lazy 7 Task of generating terms changed to task of selecting terms Copynghl James Allan Pseudorelevance feedback True relevance feedback is supervised 7 Feedback is done based on genuine user annotations What happens if we try to guess what is relevant I Assume many top ranked documents are relevant 7 Optionally find a collection of probably nonrelevant documents Modify query on that assumption Rerun that new query and show results to user What happens Pseudorelevance feedback 7 Blind relevance feedback 7 Local feedback Copynghl James Allan Page 18 History Attar and Fraenkel JACM 1977 7 Local Feedback in FullText Retrieval Systems 7 Small databases 75 US patents in electronics 7 queries Some work with 2500 documents in Hebrew 7 Showed clear gain from local dynamic feedback 7 Helped to weight candidate terms by proximity Idea seems to have been largely ignored for 16 years 7 In published results at least Copynghl JamesAllan History modern Rediscovered in TREC2 1993 Robertson Walker et al City University of London 7 OkapiatTREC2 Also done by UCLA and CMUClaritech Procedure 7 Used top 30 retrieved documents 7 Added as many terms as half the length of the query TR EC2 had very long queries 3050 words Example topic 112 funding biotechnology 7 Added phramaceut financ partner drug investor techno09139 company industri and develop Results 7 Average precision went from 373 to 407 7 Not huge but noticeable in an environment like TREC 7 NOTE 25 queries improved 18 got worse and 7 didn t change Copynghl JamesAllan Page 19 Example of impact Evans and Leefens Design and Evaluanun ufme CLARIT TRECVZ Systemquot Capynght JamesAllAn History continued Mdely adopted in TREC3 Okapi at TREC3quot Robertson et al 7 Added Ll to 40 terrns averaull39lu around 20 additions Automatic query expansion using Smartquot Buckley et al Masslve expanslol l 500 terrns and lo p rasesl TREC2 Using PIRCSquot Kwok et al 7 Spreadll39lg actlvatlol39ltl39lggel39ed by Six toprranked documents 7 0 average added ll new terrns UMass Amherst CllR s run still a holdout 7 Used a global type ofquel y expanslol l lnFinder r ByTRE 05 nad local context analysls as a passagerfocused vanatlol l on expanslol l Continues in TREC4 but new ranking function dominates r Tne Olltapi ttquot iul39lctlol39l appeared in TREOS Capynght JamesAllAn Page 20 Why in 1993 but not 1977 Attar and Fraenkel showed suggestive results But techniques generally went awry 7 Still did in 1993 but helped more than it hurt 7 Standard class project in some IR classes in 19805 didn t work well Nature of TREC collections different 7 More relevant documents per query increases chance of hit 7 Better retrieval algorithms also increases chance of hit 7 Longer documents so more term usage to help find good ones Still widely used 7 ln vector space model systems 7 Language modeling approaches incorporate similar ideas as part of ue smoothing techan 7 Also explicitly considered as part of relevance models Copynghl James Allan Local Context Analysis LCA Xu and Croft 1996 Assumed relevance feedback Observations 7 Existing techniques Improved queries on average 7 But some queries had serious drop in effectiveness 7 Top ranked documents were not always right 7 Often caused by match of a single query word 7 Not every word is useful to add to queries Inspired creation of LCA Major focus is on getting better terms nsion 7 Finding terms to consider 7 Selection of terms 7 Weighting of selected terms Copynghl James Allan Page 21 Finding candidate terms Run query to retrieve passages 7 Similar to most assumed relevance work 7 Passageretrieval unique to LCA at the time 7 Uses 300word passages Select expansion concepts from retrieved set Why passages 7 Minimizes spurious concepts that occur in lengthy documents Copynghl JamesAllan Selecting candidate terms Parse document collection Generate part of speech tagging 7 TheAT i y hasHVZ beenBEN reworkedVBN sinceCS itJPPS wasBEDZ introdu dVBN l inIN someDT quot itJPPS wouldMD imposeVB l c onIN muchAP ofIN theAT ll Select only noun phrases 7 Shown to be critical in most retrieval systems 7 Generally particularly useful for expansion 7 Could easily be extended if useful Adjectivenoun phrases verbs 7 Note that tagging is automated so makes mistakes Copynghl JamesAllan Page 22 Weighting terms Want concepts that occur near query words The more query words they occur near the better Count cooccurrences in 300 a windows of text passages To avoid coincidental cooccurrence in a large document Uses the following adhoc function to weight concepts fcQ H 001codegreecwiidfwi wiEQ E 1 lm ortance of word codegreecw max new Cw 0 p no nwnc E 6 w N Measure cooccurrence ide min100910Nnw5 Floor the IDF component Slow its growth Copyright James Allan Using expanded query Developed using lnquery Incorporate using weighted sum vvelgnt original query and expanSIon query equally Qnew wsum 10 10 Qorigmal 10 Qlca Qlcawsum1010 c110 c210 C30 Variations Lower weight on each subsequent term More important the more terms that are added Weight original query equally with a single expansion concept Only works when query is not very reliable Copyright James Allan Page 23 23 Example of expansion concepts TREC query 213 7 As a result of DNA testing are more defendants being absolved or convicted of crimes Expansion concepts 7 dnapattern dnatesting lifecodes dna test results naIab dnaevidence dna test dna profile A f n H run I In II vvwivvw r Ii Inilih Imlr II n r I 1 1 I I Iquot lawyerpeter neufeld newyork cItymurder case mi In Ila irA pr Allr I r quotI An I In r f r nrn mrItrlllllnrfrlwlfin 1 Iquot I 1 supermarketmerchandise thomas caskey procedureslifecode lifecodescorp testsreliability maine case rapeconviction dnastrand Copyright Iames Allan Does it work TREC3 and TREC4 adhoc queries With and without LCA expansion Precision dmlge 7 50 quen39es Precision dumge r 49 queries D 710 704 7 100 05 m 17 TREC3 Copynght 1ames Allan Page 24 Summary Relevance feedback 7 Real or assumed Real relevance feedback 7 Usually improves effectiveness significantly 7 Not always stable with very few documents judged 7 Difficult to incorporate into a usable system 7 Find similar is a simple instance that does appear Pseudorelevance feedback 7 Also called blind relevance feedback or local feedback Or quasirelevance feedbackquot or 7 Rocchiobased approaches effective but unstable 7 LCA comparably effective maybe better but more stable 7 Relevance models soon provide formal framework Copynghl JamesAllan Page 25 Information Retrieval Question Answering James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copyright James Allan Question answering motivation IR typically retrieves or works with documents 7 Find documents that are relevant 7 Group documents on the same topic People often want a sentence fragment or phrase as the answer to their question 7 Who was the first man to set foot on the moon 7 What is the moon made of 7 How many members are in the U 8 Congress 7 What is the dark side of the moon Move IR from document retrieval to answer retrieval 7 Document retrieval is still valuable e Extends breadth of active IR research CMPSCI 545 Copynghl JamesAllan Some TREC History QA begun in TREC8 99 and was similar in 2000 First focused on factoid questions from unrestricted domain Now includes other classes of questions definitions lists Run against a large collection of newswire Guaranteed that answer exists in the collection Return short text passage that contains and supports answer 250 or 50byte passages Return 5 answers passages ranked by chance of having answer Evaluation based on mean reciprocal rank of first 1 i Rq lt 5 correct answer Z sqwhere sq Rq IQI 1662 0 else CMPSCI 646 Copyright James Allan Judgment issues Correctness of answer not always obvious Applied several rules to simplify problem Lists of possible answers answer stuffing Not considered correct even if correct answer in there Answer had to be responsive lf 500 was correct answer than 500 was incorrect lf 55 billion was correct then 5 5 billion was not Ambiguous references refer to famous one What is the height of the Matterhorn means the one in the Alps What is the height ofthe Matterhorn at Disneyland is other CMPSCI 646 Copyright James Allan CMPSCI 646 TREC 2001 version Allow only 50byte responses 7 250by te problem was too easy Main task 7 No guarantee answer is in corpus nil is correct response then List task 7 Who are 6 actors who have played Tevye in Fiddler on the Roof 7 Answer will cross multiple documents 7 Scores on accuracy within number of items requested Context task 7 Series of questions Where context must be preserved 7 No interesting technical results here so task dropped 7 Though current QA track has sequences of connected questions Copynghl James Allan CMPSCI 646 Perceived problems Correct answers to What river in the US is known as the Big Muddy the Mississippi known as Big Muddy the Mississippi is the longest as Big Muddy the Mississippi is the longest messed with Known as Big Muddy the Mississip Mississippi is the longest river in the US the Mississippi is the longest river in the US The Mississippi is the longest riverMississippi has brought the Mississippi to its lowest ipesln Life on the MississippiMark Twain WlUlet SouthestMississippi Mark Twain of ficials began Known Mississippi US Minnesota Gulf Mexico Mud IslandMississippi The historyMemphis Clearly top answers are better Answer stuffing Copynghl James Allan TREC 2002 Repeated main and list tasks Must return exactanswer without extra information 7 So some of the examples on previous slide would be wrong Answer stuf ng forbidden again 7 Answersjudged right wrong inexact or unsupported Used new AQUAINT corpus 7 Documents from AP 9800 NYT 9800 and Xinhua English 9600 7 About 1033000 documents in 3Gb 75 submitted runs from 34 I artici atinv sites 7 67 in the main task 7 8 in the list task CMPSCI 545 Copynghl James Allan List task Some constraints on answers 7 System must provide exact answer 7 Must be supported in the text Systems also know 7 Sufficient answers exist in corpus to answer question 7 Questions created by NIST assessors not by mining a search log 25 questions were created Sample questions 7 Name 22 cities that have a subway system 7 What are 5 books written by Mary Higgens Clark 7 List 13 countries that export lobster 7 What are 12 types of dams 7 Name 21 Godzilla movies CMPSCI 545 Copynghl James Allan Main task 500 questions No definition questions needed a pilot study first No answers required 49 of 500 ended up with no answer Taken from MSNsearch and AskJeeves logs donated in 2001 Some spelling errors in questions corrected but not all When to stop Is a misplaced apostrophe a spelling error Requirements on answers Precisely w answer required not five like before System must indicate confidence in answer Could optionally submitted a justification string Evaluation is confidenceweighted average precision Rank answers to all questions by confidence Elfin number correct in first 2 ranks239 IQI CMPSCI 646 Copyright James Allan TREC 2003 QA tasks Main task factoid question answering 413 questions posed against AQUAINT corpus 54 runs from 25 groups also did next two types Scored by fraction of responses that were correct accuracy List task 37 questions with no specification of how many answers in list List the names of chewing gums What Chinese provinces have a McDonald s restaurant Scored by instance recallprecision and F1 measure Definition task 50 questions Facetbased recall measure lengthbased precision measure Passagestask 250byte extract containing answer or nil if none exists 21 runs from 11 groups CMPSCI 646 Copyright James Allan General QA approach Question Answers CMPSC1646 Copynghl James Allan Key points for success Good passage retrieval 7 QA included evaluation specifically on passage retrieval too Recognizing question type is critical 7 Requires having ability to recognize those entities Some sample entities that a system might find 7 person organization location time date money percent 7 duration frequency age number decimal ordinal equation 7 weight length temperature angle area capacity speed rate 7 product software address email phone fax telex w 7 subtypes company government agency school Better performing systems almost always have better entity recognizers and large numbers ofentity types CMPSCI 545 Copynghl James Allan Passage retrieval Not every system depends on this but most do Given query find passages likely to contain answer Most successful approaches use question patterns to find alternative ways to phrase things 7 To greatly increase recall Start with a question and a known answer 7 When was Bill Clinton elected President 1992 Look for all occurrences of that answer and declarative form of question throughout text 7 Bill Clinton was elected president in 1992 7 The election was won by Bill Clinton in 1992 7 Clinton defeated Bush in 1992 7 Clinton won the electoral college in 1992 Extract patterns that occur frequently Now more likely to be able to answer similar questions 7 When did George Bush become president CMPSCI 545 Copynghl JamesAllan Query expansion Question expansion 7 Process that adds related words to a query 7 Improves recall 7 Relevant documents using slightly different vocabulary Seems appropriate here and it does work Dif culty is need for answerjusti cation CMPSCI 545 Copynghl JamesAllan CMPSCI 646 Case study UMass TREC8 Who is the 16quot7 president of the United States One of UMass top documents contained answer Abraham Lincoln 7 But document was about the Gettysburg Address 7 Document did not mention that Lincoln was president let alone that he was the 16th president Assessors were forced to accept this as valid answer 7 They hated that Wh did it ha I en 7 Query expansion added these features Lincoln Abraham Lincoln Mr Speaker Gettysburg Address 7 Also added some strange ones Kermit stereotypical teenager Disneyland Copynghl James Allan CMPSCI 646 Case study UMass cont Or maybe it isn t all that strange Disneyland operators said visitors and park employees alike reacted anri to re orts that the robot re lica of the nation39s 16th President was being removed to make room for a new Muppets attraction LA Times 82490 Had serendipitous effect of getting answer Lincoln But all ofthe top retrieved passages were about the Muppets This is one reason that justi cation was required from then on Copynghl James Allan BBN s Answer Selection TREC 2002 and 2003 First get candidate answers Use IR system to find passages Extract all items of correct type given type of question Rank candidates by similarity of containing passages to query choose best Now use constraints to rerank candidate answers 7 If the question asks for a number check whether the answer quanti es the same noun in the answer context as in the question 7 If question looks for type oflocation ensure answer is correct type Eg a country a state a cit Note that BBN s original candidate tagging does not distinguish types of locations Check that answer satis es verb arguments For Who killed Xquot preferred candidate should be subject ofverb killed and X an object Used parse trees ofthe question and sentences in the corpus Answers that satisfy constraints are preferred Resulted in 2 absolute improvement in effectiveness CMPSCI 646 Copyngil JamesAllan BBN s confidence estimation TREC 2002 not used in 2003 Goal is estimating PcorrectQA Used several features that seemed predictive 7 Does the verb match the noun as in answer reranking Pcorrectmatch049 Pcorrectnomatch023 7 Number of words in common n and number of content words in query 4 As number increases probability of correct does too i ml mi l m2 l 012 4 tl3 quot14 l l Plt U I L LlJlHI39 4 ii in Ill ii i l0 034 l S stem s eneral abilit with at pe of question All estimated using training data from past QA tasks CMPSC1646 Copyngil JamesAllan Putting those all together Want to estimate PcorrectQA They did this by a mixture model P00rrectQ A PcorrectVS m n T PcorrectVS 1 P00rrectm n 1 PcorrectT Easy to look up values in tables built from training CMPSCI 646 Copyright James Allan BBN s use of the Web TREC 2002 and 2003 Several systems used the Web to help Huge source of text that might answer question BBN formed two queries One rewrites the question into a declarative sentence Anotherjust uses the contentbased words Mine the returned snippets rather than pages for efficiency for candidate answers Must be of correct type Select best answer next slide To getjustification find TREC document that contains the selected answer CMPSCI 646 Copyright James Allan lO Using Web cont First approach just uses Web results and qtype I nwmwg A I mmill I l I le i mll39lgtlt 5 i I lmi39wmI yx 13quot he rc 39 qucsnml we 397rm1ucncyom m mg Gonglc snmnmi39ics Second approach boosts scores that were also retrieved by nonWeb approach in TREC corpus 7 PcorrectFin trec 7 Clear from training data that naVIng the answer In IKtC corpus provides useful information n in lmi my I rnluunr 4 mxucr ml CMPSCI 646 Copynght 1ames Allan How well did it all work I I 39nmnlml prch u Rumth l l39ct39lxlnll I Inuitd l l39x t39ixlnll I m ul uppupl uund lsBiji u th n37 nme BliNZiiIilB u 1M M44 is n41 ILHNzuuli u1x4 rum 1mm I Decent performance middle of the pack Confidence scores are fairly good 7 Upper bound shows impact of perfect estimates Using the Web made a huge difference Validating in TREC corpus helped some CMPSCI 646 Copyright James Allan Some systems more complex UWateroo Canada incorporates much more TREC 2002 7 Stores known facts in a database 7 Includes a corpus of trivia 7 Uses Web to nd candidate answers it 3 am 35 quot Provides numerous sources of evidence 7 Earl answers re uire justi cation in corpus Combines candidate answers 131 iv Axuwruiu dismimumr CMI SCI 646 Copyright James Allan Waterloo s AnswerDB 0 Collection of tables with information on a bunch of topics CMPSCI 646 Copynght James Allan Con dence at Waterloo By means that answer was found ranked as follows 7 Results from early answering system Continent currency lake ocean planet province Color country season state year Anniversary date length mountain person place proper thing Long time Code large speed Number rate temperature Money Other categories Uncategorized questions Unanswered questions suggested nil as answer CMPSCI 646 Cupyngqre JamesAllan Waterloo results TREC 2002 run Lag uHmtBS uwmtB2 uilmtBl uwmtBO um Carly answering y r n u uses Altavista n v u cull correct NIST d l P gmquot 3 con dcnccrwcightcd score a 1 correct 44 n39 39 439 MultiTcxl judgments uwmtB3 good performer just above BBN 7 Though BBN s accuracy much worse BBN got 142 right 284 Waterloo got 184 right 368 7 BEN seems to have better con dence estimator Web didn t help much Early answering helped slightly 7 Except that con dence score went way up because they were always right CMPSCI 646 Cupynghua JamesAllan 3347 a 1 0441 n p it 44 con denccrwcightcd score 0514 0610 0509 n IBM also combines evidence Architecture diagram TREC 2002 shows complexity Uses search the Web and Wordnet Also uses Cyc the common knowledgequot database A m L47 In Mugs Elmmd Aukwry Am l Male 0 Allum Kym l gem4 Arman us i AM Mm 1mm r anmlm m Annn ken CMPSC1646 Copyright JamesAllan Deciding if nil is the best answer IBM TREC 2002 and 2003 Use of large external knowledge base 7 Ask the database for answer to question 7 If it knows the answer and automatic process got a different answer on TREC corpus 7 Then answer nil Also use training data to guess when it s unlikely that system has the correct answer NIL CORRECT o 35 xxxxxxxxxx xx xxxxxxxxxxxxxx x xx xx Note how 4 3 often nil is i i 2 17 correct 5 n E 15 answer at a g 9 5 bottom of 13 5 x correctly enswered question incorrectly answered question 7 NIL questlon accordlny to NIST CMPSC1646 Copyright JamesAllan Use of Cyc knowledge base Used for answer sanity checking Have system generate answer and ask Cyc if answer seems reasonable If answer i 10 of Cyc s best guess then sane Only helped once 7 What is population of Maryland 7 Answer 50000 7 Justi cation Maryland s population is 50000 and growing rapidly something called 7 Valid on the surface except that it had to do Nutria a rodent raised for its fur not people answer was about 51million so second best though less highly ranked answer was accepted because it was sane Followup work has made better use of Cyc 7 Didn t help at all in TREC 2003 though CMPSC1646 cepyngqi JamesAllan Top performing systems at TREC 2002 Con dence Correct weighted Answers Number Nl39L Accuracy un Score nexart Prec Recall LCCmain2002 0856 3 0 8 0 0804 actans r 0691 0848 pd52002 0610 0891 ST02D1 0589 0217 v l39BMPQSQACYC 0588 0630 nuwmtBS 0512 0000 BBN20020 0499 0087 39 39 2 0498 0109 lnmsiQalirZ 0497 0196 ali2002b 049 0848 Ibmsanc 0455 0239 ilv02wt 0450 0040 FDUTllQAl 0434 0957 araneaOZa 043 0174 nuslamp2002 0396 0000 CMPSC1646 Cupyngqt JamesAllan Ability of systems to estimate con dence ennmemmlmum Scare CMIPSCI 646 n 100 mu m Nnmim mm Cupyngm James Allen De nition task TREC 2003 0 Sample questions 7 Who is Colin Powell 7 What is mold Drawn from search engine logs so they re realistic 750 730 710 710 questions had a person as target Vlad the Impaler Ben Hur had an organization Freddie Mac Bausch amp Lomb had something else golden parachute feng shui TB 0 Answerto a definition has an implicit context 7 Adult native speaker of English average reader of US news 7 Has come across a term they want more information about 7 Has some basic ideas already eg Grant was a president 7 Not looking for esoteric details CMIPSCI 646 Cupyngm James Allen Judging de nitions Phase one creating truth 7 Assessor created a list of information nuggets 7 Used own question research 7 Combined with judgments of submitted answers 7 Vital nuggets those that must appear selected Phase two judging 7 Look at each system response 7 Note where each nugget appeared 7 If nugget returned more than once only one instance is counted CMPSCI 646 Copynght 1ames Allan Example judging What is a golden parachute CMPSCI 646 Agwc Mmcl Swings The arrangement whrch lncludes lucratrve stock optrons a hefty almy and a vghlaem parachutequot 1F wrmn 1 men on Eaton also has a new goloen parachute clause rn hrs con7 tract But some rnclunrng many of Bqu s top executrves julned the 215 and casheo rn therr golden parachutequot severance packages The blg payment that as a quotgolden parachut Cotsakos39 contract rncluoeo a goloen parachute hrg enough to lrhe Eyler recerven rn January was rntenneo aquot ake a future sale or the company more ly syno n t en parachute for procluctron companres But f r ts o rsmr sed ourrn two years after the mer r e w e rd 524 4 mrllron wrth Dalmler7 e golden parachute t to hrm and th are to t39 U Nelll coulo lect a golden parachute package provrorng three years of e salary ano ohuses stuck and other hen flts arter the ta eover as 175 orsappeareo ano EofA39s stock tumbled solo out hrs 1 many saw hrm as a humoler y o had ng w oen parachute ham walkl away rth a gol that glves hrm 55 mrllron a year for the rest of hrs lrte And aftel EDfA dlsclosed that he hao a goloen parachute agreement g vr g so e s50 mrllron to slam m rllr on 1f n wrng the rger me e sen marl message to bank employees that he rntenoeo to stay Scoring a returned definition Recall is proportion of nuggets returned Precision based on expected length Use F measure with 35 7 Puts much more emphasis on recall than precision arm ncll m A impuns in mm ml bill on mu Hugger it lot 139 bu the numth ul39vlml nu bu lliu numb Allowed 100 bytes per nugget be x lnn buo slung llmluctl aw i Tllcll Unusual 1 lllcultullnuance precision rmwmm lmirillmvtrlrn 6 ML 1 i cn 39 measure CMPSCI 646 Cupyngqroiarmsmian An approach BBN TREC 2003 Decide if definition is a who or what question Extract target Use IR to find top 1000 matching documents Apply heuristics to sentences to find those that mention targ Extract kernel facts eg phrases i imize irrelevant material in answer Facilitate redundancy detection Rank kernel facts b t e and similarit to profile Pro le is word centroid that models importance 0 words Heuristically look for redundant kernel facts Output nonredundant kernel facts subject to length constrain WN 01 01 CON CMPSCI 646 Cupyngqroiarmsmian Kernel facts Appositives and Copulas from a parse tree 7 George Bush the US President 7 George Bush is the US President Propositions 7 Approximation of verbargument structure 7 Person was born on date Structured patterns 7 50 handcrafted rules to find patterns that define things 7 ltTERMgt islwas also ltRBgt callednamedknown as ltNPgt Relations 7 Spouse of employee of etc Full sentences fallback CMPSCI 545 Copynghl James Allan Ranking and trimming kernel facts Create a pro le in one ofthree ways 1 Search for existing definition elseWnere 7 WordNet glossaries dictionar enclclo edia wiki edia biogra h dictionary Google eg George Bush biographyquot 2 For who questions use centroid of 17000 short bios vwvwsgcom 7 Essentially creates a language model of biographies 3 For whatquestions use centroid of extracted kernel facts Remove redundancy 7 For propositions look for duplicates and remove extras 7 For patterns choose only one from each rule 7 Else if 70 of words have already occurred call it redundant CMPSCI 545 Copynghl James Allan Results for definitions Table shows results of definitions for B5 Also shows what different values ofB do Note how good sentence baseline does 7 Return all sentences that mention the target eg lt golden parachute 7 But reduce it slightly by a mi eliminating sentences that amimMi overlap too muc fff H thunk 39 Ulnwiit mm 7 Prowded by BBN mm m ll lllZU U Does best when recall will lEMllJO b is heaVIly weighted muwm CMPSCI 646 Copynght 1ames Allan BBN s results Did okay except for about 10 questions Several result of faulty assumption of target quesionssoed by F F score 7 What is Ph in Biology Assumed Ph in Biologyquot was a object 7 Who is Akbar the Great Assumed Great was his last name Some errors caused by redundancy checking 7 Ari Fleischer Dole s former spokesman who now works for Bush 7 Ari Fleischer a Bush spokesman This was redundant because of previous kernel CMPSCI 646 Copyright James Allan Final scores of systems TREC 2003 Three types factoid list and definition Final score is a linear combination 7 Kde nition score Doesn t match balance of questions 7 413 factoid r 37 list 7 50 de nitions Reflects desire to force work on lists and definitions But to keep factoid questions important CMFSCI 646 CapyngthhmesA an Scores of top 15 systems TREC 2003 sn lumilcr Nmfanm Uim vuilv m Smgmxmc lmy z LCME m Llianyvl 56mm aliluinm m mamme BBN NH 3 um Mmmclmsuus mum nl39Twluwlugl u mus llwlnmc lBM Rcm h ll lf39ingml llvmumlz Llimcmh al39 llvimy rnurlzon l39litlmiLluncml Unnch 61 Ammmm LinnCM ul siltmen quotlin Mullun Uimmsm rimiml lt u mAumcinynl Scicnccs ASrlt ll Llnncmlvul mm lldngur CMFSCI 646 CapyngthhmesA an 21 What about that topperforming system LCC Language Computer Corporation does a consistently great job at this task Very complex system with lots of Allike technology 7 Attempts to prove candidate answers from text 7 Lots of feedback loops 7 Lots of sanitychecking that can reject answers or require additional checking Attempts to replicate results have failed 7 System is so complex it s hard to know where to start 7 LCC is a company and probably isn t telling us everything Until their highquality results are understood they remain an outlier albeit a really good outlier CMPSCI 545 Copynghl James Allan Summary Question answering is a hot area right now Has been explored numerous times in the past 7 Perhaps time was ripe So far focus has been on simpler questions 7 Factoid questions lists definitions 7 TREC tries to make things more difficult each year Part of an AQUAINT program looking at problem 7 Much more complex types of questions being explored in research rogram Dialogue situations crosslanguage against rapidly changing data so answer might c ange Some efforts require heavy knowledge bases eg Cyc Exciting and active area of research at the moment CMPSCI 545 Copynghl James Allan 22 Information Retrieval Information Retrieval of Web pages James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copyright James Allan What makes Google work well According to them 7 The heart of our software is PageRankTM a system for ranking web I aes develo ed b our founders And While we have dozens of engineers working to improve every aspect of Google on a daily basis PageRank continues to play a central role in many of our Web SeaI39Ch tomsH httpwww google corntechnology 7 Look at 2002 version for a hint of what we ll find continues to provide the basis for all of our web search toolsquot CMPSCI 545 Copynghl James Allan CMPSCI 646 And what is PageRank Paraphrased from Google s press A link from page S to page T is a vote by S for T 7 uniquely democratic nature of the W Votes from better important by PageRank pages count more Important highquality sites receive a higher PageRank PageRank combined with sophisticated text matching techniques 7 far beyond the number of times a term appears on a page 7 dozens of aspects of the page s content 7 and the contents of the pages linking to it Copynghl James Allan CMPSCI 646 Stepping back LinkBased Retrieval Use structure of the Web to identify popularity or authority Similar to citation analysis that started in 1970 s 7 ldentifing authoritative authors articles 7 Only similar Web links are not citations Intuition is that sites with many inlinks or backlinks are more popular Simple counting can lead to problems 7 Which links 7 Are all links the same 7 Spamming Kleinberg s HITS algorithm started new era of Web link analysis Google s PageRank algorithm popularized the approach Copynghl James Allan HITS algorithm Hypertext Induced Topic Search 7 Authoritative Sources in a Hyperlinked Environment Jon Kleinberg 1997 Developed as part of IBM s Clever search project An effort to extract useful information out of the link structure of the Web HITS and PageRank developed at about the same time 7 HITS stayed mostly a research project 7 PageRank was incorporated in a commercial product 7 More information available about HITS CMPSCI 646 Copyright James Allan Basic idea behind HITS Start with a query Use a search engine to get the top ranked set of documents 7 Call that the root set That set is likely to be relevant but may miss important pages that do not contain the search term Expand to base set by adding 7 Pages pointed to by pages in root set 7 Pages that point to pages in the root set 7 But limit the number of each so that no page brings in too many additional pages CMPSCI 646 Copyright James Allan HITS cont Two classes of links 7 Transverse links cross domains from umassedu to ibmcom 7 Intrinsic links are domain eg csumassedu Delete intrinsic links 7 Hypothesis is that most of those are navigation links up home Could apply other heuristics 7 Look for potential spam linking created by collusion 7 Strip out stoplinks bestviewed by Use resulting graph to nd interesting sites CMPSCI 545 Copynghl James Allan Hubs and Authorities Basic idea is to consider recommended pages 7 These are likely to be authoritative and recommending pages 7 These may be hubs pointing to a good set of material n n I hi CMPSCI 545 Copynghl James Allan Iterative algorithm Maintain arrays ofweights a e Xp is authority wt of page p e Yp is huo weight ofpage p u Maintain invariant on each 0 O quotW quotFleming Mummy 1 Initially give all pages equal weight 0 qr o lterate ad nauseum e Propagate huo weight to affect i ht O authorit we r O Propagate authority weight to affect ights hub we n1an 7 Rehorrhaiize vectors quot iztgivl l im q Showed it will always converge 7 But 20 iterations is usually ehough CMFSCI 646 Copyrigth imsAiian Google s Page Rank A page has high rank if the sum of the ranks of its backlinks is high 7 ihciudes ooth rhahy backlii iks and a few highly ranked backlii iks Mu RV Where 7 u is a w o page 7 a is the set of pages u points to forward lll ikS not used here e 5 is the set ofpages that point to u backlii iks 7 Nu i Ful is the Humberoflll ik from u e ClSaidllu Ull rahiltofaii Recursive formula Can be computed by starting with any set of ranks and iterating until convergence Very similar to half oleTS CMFSCI 646 Copyrigth imsAiian CMPSCI 646 Rank Propagation Start with link structure Pages have some starting rank value Split it among outgoing links 7 Equal votes for each link Sum weights of incoming links to get new rank ofa Page Repeat ad nauseum Copynghl James Allan CMPSCI 646 PageRank Can also be formulated as an eigenvector problem 7 R cAR where R is a vector over web pages AW 1Nu if there is a link from u to vand 0 if not Simple algorithm has a problem with rank sinks 7 Imagine a page that only points to itself 7 And has at least one other page pointing to it 7 Solve by creating a source of rank E to keep adding rank in Random surfer model Other issues 7 dangling links to unavailable sites 7 intrasite and navigation links 7 stoplinks eg Amazon Internet Explorer Could apply heuristics mentioned in HITS Copynghl James Allan Searching with PageRank Google uses a number of factors to rank search results including 7 tfidf content search 7 Proximity of query terms in page 7 Location of query terms within page 7 Anchor text from links to the page 7 PageRank Standard IR scores and PageRank scores are merged What is value of each component Trade secret CMPSCI 545 Copynghl JamesAllan TREC Web tracks Considering Web documents since 1997 lnitial questions were largescale retrieval 7 Very large corpus VLC track 7 Same problems as adhoc retrieval but larger corpus e 1998 100Gb collection Of cially a Web track starting 1998 7 Initially typical adhoc retrieval 19982000 7 Movement toward sitefinding queries 2000 No bene t to linkbased measures in any ofthese 7 In 2001 add homepage finding links finally help In 2002 evaluate Webspeci c search tasks 7 Topic distillation e Named page findin e 23 research groups participated Continued through 2004 then morphed to ef ciency CMPSCI 545 Copynghl JamesAllan TREC 2002 Topic distillation task Task 7 Put together a short but complete list of pages that are a good resource for a topic 7 Topics are similarto adhoc topics from the past Corpus 7 January 2002 crawl of gov domain 7 125 million pages approximately 18Gb 39 39 num PDF P3 and 39 Images not included in corpus but were available n g L th MW 4 a a a Judgments on task 7 Is page a good key resource or not 7 NB Page may be a good resource but not relevant per se Evaluation is by precision at 10 documents returned 7 Emphasize conciseness CMPSCI 545 Copynght James Allan Topic distillation what is it Premise some quality beyond relevance is important for Web searching 7 Authority quality definitiveness etc 7 Want to strike balance between quality and relevance Look for key pages 7 One worthy of appearing in a short list of important URLs Compare what Yahoo or DMOZ lists have to decide upon 7 Selected pages will be relevant and high quality 71 submissions from 17 groups 7 56650 pages judged for relevancequality 7 1574 of those judged to be key pages CMPSCI 545 Copynght James Allan Queryindependent properties Look at indegree and length ofURL Do the redict likelihood that URL would be chosen for inclusion in Yahoo and MOZ Graph is an ROC curve 7 False alarm orl X agtltl5 True posulves 7 Good ls upperlelt Implications e Lorlg URLs less valuable 7 Pages llrlllted to often are valuable CMFSCl 646 Capynghz hmesAlan True Dmlllves Queryindependent properties The names as l l a M n m False Ptlslllves Directory pages Key pages Not as predictive U J 6 6 False Pusllles CMFSCl 646 Capynghz hmesAlan Distribution of key pages per topic Fiequemy iiiiiHthti Hi i H i i Some have quite o few we Niimiwei reiewm uer tunic 150 CMPSCI 646 Copyright Iames Allan Distillation results Top 8 groups by Angreo10 of best run 03 A O V 2 02 VI 396 A L 0 g 01 U L I gt lt o n d E quotE 2 13 E S E i N 2 3 8 k 2 3 quot 9 1 3 S 39E E i 5 2 3 CMPSCI 646 Copynghi 1ames Allan Conclusions of results Anchor text useful Outdegree not useful Siteuniting helped with avg prec but not prec10 Content based retrieval can work as well Eliminating documents wo query words in title helped Removing nearduplicates by similarity hurt Removing true duplicates was important Query expansion hurt PageRank hurt Link analysis helped CMPSCI 646 Copy ght James Allan City University London Used standard retrieval system with Okapi weighting K1 impacts importance of term frequency B affects importance of document length 75ft K1 1 DLZ39 tf2Kl 1 BBZjDLj N no 2 mm Tuned parameters for this task K1 15 normal value B 08 heavy emphasis on document length Results were second best Prec10 0241 Angrec 0190 If set B 02 less emphasis on document length Prec10 0200 Angrec 0143 Conclusion Seems that elaborate processing of Web data is not critical Runs counter to perceived wisdom CMPSCI 646 Copy ght James Allan 11 Glasgow University Use an absorbing model for PageRanklike results Static absorbing model Calculate authority score Tor page Combine it with content score using trained parameter RSV 10 RSVd 01 authSAM Dynamic absorbing model Applied to topranked pages set B Assume top of that set A is most authoritative Break links from A to BA but not from BA to A Calculate authority score for each set Note that A s scores are boosted by links from BA Scores in BA do not get any of credit from set A Maintains authoritativeness of verytop ranked pages ClVlPSCI 646 Copyright James Allan Glasgow cont Spreading activation A portion of a page s score is propagated from its children Here done only when pages are in the same domain And only when source of link child is deeper within the domain Rsvcg RSVd 5 Z RSVS SES Querybiased spreading activation Be more cautious with activation for ambiguous queries Calculate specificity of query 39 Based partly on posmon OT query words In VVordNet hierarchy Change 3 to be the query scope value RSV RSVd queryscope Z RSVS SES CNIPSCI 646 Copyright James Allan 12 Glasgow in implementation Static absorbing modeling 7 Retrieve top 1000 pages 7 Rerankthem to incorporate authority score Dynamic absorbing model 7 Retrieve top 1000 pages 7 Reranktop 50 Le B50 7 Assume top 10 are high quality ie A10 7 Numbers small recall that precision at 10 is official measure Spreading activation 7 Give a wei ht of 01 to the linked ages Other indexing 7 Double weight of terms appearing in titles 7 Add in anchor text from all pages that link to this page Can treat differentially or as if it s all one bag of words CMPSCI 545 Copynghl James Allan Glasgow results Content of pages is most important part of process 7 Contentonly scored 02694 compared to 02224 for best other run Can make DAM provide slightly better results 7 02776 on posthoc evaluation run 7 Setting A10 was too small results improved with A20 7 PageRank consistently underperforms DAM Use of anchor and title text was detrimental Querybiased spreading activation betterthan static 7 But both we muse than not using it at all Query expansion hurt effectiveness CMPSCI 545 Copynghl James Allan IBM Haifa Knowledge Agents Maintain a knowledge base KAB 7 Stores information about a domain userdetermined 7 Main pages in the KAB Root pages from search engines Pages inserted by the user Seed pages relevant pages KAB size is limited so only best pages are maintained Page s score updated everytime a search occurs Longlived KABs are more stable 7 Also satellite pages Pages that they point to and that point tothem Provides anchortext information 7 Frequent terms in those pages and terms that cooccur with them CMPSCI 545 Copynghl James Allan IBM Haifa cont Web task is opportunity to explore capability for creating the initial KAB 7 No user interaction in the C evaluation Given query build up KAB automatically 7 Form root set by running a search on gov pages 7 Get the forward outlinks and backward inlinks sets 7 Compute topology score weight the links Links from anchor text that is similar to query heavily weighted Links that cross the KAB boundary weighted mor Connect domaincentral page KAB to queryfocused page Run PageRanklike algorithms to find hub and authority scores 7 Combine content score and topology score CMPSCI 545 Copynghl JamesAllan IBM Haifa other techniques Site compression appear In 0 10 7 Logical site is basically everything between www and gov Title filtering 7 Pages with no query words in title considered frail 7 Three frail pages with lowest scores in KAB replaced by three non frail pages outside the KAB with highest sc Duplicate elimination 7 Thresholdbased vectorspace similarity 7 Only applied to top 10 pages because of effectiveness measure 7 Applied a er the previous two lters ores CMPSC1646 cepyngqi James Allan 0 not permit more than three results from the same logical sitequot to t p Breaking down effects What is impact of site compression title filter and duplicate elimination Is the KA dynamic algorithm better than PageRank static n HampA Agent I FR Agent Basic SC TFK TFi TF9 O DE 1 I Number of pages replaced CMPSC1646 Cupyngql James Allan What s up with duplicate elimination Seems to be an error in evaluation Duplicate pages all considered equal in value 7 They re all relevant after all So getting three duplicates ofthe same page in top 10 is better than getting only one Except that evaluation specification said 7 Penalizing duplication If your top ten consists of ten pages from the same site judges might only reward the home page Suggests that duplication is bad 7 But evaluation doesn t seem to honor that 7 So removing duplicates of a key page can hurt 7 Though other sites found that removing duplicates was helpful Perhaps because removing nonkey duplicates is critical CMPSCI 545 Copynght James Allan Named page finding Generalization of homepage nding from 2001 Retrieve ranked list oftop 50 pages for 150 requests Topic is a single phrase 7 US password renewal 7 Child labor stamp Evaluation by MRR of rst correct page 7 Mean reciprocal rank lfr is rank of rst correct page get a score of 1r Judgments 7 lor 2 topics round 5 correct pages 7 For 16 topics found 2 correct pages 7 For other 132 found exactly l correct page CMPSCI 545 Copynght James Allan Named page finding results Mean Reciprocal Rdnlc quot1 U H U H 13 m J3 n LL D D 1 Q m E 1 U E m Cl 3 1 c 5 LLJ D 4 I on a 39 E c E l J r j I CNIPSCI 646 Copyright James Allan Va topicss with no page found Glasgow University named pages Retrieve top 1000 pages Rerank them based on whether page contains the query as a phrase 13 if page contains phrase RSVC TRSVd 7 Z 10 else Run DAM with smaller sets B10 A5 Here goal is to find a single named page Try various ways of using title and anchor text CNIPSCI 646 Copyright James Allan 17 Glasgow results DAM resulted in slight improvement over content 7 Different is marginal suggestive only Anchortext signi cantly improved reciprocal precision 7 With anchor text 0654 85 named page in top 10 missed 9 7 Without anchortext 0552 71 in top 10 and 15 missing 7 Recall that it did not help with the topic distillation task Query expansion hurts here too Interesting to note different tasks yield to different approaches 7 Not discussed in detail but also used different core retrieval measures to find RSV CMPSCI 545 Copynghl James Allan Web searching summary URL depth and inlink count both useful 7 At least for predicting Whether a link will appear in a directory PageRank and HITS are effective Anchor text important 7 Very valuable for named page finding 7 Less clearly valuable for topic distillation key pages No evidence any ofthat is useful for content search 7 Question are topic distillation and named page finding sufficient PaieRank is onl a small part of what Goo le does 7 Don t be fooled by their PR Following Web track and its ilk work has con rmed results CMPSCI 545 Copynghl James Allan Evaluation for Web Moving toward NDCD Jarvelin and Kekalainen T018 2002 o What makes a better search result Cumulative gain assumptions 7 Highly relevant documents better than marginally relevant ones 7 The lower the rank of a relevant document the less useful it is Comparison to recallprecision measures 7 Binary vs graded relevance 7 Rank assumption consistent though will be treated differently CMPSCI 545 Copynghl JamesAllan Cumulative gain Assume graded relevance 7 Here 0 for nonrelevant 1 2 or 3 for highly relevant ln ranked list replace document with its relevance score 7 G lt3 2 3 0 0 1 2 2 3 0 gt 7 Several good documents two nonrel one marginal Cumulative gain at rank i is sum ofvalues 7 CG1G1 7 com cop 11 Gi for igt1 as lt3 5 8 8 8 9 11 13 16 16 gt CMPSCI 545 Copynghl JamesAllan CMPSCI 646 Discounted cumulative gain Value is constant regardless of rank Want to provide diminishing value Use log of the rank base b 7 DCGi com for iltb 7 DCGi DCGi1 Gi logbi else Comparison 7 G lt3 2 3 0 0 1 2 2 3 0 gt 7 CG lt3 5 8 8 8 9 11 13 16 16 gt 7 DCG b2 3 V we we w W me 3M am Why that discounting function How to choose b Copynghl James Allan Incorporating some notion of recall Consider ideal gain vector lfwe know relevance score for every document 7 Sort gain vector by score with highest first 7 Eg with 3 highly rel 3 less rel and 4 marginally rel lt3 3 3 2 2 2 1 1 1 10 0 0 0 gt Given that could calculate ideal CG and DCG CMPSCI 646 7 CG lt369111315161718191919 gt 7 DCG1 lt36 789 889 975 1052 1088 1121 1153 1183 11831183 gt Copynghl James Allan 20 Normalized DCG DCG and CG are difficult to compare across queries 7 A score of 105 might be good for one query and bad for another Address by normalizing wrt ideal score 7 Normalize each entry by ideal possible entry Examples 7 DCG lt3 5 689 689 728 799 866 961961 gt 7 DCG lt36 789 889 975 1052 1088 1121 1153 1183 1183 1183 gt 7 NDCG lt1 083 087 078 075 076 080 086 083 083 gt CMPSCI 545 Copynghl James Allan Some experiments TREC7 adhoc retrieval task 7 Used 20 topics 7 Used submitted runs of 5 manual systems 7 Nonbinary relevance judgments for those runs Fourlevels Graded relevance 0 110 100 Binary conversion any degree if relevance is relevant only highly relevant is relevant Used lo39 base 2 10 showed similar results CMPSCI 545 Copynghl James Allan 21 Discounted cumulative gain h IHl l mi o 0 lo 20 io 4U 5U lt70 70 2w 90 ion Rilnk I Fig 2m Discounted cumuimd gain DCGV cums binaij weighting n 20 an 40 50 bl m m on HM mm Fig 21b Discuunted mimulat zii gain was curves nonbinm39y weighting CMPSCI 646 Copyright James Allan Normalized DCG h oiinmu la lIlII n m zu m 40 an mi 70 ii Rank Im mu Fig 4m Normalized disconan cumulated gain nDCGl curves rm 139 391 39 n H n in i lt0 no 7n 0 on um Rank nlmary Welg mg Fig mi Normalized discounted ciLmulatEd giu39n nDCGv curves binary weighting CMPSCI 646 Copyright James Allan Singlenumber comparisons Average values over ranks 1 to 200 of a topic Average values over topics Can do statistical signi cance testing Table 11 niDlCG Averages over Topics and s roi mimosa Signi cance Lite Results Five TREC runsquot B c Sm Sign 0293 o o 0309 0223 DEgtA D A a N p a a 01 r o 05 pmami rel CMPSCI A Comb hmzsmlm Variations of NDCG Use logrank1 7 Means tlie log l problem neveroccurs Ignore base oflog e Boosts nign ranklrlg documents 7 But tne lower ranked ones are still dlscourlted relativelv Very popular for Webbased evaluation 7 Heavilv ravors rst page of results 7 Favors verv top ofrarlke Optimization function for some learning CMPSCI A Comb hmzsmlm Totally different topicPlG We talked about mapreduce framework One implementation is Hadoop 7 Yahoogenerated now on Apache PIG is a language that sits on top of it Largely a type of relational algebra Designed to compile into mapreduce operations Homework 4 will use PIG Unfortunately word count examples dif cult CMPSCI 545 Copynghl James Allan 24 Information Retrieval Document Clustering James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copyrlgil James Allan CMPSCI 646 Clustering Basics Clustering algorithms are used to group similar objects 7 in IR t icall documents or terms occasionally queries Many different reasons 7 ln IR for efficiency orfor effectiveness Many different algorithms and approaches 7 eg graph theoretic nearest means Clustering typically based on pairwise comparisons ofobjects using a similarity measure 7 eg comparing documents comparing cluster means and documents Many possible similarity measures 7 both general and domainspecific Copynghl James Allan Clustering example Given a set of documents 7 Entire collection query results What types of documents are in it M mmpm Here top 250 encyclopedia articles 0 Ewmwlm g StnnWCKBHbIn found In response to query star Group into 5 clusters Can see 7 Number of articles in each cluster 7 Titles of some of the articles A 9mquot 5 quotWhamuobwnwryuuo r Frequent words in each cluster 8 mmfmm wu r A a Useful for r Disambiguation lm astronomy biology Browsing 0 tummm m9 i wwwl mac mmts lpmwr CMFSCI 646 Cnpynghla lamzsAllan Clustering example 2 Clusty search engine from Vivisimo Provides ranked list as well as clusters Query information retrieval 9 WM 0 m n MW u M a m 2 MW 11 m CMPSCI 646 Broad types of clustering algorithms Several axes based on relationships between 7 Clusters the groups being created 7 Objects the things being placed in groups 7 Properties the way we represent those things Relation between properties and clusters 7 Monothetic all items in cluster share same properties 7 Polythetic all items in cluster share most properties Relation between objects and clusters 7 Exclusive partition an object goes in precisely one cluster 7 Overlapping an object can belong to multiple clusters Relation between clusters and clusters 7 Ordered some clusters are within others hierarchic e Unordered all clusters created equal Copynghl James Allan CMPSCI 646 Properties of documents Document described by a set of features properties Content represented by word features 7 Individual W e Phrases Can leverage annotations 7 Names of people locations organizations 7 Events 7 Relationships Can use metadata 7 Date written author 7 Genre 7 Popularity 7 Manually assigned keywords or subject headings Same types of features appeared in indexing models Copynghl James Allan Monothetic v Polythetic Monothetic 7 All objects in cluster have a set of features in common Polythetic r Objects in cluster share many features l U m a m 2 z 14 2 3 Features 4 n r fquot 7 Every object in cluster has many features in G 7 Every r appears in Polythetic 7 No is necessarily in every document Focus in IR generally Monothetic on polythetic clustering 39 CMFSCI 646 Objects Capynght JamsAlhn myij up 1 Measures of similarity Want to compare two objects Desirable properties of a similarity function SXY r SXY grows as XampY share more features 7 Bounded between 0 and 1 a 3000 1 7 Usually SXY 0 when XampY share no features 7 Often want SXY SYX Numerous possible similarity functions In general no measure that is universally best Note that could also compute dissimilarity CMFSCI 646 Capynght JamsAlhn Sample similarity measures Simplest just measure overlap 7 X Y for sets or XY inner product for weighted vectors Fails to account for size ofX and Y so instead can use variations on these 2 X Y 2 acryx Dice coefficient g lX llY l Ex Zy W W l Cosme coefficient lX oY l Ziay Z Z Jaccard coefficient lelYlilX Yl 2 Zxxy CMPSCI 646 Copynghl James Allan Beyond objectobject similarity Have shown objectobject similarity SXy What about clusterobject SCX Or clustercluster SC1 C2 Can sometimes treat a cluster as a document 7 ln bagofwords a cluster could be contents of all bags Different decisions result in different algorithms Will consider several classes 7 Graph theoretic 7 Means or representative based 7 Neighbor based CMPSCI 545 Copynghl JamesAllan Graph theoretic clustering Graph representation of documents 7 Each document is a node 7 Edge placed between documents in the same cluster Approaches based on similarity matrix 7 Similarity between all pairs of documents Requires at least 0n2 comparisons to calculate Approaches vary on when to create an edge Naturally hierarchic Good formal properties Re ect structure ofthe data CMPSCI 545 Copyright James Allan Example Single Link Objects 1 2 3 4 5 6 Similarity Matrix Graph for threshold 095 CMPSCI 545 Copyright James Allan Resulting Dendrogram Cluster Hierarchy 065 075 085 1 4 5 6 2 3 CMPSCI 646 Copyright James Allan Single link Place all documents in singleton clusters Find highest clustercluster similarity SCC2 max Suv uEClvEC392 Put an edge between those two clusters Repeat until stopping criteria met Highest similarity below a threshold Some target number of clusters created In practice singlelink produces long strageg strings that are not good clusters Only a singlelink required to connect Compare to game of telephone Each link is okay but endtoend makes no sense CMPSCI 646 Copyright James Allan Complete link Force clusters to be much tighter Disallow joining cluster unless similar to all members Same algorithm different clustercluster similarity Recall that singlelink used the max SC 0 min S 17 2 16010602 7 Complete link produces good clusters but too few of them Many singletons If continue to apply algorithm usually collapses suddenly at end CMPSCI 646 Copyright James Allan Average link Attempt to find a balance Require substantial similarity to other members Use average similarity 1 SClC2 Suv lcll u60 602 For both searching and browsing applications averagelink clustering has been shown to produce the best overall effectiveness Efficient algorithms are possible A little more difficult than singlelink CMPSCI 646 Copyright James Allan Algorithms v Property Sketched algorithms for generating XXXlink clustering Technically XXXlink is a property ofa clustering Single link 7 A node in a cluster is more similar to some node in the cluster than to any node in any other cluster Complete link 7 When considering the minimum similarity a node in a cluster has with the remaining nodes in its cluster that similarity is greater than its minimum painvise similarity with any other cluster Average link 7 A node in a cluster has a greater average similarity to the remaining members of the cluster than it does to all members of any other cluster CMPSCI 545 Copynghl James Allan What makes a good clustering Criteria of adequacy CvR79 l The method produces a clustering Which is unlikely to be altered drasticallt when further ob ects are incor orated stable under growth 2 The method is stable in the sense that small errors in the description of objects lead to small changes in the clustering 3 The method is independent of the initial ordering of the objects Methods which satisfy these criteria may not for other reasons be the best for a I articular a I lication So far only single link satisfies those 7 Criterion l is very likely to break CMPSCI 545 Copynghl James Allan Representativebased clustering Attempts to avoid On2 comparisons De ne cluster representatives Require only On Iogn or even On computations Usually produce partitions no overlapping clusters Tend to impose structure not implicit in data 7 Number of clusters Can have undesirable ro erties 7 Order dependence is very common CMPSCI 545 Copynghl James Allan Single pass fast partition Run through data sequentially and cluster Algorithm 7 Assign the document D1 as the representative centroid mean for C1 7 For Di calculate the similarity S with the representative for each existing cluster 7 If Smax is greaterthan threshold value S Add the document to the corresponding cluster Recalculate the cluster representative average 7 Otherwise Use D to initiate a new cluster 7 If a document Di remains to be clustered repeat CMPSCI 545 Copynghl James Allan Kmeans fast partition Kmeans 7 Select K documents as cluster representatives C 7 For i 1 to N assign D to the most similar centroid 7 Forj 1 to K recalculate the cluster centroid CJ 7 Repeat the above steps until there is little or no change in cluster membership Numerous variations on this basic method 7 Cluster splitting and merging strategies 7 Criteria for cluster coherence 7 Seed selection choosing K representatives Concern how should K be chosen CMPSCI 545 Copynghl JamesAllan Nearest neighbor clustering Cluster each document with its k nearest neighbors Produces overlapping clusters Called star clusters by Sparck Jones Can be used to produce hierarchic clusters cf documents like this in web search CMPSCI 545 Copynghl JamesAllan Why cluster Ef ciency Effectiveness clustering a priori Effectiveness clustering in response to a search CMPSCI 545 Copynghl JamesAllan Clustering for Efficiency Clustering was initially studied by Salton as an efficiency device 7 Cluster documents represent clusters by mean or average document compare query to cluster representatives 7 Faster than sequential search Not as fast as optimized inverted file 7 An inverted list is also a form of cluster Clustering on disk is still an ef ciency device in database systems 7 Though disks that do their own layout partially defeat that CMPSCI 545 Copynghl JamesAllan Clustering for Effectiveness By transforming representation clustering may also result in more effective retrieval 7 Retrieval of clusters makes it possible to retrieve documents that may not have many terms in common with the query eg LSI Cluster Hypothesis proposed that closely associated documents tend to be relevant to the same queries 7 Validated for many Nags collections RR 7 Also invalid for many collections 7 Can be viewed as a test 1quot of representation how 7 Generally holds in retrieved set of documents Dissimila ty Value CWSCI 646 Copyright James A an Ranking clusters Bottomup searching create inverted file of lowest level clusters and rank them After clusters are retrieved in order documents in those clusters are ranked 7 Can discard cluster information at that point if desired Cluster search produces similar level of effectiveness to document search finds different relevant documents 7 Alternative strategy Same idea is basis for spreading activation ideas CWSCI 646 Copyright James A an Hierarchical clustering and search An alternative to writing a query Topdown searching start at top of cluster hierarchy choose one of more of the best matching clusters to w hut ll expand at the next level 7 Hard to gure out w whereto 90 mi MN m nu M 7 Works better if limit set by some mew t query WW mum Vhnu MN MW um MN M W mmm n mm mm M mi CMFSCI 545 Cnpynghla lamzsAllan WWW Mm m Topdown query limited unlMumulmxmnymudmlalml Encyclopedia articles PM lnnquot H 7 Try a w Query star 1 Select cluster 2 m 0 m WWW 74 mm Ex WWW le O r cl u ster 3 r l miulrllrilimllulwimmyl H mm X gig Wu 139 wa l 7 m 3 cm hmnlwi ammmmummmTIHmmimrmum a Mummu Wu sum llyaImp lullymcmilnnnlmbunvayekun g mm lnlwu mum humyummy iquvanmuu m7 m murmur l a mume W O S r F o lt am m mm g lhilp mm m cum mvnmcc CMFSCI 646 onpmme lamzsAllan Human Clustering Is there a clustering that people will agree on Is clustering something that people do consistently Yahoo suggests there s value in creating categories 7 Fixed hierarchy that people like What about unsupervised clustering no xed categories Human performance on clustering Web pages 7 Macskassy Banerjee Davison and Hirsh Rutgers 7 KDD 1998 and extended technical report CMPSCI 545 Copynghl JamesAllan Experiment Ten people Web pages from Rutgers University Each person given results from each of ve queries 7 Most frequent queries with disjoint sets of terms up to January 1998 7 accounting career services employment library and off campus housing 7 Returned 15 16 16 11 and 10 pages somewhat small sets Six people given just title and URL four given full text also Asked to cluster documents to best oftheir ability 7 Overlapping clusters acceptable but not required 7 No time limit CMPSCI 545 Copynghl JamesAllan Analysis Singleton clusters 7 GeneraHy used as garbage usters nut Knnvvmg Where erse m putmem 7 Dereteerrem anarysrs over 13 er dncuments eeretee rrern anarysrsu Questionsto consi 7 H e 7 aeress sunreetsv 7 Hnw many emsters were generated fur Each teprev cwscxm Cw hmmmn Cluster Size GenereHy preferred smaHer duster I I nnYy mu mer we usmg mu lexl cMPscx m cm s Jim m Number of Clusters Dune vaned tnuugn srnaH 5422 or uHEEUDn makes 4t nardtu demde 4ftnere4s s4gn4f4cance I4mlmm39rh4slr4s 4m 1 44 44 mm Sumrt 4 2 4 4 mg 4 i 2 3 s 4 4amp440 2 4 2 J 1 3 4 2 2 2 4 4 4 2 2 4 4 4 lt i i 4 1 4 r4 4 4 2 4 4 4 4 4 2 4 g 4 4 1 g 44 4 4 1 2 444 2 4 2 2 2 cmcms EWDYmesm Clustering agreement fABCtogether by one person how often by everyone else 7 LDuKaIAE ac AC anuAac Little agreement I I I I SubjuclnllmmdunInwm Suburb 44 iI4 dmIIImnl 4mpnmu4 wrmpn Igw I I I I I I I I I 4 o 4mm 5m cmcms cwmvmm Cluster overlap Some people never created overlapping clusters Others had a fair amount arlliyc quotumlm 4 Amth m lull tlmumcni dimenul lxr ulwimi CWSCl A owl hsz Allan A a w Conclusions of study People are diverse in how they cluster across queries 7 Note small starting People like to creat 5 et he Little similarity between different people e fairly small clusters ough re t Having full text has an impact in building clusters knowledge of query s intent Perhaps clustering in general is not useful CWSCl A owl hsz Allan People tend to be context speci c with little generality Perhaps clustering cannot be done without Clustering for distributed retrieval Xu and Croft 2000 We have viewed retrieval as from a single collection Instead suppose there are multiple collections For now assume collections do not overlap Fan query out to all collections or select a subset of them Integrate resulting lists How are documents assigned to collections Administrative legacy etc By topic if have control How to find topics if they re not provided Document clustering is obvious possibility Discover the topics CMPSCI 646 Copy ght James Allan Representing a topic cluster Topic models using language modeling approach 2 fD mi 0 01 1 ll D 00171 n is Pi Compare query to topic with KLQtopic KLQT Z MIOQW owgt720 39Q39 pquot u 1 Ch CMPSCI 646 Copy ght James Allan l9 Clustering documents Used kmeans clustering Requires selecting a value for k of course Document is compared to a cluster using fd wz39 fd7 w d C Z Og fdwz o ldl 1quot6710239 fd wiC Idl Note that cluster is assumed to include document That is how close is document to cluster with that document in it CMPSCI 646 Copyright James Allan Clustering and distributed retrieval Single system baseline everything together Heterogeneous collections typical in distributed IR Build a model for each existin39 collection reexistin39 clusters Global clustering Put everything in a single repository and cluster it Build a model for each cluster Assumes have access to all collections Local clustering Cluster each of the heterogeneous collections Maintain models for each cluster within each collection Models point to a cluster within a collection Multitopic representation Cluster each of the heterogeneous collections Gather models for each collection together Models point to a collection CMPSCI 646 Copyright James Allan 20 Evaluation collections TREC3 7 50 queries 345 words each 7 22Gb of text 740K documents 100 collections 7 196 relevant docs per query TREC4 7 49 queries 75 words each 7 2Gb of text 570K documents 100 collections 7 133 relevant docs per query TREC6 7 50 queries 26 words each 7 22Gb of text 551K documents 100 collections 7 92 relevant docs per query CMPSCI 545 Copynghl JamesAllan Cluster creation Global clustering centralized 7 Create 100 clusters using kmeans Local clustering TREC4 only 7 Create total of 100 clusters 7 Number per collection proportional to number of documents Multitopic representation TREC4 only 7 Use natural 10 collections of TREC4 APBB AP90 FRBB etc 7 On average each collection represented by 10 clusters Actually allocated in proportion to number of documents CMPSCI 545 Copynghl JamesAllan 21 Processing steps Rank collections against query Using KL measure Select top 17 collections Retrieve top N documents from each of those Used anuery modified to support a global IDF Local IDF values could impact results wildly Merge based on lnquery scores CMPSCI 646 Copyright James Allan Effectiveness of global clustering TREC3 TREC3 TREC3 TREC4 TREC4 centralized lOOcolbysourcc 10000lgiobal centralized 100C01 by50urce 5 docs 06760 05000 260 06680 12 5 docs 05918 04245 283 10 docs 06080 04520 257 06080 00 10 docs 04918 03816 224 l5 docs 05840 04067 304 05707 23 15 docs 04612 03469 248 20 docs 05490 03770 313 05320 3 1 20 docs 04337 03122 280 30 docs 05120 03240 367 04920 39 30 docs 03714 02769 254 Evaluate at interactive TREC6 TREC6 TREC6 pa rt R P CU rve centralized 100001bysource 100colglobal u 5 docs 04440 03360 243 04520 18 Se l GCtl n g by n atu ral 10 docs 03920 02940 250 03820 26 39 15 docs 03573 02560 284 03533 ll O n S h u 20 docs 03330 02350 294 03220 33 30 docs 02907 01973 32l 02780 44 Clustering as good as one large collection One big collection Sometimes better Broken up by source Broken up by clusters CMPSCI 646 Copyright James Allan 22 Why is clustering better Seems to concentrate relevant documents at start of ranked list of 39 39 globaliogtil39nkeltlra39n ing I o a a COIIeCtlonS bysourcgoptimaIlaakqug True by optimal 13 e a j I K quot5 E Warm ranking of clusters 33 Optimal by number of 5 100 39 quot relevant documents 5 8 9 Also true by g 391 algorithmically g 50 j 40 I 39 generated ranking of 30 139 clusters 20 43 1o fg 0 0 1O 20 30 40 50 60 70 8O 90 10 number of collections selected CNIPSCI 646 Copyright James Allan Other experiments More clusters 1000 clusters has slightly worse performance Some clusters are probably too small Better choice of k seeds Previously used first k documents Now cluster 2000 random documents and create seeds from them No difference Local clustering Recall cluster within each source rather than globally More realistic in some settings Slightly worse than global clustering but not much CNIPSCI 646 Copyright James Allan 23 Multitopic representation Recall here that original source collections selected Multiple language cluster models available But all point to the same collection Better than single source model But does not find the optimal possible ranking Much worse than global or local clustering approaches Probably the distribution of relevant docs is bad CMPSCI 646 TREC4 TREC4 TREC4 TREC4 centralized IOCol bysource 10colbysource lOcolbysource baseline multiple topic optimal 5 docs 05918 04163 297 04898 172 05469 76 10 docs 04918 03673 253 03980 191 04673 50 15 docs 04612 03347 274 03646 209 04463 32 20 docs 04337 02969 315 03255 249 04041 68 30 docs 03714 02537 317 02850 233 03503 57 Copyright James Allan Lessons learned Distributed IR benefits from topic clustering Best if can be done globally Similar documents near each other If not possible then local clustering almost as good Preserves administrative collection distribution Requires more overhead at local site Multiple topics pointing to single collection okay Better than single collection model Means local site only has to cluster and return models No further overhead at collection s site CMPSCI 646 Copyright James Allan 24 Clustering summary De ned clustering Ways documents commonly clustered 7 Graph based smgle complete average Ian 7 Kmeans kNN Reasons to cluster 7 Efficiency not a big deal any longer 7 Effectiveness can help Human clustering ability 7 Not clear humans cluster consistently 7 Suggests that true clusters are meaningless without task Clustering applied to distributed retrieval task 7 Can provide effectiveness equivalent to monolithic collection CMPSCI 545 Copynghl James Allan 25 Information Retrieval Retrieval Models Vectors James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copynght lamesAllan Outline Last class Vector space Latent Semantic Indexing This set of slides CMPSCI 545 Copynght lamesAllan CMPSCI 646 Vector Space Model Variations 7 Vector space retrieval model 7 Latent Semantic Indexing Key idea Everything documents queries terms is a vector in a highdimensional space Example systems 7 SMART G Salton and students at Cornell starting in the 60 s Still among the topperforming IR systems 7 Lucene popular open source engine written in Java 7 Most Web search engines are similar Copynght lamesAllan Vector space issues How to select basis vectors dimensions How to convert objects into vectors 7 Terms 7 Documents 7 Queries How to select magnitude along a dimension How to compare objects in vector space 7 Comparing queries to documents CMPSCI 646 Copynght lamesAllan CMPSCI 646 Vector Space and Basis Vectors Formally a vector space is defined by a set of linearly independent basis vectors Basis vectors 7 correspond to the dimensions or directions in the vector space 7 determine What can be described in the vector space and 7 must be orthogonal or linearly independent Le a value along one dimension implies nothing about a value along another LL Basis Vectors Basis Vectors for 2 dimensions for 3 dimensions Copyright lamesAllan Basic Vector Selection What should be the basis vectors for IR CMPSCI 646 Core concepts of discourse 7 orthogonal by definition 7 a relatively static vector space there are no new ideas 7 probably not too many dimensions 7 But difficult to determine Philosophy Cognitive science Terms that appear 7 easy to determine 7 But not at all orthogonal but it may not matter much 7 a constantly growing vector space newvocabulary 7 huge number of dimensions Ease of identifying bases critical Other issues can be dealt with we hope Copyright lamesAllan Mapping to basis vectors terms How do basis vectors relate to terms Each term is represented as a linear combination of basis vectors Dictionary Basis terms Term Active Inde endent m cat 025 75 5 dog 075 025 lt gravity 7 Independent CMPSC1646 Copyright lamesAllan Mapping to basis vectors documents How are documents represented A document is represented as the sum of its term vectors D001 dog Active Independent CMPSCI 545 Copynght lamesAllan Mapping to basis vectors queries How are queries represented Same way that documents are Query dog 2 Independent CMPSCI 646 Copynght lamesAllan Vector Coefficients The coefficients vector lengths term weights represent term presence importance or aboutness 7 Magnitude along each dimension Model gives no guidance on how to set term weights Some common choices 7 Binary l term is present 0 term not present in document 7 if The frequency of the term in the document 7 tf idf idfindicates the discriminatory power of the term Tfidf is far and away the most common 7 Numerous variations CMPSCI 545 Copynght lamesAllan Example 3word vocabulary tf weights cat cat cat cat cat cat cat cat lion ion cat linn cat lion dog 39 2 cat cat lion dog dog dog CMFSCl A CnpynghtalamesA an Term weighting functions Lucene weighting function tfrl 39 IOQiVdft 1 w M number of tokens in d in the same field as f Smart supports a number of functions XYZ r X expresses term 39equency component 7 Y expressed inverse document 39equency component 7 Z expresses length normalization component 7 eg atc augmented t df cosine 1 1 u v a aw up 09 a l2 e WEEK 5 CMFSCI 646 Copyrigth James A an Vector Space Similarity Similarity is inversely related to the angle between the vectors DOC2 is the most similar to the Query Rank the documents by their similarity to the Query CMPSCI 646 Copyngl1L JamesAllan Vector Space Similarity Weighted Features Example Q 2dog D13mt1dog4 on D28mt2dog6lion D13T11T24T3 Q 0T1 2T20T3 D2 8T22T26T3 Correlated Terms Orthogonal Terms Term cat dog lion Term cat dog lion T1 cat 100 0 20 050 cat 100 000 000 TZ dog 020 100 040 dog 000 100 000 T lion 000 000 100 3 lion 050 040 100 SimD1Q 3T1 1T2 4T3 2T2 SimD1Q 30 1 2 40 6T1 T2 2T2 T2 8T3 T2 2 6 022 1 8004 12 2 32 24 CMPSC1646 Copyngl1t lamesAllan Vector Space Similarity Common Measures Simlx Y Binarv Term Vectors Weiqhted Term Vectors Inner product leYl Zxxyx Di e 2leYl 229W coefficient X H Y Zxxz Jrsz Cosine lX Yl 216 coefficient llX NlYl lzxfzy XmYl Early Jaccard f coefficient X HYV XQY 2quot 2 szyx CMPSC1646 Copynght lamesAllan Vector Space Similarity Cosine Coefficient Correlation Example D1 05T1 08T2 03 Q 15T1 1T2 0T3 05x1508xl Simml Q 1I 051 082 031l15112 155 d98gtlt325 868 CMPSCI 545 Copynght lamesAllan Cosine and vector lengths Angle is independent of vector lengths Can normalize all vectors to length one Inner product equals cosine of angle Inner product more efficient to compute Very common to normalize vector lengths in index CMFSCI 646 Copyrigth James A an D1 o57391 o 8391 2 0737393 039 im01 Q CMFSCI 646 0513 0815 03 1393O98 0 051m 082T2 0 31T3 Example again normalized Q 1 57 1 173920T3 165739 11 019N325 e 053 0555T2 SimD 1Q 051 X 083 082 x 0555 o512 0822 0312oe32 05552 051 x 083 082 X 0555 0868 from earlier slide Copyrigth James A an Other comparisons Lucene Userspeci ed boost tfidf from document Proportion of query matched Lengthnormalized query we39ght Term normalization is square root of number oftokens in d that are in the same eld as t CMFSCI 646 Copyrigth James A an Standard vector space summary Very simple 7 Map everything to a vector 7 Compare using angle between vectors Challenge is mostly finding good weighting scheme 7 Variants on tfidf are most common 7 Model provides no guidance Another challenge is comparison function r Cosine comparison is most common 7 Generic inner product without unit vectors also occurs 7 Model provides no guidance CMFSCI 646 Copyrigth James A an CMPSCI 646 Outline Latent Semantic Indexing Copynght lamesAllan CMPSCI 646 Latent Semantic Indexing LSI Variant ofthe vector space model Use Singular Value Decomposition a dimensionality reduction technique to identify uncorrelated signi cant basis vectors or factors 7 Ratherthan nonindependentterms Replace original words with a subset ofthe new factors say 100 in both documents and queries Compute similarities in this new space Computationally expensive uncertain effectiveness Following gures taken from Deerwester Dumais Furnas Landauer and Harshman Indexing by Latent Semantic Analysis JASIS 416391407 1990 Copynght lamesAllan Venus LSI umumeuk x Tn m x a x x a x x m x v s 9 mm mm x u mum beerwesmr mm mm CMFSCI 545 Capynght James Alhn 1mm Mm mm me H Human quotmm ravernu w I a mo MM Wham c A mun y39th ommun ufmrwvmquot mm mm W a 39nmme rm mnaiemnm mm a Sunm m Mm mm mumm mtg n an L5 mum nrm ywmvm mum W In Him Incumrmem m negcmmrmnulmnuum m Mammy ml m mlqumn Wm a mm m m m aw 1 mm a quotm m m m mm m4 0mm A mnn rm quot7 m m 1mm 0 v u o u n a my a a A bmwesm 2v a mu CMFSCI 545 Capynght James Alhn LSI example 022 lt n 9704 4117034 us 7mm zovnm quoturns am so ruulrun 5 ula nlnrumrusvrun as on W 040 PM nm u um um 2 Wynn Mb ILurom nm 027 135 017 mm Inn uux rum mm W 027 Mini 07 m xruozruns W runle m aw an roazrun Iquot an 217m umrnso 74714194 W mm m 01 an n50 254m W mu m 1121 Mr m in nu M u nu n m um um n uznram nu rue on raw nluwm 7m as w rumrum ml min 414 m n2 Luison m m nu 0727011 um w an 7021 057 027 mu 417 u 002 roux m mam on an m amrumru h at on 10 am Murummn warns mu nu 19 our 5702 quot rms an um um uzs um 15 m n m 07 m 053 ans Am rm m 39n mm mm bmwesm 2v a mu CMFSCI 545 Capynght James Alhn documcms x v Icms an kxd u m x I s D Rnucld mum yams decnmpusmun Mum x occumenl mamx x Why Thasunhnguna unngt ngthcmumns 39Y a has mmogona mm 12mm co umns n u u s 5 me mm mamx m smgu arvamrs Ms We number m m m x 5 me nu by mw umns m x m 5 me rank cl x 3 mmkm k 5 ma mm number nidmteusmns m m mama mum x lt m 7 bmwesmqumn CMFSCI A Capynght James Alhn LSI example X 7 022 70I1 336 020 061 046 054 0221 000 002 002 Dm 010007 54 Mb 017 7013 02 011 019 QM 062 05 02 um 0 m 06 0 17 017 011 027 011 030 7014 021 027 001 049 004 062 110 045 beerwesmr 2m mu CMFSCI 646 Capynght James A1an LSI example 2 06 040 038 047 013 005 7012 06 7009 014 037 033 040 016 003 7007 gt010 004 015 051 036 041 024 002 006 009 012 026 0134 061 070 039 003 008 012 09 045 123 105 127 056 7007 7015 7021 l05 016 058 033 042 028 006 013 019 022 016 058 033 042 0221 006 013 019 022 022 055 051 063 024 007 0 14 70 20 011 010 053 023 021 027 014 031 044 042 006 023 147027 014 024 055 077 056 7006 034 7015 7030 020 031 069 098 085 7004 025 70104121 015 022 050 071 062 beerwesmr 2m mu CMFSCI A Capynght hmesA1an Comparing original and LSI kiwim mmjm willle I I uw 39m ramm rfmr EPS awnr rrm gran nilmar l 1 l l l39 I mm m I 18 005 012 016 009 interim mmpmi r HSE J 91mm TFEPUVIEI rims EPS sun39rfr i39rm 006 graph 006 mier 0 04 CNIPSCI 646 C0py1ight James Allan Deer wes rer39 6139 0 1990 Dimension 2 LSI 11 graph l ma101112 2 quot14191112 Emerging Query human computer interaction 393 m 1011 I 9 survey 1 mm f 39 j e c2t345b5r91 a 35135312 y 6 erEU a l 3 H i I 3 oompuler 4 59quot 39 V D 39 39 c11quot3 x 9 2 inuefdace V 39 PETE5 r 12453 N I 5 system x 4 C4i39158i x H x x I N quotN x Dimen i n 1 Deer39wes rer39 ef al 1990 CNIPSCI 646 C0py1ight James Allan 7 022 70 334 20 u 07 02 004 Using LSI 020 061 040 054 020 000 002 002 00v 154 7000 on fol 7023 on 019 M4 052 05 040 006 064 7017 027 0 017 o 030 7014 027 049 052 035 02I 00 00 003 D is new doc vectors 2 dimensions here T provides term vectors Given Qq1q2qt want to compare to docs Convert Q from tdimensions to 2 Q T T 34 0 m QT1xiT xks391kxk Can now compare to doc vectors Same basic approach can be used to add new docs to the database CMFSCI 040 Cnpynghta James A an T x 022 70 H 334 020 0 07 024 004 040 000 154 Example of using LSI 020 061 046 054 02 000 002 002 006 on fol 7011 011 on 004 002 0 00 05 Query is human computer QT101000000000 QTT1022124 1 0111004 046 007 QT Ts1 046334 007254 014 003 Dlmensmn 2 LSI l l mun vmailwmzi m a mammal h xznemew Query umnmmputermtemdmn swam quWUY A mum z r A Lela 56191 r 7 l 6 ammpm w a 7 Mfr 2 zy nwa u whys mama ns um el l Drmensmnt 622mm moi mu CMFSCI 646 Capynght James Alhn Is LSI any good Decomposes language into basis vectors 7 In a sense is looking for core concepts In theory this means that system will retrieve documents using synonyms of your query words 7 The magic that appeals to people From a demo at lsiresearchtelcordiacom 7 They hold the patent on Query manna on Bible verses 312 dimensions 7 5 Exodus 20 Ye shall eat nothing leavened in all your habitations shall ye eat unleavened bread 39 hen Jacob offered sacri ce upon the mount d called his brethren to eat bread and they did eat bread and tarried all night in the mount Things like this are major claim of LSI techniques 7 6 Genesis CMFSCI 646 Capynght James Alhn CMPSCI 646 Magic can be confusing Top 5 hits for query apple 312 dimensions 7 SongofSongs 85 Who is this that cometh up from the wilderness leaning upon her beloved I raised thee up under the apple tree there thy mother brought thee forth there she brought thee forth that bare thee Psalms 473 He shall subdue the people under us and the nations under ourfeet SongofSongs 23 As the apple tree among the trees of the wood so is my beloved among the sons sat down under his shadow with great delight and his fruit was sweet to my taste Zecharaiah 310 In that day saith the LORD of hosts shall ye call every man his neighbour under the vine and under the fig tree Magic Ecclesiastes 47 Then I returned and sawvanity under the sun Copynght lamesAllan Vector Space Retrieval Model Summary CMPSCI 646 Standard vector space 7 Each dimension corresponds to a term in the vocabulary 7 Vector elements are realvalued reflecting term importance 7 Any vector documentquery can be compared to any other 7 Cosine correlation is the similarity metric used most often Latent Semantic Indexing LSI 7 Each dimension corresponds to a basic concept 7 Documents and queries mapped into basic concepts 7 Same as standard vector space after that 7 Whether it s good depends on what you want Copynght lamesAllan CMPSCI 646 Vector Space Model Disadvantages Assumed independence relationship among terms 7 Though this is a very common retrieval model assumption Lack ofjusti cation for some vector operations 7 eg choice of similarity function 7 eg choice of term weights Barely a retrieval model 7 Doesn t explicitly model relevance a person s information need language models etc Assumes a query and a document can be treated the same symmetric Lack of a cognitive or other justi cation Copynght lamesAllan CMPSCI 646 Vector Space Model Advantages Simplicity Ability to incorporate term weights 7 Anytype of term weights can be added 7 No model that has to justify the use of a weight Ability to handle distributed term representations 7 eg LSI Can measure similarities between almost anything 7 documents and queries 7 documents and documents 7 queries and queries 7 sentences and sentences 7 etc Copynght lamesAllan Information Retrieval Indexing a Collection File Organization James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Scoring documents brute force Sketch ofapproach 7 For each document in collection Score the document 7 Sort the scores and select the top few Unusual in modern IR systems 7 Was common in old IR systems with few documents 7 Still feasible for small relative to memory collections Not really discussed further Blah Scoring documents narrowing Two passes Rapidly nd plausibly relevant documents to score Sketch of approach 7 Select approximate set of documents somehow 7 For each document in set Score the document 7 Sort the scores and select the top few Reasonable approach lfset is good enough might even skip rescoring Not all that common but occurs Scoring documents as needed Don t score documents that would get default score 7 For VS if score would be zero don t consider 7 For LM if score u be PDGE don t consider Similar to narrowing but in a single pass Sketch of approach 7 For each feature that might signal relevance For each document containing that feature Generate and accumulate partial score 7 Sort the scores and select the top few Most common approach Implemented with some form of index 7 Inverted file is the most common Blah Overview Bitmaps Signature files Inverted les Bitmaps Create a bitmap for every termfeature in vocabulary Bitmap is N bits long 7 N is the number of documents in the collection lffeature is present in document Dset bitito 1 For oneword query retrieve documents with corresponding bit set to 1 For twoword queries 7 Bitwise AND or OR the two features bitmaps e Retrieve documents corresponding to resulting bits that are l Blah Blah Bitmaps Advantages 7 Boolean operations very fast 7 Space efficient for common terms 7 Good compression with runlength encoding Limitations 7 Bagofwords only 7 Space inefficient for rare terms 7 Difficult to adddelete documents Not used much Overview Bitmaps Signature files Inverted les Signature files Consider document bitmap vectors 7 Bit set for each feature occurring in document 7 Cf bit set for each document containing feature Most features do not occur in a given document 7 Very wasteful So why not reuse some ofthe bits 7 Let elephant and telemetry share the same bit Could save much space Could result in false alarms 7 Perhaps clean them up on a second pass if needed Leads to signature les aka superimposed coding Signature file basics Bagof words only For each term allocate xed sized s bit vector signature De ne hash functions 7 Single function term 125 sets all sbits 7 Multiple functions term a 1s selects Which bits to set Each term has an 3bit signature 7 May not be unique Blah B lath Signature File Example Term Hash string cold 1000 0000 0010 0100 days 0010 0100 0000 1000 hot 0000 1010 0000 0000 in 0000 1001 0010 0000 it 0000 1000 1000 0010 like 0100 0010 0000 0001 nine 0010 1000 0000 0100 old 100010000100 0000 pease 0000 01010000 0001 porridge 0100 0100 0010 0000 pot 000000100110 0000 some 01000100 0000 0001 the 1010 1000 0000 0000 16 bit signatures Documents in signature file How to represent documents Bitwise OR the term signatures to form document signature Long documents are a problem why Usually segment them into smaller pieces Document Text Descriptor l Pease porridge hot pease porridge cold 1100 1111 0010 0101 2 Pease porridge in the pot 1110 1111 0110 0001 3 Nine days old 1010 1100 0100 1100 4 Some likeithot some likeitcold 11001110 1010 0111 5 Some like itin the pot 1110 11111110 0011 6 Nine days old 1010 1100 0100 1100 Blah Signature file querying At query time 7 Lookup signature for query how7 7 If all corresponding 1bits are on in document signature document probably contains that term 7 How can this be implemented efficiently Vary s to control Pfase alarm 7 Note space tradeof f 7 Optimal 5 changes as collection grows Many variations Widely studied Not widely used Signature file trivia setup False positive 7 Something identified as true When it is actually false 7 For example something retrieved that is not relevant Also called false drop but why Some de nitions of fase drop on the web 7 document that is retrieved by a search but is not relevant to the searcher s needs False drops occur because of words that are written the same but have different meanings for example squash can refer to a game a vegetable or an action msmhsr 39 39 7 A web page retrieved from a search engine or directory which is not relevant to the query used wwwmbgjorggossaryseterms htm Courtesy of Google s de ne feature 14 Blah Signature le trivia long long ago Punch card as card cataloguequot 20 bits here Punch out bits of signature 7 Tolkein 4 11 amp 14 7 Harper 41419 7 Lee 11 18 To nd an item 7 Calculate its signature 7 Run rods through bits 7 Harper Lee is 5 bits Tolkein is 3 bits 7 False drop Overview Bitmaps Signature files Inverted les Inverted Lists Source file Collection organized by document Inverted file Collection organized by term or other features One record per term listing locations where term occurs Location can have varying granularity During evaluation traverse lists for each query term OR the union of component lists AND an intersection of component lists Proximity an intersection of component lists SUM the union of component lists each entry has a score 17 Inverted Files H Document Text l Pease porridge hot pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot some like it cold 5 Some like it in the pot 6 Nine days old Example text each line is one document 18 Blah Blah Inverted Files H Document Text l Pease porridge hot pease porridge cold 2 Pease porridge in the pot I Number Term Documents 3 Nine days old 4 Some like it hot some like it cold 1 cold 1 4 5 Some like it in the pot 2 days 3 6 6 Nine days old 3 hot 1 4 4 in 2 5 5 it 4 5 6 like 4 5 7 nine 3 6 8 old 3 6 9 pease 1 2 10 porridge 1 2 1 l pot 2 5 12 some 4 5 13 the 2 5 19 WordLevel Inverted Flle H Document Text l Pease porridge hot pease porridge cold 2 Pease porridge in the pot 3 N39 d 1d 4 self hot some like it cold Bum Tm mm ME 5 Some like it in the pot 1 1d 6 Nine days old 2 22 s 3 H 3 l 3 4 4 4 In 2 3 5 4 5 it 4 3 7 5 3 5 like 4 2 6 5 2 7 mm 3 1 6 l 3 01d 3 3 6 3 9 Pease 11421 10 Pom39dsc l 2 5 2 2 11 Dot 2 5 56 12 some 4 1 5 s 1 13 the 2 4 5 5 20 10 Blah Using indexes for LM approach Assume query likelihood approach k PCI1QkMD H PCIiIMD i1 JelinekMercer smoothing PCI7LMD APMLMiIMD 1 APCI GE Probably use logs to avoid excessively tiny numbers k IOgPCI1aC1kMD Z 3909 PCIiMD i1 21 Brute force documentbased approach For each document D in collection Calculate log PQMD Sort scores Drawbacks Most documents have no query terms Very slow 22 11 Blah Use inverted list to narrow Simple approach to using inverted list Use list to find documents containing any query term All others assumed to have low and constant probability For each document in that pool Calculate log PQMD Sort scores Better Only plausible documents considered Still requires accessing entire document 23 Better use of inverted lists Recall score being calculated k 3909 13611 CIIltMD Z 3909 PCIiMD i1 Can be done in parts Do q1 for every document Then q2 then q3 then Keep array Score with cumulative scores 24 12 B lah Specifically For each query word qi Fetch its inverted list For each document Dk in list Calculate log Pqi Dk Add to accumulator Scorek Sort array 3909 13011 Or keep running list of top n documents Why do the calculation Store precom uted log P Ii Dk in list Or store partial results if necessary May have to smooth at query time why k leMD Z 3909 PCIiMD i1 25 Documentatatime processing One inverted list in memory at a time Examples of using inverted list were termatatime Cumulative score array maintained for all documents Unranked Boolean queries All scores incomplete until finished Some queries benefit from early scoring Queries retrieving everything above a threshold Documentatatime Fetch all inverted lists into memory at the same time Find score of first matching document then second and so on 26 13 Blah Efficient support for docatatime Fetch inverted lists into memory 7 For each word gives a sorted list of documents containing it Easy to handle query world or query series How is the query world AND series processed 7 Linear in size of lists please What is needed to find world series as a phrase 7 Le words immediately next to each other in the document Scaling to large collections What if inverted lists are immense Skip lists 7 A table of contents to the inveIted list 7 Embedded pointers thatjump ahead n documents Wny is this useful Separating presence information from location information 7 Many operators only need presence information 7 Location information takes substantial space HO 7 If split reduced lO for presence operators increased lO for location operators or larger index 7 Common in CDROM implementations Accessing inverted lists Given term how is a file of inverted lists accessed BTree B Tree 8 Tree etc 7 Supports exactmatch and rangebased lookup 7 Olog n lookups to find a list 7 Usually easy to expand Hash table 7 Supports exactmatch lookup 7 01 lookups to find a list 7 May be complex to expand Will examine ef ciency issues later Supporting wildcards X is probably easy why when not What about X X XY Permuterm index Prefix each term X with a Rotate each augmented term cyclically with wraparound by one character to produce n new terms 7 Append an to the end of each word form Not absolutely needed but handy for some searches 7 Insert all forms in the dictionary All point to the same inverted list Example 7 term indexed as rtermJll m39terJI rm erm termlHl 7 team indexed as 39teamJI mllL teaJll am39teJI eaml39f tJll teamlHl Wildcard lookups Exactly X search for 7 team matches 39termJI only time that is useful Rest require prefix matching X search for all terms beginning with lle 7 te uses to match 39termJI llL teamJll X search for all terms beginning with Xllf 7 am uses am39 to match amllL te X search for all terms beginning with X 7 r uses to match rm39teJI XY search for all terms beginning with Ylle 7 tm uses to match m39terJI m39teaJI Building indexes For each document to be indexed 7 Tokenize normalize stem weight 7 For each token term Fetch list for term so far Add new information to end sorted by document accession number Numerous efficiency issues 7 Repeated fetching of common lists should be avoided 7 Stage stuff in memory dump to disk as needed merge later Will cover some later Blah Updating indexes Indexes expensive to update usually done in batches Typical buildupdate procedure 7 One or more documents arrive to be added updated 7 Documents parsed to generate index modifications 7 Each inverted list updated for all documents in the batch Concurrency control required 7 To synchronize changes to documents and index 7 To prevent readers and writers from colliding Common to split index into static dynamic components 7 All updates to dynamic components 7 Search both static and dynamic component 7 Periodically merge dynamic into static Su mmary Inverted Signature f 39 quot Files Bitmaps Files Ease of update edit a doc Query evaluation speed Uncompressed space efficiency Compressed space efficiency Index fidelity Can store word positions Blah Information Retrieval Overview and Introduction James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Somrexmuples ropy ighl 19M 15 n Sew Allshdescopynghl JamesAllan Outline What is and isn t Information Retrieval Core idea of IRrelated work Simple model of IR to get started CMPSCI 545 Copynghl James Allan What is Information Retrieval Google YAHOO39 Document Web page retrieval in response to a query Quite effective at some things Highly visible mostly Commercially successful some ofthem But what goes on behind the scenes How do they work What happens beyond the Web CMPSC1646 Copyngil JamesAllan IR is not just document retrieval Automatic organization eg clustering Crosslanguage mechanisms Question answering Document provenance Agents ltering tracking routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identi cation Personal information management Metasearch multidatabase searching Summarization CMPSC1646 Copyngil JamesAllan CMPSCI 646 IR is not databases Data Fields Queries Recoverability Matching Copynghl James Allan CMPSCI 646 IR is notjust forthe web IR systems 7 FAST Autonomy Convera Eurospider 7 Hummingbird EMC Documentum A9 Amazon 7 Lemur lndri Terrier Zettair mg Okapi Smart Database systems 7 Oracle lnformix Access Web search and lnhouse systems 7 West LEXISNEXIS Dialog 7 Lycos txcite Yanoo Uoogle Live Northern Light Ieoma HotBot Baidu 7 Askcom Ask Jeeves 7 eLibrary GOVResearchcenter lnquira And countless others Copynghl James Allan In this course we ask What makes a system like Google Yahoo or Live Search tick How does it gather information 7 Whattricks does it use Extending beyond the Web How can those approaches be made better Natural language understanding User interactions What can we do to make things work quickly Faster computers Caching 7 Compression How do we decide whether it works well For all queries For special types of queries 7 On every collection ofinformation What else can we do with the same approach 7 Other media 7 Other tasks CMPSCI 545 Copynghl James Allan Outline Core idea of lRrelated work Simple model of IR to get started CMPSCI 545 Copynghl James Allan CMPSCI 646 Basic Approach to IR Most successful approaches are statistical 7 Directly or an effort to capture and use probabilities Why not natural language understanding 7 Le computer understands documents and query and matches them 7 State of the art is brittle in unrestricted domains 7 Can be highly successful in predictable settings though eg information extraction on terrorismtakeovers MUC Medical or legal settings with restricted vocabulary Could use manually assigned headings 7 eg Library of Congress headings Dewey Decimal headings 7 Human agreement is not good 7 Hard to predict what headings are interesting 7 Expensive Copynghl James Allan CMPSCI 646 Relevant Items are Similar Much of IR depends upon idea that similar vocabulary gt relevant to same queries Usually look for documents matching query words Similar can be measured in many ways 7 String matchingcomparison 7 Same vocabulary used 7 Probability that documents arise from same model 7 Same meaning of text Copynghl James Allan Bag of Words An effective and popular approach Compares words without regard to order Consider reordering words in a headline 7 Random beating takes points falling another Dow 355 7 Alphabetical 355 another beating Dow falling points 7 Interesting Dow points beating falling 355 another 7 Actual Dow takes another beating falling 355 points CMPSCI 545 Copynghl James Allan What is this about 16 X said 14 X McDonalds 12 x fat 11 x fries 8 X new 6 X company french nutrition 5 X TOOC on percent reduce taste I uesoay 4 X amount change health Henstenburg make obesity 3 X acids consumer fatty polyunsaturated US 2 X amounts artery Beemer cholesterol clogging director down eat estimates expert fast formula impact initiative moderate plans restaurant saturated trans win 1 X added addition adults advocate affect afternoon age Americans Asia battling beef bet brand Britt Brook Browns calorie center chain chemically crispy customers cut vegetable weapon weeks Wendys Wootan worldwide years York CMPSCI 545 Copynghl James Allan The start of the original text McDonald39s slims down spuds Fastfood chain to reduce certain types of fat in its french fries With new cooking oil NEW YORK CNNMoney McDonald39s Corp is cutting the amount of quotbadquot fat in its french fries nearly in half the fastfood chain said Tuesday as it moves to make all its fried menu items healthier But does that mean the popular shoestring fries Won39t taste the same The company says no quotIt39s a WinWin for our customers because they are getting the same great frenchfry taste along with an even healthier nutrition profile said Mike Roberts president of McDonald39s USA But others are not so sure McDonald39s Will not specifically discuss the kind of oil it plans to use but at least one nutrition expert says playing with the formula could mean a different taste Shares of Oak Brook Illbased McDonald39s MCD down 054 to 2322 Research Estimates Were lower 39Iuesday a ernoon It Was unclear Tuesday Whether competitors Burger King and Wendy39s International WEN down 080 to 3491Research Estimates wouldfollow suit Neither company could immediately be reached for comment CMPSCI 545 Copyright James Allan The Point Basis of most IR is a very simple approach 7 find words in documents 7 compare them to words in a query 7 this approach is very effective Other types of features are often used 7 phrases 7 link structure 7 named entities people locations organizations 7 special features chemical names product names difficult to do in general usually require hand building Focus of research is on improving accuracy speed and on extending ideas elsewhere CMPSCI 545 Copyright James Allan Outline Simple model of IR to get started CMPSCI 545 Copynghl JamesAllan Simple flow of retrieval process Teltt ijects InformationNemi 7 CMPSCI 545 Copynghl JamesAllan CMPSCI 646 Statistical language model Text ij ects r Q 39 I d x do bjects Document comes from a topic Topic unseen describes how words appear in documents on the topic Use document to guess what the topic looks like 7 Words common in document are common in topic 7 Words not in document much less likely Assign probability to words based on document 7 PwTopic m PwD tfwD lenD Index estimated topics Copynghl James Allan CMPSCI 646 Example Small document One fish two fish red fish blue fish Black fish blue fish old fish new fish lenD 16 Pfisth 816 05 PblueD 216 0125 F oneID 116 00625 A toplc 39F5leggsD 016 o Copynghl James Allan What about queries Assume information need corresponds to a topic Assume queries are small sample from that topic Very short Just a few high probability words I Information Need CMPSCI 646 Copyright James Allan Comparison Document came from a topic Did query come from this document s topic For each document find probability its topic could have generated the query PQITD W PQD 1391 7 RID Independence 75 assumption H Pq cf naive Bayes 2 i1 CMPSCI 646 Copyright James Allan 10 Example This one I think is called a Yink D1 He likes to he likes to He likes to an and D2 39 The thing he likes to i Thhelikestoisink D2 He likes toand m ink CMPSCI 646 Copynghl James Allan What does LM look like implemented Hypothesis of statistical language model 7 Documents with highly probable topic models more likely to be relevant Index collection in advance 7 Convert documents into set of PtiD 7 Store in an appropriate data structure for fast access Query arrives 7 Convert it to set PqiD 7 Calculate PQTD for all documents 7 Sort documents by their topics probability 7 Present ranked list CMPSC1646 Copynghl James Allan Is that how Google works No Google is more closely related to the vector space model 7 Simpler formal model 7 Can more readily accommodate a range of features 7 No justification required by model for features or how used But intuition is similar CMPSCI 545 Copynghl James Allan Some issues that arise in IR Text representation 7 what makes a good representation 7 how is a re resentation generated from text 7 what are retrievable objects and how are they organized Representing information needs 7 what is an appropriate query language 7 how can interactive query formulation and refinement be supported Comparing representations 7 what is a good model of retrieval 7 how is uncertainty represented Evaluating effectiveness of retrieval 7 what are good metrics 7 what constitutes a good experimental test bed CMPSCI 545 Copynghl James Allan Topics this class might cover Language model Vector space model Probabilistic model Crosslanguage OCR documents Spoken documents Indexing Multimedia documents Compression Distributed Efficiency Peertopeer Clustering Statistics of text Interaction Evaluation Relevance feedback Significance tests Filtering Web search CMPSCI 646 Copynghl James Allan Conclusion Information Retrieval 7 Indexing retrieving and organizing text by probabilistic or statistical techniques that reflect semantics without actually understanding 7 Search engines and more Core idea 7 Bag of words captures much of the meaning 7 Objects that use vocabulary the same way are related Statistical language model 7 Documents used to estimate a topic model 7 Query reflects a topic too 7 Documents of topics that are likely to produce the query are most likely to be relevant CMPSCI 545 Copynghl JamesAllan Information Retrieval Compression and Statistics of text James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copyrlgil James Allan Encoding and Compression Encoding transforms data from one representation to anothe r Data Encoded Data l Data Compression is an encoding that takes less space 7 eg to reduce load on memory disk lO network Lossess decoder can reproduce message exactly Lossy can reproduce message approximately Measuring compression 7 Compression ratio EncOrig 9 25Mb125Mb 20 7 Or sometimes OrigEnc 9125Mb25Mb 500 7 Or sometimes OrigEncOrig 9 100Mb125Mb 80 7 Or sometimes OrigEncEnc 9 100Mb25Mb 400 Copynghl JamesAllan Compression Advantages of Compression 7 Save space in memory eg compressed cache 7 Save space when storing eg disk CDDVD 7 Save time When accessing eg HO 7 Save time When communicating eg over network Disadvantages of Compression 7 Costs time and computation to compress and uncompress 7 Complicates or prevents random access 7 May involve loss of information eg JPEG 7 Makes data corruption much more costly Small errors may make all of the data inaccess ble Copynghl James Allan Text vs data compression Text compression TC predates most work on general data compression DC TC is a kind of data compression optimized for text 7 Le based on a language and a language model TC can be faster or simpler than general DC 7 Assumptions made about the data Text compression assumes a language and model 7 Data compression typically learns the model on the fly 7 Some forms of text compression are really data compression Text compression is effective when assumptions met ata compression is effective on almost any data with a skewed distribution Copynghl James Allan Compression and statistics Outline Infroau c on Fixed Length Codes 7 Shortbites birrams n grams Restricted VariableLength Codes 7 Basic method extension for larger symbol sets UTF8 Some diversions 7 Statistics of text Zipf Heaps 7 Information theory VariableLength Codes 7 Huffman CodesCanonical Huffman Codes 7 LempelZiv LZ77 Gzip LZ78 LZW Unix compress Synchronization Compressing inverted files Compression for blocklevel retrieval Copynghl James Allan Preliminaries probably not needed Text is character based Common encoding using 8 bits per character 7 Can encode values from 0 000000002 to 255 1111 11112 7 A is ASCII 65 or 010000012 Encoding Fast 7 Characters F then a then s then t 7 Values 70 then 97 then 115 then 116 7 Bits 01000110 then 01100001 then 01110011 then 01110100 7 Or01000110011000010111001101110100 Each 01 is a bit Each 8 bits is a byte and encodes one symbol So 7 4 bytes or 32 bits to encode Fast in ASCII Copynghl James Allan Fixed Length Compression Short bytes Ifalphabet s 32 symbols use 5 bits per symbol 7 Saves 3 bits per character Ifalphabet gt 32 symbols and s 60 7 use 130 for most frequent symbols base case use 130 for less frequent symbols shift case use 0 and 31 to shift back and forth eg typewriter Example Fast ltFgtltdowngtltagtltsgtlttgt 5525 bits rather than 4832 bits Works on when shifts do not occur often La TeX7gt ltLgtltdowngtltagtltupgtTltdowngtltegtltupgtltXgt 9545 bits rather than 5840 bits Optimizations Just one shift toggle Temporary shift and shiftlock 7 Variations Multiple cases Copynghl James Allan Fixed Length Compression Bigrams Use a byte 8 bits as storage unit values 0255 Use values 087 for blank upper case lower case digits and 25 special characters Use values 88255 for common bigrams master combining 7 Master 8 blank A E I O N T U 7 Combining 21 blank plus everything except J K Q X Y Z 7 Total codes 88 8 21 88 168 Example Fast 9 ltFgtltasgtlttgt saves one byte Pro Simple fast requires little memory Con Based on a small symbol set Con Maximum compression is 50 7 Average is lower 33 Variation 128 ASCII characters and 128 bigrams Extension Escape character for ASCII 128255 Copynghl James Allan Fixed Length Compression ngrams Similar to bigrams but extended to cover sequences of2 or more characters 7 Can select common ngrams in advance 7 Could learn them for specific applications The goal is that each encoded unit of length gt 1 occur with very high and roughly equal probability Popular today for 7 OCR data scanning errors make binam assum tions less a licable 7 Asian languages two and three symbol words are common longer n grams can capture phrases and names Copynghl James Allan Fixed length compression summary Three methods presented All are 7 simple 7 very effective when their assumptions are correct All are based on a small symbol set to varying degrees 7 some only handle a small symbol set 7 some handle a larger symbol set but compress best when a few symbols comprise most of the data All are based on a strong assumption about the choice of language English here Bigram and n gram methods are also based on strong assumptions about common sequences of symbols 7 lntuitions about which characterssequences are frequent Copynghl James Allan Compression and statistics Outline Introduction Fm sec Lemrvm Code 7 7 ohm iu39 rams ngrams Restricted VariableLength Codes Basic method extension for larger symbol sets UTF8 Some diversions 7 Statistics of text Zipf Heaps Information theory VariableLength Codes Huffman CodesCanonical Huffman Codes LempelZiv LZ77 Gzip LZ78 LZW Unix compress Synchronization Compressing inverted files Compression for blocklevel retrieval Copynghl James Allan Restricted VariableLength codes Extension of multicase encodings shift key Different code lengths used for each case 7 Only a few code lengths are chosen to simplify encoding and decoding Use first bit to indicate case 8 most frequent characters fit in 4 bits Oxxx 128 less frequent characters fit in 8 bits 1xxxxxxx In English 7 most frequent characters are 65 of occurrences ln Origin of Species most frequent seven are etaions r 59 of letters but only 47 of characters 58 46 if preserve case Expected code length is approximately 54 bits per character 7 328 compression ratio Average code length on WSJ89 is 58 bits per character 7 279 compression ratio Copynghl James Allan Generalization for More Symbols Use more than 2 cases 7 1xxx for 23 8 most frequent symbols and 7 0xxx1xxx for next 26 64 symbols and 7 0xxx0xxx1xxx for next 29 512 symbols and Average code length on WSJ89 is 62 bits per symbol 7 230 compression ratio Fro variable number or symDOIs Con Only 72 symbols in 1 byte or less Copynght James Allan Example Unicode as UTFn Encode Unicode into variablelength pieces 7 Unicode allows 1114112 code values 0 to 0x10FFFF or 21 bits 17 planes 0 to 16 of 65536 coding values Basic Multilingual Plane is 0 to OXFFFF All UTFn are equally valid ways to encode Unicode 7 It is possible to map between them without loss of information 7 All Unicode code values can be represented Consider UTF32 7 Each Unicode value encoded in 32 bits directly 7 Fixedlength and tIxedIocatlon characters 7 Simple to find character offsets and starting positions 7 Wasteful 11 bits never used remember Unicode uses only 21 bits In some cases many other bits wasted eg 7bit ASCII Copynght James Allan UTF16 Represent Unicode in 16bit 2 byte units 7 Prior to v21 Unicode was limited to 16 bits Unicode values OXOOOU to 0XFFrr common chars 7 Represent directly as a 16bit code value Two blocks of 1024 210 values unused in there For values 0X10000 to 0X10FFFF 7 Subtract 0x10000 so range is 0x0 to 0XFFFFF needs 20 bits 7 Encode first 10 bits in high 16bit unused range number 7 Encode remaining 10 bits in a low 16bit unused number During decoding 7 If character is unused then it is part of a 32bit encoding Which block highlow determines which of the parts it is 7 Else it is a 16bit encoding Copynghl James Allan UTF8 Base unit is a single byte For Unicode values 0 to 0XFF 127 7 Encode directly in a Single byte usmg bits Uxxxxxxx 7 Corresponds to ASCII values For all other values First byte indicates how long the encoding is 7 Encode the number of bytes needed in unary count the 1 s 7 110XXXXX means two bytes and has first 5 bits of number 7 1110XXXX for 3 bytes 11110XXX means 4 bytes 7 Maximum needed is 4 bytes though can encode more in theory All other bytes include next 6 bits 7 Bytes are of form 10XXXXXX 7 Why not 7 bits 1XXXXXXX or 8 bits XXXXXXXX Copynghl James Allan UTF8 templates OXXXXXX 7 7 bits uses 8 encoding 00 to 0x7F 1 1OXXXXX 10XXXXXX 7 11 bits uses 16 encoding 0x80 to 0x7FF 1 1 1 OXXXX 10XXXXXX 10XXXXXX 7 16 bits uses 24 encoding 0x800 to OXFFFF 1 1 1 1OXXX 10XXXXXX 10XXXXXX 10XXXXXX 7 21 bits uses 32 encoding 0x10000 to 0x1FFFFF Copynghl James Allan UTF8 examples Copyright sign is Unicode 16910 OXA9 7 Binary is 1010 1001 so needs at least 8 bits 7 Template 110xxxxx 10xxxxxx 11 bits 7 110m 10101001 0xC2 OXA9 Not equal sign at is Unicode 0X2260 7 Binary is 0010 0010 0110 0000 so needs at least 14 bits 7 Template 1110xxxx 10xxxxxx 10xxxxxx 16 bits 7 1110M 10001001 10100000 0xE2 0x89 OXAO Copynghl James Allan Notes on UTF8 formatting Number must be encoded in shortest possible encoding 7 Copyright is 0xA9 10101001 7 Valid 110M10101001 7 Invalid 1110M1000001010101001 and so on Onebyte characters and those that are part of a sequence are distinguishable 7 Highorder bit is always clear in former and always set in latter True for UTF16 too but for different reason Copynght 1ames Allan Nonoverlapping aspects of UTFn Consider older mixedwidth encodings eg Windows One character could be a quot 5 MM subsequence of another 9 3 Causes problems durIng string search Imagine looking for D 7 Is the 0x44 a D or the end of the 0x414 character 7 To check look backward one character 7 Could it be that this is a 0x442 character 7 To check look backward T f to start of text worst case quot39 ld dTl W2 m m m Copynght 1ames Allan Pros and cons of various UTFn All are equally valid encodings 131 mammm UTF32 Simpler triVial to encode quot m UTF16 captures more commonly used Unicode characters simply UTF8 is most compact for ASCIIlike text 7 At a disadvantage for some East Asian text Binary sort of UTF8 matches Unicode value sort Bigllittleendian storage not an issue for UTF8 7 Partly addressed by unused characters OxFFFE an OxFEFF 7 Put that byte order marker at start of text to flag endian choice L39l 111M 1i amx out a DB Copyright James Allan M w Back to restricted variablelength codes Generalization for numeric data 7 1xxxxxxx for 27 128 most frequent symbols 7 Oxxxxxxx l xxxxxxx for next 214 16384 symbols and so on 7 Con totally fails to compress text On WSJSQ uses 80 bits per symbol for a 00 compression ratio ll 7 Pro Can be used for integer data Examples word frequencies usually small inverted lists Wordbased encoding 7 Build dictionary sorted by MW frequency 7 Represent each word as an offsetindex into the dictionary 7 Pro A vocabulary of 2000050000 words with a Zipf distri requires 1213 bits per word am compared with a 1011 bits for completely variable length 7 Con Decoding dictionary is large compared with other methods Copyright James Allan Restricted variablelength codes summary Four methods presented All are 7 simple 7 very effective when their assumptions are correct No assumptions about language or language models 7 Though work best When language follows a pattern All require an unspeci ed mapping from symbols to numbers a dictionary All but the basic method can handle any size dictionary Copynghl James Allan Compression and statistics Outline Inf OCI39IJCf39fO Fixed LengZquoti Codes 7 39 es 39 s nagraii s Vs Ha e LengZquoti Coder 9mm exfensson for larger symbol sets UTF8 Some diversions 7 Statistics of text Zipf Heaps 7 Information theory VariableLength Codes 7 Huffman CodesCanonical Huffman Codes 7 LempelZiv LZ77 Gzip LZ78 LZW Unix compress Synchronization Compressing inverted files Compression for blocklevel retrieval Copynghl James Allan Zipf s Law A few words occur very often 7 2 most frequent words can account for 10 of occurrences 7 top 6 words are 20 top 50 words are 50 Many words are infrequent Principle of Least Effort 7 easier to repeat words than coin new ones Rank Frequency m Constant 7 pr Number of occurrences of word of rank rN N total word occurrences probability that a word chosen randomly from the text will be the word forDuniquewordsZpr1 rprzA 7 Am01 Copynght JamesAllan Example of Frequent Words Frequent Number of Percentage Word Occurrences of Total gt the 739893 59 0 3893790 31 Anita 0quot to 3364653 27 3 2quot and 3320687 2 6 1n 2 311785 18 is 1559147 12 for 1313561 10 b The 1144860 09 that 1066503 08 said 1027713 08 Frequencies from 336310 documents in the 1GB TREC Volume 3 Corpus 125720891 total Word occurrences 508209 unique Words Copynght JamesAllan Examples of lef Fr r r Word Fr 1 r 15659 1 6 422 0 0642 has 0 26 7179 2 2 944 0 0589 not 5 27 6287 3 2 578 0 0774 an 3 28 5830 4 2 391 0 0956 5 2 29 5580 5 2288 01144 hav 0 30 5245 6 2151 01291 were 8 31 2494 7 1 023 0 0716 mew 2 32 2197 8 0 901 0 0721 are 7 33 2147 9 0 881 0 0792 one 742 34 1824 10 0 748 0 0748 they 679 35 1813 11 0 744 0 0818 115 668 36 1800 12 0 738 0 0886 511 646 37 1687 13 0 692 0 0899 Week 626 38 1576 14 0 646 0 0905 gwernment 582 3939 1523 15 0 625 0 0937 when 77 40 144 16 0 592 0947 Wou1d 72 41 1318 17 0 541 0 0919 been 42 1232 18 0 505 0 0909 43 1217 19 0499 00948 44 1136 20 0 466 0 0932 45 9 21 0 389 0 0817 45 937 22 0 384 0 0845 47 909 23 0 373 0 0857 906 24 0 372 0 0892 883 25 0 362 0 0905 Top 50 words from 423 short TJME magazine articles 243836 word occurrences lowercased punctuation removed 1 6 MB Copynght James Allan one pe0p1e 68988 50 140880 25 0378 0185 Top 50 words from 84678 Associated Press 1989 articles 37309114 word occurrences lowercaseoL punctuation removeoL 266MB ynght Jarnes Allan Predicting Occurrence Frequencies r pr A A word that occurs n times has rank rquot ANn Several words may occur n times Assume rank given by rquot applies to last of the words th occur n im rquot words occur n times or more ranks 1rn rmwords occur n1 times or more 7 Note r gt r M since words that occurfrequently are at start o ist Elower rank The number of words that occur exactly n times is 9 I79 r7BrED2O17 r r e I RNnnf1ANn1 AN nn l Highest ranking term occurs once has rank D AN l Proportion of words with frequency n is lD 1nn1 Proportion of words occurring once is 12 CMPSC1646 Copynghl James Allan Example of Occurrence Frequencies Number oi Predicted Acmal Proportion Acmal Number Occurrences Proportion oi occurring n Limes oi Words Occurrences 1 occurring n 1 n n1 Limes 1 500 402 204357 2 167 132 67082 3 083 069 35083 4 050 046 23271 5 033 032 16332 6 024 024 12421 7 018 019 9766 8 014 016 8200 9 011 014 6907 10 009 012 5 893 Frequencies from 336310 documents in the lGB TREC Volume 3 Corpus 125720891 total word occunences 508209 unique words Copynghl James Allan Does Real Data Fit Zipf s Law Faurdanans aSmnsnml NIB Manning and Schutze 2000 A law of the form y kxc is called a power law Zipf s law is a power law with c 1 r rANn39l 9 n AN rl AN is a constant for a xed collection On a loglog plot power laws give a straight line with slope c logy 10gkx logk c logx lognlogAerlogAN 1logr Zipf is quite accurate except for very high and low rank Capyngmermsmm man I la Fit to Zipf for Brown Corlpus aurdamms aSmnsnml NIB Manning and Schutze 2000 m k AN 100000 Capyngmermsmm H Mandelbrot 1954 CorrecFtion Dundatzuns ufStansncal NLP Manning and Schutze 2000 The following g more general form gives a bit better t 7 Adds a constant to E the denominator 3 7 ykgtlttc mum Ion ere n ANrt391 In a mm mm 1m Mandelbrot s function on Brown co1pus k1054C115t100 Copyright JamesAllan Explanations for Zipf s Law Zipfs explanation was his principle of least effort Balance between speaker s desire for a small vocabulary and hearer s desire for a large one Debate 195561 between Mandelbrot and H Simon over explanation Li 1992 shows that just random typing of letters including a space will generate words with a Zipfian distribution 7 httpIinkagerockefellereduwlizipf 7 Short words more likely to be generated Copyright JamesAllan Size Distribution of Term Lists Impact of Zipf s Law on inverted lists 126 TIPSTER 420Mb inverted File 627072 Lisis 100 77 a 7 i i 75 i L 2 E 50 1 E 1 L 25 a at me sum 0 A 7 V 1 10 mo iK 10K max iM inverted L451 SIZE bytes Cupyngm James Allan Cumuiailve oa Characteristics of Query Terms Does query term usage follow Zipf s Law Cum Distribuhon of List Sizes Accessed by TIPSTEH 1 Query Set 1 we too 1000 10000 100000 te05 1e07 Inverted List Size bytes Cupyngm James Allan Vocabulary Growth Heap s Law How does the size of the overall vocabulary number of unique words grow with the size of the corpus Vocabular has no u er bound due to ro er names t os etc New words occur less frequently as vocabulary grows If V is the size of the vocabulary and the n is the length of the corpus in words v KnB 0lt 0 lt1 Typical constants Kw 10 100 B e 04 06 approx squareroot of n Can be derived from Zipf s law by assuming documents are generated by randomly sampling words from a Zipfian distribution Copyright James Allan Words in Vocabulary n thousands Heaps Law Data Collection N n K 13 V Kns AP2 40998865 237160 6311 047 250 FR2 35971021 170532 2544 044 WSJ2 41066428 208723 9488 044 I ZIFF2 30148165 209846 3029 051 I39 o I I I 0 1o 20 30 40 Words in Collection N millions Copyright James Allan Compression and statistics Outline 7fI39OGUCi39OI7 quot Length Cosa5 7 ho Fm 0 iu39rams are Restricted Vrfll i bC Lel39lgih Cosequot 7 Bastc mid extenmon for larger ymbcai 3633 UTF8 Some diversions 7 Statistics citez pri Heaps 7 Information theory VariableLength Codes 7 Huffman CodesCanonical Huffman Codes 7 LempelZiv LZ77 Gzip LZ78 LZW Unix compress Synchronization Compressing inverted files Compression for blocklevel retrieval Copynghl James Allan Information Theory Shannon studied theoretical limits for data compression and transmission rate Compression limits given by Entropy H Transmission limits given by Channel Capacity C A number of language tasks have been formulated as a noisy channel problem 7 Le determine the most likely input given the noisy output 7 OCR 7 Speech recognition Question answering 7 Machine translation Copynghl James Allan Shannon Game The President ofthe United States is George W The best web search engine is Mary had a little The horse raced past the barn 7 Period end of sentence 7 Wninnied garden path sentence Copynghl JamesAllan Information Theory Information content ofa message is dependent on the receiver s prior knowledge as well as on the message itself How much ofthe receiver s uncertainty entropy is reduced How predictable is the message Copynghl JamesAllan Information Theory Information content H is defined as a decreasing function Hp of the a priori probability p with which the message could be predicted if receiver predicts message with probability 1 information content is zero H1 O if prediction of message is probability 0 message would have infinite information content HO undefined information content should be additive 39 Hp1p2 Hp1 quot39 Hpz Use Hp log p With logs in base 2 unit of information content entropy is 1 bit A message that is a priori predictable with probability 05 has information content of log212 log22 1 bit Just have to tell you which of the two possibilities it is O or 1 Copyright James Allan Information Theory Given n messages the average or expected information content to be gained through receipt of one of the n possible messages IS n H 2p 1Cng r21 Average entropy is a maximum when messages are equally probable eg average entropy associated with characters assuming equal probabilities So for alphabet entropy is log 126 47 bits Taking actual probabilities into account entropy is 414 bits With bigram probabilities reduces entropy to 356 bits Experiments with people give values around 13 bits Can predict next letter with about 40 chance of accuracy Better models reduce the entropy Copyright James Allan 22 Information Theory and Zipf For words r pr A D H 2017 10gAr r21 D is number of unique words Approximations give H Alog2 A log 2D 1 For D 10000 H 95 50000 109 100000 114 bits Equiprobable case gives H 133 156 and 166 bits respectively Copyright James Allan Information Theory Another take on the principle of least effort Consider wordprobability distribution prwhich produces the smallest mean number of letters per word for a particular value of entropy H That is minimize 2 pr mrwhere rnIr is the length of the word with rank r 2 p 1 and H 2 pr log p constant Gives p Ar BB where A B and 6 are fixed for a given subject vocabulary Look familiar Mandelbrot s derivation Information theory has been used for compression term weighting and evaluation measures Copyright James Allan 23 Compression and statistics Outline 7fI39OGUCJi39OI7 1 Length Coses If rams ngrams 7 170 0 y VariableLength Codes 7 Huffman CodesCanonical Huffman Codes 7 LempelZiv LZ77 Gzip LZ78 LZW Unix compress Synchronization Compressing inverted files Compression for blocklevel retrieval Copynghl James Allan Huffman codes basics Gather probabilities for symbols 7 characters words or a mix Build a tree as follows 7 Get 2 least frequent symbolsnodes join with a parent node 7 Label least probable branch 0 label other branch 1 7 Pnode Z Pchild 7 Continue until the tree contains all nodes and symbols The path to a leaf indicates its code Frequent symbols are nearthe root giving them short codes Less frequent symbols are deeper giving them longer codes Copynghl James Allan C d H C F r39 A m 5 D D b xf w v 0 7 i r r nl D x H t Q J u if u l t r x x Managing Gigabytes Figure 26 u l l A f5 A O O Kalli Mm m tar in H D 1 Q w J Capynght JamesAllAn Huffman Codes Huffman codes are pre x freequot no code is a prefx of another Many codes are not assigned to any symbol limiting the amount of compression possible English text with symbols for characters is approximately 5 ion bits per character 375 compre s English text with symbols for characters and 800 fre Con Need a bitby bit scan of stream for decoding Con Looking up codes is somewhat inef cient The decoder must store the entire tree Traversing the tree involves chasing pointers little locality Variation Adaptive models learn the distribution on the y Variation Can be used on words as opposed to characters Capynght JamesAllAn words yields 4840 bits per character 4050 compression Occurrence Encoding Unit Probability h the 270 mi 9 of 170 u g and 137 c to 099 a 088 in 074 that 052 is 043 it 040 on 033 Copyright James Allan Huffman coding example 0c urrence Encoding Unit Probability Code Value Code Length the 270 01 2 of 170 001 3 and 137 1 l l 3 to 099 110 3 a 088 100 3 in 074 0001 4 that 052 101 1 4 is 043 1 0 10 4 it 040 00001 5 on 033 00000 5 Copyright James Allan 26 Canonical Huffman Codes A form of Huffman code optimized to make decoding efficient It isn39t necessary to store the tree for decoding Use the usual algorithm to assign code lengths Use a different algorithm to assign code words Reference Managing Gigabytes assign so that all code words of a given length occur se uentiallg Decoding requires only a dictionary and a table of contents containing the first entry of each code length Copyright James Allan Canonical Huffman code Symbol Codeword were 8 10100110 nmh m 8 10100111 100 17 ooooooooooooooooo I 1010101 101 17 00000000000000001 at 7 1010101 102 17 00000000000000010 for 7 1010110 103 17 00000000000000011 had 7 1010111 he 7 1011000 yopur 17 00001101010100100 her 7 1011001 oum 17 00001101010100101 outhgul 17 g 00001101010100110 h39s 7 1011010 zeed 17 00001101010100111 39t 7 1011011 zephyr 17 00001101010101000 7 1011100 zigzag 17 00001101010101001 7 1011101 1101 16 0000110101010101 7 1011110 120 16 0000110101010110 7 1011111 7 1100000 7 1100001 39110001 6 110010 I 6 110011 Lia 39 11010 and 5 11011 of 5 11100 Managing Gigabytes Table 22 9 5 gm I 141115 39 14 31111quot 1quot Copyright James Allan Canonical Hoffman Codes Encoding 7 said is the 10m sevenbit codeword First seven bit code is 1010100 7 Code word for said obtained by adding nine to binary representation Decoding 7 Assume bits to be decoded are 11000001 01 7 Next code word must come after the first sevenbit word as 1010100 and before the first sixbit word I 110001 7 Read nex seven bits 1100cw aw wwwate difference from first sevenbit value 7 Difference is 12 so word is 12 positions after as must be with Copynghl James Allan LempelZiv An adaptive dictionary approach to variable length coding Use the text already encountered to build the dictionary lftext follows Zipf39s laws a good dictionary is built No need to store dictionary encoder and decoder each know how to build it on the fly Some variants LZ77 Gzip LZ78 LZW Unix compress Variants differ on 7 how dictionary is built 7 how ointers are re resented encoded and 7 limitations on what pointers can refer to Copynghl James Allan VariableLength Encoding L27 L21 Data is encoded as a sequence oftuples ltNumber of characters back Length Next charactergt Example 7 String abaababbbbbbbbbbba 7 Encoding lt00agt lt00bgt lt21agt lt32bgt lt110agt Recursive references are allowed Advance both pointers Optimizations 7 Limit size of backpointers eg 8K 13 bits 7 Restrict length of phrases eg 15 characters 4 bits 7 Variablelength encode pointers and length 7 Include character only if necessary 1 bit flag Copynghl JamesAllan VariableLength Encoding L27 LZ1 Encoding data structures 7 Trie hash table or binary search tree Characteristics 7 Very fast decoding 7 Low memory overhead 7 Decoder is sufficiently small to include in compressed data Self expanding archives typically found on PCs Copynghl JamesAllan VariableLength Encoding Gzip Gzip is a variant of LZ77 Encoder locates previous strings using a hash table three characters men a linked list User preferences speed vs space determine list length For maximal compression uses lookahead instead of simple greedy search for string matches Uses one Huffman for offsets another for lengths and characters Huffman codes are semistatic 7 Text is processed in chunks of up to 64K 7 Each chunk has its own Huffman code 7 Huffman codes are stored in the compressed text Copynghl James Allan VariableLength Encoding LZ78 L22 Data is encoded as a sequence of tuples ltPhrase ID Next charactergt Instead of looking backwards for substrings use a phrase dictionary Example abaababaababbbbbbbbbb encoder lt0agt lt0bgt lt1agt lt2agt lt4agt lt4bgt lt2bgt lt7bgt lt8bgt Willllllll decoder a b aa ba baa bab bb bbb bbbb output Phrase 2 3 4 5 6 7 8 9 Copynghl James Allan VariableLength Encoding LZ78 LZ2 Phrase length does not need to be stored in the tuple Phrase ids can take up less space than back pointers Phrase dictionary grows until a memory limit is reached When full dictionary 7 is reinitialized 7 is partially rebuilt or 7 becomes static Encodes faster than LZ77 decodes more slowly Copynghl James Allan VariableLength Encoding LZW Data is encoded as a sequence of tuples ltPhrase Dgt Onl write out hrase ids Initialize phrase dictionary with single characters ASCII 0127 Form new phrase as old phrase first character of next phrase Example abaababbaabaabaa encoder a b a ab ab ba aba abaa mput encoder output 97 93 97 123 123 129 131 134 Phrase ab ba aa aba abb baa abaa 128 129 130 131 132 133 134 Copynghl James Allan Variablelength encoding UNIX compress A variant of LZW Gradually increases the number of bits used to encode phrases few when few phrases more as more phrases are built When memory limit is reached the dictionary becomes static lfcompression deteriorates ush the dictionary Copynghl James Allan Compression and statistics Outline lnzrcxiuoi39lon Fixed LengZquotI Codes 7 5quot rib res biams i7gl a3 73 e LeanquotI Codes 39 Hod ariaisson for larger symbol sets UTF8 Some GIVESIGNS 31 zexi Zipf Heaps Variablel vgz h Codes 7 Huf nan Codes r anon r LampeIZiwLZ77 Synchronization Compressing inverted files Compression for blocklevel retrieval H W751i Coses quot8 LZW Uan compress 21p Copynghl James Allan Synchronization It is difficult to access encoded text randomly With bitlevel encoding eg Huffman codes it is dif cult to know where one code ends and another begins With adaptive methods the dictionary depends upon the prior encoded text Synchronization points can be inserted into an encoded message from which decoding can begin 7 For example pad Huffman codes to the next byte or restart an adaptive dictionary 7 Compression effectiveness is reduced proportional to the number of synchronization points Copynghl JamesAllan SelfSynchronizing Codes In a selfsynchronizing code the decoder can start in the middle ofa message and eventually synchronize gure out the code 7 Also means will eventually recover from corrupt data It may not be possible to guarantee how long it will take the decoder to synchronize Most variablelength codes are selfsynchronizing to some extent Fixedlength codes are not selfsynchronizing but boundaries are known synchronization points Adaptive codes are not selfsynchronizing Copynghl JamesAllan Example Huffman Codes a chillier but that wasn39t to be expected just now b chillier bio that wasn39t to be expected just now chillier b2 that wasn39t to be expected just now chillier b t that wasn39t to be expected just now chillier bit that wasn t to be expected just now chillier mt that wasn39t to be expected just now chillier bugse eonasn t to be expected just now chillier bugea aieonasn39t to be expected just now chillier bun that wasn t to be expected just now chillier butap eonasn39t to be expected just now c chillier but thaswhrs eree 39 maem th t otaedgsrkeh a Original text b Huffman code with one bit flipped nine different single bits and c arithmetic coding with one bit flipped Managing Gigabytes Figure 241 Copyright James Allan Compression and statistics Outline a 5 3 CE 2 1 Compression for blocklevel retrieval Copyright James Allan 34 Inverted List Indexes Inverted lists are usually compressed Inverted les with word locations are about the size ofthe raw data Distribution of numbers is skewed 7 Most numbers are small eg word locations term frequency Distribution can be made more skewed easily 7 Delta encoding 5 8 10 17 a 5 3 2 7 Simple compression techniques are often the best choice 7 Simple algorithms nearly as effective as complex algorithms 7 Simple algorithms much faster than complex algorithms 7 Goal Time saved by reduced HQ gt Time required to uncompress Copynghl JamesAllan Inverted List Indexes The longest lists which take up the most space have the most frequent probable words Compressing the longest lists would save the most space The longest lists should compress easily because they contain the least information why Algorithms 7 Delta encoding 7 Variablelength encoding 7 Unary codes 7 Gamma codes 7 Delta codes Copynghl JamesAllan Delta encoding Storing Gaps Reduces range of numbers Produces a more skewed distribution Increases probability of smaller numbers Stemming also increases the probability of smaller numbers Why Copynghl JamesAllan Variable length codes Restricted FixedLength Codes Review the numeric data generalization of restricted variable length codes Advantages 7 Effective 7 Global 7 Nonparametric Copynghl JamesAllan Unary Code Represent a number n gt O as n1 1bits and a terminating O 7 1 9 0 7 2 9 10 7 3 9 110 Great for small numbers Terrible for large numbers Copynghl JamesAllan Gamma Code Combination of unary and binary codes The unary code stores the number of bits needed to represent n in binary The binary code stores the information necessary to reconstruct n Unary code stores 1 Llog nJ Binary code stores n 23909 J Example n 9 7 Llog 9J 3 so unary code is 1110 7 923 98 1 so binary code is 001 7 The complete encoded form is 1110001 7 bits This method is superior to a binary encoding Copynghl JamesAllan Delta Code Generalization of the Gamma code 7 Not related to delta encoding storing gaps Encode the length portion ofa Gamma code in a Gamma code Gamma codes are better for small numbers Delta codes are better for large numbers Example 7 For gamma codes number of bits is 1 2 Llog nJ 7 For delta codes number of bits is Llog nJ 1 2 Llog1 Llog nJJ Copynghl James Allan Comparing codings for various numbers RVL Gamma Delta n RVL Gamma Delta 2 16 8 9 10 100 8 13 12 129 16 15 14 1000 16 19 16 10000 16 27 20 16385 24 29 21 100000 24 33 25 1000000 24 39 28 Copynghl James Allan Inverted File Compression Co m pa ri so n Method Bits per pointer Bible GNUbib Comact TREC Global methods Unary 264 920 490 1719 Binary 1500 1600 1800 2000 Bernoulli 967 l 165 1058 1261 7 655 569 448 643 6 626 508 436 619 Observed frequency 592 483 421 583 Local methods Bernoulli 613 617 540 573 Hyperbolic 577 517 465 5 74 Skewed Bernoulli 568 471 424 528 Batched frequency 561 465 403 527 Compression of inverted les in bits per pointer Copyright James Allan Compression and statistics Outline Compression for blocklevel retrieval Copyright James Allan 39 Background Applying and extending compression within IR Different IR architecture leads to different approaches Based on 7 E Moura G Navarro N Ziviani and R BaezaYates Fast a d flexible word searching on compressed text ACM Transactions on Information Systems 20 0 7 N Zivinai E Moura G Navarro and R BaezaYates Compression A key for nextgeneration text retrieval systems Computer 2000 Copynghl JamesAllan Blocklevel retrieval Interested in string search as well as classic IR Develop architecture where string matching needed 7 Ideas still interesting in absence of that architecture Idea 7 Index and retrieve blocks of text 7 Should span many documents Compress documents to get most in a block 7 Perhaps useful for efficiency Compression of original text Matching disk cache block sizes Approach 7 Find blocks that contain matches 7 Scan documents in block to see if the match really occurred and where Copynghl JamesAllan 40 Compression issues Want to minimize the number of blocks 7 Compress documents as much as possible 7 Any good compression algorithm is acceptable Need to be able search the documents rapidly 7 Most algorithms require uncompressing text then searching 7 Can we search compressed directly instead 7 Can we search in area that search terms matched Consider algorithms we ve talked about 7 Fixed and restricted variable length Not very good compression for general data 7 Huffman standard or canonical Requires searching at bit level 7 LempelZiv approaches Requires starting at beginning of text to nd phraseslookback Copynghl James Allan Modify Huffman symbols Will focus on xing Huffman encoding Use spaceless word encoding I ext IS made up or words 7 Separated by space or special characters punctuation 7 Most words followed by a space rather than punctuation Symbols consist of 7 Words 7 Special character sequences that separate words Encode symbols as usual 7 If word followed by a space do not encode the space 7 Space is inserted unless the next character is a special character 7 Some special cases arise Wnere space must be encoded Copynghl James Allan 41 Example of spaceless word encoding 01 o OriginalText for each rose a rose i Compressed Text 0010 0000 l 0001 01 l 0 Copyright Iames Allan Tweak Huffman coding Binary Huffman encoding 7 Symbols represented by path from root of tree to symbol 7 Branching factor at each node is t so bit values used 7 Encoding is a string of bits Byte Huffman encoding 7 Increase branching factor at each node 7 Encode symbol as string of values leading down from root Plain byte Huffman encoding 7 256 branches at each node 7 Encode a word as a sequence of 8bit bytes Tagged byte Huffman encoding 7 Reserve one bit to indicate it starts an encoding 7 128 branches at each node Copynght 1ames Allan 42 Do they work 1 so i l l l l l l l 48 Plain Hu man uncompressed vocabulary F 7 Plain Huffmamconlpressed vocabulary 45 Compress Va 7 Gzlp lt Compression Railol Va 7D 80 90 100 40 50 50 File sizelmeganyles Fi Compression mlioz frlx39 LhP WSJ le llmpressed by G511 Cunlpm and plain Huifman with and withuut compressing the Vocabulary Cnpyngm James Allen Do they work 2 Table ll c mm39 nn Rm l in bclbmer wherk quotEhlx39oliyquot Rpl39c rs m plimal ling Tle pm 9 591 tn slnl fi in nmbulary is inclurlnxl in ma Huffman mmprrssmn lnlins ALhim39ml by ml rent L39om l e Fillvs Nulhull DOE 1 6h Pllnn39 llumnnn Tagged Huffman l 1 Cnpyngm James Allen Searching compressed text Text is compressed as a sequence of bytes Every symbol takes up complete bytes 7 Symbols are not sequences of bits that span byte boundaries Convert query similarly r Isolate symbols 7 Convert each symbol to a sequence ofbytes Search compressed text for sequence of bytes 7 Use your favorite search algorithm 7 Extensions available to support wildcards or other complex patterns Cupyngm JamesAllan Why tagged version Can get false matches by using parts of different Word Code real 35 132 o gmanexi realword word 229 12 99 V l l Com ressed Text 85l32 22E i2 99 ghost 132229 l2 9 ghost Tagged approach makes first character special Match can only succeed at start of word 7 Recall that in Huffman encoding no character s sequence is a pre x of another chara er s Problem solved Cupyngm JamesAllan Conclusion blocklevel encoding Motivated by blocklevel indexing Extended Huffman coding to bytelevel Simpler than bitlevel decoding Little drop in amount of compression Combined with wordlevel encoding to make string matching easy Copynghl James Allan Compression and statistics Recap Introduction Fixed Length Codes 7 Shortbtes binams n grams Restricted VariableLength Codes 7 Basic method extension for larger symbol sets UTF8 Some diversions 7 Statistics of text Zipf Heaps 7 Information theory VariableLength Codes 7 Huffman CodesCanonical Huffman Codes 7 LempelZiv LZ77 Gzip LZ78 LZW Unix compress Synchronization Compressing inverted files Compression for blocklevel retrieval Copynghl James Allan 45 Information Retrieval Massive parallelism MapReduce Hadoop PIG etc James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Copynghl JamesAllan Indexing Given a set of files Create an inverted file 7 word d001 d002 Write some pseudocode to do that Copynghl JamesAllan Page 1 Indexing pseudocode for each document D for each word W in D fetch list for W create it is it does not exist append D to end of list end end Now suppose have many many machines Fan documents out to processors Coalesce inverted lists Write pseudocode for that Copynghl JamesAllan Pseudo pseudocode Fan out 7 Do same work as before on each node Coalesce 7 Gather all partial inverted lists on one machine 7 Sort by the word key forthe list 7 Merge the lists together Notes 7 Requires globally unique docids 7 Could coalesce hierarchically Copynghl JamesAllan Page 2 Massive parallelism Suppose have thousands of processors How best to use them Build custom applications 7 Code for fanning out coalescing 7 Dealing with errors processor failures Build software framework 7 Bundlebury all administrative work 7 Programmer deals with interesting part Copynghl JamesAllan MapReduce Google s MapReduce probably best known 7 Jeff Dean and Sanjay Ghemawat MapReduce Simplified Data Processing on Large Clusters Symposium on OS Design and Implementation 2004 Goals ofthe work 7 Automatic parallelism and distribution 7 Faulttolerance 7 HO scheduling 7 Status and monitoring Origin of terms 7 Map applies a function to a list and returns a list 7 Reduce combines data to get a return value Copynghl JamesAllan Page 3 General overview Source data fanned out to processors Processor gets a list of things eg text lines Processor maps list to ltkeyvaluegt pairs Additional processors coalesce pairs by key Processors reduce lists to new ltkeyvaluegt s Those get written to disk Copynghl JamesAllan Example from paper Count all words in a set of documents Emit ltword countgt pairs mapString key String value reduce String key lteratorvalues key document name key a word value contents values a list of counts for each word w in value int result 0 Emitlntermediatew 1 for each v in values result Parselntv Emit AsStringresult Actual code is in appendix ofpaper Copynghl JamesAllan Page 4 Documents sent to worker processors 7 Appropriately sized Each generates ltword1gt pairs on local disk Put in le speci c to reduce destination 7 hashword mod R Master process coordinates What happens Reduce processes started on worker processors 7 Number R specified are Cupynght James Allan Reads map output les from where they Calls reduce with each unique key Gets value pairs Writes output pairs to global storage View of dataprocess ow I klkli k2 i ll rum k4 5i ll k4 v i mum l HllJl ll mills i Cupynght James Allan mp Mains guugle cumpapersmapreducehlrnl Sh de a Page 5 How many processors Want many map tasks more than machines 7 Granularity oftask often hard to predict 7 Makes it easierto load balance 7 Smallertasks easier to restart if failure Paper claims that for 5000 machines 7 200000 map tasks 7 5000 reduce tasks Copynghi JamesAllan Challenge LM statistics Want to count frequencies of ngrams 7 Sequences of n words Use MapReduce to count 5grams How Copynghi JamesAllan Page 6 Some extra features of MapReduce Your own partitioning function 7 The function that maps pairs to reduce process Order guarantee 7 Reduce process gets keys in ascending order Combiner function 7 In some cases can partly reduce at map process Rollyourown input types 7 Create a reader if one of existing types doesn t work Sideeffects 7 Ma or Reduce can generate auxiliary out ut Failure support 7 Detects and skips bad records when code crashes Counters 7 Can declare and increment counters returned to code at end Copynghl James Allan Variations on MapReduce Google s MapReduce 7 Implemented in C with Python and Java interfaces Yahoo s Hadoop 7 MapReduce implemented in Java Microsoft s Dryad 7 Seems to be more oriented toward SQLlike stuff Copynghl James Allan Page 7 Beyond MapReduce All that mapping and reducing is tedious How about scriptlike languages Savvzall Google MapReduce 7 Looks vaguely like C PIG YahooHadoop 7 Looks vaguely like SQL 7 There s also Grunt a shell for piggy things Nebula Microsoft Dryad 7 No idea what it looks like Copynghl JamesAllan MapReduce application Build an IR system using MapReduce Language modeling 7 Tokenize etc 7 Calculate PwD 7 Build inverted le 7 Parse query and rank documents Different for VS query with cosine similarity Copynghl JamesAllan Page 8 Some other problems Kmeans clustering of a set of documents 7 Will require multiple passes Count the links into all pages 7 Each page is a document 7 Assume you can identify links Grep for a string Find all anchor text associated with pages 7 lta href quotgtthis is anchor textltagt Find query words associated with page by click 7 Log ltquery pageclickedongt Build ltengisharabicgt parallel dictionary 7 From parallel corpora Copynghl JamesAllan Page 9 Information Retrieval Relevance Models James Allan University of Massachusetts Amherst CM PSCI 646 Fall 2007 Copyright James Allan Recall latent topic models Describe things in terms of underlying topics pLSl Pd w Pd 3911quot LDA Pilingunimagv lrrl 39fl V atquot r Fm ll lite l i l i l In ll Tita lHim lfatiillfe 5n fr quot 395 x l quotI i1 2quot oooio H I 1139 Copyright James Allan Page 1 Challenges of latent models Is it reasonable to assume that everything can be represented by a fixed set of topics 7 Very course areas with few topics 7 Finer with many 1 K topics 7 What is correct number of topics Aspect models capture large interesting part of corpus 7 Tend to ignore or downplay outliers What ifqueries are about unusual topicsevents 7 Many IR topics are on fringes Copynghl James Allan Relevance models Try to model the set of relevant documents 7 Not modeling topics in the corpus 7 Just modeling relevance to query at hand 7 PwR the probability of seeing w in some relevant document Assume relevant documents are samples drawn from PwR 7 Like typical LM if have known relevant documents Use maximum likelihood etc 7 What ifthere are no known relevant samples 7 Where can we nd a very small relevant doc Copynghl James Allan Page 2 Estimation Look for model M distribution from which we can estimate probabilities Pwq1 More likely find a finite universe of models fKunqlw 7Qk Z PMPw7q17H vqklAl AIeAA Assume independence then 13Uvq1 7 ZMEM PMPWIMPqla 7QRlA4 13Q17 7QR 13Q17 7QR Copyright James Allan Estimation oont After some more fiddling Ij u41 FW uq1qk 13q1qk ZMeM PMPwMPq17 13q1qk PMPq1 A L1IDUAJV 13q1 Z PwMPMIq1 AleAA 7QRlA4 7QRlA4 7QR aqk Now just have to figure out M s Copyright James Allan Page 4 A good place for models Looking for cooccurrence of W and query terms PM W 241 Ik fu qi quotqk Pltqla 39aqc Models with good estimates of those probabilities preferred 7 So want models estimated from samples that contain all or most of those query terms Wherehow can we get such samples Document retrieval Copynghl JamesAllan Relevance model Use documents to estimate models PwlR P 111Q P11JMJD PD71Q D7 But we know that query likelihood is more effective than document likelihood ivmi MD Pltlt2iDzgtPltDigt l 1 HQ PrwiMpgtPltlt2iDigtPltDigt 01 Alan P h Copynghl JamesAllan Page 6 And so Relevance model results in PwR e ZPwMDiPQIDz D7 Effectively Rank documents in collection PwR is weighted average of PwDi over coHec on ln implementation usually use top 5075 documents Note similarity to pseudoRF Dramatically different motivationdevelopment Copyright James Allan Ranking documents RM results in a model a probability distribution Also have a document model for each document Ways to rank documents PDlMR has different values for different lengths PDMR PDlGE has same bias as discussed in LM slides prefers terms with high ratio Better compare models directly KullbackLeibler divergence relative entropy PwR PwlD KLRD ZPMR log Copyright James Allan Page 7 Does it work Queryfocused topic models TDT2 corpus of newswire articles Queries drawn from TDT2 topics Similarto topics pLSILDA find but tuned to a query rather than overall corpus quotMonica Lewinsk Cmquot quotIsmcii Pulcslinmn Raidsquot quotRats in 51mmquot quotJohn Glennquot Linubmnbcr w P w 11 mm m P w w P w w mumk3 41077 m 0032 JCZ39Hkai mnnim 01155 space H030 nimbmnbci my m 4 shuttle 00211 11 ic gr 0033 columbm 01110 i 1111 con dant H027 bmm 11411 0010 ml lk 4mm mmon 001 l 01m 1 CM mil mo n01i 0 712 thcuuurc lwi39cxitlcm min mu n01i oh ii 012 1011113 tlinlon u mu 1 tcm 00111 oclnbcr 0011 decide 1mm mrr mill loop mucus it my 0011 guilty Copyngql o IamesAllan Adhoc retrieval Clear gains in retrieval Over Okapi and query likelihood Also better than expansion of LM Ave rage Precision AP FT LA WSJ Copyngqt lamesAllan 1 Page 8 RM beyond adhoc retrieval Crosslanguage retrieval 7 Need P chineseword englishquery 7 Use parallel corpus 7 Estimate with RM approach 7 More effective than translation model Crossmedia retrieval 7 Need P picture englighquery 7 Need something akin to a parallel corpus 7 Very effective at retrievingannotating pictures and handwriting Copynghl JamesAllan RM v TM for IR Topic models find trends in corpus Relevance models are queryspecific topics IR is about answering queries Topic models are too broad on their own Effective TMlR systems incorporate RM too TMlR provides some gains over pure RM 7 Possibly particularly when TMs overlap query Copynghl JamesAllan Page 10 Information Retrieval Spoken document retrieval James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Background Automatic Speech Recognition ASR Process for converting a speech signal into a text representation of the words being used Dragon Naturally Speaking is a commercial product 39 Quality Widely variable S I D Measured by error rate E N 100 very very gooo tor digit recognition IU39d Quite good for read speech with good quality microphones E as low as a few percent Not good for spontaneous conversational speech E 50 ASR is not easy Sources of problems 7 Difficult teasing out background noise music static 7 Different than written language Dialogueoriented rather than essaylike Dis uencies um39s restarts etc Weird grammar and sentence structure 7 No spaces between words in speech quotcan you wreck a nice beachquot vs quotcan you recognize speechquot 7 Different speakers create different voice signals Gender accent dialects 7 Do not know how to use contextmeaning quota long tailquot vs quota long talequot 7 Cannot use body language to interpret Example of ASR output CNN story from June 6 2006 7 this vote uh was carried some three miles by the stormjust plunk down right here on a residential street uh and and I39m hardly anyone is living on the street mean all the houses are empty people39s possessions are all still in a renders a the gold reject videotape from someone39s home there39s a losing a child39s doll from or Disneyland we find this kind of stuff all Skoda and go down any street here in Saint Bernard Parish and and uh you39ll see all the all the destruction Hard to believe that IR is even possible TameofConmnB Thesis Spoken document retrieval 7 TREC 1997 1998 1999 and 2000 7 TDT 1998 Spoken queries Where now Conclusion Thesb Most ASRrelated IR tasks are easy Some but little value in compensating for ASR errors Same reason that heavy NLP rarely helps IR Lots of words gives lots of context 7 Disambiguation 7 Redundancy Incorrect processing of some words usually has minimal impact on nal result ASRIR research interesting primarily for snippets of speech Simple example ambiguity Query fly 7 The insect fly Flies for fishing 7 The fl on a I air of I ants 7 The verb fly on a plane like a bird to move quickly Add a single word 7 fly zipper 7 fly airplane 7 fly buzz Ambiguity nearly gone Simple example redundancy Consider this text 7 y Fisherman is written and edited by yfishers for beginning to expert y fishers It provides you with year round advice from vfishing s top experts saltwater v patterns and ytying yfishing destinations The word y occurs in numerous forms Related words sherman shers tying also provide safety net Is it really that easy Will talk about experiences that support it 7 TREC SDR 7 TDT ASR 7 Spoken queries Will try to suggest where thesis breaks down 7 Le where are open research questions Table of Contents Thesis Spoken document retrieval 7 TREC 1997 1998 1999 and 2000 7 TDT 1998 Spoken queries Where now Conclusion TREC SDR track Successor to TREC s confusion track OCR Spoken Document Retrieval 7 Han itho through ithg 7 Discontinued sunset after TREC 9 Conclusions 7 SDR is a solved problem Useful article 7 Garofolo Auzanne and Voorhees The TREC spoken document retrieval track A success story Proceedings TRE 7 Several ofthe following graphs come from that paper TREC6 SDR track 1997 Genesis oftrack 7 ASR systems still slow only small corpus possible 7 100hour HUB4 dataset split 50 hours 1451 stories in evaluation set 7 50 known item queries Simpler for systems to process Simpler for NIST to evaluate Designed to stress IR based on ASR Several classes of queries targeting speci c documents How well did it work TREC6 results 10 drop Note u TREC me ZGSU 1 Q 8 mmammu Avg Percent Retrieved at Rank 1 n USUGIlt Percent Retrieved at Rank hverged Moss Systens r e uewType m 0 0 0 0 El Ref I Base El Speech Easy Speech F0 Dif cult Speech Dif cult Query Entire Test TREC7 SDR 1998 Larger corpus 7 87 hours 2866 stories Still small by lR standards Task is ranked retrieval 7 23 topics averaging 147 words Crossrecognizer runs 7 Sites provided ASR output to other sites 7 Allows comparison of ASR errors to effectiveness Question 7 Was success of SDR because known itemquot retrieval is too easy or because SDR is easy TREC7 results Mean Average Precision Retrieval Vs Recognition Mean Story WER 5 Mean Con39 Coef a 87 n 1 7 I V quot r V h mum Al39l Snell nrannn m 92 DERAZ umm EF l l l l l l l l l 0 5 101520 2530 354045 50 556065 SWER Clear rel nship between ASR and SWER Small drop in Angrec even with big drop in SWER On to TREC8 SDR 1999 Make the corpus reasonable by IR standards 7 550 hours of TDT 21754 stories 7 CCAP used as reference estimated 715 WER Query style unchanged Crossrecognizer runs still encouraged Story boundary unknown condition added 7 50 ranked retrieval topics TREC8 results boundaries WER has SDR 99 Retrieval vs Recognition m nim 3 SKchtlnim 06 Impact on 05 Angrec a 0394 0 Results lt 03 2 02 compare to 01 7 1999 mean slope 410016 even With 0 5 10 15 20 25 3o 35 40 45 50 55 60 65 70 75 80 35 b39gger WER corpus TREC8 results boundaryless SDR 99 Retrieval vs Recognition Essentially the same results Except Angrec is much lower 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 PrOblem is WER coping With m absence of boundaries not ASR Finally TREC9 SDR 2000 Same as TREC8 Shonerquenes 7 surer thatwill mess up ASR 7 Short Name some countries which permit their citizens to commit suicide with medical assistance 7 Terse assisted suicide S ongeremphagsonlackofboundanes 7 Now the required condition TREC9 results Sh01t worse than terse Known summaries conomons nest Condition TREC9 results boundaryless Avg MAP OverAll Topics vs SWER for story Boundaries Unknown Short so Condltlon Cambridge Sneffieid i 15 2D 25 3D Speech Recognizer ASWER Table of Contents Thesis Spoken document retrieval 7 TREC 1997 1998 1999 and 2000 s TDT 1998 Spoken queries Where now Conclusion Topic Detection and Tracking Eventbased organization of broadcast news Tasks are applied to stream of news 7 Similar in spirit to information filtering Corpora include radio and television 7 TDT2 JanJun1998 and TDT3 OctDec1998 7 Audio is available 56 of TDT2 used for TREC 8 amp 9 7 CCAP available as reference 7 Recognizer output also provided BBN s system 7 Multilingual as well as multimedia Evaluation done with contrasting conditions 7 Englishonly vs EnglishMandarin etc 7 CCAP vs ASR TDT tasks Tracking 7 Given N ontopic stories find rest in stream Nt is small typicall l 2 4 7 Training stories always given with boundaries First story detection 7 Monitor stream and identify any story that talks about a news event not previously discussed Cluster detection 7 Partition arriving stories into news topics Other tasks 7 Story link detection 7 Segmentation TDT 1998 Dragon Miamim a Error tradeoff curve shows modest effect of ASR errors Boundaries make a big difference i r Yamron et 31 Topic Tracking in a News Stream in Proceedings of DARPA Broadcast News Workshop 1999 TDT 1998 CMU 5mm Source Bot audition J TDT s error cost measure not very sensitive to ASR ycs 100 n 5 DTrcc i w Brown dummy R NO NIH mum errors L DTrcc AS 39l39ahlc 3 of cial TraCking Resuih Nf i Known boundary Swti m Source Bountiuncs Story Tami Condition 0 39v sighted sighted cond39tlon does DTi CL ASR yu 70085 10070 make a difference DTrcc LCUption ycx 0mm mum in ASR N 0 ms H mm DTKC NO 00106 HUMZ Carbonell et al CMU Report on TDT2 Segmentation Detection and Tracking inProceedings ofDARPA BroadcastNews Workshop 1999 TDT 1998 results UMass CCAP NWT DHCu a E squnk mew Cluster detection task Some techniques very sensitive to ASR errors 7 Later research has reduced that sensitivity JHU CLSP W899 Novelty Detection First story detection task Task is very dif cult many errors ASR corpus substantially worse than CCAP version where we care about it Miss nmhahilw itquot at Story link detection ASR errors have no impact with story comparison Story link detection Are two stories on the same topic 9 39l Graph suggests there is no impact from ASR Graph is an average what about details Impact of ASR on link detection How does score change per story pair Distribution log 100000 10000 ASR helps 1000 100 10 1 515Lgtgt5W9 55Q QL WV Change in score using ASR Break down by yesno pairs Distribution log 100000 10000 1000 NmmchomomcomNmoon q T T TI FFFNN Change in score using ASR Example when ASR errors hurt o s murder tort eand CMVI99SII256UUI93N The ASR version Pinochet augusto h CNN 0981030 HJU IJK49 a was oEen was that ecause he39s the head of n charges of murder torture CNN9981125600119U Example when ASR errors hep UN secretary ofi Annan is in Libya he Will try to persuade G u 39 to handover the suspects 27 ere killed in that CMV 0981205111170 mun when a terroris ocked the plane out of the sky CW19981221 JUL 7942 ASR version and u n secretary general healthy and it is in libya Where he will try to persuade libyan leader I u u r u u u i hand over the sus I groun were killed W e 39 ocked the plane out of the Sky CJWVI 905122121U39942 Pan Am UN secretary ofi Annan is in Libya where he will try to persuade Gadhafi to handover the suspects 270 people were killed in that bombing rww199mnrnmauasu and u n secretary general healthy and it is in libya where he will try to persuade libyan leader moammar qaddafi hand over the suspects in the nineteen eighg eight bombing of pan am jet over scotland two hundred seventy people were killed in Ihalbombing and now TRECTDT SDR conclusions ASR errors have minimal average impact on effectiveness for all tasks considered 7 Document retrieval 7 Tracking new event detection story link detection Surprisingly high WER causes small drop 7 Not explored in TDT extrapolated from TREC 7 Of course there is an impact in all cases OOV words can cause catastrophic errors Closed caption as reference has some odd effects TREC SDR declared a solved problem Table of Contents Thesis Spoken document retrieval 7 TREC1997 1998 1999 and 2000 7 TDT 1998 Spoken queries Where now Conclusion ASR and spoken queries Document retrieval fairly robust to ASR errors Hypothesis 7 Long documents alleviate OOV and recognition problems because of redundantrelated words Queries are generally shorter 7 Do errors there have a larger impact 7 How short is a problem 20 Some experiments From Barnett et a Experiments in spoken queries for document retrieval Proceedings Eurospeech 997 Two experiments 7 35 long TREC queries of 5060 words 7 10 queries ofthree lengths used 24 content words 58 words 1015 words Constructed queries 7 or top 5 articles relevant 5 or top 20 relevant 7 re than 15 or top 20 relevant Run against 24630 articles from Boston Globe from 1996 Alternate ASR error rates artificially created by altering parameters of beam search Long queries Modest impact of ASR error rate even up to 50 WER Very robust to ASR errors in top several documents Correlation between WER and Angrec was weak


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.