Class Note for CMPSCI 646 at UMass(1)
Class Note for CMPSCI 646 at UMass(1)
Popular in Course
Popular in Department
This 13 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 20 views.
Reviews for Class Note for CMPSCI 646 at UMass(1)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Information Retrieval Overview and Introduction James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 some xanmles copyiigm Lesa by Dr sew gumgum All slides Copyrlgil James Allan Outline What is and isn t Information Retrieval Core idea of IRrelated work Simple model of IR to get started CMPSCI 545 Copynghl James Allan What is Information Retrieval 300316quot YAHOO39 Document Web page retrieval in response to a query Quite effective at some things Highly visible mostly Commercially successful some ofthem But what goes on behind the scenes How do they work What happens beyond the Web CMPSC1646 Copyngil JamesAllan IR is not just document retrieval Automatic organization eg clustering Crosslanguage mechanisms Question answering Document provenance Agents ltering tracking routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identi cation Personal information management Metasearch multidatabase searching Summarization CMPSC1646 Copyngil JamesAllan CMPSCI 646 IR is not databases Copynghi James Allan CMPSCI 646 IR is notjust forthe web IR systems 7 FAST Autonomy Convera Eurospider 7 Hummingbird EMC Documentum A9 Amazon 7 Lemur Indri Terrier Zettair mg Okapi Smart Database systems 7 Oracle Informix Access Web search and Inhouse systems West LEXISNEXIS Dialog Lycos szcite Yanoo Uoogle Live Northern Light Ieoma HotBot Baidu 7 Askcom Ask Jeeves 7 eLibrary GOVResearchcenter Inquira And countless others Copynghi James Allan In this course we ask What makes a system like Google Yahoo or Live Search tick 7 How does it gather information 7 Whattricks does it use 7 Extending beyond the Web How can those approaches be made better 7 Natural language understanding 7 User interactions What can we do to make things work quickly 7 Faster computers Caching 7 Compression How do we decide whether it works well 7 For all queries For special types of queries 7 On every collection ofinformation What else can we do with the same approach 7 Other media 7 Other tasks CMPSCI 545 Copynghl James Allan Outline Core idea of lRrelated work Simple model of IR to get started CMPSCI 545 Copynghl James Allan Basic Approach to IR Most successful approaches are statistical 7 Directly or an effort to capture and use probabilities Why not natural language understanding 7 Le computer understands documents and query and matches them 7 State of the art is brittle in unrestricted domains 7 Can be highly successful in predictable settings though eg information extraction on terrorismtakeovers MUC Medical or legal settings with restricted vocabulary Could use manually assigned headings 7 eg Library of Congress headings Dewey Decimal headings 7 Human agreement is not good 7 Hard to predict what headings are interesting 7 Expensive CMPSCI 646 Copynghl James Allan Relevant Items are Similar Much of IR depends upon idea that similar vocabulary gt relevant to same queries Usually look for documents matching query words Similar can be measured in many ways String matchingcomparison Same vocabulary used Probability that documents arise from same model Same meaning of text CMPSCI 545 Copynghl James Allan Bag of Words An effective and popular approach Compares words without regard to order Consider reordering words in a headline Random beating takes points falling another Dow 355 Alphabetical 355 another beating Dow falling points Interesting Dow points beating falling 355 another 7 Actual Dow takes another beating falling 355 points CMPSCI 545 Copynghl James Allan What is this about 16 X said 14 X McDonalds 12 X fat 11 X fries 8 X new 6 X company french nutrition 5 X TOOC on percent reduce taste I uesoay 4 X amount change health Henstenburg make obesity 3 X acids consumer fatty polyunsaturated US 2 X amounts artery Beemer cholesterol clogging director down eat estimates expert fast formula impact initiative moderate plans restaurant saturated trans win 1 X added addition adults advocate affect afternoon age Americans Asia battling beef bet brand Britt Brook Browns calorie center chain chemically crispy customers cut vegetable weapon weeks Wendys Wootan worldwide years York CMPSCI 545 Copynghl James Allan The start of the original text McDonald39s slims down spuds Fastfood chain to reduce certain types of fat in its french fries With new cooking oil NEW YORK CNNMoney McDonald39s Corp is cutting the amount of quotbadquot fat in its french fries nearly in half the fastfood chain said Tuesday as it moves to make all its fried menu items healthier But does that mean the popular shoestring fries Won39t taste the same The company says no quotIt39s a WinWin for our customers because they are getting the same great frenchfry taste along with an even healthier nutrition profile said Mike Roberts president of McDonald39s USA But others are not so sure McDonald39s Will not specifically discuss the kind of oil it plans to use but at least one nutrition expert says playing with the formula could mean a different taste Shares of Oak Brook Illbased McDonald39s MCD down 054 to 2322 Research Estimates Were lower 39Iuesday a ernoon It Was unclear Tuesday Whether competitors Burger King and Wendy39s International WEN down 080 to 3491 Research Estimates Wouldfollow suit Neither company could immediately be reached for comment httvLmmey nu LGiILZUOLOQP39Q eWs rolnpmesr iucdmmilsindex him CMPSCI 545 Copyright James Allan The Point Basis of most IR is a very simple approach 7 find words in documents 7 compare them to words in a query 7 this approach is very effective Other types of features are often used 7 phrases 7 link structure 7 named entities people locations organizations 7 special features chemical names product names difficult to do in general usually require hand building Focus of research is on improving accuracy speed and on extending ideas elsewhere CMPSCI 545 Copyright James Allan Outline Simple model of IR to get started CMPSCI 545 Copynghl JamesAllan Simple flow of retrieval process TeXthj 39cts Ind d Obje t lt K rieiv ed obja sit s CMPSCI 545 Copynghl JamesAllan Statistical language model Document comes from a topic Topic unseen describes how words appear in documents on the topic Use document to guess what the topic looks like 7 Words common in document are common in topic Textb jects I d xedb39 jects 7 Words not in document much less likely Assign probability to words based on document 7 PwTopic m PwD tfwD lenD Index estimated topics CMPSCI 646 Copynghl James Allan Example Small document D lenD16 PfishD PblueD PoneD PeggSID CMPSCI 646 One fish two fish red fish blue fish Black fish blue fish old fish new fish 816 05 216 0125 116 00625 Aquott P39 016 O Copynghl James Allan What about queries Assume information need corresponds to a topic Assume queries are small sample from that topic I Very short if Just a few high probability words I 101 CMPSCI 646 Copyright James Allan Comparison Document came from a topic Did query come from this document s topic For each document find probability its topic could have generated the query PQITD W PQD 1391 7 RID Independence 75 assumption H Pq cf naive Bayes 2 i1 CMPSCI 646 Copyright James Allan 10 Example This one I think is called a Yink He likes to he likes to D He likes to anclidrinl and drink 2 39 i The thing he likes to drink ink Thhelikestoisink D2 He likes toand m ink erg tabla twig cinthllE w a m a CMPSCI 545 Copynghl James Allan What does LM look like implemented Hypothesis of statistical language model 7 Documents with highly probable topic models more likely to be relevant Index collection in advance 7 Convert documents into set of PtiD 7 Store in an appropriate data structure for fast access Query arrives 7 Convert it to set PqiD 7 Calculate PQTD for all documents 7 Sort documents by their topics probability 7 Present ranked list CMPSCI 545 Copynghl James Allan 11 Is that how Google works No Google is more closely related to the vector space model 7 Simpler formal model 7 Can more readily accommodate a range of features 7 No justification required by model for features or how used But intuition is similar CMPSCI 545 Copynghl James Allan Some issues that arise in IR Text representation 7 what makes a good representation 7 how is a re resentation generated from text 7 what are retrievable objects and how are they organized Representing information needs 7 what is an appropriate query language 7 how can interactive query formulation and refinement be supported Comparing representations 7 what is a good model of retrieval 7 how is uncertainty represented Evaluating effectiveness of retrieval 7 what are good metrics 7 what constitutes a good experimental test bed CMPSCI 545 Copynghl James Allan 12 Topics this class might cover Language model Crosslanguage Vector space model OCR documents Probabilistic model Spoken documents Indexing Multimedia documents Compression Distributed Efficiency Peertopeer Clustering Statistics of text Interaction Evaluation Relevance feedback Significance tests Filtering Web search CMPSCI 545 Copynghl JamesAllan Conclusion Information Retrieval 7 Indexing retrieving and organizing text by probabilistic or statistical techniques that reflect semantics without actually understanding 7 Search engines and more Core idea 7 Bag of words captures much of the meaning 7 Objects that use vocabulary the same way are related Statistical language model 7 Documents used to estimate a topic model 7 Query reflects a topic too 7 Documents of topics that are likely to produce the query are most likely to be relevant CMPSCI 545 Copynghl JamesAllan 13
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'