### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# InfoRetrieval&Extraction LING467

GPA 3.94

### View Full Document

## 67

## 0

## Popular in Course

## Popular in Linguistics

This 45 page Class Notes was uploaded by Agustina Bartoletti on Monday October 12, 2015. The Class Notes belongs to LING467 at Georgetown University taught by AnthonyDavis in Fall. Since its upload, it has received 67 views. For similar materials see /class/221908/ling467-georgetown-university in Linguistics at Georgetown University.

## Similar to LING467 at

## Reviews for InfoRetrieval&Extraction

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/12/15

08276 Information Retrieval and Web Search Lecture 11 Probabilistic Information Retrieval Recap of the last lecture I Improving search results I Especially for high recall Eg searching for aircraft so it matches with plane thermodynamic with heat I Options for improving results I Global methods 39 Query expansion I Thesauri I Automatic thesaurus generation I Global indirect relevance feedback I Local methods I Relevance feedback I Pseudo relevance feedback Probabilistic relevance feedback I Rather than reweighting in a vector space I If user has told us some relevant and some irrelevant documents then we can proceed to build a probabilistic classifier such as a Naive Bayes model PtkR IDml IDrl PUKINR IDnml ID tkis a term D is the set of known relevant documents Drk is the subset that contain tk D is the set of known irrelevant documents Dnrk is the subset that contain tk rrl Why probabilities in IR l How to match V Understanding of user need is uncertain 5 Uncertain guess of whether document has relevant content In traditional IR systems matching between each document and query is attempted in a semantically imprecise space of index terms Probabilities provide a principled foundation for uncertain reasoning Probabilistic IR topics I Classical probabilistic retrieval model I Probability ranking principle etc I Naive Bayesian Text Categorization I Bayesian networks for text retrieval I Language model approach to IR I An important emphasis in recent work I Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR I Traditionally neat ideas but they ve never won on performance It may be different now The document ranking problem I We have a collection of documents I User issues a query A list of documents needs to be returned Ranking method is core of an IR system In what order do we present documents to the user 39 We want the best document to be rst second best second etc Idea Rank by probability of relevance of the document wrt information need Prelevantdocumenti query Recall a few probability basics I For events a and b I Bayes Rule 1961 19 M61 b 19al bpb 1919 apa 195 bpb 1919 5p5 Prior b pbapa pbapa P ilor pa xma pa I xpx I Odds pa 1pa The Probability Ranking Principle If a reference retrieval system39s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data I 1960s1970s S Robertson WS Cooper ME Maron van Rijsbergen 1979113 Manning amp Schutze 1999538 Probability Ranking Principle Let x be a document in the collection Let R represent relevance of a document Wrt given xed query and let NR represent nonrelevance Q w Need to nd pRix probability that a document x is relevant px R pR pRpNR prior probability pm I x f t 39 39 l t px 0 re r1ev1ng a non re evan pNR I x pltx I NRpNR document px RIXXNRX1 pxiR pxiNR probability that if a relevant nonrelevant document is retrieved it is x Probability Ranking Principle PRP I Simple case no selection costs or other utility concerns that would differentially weight errors I Bayes Optimal Decision Rule I x is relevant if PRIX gt pNRx I PRP in action Rank all documents by pRX I Theorem I Using the PRP is optimal in that it minimizes the loss Bayes risk under 10 loss I Provable if all probabilities correct etc eg Ripley 1996 Probability Ranking Principle I More complex case retrieval costs I Let d be a document I C cost of retrieval of relevant document I C cost of retrieval of nonrelevant document I Probability Ranking Principle if C pRdC 1pRd C pRdC 1pRd for all d not yet retrieved then dis the next document to be retrieved I We won t further consider lossutility from now on Probability Ranking Principle I How do we compute all those probabilities I Do not know exact probabilities have to use estimates I Binary Independence Retrieval BIR which we discuss later today is the simplest model I Questionable assumptions I Relevance of each document is independent of relevance of other documents I Really it s bad to keep on returning duplicates I Boolean model of relevance I That one has a single step information need I Seeing a range of results might let user refine query Probabilistic Retrieval Strategy I Estimate how terms contribute to relevance I How do things like tf df and length influence your judgments about document relevance I One answer is the Okapi formulae 8 Robertson I Combine to find document relevance probability I Order documents by decreasing probability Probabilistic Ranking Basic concept quotFor a given query if we know some documents that are relevant terms that occur in those documents should be given greater weighting in searching for other relevant documents By making assumptions about the distribution of terms and applying Bayes Theorem it is possible to derive weights theoreticallyquot Van Rijsbergen Binary Independence Model I Traditionally used in conjunction with PRP I Binary Boolean documents are represented as binary incidence vectors of terms cf lecture 1 xx1xn 39 term i is present in document X I Independence terms occur in documents independently I Different documents can be modeled as same vector I Bernoulli Naive Bayes model cf text categorization Binary Independence Model I Queries binary term incidence vectors I Given query q I for each document d need to compute pRqd I replace with computing pRqX where X is binary term incidence vector representing d Interested only in ranking I Will use odds and Bayes Rule o pRl qpx R961 pquncl pxlq 0 R 39q x pNqux pNRqgtpltxINRqgt pxlq Binary Independence Model 0ltqu R39qaxl pong MNRIQW pNqu v J Constant for a given query Using Independence Assumption prlRM pxiRq px NR4 i1 1909 NR4 x R 39SOORqdORq M i1 pxi NRQ Binary Independence Model n 1906 R961 i1 pxi INRaQ Since X is either 0 or 1 0qu9d0qu pxi1Rq pxi0Rq 1902 1 NRaq 1902 0 N134 Let p 1702 1Rq 1 px 1NRq 0qu9dOqu 39 Assume for all terms not occurring in the query qi0 This can be changed eg in relevance feedback Binary Independence Model gt 139 1 pz39 0ltRqx0Rq 3 All ma tchmg terms NonmatChing query terms 0Rq I 1 19 1quot All matching 39tetm s All query terms Binary Independence Model 0R cm Only quantity to be estimated for rankings Retrleval Status Value 1 1 RSV10g M xiqimlt1 pigt lea pi Binary Independence Model A11 boils down to computing RSV 1 1 RSV10g p 3 575mm 13 xiqilri1pi xiq1 RSV 51561 Cilogm lea pi Binary Independence Model Estimating RSV coef cients For each term i look at this table of document counts Documens Relevant NonRelevant Total Xi1 Xi0 Total N S Ia S F Estimates i or now N S assume no SS S zero teims c KNnSS 10 s X z gnSNnSS Morenet lecture Estimation key challenge I If nonrelevant documents are approximated by the whole collection then r prob of occurrence in nonrelevant documents for query is nN and log 1 rr log N nn z log Nn IDF p probability of occurrence in relevant documents can be estimated in various ways I from relevant documents if know some I Relevance weighting can be used in feedback loop I constant Croft and Harper combination match then just get idf weighting of terms I proportional to prob of occurrence in collection I more accuratelv to log of this Greiff SIGIR 1998 Iterativer estimating p 1 Assume that p constant over all X in query p 05 even odds for any given doc 2 Determine guess of relevant document set V is fixed size set of highest ranked documents on this model note now a bit like tfidf 3 We need to improve our guesses for p and r so Use distribution of X in docs in V Let V be set of documents containing X 39 pi lVil M I Assume if not retrieved then not relevant 39 ri i IVil N 4 Go to 2 until converges then return ranking 24 Probabilistic Relevance Feedback 1 Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents V as above Interact with the user to refine the description learn some definite members of R and NR Reestimate p and r on the basis of these I Or can combine new information with original guess use Bayesian prior 2 K lgvf1 quot V k Repeat thus generating a succession of approximations to R PRP and BIR Getting reasonable approximations of probabilities is possible Requires restrictive assumptions term independence I terms not in query don t affect the outcome I boolean representation of documentsqueriesrelevance 39 document relevance values are independent Some of these assumptions can be removed Problem either require partial relevance information or only can derive somewhat inferior term weights Removing term independence I In general index terms aren t independent Dependencies can be complex van Rijsbergen 1979 proposed model of simple tree dependencies I Exactly Friedman and Goldszmidfs Tree Augmented Naive Bayes AAAI 13 1996 Each term dependent on one x a other I In 1970s estimation problems held back success of this model Fig A a r 5 Food for thought I Think through the differences between standard tfidf and the probabilistic retrieval model in the first iteration I Think through the differences between vector space pseudo relevance feedback and probabilistic pseudo relevance feedback Good and Bad News I Standard Vector Space Model I Empirical for the most part success measured by results I Few properties provable I Probabilistic Model Advantages I Based on a firm theoretical foundation I Theoretically justified optimal ranking scheme I Disadvantages I Making the initial guess to get V I Binary wordindoc weights not using term frequencies I Independence of terms can be alleviated I Amount of computation I Has never worked convincingly better in practice Bayesian Networks for Text Retrieval Turtle and Croft 1990 I Standard probabilistic model assumes you can t estimate PRDQ I Instead assume independence and use PDlR I But maybe you can with a Bayesian network I What is a Bayesian network I A directed acyclic graph I Nodes I Events or Variables I Assume values I For our purposes all Boolean I Links I model direct dependencies between nodes Bayesian Networks a b C pl OpOSlthflS CVCHtS Bayesian networks model causal relations between events Inference in Bayesian Nets Given probability distributions for roots and conditional probabilities can compute apriori probability of any instance Fixing assumptions eg b was observed will cause Toy Example Finals f f f n 09 03 No Sleep 11 01 07 H Project Due d fd fd f d f d g 099 09 08 03 g g 001 01 02 07 Triple Latte t g g t 099 01 t 001 09 Independence Assumptions Project Due 1 Independence assumption Ptlg fPtlg Joint probability P dngo Pf Pd Pnlf Pgf d Palg Finals No Sleep n Triple Latte t Chained inference I Evidence a node takes on some value I Inference I Compute belief probabilities of other nodes I conditioned on the known evidence I Two kinds of inference Diagnostic and Predictive I Computational complexity I General network NPhard I Treelike networks are easily tractable I Much other work on efficient exact and approximate Bayesian network inference I Clever dynamic programming I Approximate inference loopy belief propagation Model for Text Retrieval I Goal I Given a user s information need evidence find probability a doc satis es need I Retrieval model I Model docs in a document network I Model information need in a query network Bayesian Nets for IR Idea Doument etwor r C J quot I goal node Bayesian Nets for IR I Construct Document Network once I I For each query I Construct best Query Network I Attach it to Document Network Find subset of d s which maximizes the probability value of node best subset Retrieve these d s as the answerto query Bayesian nets for text retrieval Document 0 a a Concepts Query N tw k a a Query operators e or ANDOMNOT Information need Link matrices and probabilities Prior doc probability Pd I Pcr 1n I 1to1 I PI1d I thesaurus I withindocument term I Pqc39 canonical forms of frequency query operators 39 17 idfbased I Always use thingslike AND and NOT never store a full CPT conditional probability table Example reason trouble two Document Network Query Network Extensions I Prior probs don t have to be 1n I User information need doesn t have to be a query can be words typed in docs read any combination I Phrases interdocument links I Link matrices can be modified over time I User feedback I The promise of personalization Computational details I Document network built at indexing time I Query network builtscored at query time I Representation I Link matrices from docs to any single term are like the postings entry for that term I Canonical link matrices are efficient to store and compute I Attach evidence only at roots of network I Can do single pass from roots to leaves Bayes Nets in IR I Flexible ways of combining term weights which can generalize previous approaches I Boolean model I Binary independence model I Probabilistic models with weaker assumptions I Efficient largescale implementation I anuery text retrieval system from U Mass I Turtle and Croft 1990 Commercial version defunct I Need approximations to avoid intractable inference I Need to estimate all the probabilities by some means whether more or less ad hoc I Much new Bayes net technology yet to be applied Resources 8 E Robertson and K Sp rck Jones 1976 Relevance Weighting of Search Terms Journal of the American Society for Information Sciences 273 129 146 C J van Rijsbergen 1979 Information Retrieval 2nd ed London Buttenlvorths chapter 6 Most details of math httpwwwdcsglaacukKeithPrefacehtml N Fuhr 1992 Probabilistic Models in Information Retrieval The Computer Journal 353243 255 Easiest read with BNs F Crestani M Lalmas C J van Rijsbergen and l Campbell 1998 Is This Document Relevant Probably A Survey of Probabilistic Models in Information Retrieval ACM Computing Surveys 304 528 552 httpwwwacmorgpubscitationsjournalssurveys1998304p528 crestani Adds very little material that isn t in van Rijsbergen or Fuhr Resources HR Turtle and WB Croft 1990 Inference Networks for Document Retrieval Proc ACM SIGR 124 E Charniak Bayesian nets without tears Al Magazine 124 5063 1 991 httpvwvw aaaiorgLibrarvMagazineVol121204vol1204html D Heckerman 1995 A Tutorial on Learning with Bayesian Networks Microsoft Technical Report MSRTR9506 httpvvvvvvresearchmicrosoftcomheckerman N Fuhr 2000 Probabilistic Datalog Implementing Logical Information Retrieval for Advanced Applications Journal of the American Society for Information Science 51 2 95 110 R K Belew 2001 Finding OutAbout A Cognitive Perspective on Search Engine Technology and the WWW Cambridge UP 2001 MIR 254 28

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.