Class Note for CMPSCI 646 at UMass(3)
Class Note for CMPSCI 646 at UMass(3)
Popular in Course
Popular in Department
This 24 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 18 views.
Reviews for Class Note for CMPSCI 646 at UMass(3)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Information Retrieval Information Retrieval of Web pages James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copyright James Allan What makes Google work well According to them 7 The heart of our software is PageRankTM a system for ranking web I aes develo ed b our founders And While we have dozens of engineers working to improve every aspect of Google on a daily basis PageRank continues to play a central role in many of our Web SeaI39Ch tomsH httpwww google corntechnology 7 Look at 2002 version for a hint of what we ll find continues to provide the basis for all of our web search toolsquot CMPSCI 545 Copynghl James Allan And what is PageRank Paraphrased from Google s press A link from page S to page T is a vote by S for T 7 uniquely democratic nature of the W Votes from better important by PageRank pages count more Important highquality sites receive a higher PageRank PageRank combined with sophisticated text matching techniques 7 far beyond the number of times a term appears on a page 7 dozens of aspects of the page s content 7 and the contents of the pages linking to it CMPSCI 545 Copynghl James Allan Stepping back LinkBased Retrieval Use structure of the Web to identify popularity or authority Similar to citation analysis that started in 1970 s 7 ldentifing authoritative authors articles 7 Only similar Web links are not citations Intuition is that sites with many inlinks or backlinks are more popular Simple counting can lead to problems 7 Which links 7 Are all links the same 7 Spamming Kleinberg s HITS algorithm started new era of Web link analysis Google s PageRank algorithm popularized the approach CMPSCI 545 Copynghl James Allan HITS algorithm Hypertext Induced Topic Search 7 Authoritative Sources in a Hyperlinked Environment Jon Kleinberg 1997 Developed as part of IBM s Clever search project An effort to extract useful information out of the link structure of the Web HITS and PageRank developed at about the same time 7 HITS stayed mostly a research project 7 PageRank was incorporated in a commercial product 7 More information available about HITS CMPSCI 646 Copyright James Allan Basic idea behind HITS Start with a query Use a search engine to get the top ranked set of documents 7 Call that the root set That set is likely to be relevant but may miss important pages that do not contain the search term Expand to base set by adding 7 Pages pointed to by pages in root set 7 Pages that point to pages in the root set 7 But limit the number of each so that no page brings in too many additional pages CMPSCI 646 Copyright James Allan HITS cont Two classes of links 7 Transverse links cross domains from umassedu to ibmcom 7 Intrinsic links are domain eg csumassedu Delete intrinsic links 7 Hypothesis is that most of those are navigation links up home Could apply other heuristics 7 Look for potential spam linking created by collusion 7 Strip out stoplinks bestviewed by Use resulting graph to nd interesting sites CMPSCI 545 Copynghl James Allan Hubs and Authorities Basic idea is to consider recommended pages 7 These are likely to be authoritative and recommending pages 7 These may be hubs pointing to a good set of material CMPSCI 545 Copynghl James Allan CMFSCI 646 Iterative algorithm Maintain arrays of weights e Xp is authority wt of page p e Yp is huo weight ofpage p 2 Maintain invariant on each 0 9quot FlemingWWW 3 O Oqi 03912 V xiii2 V Initially give all pages equal weight Iterate ad nauseum e Propagate huo weight to affect authoht Weight 7 Propagate authority weight to affect hub weights 7 Rehorrhaiize vectors Showed it will always converge 7 But 20 iterations is usually enough 0 use yini 3 nuts iaiairiuiiiia Iahyv Copyrigth IamzsAllan CMFSCI 646 Google s Page Rank A page has high rank if the sum of the ranks of its backlinks is high 7 ihciudes both rhahy backlii iks and a few highly rahilted backlii iks Mu RV Where 7 u is a web page 7 a is the set of pages u points to forward Ill ikS not used here e 5 is the set ofpages that point to u backlii iks 7 Nu i BI is the Humberoflll ik from u e cis a factorused for horrhaiizatioh total rahilt ofaii web pages is constant Recursive formula Can be computed by starting with any set of ranks and iterating until convergence Very similar to half of HITS Copyrigth IamzsAllan Rank Propagation Start with link structure Pages have some starting rank value 0 Split it among outgoing links 7 Equal votes for each link Sum weights of incoming 3 links to get new rank ofa Page 100 53 50 o w 50 Repeat ad nauseum CMPSCI 545 Copynghl James Allan PageRank Can also be formulated as an eigenvector problem 7 R cAR where R is a vector over web pages AW 1Nu if there is a link from u to vand 0 if not Simple algorithm has a problem with rank sinks 7 Imagine a page that only points to itself 7 And has at least one other page pointing to it 7 Solve by creating a source of rank E to keep adding rank in Random surfer model Other issues 7 dangling links to unavailable sites 7 intrasite and navigation links 7 stoplinks eg Amazon Internet Explorer Could apply heuristics mentioned in HITS CMPSCI 545 Copynghl James Allan Searching with PageRank Google uses a number of factors to rank search results including 7 tfidf content search 7 Proximity of query terms in page 7 Location of query terms within page 7 Anchor text from links to the page 7 PageRank Standard IR scores and PageRank scores are merged What is value of each component Trade secret CMPSCI 545 Copynghl JamesAllan TREC Web tracks Considering Web documents since 1997 Initial questions were largescale retrieval 7 Very large corpus VLC track 7 Same problems as adhoc retrieval but larger corpus e 1998 100Gb collection Of cially a Web track starting 1998 7 Initially typical adhoc retrieval 19982000 7 Movement toward sitefinding queries 2000 No bene t to linkbased measures in any ofthese 7 In 2001 add homepage finding links finally help In 2002 evaluate Webspeci c search tasks 7 Topic distillation e Named page finding 7 23 research groups participated Continued through 2004 then morphed to ef ciency CMPSCI 545 Copynghl JamesAllan TREC 2002 Topic distillation task Task 7 Put together a short but complete list of pages that are a good resource for a topic 7 Topics are similarto adhoc topics from the past Corpus 7 January 2002 crawl of gov domain 7 125 million pages approximately 18Gb Includes extracted text from PDF P3 and Word documents Images not included in corpus but were available Pages les bigger than 100Kb were discarded else it would be SSGb Judgments on task 7 Is page a good key resource or not 7 NB Page may be a good resource but not relevant per se Evaluation is by precision at 10 documents returned 7 Emphasize conciseness CMPSCI 545 Copynght James Allan Topic distillation what is it Premise some quality beyond relevance is important for Web searching 7 Authority quality definitiveness etc 7 Want to strike balance between quality and relevance Look for key pages 7 One worthy of appearing in a short list of important URLs Compare what Yahoo or DMOZ lists have to decide upon 7 Selected pages will be relevant and high quality 71 submissions from 17 groups 7 56650 pages judged for relevancequality 7 1574 of those judged to be key pages CMPSCI 545 Copynght James Allan Queryindependent properties Look at indegree and length of URL Do the redict likelihood that URL would be chosen for inclusion in Yahoo and DMOZ Graph is an ROC curve 7 False alarm on x axls e Recallorl Years 7 Good ls upperlell Implications 7 Long URLs less valuable 7 Pages llrlllted lo often are valuable rm pmmves cMPsci A CapyngthhmesA an r l Falsc r oslllvcs True Dmlllves Queryindependent properties us M J ls n4 3 i ll U 2 ll w 7 H lll 39 ll l litre 0 02 Cl 06 la 1 Hll l M quot7 HA False pawlas False puslllves Directory pages Key pages Not as predictive CMFSCI 646 CapyngthJmesA an Distribution of key pages per topic Fiequem i 4 2 i iiiiHiHFWiF w Some have quite at few a ma isn 2m Nimmei reiwm uer tonic CMPSCI 646 Copyright James Allan Distillation results Top 8 groups by Angreo10 of best run 03 A O H V E 2 02 VI 3923 D L 0 g 01 6 L a gt lt 0 E E E E 2 2 i N S t m s 2 S quot 2 7 f in S 39E E D a g CMPSC1646 Copynght 1ames Allan 10 Conclusions of results Anchor text useful Outdegree not useful Siteuniting helped with avg prec but not prec10 Content based retrieval can work as well Eliminating documents wo query words in title helped Removing nearduplicates by similarity hurt Removing true duplicates was important Query expansion hurt PageRank hurt Link analysis helped CMPSCI 646 Copy ght James Allan City University London Used standard retrieval system with Okapi weighting K1 impacts importance of term frequency B affects importance of document length 75ft K1 1 DLZ39 tf2Kl 1 BBZjDLj N no 2 mm Tuned parameters for this task K1 15 normal value B 08 heavy emphasis on document length Results were second best Prec10 0241 Angrec 0190 If set B 02 less emphasis on document length Prec10 0200 Angrec 0143 Conclusion Seems that elaborate processing of Web data is not critical Runs counter to perceived wisdom CMPSCI 646 Copy ght James Allan 11 Glasgow University Use an absorbing model for PageRanklike results Static absorbing model Calculate authority score Tor page Combine it with content score using trained parameter RSV 10 RSVd 01 authSAM Dynamic absorbing model Applied to topranked pages set B Assume top of that set A is most authoritative Break links from A to BA but not from BA to A Calculate authority score for each set Note that A s scores are boosted by links from BA Scores in BA do not get any of credit from set A Maintains authoritativeness of verytop ranked pages ClVlPSCI 646 Copyright James Allan Glasgow cont Spreading activation A portion of a page s score is propagated from its children Here done only when pages are in the same domain And only when source of link child is deeper within the domain Rsvcg RSVd 5 Z RSVS SES Querybiased spreading activation Be more cautious with activation for ambiguous queries Calculate specificity of query 39 Based partly on posmon OT query words In VVordNet hierarchy Change 3 to be the query scope value RSV RSVd queryscope Z RSVS SES ClVlPSCI 646 Copyright James Allan 12 Glasgow in implementation Static absorbing modeling 7 Retrieve top 1000 pages 7 Rerankthem to incorporate authority score Dynamic absorbing model 7 Retrieve top 1000 pages 7 Reranktop 50 Le B50 7 Assume top 10 are high quality ie A10 7 Numbers small recall that precision at 10 is official measure Spreading activation 7 Give a wei ht of 01 to the linked ages Other indexing 7 Double weight of terms appearing in titles 7 Add in anchor text from all pages that link to this page Can treat differentially or as if it s all one bag of words CMPSCI 545 Copynghl James Allan Glasgow results Content of pages is most important part of process 7 Contentonly scored 02694 compared to 02224 for best other run Can make DAM provide slightly better results 7 02776 on posthoc evaluation run 7 Setting A10 was too small results improved with A20 7 PageRank consistently underperforms DAM Use of anchor and title text was detrimental Querybiased spreading activation betterthan static 7 But both we muse than not using it at all Query expansion hurt effectiveness CMPSCI 545 Copynghl James Allan 13 IBM Haifa Knowledge Agents Maintain a knowledge base KAB 7 Stores information about a domain userdetermined 7 Main pages in the KAB Root pages from search engines Pages inserted by the user Seed pages relevant pages KAB size is limited so only best pages are maintained Page s score updated everytime a search occurs Longlived KABs are more stable 7 Also satellite pages Pages that they point to and that point tothem Provides anchortext information 7 Frequent terms in those pages and terms that cooccur with them CMPSCI 545 Copynghl James Allan IBM Haifa cont Web task is opportunity to explore capability for creating the initial KAB 7 No user interaction in the C evaluation Given query build up KAB automatically 7 Form root set by running a search on gov pages 7 Get the forward outlinks and backward inlinks sets 7 Compute topology score weight the links Links from anchor text that is similar to query heavily weighted Links that cross the KAB boundary weighted more Connect domaincentral page KAB to queryfocused page Run PageRanklike algorithms to find hub and authority scores 7 Combine content score and topology score CMPSCI 545 Copynghl JamesAllan 14 IBM Haifa other techniques Site compression 7 Do not permit more than three results from the same logical sitequot to appear in top 10 7 Logical site is basically everything between www and gov Title filtering 7 Pages with no query words in title considered frail 7 Three frail pages with lowest scores in KAB replaced by three non frail pages outside the KAB with highest scores Duplicate elimination r Thresholdbased vectorspace similarity 7 Only applied to top 10 pages because of effectiveness measure 7 Applied a er the previous two lters CMPSC1646 Cupyngqi James Allan Breaking down effects What is impact of site compression title filter and duplicate elimination Is the KA dynamic algorithm better than PageRank static 0 3 0 25 S o 2 7 015 7 n39 01 e 0 us O Basic sc TFK TFi TFI10 DE l I Number of pages replaced CMPSC1646 cupyngm James Allan 15 What s up with duplicate elimination Seems to be an error in evaluation Duplicate pages all considered equal in value 7 They re all relevant after all So getting three duplicates ofthe same page in top 10 is better than getting only one Except that evaluation specification said 7 Penalizing duplication If your top ten consists of ten pages from the same site judges might only reward the home page Suggests that duplication is bad 7 But evaluation doesn t seem to honor that 7 So removing duplicates of a key page can hurt 7 Though other sites found that removing duplicates was helpful Perhaps because removing nonkey duplicates is critical CMPSCI 545 Copynght James Allan Named page finding Generalization of homepage nding from 2001 Retrieve ranked list oftop 50 pages for 150 requests Topic is a single phrase 7 US password renewal 7 Child labor stamp Evaluation by MRR of rst correct page 7 Mean reciprocal rank lfr is rank of rst correct page get a score of 1r Judgments 7 lor 2 topics round 5 correct pages 7 For 16 topics found 2 correct pages 7 For other 132 found exactly 1 correct page CMPSCI 545 Copynght James Allan 16 Named page finding results Maori Reciprocal Ron lk Dquot U H Cl H 1 m J 1 LL Ci 1 3339 EL a m E EL U C m z 3 J L m E Z E E 3 m E c E 3 gain 3 j E CNIPSCI 646 Copyright James Allan 3 0 topicss with no page found Glasgow University named pages Retrieve top 1000 pages Rerank them based on whether page contains the query as a phrase 13 if page contains phrase RSVC TRSVd 7 Z 10 else Run DAM with smaller sets B10 A5 Here goal is to find a single named page Try various ways of using title and anchor text CNIPSCI 646 Copyright James Allan 17 Glasgow results DAM resulted in slight improvement over content 7 Different is marginal suggestive only Anchor text signi cantly improved reciprocal precision 7 With anchor text 0654 85 named page in top 10 missed 9 7 Without anchor text 0552 71 in top 10 and 15 missing 7 Recall that it did not help with the topic distillation task Query expansion hurts here too Interesting to note different tasks yield to different approaches 7 Not discussed in detail but also used different core retrieval measures to find RSV CMPSCI 545 Copynghl James Allan Web searching summary URL depth and inlink count both useful 7 At least for predicting Whether a link will appear in a directory PageRank and HITS are effective Anchor text important 7 Very valuable for named page finding 7 Less clearly valuable for topic distillation key pages No evidence any ofthat is useful for content search 7 Question are topic distillation and named page finding sufficient PaieRank is onl a small part of what Goo le does 7 Don t be fooled by their PR Following Web track and its ilk work has con rmed results CMPSCI 545 Copynghl James Allan 18 Evaluation for Web Moving toward NDCD Jarvelin and Kekaloinen T015 2002 What makes a better search result Cumulative gain assumptions 7 Highly relevant documents better than marginally relevant ones 7 The lower the rank of a relevant document the less useful it is Comparison to recallprecision measures 7 Binary vs graded relevance 7 Rank assumption consistent though will be treated differently CMPSCI 545 Copynghl JamesAllan Cumulative gain Assume graded relevance 7 Here 0 for nonrelevant 1 2 or 3 for highly relevant ln ranked list replace document with its relevance score 7 G lt3 2 3 0 0 1 2 2 3 0 gt 7 Several good documents two nonrel one marginal Cumulative gain at rank i is sum ofvalues 7 com G1 7 com cop 11 Gi for igt1 CGlt3 5 8 8 8 9 11 13 16 16 gt CMPSCI 545 Copynghl JamesAllan 19 Discounted cumulative gain Value is constant regardless of rank Want to provide diminishing value Use log of the rank base b 7 DCGi com for iltb 7 DCGi DCGi1 Gi logbi else Comparison 7 G lt3 2 3 0 0 1 2 2 3 0 gt 7 CG lt3 5 8 8 8 9 11 13 16 16 gt 7 DCG b2 35 we we w W me 3M 3M Why that discounting function How to choose b CMPSCI 545 Copynghl James Allan Incorporating some notion of recall Consider ideal gain vector lfwe know relevance score for every document 7 Sort gain vector by score with highest first 7 Eg with 3 highly rel 3 less rel and 4 marginally rel lt3 3 3 2 2 2 1 1 1 10 0 0 0 gt Given that could calculate ideal CG and DCG 7 CG1lt369111315161718191919 gt 7 DCG1 lt36 789 889 975 1052 1088 1121 1153 1183 1183 1183 gt CMPSCI 545 Copynghl James Allan 20 Normalized DCG DCG and CG are difficult to compare across queries 7 A score of 105 might be good for one query and bad for another Address by normalizing wrt ideal score 7 Normalize each entry by ideal possible entry Examples 7 DCG lt3 5 689 689 728 799 866 961961 gt 7 DCG lt36 789 889 975 1052 1088 1121 1153 1183 1183 1183 gt 7 NDCG lt1 083 087 078 075 076 080 086 083 083 gt CMPSCI 545 Copynghl James Allan Some experiments TREC7 adhoc retrieval task 7 Used 20 topics 7 Used submitted runs of 5 manual systems 7 Nonbinary relevance judgments for those runs Four levels Graded relevance 0 110 100 Binary conversion any degree if relevance is relevant only highly relevant is relevant Used lo39 base 2 10 showed similar results CMPSCI 545 Copynghl James Allan 21 Discounted cumulative gain aioiii u A hn lllrllm L O D 300 E IDEAI A 500 B 39 o n g 400 i inmi 8 30 r 200 HH 0 0 20 30 40 SH 6U 70 W 90 WU Rimk D Fig2km Discountedcum1ac2dgainDCGcurvesbmzu39yweighung n H 2n 31 40 5 on 7k K0 00 m Rank Fig 2m Discuunted cumulatmi gain DCGV wives nonbinmy weighning CMPSCI 646 Copyright James Allan inioiiui u on lnHIlII AIS c 9 F 04 g i0 n m an H m 50 m m XI no mu Rank Fig in Normalized Liscunnbed cumulated gain mom curves nonhinary weighting u in 10 m 40 lt0 an 70 EU on ma Rank Fig 4m Normalized discounted cumuiacgci gain nDCG cums binary weighting CMPSCI 646 Copyright James Allan 22 Singlenumber comparisons Average values over ranks 1 to 200 of a topic Average values over topics Can do statistical signi cance testing Table 11 nl DCG Averages ovex39Topics and Smuisdcsll Signi cance cite Results for Five mace runs c D E 5m Sign 11CG lCHrlrll uDCG 10717171 0293 0313 0343 0257 0335 0331 nCG l01e10e100l nDCG 10717107100 0313 0316 0342 DEgtA 0236 0279 0247 DgtA uCG 0707071 11DCGlOlt0r0717 0300 0309 0320 0223 0259 0224 a N p a 0 01 139 o 05 amount we CMPSCI A Cowhi hmzsmlm Variations of NDCG Use logrank1 7 Means the log l problem neveroccurs Ignore base of log 7 Boosts high ranklrlg documents 7 But the lower ranked ones are still dlscourlted relativelv Very popular for Webbased evaluation 7 Heavilv favors rst page of results 7 Favors verv top ofrarlked list Optimization function for some learning CMPSCI A Cowhi hmzsmlm 23 Totally different topicPlG We talked about mapreduce framework One implementation is Hadoop 7 Yahoogenerated now on Apache PIG is a language that sits on top ofit Largely a type of relational algebra Designed to compile into mapreduce operations Homework 4 will use PIG Unfortunately word count examples dif cult CMPSCI 545 Copynghl James Allan 24
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'