Class Note for CMPSCI 646 at UMass(4)
Class Note for CMPSCI 646 at UMass(4)
Popular in Course
Popular in Department
This 14 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 12 views.
Reviews for Class Note for CMPSCI 646 at UMass(4)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Information Retrieval Clustering for IR James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copyrlgil James Allan Think and write How can clusters be used in document retrieval How might you evaluate clusterbased IR 7 Need to compare it to notclustering straight document retrieval CMPSCI 545 Copynghl JamesAllan Clustering and retrieval Incorporating clustering into retrieval Two general approaches 7 Retrieve clusters directly 7 Use clusters to improve document representation Will look at three papers here Liu and Croft Clusterbased retrieval using language models SIGIR 2004 7 Explicit clustering by partition Hofmann Probabilistic latent semantic indexing SIGR1999 7 Nice u datin of vectorspace LSI methods Wei and Croft LDAbased document models for adhoc retrieval SIGIR 2006 7 IR experiments with Blei Ng and Jordan s LDA model JMLR 2003 CMPSCI 545 Copynghl James Allan Evaluation Use clusters to adjust document representation 7 No changes needed 7 Straight document retrieval evaluation MW Retrieve clusters directly 7 Need to convert to ranked list of documents 7 Typically rank clusters By score of cluster By maximumaverageminimum score of a member document 7 Within cluster rank documents scores CMPSCI 545 Copynghl James Allan Cluster then retrieve ll Recall LM query likelihood approach for documents 75 PQD P 11 qtD H PQiD i1 Estimate word probabilities using smoothed model linear interpolation PWID AszWlD 1 MszWlCOll 7L might be constant JelinekMercer or scaled by length Dirichlet ID I 39 l lt1 3 I DI M in floralll lan i Addquot CMPSCI 646 Copyright James Allan Extend to clusters Suppose we have clusters C rather than documents 75 1 Z Still need to smooth PwC sz100 1 Aszw001l Can still use different approaches for L ICI ICI I39M dir CMPSCI 646 Copyright James Allan Querytime clustering Retrieve top 1000 documents QL and cluster them 7 Group average Single linkage Complete linkage Centroid Ward s 7 All are hierarchical agglomerative approaches 7 All vector approaches 7 Different in similarity function Centroid maintains a central representative Ward s tries to keep clusters as tight as possible by considering the sum of the squared deviations from the mean of a cluster Rank resulting clusters using CQL approach Convertto ranked list of documents 7 Documents from first cluster in document QL order 7 Then those from second cluster and so on Compare QL ranked list and CQLderived ranked list by mean average precision CMPSCI 646 Copyright James Allan Does it work t ollccuun mum my Singlurlmmgc ompltlcrlmkngc 39cmnml 39Lml lQL DMl M llnlmugl nmn u llvlir um ll li alr rm immu mi uwuu My mm min um wuss nzlmlu ux lli lllu iw uzxwwux uzvsou Mi 02963 uxi In training decided that 7 Dirichlet smoothing with u1000 worked best 7 For both QL and CQL 7 Clustering threshold t generally set to 08 No effective difference Roughly matches effectiveness of document retrieval 7 Similar to other research results in that regard No particular value to clustering this way CMPSCI 646 Copyright James Allan Another way to use clusters Recall document smoothing vanxamwv 1 pammmbm Intent is to get better estimates of word probabilities Cluster contains similar documents Perhaps should give them priority during smoothing Mix all 1 PwD the document itself 2 PwlC documents that are a lot like it 3 PwColl everything else Different approaches yield slightly different results ClVIPSCI 646 Copyright James Allan Mixing three estimates Linear combination topic document model TDM Set 7L17v27v31 variation of JelinekMercer PwD MszwIDgt2szw03szwCOU Bias collection model clusterbased doc model CBDM zmegtxauwwgt 1 lt PmlwC 1 m Hmwmbm Smooth document model same effect so ignore 0 PmlwD 1MDquotAlto agtfamwm 1 APwColl ClVIPSCI 646 Copyright James Allan l W Does it work First consider in same context as before Retrieve 1000 documents and cluster Ward s Re rank 1000 documents based on those clusters and the TDM or CBDM TDM does not improve on baseline QLDM Clear win for CBDM though not over baseline 0 Huge win on hilluuimii lliiulinlil Ml TDM 1 i lll39iM Ii iillu39r39 l FR collection I 17 will w Hi i 1 5 4 no baseline 3 w My quot himquot t ullcriiuii iut l l39 lUl DMi H Al39 llmlmngl ll i 7 ll 4 A s M 2m quot CMPSCI 646 Copyright James Allan Static a priori clustering Same idea but cluster without query Efficiency now an issue On2 algorithms too much Use kmeans 7 Value of k2000 appeared optimal in training 7 Used k1000 for the FR collection Rank documents by CBDM bias collection model 7 Document smoothed by cluster it ends up in Uniform advantage over baseline Also gains muttmu inuiwii MI in n lllllM l il limit illl X ll l ii Compared to xi 3 iiim Ir4v classic Okapi H Wquot 39 model I lil liiiiii ilivtl CMPSCI 646 Copyright James Allan Static cont Again CBDM beats TDM L niLcmn KN rm AP J rR in l85 Less gain compared to a more effective method relevance models 7 RM uses something vaguely similar to clustering t nllu iiiiu i39tix imm 39lll 21mm iiii ItiliiH lMle 1mm Mm 2mm iRik iiiiim RM mi oi i iiiw u 7i5 RVl wmw u 77 ii i i ii i Ni u w ll ya u um quottiuii iiiX ii ii l i i l i u Him CMPSC1645 Cnpyngmo JamesAllan Cluster then retrieve conclusions Retrieving clusters did not improve document ranking 7 Generally matched performance Using clusters to smooth model effective 7 Bias collection model 7 Similar results if smoothing document model Treating document cluster collection equally hurt r Primarily because no place for Dirichlet smoothing 7 With JM only best performance should be comparable Only clustering top 1K documents also a wash 7 Needed complete clustering of collection 7 Clusters ranged from 50100 documents depending on collection CMPSC1645 Cnpyngmo JamesAllan Nonpartitioned clustering Previous examples partitioned data to build clusters Documents may be about multiple topics Algorithms force such a document into a single topic Could redo previous experiments with that in mind Allow overlapping clusters Instead revisit LM approaches with clusters First a review CMPSCI 646 Copy ght James Allan Generative models Language models are generative Model describes process of generating documents e l QIVD is probability that model MD might generate Q Classic generative approach For each document Di 1gig M in a collection Pick the number of words in the document N For each of the N words Generate a word with probability Pw Can be depicted by graphical model representation plate M N H Pljcoll fin5 i1 l1r39 1 CMPSCI 646 Copy ght James Allan Generative topics Suppose documents drawn from a topic cluster A document could come from any topic But with a certain probability of coming from each one Each topic could have own distribution of words Generative model For each document D Choose a topic 2 Choose a length N 39 1quot For each word wi Generate with probability derived from topic Different probability of generation N Hmzma mwa Z i1 5H CMPSCI 646 Copyright James Allan Generative multitopic documents pLSl Here is one way to extend to allow multiple topics Generate process Pick a document d with probability Pd Every word has Pick a topic 2 with probability Pzd its own topic Generate a word w with probability Pwz Probability of generating a word in a document Pd w Pd Z PwzPzd Z How to get distributions for Pd Pwz and Pzd Kind of like clustering M swarm CMPSCI 646 Copyright James Allan 39 Topics within collection Find topics in documents using overlapping clustering But here we have a generative framework Want clustering most likely to generate collection PColl H H Pdwnd w dEColleW ndw H H dEColleW Z CMPSCI 646 Copyright James Allan Estimate with EM Select initial values for Pwz Pdz and Pz Estep Pzldyw PzPdlzPwlz M ZZPZ Pdlz PwZ step Pwlz 2dwndw39Pzld w 2w ndwPzlda w Pd z 2m nd wPzld w 1 Pz Edwnd7w CE1272chwPzldw 10 Result of fitting to collection For each topic Can show word distributions Pwz Can show documents are in it and to what degree Pdz Can show its prior likelihood Pz Example Shows Flame 39 twpace shuttle 7 faliljly llollyrwoocl Words from four Flam space llolllo 39l39lllll Clusters import sliuLtllo alillly lilorio crawl urination lilco RULEii 39 TWO CIUStel S for 39l ylt aatl ollama love new H l t llJLlJllL39l lads boot each word show H l F 4m 1 all I bLdLU ll lilotliol ll39ljllj woo l variant meanings an new we love paatwlgor lllltlli happy actor Of words lon aatolljto fflElldb39 ltu39tammeut ammo earth olul star V J Y J High Pflightlz High PIovez CMPSCI 646 Copyright James Allan pLSl for retrieval first cut Estimate Pwd using topics and proceed as normal 75 PQID PQ1qtD H PQz lD i1 pLSl paper actually uses vector space comparison Note that EM has not calculated Pwd but Pwd ZPwzPzd Z Use Bayes rule to get Pzd from Pdz Use topic distribution to get Pd Then can calculate CMPSCI 646 Copyright James Allan ll pLSl for retrieval second cut Compare Pzld and Pzq distributions Complication do not have Pzq q was I101 one Of the original documents Fold query in as a new document 7 Do EM as before 7 Only reestimate Pzq during Estep This generally outperforms other model Even better if combine with multiple clusterings 7 Various values of K the number of latentquot topics Caveat baselines in paper are low CMFSCI 545 Copyrigth JamesAllan pLSl to LDA Cannot estimate probability of an unseen string Too many parameters so tends to overfit r Requires tempering of EM lenuw Alternate proposal is Latent Dirichlet Allocation LDA r Wquot have consistent generative semantics Can generate unseen strings 7 Fewer parameters so less prone to over tting 7 Key idea Select mixture oftopics before generating words Both require value K the number of latent topics 7 Need other approaches to avoid that eg relevance models CMFSCI 646 Copyrigth lamesAllan 12 LDA s generative process is W F g C39 Pick a multinomial B for each topic 2 From a Dirichlet distribution parameterized by n Choose length of document N Choose a multinomial distribution 0 over topics FromaDirichletdistribution l r parameterized by or H I 4 For each of N word positions n quota 39 k Choose a topic 2 from distribution 0 g g 3 A Choose a word with PwzB M m 3 is a multinomial conditioned on topic 2 Likelihood of generating a corpus looks like Fl Haiti J m I 12151 in I lliv 15 in I lift Irrllf Elll Hll l 14 p jil39l l39rm CMPSCI 646 Copyright James Allan LDA for IR 1 439 7 ex 6 r n m i Cquot H livwall agility idyli tou u rL tl tJ toa ll Need approximate inference again This paper used Gibbs sampling Blei Ng and Jordan show several other possibilities If done right about as expensive as kmeans Seems to need fewer topics than CBDM used 800 compared to 2000 Preliminary experiments showed LDA is poor alone Combine with other models 0 PmlwD PQUID Alt 1 Oz PmlwColl 1 MPzdanD IDI Oz DM CMPSCI 646 Copyright James Allan l3 Resu s o SolidIy out perform Collection QL lllll LBDM mil quotAchg m or QL m m clustering shown imiu Rll AP mm 02 0205i mm mum ll2745 earner H mm mm nzw 544 my in 5 0 However on 45 of SJMN mm 02171 0207 mzw 02m n new 015m Ulohb mu 0on COIIeCtlon beat by 5 moss new 0153 um wrul 0342 relevance model RM approach Cnllwtiun 01 RM muralle quotnchg 80m 9339 by w mm mm 17139 IUU7 l0 hl 7 Reworked indexing so MMN 0 MS H200 i222 baselines inconsistent LA 01100 quot2 lt7 lll7l5 ml war was M405 um law CMPSCI 646 Copyright James Allan Condu on Clustering as partitions 7 CQL clusterquery results 7 CBDM static clustering Clustering as latent topics 7 pLSI each word has a topic distribution 7 LDA each document has a topic distribution Latent models work better here 7 Perhaps primarily because they allow overlapping clusters Still require specifying K CMPSCI 646 Copyright James Allan 14
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'