Class Note for CMPSCI 646 at UMass(5)
Class Note for CMPSCI 646 at UMass(5)
Popular in Course
Popular in Department
This 18 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 21 views.
Reviews for Class Note for CMPSCI 646 at UMass(5)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Information Retrieval Retrieval Models Vectors James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 All slides copynght lamesAllan Outline Last class Vector space Latent Semantic Indexing This set of slides CMPSCI 545 Copynght lamesAllan Vector Space Model Variations 7 Vector space retrieval model 7 Latent Semantic Indexing Key idea Everything documents queries terms is a vector in a highdimensional space Example systems 7 SMART G Salton and students at Cornell starting in the 60 s Still among the topperforming IR systems 7 Lucene popular open source engine written in Java 7 Most Web search engines are similar CMPSC1646 Copynght lamesAllan Vector space issues How to select basis vectors dimensions How to convert objects into vectors 7 Terms 7 Documents 7 Queries How to select magnitude along a dimension How to compare objects in vector space 7 Comparing queries to documents CMPSCI 545 Copynght lamesAllan CMPSCI 646 Vector Space and Basis Vectors Formally a vector space is defined by a set of linearly independent basis vectors Basis vectors correspond to the dimensions or directions in the vector space determine What can be described in the vector space and must be orthogonal or linearly independent Le a value along one dimension implies nothing about a value along another LL Basis Vectors B asis Vectors for 2 dimensions for 3 dimensions Copyright lamesAllan Basic Vector Selection What should be the basis vectors for IR Core concepts ofdiscourse orthogonal by definition a relatively static vector space there are no new ideas probably not too many dimensions But difficult to determine Philosophy Cognitive science Terms that appear easy to determine But not at all orthogonal but it may not matter much a constantly growing vector space newvocabulary huge number of dimensions Ease of identifying bases critical o CMPSC ther issues can be dealt with we hope I 545 Copyngl1L JamesAllan Mapping to basis vectors terms How do basis vectors relate to terms Each term is represented as a linear combination of basis vectors quotago H Dictionary Basis terms g Term Active Independent m cat 025 075 dog 075 025 lt quotcat gravity Independent CMPSC1646 Copyright lamesAllan Mapping to basis vectors documents How are documents represented A document is represented as the sum of its term vectors D001 dog C at Active Independent CMPSCI 545 Copynght lamesAllan Mapping to basis vectors queries How are queries represented Same way that documents are Query dog cu 2 y u lt Independent CMPSCI 646 Copynght lamesAllan Vector Coefficients The coefficients vector lengths term weights represent term presence importance or aboutness 7 Magnitude along each dimension Model gives no guidance on how to set term weights Some common choices 7 Binary l term is present 0 term not present in document 7 if The frequency of the term in the document 7 tf idf idfindicates the discriminatory power of the term Tfidf is far and away the most common 7 Numerous variations CMPSCI 545 Copynght lamesAllan Example 3word vocabulary tf weights cat cat cat cat cat cat cat cat lion lion cat lion cat lion dog cat cat lion dog dog CMFSCI 646 Copyrigth James A an Term weighting functions Lucene weighting function tf39 log1Vd 1 w M xnumber of tokens in d in the same field as f Smart supports a number of functions XYZ r X expresses term 39equency component 7 Y expressed inverse document 39equency component 7 Z expresses length normalization component 7 eg atc augmented t df cosine 1 1 Wu N lt5 imam 3909 R 0 5 CMFSCI 646 Copyrigth James A an Vector Space Similarity Similarity is inversely related to the angle between the vectors E g DOC2 is the most similar to the Query I Rank the documents 5 I by their similarity to the Query II I I CMPSCI 646 Copyright lames Allan Vector Space Similarity Weighted Features Example D13mt1dog4 on Q2dog D28mt2dog6lion d D13T11T24T3 Q 0T1 2T20T3 D2 8T2 2T2 6T3 Correlated Terms Orthogonal Terms Term cat dog lion Term cat dog lion T1 cat 100 020 050 cat 100 000 000 TZ dog 020 100 040 dog 000 100 000 T3 lion 050 040 100 lion 000 000 100 SimD1Q 3T1 1T2 4T3 2T2 SimD1Q 3390 1 392 4390 6T1 T2 2T2 T2 8T3 T2 2 6 022 1 8004 12 2 32 24 CMPSCI 545 Copyngl1t lamesAllan Vector Space Similarity Common Measures Simix Y Binarv Term Vectors Weiqhted Term Vectors Inner product iXin Zxxyx D39 e 2iXin 229W coeffICIent X MY Zxxz Jrsz Cosine iX Yi 216 coefficient i X NW izxfzy Jaccard amp X YiX Y xf ff xx coeffICIent M m i 2 2y 2 y CMPSC1646 Copyngit lamesAllan Vector Space Similarity Cosine Coefficient Correlation Example D1 05T1 08T2 03T3 Q 15T1 1T2 0T3 05x1508xl Simml Q 1I 051 082 031i15112 155 d98gtlt325 868 CMPSCI 545 Copynght lamesAllan Cosine and vector lengths Angle is independent of vector lengths Can normalize all vectors to length one Inner product equals cosine of angle Inner product more efficient to compute Very common to normalize vector lengths in index CMFSCI 646 Copyrigth James A an Example again normalized D0573O8T203T3 Qu 5T11T20T3 D 0513 08 1 2i camv098 Q 165739 1139 orgmes z 051m 082T2 0 31T3 e 083T 05555 SimDlQ SimD 1Q 051 X 083 082 X 0555 o512 0822 0312oe32 05552 051 x 083 082 X 0555 0878 0868 from earlier slide ll ZZ CMFSCI 646 Copyrigth James A an Other comparisons Lucene Userspeci ed boost tfidf from document overlapq a Proportion of query matched Lengthnormalized uer wei ht q y 9 Term normalization Is square root of number oftokens in d that are in the same eld as t CMFSCI 646 Cnpynghta James A an Standard vector space summary Very simple 7 Map everything to a vector 7 Compare using angle between vectors Challenge is mostly finding good weighting scheme 7 Variants on tfidf are most common 7 Model provides no guidance Another challenge is comparison function r Cosine comparison is most common 7 Generic inner product without unit vectors also occurs 7 Model provides no guidance CMFSCI 646 Copyrigth James A an CMPSCI 646 Outline Latent Semantic Indexing Copynght lamesAllan CMPSCI 646 Latent Semantic Indexing LSI Variant ofthe vector space model Use Singular Value Decomposition a dimensionality reduction technique to identify uncorrelated signi cant basis vectors or factors 7 Ratherthan nonindependentterms Replace original words with a subset ofthe new factors say 100 in both documents and queries Compute similarities in this new space Computationally expensive uncertain effectiveness Following gures taken from Deerwester Dumais Furnas Landauer and Harshman Indexing by Latent Semantic Analysis JASIS 416391407 1990 Copynght lamesAllan mm LSI umumems x Ta 50 on m x m m x a l x a x x m x c In 5 on qumav vamummsmn a In em x dummnnl mamx x Wm rn nasonnogunalum12nghuumn D Tn K u an hasnvlhugnml umlrmngm cuvumns 00 n n 50 s ms aagnnal Mann a 5mm varms mm mmmmmwsm x 645mg numhyo wmmni m x Mm mm x u mug beerwesmr mm mm CMFSCI 545 Capynght James Alhn 1mm Mm umyh me u Hurmv machme war M NI a mo WM pphulmm c A mnn nfmrr ommun afrmwvm Wm m a m m m rim g Ilavilequ nxu39m a mm m hm mm mgmeenng wig m an L5 mm yrmqmm Wm W m um quotlawnmum m m genemnmv m quotNam mm quotmm quotm ml m mlumsmun Wm f 1th m m m cm H mm M quota m mu q mm m4 amphmm A mnn rm Bowman u 6 ml quot7 m m4 mm a 0 u u u o u u u a a u my a w o n m u v o H mm H v u n W n a u n m uuph o 0 u o 0 n bmwesm 2v a mu CMFSCI 545 Capynght James Alhn LSI example Tquot 022 nu my run III 4m m rum 7m nvum quoturns m Armmnruulruu 5 uza om ruxcrusg run 702 mo 006 w W m nmrau nm 013 an my nw um quot54 u r n Mb Ilurum n u rum um um 35 017 mm quotm uuxran nurmn mm W 17 quotan4 I107 nuswn uzxrunzruns W Man on alu an 17 viroazrnn Iquot an 21 run rum rm m run in 7m W mm m 01 an nsgmnyruza mm W n m 1121 umrum nu uminw n21 quot3 u n4 oynruulrnw nu 14 w mu m uznram nu rue on row Hm am as 0l7ruwrom ullrnurua quotm 0 m on m um qu mare um on 054 7021 057 027 rnxl 4117 an rum 7mm m mam ms nu um ammonium am 019 um am mmnmn warm om nu 19 on you Insane rue no on ml or 015 mu n1 5 a m 053 um 03 rm m rum 4m ms bmwesm 2v a mu CMFSCI 545 Capynght James Alhn documemi x I s n39 Icms kxk kxa xxa W x z r s D Rmucl mum we Gec mpusmnnmmcmm x cucumenl mamx x Whey T has unhngcma ummengm columns 0 r n has mmogona mm eng m co umns m u s 5 the magma mawx m smgu nrvames Ms we number m ms at x a vs m2 numbzv mw umns m x m 5 me tank cl x 3 mmkm u 5 ma chasequot numbev nia meusmns m mu manned mode x lt m 7 bmwesmqumn CMFSCI 545 Capynght James Alhn LSI example X T s D 022 7011 316 020 061 046 054 021 000 002 002 0014 020 007 154 006 017 7013 02 011 019 0 062 05 02 004 m m 064 017 017 011 027 011 030 7014 021 027 001 049 003 062 003 045 beerwesmr 2m1199n CMFSCI A Capynght hmesA1an LSI example 2 016 040 038 047 013 A005 7012 016 7009 014 037 033 040 016 003 7007 gt010 DO4 015 051 036 041 024 002 006 009 012 126 084 061 070 039 003 008 012 019 045 123 105 127 056 7007 7015 7021 005 016 058 033 042 028 006 013 019 022 016 058 033 042 028 006 013 010 022 022 055 051 063 024 007 0 A 7020 011 010 053 023 021 027 014 031 044 042 4106 023 014 7027 014 024 055 077 066 7006 034 7015 7030 020 031 069 098 085 7004 025 70104121 015 022 050 071 061 beerwesmr 2m1199n CMFSCI A Capynght hmesA1an Comparing original and LSI krmmn Irm39ry rrm wank rm39mar 1 II39 1 1 1 I 114mm 18 005 012 016 009 inmrfaw 003 007 010 004 mmpmw 006 009 012 mgr 008 012 019 51395er 015 021 005 rl39spunsy 013 019 022 time 013 019 022 EPS 014 020 011 surrey Hm 006 023 13mph 006 034 ml39nun 004 025 CNIPSCI 646 C0py1ight James Allan Deer wes rer39 6139 0 1990 Dimension 2 LSI 11 graph l m31011123 2 quot14191112 mime 12mm r Query human computer interaction n39m2im11 l 9 suw ey fm jm a quot 39 j 2 c Ear K J 5 re 50 E j D 3 tuc rripuliar I 1 usequot 1 sagas 9 quot cums x 9 imamlace P gnEFS 139 c31245ai 7 N I 5 system L39 c4i1sai Dimen i 1 Deer39wes rer39 et al 1990 CNIPSCI 646 C0py1ight James Allan Using LSI X r s D 022 70 316 020 061 046 054 DIE 000 002 um DUE 020 007 154 006 DH 70 02 0H 019 044 062 053 01 004 I I I 040 m D Is new doc vectors 2 dimensions here 064 I I7 w m T prowdes term vectors Given Qq1q2qt want to compare to docs 321 027 Convert Q from tdimensions to 2 mm 049 Q QT T 34 0114 052 y 7 Q QT T s1 a M m m W m Can now compare to doc vectors Same basic approach can be used to add new docs to the database CMFSCI A CnpynghtalamesAnan Example of usrng LSI X 3 r s D39 02170 314 020 061 016 09 023 000 002 D02 0011 020 007 154 006 0 fall ADI 011 0 044 062 953 024 am 040 006 H H W 797 Query Is human computer 7039 QT101000000000 ml 0 030 7014 39239 392 QTT 10 22124 10111004 mm 049 39 m 052 046 007 aux ms QT Ts1 046334 007254 014 003 Dime15inquot 2 CMFSCI 646 LSI n r n weigh la a magi142i iggememm Query human mmputer mtem My mum gsuw 4 mm I lad5s75l an 6cm I 4 ammpm um always 39 wigs was 5 I mm cm jal I X Drmensmnt beerwesmr moi mu Capynght James Alhn Is LSI any good Decomposes language into basis vectors 7 In a sense is looking for core concepts In theory this means that system will retrieve documents using synonyms of your query words 7 The magic that appeals to people From a demo at lsiresearchtelcordiacom 7 They hold the patent on LSI Query manna on Bible verses 312 dimensions 7 5 Exodus 1220 Ye shall eat nothing leavened in all your habitations shall ye eat unleavened bread 7 6 Genesis 3154 Then Jacob offered sacri ce upon the mount and called his brethren to eat bread and they did eat bread and tarried all night in the mount Things like this are major claim of LSI techniques CMFSCI 646 Capynght James Alhn To CMPSCI Magic can be confusing p 5 hits for query apple 312 dimensions SongofSongs 85 Who is this that cometh up from the wilderness leaning upon her beloved I raised thee up under the apple tree there thy mother brought thee forth there she brought thee forth that bare thee Psalms 473 He shall subdue the people under us and the nations under ourfeet SongofSongs 23 As the apple tree among the trees of the wood so is my beloved among the sons sat down under his shadow with great delight and his fruit was sweet to my taste Zecharaiah 310 In that day saith the LORD of hosts shall ye call every man his neighbour under the vine and under the fig tree Magic Ecclesiastes 47 Then I returned and sawvanity under the sun 545 Copynght lamesAllan Vector Space Retrieval Model Summary Standard vector space Each dimension corresponds to a term in the vocabulary Vector elements are realvalued reflecting term importance Any vector documentquery can be compared to any other Cosine correlation is the similarity metric used most often Latent Semantic Indexing LSI CMPSCI 7 Each dimension corresponds to a basic concept 7 Documents and queries mapped into basic concepts 7 Same as standard vector space after that 7 Whether it s good depends on what you want 545 Copynght lamesAllan
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'