Class Note for CMPSCI 646 at UMass(10)
Class Note for CMPSCI 646 at UMass(10)
Popular in Course
Popular in Department
This 9 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 29 views.
Reviews for Class Note for CMPSCI 646 at UMass(10)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Information Retrieval Massive parallelism MapReduce Hadoop PIG etc James Allan University of Massachusetts Amherst CMPSCI 646 Fall 2007 Copynghl JamesAllan Indexing Given a set of files Create an inverted file 7 word d001 d002 Write some pseudocode to do that Copynghl JamesAllan Page 1 Indexing pseudocode for each document D for each word W in D fetch list for W create it is it does not exist append D to end of list end end Now suppose have many many machines Fan documents out to processors Coalesce inverted lists Write pseudocode for that Copynghl JamesAllan Pseudo pseudocode Fan out 7 Do same work as before on each node Coalesce 7 Gather all partial inverted lists on one machine 7 Sort by the word key forthe list 7 Merge the lists together Notes 7 Requires globally unique docids 7 Could coalesce hierarchically Copynghl JamesAllan Page 2 Massive parallelism Suppose have thousands of processors How best to use them Build custom applications 7 Code for fanning out coalescing 7 Dealing with errors processor failures Build software framework 7 Bundlebury all administrative work 7 Programmer deals with interesting part Copynghl JamesAllan MapReduce Google s MapReduce probably best known 7 Jeff Dean and Sanjay Ghemawat MapReduce Simplified Data Processing on Large Clusters Symposium on OS Design and Implementation 2004 Goals ofthe work 7 Automatic parallelism and distribution 7 Faulttolerance 7 HO scheduling 7 Status and monitoring Origin of terms 7 Map applies a function to a list and returns a list 7 Reduce combines data to get a return value Copynghl JamesAllan Page 3 General overview Source data fanned out to processors Processor gets a list of things eg text lines Processor maps list to ltkeyvaluegt pairs Additional processors coalesce pairs by key Processors reduce lists to new ltkeyvaluegt s Those get written to disk Copynghl JamesAllan Example from paper Count all words in a set of documents Emit ltword countgt pairs mapString key String value reduce String key lteratorvalues key document name key a word value contents values a list of counts for each word w in value int result 0 Emitlntermediatew 1 for each v in values result Parselntv Emit AsStringresult Actual code is in appendix ofpaper Copynghl JamesAllan Page 4 What happens Documents sent to Reduce processes worker processors started on worker 7 Appropriately sized Processors Chunks 7 Number R specified EaCh generates Reads map output ltWord1gt palrs on les from where they local disk are Put in le speci c to Cans reduce with reduce destination each unique key hasmword mOd R Gets value pairs MaSter Process Writes output pairs to coordinates global storage Cupynght James Allan View of dataprocess ow 1 a r Map Task 1 r 39 kliklvklx l le mum khldx Pamtmning Funmun kdv l mum Pmmanmg Fundmn mmmmg rmmmn 5m and Ginup mi iuw it Sm and claw nme Reduce Task Reduce Task zJ mp Mains guugle cumpapersmapreducehlrnl Cupynght JamesAllan shag a Page 5 How many processors Want many map tasks more than machines 7 Granularity oftask often hard to predict 7 Makes it easierto load balance 7 Smallertasks easier to restart if failure Paper claims that for 5000 machines 7 200000 map tasks 7 5000 reduce tasks Copynghi JamesAllan Challenge LM statistics Want to count frequencies of ngrams 7 Sequences of n words Use MapReduce to count 5grams How Copynghi JamesAllan Page 6 Some extra features of MapReduce Your own partitioning function 7 The function that maps pairs to reduce process Order guarantee 7 Reduce process gets keys in ascending order Combiner function 7 In some cases can partly reduce at map process Rollyourown input types 7 Create a reader if one of existing types doesn t work Sideeffects 7 Ma or Reduce can generate auxiliary out ut Failure support 7 Detects and skips bad records when code crashes Counters 7 Can declare and increment counters returned to code at end Copynghl James Allan Variations on MapReduce Google s MapReduce 7 Implemented in C with Python and Java interfaces Yahoo s Hadoop 7 MapReduce implemented in Java Microsoft s Dryad 7 Seems to be more oriented toward SQLlike stuff Copynghl James Allan Page 7 Beyond MapReduce All that mapping and reducing is tedious How about scriptlike languages Savvzall Google MapReduce 7 Looks vaguely like C PIG YahooHadoop 7 Looks vaguely like SQL 7 There s also Grunt a shell for piggy things Nebula Microsoft Dryad 7 No idea what it looks like Copynghl JamesAllan MapReduce application Build an IR system using MapReduce Language modeling 7 Tokenize etc 7 Calculate PwD 7 Build inverted le 7 Parse query and rank documents Different for VS query with cosine similarity Copynghl JamesAllan Page 8 Some other problems Kmeans clustering of a set of documents 7 Will require multiple passes Count the links into all pages 7 Each page is a document 7 Assume you can identify links Grep for a string Find all anchor text associated with pages 7 lta href quotgtthis is anchor textltagt Find query words associated with page by click 7 Log ltquery pageclickedongt Build ltengisharabicgt parallel dictionary 7 From parallel corpora Copynghl JamesAllan Page 9
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'