New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Mrs. Maximo Lueilwitz


Marketplace > Oregon State University > ComputerScienence > CS 519 > TOPICS IN COMPUTER SCIENCE
Mrs. Maximo Lueilwitz
GPA 3.95


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 3 page Class Notes was uploaded by Mrs. Maximo Lueilwitz on Monday October 19, 2015. The Class Notes belongs to CS 519 at Oregon State University taught by Staff in Fall. Since its upload, it has received 28 views. For similar materials see /class/224502/cs-519-oregon-state-university in ComputerScienence at Oregon State University.

Similar to CS 519 at OSU

Popular in ComputerScienence




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/19/15
Indexing Transformations A data structure that allows you to rapidly Case folding locate documents containing one or more terms index terms Want to efficiently store and query the Translate all uppercase to lowercase Stemming Translate all files to morphological roots index Run running runner Stop words A the it etc Special for applications dollar stock etc Inverted File Visual Example of Inverted File Sometimes called postings file or concordance Special purpose alternatives include bitmaps or signature file In most apps inverted files offer better performance Requires a lexicon list of all terms that appear in the text vocabulary Simplest form List of strings and disk addresses Visual Example of Word Granularity of Index Granularity Most specific location of word that can be pinpointed by index Eg Document sentence word Granularity Coarse Block of documents on disk Medium Document Fine sentence word or byte Tradeoff Coarse data must be searched at runtime Coarse data uses less space Supporting Proximity Searches in Document Granularity Proximity Phrase NEAR WITHIN operators Query the index using keywords get a list of documents Scan the full text of results at query time Looking for occurrence of phrase Could take time If lots of docs or really long docs Word Level Granularity Enables index to answer precise proximity queries Requires 2 3x space of document granularity Many more pointers one forevery appearance or a word Pointers are larger there are many words Document granularity is the most common Proximity queries can be handled also by scanning text at runtime Unless lots of phrase searches are expected Size of Uncompressed Index file 50 100 of indexed text Imagine Average word 5 characters plus 1 space 6 bytes 32 bit document pointers up to 4 billion docs 4 bytes 16 bit word pointers 2 bytes Result 6 bytes of index for every six bytes of text Can We Compress this Well Each row has this form lt6 1 5 10 20 22 40gt What are we doing when we compress data What does this have to do with probability distributions Example of Compression 4 symbol language how many bits do we need 7 2 blts If n characters and we use 2 bits to represent each of the four sym o s Now assume zipf distribution with k1 What is probability of each symbol appearing 7 Assume most probable symbol appears wlth prob p PpZp4pE PEE 4E zra 1E1 PE4Z1E egtpE15 e Relatlve probabllltles 15415215115 Simple Code Each symbol different ofbits e n 7 1D 7 11D 7 111B Size of random string of length n Prob 815 415 215 115 18152415321541l15n 8 8 6 4n15 26n15 174n 174n2n 87 Savings of 13 Can We Compress this Well Each row has this form 7 lt6 1 5 10 20 22 40gt Symbols are document ids Do certain document ids appear more than others Maybe if certain documents are larger But not in a big way But with a small transformation we can make this much more compressible One Approach to Compressing Index Files DGaps Transform doc ids pointers to documents to difference 39om previous doc id Example Orig lt6 i 5 10 20 22 40gt New lt6 4 5 10 2 18gt Haven t necessarily made the pointers smaller Terms that appear in many documents Are going to naye more smallergaps some repeats Plus gaps Wlll repeat across rows terms What Codes for Text caps don t follow tne Zipr dlstrlbutlorl Example decent code rorgaps cammacode 7 l El 7 2 inn 7 3 ml 7 4 llElElEl 7 5 llElEll 7 E llEllEl 7 7 Hull 7 E lllElElElEl 7 a lllElElEll 7 in lllElEllEl Reference 7 Managing Gigabytes by Wltterl Murrat and Bell 7 claims cumpresslurl to am or original size on TREC data At Query Time You have a keyword in a query Start by accessing the lexicon sorted list of words 7 Witn document rre uerlcy or eacn 7 Arid pointerto associated lrldegtlt row Access T39me 7 Binary searcn rorsorted lists Simplest way to re resent 7 Array Witn xed size strings Lots ofways to reduce memory 7 Concatenate all strings 7 Front coding 7 Hasning Now have a list of documents with the query term Boolean Conjunctive Queries Oregon AND State AND University d term t For each stemme 7 Search legtltlcorl record f arld address of lrthe lrlyerted flle erltry fort Picktwith smallest ft Read It Set 0 all docs in It For each remaining termt 7 Read l 7 Foreach d in Q luukup d in l lr d is not in if tnen remoye itrrom c 7 lrc empty stop Return 0 backto user


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Jennifer McGill UCSF Med School

"Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.