Internet Algorithmics CSC 6400
Popular in Course
Popular in ComputerScienence
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
HSTA 101H - 00
verified elite notetaker
verified elite notetaker
This 47 page Class Notes was uploaded by Harmon Jones on Wednesday October 21, 2015. The Class Notes belongs to CSC 6400 at Tennessee Tech University taught by Martha Kosa in Fall. Since its upload, it has received 27 views. For similar materials see /class/225736/csc-6400-tennessee-tech-university in ComputerScienence at Tennessee Tech University.
Reviews for Internet Algorithmics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/21/15
1 Tennessee Technological University vilnformation Retrieval in Large Scale File Systems 080 6400 Internet Algorithm Xin Chen 2009420 1 Tennessee I chnological University Introduction lie Tennessee Technological Universz zy Outline Online Index Construction and Maintenance Fast Online Index Construction by Geometric Partitioning Efficient Online Index Maintenance by Dynamic Balancing Tree Metadata Search in LargeScale Systems Fast Scalable Metadata Search Ranking in File Systems Summary m Tennessee Technological Universizj Online Index Construction and Maintenance Fast OnLine Index Construction by Geometric Partitioning Nicholas Lester Alistair Moffat and Justin Zobel CIKM 05 Proceedings of the 14th ACM international conference on Information and knowledge management Efficient Online Index Maintenance for Dynamic Text Collections by Using Dynamic Balancing Tree Ruijie Guo Xueqi Cheng Hongbo Xu and Bin Wang CIKM 07 Proceedings of the sixteenth ACM conference on Conference on information and knowledge management EU Tennessee Technological Universizj Problems In a truly dynamic search environment documents may be added to and removed from the collection at any point in time How to build an inverted index when the underlying data still must be continuously queryable Eg News search products search and file system search EU Tennessee Technological Universizj Challenges The complexity of index construction A range of tensions are introduced such as querying throughput document insertionremove rate and disk space Insertion vs querying operation How to balance insertion and querying operations Scale amp Performance iiU39 Tennessee Technological Universizj Solutions A geometric partitioning mechanism that offers a range of tradeoffs between costs and can be adapted to different balances of insertion and querying operations is proposed by Lester et al I A dynamic balancing tree DBT is proposed by Guo et al to achieve a good indexing and query processing performance by always merging indices with similar sizes m Tennessee Technological Universizj Offline Index construction and Maintenance Techniques lnplace Contiguous ondisk posting lists maximize query processing performance but require frequent relocations of most lists in the index Noncontiguous ondisk posting lists increase index maintenance performance but decrease query processing performance Remerge This strategy maximizes the query processing performance however for every merge event the entire index must be processed m Tennessee Technological Universizj Geometric Partitioning The key idea oz Break the index into a tightly controlled number of partitions At any given point the index is the concatenation of the set of partitions Each partition is a partial index for a contiguous subset of the documents in the collection O inverted lists Partition 3 Partition 2 Partition 1 old documents new documents largest smallest iiU39 Tennessee Technological Universizj The key issue How best to manage the sequence of mergings so as to minimize the total merging cost without allowing the number of partitions to grow excessively Requirements If a bufferload contains b pointers the first partial index cannot exceed r1b pointers where k is a parameter In general the kth partial index is no more than r1 quot7b pointers At level k the partition is either empty or contains at least rk397b pointers 10 w Tennessee Technological University Partition 3 Partition 2 Partition 1 Partition 0 l I 0 0 0 i m l 1quot 1 quot391 J kg 1 l l I ll 1 ll V4 2 l I e l I 1 II 1 J 3 If gt i 1 1 1 i l 1 1 1 I 4 aw 2 139 1 I Iquot l 6 0 l 1 1 1 Ni l 1 1 l 1 at 1 I 2 l I IV quot 39 l n A 9 f 0 0 l Bufferloads of index painters generated by the in memory process The merging pattern established when r 3 After nine bufferloads have been generated by the inmemory part of the index process the first index is placed into partition 3 All numbers listed represent multiples of b the size of each bu e oad 11 m Tennessee Technological University r Build cost Access cosl 2 039 35 Small r is likely to s Cl3l 31 harm query cost 4 032 29 but helpful for the 6 034 35 insertion operation 035 24 12 0391 22 16 042 20 Table l index construction cosl trillions oi operations calcu lsled39 and He expected number of disllt accesses per query term calculated ldrn 10 l 10311 3 x 1I J 7si1ddiiieredl values of r 12 w Tennessee Technological University Partition 2 Partition 1 Partition 0 0 O 5 Fix the number of partitions j g p and determine r g accordingly rquot quotx El Kg Querying performance vs 39 x1 l 71339 1 construction performance 4 1quot 0 i g 4 1 I g 3 gt 4quot 39 a 1 4quot I 3 l 111 21 g X 2 I 398 v v quot E g 7 5quot o Vii17 g EL quot 7quot ca 1 r 2 l L111 12 2 e Vquot m 1 quotquotquot quot i 1 3 quotquot I 11 5quot o I 13 EB Tennessee Technological University Evaluation Construction Performance Time kiloseconda 150 5Q rem erge geometric r3 39 geometric p22 2 bulk construction 39 4 Pointers billions Figure 4 Time required to build an index for the 100g collec tion using 99 hufferloads each containing approximately 200 MB of pointers The line denoted bulk constructionquot shows the com parable cost of an offline index construction algorithm for that amount of data 14 EB Tennessee Technological University Evaluation Querying Performance Stacunde per query 05 geometrim r23 U E gammatrim PE r remerge r rl 104 I x I I F I kit 3 P2 MEL M I 39 I I Pointers bil liuns 15 EU Tennessee Technological Universizj Dynamic Balancing Tree DBT not only deals with document insertion but also takes care of document deletion The impact of document deletion A large number of posting lists of term have to be decompressed and reconstructed The complexity of index maintenance is largely increased 16 liU39 Tennessee Technological Universizj A DBT is an mway tree Each node of which is a subindex The nodes of the tree is divided into H layers At layer k the number of nodes is either zero or is less than m m gt 2 The node in layer k1 is roughly 0 times bigger than the node in layer k Any two nodes in each layer are not significantly in size If a DBT is balanced and subindex merge operation is only performed on one layer When a tree is unbalanced a node is pushed down to a lower layer to make the tree to be balanced Garbage Collection a bitmap is used to identify the deleted documents and filter out those deleted documents at query processing time A threshold p is checked to determine whether a garbage collection is integrated into the merging process or not 17 3 f 2 TI 2 2 d Eggs magi E um Em E 2 2 a z 2 2 Hu x g cag A n is A in H m 5525 353 25 3 E E 352533 a 35 HAHHAV HH c 53 Hs m H r V 1 2i a f E ME 3 a i a a m2 33 E E5 H c H o 3 Nw mw ES 3 H m NEE HE 3E bEmgtEb NSBMQNeS mN mmwnmtt qiEJ Tennessee Technological University Evaluation Index maintenance Performance 850 o IDET mL lsLp l p 300 m InnnmwLsLpM A U U I I 39DDTtmx 5r1pr 9 a 800 quot39Dquot39LIy thmicMnge lg H0 Ilulll trhh E 39 quotquotquot39 ezln1ric tn1mm E 5 ul G 39 I C n I t 39v IL 0 3 39 E lquot g 100 750 Ef uh h 392 u 39 39 X m 39 U 39 a x E 2 t m 2 150 39 3 33 1 39I l39 I 700 a t 55 E 72 t a 100 39 o 39DBTIch3sF0H x quot5 Er x 39DBTID FL 3SLF05 396 650 TE 39DBTmc3 slH9 E E E I I In I I ILngarithmic Marge r I0 I I I I I IGmmUic Partitioning I I I I U l l D 011 03 05 07 09 D 01 03 0 07 09 Relative number of deleted documents Rulatwe number of 1116th documents tunnde maintenance performance b Query processingper rrmancs 19 3 Tennessee Technological University Evaluation Indexing performance Table 1 Exact reduetien percentage of indexing perfernlanee fer the results shown in Figure 4 cem pare to Logarithmic B Ierge and Gemnetric Par t it inning st rategy i garbage cellectien The reductien is caused by Leg I w39Ierge Gee Part I B T h Ierge m 13 3 s 1 a 1 1 1 5 5 1 2 3 DEBT M erge m 2 623 s 1 pl l 1 1 9 1 3931 1 E 1 BEST h Ierge I m r 5 47115 1 39j 5 1 DB 391 Itierge r39r12r3s 11 p 3 5391 1 7 Q 335 DBT IVIerge m C 3 5 1 p 2119 1 D 1 1 DBT h IEI QEIW12E3S11p 9 26 20 iiU39 Tennessee Technological Universz zy Comments Good solutions to address online index construction and maintenance issues I The evaluations seem insufficient Most experiments are based on static setting However it would be better to check the performance of the two solutions on frequently changed document insertiondeletion rate and query rate 21 lie Tennessee Technological Universz zy Outline Online Index Construction and Maintenance Fast Online Index Construction by Geometric Partitioning Efficient Online Index Maintenance by Dynamic Balancing Tree Metadata Search in LargeScale Systems Fast Scalable Metadata Search Ranking in File Systems Summary 22 m Tennessee Technological Universizj Metadata Search in LargeScale Systems Spyglass Fast Scalable Metadata Search for LargeScale Storage Systems Andrew W Leung Minglong Shao Timothy Bisson Shankar Pasupathy and Ethan L Miller FAST 09 7th USENIX Conference on File and Storage Technologies 23 m Tennessee Technological Universizj P ro blems The scale of today s storage systems is difficult to manage Where are the most recently modified files Which files can be moved to second tier storage Which application s and user s file are consuming the most space I There is no good solutions to the above issue Existing enterprise search tools Apple Spotlight server FAST Microsoft Google Enterprise Too expensive slow and cumbersome 24 EU Tennessee Technological Universizj Metadata Search Metadata is data about data oz File create time oz File size File permission Metadata search involves indexing file metadata allows point range topk and aggregation search over file properties Facilitate complex ad hoc queries about the files being stored 25 m Tennessee Technological Universizj Searching Challenges Cost amp Resources Existing solutions require dedicated hardware Metadata Collection Slow to gather metadata from the storage system Scale amp Performance Search and update is not fast and scalable 26 m Tennessee Technological Universnj How to address those challenges Leverage metadata search properties Metadata query properties Conducted user survey over 30 users and IT administrators File metadata properties Collected and analyzed file metadata snapshots from realworld storage systems at NetApp 27 m Tennessee Technological Universizj Search and Metadata Characteristics Search Characteristics Multiple attributes 9 Use all attributes in a query execution Localized 9 Include namespace knowledge Backintime 9 Version index change Metadata Characteristics Spatial locality 9 Allow index control using namespace Skewed frequency 9 Query execution using intersections 28 m Tennessee Technological University Spyglass Overview f Storage New metadata search system techniques query 39 results Hierarchical partitioning Signature files Snapshotbased metadata collection method 39 fraigiiig 39 Partition versioning i v Internal File System k J 29 m Tennessee Technological University Hierarchical Partitioning To exploit metadata locality and improve scalability the Spyglass index is partitioned into a collection of separate smaller indexes Hierarchical partitioning is based on the storage system s namespace Separate parts of the namespace is encapsulated into separate partitions Partitions are stored sequentially on the disk U Spyglass index m 0 5quot layout 30 lie Tennessee Technological Universz zy More Information Spyglass partitions are kept small on the order of 100000 files Each partition is a subtree partition the Spyglass index is a tree of subtree partitions that reflects the hierarchical ordering of the storage namespace Each partition has a main partition index in which file metadata for the partition is stored Partition metadata keeps information about the partition and pointers to child partitions 31 EU Tennessee Technological Universizj Signature Files Searching performance is tied to the number of partitions searched how do we determine which partition to search Signature files are bit arrays that serve as compact summaries of a partition s contents exploit metadata locality to prune a query s search space a signature file and an associated hashing function are created for each attribute indexed in the partition 32 m Tennessee Technological Universizj hash le extension mod b I hash le size 1 9 1 391 quot9 Qquot 39 w l 1 t l t t t t iv doc xls C W I h mpgmov ltl l4 53I 32 I28 256 IOOMB gt5ooMB Ppt ipg Ii 255 5 I 500MB Only search a partition if all tested bits are 1 False positive can be reduced by Vary the size of signature file Change the hashing function The number of signature files that have to be tested can be reduced by utilizing the tree structure of the Spyglass index 33 3 Tennessee Technological Universizj Metadata Collection The Spyglass crawler takes advantage of NetApp Snapshot technology in the NetApp WAFL file system Spyglass calculates the difference between two snapshots which represents all of the file metadata changes between the two snapshots Only the metadata of changed files is recrawled Snapshot 1 Snapshot 2 Blow Blocks Block Block4j Blocks lnode Change 34 qiEJ Tennessee Technological University Partition Versioning To improve update performance and enable backintime search Spyglass uses a technique called partition versioning Batches index updates Treat each batch as a new wider I m I l I quotquotquot 39 I I I TO I I 7 I Baseline Incremental Index Indexes incremental index version 35 lie Tennessee Technological Universizj Evaluation The Spyglass prototype is implemented as a userspace process on Linux An RPCbased interfaces to WAFL gathers metadata changes using the snapshotbased crawler Metadata collection performance Index update performance Search performance 36 m Tennessee Technological University Metadata Collection Performance Spyglass s snapshotbased crawler SB vs a strawman design SM 0 Baseline crawl 0 Incremental crawl Reads entire inode file 0 2 5 and 0 change IOX faster that SM Finishes in less than 45 min 350 I I I I 350 I I I I 300 300 g E 253 g 250 gr 3 200 200 If gt 4 h E 150 mx 2 15 quot r i l 100 100 I 50 WWII 5390 g 0 ELMquotquoti39ITE EH 0 0 20 40 60 so 100 0 2 40 f0 80 1 00 Files Millions Flles Millions SM10 SB10 EI SM l SM5 SB5 SB SM2 SB2 e 111 Tennessee Technological University Index update performance Spyglass vs PostgreSQL MySQL 250000 25000 2500 250 Update Time s 25 Web Eng Home l Spyglass I PostgreSQL Table I MySQL Table I PostgreSQL Index I MySQLlndex 38 3 Tennessee Technological University Search Performance Spyglass vs PostgreSQL MySQL Set Search Metadata Attributes Set 1 Which user and application les consume the most space Sum sizes for les using owner and ext Set 2 How much space in this part do les from query 1 consume Use query 1 with an additional directmy path I L 13 u was 0 quot3 08 quot is ifquot if 393 08 quot 03 E5 06 39 E5 06 O kii E 04 I E 04 E n i quotquotquot quot I 0 39 100ms13 53 103 255 1003 100msis 53 103 253 1003 E SpyglassW Postgres 39539 MySQI Spyglassm Postgres e MySQI Query Execution Time Query Execution Time 39 lie Tennessee Technological Universizj Comments I A good solution is presented to address metadata search issues in large scale systems I The guide of the design is convincible However the discussion only focused on indexing and crawling on a single storage server did not mention about index distribution 40 lie Tennessee Technological Universz zy Outline Online Index Construction and Maintenance Fast Online Index Construction by Geometric Partitioning Efficient Online Index Maintenance by Dynamic Balancing Tree Metadata Search in LargeScale Systems Fast Scalable Metadata Search Ranking in File Systems Summary 41 m Tennessee Technological Universizj Ranking in File Systems Efficiency vs Effectiveness in TerabyteScale Information Retrieval Stefan Buttcher and Charles L A Clarke TREC Text REtrieval Conference TREC 2005 Proceedings On ranking techniques for desktop search Sara Cohen Carmel Domshlak and Naama Zwerdling ACM Transactions on Information Systems T018 2008 42 lie Tennessee Technological Universizj Problem Users trend to store huge amounts of files of various formats as a result finding a specific desired file within the file system is a challenging task Efficiency vs Effectiveness 43 iiU39 Tennessee Technological Universizj Techniques Buttcher et al integrated term proximity into the scoring function to improve the search results and used index pruning to increase retrieval efficiency Cohen et al exploited different ranking techniques for file system z Basic ranking techniques z Two learningbased ranking schemes z A novel ranking technique 44 lie Tennessee Technological Universz zy Outline Online Index Construction and Maintenance Fast Online Index Construction by Geometric Partitioning Efficient Online Index Maintenance by Dynamic Balancing Tree Metadata Search in LargeScale Systems Fast Scalable Metadata Search Ranking in File Systems Summary 45 lie Tennessee Technological Universizj Summary Reviewed two Online Index Construction and Maintenance geometric partitioning and dynamic balancing tree Studied a design of a metadata search system Spyglass Reviewed techniques of ranking in file systems 46 lie Tennessee Technological University References Fast OnLine Index Construction by Geometric Partitioning Nicholas Lester Alistair Moffat and Justin Zobel CIKAI 05 Proceedings ofthe 14th ACM international conference on Information and knowledge management Efficient OnIine Index Maintenance for Dynamic Text Collections by Using Dynamic Balancing Tree Ruijie Guo Xueqi Cheng Hongbo Xu and Bin Wang CIKAI 07 Proceedings of the sixteenth ACM conference on Conference on information and knowledge management Spyglass Fast Scalable Metadata Search for LargeScale Storage Systems Andrew W Leung Minglong Shao Timothy Bisson Shankar Pasupathy and Ethan L Miller FAST 09 7th USENLX Conference on File and Storage Technologies Efficiency vs Effectiveness in TerabyteScale Information Retrieval Stefan Buttcher and Charles L A Clarke TREC Text REtrieval Conference TREC 2005 Proceedings On ranking techniques for desktop search Sara Cohen Carmel Domshlak and Naama Zwerdling ACM Transactions on Information Systems TOIS 2008 47
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'