Class Note for CMPSCI 645 at UMass(2)
Class Note for CMPSCI 645 at UMass(2)
Popular in Course
Popular in Department
This 34 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 14 views.
Reviews for Class Note for CMPSCI 645 at UMass(2)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
HashBased Indexes UMass Amherst March 6 2008 Slides Courtesy of R Ramakrishnan and Gehrke Introduction oz As for any index 3 alternatives for data entries kquot I Data record with key value k I ltk rid of data record with search key value kgt I ltk list of rids of data records with search key kgt Choice orthogonal to the indexing technique oz Hashbased indexes are best for equality selections Cannot support range searches 3 Static and dynamic hashing techniques exist tradeoffs for dynamic data Static Hashing O 2 hkey mod N 0 N39l gt O o O Primary bucket pages Overflow pages oz hk mod N bucket to which data entry with key k belongs k1 k2 can lead to the same bucket 3 Static buckets N fixed I main pages allocated sequentially never deallocated I over ow pages if needed Static Hashing Contd 3 Hash fn works on search key field of record r Must distribute values over range 0 N 1 hkey mod N a key b mod N usually works well a and b are constants lots known about how to tune h 3 Buckets contain data entries 3 Long over ow chains can develop and degrade performance Extendible and Linear Hashing Dynamic techniques to fix this problem g I Extendible Hashing 393 Situation Bucket primary page becomes full Why not reorganize file by doubling of buckets Reading and writing all pages is expensive ld Use directory of pointers to buckets double of buckets by 1 doubling the directory 2 splitting just the bucket that over owed Directory much smaller than file so doubling it is much cheaper Only one page of data entries is split No over ow page Trick lies in how hash function is adjusted g I Example LOCAL DEPTH LL GLOBAL DEPTH D 4 12 3216 BUCKetA oz Directory is array of size 4 810W depth D 2 oz Each bucket has local depth L 00 l i I l l l r i I I I l r 5 2113 B quot B L S D 01 39393939 H 4 To find bucket for r 1 get 1 Bucket C hr 2 take last global 1 10 depth bits of hr 5222 If hr 5 binary 101 D39RECT0RY 7 19 Bucket D Take last 2 bits go to bucket pointed to by 01 DATA Entry PAGES Inserts oz If bucket is full m it allocate new page re distribute I If necessary double the directory Splitting or not can be decided by comparing global depth and local depth for the split bucket Split if global depth local depth Don t otherwise LOCAL DEPTH L GLOBAL DEPTH D 4 12 3216 00 1 5 21 13 01 1o 11 10 DIRECTORY 15 7 19 DATA PAGES Insert r with hr20 Bucket A Bucket B Bucket C Bucket D Insert hr20 Causes Doubling LOCAL DEPTHW GLOBAL DEPTH 3216 00 5 21 1 3 01 10 11 DIRECTORY 15 7 19 4 1 2 20 Bucket A Bucket B Bucket C Bucket D Bucket A2 split image39 of Bucket A LOCAL DEPTH GLOBAL DEPTH 4 32 1 6 000 1 5 2113 001 010 011 10 100 f 101 lt 110 15 7 19 111 DIRECTORY 4 1220 Bucket A Bucket B Bucket C Bucket D Bucket A2 split image39 of Bucket A 8 Points to Note 3 20 binary 10100 Last 2 bits 00 tell us r belongs in A or A2 Last Q bits needed to tell which Global depth of directory Max of bits needed to tell which bucket an entry belongs to Local depth of a bucket of bits used to determine if an entry belongs to this bucket 3 When does bucket split cause directory doubling Before insert local depth of bucket global depth Insert causes local depth to become gt global depth directory is doubled by copying it over and fixing pointer to split image page Use of least significant bits enables efficient doubling via copying of directory 01 10 11 000 001 010 011 100 101 110 111 Directory Doubling inserting 8 gt 0100 i 1100 0001 iii A 1001 10 1011 3323 11 1011 I 1000 000 001 0001 010 0100 A 1001 g H 011 I g 1000 39 10 1001 39 i H 101 i 1011 1100 110 A 111 110 Least Significant vs Most Significant 10 Comments on Extendible Hashing If directory fits in memory equality search answered with one disk access else two 100MB file 100 bytes rec 4K pages 1000000 records as data entries and 25000 directory elements chances are high that directory will fit in memory 9 99 Directory grows in spurts and if the distribution ofhush values is skewed directory can grow large Entries with same key value duplicates need overflow pages 4 Delete removal of data entry from bucket I If bucket is empty can be merged with split image I If each directory element points to same bucket as its split image can halve directory 11 Summary 3 Hashbased indexes best for equality searches cannot support range searches oz Static Hashing can lead to long over ow chains oz Extendible Hashing avoids over ow pages by splitting a full bucket when a new data entry is to be added to it But duplicates may require over ow pages Directory to keep track of buckets doubles periodically Can get large with skewed data additional 1 0 if this does not fit in main memory 12 TreeStructured Indexes CMPSCI 645 Mar 6 2008 Slides Courtesy of R Ramakrishnan and Gehrke Review 3 As for any index 3 alternatives for data entries kquot I Data record with key value k I ltk rid of data record with search key value kgt I ltk list of rids of data records with search key kgt 00 Choice is orthogonal to the indexing technique used to locate data entries kquot 3 Treestructured indexing techniques support both range searches and equality searches B Tree Most Widely Used Iridex ozo lnserts deletes keep tree heightbalanced Log F N cost F fanout N leaf pages oz Minimum 50 occupancy except for root Each node contains 1 lt 111 lt 2d entries where dis called the order of the tree 3 Supports equality rangesearches updates efficiently Index Entries Direct search Data Entries quotSequence setquot g I Example B Tree 3 Search begins at root and key comparisons direct it to a leaf 3 Search for 5 15 all data entries gt 24 Root 13 17 24 30 V A 19 20 22 24 27 29 33 34 38 39 n Based on the search for 15 we know it is not in the tree B Trees in Practice 3 Typical order 200 Typical fillfactor 67 average fanout 133 3 Typical capacities Height 4 1334 312900700 records Height 3 1333 2352637 records oz Can often hold top levels in buffer pool Level 1 1 page 8 Kbytes Level 2 133 pages 1 Mbyte Level 3 17689 pages 133 MBytes Inserting a Data Entry into a 3 Tree oz Find correct leaf L 3 Put data entry onto L If L has enough space done Else must split L into L and a new node L2 39 Redistribute entries evenly copy up middle key 39 Insert index entry pointing to L2 into parent of L 0 This can happen recursively T0 split index node redistribute entries evenly but push up middle key Contrast with leaf splits oz Splits quotgrowquot tree root split increases height Tree growth gets wider or one level taller at top Previous Example Inserting 8quotlt Root 13 17 24 30 2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39 O 99 Inserting 8 into Example B Tree Observe how minimum occupancy is guaranteed in both leaf and index p g splits Note difference between copy up and pushup be sure you understand the reasons for this J A Entry to be inserted in parent node Note that 5 is W and continues to appear in the leaf 2 3 5 7 8 17 Entry to be inserted in parent node 539 Note that 17 is pu J dup and only appears once in the index Contrast this with a leaf split 13 24 30 g I RooxA 17 13 24 30 Example B Tree After Inserting 8 x 2 3 8 x A 14 16 1 9 20 22 24 27 29 33 34 38 39 st Notice that root was split leading to increase in height st In this example we can avoid split by redistributing entries between siblings but not usually done in practice Deleting a Data Entry from a 3 Tree oz Start at root find leaf L where entry belongs oz Remove the entry If L is at least halffull done If L has only d1 entries 0 Try to redistribute borrowing from sibling adjacent node with same parent as L 0 If redistribution fails merge L and sibling oz If merge occurred must delete entry pointing to L or sibling from parent of L 3 Merge could propagate to root decreasing height 10 Current B Tree Delete 19quotlt Delete 20 Rook 17 5 13 24 30 5 7 8 14 16 19 20 22 24 27 29 33 34 38 39 11 g I Example Tree After Inserting 8 Then Deleting 19 and 20 Rook 17 5 13 27 30 3 5 7 8 14 16 22quot 24 27 29 33 34 oz Deleting 19 is easy oz Deieting 20 is done with redistribution Notice how middle key is copied up And Then Deleting 24 3 Must merge 30 3 Observe toss of index entry on right A A I 22 27 29 33 34 38 39 and pull down of index entry beiow 5 13 17 30 2 3 5 7 8 14 16 22 27 29 33 34 38 39 13 Example ofNonleafRedistribu tion g I 00 Tree is shown below during deletion of 24 t In contrast to previous example can redistribute entry from left child of root to right child Rock 22 14 After Redistribu tion 3 Intuitively entries are redistributed by pushing through the splitting entry in the parent node oz It suffices to redistribute index entry with key 20 for i1 we ve redistributed 17 as well Rock 17 A 20 22 30 ustration 17 18 20 21 22 27 29 33 34 38 39 15 Pre x Key Compression 00 Important to increase fanout Why oz Key values in index entries only direct traffic can often compress them E g If we have adjacent index entries with search key values Damion Yogurt David Smith and Devarakonda Marthy we can abbreviate David Smith to Dav The other keys can be compressed too 39 Is this correct Not quite What if there is a data entry Davey 0aes Can only compress David Smith to Davi In general while compressing must leave each index entry greater than every key value in any subtree to its left oz Insert delete must be suitably modified 16 Pre x key compression Compress to Dav or Davi Daniel Lee David Smith Devarakonda 17 Balk Loading of a 3 Tree 3 If we have a large collection of records and we want to create a B tree on some field doing so by repeatedly inserting records is very slow oz Balk Loading can be done much more efficiently oz Initialization Sort all data entries insert pointer to first Leaf page in a new root page Rok Sorted pages of data entries not yet in B tree 3 4 6 9 1011 1213 2022 2331 3536 3841 44 Summary ofBalk Loading 3 Option 1 multiple inserts Slow Does not give sequential storage of leaves oz Option 2 Bulk Loading Has advantages for concurrency control Fewer I Os during build Leaves will be stored sequentially and linked of course Can control fill factor on pages 19 A Note on Order 3 Order 1 concept replaced by physical space criterion in practice at least hal ill Index pages can typically hold many more entries than leaf pages Variable sized records and search keys mean different nodes will contain different numbers of entries Even with fixed length fields multiple records with the same search key value duplicates can lead to variablesized data entries if we use Alternative 20 Summary 3 Treestructured indexes are ideal for range searches also good for equality searches oz B tree is a dynamic structure lnserts deletes leave tree heightbalanced log F N cost High fanout means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically 67 occupancy on average If data entries are data records splits can change rids 21 Summary Corlth 3 Key compression increases fanout reduces height oz Bulk loading can be much faster than repeated inserts for creating a B tree on a large data set 3 Most widely used index in database management systems because of its versatility One of the most optimized components of a DBMS 22
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'