RELATNL DATABASE MGMT SYS
RELATNL DATABASE MGMT SYS CS 587
Popular in Course
Popular in ComputerScienence
This 45 page Class Notes was uploaded by Orrin Rutherford on Wednesday September 2, 2015. The Class Notes belongs to CS 587 at Portland State University taught by David Archer in Fall. Since its upload, it has received 50 views. For similar materials see /class/168303/cs-587-portland-state-university in ComputerScienence at Portland State University.
Reviews for RELATNL DATABASE MGMT SYS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/02/15
Chapter 10 1 1 Indexes oz 10 Treestructured Indexes I Intuition for tree indexes I ISAM a static structure I B Trees I Operations I Search I Insert I Delete I Special cases I Key Compression I Bulk Loading I Order I Updates may change rids 3232009 PSU s CS 587 ozo 11 Hashbased Indexes Static Hashing Extendible Hashing I Directory I Insert I Delete Linear Hashing I When to split Extendible vs Linear Hashing Learning Objectives oz Describe the pros and cons of ISAM ozo Perform inserts and deletes on Btrees extndible and linear hash indexes oz For Btrees Describe algorithms for bulk loading and key compression ozo Explain why hash indexes are rarely used 3232009 PSU s CS 587 Indexes in Real DBMSs ozo SQLServer Oracle DB2 Btree only oz Postgres Btree Hash discouraged I gist 39 htt istcsberlltele edu 0 Generalized index type 0 Models RTrees and other indexes l 39 htt www5aimsusu me era oddmuse indexc i Gin primarily text search oz MySql Depends on storage engine Mainly Btree some hash Rtree ozo Bottom Line Hash indexes are rare gtvRTrees index rectangles and higher dimensional structures 3232009 PSU s CS 587 101 Intuition oz Find all students with gpa gt 30 If data is in sorted file d0 binary search to find first such student then scan to find others Cost of binary search on disk can be high ozo 2 Simple ideas I Create an index file I Use a large fanout F since each dereference is Page1 Pagez Page3 l lPageN Data File Can do binary search on smaller index le 3232009 PSU S CS 587 4 index entry Po K1P K2P 1 2 000 Kum ozo Index file may still e quite large But we can apply the idea repeatedly Nonleaf Pages 3 CM doll CM din Cl Overflow gt page Primary pages Leafpages contain data entries 3232009 PSU s CS 587 Review oflndexes oz As for any index 3 alternatives for data entries kquot I Data record with key value k I ltk rid of data record with search key value kgt I ltk list of rids of data records with search key kgt ozo Choice is orthogonal to the indexing technique used to locate data entries kquot oz Treestructured indexing techniques support both range searches and equality searches oz 18AM static structure B tree dynamic adjusts gracefully under inserts and deletes 3232009 PSU s CS 587 6 Comments on I SAM 3235 oz File creation Leaf data pages allocated sequentially sorted by search key then index Index Pages pages allocated then space for over ow pages oz Index entries ltsearch key value page idgt they direct search for data en tries in leaf pages Over ow Pages oz Search Start at root use lltey comparisons to go to leaf Cost oc log F N F entriesindex pg N leaf pgs ozo m Find leaf data entry belongs to and put it there I Perhaps in an overflow page oz Delete Find and remove from leaf if empty over ow page deallocate I Thus data pages remain sequential continguous Static tree structure insertsdeletes n ect only lenfpnges 3232009 PSU s CS 587 7 Example SAM Tree oz Each node can hold 2 entries no need for next leaf page pointers Why 3232009 PSU s CS 587 8 After Inserting 23 i 48 41 42 Root 1 l lll k Pages Primary v V Leaf 10 15ll20 27 3337 l4o46 I I51 55 l 3963 I97 Pages Pages 3232009 PSLYSC8587 9 Then Deleting 42 51 gt2 97 Root l lll V H V d 1015l20 27quot333Tquot4045 5563 Note that 51 appears in index levels but not in leaf 3232009 PSU S CS 587 10 Pros and Cons OfISAM ozo Cons I After many inserts and deletes long overflow chains can develop I Overflow records may not be sorted ozo Pros I Inserts and deletes are fast since there s no need to balance the tree I No need to lock nodes of the original tree for concurrent access I If the tree has had few updates then interval queries are fast 3232009 PSU s CS 587 11 B Tree Most Widely Used Iridex oz lnsert delete at log F N cost keep tree height bdldrlced F fanout N leaf pages ozo Minimum 50 occupancy except for root Each node contains 1 lt m lt 2d entries The parameter dis called the order of the tree I This ensures that the height is relatively small oz Supports equality and rangesearches efficiently Index Entries Direct search Data Entries quotSequence setquot 3232009 PSU s CS 587 12 Example B Tree oz Search begins at root and key comparisons direct it to a leaf as in ISAM oz Search for 5 15 all data entries gt 24 IBHWIIMIJWI Root I2 I3I5I7I I14I16I I I I19I20 22I I I24I27I29I I I33I34I38I39I Based on the searehfor 15 3 we know it is not in the tree 3232009 PSU S CS 587 13 Inserting a Data Entry into a 3 Tree ozo Find correct leaf L oz Put data entry onto L If L has enough space done Else must split L into L and a new node L2 39 Redistribute entries evenly copy up middle key 39 Insert index entry pointing to L2 into parent of L ozo This can happen recursively To split index node redistribute entries evenly but push up middle key Contrast with leaf splits ozo Splits grow tree root split increases height Tree growth gets wider or one level taller at top 3 23 2009 PSU s CS 587 14 Inserting 8 into Example B Tree 3 Observe hOW Note that 5 is copied up and minimum occupancy is A continues to appear in the leaf 2 3 5 guaranteed 1n I I both leaf and index p g splits er Entry to be inserted in parent node 7 8 3 Note difference Entry to be inserted in parent node between copy I 5 Note that 17 is pushed up and only appears once in the index Contrast up and pushup this with a leaf split b mew ilsileil l24i3 i Ii I I understand the I reasons for this 3232009 PSU s CS 587 15 Example B Tree After Inserting 8 m m N N N 23 578 1416 1912 22 242729 33343839 1 Notice that root was split leading to increase in height 3 In this example we can avoid split by redistributing entries however this is usually not done in practice 3232009 PSU s CS 587 16 Deleting a Data Entry from a 3 Tree oz Start at root find leaf L where entry belongs ozo Remove the entry If L is at least halffull done If L has only d1 entries 0 Try to redistribute borrowing from sibling adjacent node with same parent as L 0 If redistribution fails merge L and sibling oz If merge occurred must delete entry pointing to L or sibling from parent of L oz Merge could propagate to root decreasing height 3232009 PSU s CS 587 17 Example Tree After Inserting 8 Then Deleting 19 and 20 RookA m m m m A 23 578 1416 IIZZTWI 2729 33343839 ozo Deleting 19 is easy ozo Deleting 20 is done with redistribution Notice how middle key is copied up 3232009 PSU s CS 587 18 And Then Deleting 24 Mustmerge H so H H oz Observe toss of index entry on right A m I222729 llsss4ss39 and pull down of index entry below N 5 1 13H quotH sou m m V m m 23 578 1416 222729 33343839 3232009 PSU s CS 587 19 Example ofNonleafRedistribution ozo Tree is shown below during deletion of 24 What could be a possible initial tree oz In contrast to previous example can redistribute entry from left child of root to right Child Rock 1 w Tit m m m m A PM 578 1416 quot1181 2 21 221271291 331341381391 3 23 2009 PSU s CS 587 20 After Redistribution ozo lntuitively entries are redistributed by pushing through the splitting entry in the parent node oz lt suffices to redistribute index entry with key 20 we ve redistributed 17 as well for illustration Root quotN quotN N quotN quotN quotN 5 23 78 1416 quot1181 2 21 221271291 33343839 3 23 2009 PSU s CS 587 21 B Trees in Practice ozo Typical values for B tree parameters I Page size 8K I Key at most 8 bytes compression later I Pointer at most 4 bytes I Thus entries in index are at most 12 bytes and a page can hold at least 683 entries I Occupancy 67 so a page can hold at least 455 entries estimate that conservatively with 256 28 oz Top two levels often in memory Top level root of tree 1 page 8K bytes Next level 28 pages 28 23K bytes 2 Megabytes 3 23 2009 PSU s CS 587 22 BTrees vs Hash Indexes ozo A typical Btree height is 23 Height 0 supports 28 256 records Height 2 supports 224 32M records Height 3 supports 232 4G records ozo A Btree of height 23 requires 23 I Os Including one I O to access data Assuming top two levels are in memory Assuming alternative 2 or 3 ozo This is why DBMSs either don t support or don t recommend hash indexes on base tables Though hashing is widely used elsewhere 3232009 PSU s CS 587 23 Pre x Key Compression ozo Important to increase fanout Why ozo Key values in index entries only direct traffic can often compress them E g If we have adjacent index entries with search key values Damion Yogurt David Smith and Devarakonda Marthy we can abbreviate David Smith to Dav The other keys can be compressed too 39 Is this correct Not quite What if there is a data entry Davey 011857 Can only compress David Smith to Davi In general while compressing must leave each index entry greater than every key value in any subtree to its left oz lnsert delete must be suitably modified 3 23 2009 PSU s CS 587 24 Balk Loading of a 3 Tree ozo If we have a large collection of records and we want to create a B tree on some field doing so by repeatedly inserting records is very slow oz Balk Loading can be done much more efficiently oz Initialization Sort all data entries insert pointer to first leaf page in a new root page R0 Sorted pages of data entries not yet in B tree 34 691ok112j132042223431l35 3638141144 I PSU s CS 587 3232009 25 Bulk Loading Corlth m1 at a es alwa s Dt t p g Y 6 12 I 23 35 aaen rypages ozo Index entries for leaf entered into right not yet in 8 tree most index page just above leaf level When this fills up it splits Split may go up rightmost path to the root oz Much faster than repeated inserts V especially when one ll e H II 12 H H 23H H H sen considers locking Data entry pages not yet in B tree 3 23 2009 PSU s CS 587 26 Summary of Bulk Loading ozo Option 1 multiple inserts Slow Does not give sequential storage of leaves ozo Option 2 Bulk Loading Has advantages for concurrency control Fewer 1 Os during build Leaves will be stored sequentially and linked of course Can control fill factor on pages 3 23 2009 PSU s CS 587 27 A Note on Order oz Order 1 concept replaced by physical space criterion in practice at least halffull Index pages can typically hold many more entries than leaf pages Variable sized records and search keys mean different nodes will contain different numbers of entries Even with fixed length fields multiple records with the same search key value duplicates can lead to variablesized data entries if we use Alternative 3 23 2009 PSU s CS 587 28 1084 E ect 0f Inserts and Deletes 0n RIDs oz The text raises this problem I Suppose there is an index using alternative 1 I As happens with SQLServer and Oracle if a primary index is declared on a table I RIDs will change with updates and deletes 0 Why Splits and merges I Then pointers in other secondary indexes will be wrong I Text suggests that index pointers can be updated I This is impractical oz What do SQL Server and Oracle do ozo They use logical Rle in secondary indexes 3 23 2009 PSU s CS 587 29 Logical Pointers in Data Entries 9 O What is a logical pointer O 99 A primary key value 9 9 For example an Employee ID 9 9 Thus a data entry for an age index might be lt42C24gt I 42 is the age C24 is the ID of an employee aged 42 I To find that employee with age 42 must use the primary key index 9 O z This approach makes primary key indexes faster alternative 1 instead of 2 but secondary key indexes slower 3 23 2009 PSU s CS 587 30 11 Hashbased lndexes review oz As for any index 3 alternatives for data entries kquot I Data record with key value k I ltk rid of data record with search key value kgt I ltk list of rids of data records with search key kgt Choice orthogonal to the indexing technique oz Hashbased indexes are best for equality selections Cannot support range searches oz Static and dynamic hashing techniques exist tradeoffs similar to ISAM vs B trees 3232009 PSU s CS 587 31 Static Hashing ozo primary pages fixed allocated sequentially never deallocated over ow pages if needed oz hk mod M bucket to which data entry with key k belongs M of buckets hkey mod M Ml Primary bucket pages Overflow pages 3232009 PSU s CS 587 32 Static Hashing Cortth oz Buckets contain data entries ozo Hash fn works on search key field of record r Must distribute values over range 0 M1 hkey a key b usually works well a and b are constants lots known about how to tune h ozo Long over ow chains can develop and degrade performance Extendible and Linear Hashing Dynamic techniques to fix this problem 3232009 PSU s CS 587 33 112 Extendible Hashing oz Situation Bucket primary page becomes full Why not reorganize file by doubling of buckets Reading and writing all pages is expensive ld Use directory of pointers to buckets double of buckets by doublirlgthe directory splitting just the bucket that over owed Directory much smaller than file so doubling it is much cheaper Only one page of data entries is split No over ow page Trick lies in how hash function is adjusted 3232009 PSU s CS 587 34 LOCAL DEPTHl I B k tA Insert gt Example oz Directory is array of size 4 1 5 21 13 BUCKetB ozo To find bucket for r take 10 last global depth bits of 11 Bucketc hr we denote r by hr 39 If 117 5 binary 101 DIRECTORY it is in bucket pointed to 15 7 19 BUCK D by DATA PAGES ozo Insert If bucket is full it allocate new page redistribate oz If necessary double the directory As we will see splitting a bucket does not always require doubling we can tell by comparing global depth with local depth for the split bucket 3232009 PSU s CS 587 35 Insert hr20 Causes Doubling LOCAL 321 BucketA L CAL GLOBAL DEPTH GLOBAL DEPTH BucketA BucketB quotquot12 00 1 5 211 000 1 5 21 BucketB 01 m 001 10 11 BucketC 010 011 BucketC I 100 DIRECTORY 7 19 BucketB 101 2 110 15 7 19 BucketD 111 BucketAZ 4 12 20 splitimag DIRECTORY 4 1220 BucketAZ of Bucket A split image39 of Bucket A 3232009 PSU s CS 587 36 Points to Note oz 20 binary 10100 Last 2 bits 00 tell us r belongs in A or A2 Last 3 bits needed to tell which Global depth of directory Max of bits needed to tell which bucket an entry belongs to Local depth of a bucket of bits used to determine if an entry belongs to this bucket ozo When does bucket split cause directory doubling Before insert local depth of bucket global depth Insert causes local depth to become gt global depth directory is doubled by copying it over and fixing pointer to split image page Use of least significant bits enables efficient doubling via copying of directory 3232009 PSU s CS 587 37 Performance Deletions ozo If directory fits in memory equality search answered with one disk access else two 100MB file 100 bytes rec contains 1000000 records as data entries If pages are 4K then the file requires 25000 directory elements chances are high that directory will fit in memory Directory grows in spurts and if the distribution ofhash values is skewed directory can grow large Multiple entries with same hash value cause problems ozo Delete If removal of data entry makes bucket empty can be merged with split image If each directory element points to same bucket as its split image can halve directory 3232009 PSU s CS 587 38 113 Linear Hashing ozo This is another dynamic hashing scheme an alternative to Extendible Hashing ozo LH handles the problem of long over ow chains without using a directory and handles duplicates oz Id Use a family of hash functions ho h1 h2 hikey hkey mod21N N initial buckets h is some hash function range is not 0 to Nl If N 2510 for some d0 hi consists of applying h and looking at the last di bits where di d0 i h11 doubles the range of hi similar to directory doubling 3232009 PSU s CS 587 39 Linear Hashing Corlth ozo Directory avoided in LH by using over ow pages and choosing bucket to split roundrobin Splitting proceeds in rounds Round ends when all N R initial for round R buckets are split Buckets O to Next1 have been split Next to N R yet to be split Current round number is Level Search To find bucket for data entry r find hLevelr 0 If hLmlU in range Next to N R r belongs here 0 Else r could belong to bucket thelr or bucket hLmlr NR must apply hLevel1r to find out 3 23 2009 PSU s CS 587 40 Overview ofLII File ozo In the middle of a round Bucket to be split Next Buckets that existed at the beginning of this round R this is the range of hLevel gt 3232009 J J PSU s CS 587 Buckets split in this round If h Level search key value is in this range must use h Level search key value to decide if entry is in split image39 bucket split image39 buckets created through splitting of other buckets in this round 41 When to split oz Insert Find bucket by applying hLevel hLevem If bucket to insert into is full 0 Add over ow page and insert data entry 0 Maybe Split Next bucket and increment Next oz Can choose any criterion to trigger39 split oz Since buckets are split roundrobin long overflow chains don t develop ozo Doubling of directory in Extendible Hashing is similar switching of hash functions is implicit in how the of bits examined is increased 3 23 2009 PSU s CS 587 42 Example of Linear Hashing Level0 ozo Ons lit h is used to p 39 Level h h PRIMARY OVERFLOW redlstrlbute entrles 1 0 PAGES PAGES 4 N Level0 N4 000 00 DR h PRIMARY 000 00 010 10 10 39 01 a maxi 11 III 010 10 Primary 100 0 bucket page 011 11 y H II p t c c This info The actual contents is for illustration of the linear hashed only le 3232009 PSU s CS 587 43 I7I7 A89 8 snscl 600Z9Z9 ILIIS II III Izmsz 0 0 Izzws m 0I 0II wzus Is 10 101 III 5 00 6Z IL I n 01 99w 00I a 39 139 I917 HIII IL Isms K II IIO 1X9N Iv 0IISII99 0I 0I0 gt595 00 001 5 gt595 5 II IIO II SZ gt56 I0 100 ogl Ivs 0IISI 99 01 010 IZS 00 000 Szw 0 00 39ch 539 0II III AAO39IJEIEIAO AHVWIEIJ 00 000 09A91 1X9N SEIDVcI SESVJ 0q Iq AAO39IJEIEIAO AHVWIHJ a N c m H 9A91 punog 19 f0 pug aldmnxg LH Described as a Variant of EH oz The two schemes are actually quite similar Begin with an EH index where directory has N elements Use overflow pages split buckets roundrobin First split is at bucket 0 Imagine directory being doubled at this point But elements lt1N1gt lt2N2gt are the same So need only create directory element N which differs from 0 now 39 When bucket 1 splits create directory element N1 etc oz So directory can double gradually Also primary bucket pages are created in order If they are allocated in sequence too so that finding i th is easy we actually don t need a directory Voila LH 3 23 2009 PSU s CS 587 45