Class Note for CMPSCI 645 at UMass(1)
Class Note for CMPSCI 645 at UMass(1)
Popular in Course
Popular in Department
This 23 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 15 views.
Reviews for Class Note for CMPSCI 645 at UMass(1)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Overview of Storage and Indexing CMPSCI 645 Feb 28 2008 Slides Courtesy of R Ramakrishnan and Gehrke DBMS Architecture 3 Query Parser Query Rewriter Query Optimizer Query Executor Buffer Manager Disk Space Manager E5 Data 011 External Storage 4 Disks Can retrieve random page at fixed cost I But reading several consecutive pages is much cheaper than reading them in random order 4 Tapes Can only read pages in sequence I Cheaper than disks used for archival storage 4 Page Unit of information read from or written to disk I Size of page DBMS parameter 4KB 8KB oz Disk space manager I Abstraction a collection of pages I Allocate deallocate a page I Read write a page 3 Page 11 O I Pages read from disk and pages written to disk I Dominant cost of database operations Bu er Management 3 Architecture I Data is read into memory for processing I Data is written to disk for persistent storage oz Buffer manager stages pages between external storage and main memory buffer pool 3 Access method layer makes calls to the buffer manager Access Methods 4 Access methods routines to manage various diskbased data structures I Files of records I Various kinds of indexes a File of records I Important abstraction of external storage in a DBMS I Record id rid is sufficient to physically locate a record 4 Indexes I Auxiliary data structures I Given values in index search key fields find the record ids of records with those values File organizations 8 access methods Many alternatives exist each ideal for some situations and not so good in others Heap unordered files Suitable when typical access is a file scan retrieving all records Sorted Files Best if records must be retrieved in some order or only a range of records is needed Indexes Data structures to organize records via trees or hashing Like sorted files they speed up searches for a subset of records based on values in certain search key fields Updates are much faster than in sorted files Indexes 4 An index on a file speeds up selections on the search keyfields for the index Any subset of the fields of a relation can be the search key for an index on the relation Search key is not the same as key minimal set of fields that uniquely identify a record in a relation 439 An index contains a collection of data entries and supports efficient retrieval of all data entries k with a given key value k I Data entry versus data record I Given data entry k we can find record with key k in at most one disk IO Details soon B Tree Indexes Nonleaf J Pages quot J I I r I Leaf 0 Pages Sorted by search key st Leaf pages contain data entries and are Chained prev amp next 4 Nonleaf pages have index entries only used to direct searches index entry I g I Entries lt Roox Example B Tree Iquot k1 7 Note how data entries in leaf level are sorted 13 A l X A Entries gt 27 30 x 2 3 39 6 7 8 14 16 22 24 27 29 33 34 38 39 4 Equality selection find 28 29 4 Range selection find all gt 15 and lt 30 4 Insert delete Find data entry in leaf then change it Need to adjust parent sometimes I And change sometimes bubbles up the tree HashBased Indexes oz Good for equality selections Search key Index is a collection of value k buckets I Bucket primary page plus zero or more over ow pages 39 Buckets contain data entries 3 Hashing mctiorz h hk bucket of data entries of the search key value k I No need for quotindex entries in this scheme 10 Alternatives for Data Entry k in Index 0 In a data entry k we can store I Altl Data record With key value k I Alt2 ltk rid of data record with search key value kgt I Alt3 ltk list of rids of data records With search key kgt 3 Choice of alternative for data entries is orthogonal to indexing technique used to locate data entries with a key value k I Indexing techniques B tree index hash index I Typically indexes contain auxiliary information that directs searches to the desired data entries 11 Alternatives for Data Entries Corlth 00 Alternative 1 Index structure is a file organization for data records instead of a Heap file or sorted file At most one index on a given collection of data records can use Alternative 1 Otherwise data records are duplicated leading to redundant storage and potential inconsistency If data records are very large of pages containing data entries is high Implies size of auxiliary information in the index is also large typically eg B tree 12 Alternatives for Data Entries Corlth oz Alternatives 2 and 3 Data entries with search keys and rids typically much smaller than data records Index structure used to direct search which depends on size of data entries is much smaller than with Alternative 1 80 better than Alternative 1 with large data records especially if search keys are small Alternative 3 more compact than Alternative 2 Alternative 3 leads to variable sized data entries even if search keys are of fixed length Variable sizes records data entries are hard to manage 13 Index Classi cation oz Key Primary key Candidate key 3 Primary index vs secondary index I If search key contains primary key then called primary index I Other indexes are called secondary indexes 3 Unique index Search key contains a candidate key No data entries can have the same value 14 Index Classi cation Corlth 3 Clustered vs unclustered If order of data records is the same as or close to order of data entries then it s a clustered index Alternative 1 implies clustered Alternatives 2 and 3 are clustered only if data records are sorted on the search key field A file can be clustered on at most one search key Cost of retrieving data records through index varies greatly based on Whether index is clustered or not 15 Clustered vs Unclustered Index 3 Suppose that Alternative 2 is used for data entries and that the data records are stored in a Heap file To build clustered index first sort the Heap file with some free space on each page for future inserts Overflow pages may be needed for inserts Thus order of data recs is close to but not identical to the sort order Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries quot Index File M X i i mime m Wk Data Records Data Records 16 Cost Model for Our Analysis We ignore CPU costs for simplicity B The number of data pages R Number of records per page D Average time to read or write disk page Measuring number of page lO s ignores gains of prefetching a sequence of pages thus even I 0 cost is only approximated Averagecase analysis based on several simplistic assumptions n Good enough to show the overall trends 17 Comparing File Organizations oz Heap files random order insert at eof 3 Sorted files sorted on ltage salgt t Clustered B tree file Alternative 1 search key ltage salgt Heap file with unclustered B tree index on search key ltage salgt oz Heap file with unclustered hash index on search key ltage salgt 18 Operations to Compare 3 Scan Fetch all records from disk ozo Equality search ozo Range selection ozo Insert a record ozo Delete a record 19 Assumptions in Our Analysis ozo Heap Files Equality selection on key exactly one match 3 Sorted Files Files compacted after deletions 3 Indexes I Alt 2 3 data entry size 10 size of record Hash No overflow chains 80 page occupancy gt File size 125 data size BTree 67 occupancy typical implies file size 15 data size Balanced with fanout F 133 typical at each nonlevel 20 Assumptions contd 3 Scans I Leaf levels of a treeindex are chained I Index dataentries plus actual file scanned for unclustered indexes 3 Range searches I We use tree indexes to restrict the set of data records fetched but ignore hash indexes 21 of leaf pages B0167015B Cost of index IO 015BD of data entries on a leaf page R1006767R Cost of data IO 015B67RDBDR Key unclustered Btree one 10 per data entry of data entry pages B01800125B Cost of index IO 0125BD of data entries on a hash page R100808R Cost of data IO 0125B8RDBDR Key unclustered hash index one 10 per data entry 22 Cost of Operations Scan Equality Range Insert Delete Heap File BD 5BD BD 2D Search D Sorted File BD Dlong E gfs ilg Search Search pages Clustered 15BD DlogF15 231133 Search Search Tree Index B pages g D D Unclustered BDR D1IogF 2311in Search Search Tree Index 15 1513 recs 3D 3D Unclustered BDR 2D BD 4D 4D Hash Index 125 n Seveml assumptions underlie these rough estimates
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'