Class Note for CMPSCI 445 at UMass(3)
Class Note for CMPSCI 445 at UMass(3)
Popular in Course
Popular in Department
This 16 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 12 views.
Reviews for Class Note for CMPSCI 445 at UMass(3)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
g I Evaluation of Relational Operations CMPSCI 445 Fall 2008 Slides Courtesy of R Ramakrishnan and Gehrke Relational Operations 9 We will consider how to implement Selection 0 Selects a subset of rows from relation Projection 75 Deletes unwanted columns from relation 39 m gtltl Allows us to combine two relations Setdifterence Tuples in reln 1 but not in reln 2 Union U Tuples in reln 1 and in reln 2 Aggregation SUM MIN etc and GROUP BY Order By Returns tuples in specified order rt After we cover the operations we will discuss how to optimize queries formed by composing them Outline 6 Sorting oz Evaluation of joins oz Evaluation of other operations Why Sort 3 A classic problem in computer science oz Important utility in DBMS I Data requested in sorted order e g ORDER BY eg find students in increasing gpa order I Sorting useful for eliminating duplicates e g SELECT DISTINCT I Sortmerge join algorithm involves sorting I Sorting is first step in bulk loading B tree index 3 Problem sort 1Gb of data with 1Mb of RAM 2Way Sort Requires 3 Bu ers oz Pass 1 Read a page sort it write it only one buffer page is used 3 Pass 2 3 etc three buffer pages used I My 11 1 I I Disk Main memory buffers Disk TwoWay External Merge Each pass we read write each page in file 2N N pages in the file gt the number of passes l39log2 N I 1 So total cost is 2Nlog2 N39 1 Idea Divide and conquer sort subfiles and merge Sort 34 26 49 78 56 13 23 47 13 46 89 56 2 23 44 12 67 35 89 6 I 12 23 34 45 66 78 9 I Input file PASS 0 2 1pageruns PASS 1 2page runs PASS 2 4page runs PASS 3 8page runs General External Merge Sort n More than 3 buffer pages How can we utilize them 3 To sort a file with N pages using B buffer pages Pass 0 use B buffer pages Produce N B sorted runs of B pages each Pass 2 etC merge B1 runs Disk B Main memory buffers Disk Cost of External Merge Sort 3 Number of passes 1 1083 1fN B 3 Cost 2N of passes oz Eg with 5 buffer pages to sort 108 page file Pass 0 108 5 22 sorted runs of 5 pages each last run is only 3 pages Pass 1 22 4 6 sorted runs of 20 pages each last run is only 8 pages Pass 2 2 sorted runs 80 pages and 28 pages Pass 3 Sorted file of 108 pages Number of Passes of External Sort N B3 B5 B9 B17 B129 B257 100 7 4 3 2 1 1 1000 10 5 4 3 2 2 10000 13 7 5 4 2 2 100000 17 9 6 5 3 3 1000000 20 10 7 5 3 3 10000000 23 12 8 6 4 3 100000000 26 14 9 7 4 4 1000000000 30 15 10 8 5 4 Blocked 10 for External Merge Sort 3 longer runs often means fewer passes 3 Actually we don t do 10 a page at a time oz In fact read a block of pages sequentially 3 Suggests we should make each buffer input output be a block of pages But this will reduce fanout during merge passes In practice most files still sorted in 23 passes 10 Sorting Records oz Sorting has become highly competitive Parallel sorting is the name of the game oz Datamation sort benchmark Sort 1M records of size 100 bytes in 1985 15 minutes World records 118 seconds 1998 record 39 16 offtheshelf PC each with 2 Pentium processor tow hard disks running NT40 3 New benchmarks proposed Minute Sort How many can you sort in 1 minute Dollar Sort How many can you sort for 100 11 Using B Trees for Sorting 3 Scenario Table to be sorted has B tree index on sorting columns oz Idea Can retrieve records in order by traversing leaf pages oz Is this a good idea oz Cases to consider Good idea Could be a very bad idea B tree is Clustered B tree is not Clustered 12 Clustered B Tree Used for Sorting 4 Cost root to the left 9 If Alternative 2 is used Index most leaf then retrieve Directs search all leaf pages Alternative 1 Data Entries Data Records Additional cost of retrieving data records each page fetched just once n Always better than external sorting 13 Unclustered B Tree Used for Sorting oz Alternative 2 for data entries each data entry contains rid of a data record In general one 10 per data record Worse case 1 O pN p records per page N pages in file Index Directs search 393939 3939 39 Data Entries quotSequence setquot quotquot39quot3939Jquot3939 quot3939quot39 quot W Data Records 14 g I External Sorting vs Unclustered Index N Sorting p1 p10 p100 100 200 100 1000 10000 1000 2000 1000 10000 100000 10000 40000 10000 100000 1000000 100000 600000 100000 1000000 10000000 1000000 8000000 1000000 10000000 100000000 10000000 80000000 10000000 100000000 1000000000 n p of records per page M B1000 and block size32 for sorting n p100 is the more realistic value 15 Summary oz External sorting is important DBMS may dedicate part of buffer pool for sorting 3 External merge sort minimizes disk lO cost Pass 0 Produces sorted runs of size B buffer pages Later passes merge runs of runs merged at a time depends on B and block size Larger block size means less I 0 cost per page Larger block size means smaller runs merged In practice of runs rarely more than 2 or 3 3 Clustered B tree is good for sorting unclustered tree is usually very bad 16
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'