### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Database Systems COM SCI 143

UCLA

GPA 3.86

### View Full Document

## 94

## 0

## Popular in Course

## Popular in ComputerScienence

This 23 page Class Notes was uploaded by Anastacio Barton on Friday September 4, 2015. The Class Notes belongs to COM SCI 143 at University of California - Los Angeles taught by Staff in Fall. Since its upload, it has received 94 views. For similar materials see /class/177904/com-sci-143-university-of-california-los-angeles in ComputerScienence at University of California - Los Angeles.

## Reviews for Database Systems

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/04/15

C8143 Query processing and join algorithms Book Chapters Book Chapter 131 136 Things to Learn 0 Join algorithms Motivation Studentsid7 name7 addr7 age7 GPA Enrollsid7 dept7 cnum7 sec Btree index on sid7 age of Student table 0 QzHow do we process SELECT FROM Student WHERE sid gt 30 o QzHow do we process SELECT FROM Student WHERE sid gt 30 AND age gt 19 o QzHow do we process SELECT FROM Student S Enroll E WHERE Ssid Esid o Joins can be very expensive maybe x R x How can we perform joins efficiently Join algorithms R and S example slide 0 Q How to join R and S What is the simplest algorithm What if we have an index Any other ideas that we can use 7 Four join algorithms Nested loop join lndex join Sort mergejoin Hash join 7 We now learn how they work 1 NestedLoop Join nested loop join slide For each I in R do For each s in S do if rC sC then output rs pair 0 Q If R has 100000 tuples7 how many times the entire S table is scanned o The simplest algorithm It works7 but may not be efficient 2 Index Join index join slide For each I in R do X lt indexlookupSC IC For each s in X do output rs 0 Look up index to nd matching tuples from S c Q Bene t of index join compared to nested loop join 3 SortMerge Join Sort merge join slide 0 Main idea If tables have been sorted by the join attribute7 we need to scan each table only once 7 Maintain one cursor per table and move the cursor forward 0 Sort tables and join them sort merge algorithm slide 1 if R and S not sorted sort them 2 i lt 1 j lt 1 While i lt IRI AND j lt ISI do if Ri C Sj C then outputTuples else if RiC gt SjC then j lt j1 else if RiC lt SjC then i lt i1 Procedure outputTuples While RiC sjc AND i lt IRI do k lt j While RiC SkC AND k lt Isl do output Ri Sk pair 4 Hash Join 0 Main idea If hash values are different7 the tuples will never join7 ie7 if hRC 7 hSC7 then RC 7 SC 0 Join two tuples only if their hash values are the same hash join algorithm slide 1 Hashing stage bucketizing Hash R tuples into G1 Gk buckets Hash S tuples into H1 Hk buckets 2 Join stage For i 1 to k do match tuples in Gi Hi buckets R G1 G2 Comparison of Join Algorithms c Q Which algorithm is better 0 Q What do we mean by better Cost model 0 The ultimate bottom line 7 How long does it take for each algorithm to nish for a particular data 0 Need of cost model 7 We need a cost model77 to estimate the performance of different algorithms 0 Our cost model Total number of disk blocks that have been readwritten 7 Not very realistic lgnore random7 sequential lO issues7 CPU cost7 etc 7 Yet simple to analyze and doable in class More sophisticated models are too complex to analyze in class 7 Good approximation given that disk lOs dominate the cost Most algorithms that we will study do mostly sequential scan 7 A better algorithm smaller number of disk block access 7 Ignore the last 10s for result writing the same for every algorithm Example to use 0 Two tables R7 S o R 17 000 tuples7 S 107 000 tuples7 10 tuplesblock 0 b3 100 blocks7 b5 1000 blocks 0 Memory buffer for 22 blocks Cost of join stage of sortmerge join 0 Usage of main memory blocks for join 1 Available memory buffers Disk blocks of each table Memory IIII R 100 blocks Em CECE S 1000blocks 22 blocks 2 We need to read R talole7 S table and write the output 7 Disk transfer unit is one block gt At least one memory buffer block to read R7 read S and write output gt Three memory blocks used for these tasks Memory IIII R 7m S 3 We sequentially read R and S blocks one block at a time7 and join them using the join algrothm c Q How many disk lOs block readswrites for R and S during join stage 0 Q Under our cost metric7 can we make it more efficient by allocating more buffers for reading R and S For example7 Memory E III Ei m s 10 blocks per table NestedLoop Join naive nested loop join algorithm slide for reminder join diagram Memory t t 7 Ill em s c Q How many disk blocks are read 0 Q Can we do any better Optimization 1 Blocknested 100p join Once we read a block from R7 join everything in the block in one scan of S gt reduces the number of scans of S table 0 Q What is the cost 0 Q Can we do any better Optimization 2 Read as many blocks of R and join them togeter in one scan of S gt reduces the number of scans of S table 0 Q What is the maximum of blocks that we can read in one batch from R c Q What is the cost 0 Q What is general cost for b3 b5 and M c Q What if we read S rst Would it be any different 7gt Use smaller table for the outer loop 0 Summary 7 Always use block nested loop not the naive algorithm 7 Read as many blocks as we can for the left table in one iteration 7 Use the smaller table on the left or outer loop Hash Join hash join slide for reminder two stages hashing stage and join stage 0 Hashing stage Read R or S table and hash them into different buckets buckets G1 6 Ee i Illl R 75 7 Q One block for reading R7 other blocks for bucketizing How many buckets 7 Q Assuming random hashing7 how many blocks per bucket 7 Q During loucketizing7 R table is read once and written once How many disk IOs read or write 7 Repeat the same for S 0 Join stage Join H1 with G1 G1 from R H1 from S DD 5 blocks 48 blocks 7 Q 5 blocks for G1 48 blocks for H1 How should we join G1 and H1 7 Q How many disk 10s 7 Q Total disk 10s 7 Q What if R is large and G1 gt 20 Recursive partitioning b of bucketizing steps logy11 General hash join cost bR lt 5 2bR b5 liogM1 Min 2N big bs Index join index join slide for reminder 0 Q How many disk 10s 0 Q What should the system do to perform index join Index join cost 7 IQ for R scanning 7 IQ for index look up 7 IQ for tuple read from S 0 Example 1 7 15 blocks for index 1 root7 14 leaf 7 On average7 1 matching S tuples per an R tuple Q How many disk lOs How should we use memory Q Any better way 0 Example 2 7 40 blocks for index 1 root7 39 leaf 7 On average7 10 matching tuples in S Q How many disk lOs How should we use memory 0 General cost 3 lRl C J 7 C average index look up cost 7 J matching tuples in S for every R tuple 7 lRl tuples in R c Q How can we compute J 7 Example R MRCSC S lSl 107 VCR 1000 Uniform distribution for C values How many tuples in S with C c SortMerge Join 0 Two stage algorithm 1 Sort stage Sort R and S 2 Merge stage Merge sorted R and S o of disk lOs during merge stage 3 5 100 17 000 1100 c Q How many disk lOs during sort stage Merge sort algorithm SCI R 77 DD 100 blocks c Q How many blocks can we sort in main memory 7 Q Do we need to allocate one block for output 0 Q How many sorted runs after sorting R in chunk of 22 blocks sorted runs 22 blocks R 7a El 77 I pm 100 blocks 22 blocks 0 Q What should we do with 5 sorted runs c Q How many disk 10s 7 Q During rst stage sorting 7 Q During second stage merging Repeat it for S table of 1000 blocks Show that now we need three stages 0 ln general7 the number of passes for 3 and M llogM1bRMl 1 7 Verify it at home on your own 7 Total of 10s for sorting 2 bRllogM1bRMl 1 Total sortmerge join cost In total 400 6000 1100 7500 o In general 2bRllogM1bRMl 1 2b5llogM1bsMl 1 bR b5 lOs Summary of join algorithms 0 Nested loop join ok for small relations relative to memory size 0 Hash join usually best for equi join 7 if relations not sorted and no index 0 Merge join for sorted relations 7 Sort merge join good for non equi join 0 Consider index join if index exists 0 To pick the best DBMS maintains statistics on data Highlevel query optimization Tables RA B SB C TC D c Q How can we process the following query SELECT FROM R S T WHERE RB SB AND SC TC AND RA 10 AND TD lt 30 7 Many different ways Show a couple of logical query trees 0 Q For now focus on R M S gtlt1 T How many different ways to execute it C5143 Join Algorithms o Iteration Join conceptual For each r e R do For each s e 5 do if rC sC then output rs pair 0 Merge Join conceptual 1 if R and S not sorted sort them 2 i e 1 j e 1 While i g R j g S do if RiC SjC then outputTupIes else if RiC gt SjC then j e j1 else if RiC lt SjC then i e i1 OutputTuples for duplicates While RiC SjC i s R do J39J39 lt J39 While RiCSjCjjSS do output Ri Sj J39J39 lt J39J391 i lt i1 o Hash Join conceptual Merge JOIn Hash function h range 1 a k Both R 5 sorted by C Buckets for R G1 Gk Buckets for S H1 Hk Algorithm 1 Hash R tuples into GlGk buckets 2 Hash S tuples into H1Hk buckets 3 Fori 1tokdo match tuples in Gi Hi buckets Hash Join Step 1 0 Index Join conceptual 1 If not create an index for SC 2 For each r e R do X indeXIookupSC rC For each s e X do output rs Index Join Memory R R s Cost Analysis 0 Cost model of disk 105 o Assumption 10 tuplesblock R 10000 Br S 5000 B5 Memory buffers for 102 blks NestedLoop Join For each r e R do For each s e 5 do if rC sC then output rs pair How many disk IOs Block NestedLoop Join o Read 100 blocks of R into memory 0 Read all ofS using 1 block and output joined tuples using 1 block 0 Repeat until done How many disk IOs General Rules for NestedLoop Join 0 Use block nestedloop join 0 If no table fits into memory Smaller table on the left 0 In each iteration Read as many blocks in the left outer table as possible in ltM 2i BS BR What if a table fits into memory Merge Join 0 Both R S ordered by C relations contiguous Memory 0 Merge Join Algorithm i e 1j e 1 While i g R j g S do if RiC SjC then outputTuples else if RiC gt SjC then j ej1 else if RiC lt SjC then i e i1 How many disk 105 if R and S have been already sorted Merge Sort Algorithm i For each 102 blk chunk of R Read chunk Sort in memory Write to disk 102 blocks Memory E r Salted chunks EEl Em Merge Sort Algorithm 2 Merge Read one block from each 102 blk chunk to memory buffer While some buffer not empty a Output the smallest tuple b If a buffer is empty Read the next block from the chunk if not EOF Ex E l E Sorted Sgrlted chunks l e 39 Memory How many disk 10s for sorting Merge Join Cost 0 Very efficient if tables are sorted BR BS 0 Cost for sorting R Further Improvement for Merge Join Hint do we really need fully sorted tables sorted buckets m Join gt zBRilogM1BR Mgti1 Hash Join 0 Step 1 Memo buckets E GI 39 E GZ Gk 0 Step 2 R S G1 7 El 62 E 2 G3 Gi Hi memo 1 Hash Join 1 Bucketize R into 101 buckets Read R hash write 3 l Mll 101buckeis g 5 5 E m l lt gt How many R Similarly for S Hash Join 2 Join each bucket a Read one R bucket build memory hash table b Read corresponding S bucket hash probe How many disk IOs Recursive Partitioning Basic idea Two tuples can join only if they agree on every hash function 1 Bucketize R into 101 buckets Read R hash write EEmi R a E G1201 buckeis 1o5 blks E1 5 a a m l 990 blks Recursive Partitioning 2 Bucketize each Gi using another hash function Read Gi hash write vG11 98 blks k Fiis in memory Repeat until each bucket fits in memory Hash Join Cost No recursive partitioning 3 BR BS Recursive partitioning 209R BS VlogMilAIBf 2 BR BS Index Join Index Join Algorithm Read a block B from R For each r e B do X indexIookupSC rC For each s e X do output rs Repeat to the end of R How many disk IOs Factors to Consider 0 Index look up cost How many blocks for index How many levels 0 Number of matching tuples C8143 Query Optimization Ch 14 Basic Steps in Query Processing 1Parsing and translation Translation of SQL queries into naive RA 20ptimization of RA expressions b 3 Evaluation Optimization of RA Expression Pushing Selection and Projection Estimating costs 0 Join Order with minimal cost 0 Much more Introduction 0 Pushing selection and also projection H EIISIOINELHWM H cuslomeLnam 6 binnchiyzliwoklyn l N branch M G branchzmyBmoklyn M account depositor brunch account depositor 4 0 An evaluation plan pe nes exactly what algorithm is used for each operation and how the execution of the operations is coordinnlnrl H customeyimme sort to remove duphcates M hash join M merge join deposi tor Pipeline Pipeline brancnzciiy Brooklyn balance lt 1000 use index 1 use linear scan brunch account Optimizers for DB Queries vs PL Compilers o In compilers optimization is pattern based eg statements are taken out of nested loops whenever possible In DBMS this kind of greedy optimization is performed when pushing selection and projection into the the RA expression For other operators joins the optimization is exhaustive all possible implementations are evaluated and their cost is estimated Basically an exponential computation on the number ofjoins Cost estimates requires keeping statistics on DB population Join Ordering o rim r2 r2 W1 but their cost is not the same 0 Join of three relations r1 r2 and r3 01er N5 1940294 3 Joins are associative and commutative f r2 lgt4 r3 is quite large and r11gtq r is small we first compute r1 m r2 ampjoin the result with r3 0 But if r1 D45 is larger than r2 M r3 we compute r2 gt4 r3 andjoin the result with r1 0 Also r1 M r3 first should not be excluded unless this is actually a cartesian product Statistical Information for Cost Estimation n number of tuples in a relation r b number of blocks containing tuples of r l size of a tuple of r f blocking factor of r ie the number of tuples of r that fit into one block VA r number of distinct values that appear in rfor f 7 A0 attribute A same as the size 0 7 7 n 7r7 br77 7 vrv Estimation of the Size of Joins o If R i7 5 ms is a foreign key in S referencing R then the number of tuples in r gt4 sis the same as the number of tuples in s o If R 7 Sis a key for R then a tuple of swill join with at most one tuple from r therefore the number of tuples in r N s is no greater than the number of tuples in s o VAr the number of distinct values for attribute A in relation r Estimation of the Size of Joins 0 Estimate size of depositor Ncustomer using the information about foreign keys 0 Depositor has 5000 tuples and customer has 10000 tuples Estimation of Size of Joins R NS R S A is not a key for Ror S If we assume that every tuple t in R produces tuples in R 1X15 the number of tuples in R MS is estimated to be 11 7 n5 VAs If the reverse is true the estimate obtained will be H 7 n5 VAr The lower of these two estimates is probably the more accurate one Can improve on above if histograms are available 7 Use formula Similar to above for each cell of histograms on e atiors the twor i Estimation of the Size of Joins 0 Estimate size of depositor Ncustomerwithout using information about foreign keys Depositor has 5000 tuples and customer has 10000 tuples Vcustomername depositor 2500 and Vcustomer name customer 10000 The two estimates are 5000 100302500 20000 and 5000 10113010000 00 We choose the lower estimate which in this case is the same as our earlier computation using foreign keys

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.