Class Note for CMPSCI 645 at UMass(6)
Class Note for CMPSCI 645 at UMass(6)
Popular in Course
Popular in Department
This 22 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 13 views.
Reviews for Class Note for CMPSCI 645 at UMass(6)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Parallel DBMS CMPSCI 645 Slide content due to Ramakrishnan Gehrke Hellerstein Gray Figures taken from Dewitt and Gray Parallel Database Systems The Future of High Performance Database Systems CACM 1992 Parallel vs Distributed D35 3 Parallel database systems Improve performance through parallelizing various operations loading data indexing query evaluation Data may be distributed but purely for performance reasons 9 Distributed database systems Data is physically stored across various sites each of which runs DBMS and can function independently Data distribution determined by local ownership and availability in addition to performance 2 Why Parallel Access To Data At 10 NIBS 1000 x parallel 12 days to scan 15 minute to scan Parallelism divide a big problem into many smaller ones to be solved in parallel DBMS The Success Story 3 DBMSs are the most only successful application of parallelism Teradata Tandem vs Thinking Machines KSR Every major DBMS vendor has some server 3 Reasons for SUCCESS Bulkprocessing partition ism Natural pipelining Inexpensive hardware can do the trick Users appprogrammers don t need to think in Some Terminology 3 SpeedUp More I39GSOUI39CGS means proportionally less time for given amount of data problem size constant system grows 3 ScaleUp If resources increased in proportion to increase in data size time is constant problems size system both grow Xact sec sec Xact A Ideal 4 s 2 b0 5 o H 19 degree of ism 0 Ideal S 2 0 CO C o m CD 0 H V degree of ism Enemies of good speedup scaleup g I 3 Start up work If thousands of processes must be started this can dominate actual computation time 9 Interference The slowdown each new process imposes on all others when accessing shared resources 3 Skew Variance in the size of jobs for each process Service time for whole job is the service time of slowest step of job 7 Architecture Issue Shared What Global Shared Memory Shared Memory Multiprocessor Shared Disk Multiprocessor Interconnection Network Shared memory Shared disk 3 Alternative architectures Shared nothing Shared memory all processors shared common global memory and access to all disks Shared disk all processors have private memory but direct access to all disks Shared nothing each memory disk owned by processor which acts as server for data 8 Architecture Issue Shared What Interconnection Network Global Shared Memory Shared Memory Multiprocessor Shared Disk Multiprocessor Shared memory Shared disk Shared nothing Advantages oMinimize interference by minimizing shared resources Expoit commodity processors and memory Disk and memory accesses are local Traffic on interconnection network is minimized Shared nothing each memory disk owned by processor which acts as server for data 8 Di ferent Types of DBMS lism 3 lntraoperator parallelism get all machines working to compute a given operation scan sort join st Interoperator parallelism each operator may run concurrently on a different site exploits pipelining 3 Interquery parallelism different queries run on different sites 3 We ll focus on intraoperator ism Limits ofpipelined parallelism in DBMS 3 Relational pipelines usually not very long 3 Some relational operators block e g sorting aggregation 3 Execution cost of one operator may be much higher than another example of skew 3 As a result partitioned parallelism is key to achieving speedup and scaleup 10 Automatic Data Partitioning Partitioning a table Round Robin Good for equijoins Good for equijoins Good to spread load range queries groupby Shared disk and memory less sensitive to partitioning Shared nothing benefits from quotgoodquot partitioning Parallel query processing split each Join output into 3 streams M merge the 3 join input streams at each insert node Perform 13 of the join split each B scan output into 3 streams merge the 3 input streams at each join node Two relational scans consuming two input relations A and B and feeding their outputs to a join operator that in turn produces a data stream C 12 Parallel Scans 3 Scan in parallel and merge 3 Selection may not require all sites for range or hash partitioning 3 Indexes can be built at each partition Parallel Hash 0m OUTPUT P quot s 1 1 H INPUT 3 3 hash 2 2 5 original Relations fun tion 0 o o o o o 1 Rth 8 f1 e E Bl Disk B main memory buffers Disk 3 In first phase partitions get distributed to different sites A good hash function automatically distributes work evenly 9 Do second phase at each site 3 Almost always the winner for equi join Data ow Network for 0in as AJ Bi B i m AJ39 8139 Bj m m Bi Bi AJ39 Aj BJ39J m r 39l I 1 l i L SPLIT REFLEX L SPLITKJKSPLLT ll grating Quasist LM RGEJ l M REE quot39Hv4 f quot 9rquot R39fo aveFf quotnagf aftx axisquotf Murry l Eh Iquot x x Aquot SCAN SCAN SCAN SCAN mm mm IA 13 1 IA 3 a 39 3939 39f 39 4quot 55 39 quot r 39 quot 1 A39 B39 A Bquot AI EMS Bl A ball a 3939 v bs dquot39 39 39 3 Good use of split merge makes it easier to build parallel versions of sequential join code Complex Parallel Query Plans 3 Complex Queries InterOperator parallelism Pipelining between Operators 9 note that sort and phase 1 of hashjoin block the pipeline Bushy Trees w S1tes 18 S1tes 14 w 3 S1tes 58 Parallel query optimization issues 3 Cost estimation in parallel environment 3 Consider bushy plans much larger plan space 3 Some parameters only known at runtime number of free processors available buffer space 17 Sequential vs Parallel Optimization oz Best serial plan Best plan 3 Trivial counterexample Table partitioned with local secondary index at two nodes Range query all of node 1 and 1 of no Node 1 should do a scan of its partition Node 2 should use secondary index 9 SELECT FROM telephoneboollt WHERE name lt NoGood Parallel DBMS Summary 0 00 isrn natural to query processing Both pipeline and partition isrn 3 SharedNothing vs SharedMern Shareddisk too but less standard Sharedmern easy costly Doesn t scaleup Sharednothing cheap scales well harder to implement 3 lntraop Interop 8 Interquery isrn all possible DBMS Summary cont 3 Data layout choices important 3 Most DB operations can be done partitionl Sort Sortmerge join hashjoin 3 Complex plans Allow for pipeline ism but sorts hashes block the pipeline Partition ism achieved Via bushy trees DBMS Summary cont 3 Hardest part of the equation optimization 2phase optimization simplest but can be ineffective More complex schemes still at the research stage 3 We haven t said anything about Xacts logging Easy in sharedmemory architecture Takes some care in sharednothing
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'