Database Sys Implement
Database Sys Implement CS 4420
Popular in Course
Popular in ComputerScienence
This 0 page Class Notes was uploaded by Alayna Veum on Monday November 2, 2015. The Class Notes belongs to CS 4420 at Georgia Institute of Technology - Main Campus taught by Staff in Fall. Since its upload, it has received 21 views. For similar materials see /class/234142/cs-4420-georgia-institute-of-technology-main-campus in ComputerScienence at Georgia Institute of Technology - Main Campus.
Reviews for Database Sys Implement
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 11/02/15
College of Comput39 CS 4420 Database System Implementation Failure Recovery Ling Lin Assncizte Prnfessnr Cnllege nf Cnmpnting Genrgjz Tech Lecture Outline l L ast L ecture ry Optimization Architecture o Failure Reeove ry 9 Database Consistency and Con 39nts 9C0nsistent Database State and Transactions 0 Checkpoints DBM Failure Modes I Erroneous data entry caught by DBMS through triggers key foreign key value and tuple based constraints I Media failures 0 Local caught by parity checks on sectors main memory 0 Disk crash handled by RAID I System Crashes 0 Problem is that memory loss wipes out process state and data 7 what if the process had already made some modi cations to the database DBM Consistency Principle I Transaction o A process that reads or modi es the DB I DB State and Consistency o The database has a state values of all DB elements e g tuple relation block 0 A state is consistent if it meets all constraints both explicit part of schema and implicit o A transaction if allowed to complete with no interference from other transactions or systems errors will take the DB from a consistent state to another consistent state DBM How can constraints be violated I Transaction bug I DBMS bug I Hardware failure e g disk crash alters balance of account I Data sharing eg Tl give 10 raise to programmers T2 change programmers gt systems analysts DBM How can we prevent x Violations I Chapter 8 due to failures only I Chapter 9 due to data sharing only I Chapter 10 due to failures and sharing DBM Failure Recovery I DB Operations and Transactions I Atomic transaction I Transaction Failure due to system crashes or media failure I Recovery Techniques 0 Logging 0 Logging schemes 9 Undo Log WAL 9Redo Log 9UndoRedo Log DBM Storage hierarchy El Memory Disk Operations I Input X block with X gt memory I Output X block with X gt disk I Read Xt do inputX if necessary t value of X in block I Write Xt do inputX if necessary value of X in block t Transaction collection of actions that preserve consistency Consistent DB Big assumption If T starts with consistent state T executes in isolation gt T leaves consistent state Each transaction sees a consistent DB Correctness informally I If we stop running transactions DB left consistent I Each transaction sees a consistent DB DBM Key problem Un nished transaction Example Constraint AB T1 A Ax 2 B lt B x 2 DBM T1 Read At tlt tx2 Write At Read Bt tlt tx2 Write Bt Output A 39 Output B faIIUI E 9 Need atomicity execute all actions of a transaction or none at all A8 16 B8 16 memory disk DBM Logbased Techniques Database Log Every action of a transaction must not only perform the action but must also write a log record to an appendonly le Old New stable database stable database Lo A log le maintained by the DBMS system and residing on a stable storage When a change is made to the database a record containing values of the updated item is written to the log le DBM Log Information The log contains information used by the recovery process to restore the consistency of a system The type of log records include 0 start transaction T has started 0 read transaction T has read data item X 0 write transaction T has changed the old value before image of X to a new value after image 0 abort o commit mum c344 39 College of Comput 20 Database System Implementation Query Processing and Query Optimization Ling Lin cizte Prnfessn Assn r Cuuege ur Cnmputing Geurgiz Tech warm Hugul mum Revie Lecture Outline w of Last Leeture ary of Multidimensional Indexing Techniques o Query Processing Part1 Query Processing Steps lgebr formauorr 5 Logical Query Plans Today sLec cont Based Opurruzatror o transformauorrs gtCostrbased Query Opurruzauor trutermedrate Result Esumauor tIO Blocks Estrruauor warm Hugul DB09 Introduction Query Processing Phases I Relational algebra level 0 Algebraic transformations 0 good transformations I Detailed query plan level 0 estimate costs intermediate results size 10 blocks 0 generate and compare plans Prepared by Lrng Lu 3 DB09 Introduction Relational Query Optimization I Relational algebra level 0 Using Heuristics for good transformations to reduce the search space of local query plans 0 Typical Heuristics 9 Push Selection and Projection closer to the base relations in a SQL query tree 9 Reduce intermediate result size as much as possible I Physical Executable query plan level 0 Consider various indexes and their performance characteristics when estimating costs of alternative query plans 0 Pick the most economical physical query plan for query execution Prepared by Lrng Lu 4 College of Computing Query Processing amp Optimization Ling Liu mm Query Access mm QueryOpt The System Catalog I Store the meta information that describes each database including a description of o conceptual database schema logical data model 9 Relations Attributes Keys Indexes Views 0 internal schema 0 external schema I Store information needed by specific DBMS modules 0 query optimization module 0 security and authorization ng Lm 3 QueryOpt Relational DBMS Catalog I All meta information is stored as relations I Example RelationiKeys I Rel name IKeV number I Member attr I Relationilndexes Reliname Index name Member attr Indexitype Attrino AsciDesc Viewiqueries View nam e Query Viewiattrs ng Lm 4 ng Lru QueryOpt Uses of System Catalog I DDL Compilers 0 Correct de nition of relations and attributes I DlVlL Query Compiler o DML Parser 9 guided by the description of DML syntax and the schema information in the catalog generates a query tree after parser o Optimizer 9 generates access paths that is relatively optimal for executing a queryDML command by accessing the database structure information schemas and mapping highlevel SQL queries into lowlevel le access commands ng Lru QueryOpt Database Access User Program A Lanouaoe DBMS System Buffer external schema used by user roram A 2 T Database Ling Lm 31 D C9 G G QueryOpt Database Access User program A sends to DBMS an invoke command to retrieve a set of record DBMS analyzes the external schema of the user program A and nds the database description of the record DBMS checks with the schema to get the data types and location information of record DBMS checks with the physical schema to find out which device the record is in and what access methods can be used According to 4 DBMS sends OS a read command to execute the search Ling Lm G 9 C98 QueryOpt Database Access OS issues the page invoke command to the correspond device and then puts the page fetched into the system buffer DBMS uses the schema and the external schema to infer the logical structure of the retrieving record DBMS places the relevant data to the UWA and provides the status information at the program invocation exit QueryOpt Query Processing Methodology Highlevel Calculusbased Query Execution Schedule file access plan Ling LHJ 9 Conceptual Schema I Describes the meaning of data in the universe of discourse o emphasizes on general conceptually relevant and often timeinvariant structural aspects of the universe of discourse I Excludes the physical organization and access aspects of the data CUSTOMER LH Vg LHJ 10 QueryOpt External Schema I Describes parts of the information in the conceptual schema in a form convenient to a particular user group s view I Is derived from the conceptual schema MALE CUS TOMER MALECUSTOMERX Y CUSTOMERX Y s A WHERE SEXM CUS TOMER l NAME ADDR SEX AGE Llng LHJ 11 QueryOpt Internal Schema I Describes how the information described in the conceptual schema is physically represented in a database to provide the overall best performance CUSTOMER NAME I ADDR SEX AGE CUS TOMER index on NAME Block Size 100 Lll lg LHJ 12 Ling Lm QueryOpt Query Preprocessing Input Calculus query on base relations I Normalization o manipulate query quanti ers and quali cation I Analysis 0 detect and reject incorrect queries 0 possible for only a subset of relational calculus I Simplification o eliminate redundant predicates I Restructuring o calculus query 3 algebraic query 0 more than one translation is possible 0 use transformation rules Ling Lm QueryOpt Normalization I Lexical and syntactic analysis similar to compilers 0 check validity 0 check for attributes and relations 0 type checking on the quali cation I Put into normal form 0 Conjunctive normal form pllvplzvvp1n pmlvpmzvvpmn o Disjunctive normal form P11AP12A P1n V V pmlAmeAquot39pmn o OR39s mapped into union AND s mapped into join or selection QueryOpt Refute incorrect queries I Refute incorrect queries I Incorrect o disjoint components are useless 0 multiple relations missing joins 7 may not be incorrect but may indicate Cartesian product I Contradictory o quali cation can not be satis ed by any tuple DUR gt 27 AND DUR lt 25 Ling LHJ 15 QueryOpt Simplification I Why simplify o The simpler the query the less work there is and the better the performance I How Use transformation rules 0 elimination of redundancy 9 idempotency rules p1 pl false P1 sz P1 P2 p1 false pl 9 application of transitivity 9 use of integrity rules I Example 0 XgtaandXgtb Ling LHJ 16 QueryOpt Re structunn g I Convert relational calculus to relational algebra RENAME Project I Make use of query trees I Example Find the names of employees other than 6DUFH2 OR DUR24AND Select Doe who worked on the CADCAM prOJect NAMEICADCAM AND for either 1 or 2 years SELECT ENAME FROM E w P WHERE EENOWENO JNO AND WJNOPJNO N ENO Join AND EENAME J Doequot AND PJNAME CADCAM P W AND WDUR12 OR WDUR24 Ling Lm 17 QueryOpt Query Optimization I Not really optimizing but planning to avoid bad execution strategies I Models 0 Heuristicsbased 9Apply transformation rules according to a general strategy 0 Costbased 9Minimize a cost function IO cost CPU cost subject to a set of constraints Ling Lm Query Optimization An Example SELECT ENAME FROM EW WHERE EENO WENO AND WRESP quotManager I Algebraic Optimization 0 Two steps 9 Rewriting 9 Transformation 0 May produce multiple optimal plans I Costbased Optimization 0 Decide which of the alternatives is the cheapest to execute bug on 19 QueryOpt 39 n Pro39ect RENAME PrOJect ENTE J 6 6E ENO w ENO AND EENO WENO AND S I t Select WRESP quotManager eec WRESP quotManager lgtltlENo lgtlt1ENo Join Join E W E bug on Moving selection downward using Cascade of selections E 6 Commutativity of selections WRESP quotManager and Commuting selection With binary operations W Ling Llu 21 Algebraic Transformation RENAME EENO WENO QueryOpt QueryOpt How to decide which is the best plan Plan 1 Plan 2 TEENWE D4 E ENO w ENO E 6W RESP quotManager RENAME EENO WENO 6WRESP quotManage r E l W Ling m QueryOpt Query Optimization An Example I Example SELECT ENAME FROM EW WHERE EENO WENO AND WRESP quotManagerquot I Strategy 1 Algebraic rewriting ENAME6RESPquotManagerquotE ENoG EN0E X ENAME6RESPquotManagerquotE ENoG EN0W X I Strategy 2 an algebraic transformation quotENAMEE gtltl EN00RE51HManaguquotm quotENAME0RE51HManaguquotW N ENO E ng on 23 QueryOpt Query Optimization An Example I Why is it important SELECT ENAME FROM EW WHERE EENOWENO AND WRESPquotManagerquot I Strategy 1 nENAME6RESP Manager EENOGENOE X W I Strategy 2 ENAME E N ENO6RESPquotManager W I Strategy 3 ENAME GRESP Manager W x ENO E Llng LHJ Z4 QueryOpt Cost of Alternatives I Assume o cardE 4000 cardW10000 o 10 oftuples in W satisfy RESPquotManagerquot selection generates 1000 tuples execution time proportional to the sum of the cardinalities of the temporary relations I searching is done by sequential scanning Strategy 1 Strategy 2 Cartesian prod 40000000 Selection over W 10000 Search over all 40000000 Join40001000 4000000 80000000 4010000 ng LHJ ZS QueryOpt Cost of Alternatives I Assume o cardKE 4000 cardW10000 o 10 of tuples in W satisfy RESPquotManager selection generates 1000 tuples I execution time proportional to the sum of the cardinalities of the temporary relations I searching is done by sequential scanning Strategy 1 Strategy 2 Strategy 3 Cartesian p40000000 Selection on W10000 Selection on E4000 Search 40 000 000 J0in400010004QQOQQO Join40001000 80000000 4 010 000 4 004 000 Ling LHJ 26 Relational Query Optimization I High level Transformationbased Optimization Logical Query Plans 0 Algebraic Transformation or Query rewritting 0 Goal Reduce the search space of logical query plans 0 Mechanisms Heuristics for good transformations 9 Push Selection and Projection closer to the base relations in a SQL query tree 9 Reduce intermediate result size as much as possible I Low Level costbased Optimization Executable Query Plans 0 Mechanism 9 Cost estimation using intermediate result size estimation and 10 cost estimation 9 Consider various indexes and their performance characteristics When estimating costs of alternative query ans 0 Goal Pick the most economical physical query plan for query execution Ling Lm QueryOpt ng Lm QueryOpt Heurlstlcs Query Optimizatron I Optimization Heuristics for the relational algebra 0 Performing selection and projection before join 0 Combining several selections over a single relation into one selection 0 Find common subeXpressions o Algebraic rewritingtransformation rules I General transformation rules for relational algebra also called equivalencepreserving algebraic rewriting rules QueryOpt Algebraic Transformation Rules I Cascade of selections cpl and p2 and and pnR E 6p16p239 6pnRquot39 I Cascade of projections A1 and A2 and and AnR E 7TA17TA27TAHR E nA1R ifAlg A2g g An I CommutatiVity of selections 6p16 pzRD E 6p26p1R o SP10r pzR E 6p1R U 6p2R I CommutatiVity of selection with projection A1A2An6pR E 6pnA1A2AnR Ling LHJ Z9 QueryOpt Albegraic Transformation Rules I CommutatiVity of binary operators 0 R lgtltlASESgtltIAR o RXSES gtltR o RUSESUR o RnSESnR I AssociatiVity of binary operations oRgtltSgtltTERgtltSgtltT o R1gtltSgtltTERgtltSgtltI T ORUSUSERUSUT o RmSrSERnSmT Ling LHJ 30 QueryOpt Algebraic I Commuting selection with binary operations 0 6pAR X S E 6PAR X S where A is de ned on R only 5pltAiR lgtltl qAiBk S E 6pAiR N qAiBk S Where Al is de ned on R and Bk is defined on S o 6PltAiR U T E 6PltAiR U 6PltAiT where A is de ned on both R and T o 6PltAiR n T E 6PltAiR n 6PltAiT where A is de ned on both R and T I Commuting projection with binary operations MR x s 2 MR x ms 0 nCR gtlt1qAijk S E nARgtlt qAijk11BS o ncR u s a nA R u 13 s where RA and SB C E A39 U B where A39 g A B Q B Ling LHJ 31 QueryOpt Heuristicsbased Algebraic Optimization Outline I Using cascade of selections rule break up any selections with conjunctive conditions into a cascade of selections o it allows more freedom in moving selections down different branches of the tree I Using commutativity of selections with other operations rules move each selection down the query tree as far as possible Ling LHJ 32 QueryOpt Heuristicsbased Algebraic Optimization Outline I Using associativity of binary operations rearrange the leaf nodes so that the most restrictive selections are executed first 0 the fewer tuples the resulting relation contains the more restrictive the selection is o reducing the size of intermediate results improves performance I If possible combine a Cartesian product with a selection into a join ng LHJ 33 QueryOpt Heuristicsbased Algebraic Optimization Outline I Using cascade of projections and commutativity of projections with other operations break down and move lists of projections down the tree as far as possible I Identify subtrees that represent groups of operations that can be executed by a single algorithm ng LHJ 34 QueryOpt Heuristic Optimization Example RENAME Canomcal query tree at the end of query preprocessing phase 6DUR12 OR DUR24AND JNAME CADCAM AND ENAME J DOE Ling Lm QueryOpt Heuristic Optimization Example TEEN ME GDUR12 RDUR24 GJNAME CADCAM Use cascading of selections GENAMEOU DOE rule to decompose selecnons I lgtltJNO lgtlt WK p Ling Lm Ling Lm QueryOpt Heuristic Optimization Example TEEN ME GDUR12 RDUR24 Push selection down using commutativity of GJNAME CADCAM Mm selection over join lgtltl E O GENAME J Doe P W E Ling Lm QueryOpt Heuristic Optimization Example TEEN ME Push selection down using commutativity of GDURZWZ TR DUR24 selection over join NJNO GJNAME CADCAM lgtltl ENO P GENAME J Doe QueryOpt Heuristic Optimization Example RENAME I 1gtlt1 K Push selection down lgtltl ENO GJNAME CADCAM GDUR m DUR24 GENAME J Doe P W E Ling Lm 39 QueryOpt Heuristic Optimization Example nETAME lgtlt1 JNO D0 earl ro39ection TEJNO ENAME y p J N ENO TEJNO TEJNO ENO TEJNO ENAME GJNAME CADCAM G G DUR 12x DURZM ENAME J Doe Ling LHJ 40 QueryOpt Heuristic Optimization Example TEE AME T Identify subtrees that can be implemented in one algorithm JNO nJNO ENAME TEJNO TEJNO ENO TEJNO ENAME GJNAME CADCAM GDUR m DURZ4 GENAME J Doe W ng Lm 41 QueryOpt Heuristics Query Optimization Summary I First apply operations that reduce the size of intermediate results 0 Move selections and projections down the tree as far as possible 9 early selections reduce the number of tuples 9 early projections reduce the number of attributes o The selection and join operations are the most restrictive thus they should be executed before other similar operations 9 This is done by reordering the leaf nodes of the tree among themselves and adjusting the rest of the tree appropriately ng LHJ 42 CS 4420 Data Mining and Data Warehousing Today s Agenda I Data Mining 0 Overview and Examples 0 Data Mining Methods AssacmnanPr2dxctmn C1ass catxanC1ustenng mar anonDiscrimination o Visualization 0 Privacy I Data Warehousing and OLAP Knowledge Discovery I Knowledge Discovery 0 Technology to extract interesting knowledge rules patterns regularities constraints from a large collection of data 0 Process of nontrivial extraction of implicit previously unknown and potentially useful information from large collection of data I Data Mining 0 A critical step in the knowledge discovery process 0 Extracts implicit information from a large dataset Steps in a KDD Process I Learning the application domain goals I Gathering cleaning and integrating data I Data transformationconsolidation I Data Mining 0 Choosing the mining methods and algorithms 0 Mining search for patterns or rules of interest I Analysis and evaluation of the mining results I Use of discovered knowledge in decision making What can Data Mining do for us I Descriptive Mining 0 Discover and describe general properties I Predictive Mining 0 Infer interesting properties based on available data Successful Applications I Diapers and beer Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby knowing many customers would walk between them Placing potato chips between increased sales of all three items Sports 25 ofthe NBA39s 29 teams use IBM39s Advanced Scout product to help them manage and analyze game data Game data shots rebounds etc is synchronized with game video Coaches can query the database and look for interesting patterns using data mining A query might ask quotWho are our best and worst shooters and under what circumstancesquot Successful Applications I Gamblin Gary Loveman a onetime Harvard Business School professor and now the CEO of Harrah s has built a Total Rewards database of 26 million players Preanalysis Harrah s was collecting 36 of its customers39 gaming dollars The take has increased to 42 in four years Astronomy Skycat and Sloan Sky Survey clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies nearby stars and many other kinds of celestial objects I Fraud detection Wachovia example Why is Data Mining Popular I Technology Push 0 Technology for collecting large quantity of data 9 Bar code scanners satellites cameras RFID tags sensors 0 Technology for storing large collections of data 9 Dambases data warehouses 9 Variety of data repositories such as virtual worlds digital media World Wide Web Spatial Databases Multimedia databases Time series data Geographical and Satellite Dam W Quick History Lesson I Originally data mining Was a statisticians term for overnsing data to draw invalid inferences I David Rhine a parapsychologist at Duke in the 19505 tested studenw for extrasensory perception by asking them to guess 10 cards 7 red or black He found about 11000 ofthern guessed all 10 and instead ofrealizing that that is What you39d expect from random guessing declared them to have ES P When he retested them he found they did no better than average His conclusion telling people they have ESP causes them to lose it Source J Uzzmh Data Mining Methods 1 I Association I Discover the frequency of items occurring together in a transaction or an event I Example 80 afmtamm who purchase milk also purchase bread I Prediction I Predicts some unknown or missing information based on available data I Example Forecast the Sale value afelectramc praduetsfar the next quarter based on the available datafarthepast three quarters Association Rules I Motivated by market analysis I Rules of the form 0 item1 A item2 A A itemk 9 itemk1 A A itemn I Example 0 beer A soft drink 9popcorn I Problem discovering all interesting association rules in a large database I Issues 0 Interestingness Completeness Ef ciency I Basic measurement for association rules 0 Support ofthe rule 0 Con dence of the rule Basic Measurement Support I Support for an item set I in a transaction database D of transactions in D containing I 81 total of transactions in D I Semantics o The percentage of transactions that contain all of the items inl I Support for a rule A 9 B o is the support for the itemset A I If the support of a rule is low the rule may have been discovered simply by chance Basic Measurement Confidence The con dence for a rule A9B o is the percentage of such transactions that contain all items in both A and B of transactions in D containing A B CA B of transactions in D containing A I Semantics o It indicates the degree of correlation between purchases of these two sets of items If a rule has a low con dence then the purchase of A itemset is not highly correlated with the purchase of B itemset Strong Association Rules I UserApplication gives a minimum support threshold minisup and a minimum con dence threshold miniconf I An association rule A9B is w or interesting for the database D if 0 SA9B support gt minisup o CA B con dence gt miniconf I An item setI is a frequent itemset if SI gt minisup I Association Rule Mining is interested only in strong association rules Association Rules An Example I A simple transaction I Support and Con dence database SA 60 SAB 40 Tquot ITEMS SA B SA 40 CA B 23 67 I Let minisup 60 and miniconf 80 A 9C 60100 is a strong association rule AB 9C 40100 is not insuf cient support C 9A 8075 is not insuf cient con dence lnterestingness Measures I Objective Measures 0 based on statistics and structures of patternsrules discovered eg con dence and support I Subjective Measures 0 based on users beliefs in the data novelty unexpectedness etc I A patternrule is interesting if it is o easily understood by humans valid on new test data with some degree of certainty potentially useful validates some hypothesis that a user seeks to con rm Data Mining Methods 2 I Classi cation 0 Determine the class or category of an object based on its properties 0 Example 9C fquot39the na139 I Clustering I quarter 0 Organize a set of multidimensional data objects in groups such that the intergroup similarity is minimized and the intragroup similarity is maximized 0 Example 9 group crime locations to find distribution patterns Classification I Two stages 0 Learning stage construction of a classification function or model 0 Classification stage predication of classes of objects using the function or model I Tools for classification 0 Decision tree 0 Bayesian network 0 Neural network 0 Regression I Problem 0 Given a set of objects whose classes are known training set derive a classification model which can correctly classify future objects An Example I Attributes I Class attribute game play or not I Tramlng set Outlook Temperature Humidity Windy Play y 85 85 false No overcast 83 78 false Yes sunny 80 90 true No sunny 72 95 false No sunny 72 70 false Yes Data Mining Methods 3 I Summarization 0 Characterization of general features of objects in the target class 0 Example 9 Characterize people x buying paiiem on the weekend I Discrimination 0 Comparison of general features of objects between a target class and a contrasting class 0 Example 9 Comparing siudenix in Engineering and in Art Summarization Technique AttributeOriented Induction I Generalization using Concept hierarchy taxonomy brand content size Food Milk Bread Skim Milk 2 Milk White ole Bread Wheat Content Lucerne Dairyland Wonder Safeway Requirements amp Challenges in Data Mining I User Interface and Visualization I Security and Privacy Issues I Mining Methodology Choices I Performance Guarantees I Data sources of complex data types or unstructured data Data Mining Visualization I The gure is a visualization of a set of association rules With exactly one item in the head and one in the body T he item labels are not shown in this picture The different colors denote difference confidence levels The user could have the color denote the support instead IBM Quest Project Data Mining Visualization I 2item rules I This gure runs offa similar data set but the height ofthe bars denotes the con dence while the color denotes the support The Vertical plane can be moved by the user to partially hide all rules whose con dence is below a certain ort and the color the con denc e IBM Quest Data Mining Visualization I 3item rules I This gure Visualizes rules with two items in the body and one in the head You can click on the image to get a fullsize version The bottom plane shows 2item rules as boxes similar to the rst gure The center boxes show 3item rules Users can quot yquot through this space to explore clusters or click on a box to highlight all rules containing any of the items in that box IBM Quest Visualization Store Placement What about Privacy DARPA s Total Information 1N Awareness initiative o Renamed Tzrmris tlnforma on Awareness 3 39 o Scuttled last year Muitistate AntiTerronsrn Information Exchan e ATRIX l Outsourcing Trend I39M WATCHING mm o Maintaining privacy oftax returns so IEHAVE YOURSELF and medical data in fomign inninn Amnuunm countries Privacy Preserving Data Mining I The primary task in data mining development of models about aggregated data I Can We develop aecumte models Without access to precise information in individual data records I R Agrawal and R Srikant Privacy Preserving Data SIGMOD 2000 I Application Web Demogmphics 11012 adapted fmm Ramabrthn amprlam n Web Demographics I Volvo S40 website targets people in 205 o Are visitors in their 205 or 405 0 Which demographic groups likedislike the website Randomization Approach Overview 55039in i 1303 YOK U R gdo z i39 R 39dor iie39 I ss azowsu 2539J50KJ A Reconstruction Problem I Original values X1 X2 X11 0 from probability distribution X unknown I To hide these values we use y1 y2 yn o from probability distribution Y I Given 39 X1Y1 X2Y2 Xnyn o the probability distribution of Y I Estimate the probability distribution of X Intuition Reconstruct single point I Use Bayes rule for density functions Original distribution for Age Probabilistic estimate of original value of V Intuition Reconstruct single point I Use Bayes rule for density functions Original Distribution for Age Probabilistic estimate of original value of V Reconstructing the Distribution I Combine estimates of where point came from for all the points a Gives estimate of original distribution Seems to work well Original Randomized Reconstructed Data Warehousing and OLAP What is a Data Warehouse I A database that is maintained separately from an operational database I Typically aggregates data from multiple databases updates can take place over night I A subj ect oriented integrated timevariant and nonvolatile collection of data in support for management s decision making process WHInmon Building a Data Warehouse Corporate Option 1 data warehousa V Consolidate Data Marts Option 2 Build from scratch Corporate data Three Tier Architecture Q External datasources Operational databases a E D ata malts Data Warehouse Design I Most of data warehouses use a star schema to represent multidimensional data model I Each dimension is represented by a dimension table that provides its multidimensional coordinates and stores measures for those coordinates I A fact table connects to all dimension tables with a multiple join Each tuple in the fact table consists of a pointer to each of the dimensional tables I The links between the fact table and the dimensional tables for a shape like a star Example of Star Schema N Sale Fact Table N Date Product Store x Customer I xquot unitisales Store dollarisales StorelD gquot City State Country Region OnLine Analytical Processing I Data Cube 0 a multidimensional array 0 each attribute is a dimension 0 variable a special dimension for aggregation I Example 0 1quot Product Region a OLAP Example Months Slice on city Atlanta Products Sales Products Sales quot 1 Multidimensional data cube OLAP Example Products Sales Multidimensional data cube College of Comput CS 4420 Database System Implementation Concurrency Control Ling Lin Assncizte Prnfessnr Cnllege nf Compnn39ng Genrgjz Tech Lecture Outline Review ofLast Lectures Concurrency Control o Problems ofeoneurrenta ess o Basre Concepts of Concurrency Control 9 Cun retaenuns 9 Schedules 9 Cun retEqulwlent Schedules 9 Cun retrSenahzzble schedules Deteenm 9 Addmunal Lueks and Lhar rules m c c DB16 Concepts Transaction sequence of riX WiX actions Con icting actions r1A W2A W1A W2A A W Schedule represents chronological order which actions are executed Serial schedule no interleaving of actions or transactions DB16 Example Tl ReadA T2 ReadA A A100 A Ax2 WriteA WriteA ReadB ReadB B B100 B Bx2 WriteB WriteB Constraint AB DB16 Schedule A Serial Schedule A B T1 T2 25 25 ReadA A lt A100 quotquotquotquotquotquotquotquotquotquotquot quot Wr1teA 125 ReadB B B100 WriteB 125 ReadAAlt Agtlt2 WriteA 250 ReadBB Bgtlt2 WriteB 250 250 250 DB16 Schedule B another Serial Schedule A B T1 T2 TY ReadAA E An quotquotquotquotquotquotquotquotquotquotquot 39 WriteA 5 O ReadBB lt BXZ WriteGB ReadA A e mm 50 WriteA ReadB B e B100 150 WriteGB 150 1 50 1 50 DB16 Schedule C A B T1 T2 25 25 ReadA A lt A100 quotquotquotquotquotquotquotquotquotquotquot quot Wr1teA 125 ReadAAlt Agtlt2 Wr1teA 250 ReadB B B100 WriteB 125 ReadBB Bgtlt2 WriteB E 250 250 A concurrent schedule equivalent to serial schedule A DB16 Schedule D A B T1 T2 Z5 25 ReadA A lt AHOO WriteA ReadAA lt AXZ 125 WriteA ReadBB e m 250 WriteGB ReadB B A B100 50 WriteGB 150 250 150 A concurrent schedule that is NOT equivalent to any serial schedule 8 DB16 I Given multiple concurrent schedules what are considered as good schedule regardless of 0 initial state and o transaction semantics I Only look at order of read and writes Example Scr1Aw1Ar2Aw2Ar1Bw1Br2Bw2B DB16 I Conflicting Concurrent Actions on same object start r1A end r1A l l startw2A end fwzA me Nonconflicting Concurrent Actions operate on the same object in the same order in all transactions operate on two different objects DB16 De nition Si 82 are con ict equivalent schedules if SI can be transformed into 82 by a series of swaps on noncon icting actions De nition A schedule is con ict serializable if it is con ict equivalent to some serial schedule DB16 Precedence graph PS S is schedule Nodes transactions in S Arcs Ti gt Tj whenever piA qjA are actions in S PiA lts MA at least one of pi qj is a write DB16 Schedule A Serial Schedule T1 T2 ReadA A A100 WriteA ReadB B B100 WriteB ReadAA Agtlt2 WriteA ReadBB Bgtlt2 WriteB DB16 Exercise What is the dependency graph of Schedule B Tl T2 ReadAA lt AXZ WriteA ReadBB lt BXZ WriteB ReadA A lt Al 00 WriteA ReadB B lt B100 WriteB DB16 Exercise What is the dependency graph of Schedule C T1 T2 ReadA A A100 WriteA ReadAA Agtlt2 WriteA ReadB B B100 WriteB A concurrent schedule equivalent to serial schedule A ReadBB Bgtlt2 WriteB DB16 Exercise Schedule D A concurrent schedule that is NOT equivalent to any serial schedule T1 ReadA A lt A100 WriteA ReadAA lt AXZ WriteA ReadBB lt BXZ WriteB ReadB B lt B100 WriteB Exercise I What is PS for s w3A w2C r1A w1B r1C w2A r4A w4D I Is S serializable DB16 Lemma Si 82 con ict equivalent gt PSiPSz Theorem PSi acyclic Cgt SI con ict serializable College of Comput CS 4420 Database System Implementation Indexing and Hashing Ling Lin snciztePrnfessn As i Cnusgs gr Cnmpnu39ng Gsnigiz Tech managing n i Lecture Outline l Review ofLast Lecture Indexing Convennonai indexes l Today s Lecture 0 Indexing continues Buses v s Sequential les 0 Hashing schemes managing n 1 DB06 Introductj on Conventional index I Basic Ideas 0 Sparse o Dense o MultilevelSparse Index I Duplicate Keys I DeletionInsertion I Secondary indexes Prepared by Ling Lm 3 DB06 Introductj on Secondary Index I Primary Index 0 Controls the placement of relational records on disk I Secondary Index 0 does not control record placement 0 a dense index 0 the search key index field is not a primary key Prepared by Ling Lm 4 DB06 Introductj on Indirect Buckets I To avoid repeating keys in secondary index use a level of indirection called buckets I Additional advantages 0 Allows intersection of sets of records without looking at records themselves 0 Can extend bucket to include role position of word search key value Prepared by Ling LHJ 5 DB06 Introductj on Duplicate values amp secondag indexes buckets Note byHector GarciarMolina 6 Query Get employees in Toy Dept A 2nd oor Dept index EMP Floor index I M gt Intersect toy bucket and 2nd Floor bucket to get set of matching EMP s Note by Hector GarciarMolina 7 DB06 Introductj on Inverted Indexes I Secondary indexes I Similar idea from IR community but 0 Record DB VS Document in IR 0 Search key value of record DB VS keyword in a document IR I Usually used with buckets Kl d1 d2 d5 le K2 d1 Prepared by Ling Lin 8 DB06 Introductj on BTrees I Generalizes Multilevel Index 0 Number of levels varies with size of the data le being indexed but often 3 0 Useful for primary secondary indexes l B Tree 0 All nodes have the same format n keys n1 pointers l Btree rules for tree of order n 1 All leaves at same lowest level balanced tree 2 Pointers in leaves point to records except for sequence pointer Prepared by Ling Lm DB06 Introductj on 3 Number of pointerskeys for Btree Max Max Min Min ptrs keys ptrssdata keys r39ilgrl frlggb 1 n rn12l ln12l 1 nolhq got 1 n Ln12J Ln12J Root n1 n 1 1 Note by Hector GarciarMolina DB06 Introductj on Full node min node f J f H Nonleaf 120 150 180 Leaf counts even if null Note by Hector GarciarMolina DB06 Introductj on B Tree Operations I Lookup 0 Start at root 0 Until reach a leaf follow pointer that could lead to the key you want 0 Search that leaf and leaves to the right if duplicates are allowed I Insertion and Deletion 0 Upon Insertion no room split leaf 0 Upon Deletion if deletion violates the minimum rule merge Via redistribute keys or coalesce with sibling Prepared by Ling nu 12 Imvamawng HI mm College of Comput CS 4420 Database System Implementation Query Optimization Part II Ling Lin Assncizte Prnfessnr Cnuege hr Cnmpuu hg Genrgia Tech C Imvamawng HI Midterm 1 Questions Answers o Chapter 1 mm hapters suggest to read Chapter 2 e Seeaor 21to Section 2 Chapter 3 e Seeaor 3 1 Section 3 2 Chapter 4 Chapter 5 e Seeaor 5 1 to Section 5 3 5 4 Seeuoh 3 45eet16h 3 5 chapter 6 e Seeuoh 61t6 Seeuoh 6 4 Seeuoh 6 65eet16h 6 7 Chapter 7 SEEDOH7 1 to Secnon7 5 3 Secnon7 7t6 Seet16h77 5 DB 1 0 Introductj on Lecture Outline I Review of Last Lecture 9Cost based Query Optimization Intermediate Result Estimation I Today s Lecture 0 Costbased Query Optimization IO Blocks Estimation Choosing the optimal query execution plan Prepared by Lrng Lru 3 DB10 Introductj on Relational Query Optimization I High level Transformationbased Optimization Logical Query Plans 0 Goal Reduce the search space of logical query plans 0 Mechanisms 9 Algebraic Transformation or Query rewritting 9 Heuristics for good transformations w Push Selection and Projection closer to the base relations in a SQL query tree w Reduce intermediate result size as much as possible I Low Level costbased Optimization Executable Query Plans Prepared by Lrng Lru 4 DB 1 0 Introductj on Relational Query Optimization I High level Transformationbased Optimization Logical Query Plans I Low Level costbased Optimization Executable Query Plans 0 Goal Pick the lowest cost most economical physical query plan for query execution o Mechanism 9 Cost estimation using intermediate result size estimation and 10 cost estimation 9C0nsider various indexes and their performance characteristics when estimating costs of alternative query plans Prepared by Ling Ln 5 Costbased Query Optimization I Estimating cost of query plan 0 Estimating size of results done 9 Selection Projection Join 0 Estimating of 10s 4 today s topic 9 Relations Contiguous or not 9 Index or not 9 Hash or not I Generate and compare plans 4 Note byHector GarciarMolina 6 DB 1 0 Introductj on Estimate Intermediate Results 0 TR tuples in R o SR of bytes in each R tuple o VR A distinct values in R for attribute A o BR of blocks containing R tuples Selection W GCONDR TW f TR fAa VRZ f Max RZ a 1 A MaxRZ MinRZ 1 fAgta 12 or 13 Prepared by Ling LHJ DB10 Introductj on Estimate Intermediate Results Projection W TEAR TW TR Join W R1 Dlt1A R2 TW TR2 TR1 MaxVR1AVR2 A Prepared by Ling LHJ m mm College of Comput CS 4420 Database System Implementation Query Processmg Ling Lin Assncizte Prnfessnr Cnuege nr Cnmpming Genrgjz Tech WWW m mm Lecture Outline Remew ofLast Lecture o Hashing scheme o MulnrdAmensxonal Indexing multilevel mdexmg eHashmg Based 39Gnd Pamtxoned Hashing eTeeeBased tKDeTree ReTeee Today s Lecture o PamtxonedHashmg o Summar of Multidimensional Indexing Teehmques o Query Pmeessmg Part1 quotmm ungm DB08 Introduction Review of Last Week s Lectures I Hashing scheme 0 Static Hashing 0 Dynamic Hashing I Multidimensional Indexing a Type of Queries 0 Type of Multidimensional Indexes 9 TreeBased KD tree Multikey Index Quad Tree R Tree 9 HashBay Grid Partitioned Hash Prepared by Lrng Lu 3 Types of Database Queries I Point Queries Partial Match Queries 0 Specify values of 71 dimensions ngtl I Range Queries 0 Specify ranges for 71 dimensions ngtl I Nearest Neighbor Queries 0 Closest point to a given point in a 7 dimensional space I Location Where am I queries 0 given a point nd where e g in which shape it is located Prepared by Lrng Lu 4 Distributed Database Systems A Motivation Growth of data communication networks I Faster cheaper amp reliable Needs of organizations I Multiple sites I Global presence I Interacting organizations I Need for transaction capabilities I Availability and reliability are crucial What is Distributed Database System A Distributed Database DDB is a collection of multiple logicaly mterreatec data bases distributed over a computer network A Distributed Database System is software that manages a distributed database while making the distribution transparent to the user quotDistributed DB Environment ClientServer Applications Applications Applications Client Client Client Services Services Services Communications Communications Communications 1 7 LAN H1gh1eve1 F11tered requests data only Multiple clien single server system Data ase ClientServer Design Issues Task distribution What will the client do What services will the server provide Q Communication method RPC Message passing Unit of communication Objects relations tuples etc Pages Q Caching Whether to do caching or not Unit of caching Cache update policies ClientServer Characteristics Client presents the user interface forms queries or commands in a prede ned language communicates these to the server according to an accepted interprocess communication method performs data analysis on the results returned from the server presents the results to the user gt Server provides a service to clients only responds to queries or commands from clients does not initiate conversation with clients hides the existence of the clientserver system from the client ideally ClientServer Advantages amp A Limitations lAdvantages Efficient division of labor Client access to remote data via standards Full DBMS functionality provided to client workstations Limitations Server forms bottleneck Server forms single point of failure I Database scaling difficult Multiple ClientsMultiple Servers directory caching query decomposition commit protocols Applications Client Services Communications LAN Mommam Database Database SQL Interface Applications programmatic Client Interface Services 0 other application Communications support environments LAN Data ase Data ase Assumptions Datasoftware located at multiple sites Sites are connected through communication network Each site logically consists of a single processor Relaxing leads to parallel database DDBMS is a full fledged database Not a collection of files Not just a TP system quotWhy Distributed Databases Transparent management of distributed fragmented and replicated data Data in organizations originate and are maintained at several locations Improves reliability and availability i Improves performance and scalability Minimizes overload Easier and economical expansion Transparency eti rHighIevel users are shielded from the implementational details such as the location replication of data Users still see a monolithic view of the database Different types of transparencies Distribution transparency Replication transparency Fragmentation transparency n Example ASG ENAME TITLE ENO 1N0 RESP DUR Doe Elect Eng E1 11 Mung 12 M Smrtn Syst Anal E2 11 Analyst 24 A Lee Mech Eng E2 12 Analyst 6 1 Miller Programmer E3 3 Consultant 10 E Casey Syst Anal E3 14 E eer 48 Elect Eng E4 12 Programmer 18 R Davis Mech Eng E5 12 age 24 1 Jones Syst Anal E5 1quot Ma39fage 48 E7 13 Engineer 35 E7 15 Engineer 23 E8 13 Manager 40 PRO SKILL 150000 Elm Eng 135000 Syst Anal 250000 Mech Eng 27000 Transparency Illustration S LECT ENAMESAL F 0M EMPASGSKILL HERE DUR gt 12 ND EMPENO ASSENO Paris Employees Paris Projects Paris Assignments New York Employees Bangalore Employees New York projects Bangalore Projects New York Assignments Bangalore A5539gnments paris Employees New York PrOJects A Reliability amp Performance lt gtDistributed DBMS avoids single point of faYure Only data and software at the failed site are inaccessible Further improvement by replication Performance improvements Proximity of data to points of its use Parallelism in execution o Interquery parallelism Intraquery parallelism DDBMS Additional Issues Distributing and keeping track of data Distributed query processing Distributed transaction management Replication management Distributed recovery Security lt 2Distributed catalog management A Data Fragmentation 13 Horizontal Fragmentation Groups rows of a relation to create subsets of tuples having a logical meaning Horizontal fragments can be specified by a SELECT operation Vertical Fragmentation Each vertical fragment retains certain columns of the relation Primary key has to be included in all fragments Corresponds to PROJECT operation r Hybrid Fragmentation Distributed Query Processing Methodology Calculus query on Distributed Query Decomposition Algebraic query on Distributed relations Control Site Data Localization Fragment Query Global Optimization Optimized Frament Query with Communication Operations at Local Sites Local Optimization ampL gig g Optimized local queries Steps Query Decomposition T Input Calculus query on global relations Output Algebraic query on global relations 39 Transformation rules used 39 Same as Centralized ata Localization Input Algebraic query on global relations Output Algebraic query on fragment relations lt9 Global Optimization Input Algebraic query on fragment relations The best global schedule gt Local optimization Input Best global execution schedule Dillmilquot The nnlimnl nr r pcc nnth fnr ranch lnr nl QIIhf lllPl V Decomposition amp Restructuring Convert relational calculus AENAME to relational algebra PrOJect Make use of query trees 0DUR 12 OR DUR 24 AND 4 Exa mple JNAME CADCAM AND Select Find the names of all employees other ENAM nJDOE n J Doe who worked on CADCAM project for either 1 or 2 years SELECT ENAME X FROM E w P WHERE EENO WENO P AND WJNO PJNO AND EENAME J Doequot AND PJNAME CADCAM AND WDUR 12 or WDUR 24 JNO Join IER E E s Step 2 Data Localization Input Algebraic query on distributed relations ailDetermine which fragments are involved Localization program Substitute for each global query its materialization program Optimize 4 Restructuring AEVTAME 0DUR T OR DUR 24 Assumptions 0 E is fragmented into E1 E2 amp E3 7 JNAMT CADCAM o W is fragmented into W1 amp W2 aENAMf J DOE PSK l E1 E2 E3 W1 W2 Provides Parallelism Step 3 Global Query Optimization Input Fragment Query Find the best global schedule Minimizes a cost function Distributed join processing Which relations to ship where Decide on the use of semijoins Semi joins saves communication at the expense of more local processing Join methods 9 Nested loop Vs Ordered joins CostBased Optimization Solution Space The set of equivalent algebra expressions query trees Cost function in terms of time IO cost CPU cost Communication cost Might have different weights in different distributed environments lt33 Search algorithms How do we nd the best plan Exhaustive search heuristic algorithms iterative deepening simulated annealing etc Cost Functions Total Time Reduce each cost component Do as little of each cost component as possible Optimizes the utilization of resources Response Time Do as many things as possible in parallel May increase total time Total Cost otal cost CPU cost 10 cost communication cost CPU cost unit instruction cost no of instructions IO cost unit disk IO cost no of disk IO Communication cost message initiation data transmission Factors influencing Total Cost Wide area network I Message initiation and data transmission costs are high Local processing cost is low Ratio of communication to IO costs around 201 Q sLocal area network Communication and local processing costs are approximately equal Ratio about 116 A Response Time Elapsed time between the initiation and the completion of a query Response time CPU time IO time communication time CPU time unit instruction time no sequential instructions IO time unit IO time no of sequential IOs Communication time unit msg initiation time no of sequential messages unit transmission time no of sequential bytes Example Assume that only the communication cost is considered Total time 2 message initialization time unit transmission time xy Response time maxtime to send x from 1 to 3 and y from 2 to 3 Time to send x from 1 to 3 message initialization time unit transmission time x Time to send y from 2 to 3 message initialization time unit transmission time y Distributed Transaction Execution f apmkmlinm Be initransaction esults amp User notifications Distribute Transaction Execution Iotlel Replica Control Protocol EOT Abort Distributed Concurrency Control 39ot oco Serializability in Distributed DBMS Somewhat more involved Two histories have to be considered Local histories Global history 9 For global transactions ie global history to be serializable two conditions are necessary Each local history should be serializable Two con icting operations should in the same relative order in all of the local histories where they appear together Global Nonserializability T1 Read X T2 Read X X lt X5 X lt X15 WriteX WriteX Commit Commit The following two local histories are individually serializable in fact serial But the two transactions are not globally serializable LH1 R1X W100 C1 R200 W200 C2 LHz R2X W200 C2 R100 W100 C1 Centralized 2PL Only one 2PL scheduler in the distributed system Lock requests are issued to the central scheduler Data 909855 at Coordinating TM Central site LM partICIpating units amp End 0 fOPeration Rele aSe cksgt ack re Lie Lo ck 6 awed A Distributed 2PL ZPL schedulers are placed at each site lt r Each scheduler handles lock requests for data at that site Schedulers coordinate among themselves Coordinating TM Participating LMs Participating DPs L Operatic n End Of Operation W Deadlock A transaction is deadlocked if it is blocked and will remain blocked until external intenention Locking based CC algorithms may result in deadlocks 13gt Waitfor graph If transaction Ti waits for another transaction TJ to release a lock then Ti 9 T in A Local Vs Global WFG Assume T1 and T2 run at site 1 T3 and T4 at site 2 Also assume T3 waits for a lock held by T4 which waits for a lock held by T1 which waits for a lock held by T2 which in turn waits for a lock held by T3 Site 1 Site 2 Local WFG Deadlock detection Transactions are allowed to wait freely Deadlock detection through cycles in Waitfor graphs Topologies for deadlock detection algorithms Centralized Hierarchical Distributed Centralized Deadlock detection n One site designated as the deadlock detector for the system lt9 Each scheduler periodically sends its local WFG to the central site which merges to obtain a global WFG to determine Cycles lt9 How often to transmit Too often gtHigher communication cost Too late IE9 Higher delays Would be a reasonable choice if CC algorithm is also centralized Hierarchical Deadlock Detection Site 1 Distributed Deadlock Detection Sites cooperate in detection of deadlocks 43 Example The local WFGs are formed at each site and passed on to other sites Each local WFG is modi ed as follows 39 Since each site receives the potential deadlock cycles from other sites these edges are added to local WFGs oThe edges in the local WFG which show that local transactions are waiting for transactions at other sites are joined with the edges in the local WFG which show remote transactions are waiting for the local ones Example Continued Each local deadlock detector Looks for cycles that does not involve an external edge o If it exists there is a local deadlock which can be handled locally I Looks for a cycle involving external edge If it exists it indicates a potential global deadlock Passes on the information to the next site Distributed Reliability Protocols Commit protocols How to execute commit protocols for distributed transactions Issue Ensuring atomicity and durability Termination protocols If failure occurs how should operational sites deal with it Nonblocking Failures should not force sites to wait until the failure is repaired to terminate Recovery protocols How should failure sites deal with it Independent Failed site can determine the outcome of transaction without obtaining remote info TwoPhase Commit 2PC Phase 1 Coordinator gets participants ready to write the results into database Coordinator The process at the site where the transaction originates and which controls its execution 30 Phase2 Everybody writes the results into the database Global Commit Rule Coordinator aborts transaction iff at least one participant votes to abort it Transaction is committed iff all the participants vote to commit it n2Phase Commit Ready I YesNo CommitAbort Committed I ZPC PngEgrcol Actions Partici ant a 2 a quotClassification of DDBMS Homogenous DDBMS Degree of homogeneity Heterogeneous DDBMS No local autonomy Degree of local autonomy Federated databases Multi databases Federated Database Systems High degree of local autonomy Communication autonomy Execution autonomy Association autonomy Each server is an independent autonomous database with local users amp local transactions Global schema is shared by applications Multidatabase systems do not have global schema and interactively construct one A FDBMS Issues Autonomy leads to heterogeneity Differences in data models Differences in constraints Differences in query languages Need for canonical system languages language amp query translators five level schema is used for interactions between autonomous DBs Parallel Database Systems Objectives Avoid problems with uniprocessor von Neumann computer architecture model sequential operating mode memory disk and main memory access bottleneck speeddisk ltlt speedRAM ltlt speedmicroprocessor Exploit parallelism inherent in database processing interquery intraquery intraoperation Increase memory bandwidth through data partitioning and parallelism Parallel DBMS Advantages 2 Much better cost performance than mainframe solution lt0 Highperformance through parallelism high throughput with interquery parallelism low response time with intraoperation parallelism High availability and reliability by exploiting data replication Extensibility with the ideal goals linear speedup linear increase in performance for a constant DB size and proportional increase of the system components linear scaleup sustained performance for a linear increase of DB size and proportional increase of the system components Barriers to Parallelism lt startup the time needed to start a parallel operation may dominate the actual computation time interference when accessing shared resources each new process slows down the others hot spot problem skew the response time of a set of parallel processes is the time of the slowest one Parallel data management techniques intend to overcome these barriers Shared Memory Architecture Examples symmetric multiprocessors Sequent Encore and some mainframes IBM3090 Bull39s DPSS SharedDisk Architecture Examples DEC39S VAXcluster IBM39s IMSVS Data Sharing SharedNothing Architecture Examples Teradata39s DBC Tandem Intel39s Paragon NCR39s 3600 and 3700 College of Comput CS 4420 Database System Implementation Query Optimization Recovery Ling Lin neiate Prnfessnr Ass Cnuege nf Cnmpnting Genrgia Teen Lecture Outline Query Optimization nal remarks o Architectures o Complexity and general methodology Failure Recovery o Database Consistency and Constraints o Consistent Database State and Transactions o Undo logging o Redo logging next lecture o UndoRedo o Checkpoints nextlecture oggmg next lecture DBIZ Query Optimization I Query Processing Algorithms o Iterative Approach nested loop 0 Sortbased o Hashbased o Indexbased I Query Optimization Architecture 0 Pipelining o Materialization DBIZ Pipelining I Many operations or certain implementations of them allow us to use pipeline 0 To accept one or both arguments in a stream without seeing the entire relation before starting I Materialization 0 When pipelining is not possible then the arguments must be materialized sorted on disk if it is large before beginning the processing DBIZ w Pipelined I Not pipelined TESNAME nested 100p city nested 100p M I K SPJ G SCTY like Atlanta 6 JPNAME LIKE Dlglta LIbrary SUPPLIER PROJECT DBIZ Pipelining Summary Project Selection Duplication Removal allow pipelining Nested loop join allows the outer argument to be pipelined but not the inner Index join allows pipelining of the nonindexed argument Sort or hashjoin allows pipelining of either argument but there are problems if both are pipelined o Sortjoin requires sharing of memory for sorting sublists o Hashj oin requires buffers for buckets of both relations in memory Intersection does not allow pipelining DBIZ Example nSNAME hash join 1 City indexed on A city of Supplier N ty K SPJ G SCITY like Atlanta 6 JPNAME LIKE DIgItal LIbrary SUPPLIER PROJECT DBIZ Example TESNAME hash join 1 city indexed on city of Supplier y SPJ G SCITY like Atlanta 6 JPNAME LIKE DIgItal LIbrary SUPPLIER PROJECT DBIZ Query Processing Methodology Highlevel Calculusbased Query SQL EXTERNAL Preprocessmg l Algebraic Query a tree structure Optimization Execution Schedule file access plan DBIZ Query Optimization I Not really optimizing but planning to avoid bad execution strategies I Models 0 Heuristics based Algebraic Rewriting and Algebraic Transformation 9Apply transformation rules according to a general strategy 0 Cost based 9Minimize a cost function IO cost CPU cost subject to a set of constraints DBIZ Query Optimization An Example SELECT ENAME FROM EW WHERE E ENO W ENO AND WRESP quotManager I Algebraic Optimization 0 Two steps 9Rewriting 9Transformation 0 May produce multiple optimal plans I Cost based Optimization 0 Decide which of the alternatives is the cheapest to execute DBIZ Algebraic Rewriting Strategy 1 two possible plans Project TEENFE RENAME Project 6E ENO W ENO AND 6 EENO WENO AND WRESP yyManager salad WRESP quotManager salad l PL ENO ENO Join Join E W W E Plan 2 Plan 1 DBIZ Algebraic Transformation Strategy 2 RENAME EENO WENO 6WRESP quotManager l Plan 3 Moving selection downward using Cascade of selections Com m utativity of selections and Commuting selection with binary operations DBIZ Plan 3 Algebraic Transformation Strategy 2 two possible plans P lan4 TEENWE D4 E ENO w ENO E 6 W RESP quotManager 6W RESP quotManage r l W RENAME EENO WENO E DBIZ Query Optimization An Example I Example SELECT ENAME FROM EW WHERE EENO WENO AND WRESP quotManagerquot I Strategy 1 Algebraic rewriting ENAMEGRESPquotManagerquotE ENoG 1NoE X Different jOi n ENAMEGRESPquotManagerquotE ENoG ENOW X ordering Strategy 2 algebraic transformation quotENAMEE gtltl EN00REs1HManaguquotm Dl grem jom quotENAME0RES1HManaguquotW N ENO E ordering DBIZ Cost of Alternatives I Assume o cardKE 400039 cardW10000 o 10 of tuples in W satisfy RESPquotManager selection generates 1000 tuples ie VW RESP 10 o No index on either E or W I execution time proportional to the sum of the cardinalities of the temporary relations join cost dominates the overall cost I searching is done by sequential scanning Strategy 1 Strategy 2 Cartesian prod 40000000 Selection over W 10000 Search over all 40 000 000 Join40001000 4 000 000 80000000 4010000 DBIZ How to decide which is the best plan Strategy 2 two possible plans Plan 4 Plan 3 RENAME RENAME lgtlt EENO WENO NEENO WENO 6 u E GWRESP Manager WRESP 7 Manager E W W DBIZ CostBased Optimization I Reduce a de ned cost of executing queries I What is involved in the cost of executing a query 0 Access cost to secondary storage 9 Search for data block index 9 ReadWrite index and data blocks 0 Storage cost 9 Index and data blocks 9 Intermediate les 0 Computation cost 9 Query planning 9 Record search sort merge 9 Actual transactionquery operations 0 Communications cost 9 Transfer of results to the user Query Optimization Issues I Search Strategies 0 exhaustive search vs Heuristics I Optimization granularity 0 Single query vs multiple queries I Optimization timing 0 Compile time vs run time DBIZ Query Optimization Issues I Search Strategies 0 exhaustive search 9 optimal 9combinatorial complexity in the number of relations 9costbased o heuristics 9not optimal 9group common subexpressions 9perform selection projection rst 9reorder operations to reduce intermediate relation size 9optimize individual operations DBIZ Query Optimization Issues I Optimization granularity 0 single query at a time 9can not use common intermediate results 0 multiple queries at a time 9efficient if many similar queries 9decision space is much larger DBIZ Query Optimization Issues I Optimization timing 0 static 9 compilation 2 optimize prior to the execution 9 difficult to estimate the size of the intermediate results 2 error propagation 9 can amortize over many executions 0 dynamic 9 run time optimization 9 exact information on the intermediate relation sizes 9 have to reoptimize for multiple executions 0 hybrid 9 compile using a static algorithm 9 if the error in estimate sizes gt threshold reoptimize at run time College of Comput CS 4420 Database System Implementation Indexing and Hashing Part II Ling Lin Assncizte Prnfessnr Cnllege nr Cnmpming Gmgiz Tech WWW Lecture Outline Review of Last Lecture de sional Indexing multilevel indexing gtHashmg B 39Gnd Partitioned Hashing gtTree Based tKDrTree RrTree ased quotmm mu DB07 Introduction Review of Last Week s Lectures I Indexing 0 Conventional indexes Dense vs Sparse o B Trees 0 B trees BTree and Sequential Indexed files I Hashing scheme 0 Static Hashing a Dynamic Hashing 9 Extensible Hashing Prepared by Ling Liu 3 DB07 Introduction Hashing Tables I Hash function h search key 9 0 B l I Buckets are blocks numbered 0Bl I Main Idea 0 If a record with search key K exists then it must be in bucket hK I Advantage o Cuts search down by a factor of B 0 One disk 10 of there is only one block per bucket Extract rom Note by J Ullman 4 DB07 Introduction EXANIPLE 2 recordsbucket INSERT ha 1 hb 2 hc 1 hd 0 he 1 DB07 Introduction Hash Table Operations I Lookup o For records with search key K compute hK search that bucket I Insertion 0 Put in bucket hK if it ts Otherwise create an over ow block 9 Over ow blocks are part of bucket I Deletion 0 Compute hK search bucket for records with key K Extract rom Note by J Ullman 6 DB07 Introduction Efficiency of Hash Table Indexes I Efficient is highest if the following condition holds records lt buckets recordsblock I Good hash function Expected number of keysbucket is the same for all buckets I Potential Problem when the number of records in a le grows 0 too many over ow blocks for some bucket 0 too few records 7 producing more sparse buckets I Need dynamic hashing method to maintain the above relationship 0 Extensible Hashing 9 Double the number of buckets when needed 0 Linear Hashing 9 add one more bucket as appropriate I How does hash table access compare with B tree access Prepared by Ling LHJ DB07 Introduction Efficiency of Hash Table Indexes I Efficient is highest if the following condition holds records lt buckets B recordsblock I Static Hash Table B never change I Problem 0 Performance of a hash table may degrade if there are too many records in one bucket I Solution Dynamic Hash Table 0 Methods to allow graceful growth of number of buckets and to maintain the above relationship 0 Extensible Hashing grows B by doubling it each time 0 Linear Hashing grows B by add one more bucket each time Prepared by Ling LHJ DB07 Introduction Dynamic Hashing Framework I Common Feature 0 Hash function h hashes search key to a long k bitstring 0 Uses only some of the bits at any time to determine placement of keys in buckets I Extensible Hashing 0 Pro 9Can handle growing files with less wasted space and with no full reorganizations 0 Con 91ndirection Not bad if directory in memory 9Directory doubles in size Now it ts now it does not Prepared by Ling Liu 9 DB07 Introduction Dynamic Hashing Framework I Linear Hashing 0 Pro 9 Can handle growing files with less wasted space and with no full reorganizations 9 No indirection like extensible hashing 0 Con Can still have over ow chains LeLy full D D D D D D L Ll J gt Need to move 171 here D Would waste space Very empty Prepared by Ling Liu 10 College of Comput CS 4420 Database System Implementation Query Processing and Query Optimization Ling Lin sncizte Prnfessn A r Cnllege nr Cnmnnu39ng Genrgia Teen Lecture Outline Query Processing Phases Relational algebra level last lecture o Algebraic transform atlons 0 good transformations Detailed query plan level today s lecture o esurnate costs intermediate results size 10 blocks o generate and compare plans Query Processing Methodology Highlevel Calculusbased Query EXTERNAL SCHEMA LOGICAL Algebraic Query a tree structure Execution Schedule file access plan Relational Query Optimization I Relational algebra level 0 Using Heuristics for good transformations to reduce the search space of local query plans 0 Algebraic Rewriting o Algebraic Transformation 9 Typical Heuristics 6 Push Selection and Projection closer to the base relations in a SQL query tree 6 Reduce intermediate result size as much as possible I Physical Executable query plan level 0 Consider various indexes and their performance characteristics when estimating costs of alternative query plans 0 Pick the most economical physical query plan for query execution Query Rewriting Same Query Can be Expressed in two different formats Select BD From RS Where RA c SE 2 RCSC Select BD From RS Where RA c RC Select SC From S Where SE 2 Algebraic Rewriting RENAME Project 6EENO WENO AND WRESP quotManager Select EENO WENO AND WRESP quotManager RENAME Project Select Algebraic Transformation RENAME E EENO WENO 6WRESP quotManager l W Moving selection downward using Cascade of selections Com m utativity of selections and Commuting selection with binary operations How to decide which is the best plan Plan 1 D0 early selection Plan 2 RENAME D4 EENO WENO E 6WRESP quotManager RENAME EENO WENO 6WRESP quotManage r E l W How to decide which is the best plan Do early selection and Plan 3 early projection Plan 4 RENAME RENAME lgtlt gt4 EENO WENO EENO WENO TE TE TEENO TE ENAME ENO Two ENAME EN 6WRESP quotManager 6W RESP quotManage r E E Identify subtrees that can e W W 39 in one algor thm Heuristics Query Optimization Summary I First apply operations that reduce the size of intermediate results 0 Move selections and projections down the tree as far as possible 9 early selections reduce the number of tuples 9 early projections reduce the number of attributes o The selection and join operations are the most restrictive thus they should be executed before other similar operations 9 This is done by reordering the leaf nodes of the tree among themselves and adjusting the rest of the tree appropriately 1 SQL query parse tree I a nswer Pi l logical query plan statistics T improved qp stimate result size qp sizes P1c1P2c2i consider physical plans T P1P2 Note by Hector GarciarMolina n Algebraic Transformation Rules I Cascade of selections cpl and p2 and and pnR E 6p16p239 6pnRquot39 I Cascade of projections A1 and A2 and and AnR E 7TA17TA2 AnR E nA1R if Alg A2g g An I CommutatiVity of selections 6p16 I32R E 6p26p1R o 6P1 or pzR E 6p1R U 6p2R I CommutatiVity of selection with projection A1A2An6pR E 6pnA1A2AnR Algebraic Transformation QA I State if the following transformation is correct or not and explain why shamanismcpgcpmdpgaz E plmdpchszD quotA1 and A2 E quotA1quotA2R E quotA2quotA1CR x E quotA1CR one u 04R ammo 7 A1AZGPCR E 5pquotA1A2CR X E 6p1A1 QA 6pAR X S E 6PAR X S A is de ned on R only 6pAiR 1gtlt qAiBk S E 5pAiR 39gtlt39 qAiBk S Ai is de ned on R and Bk is de ned on S 6pAiR U T E 6pAiR U 6pAiT 5pltAiR n T E 6pAiR 6pAiT Ai is de ned on both R and T ncR x s 2 MR x ms ncR u s a nAR u ms CA39VB39where A gR B Q S mu mum College of Comput CS 4420 Database System Implementation Query Optimization Part II Ling Lin Assncizte Prnfessnr chiiege hr Camputihg Genrgjz Tech whammy r mu mum Midterm 1 QuestionsAnswers Notes Chapters suggest to read 0 Chapter1 o Chapter 2 7 Section 21to Seehor 2 4 o Chapter 3 7 Section 3 1 Section 3 2 Seehor 3 4 Section 3 5 Chapter4 o Chapter 5 7 Section 5 1 to Seehor 5 3 5 o Chapter 6 7 Section s1to Seehor s 4 Section 6 s Seehor s 7 o Chapter7 7 Section 71to Seehor 7 5 3 Section 7 7 to Section 7 7 5 whammy 1 DB10 Introduction Lecture Outline I Review ofLast Lecture 0 Query Processing 9 Query Processing Steps 9 Algebraic Based Optimization w Algebraic Transformation amp Logical Query Plans 6 Good transformations 9 Costbased Query Optimization 6 Intermediate Result Estimation 6 IO Blocks Estimation I Today s Lecture 0 Costbased Query Optimization 6 IO Blocks Estimation 6 Choosing the optimal query execution plan Prepared by Llng LlLl 3 DB10 Introduction 7 lanswer l logical query plan 7 7 statistic Pi improved lqp T stimate res ult siz e P39Gk est P1P2 Note by Hector GarciarMolina 4 DB10 Relational Query Optimization I High level Transformationbased Optimization Logical Query Plans 0 Algebraic Transformation or Query rewritting 0 Goal Reduce the search space of logical query plans 0 Mechanisms Heuristics for good transformations 9 Push Selection and Projection closer to the base relations in a SQL query tree 9 Reduce intermediate result size as much as possible I Low Level costbased Optimization Executable Query Plans 0 Mechanism 9 Cost estimation using intermediate result size estimation and 10 cost estimation 9 Consider various indexes and their performance characteristics When estimating costs of alternative query plan 0 Goal Pick the most economical physical query plan for query execution Prepared by Ling Lm DB10 Introduction Costbased Query Optimization I Estimating cost of query plan 0 Estimating size of results done 0 Estimating of 10s lt today s topic I Generate and compare plans Note by Hector GarciarMolma DB10 Introduction Estimating IO costs lKeep statistics for relation R oTR tuples in R oSR ofbytes in each R tuple 0BR of blocks to hold all R tuples oVR A distinct values in R for attribute A o BR of blocks containing R tuples o KR max of tuples ofR per block 0 M memory blocks available o HTi levels in indexl o Bi of leaf blocks in index i Prepared by Ling Lru DB10 Introduction Estimating cost of query plan 1 Estimating size of results Selection Projection Join 2 Estimating of 10s I Count of disk blocks that must be read or written to execute query plan I Factors considered 0 Relations Contiguous or not Index or not Hash or not Prepared by Ling Lru College of Comput39 9 CS 4420 Database System Implementation Concurrency Control Ling Lin sncizte Prnfessn A r Cnuege nf Cnmpnting Genrgie Tech Lecture Outline Today s Lec 9 Phantums Phenumenun l Review oflectures since midterm 1 o Failure recovery UnduReduUndur edu o Concurrency ontxol e mmmuun 9 Theums 9 Luckmg prutncul Review of Concurrency Control 0 Problems of concurrent access 0 Basic Concepts of Concurrency Control 9 Schedules 9 Con ict actions 9 Con ict Equivalent Schedules 9 Con ictSerializable schedules 0 Locking schemes 9 Two Phase Locking 2PL 9 Share and Exclusive Locks 9 SX locks With Increment Lock 9 SX Locks With Update Lock 9 Lock Table and Locking System 9 Locking Granularity 9 Multiple Granularity Locks IS IX SIX I Today s Lecture 9 Phantoms Phenomenon 9 Treebased CC 9 Validation CC T1IS T2IX e T1S T2IX Multiple granularity Comp Requester IS IXSSIXX IS T T T T F HolderIX S T T F F F SIX T F T F F X T F F F F F F F F F Why is multiple granularity tree useful 9Problems with Insert delete operations Insert Phantoms problem Example relation R Ena1ne constraint E is key use tuple locking R E Name 55 Smith T1 Insert lt99Gore gt into R T2 Insert lt99Bush gt into R T1 T2 8101 58201 8102 38202 Check Constraint ECheck Constraint Insert o399Gore Insert o499Bush Solution I Use multiple granularity tree I Before insert of node Q 9 we IThis approach can be generalized to multiple indexes lock parentQ in X mode Next I Treebased concurrency control I Validation concurrency control Optimistic concurrency control Example 0 all objects accessed through root following pointers T1 lock T1 lock r can we release A lock if we no longer need A Idea traverse like Monke Bars T1 lock T1 lock T1 lock Rules tree protocol exclusive locks 1 First look by Ti may be on any item 2 After that item Q can be locked by Ti parentQ locked by Ti 3 Items may be unlocked at any time 4 After Ti unlocks Q it cannot relock Q only if I Treelike protocols are used typically for Btree concurrency control Rootquot E g during insert do not release parent lock until you are certain that child does not have to split Next I Validation concurrency control Optimistic concurrency control Validation Transactions have 3 phases 1 Read 0 all DB values read 0 writes to temporary storage 0 no locking 2 Validate 0 check if schedule so far is serializable 3 m o if validate 0k write to DB Key idea I Make validation atomic I If T1 T2 T3 is validation order then resulting schedule will be con ict equivalent to Ss T1 T2 T3 To implement validation system keeps two sets I M transactions that have finished phase 3 and are all done I m transactions that have successfully finished phase 2 validation Example of What validation must prevent RST2B R ISOAB 74 I WST2BD 4vsltT3C T2 T3 T2 T3 J start l start l validated lvalidated time allow Example of What validation must prem RST2B R I3AB y I WST2BD WST3C T2 T2 T3 I start Ema l validated lvalidated T2 1 T3 time nish start phase 3 Another thin validation must revent RST2A RST3AB WST2DE WST3CD T2 T3 i validated Jvalidated finish T2 tlme BAD W3D w2D allow Another thing validation must prev RST2A RST3AB WST2DE WST3CD T2 T3 validated validated finish 39 39 T2 T2 me Validation also called optimistic concurrency control is useful in some cases Con icts rare System resources plentiful Have real time constraints Summarv Have studied CC mechanisms used in practice 2 PL Pessimistic CC Multiple granularity Tree index protocols Validation Optimistic CC Review and Exercises Failure Recovery DB Operations and Transactions Atomic transaction Transaction Failure due to system crashes or media failure Recovery Techniques 0 Logging 0 Logging schemes 9 Undo Log WAL 9Redo Log 9Undo Redo Log 0 Checkpointing Exercise 1 Suppose A and B are database elements and their initial values are both 0 After transaction T executed their values are both changed to l The content for log are given below ltSTART TgtltT A0gtltTgtltCOMMIT Tgt What kind of logging was used and what are the values for the two question marks I a Undo and B0 I b Undo and B l I c Redo and B l I d Redo and B0 I e Redo and A 1 Solution A Exercise 2 Examine the schedule given below There are three transactions T1 T2 and T3 Initially salary l and tax 2 The assignments happen within the local memory space of the transactions and the effects of these assignments are not re ected in the database until the WRITE operation lt0 T3 startgt T1 T2 T3 lt3 T1 startgt 0 start lt6 T3 tax 2 3gt gEAEtfai 1 lt7 T3 commitgt 4 thlggD salary lt8 T2 Star 1 g M V M V tax lt12 T2 tax 3 4gt g fatal tax lt13 T2 commitgt 10 RELB salirgalary lt14 checkpoint start T1 is still activegt l3 Whir atxax 13 commlt lt17 T1 salary 1 2gt 1g checkpoint start 16 tREADttagtlt I lt18 checkpoint endgt 17 ax ax afya ary lt19 T1 commitgt 11 checkpoint end a Show the undoredo log le entries that would be generated by this execution For each log entry indicate what line above generates it 28 Exercise 2 Cont b Assume that the undoredo recovery algorithm with checkpoints is being used The database crashes immediately after statement 7 Assume that all the log records up to this point are on disk 0 b1 Which transactions would have to be undone Answer T1 0 b2 Which transactions would have to be redone Answer T3 Exercise 2 Cont c Again assume that the undoredo recovery algorithm with checkpoints is being used but now the databases crashes just after statement 18 Assume that all the log records up to this point are on disk o 01 Which transactions would have to be undone Answer T1 0 c2 Which transactions would have to be redone Answer None Concurrency Control Correctness informally I If we stop running transactions DB left consistent I Each transaction sees a consistent DB I Recovery Problems due to failures m I CC Problems due to data sharing m I CC Recovery Problems due to failures and sharing next lecture Concepts Transaction sequence of riX WiX actions Con icting actions r1A W2A W1A W2Alt r1A ltW2A Schedule represents chronological order in which actions are executed Serial schedule no interleaving of actions or transactions Precedence graph PS S is schedule Nodes transactions in S Arcs Ti gt Tj Whenever piA qjA are actions in S piltAgt lt3 cultAgt at least one of pi qj is a write A39 De nition SI 82 are con ict equivalent schedules if SI can be transformed into 82 by a series of swaps on noncon icting actions De nition A schedule is con ict serializable if it is con ict equivalent to some serial schedule Lemma SI 82 con ict equivalent gt PSlPSZ Theorem PSl acyclic Cgt SI con ict serializable Exercise 3 T1 ReadA A A100 WriteA ReadB B B100 WriteB Constraint AB T2 ReadA A Ax2 WriteA ReadB B Bx2 WriteB Questioma is this schedule con ict serializable Which serial schedule it is equivalent to T1 T2 ReadA A A100 WriteA ReadB B B100 WriteB Answer Yes It is equivalent to serial schedule of T1 ReadAA Agtlt2 proceedlng T2 WriteA ReadBB Bgtlt2 WriteB Question b is this schedule con ict serializable Which serial schedule it is equivalent to Answer Not seria zable and it is NOT equivalent to I T7 any serial schedule T1 ReadA A lt AlOO WriteA ReadAA lt AXZ WriteA ReadBB lt BXZ WriteB ReadB B lt B100 WriteB Exercise 4 I What is PS for s W3A w2C r1A w1B r1C w2A r4A w4D I Is S serializable N0 Locking Schemes I Binary I Shared Exclusive I Extending the SX locking scheme with new lock I Examples 0 Increment Lock 0 Update Lock 0 Intentional Lock IS IX SIX Twophase Locklng 2PL Protocol To guarantee serializability In a transaction all lock operations SiLock or XiLock precede the first unlock operation No locks can be acquired after the first lock is released We call a transaction satisfying twophase locking protocol if it obeys the above rules Twophase execution Growing phase lock acquisition only no unlock Shrinking phase lock release no more lock Lock point divides the two phases Lock point 1 Obtain lock 1 Release lock N0 of IOCKS Phase 1 A Phase 2 BEGIN END Serializability of 2PL Theorem For any legal schedule S o if every transaction in the schedule S satis es the 2PL protocol S is guaranteed to be serializable obviating the need to test for serializability of schedules Note ZPL is a sufficient but not a necessary condition for serializability of schedules may limit the amount of concurrency that can occur in a schedule Potential Problems Livelock Starvation Deadlock Livelock I Livelock occurs 0 when one transaction cannot proceed for an inde nite period of time While other transactions continue normally I Cause 0 due to unfair waiting schemes usually prioritybased ones I Fair schemes 0 FCFS o prioritybased by increasing the priority of a transaction who waits longer DeadLocks Deadlock occurs when each of two transactions is waiting for the other to release the lock on an item T1 T2 XiL ockA I Em R1 A 1 X7LockB yy 7 R203 quot K 4 X7L kB39gt i X LockA39 waltmg waiting In general a deadlock may involve n ngt2 transactions and can be detected by using a Waitfor graph Wait for graph Node transaction Edges pending lock request If Ti is waiting for an object held locked by Tj then there is an edge from Ti to Tj Root is the current lock holder 43 Deadlock Avoidance Conservative 2 PL Conservative 2 PL static requires a transaction T to 39 predeclare all the read set of items and write set of items and lock all the items it accesses before T begins execution 39 If any of the predeclared items can not be locked T does not lock any item at all Instead T waits and tries again until all the items are available for locking 39 Conservative 2 FL is deadlock free No of locks T Obtaln lock Release lock Phase 1 Phase 2 Transaction BEGIN Deriocl of END duration data ilem use 44 Deadlock Avoidance Strategies I No waiting 0 If a transaction cannot obtain a lock it is immediately aborted and restarted later 0 Do not even check for deadlock I Cautious waiting 0 When a transaction Tx tries to obtain a lock that is already held by T J9 T is blocked and wait iij is not already blocked 9 Otherwise abort T I Based on timeouts o If a transaction T waits longer than a systemdefined timeout then the system assumes that T is deadlocked and aborts T regardless of whether T is actually deadlocked or not Strict 2 PL In addition to the Basic 2 PL and Conservative 2 PL there is another version of 2 PL Strict 2 PL A transaction does not release any of its locks until after it terminates by committing Strict 2 PL is not deadlockfree unless combined with conservative 2 PL But it helps to overcome the update lost problem If all transactions in a schedule S follow the strict 2PL protocol S is called a strict schedule NO 0quot 3906 3 Hold locks until the end T l Obtain lock Release lock EEGIN END Transaction duratlon period of data Item use Exercise 5 What are the problems of the following two schedules a 1 and 2 have lost updates problems b 81 and 2 both have dirty read problems c 81 and 2 have dirty read and lost update problem respectively d 81 and 2 have dirty read and fuzzy unrepeatable read respectively Solution D T1 T2 WA RA Rollback Exercise 6 The following schedule observes a the basic two phase locking protocol basic 2PL only b the conservative 2PL c the strict 2PL 1 none of above Solution C College of Comput CS 4420 Database System Implementation Indexm g Ling Lin ncizte Prnfessnr Ass Cnllege nr Cnmpnn39ng Genrgjz Tech mummy n i Lecture Outline Review of Last Lecture o Disk Organization aHowto lay out data on disk r Opnuns fur Staring Recums in blacks l Today s Lecture 0 Indexing continues Conventional indexes gtBrtrees eHashmg schemes mummy n 1 D1305 Introductj on Conventional index I Basic Ideas 0 Sparse o Dense o Multilevel I Duplicate Keys I DeletionInsertion I Secondary indexes Prepared by Ling Lm 3 DBOS Introductj on Dense vs Sparse Index I Dense Index 0 Every key in the data le is represented in the index le I Pro 0 A dense index if ts in the memory is very ef cient in locating a record given a key I Con o A dense index if too big and doesn t t into the memory will be expense when used to nd a record given its key Prepared by Ling Lm 4 DB05 Introducti on Dense Index Sequential File 10 Every key In the data le is represented in the index file Advantage of Indexing e m dew quot es m Prepared by Ling Liu 5 DBOS Introduction Sparse Index I Sparse Index 0 Normally keeps only one key per data block 0 Some keys in the data le will not have an entry in the index file I Pro 0 A sparse index uses less space at the expense of somewhat more time to nd a record given its key 0 Support multilevel indexing structure I Con o Locating a record given a key has different performance for different key values Prepared by Ling Liu 6 DBOS Introductj on Sparse Index Sequential File Normally keeps only one key per data block Prepared by Ling Liu 7 DB05 Introduction Sparse 2nd level Sequential File Prepared by Ling Liu 8 D1305 Introductj on Terms I Index sequential file I Search key st primary key I Primary index on Sequencing field I Secondary index I Dense index all Search Key values in I Sparse index I MultileVel index Prepared by Ling nu 9 DBOS Introductj on Primary Key and Primary Index Relation Students Name 3g deEt SMITH 123 CS Johnson 522 EE Mary 444 CS Primary Key Primary Index Option l Storing the relation as it is Option 2 Storing the relation in the ascending order of id Index Sequential le Note by J Ullman 10 DBOS Search Key Vs Primary Key select from Students where dept EEI Which one is the search key How many Primary Keys a relation can have How many search keys a relation can have Note by J Ullman Introducti on DBOS Dense Index and Sparse Index Relation Students Name id deEt SMITH 123 CS Johnson 522 EE Mary 444 CS Can you build dense index on every attribute of a relation How many sparse index can you build on a relation Note by J Ullman Introducti on Qgggmg w uw39 CWM Database Implememm m Peer to Peer Data Management 339 Emnremaax mam mam gh sums 28 imam Lemme 24 0mm 39 mat WW P1633211quot 5 w ce a a M2311 349315 WW ntmduct an 3 219 Mg H mit rwr wr imi tltiws 9AM mmam mas x g qwam lt gt PEER om hg mmco i m mammam WWWM L N1 de Maw H mi 0 WW P2P OUi vaZ Q IH f gs wae ammgma Com 1 WWW oD gmrDKQM iw y tm mg zim g am y 3 Wmquot A peeramapeer stcmge pmbllem r mLut mcc w mthwgt o 1 ma i mgmd a WnHHiiumg it amd QJnQag r o HZiCw 3 WM Wadi m5quot J NLQ P N b Um mgii wasng mmwgl Wer mm Pmajb am WW i rw fdimx t 1in M J d g wmaw ra Wad Q wwfumg gssvgsi reim39w fx 0 Mud f rar 9 MSWquot l Values W m rgt 4 t J 7quot J gt 0 Q er H Lookupf de39 E f rgkaa Lmkqujp mg 1er g mipbfkaatm 1mm mama Centralized makinpr apster mummy K DBK I A WW 7 r Swag 5w Om mag mb gsmaniE mm 9 rumg muggy 93mg Distaribmed Salu w Fbmlw Cg luj ha M ad gujgp am 11 quoti V gt T Mer mgr W 39 E5 mam d F mnwv o ngi bw a Qm 3 xpg snwy39a 0 1H Wu mmsf igs tm39i wiicgz Obwthcj wmrce m um I um mamqu m mmmry o lr M c teld MEQ n mgw awe V x 1 may 93mm Cmpum ma l Ememy g NQT Main 6min 0 mm mxvxmg ifmasj m 51 I y i m O gtXltp i t mg x s mg 1n gt Og g l g me mamy ouag m pr it m om mzojmy oammym iry mnwv Why metamemer Reducing w 1 lt61 Mi icad W b awmg NO addi ma l than 2 Adtmimwig 1f vmag mwnmmg 3a Swimer lunggmaidmmg Emma 9mm 19me 451 ESSHWQJHc mmmii 1 f z21mngm Faumta diewm by damn gum me C mm0n issues 0 wggm 2 9 mg wmm wc lgly mammmk 9 mm an 0 mm WMMQQ o www9me MM 0 R w 111gtm1mmmmmg o R W H m im 0 L 1 i j mngiixvxm k pm m y ljl lewxaz g m P2P W W y warmmk CHEW laan Napster Hismry o Si hgwm Fam mg N1rfdm 1m OES iml 65 M M j by Z 034mm MWm rim 2 X l o Hi iL7Wif Empl39lmm i m o wm icax l f j 2 d o mmmmwx39tgl akg g axemw hr w pfmioE e TEE m MWF111 m1 1 P2P damn E Aw 1143mm Gnute la Hlsmry 0 rngfmgg INMIHW LE mam FWQWK U mrd T m P glamw wm ifg i l mgmgm m mxr mmg muii OGWMUQ Us mitgd f u 3 may o m li Lfc7lf j j md tremfmmg 0quot Pw P W J 49 f am IWK GS CREW anv P2P ewe study a Gnute a 0 P2P Was h jmmg 1 tmia m it w f m LEVI l ry rmrgi vmm o IM 6 gt pm WWW WWW m o lNkcadkzgg msa m fm p m TG mm 0 Qty earns b dim b 1 WWW o w Mme Wm i mm mm o g1 mg mmw HID 0 WWW raga WWWw ca ng yjn f gummy mmpxg L S 114 CEAIEEIE CWW Cumm Tecmjiqu Gnutelma Bm hal f m Search BPS W m V J WW 39 a F O 3 r V L mux gmng gt f 6 56mm E mem gt massgmm 349315 114331105343 Metrics 4 Lw ragng ggjemg olagm wcd ih w OPIF ULW mer39 0 Quality 1i ommmw F wr tzsmms og fgiacc i m l w in mwm gt5 fangs h m OTTWK if g tfgf am39tm mnw Gnutell a File ae mg g Nli Hil 9E3 h H gw gm Wr o mu cai as im mw WM OSLE rr H Lt Wig3 Hawks W ae s i mrg warQ Ws r 1nm gm ms J JCeJth tm Q pw 98am W W mg 1tr Hmmy 114331105343 w o GWMEUH mngr g ft by 6 A mam tsggv o H H y f m if m Seamh Pmmml o Sarg u rcdm S o Um rgm Rasmwa ii H N o Wm m m T 1an WW gem em mm o gggm g A s N 139 UU i l 1 Wigwam o Flmmmg g mi mnw Flimd mg Pmch 6 y g1 W a l 9 H11 d n i ham am ne 0613va 17 f WW pasx r mejma 9W 1mm rm 7 1 f tam wafmx a tfgs fill bxmmwm x mfg mw 0 l n o 9 mm by hxax dkgmg x t v Q oUmhcqugvg HEW g m mm m Mtgg 4 F oH p mm mm mm famardmg Pmtm l 349319 Pm o B RW mw M A S N T o M 3 15 may W N Mr Tim B mp 11 de Emlm gm Whom 0 B H IK w s Hm lm gm n f mggsw itg Emmi B rm J N R 5MH AC 0 H1 m mm H 0Ry o B 3 s N 71 w m 139 m mmmax me f tam3 mm A mg 1m 1am w i N 0 Wh gt m B 1Cc3 w f p lt 17fi h1 f nmw R w r 5mm m 156 i i m g h gtf mmw mg gm A1 El imam mum Gmup Membership 0 mm wi39rf m g1 PUN am sm e Uf o R M5 mmcc 131m PUNK in m glmwrg o Fi ixv 1r WM wit 931th UP mmagvmg ifgmrcmj a P1 n d if rrrgbgmw mime 0 PUNCng QJEJDW o WKQU HWWNN mm mm W mammal mm wr mcdm mam Nodes m the largest network component thousands 200 l Figure l Network growthfhe number ofnodes in the largest con nected component In the network grew by about 25 times over the seven months ofour study S Mesnges per second MN 33E 2 N Tlme mmures Fi ure 2 Generated traffic In November 2000 overhead traf c on a randomly chosen link accounted for more than 50 percent ofthe total whereas user query messages were only 36 percent Nem a Diameter Reachable node pm percent 1 2 3 4 S 6 7 Q 9 IO Nodertornode shortest path hops Figure 3 Node tonode shortest paths More than 95 percent of node pairs could be reached within 7 hops mum WV a A a SQme Stamgmg 0 Nammk 5179 gmwm Gm 7 awning 0 Rm 2 tr A N cgngm TWE biw ir y o alum me 2mm o fA lQm 26 a mxokevs We mm mm 2241 h ws O 1 16 th m Alumna gmaj of muui i 91 139 21mm lamaer 3mg W PW ELa W Hymtmgis o ggi mirggm irzimg i K iNMtWKg miikwv ig k i i JK Ni v 39 Nimk s mam L sham mmmmi M oiyii39a my Mi mm mi Hiram A aw with warm W o ime f m iwiogti klt i39 YONiZ i im owmags Lh x Ma ah rg vma h u n i mm mm I N dl Cmmc v ty Nodes Jog scnie 00 Links og scaiei Figure 6 Connectivin distribution March May 200 IThere are too few nodes with low connectivity to form a pure powerlaw network TEES mamai Malay Network Mapng Figure 7 Mapping the overlay network topology to the network infrastructure a With perfect ma ping a message inserted into the network by nodeA travels physical link DAE only once to reach all other nodes b With inefficient mapping the same message traverses the link six times 3 Neath minkquot Ramde Clusters Hypatihesmg o imgi eilie insensitive iglnysiioei igmoxai quot 0 imiteiira dimmers iipiptwr iril39n td tonniiy d ntoxexsim mg o 1 dig sitemimggi ii quotOiLUi mi 39 meE w 5 0 Egg o 05ij o www ggg mg 3amp5 ammmu Egmigg E36 VQ gEQwQO A853 kw Qgtg wgw E NNQZ E Q O g g g g m M6 0 Ar E a ag gg g ai g MEQ wg gt635 E E ggio Q E mgmw gg i o gwggw ME Ewm i 33 gammy E Aw P2P Search 0 Dgtvj V1i m m m 15639 U x39a yg amg ma 0 13mm 0 gmmyra W fro mgim mgm 9 armng mag o R m gmmw m timmgmwy waw umm 0 me o Mme m n H 0 mm mgmfwm Mag to p 1 gyyn Qummbm 9 mm 2 m xvgmiagp r Im mwmw 2 ww mwg P2P Search Qutil ne W 0 Tltimm mw 0 13mm Wat gm m 131 o Mimiim mmm mg RIDES o Rmmm Wa k o l mmmmm wmmm 1313ng o Hilgmd mg Pm J39H mg1mmiy o SWWQW E r mi o Sdm m n vg M 1 kw Tr 1 o g a R M1I ltslim m 3419319 Search Tmhmique Immiw W Deepening Hm GM my 0 TSVt was g gra ig c o 39w f b glmm m w 1mm an M dgn 0 Q WLF H mmmr i f ww sj bgtim w jfnuy y mfg mg W WLG 0 H1 M sw ags i fs a l Mr W f 2 13W mmwmm Hm PHCOJPS Er WW 6 v Am m11 1 xm yg W 5am mmmegm ui Imwmbgf mwmtwar w f Mm m 1 mm micsm ccmg gtgt Lmumamg f 1 mllp J 63 S35 Seamh Tmhnique 119mb ab l s tc Wquot Bmadm mg GE Mug 0 mm m mwmbear f mag Qc l m TTJLQ o Mam m a rm ii lf 1 m fwmm Q fi k mg ghbmga o N m f mg Um mxm i o lW iM 7 ifi hrg mlgmmr o ff gwf mm ww o mm mmm 1m hng 195 mm mm mammal 11 mwm mm 4A i i 31 mw mimmmmawv Mummiwv mam um mats07m 9 a Mimmdmgmwmwwm npn m m Wm w um msan WAR 7 Nahumi ihunm mame u an l m mim av it mum wmmmmm mam 1311 clam mam mmmxewhw WNW In MW unaware CE AQEID O O Pea Hetemgene ty P r by leqnf mmm gm qwal m gammy HH wf mii ngwa L Rm mg mml M t nr39 f m lt m 1 1 iw l klt laggmnm h high maJ S LL Vs ia aung cvmwmx c tjkam Smmam r mi NW5 aamgs 21 m g mm Umwi TTL Wm oB i tH mtagklt Wme t m ii TC fi mmwgwmw mum Mg a gmad WW Wk 0 NME mama f f Elite A mam IFTJL 397 o qreirgw mwbm 4 Mb 33 mli m MW 22 gt 4L u 4 Wm quay 2 31m 1mm 1g mm 72 0 Wm mm 22 4 dwji nsm by 4 gm CEAIEEIE wmraw Camm maawam netwmk mpdagieg i 7Q m f y LLyU EH1 L nggg w ejf ln m xvsamgbur mnmmm o L VH a r g WKQWQ mm mm 0 N YAV I k MD 01 39 H gi 25mm 1 n O o L y swrw o I Hmr x iamhw m a Mo 0 mm xiimam mm mm mm 21m Kym CWW Laan T0p ll gyz Layered Dense ng ii mmm gm 6W mmsco m WWW win 9 t5 mmwmm WW by m fm l rr mw m wg Aw VD A Q mmglsy mammal wmraw gmm y r w 3 M L i MWng Layered Spame a WWW mm mm sawinxcgn HL mg m mmmadm o mgr exam lye gmnmac m mccmr thaw Loam mar lbxal r g mg lava HL Y7 I A Y 1 t l g AL mum me 0 O L M r 0 as s mmgw 3L quot mwr mzegxrg ff mm a7 EWQ GQ lrwxgtijrte W n m Emmi a Q 1 b mg g it HL 392 n um Saw y Amiga 11V lt3 THEM m m u an m mama 3L V 31 1 my v ihs mlmy mm m t mw w 7W I 3233 L m mum w u H V 6 3mm mm Dams mm 6 13W an 6 But its wlga mm mm m infammk a a F a 3mm 11mm Li W mm m mart m a maxilla mam if 3 W Ewmam y gs may mam 1 3ng 1143mm Spmmm 0f Mtg7 0 H ym o m mzzlt di mm P22 We gum WWW a Smpwgr mpgxsjr lt 0 fax WWW o Plums o Mm ifmam m my i w a vm gdl P7 53 kman Security in Pwr m Peer Systems a pp xfdwm ihi 0 am m rmmmmm Wb tdh am mil gmmxe g m 116 fwdkwl Wm Um mmgvmiw L rtaqpu m r mid p mmwr lawnx T39hmi o anmmym try lm W mmm mil 51be 3g m lmwm mm mgdlay WWWWK w a my Li Emmi l millfwbm m 39ir mjm Ekmgw mid a mmaa O quot College of Computing CS 4420 Database Security Ling Liu Associate Professor College of Computing Georgia Tech 2003 CSIFBI Computer Securitv Survev Sauna mp wwwgucs1 mm Attack Sophistication vs Intruder Technical Knowledge Cross site scripting lmrl erd stealthquotladvanced i TOOIS quoto 9 9 scanning techniques 39 i lstaged packet spoo ng denial of service l i I ldistrihuted l attack tools I www attacks I i i automated probesscans iGUI ck doorsi disabling audits i etwork mgmt diagnostics i Ehijacking Attack burglaries isessions 39 39 i a i 39 39 39 a i password cracking i selfreplicating code LOW i password guessing Attackers 1980 1985 1990 1995 2000 Copyright CERT zooo Goals of DB Security Integrity Only authorized users should be allowed to modify data 0 Availability Making data available to the authorized users and application programs Secrecy or Confidentiality Protection of data from unauthorized disclosure Security Objectives Preventdetectdeter improper Disclosure of information Preventdetectdeter Improper modi cation of information I A Security System Assurance How well the security system meets the protection requirements and executes the expected functions Preventdetectdeter improper Denial of access to services Security Mechanisms oz Access control which data users can access oz Information ow control What users can do with the accessed data oz Inference control how the accessed data can be used to infer additional information ozo Encryption Access Control oz Ensures that all direct accesses to object are authorized ozo Protects against accidental and malicious threats by regulating the read write and execution of data and programs oz Need Proper user identi cation Information specifying the access rights is protected form modification Access Control ozoAccess control components Access control policy specifies the authorized accesses of a system Access control mechanism implements and enforces the policy Access Control Basics oz Subject active entity that requests access to an object eg user or program oz Object passive entity accessed by a subject eg record relation le oz Access right privileges how a subject is allowed to access an object eg subject 3 can read object o Access Control Models Discretionary Access Control DAC grants privileges to users including the capability to access speci c data les records or elds in a speci c mode such as read insert delete or update Mandatory Access Control MAC classi es users and data into multiple levels of security and then enforces appropriate rules RoleBased Access Control RBAC Discretionary Access Control DAC oz For each subject access right to the objects are defined oz User based oz Grant and Revoke oz Problems Propagation of access rights Revocation of propagated access rights Access Matrix Model Proposed by Harrison Ruzzo Ullman 1976 Outline S Set of subjects 0 Set of objects A The access matrix rows correspond to the subjects columns correspond to the objects Access Matrix Model cont Conditions can be associated with access authorization Datadependent Specifying constraints on the value of accessed data Timedependent Specifying constraints on the time when an access can take place Contextdependent Specifying constraints on combinations of data which can be accessed Historydependent Specifying constraints dependent on previously performed accesses Access Matrix Model cont An Example Object types Relations views rows records columns operations Subject types Users accounts programs 39ect subject 01 OI Om 51 ASlOI ASlOi ASlOm 52 Aszo1 ASZOi ASZOm Sn ASnOl ASnOi ASnOm Access Modes Access Modes Read Write Append and Execute Operations ASO is an entry in the access matrix Mechanisms ltcreategt ltsubject Si ltobject Ojgt ltdestroygt ltsubject Si ltobject Ojgt ltenter deletegt ltoperation x into from ASi Oj Discretionary Access Control GRANT SCHEMA DB schema name AUTHORIZATION users GRANT privileges ON obj ect TO users WITH GRANT OPTION REVOKE GRANT OPTION FOR privileges ON obj ect FROM users CASCADE RESTRICT Privileges SELECT INSERT DELETE UPDATE REFERENCES EMPLOYEE l NAME I EMPIDIBDATEI ADDRESSI SEXI SALARYI DEPTNO l GRANT SELECT ON EMPLOYEE TO user3 GRANT SELECT ON EMPLOYEE TO user3 WITH GRANT OPTION GRANT INSERT ON EMPLOYEE NAME SSN TO user3 GRANT UPDATE ON EMPLOYEE SALARY TO user3 REVOKE SELECT ON EMPLOYEE FROM user3 CASCADE Access Control Using Views The owner of a relation can create a View containing only limited columns andor tuples then grants the View to other users Example CREATE VIEW User3 EMPLOYEE AS SELECT EMP ID NAME DEPTNO FROM EMPLOYEE DAC by Views Emplqyae relation Name Dept Salary Manager J Black Toy 25000 S Red S Red Toy 43000 K Brown L White Candy 28000 GR Green CREATE VIEW toyidept AS SELECT NameSalary Manager FROM Employee WHERE Dept Toy royidept View Name Salary jlfanger J Black 257000 S Red S Red 437000 K Brown DAC by Grant and Revoke GRANT SELECT ON Employe TO Black WITH GRANT OPTIO 9 rown revokes grant given to Black Brown does not want Red to access the Employee relation GRANT UPDATESal v 0 Employee TO White Weakness of Discretionary Access Control System userl SELECT ON EMPLOYEE WITH GRANT OPTION gt Granted Privilege n Revoked Privilege Authorization Graph Summary Types of discretionary privileges The Account level The Relation level Features Specifying privileges using views Revoking privileges Propagation of privileges using the GRANT option Specifying limits on propagation of privileges Horizontal propagation limits Vertical Propagation limits Advantage flexible Disadvantage No real assurance Information leak and Vulnerable to attacks like Trojan Horses Mandatory Access Control Mandatory Access Control Each data object is assigned a security class 0 Each subject user or program is assigned a clearance for a security class 0 Security classes could be Top Secret TS Secret S Confidential C Unclassified U Multilevel Relation and Polyinstantiation EMPID NAME SALARY DEPTNO SECURITYCLASS 1 smith 100000 5 S 2 brown 80000 4 C 1 smith null 5 C MAC ozo Access rights defined by comparing the security classification of the requested objects with the security clearance of the subject ozo If access control rules are satisfied access is permitted ozo Otherwise access is rejected ozo Granularity of access rights MAC BellLaPadula BLP Model ozo Single security property a subject S is allowed a read access to an object 0 only if labelS dominates labelO ozo Starproperty a subject S is allowed a write access to an object 0 only if labelO dominates labelS No direct ow of information from high security objects to low security objects Axioms Simple security 83 property An object may have read or write access to an object only if the clearance of the subject dominates the security level of the object Star property A subject can only read objects at or above their level A subject can only write objects at or below their level Tranquility principle No subject can modify the classification of an active object Mandatory Security Mechanism Restriction Axiom 1 t no subject S can read an object O ifthe object s security class is higher than the subject s security clearance S can read 0 only ifclassS 2 class0 t no subject may write an object that has lower security class than the subject s security clearance thus preventing information ow from higher classi cation to lower classi cation S can have write access to 0 if and only ifclassS S Class0 Flow Control Flow control checks that information contained in some data objects does not ow explicitly or implicitly into less protected objects A clearance for a security class can be assigned to each application program Like a DB user each application program is subjected to the same readwrite restrictions Role Based Access Control RBAC Motivation ozo Express organizational policies Separation of duties Delegation of authority a Flexible easy to modify to meet new security requirements 2 Supports Leastprivilege Separation of duties Data abstraction o RoleBased Access Control Mandatory access control is rigid because the security class should be assigned to each subject and data object In the real world access privileges are associated with the role of the person in the organization example bank teller Each role is created and is grantedrevoked privileges Each user is grantedrevoked roles Inference Control Inference Control Must prohibit the retrieval of individual data through statistical aggregate operations on the database Example SELECT MAXSalary FROM EMPLOYEE WHERE Dept CSE AND Address LIKE Cincinnati Note What if only few employees live in Cincinnati Solutions for Inference Control No statistical queries are permitted Whenever the number of tuples in the selected population is smaller than a certain number Prohibit a sequence of queries that refer to the same population of tuples repeatedly Partition the database into groups larger than certain size and queries can refer to any complete group or set of groups but never to a subset of a group Statistical Database Security Statistical Database Security Statistical databases are used to produce statistics on various populations Features are individual information is considered confidential users may allow to access statistical information on the population ie applying statistic functions to a population of tuples Techniques for protecting privacy of individual information solutions are illustrated by examples Person name ssn income address city state zip sex lastidegree Suppose we are allowed to retrieve only the statistical information over this relation by using SUNl AVG MIN MAX COUNT etc Statistical Database Security The following two queries are valid Q l find the total number of women who have phD and live in Calgary Alberta select COUNT rom Person where lastidegree phD and sex F and city Calgary and state Albe rta Qlifind the average income ofwomen who havephD and live in Calgary Alberta select AVG income from Person where last degree phD and sex 2 F and city Calgary a state Alberta If we know Mary Black is phD and live in Calgary and we want to know her income we may use the above two queries When query Q1 returns one the result of Q2 is the income ofMary Otherwise we may issue a number of subsequent queries using MAX and MIN we may easily know the close range of Mary s income Statistical Database Security Example 2 If the query writer say Connie also lives in Calgary and has phD then using the following queries she can easily get the income of Mary Q3 select SUM1ncome rom Person where lastidegree phD and ex F and city Calgary and state 7 Alberta and name ltgt Mary select SUM1ncome rom Person where lastidegree phD and sex F and city Calgary and state Alberta ltgt Connie an na 7 Because Connie knows her own income by Q4 Q3 Connie s income she knows Mary s income Statistical Database Security A number of restrictions can be used to reduce the possibility of deducing individual information from statistical queries No statistical queries are permitted whenever the number of tuples in the population specified by the selection condition falls below some threshold Restricting the number of tuples in the intersection of subsequent query results Another technique is to prohibit sequences of queries that refer repeatedly to the same population of tuples Data Encryption Data Encryption Example Let the plaintext be the string AS KlNGFl SHERS CATCH FIRE Let the encryption key be the string ELIOT The encryption algorithm might be as follows Step 1 ASbKl NGFI S HERSb CATCH bFlRE Step 2 let b 00 A01 Z26 0119001109 1407060919 0805181900 0301200308 0006091805 Step 3 Encode the encryption key 0512091520 Step 4 Replace each character by the sum of its integer encoding and the integer encoding of the encryption key with modulo 27 Data Encryption cont 0119001109 1407060919 0805181900 0512091520 0512091520 0512091520 0604092602 1919152412 1317000720 0301200308 0006091805 0512091520 0512091520 0813021801 0518180625 Step 5 decode the integer in the result of Step 4 by its character equivalent FDIZB S SOXL MQbGTHMBRAERRFY The encryption procedure will return the following string FDIZBSSOXLMQ GTHMBRAERRFY as the encoded text for AS IGNGFISHERS CATCH FIRE The decryption procedure is straightforward given the key Encryption and PKI Public Key Infrastructure Each user generates a pair of keys a public key and a private key for encryption and decryption of messages Public key and private key are interchangeable a message encrypted using one key can be decrypted by the other key The public key of the pair is made public for others to use whereas the private key is kept by the owner Since the keys are generated by using eXponentiation and modulo functions it is hard to crack them PKI Continued If a sender Wishes to send a private message to a receiver the sender encrypts the message using the receiver s public key When the receiver receives the message he or she decrypts it using the receiver s private key No other recipient can decrypt the message because only the receiver knows his or her private key Digital Signatures Like a handwritten signature a digital signature is a means of associating a mark unique to a person with a body of text The message sender generates the digital signature by hashing the message The sender encrypts the digital signature using hisher private key rst then encrypts it using the public key of the receiver The receiver decrypts the digital signature using hisher private key rst then decrypts it using the public key of the sender To validate the message itself the receiver hashes the message and compare the hash value with the decrypted digital signature College of Comput CS 4420 Database System Implementation More TP Ling Lin ncizte Prnfessnr Ass Cnuege nf Cnmpmjng Genrgjz Tech Lecture Outline l Review of Last Lectures Concurrency Control 9Tvm Phase Ludqng2FL Binary Lucks Share and Exeiuswe Luck egXieeks with inmantLunk Update Leek 9Mmuple Granu anty LucksUS ix SIX e emne Awid Esmdlng rullbank Sum Schedule eNes39ee Transammns 9VXEW Sanalizztnlity Chapters to Read I Chapter 8 read all sections I Chapter 9 0 skim 98 I Chapter 10 o Read 101 102 103 0 skim 104 105106107 DB18 Locking Schemes I Binary I Shared Exclusive I Extending the SX locking scheme with new lock I Examples 0 Increment Lock 0 Update Lock 0 Intentional Lock IS IX SIX Con ictSerializable Schedule no no DB18 Two phase Locking 2PL Protocol To guarantee serializability 39 In a transaction all lo ck operations SiLock or XiLock precede the first unlock operation No locks can be acquired after the first lock is released We call a transaction satisfying twophase locking protocol if it obeys the above rules Twophase execution Growing phase lock acquisition only no unlock Shrinking phase lock release no more lock Lock point divides the two phases Lock point 1 Obtain lock 1 Release look No of locks Phase 1 Phase 2 BEGIN END DB18 Deadlock Avoidance Conservative 2 PL Conservative 2 PL static requires a transaction T to 39 predeclare all the read set of items and write set of items and lock all the items it accesses before T begins execution 39 If any of the predeclared items can not be locked T does not lock any item at all Instead T waits and tries again until all the items are available for locking 39 Conservative 2 PL is deadlock free No of Iocia T Obtain lock 1 Release lock Phase 1 Phase 2 Transaction BEGIN Perio 0r END duration data item use DB18 Strict 2 PL In addition to the Basic 2 PL and Conservative 2 PL there is another version of 2 PL Strict 2 PL A transaction does not release any of its locks until after it terminates by committing Strict 2 PL is not deadlockfree unless combined with conservative 2 PL But it helps to overcome the update lost problem If all transactions in a schedule S follow the strict 2PL protocol S is called a strict schedule N0 of IOC 8 Hold locks until the end l Obtain lock T 1 Release lock Transaction BEGIN END period or u ration data Item USS College of Comput CS 4420 Database System Implementation Concurrency Control Ling Lin ncizte Prnfessnr Ass College nf Compnn39ng Georgia Tech Lecture Outline Review ofLast Lectures o Fallure Recovery scheme eLoggrng Record o UndolngRedolng Logglng o cn mg Polnt Backup Concurrency Control o Problems ofeoneurrent aeeess o Basle Concepts of Conenrreney Control 9 Cun letaenuns 9 Schedules 9 Cun letEqulwlent Schedules 9 Cun letrSenallzzble schedules 9 ZPL DBIS Loggmg Schemes I Undo only 0 Log entries contain only old values 0 A crash before transaction nishes the log tells us how to restore old values for any DB elements changed on is 0 Important Principle WAL 9 log records for an item X must be on disk before any DB modification to X is ushed to disk 9Before commit record ushed to disk all DB modifications must be ushed to disk I Weakness o Undo log cannot bring backup DB copies to date DBIS Logging Schemes I Redo only 0 Log entries contain only new values 0 Before writing modifications to disk transaction must be committed and COMMIT record must be written to log 0 Before writing modifications to disk ush all log records involving X including commit to disk I Weakness o Redo logging need to keep all modified blocks in memory until commit DBIS Checkpointing I Problem In principle recovery requires looking at entire log I Simple solution occasional checkpoint operation 0 Stop accepting new transactions 0 Wait for all current transactions commit or abort o Flush log to disk and all DB updates in memory buffer to disk 0 Enter checkpoint record in the log record and ush to disk I Advantage o If recovery is necessary after this point all transactions prior to a checkpoint have committed and need not be undone DBIS Exercise 1 what to do at recovery Undo 10g 1in cannot bring backup DB copies up to date A u A a g g a g a lt CD U FT 8 g N 8 m Crash I F g I I v U v 1 v v v Redo need to keep all DB blocks in memory until commit Redo log disk A Q E A 13 A 2 E a D E N lt E 3 m E 0 Crash vI LL u 11 LL m I H 4 N I v u v v v v DBIS Nonquiescent Checkpointing I Problem we may not want to stop transactions from entering system I Solution 0 Write ltSTART CKPTT1 Tk record to log where Ti are active transactions 0 Allow active transactions to commit but do not prohibit new transactions 0 Write END CKPT record to log I Advantage o If recovery is necessary after END CKPT recover the transactions that began after END CKPT o If crash occurs between START CKPT and END CKPT need to undo all transactions associated with START CKPT with no COMMIT T DBIS Exercise 2 what to do at recovery time noTl commit L 0 T1 Ckpt Ckpt T1 G ax 39 T1 end b I Undo T1 undo ab 2 With T1 commit L 0 T1 ckpt s T1 ckpt T1 T1 G a T1 b end cmt I Redo T1 redo bc CS 4420 Database System Implementation Concurrency Control Ling Lin Assncizte Prnfessnr Cnllege nf Cnmpnting Genrgjz Tech Lecture Outline l Review ofLast Lectures Concurrency c 9Tvm Phase Luck 951m and Exame Lacks 9 SX luck with In ck Today s Lec e 9 SX u m Updat uck 9Luk TableandLunkmg System Luckm G amy aMuluple Granmanty Lunks s ix SIX 9Fhanmmthennmenun Concurrency Control Methods 9 Pessimistic Concurrency Control I Locking Techniques for Concurrency Control I Ensuring Serializability by Two Phase Locking 9 Optimistic Concurrency Control Locking Scheme I Binary Locks 0 Lock Unlock I Shared and Exclusive Locks ReadWrite Locks 0 S Lock X Lock Unlock I Enhanced SX schemes 0 Increment Lock 0 Update Lock 0 Intentional S lock and Intentional X lock ISIX SIX I Treebased Locking Scheme Pessimistic Concurrency Control DB17 Three Rules 1 2 3 m DB17 Rule 1 Well formed transactions Ti l SlA r1A u1A Ti l X1A W1A u1A DB17 Rule 2 Legal scheduler S lS1A lliA no lXjA S lX1A lliA no lXjA no lSjA New request Compatibility matrix s X Lock S true false already X false false held In DB17 Rule 3 2PL transactions No change except for upgrades I If upgrade gets more locks eg S gt S X then no change II If upgrade releases read shared lock eg S gt X then it can be allowed in growing phase DB17 Twophase Locking 2PL Protocol To guarantee serializability In a transaction all lock operations SiLock or XiLock precede the first unlock operation No locks can be acquired after the first lock is released We call a transaction satisfying twophase locking protocol if it obeys the above rules Twophase execution Growing phase lock acquisition only no unlock Shrinking phase lock release no more lock Lock point divides the two phases Lock point 3 1 Obtain lock U 2 1 Release lock 5 039 Z Phase 1 Phase 2 BEGIN 9 DB17 Ser1al1zab1l1ty of 2PL l l leorem For any legal schedule S o if every transaction in the schedule S satis es the 2PL protocol S is guaranteed to be serializable obviating the need to test for serializability of schedules Note ZPL is a sufficient but not a necessary condition for serializability of schedules may limit the amount of concurrency that can occur in a schedule Jo mi iia LDW Mia Livelock Starvation Deadlock DB17 Livelock I Livelock occurs 0 when one transaction cannot proceed for an inde nite period of time while other transactions continue normally I Cause 0 due to unfair waiting schemes usually prioritybased ones I Fair schemes 0 FCFS o prioritybased by increasing the priority of a transaction who waits longer DB17 DeadLocks Deadlock occurs when each of two transactions is waiting for the other to release the lock on an item Tl TZ XiL ockA I R R1 A 2 XiL ockB l R203 R J X7L kB39gt A X LockA39 A waltmg Waiting In general a deadlock may involve n ngt2 transactions and can be detected by using a Waitfor graph Wait for graph Node transaction Edges pending lock request If Ti is waiting for an object held locked by Tj then there is an edge from Ti to Tj Root is the current lock holder 12 DB17 Deadlock Avoidance Conservative 2 PL Conservative 2 PL static requires a transaction T to 39 predeclare all the read set of items and write set of items and lock all the items it accesses before T begins execution 39 If any of the predeclared items can not be locked T does not lock any item at all Instead T waits and tries again until all the items are available for locking 39 Conservative 2 FL is deadlock free No of locks T Obtain lock 1 Release lock Phase 1 Phase 2 Transaction BEGIN period or END duration data item use 13 DB17 Deadlock Avoidance Strategies I No waiting 0 If a transaction cannot obtain a lock it is immediately aborted and restarted later 0 Do not even check for deadlock I Cautious waiting 0 When a transaction Tx tries to obtain a lock that is already held by J 9 T is blocked and wait iij is not already blocked 9 Otherwise abort T I Based on timeouts o If a transaction T waits longer than a systemdefined timeout then the system assumes that T is deadlocked and aborts T regardless of whether T is actually deadlocked or not DB17 Strlct 2 PL In addition to the Basic 2 PL and Conservative 2 PL there is another version of 2 PL Strict 2 PL A transaction does not release any of its locks until after it terminates by committing Strict 2 PL is not deadlockfree unless combined with conservative 2 PL But it helps to overcome the update lost problem If all transactions in a schedule S follow the strict 2PL protocol S is called a strict schedule No of Ian 3 Hold locks until the end A 1 Obtain lock T 39 1 Release lock EEGIN END Tranaactlon period of duration data item use DB17 Other Locking Schemes I Goal Further Enhancement of concurrent execution I General Mechanism 0 Extending the SX locking scheme with new lock I Examples 0 Increment Lock 0 Update Lock 0 Intentional Lock IS IX SIX
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'