Introductn to Database Systems
Introductn to Database Systems EECS 647
Popular in Course
Popular in Elect Engr & Computer Science
verified elite notetaker
This 210 page Class Notes was uploaded by Melissa Metz on Monday September 7, 2015. The Class Notes belongs to EECS 647 at Kansas taught by Jun Huan in Fall. Since its upload, it has received 38 views. For similar materials see /class/184508/eecs-647-kansas in Elect Engr & Computer Science at Kansas.
Reviews for Introductn to Database Systems
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/07/15
EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Today s Topic o Guarantee con ictserializable schedule 4262009 Luke Huan Univ of Kansas Locking 0 Rules 0 If a transaction wants to read an object it must first request a shared lock S mode on that object o If a transaction wants to modify an object it must first request an exclusive lock X mode on that object 0 Allow one exclusive lock or multiple shared locks Mode of the lock requested S X MOde Of100k5 S Y N Grant the lock currently held x N N by other transactions Compatibility matrix 4262009 Luke Huan Univ of Kansas Basic locking is not enough Add 1 t0 bothA and B T1 preserve AB lockXA Read 100 rA Write 1001 unlockA Possible schedule under locking But still not con ictserializable lockXB Read 200 rB T2 Multiply both A and B by 2 preserves AB lockXA 1 A Realel 9 WA Write 1012 unlockA lockXB rB Read 100 W 3 Write 1002 unlockB A Bl 4262009 Write 2001 W B unlock e r Univ of Kansas Twophase locking 2PL o All lock requests precede all unlock requests a Phase I obtain locks phase 2 release locks T l lockXA rA WA lockXB unlockA rB lockXA lgt rA WA lockXB rB WB 2PL guarantees a con ictserializable schedule T1 rA WA rB WB Cannot obtain the lock on B WB 4262009un100kB until T1 unlocks Luke Huan Univ of Kansas rA WA rB WB Problem of 2PL 0 T2 has read uncommitted data T1 T2 written by T1 2431 0 If T1 aborts then T 2 must W rA abort as well r03 MA 0 Cascading aborts are possible wB if other transactions have read rB data written by T 2 wB Abort 0 Even worse what if T 2 commits before T1 0 Schedule is not recoverable if the system crashes right after T 2 commits 4262009 Luke Huan Univ of Kansas 6 Strict 2PL 0 Only release locks at commitabort time o A writer will block all other readers until the writer commits or aborts Used in most commercial DBMS 4262009 Luke Huan Univ of Kansas Next o A few examples 4262009 Luke Huan Univ of Kansas NonZPL A 1000 B2000 Output L0ckXA ReadA A A50 WriteA UnlockA L0ckXB ReadB B B 50 WriteB L0ckS A ReadA UnlockA L0ckS B ReadB UnlockB PRINTAB 2PL A 1000 B2000 Output LockXA ReadA LockSA A A50 WriteA LockXB UnlockA ReadA LockSB ReadB B B 50 WriteB UnlockB UnlockA ReadB UnlockB PRINTAB 4262009 Luke Huan Univ of Kansas 10 Strict 2PL A 1000 B2000 Output LockXA ReadA LockSA A A50 WriteA LockXB ReadB B B 50 WriteB UnlockA UnlockB ReadA LockS B ReadB PRINTAB UnlockA 4262009 Luke Huan Univ of Kansas 11 Lock Management 0 Lock and unlock requests handled by Lock Manager 0 LM keeps an entry for each currently held lock 0 Entry contains 0 List of xacts currently holding lock 0 Type of lock held shared or exclusive Queue of lock requests Lock Management cont a When lock request arrives 0 Does any other xact hold a con icting lock 0 If no grant the lock 0 If yes put requestor into wait queue 0 Lock upgrade 0 Shared lock can request to upgrade to exclusive Deadlocks o Deadlock Cycle of transactions waiting for locks to be released by each other 0 Two ways of dealing with deadlocks o prevention 0 detection 0 Many systems just punt and use Timeouts o What are the dangers with this approach Deadlock Detection o Create and maintain a waitsfor graph 0 Periodically check for cycles in graph Deadlock Detection Continued Example T1 SA 31 33 T2 XB XC T3 SD SC XA T4 XB Deadlock Deadlock Prevention Assign priorities based on timestamps Say Ti wants a lock that Tj holds Two policies are possible WaitDie If Ti has higher priority Ti waits for Tj otherwise Ti aborts Woundwait If Ti has higher priority Tj aborts otherwise Ti waits Why do these schemes guarantee no deadlocks Important detail If a transaction restarts make sure it gets its original timestamp Why Summary o Correctness criterion for isolation is serializability o In practice we use conflict serializability which is somewhat more restrictive but easy to enforce a Two Phase Locking and Strict 2PL Locks implement the notions of con ict directly 0 The lock manager keeps track of the locks issued 0 Deadlocks may arise can either be prevented or detected Recovery Motivation o Atomicity o Transactions may abort Rollback o Durability 0 What if DBMS stops running Causes oz Desired state after system restarts T1 amp T3 should be durable T2 T4 amp T5 should be aborted effects not seen crash T1 Commit I T2 39l It I T3 omml I T4 39 I T5 Handling Failures 0 System crashes in the middle of a transaction T partial effects of T were written to disk 0 How do we undo T atomicity 0 System crashes right after a transaction T commits not all effects of T were written to disk 0 How do we complete T durability 4262009 Luke Huan Univ of Kansas 20 Na39I39ve approach 0 Force When a transaction commits all writes of this transaction must be re ected on disk 0 Without force if system crashes right after T commits effects of T will be lost rs Problem Lots of random writes hurt performance 0 No steal Writes of a transaction can only be ushed to disk at commit time 9 With steal if system crashes before T commits but after some writes of T have been ushed to disk there is no way to undo these writes Problem Holding on to all dirty blocks requires lots of memory 4262009 Luke Huan Univ of Kansas 21 Logging 0 Log 0 Sequence of log records recording all changes made to the database 0 Written to stable storage eg disk during normal operation 0 Used in recovery 0 Hey one change turns into two bad for performance 0 But writes are sequential append to the end of log a Can use dedicated disks to improve performance 4262009 Luke Huan Univ of Kansas 22 Undoredo logging rules Record values before and after each modi cation h Ti X oldvalue0fX newvalue0fX i A transaction TI is committed when its commit log record h T commit i is written to disk Writeahead logging WAL Before X is modi ed on disk the log record pertaining to X must be ushed 9 Without WAL system might crash afterX is modified on disk but before its log record is written to disk no way to undo No force A transaction can commit even if its modified memory blocks have not be written to disk since redo information is logged Steal Modified memory blocks can be ushed to disk anytime since undo information is logged 4262009 Luke Huan Univ of Kansas 23 Undoredo logging example T1 balance transfer of 100 from A to B readA a a a 100 Memory writeA a readB b b b 100 A 866 700 writeB b BM 500 commlt ltT1 startgt Steal can ush ltT1 A 800 700gt before commit ltT1 B 400 500gt ltT1 commitgt No force can ush after commit Moogestriction on when memogubloyglg canshould be ushed 24 Summary 0 Goal 0 Support Concurrency with Isolation Protocol Isolation Recover Deadlock Starvation ability 2PL Strict 2PL Strict 2PL with priority 4262009 Luke Huan Univ of Kansas 25 Summary o Concurrency control a Serial schedule no interleaving o Con ictserializable schedule no cycles in the precedence graph equivalent to a serial schedule 0 2PL guarantees a con ictserializable schedule 0 Strict 2PL also guarantees recoverability 0 Recovery undoredo logging with fuzzy checkpointing a Normal operation writeahead logging no force steal a Recovery rst redo forward and then undo backword 4262009 Luke Huan Univ of Kansas 26 EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Summary of SQL Features 0 Query 0 SELECTFROMWHERE statements Set and bag operations Table expressions subqueries Aggregation and grouping Ordering o Outerjoins 0 Table creation constraints and content updates 41 2009 Luke Huan Univ of Kansas Today s Topic o Subqueries 0 Active databases 0 View 0 Indexing 41 2009 Luke Huan Univ of Kansas Quantified subqueries A quanti ed subquery can be used as a value in a WHERE condition Universal quanti cation for all WHERE x 0p ALL subquery 0 True iff for all t in the result of subquery x 0p t Existential quanti cation exists WHERE x 0p ANY subquery 0 True iff there exists some t in the result of subquery such that x 0p t amp Beware o In common parlance any and all seem to be synonyms o In SQL ANY really means some Alsomayuse EXISTS NOT EXISTS IN NOT IN 41 2009 Luke Huan Univ of Kansas 4 Examples of quantified subqueries Which employees have the highest salary Employee Sid Name Salary 0 SELECT FROM Employee WHERE Salary gt ALL SELECT Salary FROM Employee How about the lowest salary a SELECT FROM Employee WHERE Salary lt ALL SELECT salary FROM Employee 41 2009 Luke Huan Univ of Kansas More ways of getting the highest Salary 0 Who has the highest Salary 0 SELECT FROM Employee e WHERE NOT EXISTS SELECT FROM Employee WHERE Salary gt eSalary a SELECT FROM Employee WHERE Eid NOT IN SELECT elSID FROM Employee el Employee e2 WHERE elSalary lt e2Salary 41 2009 Luke Huan Univ of Kansas Nested Queries Scoping c To nd out which table a column belongs to 0 Start with the immediately surrounding query a If not found look in the one surrounding that repeat if necessary 0 Use tablename columnname notation and AS renaming to avoid confusion 41 2009 Luke Huan Univ of Kansas Anlixanu e SELECT FROM Student S WHERE EXISTS r SELECT FROM Enrol e WHERE SSID AND EXISTS SELECT F FRDM Enroll WHERESID sS1p AND CID ltgteCIDP Students who are taking at least two courses 41 2009 Luke Huan Univ of Kansas Nested Queries o Nested queries do not add expression power to SQL 0 For convenient 0 Write intuitive SQL queries 0 Can always use SQL queries Without nesting to complete the same task though sometime it is hard 41 2009 Luke Huan Univ of Kansas 9 Database Design Active database View Indexing Thinking about the following realworld situation You need to design a database for students to record student name gpa gt0 id email address home department and images Constraints student id is the primary key student gpa can not be negative a student may take up to 10 courses in a semester unless they are senior and whenever a new honor student is entered in the table this student is automatically enrolled a honor course BiolOl How J ayhawk y 41 2009 Luke Huan Univ of Kansas Active Database o Constraints are ef cient ways to improve the quality of the database 0 We have covered domain constraint 0 We will talk about 0 Check 0 Assertion o Trigger 41 2009 Luke Huan Univ of Kansas 11 Check 0 Student GPA must be positive 0 CREATE TABLE Student SId charlO Sname char20 email charlO gpa float CHECK gpa gt 00 41 2009 Luke Huan Univ of Kansas 12 Assertion 0 Make sure each student can take up to 10 course projects unless the student is a senior level gt4 CREATE ASSERTTON uplimitCourse CHECK SELECT MAX COUNT Sid FROM student enrollment WHERE level lt 4 GROUP BY Sid lt 10 41 2009 Luke Huan Univ of Kansas 13 Trigger example CREATE TRIGGER EECSl68AutoRecruit AETER ENSERT ON Student Ewmn REEERENCING NEW ROW As newStudent FOR EACH ROW WHEN lnewStudent isHonor l C0nditi0n INSERT INTO Enroll VALUES newStudent SID BIOlOl 1 Action 41 2009 Luke Huan Univ of Kansas 14 Views o A View is like a Virtual table 0 De ned by a query which describes how to compute the View contents on the y 0 DBMS stores the View de nition query instead of View contents 0 Can be used in queries just like a regular table 41 2009 Luke Huan Univ of Kansas 15 Why use views c To hide data from users 0 To hide complexity from users 0 Logical data independence 0 If applications deal with Views we can change the underlying schema Without affecting applications 0 Recall physical data independence change the physical organization of data Without affecting applications 0 To provide a uniform interface for different implementations or sources Real database applications use tons of Views 41 2009 Luke Huan Univ of Kansas 16 Creating and dropping views 0 Example EECS647roster o CREATE VIEW EECS647Roster AS SELECT SID Wauedubasetablesn FROM Student I WHERE SID IN SELECT SID FROM Enroll WHERE CID EECS64739 c To drop a View 0 DROP VIEW vwwLname 41 2009 Luke Huan Univ of Kansas 17 Using views in queries 0 Example nd the average GPA of EECS647 students 0 SELECT AVGGPA FROM EECS647Roster a To process the query replace the reference to the View by its de nition a SELECT AVGGPA FROM SELECT SID name age GPA FROM Student WHERE SID IN SELECT SID FROM Enroll WHERE CID EECS647 41 2009 Luke Huan Univ of Kansas 18 Modifying views Does not seem to make sense since Views are Virtual But does make sense if that is how users see the database Goal modify the base rabies such that the modi cation would appear to have been accomplished on the View Be careful c There may be one way to modify 0 There may be many ways to modify 0 There may be no way to modify 41 2009 Luke Huan Univ of Kansas 19 A simple case CREATE VIEW StudentGPA AS SELECT STD GPA FROM Student DELETE FROM StudentGPA WHERE STD 123 translates to DELETE FROM Student WHERE STD 123 41 2009 Luke Huan Univ of Kansas 20 An impossible case CREATE VIEW HighGPAStudent AS SELECT SID GPA FROM Student WHERE GPA gt 37 INSERT INTO HighGPAStudent VALUES987 25 o No matter What you do on Student the inserted row will not be in H ighGPAStudent 41 2009 Luke Huan Univ of Kansas 21 A case with too many possibilities CREATE VIEW AverageGPAGPA AS SELECT AVGGPA FROM Student 0 Note that you can rename columns in View de nition UPDATE AverageGPA SET GPA 25 0 Set everybody s GPA to 25 0 Adjust everybody s GPA by the same amount 0 Just lower Lisa s GPA 41 2009 Luke Huan Univ of Kansas 22 SQL92 updateable views o More or less just singletable selection queries 0 No join 0 No aggregation o No subqueries o Arguably somewhat restrictive 0 Still might get it wrong in some cases a See the slide titled An impossible case 0 Adding WITH CHECK OPTION to the end ofthe View de nition will make DBMS reject such modi cations 41 2009 Luke Huan Univ of Kansas Indexes for Performance Tuning 0 An index is an auxiliary persistent data structure a Search tree eg Btree Rtree 0 Lookup table eg hash table 0 An index 011 RA can speed up accesses 0f the form 0 RA value lockup 0 RA gt value range query 0 An index 0nRA1 RA can speed up a RA1 value1 E E RA value a RA1 RAngt valuel value More on indexes in the second half of this course 41 2009 Luke Huan Univ of Kansas 24 Creating and dropping indexes o CREATE UNIQUE INDEX indexname ON tablename columnname1 columnnamen a With UNIQUE the DBMS will also enforce that columnnamel columnnamen is a key of tablename o eg CREATE INDEX nameindeX ON STUDENT SName 0 DROP INDEX indexname 0 Typically the DBMS will automatically create indexes for PRIMARY KEY and UNIQUE constraint declarations 41 2009 Luke Huan Univ of Kansas 25 Choosing indexes to create More indexes better performance 0 Indexes take space 0 Indexes need to be maintained when data is updated 0 Indexes have one more level of indirection Optimal index selection depends on both query and update workload and the size of tables 41 2009 Luke Huan Univ of Kansas 26 What s next 0 Database meets Concurrency 0 Transaction will be covered after we talk about transaction management 41 2009 Luke Huan Univ of Kansas EECS 647 Introduction Database Systems 39 Query optimization 0 One logical plan best physical plan 0 Questions 0 How to enumerate possible plans 0 How to estimate costs a How to pick the best one o Often the goal is not getting the optimum plan but instead avoiding the horrible ones Any of these will do I O O i 1 second 1 minute 1 hour 4152009 Luke Huan Univ of Kansas Cost estimation PROJECT title I Physical plan example MERGEJQIN CID SORI CID SCAN Course Input to SORTCID MERGEJ lNng FILTER name Bart SORTSID I SCAN Enroll SCAN Student c We have cost estiihatmnfereacheperator 0 Example SORTCID takes log2Binput gtlt Binput 0 But What is Bz nput c We need size of intermediate results 4152009 Luke Huan Univ of Kansas Review DBMS Architecture UserWeb FormsApplicationsDBA query transaction Query Parser Query Rewriter l Files amp Access Methods Lock Tables Buffer Manager Main Memory Storage Manager 4152009 Luke Huan Univ of Kansas 4 LOCK TABLE BUFFERS BUFFER POOL DBMS a set of cooperating software modules Luke Huan Univ of Kansas 5 Transactions 0 A program may carry out many operations on the data retrieved from the database 0 However the DBMS is only concerned about What data is readwritten fromto the database 0 database a xed set of relations A B C o transaction a sequence of read and write operations readA writeB o DBMS s abstract View of a user program 4152009 Luke Huan Univ of Kansas 6 Correctness criteria The ACID properties A tomicity A11 actions in the Xact happen or none happen C onsistency If each Xact is consistent and the DB starts consistent it ends up consistent I solation Execution of one Xact is isolated from that of other Xacts D urability If a Xact commits its effects persist An Example about SQL Transaction 0 Consider two transactions Xacts T1 BEGIN AA100 BB100 END T2 BEGIN A106A B106B END 1st xact transfers 100 from B s account to A s 2nd credits both accounts with 6 interest Assume at first A and B each have 1000 What are the legal outcomes of running T1 and T2 1100 106 1166 There is no guarantee that T1 will execute before T2 or Viceversa if both are submitted together But the net effect must be equivalent to these two transactions running serially in some order 4152009 Luke Huan Univ of Kansas 8 Example Contd Legal outcomes A1166B954 or A1160B960 Consider a possible interleaved schedule T1 AA100 BB1OO T2 A106A B106B ozo This is OK same as T1 T2 But what about T1 AA100 BB1OO T2 A106A B106B Result A1166 B960 AB 2126 bank loses 6 The DBMS s view of the second schedule T1 RA WA 123 WB T2 RA w 4152009 Luke Huan Univ of Kansas SQL transactions Syntax in pgSQL BEGIN ltdatabase operationsgt COMMIT ROLLBACK A transaction is automatically started when a user executes an SQL statement begin is optional Subsequent statements in the same session are executed as part of this transaction 9 Statements see changes made by earlier ones in the same transaction 0 Statements in other concurrently running transactions do not see these changes COMMI T command commits the transaction ushing the update to disk ROLLBACK command aborts the transaction all effects are undone 4152009 Luke Huan Univ of Kansas 10 Atomicity 0 Partial effects of a transaction must be undone when 0 User explicitly aborts the transaction using ROLLBACK o Eg application asks for user con rmation in the last step and issues COMMIT or ROLLBACK depending on the response 0 The DBMS crashes before a transaction commits 0 Partial effects of a modification statement must be undone when any constraint is violated 0 However only this statement is rolled back the transaction continues 0 How is atomicity achieved 0 Logging to support undo 4152009 Luke Huan Univ of Kansas Isolation o Transactions must appear to be executed in a serial schedule with no interleaving operations 0 For performance DBMS executes transactions using a serializable schedule 0 In this schedule only those operations that can be interleaved are executed concurrently 0 Those that can not be interleaved are in a serialized way 0 The schedule guarantees to produce the same effects as a serial schedule 4152009 Luke Huan Univ of Kansas 12 Concurrency control 0 Goal ensure the I isolation in ACID T1 T 2 readA readA writeA writeA readB readC writeB writeC commit commit 4 ABC 4152009 Luke Huan Univ of Kansas 13 Good versus bad schedules Good T1 1 A WA 1 03 WB 41 52009 T2 rA WA 1 C WC Bad Good But Why T1 T 2 T1 T 2 1 A 1 A Read 400 IA WA wnmwm W400 M 400 100 Write rB 400 50 rB 1 C 1 C WB WB WC WC Luke Huan Univ of Kansas Serial schedule o Execute transactions in order with no interleaving of operations T1rA T1wA T1rB T1wB T2rA T2WA T2rC T2WC T2rA T2wA T2rC T2wC T1rA T1WA T1rB TlwB Isolation achieved by de nition 0 Problem no concurrency at all 0 Question how to reorder operations to allow more concurrency 4152009 Luke Huan Univ of Kansas Conflicting operations 0 Two operations on the same data item con ict if at least one of the operations is a write a rX and WX con ict a wX and rX con ict a wX and WX con ict a rX and rX do not a rWX and rWY do not 0 Order of con icting operations matters a Eg if T1rA precedes T 2WA then conceptually T1 should precede T 2 4152009 Luke Huan Univ of Kansas 16 Precedence graph o A node for each transaction 0 A directed edge from T I to if an Operation of TI precedes and con icts with an Operation of in the schedule T1 i T1 i rA a T04 0 e e wA wA rB Good rB Bad M n0 cycle M cycle WB WB 4152009 WC Luke Huan Univ of Kansa WC 17 Conflictserializable schedule o A schedule is con ictserializable iff its precedence graph has no cycles 0 A con ictserializable schedule is equivalent to some serial schedule and therefore is good 0 In that serial schedule transactions are executed in the topological order of the precedence graph 0 You can get to that serial schedule by repeatedly swapping adjacent noncon icting operations from different transactions 4152009 Luke Huan Univ of Kansas Summary 0 Transaction View of DBMS o ReadX o WriteX o ACHD o Atomicity TX s are either completely done or not done at all 0 Consistency TX s should leave the database in a consistent state 0 Isolation TX s must behave as if they are executed in isolation o Durability Effects of committed TX s are resilient against failures 0 SQL transactions Begins implicitly SELECT m UPDATE m ROLLBACK I COMMIT 4152009 Luke Huan Univ of Kansas EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Today s Topic o How to locate data in a le fast 0 Go with a simple example rst a ISAM Indexed sequence access method 0 An overview of the indexing strategy 0 Treebased indexes o Btree 3312009 Luke Huan Univ of Kansas Basics 0 Given a value locate the records with this value equality search SELECT FROM R WHERE A value SELECT FROM R S WHERE RA SB o Other search criteria eg 0 Range search SELECT FROM R WHERE A gt value 0 Keyword search 3312009 Luke Huan Univ of Kansas TreeStructured Indexes Introduction o Treestructured indexing techniques support both range search and equalin search 0 ISAM 1ndexed equential Access Method 0 static structure early index technology Motivation for Index 0 Find all students with gpa gt 30 o If data le is sorted do binary search 0 Cost of binary search in a database can be quite high Why a Simple idea Create an indeX le Index File Km Pm Page 1 Page 3 Page N Data File Page 2 Can do binary search on smaller indexfile ISAM o What if an index is still too big 0 Put a another sparse index on top of that a ISAM Index Sequential Access Method more or less Example look up 1 9 7 IL 01 m 99 Indexblocks 100 123 1 00 100 10 123 12 192 19 200 204 901 9039 996 9939 119 12 39 39 m Data blocks 6 3312009 Luke Huan Univ of Kansas Updates with ISAM Example insert 107 00 200 90 Example delete 12 9 IndeXblocks 100 123 1 oo o1 99 192 1939 200 204 901 90quot 996 9939 100 101 123t 119 12 1 O 7 Data blocks err ow block 0 Over ow chains and empty data blocks degrade performance 0 Worst case most records go into one long chain 3312009 Luke Huan Univ of Kansas A Note of Caution o ISAM is an oldfashioned idea 0 Btrees are usually better as we ll see 0 But ISAM is a good place to start to understand the idea of indexing o Upshot 0 Don t brag about being an ISAM expert on your resume a Do understand how they work and tradeoffs with B trees 3312009 Luke Huan Univ of Kansas Overview of Indexes o Dense vs sparse indexes 0 Primary vs secondary indexes o Clustered vs unclustered indexes o Treebased indexes vs hashbased indexes 3312009 Luke Huan Univ of Kansas Dense and sparse indexes o Dense one index entry for each search key value a The index entry include page id record id 0 Sparse one index entry for each block 0 Records must be clustered according to the search key a The index entry include only page id 3312009 Luke Huan Univ of Kansas Dense versus sparse indexes 0 Index size 0 Sparse index is smaller 0 Requirement on records 0 Records must be clustered for sparse index 0 Lookup a Sparse index is smaller and may t in memory a Dense index can directly tell if a record exists 0 Update a Easier for sparse index 3312009 Luke Huan Univ of Kansas 11 Primary and secondary indexes 0 Primary index 0 Created for the primary key of a table omeewaxG wam m kammmmuomQMmmymw 0 Secondary index 0 Jsua yldense 0 SQL 0 PRIMARY KEY declaration automatically creates a primary index UNIQUE key automatically creates a secondary index 0 Additional secondary index can be created on nonkey attributes CREATE INDEX StudentGPAIndeX ON StudentGPA 3312009 Luke Huan Univ of Kansas 12 Clustered Indexes 0 An index is clustered if the entry order in the index le is almost the same as that in the data le otherwise it is unclustered 3312009 Luke Huan Univ of Kansas 13 Hashbased Indexes 0 Based on hash table o Organized as a set of ltkey valuegt tuples 0 Key is the attribute for indexing 0 Value is ltpage id record idgt From 3312009 Luke Huan Univ of Kansas Example 0 We may have multiple different indexes per le a Eg Employee EID name age salary a Primary key is EID name is unique age and salary are nonkey attributes 0 We may or not sort the le by EID with a Btree index on EID primary name secondary and age secondary and a hash index on salary 3312009 Luke Huan Univ of Kansas 15 Recap 0 An treebased index structure built for the primary key of a le that is sorted according to the primary key PRIMARY SPARSE CLUSTERED I n Index File a a x as 3312009 Luke Huan Univ of Kansas 16 Recap 0 An treebased index structure built for nonkey attribute of a le that is sorted according to the primary key SECONDARY DENSE UNCLUSTERED Index File i V Data me E m m 3312009 Luke Huan Univ of Kansas 17 Btree A treebased index le could be primary or secondary could be sparse or dense could be clustered or unclustered A hierarchy of intervals Balanced more or less good performance guarantee Diskbased one node per block large fanout Root 0 i 2 degree 2 180 120 150 k 7 f O m C C C C C KO OW C C m m C C N m L0 L0 l I C I I l l l l l l Ell200939 I I I luke Hllan Univ oflansal I I I I 3918 Sample Btree nodes to keys 100 k degree 2 C C C Nonleaf N L0 00 l l l t0 keys to keys to keys to keys lOOklt120 120klt150 150klt180 180k C m to records with these k values or store records directly in leaves Luke Huan Univ of Kansas 39gt to next leaf node in sequence C Leaf S 3312009 Btree balancing properties 0 Height constraint all leaves at the same lowest level 0 Fanout constraint all nodes at least half full except root Max Max Min Min pointers kevs active pointers kevs Nonleaf 2d1 2d d1 1 Root 2d1 2d 2 1 Leaf 2d 2d d d 3312009 Luke Huan Univ of Kansas 20 3312009 Luke Huan Univ iii 3911 20 60 65 l l 100 101 III 110 120 13o 12o 150 180 150 156 179 degree2 Ill w l I 180 200 I I SELECT FROM R WHERE k 32 SELECT FROM R WHERE k 179 Lookups Range query SELECT FROM R WHERE k gt 32 AND C lt 179 degree 2 5 39ll lt 3 3312009 30 And follow nextleaf pointers Luke Huan Univ of Kansas Insertion 0 Insert a record with search key value 32 3 degree 2 H C C C Look up Where the S 3 3 inserted key k 7 shouldgo mmsas E Ill 1 Illll III II And insert it right there 3312009 Luke Huan Univ of Kansas 23 Performance analysis o How many IO s are required for each operation h the height of the tree more or less Plus one or two to manipulate actual records Plus 0h for reorganization should be very rare if f is large Minus one if we cache the root in memory 0 How big is h 3312009 Roughly logfmout N Where N is the number of records Btree properties guarantee that fanout is least f 2 for all nonroot nodes Fanout is typically large in hundreds many keys and pointers can fit into one block A 4level Btree is enough for typical tables Luke Huan Univ of Kansas 24 Btree in practice 0 Complex reorganization for deletion often is not implemented eg Oracle InformiX 0 Leave nodes less than half full and periodically reorganize 0 Most commercial DBMS use Btree instead of hashing based indexes because Btree handles range queries 3312009 Luke Huan Univ of Kansas 25 The Halloween Problem 0 Story from the early days of System R UPDATE Payroll SET salary salary ll WHERE salary gt 100000 c There is a Btree index on PayroleaZary o The update never stopped Why 0 Solutions 0 Scan index in reverse a Before update scan index to create a complete todo list 0 During update maintain a done list 0 Tag every row with transaction statement id 3312009 Luke Huan Univ of Kansas 26 Btree versus ISAM o ISAM is more static Btree is more dynamic o ISAM is more compact at least initially 0 Fewer levels and IO s than Btree o Overtime ISAM may not be balanced 0 Cannot provide guaranteed performance as Btree does 3312009 Luke Huan Univ of Kansas Btree versus Btree o Btree Why not store records or record pointers in nonleaf nodes 0 These records can be accessed with fewer IO s 0 Problems 0 Storing more data in a node decreases fanout and increases h 0 Records in leaves require more IO s to access a Vast majority of the records live in leaves 3312009 Luke Huan Univ of Kansas Smnmaw 0 Two types of queries 0 Keysearch o Rangequery o Btree operations 0 Search 0 Insert Sphtch d 0 Delete 0 Redistribution o Btree sorting 0 Next week with other diskbased sorting algorithms 3312009 Luke Huan Univ of Kansas 29 EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Administrative 0 Final proj ect is a team proj eet with team size 2 0 We have 33 students and I already have one volunteer for the singlemember team 0 You need to nd your partner asap if you haven t done so 3132009 Luke Huan Univ of Kansas 2 Administrative o For those who want to know basic tree data structures the following are some useful information sources a 268 notes about binary search tree a A introductory book Data Structures and Algorithm Analysis in C 2nd edition by Weiss published by Addison Wesley 0 Chapter 4 o KU library has several copies 0 My Online 560 notes 0 httppeop1eeecskuedujhuanEECSS60F08 3132009 Luke Huan Univ of Kansas Review Database Design Miniworld REQUIREMENTS COLLECTION AND ANALYSIS Functional Requirements Data Requirements HighLevel Transaction Conceptual Schema T Specification In a highlevel data model DBMSindependent I LOGICAL DESIGN DBMSspecific DATA MODEL MAPPING ical Concepiual Schema V Log APPLICATION PROGRAM In the data model of a specific DBMS DESIG N l 1 PHYSICAL DESIGN TRANSACTION lt Internal Schema IMPLEMENTATION Application Prog rams 3132009 Luke Huan Univ of Kansas A DBMS Preview Casual Users Application Parametric Users Users DBA Staff Programmers DDL Privileged Interactive Application Statements ands Query Programs I I I V V DDL Query Comp ler Comp er Cue y r Optirnizer DML Compiled Compiler Transactions DBA Commands Queries and Transactions Manager Concurrency Control Backu p Recovery Us Stored Database InputOutput from Database Query and Transaction Execution 3132009 Luke Huan Univ of Kansas Outline 0 It s all about disks 8 0 That s why we always draw databases as 0 And why the single most important metric in database processing is the number of disk IO s performed 0 Storing data on a disk 0 Page organization a Record layout 0 Block layout 3132009 Luke Huan Univ of Kansas The Storae Hierarchy Smaller Faster oMain memory RAM for currently used data oDisk for the main database secondary storage oTapes for archiving older versions of the data tertiary storage i i Source Opemung Systems Concepts 5m Eamon Bigger Slower Jim Gray s Storage Latency Analogy How Far Awav is the Data Andromeda 109 Tape Optica 2000 Years R0 bot 39 106 Disk 2 Years 15 hr 100 Memory 10 On Board Cache 10 min 2 On Chip Cache 39 1 Registers My Head 1 min 3132009 Luke Huan Univ of Kansas A typical disk Tracks Platter Arm movement Spindle rotation MOVlng parts9 are 510W 1 I I 3132009 Luke Huan Univ of Kansas 9 Top view Higherdensity sectors on inner tracks andor more sectors on outer tracks A block is a logical unit of transfer consisting of one or more sectors 3132009 Luke Huan Univ of Kansas 10 Disk access time Sum of 0 Seek time time for disk heads to move to the correct cylinder 0 Rotational delay time for the desired block to rotate under the disk head 0 Transfer time time to readwrite data in the block time for disk to rotate over the block 3132009 Luke Huan Univ of Kansas 11 Random disk access Seek time rotational delay transfer time 0 Average seek time a Time to skip one half of the cylinders 0 Not quite should be time to skip a third of them why a Typical value 5 ms 0 Average rotational delay a Time for a half rotation a function of RPM 0 Typical value 42 ms 7200 RPM 0 Typical transfer time o O8msec per 8K block 3132009 Luke Huan Univ of Kansas 12 Sequential Disk Access Improves Performance Seek time rotational delay transfer time 0 Seek time o 0 assuming data is on the same track 0 Rotational delay 0 0 assuming data is in the next block on the track 0 Easily an order of magnitude faster than random disk access 3132009 Luke Huan Univ of Kansas 13 Performance tricks Disk layout strategy 0 Keep related things What are they close together same sectorblock same track same cylinder adjacent cylinder Double buffering 0 While processing the current block in memory prefetch the next block from disk overlap 10 with processing Disk scheduling algorithm Track buffer 0 Readwrite one entire track at a time Parallel IO 0 More disk heads working at the same time 3132009 Luke Huan Univ of Kansas 14 Files 0 Blocks are the interface for IO but 0 Higher levels of DBMS operate on records and les of records c M A collection of pages each containing a collection of records Must support a insertdeletemodify record a fetch a particular record speci ed using record id a scan all records possibly with some conditions on the records to be retrieved Unordered Heap Files o Simplest le structure contains records in no particular order 0 As le grows and shrinks disk pages are allocated and deallocated c To support record level operations we must a keep track of the pages in a le a keep track of free Space on pages 0 keep track of the records on a page 0 There are many alternatives for keeping track of this a We ll consider 2 Heap File Implemented as a List Full Pages Header Page m m m m Data Data Data 39 gt Page Page Page VVU Pages with Free Space o The header page id and Heap le name must be stored someplace Database catalog 0 Each page contains 2 pointers plus data Heap File Using a Page Directory W a Header Page DIRECTORY o The entry for a page can include the number of free bytes on the page 0 The directory is a collection of pages linked list implementation is just one alternative 0 Much smaller than linked list ofall HF pages Record layout Record row in a table 0 Variableformat records a Rare in DBMS table schema dictates the format 0 Relevant for semistructured data such as XML 0 Focus on xedformat records a With xedlength elds only or a With possible variablelength elds 3132009 Luke Huan Univ of Kansas 19 Record Formats Fixed Length F 1 F 2 F 3 F 4 lt L l gt L 2 L 3 L 4 Base address B Address BL1L2 o All eld lengths and offsets are constant a Computed from schema stored in the system catalog 0 Finding i th eld done Via arithmetic Fixedlength fields 0 Example CREATE TABLE StudentSID TNT name CHAR20 age TNT GPA FLOAT O 4 24 28 36 Il42 Bart padded with space 10 I 2 3 0 Watch out for alignment 0 May need to pad reorder columns if that helps 0 What about NULL 0 Add a bitmap at the beginning of the record 3132009 Luke Huan Univ of Kansas 21 Record Formats Variable Length 0 Two alternative formats elds is xed F1 F2 F3 F4 Fields Delimited by Special Symbols A F1 F2 F3 F4 JKK TF Array of Field Offsets E Second offers direct access to i th field efficient storage of nulls special don t know value small directory overhead Large Object LOB fields 0 Example CREATE TABLE StudentSID INT name CHAR20 age INT GPA FLOAT picture BLOB32000 0 Student records get declustered a Bad because most queries do not involve picture 0 Decomposition automatically done by DBMS and transparent to the user 0 StudentSID name age GPA o StudentPictureSID picture 3132009 Luke Huan Univ of Kansas Block layout How do you organize records in a block 0 Fixed length records 0 Variable length records 0 NSM Nary Storage Model is used in most commercial DBMS 3132009 Luke Huan Univ of Kansas 24 Page Formats Fixed Length Records Slot 1 Slot 2 Free Space Slot N 1011LM number M 3 2 1 number PACKED of records UNPACKED BITM AP of slots E Recard id page I d Slat gt In rst alternative moving records for free space management changes rid may not be acceptable NSM 0 Store records from the beginning of each block 0 Use a directory at the end of each block 0 To locate records and manage free space 0 Necessary for variablelength records 142 hart 2123 hiilhouse 11 3 43 7 Ralph i di 23 Why store at two diff a Both can g 3132009 Luke Huan Univ of Kansas Options o Reorganize after every update delete to avoid fragmentation gaps between records 0 Need to rewrite half of the block on average 0 What if records are xedlength o Reorganize after delete 0 Only need to move one record 0 Need a pointer to the beginning of free space a Do not reorganize after update 0 Need a bitmap indicating which slots are in use 3132009 Luke Huan Univ of Kansas System Catalogs For each relation 0 name file location file structure eg Heap file 0 attribute name and type for each attribute 0 index name for each index 0 integrity constraints For each index 0 structure eg B tree and search key fields For each View 0 View name and definition Plus statistics authorization buffer pool size etc Catalogs are themselves stored as relations AttrCatattrname rename type position attrname relname type position attrname AttributeCat string 1 relname AttributeCat string 2 type AttributeCat string 3 position AttributeCat integer 4 sid Students string 1 name Students string 2 login Students string 3 age Students integer 4 gpa Students real 5 fid Faculty string 1 fname Faculty string 2 sal Faculty real 3 Indexes a sneak preview o A Heap le allows us to retrieve records 0 by specifying the rid or o by scanning all records sequentially 0 Sometimes we want to retrieve records by specifying the values in one or more elds eg 0 Find all students in the CS department 0 Find all students with a gpa gt 3 o Indexes are le structures that enable us to answer such valuebased queries efficiently Summary o Disks provide cheap nonvolatile storage 0 Random access but cost depends on location of page on disk important to arrange data sequentially to minimize seek and rotation delays Summary Contd o DBMS vs OS File Support a DBMS needs features not found in many OS s eg forcing a page to disk controlling the order of page writes to disk files spanning disks ability to control prefetching and page replacement policy based on predictable access patterns etc 0 Variable length record format with eld offset directory offers support for direct access to i th field and null values EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Administrative o I am trying to get supporting team to agree to keep your pgSQL account for another semester until Dec 2009 0 Which means 0 Your online database application will be alive for a while a You could do a demon to your potential employers if they are interested o A few things to keep in mind a No backup students are responsible for their own data a No uptime availability guarantee system may be updated in anytime There is no access during update 4262009 Luke Huan Univ of Kansas 2 Administrative o Midterm II is scheduled on April 27th 0 Senior survey today 4262009 Luke Huan Univ of Kansas Review Database Design Miniworld REQUIREMENTS COLLECTION AND ANALYSIS Functional Requirements Data Requirements HighLevel Transaction onceptual Schema Specification In a highlevel data model I DBMSindependent I LOGICAL DESIGN DBMSspecific DATA MODEL MAPPING gical Conceptual Schema 7 Lo APPLICATION PROGRAM 39In the data model of a specific DBMS DESIGN PHYSICAL DESIGN TRANSACTION lt Internal Schema IMPLEMENTATION Application Programs 4262009 Luke Huan Univ of Kansas Review DBMS Architecture UserWeb F ormsApplicationsDBA query transaction Query Parser Query Rewriter lFiles amp Access Methods Lock Tables I Buffer Manager Main Memory Storage Manager 4262009 Luke Huan Univ of Kansas 5 Today s Topic 0 Data Mining transform data to knowledge a knowledge is power 4262009 Luke Huan Univ of Kansas What Is Data Mining 0 Data mining knowledge discovery from data 0 Extraction of interesting nontrivial implicit previously unknown and potentially useful patterns or knowledge from huge amount of data 0 Data mining a misnomer 0 Alternative names 0 Knowledge discovery mining in databases KDD knowledge extraction datapattern analysis data archeology data dredging information harvesting business intelligence etc a Watch out Is everything data mining 0 Simple search and query processing A T o Deductive expert systems gymw 4262009 Luke Huan Univ of Kansas 7 Why Mine Data Commercial Viewpoint 0 Lots of data is being collected and warehoused o Web data ecommerce o purchases at department grocery stores 0 BankCredit Card transactlons 0 Computers have become cheaper and more poweful 0 Competitive Pressure is Strong 0 Provide better customized services for an edge eg in Customer Relationship Management 4262009 Luke Huan Univ ofKanSaS Why Mine Data Scientific Viewpoint 0 Data collected and stored at enormous speeds GBhour 0 Traditional techniques infeasible for raw data 0 Data mining may help scientists 4262009 remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scienti c simulations generating terabytes of data in classifying and segmenting data in Hypothesis Formation Luke Huan Univ of Kansas Mining Large Data Sets Motivation 0 There is often information hidden in the data that is not readily evident 0 Human analysts may take weeks to discover useful information 0 Much of the data is never analyzed at all 4000000 3500000 3000000 2500000 2000000 otal new disk since 1995 1500000 1000000 Number of 500000 0 1995 1996 1997 1998 1999 From R Grossman C Kamath V Kumar Data Mining for Scienti c and Engineering Applications Knowledge Discovery KDD Process 0 Dwannmng MHeof knowledge discovery Pattern E process Taskrelevant D Data Wareh on I I I Data Cleaning I A w I 39 39 39 39 39 39 quot i Data Integration E 4 u Databases 4262009 Luke Huan Univ of Kansas 11 What is not Data Mining o What is not Data Mining Look up phone number in phone directory Query a Web search engine for information about Amazon 4262009 0 What is Data Mining Certain names are more prevalent in certain US locations O Brien O Rurke O Reilly in Boston area Group together similar documents returned by search engine according to their context eg Amazon rainforest Amazoncom Luke Huan Univ of Kansas 12 Origins of Data Mining o Draws ideas from machine learningAI pattern recognition statistics and database systems 0 Traditional Techniques may be unsuitable due to o Enormity of data 0 High dimensionality of data 0 Heterogeneous distributed nature of data 4262009 Luke Huan Univ of Kansas 13 Data Mining Tasks 0 Prediction Methods 0 Use some variables to predict unknown or future values of other variables 0 Underline assumption we could not gure out if the mechanism that generates the data but we 1 could nd a and previous of the near future r 5ii 7 mmmm based on current ham 0 Description Methods 0 Find humaninterpretable patterns that describe the data From Fayyad etal Advances in Knowledge Discovery and Data Mining 1996 4262009 Luke Huan Univ of Kansas 14 Data Mining Tasks Classi cation Predictive Cluster Descriptive Association Rule Discovery Descriptive Sequential Pattern Discovery Descriptive Regression Predictive Deviation Detection Predictive 4262009 Luke Huan Univ of Kansas Classification Definition 7 fat it39ll 0 Given a collection of records M 0 Each record contains a set of attributes one of the attributes is the class 0 Find a for class attribute as a function of the values of other attributes 0 Goal previously unseen records should be assigned a class as accurately as possible 0 A test set is used to determine the accuracy of the model Usually the given data set is divided into training and test sets with training set used to build the model and test set used to validate it 0 Also known as the supervised learning in machine learning literatures 4262009 Luke Huan Univ of Kansas 16 Classification Example Tid Refund Marital Yes No e n09 n09 6 0 a o 00 o are Taxable Marital Taxable Status Income Cheat Status Income Single 125K NO Single Married 100K No Married Single 70K No Married Married 120K No Divorced DiVOrCed 95K Yes Single Married 60K NO Married Divorced 220K No set Single 85K Yes 1 Married 75K No Single 90K Yes 4262009 Luke Huan Univ of Kansas 17 Classification Application 1 0 Direct Marketing 0 Goal Reduce cost of mailing by targeting a set of consumers likely to buy a new cellphone product 0 Approach 0 Use the data for a similar product introduced before 0 We know which customers decided to buy and which decided otherwise This buy don t buy decision forms the class attribute 0 Collect various demographic lifestyle and companyinteraction related information about all such customers Type of business where they stay how much they earn etc 0 Use this information as input attributes to learn a classi er model From Berry amp Linoff Data Mining Techniques 1997 4262009 Luke Huan Univ of Kansas 18 Classification Application 2 0 Fraud Detection 0 Goal Predict fraudulent cases in credit card transactions 0 Approach 0 Use credit card transactions and the information on its accountholder as attributes When does a customer buy What does he buy how often he pays on time etc 0 Label past transactions as fraud or fair transactions This forms the class attribute 0 Learn a model for the class of the transactions 0 Use this model to detect fraud by observing credit card transactions on an account 4262009 Luke Huan Univ of Kansas 19 Clustering Definition 0 Given a set of data points each having a set of attributes and a similarity measure among them nd clusters such that a Data points in one cluster are more similar to one another 0 Data points in separate clusters are less similar to one another 0 Similarity Measures 0 Euclidean Distance if attributes are continuous o Other Problemspeci c Measures 4262009 Luke Huan Univ of Kansas 20 Illustrating Clustering Euclidean Distance Based Clustering in 3D space 4262009 Luke Huan Univ of Kansas 21 Clustering Application 1 a Market Segmentation 0 Goal subdivide a market into distinct subsets of customers Where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix 0 Approach 0 Collect different attributes of customers based on their geographical and lifestyle related information 0 Find clusters of similar customers 0 Measure the clustering quality by observing buying patterns of customers in same cluster vs those from different clusters 4262009 Luke Huan Univ of Kansas Clustering Application 2 0 Document Clustering 0 Goal To nd groups of documents that are similar to each other based on the important terms appearing in them a Approach To identify frequently occurring terms in each document Form a similarity measure based on the frequencies of different terms Use it to cluster 0 Gain Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents 4262009 Luke Huan Univ of Kansas 23 Illustrating Document Clustering o Clustering Points 3204 Articles of Los Angeles Times 0 Similarity Measure How many words are common in these documents after some word ltering Category Total Correctly Articles Placed Financial 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278 4262009 Luke Huan Univ of Kansas 24 Clustering of SampP 500 Stock Data 3 Observe Stock Movements every day 3 Clustering points Stock UPDOWN at Similarity Measure Two points are more similar if the events described by them frequently happen together on the same day 3 We used association rules to quantify a similarity measure AppliedMatlDOW NBa yNetworkD0Wn 3COMD OWN CabletronSys DOWNCIS COD OWNJIPD OWN DSC COmmDOWNINTELDOWNLSI Logic DOWN I MicronTechD OWN3Te xas qut DownTe llabs Inc Down Tee mm 10 gyl DO 39 Natl Semic OnductDOWNOra c 1D OWN S GI DOW N SunDOWN Fannie Mae DOWNFe d Ho me Lo an DOW N MBNA C01p DOWNM0 rg an Stanley D OWN Financ ia1DOWN 4262009 Luke Huan Univ of Kansas 25 Assomatlon RUIe Discovery Definition 0 Given a set of records each of which contain some number of items from a given collection 0 Produce dependency rules which will predict occurrence of an item based on occurrences of other items 4262009 Bread Coke Milk Beer Bread Rules Discovered Milk gt Coke Diaper Milk gt Beer Beer Diaper Beer Bread Diaper Milk 9 Luke Huan Univ of Kansas 26 Association Rule Discovery Application 1 0 Marketing and Sales Promotion 0 Let the rule discovered be Bagels gt Potato Chips 0 Potato Chips as consequent gt Can be used to determine what should be done to boost its sales 0 Bagels in the antecedent gt Can be used to see which products would be affected if the store discontinues selling bagels 0 Bagels in antecedent and Potato chips in consequent gt Can be used to see what products should be sold with Bagels to promote sale of Potato chips 4262009 Luke Huan Univ of Kansas 27 Association Rule Discovery Application 2 o Supermarket shelf management 0 Goal To identify items that are bought together by suf ciently many customers 0 Approach Process the pointofsale data collected with barcode scanners to nd dependencies among items a A classic rule o If a customer buys diaper and milk then he is very likely to buy beer 0 So don t be surprised if you nd sixpacks stacked next to diapers 4262009 Luke Huan Univ of Kansas 28 Association Rule Discovery Application 3 0 Inventory Management 9 Goal A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households 0 Approach Process the data on tools and parts required in previous repairs at different consumer locations and discover the co occurrence patterns 4262009 Luke Huan Univ of Kansas 29 Sequential Pattern Discovery Definition 0 Given is a set of objects with each object associated with its own timeline of events nd rules that predict strong sequential dependencies among different events A B C D E 0 Rules are formed by rst disovering patterns Event occurrences in the patterns are governed by timing constraints A B C D E lt 9 33919 lt ws lt ms 4262009 Luke Huan Univ of Kansas 30 Sequential Pattern Discovery Examples o In telecommunications alarm logs o InverterProblem ExcessiveLineCurrent Recti erAlarm gt FireAlarm o In pointof sale transaction sequences 0 Computer Bookstore IntroToVisua1C CPrimer gt PerlfordummiesTclTk 0 Athletic Apparel Store Shoes Racket Racketball gt SportsJacket 4262009 Luke Huan Univ of Kansas 31 Regression o Predict a value of a given continuous valued variable based on the values of other variables assuming a linear or nonlinear model of dependency o Greatly studied in statistics neural network elds 0 Examples 0 Predicting sales amounts of new product based on advetising expenditure 0 Predicting Wind velocities as a function of temperature humidity air pressure etc 0 Time series prediction of stock market indices 4262009 Luke Huan Univ of Kansas 32 DeviationAnomaly Detection o Detect signi cant deviations from normal behavior 0 Applications 4 m 0 Credit Card Fraud Detection I i R 1quot l a 0 Network Intrusion Detection Typical network traf c at University level may reach over 100 million connections per day 4262009 Luke Huan Univ of Kansas 33 First Step The Nature of Your Data o What is data 0 What are in your data 0 What can we do about data 4262009 Luke Huan Univ of Kansas What is Data 0 Collection of data objects and their attributes 0 An attribute is a property or characteristic of an object 0 Examples eye color of a person temperature etc o Attribute is also known as variable field characteristic or feature Objectslt o A collection of attributes describe an object 0 Object is also known as record point case sample 4262009 entity or instance Luke Huari Univ of Kansas Attributes A Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Attribute Values o Attribute values are numbers or symbols assigned to an attribute o Distinction between attributes and attribute values 0 Same attribute can be mapped to different attribute values 0 Example height can be measured in feet or meters 0 Different attributes can be mapped to the same set of values 0 Example Attribute values for ID and age are integers 0 But properties of attribute values can be different ID has no limit but age has a maximum and minimum value 4262009 Luke Huan Univ of Kansas 36 Types of Attributes c There are different types of attributes 0 Nominal 0 Examples ID numbers eye color zip codes 0 Ordinal 0 Examples rankings eg taste of potato chips on a scale from 110 grades height in tall medium short 0 Interval 0 Examples calendar dates temperatures in Celsius or Fahrenheit 0 Ratio 0 Examples temperature in Kelvin length time counts 4262009 Luke Huan Univ of Kansas 37 Structured vs Unstructured Data o Structured Data 0 Data in a relational database 0 Semistructured data 0 Graphs trees sequences 0 Unstructured data 0 Image text 4262009 Luke Huan Univ of Kansas What is in Your Data o What kinds of data quality problems 0 How can we detect problems with the data 0 What can we do about these problems 0 Examples of data quality problems 0 Noise and outliers 0 missing and duplicated data 4262009 Luke Huan Univ of Kansas 39 Noise 0 Noise refers to modi cation of original values 0 Examples distortion of a person s voice When talking on a poor phone and snow on television screen 1 1 10 05 5 0 0 5 05 10 1 i i i 1 i i i i 0 02 04 06 08 0 02 04 06 08 1 Time seconds Time seconds Two Sine Waves Two Sine Waves Noise 4262009 Luke Huan Univ of Kansas 40 Mapping Data to a New Space o Fourier transform 0 Wavelet transform 0 02 04 06 08 390 02 04 06 08 1 10 20 30 40 50 60 70 80 90 Time seconds Time seconds TWO Sine Waves Two Sine Waves Noise Frequency 4262009 Luke Huan Univ of Kansas 41 Outliers o Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set 0 PE 313955 O O quot int 33321 0 One person s outlier can be another one s treasure 4262009 Luke Huan Univ of Kansas Missing Values 0 Reasons for missing values 0 Information is not collected eg people decline to give their age and weight o Attributes may not be applicable to all cases eg annual income is not applicable to children 0 Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values weighted by their probabilities 4262009 Luke Huan Univ of Kansas 43 Duplicate Data 0 Data set may include data objects that are duplicates or almost duplicates of one another D Major issue when merging data from heterogeous sources 0 Examples 9 Same person with multiple email addresses 0 Data cleaning a Process of dealing with duplicate data issues 4262009 Luke Huan Univ of Kansas 44 Challenges of Data Mining o Scalability 0 Dimensionality 0 Complex and Heterogeneous Data 0 Data Quality 0 Data Ownership and Distribution 0 Privacy Preservation o Streaming Data 4262009 Luke Huan Univ of Kansas 45 EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 V Swat FW Full Q Stating Points o A database 0 A database management system c A miniworld o A data model 0 Conceptual model a Relational model 2242009 Luke Huan Univ of Kansas So What Database Schemas A database schema is a description of a database using a given data model relational model by default External schemas user Views l M describe how users see the o o data V1ew 1 V1ew 2 V1ew 3 Conceptual schema de nes logical structure Im emal schema describes the F ya g gage Strucmre Of Ilnternal Schema I IConceptual Schemal DB 2242009 Luke Huan Univ of Kansas 3 Data Independence 0 Applications insulated from how data is structured and stored 0 Logical data independence the capacity to change the conceptual schema Without having to change the external schema 0 Physical data independence the capacity to change the internal schema Without having to change the conceptual schema 0 Q Why is this particularly importanifor DBMS OK So how to model a miniworld 0 Using conceptual model such as ER model 2242009 Luke Huan Univ of Kansas Where does a conceptual model lead us to o A tabular View 2242009 Luke Huan Univ of Kansas 6 Now you have a draft how do you improve your design 0 Using integrity constraints a Attributes value must come from its domain a Every relation mush have a primary key a The primary key value in a tuple can not be NULL 0 The foreign key value in a referenced tuple must exist or the foreign key value in the referencing tuple is NULL 2242009 Luke Huan Univ of Kansas How do you improve relational data base design cont 0 Using functional dependency o Nonkey FD always leads to redundancy o BCNF normal form 0 Pick up a nonkey FD do binary decomposition 0 First normal form no attribute may be composite or multiValued 0 2nd normal form and 3rd normal form are coming 2242009 Luke Huan Univ of Kansas What could we do about relations o Relational algebra expression 0 Using temporary variable 0 Identifying information source a Pay attention to non monotonic operation 2242009 Luke Huan Univ of Kansas Review 0 SELECT a list of attributes FROM a list of relations WHERE condition 0 Condition may have logical operators AND OR NOT 0 Condition may have comparison operators lt lt ltgt gt2 gt 0 String comparison may use exactly match or LIKE matching with regular expressions 9 9 o Arithmetic expressions of attributes are allowed 2242009 Luke Huan Univ of Kansas 10 Examples of bag operations Bagl Bag2 fruit fruit Apple Apple Apple Orange Orange Orange Bagl UNION ALL Bag2 Bagl INTERSECT ALL Bag2 fruit Apple frult Bagl EXCEPT ALL Bag2 mmm Apple Apple fruit Orange Apple Orange Orange Orange 2242009 Luke Huan Univ of Kansas 11 Exercise 0 SELECT Sid 2009 age FROM STUDENT WHERE name LIKE J0hn OR GPA gt 36 sid name age gpa 1234 John Smith 21 35 1123 Mary Carter 19 38 1011 Bob Lee 22 26 1204 Susan Wong 22 34 1306 Kevin Kim 18 29 asdfsadf 2242009 Luke Huan Univ of Kansas 12 Aggregates 0 Standard SQL aggregate functions COUNT SUM AVG MIN MAX 0 Example number of students under 18 and their average GPA 0 SELECT COUNT AVGGPA FROM Student WHERE age lt 18 o COUNT counts the number of rows 2242009 Luke Huan Univ of Kansas 13 Aggregates with DISTINCT 0 Example HOW many students are taking classes 0 SELECT COUNT SID FROM Enroll a SELECT COUNTDISTINCT SID FROM Enroll 2242009 Luke Huan Univ of Kansas GROUP BY 0 SELECT H FROM H WHERE GROUP BY list0fcolumns 0 Example nd the average GPA for each age group 0 SELECT age AVGGPA FROM Student GROUP BY age 2242009 Luke Huan Univ of Kansas Operational semantics of GROUP BY SELECT FROM WHERE GROUP BY 0 Compute FROM x 0 Compute WHERE o 0 Compute GROUP BY group rows according to the values of GROUP BY columns 0 Compute SELECT for each group TE 0 For aggregation functions with DI ST INCT inputs rst eliminate duplicates Within the group Number of groups number of rows in the nal output 2242009 Luke Huan Univ of Kansas 16 Example of computing GROUP BY SELECT age AVGGPA FROM Student GROUP BY age sid name age gpa 1234 JohnSnn 1 21 35 rows accordmg to the 1123 Mary Carter 19 38 values of GROUP BY 1011 Bob Lee 22 26 c lumnq 1204 Susan Wong 22 34 Sid name age gpa 1306 Kevin Kim 19 29 Compute GROUP BY group Compute SELECT for each 2242009 Luke Huan Univ of Kansas 17 Aggregates with no GROUP BY 0 An aggregate query with no GROUP BY clause represent a special case Where all rows go into one group FROM Student Compute aggregate over the group SELECT AVG GPA 2242009 sid name age gpa sid name age gpa 1234 John Smith 21 35 7 a 1123 Mary Carter 19 38 1011 Bob Lee 22 26 1204 Susan Wong 22 34 1306 Kevin Kim 19 29 Group all I39OWS into one 18 Luke Huan Univ of Kansas Restriction on SELECT o If a query uses aggregation group by then every column referenced in SELECT must be either 0 Aggregated or o A GROUP BY column This restriction ensures that any SELECT expression produces only one value for each group 2242009 Luke Huan Univ of Kansas 19 Examples of invalid queries o SELECT Wage EROM Student GROUP BY age 0 Recall there is one output row per group 0 There can be multiple SID values per group 0 SELECT M MAX GPA FROM Student 0 Recall there is only one group for an aggregate query with no GROUP BY clause 0 There can be multiple SID values 0 Wishful thinking that the output SID value is the one associated with the highest GPA does NOT work 2242009 Luke Huan Univ of Kansas 20 HAVING 0 Used to lter groups based on the group properties eg aggregate values GROUP BY column values 0 SELECT FROM H WHERE GROUP BY HAVING condition 2242009 Compute FROM x Compute WHERE 6 Compute GROUP BY group rows according to the values of GROUP BY columns Compute HAVING another 6 over the groups Compute SELECT TE for each group that passes HAVING Luke Huan Univ of Kansas 21 HAVING exam ples c Find the average GPA for each age group over 10 0 SELECT age AVGGPA FROM Student GROUP BY age HAVING age gt 10 0 Can be written using WHERE Without table expressions 0 List the average GPA for each age group with more than a hundred students 0 SELECT age AVGGPA FROM Student GROUP BY age HAVING COUNT gt 100 0 Can be written using WHERE and table expressions 2242009 Luke Huan Univ of Kansas 22 A First Touch of Subqueries 0 Use query result as a table 0 In set and bag operations FROM clauses etc o A way to nest queries 0 Example names of students who are in more clubs than classes SELECT DISTINCT name FROM Student SELECT SID FROM ClubMember EXCEPT ALL SELECT SID FROM Enroll AS S WHERE StudentSID SSID 2242009 Luke Huan Univ of Kansas 23 EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Midterm 2 Histogram 14 12 7 510 C cu 8quot g 6 77 Frequency d L 4 2 0 1 1 1 lt70 7079 8089 gt90 Bin 572009 Luke Huan Univ of Kansas Administrative o Homework 6 is assigned Without due day 0 Everyone has full credit for it 572009 Luke Huan Univ of Kansas Administrative o In this fall I am teaching EECS 730 Introduction to Bioinforrnatics 572009 Luke Huan Univ of Kansas Clients 1 g 5 What s the world in your eyes Application servers Database 86 Ne FS What s the world in a data engineer eyes 572009 Luke Huan Univ of Kansas Review Clients V v a 9 Web server Application servers Database servers 572009 Luke Huan Univ of Kansas Review 572009 Luke Huan Univ of Kansas 7 Review of What We Learned Since Midterm o Concurrency control ACID Precedence graph Locking 2PL Strict2PL Deadlock and starvation Recovery 0 Data Mining 572009 Data mining overview Decision tree Law of large number Association rule and frequent item set mining Clustering Luke Huan Univ of Kansas Welcome to the Real DB World 0 Where is the frontier o The ACM Special Interest Group on Management of Data 0 ACM SIGMOD wwwsigm0d org 0 Very large databases 0 VLDB httpWWWVldborgdblpdbconfVldb 0 IEEE International Conference on Data Engineering 0 IEEE ICDE httpdblpunitrierdedbconficde 572009 Luke Huan Univ of Kansas 9
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'