#### Transcript Using SQL to Build New Aggregates and Extenders for Object

User-Defined Aggregates for Advanced Database Applications Haixun Wang [email protected] Computer Science Dept. University of California, Los Angeles State of the Art: the big picture Relational Databases: Single data model of spartan simplicity Logic-based database languages Commercial success and dominance through SQL standards (in the 80’s) A new wave of DB applications (in the 90’s) Knowledge discovery, Data Mining, OLAP, Time Series analysis, Spatial/Temporal DBs, XML, … Underscores limitations of RDBMS Prompted major extensions leading to SQL3 standards (SQL99 is a subset of SQL3) 2 State of the Art: R&D highlights Deductive DBs OO-DBs Rule based syntax Logic based formalization of query language semantics – e.g., nonmonotonic reasoning and stratification Recursive queries Complex data types / inheritance Expressive power by merging PL and query languages OR-DBs Rich data types / Path Expression (SQL) UDFs and Database Extenders (Data Blades) 3 State of the Art: the seamy side A patchwork of major extensions Still not powerful enough DBMSs have become more powerful but much hard and complex to build and use Data Mining: clustering, classification, association New language constructs not helping either Limited expressive power in other applications Bill of Materials (BoM) type of applications Temporal reasoning 4 This thesis: Many of the problems can be solved by UDAs User Defined Aggregates (UDAs): insufficient support in commercial world and DB standards Our claim: UDAs provide a more general and powerful mechanism for DB extensions AXL – a system to make it easier to define UDAs AXL – where SQL and Data Mining intersect 5 A Brief History of AXL (and of my thesis) Logic formalization of aggregates SQL-AG: Simple Aggregate Definition Language [ICDE’00] using SQL to define new aggregates easy to use, but with limited performance and power AXL: Implementing and extending SQL3 UDAs [DBPL’99] Implemented on DB2 using extended user-defined aggregates with ‘early returns’ SADL: [DDLP’98, LBAI’00] Early returns, monotonic aggregates: used freely in recursive queries. Extensions of LDL++ (Logic Database Language) Aggregate eXtension Language [VLDB’00] powerful, efficient and still SQL-based 6 Defining UDAs in SQL3 AGGREGATE FUNCTION myavg(val NUMBER) RETURN NUMBER STATE state INITIALIZE myavg_init ITERATE myavg_iterate TERMINATE myavg_return INITIALIZE: gives an initial value to the aggregate ITERATE : computes the intermediate aggregate value for each new record TERMINATE: returns the final value computed for the aggregate myavg_init, myavg_iterate, myavg_return are 3 functions that the user must write in a procedural programming language 7 Limitation of SQL3 UDAs UDAs in SQL3, Postgres, Informix, and early versions of LDL++ share the same limitations: Aggregates can not be used inside recursion No support for early returns and on-line aggregation Also Ease of Use is a major issue (except for LDL++) 8 Ease of Use THE PROBLEM: UDFs are very hard to write and debug. In “unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem. A SOLUTION: Use a high-level language for defining UDAs. But there are many potential problems with any new language. THE IDEAL SOLUTION: use SQL to define new aggregates. Substantial benefits: Users are already familiar with SQL No impedance mismatch of data types and programming paradigms DB advantages: scalability, data independence, optimizability, parallelizability 9 Simple Aggregates AGGREGATE avg(value INT) : REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state (value, 1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } } 10 Avoiding Multiple Scans Show the average salary of senior managers who make 3 times more than the average employees. SQL: SELECT avg(salary) FROM employee WHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee) Two scans of the employee table required With AXL UDAs: SELECT sscan(title, salary) FROM employee 11 AXL: Using a Single Scan AGGREGATE sscan(title CHAR(20), salary INT) : REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state); } } 12 Ordered Sequences and Time Series We have a sequence of events, each of which is active during a certain interval (from, end). Find out at which point of time we have the largest number of active events. SQL: Group-by on the start time of each interval, and count! With AXL UDAs: SELECT density(from, end) FROM events 13 AXL: Using a Single Scan AGGREGATE density(start TIME, end TIME) : (time TIME, count INT) { TABLE state(time TIME, count INT) AS (0, 0); TABLE active(endpoint TIME); INITIALIZE: ITERATE: { DELETE FROM active WHERE endpoint < start; INSERT INTO active VALUES(end); UPDATE state SET time=start, count =count + 1 WHERE count < (SELECT count(*) FROM active); } TERMINATE: { INSERT INTO RETURN SELECT time, count FROM state; } } 14 Early Returns AVG normally converges early: an early approximation is all is needed in several applications Online aggregation means that early returns are produced during the computation Many applications: e.g., find the local max and min in a sequence of values, various temporal aggregates, rollups, etc. These might depend on the order – same as new OLAP extensions in SQL3 16 Return avg for Every 100 Records AGGREGATE olavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; } } 17 Temporal Coalescing AGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME) { TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: { UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from; UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: { INSERT INTO RETURN SELECT cFrom, cTo FROM state; } } 18 Recursive Aggregates In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively. AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(P); INSERT INTO RETURN SELECT alldesc(Child) FROM children WHERE Parent = P; } } Find all the descendents of Tom: SELECT alldesc(Child) FROM children WHERE Parent = ‘Tom’; 19 Check Point Simple applications: AXL UDAs provide a solution with better performance and good ease of use. Many advanced database applications Time Series, Temporal Database, Spatial Database… In particular data mining applications 20 Data Mining and Database Systems Current Approach: Cache-Mine: Cursor-based: loose-coupling, stored procedures UDFs: ease of use problems Using DBMSs as containers of data Many attempts to closely integrate data mining functions into DBMS have shown major problems 21 What we need … SQL-aware Data Mining Systems Surajit Chaudhuri “Data Mining and Database Systems: Where is the Intersection?” IEEE Data Engineering Bulletin, 1998 Decision Tree Classifiers Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Training set: tennis Wind Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong PlayTennis No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes no RecId 1 1 1 1 2 2 2 2 ColId Value 1 Sunny 2 Hot 3 High 4 Weak 1 Sunny 2 Hot 3 High 4 Strong PlayTennis No No No No Yes Yes Yes Yes 14 14 14 14 1 2 3 4 Rain Mild High Strong No No No No Stream of Column/Value Pairs (together with RecId and Category) 25 Convert training set to column/value pairs AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT) { INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn); } } CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT) SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis) FROM tennis; SELECT classify(recId, col, val, YorN) FROM col-val-pairs; 26 Categorical Classifier in AXL [ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT); [ 3] TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value}); [ 4] TABLE mincol(Col INT, MinGini REAL); [ 5] TABLE ginitable(Col INT, Gini REAL); [ 6] INITIALIZE: ITERATE: { [ 7] INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN); [ 8] UPDATE summary [ 9] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue; [10] INSERT INTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0; [11] } [12] TERMINATE: { [13] INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc) [14] FROM summary GROUP BY Col; [15] INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable; [16] INSERT INTO result SELECT iNode, Col FROM mincol; [17] SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN) [18] FROM treenodes AS t, [19] (SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m [20] WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m [21] WHERE t.RecId = m.RecId GROUP BY m.Value; [22] } [23] } Performance SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C 28 SPRINT Algorithm in AXL [ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value)); [ 3] TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT); [ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId)); [ 5] TABLE mincol(Col INT, Val REAL, Gini REAL); [ 6] TABLE node(Node INT) AS VALUES(iNode); [ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN); [ 9] UPDATE summary [10] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal) [11] WHERE Col=iCol; [12] } [13] TERMINATE: { [14] INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary; [15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n; [16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m [17] WHERE t.Col = m.Col AND m.Gini > 0; [18] SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN) [19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec [20] GROUP BY s.LeftOrRight; [21] } [22] } Comparison with Other Architectures Integrating Association Rule Mining with Relational Database Systems S. Sarawagi et al SIGMOD 98 Time in Sec 4000 3000 Pass Pass Pass Pass 2000 1000 4 3 2 1 0 Cache S-Proc UDF SQL Datasets with 6.6 million records, support level .25%. 30 Implementation of AXL Standalone Mode Berkeley DB DB2 Add-on Mode .axl .axl .cc .cc .exe UDFs DB2 .lib SQL DB2 31 Implementation of AXL Open interface of physical data model. Limited Optimization Currently using Berkeley DB as our storage manager In memory tables Using B+-Tree indexes to support equality/range query Predicate push-down / push-up User Defined Aggregates Hash based Return multiple rows: ‘early return’ Return multiple columns: employee’s name and salary 32 Implementation of AXL (cont’d) Non-blocking aggregation Keeping the state of aggregation between calls to the aggregate routines Local tables defined inside aggregation are passed as parameters to the aggregates Explicit sorting (and implicit hash-based aggregation) AXL V1.2: above 30,000 lines of code 33 Check Point Simple applications: AXL UDAs provide a solution with better performance and good ease of use. Data Mining applications Formal Semantics of Aggregates and Monotonic Aggregation 34 Aggregates in Recursion Stratification: shaves(barber, X) :- villager(X), shaves(X, X). villager(barber). Aggregates: p count(p) =0 Aggregates in many applications are actually monotonic (and should be allowed inside recursion). 35 Beyond Stratification Significant previous efforts… I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The magic of duplicates and aggregates”, VLDB 1990 A. Van Gelder, “Foundations of aggregates in deductive databases”, DOOD 1993 K. A. Ross and Y. Sagiv, “Monotonic aggregates in deductive databases”, JCSS 1997 S. Greco and C. Zaniolo, “Greedy algorithms in Datalog with choice and negation”, JICSLP 1998 36 Formal Semantics of Aggregates choice((X),(Y)) Ordering a domain Enforcing functional dependency. FD: X->Y Multiplicity of stable models, monotonic transformation Once (X,Y) is generated, choice ensures this is the only arc leaving source node X and entering sink node Y Formal semantics of UDA … return(Y,V) :- ordered(X,Y), ordered(Y,_), terminate(Y,V). … 37 Early Returns Monotonic Aggregates Aggregates with only ‘early returns’ and no ‘final returns’ are monotonic w.r.t. set containment: AGGREGATE mcount(): INT { TABLE state(cnt INT) AS VALUES (0); INITIALIZE: ITERATE: { UPDATE state SET cnt=cnt+1; INSERT INTO RETURN SELECT cnt FROM state; } } 38 Early Returns Monotonic Aggregates SELECT mcount(*) FROM employee; v.s. SELECT count(*) FROM employee; Name John Mary Tom Name John Mary Tom Jerry {1,2,3} count{John, Mary, Tom} 3 mcount{John, Mary, Tom} {1,2,3,4} count{John, Mary, Tom, Jerry} 4 mcount{John, Mary, Tom, Jerry} 39 Return sum at the nth value AGGREGATE sumat(value INT, n INT): INT { TABLE state (sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt +1; INSERT INTO RETURN SELECT sum FROM state WHERE cnt = n; } } 40 Monotonic Aggregation Monotonic aggregates can be used without any restriction and without changing the underlying implementation. This solves the problem that had eluded database researchers since the introduction of relational systems: BoM, Company Control, Join-the-Party… Greedy optimization algorithms, such as Dijkstra’s single source shortest path. 42 Join-the-Party Problem Some people will come to the party no matter what, and their names are stored in a sure(PName) relation. But many other people will join only after they know that at least K=3 of their friends will be there. WITH wllcm(Name) AS ((SELECT Pname FROM sure) UNION ALL (SELECT f.Pname FROM friend AS f, wllcm AS w WHERE w.Name =f.Fname GROUP BY f.Pname HAVING mcount()=3)) SELECT Name FROM wllcm; Density-based Clustering [M. Ester et al. KDD 96] 43 BoM: Cost of Parts assembly 001 1 5 1 3 002 004 003 100 2 5 10 005 8 3 Part 001 001 001 001 002 003 003 004 004 004 Qty 5 1 1 100 10 2 5 3 8 3 part-cost Basic Part 005 006 006 Subpart 002 003 004 005 005 006 005 006 005 003 Cost 10 15 fan-out Part 001 002 003 004 ChC 4 1 2 3 44 BoM: Using AXL WITH cst(part, cost) AS ((SELECT part, cost FROM part-cost) UNION ALL (SELECT a.part, sumat(a.qty * c.cost, p.ChC) FROM assembly AS a, cst AS c, fan-out AS p WHERE a.subpart = c.part AND p.part = a.part GROUP BY a.part)) SELECT part, cost FROM cst; Bottom up solution. Computes the cost for each part once and only once. Monotonic sumat(cost, n) returns sum when exactly n items are aggregated. Works in DB2 after AXL rewrites callings of sumat() to callings of automatically generated UDFs. 45 BoM: Using Recursive SQL WITH mpath(subpart, qty) AS ((SELECT subpart, qty FROM assembly WHERE part = ‘001’) UNION ALL (SELECT c.subpart, m.qty * c.qty FROM mpath m, assembly c WHERE m.subpart = c.part)) SELECT sum(m.qty * c.cost) FROM mpath m, part_cost c WHERE m.subpart = c.part ; Top down solution: computes the cost of part ‘001’ Explosion: all edges that descend from part ‘001’ are “stored” in mpath What if we want to compute the cost of each part? 46 Check Point Simple applications Data Mining and Decision Support Formal Semantics & Monotonic aggregates OLAP and other aggregate extensions SUCH THAT CUBE, ROLLUP, GROUPING SET OLAP Functions 47 “Such That” For each division, show the average salary of senior managers who make 3 times more than the average employees, and the average salary of senior engineers who make 2 times more than the average employees (in the same output record). D. Chatziantoniou, Kenneth Ross, VLDB 1996 SELECT division, avg(X.salary), avg(Y.salary) FROM employee GROUP BY division: X, Y SUCH THAT X.title = ‘senior manager’ AND X.salary > 3*avg(salary) AND Y.title = ‘senior engineer’ AND Y.salary > 2*avg(salary) 48 Expressing “Such That” in AXL TABLE seniors(salary INT); AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL { TABLE state(sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES (salary) WHERE title = qtitle; } TERMINATE: { SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > ratio * (SELECT sum/cnt FROM state); } } 49 Using UDA sscan2 SELECT division, sscan2(title, salary, ‘senior manager’, 3), sscan2(title, salary, ‘senior engineer’, 2) FROM employee GROUP BY division; No joins or sub-queries required. One pass through the employee relation (standard SQL requires at least 2 passes). 50 Other Aggregate Extensions GROUPING SETS, ROLL-UP, CUBE New OLAP extensions Windows containing a partitioning, an ordering of rows and an aggregate group “… every standard must be prepared to tackle new issues that arise as the market evolves. If SQL does not respond positively to this challenge, SQL risks becoming irrelevant …” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle Introduction to OLAP Functions 51 Contributions Adding extended UDAs to O-R systems Tightly couple data mining functions with DBMS high level language, minimal additions to SQL monotonic aggregates, recursive aggregates designed for general purpose applications SPRINT Algorithm, Categorical Classifier, … Performance More efficient than cursor-based languages like PL/SQL, JDBC and UDF-based approaches … 56 Future Directions Parallelization Extenders/Data Blades Decision Support build on top of UDAs instead of UDFs the Apriori algorithm Windows and OLAP functions Spatial/Temporal extensions the TENOR system 57 Future Direction: Parallelization Current parallel aggregation algorithms valid for AXL: Inter-Partition parallelism: all tuples of the same group-by value are in one node Two phase algorithm: user provides a COMBINE routine Unlike SQL3, AXL’s aggregate routines are written in SQL, so we can apply traditional query parallelization techniques to INITIALIZE, ITERATE, and TERMINATE Since aggregate routines are written in SQL, the COMBINE routine can be generated automatically by the system for simple UDAs 58