Class Note for EECS 647 with Professor Huan at KU 3
Class Note for EECS 647 with Professor Huan at KU 3
Popular in Course
Popular in Department
This 45 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Kansas taught by a professor in Fall. Since its upload, it has received 15 views.
Reviews for Class Note for EECS 647 with Professor Huan at KU 3
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
EECS 647 Introduction to Database Systems Instructor Luke Huan Spring 2009 Administrative o I am trying to get supporting team to agree to keep your pgSQL account for another semester until Dec 2009 0 Which means 0 Your online database application will be alive for a while a You could do a demon to your potential employers if they are interested o A few things to keep in mind a No backup students are responsible for their own data 9 No uptime availability guarantee system may be updated in anytime There is no access during update 4262009 Luke Huan Univ of Kansas 2 Administrative o Midterm II is scheduled on April 27th 0 Senior survey today 4262009 Luke Huan Univ of Kansas Review Database Design Miniworld REQUIREMENTS COLLECTION AND ANALYSIS Iv Functional Requirements Data Requirements FUNCTIONAL ANALYSIS CONCEPTUAL DESIGN HighLevel Transaction Conceptual Schema Specification In a highlevel data model I LOGICAL DESIGN DATA MODEL MAPPING I D E MSindependent i D B MSspecific Logical Concepiual Schema 39In the data model of a specific DBMS DESIGN 1 PHYSICAL DESIGN TRANSACTION lt Internal Schema IM PLEM ENTATION APPLICATION PROGRAM Application Programs 4262009 Luke Huan Univ of Kansas Review DBMS Architecture UserWeb F ormsApplicationsDBA query Query Parser Query Rewriter Files amp Access Methods Buffer Manager Storage Manager transaction Lock Tables Main Memory 4262009 Luke Huan Univ of Kansas 5 Today s Topic 0 Data Mining transform data to knowledge 6 knowledge is power 4262009 Luke Huan Univ of Kansas What Is Data Mining 0 Data mining knowledge discovery from data 0 Extraction of interesting nontrivial implicit previously unknown and potentially useful patterns or knowledge from huge amount of data 0 Data mining a misnomer 0 Alternative names 0 Knowledge discovery mining in databases KDD knowledge extraction datapattern analysis data archeology data dredging information harvesting business intelligence etc a Watch out Is everything data mining a Simple search and query processing A T o Deductive expert systems gymw 4262009 Luke Huan Univ of Kansas 7 Why Mine Data Commercial Viewpoint 0 Lots of data is being collected and warehoused o Web data ecommerce o purchases at department grocery stores 0 BankCredit Card transactions 7 r LJ 0 Computers have become cheaper and more powerful 0 Competitive Pressure is Strong 0 Provide better customized services for an edge eg in Customer Relationship Management 4262009 Luke Huan Univ ofKanSaS E Why Mine Data Scientific Viewpoint 0 Data collected and stored at enormous speeds GBhour 0 remote sensors on a satellite 0 telescopes scanning the skies o microarrays generating gene expression data 9 scienti c simulations generating terabytes of data 0 Traditional techniques infeasible for raw data 0 Data mining may help scientists o in classifying and segmenting data 0 in Hypothesis Formation 4262009 Luke Huan Univ of Kansas Mining Large Data Sets Motivation 0 There is often information hidden in the data that is not readily evident 0 Human analysts may take weeks to discover useful information 0 Much of the data is never analyzed at all 4000000 3500000 3000000 r 2500000 2000000 Total new disk TB since 1995 1500000 1000000 Number of 500000 analysts 0 1995 1996 1997 1998 1999 From R Grossman C Kamath V Kumar Data Mining for Scienti c and Engineering Applications Knowledge Discovery KDD Process o Dwannmng MHeof knowledge discovery Pattern E lua process 7 H m animating7 Taskrelevant D Data Wareh on Data Cleaning b 4 Integration Databases 4262009 Luke Huan Univ of Kansas 11 What is not Data Mining o What is not Data Mining Look up phone number in phone directory Query a Web search engine for information about Amazon 4262009 0 What is Data Mining Certain names are more prevalent in certain US locations O Brien O Rurke O Reilly in Boston area Group together similar documents returned by search engine according to their context eg Amazon rainforest Amazoncom Luke Huan Univ of Kansas 12 Origins of Data Mining o Draws ideas from machine learningAI pattern recognition statistics and database systems 0 Traditional Techniques may be unsuitable due to o Enormity of data 0 High dimensionality of data 0 Heterogeneous distributed nature of data 4262009 Luke Huan Univ of Kansas 13 Data Mining Tasks 0 Prediction Methods 0 Use some variables to predict unknown or future values of other variables 0 Underline assumption we could not gure out the mechanism that generates the data but we could nd a 1le wormmrimz based on current and previous rare to a level that we could make of the near future 0 Description Methods 0 Find humaninterpretable patterns that describe the data From Fayyad etal Advances in Knowledge Discovery and Data Mining 1996 4262009 Luke Huan Univ of Kansas 14 Data Mining Tasks Classi cation Predictive Clustering Descriptive Association Rule Discovery Descriptive Sequential Pattern Discovery Descriptive Regression Predictive Deviation Detection Predictive 4262009 Luke Huan Univ of Kansas 15 Classification Definition 0 Given a collection of records 0 Each record contains a set of attributes one of the attributes is the class 0 Find a for class attribute as a function of the values of other attributes 0 Goal previously unseen records should be assigned a class as accurately as possible 0 A test set is used to determine the accuracy of the model Usually the given data set is divided into training and test sets with training set used to build the model and test set used to validate it 0 Also known as the supervised learning in machine learning literatures 4262009 Luke Huan Univ of Kansas 16 Classification Example Tid Refund Marital Aromximoihmm Yes No No Yes No No Yes No No No C 0 000 nego 00 lt90 e Go 00 o 65 Taxable Status Income Cheat Single 125K No M arried 1 00K No Single 70K No Married 120K No DiVOrCed 95K Yes Married 60K No Divorced 220K No Single 85K Yes Married 75K No Single 90K Yes 4262009 Refund Marital Taxable Status Income Cheat No Single 75K Yes Married 50K 239 No Married 150K Yes Divorced 90K No Single 40K No Married 80K 17 Luke Huan Univ of Kansas Classification Application 1 0 Direct Marketing 0 Goal Reduce cost of mailing by targeting a set of consumers likely to buy a new cellphone product 0 Approach 0 Use the data for a similar product introduced before 0 We know which customers decided to buy and which decided otherwise This buy don t buy decision forms the class attribute 0 Collect various demographic lifestyle and companyinteraction related information about all such customers Type of business where they stay how much they earn etc 0 Use this information as input attributes to learn a classi er model From Berry amp Linoff Data Mining Techniques 1997 4262009 Luke Huan Univ of Kansas 18 Classification Application 2 0 Fraud Detection a Goal Predict fraudulent cases in credit card transactions a Approach 0 Use credit card transactions and the information on its accountholder as attributes When does a customer buy What does he buy how often he pays on time etc 0 Label past transactions as fraud or fair transactions This forms the class attribute 0 Learn a model for the class of the transactions 0 Use this model to detect fraud by observing credit card transactions on an account 4262009 Luke Huan Univ of Kansas 19 Clustering Definition 0 Given a set of data points each having a set of attributes and a similarity measure among them nd clusters such that a Data points in one cluster are more similar to one another 0 Data points in separate clusters are less similar to one another 0 Similarity Measures 0 Euclidean Distance if attributes are continuous o Other Problemspeci c Measures 4262009 Luke Huan Univ of Kansas 20 Illustrating Clustering Euclidean Distance Based Clustering in 3D space 4262009 Luke Huan Univ of Kansas 21 Clustering Application 1 a Market Segmentation 0 Goal subdivide a market into distinct subsets of customers Where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix 0 Approach 0 Collect different attributes of customers based on their geographical and lifestyle related information 0 Find clusters of similar customers 0 Measure the clustering quality by observing buying patterns of customers in same cluster vs those from different clusters 4262009 Luke Huan Univ of Kansas 22 Clustering Application 2 0 Document Clustering a Goal To nd groups of documents that are similar to each other based on the important terms appearing in them Q Approach To identify frequently occurring terms in each document Form a similarity measure based on the frequencies of different terms Use it to cluster 0 Gain Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents 4262009 Luke Huan Univ of Kansas 23 Illustrating Document Clustering Clustering Points 3204 Articles of Los Angeles Times Similarity Measure How many words are common in these documents after some word ltering Category Total Correctly Articles Placed Financial 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278 4262009 Luke Huan Univ of Kansas 24 Clustering of SampP 500 Stock Data 3 Observe Stock Movements every day Clustering points Stock UPDOWN 3K Similarity Measure Two points are more similar if the events described by them frequently happen together on the same day at We used association rules to quantify a similarity measure Applie dMatl DOW NBayNejworkDOWn 3CQMDOWN Cabletron Sy s DOW NCIS COD OWNZHPLD OWN DSCCOmmDOWNINTELDOWNLSI Logic DOWN 39 Micro n Te ch D OWN T e xas Ins tvDown Te llab s Inc Down Tee hno lo gyl DO 39 Natl Se mic 0ndu39ctDOWNIOra c 1D OWN s GI DOW N SlinDOWN Fannie Mae DOWNFe d Ho me Lo an DQW N MBNA C01p DOWNM0 rg an Stanley D OWN Financ ia1DOWN 4262009 Luke Huan Univ of Kansas 25 Assomatlon RUIe Discovery Definition 0 Given a set of records each of which contain some number of items from a given collection 0 Produce dependency rules which will predict occurrence of an item based on occurrences of other items TID Items 1 Bread COke Milk Rules Discovered 2 Beer Bread Milk gt coke 3 Beer Coke Diaper Milk Diaper Milk gt Beer 4 Beer Bread Diaper Milk 5 Coke Diaper Milk 4262009 Luke Huan Univ of Kansas 26 Association Rule Discovery Application 1 0 Marketing and Sales Promotion 0 Let the rule discovered be Bagels gt Potato Chips 0 Potato Chips as consequent gt Can be used to determine what should be done to boost its sales 0 Bagels in the antecedent gt Can be used to see which products would be affected if the store discontinues selling bagels 0 Bagels in antecedent and Potato chips in consequent gt Can be used to see what products should be sold with Bagels to promote sale of Potato chips 4262009 Luke Huan Univ of Kansas 27 Association Rule Discovery Application 2 o Supermarket shelf management 0 Goal To identify items that are bought together by suf ciently many customers 0 Approach Process the pointofsale data collected with barcode scanners to nd dependencies among items 0 A classic rule o If a customer buys diaper and milk then he is very likely to buy beer 0 So don t be surprised if you nd sixpacks stacked next to diapers 4262009 Luke Huan Univ of Kansas 28 Association Rule Discovery Application 3 0 Inventory Management 0 Goal A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households 0 Approach Process the data on tools and parts required in previous repairs at different consumer locations and discover the co occurrence patterns 4262009 Luke Huan Univ of Kansas 29 Sequential Pattern Discovery Definition Given is a set of objects with each object associated with its own timeline of events nd rules that predict strong sequential dependencies among different events A B governed by timing constraints C D E Rules are formed by rst disovering patterns Event occurrences in the patterns are A B lt xg A C D E ltws 23919 L A Ti lt ms A 4262009 Luke Huan Univ of Kansas 30 Sequential Pattern Discovery Examples o In telecommunications alarm logs o InverterProblem ExcessiveLineCurrent Recti erAlarm gt FireAlarm o In pointof sale transaction sequences 0 Computer Bookstore IntroToVisua1C CPrimer gt PerlfordummiesTclTk 0 Athletic Apparel Store Shoes Racket Racketball gt SportsJacket 4262009 Luke Huan Univ of Kansas 31 Regression o Predict a value of a given continuous valued variable based on the values of other variables assuming a linear or nonlinear model of dependency o Greatly studied in statistics neural network elds 0 Examples 0 Predicting sales amounts of new product based on advetising expenditure 0 Predicting Wind velocities as a function of temperature humidity air pressure etc 0 Time series prediction of stock market indices 4262009 Luke Huan Univ of Kansas 32 DeviationAnomaly Detection o Detect signi cant deviations from normal behavior 0 Applications 0 Credit Card Fraud Detection 0 Network Intrusion Detection Typical network traf c at University level may reach over 100 million connections per day 4262009 Luke Huan Univ of Kansas 33 First Step The Nature of Your Data o What is data 0 What are in your data 0 What can we do about data 4262009 Luke Huan Univ of Kansas 34 What is Data 0 Collection of data objects and Attributes their attributes A f 0 An attr1bute1s a property or Tid Refund Marital Taxable Ch t St t ea characteristic of an object a quot5 quot me E 1 1 f r 1 Yes Single 125K No Q xamp 6539 eye CO or 0 a 2 No Married 100K No person temperature etc 3 No Single W No 0 Attribute is also known as 4 Yes Married 120K No variable eld Objects lt 5 No Divorced 95K Yes 39 39 6 No Married 60K No characteristic or feature 7 Yes Divorced 220K No 0 A collection of attributes 8 No Single 85K Yes describe an object 9 No Married 75K No 0 Object is also known as K 10 N0 Sing39e 90K Yes record point case sample entity or instance 4262009 Luke Huan Univ of Kansas 35 Attribute Values o Attribute values are numbers or symbols assigned to an attribute o Distinction between attributes and attribute values 0 Same attribute can be mapped to different attribute values 0 Example height can be measured in feet or meters 0 Different attributes can be mapped to the same set of values 0 Example Attribute values for ID and age are integers 0 But properties of attribute values can be different ID has no limit but age has a maximum and minimum value 4262009 Luke Huan Univ of Kansas 36 Types of Attributes c There are different types of attributes 0 Nominal 0 Examples ID numbers eye color zip codes a Ordinal 0 Examples rankings eg taste of potato chips on a scale from 110 grades height in tall medium short a Interval 0 Examples calendar dates temperatures in Celsius or Fahrenheit 0 Ratio 0 Examples temperature in Kelvin length time counts 4262009 Luke Huan Univ of Kansas 37 Structured vs Unstructured Data o Structured Data 6 Data in a relational database 0 Semistructured data 0 Graphs trees sequences 0 Unstructured data 0 Image text 4262009 Luke Huan Univ of Kansas 38 What is in Your Data o What kinds of data quality problems 0 How can we detect problems with the data 0 What can we do about these problems 0 Examples of data quality problems 0 Noise and outliers 9 missing and duplicated data 4262009 Luke Huan Univ of Kansas 39 Noise 0 Noise refers to moc i cation of original values 0 Examples distortion of a person s voice when talking on a poor phone anc snow on television screen 1 15 10 05 5 0 0 5 05 10 1 i i i 15 i i i i 0 02 04 06 08 1 0 02 04 06 08 1 Time seconds Time seconds Two Sine Waves Two Sine Waves Noise 4262009 Luke Huan Univ of Kansas 40 Mapping Data to a New Space 0 Fourier transform 0 Wavelet transform 1 i i 15 0 10 0 05 5 0 0 0 0 5 0 05 10 0 i 0 02 04 06 08 390 02 04 06 08 1 0 10 20 30 40 50 60 70 80 90 Time seconds Time seconds TWO Sine Waves Two Sine Waves Noise Frequency 4262009 Luke Huan Univ of Kansas 41 Outliers o Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set 0 One person s outlier can be another one s treasure 4262009 Luke Huan Univ of Kansas 42 Missing Values 0 Reasons for missing values 0 Information is not collected eg people decline to give their age and weight a Attributes may not be applicable to all cases eg annual income is not applicable to children 0 Handling missing values 0 Eliminate Data Objects 0 Estimate Missing Values 0 Ignore the Missing Value During Analysis 0 Replace with all possible values weighted by their probabilities 4262009 Luke Huan Univ of Kansas 43 Duplicate Data 0 Data set may include data objects that are duplicates or almost duplicates of one another a Major issue when merging data from heterogeous sources 0 Examples Q Same person with multiple email addresses 0 Data cleaning Q Process of dealing with duplicate data issues 4262009 Luke Huan Univ of Kansas 44 Challenges of Data Mining o Scalability 0 Dimensionality 0 Complex and Heterogeneous Data 0 Data Quality 0 Data Ownership and Distribution 0 Privacy Preservation o Streaming Data 4262009 Luke Huan Univ of Kansas 45
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'