Data Mining&Stat Learn
Data Mining&Stat Learn ISYE 7406
Popular in Course
Popular in Industrial Engineering
This 0 page Class Notes was uploaded by Maryse Thiel on Monday November 2, 2015. The Class Notes belongs to ISYE 7406 at Georgia Institute of Technology - Main Campus taught by Kwok-Leung Tsui in Fall. Since its upload, it has received 18 views. For similar materials see /class/234200/isye-7406-georgia-institute-of-technology-main-campus in Industrial Engineering at Georgia Institute of Technology - Main Campus.
Reviews for Data Mining&Stat Learn
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 11/02/15
Review amp Analysis of MahalanobisTaguchi System William H Woodall and Rachelle Koudelik Virginia Tech KwokLeung Tsui and Seoung Bum Kim Georgia Tech Zachary G Stoumbos Rutgers University Christos P Carvounis MD State University at Stony Brook Nassau University Medical Center 42009 1 Primary MTS References Taguchi G and Rajesh J 2000 New Trends in Multivariate Diagnosis Sankhya The Indian Journal of Statistics 62 233248 Taguchi G Chowdhury S and Wu Y 2001 The MahalanobisTaguchi System New York McGraw Hill Taguchi G and Rajesh J 2002 a new book in MTS 342009 PC Mahalanobis Very influential in largescale sample survey methods Founder of the Indian Statistical Institute in 1931 Architect of India s industrial strategy Advisor to Nehru and friend of RA Fisher 342009 Japanese Quality Engineer 342009 Genichi Taguchi Deming prize in Japan 4 times Rockwell Medal 1986 Citation Combine engineering amp statistical methods to achieve rapid improvements in costs and quality by optimizing product design and manufacturing processes 197879 Ford l Bell Labs Teams quotDiscoverquot Method 1980 First US Experiences Xeroxl Bell Labs 1990 Taguchi Methods or DOE well recognized by all industries for improving product or manufacturing process design MTS is said to be 342009 A groundbreaking new philosophy for data mining from multivariate data A process of recognizing patterns and forecasting results Used by Fiju Nissan Sharp Xerox Delphi Automotive Systems Ford GE and others Beyond theory Intended to create an atmosphere of excitement for management engineering and academia Applications include the following Patient monitoring Medical diagnosis Weather and earthquake forecasting Fire detection Manufacturing inspection Clinical trials Credit scoring 342009 342009 MTS Overview Similar to a classification method using a discriminanttype function Based on multivariate observations from a normal and an abnormal group Used to develop a scale to measure how abnormal an item is while matching a pre specified or estimated scale MTS scale is used for variable selection diagnosis forecasting and classification MTS Procedure Stage 1 Identify p variables Vi i 1 2 p that measure the normality of an item Collect multivariate data on the normal group j 1 2 m Standardize each variable to obtain Zi vectors Calculate the Mahalanobis distances MD for the m observations 342009 m lpZfS1ZZ i1 In Where S is the sample correlation matrix of the Z s for the normal group 342009 Stage 2 0 Collect data on tabnormal items Xi im 1m2 m t 0 Standardize each variable using the normal group means and standard deviations 0 Calculate MD values MDi i m 1m2 mt 342009 10 According to the MTS the scale is good if the MD values for the abnormal items are higher than those for the normal items good separation Stage 3 Identify the useful variables using orthogonal arrays OAs and signal to noise SIN ratios The MTS uses a design of experiments approach as an optimization tool to choose the variables that maximize the average SIN ratio 342009 12 Use of DOE for Variable Selection Design an OA experiment using all variables For each row of the 0A a given set of variables Compute MDifor each observation in abnormal groups Determine a Mi value the true severity level or working average for each abnormal group Compute SIN ratio based on MD and M Determine significant variables using main effect analysis with SIN ratio as response 342009 13 An Example of OA including variable excluding variable Run V1 V2 V3 V17 SNRa o 1 SN1 2 SN2 3 SN3 4 SN4 5 SNS 6 SN6 32 SN32 342009 Dynamic SIN ratio multiple abnormal qroupsi First regress Y SQRTMDl to Al to obtain slope estimate beta hat then de ne SN ratio 1 R ME A2 1010g 1010g 8 r MSE 342009 15 Largerisbetter SIN Ratio sing le abnormal group For t abnormal observations the largerisbetter SN ratio is 1 t 1 1010g Z D r1le 342009 16 Main Effect Analvsis Compute level averages of SIN ratios and for each variable Keep variables only with positive significant estimated main effects 342009 17 Stage 4 Based on the chosen variables use the MD scale for diagnosis and forecasting A threshold is given such that the losses due to the two types of classification errors are balanced in some sense 342009 18 A Medical Case Study Medical diagnosis of liver disease 200 healthy patients and 17 unhealthy patients 10 with a mild level of disease and 7 with a medium case Age Gender and 15 blood test variables Data is made available 342009 19 Case Study Blood Test Variables with Normal Ranges Iaguchl et al Variables Symbol Acronym Normal Ranges 2001 Normal Total Protein in Blood V3 W 60 to 83 gmdL 6575 gmdL Albumin in Blood V4 Alb 34 to 54 gmdL 3545 gmdL Cholinesterase V ChE Depends on Technique Pseudocholinesterase 5 8 to 18 UmL 060100 de Glutamate O Transaminase Asparate Aminotransferase V6 GOT 10 to 34 IUL 225 Un1ts Glutamate P Translmmase V GPT 6 to 59 UL 022 Units Alanine Transamlnase 7 Lactic Dehydrogenase V8 LDH 105 to 333 IUL 130250 Units Alkaline Phosphatase 0250 UL Normal V9 Alp 250750UL Moderate Elevation 23903910390 Umts rGlutamyl Transpeptidase 0 to 51 IU L gammaGlutamate Transferase V10 r GPT serum 0 68 Un1ts Leucine Aminopeptidase V11 LAP Male 80 to 200 UmL Female 75 to 185 UmL lt 200 Des1rable Total Cholesterol V12 TCh 200239 Borderline high 240 High Triglyceride V13 TG 10 to 190 mgdL Phospholipid V14 PL Platelet 150000 to 400000mm3 Creatinine V Cr 8 to 14 mgdL Blood Urea Nitrogen V16 BUN 7 to 20 mgdL Uric Acid V17 UA 41 to 88 mgdL 342009 20 Some results and conclusions Largest MD in healthy group 236 Lowest MD in unhealthy group 773 Thus there is a lot of separation between the healthy and unhealthy group The M values are estimated from averages of MD values 342009 21 including variable OA32 excluding variable Run V1 V2 VS V17 SN Ratio 1 SN1 2 SN2 3 SN3 4 SN4 5 SNS 6 SN6 32 SN32 342009 22 averaqe SIN ratio All variables 625 MTS combination 427 OA optimal comb 334 Overall optimal comb 176 Thus the proposed method does not yield the optimum combination MTS average SIN ratio was at about the 95th percentile 342009 23 MDs for Unhealthy Group for Various Combinations of Variables Subject Disease Level All MTS OA Optimal Optimal 1 Mild 7727 13937 3053 13329 2 Mild 3416 14726 7435 3616 3 Mild 10291 17342 9493 3002 4 Mild 7204 10304 4951 12311 5 Mild 10590 13379 9367 12042 6 Mild 10557 3605 6643 6139 7 Mild 13317 13396 7794 6139 3 Mild 14312 27910 3162 22666 9 Mild 15693 23110 10273 26000 10 Mild 13911 35740 20992 14422 11 Medium 12610 20323 16517 20333 12 Medium 12256 13573 14607 19312 13 Medium 19655 34127 35229 44614 14 Medium 43039 35564 13105 32720 15 Medium 73639 74175 9560 23560 16 Medium 97263 104424 29201 31310 17 Medium 135693 123022 44742 57226 342009 24 Plots of Ille for Unhealthv Group for Various Combinations of Variables i I Mild A11 Medium All Mild MTS Medium MTS Mild OA Optimal Medium OA Optimal Mild Optimal Medium Optimal 342009 25 Case Study Blood Test Variables with Normal Ranges Iaguchl et al Variables Symbol Acronym Normal Ranges 2001 Normal Total Protein in Blood V3 W 60 to 83 gmdL 6575 gmdL Albumin in Blood V4 Alb 34 to 54 gmdL 3545 gmdL Cholinesterase V ChE Depends on Technique Pseudocholinesterase 5 8 to 18 UmL 060100 de Glutamate O Transaminase Asparate Aminotransferase V6 GOT 10 to 34 IUL 225 Un1ts Glutamate P Translmmase V GPT 6 to 59 UL 022 Units Alanine Transamlnase 7 Lactic Dehydrogenase V8 LDH 105 to 333 IUL 130250 Units Alkaline Phosphatase 0250 UL Normal V9 Alp 250750UL Moderate Elevation 23903910390 Umts rGlutamyl Transpeptidase 0 to 51 IU L gammaGlutamate Transferase V10 r GPT serum 0 68 Un1ts Leucine Aminopeptidase V11 LAP Male 80 to 200 UmL Female 75 to 185 UmL lt 200 Des1rable Total Cholesterol V12 TCh 200239 Borderline high 240 High Triglyceride V13 TG 10 to 190 mgdL Phospholipid V14 PL Platelet 150000 to 400000mm3 Creatinine V Cr 8 to 14 mgdL Blood Urea Nitrogen V16 BUN 7 to 20 mgdL Uric Acid V17 UA 41 to 88 mgdL 342009 26 342009 Variables for Unhealthy Patients Well Outside Normal Ranges Subject Number Variable Number 1 12 13 2 None 3 None 4 13 5 10 6 7 7 7 8 13 9 12 13 10 4 12 11 1012 12 10 13 10 14 10 13 15 6 7 13 16 3 6 7 10 12 17 678 10 13 27 Medical Analysis V4 V6 V7 V9 and V10 are crucial for liver disease diagnosis and classification Medical diagnosis shows that patients 1517 exhibit some chronic liver disorder Cluster analysis on V4 V6 V7 V9 and V10 yields only two groups Only patients 1517 are classified as abnormal This result is consistent with medical diagnosis 342009 28 Dotplot for V4 Alb 16 17 15 Medium 39 T T 39 39 Mild i i i i i i i Normal I T 39 I 38 48 58 342009 29 Dotplot for V6 GOT 17 15 16 Medium I I 39 39 39I Mild I I I i I i Normal I I I 50 100 150 342009 30 Dotplot for V7 GPT 17 16 15 Medium 39 39 quot Mild 70 120 170 342009 Dotplot for V9 Alp 1 I I I 1I 7 Medium Mild III IIlI III III II III IIII II III I II I II I Normal an 100 200 300 342009 32 Dotplot for V10 rG PT 15 16 17 I Medium I Mild I Normal 3quotnE d d gu u ugu 1 0 342009 33 342009 Tree Classification Methods 34 Classification Trees The CART Classi cation And Regression Tree methodology known as binary recursive partitioning For more detailed information on CART please see Breiman Friedman Olshen amp Stone 1984 Classi cation and Regression Trees C45 is a decision tree learning system introduced by Quinlan Quinlan J Ross 1993 C45 Programs for Machine Learning The software is available at httpWWW2csureginacahamiltoncourses831notesmldtr c45tutorialhtml 342009 35 Tree from Splus V5 lt 3815 Yes No V10 lt 63 V6 lt 375 Yes No Yes No 28 22 1196 14 36 31 342009 36 Tree from Splus Variables actually used in tree construction V5 V10 and V6 Number of terminal nodes 4 Misclassification error rate 001382 3 217 Classification matrix based on learning sample Predicted Class Actual Class 1 2 3 1 200 0 0 2 0 8 2 3 1 0 6 342009 37 Tree from 045 V5 lt 364 Yes No V10 lt 63 1200 Yes No 31 V6 lt 26 28 Yes No 22 36 342009 38 Tree from C45 Variables actually used in tree construction V5 V10 and V6 Number of terminal nodes 4 Misclassification error rate 00046 1 217 Classification matrix based on learning sample Predicted Class Actual Class 1 2 3 1 200 0 0 2 0 10 0 3 1 0 6 342009 39 342009 Scatter Plot of V5 vs V10 vs V6 V5 ChE 0 Normal Mild Medium 700 600 500 400 300 200 100 0 50 100 V10 rGPT 150 O 200 250 40 342009 V6 GOT 150 100 50 Scatter Plot of V5 vs V6 16 L5 0 Normal Mild 17 Medium 0 100 200 300 400 500 600 700 V5 ChE 41 342009 V10 rGPT 250 200 150 100 50 Scatter Plot of V5 vs V10 17 x 9 9 L6 15 08 v 391 l l 0 100 200 300 400 500 600 700 V5 ChE 0 Normal Mild Medium 42 342009 V6 GOT 150 100 50 Scatter Plot of V10 vs V6 0 Normal Mild Medium 100 150 V10 r GPT 200 250 43 Dotplot for V5 ChE 15 16 17 Medium Mild Normal 700 600 500 400 300 200 100 44 342009 Dotplot for V6 GOT 17 15 16 Medium I I 39 39 39I Mild I I I i I i Normal I I I 50 100 150 342009 45 Dotplot for V10 rG PT 15 16 17 I Medium I Mild I Normal 3quotnE d d gu u ugu 1 0 342009 46 Sgt gt mgt m8 seamen 680 tgt 2gt N gt Sgt mgt mgt 2550 tgt dwgt mwgt Egt gt owgt mgt vgt gt HmEm QO lt0 tgt mwgt Egt mwgt Nwgt owgt mgt vgt Hmgt tgt I wgt Ea 23gt lt mooQEm mwcomohmmd Eosmm 55 omtmm Eoo MDs for Unhealthy Group for Various Combinations of Variables Disease Level All MTS OA Optimal Optimal Trees Mi1d 7727 13937 8058 13329 7366 Mi1d 8416 14726 7485 8616 18789 Mi1d 10291 17342 9498 8002 9068 Mi1d 7204 10804 4951 12311 6517 Mi1d 10590 18379 9367 12042 29864 Mi1d 10557 8605 6643 6139 10869 Mi1d 13317 13896 7794 6139 10869 Mi1d 14812 27910 8162 22666 8222 Mi1d 15693 28110 10278 26000 9155 Mi1d 18911 35740 20992 14422 16420 Medium 12610 20828 16517 20833 42681 Medium 12256 18578 14607 19312 38523 Medium 19655 34127 35229 44614 86796 Medium 43039 85564 13105 32720 28252 Medium 78639 74175 9560 28560 208102 Medium 97268 104424 29201 31810 228428 Medium 135698 123022 44742 57226 199304 342009 48 Mild A11 Medium All Mild MTS Medium MTS Mild OA Optimal Medium OA Optimal Mild Optimal Medium Optimal Mild Trees Medium Trees Conclusion The MD values and dotplots show that 342009 only the MD scale based on the variables used by classification trees ie V5 V6 and V10 does a good job discriminating between patients with mild level disease and patients with medium level disease Maybe MD is a good measure for multivariate data 50 Comparison with Medical Analy sis V4 V6 V7 V9 and V10 are crucial for liver disease diagnosis and classification Medical diagnosis shows that patients 1517 exhibit some chronic liver disorder Cluster analysis on V4 V6 V7 V9 and V10 yields only two groups Only patients 1517 are classified as abnormal This result is consistent with medical diagnosis 342009 51 342009 Correlations Variables in Classi cation Trees V5 V6 V10 V4 0501 0505 O184 3 2 E 8 V6 0370 1 0507 u 1 E a Q a V7 O365 0905 0485 g5 V9 0305 0197 0269 g V10 O189 0507 1 52 Dotplot for V4 Alb 16 17 15 Medium 39 T T 39 39 Mild i i i i i i i Normal I T 39 I 38 48 58 342009 53 Dotplot for V7 GPT 17 16 15 Medium 39 39 quot Mild 70 120 170 342009 Dotplot for V9 Alp 1 I I I 1I 7 Medium Mild III IIlI III III II III IIII II III I II I II I Normal an 100 200 300 342009 55 Case Study Summary OA amp main effect analysis do not give overall optimum MTS discriminant function SIN ratios does not separate the two unhealthy groups The variables selected from MTS are not appropriate to detect liver disease based on medical diagnosis Tree methods separate the two unhealthy groups MD may be a good distance measure for multivariate data Results are based on current data and training error 342009 56 Discussions The MTS ignores considerable previous work in application areas such as medical diagnosis and classification methods The MTS ignores sampling variation and discounts variation between units The use of OA cannot be justified The MTS is not a welldefined approach Traditional statistical approaches may work better in many cases Despite flaws we expect the MTS to be used in many companies 342009 57 Introduction to Data Mining KwokLeung Tsui Industrial amp Systems Engineering Georgia Institute of Technology 1122007 What is Data Mining 0 Data mining is extraction of meaningfulusefulinteresting patterns from a large volume of data sources signal image time series image transaction text web etc Data mining is one of top ten emerging technology MIT39s Technology Review 2004 1122007 2 DM Fields 6 Backgrounds 0 Data mining is an emerging multidisciplinary field Statistics especially multivariate statistics Machine learning Application Background eg Biology Pattern recognition Databases Visualization OLAP and data warehousing etc 1122007 Commonly Used Language in Data Mining Data Mining DM Knowledge Discovery in Database KDD Massive Data Sets MD Very Large Data Base VLDB Data Analysis DA 1122007 Data Mining DM 7t MD DM 7t DA DAMD DM Statistical DM Computationally feasible algorithms Little or no human intervention Money issue DA software 51OK DM software 100K 1122007 Statistical Data Mining Data Mining is exploratory data analysis with little or no human interaction using computationally feasible techniques ie the attempt to find unknown interestinq structure ltsource Ed Wegmangt 1122007 Data Mining Data mining is the process of exploration and analysis by automatic or semi automatic means of large quantities of data in order to discover meaningful patterns and rules ltsource Mastering Data Mining by Berry and Linoff 2000gt 1122007 7 KDD Knowledge discovery in databases KDD is a multidisciplinary research field for nontrivial extraction of implicit previously unknown and potentially useful knowledqe from data ltsource Data Mining by Adriaans and Zantinge 1996gt 1122007 8 KDD amp Data Mining KDD Process DM Process The process of using data mining methods algorithms to extract knowledge according to the specifications of measures and thresholds using a database along with any necessary preprocessing or transformations Data Mining amp Modeling A step in the knowledge discovery process consisting of particular algorithmsmethods that under some acceptable objective produces particular patterns or knowledge over the data Text mining web mining etc Some people treat DM and KDD equivalently 1122007 9 Data Mining Statistics CS Data Miners Computer Scientists t Extract useful information from large amount of raw data 1122007 Friedman Support data mining by mathematical theory and statistical methods 4 Support data mining by computational algorithm and relevant software 10 Applications 0 Bioinformatics 0 Sales and Marketing 0 Health Care Medical Diagnosis 0 Supply Chain Management 0 Process Control 0 Network Intrusion Detection 0 Astronomy 0 Sports and Entertainment 1122007 11 1122007 Examples of DM Applications Finance Forecast stock price or movement using neural network or time series Telecom Predict churn rate and customer usage using tree logistic regression and activity monitoring Retail Identify cross selling using association rules eg Market Basket Analysis RFM Analysis Pharmaceutical Segment customers into different behavior groups using clustering and classification Banking Customer relationship management CRM using clustering and association 12 1122007 Examples of DM Applications Hotelairline Identify potential customers for promotion offers using tree or neural network Ocean terminal operation data mining for efficiency improvement Salesdemand data mining for inventory planning UPS transaction data mining for mail box location design 13 Prerequisites for Data Mining Large amount of data internal amp ext 5 I called data warehouse data mart etc Phone calls web visits supermarket transactions weather data etc Mega giga terabytes Information technology advancement Most companies have these resources Friedman 1122007 14 Prerequisites for Data Mining Advanced computer technology big CPU parallel architecture etc allow fast access to vast amount of data allow computationally intensive algorithm and statistical methods Knowledge in business or subject matter ask important business questions understand and verify discovered knowledge 1122007 15 1122007 Data Mining Process 16 1122007 Data Mining KDD Process Determine Business Objectives 1 Data Preparation l Mining amp Modeling 1 Consolidation and Application 17 Formulate Business Objectives DDDDDDD w Examples of a telecom company Identify important customer traits to keep profitable customers and to predict fraudulent behavior credit risks and customer churn Improve programs in target marketing marketing channel management micromarketing and cross selling Meet effectively the challenges of new product development 1122007 18 Data Preparation Business Source Objectives Systems Legacy systems External systems Model Discovery File Identify Data Extract data Cleanse and Needed and gt From source Aggregate Sources systems data 1122007 Model Evaluation File 19 Mining amp Modeling Model Discovery File Explore Data Construct Model Mom Evaluation Ideas File 1122007 Evaluate Model gt Transform Model into Usable Format Reports Consolidation and Applicatio mmmmmmm m Ideas Communicate Make Business Reports mansport Knowledge EXtraCt Decisions and Knowledge Database KnOWIedge Improve model 1122007 21 Effort in Data Mining 1122007 Effort CD O 01 O i A O i 00 O i 20 Business Data Preparation Data 1VIining Objective Determination Consolidate Discovered Knowledge 22 1122007 Data Preparation 23 1122007 Databases amp Data Warehouses Relational Database Object Oriented Database Transactional Database Time Series Spatial Database Data Warehouse Data Cube SQL Structured Query Language OLAP OnLine Analytical Processing MOLAP Multidimensional OLAP Fundamental data object for MOLAP is the Data Cube ROLAP Relational OLAP using extended SQL 24 Data Preparation Sources of Noises Faulty data collection instruments eg sensors Transmission errors eg intermittent errors from satellite or internet transmissions Data entry error Technology limitations error Naming conventions misused eg same names but different meaning Incorrect classification 1122007 25 Data Preparation Redundant data Variables have different names in different databases Raw variable in one database is a derived variable in another Changes in variable over time not reflected in database Irrelevant variables destroy speed dimension reduction needed 1122007 26 1122007 Data Preparation Problem of Missing Data Missing values in massive data sets Missing data may be irrelevant to desired result Massive data sets if acquired by instrumentation may have few missing values lmpute missing data manually or statistically Missing Value Plot Traditional methods limited for small data sets 27 Data Preparation Problem of Outliers Outliers easy to detect in low dimensions A high dimensional outliers may not show up in low dimensional projections Clustering and other statistical modeling can be used Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets Generally difficult for massive data Traditional methods limited for small data sets 1122007 28 Data Cleaning Duplicate removal tool based Missing value imputation manual and statistical Identify and remove data inconsistencies Identify and refresh stale data Create unique record case ID 1122007 29 Database Sampling The KDD systems must be able to assist in the selection of appropriate parts of the databases to be examined For sampling to work the data must satisfy certain conditionseg no systematic biases Sampling can be very expensive operation especially when the sample is taken from data stored in some databases 1122007 30 1122007 Database Reduction Data Cube aggregation Dimension reduction Eliminate irrelevant and redundant attributes Data Compression Encoding mechanisms quantizations wavelet transformation principle components etc 31 1122007 Data Preparation Using R 32 1122007 Introduction to R A software package suitable for data analysis and graphic representation Free and open source Implementation of many modern statistical methods Flexible and customizable 33 Software Download of R Go to httpwwwcranrproiectorq Click on Windows 95 and later Click on bai Click on R230win32exe Press Save to download the file to your computer Install R by doubling click on the downloaded file and following the instructions 1122007 34 Using R To invoke R Go to Start Programs R To quit R Type q at the R prompt gt and press Enter key Or simply close the window 1122007 35 Using R A good introduction to R is available at httpwwwcranrproiectorq Click on Manuals under Documentations at the left A few important commands helpstart for a webbased interface to the help system Help topic or topic for help on topic help search pattern for help pages containing pattern 1122007 36 Data Preparation Data Matrix deitinnnli latenet is arranged in n mntrin term with eneh mm as reenrri tinr nheerrntinnj anti eeeh entnnm r13 attribute tier Vitrnbiie Thin matrix in enited n diam matrix and it repreeenterl nil feline I1 1 Iril quot115p n it the number nf menrein and p a nnmher nf ttttt39ibtlt In pmetiee ii iii mneh larger than 13 There 13 reriehiea enn he nf different twee Mentally the t39 t39i hihi enn he etaneitied intn the fnttnwing typee 1122007 37 Data Preparation 1 Variable Types TH I i 5 cripticul Exam ea Up ara tam amiual Just labial r dff 39 mam Zip Emit G l ld l um 11m m diatiugu l c111 GWEN rmn am ler Drdi al Ulla r hl pm e u Dpi m Gradea 1 mr 331 m r g Bf hji EEIEE Illt 39 Un t Eif nlmgm menn bum Edging Fahr lhai it in rmFriar 131 ll m gi a arbitrary lama Rati Unit f lEE E l lt and TEI p l L E in Hakim Length quotWm le m39igi 3 11m i biil j C f u gt Iumme Hmuiual mm m iml variahlli are diameter whili im 39ml and farm Variable arr mmi umus E i fti l j an immwal 31f 13min mime can be ma afl at an Un39cli mriabl 111mm Age EEC while umn climl wriabla can b Emmi as E ti umm variablt 1122007 1122007 Data Preparation Missing Values In real clataset we often encounter missing values as well The easiest way to handle missing values is t0 throw aurar all the incomplete observations However this may leacls to a biased sample if the values are nanrandomly missing Let us illustrate this by a simple simulation using R First we simulate 5000 random numbers from Will and transform into a 500x10 data matrix a set randnm seed setscedt1234il 139 a generate a rlltmatriztt unrmtillllllnculzllll Now we replace all the values in d which are less than 42 to NA missing value and save it to X Similarly we replace values which are greater than 2 tn NA and save it to Finally we replace values either less than 3 or greater than 2 to NA and save it to KS 39 Data Preparation Missing Values Now we replace all the values in diarliiell are less than 2 to NA missing value and save it to x1 Similarly we replace values which are greater than 2 to NA and save it to Finally we replaee values either less than 2 or greater than 2 ta NA and save it to 33 ale replaeel d tlltl v1 N ll a left 1 11mm itm x3e replaeeld tlgt2 Xr i righ 1 truncation ereplaeetalal hl 39 ZI39LEI39LL tlH39 Ll at hath earls Now we select all the complete cases in KL x2 and x3 and save them to Cl C and CS respectivley e xleumpleteeaaeaal 3939 complete case39s a39rilmpleteteasealxll x3rhmpleteteaaeata3 Us L 3 1122007 40 Data Preparation Finally 1111 s 11111111 amp111 121211111m1 11111111 111111 51111111111111 39iati n 115161 1 12 and 131 pp111 13m 1 c mmmt rm1umn maan 1 Ul i 4 W g35514 5UD13 19 1 EU1W1136 e l l 1l 2 gm HEWETETQQ DEE 2 s l i l4 la 1 i gglw1gg11m1 1 5 5351 1 EHSE M 2l21 1 lTli 1 H12334 1558 UM 33231 HHW343ETQ U l 53 WHHUQ4UT91 quot 7 fu i5DEEQ H1W312 9 44 U H1Ll 45 M HEEEEEIEET 1 UM33 42Q3 UUE34EQET EM11U 34 E g 4921 EHUHHTHQ EU H2Ul T UHE EETEQ ElEE HE l llWE9 n li 3 kupp1y1c3121m an 1 U HEEEE4EE 1 1 U1Hai131 emm4 14HUH 1 Us FM QH man 131 T1 3 l 1343 EHUEEEW5 EUHEGE4UE4 14495 em U553 JQ 1 3 1 1 4139 1311211111111 1113131132 11112511111151 11 1th 11391 1111211111311 mvErraatimmtad 11111 111111 11113111 2 light111111121112112111 LlIE1E1 EE1111MEE1 1111 11111 11111111 while 123 unnamed at hath 111115 S lll fh t 421117133 11111 11111311313611 E till l ti 39 i 1122007 41 NV NOOQN Sr Hu h EM mm 553 H uama mus u ESE Jam MPH ME upm h muuhu s m u mum E ggu a Eu mmm wmm E p M wmcmm gmcgmmm gmmwmmmg m m g mg mwM mgw ywn wwm Emu m m mmm m mmw m wmgwwam am ma mmmm gg mmm Wm MW m mgwm wm Emmwt m m m Wimmmmm wmmmmm wmag mm3 m w m E H M I J frw m a 3 H wg m Whmmm m mHm Em i wm a p m E mmmgg ag Emw 3 a quot mfwa a Qm gm i mg mwg E E Amwq wh u r5 03232 3mm Data Preparation Data Dissimilarities Distances Measure of dissitnilarities or distances between objects are intpnttant since many DM techniques such as leineans clttsteringr nearest neighbor classi caticn Outlier detection are based on this 1neasttre It is important to define distance between two observations Since there may be may different types of variables in car data set the distance or dissimilarity between objects lbb t t ll t should be de ned differently according to their variable type Suppose that we have 13 variables in our data set and the abject i and denoted by 1122007 43 Data Preparation Data Dissimilarities Distances I Continuous variables If all the variables are continuous we can use either one of the following as the distance measure Euclidean dl hj Vln x I 2 Kip 3JP 3 CityBlock d j x 1 39 x E J pl lvlinleouv39ski fij ll I cl1 quot39 J T Q I 3919 Ordinal variables If all the variables are ordinal we can l escale the observation To 0 l by 3 ng lquot39M39k 1 and r 11le ljs lil IHL l where variable lc has Mk levels with rank 11 Then treat them like continuous variables 1122007 44 Data Preparation Data Dissimilarities Distances Binary variables If all the variables are binary we can use either one of the following measures Simple matching coefficient dag rsq quotrst r zs p Jaccard coefficient d i39sqrs Object l i 0 Note that the Taccard coef cient ignores the counts for both i and are zero This is used for the situation when t is much bigger than 11 1 5 g the transaction records of items in supemlarket Nominal variables If all the variables are nominal we can use dim purl1D where m is the number of matches among these 1 variables 1122007 45 Data Preparation Data Dissimilarities Distances Mixed type Missing rallies and sealing of data In practice data set usuallyr eontains missing values and trariahlea of mixed type The easiest way to handle the missing data is to delete all the records 1a ith missing values R has a buildi111 mction rumplete eager it to select all the complete eases However using only the complete cages will throw away lots of information and more seriously may lead to a biased gobsample A better but more complicate way to handle missing values is to use available variables to compute the distance between two records 1122007 46 Data Preparation Data Scaling of MixedType Data Another issue is that the range of the variables may differ a lot Using the original scale mayr put more weights on the variables with large range Therefore we better rescale them into the 01 interval before performing clustering analysis Define r 39 P Ek H aquot P o d1 Zk n d I39 EH h quot1 u 4 n I where 119quot 0 It Xi or Kit is missing ME 1 otheI Wlse m D If variable k is hulary or nommal d 0 1f 11 x1 39 otherwrse 1 1 If variable k is continuous 75 55 x1 x quotmax x min 3 g 1 Jr uk uk If variable 1 is ordinal first compute ik 0 1quotM 1 and then treat them as continuous Note that all the above a have values in 01 and hence all the variables were treated equally when computing the dissimilarities or distances 1122007 47 Data Preparation Outlier Detection Grubb s test zscore rule nonresistant Dataset often contains outliers Since many statistieal teelmiques are sensitive to outliers we usuallyr detect and delete these outliers before applying the Divt teelmiques Usually we can standardize a variable by r a39 where i and a is the sample mean and standard deviation of x respectively and consider tlre value z 3 3 as an outlier However this can onlyr detect part of the outliers as we only eonsider the marginal distribution of eaclt variable To 1122007 48 Data Preparation Outlier Detection Other Rules CV Rule CV coefficient of variation meanSD Call for outlier if CV exceeds certain threshold Resistant Rule Resistant Score X median MAD MAD Median absolute deviation Call for outlier if Score exceeds certain threshold say 5 1122007 49 Data Preparation Mahalanobis Distance multidimensional data p variate normal distribution that is iidr NptrnEl According to distribution PI s l theory the Mahalanobis distance D2 it TJ 9 Ill17 is distributed as a Chisquare distribution with p degrees of freedom 3 is the pxl sample mean vector and 57 is the prep sample covariance of random vector X1 This mean that if the observation i c 1 Wltll D greater than a preassrgned level tsay 99percentrle of a Chisquare distribution with 11 degrees of freedonn then this observation is an outlier 1122007 50 H NMOON Um m 323303 103m 852 6m mxmBEm L11 HH n HhHH HLE mEHm HHHHEHHHG H EIPHH EH 3 a H H H e EH Ha L H Hm Eng HE E amass a 2 m 3 53 F7 333 HHHEVMmH HH HEHWHH q a H E H 39 H 233 H 324 H Han HEJHHHHH HEnHHJHHHEH Haas a Hummmm g HlmH mrHHH H HHIHHHHWH 3 E332 3H Ham Ej zz 3 uiz EH5 cw SHE E n IEHHHHHKHHWIHHHHHHH deng3H5 HHaHiHsHHlnnHHH EEHHHmHEHEH EmsHUWEEHP Emma E Hag HHHHJHJWFL 3 HHEHHHH anHQEn HHHH Hams H st 3H nnHHHawHHi an H5 3 HHzn Hn HHHEHHHH NHAEHHHE 3 Emma Sam 2 532 a 23H HH 3 HHHaHHH SHE mH 1122007 Data Preparation 1 Download data gt dlt rcadcxw quothmcdUSV nmstringszu quot gt dimd 1 59m 13 h reud in date set H display the dimension cf d Remove cases with missing data gt dcltdcomplatcuuxexd gt dimdu h 1 5364 13 gt names du display the names in dc 1 quotBADquot quotLOANquot quotMURTDUEquot quot quotLUEquot quotREASONquot quotJOBquot quotWk 8 quotDERUGquot quotDELINQquot quotL LA LJEquot quotNIMJ quot quotCLNUquot quotDEB39I ENCquot select and wrnplmc 390sz to dc display dimension of d 52 Data Preparation 3 Remove outliers 3a Separate the data into two groups gt ti1lt ucttlui3AD l hdlt i t 59ith gt LiillMLii it nut that there is tint Silt runs and 13 columns 1 300 13 gt uUlt39uclchBADU select BADz gt dimthl note that thsre is uniy 3 64 runs and 13 coiumns 13064 13 3b Compute Mahalanobis distance for each group mdixtC functiuntx tltusmutrixx pltdimitJ2 mltupplytt2meunl sltvurt muhuunubistmsl trunsfurm A to u matrix find the Cuiumn dimension 0f x compute uolumn mcun compute sample covariunrt matrix using built in muhuiunobis function tti39citititt gt sourccquotmdistr t lead the tiie contuins the muixt function gt mtiiltmdi ttitirLib compute distance and exthuts columns 156 gt md1ltmdistd1 rl5 t Lnd suvc results to muU and mdl 1122007 53 Data Preparation 30 Remove outliers in each of the two groups gt couchistgl 19tifii gt u I 13 2092 sciuct observations with mdUltc from d0 sticct ohwervutions with md1ltc from di x0 uuntuins 2759 ruse gt xilt dmtiiltc gt x1ltstt1md1ltc gt dimix i J 278 15 gt dimxl 1 273 13 us x1 contuins 37 unses it 3d Combine the two groups of data amp write to file gt xltrbinux0xl combine the mutrirs A0 and x1 f0WWiFE gt dimtx x have 5Uh7 rows und 15 columns 11 jun 13 gt writetahiethile izmculcxv sED I39G ii39tmm s39 r HM F To th L39lCFquot 1122007 54
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'