Class Note for EECS 690 at KU
Popular in Course
Popular in Department
This 7 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Kansas taught by a professor in Fall. Since its upload, it has received 38 views.
Reviews for Class Note for EECS 690 at KU
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
EEC EECS 690 Introduction to Bioinformatics Microarray Data Analysis lll Anne Y Zhang Electrical Engineering and Computer Science httppeopleeecskueduyazhang nine 41 A v Zing uiuwvavkzixax t Outline Classification 0 K nearest neighbor classi er 0 Decision tree learning a Naive Bayes classi er me 41 A v Zing uiuwwmtaixax 2 Review SelfOrganizing Map SOM o The SOM works both as a projection Visualization method and a clustering method Maps highdimensional input data onto a low dimensional usually twodimensional output space while preserving the topological relationships between the input data Constraint version of K means clustering v Prototypes lie in a one or twodimensional manifold 0 K prototypes Rectangulargnd hexagonal gno nine 41 A v Zing uiuwvavkzixax a Review Principal components analysis PCA o PCA Project the highndtmenstonal points if a low otrnenstonat space while preserving the essence of the data 0 A better coordinate system for he data 0 PCA objective To reduce dimensionality To reduce redundancies in he data To choose he most useful variables features To visualize multidimensional data To identify groups of objects eg genessam ples To identify outliers eee me 41 A v Zing uiuwwmtaixax Page 211 Microarray Data Analysis Unsupervised Learning upervised Learning Clustering Classi cation Class Discovery Class Prediction Finding previously unrecognized structure Finding distinctions between goups Hierarchical stustenng Khmezns stustenng GM M clustmng SOM FCA KrNearesL Neignbnr De smn nestesming Naive Bayes classi er upp rt VestnrMasnn nine 41 A v Zing uiuwvavkzixax s Example Tumor Classification Current methods for classifying human tumor mainly rely on a variety of morphological clinical and molecular variables Microarray allows the characterization of molecular variations among tumors by monitoring gene expression Hope microan39ay will lead to more reliable tumor classi cation and therefore more appropriate treatments and better outcomes me 41 A v Zing uiuwwmtaixax s Example Tumor Classification 0 Data gene expression ofthe tumor tissue from cancer patients and gene expression ofthe normal tissue from normal persons 0 Task to learn a model from the above data that can be used to predict the status of a person based on his expression profile 0 Solution ms 13 A v 2mg Urwels39vnlkanss 7 Classification 0 Data A set of data records also called examples instances or cases described by c kattributes A A2 AK a class Each example is labelled with a prede ned class 0 Goal To learn a classification model from the data that can be used to predict the classes of new future or test casesinstances o Supervised learning classes are predefined use a training or learning set of labeled objects to form a classifier for classification of future observations 2mm 13 A v 2mg Unwswnlmsas a General model of classi cation 0 Given input X output Y Y fX 5 155 05 is independent on x unknown 0 Want to estimate the functionf based on known data set training data ms 13 A v 2mg Urwels39vnlkanss 9 Classi cation A TwoStep Process Model construction describing a set of predetermined classes c Each sample is assumed to belong to a predefined class as determined by the class label a The set of samples used for model construction training set a The model is represented as classification rules decision trees or mathematical formulae Model usage for classifying future or unknown objects 0 Estimate accuracy of the model 0 The known label of test sample is compared with the classified result from the model 0 Accuracy rate is the percentage of test set samples that are correctly classi ed by the model a Test set is independent of training set was 13 A v 2mg Unwswnlmsas m Approaches o K Nearest Neighbor 0 Decision tree learning 0 Naive Bayes ms 13 A v 2mg Urwels39vnlkanss M K Nearest Neighbor 0 Majority vote within the k nearest neighbors A 1 Hz E 2 y 1611 o kNN requires a parameter k and a 9 distance metric o For k 1 training error is zero but a a test error could be large 0 0 0 0 As k l training error tendsto 39939 increase but test error tends to o s O Q Q 0 0 decrease rst and then tends to increase a o For a reasonable k both the training and test errors could be smaller than the Linear decision boundary K 1 brwm 2mm 13 A v 2mg Unwswmmsas K 3 green 2 K Nearest Neighbor Example K vs Misclassification Error I HOW to choose K I Crussvaiidatiun Test Error couid be larger or srnaiier tnan tne training error Small ilt rnore exible estimates amp more variance gt overfittirig Large ilt rnore stabie estimates but not exible Decision Tree Learning Decision tree learning is one or tne rnost Widely used tecnnioues ror ciassirication The Classification model is atree Called decision tree A floweciianeiike tree structure internal node denotes a test on an attribute Brancn represents an outcome ottnetest Leaf nodes represent ciass iabeis or ciass distribution c4 5 by Ross Quinlan is bernabs tne best known System it can be downloaded rrorn tne Web A decision tree for buyscomputerquot Decision nodes and leaf nodes classes y as excellent fan From a decision tree to a set of rules A decision tree can be converted to a set or ruies Eacn batn trorn the root to a ear is a rule W Y 5 excelllem f r rig 39 u a Decision tree genration 0 Decision tree generation consists oftwo onases I Tree uris truttiuri Al start aiitnetianine examples aie at tne ieet Parliliuri examples ieeuisiieii based on seieeteu attributes Tree pruning iuentiri and iemeie brancheslhal ieiieet neise ei eutiieis 0 Use ordecision tree Ciassiiying an unknown sarnoie Testtne attribute values ortne sample againsttne decision tree Algorithm for decision tree learning 0 Basic algorithm a greedy divideandeconquer algorithm Assume attributes are categorical l lOW cohtmuous attrloutes can be handled too Tree is constructed in a OprdOWl l recurslVe manner At start all the tralmhg examples are at the root Examples are partitioned recurslVely oased oh selected attributes Attributes are selected oh the basis or ah impurity function e 9 information gain 0 Conditions for stopping partitioning o All examples for a glVeh hode belong to the same class c There are ho remammg attributes for further partltlomhg e maiorlty class is the leaf o There are ho examples left muse 41 x V 2mg umuxvailatxax o c o Choose an attribute to partition data The key to building a decision tree which attribute to choose in order to branch The objective is to reduce impurity or uncertainty in data as much as possible 0 A subset of data is pure if all instances belong to the same class The heuristic in 045 is to choose the attribute with the maximum Information Gain or Gain Ratio based on information theory me 41 x V my mumm m Information theory Entropy measure 0 The entropy form ula m emropyD 7 PrcJ 1ogZ PrcJ 11 IC I 213m 1 11 Prc is the probability of class cjin data set D We use entropy as a measure of impurity or disorder of data set D Or a measure of information in a tree As the data become purer and purer the entropy value becomes smaller and smaller This is useful to us muse 41 x V 2mg umuxvailatxax 2V Information gain 0 Given a set of examples D we rst compute its entropy ICI lCl emropyD 72151 1og2 PrcJ ZPch 1 11 11 0 If we make attr bute A with v values the root of the current tree this will partition D into v subsets D1 D2 Dv The expected entropy if A is used as the current root entropy A D 2 11 I J D I I D gtlt entropyDJ me 41 x V my mumm 21 Information gain cont 0 Information gained by selecting attribute Ato branch or to partition the data is gar39rrDA7 entropyD a entropy A D c We choose the attribute with the highest gain to branchsplit the current tree muse 41 x V 2mg umuxvailatxax 21 me 41 x V my mumm 2 An example We build the nal tree 6 6 4 4 2mm Da x1o a x1o 0971 P1 10 g21010 g210 7 3 ennupymdw 75x entrapym r xentrap Dz lx09852lx0 10 10 0 6896 entrapy mD7xentrap llrxentrapyDz n 6 4 75x0 9183Ex0 0 551 Choose lt2 gt2 gainDgene1 entropyD ientropygm D 0 2814 GeneZ as I gainDgen22 entropyD7entr0pygmzD 0 42 Split T10 mm m szmnv Untmwmnsas 25 mm m AVZMNV mmwmnas 2t Use the decision tree Is the decision tree unique Newpaticnt Yes No Here is a simpler tree n We want smaller tree and accurate tree I Easy to understand and perform better Finding the best tree is NPhard All current tree building algorithms lt2 gt2 are heuristic algorithms H9 t3 mm a t3 mm momma a Avoid overfitting in classification Naive Bayes Classmer o Over tting A tree may over t the training data 0 Good accuracy on training data but poor on test data Bayes rue Symptoms tree too deep and too manybranches some may PY t Xk re ect anomalies due to noise or outliers 0 Two approaches to avoid over tting o Prepruning Halt tree construction early 39 suppose XX1gt gtXn Dif cult to decide because we do not know what may 0 When Y is Boolean and Xis a vector of n Boolean happen subsequently Ifwe keep growmg the tree quot attributes 9 2M parameters to estimate 0 Postpruning Remove branches or subtrees from a fully grown tree 0 Naive bayes assumes o This method is commonly used C4 5 uses a statistical method to estimates the errors at each node for pruning Avalidation set may be used for pruning as well mm m szmnv Untmwmnsas 2v mm m AVZMNV mmwmnas 3n Conditional Independence Na39I39ve Bayes Classifier Definition X is conditionally independent of Y 0 Assume afgel mctimfi X 9 Ythere each instance given Z if the probability distribution l i ig symbmes ltx1 x1quotquot39xngt39 Most pmbable governing X is independent of the value of Y 39 v argmax Py i apex given the value of Z Mquot w ir ri39 z t I iz L mgm w yleV Pxxw I3939 z i39 Il gtI 393il39 l39391139l 39ii39 argmaxpthyxw iyJPyJ my Ix1l nil39 1111an argmaXPyJHPx lyj 39 yleV my v m a me v my mama a Na39I39ve Bayes Classifier Parameter Estimation Na39I39ve Bayes classifier Algorithm NaiveiBaye39sileamexamples Discrete inputs 0 Prior probability over Y P01 7 K parameters 0 PowJ r rlMK parameters For each target value y 130 e estimate Pm For each attribute value x of each attribute x 13xlyltestimatePxly1 I Estimation a t Y Pee D123 NaiveiBaye siclassify x yargmax13y1H13xlyl xa yDxaYyl FV as a 1 DY y are v m a me v m WWW e Na39I39ve Bayes Classifier Parameter Estimation 0 Continuous inputs 0 Need to cnoose representation for distribution P giylr Na39I39ve Bayes Subtleties o Conditional independence assumption is often violated usually assume Gaussian 1 0 But it works surprisingly well anyway Note don t need Faith U 9 estimated posteriors Poop to be correct need only Mean and standard ueviatmn 7 211K parameters that PU HP P argmax x y argmax y x x Elsiyali Elewp im my 1 my 1 prior probability over 1 p01 Kpmmetm Nai39ve bayes posteriors often unrealistically closeto 1 or o Estimation o What ifnone ofthe training instances with target 1 7 DYygt 1 2 g ZXh ykl valueyJ have attribute value 3 Then 1 a 4 y A A A my i Priy0 PyHPxiyo k Y 4 2 z y m y u enema 5 25014 w M y 130 aly Dltx Y y w DY y1R muse 41 a v we Mealmm 15 2msr 41 a v Italy Meanmm as Na39I39ve Bayes classifier Example Cancer diagnosis 7 a 7 Py06 Pn04 Pgene1 gt2ly 0 5 Pgene1 gt2ln0 Pgene2 gt2 i y 0 667 Pgene2 gt2ln0 New patiml PyPg2n21 gt 2 i yPg2n22 lt 2 i y o 0999 PnPg2n21 gt 2 i nPg2n22 lt 2 i n o muse 41 A V we umwvawztxax 11 Naive Bayes classifier Example Classify DNA Sequences o Classify which DNA sequence contains splice site 0 Classify which DNA sequence encodes protein 0 How to represent the DNA sequence for nai39ve bayes me 41 A V my Wanmm Na39l39ve Bayes classifier Example Classify DNA Sequences Target cuncept DNA sequence a tee Represent eacn sequence by Vectcr uf letters A One attribute per pusmun m tne sequence A Leernrnq use trernrnq exampleslu estimate a P PC I PSEul I PSenl Naive Bayescunditiunal independence assumptiun ImgthY New H m HATCG P06 Z Wk W Z P06quot Z Wk lyV1m muse 41 A V we umwvawztxax 19 Naive Bayes classifier Example Classify DNA Sequences CTGTCCTGTG ACGTTTCAAA PEIS CTACCCTGCG 7 s TGTGACGCGT Calculate pm 127 Porn 39 GA TATGTC TC Pow and we KHZslow 2 etnen en 25 rAGCGAGGTGT PxDSPx q25 r TGGTGAATTC 2 7i A m r CTACCTCAAG argrqax 13091 1139 l y WE me 41 A V my Wanmm
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'