MACHINE LEARNING AND DATA MINING
MACHINE LEARNING AND DATA MINING CS 434
Popular in Course
Popular in ComputerScienence
Mrs. Maximo Lueilwitz
verified elite notetaker
Mrs. Maximo Lueilwitz
verified elite notetaker
verified elite notetaker
Mrs. Maximo Lueilwitz
verified elite notetaker
verified elite notetaker
This 238 page Class Notes was uploaded by Mrs. Maximo Lueilwitz on Monday October 19, 2015. The Class Notes belongs to CS 434 at Oregon State University taught by Staff in Fall. Since its upload, it has received 20 views. For similar materials see /class/224507/cs-434-oregon-state-university in ComputerScienence at Oregon State University.
Reviews for MACHINE LEARNING AND DATA MINING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/19/15
Lecture 13 Oct272007 Bagging Generate T random sample from training set by bootstrapping Learn a sequence of classifiers h1h2hT from each of them using base learner L To classify an unknown sample X let each classifier predict Take simple majority vote to make the final prediction Simple scheme works well in many situations BiasNariance for classifiers Bias arises when the classifier cannot represent the true function that is the classifier underfits the data Variance arises when the classifier overfits the data minor variations in training set cause the classifier to overfit differently Clearly you would like to have a low bias and low variance classifier Typically low bias classifiers overfitting have high variance high bias classifiers underfitting have low variance We have a tradeoff Effect of Algorithm Parameters on Bias and Variance k nearest neighbor increasing ktypically increases bias and reduces variance decision trees of depth D increasing D typically increases variance and reduces bias Why does bagging work Bagging takes the average c mamas models reduces the variance This suggests that bagging WOIKS me new with low bias and high variance classifiers Boos ng Also an ensemble method the final prediction is a combination of the prediction of multiple classifiers What is different Its iterative Boosting Successive classifiers depends upon its predecessors look at errors from previous classifiers to decide what to focus on for the next iteration over data Bagging Individual classifiers were independent All training examples are used in each iteration but with different weights more weights on difficult sexamples the ones on which we committed mistakes in the previous iterations Adaboost Illustration F LVAL CLASSIFIER HX sign 331 am hmX 9 Update weights quotA hMX h3ltxgt Update weights h x Update weights 2 Original data gt m h1x uniformly weighted The AdaBoost Algorithm anm nset s ofm labeled examplFs s mm Lm labelsyi e y 1K Learn a learning algorithm a constant L The AdaBoost Algorithm Input 215539 s ofm labeled examples S m yi139 12 m laws y e y 11K Learn a learning algorithm a constant L 1 immune for all i 1111239 lm initialize the weights 2 furl 1m Ldo 3 for all i mi 1 111M Ami4i summits nnrmahzsd Wigm 4 115 mmm caLlLeam with nmmalized Wigms 5 s Zipi mm yi minim mg errm quot11 7 ifs gt 12 then a 7 1 9 exit 10 61 am 7 e1 11 for all i midi iiii quot W39 Jump ute new weights 1 oman 111 argmax 2 log i WI y HEY II t AdaBoostExample Original Training set Equal Weivhts to all traininv sam les 81030 D l 1104 Taken from ATutorial on Boos ng by Yoav Freund and Rob Schapire AdaBoostExample 91030 011042 07 AdaBoostExample ROUND 2 22021 l2065 3929 D3 AdaBoostExample ROUND 3 3 Ol4 013092 AdaBoostExample nal Weighted Error Adaboost calls L with a set of prespecified weights It is often straightforward LU Conv e1 L a vase learner L to take into account an input distribution D Decision trees K Nearest Neighbor Na 1 ve Bayes When it is not straightforward we can resample the training data S according to D and then feed the new data set into the learner Boosting Decision Stumps Decision stumps very simple rules of thumb that test condition on a single attribute Among the most commonly used base classifiers truly weak Boosting with decision stumps has been shown to achieve better performance compared to unbounded decision trees Sleep decrease in 61101 Boosting Performance Comparing C45 boosting decision stumps boosting 045 using 27 UCI data set C45 is a popular decision tree learner 30 0 I 25 0 Lo 20 39 v I o 15 39o 10 5 I O I O 0 39 V I 0 51015 20 25 30 0 51015 20 2530 boosting stumps boosting 045 Emu mu uanggmg Wm M Boosting vs Bagging of Decision Trees Ove rflttl ng Boosting drives training errorto zero will it overfit Curious phenomenon 20 15 1 0 error 5 test 0 I train 10 100 1000 of rounds T BoostinV is often robust to overfittinV not alwa s Test error continues to decrease even after training error goes to zero Explanation with Margins fx 2W1 39hzx WAT Margin y fX Histogram of functional margin for ensemble just after achieving zero training error Effect of Boosting Maximizing Margin 7 t No examples Margin with small margins Even after zero training error the margin of examples increases This is one reason that the generalization error may continue decreasing Biasvariance analysis of Boosting In the early iterations boosting is primary a biasreducing method In later iterations it appears to be primarily a variancereducing method What you need to know about ensemble methods Bagging a randomized algorithm based on bootstrapping What is bootstrapping Variance reduction What learning algorithms will be good for bagging Boosting Combine weak classifiers ie slightl better than random Training using the same data set but different weights How to update weights How to incorporate weights in learning DT KNN Nai39ve Bayes One explanation for not overfitting maximizing the margin Lecture 8 Oct 15th 2008 Bayes Classifiers in a nutshell 1 Learn the P09 X2 Xm YI for each value vi 3 tstimate PYv as traction or records With Yv 4 For a new prediction Yl re lict argmaXPYv X1 u1Xm um argmaXPX1u1Xm um YvPYv Joint Density Estimator Overfits Typically we don t have enough data to estimate the joint distribution accurately So we make some bold assumptions to simplify the joint distribution Na39ive Bayes Assumption Assume that each attribute is independent of any other attributes given the class label PX1u1 XmuleVz PX1u1YvlPXm zulezvl A note about independence 0 Assume A and B are Boolean Random Variables Then quotA and B are independent ifand only if PA B PA quotA and B are independent is often notated as A J B Independence Theorems Assume P A B P A Assume P A B P A Then PA B Then PBA PA PB PB Independence Theorems Assume P A B P A Assume P A B P A Then PAB Then PAB PA PA Examples of independent events Two separate coin tosses Consider the following four variables T Toothache I have a toothache C Catch dentist s steel probe catches in my tooth A Cavity w W Weather I0T C A W I0T C A I0W Conditional Independence pX1X22y pX1y X1 and X2 are conditionally independent given y 0 If X1 and X2 are conditionally independent given y then we have pX1X2v pX1v pX2v Example of conditional independence T Toothache I have a toothache C Catch dentist s steel probe catches in my tooth A Cavity T and C are conditionally independent given A PT CIA PTAPCA So events that are not independent from each other might be conditionally independent given some fact It can also happen the other way around Events that are independent might become conditionally dependent given some fact BBurglar in your house A Alarm Burglar rang in your house E Earthquake happened B is independent of E ignoring some possible connections between them However if we know A is true then B and E are no longer independent Why PBA gtgt PBA E Knowing E is true makes it much less likely for B to be true Na39ive Bayes Classifier Assume you want to predict output Ywhich has arity nYand values v1 v2 vny Assume there are m Input attributes called XX1 X2 Xm Learn a conditional distribution of pXy for each possible y value y v1 v2 vnyl we do this by Break training set into nY subsets called 051 052 DSny based on the yvalues Le 05 Records in which Yv For each 05 learn a joint distribution of input distribution PX1u1XmumYvl PX1u1YviPXmum Yvi Example Y Apply Naive Bayes and make prediction for 101 gtlt Ngtlt gtlt JO 1 Learn the prior distribution of y Py012 Py112 2 Learn the conditional distribution of xi given y for each possible y values pX1v0 pX1v1 pX2v0 pX2v1 PX3YU PX3Yl For example pX1yo PX11ly023 PX1Oy013 404044 000044 To predict for 101 Py0101l 39 1 01v0Pv0P 01 Py1101 39 101lv1Pv1P 01 Final Notes about Nai39ve Bayes Classifier Any density estimator can be plugged in to estimate pX1X2 Xm y or pXiy for Na39ive bayes Real valued attributes can be modeled using simple distributions such as Gaussian Normal distribution Zero probabilities are painful for both joint and naive A hack called Laplace smoothing can help Original estimation PX11y0 of examples with yO X11 of examples with yO Smoothed estimation never estimate zero robabilit PX11ly0 1 of examples with yO X11k of examples with yO Na39ive Bayes is wonderfully cheap and survives tens of thousands of attributes easily Bayes Classifier is a Generative Approach Generative approach Learn py ple and then apply bayes rule to compute pyX for making predictions This is in essence assuming that each data point is independently identically distributed ii d and generated following a generative process governed by py and WW 6 py py B ayes Naive Bayes claSSIerr claSSIerr pX v Q pXl v pXm v Generative approach is just one type of learning approaches used in machine learning Learning a correct generative model is difficult And sometimes unnecessary KNN and DT are both what we call discriminative methods They are not concerned about any generative models They only care about finding a good discriminative function For KNN and DT these functions are deterministic not probabilistic One can also take a probabilistic approach to learning discriminative functions ie Learn pyX directly without assuming X is generated based on some particular distribution given y ie pXy Logistic regression is one such approach Logistic Regression 0 First let s look at the term regression 0 Regression is similar to classification except that the y value we are trying to predict is a continuous value as opposed to a categorical value I Classification Given income savings predict loan applicant as quothigh riskquot vs quotlow riskquot I I Regression Given income savings predict credit score I Linear regression Essentially try to fit a straight line through a clouds of points Look for ww1w2wm y w0w1x1wmxm and y is as close to y as possible Logistic regression can be think of as extension of linear regression to the case where the target value y is binary Logistic Regression Because y is binary O or 1 we can not directly use linear function of x to predict y Instead we use linear function of x to predict the log odds of y1 g Py 1lx Py 0 x Or equivalently we predict 1 I P 1 x quotf y l 1e w0w1x1wmxm 2 W0 w1x1wmxm Sigmoid function Learning w for logistic regression Given a set oftraining data points we would like to find a 1 e w0w1x1wmxm weight vector w such that Py 1x 1 is large eg 1 for positive training examples and small eg 0 otherwise This can be captured in the following objective function Note that the superscript i is an index to LW ZlogPQl I XZW the examples in the training set 139 Zyquot10gPyquot 1 X1W1 yi10g1Pyquot 1XquotWl This is call the likelihood function of w and by maximizing this objective function we perform what we call quotmaximum likelihood estimationquot of the parameter w Optimizing Lw Unfortunately this does not have a close form solution Instead we iteratively search for the optimal w Start with a random w iteratively improve w similar to Perceptron Logistic regression learning Give11training examplesx39 y i1 N Let W e 000 0 Repeat until convergence 1 e 0700jm0 Fori Ito N do l V H r 1 1 9 3 error y39 733 l 1 error X wew d Learning rate Logistic regression learns LTU We predict y1 if Py1XgtPyOX You can show that this lead to a linear decision boundary Relntorcement Learning Dec 03 2008 Large State Spaces 39 When a problem has a large state space we can not longer represent the U or qunctions as explicit tables 39 Even if we had enough memory A Never enough training data A Learning takes too long 39 What to do Function Approximation Never enough training data A Must generalize what is learned from one situation to other similar new situations Idea A Instead of using large table to represent U or Q use a parameterized function I small number of parameters generally exponentially fewer parameters than the number of states A Learn parameters from experience A When we update parameters based on observations in one state then the U or Q estimate will also change for other similar states I facilitates generalization of experience Example Consider grid problem with no obstacles deterministic actions UDLR 49 states Features for state sxy fl sx f2 s just 2 features 6 O 10 O Linear Function Approximation Define a set of state features f1s fns A The features are used as our representation of states A States with similar feature values will be treated similarly A common approximation is to represent Vs as a weighted sum of the features Le a linear approximation U6S90 61f1S922S9n 1S Example Consider grid problem with no obstacles deterministic actions UDLR 49 states Features for state sxy f1sx f2sy just 2 features Us9091x92y Is there a good linear approximation 10 0 A Yes A 60 10 61 1 62 1 A note upper right is origin Us 10 x y subtracts Manhattan dist from goal reward 6 O Instead of storing a table of 49 entries we now only need to store 3 parameters 6 Function approximation accuracy The approximation accuracy is fundamentally limited by the information provided by the features Can we always define features that allow for a perfect linear approximation A Yes Assign each state an indicator feature e feature is 1 iff i th state is present and Bi represents value of i th state A Of course this requires far too many features and gives no generalization Changed Reward Bad linear approximation Us9091x92y Is there a good linear approximation A No 0 10 But What If 39 Us9091X92y932 3 0 Include new feature 2 0 Z I3Xl I3YI A z is dist to goal location Does this allow a 10 3 good linear approx A90 10 91 92 o 90 1 Linear Function Approximation 39 Define a set of features f1s fns A The features are used as our representation of states A States with similar feature values will be treated similarly A More complex functions require more complex features U6S 90 61fls62f2s 6nfns 39 Our goal is to learn good parameter values ie feature weights that approximate the value function well A How can we do this A Use TDbased RL and somehow update parameters based on each experience 11 TDbased RL for Linear Approximators 1 Start with initial parameter values 2 Take action according to an ex loreex oit olic should converge to greedy policy ie GLIE 3 Update estimated model 4 Perform TD update for each parameter 914 5 Goto 2 What is a quotTD update for a parameter Aside Gradient Descent for Squared Error 39 Suppose that we have a sequence of states and target values for each state lt51uslgtltszusz A L5 produced by the TDbased RL loop 39 Our goal is to minimize the sum of squared errors between our estimated function and each target value 1 A 2 323U6 Sj usvj squared error of examplej target value for j th state our estimated value forj th state 39 After seeingJ un state gradient descent rule tells us to update all parameters by 6 lt6I0an 6E 9E 51163 59 66 511631 59 learning rate Aside continued aE A 60 s 9 61 0 661 29 ausj U9Sj 9 J 66 W J E A J depends on form of 5U9 SJ approximator For a linear approximation function Ugo 61ms62f2s6n12s a grsj a gi fiSJ Thus the update becomes 9 lt 61 ausj Ugsjfisj For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule TDbased RL for Linear Approximators 1 Start with initial parameter values 2 Take action according to an exploreexploit policy should converge to greedy policy ie GLIE Transition from s to s 3 Update estimated model 4 Perform TD update for each parameter 6 lt 61 aus U6sfls 5 Goto 2 What should we use for quottarget value vs Use the TD prediction based on the next state 3 ms Rm Mm this is the same as previous TD method only with approximation5 TDbased RL for Linear Approximators 1 2 5 Start with initial parameter values Take action according to an exploreexploit policy should converge to greedy policy ie GLIE Update estimated model Perform TD update for each parameter 6 e 6 aRs mm Ugsfs bOtO 2 Note that step 2 still requires T to select action Io aVOId thls we can do the same thlng tor modelfree QIearning Q learning with Linear Approximators Q6SCl 90 61f1sa92f2saenfnsa Features are a function of states and actions 1 Start with initial parameter values 2 Take action according to an exploreexploit policy should converge to greedy policy ie GLIE 3 Perform TD update for each parameter a lt 0 0412 ymax Q6S39a39 Q6Safisa 4 Goto 2 For both Q and U these algorithms converge to the closest linear approximation to optimal Q or U Summary of RL 39 MDP A Definition of an MDP T R S A Solving MDP for optimal policy Value iteration policy iteration 39 RL A Difference between RL and MDP A Different methods for Passive RL DUE ADP TD A Different method for Active RL ADP Q Learning with TD learning A Function approximation for large stateaction space Learning objectives 1 Students are able to apply supervised learning algorithms to prediction problems and evaluate the results 2 Students are able to apply unsupervised learning algorithms to data analysis problems and evaluate results 3 Students are able to apply reinforcement learning algorithms to control problem and evaluate results 4 Students are able to take a description of a new problem and decide what kind of problem supervised unsupervised or rein rorcement it is 20 Reinforcement Learning C5434 Fall 2007 So far 0 Given an MDP model we know how to find optimal policies Value Iteration or Policy Iteration 0 But what if we don t have any form of model Like when we were babies All we can do is wander around the world observing what happens getting rewarded and punbhed 0 Enters reinforcement learning C5434 Fall 2007 Reinforcement Learning in a nutshell 0 Imagine playing a new game whose rules you don t know after a hundred or so moves your opponent announces quotYou lose Russell and Norvig Introduction to Artificial Intelligence pg 764 CS434 Fall 2007 Reinforcement Learning 0 Agent placed in an environment and must learn to behave optimally in it 0 Assume that the world behaves like an MDP except Agent can act but does not know the transition model Agent observes its current state its reward but doesn t know the reward function 0 Goal learn an optimal policy C5434 Fall 2007 Reinforcement Learning RL Successes 0 Elevator scheduling o Autonomous Helicopters o Vehicle routing 0 Robotics walking Aibos air hockey juggling soccer 0 Games Backgammon checkers 0 Targeted marketing C5434 Fall 2007 Factors that Make RL Difficult 0 Actions have non deterministic effects which are initially unknown and must be learned 0 Rewards punishments can be infrequent Often at the end of long sequences of actions How do we determine what actions were really responsible for reward or punishment credit assignment problem 0 World is large and complex C5434 Fall 2007 Passive vs Active learning 0 Passive learning The agent acts based on a fixed policy 7 and tries to learn how good the policy is by observing the world go by Analogous to policy evaluation in policy iteration 0 Active learning The agent attempts to find an optimal or at least good policy by exploring different actions in the world Analogous to solving the underlying MDP C5434 Fall 2007 ModelBased vs ModelFree RL 0 Model based approach to RL learn the MDP model T and R or an approximation of it use it to find the optimal I olic 0 Model V ree a roach t0 RL derive the optimal policy without explicitly learning the model 0 We will consider both types of approaches C5434 Fall 2007 Example Passive RL 0 Suppose given policy 0 Want to determine how good it is 3 gt gt gt 2 l l El 1 T 4 lt lt CS434 Fall 2007 Objective Value Function C5434 Fall 2007 10 Passive RL Given policy 7t estimate U s Not given transition matrix nor reward function Sim I follow the olic for many epochs Epochs training sequences 19191291391291392393939394 11912913923933939298939894 11 9 21 9 31 9 32 9 42 1 11 Appr 1 Direct Utility Estimation 0 Direct utility estimation model free Estimate U s as average total reward of epochs containing 5 calculating from s to end of epoch Reward to go of a state s the sum ofthe discounted rewards from that state until a terminal state is reached 0 Key use observed reward to go of the state as the direct evidence ofthe actual expected utility of that state 0 Average the reward to go samples 12 C5434 Fall 2007 Direct Utility Estimation Suppose we observe the following trial let s assume that y1 for simpler computations 11 004 gt 12 004 90231004 gt 12 004 gt 123 034 gt 231004 gt 331004 gt 431 The total reward starting at 11 is 072 We call this a sample of the observedrewardtogo for 11 For 12 there are two samples for the observedrewardtogo 139 1392oo4 gt1r3oo4 gt 12094 gt 13oo4 gt 23004 gt 33o04 gt 431 Total 076 2 12004 gt 13004 gt 23004 gt 33004 gt 431 Total 084 13 C5434 Fall 2007 Direct Utility Estimation 0 Direct Utility Estimation keeps a running average of the observed rewa rd to go for each state 0 Eg For state 12 it stores O76O842 08 0 As the number of trials goes to infinity the sample average converges to the true utility 14 C5434 Fall 2007 Problem with DUE 0 Direct Utility Estimation converges slowly 0 Why Doesn t exploit the fact that utilities of states are not independent Utilities always follow the Bellman equation Us Rs 7maxZ Ts a S39US39 I Note the dependence on neighboring states I C5434 Fall 2007 15 Direct Utility Estimation How can the dependence help t Suppose you know that state 33 has a high utility Suppose you are now at 32 The Bellman equation would be able to tell you that 32 is likely to have a high utility because 33 is a neighbor DEU can t tell you that until the end of the trial C5434 Fall 2007 Adaptive Dynamic Programming Model based 0 This method does take advantage ofthe constraints in the Bellman equation 0 Basically learns the transition model and the reward function so that it can solve the underlying MDP thus considered model based 0 Based on the underlying MDP we can perform policy evaluation which is part of I olic iteration I reviousl taught 17 C5434 Fall 2007 Remember olic evaluation 0 Policy evaluation computes the utility of each state if policy T is followed by solving a system of equations UAS RC 72 TSa S S39U S39 0 The equations are linear so they can be solved with linear algebra in time On3 where n is the number of states 18 C5434 Fall 2007 Adaptive Dynamic Prorammin 0 Make use of policy evaluation to learn the utilities of states 0 In order to use the policy evaluation eqn Ul ltsgt gm a To mltsgts39gtUilts39gt the agent fills in the learned transition model Tsas and the reward function Rs How do we learn these models 19 C5434 Fall 2007 Adaptive Dynamic Prorammin 0 Learning the reward function Rs Easy because it s deterministic Whenever you see a new state 5 store the observed reward value as Rs 0 Learning the transition model Tsas Keep track of how often you get to state 5 given that you re in state 5 and do action a eg if you are in s 13 and you execute Right three times and you end up in s 23 twice then TsRights 23 20 C5434 Fall 2007 ADP Algorithm function PASSIVEADPAGENTpercept returns an action inputs percept a percept indicating the current state 5 and reward signal r static IT a fixed policy mdp an MDP with model T rewards R discount V U a table of utilities initially empty N50 a table of TrequenCIes for stateaction pairs Initially zero Nam a table of frequencies for stateactionstate triples initially zero 5 a the previous state and action initially null ifs is new then do Us 6 r Rs 6 r Update reward ifs is not null then do mCtlon increment Nsasa and Nsassas Update transition for each t such that Nsassat is nonzero do model Nsas39slaltNsasla U 6 POLICYEVALUATIONn U mdp if TERMINALs then s a 6 null else 5 a 6 s 7115 return a 21 C5434 Fall 2007 The Problem with ADP 0 Need to solve a system of simultaneous equations costs On3 Very hard to do if you have large state space 0 Can we avoid the computational expense of full policy evaluation C5434 Fall 2007 22 Temporal Difference Learning TD 0 TD tries to approximate the utility function by Local updates of utilityvalue function on a peraction basis Do not estimate the transition function For each transition from s to s perform following local update UAS U7S06RS7ltU S39U s learning rate discount factor 0 Intuitiver moves us closer to satisfying Bellman constraint U ltsgt Rltsgt yZ Tltsas39gtU lts39gt s39 Why 23 C5434 Fall 2007 Aside Online Mean Estimation 0 Suppose that we want to incrementally compute the mean of a sequence of numbers Eg to estimate the mean ofa rv from a sequence of samples A 1 n1 1 n 1 1 n Xnl xi xi xnl xi n I391 n i1 n n i1 p x A g 3 gtltgt v average of n1 samples sample n1 learning rate 0 Given a new sample xn1 the new mean is the old estimate for n samples plus the weighted difference between the new sample and old estimate C5434 Fall 2007 24 Temporal Difference Learning TD 0 TD update for transition from s to s UAS U S06RS7U S39U7zS Hf J learning rate New noisy sample of utility based on next state 0 So the update is maintaining a mean of the noisy utility samples 0 If the learning rate decreases with the number of samples ev 1 n then the utility estimates will eventually converge to true values U ltsgt mm 72 Tltsas39gtU lts39gt C5434 Fall 2007 25 ADvaTD 5 0 muwm umu 355 300 400 Number of epochs 200 100 800 Number of epochs 600 400 26 C5434 Fall 2007 Comparisons Direct Estimation model free Simple toimplement Each update is fast Does not exploit Bellman constraints and converges slowly Adaptive Dynamic Programming model based Harder to implement Each update is a full policy evaluation expensive Fully exploits Bellman constraints Fast convergence in terms of e ochs 0 Temporal Difference Learning model free Update speed and implementation simliar to direct estimation Partially exploits Bellman constraintsadjusts state to agree with observed successor 0 Not all possible successors Convergence in between direct estimation and ADP 27 Comparison between ADP and TD 0 Advantages of ADP Requires fewer training epoch Utility estimates don t vary as much from the true utilities 0 Advantages of TD Simpler less computation per epoch Crude but efficient first approximation to ADP Don t need to build a transition model in order to perform its updates this is important because we can interleave computation with exploration rather than having to wait for the whole model to be built first WengKeen Wong Oregon State University 2006 28 Lecture 1 1 Oct17 2007 SVM wlo soft margin Set functional margins ywXb21 maximize geometric margin by minimizing w SVM with soft margin Set functional margins yiwXib2 1 minimize w2 c g Solutions to SVM N N w Zaly x St Zalyl 0 and 0 S 0613 c With soft margin 11 11 For classifying with a new input 2 comPUte wzb 2a y x zb 20 y 96 zb 11 1 11 1 classify 2 as if positive and otherwise Note w need not be formed explicitly we can classify z by taking inner products with the support vectors Mapping the input to a higher dimensional space can solve the linearly inseparable cases Nonlinear SVMs Feature Spaces General idea Form data set the original input space can always be mapped to some higherdimensional feature space such that the data is linearly separable Example Quadratic Feature Space Assume m input dimensions iY fc mstan em X X1 X2 Nquot Xm J2 Lineai Terms Number of quadratic terms xExm 1mmmm 12 m2 pure 2 uadratic The number of dimensions Terms increase rapidly 1 Elm Quadratic x ossTerms You may be wondering about the s At least they won t hurt anything Jim You will find out why they are there soon Jigquot7le a b Dot product in quadratic feature space 1 i 1 i 1 lt1gta c130 VIZquot fzbl m m m m 630 171 1 2 11b Z 117 Z 2 2011112 I quot EMA 11 11 1 JI l i zam mm 391 z z 39 quot 2 b z Now let s Just look at another I 3 ia f interesting function of ab a3 i b 7 a b 12 5mm 51215 i b 2 i r i 2 b 1 xZtllll3 217113 am a m 1 39 7 39 axbx22 axbx1 rm V Zliibm g 121 LN mm m er 7M glgihrzaibzb ZZaxajbbe 22 be 1 39 39 K 11 11 11 nial V21ka m m m m i Za12b1222 ZuluIij 2204 1 J I They are the same And the later only takes Om to compute Kernel Functions If every data point is mapped into highdimensional space via some transformation Xgt ltgtx the dot product that we need to compute for classifying a point x becomes lt xquot ltxgt for all support vectors X A kernel function is a function that is equivalent to an inner product in some feature space kab ltltlgta ltlgtbgt We have seen the example kab ab12 This is equivalent to mapping to the quadratic space More kernel functions Linear kernel kab ab Polynomial kernel kab ab1d RadialBasisFunction kernel Kab expi 20 In this case the corresponding mapping ltgtX is in nite dimensional Lucky that we don t have to compute the mapping explicitly w C13z b at yt cpx lc1gtz b 2510 yt Kxt z b 11 1 11 1 Note We will not get into the details but the learning of wcan be achieved by using kernel functions as well Nonlinear SVM summary Map the input space to a high dimensional feature space and learn a linear decision boundary in the feature space The decision boundary will be nonlinear in the original input space Many possible choices of kernel functions How to choose Most frequently used method crossvalidation Ensemble Learning Ensemble Learning So far we have designed learning algorithms that take a training set and output a classi er What if we want more accuracy than current algorithms afford Develop new learning algorithm Improve existing algorithms Another approach is to leverage the algorithms we have via ensemble methods Instead of calling an algorithm just once and using its classifier Call algorithm multiple times and combine the multiple classifiers What is Ensemble Learning quotquotquotquotquotquot 39T39 quotquotquotquotquot quot39l l Traditional i i Ensemble method39 1 different training sets andor learning 1 algorithms x3 gt 3 lt3 0quot X yih X X ylhlgtlt Ensemble Learning INTUITION Combining Predictions of multiple classifiers an ensemble is more accurate than a single classifier Justification easy to find quite good rules of thumb however hard to find single highly accurate prediction rule lfthe training set is small and the hypothesis space is large then there may be many equally accurate classifiers Hypothesis space does not contain the true function but it has several good approximations Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers How to generate ensemble There are a variety of methods developed We will look at two of them Bagging Boosting Adaboost adaptive boosting Both of these methods takes a single base learning algorithm and wrap around it to generate multiple classifiers Bagging Bootstrap Aggregation Breiman 1996 Generate a random sample from training set by bootstrapping Sample with replacement New sample contains the same number oftraining points may contain repeats On average 667 ofthe original points will appear in a random sample Repeat this sampling procedure getting a sequence of T independent training se 5 a sequence of classi ers CWCZCT for each of these training sets using the same base learner To classify an unknown sample X let each classi er predict Take simple majority vote to make the nal prediction Bagging Example Decision Boundary by the CART Decision Tree Algorithm 100 bagged trees Lecture 5 DT cont Oct 8 2008 Review of last lecture What is decision tree What decision boundaries do decision trees produce Syntactically different trees can represent the same decision boundaries In such cases we prefer smaller trees flexible hypothesis space How to learn a decision tree A greedy approach At each step choose the test that reduce the most uncertainty about class labels Choosing the test based on training GI39I39OI39 Perform 1step lookahead search and choose the attribute that gives the lowest error rate on the training data X 901 902 903i A 0 0 0 1 0 0 1 0 0 1 0 1 H 0 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 1 Training examples J4 J4 Unfortunately this measure does not always work well because it does not detect cases where we are making progress toward a good tree J10 X1 1 A Better Heuristic from Information Theory Let Xhave the following probability distribution PX 0p0 PX1p1 02 08 The entropy ofX denoted H00 is defined as HXR10g2Po PllOg2Pl HX P0 log2 P0 Pk log2 Pk if there are k possible values ogPXx measures the surprise of value x If PXx is small x is a surprising value to take ogPx is large Entropy can be considered as the average surprise of a random variable which is also referred to as the uncertainty of a random variable Entropy Entropy is a concave function downward Minimum uncertainty occurs when p00 or 1 Mutual Information If we use entropy to measure uncertainty we end up measuring the mutual information between a candidate test variable Xand class label Y X Y HY HY l X Uncertainty of Y Remaining uncertainty of Y after knowing the value of X HYX is called the conditional entropy of Y given X Measures the uncertainty of Y after knowing the value of X HYXZPXxHYXx 2 ZPXxxPYyXxlogPYyXx The probability of Xx The uncertainty of Y when Xx HY 09183 X1 PX10066W103333 I I HYX10 HYX11 06log0604log04 08log0802log02 09710 07219 HYX1 0666709710 0333307219 08873 X1Y00304 Information Gain This is called the information gain criterion choose X that maximizes mutual information between X and y argmaXIXjYargmaXHY HYXj J J argminHY X J Information gain is just one of the methods for selecting tests in decision tree learning There are other methods as well but they use the same general approach based on different uncertainty measures Choosing the Best Feature A General 20 10 X1 12 USL US PR 8 2 USR View Benefit of split US PLUSLPRUSR H4 Expected Remaining Uncertainty Impurity Measures of Uncertainty Error Entropy p09p 1 p091 p Gini Index 2101 p Issues with Multinomial Features Multinomial features more than 2 possible values Comparing two features one is binary the other has 100 possible values which one you expect to have higher mutual information with Y The conditional entropy of Y given this feature will be low But is this meaningful This bias will inherently prefer such multinomial features to binary features Method 1 To avoid this we can rescale the conditional entropy ZPXJ xHY X x 2 ar min x g ZPXJ x log PX x HYX HXJ arg min J Method 2 Test for one value versus all of the others Method 3 Group the values into two disjoint sets and test one set against the other Continuous Features Test against a threshold How to compute the best threshold 9 for X Sort the examples according to X Move the threshold 6 from the smallest to the largest value Select 6 that gives the best information gain Trick only need to compute information gain when class label changes complex decision boundaries This can lead to overfitting clolo t2 t3 Overfitting Decision tree has a very flexible hypothesis space As the nodes increase we can represent arbitrarily size Possibly just noise but the tree is grown larger to capture these examples ll Iquot1 l n3 Overfitting 39 Fn III S 39 I m 39 h 39 I 39 II 5 if I 39 1 1 uil39lillE dulu 39 H Ilf I39IEIIH I I I I I I I I l I Ii 4 mII 4 7 Jill 39H uxJ I39nl Luv nun ll39rm ul fi39iJL39 I 397 LI Avoid Overfitting Early stop Stop growing the tree when data split does not offer large benefit Post pruning Separate training data into training set and validating set Evaluate impact on validation set when pruning each possible node Greedin prune the node that most improves the validation set performance Effect of Pruning I I I I I I I I I 39 I 39 39 39 quot39 m F J Ea u In quotn quotin 39 quot I 3939AL I39m Eurmg III Llquot lul LLJLu quot39 quot I In IHl lul 39 llrltly prurungu HI 2E if 4 5 ful l TI 1 I quotN If Hm 1quotllr39u dl llrl39ll39isl I Hauluh Revisit some of the issues ls decision tree robust to outliers ls decision tree sensitive to irrelevant features ls decision tree computational efficient Frequent pattern mining association rules November 8 2007 CS 434 Fall 2007 What Is Frequent Pattern Mining Frequent pattern a pattern a set of items subsequences substructures etc that occurs frequently in a data set Motivation Finding inherent regularities in data What products were often purchased together Beer and diapers What are the subsequent purchases after buying a PC What kinds of DNA are sensitive to this new drug Can we automatically classify web documents Broad applications Basket data analysis crossmarketing catalog design sale campaign analysis Web log click stream analysis DNA sequence anaIySIs November 8 2007 CS 434 Fall 2007 2 Association rules Data MarketBasket transactions TID Items 1 Bread Milk 2 Bread Diaper Beer Eggs 3 Milk Diaper Beer Coke 4 Bread Milk Diaper Beer 5 Bread Milk Diaper Coke Example of Association Rules Diaper gt Beer Mik Bread gt EggsCoke Beer Bread gt Milk Implication means cooccurrence not causality Given a set of transactions find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction November 8 2007 CS 434 Fall 2007 Definition Frequent Itemset Itemset A collection of one or more items Example Milk Bread Diaper kitemset An itemset that contains k items Support count 039 Frequency of occurrence of an itemset Eg sMilk BreadDiaper 2 Support Fraction of transactions that contain an itemset sMilk Bread Diaper 25 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items Bread Milk Bread Diaper Beer Eggs Milk Diaper Beer Coke Bread Milk Diaper Beer UIBUJNl t Bread Milk Diaper Coke Definition Association Rule 0 Association Rule An implication expression ofthe form X gt Y where X and Y are itemsets Example Milk Diaper gt Beer 0 Rule Evaluation Metrics Support 5 O Iraction of transactions that contain both X and Y PX A Y Confidence c 0 Measures how often items in Y appear in transactions that contain X PYIX November 8 2007 CS 434 Fall 2007 TID Items 1 Bread Milk 2 Bread Diaper Beer Eggs 3 Milk Diaper Beer Coke 4 Bread Milk Diaper Beer 5 Bread Milk Diaper Coke Example Milk Diaper gt Beer 0MilkDiaerBeer 2 0 4 T 5 39 039 MilkDia erBeer 2 p 067 0Milk Diaper 3 Problem definition Association Rules Mining 10 A B C Itemset Xx1 xk 20 A C thresholds minsup minconf 30 u Output 40 B E F All the rules XampY having November 8 2007 support PXquotY 2 minsup confidence PYX2 minconf Let minsup 50 minconf 50 A 9 C 50 667 C 9 A 50 100 CS 434 Fall 2007 Bruteforce solution 0 List all possible association rules Compute the support and confidence for each rule 0 Prune rules that fail the minsup and minconf thresholds gt Computationally prohibitive November 8 2007 CS 434 Fall 2007 Mining Association Rules TID Items Bread Milk Bread Diaper Beer Eggs Milk Diaper Beer Coke Bread Milk Diaper Beer UIBUJNl K Bread Milk Diaper Coke Observations Example of Rules MilkDiaper gt Beer s04 c067 MilkBeer gt Diaper s04 c10 DiaperBeer gt Milk s04 c067 Beer gt MilkDiaper s04 c067 Diaper gt MilkBeer s04 c05 Milk gt DiaperBeer s04 c05 0 All the above rules are binary partitions of the same itemset Milk Diaper Beer 0 Rules originating from the same itemset have identical support but can have different confidence 0 Thus we may decouple the support and confidence requirements 0 We can first find all frequent itemsets that satisfy the support requirement Mining Association Rules Twostep approach 1 Frequent Itemset Generation Generate all itemsets whose support 2 minsup 2 Rule Generation Generate high confidence rules from each frequent itemset where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive itemsets Given d items there are 2039 possible candidate Frequent mset Generation Frequent Itemset Generation Bruteforce approach Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Transactions List of Candidates T ID Items Milk T 1 Beer N Beer Coke M Beer er Coke w Match each transaction against every candidate Complexity N ONMw gt Expensive since M 2039 ll Frequent Itemset Generation Strategies Reduce the number of candidates M Complete search M2d Use pruning techniques to reduce M Reduce the number oftransactions N Reduce size of N as the size of itemset increases Used by DHP and vertical based mining algorithms Reduce the number of comparisons NM Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction November 8 2007 CS 434 Fall 2007 12 Reducrng Number of Candidates Apriori principle Ifan itemset is frequent then all of its subsets must also be frequent If beer diaper nuts is frequent so is beer diaper ie every transaction having beer diaper nuts also contains beer diaper Apriori principle holds due to the following property of the support measure VXY X g Y gt sX 2 Sm Support of an itemset never exceeds the support of its subsets This is known as the antimonotone property of support November 8 2007 CS 434 Fall 2007 13 Found to be Infrequent x if i fair f I gt g lEi 139 63 WWquot 0 y wgt x 965 ka W my 4 VA Pruned supersets null Illustrating Apriori Principle DE DE Illustrating Apriori Principle Items 1itemsets UOKe 2 N Milk 4 temset count Pairs 2itemsets 39aer BreadBeer No need to generate 99 BreadDiaPer candidates involving Coke MilkBeer 2 or Eggs MilkDiaper 3 I BeerDiaper 3 39l39 M l n lm u m Sup39p39ert39 3 N Triplets 3itemsets If every subset is considered lte m S et C 0 u M 6C16C26C341 IBreadMikDIaper 3 I With supportbased pruning 6 6 1 13 November 8 2007 CS 434 Fall 2007 15 Apriori Algorithm Method Let k1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length k1 candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent leaving only those that are frequent November 8 2007 CS 434 Fall 2007 16 The Apriori Algorithm An Example Su 2 Database TDB pm39 A 2 L E E c a s I 8 3 10 A c D C 3 20 B C E lst scan C 3 30 u V E E E 3 40 B E C2 C2 L2 21 Scan A C 2 B C 2 lt B c 2 B E 3 EB E 3 C E 2 c E 2 3rd scan L3 CS 434 a w 17 The Apriori Algorithm Pseudocode Ck Candidate itemset of size k Lk frequent itemset of size k L1 frequent items for k 1 Lk k do begin C candidates generated from Lk for each transaction t in database do increment the count of all candidates in Cm that are contained int Lk1 candidates in Cm with minsupport end return Uk Lk November 8 2007 CS 434 Fall 2007 18 Important Details of Apriori How to generate candidates Step 1 selfjoining Lk Step 2 pruning How to count supports of candidates Example of Candidategeneration L3abc abd acd ace bcd SelfjoiningzL3L3 abcd from abc and abd acde from acd and ace Pruning acde is removed because ade is not in L3 C 4abcd November 8 2007 CS 434 Fall 2007 l LD How to Generate Candidates Suppose the items in Lk1 are listed in an order Step 1 self joining Lk1 insert into Ck select pitem1 pitem2 pitemk1 qitemk1 from Lk 1 P Lk l q where pitem1qitem1 pitemk2qitemk2 pitemk1 lt qitemk1 Step 2 pruning forall itemsets c in Ck do forall k1subsets s of c do if s is not in Lk1 then delete c from Ck November 8 2007 CS 434 Fall 2007 20 The Apriori Algorithm An Example SL lpmin Database TDB C1 10 A C D 20 B C E lst scan 30 n 4 V E 40 B E C2 L2 A C 2 B C 2 B E 3 C E 2 2 A 2 B 3 L A 2 C 3 B 3 gt C 3 E E 3 c2 E 2mm B C 2 B E 3 C E 2 3rd scan CS 434 a w 21 Reducing Number of Comparisons Candidate counting Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons store the candidates In a hash structure Instead of matching each transaction against every candidate match it against candidates contained in the hashed buckets Transactions Hash Structure TID Items Bread Milk Bread Diaper Beer Eggs Milk Diaper Beer Coke Bread Milk Diaper Beer Bred Milk Diaper Coke Z UIBUJNi K lt 7r gt Buckets November 8 2007 CS 434 Fall 2007 Mining Association Rules Two step approach 1 Frequent Itemset Generation Generate all itemsets whose support 2 minsup 2 Rule Generation Generate high confidence rules from each frequent itemset where each rule is a binary partitioning of a frequent itemset Enumerate all possible rules from the frequent itemset and out these of high confidence FrequentPattern Mining Summary Frequent pattern mining an important task in data mining Scalable frequent pattern mining methods Apriori Candidate generation amp test The Apriori property has also been used in mining other type of patterns such as sequential and structured patterns Problem frequent patterns are not necessarily interesting patterns I Bread gt milk is not really interesting although it has high support and con dence I Many other measures exist to address this problem November 8 2007 Data Mining Concepts and Techniques 24 ovember 8 2007 What you need to know What are support and confidence of a rule The apriori property How to find frequent itemset using the aprioir property Candidate generation Pruning Why are they correct CS 434 Fall 2007 25 Lecture 2 Linear Models Last lecture You have learned about what is machine learning Supervised learning Unsupervised learning Reinforcement learning You have seen an example learning problem and the general process that one goes through to design a learning system which involves determining Types of training experience Target function Representation of the learned function Learning algorithm Supervised learning Let s look at the problem of spam filtering Now anyone can learn how to earn 200 943 per day or More If you can type hunt and peck is ok to start and fill in forms you can score big So don t delay waiting around forthe next opportunityit is knocking now Start here httpredbluecruisecomtc381polohooyz37957html Do you Have Poetrythat you think should be worth 1000000 USD we do Enter our International Open contest and see if you have what it takes To see details or to enter your own poem Click link below httpesuscribercomimyseOsAoo4q954zYYUoYQampm795314ampl0 View my photos invite you to view the following photo albums zakmonth27 Hey have you seen my new pics yet Me and my girlfreind would love it if you would come chat with us for a bit Well join us if you interested Join live web cam chat here httpecommcentralcomimyseOsAoo4q9s4zYYUoYQampm825314ampl0 Let s look at the design choices Learning experience Past emails and whether they are considered spam or not you can also choose to use nonspam or spam emails only but that will require different choices later on Target function Email gt spam or not Representation of the function Learning algorithm We will focus mostly on these two aspects in this class In some cases you will also need to pay attention to the first two questions Continue with the design choices Representation of the function email gt spam or not First of all how to represent an email Use bagofwords to represent an email This will turn an email into a collection of features eg where each feature describe whether a particular word is present in the email This gives us the standard supervised classification problem typically seen in text books and papers Training set a set of examples instances objects with class labels eg positive spam and negative non spam Input representation an example is described by a set of attributes eg whether is present etc Given an unseen email and its input representation predict its label Next question what function forms to use Linear Threshold Units McCulloch amp Pitts 1943 1 X10 X20 F 1 ifw0gwixigt0 1 otherwise XnO Assume each feature Xj and weight wj is a real number LTU computes w xand takes threshold it to produce the prediction y Why linear model Simplest model fewer parameters to learn Visually intuitive drawing a straight line to separate positive from negative Geometric view X1 Referred to as decision boundary X1 A Canonical Representation Given a training example ltx1 x2 xmgt y transform it to lt1 x1 x2 xmgt y The parameter vector will then be Dot or inner product takes two equallength vectors and returns the Given a training set we need to learn sum 01 their componentWise PrOdUCt W W1 W2 gxw w01 W1X1W2X2mem Or equivalently hxw signgxw To differentiate the learned function and the true underlying function it is common to refer to the learned function as a hypothesis each unique set of parameter values is one hypothesis A prediction is correct if ygxw gt0 or yhxwgt0 Geometricaly using the canonical representation translates to two things 1t Will increase the input space dimension by 1 and 2 the decision boundary now always passes through the origin Geometric view How to learn the perceptron algorithm The equation wO wlx1 wmxm 0 defines a linear decision boundary that separates input space into different decision regions The goal of learning is to find a weight vector w such that its decision boundary correctly separate positive examples from negative examples How can we achieve this Perceptron is one approach It starts with some vector wand incrementally update wwhen it makes a mistake Let wt be current weight vector and suppose it makes a mistake on example ltx ygt that is to say ywtx lt0 The perceptron update rule is Wm wt yx Perceptron Algorithm Let W 9 000 0 Repeat Accept training example i Xi yi Mi 9 w xi if yi ui lt 0 W e W ini Effect of Perceptron Updating Rule Mathematically speaking Y39 Wt139X y 39 WtY39X 39Xy 39 Wt 39 X y2X2 gt y w x The updating rule makes y wt X more positive thus can potentially correct the mistake Geometrically Online vs Batch We call the above perceptron algorithm an online algorithm Online algorithms perform learning each time it receives an training example In contrast batch learning algorithms collect a batch of training examples and learn from them all at once Batch Perceptron Algorithm Given training examples Xi yi i1N Let W 6 000 0 d0 delta 6 000 0 fort 1t0 N do ui e W Xi if y ui lt 0 delta 6 delta y Xi delta 6 delta N W e W 77delta until I delta lt 8 Good news If there is a linear decision boundary that correctly classify all training examples this algorithm will find it Formally speaking this is the convergence Property For linearly separable data ie there exists an linear decision boundary that perfectly separates positive and negative training examples the perceptron algorithm converges in a finite number of steps Why If you are mathematically curious read the following slide you will find the answer And how many steps If you are practically curious read the following slide answer is in there too The further good news is that you are not required to master this material they are just for the curious ones To show convergence we just need to show that each update moves the weight vector closer to a solution vector by a lower bounded amount Let w be a solution vector and wZ be our w at tth step Proof w wt cosinewwt 2 wt W w 39 wt 2 w39wt1 ytxt w 39 wt1 wg ytxt Assume that w classify all examples with a margin 7 ie wyx gt yfor all examples w wt 2 wwt1wytxt gt wwt1 7gt w wt2 27lgt gt ww0 t7t7l 2 2 llwtll xt t t2 2 t2 wt1 y x wt1 y 2 ZWHytxt lt quotmy xt Assume that are bounded by D llwtllz ltwt12 xt 2 ltwt12 DZ lt wt22 202 lt lt 02 w wt t7 t7 S W W WIIllwtllgtWIlllwtllgtwHrDZ Dw W ytmltlgt lt HgttltD Margin 72 W is referred to as the margin The bigger the margin the easier the classification problem is the perceptron algorithm will likely find the solution faster Side story the bigger the margin the more confident we are about our prediction which makes it desirable to find the one that gives the maximum margin Later in the course this concept will be core to one of the recent most exciting developments in the ML field support vector machine Bad news What about nonlinearly separable cases In such cases the algorithm will never stop How to fix One possible solution look for decision boundary that make as few mistakes as possible NPhard refresh your 325 memory Fixing the Perceptron Idea one only go through the data once or a fixed number of times Let w 0000 f0ri1N Take training example i Xl y MewXl if y u lt 0 wlt wy Xl At least this stops Problem the final W might not be good eg right before terminating the algorithm might perform an update on a total outlier VotedPerceptron Idea two keep around intermediate hypotheses and have them vote39gt Freund and Schapire 1998 Let w 000 0 c0 0 repeat Take example ixlyl M e w e X1 if y1 u1 lt 0 Wm H W in1 9m 0 n 1 1 else 0 0 1 Store a collection of linear separators wo W1 along with their survival time c0 c1 The c s can be good measures of reliability of the w s For classification take a weighted vote among all separators Lecture 3 Sept 28 2007 Summary of Perceptron Algorithm Learns an LTU linearthreshold unit There are many 0 her approaches to learn LTUs Start with a random hypothesis weight vector w update when a mista e is made Stop if no more mistakes or when prespeci ed steps are nished Guaranteed to converge if linearly separable of step needed to converge depend on the dif culty level of the data large margin gt faster convergence Voted perceptron stores all old hypotheses and let them vote for classifying new exam les Better ones longer survival time get bigger weights for their tes The M ulticlass Case We learn one LTU for each class k 1c The training is done on a transformed data set where class k examples are considered positive the others considered negative Classify Xto according to hkxwkx y argmaxhk x k This is called a linear machine Next we will see a classifier that is very simple in design but can produce very complex models Nearest Neighbor Algorithm Remember ahtrarnrng exampies Gwen a new exampie x nno the rts ctosesttrarnrng exampie ltxx ygt and predicty New example How to measure orstance e Euchoean lix x liz 2 42gt 1 Decision Boundaries The Voronoi Diagram Given a set of points a Voronoi diagram describes the areas e 0 Viewed as zones of control Voroni diagram D 0 mm MNWW cs EDmEH EduinfuPeupitychewDeiaunay htmi Decision Boundaries Subset of the Voronoi Diagram Each exampie oontrots ts own nerghoorhooo Create the voronr oragrarn r hoary are rorrneo by oniy retarnrngthese ne segments separatrng drfrerent ctasses U a The more exampies stored the more compiex the decrsron oounoanes can become Decision Boundaries th iame numbev niexampies and quotElise in me ianeis me decisiun bumva can become quot351w KNearest Neighbor Example K 4 New example Findthe knearest neighbors and have them vote Effect of K Largerk pruduces smumnerbuundary errectanu can reducetne impa ufciass iabei nDiSE Eutwnen x is an iarge say Kw We aiways predicttne maiunty iass Question how to choose k Choose kto minimizethe mistakes thatWe make 7 w on training exampies irammg ermr e ham ermr K2EI K1 Mudei cumpieww Model selection If we use training error to select models we will always choose more complex ones Model com ple Over tting Validation data We can keep part ofthe labeled data apart as validation data Evaluate different k s on the validation data Choose k that minimizes validation error Once we choose k we can then use all the data in making predictions for new examples Validation Testing Training Kfold Cross Validation When labeled set is small we might not be able to get a big enough validation set why do we need large validation set Train on 82 83 84 85 test on Si 4 l1 Train on st s3 s4 s5 test on s2 gt 2 Train on 81 82 83 84 te5t0n 85 gt 5 Practical issues with KNN Suppose we want to build a model to predict a person s shoe size Use the person s height and weight to make the prediction P1 6 175 P2 57168 PQ61 170 DPQP11 01252 m5 DPQPZ1 0 4M22 204 Because weight has a much larger range of values the differences look bigger numerically Features should be normalized to have the same range of values eg 01 1 otherwise features with larger ranges will be treated as more important Practical issues with KNN Our data may also contain the GPAs Should we include this attribute into the calculate When collecting data people tend to collect as much information as possible regardless whether they are useful for the question in hand Recognize and remove such attributes when building your classi cation models Other issues It can be computationally expensive to nd the nearest neighbors 7 Speed up tne computation y using sma data structures krd tree etc For large data set it requires a lot of memo 7 Remove unimportant examples Summary KNN is What We call lazy learning vs eager learning 7 Lazy learning only occur when you see tne test example 7 Eager learn a model oerore you seetnetest example training eltamples can be thrown away arter learning Advantage e Conceptually simple easyto understand and explain 7 veryrlexiole decision ooundanes 7 Not mucn arning at al Dlsadvarlta e e ltcan oe nard to rind a good distance measure 7 lrrelevantreatures and noise can be very detrimental 7 Typically can notnandle more nan an attnoutes 7 Computational cost requires a lot computanon and memory 860001 0 Lecture 15 Clustering Oct 31 2008 Unsupervised ning and pattern disve So far our data has been We will be looking at in this form unlabeled data x11x21 x31 xm1 y1 x11x21 x31 xm1 2 2 2 2 2 2 2 2 2 x1x2x3xm y x1x2x3xm gt What do we expect to learn from such data l have tons of data and need to n n n n n x1 x2 x3 xm y x1 x2 x3 xmn organize it better eg find subgroups understand it better eg understand interrelationships find regular trends in it eg lfA and B then C The web organized by topic into categories Reg Home iona ConsumersHomeownersFamily AsiaEurDQeNui1hAmenca m Business Cquanies Arts Movies Kids and Tee ns ce Cumgulers Entertainment School Scien Biology Psychology Physics Computers News Shopging lnternet Hardware Media NestaQers Current Events Autos Clothing Gifts Recreation Fund Ouiduars Travel Games Socieg Board RuleDlaving Vlde issues PeuQie Reirgian ealth Alternative Fitness Medicine Education Lrbranes Mags Sports Basketball Football Soccer World Deutsch Espa ul Frangars itairan JaQanese Korean Nedei lands Puiska Svenska Music gt Music Hierarchical clustering as a way to organize web pages e 61 Education 346 Music 172 disto 107 rranging 2 umor 99 wards 52 nstruments 1141s Bands and Artists 47326 iving Histom a6 usiness 5322 yrics 787 Charts 109 arching 1379 Chats and Forums 242 ovies 335 Classi eds 37 useums 46 Clubs and Venues 435 usic V eos Collecting 220 usical Theatre 1831 Comedians 128 usicol 26 Corn 0 on 16836 ews and Media 197 Computers 3938 mm 127 W 507 m 34 i 7 o c Directories 120 Personal Pages 194 Disabled 17 Photography 136 Q 666 Radio 36 ecord Labels 2900 e ional 11 esources 173 eviews 665 hopping 2935 ongwriling 405 ound Files 146A 5 31892 echnology 96 elevision Shows H78 74 rading 325 ideo Games 153 M 2952 eblogs 105 eddings m o l3 omen in Music 1715 Finding association patterns in data Pattern Recognition and Machine Learning Information Science and Statistics Hardcover by CHHSQDghEV M 515mm Authar Er W List Price P 5011 a this rtem sing for mu with Super Saver Shipping Deteris Vuu Save 24 84 33 n n Avallahlllly In Stack snwtrurn and said w Alllaznnnnl Gi rwrau eveneme r r 292 snare your awn ustumer image w 58 used at new avaiiabie From 543 59 Better Together Buy this book wrtn Tne Eiements of Statisticai Learning by TV Haste today Buy Together Today 11884 Pattern Ciassi catmn Cnmguter Ma ua rn 2nd Edrtmn1by Maenrne tearnrng by Wrtten Tum M Mrteneu MATLAE tn Attnnrganv 03966 8300 2 ltmmNmo Clustering Example Monthly spending Monthly income Clustering Example Monthly spending Monthly income What is Clustering In general clustering ca be viewed as w exploratory procedure for finding interesting subgroups in given WM Very commonly we focus on a special kind of clustering Group all given examples into disjoint subsets of clusters such that Examples within a cluster are very similar Examples in different clusters are we wwwnt Example Applications Information retrieval cluster retrieved documents to present more organized and understandable results Consumer market analysis cluster consumers into different interest groups so that marketing plans can be specifically designed for each individual group Image segmentation decompose an image into regions with coherent color and texture Vector quantization for data ie image compression group vectors into similar groups and use group mean to represent group members Computational biology group gene into coexpressed families based on their expression profile using different tissue samples and different experimental conuuuons Vector quantization for image compression 7W7 mv i 701554 bytes 127292 bytes Important components in clusteri7 Distancesimilarity measure How to measure the similarity between two objects Clustering algorithm How to find the clusters based on the distancesimilarity measures Evaluation of results How do we know that our clustering result is good Distance Measures One of the most important question in unsupervised learning often more important than the choice of clustering algorithms There are a set of commonly used distance measures Let s look at them and think about when each of them might be appropriate or inappropriate Common distancesimilarit measures d Euclidean distance Cosine similarity COS X X d More flexible measures DX X39 Zwx x2 z391 one can learn the appropriate weights given user guidance Note We can always transform between distance and similarity using a monotonically decreasing function How to decide which to use Usually need to consider the application domain you need to ask questions such as What does it mean for two consumers to be similar to each other Or for two genes to be similar to each other Ideally we d like to learn a distance ftn from user input Ask users to provide things like object A is similar to object B dissimilar to object C Learn a distance function to correctly reflect these relationships This is a more advanced topic that we will not cover in this class but nonetheless important When we can not afford to learn a distance measure and don t have a clue about which distance function is appropriate what should we do Remember clustering is an exploratory procedure and i is important to explore ie try different options Now we have seen different was at one can use to measure distances or similarities WW 5 a W V a quotwows objects how can we use them to group the objects into clusters People have looked at many different approaches and they can be categoriZeu into two distinct types Hierarchical and nonhierarchical Clustering Hierarchical clustering builds a treebased hierarchical taxonomy dendrogram animal W s reptlle mammal worm insect A AAAA A nonhierarchical flat clustering produces a single partition of the unlabeled data Choosing a particular level in the hierarchy produces a flat clustering Recursive application of flat clustering can also produce a hierarchical clustering Hierarchical Agglomerative Clustering HAC Assumes a distance function for determining the similarity of two instances One can also assume a similarity function and reverse many of the operations in the algorithm to make it work for similarities Starts with each object in a separate cluster and then repeatedly joins the two closest clusters until only one cluster is left The history of merging forms a binary tree or a hierarchy HAC Example ABCDEFGH HAC Algorithm Start with all objects in their own cluster Until there is only one cluster Among the curren clusters determine the two clusters cl and c that are closest Replace cl and 0 with a single cluster 0 U 0 Problem our distance function computes distance between instances but we also need to compute i nce between clusters How Distance Between Clusters Assume a distance function that determines the distance of two objects DXX There are multi le wa to turn a distance function into a cluster distance function Single Link distance of two closest members of clusters Complete Link distance of two furthest members of clusters Average Link average distance Sin le Link A Iomerative Clusterin Use minimum distance of all pairs DClCj mm IDXX39 Forms strageg clusters long islands ABCDEFGH Complete Link Maximum distance of all pairs DCiCjxer g c0xx39 ABCDEFGH Makes tight spherical clusters Updating the Cluster Distances after merging is a piece of After merging cand cf distance of the resulting cluster to any other cluster ck can be computed by Single Link DcI U c ck minDcx ck Dc ck Complete Link Dcx U c 1 or InaXDC 0k Dc Ck This is constant time given previous distances Average Link Basic idea average similarity between members Of the two clusters Two options are available Averaged across all ordered pairs in the merged cluster Dccj 1 Z ZDXX39 1 Ci U Cj lci U 0quot xecich x39eciucj x39 x Averaged across all pairs between the original two clusters Dclcj Z ZDXX39 lciucj No clear difference in performance V xeci x 60 Compared to single link and complete link Computationally more expensive On1n2 Achieves a compromise between single and complete link HAC creates a Dendrogram DC1 c2 Dendrogram draws the tree such that the height of a tree branch the distance between the two merged clusters at that particular step The distances are alwa s monotonically increasing This can provide som understanding about how many natural groups there are In the data A drastic height change indicates that we are merging two very different clusters together Lecture 12 Oct 24th 2008 Review Linear SVM seeks to find a linear decision boundary that maxrmlzes the geometric margin Using the concept of soft margin we can achieve tradeoff between maximizing the margin and faithfully fitting all training examples thus provides better handling of outliersnoisy training examples and simple cases that are not linearly separable By mapping the data from the original input space into a higher dimensional feature space we obtain nonlinear SVM Let s see a bit more about this Non linear SVM The basic idea is to map the data onto a new feature space such that the data is linearly separable in this space However we don t need to explicitly compute the mapping Instead we use what we call the kernel function this is referred to as the kernel trick What is a kernel function A kernel function kab lt a bgt I is a mapping Why using kernel function The inner product in the mapped feature space lt a bgt is computed using the kernel function applied to the original input vectors kab This avoids the computational cost incurred by the mapping We have shown previously that using kernel in prediction stage is straight forward but what about learning of w or more directly the oci s Does this make any difference in the SVM learning step Let s look at the optimization problem of the original SVM N 2 Soft margin SVM CZ optimization I I 1 problem subjectto y wx b21 i 1217N 9 20 i1N N Knowing that w Zaiy x we can rewrite the optimization i1 problem in terms of inner products N 39 z39 j z39 j raigIZaiajyy ltx x gtcZ i 171 1 subjectto 321205ij ltXJxl gtb21 l i1N j1 N 7 120 i1 In fact learning for SVM is typically carried out using only the inner products of x without using original vector of x Now we just need to replace the inner products with an appropriate kernel functions What we have seen so far Linear SVM Geometric margin vs functional margin Problem formulation Soft margin SVM Include the slack variable in optimization c controls the trade off Nonlinear SVM using kernel functions Kernel functions Notes on Applying SVM Many SVM implementations are available and can be found at wwwkernelmachineorgsoftwarehtml Handling multiple class problem with SVM requires transforming a multiclass problem into multiple binary class problems One against rest Pairwise etc Model selection for SVM 0 There are a number of model selection questions when applying SVM Which kernel functions to use What c parameter to use soft margin You can choose to use OETault options provided by the software but a more reliable approach is to use cross validations Strength vs wea kness Strengths The solution is globally optimal It scales well with high dimensional data It can handle nontraditional data like strings trees instead of the traditional fixed length feature vectors I Why Because as long as we can define a kernel function for such input we can apply svm Weakness Need to specify a good kernel Training time can be long ifyou use the wrong software package Table 1 Training mm m CPUseconds SV39M egasos SVMLight CCAT 20075 Covenype aslroph 85 x 25514 80 2007 2006 1998 Ensemble Learning Ensemble Learning So far we have designed learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford Develop new learning algorithm Improve existing algorithms Another approach is to leverage the algorithms we have via ensemble methods Instead of calling an algorithm Just once and using its Classmer Call algorithm multiple times and combine the multiple classifiers What is Ensemble Learning Traditional I 1 x E L U x vh1X different training sets andor learning algorithms lt2 M U x vhx Ensemble Learning INTUITION Combining Predictions of multiple classifiers an ensemble is more accurate than a single classifier Justification easy to find quite good quotrules of thumb however hard to find single highly accurate prediction rule If the training set is small and the hypothesis space is large then there may be many equally accurate classifiers Hypothesis space does not contain the true function but it has several good approximations Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers How to generate ensemble 0 There are a variety of methods developed We will look at two ofthem Bagging Boosting Adaboost adaptive boosting Both ofthese methods takes a single learning algorithm we will call this the base learner and use it multiple times to generate multiple classifiers Bagging ootstrap Aggregation Breiman 1996 Generate a random sample from training set S by a random resampling technique called bootstrapping Repeat this sampling procedure getting a sequence of T training sets 1SZST Learn a sequence of classifiers h1h2hT for each of these training sets using the same base learner To classify an unknown point X let each classifier predict h1X 1 h2X 1 h3X o hTX 1 Take simple majority vote to make the final prediction Predict the class that gets the most vote from all the learned classifers Bootstrapping 5 For i1 N N is the total number of points in S draw a random point from S and add it to 5 End Return 5 This is a sampling procedure that samples with replacement Each time a point is drawn it will not be removed This means that we can have multiple copies of the same data pomt in my sample New training set contains the same number of points may contain repeats as the original traInIng set On average 667 of the original points will appear in a random sample Bagging Example 3 lg g39 o I O i gt I Q t quot39 quotffF 39 39e39quot 33 39a1235 4 ah TLr iJ 9 g 339quot quot 1393 34quot quot3939 39quot If 15 vi A 399 39 39 139 gm 393939 I i W I 39 9 1quot u If It quot 1 f It 15 it s3939 Q I I I Il D h 39 mill o t39 u g 5 I 9 inquot 39 39 3 i amp 39 I 3 quot a 139 I C39Q Iaquot 1 gt0 quotf i 3 i3quot 391 39quot Mk3 ig 5639139 Witt1 iii 7quot 339 t39i39 O 0quot 39t39 gill quot pfl 3 I n I il 2 Q quotL JFCI 45 371 015 a 11 The true decision boundary I I Decision Boundary by the CART Decision Tree Algorithm i n 415 D u r A in Note that the decision tree has trouble representing this decision boundary 100 bagged trees do 75 no 05 o By averaging 100 trees we achieve better approximation ofthe boundary together with information reprding how confidence we are about our prediction Eimr Rm of CA Empirical Results for Bagging Decision Trees Freund amp Schapire m is 20 Eum Mk agged c4 3 Whycan bagging improve the ciassi cation accuracy The Concept of Bias and Variance BiasVariance for classifiers Bias arises when the classifier cannot represent the true function that is the classifier underfits the data Variance arises when the classifier overfits the data minor variations in training set cause the classifier to overfit differently Clearly you would like to have a low bias and low variance classifier Typically low bias classifiers overfitting have high variance high bias classifiers underfitting have low variance We have a tradeoff Effect of Algorithm Parameters on Bias and Variance k nearest neighbor increasing k typically increases bias and reduces variance decision trees of depth D increasing D typically increases variance and reduces bias C5434 Fall 2007 Why does bagging work 0 Bagging takes the average of multiple models reduces the variance This suggests that bagging works the best with low bias and high variance classifiers C5434 Fall 2007 Lecture 13 Oct 22 2007 Critical points of boosting so far Boost the performance of weak error rate slightly below 05 learner L It does so by feeding the same training data to L but with different weights Each round weights are updated weights of correct examples l weights of misclassified examples T Given the weights learner L now tries to minimize weighted error Weighted Error It is often straightforward to convert a base learner to take into account an input distribution D weights Decision trees K Nearest Neighbor When it is not straightforward we can resample the training data S according to D and then feed the new data set into the learner Decision tree with weights 0040 AAoow AOOAO OAAAlt 02 05 01 02 Na39l39ve Bayes with weights 0040 AAoow AOOAO OAAAlt 02 05 01 02 Overfitting Boosting drives training error to zero will it overfit Curious phenomenon 20 15 1 0 error 5 test 0 train 10 100 1000 of rounds T Boosting is often robust to overfitting not always Test error continues to decrease even after training error goes to zero Explanation with Margins fx 2W1 39hzx WAT Margin y fX Histogram of functional margin for ensemble just after achieving zero training error Effect of Boosting Maximizing Margin 4 D 4 D 4 D gt No examples Margin with small margins Even after zero training error the margin of examples increases This is one reason that the generalization error may continue decreasing Biasvariance analysis of Boosting In the early iterations boosting is primary a biasreducing method In later iterations it appears to be primarily a variancereducing method What you need to know about ensemble methods Bagging a randomized algorithm based on bootstrapping What is bootstrapping Variance reduction What learning algorithms will be good for bagging Boosting Combine weak classifiers slightly better than random Training using the same data set but different weights How to update weights How to incorporate weights in learning DT KNN Nai39ve Bayes One explanation for not overfitting maximizing the margin Feature Selection notes adapted from Ben Blum s notes httpwwwcsberkeleyedul 39 39 quot html So far we have assumed that we are using all attributesfeatures that are available to us in learning Is this always a good idea What is feature selection Reducing the feature space by throwing out some of the features Also called attribute selection Motivating idea try to find a simple model Occam s razor simplest explanation that accounts for the data is best What is feature selection Task classify whether a document is about cats Data word counts in the document cat and it kitten electric trouble then several feline While lemon N NLOO IFNGJNOONX OO I Reduced X cat kitten feline Task predict chances of lung disease Data medical history survey Vegetarian Plays video games Family history Athletic Smoker Sex Lung capacity Hair color X No Yes No No Yes Male 58L Red Audi 185 lbs Reduced X Family history Smoker No Yes Why do it Case 1 We re interested in features we want to know which are relevant If we fit a model it should be interpretable Case 2 We re interested in prediction features are not interesting in themselves we just want to build a good classifier or other kind of predictor Why do it Case 1 We want to know which features are relevant we don t necessarily want to do prediction What causes lung cancer Features are aspects of a patient s medical history Binary response variable did the patient develop lung cancer Which features best predict whether lung cancer will develop Might want to legislate against these features What causes a program to crash Alice Zheng 03 04 05 Features are aspects of a single program execution Which branches were taken What values did functions return Binary response variable did the program crash Features that predict crashes well are probably bugs Why do it Case 2 We want to build a good predictor Text classification Features for all 105 English words and maybe all word pairs Common practice throw in every feature you can think of let feature selection get rid of useless ones Training too expensive with all features The presence of irrelevant features may cause overfitting Classification of leukemia tumors from microarray gene expression data Xing Jordan Karp 01 72 patients data points 7130 features expression levels of different genes Embedded systems with limited resources Classifier must be compact Voice recognition on a cell phone Branch prediction in a CPU 4K code limit Filtering Simple techniques for weeding out irrelevant features without considering the classifier that we are using Filtering Basic idea assign score to each feature f indicating how related Xf and y are Intuition if xif yi for all i then fis good no mattegl what our classifier is Many popular scores including one we already know of Information gain Then somehow pick a fix number of highest scoring features to keep Comparison of ltering methods for text categorization Yang and Pederson 97 gure 1 Average premslan okaN vs unique word count 1 1 1 1 1 0951 1 11pl average precision o e 1 1 00 e k 113 4 DP T5 u 055 CHLmax x MLmax an 0 45 1 1 1 1 1 1 1 1 16000 14000 12000 10000 0000 6000 4000 2000 0 Number of features umque wards Filtering Advantages Very fast Simple to apply Disadvantages Doesn t take into account which learning algorithm will be used Doesn t take into account interactions between features Can be a significant disadvantage why Suggestion use filtering for initial screening
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'