### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# MACHINE LEARNING CSCI 5622

GPA 3.51

### View Full Document

## 24

## 0

## Popular in Course

## Popular in ComputerScienence

This 98 page Class Notes was uploaded by Allie West II on Thursday October 29, 2015. The Class Notes belongs to CSCI 5622 at University of Colorado at Boulder taught by Gregory Grudic in Fall. Since its upload, it has received 24 views. For similar materials see /class/231984/csci-5622-university-of-colorado-at-boulder in ComputerScienence at University of Colorado at Boulder.

## Reviews for MACHINE LEARNING

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/29/15

Machine Learning Class Projects Greg Grudic General Info 39 You must choose a project by next week 39 The project is due on December 9 Submission via email 7 8 page writeup in NIPS format see nipscc 7 Source code used if any Density Estimation Algorithm in Homework 4 39 Implement 7 Sparse models 7 Alpha optimization 7 Extension to hundreds of thousands of training examples 39 Test on Standard datasets and compare to published results Mobile Robot Path Following in Outdoor Environments Project 1 0 Apply many algorithms to more data of the type used in Homewor 4 0 Use the WEKA toolkit and perhaps LIBSVM 0 Report on 7 Which gives the best error rates 7 Which model is the fastest to evaluate 0 Close the loop with the best model on the real robot limited test 7 I just want you to take a model and put it into the loop Mobile Robot Path Following in Outdoor Environments Project 2 0 Close the loop using the algorithm in homework 4 here I expect more tests because you are using the algorithm you developed in homewor 0 Ex eriment with various types of features extracted from the image 7 Grey scale 7 Normalized color 7 Other color representations YUV 7 How big should the window be 0 Report on which features work best Mobile Robot Path Following in Outdoor Environments Project 3 0 Build a regression model that maps images to steering ang es 7 Stick to one or two algorithms for learning the model Maybejust the polynomial cascade algorithm U e human examples ie you teleoperating the robot as training data 0 Close the loop on the real robot 7 Report on at least a 2 different experiments 0 Ex eriment with various types of features ge P extracted from the 1ma Your Own Project 1 39 Pick a paper from a recent NlPS conference nipscc has all proceedings and implement the algorithm 7 The algorithm should not have source code available 39 Repeat experiments done in the paper perhaps not all and add at least one more 39 Give me your opinion of the algorithm Your Own Project 2 39 Start with novel data 39 Apply at least 5 different learning algorithms to the dataset and determine which is best 39 Use WEKA or other toolkits you can download from the web Reinforcement Learning for Mobile Robot Path Following in Outdoor Environments 39 Learning the Q function starting from human examples Support Vector Machines Greg Grudic Notes borrowed from Bernhard Sch lkopf Greg Gmdic Muslim Learning Today s Lecture Goals Support Vector Machine Classification A Good text on SVMs Bernhard Scholkopf and Alex Smola Learning with Kernels MIT Press Cambridge MA 2002 Greg Gmac Muslim Learning Support Vector Machine Classi cation Classification as a problem of nding optimal canonical linear hyperplanes Optimal Linear Separating Hyperplanes 7 In Feature Space 7 In Kernel Space Greg Gmdic Muslim Learning Linear Separating HyperPlanes How many lines can separate these points Which line should we use Greg Gmac Muslim Learning X Linear Separating HyperPlanes 3 Machine Learning 5 Linear Separating HyperPlanes Given data X1y1XNyN Finding a separating hyperplane can be posed as a constraint satisfaction problem Vi6lN ndw such that waibZl if yil waib 71 if yiil Or equivalently yiWTX b7120 Vi Greg Grude Machine Learning 5 Margin For Canonical Hyperplane Calculating the Margin xltwxgtbl xltwxgtbl J Note ltwx1gtbl ltwix2gtbl gt ltwx1 x2gt 2 l 2 gt ltinII39quot1quot 2 Ilwll w x X X X Machine Learning 7 Greg Grudic How do SVMs choose the boundary 0 Margin is the distance 0 detween two hyperplanes margin W Find the one with the Maximum MARGIN Machine Learning 8 SVM Constraint Optimization Problem Given data X17y1gt739 7XN7yN Minimize W 2 subject to yiWTxi b 120 Greg Grudic Machine Learning The Lagrange Function Formulation Introduce Lagrange multipliers ai gt 0 and a Lagrangian Lemma 2 gnww Zol a iltwxlgt bl 1 il L has to minimized Wrt the primal variables W and b and maximized with respect to the dual variables ai o if a constraint is violated then yi w xi b 1 lt 0 gt ozi will grow to increase L how far W I want to decrease L ie they have to change such that the constraint is satis ed If the problem is separable this ensures that ozZ lt oo o similarly if yl w xi b i l gt 0 then ai 0 otherwise L could be increased by decreasing ai KK T conditions B Sdi lknpf Canberra Fobnmry 2002 Greg Grudic Machine Learning 10 Derivation of the Dual Problem At the saddle point extremum 6 6 Lwbu0 Lwba0 3b 3W 0 This g1ve the cond1t10ns N N Earl139 0 wZaiini z39l z39l Substitute into Lwba to get the dual problem Greg Grudic Machine Learning Dual Problem Maximize N 1 N N Wltagt 213 E1oziozjyiyj King Z l j Subjectto 05120 l39lN N Zoex 0 11 Greg Grudic Machine Learning 12 y Support Vector Expansion 1 w fax y x Kw x1 b gtl j oz 0 gt x1 irrelevant OR y Kw x1 b 1 On Margin x1 Support Vector Greg Gmdnc Machmz Leammg Support Vector Expansion 2 sgn w X b Substitute W 2041 y X 11R fx sgnoz1yl xlxgtb Greg Gmac Machmz Leammg What are the Support Vectors 00 Greg Gmdnc Machmz Leammg Why do we want a model With only a few SVS Leaving out an example that does not become an SV gives the same solution Theorem Vannik and Chervonenkis 19741 Let SVN be the number of SVs obtained by training on N examples randomly drawn for PXY and E be an expectation Then E Probtest error g w Greg Gmac Machmz Leammg What Happens When Data is Not Separable Soft Margin SVM D Lquot Add a Slack Variable margin 939 i 0 if x1 correctly classified I distance to margin otherwise Greg Grudic Machine Learning 17 Soft Margin SVM Constraint Optimization Problem 0 Given data X1y1XNyN N Minimize CZ subject to il yiwaibZli VilN I Greg Grudic Machine Learning 18 Dual Problem Nonseparable data 0 Maximize N 1 N N WltO gt 206 Eaiajyiyj Xi Xjgt l l J 0 Subjectto 0 ozi gC i1 N Zaiyi 0 il Greg Grudic Machine Learning 19 N Same Decision Boundary fx sgn Olly xi x 19 Greg Grudic Machine Learning 20 Mapping into Nonlinear Space 1R2 gt R3 9317952 39 gt 21722723 ii7 127 Greg Grudic Machine Learning 21 39l 392 input space gimme Nonlinear x Exlxz feature space V w 1 v lw3 f I 2 2 v sonw wq 7w J2 39 x397b D l l a 3 l E R2 X X X X X X Greg Grudic Machine Learning 22 Nonlinear SVMs KEY IDEA Note that both the decision boundary and dual optimization formulation rely on dot products in input space only Xi 9 X N fltxgtsgn 2051321 XiXgtlb N i121N N WO Zloai ZZoaioajyiyjltxixjgt z z Greg Grudic Machine Learning 23 Kernel Trick Replace XiDXJ39 KltXiXj lt ltXigtqultXJgtgt Can use the same algorithms in nonlinear kernel space Greg Grudic Machine Learning 24 Nonlinear SVMs Maximize N 1 N N Wa 04 iEZZQ ajy yjK xj i i Bounda Need Mercer Kernels KXixj ltltIgtltXiltlgtxjgt ltltIgtltXjltlgtxigt N fxsgn Zaly1Kxlxb KXj xi 11 c gcm MMmng 25 c gcm MMmng 26 Gram Kernel Matrix Commonly Used Mercer Kernels TrainingDaLa X1y1gtXNgtyN P 1 1 0 nom1a Kx1x1 Kx1xN y d K 2 2 14le lxxJl K K Slgm01d xwxl XIWXN Kltxlxj tanhHXlXjgt0 P pe 39 r0 lmPositive Definite Matrix GauSSIan 1 2 mic Kltxlxj exp i lexlixjquot 39Positive on diagonal 20 39beN Greg Gmdnc Machmz Leammg 27 Greg Gmac Machmz Leammg 22 Soft Margin SVMS CSVM 15 for C gt 0 minimize 1 m Time gllwllg age 1 subject to yi WXi b 2 1 5i 1 gt 0 margin I SVM 55 for 0 g 1 lt 1 minimize 1 r 1 7 W7 6 p 39 Vp M if m i subject to yi WXj b 2 I 67 Si gt 0 margin 2W n snn mumquot minim 21m Greg Grudic Machine Leaming 29 Duals Using Kernels C SVM dual maximize 1 WW in E2 aiajinjllxrx subject to 0 S on lt C 17171 0 U SVM dual maximize 1 WOC E aiajyz39yjkxixj subject to 0 3 a1 S Ziaiyi 0 il 2 1 In both cases decision function fx sgn digMix xi b u 5hiillmpl 39alllwrm I rl mmu 2mm Greg Grudic Machine Learning 30 SVM Training o naive approach the complexity of liiaximizing m 1 m 271 17 i Zigj1 aiaijyJHxi XJ scales with the third power of the training set size m 0 only SVS are relevant gt only compute kx xJIj for SVS Extract them iteratively by cycling through the training set in chunks 63 o in fact7 one can use chunks which do not even contain all SVS 42 Maximize over these sub problems using your favorite optimizer o the extreme case by making the subnobleins very small just two points one can solve them analytically 45jV Greg Grudic Machine Leaming 31 Handwritten MNIST A SVM Success Story h t b h k EEHE E C arac CI39 CDC mar SablmiHEJ 60000 training and 100000 testlng 1m mag m j j j j HIE Dimension d 28 X m m m l l il El ENE 28 EM mm m MMMME HMMMMMMWEE 39E39EHEEH ZEMXH E 12 WNW mm m mm Greg Grudic Machine Learning 32 Results on Test Data Classi er test error linear Classi er 84 3 nearest neighbour 24 SVM 14 Tangent distance 11 LeNet4 11 Boosted LeNet4 07 Translation invariant SVM 056 SVM used a polynomial kernel of degree 9 Greg Grudic Machine Leaming Greg Grudic SVM Model Structure classi cation 39 sgn Z lidxx b I weights comparison eg k3939 in exp l Ill c suppon Vectors 391 394 Ix 39I lKriie input veclou Machine Leaming 34 Machine LeaIning CSCI 5622 Fall 2004 Greg GTudic Greg Gm mm mm Admin Stuff 1 Greg Gm mm mm Admin Stuff 2 I Course Textbook The Elements of Statistical aming by Hastie Tibshirani FIi dman I Grading r Homework 45 7 Project 25 7 Class participation 10quot r Midterm exam 20 I Course Workload ouwide of class 7 4 to 5 hours per week Greg Gm mm mm Admin Stuff 3 3 cadmg asggnmzms alganthmim zmzmnnun 25th 157 7 xsmomw mang ymltmlse madah awrywwmfxa mfagwmaMammmmmmmaam 7 um lhaveave aadexcuseeachda mass musmwmaam arm m Halfmnt a W m 157 prasslgxl myeum 9 Midmrm Nuv 17 7 Testbasw We afN ltwdlcmlslst ufgenznl magma an a machmz 1mm algan39hmscwvexed a an le wl mtbe xeqmdtn nve algan39hmsmpmve harem ngiaumak byNuv 3 Due Dee In 7 lmplementntmn afan alganthm mt severed in class mm a recent research pup m w be rammed ta summanu m algaan and what u emplnca y Clam Earlicm39 au39m 7 Thls camsts ufshnwmg npfarclassand askmg quzsnans szslmns byemdcaumas class pamnputmnl Greg cm Wm ham 4 Goal of the Course 0 A fundamental understanding of the basic concepm behind machine learning 7 What does it mean for a machine to learn 0 You will be able to read current research papers in ML after completing this course 0 Why is machine learning important 7 ML algorithms are at the heart ofmany modern computer applications Greg cm Wm ham 5 Where can ML be found Marketing 7 Whu shnuld a cumpany target fur advauang Fm l mg 7 lspassenger s7 likelth maaka plane 7 User interfaces 7 Makmgn asiertu mlmdwnhaFC by anunpaung what 1 am dumg Dueument charadenzanun 7 Smrnhmgtheweb runnngsanmaeg Bmmfurmaues 7 Hurran genume pm e51 Whlchgem a xespunsfblz faith cancenhatmns m myfamdy Data mmmg 7 Data daubles evay yeaf39 Dunham mm 7 ML algunthms are usedlu rmke sense arms data Eeunurmes medical magmas rubuhes eumputervismn manufacmnng inventury cuntml elevatur uperahun Greg cm Wm ham 5 ML is part of Arti cial Intelligence What is Al The automation of activities that we associate with human thinking activities such as decisionmaking problem solving learning Bellman 1978 The study ofmental faculties through the use of computational models Chami anndDermott 1985 The study ofhow to make computers do things at which at the moment people are better Rich and Knight 1991 The branch ofcomputer science that is concerned with the automation of intelligent behavior Luger and Stubble eld 1993 Greg Grudic Michine Learning 7 My Personal View of Al I I Want to build a robot that Will 7 Clean my house 7 Cook when I don t want to 7 Fix my car or take it to be xed 7 i e do the things that I don t feel like doing I Therefore AI is to me the science of building machmes agenw that act mtionally W1th respect t a goal Greg Grudic Michine Learning 2 What is a Rational Agent I An agent is an entity that perceives and acts I A rational agent is one that does the right thing 7 The right thing that which is expected to maximize goal achievement dumg hmgs Idun I feel hire dumg given the available information I This is not a new idea 7 Aristotle Nicomachean Ethics Every art and every mqulry and Slmllarly every actzun onopnrsmt 15 thought to am at some good Greg Grudic Michine Learning 9 Elements of AI Representatio Reasoning Learning Greg crisis thhmz learning My Elements of AI Learning Representatio Greg crisis thhmz learning Why does learning encompass reasoning How can 1 reason rationally about a world 1 know nothing out an gain knowledge about aworl sampling it and learning from those d without samples Fundamental lesson ofAI learned in 980 s 7 1 code lmowledge about anything butthe Vlal problem domainsl 7 Expert Systems largely failed beca oesn tlmow how to o e use an arpertlte g doctor f imaliz code what akes her an expertl e For Example rm an expert on chairs butl can t and no one Canl write aprogram that identifies Chalrs in an image ML techni lIES can Gregorian thhmz learning What is Machine LeaIning The goal of machine learning is to build computer systems that can adapt and learn from their experience 7 Tom Dietterich 0 What does this mean 0 When are ML algorithms NOT needed Greg Gm thhmz team 13 A Generic System S stem Z 1 2 K Input Variables X XI x2 xN Hidden Variables h hi h2 hK Output Variables y y1 y2 Greg Gm thhmz ian Another De nition of Machine LeaIning I Machine Learning algorithms discover the relationships between the Variables of a system input output and hidden from direct samples of the system I These algorithms originate form many elds 7 Statistics mathematics theoretical computer science physics neuroscience etc Greg Gm thhmz ian When are ML algorithms NOT needed 0 When the relationships between all system variables input output and hidden is completely understood 0 This is NOT the case for almost any real system Greg cm mm mm m Main Sub elds of Machine Learning 0 Supervised learning 7 Classi cation 7 Regression SemiSupervised Transduction learning 0 Active learning 0 Reinforcement Learning 0 Unsupervised Learning Greg cm mm mm 17 Supervised Learning 0 Given Training examples x1gtfx1gtx2gtfxzgtgtXPgtfxP of some unknown function system y fx 0 Find x ie an approximation 7 Predict y fx39 Where x is not in the training set Greg cm mm mm Two Types of Supervised Learning I Classifi cation y eclczcN 7 Model output is a prediction that the input belongs to some class 7lfthe chair input is an image the output might be face dog boat etc I Regression y 6 ER 7 The output has in nitely many Values 7 If the input is stock features the output could be a prediction oftomor roW s stock price Greg cm mm mm Learning Classi cation Models I Collect Training data I Build Model happy M feature space I Make aprediction Hzgh Dzmensmmzl Feature Space Greg cm Learning Regression Models I Collect Training data I Build Model stock Value M feature space I Make a prediction Stuck Value Ml w ie are as if 91 Greg cm Feature Space mm mm Examples of Supeersed Leammg 0 Credit risk assessment x Properties of customer and proposed purchase x Approve purchase or not Disease diagnosis x Propenies ofpatient symptoms lab tests x Disease ormaybe recommended therapy Greg Grndic Michine Lemming 22 Examples of Supervised Learning continued 0 Face recognition x Image ofperson39s face x Name ofthe person Automated Vehicle Driving x Image ofthe road I x Throttle break and steering commands Greg Grndic Michine Lemming 23 Appropriate Applications for Supervised Learning 0 Situations where there is no human expert x Bond graph for a new molecule x Predicted binding strength to AIDS protease molecule Situations where humans can perform the task but can39t describe how they do it x Bitmap picture ofhandwritten character x Ascii code ofthe character Greg Grndic Michine Lemming 24 Appropriate Applications for Supervised Learning continued Situations where the desired function is changing frequently X Description of stock prices f x Recommended stock transactions Situations where each user needs a customized function f X Incoming email message f Importance score for presenting to user or deleting without presenting Greg Grudic Machine Learning 25 SemiSupervised Learning Transduction Given Training examples wegtgtgltxprxZgtgtltxarltxpgtgti of some unknown function system y f And examples of inputs that require classification Predict y fltxgtyfltxgty fltxgt ft m Learning with Local and Global Consisten Dengyong zhou Olivier Bousquet Thomas N Lai Jason Weston Bemhani Schoelkopf ms 2003 w Toy Dam Two in SW REF Kemen E l we iv v Y D v 7 an a iquot an e quot 7 grquot flew ngmv i i gt D39 I 1 V I i Greg Grudic Machine Learning 27 Active Learning Premise Data 15 expensive to collect e g most aipenments in biology dataset High Dimensional Feature pace Reinforcement Learning RL Autonomous agent learns to act optimally without human intervention 0 Agent learns by Stochastically interacting with its environment getting infrequent rewards 0 Goal maximize infrequent reward Greg om mm mm 29 Reinforcement Learning 0 Addresses the temporal credit assignment problem 7 Delayed reward HARD problem 0 Successful RL applications 7 TD gammon Tesauro 7 Packing containers Moore 7 Elevator dispatch Crites and Barto Greg om mm mm 3 RL in Robotics Hit an obstacle get anegau39ve remrd Reaeh goal get a posi ve remrd Reaeh goal faster get a bigger positive remrd Mendquot Mum um A simple Robotics Problem Greg Gm Mum Leammg 32 Even Simple Robotic tasks are dif cult to Program Greg Gm Mum Leammg z Does ML work on Actual Robots Greg Grudic Machine Learning 34 Yes Greg Grudic Machine Learning 35 More Complex Example Greg Grudic Machine Learning Unsupervised Learning Studies how input patterns can be represented to re ect the statistical structure of the overall collection of input patterns No outputs are used unlike supervised learning and reinforcement learning unsupervised learner brings to bear prior biases as to What aspects of the structure of the input should be captured in the output Gmg Gm Machine Leamn39g 37 Unsupervised Learning Example 39 Collect Training data eg consumer info 39 Build Model things that a similar M feature space High Dimemional Feature Space Gmg Gm Machine Leamn39g 32 Locally Linear Embedding LLE hum mmm39 quota leunK ml Conclusion 0 How many people have I scared away 11 7 0 Class Schedule 0 Who s in my class Please email me the survey on the class web page by the end of the week a text file please Greg Gm mm mm Support Vector Machines Greg Grudic Notes borrowed from Bernhard Scholkopf Greg Grude Machine Learning Today s Lecture Goals Support Vector Machine Classi cation 7 Same assumptions on data 7 Different Optimization problem Support Vector Machine Regression 7 Same assumptions on data 7 Different Optimization problem A Good text on SVMs Bernhard Scholkopf and Alex Smola Learning with Kernels MIT Press Cambridge MA 2002 Greg Grude Machine Learning Support Vector Machine Classi cation 0 Classification is as a problem of nding optimal canonical linear hyperplanes Optimal Linear Separating Hyperplanes In Feature Space In Kernel Space Greg Grude Machine Learning Linear Separating HyperPlanes How many lines can separate these points Which line should we use Greg Grude Machine Learning Linear Separating HyperPlanes 00 x N I Greg Grude Machine Learning 5 Linear Separating HyperPlanes Given data X1y1XNyN Finding a separating hyperplane can be posed as a constraint satisfaction problem Vi6lN ndw such that waibZl if yil walergil if yiil Or equivalently yi WTXI b 12 0 Vi Greg Grude Machine Learning 5 Margin For Canonical Hyperplane Calculating the Margin Note ltwx1gtb 1 ltw x2gtb 1 gt ltW9X1X2gt 2 l L gt ltlw PW llwll Greg Grudic d ltwxgt wwxl il X U Machine Learning 7 How do SVMS choose the boundary Find the one With the Maximum MARGIN Machine Learning 8 SVM Constraint Optimization Problem Given data X17y1gt7 7XN7yNgt Minimize W 2 subject to yWTx b 120 Vi Greg Grudic Machine Learning The Lagrange Function Formulation Introduce Lagrange multipliers a gt 0 and a Lagrangian 1 77L LW7 by a EHWH 20 yr 39 WVXz39 bl 1 il L has to minimized wrt the primal variables W and b and maximized with respect to the dual variables 04 o if a constraint is violated then y W Xi b 1 lt O gt ozi will grow to increase L how far w b want to decrease L ie they have to change such that the constraint is satis ed If the problem is separable this ensures that 04 lt oo o similarly if y W x b 1 gt 0 then oz 2 0 otherwise L could be increased by decreasing a KK T conditions 13 Svh lkopI Canberra February 2002 Greg Grudic Machine Learning Derivation of the Dual Problem At the saddle point extremum a a wa 0 wa 0 ab a aw a This give the conditions N N Zaiyi203 szaiyixi 11 11 Substitute into Lwba to get the dual problem Greg Grude Machine Learning 11 Dual Problem 0 Maximize N 1 N N WOZOli EZZaiajyiyj ltXi3Xjgt il 11 0 Subject to 05120 ilN N Zoey 0 il Greg Grude Machine Learning 12 Support Vector Expansion 1 N W Zaiyixi i1 yl x1 1 gt1 gt a1 0 gt X irrelevant OR yl x1 1 1 OnMargin x1 Support Vector Greg Grude Machine Leaming Support Vector Expansion 2 f x sgn W x b N Substitute W E aiyixl i1 OR fx sgno iyi xix b Greg Grude Machine Leaming What are the Support Vectors o 0 0 g O O 0 0 Q i O I O O O O x Max1mized O gt Margin Greg Grude Machine Leaming 15 Why do we want a model with only a few SVs Leaving out an example that does not become and SV gives the same solution Theorem Yapnik and Chervonenkis 1974 Let SVN be the number of SVs obtained by training on N examples randomly drawn for PXY and E be an expectation Then ESVN EP btt lt ro es errori N Greg Grude Machine Leaming 15 What Happens When Data is Not Separable Soft Margin SVM Add a Slack Variable 0 if X1 correctly classified i distance to margin otherwise Greg Grudic Machine Learning 17 Soft Margin SVM Constraint Optimization Problem 0 Given data X1ylXNyN N Minimize C2 5 subject to i1 y wal 19 21 51 Vi 1N Greg Grudic Machine Learning 18 Dual Problem Nonseparable data 0 Maximize N 1 N N Wa 21051 1oziozjyiyj ltXiXjgt l l j 0 Subject to 0 S 04 S C i1N N Z oxiyi 0 i1 Greg Grude Machine Leaming 19 Same Decision Boundary fx sgnZNoziyi xixgtb i1 Greg Grude Machine Leaming 20 Nonlinear SVMs KEY IDEA Note that both the decision boundary and dual optimization formulation rely on dot products in input space only Xi X fltxgtsgn Zaiyi XiXgtlb N 1 N N Wltagt 09 Zzaiajyiyj ltXiXjgt il il jl Greg Grudic Machine Learning 21 Mapping into Nonlinear Space lt1 1R2 gt R3 5317172 39 gt 21722723 Ii7 127 Greg Grudic Machine Learning 22 Nonlinear Data fxsg11 w139fw23922w3 E xlxzib I R3 R2 X 0 X X X X x X Greg Grudic Machine Learning 23 SVM Model Structure classi cation fxsgn In quot weights comparison eg A39Lm vV u d k expIIV39ll2 c support VEGOI39S quot I k39IzlnlKJ 399 input veclom Greg Grudic Machine Learning 24 Kernel Trick Replace Xi Xjgt with Kltprjgt ltltIgtltXiltlgtltxjgt Can use the same algorithms in nonlinear kernel space Machine Learning Nonlinear SW5 Maximize N 1 N N Wlo I 20439 ZZaiajyiyjKXiXJ i1 il j1 Bound z N fx sgn ZaiyiKxixb il Greg Grude Machine Learning Need Mercer Kernels Kltxpxjgt ltqDltxigt ltxfgtgt ltexjlt1gtltxigtgt Kltx x1 1 3 Greg Grude Machine Learning 27 Grarn Kernel Matrix Training Data X1y1XNyN KX1X1 KX1XN K 3 3 KXNX1 KXNXN Progem39es Positive De nite Matrix Symmetric Positive on diagnal N by N Greg Grude Machine Learning 28 Commonly Used Mercer Kernels Polynomial ltXi xjgtcd Slngld KXixj tanhltltXixjgt6 Gaussian 1 KXixj eXp 202 Greg Grudic Machine Learning 29 Soft Margin SVMs C SVM 15 for C gt 07 minimize 1 m 7 W75 W2 051 i1 subject to yi wxigt l b 2 1 517 2 2 0 margin V SVM 55 for 0 S 1 lt 1 minimize 1 1 7W757p w2 up Eze Z subject to 31 ltW7Xigt b 2 p 2 5 2 0 margin 2Ollwll B th ilkopf Canberra Flabmamy 2002 Greg Grudic Machine Learning 30 Duals Using Kernels C SVM dual maximize 1 Wa 04139 i aiajyz39yj xz39 Xj subject to 0 S ozz S C ozZyz 0 V SVM dual maximize 1 WW Zij Otiajyzyjkxzxj subject to 0 S 041 g 041 0 ozz 2 I In both cases decision function fx sgn aim x xi b B Schiilkopf Canberra February 2002 Greg Grudic Machine Learning 31 SVM Training o naive approach the complexity of maximizing m 1 m WW 2121 izmzl O iajyiyjmxiv Xj scales with the third power of the training set size m 0 only SVs are relevant gt only compute kx7 xjj for SVs Extract them iteratively by cycling through the training set in chunks 63 o in fact7 one can use chunks which do not even contain all SVS 42 Maximize over these sub problems using your favorite optimizer o the extreme case by making the sub problems very small just two points one can solve them analytically 45 Greg Grudic Machine Learning 32 MNIST A SVM Success Story 113mm h k mzm ma c aracter enc mar sEcEmvzaEE 60000 tra1n1ng and 100000 testing IE i l i j l EUE Dimension 1 28 X 39m m m lg 11113 E 28 EM mm mmmmm Hil ll w EM WQHE 39EWXEZHE MEW L71 EMME M MAME mm m mm Results on Test Data Classi er test error linear Classi er 84 3 nearest neighbour 24 SVM 14 Tangent distance 11 LeNet4 11 Boosted LeNet4 07 Translation invariant SVM 056 SVM used a polynomial kernel of degree 9 Greg Grudic Machine Learning 34 Learning Regression Models Collect Training data Build Model stock value M feature space Make a prediction I I I I I I I I I I I I I I i 5 39 JL i L V 1 7 Feature Space Greg Grudic Machine Learning 35 SV Regression 5 Insensitive Loss 64 Goal generalize SV pattern recognition to regression preserving the following properties 0 formulate the algorithm for the linear case7 and then use kernel trick 0 sparse representation of the solution in terms of SVs e lnsensitive Loss ly fXs I maX07 ly fX 8 Estimate a linear regression f x w X b by minimizing 1 W 2 22m lyi fXz39s 2 m 21 B tholkopf Canberra February 2002 Greg Grudic Machine Learning 36 8SV Regression Estimation 64 NM 8 8 Greg Grudic Machine Learning 37 Formulation as an Optimization Problem Estimate a linear regression fX W7Xgt b with precision 5 by minimizing m minimize 7w 5 02 subject to ltWXZ39gt b yz39 S 8 51 32 ltW7Xigtb 354 515 2 0 foralli1m B Sahiilkupf Canberra February 2uu2 Greg Grudic Machine Learning 38 Dual Problem In Terms of Kernels For C gt 0 8 2 0 Chosen a priori7 m m maximize Wa 01 5 2m 04 2m 04231 il i1 m Z 04 0406 ajkxi7 Xj ij1 NIH m subject to 0 S 0475a g C39 239 17 m and ai a 0 21 The regression estimate takes the form fX ELM maxi x b B Sch ilkopf Canberra Fehmamy 2002 Greg Grudic Machine Learning 39 VSV Regression Again use V to eliminate another parameter Estimate 5 from the data st the V property holds Primal problem for 0 S V S 1 minimize 1 m me W2 0 V6 17712 y fXz39g i1 Greg Grudic Machine Learning 40 V SVRegression Automatic Tube Tuning 2 2 1quot x 15 15 x s r x I I 1 WK 1 39 X x s t K r as as m 39 I a e I UK X KM I x I I 0 39 I 0 X x xx r x 1a x 1 ins 475 I x x x z 1 x I x 4 71 39X v x 45 715 r x r A 72 72 quot 7 a 72 71 u 1 z 7 7 72 71 u 1 z Identical machine parameters 1 02 but different amounts of noise in the data Greg Grudic Machine Learning 41 sSVRegression Run on the Same Data Identical machine parameters 5 02 but different amounts of noise in the data Greg Grudic Machine Learning 42 Boston Housing Benchmark o 506 examples 13 dimensional Mean Squared Error MSE Results on test data Results MSE o Bagging regression trees 117 8 1 K 2 o s SV regression 76 59 MSE EZOG fltxi Teleala y1x1yKxK o 100 runs with 25 randomly selected test points 0 training set is split into actual training set and validation set 80 points for selecting 5 C and kernel parameters 0 ftpr ftpicsucicom pub machine learning databases housing B Srh lkopf Cnnherra February 2002 Greg Grudic Machine Learning 43 Comparison 1 vs 8 V SVR 01 02 03 04 05 06 07 08 09 10 automatic 8 26 17 12 08 06 03 00 00 00 00 MSE 94 87 93 95 100 106 113 113 113 113 Errors 00 01 02 02 03 04 05 05 05 05 SVs 03 04 06 07 08 09 10 10 10 10 s SVR 0 1 2 3 4 5 6 7 8 9 10 MSE 113 95 88 97 112 131 156 182 221 270 343 Errors 05 02 01 00 00 00 00 00 00 00 00 SVs 10 06 04 03 02 01 01 01 01 01 01 o RBF kernel7 C and a Chosen as in 56 B schalkopf Canberra February 2002 Greg Grudic Machine Learning 44 Introduction to Classi cation Greg Grudic September 7 2004 1 Basic De nitions and Probability Formulas 1 Probability of an Event Let A be an event eg A it rains tomorrow Then the probability of A occurring is symbolized PrA where O S PrA g 1 Posterior Probability Assume there exist two events A and B Then the posterior proba bility of event A given event B is simply that probability that A will occur given that B has occurred This is symbolized by PrAB where O S PrAB g 1 Product Rule Assume two events and called them A and B The probability that both A and B will occur is symbolized by PrA H B where PrA H B PrAB PrB PrBA PrA 1 This is often referred to as the conjunction of two events Sum Rule Assume two events and called them A and B The probability that either event will occur is symbolized by PrA U B where PrA U B PrA PrB PrA H B 2 This is often referred to as the disjunction of two events Bayes Theorem Assume two events and called them A and B The posterior probability PrAB of A given B is Pr B A Pr A Pr AB MB 3 Theorem of Total Probability Assume an event B and a set of events A1 An If A1 An are mutually exclusive with 231 PrAZ 1 then Pr B Pr B Pr 4 2 Assumptions on Data Let X1y1 XN yN be a set of N data examples This data is assumed to be generated from two or more classes as de ned below 21 Binary Classi cation A point X E D Q l d can belong to one of TWO classes hence which we Will label by y E 11 Assume the prior probability of class y 1 is 19 and the prior probability of class y 1 is p Where 0 lt 19 lt 1 O lt p lt 1 and 19 p 1 Assume that class y 1 is independently identically distributed iid from the probability density function pdf hX Where hX 2 O VX E D and f D hXdX 1 Similarly assume that class y 1 is iid from the pdf hX 2 0 Where hX VX E D and f D hXdX 1 Then the posterior probability of class y 1 given X is Pry1 PrXy1 phx Pr 9 1 lx Pry1Prxly1Pry 1Prxly 1 phltxgtphltxgt 5 And the posterior probability of class y 1 given X is h X Pr y 1xgt 1 Pr y 1Xgt ph xgt 2ltxgt 22 MultiClass Classi cation MultiClass classi cation assumes more than two classes We can symbolize K classes by y E 01 CK The prior probability of each class is symbolized by pk and it is assumed to be iid from a pdf distribution T hen the posterior probability of class y Ck given X is Pr y Ck X Ifkhk X 6 3 Classi cation Models The goal of classi cation is to nd a model that minimizes the number of classi cation errors on future or test data ie data that has not been used to build the Classi cation model A classi cation model is de ned by M X and a class prediction for an input X is given by quotJ7 M X 7 Given a set of M test points de ned by X1y1 X M y M the class prediction for each point is given by Q7 M 8 The model error rate is 1 M error M 6 7 Where 0 otherwise 31 Generative Models If the prior for each class pk and the pdf function hkX are known then a classi er the mini mizes the classi cation error rate is given by Q 315 max pkhk Ck This is called the Maximum 61 Posteriori MAP classi er and we will cover this in class in more detail when we study Bayesian Learning The MAP classi er can be readily derived from Equation 6 MAP theory is the main motivation behind generative classi cation models A generative model is one that attempts to build models of p and hkX from the training data These approxi mations will be denoted by p and hkX respectively and a generative model has the formlz Q argcrlpax Estimating is the prior 19 from training data is relatively simple and is done as follows However estimating can be very challenging and usually requires many assumptions One common assumption is to model each class as being generated by a multivariate Gaussian density A 1 hk X 1 1 exp 2 d 2k 2 1 A A A ltx ugtT21ltx ugt where d is the dimension of the problem i is a vector containing the mean values of all the inputs for class k fl is the covariance matrix of class k and is the determinant of i3 and fl can both be estimated from data see Chapter 43 of 1 This Gaussian assumptions leads to two commonly used classi cation models Linear Discriminant Analysis LDA and Quadratic Discriminant Analysis QDA One desirable property of a generative model is that it directly computes estimates of Pry c X this is important in many applications especially medical ie what is the probability that a patient has cancer given a set of lab test results Another is that if we have good prior knowledge about kx relatively few training data points are required to obtain good classi cation models Because of this generative models are currently being widely applied in bioinformantics domains where data is very expensive to generate 32 Discriminative Models A Discriminative Classi cation Model starts with the premise that modelling in x is too dif cult and in fact unnecessary In order to build a classi er all I need to model is the decision or 1This is a generative classi er because knowing 13 and bk X allows us to actually generate data 3 gtlt F 1i M m N w as g quotx i a is is it 3 33 33 3 W W hahhh339i39ii iiZIZ39shh quot39393939 quot N Q 1 a a kx w R g m K X E x39 E E a 3 x 31 i 5 gig we Figure l Separating Hyperplane for a Binary Classi cation Problem discriminating boundary between the classes I need not bother with density estimation at all If we assume a binary classi cation problem with classes labelled by y E 1 1 then a linear classi er can be de ned by 3 MX sgn BO B17BdXT where l AZO 1 otherwise sgn A Therefore the boundary between two classes is de ned by the following Bo B17 w d XT 0 Many algorithms have been proposed for estimating the coef cients BO 81 Bd from train ing data In the next section we will discuss one of the earlier ones 321 Rosenblatt s Perceptron Learning Algorithm The perceptron algorithm dates back to the 1950 s and is the motivation behind modern neural network algorithms The algorithm works by minimizing the distance of misclassi ed point to the decision boundary ie it minimizes the following function D 0731 ma Ad I 2 yr Bo B1 ma Ad XT EM 4 where M isAthe set of misclassi ed points This function is minimized by gradient descent on the g l d coef cients The gradients are de ned as follows for g 8DBo818d an and for 3 Bd ie j 1 d 8D B07317 7Bd A Z yi i tj j iEM This leads to the following update rule for the coef cients 5M QM y l r y z i Where 0 gt O is the learning rate and can be arbitrarily set to 1 This algorithm has an number of interesting properties If the problem is linearly separable the algorithm will converge to a sepa rating hyperplane in a nite number of steps However it also has some fundamental problems namely see 1 chapter 45 1 If the data is not separable the algorithm will cycle forever 2 If the data is separable the nite number of steps can be very large depends on the size of the gap between the two classes 3 There are in nitely many hyperplanes that satisfy the minimization criteria when the data is separable Which one is best One solution to this problem is proposed by Support Vector Theory which we will study later 322 Nonlinear Separating Boundaries When the data is not linearly separable nonlinear models can be used y we sgn 90 81 B ltIgt x Where as in regression the p basis functions ltIgtX gbl x gbpXT are nonlinear functions of the inputs X Examples include gb x 5131332 2X 523 gbgx sin13 etc A commonly used basis function is a kernel which as in regression has the following form 90239 X K X737 X where X7 for 239 l N are training examplesTherefore the kernel model looks like A N A u Z iK Xiax 71 5 Typically used kernel functions include the Gaussian Kernel a b2 K b Where a b E SW the kernel parameter is a gt O and 2 d 2 Ha b Z aj 199 j1 The polynomial kernel k K a b q aTb Where a b E Sid the kernel parameter are k 123 and q E 3 The sigmoid Kernel K a b tanh naTb 9 Where a b E SW the kernel parameter are n E 3 and 6 E 3 4 Practical Considerations There are many practical considerations When building classi cation models These include 1 Scale all inputs to the same range Usually l to 1 before building the model 2 Get ride of highly correlated inputs 3 Feature ie input selection Features that do not improve the model should be discarded Does this list seem familiar 5 Other Classi cation Formulations We Will cover a number of other classi cation formulations in class These include Gaussian Process Classi cation Support Vector Machines Classi cation T rees Nearest Neighbor Classi cation bagging boasting and random forests References l T Hastie R Tibshirani and J Friedman The Elements of Statistical Learning data mining inference and prediction Springer 2001 Introduction to Classification Greg Grudic Greg Gmdnc Machine Leammg Today s Lecture Goals Introduction to classification Generative Models 7 Fisher Linear Discriminative Analysis 7 Gaussian Mixture Models Discriminative Models 7 Rosenblatt s Preceptron Learning Algorithm Nonlinear Extensions Greg Gmac Machine Leammg 2 Last Week Learning Regression Models Collect Training data Build Model stock value Ffeature space Make aprediction n i w a ie Stock Value F eafure input Space Greg Gmdlc Machmz Leammg This Class Learning Classi cation Models Collect Training data Build Model happy Ffeature space Make aprediction High Dimensional Feamre input Space Greg Gmac Machine Leammg Binary Classification A binary classi er is a mapping from a set of d inputs to a single output Which can take on one of TWO values In the most general setting inputs x 6 1R output y e 711 Specifying the output classes as 1 and 1 is arbitrary r O en done as a mathematical convenience Greg Gmdlc Machine Leammg A Binary Classifier Given leaming data xpyl xNyN A model is co structed X gt Classi cation Model M x gt5 e711 Greg Gmac Machine Leammg The Learning Data Learning algorithms don t care Where the data comes from Here is a toy example from robotics sensor 1 x1 E R 7 Inputs from two sonar sensors sensor 2 x2 E R 7 Classi cation output in Greg s of ce y 1 Robot NOT in Greg s of ce y 1 Greg Gmdlc Machine Leammg Classification Learning Data x1 x2 y Example I 095013 058279 1 Example 2 023114 04235 1 Example 3 08913 043291 1 Example 4 0018504 076037 1 Greg Gmac Machine Leammg The Learning Data Symbolic Representation of N learning examples of d dimensional inputs Graphical Representation of Classi cation Training Data D an a and x11 xld yl w as a DD I I quot a D 9 le de yN a an o a a cagcm MachmzLeammg exegcmac MachmzLeammg m Linear Separating HyperPlanes How many lines can separate these points Greg Gmdnc Machmz Leammg Linear Separating HyperPlanes Greg Gmac Machmz Leammg Linear Separating HyperPlanes The Model 9 7Mxgt7 sgniin 02 iidjx Where A 1 If A gt 0 sgnl 71 otherwise The decision boundary 60 lgtquotgt dxT 0 Greg Gmdlc Machmz Leammg u Linear Separating HyperPlanes The model parameters are amuse The hot on the betas means that they are estimated from the data 7 In the class notes Sometimes the hat will be there and sometimes it won t Many different leaming algorithms have been proposed for determining lw d Gugcmac Machmz Leammg 14 Rosenblatt s Preceptron Learning Algorithm Dates back to the 1950 s and is the motivation behind Neural Networks The algorithm 7 Start with a random hyperplane nv lv 734 7 Incrementally modify the hyperplane such that points that are misclassi ed move closer to the correct side of the boundary 7 Stop when all learning examples are correctly classi ed Greg Gmdlc Machmz Leammg IS Rosenblatt s Preceptron Learning Algorithm The algorithm is based on the following property 7 Signed distance of any point x to the boundary is proportional to A A A my Bx Therefore if M is the set ofmisclassi ed learning examples we can push them closer to the boundary by minimizing the following um aj7zMylma ajxr Gugcmac Machmz Leammg m Rosenblatt s Minimization Function This is classic Machine Learning First define a cost function in model parameter space A A 391 A Dust3 djyzy in 21 EM k Then nd an algorithm that modifies u h 754 such that this cost function is minimized One such algorithm is Gradient Descent Greg Gmdnc Machmz Leammg Greg Gmac Gradient Descent e k a C t e W emm quotWm N u u O n 3 Machmz Leammg The Gradient Descent Algorithm 31 Al paD 08g d Where the leaming rate is de ned by p gt 0 Greg Gmdnc Machmz Leammg The Gradient Descent Algorithm for the Perceptron ma a 4 sawm p 4 f quot d in in y g g 7p ya 31 31 W1 Machmz Leammg The Good Theoretical Properties of the Perceptron Algorithm If a solution exists the algorithm will always converge in a nite number of steps Question Does a solution always exist Greg Gmdlc Machmz Leammg 21 Linearly Separable Data Which of these datasets are separable by a linear boundary Greg Gmac Machmz Leammg Linearly Separable Data Which of these datasets are separable by a linear boundary Jr m 39 Linearly b Separ able Greg Gmdlc Machmz Leammg 23 Bad Theoretical Properties of the Perceptron Algorithm If the data is not linearly separable algorithm cycles forever 7 Cannot converge r This property stopped research in this area between 1968 and 1984 Perceptrans Minsky and Pappm 1959 There are in nitely many solutions When data is linearly separable the number of steps to converge can be very large depends on size ofgap between classes Greg Gmac Machmz Leammg What about Nonlinear Data Nonlinear Models Data that is not linearly separable is called The Linear Model nonlinear data 5 Mx SSquot in 233 j The Nonlinear basis function Model Nonlinear data can often be mapped into a 5 Mm SSquot in 263 9 d nonhnear Space Where It IS llnearly Examples of Nonlinear Basis Functions separable g x x g5 x x g x m Q X sinx55 oxegcm Mimmtemg 25 Mom Mimmtemg 26 Linear Separating HyperPlanes In Nonlinear Basis Function Space An Example Greg Gmdnc Machmz Leammg 27 Greg Gmac Machmz Leammg 22 Kernels as Nonlinear Transformations Polynomial KKxixj ltXixjgtqgtk Kx tanh nltxixjgt0 Gaussian or Radial Basis Fun tion REF Kltxixj exp i 202 Greg Gmdnc Machine Leammg quotxi 7 x 2 The Kernel Model Training Data Xpyl XNeyN J7MXSgn BNLZ xKleX The number of basis functions equals the number of training examples Unless some of the beta s get set to zero Greg Gmae Machine Leammg 3n Gram Kernel Matrix Training Data X1y1quotXNgtyN Kx1x1 Kx1xN K 5 39 5 KXNX1 KXNXN Promrtiex 39Positive Definite Matrix 39Symmetric 39Positive on diagonal 39N by N Greg Gmdnc Machine Leammg Picking a Model Structure How do you pick the Kernels r Kemel parameters These are called learning parameters or hyperparamters 7 Two approaches choosing leaming paramters 7 Learning parameters mustmaximlze probability of correct classifieanonbasedon pnorbiases Frequentist 7 Use validation data More on learning parameter selection later Greg Gmae Machine Leammg 32 Perceptron Algorithm Convergence Two problems 7 No convergence When data is not separable in basis function space 7 Gives in nitely many solutions When data is separable Can we modify the algorithm to fix these problems Greg Gmdnc Machmz Leammg z Decision Trees Greg Grudio Notes borrowed from Thomas G Dietterich and Tom Mitchell Outline Decision Tree Representations ID3 and C45 learning algorithms Quinlan 1986 CART learning algorithm Breiman et al 1985 Entropy Information Gain Over tting Training Example Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Decision Tree Hypothesis Space 0 Internal nodes test the value of particular features I and branch according to the results of the test s Leaf nodes specify the Class hx Sunny 0v ere 2151 Rain l High Nonnal Strong Weak No Yes No Yes Suppose the features are Outlook 11 Temperature 12 Humidity 13 and Wind 14 Then the feature vector x Sunny Hat High Strong will be classi ed as No The Temperature feature is irrelevant If the features are continuous internal nodes may test the value of a feature against a threshold Sunny Overcast Rain gt 75 lt 75 gt 20 lt 20 No Yes No Yes Decision Tree Decision Boundaries Decision trees divide the feature space into axis parallel rectangles and label each rectangle with one of the K classes x2 x2lt3 l l 6 xllt4 xllt3 1 1 l 0 0 l x2lt4 l 4 0 0 l 0 l 0 0 2 l 0 l 0 l 0 Decision Trees Can Represent Any Boolean Function x2 x1 lt05 1 l 0 x2 lt05 x2 lt 05 0 0 l 0 l l 0 0 1 XI The tree will in the WDl St ease require exponentially many nodes however Learning Algorithm for Decision Trees SltxlaylgtltXNyNgt lt 3939 1 GROWTREES Jay 5 if y 0 for all x y E 5 return new leaf0 else if y 1 for all Xy E S return new leaf1 else choose best attribute m 30 all xy E S with mj 0 51 all xy E S with mj1 return new nodeav7 GROWTREESQ GROWTREE31 What happens if features are not binary What about regression 8 Choosing the Best Attribute Which attribute is best 2935 1 A1 2935 A2 t f t f 215 830 1833 ll2 Choosing Splits Using Entropy 0 pg is the proportion of positive examples in S 0 p9 is the proportion of negative examples in S o Entropy measures the impurity of S EWWOPZS E P10g2 Pea P9 102 P9 Entropy E11Lr0pyS 00 05 o S is a sample of training examples Entropy Entropy8 expected number of bits needed to encode class QB or 9 of randomly drawn member of 8 under the optimal shortestlength code Why Information theory optimal length code assigns logg p bits to message having probability p So expected number of bits to encode ea or 9 of random member of S pe 10g2pep pe 10g2pegt Em I OPMS E pEB10g2p B 199 102 p9 Information Gain GamS A expected reduction in entropy due to sorting on A Sv G S A 2E t S z 7E t 8 am 7 gt WOW UGVGWSW S WOW u 29357 A1 29I35 A2 t f t f 2157 8307 l8337 ll27 Training Example Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Selecting the Next Attribute Which attribute is the best classifier S39 95 E 0940 High N armu 34 61 E0A985 E0592 Gain S Humidity 940 714985 714592 151 62 EOt811 S39 95 E 0A940 S rrang 3a3 E 100 Gain S Wind 940 81481161410 048 D1 D2 D14 95 Smmy Over cast Rain D1D2D8D9D11 D3D7D12D13 D4D5D6D10D14 23 40 32 Which attribute should be tested here Sm y D1D2D8D9D11 Gal116501quot HumidTy 970 35 00 25 00 970 Gain SM my TemperaturE 970 25 00 25 10 15 00 570 Guiquot 551mm Wind 970 25 10 35 918 019 NonBoolean Features Features with multiple discrete values Multiway splits Test for one value versus the rest Group values into disjoint sets Realvalued features Use thresholds Regression Splits based on mean squared error metric Hypothesis Space Search 5 i Over tting Consider error of hypothesis It over 0 training data errortmmm o entire distribution 77 of data ETTOTDUL Hypothesis h E H over ts training data if there is an alternative hypothesis h E H such that errortrainhgt lt errortrainlthlgt and error h gt TTO F Dhlgt Over tting in Decision Trees Accuracy 06 On training data 7 011 test data quotquot 0 55 05 0 10 20 30 40 50 60 70 80 90 Size of tree number of nodes 100 20 Validation Data is Used to Control Over tting Prune tree to reduce error on validation set 21 Decision Trees Greg Grudic Notes borrowed from Thomas G Dietterich and Tom Mitchell Outline Decision Tree Representations ID3 and C45 learning algorithms Quinlan l 986 CART learning algorithm Breiman et a1 1985 Entropy Information Gain Overfitting Decision Tree Hypothesis Space I Internal nodes test the value of particular features L j and branch according to the results of the test 0 Leaf nodes specify the class hx Sunny Ovetc 2151 Rain Yes Wind High Nonnal Strong Weak No Yes No Yes Suppose the features are Outlook 11 Temperature 12 Humidity x3 and Wind Then the feature vector x Sunny Hat High Strong will be Classi ed as No The Temperature feature is irrelevant If the features are continuous internal nodes may test the value of a feature against a threshold Sunny Overcast Rzu39n l gt 75 lt 75 gt 20 lt 20 No Yes No Yes Decision Tree Decision Boundaries Decision trees divide the feature space into axisparallel rectangles and label each rectangle with one of the K classes X2 1 x2lt3 l 6 xllt4 xllt3 l 1 1 0 0 l x2lt4 l 4 0 0 l 0 l 0 0 2 l 0 l 0 l 0 o 2 4 6 x1 Decision Trees Can Represent Any Boolean Function x2 xi lt 05 1 l 0 x2 lt 05 x2 lt 05 0 0 l 0 l l 0 The tree will in the waist case require exponentially many nodes however Learning Algorithm for Decision Trees S x1y1xNyN GROWTREES if y O for all xy E S return new leaf0 Xx1xd xjy60l else if y l for all x E S return new leaf1 else choose best attribute 1391 50 all xy E S with mj 0 51 all Xy E S with 17 1 return new nodej GROWTREESO GROWTREES1 What happens if features are not binary What about regression 7 Choosing the Best Attribute Which attribute is best 29357 A1 29r35 A2 t f t f 2175 l 830 18337 ll27 Entropy E11tr0pyS 0 0 0 5 0 S is a sample of training examples Entropy 0 pg is the proportion of positive examples in S 0 399 is the proportion of negative examples in S o Entropy measures the impurity of S EWTOPMS E 969 1082 pea P91082109 Entropy Entropy expected number of bits needed to encode class QB or 9 of randomly drawn member of 8 under the optimal shortestlength code Why Information theory optimal length code assigns logg p bits to message having probability p So expected number of bits to encode QB or 9 of random member of S p 10g2pee pe 1092199 EntTOPMS E p10g2pes P9108 P9 Information Gain GamS A expected reduction in entropy due to sorting on A lSvl iEntropMSv G 39 SA EE t S 2 WM 7 7174029 vEValuesM 8 2935 A1 293sr A239 t f t f 215 830 1833 112 Training Example Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot D2 Sunny Hot D3 Overcast Hot D4 Rain Mild D5 Rain Cool D6 Rain Cool D7 Overcast Cool D8 Sunny Mild D9 Sunny Cool D10 Rain Mild D11 Sunny Mild D12 Overcast Mild D13 Overcast Hot D14 Rain Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No N 0 Yes Yes Yes No Yes No Yes Yes Yes Yes Yes Selecting the Next Attribute Which attribute is the best classi er S 95 E 20940 Hum it ry High Normal 34 631l E0985 E 0592 Gain S Humidity 940 714985 714592 l S 95 E 20940 Weak 62 E0811 Gain S Wiml Strong 33 E100 940 814811 61410 048 D1 D2 D14 95 Sunny Overcast Rain D1D2D8D9Dll D3D7D12Dl3 D4D5D6D10Dl4 23 40 32 Which attribute should he tested here A I I Sslmny D1D2D8D9D11 GaWSsmmyHumiditv 970 3500 250o 970 Gui Ssmuw Temperature 970 25 00 lt25gt1o 15 00 570 Gain 551mm Wind 970 25 10 35 918 019 NonBoolean Features 0 Features with multiple discrete values Multiway splits Test for one value versus the rest Group values into disjoint sets Realvalued features Use thresholds Regression Splits based on mean squared error metric Hypothesis Space Search Over tting Consider error of hypothesis it over 0 training data errortmmh o entire distribution D of data BTTOTDUL Hypothesis h E H over ts training data if there is an alternative hypothesis h E H such that errortrainlth lt errortrainlth and errorp z gt errorp zl Over tting in Decision Trees 0 10 20 30 40 50 60 70 80 90 100 Size of tree number of nodes Validation Data is Used to Control Over tting Prune tree to reduce error on validation set Learning With Weighted Examples Greg Grudic Weighted Training Examples Training Examples D xly1xNyN Associate a weight W gt 0 W1W2W3 The greater the weight the greater the importance of the training example Might come from prior knowledge Or a boosting algorithm Weighted Sum of Least Squares The Loss Function being minimize 2 weighted RSS wly i ii jxy 11 11 0 The matrixes and outputs become W1 Wlxll Wlxld X 5 WN WNXNI WNXNd The Solutlon 1s 3 XTX 1XTYW lel w WNyN Weighted Ridge Regression The Loss Function being minimize 2 weighted Ridge Loss w 4 i u ii jxv Ai f 11 11 11 0 The matrixes and outputs become W1 Wlxll Wlxld X 5 WN WNXNI WNXNd le1 Y E w WNyN The Solution is In is a d1 by an identity IA XTX AID 71 XTY mauix with the st entry set to Zero Weighted Kernel Ridge Regression 0 Calculate the kernel matrix Kx1x1 K E 0 Then W1 W1KX1gtX1 X 5 WN WNKXNgtX1 XTX KXNXI WNKXNgtXNJ Kx1xN KXNXN W1KX1gtXN lel Y w WN yN is 1 NH by N1 identity matrix with the rst entry set to Zero A10 1 XTYW Weighted Gradient Descent The error gradient Aw i314 E for batch is where E is error on training example i o The batch update is wjzwj Aw Perceptron update u u my ft H 13 Wat t t WM becomes incremental Decision Tree Stump Next homework

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.