Introduction to Machine Learning
Introduction to Machine Learning CSI 5325
Popular in Course
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in ComputerScienence
This 93 page Class Notes was uploaded by Melvin Bednar on Saturday October 3, 2015. The Class Notes belongs to CSI 5325 at Baylor University taught by Gregory Hamerly in Fall. Since its upload, it has received 25 views. For similar materials see /class/217934/csi-5325-baylor-university in ComputerScienence at Baylor University.
Reviews for Introduction to Machine Learning
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/03/15
m mv I39m mah m mg 3 my Mag Greg Hamerly Some content from Tom Mitchell 126 Concept learning Candidate elimination algorithm recap Picking new examples The need for inductive bias Where we are going 226 The definition of concept learning Concept learning is learning a function which has a boolean valued output fXHO1 Many machine learning approaches use this simplistic binary View of the world 326 Example Version Space s ltSunny Warm 7 Strong gt ltSunny 7 7 Strong 7 gt ltSurmy Warm 7 7 7 gt lt Warm 7 Strong 7 gt G ltSunny 7 7 7 7 gt lt Warm 7 7 7 gt 426 Representing Version Spaces The General boundary G of version space VSHD is the set of its maximally general members The Specific boundary S of version space VSHD is the set of its maximally specific members Every member of the version space lies between these boundaries VSHD h E HiGs E 3g E Gg 2 h 2 5 where X 2 y means X is more general or equal to y 526 Candidate Elimination Algorithm Initialize I G e maximally general hypotheses in H I 5 e maximally specific hypotheses in H For each training example d do I If d is a positive example adjust sets G and 5 I If d is a negative example adjust sets G and 5 626 Candidate Elimination Algorithm positive example For a positive example d I Remove from C any hypothesis inconsistent with d I For each hypothesis 5 in 5 that is not consistent with d I Remove 5 from S I Add to 5 all minimal generalizations h of s such that I h is consistent with d and l some member of G is more general than h I Remove from 5 any hypothesis that is more general than another hypothesis in S 726 Candidate Elimination Algorithm negative example For a negative example d I Remove from 5 any hypothesis inconsistent with d I For each hypothesis g in G that is not consistent with d I Remove g from G I Add to G all minimal specializations h ofg such that I h is consistent with d and l some member of 5 is more specific than h I Remove from G any hypothesis that is less general than another hypothesis in G 826 Tracing the Candidate Elimination Algorithm Go lt777777gt 926 What about noisy data What would happen to the candidate elimination algorithm if it encountered an incorrectly labeled example The algorithm removes every hypothesis that is inconsistent with some training example Therefore the true concept will be removed What does this show about our assumptions that I the data has no noise I the hypothesis space contains the correct hypothesis 1026 What ordering on the training examples Does the order of the training data matter I in correctness of the algorithm I in efficiency of the algorithm 1126 9551 lt239 2 2 2 Wm 239gt lt239 2 2 2 2 mesgt 9 lt5 5 Suoag 5 lawM 5gt lt5 5 5 5 lamM Kuunggt lt5 5 Suoag 5 5 zfuunggt lt5 5 Suoug 5 lawM zfuunggt Is gxeu eq plnoqs eldwexe Bugugen 1eqV How Should These Be Classified S ltSLMW Warm smug 7 7gt ltgumy 7 7 man 7 7gt awry Wan 7 7 7 7gt lt7 Wan 7 Strarg 7 7gt G er 7 7 7 7 7gt lt7 Warm 7 7 7 7gtgt Sunny Warm Normal Strong Cool Change Rainy Cool Normal Light Warm Same Sunny Warm Normal Light Warm Same 1326 What Justifies this Inductive Leap Suppose we39ve seen these two training examples l Sunny Warm Normal Strong Cool Change l Sunny Warm Normal Light Warm Same S Sunny Warm Normal 7 7 7 Why should we believe we can classify the unseen7 Sunny Warm Normal Strong Warm Same 14 26 Me i m lm LTlie n What is inductive bias In short it39s limiting our hypothesis space What does the word bias imply So far we39ve limited our hypothesis space to be a conjunction of attribute constraints 1526 Me i in im LTiie n An UNbiased learner Idea Choose H that expresses every teachable concept ie H is the power set of X Consider H disjunctions conjunctions negations over previous H Eg Sunny Warm Normal 7 7 7 V lt7 7 7 7 7 Change What are 5 G in this case7 S lt Ge 1626 Me i m lm LTlie n Inductive Bias Consider I concept learning algorithm L I instances X target concept 6 I training examples DC ltX7 cxgt I let LX7 DC denote the classification assigned to the instance X by L after training on data DC Definition The inductive bias of L is any minimal set of assertions B such that for any target concept 6 and corresponding training examples DC in e XB A DC Axi r Lx 06 where A h B means A logically entails B 17 26 Nm tt Hue LTtte Head for m Inductive system Classi cation of Candidate w instance or Elimination quotdon t know Algorithm o mining examples New instance Using Hypothesis Space H Equivalent deductive system classi cation of new instance or 39H alnln exam 1es v s p tdonsnmowquot Theorem Prover New instance Assertion quot Hcontains the target conceptquot gt Inductive bias 18 26 Me i m lm LTlie n Three Learners with Different Biases Rote learner Store examples Classify X ifF it matches previously observed example Candidate elimination algorithm FindS 1926 Me i in im LTiie u Inductive bias and tic taC toe What is the inductive bias of the TTT player you39re writing 2026 Me i m lm LTlie n Inductive bias and other learning algorithms As we continue in this course keep in mind the inductive bias of different algorithms we discuss Speaking generally a learning algorithm with a large inductive bias I makes larger leaps in logic I learns more quickly from examples I is less flexible39 than a learning algorithm with a smaller bias 2126 Me i in im LThe n Inductive bias and curve fitting Suppose we were trying to learn the regression function f I R gt R What is the best curve f through these points 25 o o 2 is o o i M II hi W M a as an M as an 2226 Me i m lm LTlie u Induction leads to deduction Notice that often we want to not just learn a concept but use that concept to do something new We typically induce a conceptmodelhypothesis from specific training examples We often then use that learned concept to deduce facts on new data 2326 Summary Points Concept learning as search through H General to specific ordering over H Version space candidate elimination algorithm 5 and G boundaries characterize learner39s uncertainty Learner can generate useful queries lnductive leaps possible only if learner is biased lnductive learners can be modelled by equivalent deductive systems 24 26 Next steps We will look at models which are widely used for learning I decision trees I neural networks I linear models and basiskernel expansions I generative probabilistic models I etc For each thing try to keep in mind things like I what is the hypothesis space I what is the inductive bias I is the algorithm robust to noise I how well can the algorithm learn I how quickly can the algorithm learn 25 26 Tools for the course I IBTEX emphasis on quality writeups of experiments I MATLAB available on PCs and my linux servers ask me for an account I good for statistics linear algebra data wrangling plottinggraphing prototyping and running experiments I Weka Java based machine learning toolkit I Datasets lots availablel I39ll post links on the course page as needed 26 26 Intro to machine learning CSI 5325 Lecture 12 Bayesian learning Greg Hamerly Some content from Tom Mitchell 117 Bayes Theorem MAP ML hypotheses MAP learners M L learners 217 Two roles for Bayesian methods Provides practical learning algorithms I Naive Bayes learning I Bayesian belief network learning I Combine prior knowledge prior probabilities with observed data I Requires prior probabilities Provides useful conceptual framework I Provides gold standard for evaluating other learning algorithms I Additional insight into Occam39s razor 3l7 Bayes theorem PDlhPh Pth Pm h prior probability of hypothesis h D prior probability of training data D th probability of h given D Dlh probability of D given h 417 Generally want most probable hypothesis given training data Maximum a posteriori hypothesis hMApI hMAp argmax Pth heH argmax PDlhPh heH If assume Ph Phj then can further simplify and choose the Maximum likelihood ML hypothesis hML argmax PDlh heH 5l7 Bayes theorem Does patient have cancer or not A patient takes a lab test and the result comes back positive The test returns a correct positive result in only 98 of the cases in Which the disease is actually present and a correct negative result in only 97 of the cases in Which the disease is not present Furthermore 008 of the entire population have this cancer Pcancer P cancer Plcancer P7lcancer Pl cancer P7l cancer sn Basic formulas for probabilities I Product Rule probability PA B of a conjunction of two events A and B PA O B PAlBPB PBlAPA I Sum Rule probability of a disjunction of two events A and B PM U B PA PB 7 PM Q B I Theorem of total probability if events A17 An are mutually exclusive with ELI PA1 then PB Z PBlAiPAi i1 7l7 Brute force MAP hypothesis learner For each hypothesis h in H calculate the posterior probability PDlhPh Pth Pm Output the hypothesis hMAp with the highest posterior probability hMAp argmax Pth heH 817 LECMH E 1 Relation to concept learning Consider our usual concept learning task I instance space X hypothesis space H training examples D I consider the FindS learning algorithm outputs most specific hypothesis from the version space VSHD What would Bayes rule produce as the MAP hypothesis Does Finds output a MAP hypothesis77 gn Relation to concept learning Assume fixed set of instances X17 7Xmgt Assume D is the set of classifications D ltcx17cxmgt Choose PDih 1017 Relation to concept learning Assume fixed set of instances X17 7Xmgt Assume D is the set of classifications D ltcx17cxmgt Choose PDih I PDih 1 if h consistent with D I PDih 0 otherwise Choose Ph to be uniform distribution I Ph TE for all h in H Then 0 otherwise m if h is consistent with D FWD 1117 Evolution of posterior probabilities 130 PhiD1 PhiD1D2 hypotheses hypotheses hypotheses a 17 C 1217 Learning a real valued function With some simple assumptions maximum likelihood learning can be converted to squared error minimization Occurs often in regression ie learning real valued functions I samples are independent I true output signal is corrupted by noise I noise is Gaussian with zero mean I noise is iid wrt itself I noise is independent of input 13 U Learning a real valued function Consider any real valued target function f Training examples Xi digt where d is noisy training value I d fX e I e N N0702 is a random variable noise drawn independently for each X according to some Gaussian distribution with u 0 unknown 02 1417 Learning a real valued function Training examples Xi digt where d is noisy training value I d fX e I e is random variable noise drawn independently for each X according to some Gaussian distribution with mean0 Then the maximum likelihood hypothesis him is the one that minimizes the sum of squared errors h ar min diihxi 2 ML 56H l 15 U hML argmax pDih heH ar max dih 56H gm i dih a 0092 1 argmax ie heH i127r02 Maximize natural log of this instead 1617 Maximize natural log of this instead hML m 1 1 d7hxgt2 ar max M777 7 6H V27102 2 lt 7 m 1 diihxgt2 argmax if 7 heH 2 lt 0 argmax 7 diihxi 2 he ar min diihxi 2 56H 1717 U mo in rm C58 15gt mm Greg Hamerly Spring 2008 Some content from Tom Mitchell 117 Computational learning theory Probably approximately correct PAC learning Vapnik Chervonenkis dimension 217 Computational learning theory What general laws constrain inductive learning We seek theory to relate I Probability of successful learning I Number of training examples I Complexity of hypothesis space I Accuracy to which target concept is approximated I Manner in which training examples presented 317 lune arm PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n and a learner L using hypothesis space H Definition C is PAC learnable by L using H if I for all c E C distributions 7 over X e such that O lt elt 12 and6 such that O lt 6 lt12 I learner L Will With probability at least 1 7 6 output a hypothesis h E H such that error73h e in time that is polynomial in 16 16 n and sizec 4l7 Unbiased learning Given sample space X class of concepts C hypothesis space H C and n boolean features for each example lci 2w I le 2 I W lCl 22 Therefore sample complexity is m2 2 n2 n16 Thus an unbiased hypothesis space is not PAC learnable since it has an exponential dependence on n sn Shattering a set of instances Definition a dichotomy ofa set 5 is a partition ofS into two disjoint subsets Definition 3 set of instances 5 is shattered by hypothesis space H if and only if for every dichotomy of 5 there exists some hypothesis in H consistent With this dichotomy 617 7L7 The Vapnik Chervonenkis dimension Definition The Vapnik Chervonenkis dimension VCH of hypothesis space H defined over instance space X is the size of the largest finite subset ofX shattered by H lf arbitrarily large finite sets ofX can be shattered by H then VCH E 00 817 VC dimension of linear decision surfaces a b 917 Sample Complexity from VC Dimension How many randomly drawn examples suffice to e exhaust VSHD with probability at least 17 6 m 3 a4 Iog226 8VCH og2136 1017 Compare VC and PAC sample complexity PAC assuming consistent learner 1 mzi E n16 In iHi VC 1 mzi E 4 Iog226 8VCH Iog213e 1117 Compare VC dimension and hypothesis space size VCH Iogz iHi Why I assume VCH d I H requires at least 2d distinct hypotheses to shatter d instances I thus 2d and d VCH log2 iHi 1217 Examples of VC dimension 13 L7 lune arm l Leclln e 16 Le VC Dimension of multi layer perceptron network Given a multilayer perceptron G with I n input nodes I s 2 2 internal non input nodes I at most r inputs to any internal node I concept class C over R with VCC d Then VCG 2ds loges where e 271828 Note d r 1 for linear perceptron Limitations I applies only to thresholded linear perceptrons I does not apply to sigmoid based networks trained with backprop those have VC dimension at least this big why I doesn39t account for backprop39s training method small to large weights W VC dimension and number of parameters Note that the earlier bound of PAC learning depended on the size of the hypothesis space Further lHl typically depends on the number of parameters d in the model I often lHl has an exponential relationship with d I this leads to the curse of dimensionality39 which is most evident in bias free learning I we saw this in the sample complexity of the n term biasfree boolean learner However the VC dimension often may not depend so directly on d 15 U lime arm i Leclm e 16 La Classes with infinite VC dimension The following family of classifiers has an infinite VC dimension hx sinax 2 0 but a finite number of parameters only one A AAAA AAA AAA AA AA AA VVV VVV VVV V 16 U lime arm l Leclln e 16 La Infinite hypothesis spaces and VC dimension Note thatjust because a hypothesis space is infinite does not imply that the VC dimension is infinite I eg the interval classifier hx a lt X lt b I lHl 00 but VC2 However VC 00 does imply that lHl 00 Both of these results are implied by VCH logz lHl 1717 In in mm 6C3 243 Sn Greg Hamerly Spring 2008 Some content from Andrew Moore httpwwwcscmueduNawmtutorials 114 H Lemme Kernelizing other algorithms Other issues 214 Distance gt kernel distance Euclidean squared distance between a and b d7W2 lla bll2 aib 39 3 b aa72babb which isjust a few dot products Replace dot products with a kernel Mali2 ll a dgtlll2 Clgta Clgta 7 2 a Clgtb ltlgtb Clgtb Kaa 7 2Kb7 a l Kb7 b This gives the straight line distance of a and b in the mapped space of the kernel 314 Kernel izing other algorithms Any algorithm that can be expressed entirely with dot products can be kernelized Just replace all dot products with a chosen kernel Examples PCA k nearest neighbor k means any distancebased learning algorithm 414 inns arm What is a kernel A kernel is a function that represents an inner product in some mapped space Mercer39s condition Kab is a kernel representing a basis expansion Ka7 b Z ai bi ifF for any ga such that fga2da is finite then Ka7 bgagbda db 2 O In other words K is positive semi definite 5 14 VC dimension of an SVM The VC dimension of any SVM can be bounded in several ways but one good way for thinking about it is VCSVM E diameter margin where I diameter is the diameter of the smallest sphere that can enclose all the high dimensional term vectors from the training set I margin is the smallest margin we39ll let the SVM use We can use the VC dimension with SRM structural risk minimization to choose the kernel the kernel parameters etc But it39s easier just to use crossvalidation 614 VC dimension of an SVM Interestingly the VC dimension depends on the margin but not necessarily on the complexity of the kernel In fact some kernels have infinite VC dimension themselves gt Gaussian aka RBF kernels represent an infinite feature space ltlgtx and also have infinite VC dimension 714 Another SVM error bound Another bound on SVM error bounds the number of support vectors En um ber su pport vectors EPerror n where the expectation on the right is over all samples of size n So the number of support vectors has a relationship with accuracy 814 Constructing kernels We can learn with any data for which there is a kernel I vectors in Rd simple text documents strings I I l trees I graphs think subgraph isomorphism expensive I Areas of active research are how to choose appropriate kernels and developing appropriate and efficient kernels for structured data 914 Constructing kernels from existing kernels Suppose I K1 and K2 are both kernels over X x XX Q Rd I f is a real valued function over X I D is a feature map from X to Rm I K3 is a kernel over Rm x Rm I B is a positive semi definite d x d matrix Then the following are all kernels Ka7 x a7 x 9 x 4m x U39U39U39U39U39U39 vvvvv a7 Ka7 K1a7 b l K2a7 b 04K1a7 b K1a7 bK2a7 b f5010 Kd 7 m aTBb 1014 M ulti class with SVMs SVMs are excellent for 2 classes but what about more than that Standard approach choose a method to decompose one m class into multiple 2 class problems I 1 versus all requires m SVMs I pairwise classifiers requires mm712 SVMs I error correcting output codes multiple p versus q SVMs Training is easy break the initial multi class dataset into multiple binary class datasets and train a binary SVM for each Combining answers of multiple classifiers for making predictions can be tricky but voting or looking for the most confident predictions works 1114 SVMs and probabilities The SVM is a discriminative model as opposed to a generative one Thus it39s good at classification but not at providing probabilities Some work has been done Vapnik Wabha Platt others to take an SVM and use it to produce py 1lx Platt39s idea fit a sigmoid to fx 1214 Support vector regression Support vectors machines are for classification but regression is also possible with similar ideas A key difference from least squares is that SVR cares only about errors that are further than 6 from the regression line I this is called c insensitive loss Such regressors can be sparse just like SVMs and can also use kernel functions for non linear regression 1314 SVM learning in practice Don39t write the software use someone else39s I SVMIight Joachims I Matlab toolkits l www supportvectormachines orgSVMsoft html 1414 Intro to machine learning CSI 5325 Lecture 5 decision trees Greg Hamerly Fall 2008 Some content from Tom Mitchell 118 Inductive Bias in D3 Overfitting Underfitting 218 im Limim Inductive Bias in D3 Note H is the power set of instances X HUnbiased Not really I Preference for short trees and for those with high information gain attributes near the root I D3 bias is a preference for some hypotheses rather than a restriction of hypothesis space H I Occam39s razor prefer the shortest hypothesis that fits the data 318 lmr Limi i Leciin e i Occam s Razor Why prefer short hypotheses Arguments in favor I Fewer short hyps than long hyps gt a short hyp that fits data unlikely to be coincidence gt a long hyp that fits data might be coincidence Arguments opposed I There are many ways to define small sets of hyps I eg all trees with a prime number of nodes that use attributes beginning with Z I What39s so special about small sets based on size of hypothesis77 418 Overfitting in Decision Trees Consider adding noisy training example 15 Sunny Hot7 Normal7 Strong7 PIayTenns No What effect does it have on the earlier tree Sunny Omust Ram Humidity Vde m H1gh Normal strong Weak N0 Y2 M7 Y2 518 Overfitting Consider error of hypothesis h over I training data errortanh I entire distribution 7 of data error73h Hypothesis h E H overfits training data if there is an alternative hypothesis h E H such that errort3nh lt errortanh errorp h gt errorp h 618 A ccuracy 06 On training data 7 On test data W 055 05 Size of tree number of nodes 718 Data centric reasons for overfitting Noise in the data I Do we want to perfectly model training data with noise Too little data I Spurious correlations could occur 818 Avoiding Overfitting How can we avoid overfitting in decision trees I stop growing when data split not statistically significant I grow full tree then post prune 918 How to select the best tree I Measure performance over training data I Measure performance over separate validation data set I MDL minimize sizetree i sizemiscassificationstree 1018 Red uced Error Pruning Split data into training and validation set Do until further pruning is harmful Evaluate impact on validation set of pruning each possible node plus those below it Greedin remove the one that most improves validation set accuracy I Produces smallest version of most accurate subtree I What if data is limited 1118 On training data 7 On test data W Ontest data dming pmnjng Size of tree number of nodes 100 1218 Rule Post Pruning Convert tree to equivalent set of rules Prune each rule independently of others Sort final rules into desired sequence for use Perhaps most frequently used method eg C45 1318 Converting A Tree to Rules Emmy Overcast Ram Yes Wmquot ngh Norm Sznmg Weak No Yes No Yes IF Outlook Sunny A Humidity High THEN PlayTennis No IF Outlook Sunny A Humidity Normal THEN PlayTennis Yes 1418 Overfitting is a general problem Overfitting is notjust a problem for decision trees Most machine learning algorithms face overfitting problems It most commonly occurs when I the hypothesis space is very large I the hypothesis search is not biased toward simple models I there is little training data I there is a lot of noise in the training data How can we tell when it happens I Radical differences in traintest accuracy I Overly complex hypothesismodel eg large decision tree 1518 Underfitting can also occur Underfitting is the opposite of overfitting It appears in a different way on our accuracy graphs I How would you expect an underfit model to do on training data I On test data 1618 Causes for underfitting I Hypothesis space is too small I Search strategy is too biased towards simple models I Attributes do not correlate with the target 1718 Evaluating hypotheses In any learning algorithm we want to know how well39 it is performing This is a huge question We will talk about it as we go along and then particularly in chapter 6 1818
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'