### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Advanced Software Engineering CSI 5V93

Baylor University

GPA 3.85

### View Full Document

## 11

## 0

## Popular in Course

## Popular in ComputerScienence

This 63 page Class Notes was uploaded by Melvin Bednar on Saturday October 3, 2015. The Class Notes belongs to CSI 5V93 at Baylor University taught by Staff in Fall. Since its upload, it has received 11 views. For similar materials see /class/217937/csi-5v93-baylor-university in ComputerScienence at Baylor University.

## Reviews for Advanced Software Engineering

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/03/15

Lecture 14 Bayesian learning CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI 5m Introduction to rriacnine learning Lecture Hep llB Announcements Homework 4 assigned soon CSI 5m Introduction to rriacnine learning Lecture l4 7 p 2l8 Questions CSI 5m Introduction to machine learning Lecture f4 7 p ais Comments on homework 3 Overall great work 39 nice tables and graphics 39 many good selfposed questions about why and picking out interesting aspects good explanations of experiments and results Some constructive remarks 39 color can be helpful in graphs send electronically if you can t print them good to ask difficult questions but answer them carefully 39 explanation graphics and results should be first not code 39 please put all parts of the document into ETEX including graphs and tables and code make the document uniform be careful of using in IEI39EX make sure you use otherwise you will use which is the comment character CSI 5m Introduction to machine learning Lecture f4 7 p 4MB Comparing to the base rate One thing that no one did on the vowel task report and compare to the base ratequot 39 the base error rate is the error rate if you simply predict one class the largest population class every time the classes were evenly distributed in test and training data 39 base error rate was 1 111 100 91 909 80 even though LDA and linear regression don t do great on this task they do much better than 909 error CSI 5m Introduction to rriecnine learning Lecture 14 r J 518 Bayesian learning and naive Bayes Bayes rule and modelling naive Bayes smoothing binning continuous variables applications to text shaping probabilities See handouts from Mitchell as primary source also see section 663 in your book CSI 5m Introduction to rriecnine learning Lecture 14 r J 618 Bayes rule PIM X PrXPl4PrM Breaking it apart 39 M model X data PrMlX probability of the model given the data PrXlM probability of the data given the model PrM prior probability of the model 39 PrX prior probability of the data CSI 5m Introduction to rriacnine learning Lecture f4 7 p 7f8 Using Bayes rule for classification PIM X PrXPl4PrM How do we use this to classify an input cc Different methods depending on our assumptions 39 MAP Maximum A Posteriori 39 ML Maximum Likelihood CSI 5m Introduction to rriacnine learning Lecture f4 7 p sis Naive Bayes classifier A naive Bayes classifier is based on Bayes rule It s a simple effective tool for learning from data It s called naive because of the Naive conditional independence assumption Assumption input variable i is independent of input variable 9 given the class This is called conditional independence n probability terms if input ac has features ac and acj Prm1jlc PrwlcPr1jlc What does this mean CSI 5m Introduction to rriacnine learning Lecture l4 7 p ens Back to Bayes rule PrXlM PrM PrMlX WX The naive assumption affects PrXlM which can be expanded as PrXlM PrX1lM PrX2lM a a a PrXdlM H PrXlM i1 So instead of having a multivariate model for ddimensional data we have d univariate models CSI Eves Introduction to machine learning Lecture Hip ions Naive Bayes probabilities Starting again with Bayes rule PrXlM PrM PrMlX WX and substituting our new definition 0 PrXlM H PrX lM 711 we get the new probability Prawn PrX CSI SVQG Introduction to machine learning Lecture Hip iiiB Naive Bayes classification MAP classification mMAP arg max PrM le mEM 7 PrM m 1131 PrX lM m g 34 PrX d arg PrM m H PrX lM m i1 ML classification mML argmaxPrXlMm mEM d arg PrX lM m CSI Eyes Introduction to machine learning Lecture Hip i2i8 Typical naive Bayes application discrete data Suppose that our data is a text document How would you model this In other words we need a probability model for a document PrX John is a computer scientist quotlM If we have this model PrXlM then we can use Bayes rule to do classification CSI Eves Introduction to machine learning Lecture Hep lGlB A bagoiwords model Suppose that we model a document as a collection bag of words without order For example the following are equivalent documents from the bagofwords model document a John is a computer scientistquot document b scientist a computer John isquot document c computer a is John scientistquot Then we need PrX alM CSI Eves Introduction to machine learning Lecture Hep l4l8 A bagofwords model How do we model PrX John is a computer scientist quotIM Imagine that we limit the vocabulary to say 10 words Only these words are permitted at computer scientist is c john baylor a he where Then using the bagofwords model this is a 10dimensional problem Each dimension can have a 1 the word is present in the document or 0 the word is not present in the document The document John is a computer scientistquot would have the feature vector I at I computer I scientist I is I c I john I baylor I a I he I where I lol 1I 1101 010 0 We ignore duplicate words for now So now we have a discrete representation but not yet a probability model CSI Eves Introduction to machine learning Lecture Hep l5l8 The bagofwords model For discrete data we represent it using a nonparametric model which means counting Counting probabilities PrX 1 countX cc n where counta means to count all records where a is true and n is the total number of records considered in the training data Soto model a document we need to learn a probability such as above the probability of the individual words appearing together in that document CSI Eves Introduction to machine learning Lecture Hep l6l8 Modelling the joint bagoiwords probability Document a John is a computer scientistquot Classes EMAIL or SPAM countX a H EMAIL Pr EMA39L countEMAlL To calculate this we would have to count all documents in the class EMAIL and look for documents that have the words john is a computer scientistquot Likewise for SPAM countX a SPAM ter alSPAM CountSPAI CSI Eyes Introduction to machine learning Lecture Hep i7i8 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI Eyes Introduction to machine learning Lecture Hep l8l8 Lecture 22 Support Vector Machines CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI 5m Introduction to rriecnine ieerning Lecture 22 7p ii7 Questions CSI 5m Introduction to rriecnine ieerning Lecture 22 r p 2i7 Separating hyperplanes Here is a twoclass classification problem with several linear boundaries Note that there are many solutions to use a linear boundary to separate the classes CSI 5m Introduction to rriacnine learning Lecture 22 r p 3i7 Separable and Nonseparable data yi T 50 2 C1 51 Er 2 0 ig g constant 171 CSI 5m Introduction to rriacnine learning Lecture 22 r p 4i7 Support vector classifier example Training Error 026 D 0 Test Error 030 Training Error 0270 Test E 88 Bayes Error 02m Bayes Error 02t CSI 5m Introduction to rriecnine learning Lecture 22 r p 5i7 Next flexible nonlinear support vector machines Question How do we better handle data that is not linearly separable with support vector machines Answer Transform the data into a higher dimension where the data is more easily separated CSI 5m Introduction to rriecnine learning Lecture 22 r p cm Moving to higher dimensions basis expansions We want to move the data into a higher dimensional space with several basis expansion functions m a h m h2m hMm For example we might have 9ch e R2 and choose M 4 basis functions h1w 1 h2w w h3w 1 h4w 1112 Then we would learn a B that lives in this higherdimensional space that separates the classes Key idea transforming into a higher dimension allows B to represent a linear hyperplane in the higher dimensional space which is nonlinear curve in the original dimensions CSI 5m Introduction to rriacnine learning Lecture 22 r p 7i7 Example of moving into higherdimensional spaces CSI 5m Introduction to rriacnine learning Lecture 22 r p 8H7 Rewriting in terms of inner products Note that we have the Lagrangian form LD Zoo i Z iajytijmThWD 11 39 391 L 1 Zoo 7 Zorrouyw m M90 gt 11 11 j1 And this depends only on the inner product aka dot product of Mom and My not on It directly Note also that we could write fm as M hwT Bo Z O iyilth1ah1igt 50 711 80 this also depends only on the dot product not on h cst 5m Introduction to rriacnine learning Lecture 22 r p 9i7 The kernel trick We can rewrite all the equations we need for SVMs in terms of dot products in the higherdimensional space It turns out that we can compute these dot products in the highdimensional space without using the basis expansions All we need to know is the kernel function May lthwhygt Our It could go to many dimensions even infinite but we don t need to actually go there CSI Eves Introduction to machine learning Lecture 227p i0i7 Kernel properties A kernel 39 represents an inner product in some dimension usually a higher dimension should be positive semidefinite A kernel doesn t have to have a nice basis expansion form All you need for a kernel is to define a positive semidefinite function K that does Kztitdxndetit CSI Eves Introduction to machine learning Lecture 227p iii7 The Support Vector Machine summary Given a kernel function ki 1 and an error parameter y maximize 7L 1 7L LD Z on i i Z Oriajyiyjkwim 11 139 1 j1 subject to OSOHSC Za y 0 11 And the decision function for the classifier is i1 91 39 061917900 90 50 Where did 5 go CSI Eves Introduction to machine learning Lecture 227p i2i7 Example kernels polynomial kernel of degree d K1511lt151Igtd radial basis kernel K00 00 epolav w39ll2C neural network kernel KW ml tanhn1ltw 1 H2 CSI Eves Introduction to machine learning Lecture 227p i3i7 Representing the space for the polynomial kernel polynomial kernel of degree d K1511lt151Igtd CSI Eves Introduction to machine learning Lecture 227p i4i7 Examples Training Error 0t80 Test Error Bayes Error 02m Training Error 0t60 o m 41gt m o BayesError 02m i CSI Eves Introduction to machine learning Lecture 227p i5i7 Using SVMs You should probably never implement an SVM on your own they are tricky Fortunately there is a lot of good software out there for SVM learning When using an SVM learner you must choose kernel 39 error penalty y Problems not intuitive how to choose the kernel or y Advantages SVMs often perform amazingly well CSI Eves Introduction to machine learning Lecture 227p i6i7 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI Eves Introduction to machine learning Lecture 227p i7i7 Lecture 8 Variable selection and model verification CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI ves Introduction to rriacnine learning Lecture 87p ll8 Announcements Homework 2 due February 8th CSI ma Introduction to rriacnine learning Lecture 87p 2l8 Questions Questions from last time 39 For stepwise selection why not use the t statistic instead of the F statistic see homework exercise 31 More examples coming as we move along How do we choose the best model using ridge regression How do we use ridge regression What is crossvalidation How does it work How can correlated variables negate one another CSI ma Introduction to rriacnine learning Lecture 87p ans Chapter 3 Linear methods for regression 31 Introduction 32 Linear regression models and least squares 33 Multiple regression from simple univariate regression 34 Subset selection and coefficient shrinkage 35 Computational considerations CSI ves Introduction to rriacnine learning Lecture 87p 4H8 Coefficient shrinkage Subset selection of variables 39 eliminates variables completely by testing if their coefficients are sufficiently close to zero 39 trying all subsets is a difficult combinatorial problem An alternative is to shrink the coefficient values so that variables become less important This can turn the combinatorial discrete problem of subset selection into a smoother problem which can be optimized easily This also solves the problem of correlated variables whose coefficients negate one another and have high variance cst ma Introduction to rriacnine learning Lecture 87p 5f8 Ridge regression Ridge regression adds a squared penalty to each coefficient n d 2 d 1355035 2 91502902751 A252 11 j1 j1 y e XBTy 7 X6 We 80 we want to choose the B that minimizes the above A n d 2 d B argm in Z ltyr ony gt AZB 11 j1 j1 XTX Myley Where I is the d gtlt d identity matrix cst ma Introduction to rriacnine learning Lecture 87p 6H8 The problem of correlated variables Let y B111 B212 be our model If an and x2 are independent then both B1 and B2 are necessary to represent y If an 2 or an is some multiple of 12 then they are correlated If they re correlated then EITHER B1 or B2 could be used to accurately represent y The B1 and B2 are not unique This causes instability Ridge regression solves this by adding A to the diagonal of XTX CSI ma Introduction to rriacnine learning Lecture 87p 7l8 Example of ridge regression Icavol 3 E o ltr o v vquotei ht g 999 m o N 7 Ibph 35 o o o o O gleason age N Q Icp l l l l l 0 2 4 B 8 CSI ma Introduction to rriacnine learning Lecture 87p sis Model selection How do we choose the best model for a machine learning problem Choosing a model means choosing the model type and the parameters for that model There are different model types linear regression 39 knearest neighbor 39 many others we will look at There are different parameters that we can select use all parameters use a subset subset selection shrink the coefficients create new parameters as combinations of the old ones other ways to select parameters CSI ves Introduction to rriacnine learning Lecture 87p ens Determining the best model How do we determine which of two models is better 39 often we use some measure of the model s accuracy There are lots of possibilities sumofsquared error aka RSS for regression problems 39 sum of absolute error for regression problems 39 prediction accuracy for classification problems likelihood 39 other All of these metrics could be measured on 39 the training set the test set a holdout validation set 39 other CSI 5m Introduction to rriacnine learning Lecture 87 p l0l8 Example RSS for model selection cancer dataset forward stepwise selection rss per point number of variables Simple linear regression on the training data with different numbers of parameters Forward stepwise selection uses RSS on training data to select the next parameter CSI 5m Introduction to rriacnine learning Lecture 87 p lll8 Verifying a model with heldout data A simple and reasonable way to choose a model is based on holdoutquot data or a validation setquot Model selection with a holdout set 39 Given training data X Divide the training data into XT and XV the training and validation sets Learn different models on XT Test each model on XV 39 Choose the model with the lowest errorhighest accuracy on XV What are the advantages and disadvantages to this approach CSI 5m Introduction to rriacnine learning Lecture 87 p l2l8 Crossvalidation Crossvalidation takes the holdout set one step further Divide your training data X into k disjoint parts aka folds XX1UX2UUXk Cross validation procedure 39 loopoveri1k 0 train on X i X 0 validate on X obtain an accuracy or error on X report the k errors It s common to use k n called leaveoneout cross validation or k 10 10fold cross validation CSI 5m Introduction to rriacnine learning Lecture 87 p i3is Cross validation and choosing the model Cross validation produces k different errors on the k folds for each model Taking the average of the errors will give an idea of the expected error for other similar test sets Finding the standard deviation of the errors gives an idea of how much the model varies in its error CSI 5m Introduction to rriacnine learning Lecture 87 p f4f8 Examples of variable selection All Subsets Ridge Regression CVError 06 08101214 1618 CVError 08 IO 12 14 16 18 06 Subset Size Degrees of Freedom Error bars are obtained by crossvalidation The least complex model within 1 standard error of the best average error is chosen CSI 5m Introduction in machine ieerning Lecture 87 p i5i8 Cross validation nearest neighbors and repeats Think about the 1nearest neighbor classifier Imagine that your training data has 2 copies of every input example What if you use leaveoneout cross validation with this model and data CSI 5m Introduction in machine ieerning Lecture 87 p i6i8 Practical issues for homework 2 center your data input and output before using ridge regression see the book p 60 The first column of the data is the true label so you don t want to use that value as an input because it makes the problem trivial Make sure you understand the ZIP data before you work with it Confusion matrix CSI 5m Introduction to rriacnine learning Lecture 87 p l7l8 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI 5m Introduction to rriacnine learning Lecture 87 p l8l8 Lecture 2 Supervised learning introduction CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI ves Introduction to rriacnine learning Lecture 2 7p iZA General announcements First assignment due January 20th see the website Get the book ASAP if you don t have it CSI ma Introduction to rriacnine learning Lecture 2 it 224 Questions CSI ma Introduction to rriecnine learning Lecture 2 7p 324 Chapter 2 Overview of supervised learning 21 Introduction 22 Variable types and terminology 23 Two simple approaches to prediction Least squares and nearest neighbors 24 Statistical decision theory 25 Local methods in high dimensions 26 Statistical models supervised learning and function approximation 27 Structured regression models 28 Classes of restricted estimators 29 Model selection and the biasvariance tradeoff CSI ma Introduction to rriecnine learning Lecture 2 7p 424 Least squares linear regression 231 A linear model is one which is linear in the inputs Examples linear model fc 390 linear model fcy 3x 7103 nonlinear model fcy x2 i 4y 63 NOTE linear model fww2y1y 12 7 my w 7 y Linear models are popular because 39 simple to understand 39 simple to find 39 stable high bias Linear models make strong assumptions about the relationship between the inputs and the output CSI ma Introduction to rriacnine learning Lecture 2 7p 524 Linear model Given a vector of inputs X X1 X2 Xp predict the output Y with the model I7 1 Y osz j 11 Using linear algebra s innerproduct aka dot product notation we can write this as 2 Y X TB where here 5 is a vector of all the coefficients and XT is the transpose of the vector X Twodimensional example on the board CSI ma Introduction to rriacnine learning Lecture 2 7p 624 Finding the coefficients We defined our model the output Y equals the inputs X times their respective coefficients B Here the learning problem is finding the appropriate coefficients 5 for a given training set How should we do this Our first step is to define an error function which is a common thing in machine learning The error function will tell us how well our estimate B is doing at describing the training data CSI ma Introduction to rriacnine learning Lecture 2 7p 724 Least squares error function The most common method for finding coefficients of a linear function is least squares Error function 7L 3 Rsstm 2W a new i1 We want to minimizethis error function for a fixed training set Minimization should remind you of calculus take the derivative with respect to 5 set to zero Remember that B is a vector CSI ma Introduction to rriacnine learning Lecture 2 7p 824 Least squares error function We can rewrite the last equation as 4 RSSW y X5Ty X5 Taking the derivative with respect to 5 setting to zero we get 5 XTy 7 x5 0 If XTX is nonsingular then the unique solution is given by 6 B XTXrlXTy We will derive this later see section 32 in your book CSI ves Introduction to rriecnine ieerning Lecture 2 it 924 Linear regression application Linear Regression at UM Response o o o 00 oo 60 o o o o o 0000 o 0800 o 39 oo o o o m engage e o Ego 0 9 o OWOGOO out 0 0 039 o 1290quot 0 i or 23 00 to o A 00 0 o agged 0 0290009 0 o o 90 v n we 0 00 0 5mm 90 JOOOAC 9 o o Input 2 dimensions Output GREEN or RED coded as 01 Note we have used a regression model for a classification problem Predict RED if 1 gt 05 or GREEN if 1 g 05 The linear model makes strict assumptions If the data does not fit a linear model we have misclassifications CSI 5m Introduction to rriecnine learning Lecture 27p i024 Nearest neighbors 232 Instead of using a linear model let s use the points near the query to classify it A 1 7 YE Z M MENME Where the Nkc is the neighborhood of k points closest to cc This is an average of the nearest neighbors Here we don t assume a linear model or any model We do have to choose k however CSI 5m Introduction to rriecnine ieerning Lecture 2 r p iiZA 15Nearest neighbor classifier rNeareSl Neighbor Classifier We still make misclassifications but we make fewer than the linear model Note the shape of the boundary which is determined by the redgreen points CSI 5m Introduction to rriecnine ieerning Lecture 2 r p i224 1Nearest neighbor classifier Nearest Neighbor Classifier We can get perfect performance on the training set if we use k 1 However this does not generalize well CSI 5m Introduction to rriecnine learning Lecture 27p tam Number of parameters to estimate Small parameters gt large parameters Linear Regressmmlalt Response is Nearest Neighbarclassilier i Nearest Neignoorciaesitier u i 391 hi 0 a a i o so a if i a no n l U 0 i 0 n a Cl For linear regression we estimate d 1 parameters where here d 2 For nearestneighbors we estimate roughly Nk parameters More parameters a more flexibility a harder to learn a need lots of data Machine learning must deal with the tradeoff between model flexibility and model stabilitylearnability This is often called the biasvariance tradeoff CSI 5m Introduction to rriecnine learning Lecture 27p t424 Bias and variance LinearRegressnnaiai Rewame is Nearest Neigmarclassmer a 0 a a a a a 3000 a a a 0 00 00A 0 an a 0 i we r a 0 ii 93a 0 aim 0090 a a a it n n G Am 0 a otio a u Every model has some bias and some variance A strict model like linear regression has high bias and low variance Better with smaller training sets A loose model like nearest neighbor has low bias and high variance Better with larger training sets The bias and variance trade off depending on the model we will see more later CSI 5m Introduction to rriacnine learning Lecture 2 r p i524 Biasvariance and model complexity 29 Prediction Error Another way to look at biasvariance and model complexity On the left we are High Bias Low Variance Training Sa Test Sam Low Bias High Variance Model Complexity underfitting on the right we are overfitting High CSI 5m Introduction to rriacnine learning Lecture 2 r p ie24 Quick sidebar expectations Expectation weighted mean of a variable where weights come from a probability distribution Expectations for continuous variable ac and discrete variable y mi Even fleH dx 9 ELFml Z W PM Expectations are linear 10 Elam by aElac bEly You should be familiar with expectations CSI 5m Introduction to rriacnine learning Lecture 27p i724 Statistical decision theory 24 Given X 6 Rd Y e R and joint distribution PX Y We want a funct on fX that predicts Y given X We need a loss function LY fX for penalizing prediction errors The simplest and easiest is squarederror loss 11 EPEf ElYfX2l m 7 VWW See your book for derivations but this is minimized when 13 at EinXwl Thus the minimumerror solution is when we take the weighted mean of the output variable Y at X cc CSI 5m Introduction to rriacnine learning Lecture 2 r p l824 Averaging over data Nearest neighbor methods average over training data in a neighborhood near the input point 14 x Aveyrlm 6 NW This is an approximation to computing the expected output for the given input Linear regression also averages over the training data Both compute averages but have dramatic differences 39 leastsquares assumes fm is wellapproximated by a globally linear function knearest neighbors assumes fc is wellapproximated by a locally constant function CSI 5m Introduction to rriacnine learning Lecture 2 e p i924 Classifiers and Bayes classifier For a classifier we define the loss function as Lk l is the cost of classifying an observation of true class gk as g If we assume that the losses are 0 or 1 Lkl0ifkl Lkl1ifk7 l then we obtain the Bayes classifien 15 X gk if Prgle w meagx Prng w 9 The Bayes classifier always chooses the most probable class The error rate of the Bayes classifier is called the Bayes error rate CSI 5m Introduction to rriacnine learning Lecture 2 e p 2024 Optimal Bayes classifier Bayes Optimal Classi er The classifier shown here comes from the known distribution that generated the data not from the data itself Therefore it is optimal cst 5m Introduction to rriacnine learning Lecture 2 r p 2i24 Dimensionality problems 25 Even though the knearest neighbor classifier seems like it will do everything we need it has problems when the input dimension is high This is known as the curse of dimensionalityquot and it affects every learning algorithm though some worse than others Basic idea in higher dimensions everything looks far away Example explanation the Euclidean distance formula is 16 As d increases there are more added terms Distances get large cst 5m Introduction to rriacnine learning Lecture 27p 2224 The curse another example O UnitCube 39 no 0 I D 8 o 39 z E II o V O N r i O i O i O l 1 O i i i i 00 02 04 06 Neighborhood Fraction of Volume Right graph shows the required sidelength to capture the given fraction of volume in a unit hypercube for different dimensions For d 10 to capture just 10 of the volume requires a box with sidelength 80 CSI 5m Introduction in rriacnine learning Lecture 2 r p 2324 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI 5m Introduction in rriacnine learning Lecture 2 r p 2424 Lecture 16 Bayesian learning CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI 5m Introduction to rriacnine learning Lecture l rp ll8 Questions CSI 5m Introduction to rriacnine learning Lecture l6 7 p 2l8 Bayesian learning and naive Bayes Bayes rule and modelling naive Bayes smoothing binning continuous variables applications to text shaping probabilities See handouts from Mitchell as primary source also see section 663 in your book CSI 5m Introduction to rriacnine learning Lecture l6 7 p ais Naive Bayes classifier A naive Bayes classifier is based on Bayes rule Naive assumption input variable i is independent of input variable 9 given the class This is called conditional independence In probability terms if input ac has features 9ch and 1139 Prm1jic Prw icPr1jic CSI 5m Introduction to rriacnine learning Lecture l6 7 p 4l8 Naive Bayes probabilities Starting again with Bayes rule PrXlM PrM PrMlX WX and substituting our new definition 0 PrXlM H PrX lM 711 we get the new probability WM 1121 PrXrlM PrMlX WX CSI 5m Introduction to rriecnine learning Lecture l6 7 p 5l8 Using logprobabilities If we multiply many small numbers between 0 and 1 we will get a very small number Using logarithms helps us fix this problem 0 countX a M P alM 0 count Xi a M logPrX alM logH W 721 7 1 10 countX a M 7 g countM s l l H ll Ms 10g countX a m M i log countM s l l 1 d log countX a m M i d log countM l l l I L 1 CSI 5m Introduction to rriecnine learning Lecture l6 7 p eis Using logprobabilities We can apply the log probabilities to the entire Bayes rule PrXM PrM PrX log PrXM logPrM 7 logPrX log PrMX log 0 2 log countX ai m M 7 dlog countM 71 log PrM 7 log PrX Then the MAP classification is still the class M that gives the largest log PrMlX similar for ML CSI 5m Introduction to rriacnine learning Lecture f6 7 p 7f8 Prior probabilites PrM For the tasks we will use we will model PrM as the frequency of class M Therefore PMM countM n log PrM log countM 7 logn where n is the total number of records in the training set and countM is the number of times that class M occurs in the training set CSI 5m Introduction to rriacnine learning Lecture f6 7 p sis Smoothing zero probabilities A standard way to deal with zero probabilities is to eliminate them by smoothing The simplest way to smooth probabilities is by adding a smoothing constant A For example PrX countX ac H M A countM M Where A gt 0 and k is the number of different values of X that have been observed CSI 5m Introduction to rriacnine learning Lecture l6 7 p Sis Example of smoothing zero probabilities We observe the following frequencies in the training set countTEMP HIGH r SUNNY countTEMP MED r SUNNY countTEMP Low r SUNNY Then if 1 PrTEMP HIGHiSUNNY PrTEMP MEDlSUNNY PrTEMP LowiSUNNY before smoothing 46 26 06 N after smoothing 41 7 St 59 63 39 0 1 63 19 CSI Eves Introduction to machine learning Lecture l rp l0l8 Naive Bayes probability densities We have described the naive Bayes classifier as using nonparametric probability models based on counts If we have knowledge about the data that it is parametric eg Gaussian we can use that instead CSI Eves Introduction to machine learning Lecture l rp lilB Example For example modelling two attributes with naive Bayes for bouncy balls attribute 1 color red green blue orange yellow white black 39 attribute 2 bounce return real number We could model color with a typical nonparametric counting density function We could model bounce returnquot with a Gaussian distribution assuming this is appropriate The joint probability under the naive assumption is still just the product of the two probability functions Prcolorred bounce90 Prcolorred Prbounce90 CSI Eves Introduction to machine learning Lecture l rp l2l8 Binning Most of the time with naive Bayes practitioners use the nonparametric model most general assumption 39 requires most data for learning Suppose that you have realvalued data how do you model this with a nonparametric model that expects discrete values Answer divide up the real space into discrete regions CSI Eves Introduction to machine learning Lecture i rp iGiB Binning methods Equalwidth binning choose a uniform interval width and cover all inputs with bins of that width Equalfrequency binning choose a number of bins and cover all inputs with that many bins each bin having the same number of examples Advantages Disadvantages CSI Eves Introduction to machine learning Lecture i rp i4i8 Shaping probabilities thresholds When classifying an example ac we have the probability that it belongs to each class PrM mIX We classify to the class that gives the largest probability If the number of classes is 2 then we really have only one probability since the two probabilities must sum to 1 The threshold for classifying to class A versus class B is usually 50 In other words classify to A if PrAIX 2 50 If we want to be more sensitive to one class over the other we can change this threshold This is equivalent to modifying the priors CSI Eves Introduction to machine learning Lecture l rp l5l8 Shaping probabilities Usually with naive Bayes the probabilities that it gives are extreme they are very close to 0 or 1 This causes the probabilities to be unreliable How can we deal with this ranking probabilities shaping probabilities CSI Eves Introduction to machine learning Lecture l rp l6l8 Project discussion CSI Eves Introduction to machine learning Lecture l rp i7i8 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI Eves Introduction to machine learning Lecture l rp l8l8 Lecture 21 Support Vector Machines CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI 5m Introduction to rriecnine ieerning Lecture 2i 7p iM Questions CSI 5m Introduction to rriecnine ieerning Lecture 2i 7p 2i4 Separating hyperplanes Here is a twoclass classification problem with several linear boundaries Note that there are many solutions to use a linear boundary to separate the classes CSI 5m Introduction to rriacnine learning Lecture 2i 7p 3m Optimal separating hyperplanes The optimal separating hyperplane Vapnik 1996 39 separates the two classes 39 maximizes the distance to the closest point from either class This second step is called maximizing the margin What are the advantages of maximizing the margin CSI 5m Introduction to rriacnine learning Lecture 2i 7p 4m Mathematically defining maximum margin max C orll H1 subject to CSI 5m Introduction to rriecnine learning Lecture 2i 7p 5m Mathematically defining maximum margin From your textbook 452 presented on the board incorporate ll ll 1 rewrite in terms of ll ll Lagrange constraint form derivatives of Lagrange form KarushKuhnTucker conditions support points CSI 5m Introduction to rriecnine learning Lecture 2i 7p em Example of support points CSI 5m Introducllan la macnlne learmng Lecture 2r 7p 7m Nonseparable data chapter 12 For separable data we had for all i 1 n MWTB 50 2 C For nonseparable data we introduce slack variables 5139 MWTB 50 2 Cl 5139 where Er 2 0 E E1 3 constant 721 CSI 5m Introducllan la macnlne learmng Lecture 2r 7p 8H4 Separable and Nonseparable data MWTB 50 2 C1 51 5139 2 0 273 g constant 171 CSI 5m Introduction to machine learning Lecture 2i 7p cm Support vector classifier yi T 50 2 C1 51 5139 2 0 ig g constant 711 After many Lagrangian formulations and reformulations see 1221 we get B 2 02131501 11 The 9ch that have corresponding on gt 0 are called support vectors and define Remaining 9ch do not have any influence on CSI Eves Introduction to machine learning Lecture 2i 7p ioM Maximum margin for nonseparable data From your textbook 1221 presented on the board incorporate Er constraints Lagrange constraint form derivatives of Lagrange form dual form KarushKuhnTucker conditions parameter y CSI Eves Introduction to machine learning Lecture 2i 7p iiM Support vector classifier example Training Error 026 TestError 030 BayesError 02t o Training Error 0270 0 o TestE 88 Bayes Error 020 o 39y 10000 CSI Eves Introduction to machine learning Lecture 2i 7p izM Next flexible nonlinear support vector machines Question How do we better handle data that is not linearly separable with support vector machines Answer Transform the data into a higher dimension where the data is more easily separated CSI Eves Introduction to machine learning Lecture 2i 7p iGM 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI Eves Introduction to machine learning Lecture 2i 7p iAM Lecture 23 Unsupervised learning CSI 5v93 Introduction to machine learning Baylor University Computer Science Department Dr Greg Hamerly http cs baylor edu Nhamerly CSI 5m Introduction to rriacnine learning Lecture 23 7p ii5 Questions CSI 5m Introduction to rriacnine learning Lecture 23 r p 2i5 Introduction Until now we have discussed the problem of supervisedlearning Supervised learning having a training signalquot or correct answer for the training examples This comes in two forms 39 class labels regression value What if we didn t have the training signal for the examples Can we learn anything Answer yes we now turn to unsupervised learning cst 5m Introduction to rriacnine learning Lecture 23 r p 3i5 Unsupervised learning Unsupervised learning is learning without a training signal It can be very different than supervised learning 39 no correct answerquot goal is different the concept to learn is not given with the training data 39 overfitting and underfitting are still possible but it s not as clear when they occur cst 5m Introduction to rriacnine learning Lecture 23 r p 4i5 Several views on unsupervised clustering Summarization Finding a concise representation of given data Density estimation finding the probability distribution of the data Classification Dividing a dataset into distinct subsets Pattern discovery identifying relationships in the data that were unknown All this without a training signal given only the input features of the data CSI 5m Introduction to rriacnine learning Lecture 23 r p 5i5 Types of unsupervised learning Finding association rules For example If a customer buys lunch meat and cheese he will probably also buy breadquot 142 Clustering aka grouping or segmenting data according to its similarity or dissimilarity 143 Finding lowdimensional simple surfaces in highdimensional data 145 CSI 5m Introduction to rriacnine learning Lecture 23 r p ei5 Data clustering 43 Several viewsmotivations 39 identify and group items which are similar into several groups identify and divide items which are different into several groups 39 summarize the data with several prototypes We can do all these with the kmeans algorithm CSI 5m Introduction to rriacnine learning Lecture 23 r p 7i5 Data clustering 43 Several viewsmotivations 39 identify and group items which are similar into several groups identify and divide items which are different into several groups 39 summarize the data with several prototypes We can do all these with clustering algorithms There are several types of clustering algorithms hierarchical bottomup hierarchical topdown iterative flat clustering eg kmeans and Gaussian EM spectral clustering densityseeking eg meanshift Most clustering algorithms are iterative in some fashion CSI 5m Introduction to rriacnine learning Lecture 23 r p 8i5 The kmeans algorithm A flat iterative improvement clustering algorithm Very popular easy to implement fairly fast algorithm provides good results lnput a set of to data points in d dimensions Goal find k prototypes or centers that represent the data well The centers are not constrained to be part of the input Each center represents the data that is nearest to it cst 5m Introduction to rriacnine learning Lecture 23 r p 9i5 The kmeans algorithm Iteration Number 20 This is an example of using kmeans with k 3 clusters on 2dimensional data CSI Eves Introduction to machine learning Lecture 237p lol The kmeans algorithm Basic algorithm 39 choose the number of cluster centers k position the k centers iterate until no change 0 assign each data point to the cluster center nearest to the point 0 move each cluster center to the average of the points assigned to it CSI Eves Introduction to machine learning Lecture 237p iiiS Example kmeans with k 3 on 2d data Initial Centroids Initial Partition Iteration Number 2 Iteration Number 20 CSI Eves Introduction to machine learning Lecture 237p i2i5 Demonstration of kmeans CSI Eyes Introduction to machine learning Lecture 237p lGl What kmeans is actually doing kmeans is actually minimizing the withincluster sumofsquareddistances Let 6ij indicate that datapoint 9ch belongs to cluster center CJ39 Then kmeans minimizes 7L k FXC Zle1 lel25iij 11 11 7L k d Z 5ijZwmij2 11 j1 m1 Some points only finds a local minimum of FX C may get stuck in poor solutions starting with some solution always finds a solution that is better or equal because of the last point it will always terminate CSI Eyes Introduction to machine learning Lecture 237p l4l5 2minute journal Please write a response to the following on a piece of paper and hand it in immediately Please make it anonymous no names Write about 39 major points you learned today 39 areas not understood or requiring clarification CSI Eves Introduction to machine learning Lecture 237p i5i5

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.