Machine Learning CS 5350
Popular in Course
Popular in ComputerScienence
This 26 page Class Notes was uploaded by Marian Kertzmann DVM on Monday October 26, 2015. The Class Notes belongs to CS 5350 at University of Utah taught by Harold Daume in Fall. Since its upload, it has received 12 views. For similar materials see /class/229971/cs-5350-university-of-utah in ComputerScienence at University of Utah.
Reviews for Machine Learning
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/26/15
Machine Learning CS 5350CS 6350 25 Sep 2008 Margins Large margin principle want 11 that separates the classes maximally We assume that 11 separates the data and achieves a functional margin of at least 1 T at is wT b 2 1 for all positive and wT b S 71 for all negative m We de ned the geometric margin based on the normalized weight vector 7 minn yn uTn b where u 39w Compute margin as a function of normalized weight vector u for positive point and negative point LE 1 7 uTmbiuTm ib 1 w T w T 7 777 7 a if a 2 llwll llwll 1 T T 7 7 w a 7w a 211le 1 7 1 llwll This shows that the margin is inversely proportional to the norm of the weight vector and independent of the bias Moreover having a large margin is equivalent to having a small weight vector norm why does this make intuitive sense Now we write the learning problem as an optimization problem 1 2 minimizew E subject to yn 39wTn b 7 1 2 0 Vn Why large margins Because they mean simple solutions Machine Learning CS 5350CS 6350 Kernels Today we talk about kernels These are magical beings that allow us to turn nearly any linear model into a nonlinear modell Consider 1D nonlinearly separable data 7 7 7 77 Now map each location I to a 2D point 112 Then in 2D the data is linearly separable We can generalize for z 6 RD map 39i39 z gt gt 1313 zzlzgzlzgzlzD zD1zD Or maybe we want cubic function This is computationally a nightmarel Recall SVM dual optimization 1 minimizea Zan 7 E ZynymanamInTIm n nm subject to Zynan 0 n an 2 0 7 V71 And SVM prediction fW sign Z anannTI nESV In both cases inputs only ever appear in dot products So we just replace all ms with zs 1 minimizea 2 an 7 E Zynymanam i39WnW39i39Wm n nm subject to Zynan 0 n an 2 0 Vn prediction sign Zanyn 1nT I neSV lnsight for many 39i39 we can compute dot products without actually computing the resulting vectorsl Consider the quadratic case in two dimensions with an addition of some constant terms The dot product can be computed as IT z1 71 7 IIIQTZ 72 7 2122 211122122 1121 12222 ITZV Machine Learning CS 535008 6350 2 So the cost of computing the dot product in the highdimensional space is the same as the cost in the lowdimensional space Can additionally include a linear term I 1112 mix 1112 by using 1 sz2 Such constructions are called kernel products K4 1 2 IT z often subscript of 39i39 is dropped from K when it is clear from context A bit of theory We write the original space the input space as X and the output of 39i39 the feature space as f o b22677 0 K XXX7R7KIZ IT Z o The previous means that 7 must be a vector space with a dot product these are called Hilbert spaces Question can I used any function K as a kernel Answer No There must exist some Hilbert space 7 for which K is a dot product Question Ackl How can I know that Answer A suf cient condition is that K be positive de nite dz dz VfELg This is equivalent to saying that the resulting matrix for any data set be positive de nite de ne Kij Kzi7 1 and require that K be positive de nite Mercer7s condition Why does this matter It helps us build kernels Building kernels lf K17 K2 are kernels and a is a positive real number7 then the following are also all kernels o Kz7 2 K1zz K2z7 2 7 direct sum 0 Kz7 2 ozKlz7 2 7 scalar product 0 Kz7 2 K1z7 2K2 I 2 7 direct product Combining these7 any positive linear combination of kernels is a kernel Okay7 so where do they come from in the rst place We7ve already seen two 0 Linear Kzz 1T2 o Quadratic Kzz 1 sz2 Machine Learning CS 535008 6350 3 We can actually generalize the second to a polynomial degree d kernel Kz7 2 1 szd The nal standard kernel is the REF kernel This is harder to understand it is de ned by Krbfltz72gt exp wsz Here7 K is parameterized by a hp7 7 This is actually an in nitedimensional kernel there is no nite space 7 such that K corresponds to a dot product in The REF kernel produces decision boundaries that look like blogs Think of a kernel as a similarity function and everything makes more sense hopefullyl Using kernels For any kernel you want7 just plug it in to the SVM optimization and you get a nonlinear classi er Most software supports this How do I know the best kernel for my problem Try them all Do you have intuition Usually nonlinear kernels introduce extra hps to tune Can kernelize other algorithms than just SVM eg7 perceptron is also easy Machine Learning CS 5350CS 6350 15 Nov 2008 Probabilistic Modeling The goal is probabilistic modeling is to set up a statistical model pdata l model that explains our data given a model and then try to nd the best model Suppose we believe there is a functional relationship y wT between some set of inputs 1 LEN and some set of outputs y1 yn However in data that we observe this relationship is corrupted by noise That is the y we observe are not precisely 39me but is rather corrupted by some noise We wish to model this noise stochastically one choise is to say that the noise is Gaussian distributed Namely y M e and e N N07 0 02 Here 6 N N0TMO392 denotes that e is drawn from a Gaussian Normal distribution with mean M and variance 02 This Gaussian has density NONI l 702 1 1 1 2 z To think about y wT 6 think about a true linear relationship that is then slightly corrupted Since we assume that the error model is Gaussian with mean zero this is the same as saying y N N07 39me 02 This is not hard to verify if you donlt see it try working it out Now we are ready to talk about pdata l model Since the inputs are always given to us we do not need to explicitly model them Thus although our data are the pairs Ltn yn the s are always provided Furthermore our model in this case is totally speci ed by m We can therefore consider something of the orm 10y1pulem1puwN7w 2 The rst step is to apply the chain rule N 10y17myN l 17H 7N7w HPltyn l mlpuwNywyylwqynil 3 711 Next we make an assumption conditioned on Ltn and 39w the ys are completely independent of eachother and yn is independent of all ms for m f n Thus N 10y17myN l 17H 7N7w HPltyn l mlpuwNywyylwqynil 4 n1 N Hpyn l m1mNw 5 n1 10yn l M w 6 H 12 S H Now we can substitute in our Gaussian model for pyn l n39w Probabilistic Modeling 2 N N H 10yn l 17mm PIAMon l 7471771702 7 711 711 Our first method of solving this is by using the maximum likelihood estimatori Namely we want to choose the 39w that maximizes the probability of the data given the model probability of data given model is typically called the likelihood Our standard methods for maximizing or minimizing require us to differentiate Differentiating that product looks nastyi So instead of maximizing the likelihood we will maximize the log likelihood M39w logpdata l model 8 N 10g H Nadyn l wTn7 02 9 711 N ZlogNoryn l wTn02 10 711 We can now plug in the definition of the Gaussian and do some simpli cation N M041 ZlogAoryn l mena2 11 n1 N 1 1 1 7 7 Mn 12 gogltmexpl we gt lt gt N 1 1 7 7 7 10g27r02 7 9n 7 747177 13 Now this looks slightly messy but remember that we re just treating this as a function of mi That means that the additive term in the front is irrelevant The constant 1202 is also irrelevanti Thus to maximize the likelihood we could instead maximize 7 En yn 7 39szltn Or equivalently minimize En yn 7 39szltn Note the missing negative sign This is exactly the linear regression model we came up with before by trying to minimize squared errorl One way to interpret this is that by choosing squared error as our loss function we are implicitly assuming a Gaussian noise mode i We can do the same for classi cation We might want to make some model like y sign39wT17 for a binary classi cation problem with y E 71 1 However this is incapable of dealing with noise Instead we transform M via the sigmoid and use the sigmoid s output as the probability that y is 1i In other words my 1 1 m altwTwgt 14gt py 71 l 14117 17 0me 15 A convenient property of the sigmoid is that 1 7 02 072 Thus we get Probabilistic Modeling 3 py l 1451 0wam 16 1 1 exp iwam Following exactly the same steps as before7 we obtain N pdata 1 model H Pltyn l w 137 17 711 1 1 1 exp iyanmn H 12 18 n Mw Z log 1 exp Eynwmn 19 N 7 Zloglt1 exp iyanmnD 20 21 This is exactly the log loss that we were optimizing before So optimizing log loss is the same as assuming a sigmoid probability in a statistical model Machine Learning CS 5350CS 6350 Math refresher 11 Jan 2006 Linear algebra Why linear algebra Represent a set of linear equations 4951 75952 E 13 72951 3952 9 can be represented as 4 75 13 Axb A72 3 ligl Now we can solve the set of equations by solving for x 3 Additionally provides convenient notation Will often write 2 111951 it is shorter and easier to work with to write wT Notation H If x is a vector then mm is the mth element of x a scalar P90quot 9quot from the mth row 0 Simple operations as or w x or x for this we will stick with the rst but any is okay Matrices are capitalized A E RMXN means an Mrow Ncolumn matrix at a w x Vectors will be column vectors so that x E RM means as E RUM To get a row vector we write xT If A is a matrix then Am or Amn when clear is the mth row nth column scalar from A If A is a matrix then Aw is the column vector from the nth column of A and Am is the row vector 1 A E RMXN and oz E R then A oz is de ned by A 01 Amn oz and 01Amn is ozAm Subtraction and division similar F Vectors x E RM and y E RM xTy Em xmym is a scalar mew AB AC but it is possible that AB y BA Additional de nitions AERMXN andBElRNXP meansCABElRMXP with Cmp Vectors x E RM and y E RN xyT E RMXN de ned by xyTmn EnAmBnp AmBp Multiplication is associative and distributive but not commutative ABC ABC and AB C Math refresher 2 H ldentity matrix I E RNXN de ned by In 1mn For all A of appr size AI IA A Diagonal matrix D diagd E RNXN for d E RN de ned by Dmn 1mndm m 3 Transpose ifA E RMXN then AT E RNXM with ATnm Amn Holds ATT A ABT BTAT and A By AT ET 4 Symmatric square A is symmatric if A AT Often SN A E RNXN A is symmetric 5 Trace the sum of the diagonals tr A En A7m for A E RNXN Norms 1 For a vector x the norm of x is typically the Euclidean or 2 norm xTx 2 p norm is En lamp 3 In nite norm max Inverse 1 A E RNXN and A invertible then A 1 is unique such that A lA AA 1 I 2 A is invertible if all of its rows or columns are linearly independent 3 A 1 1 A AB 1 B lA l A lT AT 1 4 1f Ax b then x A lb Determinant 1 AianNXN then det A E R sometimes denoted 2 detI 1 detA det AT detAB detAdet B 3 detA 0 implies A is singular noninvertible otherwise det A 1 1detA 4 If we multiply one row of A by oz to get A then det A adet A If we swap two rows of A to get A then det A E det A 5 In general detA 271m7 Am det Am for any m E 1 N 4 271mnAm det Am for any n E 1 N 5 6 If A diaga then detA Mn 17 other special cases exist Positive de nite matrices 1 For a quadratic form xTAx Em En Amnxmxn we can assume A is symmetric 2 If A E SN we say A is pd if for all x E RN xTAx gt 0 we write A gt 0 Set is 8143 Math refresher 3 If i i i and xTAx 2 0 then A is psd A i 0 and set is Sill 4 PD and ND matrices are always invertible 5 If B E RMXN then G ATA E RNXN is always psdi Calculus If f R A R is wellbehaved then ElfBx is the rst derivative and pf8952 is the second derivative etci Gradient is multivariate extensioni Suppose f RMXN A R is wellbehaved then the gradient is de ned by i 6f i a 1 33A 2 BiA9 N VAf Bilm 322 BA QN E RMXN 6 a a 3f BAM1 BAMQ BAMN 3f v 7 Afh 814m The usual linearity rules holdi As With standard calculus 0 means that A is an extremum of The Hessian is a multivariate extension of the second derivative Suppose f RN A R then 311 311312 BxinN 32 E i I I I 2 a a a 33 a vif 12 11 0 22 12 SEN E RNXN 8 32f 32 32f Tamas 31MB 39 39 39 3sz 82f v2 9 lt mm am lt gt Examples i i i Probability and Statistics A random variable rivi is a value determined by chance drawn by a probability distribution 1 Input data 2 Output data Math refresher 4 3 Noise Often Will think in terms of a data generating model Discrete distributions take on discrete values NE 9quot Bernoulli coin ipping p1 7r p0 17 7r for 7r E 01 Also px 5517 1 1 Binomial coin ipping cont d What is the probability of k heads in a sequence of 71 trials 71 2 k Mk l nu 7rk177rnik Multinomial rolling dice parameter vector 9 E RK 2 0 With Eh 9k 1 Let me be the number of times k comes up in 71 rolls then px l 971 H3116 Hk 912 Continuous distributions take on continuous values Think of discrete distributions With number of values approaching in nity and probability of each approaching zero 1 2 Req 30 dx px1 Events are pa lt x lt b dx px Examples H 9quot Uniform distribution px l R IOSESRLR Univariate normal distribution as M 72 E R 1 1 1095 l 02 mew QW 02 10 Examplesw Multivariate normal distribution as M E RN 2 E 53 m l M x 7 WWW 7 ml 11 4 ex H x 27139N detE p 2 Measure distance With Mahalanobis points of equal distance have constant density Examplesw Exp ectations 1 2 3 4 For discrete p EENPM Expat For continuous p EENPM f dx pxx Called the mean or center of mass not the same as the mode in de nition With eg EENPUQH Expectations of functions are obtained by replacing x Epxzfxz Math refresher 5 5 If a independent of x then Max alExi 6 If x and y are independent then Eli y 7i Often y The variance is a measure of dispersion Vb E 7 M952 7 Covariance Caz9531 Vb y E i i lfp is ajoint distribution Mac y then f dx f dypx y 1 and can obtain marginal by px f dypx x and y are independent in px y ifpx y Simpli es marginalization a lot Conditional My l I pawpm and Way My l WM We l ypy If independent My l I Yields chain rule Bayes rule py l x W as prior times likeligood over marginal Machine Learning CS 5350CS 6350 27 Mar 2007 Introduction to Bayesian learning What is Bayesian learning 1 A formal model of uncertainty 2 A method for expressing prior beliefs 3 A methodology for making inference about data 4 A paradigm for decision making The central di erence on the learning side between Bayesian and nonBayesian learning aka frequentist learning or learning theory is the Bayesian treat parameters as true unknownsiie as random variables Let s take an example from statistics Let s say we have a coin that may be biased It has probability 7r E 0 l of coming up heads Suppose we ip it once and it comes up tails How do we infer 7r Frequentist answer 7r 0 because this is the maximum likelihood solution Sortof frequentist answer 7r because I ll smooth and compute 7r heads 1total ips 2 These are derived because we assume that we want to nd 7r which maximizes the likelihood of the data pD l 7r Several people have complained that conditioning on 7r is weird and it is Only random variables should be conditioned on and in frequentist land a parameter is de nitely not a random variable Let s say that we know 7r E 0 025 05 075 1 Still the ML solution would give 0 The Bayesian solution is quite different We don t actually try to nd a single value of 7r but rather compute a distribution over possible 7r This comes from a simple application of Bayes rule pltwgtpltD l w pwpw l w W l D pm 3pr l w Here p7r is called the prior pD l 7r is the likelihood and pD is the marginal or evidence or partition function In our coin ipping example our likelihood is just 7r 1 7 7r where h and t are the counts of heads and tails If we think about the frequentist perspective what happens is that they e ectively put a uniform prior over 7r and approximate the posterior by a point distribution centered at the maximum We will soon see how to justify smoothing in a similar manner But this entails two weird approximations maybe we don t want a uniform prior and maybe we don t want to make this approximation Let s say that a priori we believe the ve values of 7r have probability 01 02 04 02 01 respectively This basically means that we expect the coin is likely to not be severely biased This is a valid prior because it sums to one over the range of 7r Now let s revisit the case where we ip once and it comes up tails This gives us the following unnormalized posterior Machine Learning CS 535005 6350 2 p7r0lDOlt01gtlt00gtlt1101 p7r 025 D K 02 X 0250 X 0751 015 p7r 05 D K 04 X 050 X 051 02 p7r 075 D K 02 X 0750 X 0251 005 p7r1lDOlt01gtlt10gtlt010 After normalizing we get 0203 0401 0 as the posterior Note that the posterior at 7r 05 hasn t changed but the probability of 7r gt 05 has signi cantly decreased and we mow that 7r 1 is impossible Suppose we ip again and get another tails This gives p7r0lDOlt02gtlt00gtlt1102 p7r 025 D K 03 X 0250 X 0751 0225 p7r 05 D K 04 X 050 X 051 02 p7r 075 D X 01X 0750 X 0251 0025 p7r1lDOlt00gtlt10gtlt010 Here we have used a technique known as posterior updating We take the posterior from the rst example and treat it as the prior for the second example After normalizing we get approximately 031 035 031 003 0 Now we are more sure that 7r should be 025 but only by a little We can repeat this process inde nitely If we observe an in nite number of ips we will converge to the true value this is known as consistency Now let s say that we don t want to con ne 7r to one of ve values but want to allow it to range continuously That is we need a probability distribution p with domain 0 1 One could cook up many such distributions with a bit of integration to ensure proper normalization However there is a standard such distribution known as the beta distribution Fab ail Bet7r l a b WW 17 7rb 1 Here a and b are parameters of the prior or hyperpommeters of the model Ignore the fraction term for a second it serves to normalize the beta Nine beta distributions are shown below with a E 0 1 5 in the columns and b E 0 1 5 in the rows Machine Learning CS 535005 6350 3 30 15 20 1o 5 1o 5 o 05 1 o 05 1 o 05 1 15 15 02 1o 1 01 5 05 o 05 1 o 05 1 05 1 x10quot 02 5 1 01 o 05 1 o 05 1 o 05 1 Now let s consider our posterior updating after observing a single tails We have m 1 D gmpw 1 m 18m 1 a mm 1 m Z 1 ail 71 0 1 EM 177139b7r177r 1 ail 171 77139 177139bJr Bet7r l ab1 In general this works for any amount of heads and tails Suppose we have 1 heads and t tails then we get w 1 D plt gtpltD 1 m gww 1 a mm 1 m 7141710 7 7rb 1 wh 7 7139 14417107 7rbt71 NIHNIH t7rlahbt We don t need to worry about computing Z because we know that the beta distribution is properly normal ize I An example of how this works is shown below Machine Learning CS 535005 6350 4 5 5 5 5 4 4 x 4 4e 4 x 3 3 3 3 2 2 2 2 1 1 1 1 o o 5 1 o o 5 1 o o 5 1 o o 5 1 5 5 5 5 4 x 4 x 4 x 4 x 3 3 3 3 2 2 2 2 1 1 1 1 o 05 1 o 05 1 o 05 1 o 05 1 5 5 5 5 4 x 4 x 4 x 4 x 3 3 3 3 2 2 2 2 1 1 1 1 o 05 1 o 05 1 o 05 1 0o 05 1 This shown results after eleven ips with a uniform a b 1 Beta prior The ips are THHTTTTHTTT the stars indicate whether it was tails or heads You can see that over time the distribution tends toward 7139 lt 05 and becomes more and more peaked It s easy to verify that the value of 7139 that maximizes Bet7r l a b is exactly aaer This somewhat justi es smoothing to obtain add one77 smoothing we pretend that we start with a beta prior with a b 1 To get add alpha77 smoothing we set a b A Then we do maximum likelihood77 Technically this is called maximum a posteriori or MAP since we re choosing a value that maximizes the posterior rather than one that maximizes the likelihood Now back to the function This thing appears all the time in normalization terms and is de ned by Fz Oood757fzi1 explit This integral has no closed form solution However it can be computed by standard techniques available in matlab and many other languages It has the nice property that it extends the factorial function to the real line if n is a positive integer then Fn n 7 1I Given this we often compute log FA since it grows too quickly The functions in matlab are gamma and gammaln There are two other distributions you ll need to known about other than the standard Normal Multinomial Binomial etc These are the gamma and the Dirichlet We ll do Dirichlet rst since it s basically an extension of the beta Remember that a multinomial is just like a more complicated binomial Instead of having a coin with a single parameter 7139 we have a die with a parameter vector 61 6K all positive and sum to one We would like a prior on this The Dirichlet is a multivariate version of the beta It is parameterized by a vector 041 aK all positive but need not sum to one Dir6 1 04 13 63W In the two parameter case this is exactly a beta distribution We also get the same posterior updating If the prior was Dir6 l 04 then after observing 31 rolls of a 1 32 rolls of a 2 and so on the posterior Machine Learning CS 535005 6350 5 hyperpammete rs becomes 11 x1 Wax 95K oz 95 Again we can think of smoothing as MAP inference with a Dirichlet prior Finally we need a gamma distribution This is a distribution over positive reals This will be useful as a prior for the inverse variance of a normal distribution iiei p102 but for now just think of it as some distribution over 0 gt0 bu Na Qam l a b A ail exp7Ab Note that several de nitions of gum actually exist 7 sometimes people use 711 instead of a which puts the b in the denominator and replaces quotZ 1 with Aa li Examples of gamma priors are shown be ow 1 04 02 05 02 01 o 5 10 o 5 1o 0 5 1o 1 15 04 1 05 02 05 o 5 10 o 5 1o 0 5 1o 4 1 1 3 1 2 05 05 1 5 1o 5 1o 5 1o Now suppose we have a posterior p9 l D here 9 is just an arbitrary symbol for parameters What do we do with it Well at the end of the day sometimes the posterior is of interest in its own right Often however we probably want to make predictions That is we may want to predict how many of 100 coin ips will land tails In general if there s a quantity that we want to predict we want to compute EM 1 D 116 G 01620161 Df9 1 m e d6plt6gtpltD 1 9116 If 9 is a discrete space replace the integrals by sumsi For instance take the coin ipping example Suppose we have a Bet1 1 prior then observe 9 heads and 19 tails This gives us a Bet10 20 posterior Let s say for simplicity that we want to know the probability that the next two ips will come up tails In this case writing 7r for 9 we have 1 7 7r Thus we want to compute Machine Learning CS 535005 6350 d7rBet7r 102017 2 dwwgl 7 70190 7 2 9 21 7 d7rr10r207r 17 7 d N30 r10r22 N32 9 7 F10F20 N32 F10F227r 1 7 7r21 H30 F10F22 H32 r10r20 H32 r10r22 r30r22 d7rBet7r 1022 7 r30r22 7 29121 7 21x 20 7 045 7 F20F32 7 31119 7 31 X 30 7 W90 7 7021 Machine Learning CS 5350CS 6350 29 Jan 2008 Maximum margin classi ers Large margin principle want 11 that separates the classes maximally We assume that 11 separates the data and achieves a functional margin of at least 1 That is 39me b 2 l for all positive m and 39me b S 7l for all negative m We de ned the geometric margin based on the normalized weight vector 7 minn yn ulmn b where u 39w Compute margin as a function of normalized weight vector u for positive point and negative point LE 1 7 ulmb7ulm 7b 1 w T w T 7 77 7 a 77 a 2 llwll llwll 1 T T 7 7 w a 7w a 2Hle l 7 l llwll This shows that the margin is inversely proportional to the norm of the weight vector and independent of the bias Moreover having a large margin is equivalent to having a small weight vector norm why does this make intuitive sense Now we write the learning problem as an optimization problem 1 2 minimizew E subject to yn wlmn b 7 l 2 0 Vn Now we need to gure out how to solve this Enter convex optimization and Lagrange theory lntroduce Lagrangemultipliers ozLN one for each constraint Leads to Lagrangian 1 2 Lwa E 7 Zanyn wlanrb 71 n Now we want to minimize L39wa with respect to both 39w and a Take derivatives with respect to 39w VwL w 7 Zanynmn 0 n gt39w Z anynmn n m Z M o n So given 01 39w is deterministic plug back in to L 2 Z anynwn 7 2 an on 2 amym7ngtTn b 71 n n m zzawnymwmm 7 zzawnymwmm 7 b zany a 1 7E Z Zamanymynmmln 2 an m n n La Machine Learning CS 535008 6350 2 So now just solve 1 minimizea Zan E Z ynymanamnT17m n nm subject to Zynan 0 n an 2 0 7 V71 Then compute the bias 1 b 77 max men min men 2 nyn71 nyn1 This leads to a sparse solution most 04 are zero Why The Karush Kuhn Tucker conditions say that at the optimum an ynw1wn b71 7 0 Vn This means that an y 0 only when the point is right on the margin These points are the support vectors Not linearlyseparable data Introduce slack parameters n is how far on the wrong side ynn is from the margin Then minimizew ll39ww C25 subject to yn 39wTn b 715n 2 0 Vn 5n 2 0 Vn High C means t data while low C means have a large margin Following the same dual formulation7 we get 1 T T Lwbsargt 7 5w w 025 7 gun yn w M712 71757 7 n5i Differentiate VwL w 7 Zynanmn 0 7w Zynanwn VbL Zynan 0 VgnLC7an7rn0 Thus 1 T 110417575705 r 7 anynymanamwn mm Which is the same as before7 but now we have constraints an 2 0 and rn 2 0 and C 7 an 7 rn 0 This means 1 an S C 2 if rn 0 then an C Geometrically7 support vectors are now also the noisy points Why large margins Because they mean simple solutions Machine Learning CS 5350CS 6350 30 Jan 2006 Non linear models We ve spent a lot of time talking about linear modelsimodels which are parameterized by a weight vector of equal dimension to the input vectors These are nice but limited Here we consider two very different techniques nearest neighbor models and decision trees Nearestneighbor kNN 1NNisimple intuition to classify a new data point just return the class of the closest training point kNNiinstead of single closest point average over the k nearest 6NNiinstead of the k nearest use as many as t in a ball of radius 6 What is the VC dimension of such an algorithm How does one train such an algorithm Despite its simplicity CNN is a really really good classi er But very sensitive to irrelevant features Can be used also for regression by averaging responses See gs 227ab and 228abc in PRML 2M 2 0 o O O O O O O O O O O O O 0 1 1 a b K1 K3 K31 2 2 2 0 g o a 39 3 I o 39 393 O u 00 39I g n III II 0 n 39l 0 57 I 337 x7 O 0 I 1 039 O 1 039 I 1 0 I I 9 o 39O l t 9 l39 g 0 I Q a o g o 39l I 39 o 8O o 39 39 39 39 o 39 2 0 I r 0 0 0 I r 0 1 x6 2 0 1 x6 2 0 1 x6 2 Decision trees N on linear models 2 Idea suppose we could only use one feature to make a classi cation decision Let s choose that feature Now look at all example for which this feature is on and choose a single feature to make a classi cation decision Then look at all for which the rst feature was off Recurse until no data left Two issues a how to choose a single feature b how to choose to stop Entropy is a measure of randomness closeness to uniformity In particular how many bits to send a message on average If four options ABCD each with prob 14 then best coding is binary which gives two bits per character What ifpa 12 pb 14 and pc pd 18 We can code this with 175 bitschar on average how The minimum number of bits is the entropy HltXgt 7 2M xgtlog2pltX x Zero entropy means deterministic high entropy means close to uniform HYlX is the number of bits needed to send Y given that both the sender and recipient knew X Condi tional entropy HYlX EMX xHYlX x 7 EMX 2200 le I ogapW le I r y Information gain GO1X is i must send Y 7 how many bits would I save if both ends knew X GO1X HY 7 HmX Idea choose the feature with the highest information gain Stopping use threshold of either number of elements in leaf or entropy of leaf Dealing with realvalued features if X is realvalued consider all possible split locations X S 2 and X gt 2 and nd the best 2 to split Best maximum IG Only need to search over splits that exist in training data Machine Learning CS 5350CS 6350 15 Feb 2007 Clustering cont d We can think of kmeans in a probabilistic framework Suppose that we have a Gaussian centered at each of the means then we can get the probability of the data set as PI1N l C1N HN0T9 n l M70021 Here MC is the the mean of cluster k For now we assume common variance We can think of the clustering problem as trying to nd good mes Change of notation Instead of c7 being the cluster for data point 71 let 2 E 0 1K be an indicator vector for data point 71 Le znjk 1 if x7 is in k and 0 otherwise Model Generate each data point by rst choosing among one of k clusters each with probability me Then generate the data point by a Gaussian centered at MC ln equations px1NZ1N1K l MLKJQJ HHUModM l Mk Qllyn k n k HH 2 2W2 1 H H2 7 n k 7m 7m exp 202 957 MC From this we get the likelihood of the data by summing over the unknown 2s W Mam 2 HH WWW2m 7 HM Ml Z1N1K n k ll 2 H mzmiydzexp 722 MM 7 M2 7b ln1K 1c So now we follow our standard recipe of taking logs and derivatives 1 M Iogm l M02 w ZlogZH m2mir iQexp 722 HM 7 ml 7 Z k But at this point we get stuck If we knew 2 we could do this easily logpxz l M027r ZZZ logmC logVera l MCUQI n k We call the value with 2 the complete log likelihood77 and the value without 2 the incomplete log likelihood77 Clustering cont d 2 The idea for clustering with GMMs is the same as for kmeans We will make an initial guess at 2 and then try to iteratively re ne it This turns out to be a special case of the expectation maximization algorithm which we will discuss shortly in more generality ln kmeans we made hard guesses at the clusters the 2 vector we considered had a one in a single location and zeros everywhere else In Gaussian mixture models we ma e soft guesses The 2 vector will satisfy zmk 2 0 for all n k and Eh zmk 1 for all n Thus it s a probabilistic guess at the clustering Given some setting of M and 72 we can make guesses at 2 by just looking at their expectations Em xjmgz njk 1gtlt pznjk 1 l xnM027r 0 X pznjk 0 l xnM027r p2n1c 1l96mM Q r 7 pan l znjk 1 M 02pznk 1 l 7r 7 SICpan l znjk 1M02pznk 1 l 7r Nadya l MC727rC ElyVON l k027r1c These expectations give us a soft clustering for each data point into each of the k clusters Now using these guesses we want to maximize the complete data log likelihood with respect to M and 72 To do this we take the gradient of the complete likelihood with respect to 7r M and 72 We do 7r rst For this we actually need to introduce a Lagrange multiplier to ensure that the constraints on 7r are satis ed 71 2 0 and Eh mg 1 This gives us an augmented likelihood function of Zszklog 39k 7 Ewe 71gt 7b k k We differentiate this with respect to me to get ZmL kik2mki7rk0 7 Summing over all K we get that En Eh zmk N so me n This makes intuitive sensel Next we ll take care of MC These are somewhat easier since we don t need to worry about constraints so there are no Lagrange multipliers Clustering cont d Here we take the gradient of the complete log likelihood With respect to MC VM logpxz l 1102 7r szkvw logVedas l MC021 n 1 2 Zznkvm llxn Well n l 7 Zznkxn 7 me n We equate this to zero to give 1 szk wk 957 0 n 1 7 1 gt gznk lk 7 gznkpxn juzmezmm n 71 Zn k gt k ix 2uwtiquot Again this result is intuitive Finally we deal With 72 V02 logp az l UQJV 7 ZZmVaz logWM l MICMTQ n k 7d 2 1 2 2nkv 7101307 271m 7 well ZZa iiwrMW n k 202 204 We set this equal to zero to obtain d 22 ZZZMTL H967 7Mkll2 n k n 1c 51 1 gtdN02ZZznjk 71 1c 1 gt02Wnkznjk lbwHell lxn MkllQ lxn MkllQ Clustering cont d Putting this all together we obtain the following algorithm 1 Initialize cluster centers MLK 7r and 72 2 lterate T times i i i a Compute expectation of 2 variables by E 2 Z Nora Mk027rk 172 l 951 Jr WC EMAMan My 02 b Compute new values of 7r M 72 by l k N m n 2 1c x E m n l 2 2 7 Wankllxnil cll One can obtain a more general solution where we use full covariance matrices andor clusterspeci c co variance matrices For the assignment you will have to do both The D Souza writeup includes the deriva tionsupdates for the more general case
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'