Review Sheet for CMPSCI 689 at UMass(3)
Review Sheet for CMPSCI 689 at UMass(3)
Popular in Course
Popular in Department
This 32 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 16 views.
Reviews for Review Sheet for CMPSCI 689 at UMass(3)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
SVM Classi cation and Regression Reproducing Kernels Sridhar Mahadevan mahadevacs umass edu University of Massachusetts Sridhar Mahadevan CMPSCI 689 p13 Topics SVM review linearly separable case Softmargin classifiers dealing with overfitting SVM regression Examples of kernels Mercer s theorem Sridhar Mahadevan CMPSCI 689 p23 Marginbased Classi cation Sridhar Mahadevan CMPSCI 689 p33 Separating Hyperplane between Convex Hulls Sridhar Mahadevan CMPSCI 689 p43 Optimal Margin Classi cation 39 Consider the problem of finding a set of weights w that produces a hyperplane with the maximum geometric margin max 7 such that 771076 y lt mm gt 19 27 1m w1 39 We rescale the weights by w to eliminate the nonconvex constraint w 1 and instead look to minimize lt 1010 gt while keeping the functional margin a 1 1 min such that y lt mm gtb 2 12 Sridhar Mahadevan CMPSCI 689 p53 Lagrange Dual Formulation The primal optimal margin classification problem can be formulated as minfw such thatg w 3 02 1kand 0z 1l 11 The dual problem can be formulated using Lagrange multipliers as k l max LDaB Where LDaB 2 main ltfw l Zaig w i1 i1 a o 20 Weak Duality Theorem The dual formulation always produces a solution that is upper bounded by the solution to the primal problem Strong Duality Theorem The solution to the Lagrange dual is exactly the same as the primal solution assuming that the function fw and the constraints giw are convex and him is an affine set meaning him lt w am gt bi Sridhar Mahadevan CMPSCI 689 p63 Weak Duality Theorem 39 Suppose w is a feasible solution to the primal problem and that a and B constitute a solution to the dual problem LDC 7B mgnLu7a7 S Lw7a7 fw Zaigxw Z mw s fw 39 Since w is a feasible solution to the primal the last inequality holds true since gw g 0 and hw 0 since w is a feasible solution and since a is a feasible solution to the primal a 2 0 39 This implies the following condition 2120 LDCK75 S muijnfw giw S 07mm 2 0 Sridhar Mahadevan CMPSCI 689 p73 Sparsity of Parameters Corollary Let 10 be a weight vector that satisfies the primal constraints and crquot 5 be the Lagrangian variables that satisfies both the dual constraints that is fw LDCrB where a 2 0 and gw g 0 hw 0 Then ajgw 0 forz 1 k The proof follows easily by noting that the inequality fw Za gzw Z fhdwquot S fw becomes an equality only when ajgw 0 forz 1 k This is a key result in the SVM context it implies that the representation will be sparse since we only need to maintain a for the support vectors that is the instances that are closest to the separating hyperplane Sridhar Mahadevan CMPSCI 689 p83 Saddle Point Function Saddle Point Function x2 y2 7 Duality Gap and Saddle Points If the primal and dual solutions are not equal this duality gap can be detected in the course of solving the dual and used as a convergence metric Define a saddle point as the triple wa where w E Q 01 2 0 and Lwa S Lwa S Lwa Theorem The triple 10 Crquot 5 is a saddle point if and only if 10 is a solution to the primal problem and Crquot 5 is a solution to the dual problem and there is no duality gap so fw LDaB Strong Duality Theorem If f w is convex and w E Q where Q is a convex set and g h are affine functions the duality gap is 0 Sridhar Mahadevan CMPSCI 689 p103 Karush Kuhn Tucker Conditions Assume that the function fw and the constraints giw are convex and him is an affine set meaning him lt Li H gt bi Let there be at least one U such that giw lt 0 for all 2 Then the KKT conditions assure us the duality gap is 0 The minimizing values aquot 5 10 also satisfy the following conditions 8 1 Lwa 0 i1n 310i 2 6 Lw cf B 0 2 1 l 7 7 7 7 7 3 afgiw0 21 k 4 9zWSOi17 k 5 ajZO 21 k Sridhar Mahadevan CMPSCI 689 p113Z Support Vectors for Optimal Margin Classi ers 39 Returning to the original problem we can write the constraints as 1 2 main 5 such that 39 From the KKT constraints note that the only instances for which a gt 0 are those which have functional margins exactly 2 1 because then gw 0 39 Note that the functional margin is the smallest of all the margins which implies that we will only have nonzero a for the points closest to the decision boundary These are called the support vectors Sridhar Mahadevan CMPSCI 689 p123 Dual Form of Optimal Margin Classi er 39 We can write the Lagrangian for our optimal margin classifier as 1 39 To solve the dual form we first minimize with respect to w and b and then maximize wrt a m m VwLwb a w Zaiyixi 0 gt w Zaz yz xz i1 i1 VbLwabaa Z 39 Mechanical interpretation Think of each instance xi as generating a force aiyi on the hyperplane The forces sum to 0 Sridhar Mahadevan CMPSCI 689 p133 Support Vectors for Optimal Margin Classi ers Using these results we can simplify the Lagrangian into the following form m 1 m ma ai iaialtxixgt St aigt0 and aii0 Given the maximizing C we use the equation 10 2271 a yixi to find the maximizing 10 A new instance ac is classified using a weighted sum of inner products over only support vectors m lt wx gt 19 Zaiyi ltxix gt 19 2 origi ltxix gt 6 i1 z ESV The intercept term 9 can be found from the primal constraints maxyi1lt wx gt l minyi1lt 10 gt 2 5 Sridhar Mahadevan CMPSCI 689 p143 Geometric Margin and Lagrange Parameters Theorem Consider a linearly separable set of instances 901 31 mm ym and suppose crquot 9 is a solution to the dual optimization problem Then the geometric margin can be expressed as 1 1 7 W Z ZiESVaz Proof Due to the KKT conditions it follows that for all support vectors 939 E SV yjf9 j7b 3j gm ltxixj gtbgt 1 z ESV llwll2 ltZQZ 3z xiazaiijj gt 2 3735 Z afyz ltxzxj gt i j jESV zesv Z aim WV Z a j E S V j E S V Sridhar Mahadevan CMPSCI 689 p153 Dealing with N onseparable Data non separable data high variance low bias high bias low variance x gt nonseparable data nonseparable dat Sridhar Mahadevan CMPSCI 689 p163 Soft Margin Classi ers Let us reformulate the concept of margin to allow errors so positive or negative instances can lie on the wrong side of the margin The slack variable gi represents the extent to which a margin constraint is violated yiltwixigtb21 gi where 20 i1l Similar to ridge regresson define a variable which represents the extent to which we want to tolerate errors A softmargin classifier solves the following constrained optimization problem l Minimize ltwwgt Ang z 1 subjectto yiltwixigt b21 i i1l wheregi 2 0z391l Sridhar Mahadevan CMPSCI 689 p173 SVM Regression Sridhar Mahadevan CMPSCI 689 p183 einsensitive loss Sridhar Mahadevan CMPSCI 689 p193 SVM Regression 39 We introduce two slack variables g and which represent the penalty for exceeding or being below the target value by more than 6 39 The primal problem can be formulated as l Minimize lel2 AZ 2 5 z1 subjectto ltwixigt b y i i1l and yi ltwixigt b i i1l where gig 2 0 z 1l Sridhar Mahadevan CMPSCI 689 p203 The Kernel Trick Note that SVM classification estimators of the form m ltwxgtbZaiyi ltxixgtb i1 where we can express the evaluator solely in terms of inner products and parameters that are independent of the input space dimensionality To map from the original space X into some higher dimensional space we only need to replace this by lt w 00 gt 5 Zaz yz lt MW 90 gt 5 i1 The key property of kernels we are exploiting is the reproducing property kxy lt Mac My gtlt Mac My gt In words the inner product in highdimensional space can be cheaply computed using the kernel function in the original space Sridhar Mahadevan CMPSCI 689 p213 Convolution Kernels Haussler 1999 defined convolution kernels as a general way of defining a similarity measure on structured objects eg graphs documents xd where each part as E X We can define the xd are indeed Consider an object ac x1 partof relation Rx1 xd ac which holds if and only if x1 the parts of 90 Of course there may be more than one way to decompose ac into its parts eg think of subsequences of strings or subtrees of a tree etc Let R1x 001 xdRx1 xdx Then the convolution kernel Mac 3 is defined as d E sz00z yz R1CCR 1yi1 kx7y where kx y is a kernel on the ith component Sridhar Mahadevan CMPSCI 689 p223 String Kernels Watkins 1999 defined string kernels which can be seen as an instance of a convolution kernel Consider the set of all subsequences of a word of length n eg the subsequences of bat are ba at and bt The length of a subsequence u is defined as if 239 1 if the subsequence begins at position if in a string 8 and ends at position il Consider the mapping gb 2 gt RE where E is an alphabet 2 is the set of all strings and E is the set of all strings of length n Given any subsequence u E 2 define s ZWQM AW where z is the index vector representing the positions at which the subsequence u occurs in string 3 and E 01 The string kernel is defined as Knst 2862 me i Z iu lilj jztljli Clearly Kns t lt s t gt and so this is a valid kernel Sridhar Mahadevan CMPSCI 689 p233 Examples of Kernels Multinomial kernel Kx y 6957 Product kernel Kx y 2 ac y TFIDF kernel Kx y lt Mac My gt where magi w where ti is the term frequency of word 2 in document 90 di is the inverse document frequency of word 2 and 739 is a normalizer to ensure 1 2 llx yll Gaussian kernel Kxy 6 a2 RBF kernel Kac y ell cyll2 Polynomial kernel Kx y lt 003 gt cd Neural net kernel Kx y tanhac y c Sridhar Mahadevan CMPSCI 689 p243 Fisher Kernels Jaakkola and Haussler 1998 proposed using a generative model as a kernel in a discriminative nonprobabilistic kernel classifier eg in their case they used kernel logistic regression Let PX6 be any generative model eg a hiddenMarkov model 8ZX6 ae Consider the Fisher score equations UX the loglikelihood of a particular input X in other words the gradient of Define the information matrix I EUX UE where the expectation is over PX6 The Fisher kernel is Kac y U I1Uy The Fisher kernel can be asymptotically approximated as Kx y m U Uy In a protein homology similarity problem they got excellent results using the Fisher kernel with kernel logistic regression Sridhar Mahadevan CMPSCI 689 p253 Reproducing Kernel Hilbert Spaces 39 Theorem Every RKHS H on a domain 739 yields a positive definite function Ms t on T x T and conversely every positive definite function Ms t yields a RKHS Proof Assume H is a RKHS Then every function evaluation ft E H has a representer that is ft lt 771 f gt where m is the representer for function evaluation at t Define kst lt 773777 gt Clearly kst must be a symmetric function We show below that Ms t must be positive definite Zaiajktz tj ZZaiaj lt 77ti77tj gt iaj i j Zai ltnti7zaj77tj gt i j Zai lt mwzajmj gt i j lt Zamtizajmj gt Z J 2 HE az 77tll gt0 1 Sridhar Mahadevan CMPSCI 689 p263 z Reproducing Kernel Hilbert Spaces Proof of converse Given a positive definite function Ms t on T x T we want to show that this defines a unique reproducing kernel Hilbert space 71K Define the function we show below that this is our representer function kt 2 Mt so that kts kts8t E 739 Define the preHibert space 71 to be the linear space comprised of all f 2 2i aikti where ti E T We can show that the completion of Hg is a RKHS Consider two elements fg E 71 where f Zelda 9 Zeke z1 j1 Define the innner product in 71 as lt g gt2 2m ai jktitj Sridhar Mahadevan CMPSCI 689 p273 Reproducing Kernel Hilbert Spaces 39 From the definition of inner product in H0 we see that ltfggt Zaz jk fmtj 39 Similarly it follows that lt g gt Zj Bj tj 39 This implies that this inner product gives us the reproducing property of a RKHS namely that lt kt gt ft It also follows that lt kmktl gt ktt Sridhar Mahadevan CMPSCI 689 p283 Reproducing Kernel Hilbert Spaces 39 To complete the proof we need to show that our definition of lt f g gt satisfies all the properties of an inner product For example lt f f gt2 0 holds because Ms t is positive definite lt ff gt 22aajkttj 2 0 z j 39 We also need to show that if lt f f gt2 0 then f 0 To show this first show that kernels satisfy the CauchySchwartz inequality that is k2s t 3 Ms skt t Then use the reproducing property to show that f2t lt f f gt Ktt 39 Finally we add to the preHilbert space 71 all limits of Cauchy sequences Note that if fn is a Cauchy sequence fn fm2 an fmII2Kltttgt and so the Cauchy sequence must converge to some ft T gt R We add all such f to 71 to get our RHKS HK Sridhar Mahadevan CMPSCI 689 p293 Mercer s Theorem Discrete case Let s start with the discrete simple case of Mercer s theorem Consider a symmetric kernel function Mac 2 on a finite input space Xx1acn The Gram matrix of Mac 2 is the matrix G kxxjj1 Let us restrict our attention to kernels whose Gram matrices are positive definite ie the eigenvalues are nonnegative Then we know that 7L T T T G Alxlxl l l Anacan E Amigo i1 Consider the nonlinear mapping gb x gt xAiti1 Then we can see that lt gt Zktxtixtj Macho 751 Sridhar Mahadevan CMPSCI 689 p303 Mercer s Theorem General case Let Kac y be any symmetric continuous function Kac y lt Mac gt if and only if Kac ygacgydacdy for all g The condition above generalizes the positive definite condition from vectors to the Hilbert space of functions Mercer s theorem arises in the context of integral equations nmKowmwm A function mac is an eigenfunction of Kac y if it solves the corresponding integral operator equation wmmemwy Note that eigenfunctions are a generalization of eigenvectors since in general Hilbert spaces vectors are functions Sr39dharmahadeva CMPSC39 68939p3931 3 Mercer s Theorem General case In the general case we have to ensure that inner products over infinite vectors converge We do that by redefining If Kx y is a kernel that satisfies the conditions of Mercer s theorem then it can be expanded using the set of eigenfunctions m Km 3 Z Ajltl5j00lt5jy j1 where the gbj are normalized so that L2 1 Here the weights j are the positive eigenvalues The condition above generalizes the property that positive matrices can be expanded as a finite sum of projection matrices Sridhar Mahadevan CMPSCI 689 p323
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'