### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Data Mining&Stat Learn ISYE 7406

GPA 3.82

### View Full Document

## 17

## 0

## Popular in Course

## Popular in Industrial Engineering

This 0 page Class Notes was uploaded by Maryse Thiel on Monday November 2, 2015. The Class Notes belongs to ISYE 7406 at Georgia Institute of Technology - Main Campus taught by Kwok-Leung Tsui in Fall. Since its upload, it has received 17 views. For similar materials see /class/234200/isye-7406-georgia-institute-of-technology-main-campus in Industrial Engineering at Georgia Institute of Technology - Main Campus.

## Reviews for Data Mining&Stat Learn

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/02/15

Support Vector Machines KwokLeung Tsui Industrial amp Systems Engineering Georgia Institute of Technology 2262009 Outline Support Vector Machines Support Vector Tree Adjusted Support Vector Machines 2262009 Support Vector Machines SVM The upport yector Machine has been shown to be able to achieve good generalization performance for classification of high dimensional data sets and its training can be framed as solving a quadratic programming problem 2262009 Some History of SVM SVM was first introduced in 1992 1 SVM is inspired from statistical learning theory 2 SVM becomes popular because of its success in handwritten digit recognition SVM is now regarded as an important example of kernel methods arguably the hottest area in machine learning 1 BE Boser et a A Training Algorithm for Optimal Margin Classifiers Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144152 Pittsburgh 1992 2 V Vapnik The Nature of Statistical Learning Theory 2nd edition Springer 1999 2262009 Separating Hyperplane Separating Plane incur the Fed lrallist Pia r5 1 FEB Banach Smith I a I a I a HI IlIclIIWJquotm I I I I I j I I u I It m I I I I j l i m ii an E39 e a I I r m Il39 m I I m I I Ir 391 In WIFEler nll 3 5quotquot5 I 39 m m l39aFlll 5 y m il a l I I u I 39 39quot rI 39 I 1 I l I I Iquot quot 1 y E 39 a V E Euna IIiuilhquot1 l I m I I E d I quot I 3 3 is i V g 5 m a I a L I Em m J I I a I IEJii39II lmrnl f l l i I I m i LIl nl iL VI If i I I I u v i F Iquot E e u H 1 l quot I 39 H mlltun EE Pape rs 39 n a v r 39 quot 39 V I I I 39 r7 m z I 39 I F39 ii 39 5 a 1 quot a I I E r 391i u1 l39n iquotquotquotquotquotE39 I a5 L 39 i y inil h i kiamp alarl I I I39 u 5 a i y 3 i V b r r I I39Ia I g p 1 L i 39i 5 3 quotg 39 in i a l I E l 2 H I I l 39 l i I a quotItale I I n I gt I I quot 2 L v I Z I n quotquotquot3939m393939l 23 39 39 u I 5 i 3 1 i i 3 E 39 39 3 39 239 quot v3 gt f I E quot 39 I ll MadlsanSD Papergs r 3 i I 5 K I L 39 E V i 1 2 g L m r hyln39I tF u IEK39 IIHnIEIIMEZ nynVIYllm firu n b nua I c an u I 5 E h l l gt Elma l I El E1 Illa Ing quotI la 5 AH 2262009 5 Support Vector Machine SVM Simplest Case Linear machine trained on separable data 2262009 Pattern Classification Training data 0 Xnyi i1n o XIer O o o o o VIEH17 o o 0 O 0 2262009 Pattern Classification wXb0 Classi cation rule y signw X b 2262009 Which line is the best Training and Generalization 2262009 9 Maximal Margin Separating Hyperplane Optimization problem max C o Waballwll1 o yiW39Xib2C o o Margin 2C Intuitively maximize the margin between classes 2262009 10 Maximal Margin Separating Hyperplane To get rid of the constraint 1 we rephrase the optimization problem 1 2 mmz W Wb 2 SJ yiWXlb21 i1n This is a convex optimization problem quadratic objective with linear inequality constraints 2262009 1 1 Support Vectors Support vectors C X yiWXlb1 13965 0 szal39yixi IES Classi cation rule y signZ aiinl X 9 ES The line depends only on a small number of training examples which are called support vectors 2262009 12 Geometric Margin Maximal Margin Hyperplane A hyperplane realizing the maximum geometric margin According to a theorem from Learning Theory from all possible linear decision functions the one that maximises the margin of the training set will minimise the generalisation error 0 The optimal linear classifier forms the Maximal Margin Hyperplane 2262009 13 Geometric Margin o The Euclidean distance of an example Xy from the decision boundary 72ltlxgt I W I W 2262009 14 Geometric Margin and Hyperplane X Nil Margin wxb0 X39 wx b1 wxb1 Let s compute M in terms of w and b Let X39 be any point on the minus plane Let X be the closest plusplanepoint to X Plusplane X wX b 1 MinusplaneX wX39 b 1 X X39 Aw for some value of A X X39 M 2262009 15 Geometric Margin and Hyperplane 0 Min terms ofwand b wXb1 MXX wX1wb1 ZlX 1wX wXb1ww1 1w 2 2 x1 W39W ww w w 2 W 2262009 16 Maximal Margin Separating Hyperplane To get rid of the constraint 1 we rephrase the optimization problem 1 2 mmz W Wb 2 SJ yiWXlb21 i1n This is a convex optimization problem quadratic objective with linear inequality constraints 2262009 17 Estimating the Maximum Margin Classifier 0 Now we need to find a way to search the space of w and b to find the widest margin that matches all the data points Gradient decent Expectation Maximization Any other numerical search algorithms Quadratic Programming 2262009 18 Quadratic Programming 0 QP is a wellstudied optimization algorithms to maximize a quadratic function of some realvalued variables subject to linear constraint 39HquotquotquotquotquotT39R39U quot39n39g t1 lll t V Find arg max Cdfu 4 ux39 Quadrant LlltEIIUf39li H 39 Subject to 71 1H1 ung imam 1 731171 an imam E n additional linear i eq uality constraints rim ll only Imiim by And subject to 39 rt II139i39l1H iii lily 39 at39ri ljlmum bin 1 T g lg g C E OHZ ul Jltlllll fZ quot yruEjmum gin i1 gt g E J g aquot 1 l q 3 Itn gjlllfl airi 3lul quot din ejmum bin e 39 9quotquot 2262009 19 Finding the Decision Boundary Let X1 Xn be our data set and let yi e 1 1 be the class label of Xi The decision boundary should classify all points correctly gt ywaxi b 2 1 v2 The decision boundary can be found by solving the following constrained optimization problem h inimizn lllquot2 IVIIIIIIIIILC 2 T c 2 Suiigiom Tn til 7v l hl gt I a JVVU UV VV JLZ I U I VU The Lagrangian of this optimization problem is 1 i n rTi n iiii4 I I 1quot l1 1 n v iiwii 3 a i39iigiw x i 0i Li air72L W7 7 ll i LUL Ll I L L 2 2262009 20 Constraint Optimization Minimize FX Subject to ci X S 0 for i12 n Theorem 626 KKT for Differentiable Convex Problems 312 A solution to tho optimization problem 637 with Convex di i39ontiohlo f c is given by if there exists 50 1 E R with a Z Ofm all i E n suoh that tho illowing Conditions are Stiii fit ti I39I 97L3E 51 8x150 Z iayc iOE O Saddle Point in 3 653 11 8mm 54 QUE 3 U Saddle Point in 1 654 H E aim2E O Wiizishing KKT Gop 655 i1 KKT KarushKuhnTucker 2262009 21 Solutions Construct amp minimize the Lagrangian Lwbalwlr Zaiyiwaib 11 1 subject to 051 2 0139 1N By taking derivatives wrt w b and or equating them to 0 6L 1 N 39 V a 2 W Zaiyixi 0 2 9parameters are expressed as a 11near N M combination of training points ie w WZZW 0 3 ab 11 1 1 6LV ba yiWTXi b 1 03139 132N or I KKT cond aiywa b 1 0 only SVs will have nonzero Oli The Lagrange multipliers or i are called dual variables 2262009Eac3h training pornt has an assocrated dual variable 22 Example Class 2 1130 C3 396 O 393 aw Uquot I C 12 0 l O 311 a 3 II 1H U1 I I D If Itquot in I 14 I 9 Clags 1 2262009 23 Solutions Thus W aiyixi ZiESV aiyixi Plug Eqs 2 amp 3 back into the Lagrangian to obtain the following dual formulation The resulting dual that is solved for oc by using a QP solver 1 N N MaX1mlze Wor EZZaiajyiijij 061 11 j1 subject to oriyl 0 05 2 0 139 1N Data enters only in the form of l This is a quadratic programming QP problem dot prOdUCtS A global maximum of oci can always be found The b does not appear in the dual so it is determined separately from the initial constraints 2262009 24 Classifying New Data Points Once the parameters oc b are found by solving the required quadratic optimization on the training set of points the SVM is ready to be used for classifying new points Given new point x its class membership is signfx oc b where fX0b WXb ZZIajyixiTxbi Z fyixfxbi O 0 ieSV o This is called an Linear SVM LSVM 2262009 25 Solutions The solution of the SVM ie of the quadratic programming problem with linear inequality constraints has the nice property that the data enters only in the form of dot products Definition of dot product inner product Given xx1x2xn and yy1y2yn then the dot product of x and y is xT yx1y1 x2y2 xnyn This is nice because it allows us to make SVMs non linear without complicating the algorithm 2262009 26 Characteristics of the Solution Many of the oci are zero w is a linear combination of a small number of data points This sparse representation can be viewed as data compression as in the construction of k nearest neighbor knn classifier xi with nonzero oci are called support vectors SV The decision boundary is determined only by the SV Let j1 s be the indices of the 3 support vectors We can write For testing with a new data 2 H I 1 7L Compute w z l b Z 1 ortjytj xtjz i 0 and class y 2 as class 1 if the sum is positive and class 2 othenvise 2262009 27 Overlapping Cases 0 o o o o 03o o ioo 0 o g o O Impossible to perfectly separates the two classes Include an error term Instead of maximizing margin minimize error 1 margin Again involves only quadratic programming 2262009 28 2262009 Linearly NonSeparable Case 29 Linearly NonSeparable Problems We allow error xi in classification 2262009 30 Linearly NonSeparable Problems Cannot find w and b that satisfy yi ltwxigt b 2 1 for all i if a high noise level causes a large overlap of the classes Introduce slack variables ii 20 yi ltwxigt b 2 1 ii for all i Making ii large enough the constraints can always be met In order not to obtain the trivial solution where all ii take on large values we thus need to penalize them in the objective function Penalized objective function Minimize W2 2 C 2amp1 subject to yi ltwxigt b 2 1 ii ii 20 for all i 2262009 31 Linearly NonSeparable Problems By minimizing 2 ii can be obtained by waib21 e y 1 IWTXribS lHfz yz 1 20 V where ii are slack variables in optimization Note that i0 if there is no error for xi and ii is an upper bound of the number of errors 2262009 32 Linearly NonSeparable Problems We want to minimize C tradeoff parameter between error and margin ie tradeoff between two conflicting goals minimizing the training error and maximizing the margin The optimization problem becomes Minimize 1 iiwii2 ozyzl gz subject to 112me b 2 1 e a 2 0 2262009 33 Linearly NonSeparable Problems The dual of the problem is 1 N N Max1m1ze Wor 52 20405 My IX X j 051 i1 j1 subject to Ziggy OC Z or Z O i 1N 39 8 w IS recovered as W 29 rvtmtuxt This is very similar to the optimization problem in the linear separable case except that there is an upper bound C on oci now Once again a QP solver can be used to find oci 2262009 34 Strengths of SVMS Good generalization in theory Good generalization in practice Work well with few training instances Find globally best model Efficient algorithms Amenable to the kernel trick 2262009 35 2262009 Extension to Non linear Decision Boundary 36 Nonlinear Cases use another coordinates system such that the curve becomes a line 2262009 37 Kernels 4 M31 X gt My 5 Only inner products XT y are involved in the calculation Under certain conditions there exists a kernel K SUCh that KXY XT Y eg Polynomial of degree d KXyXTy1d replace XTy by XT 150 2262009 38 Extension to Non linear Decision Boundary So far we only consider largemargin classifier with a linear decision boundary How to generalize it to become nonlinear Key idea transform xi to a higher dimensional space to make life easier Input space the space the point xi are located Feature space the space of ltxi after transformation Why transform Linear operation in the feature space is equivalent to nonlinear operation in input space Classification can become easier with a proper transformation 2262009 39 2262009 Nonlinear Mapping Separatiun Hwy he eatSim in higher c iifmengi na I I I I 39II E manure I map i I 22 x y39pErplane camping in Haw dimenamm Simp e in higher dimena mg 4O 2262009 Examples 41 2262009 Examples I I 3 x K HI 3 X if I x a 9 Im 1 l39u K p39 a 3 a J a a E x quotl J x 3 an 3quot tr 1 J 39I quot39 quot I J D l 3 A x 39t 1 quotI quotJR x K a 3 fl All it E r H m J If dquot 5 a J J K L a a a u x quotquotquot gunj 39 v H 1 E H a 3quotquot 393 E Figure 21 Te example at a binary elaeeifieatien PI39DlJIEI mapped inte feature space We aeeuine that the true rteeieien hetnldarja ie an ellipee in input epaee left panel The teal et the teaming preeeee ie tn eetirnate thie henndart r haeecl en empirical data eenaiating elf training pehlte in heth elaeeee 39er eeeee EJJI LCl Cirelea reepeetivelyju l llhen mapped inte feature epaee via the nenlinear 111ap 121111 3133 sr 3 V a1rg right panel the ellipae heeeinee a hyperplane in the present eilnple eaee it parallel tn the aaie henee all peinte are pletted in the 31 233 planel Thie ie Line tn the fact that ellipeee can he written ae linear equatiene in the entries et enema Therefera in feature epaee the prehlein reelueee te that ef estimating a hyperplane frern the mapped data painte Net e that via the pelynenwial kernel see 212 and 213 the at prednet in the threedimeneienal space can he eeinpnted Witheut eeinpntntg 112 Later in the hank we ehall deeerihe algeritlnne fer eenetrncting hyperplanee whileh are haeed en det precincts Chapter F 42 Examples Value of discriminant function class 1 dass 2 Aass 1 AC 0 2262009 43 Mapping into a New Feature Space 2X9XCDX C13X19X2 X19X29X129X229X1X2 Rather than run SVM on xi run it on 13xi Find nonlinear separator in input space What if 13xi is really big Use kernels to compute it implicitly 2262009 44 Transforming the Data Input space Feature space Computation in the feature space can be costly because it is high dimensional The feature space is typically infinitedimensional The kernel trick comes to rescue 2262009 45 The Kernel Trick Recall the SVM optimization problem Maximize Wa aiajyiy at i1 j1 subject to 204321 2 O C 2 0512 Oz 1N The data points only appear as inner product As long as we can calculate the inner product in the feature space we do not need the mapping explicitly Vr AT Define m Xan 65ij flL r WKAJI 2262009 46 Idea of the Kernel Trick The only way in which the data appears in the training problem is in the form of dot products xioxj First map the data to some other possibly infinite dimensional space using a mapping CD Training algorithm now only depends on data through dot products in CD xiocDxJ If there is a kernel function K such that KXiXjCDXiCDXj we would only need to use K in the training algorithm and would never need to know CD explicitly 2262009 47 Why SVM Work The feature space is often very high dimensional Why don t we have the curse of dimensionality A classifier in a highdimensional space has many parameters and is hard to estimate Vapnik argues that the fundamental problem is not the number of parameters to be estimated Rather the problem is about the flexibility of a classifier Another view the SVM loss function is analogous to ridge regression The term 12w2 shrinks the parameters towards zero to avoid overfitting 2262009 48 Choosing the Kernel Function Probably the most tricky part of using SVM The kernel function is important because it creates the kernel matrix which summarize all the data Many principles have been proposed diffusion kernel Fisher kernel string kernel Information complexity In practice a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try 2262009 49 Summary Steps for Classification 2262009 Prepare the pattern matrix Select the kernel function to use Select the parameter of the kernel function and the value of C You can use the values suggested by the SVM software or you can set apart a validation set to determine the values of the parameter Execute the training algorithm and obtain the oci Unseen data can be classified using the oci and the support vectors 50 Summary Multi Class Problems Categorical Predictor Variables Use Dummy Variables MultiClass Responses 1versus1 11 KK12 classifiers use only relevant data 1versusrest 1r K classifiers Combine classifiers Count the number of votes Take averages of conditional probabilities 2262009 51 Strengths and Weaknesses of SVM Strengths Training is relatively easy No local optimal unlike in neural networks lt scales relatively well to high dimensional data Tradeoff between classifier complexity and error can be controlled explicitly Nontraditional data like strings and trees can be used as input to SVM instead of feature vectors Weaknesses Need to choose a good kernel function 2262009 52 2262009 Software A list of SVM implementation can be found at httpwwwkernelmachinesorgsoftware html Some implementation such as LIBSVM can handle multiclass classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available 53 The Suppart Vectar Classifier 1222 Mixture Example EE EEEEEEEE EEEEEEEE ESHE55555555EEEEEEEEEEEEEEEEEEEEEEEEE quots 39 ssa 39isssis39i E EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEFLHH E 1l l l 39ampIIEEEELIBET39I 1 iiEi iiiiiiiiiizj iiigif39 Equotquot Su sm mm5525225525sa22255225222222555222555225222 Erma g1 sassas55225222asses52222222225222 q 2 mm 1 1m FIGURE 122 The Emmiquot mpp ri i r mmd ry ffquot ih mimi m dam Exampie with tum w ppmg 513353 far Hm da gmm mines Elf If br km Ema in dimi iiiFm margins Wham f lzi The mppm39t paints m 2 Ifle ELM Hm pr ts an the mmng 3M5 Hf air margin The Mari sarilid dam am thaw mppn f paint failing emc y h mmym 5 I m I 1 hi quotthe E panei 52 Elf iha nhsmwm i m f Supp paints mh in ma ght panHI 85 mm 2262009 54 Suppart Vector Machines N The solution function its Ediea 33 i1 SVM Degree4 Palynamial in Feature Spaee SUM Radial Kernel in Fmture paae EEEEEEA quota Tranmg Errnn 5535 39 Tea l Ermr ampj15Errur Tamra Em Em mm55555555555352555 nil FIGURE 7123 Tam neaiineai EVMs far ihe mixture data The ie uses a 4 day degree agiyaamiai kernei the rigid a radial basis kernel Iii ease ease 1r iaas l d is appmaimaieiy aehieae the best iesi ewar perfei manee and H it marked meii iii bath eases Ti39ie radial basis kernel perfams the best eiase tie Bayes epiimaijr as mighi be eapeeied gin en the data arise am miaiiires ef gaiassiaiis 2262009 55 Support Vector Tree Simultaneously realizing principles of maximal margin amp maximal purity X Huo J Chen S Wang K L Tsui School of Industrial and Systems Engineering Georgia Institute of Technology 2262009 56 Support Vector Tree SVT SVM Classification Tree SVT Maximal Margin amp Maximal Purity Maximal margin better separation Maximal purity a pure decent node 2262009 57 Type Topology Binary Tree Partition 4 At each nonterminal node use SVM to find the class most separable from the others and the corresponding marginmaximizing separating hyperplane 2262009 58 2262009 Algorithm 59 2262009 11 F111 51 11 F111 F 1 if 11 m 1 EH 1 If 1 F 11111111111 If 1 1111 39 iEE 111quot 1111 11 L111 f 11111111211 11 11 111quot 111111 EillLTIIi 1111 1111 11111 1 151111111111 1 3 L Qt11 15111111 H1111 111111211 1111 11 139t1111111 51211111111111111 1131111111111111 11 fig1 1111t111g1tzt 211 ng F J H L n 1 2 H n E 1 m 7 m In E 3 1 111 1113131 3 J 531 m 1quot 1 A i1 H1111j1n1 1111 ifmi i quotl 1 1 51 1 m 1 HI 1711 60 1EC39ifitll this HUGEital 1111 giwul 1115 ME mu Mn Ezulilpu39tn the training HIM Chum5E if a39n m 39EiHlfll that Elm training 51131 lllfliililiEmElf1 Ii migmiu a m 39i is quot539 3 Lu 3 r at a Ail Amimi HUI Cf 1m Uli Inf itquot Ellilil 11mm AHHigu ultnilglm tLSi mmtllur Ellil 11mm F1113 naiigu 11121139t lm illilIIUFEE 111 uraraiglmmut if 135525 tu FLEEiUII39FEu 2262009 61 2262009 Type II Topology Partition Binary Tree 0 A K At each nonterminal node group data into two super classes use SVM to find the two most separable super classes and the corresponding marginmaximizing separating hyperplane 62 2262009 Algorithm 63 2262009 1 xiii L111 EEC 112 IfEIn H1111 11111 111111 1 15 1 mug11111111 Uf 111111311111 11f E111 12 i If 11111112111131 if Eu11 t1111 11111 5111111111 11117 1111111111 9121 11 21 1 If r 1raraig111111 11 1111 1111111 1111111 111 1 11111111 1111111 155111111 11111111 1 1111113311 1111111111111 11 if ij 111 E9 111111321 1 11111111 111111 111111 511 1111 113111 111U11111 11 1lf iFig li1 3911 1 61111111111135 t 11111i11111 11111111 1 L111 11 1111113111 111 11111 1111111111151 uf quot1511 111111 EEC EHE39JW 1 111113 BITE 11111 11111 11111 5111121 111 113 111111 7 11111111 1111 11111131 195 1511111111 11111 1111 5111111111111 111 1 1111113111111 111 5111117 1 64 1 quot7 i Lila Furl1 l 2641 11m if 1 1139i i5 11ml Equal 11 If E TI V TI 1 LN 54 LIEquot Hi1 fill llh lilt i1 5ij 11 Smut HUI Elf i in alguritllln iilu Smut H1121 Elf iii in ulguritlll E lli H1312 Elfujliv 5111 lgUFitlllil n if ElmaHE 1 Hunt 11 IiE11 it trailiiug Harm 11111111112312i1 f E lglil 111 than 5 quot flit kt ll lth t 111 WCquotJ lm Eli21r39lri if fl39TJlll 3 add 3quot and if 111L CL fit Agggll 2 3111 39t u 111am twu Chili 11fli1iEQZEu T11i al igu mut algal i1tu39a Cfl 39EEEE QH t1 1 mgiuul 2262009 65 Experiments Data Summary Problem number of samples class attributes balance 625 3 4 car 1728 4 6 glass 214 7 9 iris 150 3 4 gment 2310 7 19 waveform 600 3 22 wine 178 3 13 2262009 66 Experiments SV39I vs OneversusOne SV T Oneversusone Problem Error nSVs Error nSVs bal anoe 0 0957 1 13 00957 1 17 car 01291 371 01310 387 glass 03231 112 03231 109 iris 0 15 0 16 segment 00375 193 00331 210 waveform 01657 253 01547 252 wine 00185 23 00185 21 2262009 67 Experiments SVT vs CART SVT CA RT Problem Size Error size error size error bal anoe 3 00957 3 03564 19 01968 car 5 01291 5 01734 67 0025 glass 7 03231 7 03846 15 02615 iris 3 0 3 01087 4 00217 segment 7 00375 7 0134 50 00346 waveform 3 0 1657 3 02652 14 02376 wine 3 00185 3 01296 4 00556 2262009 68 Experiments 1 I f lijxj m2 my I I 39 Gil Lug IZIIJLII39IIIIEII39 1 H T 39 39 r I I I L6 I I I quotI u u I l 5 I I I 4 4 I14 I3939 I I39 a IjiF39 I I 5 quot x 11 Jquot 39x I quotquotquot A r U IJ quot 04 06 013 2262009 Conditional probabilities of Classes 1 2 and 3 f1 X 097 exp3X f2X exp 25X 122 f3X 1 f1 x f3X Training 1000 Test 10000 Bayes SVT 69 Discussion SVT is a hierarchical representation of partition of the state space It is easy to understand and interpret Asymptotically SVT converges to the Bayes classifier under certain conditions 2262009 70 Problems The computation is expensive For a Kclass problem we need to run the SVM K times to find the most separable class and the corresponding optimal hyperplane A quicker simple algorithm is wanted to find the most separable class then the SVM is used to find the optimal hyperplane 2262009 71 Adjusted Support Vector Machines S Wang W Jiang K L Tsui School of Industrial and Systems Engineering Georgia Institute of Technology Steven Institute of Technology 2262009 72 Support Vetth Machine SUM 9 Given date 31313 mi 2 1 n With 33 E 331 end E 1 1 SUM finds a separating hyperplane air erT quotw E D that maximier the margin ef seper39etien between twe Classes 1 and 1 2262009 73 Support Vetth Machine SVM u Mathematically Speaking SVM selves the fellewing eptimiaatien preblem 11 39 min it 1 I SVM uaee a leaa funetien at the farm a 4 1 af a if aft 1 Lttrft rtt 1 aftirflh U i if heft 1 2 that penaligeg 3 paint w39 it it falls an the warm side at its beuntiary Lew was lt 1 I The less funetien is referred te ae the hinge less and paints an the wreng aide at their beundariee are referred te ae suppert ueetera 2262009 74 2262009 Ceneeme I The rhuexhhtum metg h Separating hyeehe ehe eturehgh EhEpEHdS eh exuhteert veetere I Sthpp veetete ere efteh hhr e few pe hte hear the hemmd ehee e we hheh te very themetieetlh whzh training eeurhyetee 2 SUM SDTItMT t t is IUMMSIEIHIJWE espeeiath when the se mpie she is Sh neh I The maxi rmhlrh metgh Separating hypehp ehe h heqrulhheteht te the twe elteeeee I The hum her ef tte hhtg pe hte 3 ed the d spere eh eTF tte h hg eehtte h h eeeh e eee may he th feteht I mm It d ere htly SVM may Ma39exrfehm pearly wheh thenre exlust sampling bias and Marge thhfetehee ht ehss vat39ltahees as hlhtstyralted ht the ghte te the tight I145 U4 1335 US 1325 39 39 39 NEEL I39J wi h 13E quot Dbsewa ans Hlj1 391 1with EU quot3937quot Dbsewatians l E39e iM Eulutien Dp mal Enhxti n l l I I i k l E T u A f EH H l f X quot a I F f a 39 a 52 a E 5 1E 15 2321 139 75 Motivation Loss function only depends on support vectors Sensitivity to Outliers at boundaries Even if there are no outliers there is bias due to small sample size due to Asymmetric variance orand Asymmetric sample size Possible Rectification New loss function depending on all data robust 2262009 76 2262009 Other iL SS Functens fer Variants Lee and Mengeserisn 2301 All these less functions penalize enly peints with y s 4 1 and thus share the same concerns with SVM sensitive te training samples and perfer ming pearly under certain situetiens 77 2262009 5 5 Himge L555 H055 far SSVM n 4 4 f 3 gt 3 3 2 52 2 1 1 D 5 10 D 5 110 NE u 5 5 L055 for quotup Learning New L555 4 39 4 3 3 g 2 g 2 1 1 39U 39 D D 5 10 U 5 10 W affix 78 2262009 A New Leas Functien I The new Ines funetien Has the farm 1 M a 1 i 1 af it II L if array a if was u 5 Lla Penalizea all training data peinta anti therefere irnplieitltr eenaitlera the number at training painta 35 well aa their diaperaien in each elaaau Ia ameeth anti sartr39iaatljtr menetene deereaeing with y t la equal te the hinge leaa when y a Z 13 and greater than the hinge lea when yft aj I 79 A New Less Functien I Under the newr lees fLmetim the eptjmizetien preblem fer SVM heeemee Jmi me 0 133g H w H L11 yaflEJIJ veJere a DH 1 M m 7 i1 where ie eh indieeter funetiem I The eelutien ef this eptimizetien preblem is expected te h rebuet egainet euppert weetere es the newr leee funetien penelizee all training data peihte TI IEEI EITI 1 Under eerrein reguferfg eend ene Viepnfk the S f n ef the ep theme Hen prebfem in eppreeehes the Beyee efeesi er 35 the eempfe efee e eppreehee in nity 2262009 80 Adjusted Suppert Vecter Machine ASVM u Te maintain the nice prpertiee pf SUM such are eaeiy nenlinear generalizatipn with the kernel trick an ASVM Firstr run SVM ta find the maximum margin Separating hyperplane quot131th liL39J and the hauntlaries fer eaeh Claee Neat adjuet the leeatien E1 pf the maximummarin separating hyperplane within the twe heundariea by sewing the fella Wing pptirniaatien preblern 1quot new 1 a Me a all 1 d we U k a subject tp m a Ttal l b d a 1 93 H 1 t it DI at a Hiquot 1 at 2262009 81 2262009 A One Dmensipnal Example I Cempare the generalizatiep perfermanee pf tSVM and SUM 2 Training sample Class 1 n1 paints randemly draw frem 000 Class 2 quot u53 paints randemly draw frem 0010 r11 rte 13300 with pg 2 203301 1 I 5100 I1 1 and e1 2 12 I 15 Tunirl parameters Cl and Cg are ehpsen usin a held put sample at size 10000 5000 Frem elass 1 and 5000 frem elass 2 I Testing sample 20000 ehservatipns with 10000 frpm elass 1 and 100 frem elass 2 7 v u Fer each and mm 40 simulatlerls arid ealeulate the average and stantlartl desiatien pf the error rates 82 ta Dimensional Example n1 C n1U n139E r552 LEE LEE Bay25 Bayes f 315M gym I AEEI M ASVM f x f m quot 1115 III DJ E E II Eli E E u 41 EI1 11135 2262009 83 ASVM n5 Bayeg Average Rate Std verage Rate Std 2262009 100 00 80 7390 E10 50 40 20 01512 01572 01512 01572 01572 0152 0152 1512 0151 0200 02136 02113 02160 0217 02348 0230 02434 00122 00131 00120 00150 00104 00234 032 00515 01600 01505 01503 01506 0152 01513 0155 015m 01603 00032 00053 0002 00024 00012 0000 0015 00021 010000 84 2262009 A neDimensional Example I hsieruatiens When the variances ef the twe elasses are eeinparahleg the perferrnanee ef iSVM is very eempetitive te that e39f SUM When the sarianees ef the twe Classes are eensiderahly differenta ASVM eutperferrns SVM ne matter sainplin bias exists er net The larger the difference between a and 5quot the better ASVM perferrns than SVM espeeially when sampling bias is severe ASVM is mere rebust aainst sampling bias than RSVM is much mere stable than SUM The standard tleviatien ef the errer rate frern ASVM is n1 ueh less than that ef the errer rate frern SVM when the varianees ef the twe elasses are eensiderahly different 85 2262009 Experimental Cemparisens Real Data I Tw examples freth UCI Machine Learning Hepsitery website Wiseensin Breast Cancer Pll w ll l Indian Diabetes a Ten fltl crass ualidatien was used a had smaller errar rate than eanventienal SUM in bath eaanples Average Errer Fiats Dataset SalTI ple Size ef IE39I Iass at In put 1a ariahles SUM ASVM lhl lallSE l ISll39l Breast Cancer E3533 2 Q D ELDEEIE F39IMA Indian Eliahetes 532 2 T 32183 DECIIEE 86 Examples with Real Data Sets Ila1t met 17 1313133 1 1315155 1 11LE15 2 A tt 1 1 1339 L1 tEE 11 is Flaunt G 131185 1 41 e2 1111 11131 14311 E13131 1 I712 1531 1 5 1 3911 I31 3 1 1 183 a x I 373 II 16 1 1 1 1 El EIIU I 1 1139 Table 11 1511351221 11111111 my ALENET TE 11 Fzj ld 3 1vquot Errmr Stalluj n leviathan 1 at a 312 1 51 M 151 M 15 11n111w3eVenlarlt sv 1qu 15 RIM 51 I mp 1quot IVIEI 1391E 111 Iris Plant 173131533 12112139112111391EEL39tiIZ I39l EmaIi HEIFEEIEW 11111113 T 11315123 111144 11131133 Mr41 1 1 11 in E 3911 53121155 121 d 3 1 31 5 Fl 2 1 1 quotm3 E I 14 D I Dl 1139 3911 1372 011111 11113237 111111 1 1 3111321 1 1311 E I T5511 a Ii J V V ml BEL 131 2262009 Table 11 R 1111 unmnary 87 Examples with Real Data Sets P EL 1 titiurn A 51 M 1 I33quot in a 11139 iIEIEZIIEIEUIIJ il ll 1 WELT 13211108 11 MET IIZI IIT 11E 133 1112105 1 11 33 ll DEBT 11n lll 1 El 1 1quot Dq IJJEIET Standard leviati11 U il gi E nE Tall le 12 TEEN Firm 11 EI I DI E far Iris Plant Data 2262009 Cencluscn and Future Research Cnclusinn Samplin biaa and difference between variances cf twe claasea have prfcu nd irnpacta en the perfermance cf SVMH ASVM is much mere efficient and relauat under these aituatiena than SUM I Future research Henlinear eneralizatien cf ASVM Explicit aelutien cf the eptirnizatien preblem under the prepeaed newr leaa functiem and ita perfcrrnance ccmpared with ASVM 2262009 89 Future Research Numerical study of small sample bias distributions of order statistics Equal variance but unequal sample size Equal sample size but unequal variance Unequal sample size and variance Resampling methods based on non parametric density estimation to generate data with larger sample sizes Biascorrection using resampling methods eg bootstrap 2262009 90 2262009 Kernels Additional Details on Kernels 91 Kernels Find kernel K such that KX1 X2 lt 1301 X2gt Computing Kx1x2 should be efficient much more than computing cDx1 and cDx2 Use Kx1x2 in SVM algorithm rather than ltX1 X2gt Remarkably this is possible 2262009 92 An Example for f and K the Polynomial Kernel with degree 2 Suppose is given as follows c1301 X112 X122 V2X11X12 c1302 X212 X222 V2X21X22 An inner product in the feature space is lt c1301 c1302 gt X112X212 X122X222 2X11 X12 X21 X22 So if we define the kernel function as follows there is no need to carry out explicitly KX1 X2 X112X212 X122X222 2X11 X12 X21 X22 This use of kernel function to avoid carrying out explicitly is known as the kernel trick KX1IX2 lt CDX1I cDX2 gt lt X1 X2 gt 2 NOte X1 X11 X12 X2 X21 X22 gt lt X1 X2 gt X11X21 X12X22 2262009 93 Examples of Kernel Functions Polynomial kernel with degree d Kx1x2 lt x1x2 gtOI Monomials of degree d Variation Kx1x2 lt x1 x2 gt 1 d Gaussian kernel Kx1x2 exp X1X2 IZZGZ Radial basis functions Sigmoid kernel Kx1x2 tanhlt x1x2 gt 9 Neural networks 2262009 94 Modification Due to Kernel Function Change all inner products to kernel functions For training 1 N N I I Maximize Wor ZZoriorjyiijiTXj 2104 Onglnal i1 J1 subject to 210432 2 0 C2 Oll 2 0 i1N 1 N N Max1m1zeWor 051a l K xijx ai Wlth kernel 1ny J 2 functlon subject tazzlaiyi 09C20 i ZOJZLWN 2262009 95 Modification Due to Kernel Function For testing the new data 2 is classified as class 1 if f20 and as class 2 if fltO S w Z 04 y x Original 31 quot7 quot7 78 f WTZ b oatjytjxgjjz b 31 8 With kernel W 1 atjytjmxtj 3 function 8 f 2 W7 b atjythth7 Z b 2262009 96 Example Suppose we have five 1D data points x11 x22 x34 x45 x56 with 1 2 6 as Class 1 and 4 5 as class 2 gt y11 y21 y31 y41 y51 We use the polynomial kernel of degree 2 KXy Xi12 C is set to 100 We first find oci i1 5 by aiajyiyjij U2 31 2 subject to 100 2 052 gt O L ozz39yz 0 2262009 i1 97 Example By using a QP solver we get where oc10 ocz25 0L30 0L47333 0c54833 Note that the constraints are indeed satisfied The support vectors are x22 x45 x56 The discriminant function is b is recovered by solving f21 or by f51 or by f61 as x2 x4 x5 lie on yiWT Z b l and all give b9 g fy 06667902 533333 9 2262009 98 Support Vector Machines KwokLeung Tsui Industrial amp Systems Engineering Georgia Institute of Technology 372007 Outline Support Vector Machines Support Vector Tree Adjusted Support Vector Machines 372007 Support Vector Machines SVM The upport yector Machine has been shown to be able to achieve good generalization performance for classification of high dimensional data sets and its training can be framed as solving a quadratic programming problem 372007 Some History of SVM SVM was first introduced in 1992 1 SVM is inspired from statistical learning theory 2 SVM becomes popular because of its success in handwritten digit recognition SVM is now regarded as an important example of kernel methods arguably the hottest area in machine learning 1 BE Boser et a A Training Algorithm for Optimal Margin Classifiers Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144152 Pittsburgh 1992 2 V Vapnik The Nature of Statistical Learning Theory 2nd edition Springer 1999 372007 Separating Hyperplane 7quot 7 r Separating Pil n 39 r th s F 39d il ah IQEGh Sml il m A 7 i I r I I a j i i a g i I E x a m a 5 i 39 39 39 m m 39 39 39 n I 39 39 m m 539 k amiltcmm 56 Papers 5 E i aa i in a s E a 5 sau uahd iih39a39ia iquotquotquotquot ia d L h I IK i I j I il i Iain1i I n I 7 l H 3 j 39 2 6quot q if 41E 1 g 1tk 11g 5 g 39 39 it i 4 a at i i3 39 r x I 39 4 7 gt 39 w Iquot I II I I u 7 r 7 I 3 a II 39 5 1 m vE 39 39 39 39 39 quot r Iquot ll u a 1 I I I la l I I n l il 39 g E I n I Ma iscnusueapeggsp a 1 a a W V V l1 7 I I 39d I FI Sn r 4 7 39 I I quotquot 39 391 I 2 nl l m xquot I 39 39 r D t a m r5 n I u I r I n 1 39 t 39 39L m 7 1 m 39 I I r quot 57 a arm 5 r a s L a h 3 I J lhldi I n m m l I 39i39 r l i a a 39 i I a 5 5 I K i I l m I I m r a l 39 m 39 I I a i I b l I m l m 7mm a 3919 T a lmum 1 m g g 39939 T 39 mu quot quot ILII 39issh h5 372007 5 Support Vector Machine SVM Simplest Case Linear machine trained on separable data 372007 Pattern Classification Training data 0 Xnyi i1n O XIer O o o O o yie1 1 o o 0 O 0 372007 Pattern Classification wXb0 Classi cation rule y signw X b 372007 Which line is the best Training and Generalization 372007 Maximal Margin Separating Hyperplane Optimization problem max C o Waballwll1 O yiWXl b2C o o Margin 2C Intuitively maximize the margin between classes 372007 10 Maximal Margin Separating Hyperplane To get rid of the constraint 1 we rephrase the optimization problem 1 2 mlnz W Wb 2 SJ yiWXlb21 i1n This is a convex optimization problem quadratic objective with linear inequality constraints 372007 1 1 Support Vectors Support vectors C X yiWXlb1 13965 0 szal39yixi IES Classi cation rule 0 O y s1gnZ aiinl X 9 ES The line depends only on a small number of training examples which are called support vectors 372007 12 Geometric Margin Maximal Margin Hyperplane A hyperplane realizing the maximum geometric margin According to a theorem from Learning Theory from all possible linear decision functions the one that maximises the margin of the training set will minimise the generalisation error 0 The optimal linear classnfler forms the Maximal Margin Hyperplane 372007 13 Geometric Margin o The Euclidean distance of an example Xy from the decision boundary 72ltlxgt I W I W 372007 14 372007 Geometric Margin and Hyperplane X Nil Margin wxb0 X39 wx b1 wxb1 Let s compute M in terms of w and b Let X39 be any point on the minus plane Let X be the closest plusplanepoint to X Plusplane X wX b 1 MinusplaneX wX39 b 1 X X39 Aw for some value of A X X39 M 15 Geometric Margin and Hyperplane 0 Min terms ofwand b wXb1 MXX wX1wb1 ZlX 1wX wXb1ww1 1w 2 2 x1 W39W ww w w 2 372007 W 16 Maximal Margin Separating Hyperplane To get rid of the constraint 1 we rephrase the optimization problem 1 2 mlnz W Wb 2 SJ yiWXlb21 i1n This is a convex optimization problem quadratic objective with linear inequality constraints 372007 17 Estimating the Maximum Margin Classifier 0 Now we need to find a way to search the space of w and b to find the widest margin that matches all the data points Gradient decent Expectation Maximization Any other numerical search algorithms Quadratic Programming 372007 18 Quadratic Programming 0 QP is a well studied optimization algorithms to maximize a quadratic function of some realvalued variables subject to linear constraint Tun quotuquot 39 quot TRquot U quotquotquotquotquot quot5139 L n 2 Subject to 011111 run2 almum b1 5115 JEN2 imam b n additional linear inequality constraints lulu11 angii2 imam b J And subject to m l DJ arugula 39 aria 1111quot 39quotia 1m m bn 1 3 lg amp V Z 3 C E immliil twang2 Y quot2mllm 3mg g 5 9i rr b g I 1I1 ati eilul 39 quot atri e lmum ti g 391 quot 372007 19 Finding the Decision Boundary Let X1 Xn be our data set and let yi e 1 1 be the class label 0in The decision boundary should classify all points correctly gt ywaxi b 2 1 v2 The decision boundary can be found by solving the following constrained optimization problem h inimi zn lIIWIIQ IVIIIIIIIIILC 2 T c A Suiigiom Tn orlxxr v l hl gt I a JVVU UV VV JLZ I U I VU The Lagrangian of this optimization problem is 1 i n rTi n iiil4 I 1quot l1 1 n v iiwil 3 a i39iigiw x i 0i Li air72L W7 7 ll i LUL Ll I L L 2 372007 20 Constraint Optimization Minimize FX Subject to ci X S 0 for i12 n Theorem 626 KKT for Differentiable Convex Problems 312 A solution to the optimizaticm prolilmn 637 with quotOJ12191 di rentiable f L is given by thva exists some 1 E R with a 2 0 7139all i E n suoh that thefollcmving mmiiticms are SHL iS L d H 1130351 2 Bifmquot 2 iavci O Saddle Point in 1quot 653 11 uiLOE X 70 g 0 Saddle Point in 51 654 H 2 61176 O Vanishing KKT Gap 655 i1 KKT KarushKuhnTucker 372007 21 Solutions Construct amp minimize the Lagrangian Lwbalwlr Zaiyiwaib 11 1 subject to 051 2 0139 1N By taking derivatives wrt w b and or equating them to 0 6L 1 N 39 g v a 2 W Zaiyixi 0 2 parameters are expressed as a 11near N M combination of training points ie w WZZW 0 3 ab 11 1 1 6L g ba yiWTXi b1 03139 132N or I KKT cond aiywa b 1 0 only SVs will have nonzero Oli The Lagrange multipliers or i are called dual variables 372007 Each training pornt has an assocrated dual variable 22 372007 IIha Example I 140 190 Class 1 Class 2 1190 5quot 0 5 a 61 O 1 y l Cl 120 i O tagl1 0 3 0 23 Solutions 7 Thus W aiyixi ZiESV aiyixi Plug Eqs 2 amp 3 back into the Lagrangian to obtain the following dual formulation The resulting dual that is solved for oc by using a QP solver 1 N N MaX1mlze Wa EZZaiajyiijij 061 21 j1 subject to oriyl 0 05 2 0 139 1N Data enters only in the form of l This is a quadratic programming QP problem dot prOdUCtS A global maximum of oci can always be found The b does not appear in the dual so it is determined separately from the initial constraints 372007 24 Classifying New Data Points Once the parameters oc b are found by solving the required quadratic optimization on the training set of points the SVM is ready to be used for classifying new points Given new point x its class membership is signfx oc b where fX0b WXb ZZIalfyifobi Z fyixfxbi gt O 0 ieSV 1 o This is called an Linear SVM LSVM 372007 25 Solutions The solution of the SVM ie of the quadratic programming problem with linear inequality constraints has the nice property that the data enters only in the form of dot products Definition of dot product inner product Given xx1x2xn and yy1y2yn then the dot product of x and y is xT yx1y1 x2y2 xnyn This is nice because it allows us to make SVMs non linear without complicating the algorithm 372007 26 Characteristics of the Solution Many of the oci are zero w is a linear combination of a small number of data points This sparse representation can be viewed as data compression as in the construction of k nearest neighbor knn classifier xi with nonzero oci are called support vectors SV The decision boundary is determined only by the SV Let j1 s be the indices of the 3 support vectors We can write For testing with a new data 2 H I 1 7L Compute w z l b Z 1 atjytj xtjz i 0 and class y 2 as class 1 if the sum is positive and class 2 othenvise 372007 27 g Overlapping Cases Impossible to perfectly separates the two classes Include an error term Instead of maximizing margin minimize error 1 margin Again involves only quadratic programming 372007 28 372007 Linearly NonSeparable Case 29 Linearly NonSeparable Problems We allow error xi in classification 372007 30 Linearly NonSeparable Problems Cannot find w and b that satisfy yi ltwxigt b 2 1 for all i if a high noise level causes a large overlap of the classes Introduce slack variables ii 20 yi ltwxigt b 2 1 ii for all i Making ii large enough the constraints can always be met In order not to obtain the trivial solution where all ii take on large values we thus need to penalize them in the objective function Penalized objective function Minimize W2 2 C 2amp1 subject to yi ltwxigt b 2 1 ii ii 20 for all i 372007 31 Linearly NonSeparable Problems By minimizing 2 ii can be obtained by 1 K p1 J 9 yz J 2 1 lt1 Q N where ii are slack variables in optimization Note that i0 if there is no error for xi and ii is an upper bound of the number of errors 372007 32 Linearly NonSeparable Problems We want to minimize C tradeoff parameter between error and margin ie tradeoff between two conflicting goals minimizing the training error and maximizing the margin The optimization problem becomes Minimize 1 iiwii2 ozyzl gz subject to 112me b 2 1 e a 2 0 372007 33 Linearly NonSeparable Problems The dual of theproblem is 1 N N Max1m1ze Wor 52 20405 my 1X X j 051 i1 j1 subject to Ziggy OC Z or Z O i 1N 39 8 w IS recovered as W 29 rvtmtoxt This is very similar to the optimization problem in the linear separable case except that there is an upper bound C on oci now Once again a QP solver can be used to find oci 372007 34 Strengths of SVMS Good generalization in theory Good generalization in practice Work well with few training instances Find globally best model Efficient algorithms Amenable to the kernel trick 372007 35 372007 Extension to Non linear Decision Boundary 36 Nonlinear Cases use another coordinates system such that the curve becomes a line 372007 37 Kernels 4 W X gt My 3 Only inner products XT y are involved in the calculation Under certain conditions there exists a kernel K SUCh that KXY XT Y eg Polynomial of degree d KXyXTy1d replace XTy by XT 150 372007 38 Extension to Non linear Decision Boundary So far we only consider largemargin classifier with a linear decision boundary How to generalize it to become nonlinear Key idea transform xi to a higher dimensional space to make life easier Input space the space the point xi are located Feature space the space of ltxi after transformation Why transform Linear operation in the feature space is equivalent to nonlinear operation in input space Classification can become easier with a proper transformation 372007 39 372007 Nonlinear Mapping Separatim I may he eagier in higher dirrmnaiuna I I i I I i THEE 39 I Fix 1f i I I I V yperpla e EDWHJIBIH M1 aw dimena m gimme in higher diimana m 4O Examples 372007 41 372007 Examples I x H y K E 3 K K it 5 quot as x K M I a 3 a J l a s quott x 1 lrs39f all t J tilquot I J s1 I J l h la 3 393 3 quotI 13H x X H Z I x 39 44 a 3 1E s H J 3 is H u a a K d F E a I V H a e a 3 x a 3 Figure 721 Titerr example at lininarrF classificaticrn prehlem mapped inter feature space We assume that re trite decisien hetmdarjr is an ellipse in input space left panel The task at the learning precess is he estimate this hetmdarjr based an empirical data censisting at training paints in hath classes creases and circles respectivelyl ll39tthen mapped inte feature space via the nenlinear map IJE39ELITJ 3133 51 i Mi i 1rg right panel the ellipse hecemes a hyperplane in the present simple case it parallel te the hence all paints are platted in the 31 i 23 planet This is due te the fact that ellipses can he written as linear equatiens in the entries at shag 33 Theret39ere in feature space the prehlem reduces te that at estimating a hyperplane frem the mapped data paints Nate that via the peljrnemial kernel see 23112 and 21311L the det preduct in the threedimensienal space can he cemputed witheut cempirting 113 Later in the heel we shall describe algeritluns fer censtmcting hyperplanes which are based can det preducts Chapter I 42 Examples Value of discriminant function class 1 dass 2 Aass 1 AC 0 372007 43 Mapping into a New Feature Space CDX9XDX C13X19X2 X19X29X129X229X1X2 Rather than run SVM on xi run it on 13xi Find nonlinear separator in input space What if 13xi is really big Use kernels to compute it implicitly 372007 44 Transforming the Data Feature space Input space Computation in the feature space can be costly because it is high dimensional The feature space is typically infinitedimensional The kernel trick comes to rescue 372007 45 The Kernel Trick Recall the SVM optimization problem Maximize Wa aiajyjy at i1 j1 subject to 204321 2 O C 2 0512 0 z39 1N The data points only appear as inner product As long as we can calculate the inner product in the feature space we do not need the mapping explicitly Vr AT Define ln Xan 65ij flL r WKAJI 372007 46 Idea of the Kernel Trick The only way in which the data appears in the training problem is in the form of dot products xioxj First map the data to some other possibly infinite dimensional space using a mapping CD Training algorithm now only depends on data through dot products in CD xiocDxJ If there is a kernel function K such that KXiXjCDXi CDXj we would only need to use K in the training algorithm and would never need to know CD explicitly 372007 47 Why SVM Work The feature space is often very high dimensional Why don t we have the curse of dimensionality A classifier in a highdimensional space has many parameters and is hard to estimate Vapnik argues that the fundamental problem is not the number of parameters to be estimated Rather the problem is about the flexibility of a classifier Another view the SVM loss function is analogous to ridge regression The term 12w2 shrinks the parameters towards zero to avoid overfitting 372007 48 Choosing the Kernel Function Probably the most tricky part of using SVM The kernel function is important because it creates the kernel matrix which summarize all the data Many principles have been proposed diffusion kernel Fisher kernel string kernel Information complexity In practice a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try 372007 49 Software A list of SVM implementation can be found at httpwwwkernel machinesorgsoftwarehtml Some implementation such as LIBSVM can handle multiclass classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available 372007 50 Summary Steps for Classification Prepare the pattern matrix Select the kernel function to use Select the parameter of the kernel function and the value of C You can use the values suggested by the SVM software or you can set apart a validation set to determine the values of the parameter Execute the training algorithm and obtain the oci Unseen data can be classified using the oci and the support vectors 372007 51 Strengths and Weaknesses of SVM Strengths Training is relatively easy No local optimal unlike in neural networks lt scales relatively well to high dimensional data Tradeoff between classifier complexity and error can be controlled explicitly Nontraditional data like strings and trees can be used as input to SVM instead of feature vectors Weaknesses Need to choose a good kernel function 372007 52 The Suppart Vectar Classifier 1222 Mixture Example Tranma Tran1 Emanu Telemur Telermr magnesium Ems r mFEEITU 21 39EEEEE39 q 1mm FIGURE HE Tha Hum Jrquot mpp 11mm Ermmd ry far ih miximm dam ammp with mm nw mpp ng 33513353 far itan da amn mium Hf 1quot brake fines in dimig ag margins Wham f lzi S pp f paints r11 5 I am all ih paints an the mmng 3mg Hf their margin Th b S d darts am quotthaw suppn f paints failing exactly the mmyin 5 1 m 3 I In quotma Ee p n 32 if quot ue HES an mppnf39 paints mh 39a in quotifh ght p n 85 am 372007 53 372007 Support Vector Machines The solution function W Ariinf m n i1 EW39u39I Damage 4 Fnlynnmial in Fagiurg EIEEE SUM Radial H rl i l in Fmtur Space quot quot an j n I n n r a a n u u a n u u n I n n u n n u n u n u a n I u n n u n u a TranngEImrJED 5 n 2 Ted Ema Enm LE 1222122225ZZEZZZZZIIZZHZZJJZ39 39I39ranlng Errul 1 5 gt i 5 TeaFl Ermr 39 BajEEErmr 32153333332 FIGURE 7123 Tm nan inmr SWIM farquot 13 mwlm m simian IE E 333 4 15 diang p iyn mmi kern i ags right a ma a bmis kEmEE In E h 3133 if was tuned in appmiimm y hiewe thug best tas germ parfn n nc and 1 1 m fk d mail in bath msm The radial basis k rn l p rf m the best Elma in B y npt ma as might biz Empeaiad git an the dam arise nm mi umg 9f gamsmm 54 Support Vector Tree Simultaneously realizing principles of maximal margin amp maximal purity X Huo J Chen S Wang K L Tsui School of Industrial and Systems Engineering Georgia Institute of Technology 372007 55 Support Vector Tree SVT SVM Classification Tree SVT Maximal Margin amp Maximal Purity Maximal margin better separation Maximal purity a pure decent node 372007 56 Type Topology Binary Tree Partition 4 At each nonterminal node use SVM to find the class most separable from the others and the corresponding marginmaximizing separating hyperplane 372007 57 372007 Algorithm 58 11 L111 137 1 E 1 11331 F111quot 51 1 If 1 W11 Fur f 1 11111uri11 33 1111 111fquot 1111 L111 t 391 1111111IE11 11 11 111111 11112111 12111 a i i 1111 1 11 1511151111 1 I111 11121 1 113 Hun 111111511 EWFEI 111121 11111111111 51211111111111111 11311111111113111quot 11 1 fur 11111 If g 5quot Z 1 E E u 1 1M 1 E a i1 IfJ l 111 11111 1 subjunzt 1111 1111111 11fquot 2431 1 31 1 m 1 1 1W 372007 hm 1iu i 39i111 this ligfiEilgll 1111 giwul 1 Fiji 51ml Emiipu tt I 11i2 mining 21 1 Eilf illJ CHUUHU Z E 39 llfll 111511 tlm tra niu gnaw11 111i111111i2 112 5112111111 quotE If C m 1 l 1 Ti H LHEi n Hut ifquot tu nlil If HIE Ellil l 11mm upuilem if im aliL ilmr Chilil liljil Tl i WESFigu 111 flraraiguumnt Llf Elmath tn 111ml illlpus v E I Fri g 1 Lahl li39szW 372007 60 Type II Topology Partition Binary Tree A p At each nonterminal node group data into two super classes use SVM to find the two most separable super classes and the corresponding marginmaximizing separating hyperplane 372007 61 372007 Algorithm 62 E i 372007 1111111 11111 1111 1111 1 1111 11 IfU11U Z1iiL111 1115 1111111111111 111 2121 12 M 1 1 11111111111111 1 muta n 1111 311111111111 12111 11111111 3111 L 1 1 1 1115 1 11551111311111 111 1111 111111 111111 111 11 11111111131 111111 11511115 111111 1111111311 If 11111111132311 113s E39m 111 1 whng 111111 11111111 1111111 M111 2511 111111 17 1111011111 111 tl ig d Eu 11 SAFE111111131 111111111111111 1111111 111 L121 113139 1351 1111111111 1111 1111 111 11121 115171 1111151 If 21111 1 1111111111 1 11 and g 1 1111 2115 111 111ij1 111111 If 11111211 1111 151 12 11551111111 111211 1111 5111111115 111 E 1116111111111 111 11111111 63 1131 If i 111 m g Hi I II 39 n 372007 Fmr 1 215139391 1134111 111111 111111111 If 1 H T in L13 111 11 1 11 1 1 111 1115quot I If 111 51111111 5111 2111111 111 111 Urit11111 7 1111 5111111 5111 111 allguritlnu 511113 5131 Elfa niaF 111 111gurithli1 11 Cliclmraizz 1 zen11 11111 111 1121111111g 11111111 1111111111izgml if E 111g1117111 E131 111n11 Ifquot 111 131111 5111194111 111 Iii 1411 111211 11 ifquot 11 111 11 1511111 1 111111 if 139quot 1111 11 111111 12 111i tum Chili Aggi ll 151 113611151 39139th ig m t 311 513jgi f Chlwa with ri ilgn 64 Experiments Data Summary Problem number of samples class attributes balance 625 3 4 car 1728 4 6 glass 214 7 9 iris 150 3 4 gment 2310 7 19 waveform 600 3 22 wine 178 3 13 372007 65 Experiments SV39I vs OneversusOne SV T Oneversusone Problem Error nSVs Error nSVs bal anoe 0 0957 1 13 00957 1 17 car 01291 371 01310 387 glass 03231 112 03231 109 iris 0 15 0 16 segment 00375 193 00331 210 waveform 01657 253 01547 252 wine 00185 23 00185 21 372007 66 Experiments SVT vs CART SVT CA RT Problem Size Error size error size error bal anoe 3 00957 3 03564 19 01968 car 5 01291 5 01734 67 0025 glass 7 03231 7 03846 15 02615 iris 3 0 3 01087 4 00217 segment 7 00375 7 0134 50 00346 waveform 3 0 1657 3 02652 14 02376 wine 3 00185 3 01296 4 00556 372007 67 Experiments 1 I a f IiJi I19 I I f2ix I v f3ixji izQLInciaIquot 1 lquot 39 I I I I I ILLE I I If x m I l I I u I z 4 w 114 r 39 quotK fl I39 aquot l39l quot41 I I r x I I I u n II 2 139 I Jquot a I 139s f I f II I11 3 39 X 39 L I I quotquot l I I 2 I14 l if U 8 372007 Conditional probabilities of Classes 1 2 and 3 f1 X 097 exp3X f2X exp 25X 122 f3X 1 f1 x f3X Training 1000 Test 10000 Bayes SVT 68 Discussion SVT is a hierarchical representation of partition of the state space It is easy to understand and interpret Asymptotically SVT converges to the Bayes classifier under certain conditions 372007 69 Problems The computation is expensive For a Kclass problem we need to run the SVM K times to find the most separable class and the corresponding optimal hyperplane A quicker simple algorithm is wanted to find the most separable class then the SVM is used to find the optimal hyperplane 372007 70 Adjusted Support Vector Machines S Wang W Jiang K L Tsui School of Industrial and Systems Engineering Georgia Institute of Technology Steven Institute of Technology 372007 71 7 7 777 if if if Suport Vecter Machine SVM a Given data 33 1 1 with 319 3 and E 1 SVM nds a generating hyperplane 33TH E D that maxirnilee the margin ef eepar aw tien between twe eleeeee 1 and 1 372007 72 Support Vector Machine a Methematieelly speaking SVM eelvee the fellewing eptimizetien prehlem 3391 u 91 ydi 111 3 1 I SUM ueee e leee functien 0f the ferm 1 if u 1 MM 13 1 yfIIFIJ U 39 quot if Mm 3 1 2 that penalizes a peint ehly if it felle en the wreng EidE ef its beundery Lew i The less funetieh ie referred te es the hinge lesef39quot and peinte err the wreng eide ef their heundariee were referred m 35 support vectorsquot 372007 73 372007 Cencerne n The rheeimiirh mergih QEPaFHMiiilg hyperipiahe strehgijir tilehehcie eh eurppert veetere a Stimj i39il treetere ere etterr hiy er iew heihte hear the hehhdierriee ehti iihehr te very tirerrieitimiir with treirrihg eerhhiee SUM selliitieh is hhentehier espeeiaiiy when the sampie size is emaii I The mamirhhhr mergih separating hypernpiehe is eqhiriiete ht te the twe elieeeeei a The hirmjer er iii rhf iiiilg eihts and the tiiepeireieh yet treihihg heihte ih eeeh elieee may he tiiiiereht eheieiere hiy Igt SUM may perform ipeeriy when there exist ea hihiihg bias and ilarge Cii iEIi EiiCE ih Ciaes variehees as iilhistraited in the gure t0 the right U45 I14 I135 I13 1125 39 39 39 510113 with 131 quotquot beewa nne I ENTIRE with ED quot393 quot Dbsewa ans SEEM Ealutiem Dp mal Emmi3n quot39 T a f f X I f x x 74 372007 Other Loss Functiene fer Varants Lee and Menagerian 2M1 if lift 13 if Ef e 1 Shem Teeng Zhenga and Wang 2003 A b39 1 if 1a LW i 0 if mar 1 All these less functions penalize enly points with y er 6 1 and thus share the same CDI ICEI HS with SVM sensitive te training samples and perferming pearly under certain Situations 75 372007 New Loss Functien n The new Ines functien Has the term 1 a if gilt e U Liaflill w yfl l r quotl i 5 I Iiiquotll y qa 1t yj at U Penalizee all training data peinte anti therefere implititly eeneidera the number ef trainin peinte aa well a5 their diaperaien in each elaee lie emeeth and strictly rnenetene tiereaeing with yflai la equal te the hine less when y tj e if and greater than the hinge less when 2 D 76 New Lose Function in Uneler the new lees funetien the eptimizetien prebler n fer SUM beeemee I y ifEHTiJ g 7 133 H w H 1Z1 wwJ Iygfmg a 0H1 i1 I ya E Where ie an indieeter funetiem I The eelutien ef this eptimizetien prebIen I is expected I be rbuet egainet eu pp l t veetere as the new lee funetien penalizes all training data peinte Themem 1 Under eertem regufen39gx eendi ene Vepnfk I998 the eefutien ef the ep n r39nfee Hen prebfen i in eppreeehee the Bayes deeei er 35 the eerrlpfe efee n eppreeehee in nity 372007 77 It Adjusted Support Vector Machine ASVM n Te maintain the nice prepertiee pf SVMj such as eerier nenlinear eneralizatien with the Ikernel trielki ASVM Firetr run te find the maximum margin separating hyperplane 13 it and the heundariee fer aeh CIEISS Next adjuat the leeatien b ef the maximum margin aeparatin hyperplane Within the twe heundariee by sewing the fellewing ptimizatien prehlem 339 mg 11 j j 1 e 3 a H aa e 0 G 1 dig 7 subject te e Teal 1 tit 17 9a b 1 e 5 3 er 5393 b b 9131 372007 78 372007 lu l I A Lne imensienal Example Certmare the generalizatieh performance at tSVM and SUM Traihin sample Claee 1 quot1111 peinte randemly draw tram NEG Claee 2 paints randemly draw hem N11 R1 m EDD with quotnag EL L 30 wllml quotFquot 3 a1 1 and are 13332 a 1 Tu nin parameters Cl a nd Cg are Cheeen ueing a held eat sample ef size 10000 5000 tram elaee 1 anti EDDIE tram elaea 2 Teati rig sample 200130 heervatiena with 10000 tram class 1 and IUD frem elaee 2 3 a Fer eaeh quot g and erg er ill Simulatleha and ealeulate the awerae and Standard deviatieh ef the errer rates 79 l 1 11 111 1 1 eaDimensional Example n11 ln 1 l n11E enEE I I I I I I I I I I I I I Bayes Bayes SUM 311191 Aggy HE39JM 3 j r 33 f 15 39 quot DJE 1 DJ E E II 1139 E 5 LE IE U1 gJ mg 7 I El n I I I I I 14 1 E E 1 II 1 2 a 31 372007 372007 A neDimensional Example SVM 71g Bayes Average Rate Ste Ave re ge Rate Std 100 90 80 7390 0 50 40 30 015 01572 01572 01572 015 mam 015T2 0151 0200 02136 02173 02160 0217 02203 2348 02300 00122 001321 00129 0015 0184 0032 0515 01609 01586 01503 01506 0152 111158 01557 01573 01603 0032 00053 0002 00024 00012 00009 00015 00021 00020 81 I l i l i ii i ll l I 372007 A Ine Dimensienal Example I hseruatiens When the variances ef39 the twe classes are eernparableg the perferrnanee elf ASVM is very eempetitive te that ef SVM When the warianees ef the twe classes are ednsiderably diflrent ASVM eutperferms SVM ne rnatter sampling bias exists er net The larger the difference between and the better ASVM perfern39is than SVMF especially when sampling bias is severe ASVM is mere rebust against sampling bias than SUM ASVM is much mere stable than SUM The standard deviatien ef the errer rate frern ASVM is n1 Lieh less than that ef the errer rate frern SVM when the varianees ef the twe Classes are eensiderably different 82 372007 Experimental Camparisons Real quotata Twa examples fram UCI Machine Laarning REF SlI I website Wisaansin Breast Cancer PIMA Indian Diabetes Tan fald crass ualidatian was used had smaller arrar rats than canuantianal SWIM in bath examples Ava ra gs Er rar Flats Datasst Examplra Size af Class at In put Variables SUM tSVM Eu39slistansin Breast Eanaar g 2 Q 303 IlleE F39IMA Indian Diabetes 532 2 T 32133 ECIIEE 83 I i i i H i It i I Conclusien and Future Research an inclusion Seinplin bias and differenee between variances ef twe cleeeee have prefeu nd impacts en the perfennenee ef ASVM is mueh mere efficient and rehuet under these eituetiene then SVM i Future TEEEEITCh Nenlineer enerelizetien ef ASVM Explieit eelutien ef the eptirnizetien prehlem under the prepeeed new lese funetiem and ite perfennenee eempered with ASVM 372007 84 372007 Kernels Additional Details on Kernels 85 Kernels Find kernel K such that KX1 X2 lt 1301 X2gt Computing Kx1x2 should be efficient much more than computing cDx1 and cDx2 Use Kx1x2 in SVM algorithm rather than ltX1 X2gt Remarkably this is possible 372007 86 An Example for f and K the Polynomial Kernel with degree 2 Suppose is given as follows c1301 X112 X122 V2X11X12 c1302 X212 X222 V2X21X22 An inner product in the feature space is lt c1301 c1302 gt X112X212 X122X222 2X11 X12 X21 X22 So if we define the kernel function as follows there is no need to carry out explicitly KX1 X2 X112X212 X122X222 2X11 X12 X21 X22 This use of kernel function to avoid carrying out explicitly is known as the kernel trick KX1IX2 lt CDX1I cDX2 gt lt X1 X2 gt 2 NOte X1 X11 X12 X2 X21 X22 gt lt X1 X2 gt X11X21 X12X22 372007 87 Examples of Kernel Functions Polynomial kernel with degree d Kx1x2 lt x1x2 gtOI Monomials of degree d Variation Kx1x2 lt x1 x2 gt 1 d Gaussian kernel Kx1x2 exp X1X2 l2262 Radial basis functions Sigmoid kernel Kx1x2 tanhlt x1x2 gt 9 Neural networks 372007 88 Modification Due to Kernel Function Change all inner products to kernel functions For training 1 N N I I MaximizeWor ZZogorjyiijfXj2204 Onglnal i1 J1 subject to 210432 2 0 C2 Oll 2 01392 1N 1 N N Max1m1zeWor 051a I K xijx ai Wlth kernel 1ny J 2 functlon subject tazzlaiyi 09C20 i ZOJZLWN 372007 89 Modification Due to Kernel Function For testing the new data 2 is classified as class 1 if f20 and as class 2 if fltO S w Z 04 y x Original 31 quot7 quot7 78 f WTZ b oatjytjxgjjz b 31 8 With kernel W 1 atjytjmxtj 3 function 8 f 2 W7 b atjythth7 Z b 372007 90 Example Suppose we have five 1D data points x11 x22 x34 x45 x56 with 1 2 6 as Class 1 and 4 5 as class 2 gt y11 y21 y31 y41 y51 We use the polynomial kernel of degree 2 KXy Xi12 C is set to 100 We first find oci i1 5 by i 1 i i A max 2 0 3 Z Z aiajyiyjij 1V i1 4i1j1 2 subject to 100 2 052 gt O L ozz39yz 0 372007 i1 91 Example By using a QP solver we get where oc10 ocz25 0L30 0L47333 0c54833 Note that the constraints are indeed satisfied The support vectors are x22 x45 x56 The discriminant function is b is recovered by solving f21 or by f51 or by f61 as x2 x4 x5 lie on yiWTq z b l and all give b9 fy 06667902 533333 9 372007 92

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.