Computational Data Analy
Computational Data Analy CSE 6740
Popular in Course
Popular in Computational
verified elite notetaker
This 0 page Class Notes was uploaded by Dell Spencer IV on Monday November 2, 2015. The Class Notes belongs to CSE 6740 at Georgia Institute of Technology - Main Campus taught by Alexander Gray in Fall. Since its upload, it has received 24 views. For similar materials see /class/234227/cse-6740-georgia-institute-of-technology-main-campus in Computational at Georgia Institute of Technology - Main Campus.
Reviews for Computational Data Analy
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/02/15
CSE 6740 Lecture 19 How Do I Optimize With Constraints Constrained Optimization Alexander Gray agraycc A gatech edu Georgia Institute of Technology A CSE 6740 Lecture 19 p14t Quiz Answers 7 7 1 A convex function has a unique global minimum T 2 Stochastic gradient descent works one data point at a time T L J CSE 6740 Lecture 19 p24t Today 7 7 1 Unconstrained Optimization LatentVariable 2 Constrained Optimization L J CSE 6740 Lecture 19 p34t Unconstrained Optimization LatentVariable The EM algorithm a form of bound optimization Mixture of Gaussians rRecall the mixture of Gaussians model whose hidden variable is the class label PCk 7 Zwk1 1 k fXlCk NMk72i 2 CSE EMU Lecture 19 7 p 54 Mixture of Gaussians 7 rRecall the mixture of Gaussians model whose hidden variable is the class label PCk wk Zwk1 3 k fXlCk NWImZi 4 K K fX Z XlC WHO I k ZMNWImZi 5 k1 k1 L J CSE 6740 Lecture 19 p64l Mixture of Gaussians 7R 7 eca Bayes rule which gives fxlC kPC k f as 39 This value is the probability that a particular component is was responsible for generating the point x and satisfies 251 PC kx 1 We ll use as a shorthand PC 6 wik E 7 L J CSE 6740 Lecture 19 p74l Mixture of Gaussians 7W 7 en consider a simplified case where the covariances are fixed to be diagonal with all dimensions equal 2k 03 so 2 a mm 8 f930k NHl 72k WGXP 20g and K 1 llx Mkll2 f a E 7T Xp 9 1 27T0iD2 20 L J CSE 6740 Lecture 19 p84 Maximum Likelihood Identi ability rW e want to find the parameters 6 7 uk 0k which maximize the likelihood 7 N W H fem 1o i1 or map mp Unfortunately there exist parameter settings for which the likelihood goes to infinity for example when one of the Gaussian components collapses onto one of the data points Also there may be several parameter settings with identical likelihoods We ll just ignore all this and proceed because it turns out to be fine in practice L J CSE 6740 Lecture 19 p94t Minimizing the Negative Loglikelihood VI 7 t is equivalent to minimize the negative loglikelihood N 2 10ng Zlogf9X7 M i1 N K Zlog ZfOCILIC kPC kgt 2gt i1 kzl Since this error function is a smooth differentiable function of the parameters we can employ its derivatives to perform unconstrained optimization on it L J CSE 6740 Lecture 19 p104i Minimizing the Negative Loglikelihood VF 7 or the centers Mk we obtain N 3E Mk 977 I w 13 8M k 0k and for the Oh we obtain N 3E D llxz39 Mk2 E O 14 50k W 0k 3 039 i1 k L J CSE 6740 Lecture 19 p114t Minimizing the Negative Loglikelihood l Optimizing for the mixing parameters 7 must be done subject to the constraints K Zak 1 15 k1 0 S 7 S 1 16 L J CSE 6740 Lecture 19 p124t Minimizing the Negative Loglikelihood rThis can be done by representing the mixing parameters in 7 terms of a set of auxiliary variables wk such that GXPWk O 251 expm Recall that this is the logistic or softmax transformation It ensures for 00 g wk g 00 that the constraints on 7 hold 17 L J CSE 6740 Lecture 19 p134l Minimizing the Negative Loglikelihood WU 7 m1 j k 7Tij 18 tilizing 87k and the chain rule consequence 3E K 3E aw 19 a yk Fla mam we can obtain N 8E 20 8 Em 7 L J CSE 6740 Lecture 19 p144i Minimizing the Negative Loglikelihood rNote that at this point we are armed with derivatives so we7 can use standard optimizers The EM algorithm does not actually require these derivatives in its final form though we will consider these derivatives in our derivation which gets us there L J CSE 6740 Lecture 19 p154l Conditions at the Optimum 7 lt is insightful to see what the maximum likelihood solutions look like when these derivatives are zero we have A wikxq Mk 21 Ziwik 1 2 a Zzwzkllxz Hkll 22 D A 1 77k Nszk 23 Z which represents the intuitively satisfying result that they are the usual mean and standard deviation where the points are weighted by the posterior probabilities of being Lgenerated by each component A CSE 6740 Lecture 19 p164l EM Algorithm for Mixture of Gaussians l These are the final update equations ledx k Z HEGW Z Z 1d 2 013new i wgk Mgewll 25 1d D 1 wgew N Z wggd 26 7L L J CSE 6740 Lecture 19 p174t EM Recurrence Idea rW e can write the change in error in the form 7 Enew EOld foldltxi 2k PewIO kgtPneWltO k P01dlt0 kiwi 10g fewest P01dlt0 way where the last factor is simply the identity L J CSE 6740 Lecture 19 p184i EM Bounding Idea VT 7 0 proceed we make use of Jensen s inequality which states that given a set of numbers Ah 2 0 such that 2k Ak 1 for any numbers 2k log 2 AM 2 Z M logzk 29 k k Since the probabilities P01dO kx in the numerator have these properties they can play the role of the M This yields Enew Eold S 30 fnew c mpnewc k old x7 0 O L 1 g PoldC klxif01dxi 31g CSE 6740 Lecture 19 p194t EM Bound Minimization rW e wish to minimize Enew with respect to the new 7 parameters If we let Q be the righthand side of the previous equation then we have Enew S Eold Q 32 and so E0101 Q represents an upper bound on the value of Ene We can therefore seek to minimize this bound with respect to the new values of the parameters By construction this will necessarily lead to a decrease in the value of Enew unless Enew is already at a local minimum L J CSE 6740 Lecture 19 p204l EM Bound Minimization 7 lfwe now drop terms which depend only on the old parameters we can replace Q by 63 Z Zpddw kx1ogf ewxo mp eww k 7 k 33 The smallest value forthe upper bound is found by minimizing Q with respect to the new parameters by finding the derivatives for them and setting them to zero For the uk and 0 this is straightforward but for the mixing parameters 7 we again have to account for the constraint 2k P 6WC k 1 L J CSE 6740 Lecture 19 p214l EM Bound Minimization 7W 7 e can do this by introducing a Lagrange multiplier A and minimizing the function Q A PneWC k 1 34 k Setting the derivative of PneWO k to zero we obtain P01dC km Z Pnew0k A O 35 After some manipulation of this it can be found that A N L J CSE 6740 Lecture 19 p224t EM Algorithm for Mixture of Gaussians rW HGW Mk cam new 7 e arrive at the following update equations 2239 wglidxi old 36 2239 wzk i E wfzidl I132 M ewl 2 37 m D 2239 1 N Z wfgd 38 i The expectation step E step consists of evaluating the conditional probabilities wZk using the last values of the parameters The maximization step Mstep consists of the updates to the parameters which move toward the localJ aximum CSE 6740 Lecture 19 p234l EM Improvements rEM is based on first derivatives and has linear convergenceT It is good away from the optimum but slow near it We can use the idea of deterministic annealing analogous to simulated annealing to effectively smooth out the likelihood surface so that the steps are only sensitive to major wells We can also make a hybrid method which switches to line search near the minimum L J CSE 6740 Lecture 19 p244i EM Generalizations rA generalized EM algorithm simply takes a step which 7 improves the likelihood but doesn t necessarily maximize the Q function The EM algorithm is an instance of a more general idea variously called optimization transfer bound optimization block relaxation methods and iterative majorization L J CSE 6740 Lecture 19 p254l rConvex Optimization Problems Problems with a convex objective function and if there are constraints convex constraints Linear Programming 7 rWhen the objective and constraint functions are all affine the problem is called a linearprogram LP which has the form Find 513 arg mianRD 0T3 d 39 subject to Ga g h 40 A51 b 41 L J CSE 6740 Lecture 19 p274l Quadratic Programming 7 rWhen the objective function is quadratic and the constraint functions are affine the problem is called a quadratic program QP which has the form Find 93 arg minmERD aTP93 1 r 42 subject to Ga g h 43 A93 b 44 L J CSE 6740 Lecture 19 p284l ratically Constrained Quadratic Progran fthe constraints are also quadratic the problem is called a T quadratically constrained quadratic program QCQP Find 93 arg minmERD aTP93 1 r 45 subject to aTP7a qffx 7 g 0 z39 1 M 46 A51 b 47 L J CSE 6740 Lecture 19 p294l Secondorder Cone Programming 7 WA closely related problem is called a secondorder cone program SOCP which has the form Find 513 arg minmeRD fTa 48 subject to A xb 2 g c fordi z391M 49 Fa g 50 A constraint of this form is called a secondorder cone constraint L J CSE 6740 Lecture 19 p304l Geometric Programming rA geometric program GP is a problem of the form Find 513 erg minmERD fa subject to 315 g 1 z 1 M d75131 2 17N where f and the 7 have the form 10g ealTkHbi C k and the di have the form egz39THhi L 51 52 53 54 55 J CSE 6740 Lecture 19 p314t Semide nite Programming U semidefinite program SDP has the form Find 513 arg minmeRD CTCU 56 subject to 1171 ann G g 0 57 A51 b 58 L J CSE 6740 Lecture 19 p324i rConvex Optimization Methods The interiorpoint method Lagrangian Duality T rConsider again the general form for an optimization problem Find 513 arg mianRD 59 subject to 315 2 O z391M 60 d7a0 z391N 61 L J CSE 6740 Lecture 19 p344i Lagrangian Duality 7W 7 e can rewrite the original problem as the unconstrained optimization problem M N Find 93 argxr lligrllj fx 21040493 Zlmwzm 62 where 00 is the indicatorlike function which takes the value 0 ifthe constraint function is satisfied by a and 00 otherwise L J CSE 6740 Lecture 19 p354l Lagrangian Duality W rThe basic idea of Lagrangian duality is to soften the problem by replacing 00C7a by Acx and 00d7a by 77dx where the A and 77 are positive weights obtain the Lagrangian of the problem M N Mac A 77 f as Atoms meltac 63 The A and 77 are called the Lagrange multipliers and the A and 77 vectors are called the dual variables or Lagrange multiplier vectors associated with the problem L J CSE 6740 Lecture 19 p364l Lagrangian Duality rW e define the dual function g as the minimum value of the Lagrangian over x 7 90M 77 igf x A 77 64 Since the dual function is the pointwise infimum of a family of affine functions of A 77 it is concave even if the original problem is not convex The dual function yields a lower bound on the optimal value g 77 g 513 L J CSE 6740 Lecture 19 p374l Lagrangian Duality T rThe dual function value g 77 is its optimal value over m Now we d like to find the best lower bound that can be obtained from the dual function Find X 77 arg 1r1r1a3gtltv77Lq7 77 65 subject to A 2 O 66 This is called the dual problem while the original problem is called the primal problem This is always a convex optimization problem because the objective function is concave and the constraint is convex even if the primal problem is not convex L J CSE 6740 Lecture 19 p384l Lagrangian Duality lilfthe primal problem is convex usually maximizing the dual1 problem is the same as minimizing the primal problem We call this strong duality Suppose we have strong duality and all the functions are differentiable Since 513 minimizes La X 77 over x it follows that its gradient is zero at 513 M N Vfa Z A V V O 67 L J CSE 6740 Lecture 19 p394l Optimality Conditions 7 rThus we have these conditions which must hold at the optimum 2493 0 68 d x 0 69 A 2 0 7o Aim 0 71 M N VfaZkfvcixnfvci x O 72 These are called the KarushKuhnTucker KKT conditions They are necessary and sufficient at the optimum We can Lthus formulate optimization as solving these equations A CSE 6740 Lecture 19 p404l Logarithmic Barrier Idea 7W 7 e ll now do something related but slightly different with the constraints We rewrite the general problem as Find 513 arg mianRD fa 00c7a 73 subject to Ax b 74 We approximate the indicator function by A 1 1a 10g u 75 where t is a parameter which increases the accuracy of the A approximation as it increases u goes to 00 as u increases to zero but is differentiable and convex L J CSE 6740 Lecture 19 p414l Logarithmic Barrier Idea rW e call the function 7 M W log c x 76 the logarithmic barrier for the problem We ll optimize ax was The barrier method solves a sequence of such problems each of which is convex increasing t on each iteration using Newton s method The solution xt is the starting point for the next value of t It can be shown that the error for each iteration is bounded Lby Mt and thus the error goes to zero CSE 6740 Lecture 19 p424l InteriorPoint Method VT 7 he barrier method is an example of an interiorpoint method Given feasible x t gt O u gt O tolerance e gt 0 repeat 1 Find xt by using Newton s method to minimize tf gb subject to Ax b starting at x 2 a 3 Quit if Mt lt e 4 25 mi Methods called phase methods are used to choose the starting x L J CSE 6740 Lecture 19 p434i InteriorPoint Method 7 rThe interiorpoint method can be used for all of the constrained convex optimization problems In practice despite convergence analysis which relates the number of iterations to M it always takes about 1020 iterations There is a modification of the barrier method called the primaldual interiorpoint method which is often a bit faster and is what is used in practice L J CSE 6740 Lecture 19 p444l SVM Quadratic Program 7 rRecall that the formulation of the support vector machine results in a quadratic program Though generally effective interiorpoint is not sufficient to handle the large number of variables and constraints in an SVM problem An idea called chunking is often used for largescale convex optimization problems which solves smaller problems using subsets of the constraints while adding more constraints until all of them are satisfied L J CSE 6740 Lecture 19 p454l SMO Extreme Chunking rThe sequential minimization optimization algorithm was 7 developed specifically for SVM s It is a special case of chunking which considers only two constraints at a time the minimum number While it can be shown to converge it is based on several heuristic steps Empirically it is much more efficient than interiorpoint or general chunking The exact reason it has an advantage in this setting has not yet been elucidated L J CSE 6740 Lecture 19 p464i Main Things You Should Know r Interpretation of the EM algorithm in terms of bound 7 optimization The different types of constrained optimization problems o The idea of the interior point method L J CSE 6740 Lecture 19 p474t