Popular in Course
Popular in Mathematics (M)
This 212 page Class Notes was uploaded by Camryn Rogahn on Tuesday October 20, 2015. The Class Notes belongs to MATH693A at San Diego State University taught by P.Blomgren in Fall. Since its upload, it has received 29 views. For similar materials see /class/225265/math693a-san-diego-state-university in Mathematics (M) at San Diego State University.
Reviews for MATH693A
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/20/15
Corrections to First Printing of NocedalWright To make it easier for the typesetter we have included in many cases an explicit correction of the offending passages marked by the tag corrected sentence or similar so that heshe can cutandpaste from the source latex file of this document into the latex file for the book Note that in this document the labels of equations appearing in the corrected sentences may not print or may print with wrong equation numbers because they are using labels defined elsewhere in the book file When inserted in the book file they will produce the correct labels 1 b0 9393 5 27 page 24 equation above 215 index of the Hessian should be k not k 1 Corrected sentence When and an lie in a region near the solution 3quot within which Vf is positive definite the final term in this expansion is eventually dominated by the V2 a r 1 a term and we can write k k k page 25 There are some errors in the sentence that includes 220 Corrected sentence In fact the equivalent formula for applied to the inverse of 7 approximation Hk Bk 1 is page 26 There are spaces before two commas in the last paragraph before the new subsection descent Newton page 37 first and second lines italicize inexact instead of line search page 46 We need to introduce a new equation since some of the references to 320 are not correct Corrected Paragraph For some algorithms such conjugate gradient methods we will not be able to prove the limit 3917 but only the weaker result lim inf 0 1 kgt00 In other words just a subsequence of the gradient norms HkajH converges to zero rather than the whole sequence see Appendix 7 This result too can be proved by using Zoutendijk s condition 7 but instead of a constructive proof we outline a proof by contradiction Suppose that 1 does not hold so that the gradients remain bounded away from zero that is there exists 397 gt 0 such that HkaH 2 397 2 for all k su iciently large Then from Y we conclude that cos 9 gt 0 3 that is the entire sequence cos 8k converges to 0 To establish 1 therefore it is enough to show that a subsequence cos 917 is bounded away from zero We will use H H H 63 N 9 9 this strategy in Chapter 7 to study the convergence of nonlinear conjugate gradient methods By applying this proof technique we can prove global convergence in the sense of 7 or 1 for a general class of algorithms page 49 Last sentence before new section Theorem 34 not 33 Corrected Sentence For example if MQ 800 an 1 and fa 0 Theo rem 7 suggests that the function value will still be about 008 after one thousand iterations of the steepestdescent method page 78 line 12 Add the words p is feasible and quotquot after the words if and only if Corrected phrase if and only if pquot is feasible and there is a scalar A 2 0 such that the following conditions are satisfied page 79 equation 421 Minus signs are missing Corrected equation of AJ A QA Mr Q g i jl page 90 last line A1 2 A should be A1 3 A Corrected sentence We now derive a bound on the righthandside that holds for all su iciently small values of A1 that is for all A1 5 A where A is to be determined page 91 lines 710 ok gt should be ok gt there is a missing parenthesis after and the g in the displayed equation should be 2 Corrected passage Therefore ok gt and so by the workings of Algorithm 7 we have AM 2 A1 whenever A1 falls below the threshold A It follows that reduction of Ak by a factor of can occur in our algorithm only if At 2 A page 118 The last term of equation 537 should be J Tblra not 1410730 Like wise for the equation two lines later Corrected lines The quadratic 1 defined by 7 is transformed accordingly to m an rmH quot39zquot39a If V If we use Algorithm 7 to minimize 1 or equivalently to solve the linear systenl 1414053 14 12 page 166 line 2 Insert a composition of before the words elementary arithmetic H 63 H H 00 H 93 A 27 operations Corrected sentence This technique takes the view that the computer code for eval uating the function can be broken down into a composition of elementary arithmetic operations to which the chain rule one of the basic rules of calculus can be applied page 171 line 15 and line 21 The reference to 718 should be replaced by a reference to 79 Corrected version of line 15 By substituting 7 into 7 we obtain Corrected version of lines 2021 A similar argument shows that the nonzero elements of the fourth column of the Hessian can be estimated by substituting 7 into we obtain eventually that page 176 line 2 of Section 72 exact should be analytic Corrected sentence Automatic differentiation is the generic name for techniques that use the computational representation of a function to produce analytic values for the derivatives page 176 second paragraph of Section 72 Delete the following sentence altogether Similarly each line of a computer program usually contains just a few arithmetic operations page 180 line 1516 Delete the following sentence altogether When node j is not a child of node i we have 6303 Md 0 and so the corresponding term need not appear in the summation page 180 lines 4 and 6 that is fourth and sixth lines from the bottom of the page For consistency replace these two instances of by 61 Man Corrected sentences During the forward sweep the evaluation of f we not only calculate the values of each variable sci but we also calculate and store the numerical values of each partial derivative 6303 6313 Each of these partial derivatives is associated with a particular arc of the computational graph The numerical values of 61 E77 computed during the forward sweep are then used in the formula 7 during the reverse sweep page 181 line 7 Node 12 should be replaced by Node 9 Corrected line Node 9 is the child of nodes 3 and 8 so we use formula page 208 line 8 We need 0 g Jk g 1 not just Jk 2 0 Corrected Sentence Also since BFGS and DP updating preserve positive definite ness of the Hessian approximations when szyk gt 0 this relation implies that the same property will hold for the Broyden family if 0 g 11 g 1 2 2 2 2 2 C H b0 93 A page 220 line 3 It should read detI myquot 1 y39rau Corrected Sentence First show that detI myquot 1 also where a and y are nvectors page 221 Part of 89 The reference should be to 845 rather than 843 Corrected sentence Now use this relation to establish 7 page 316 line 25 In this displayed equation the final part k 0l1l2 should be k i1 3 5 Corrected display equation mm392 km 1 for kd1d3i5 page 357 Delete the material on lines 6 through 29 inclusive and replace with the following material Inserted material We return to our claim that the set N defined by 7 is a closed set This fact is needed in the proof of Lemma 7 to ensure that the solution of the projection subproblem 7 is welldefined Note that N has the following form N s s AA CA 2 0 6 for appropriately defined matrices A and C We outline a proof of closedness of N for the case in which LICQ holds Definition 3917 that is the matrix A has full column rank It su ices to show that whenever is a sequence of vectors satisfying sk gt squot then squot E N Because A has full column rank for each sk there is a unique vector Ak such that sk AAk in fact we have A1 AFA quotAquotsk Because sk E N we must have CAk 2 0 and since sk gt squot we also have that 1311 M 1331A39139Ark4739sk ATA 1ATsquot 12 Aquot The inequality CAquot 2 0 is a consequence of CAk 2 0 for all k whereas squot AAquot follows from the facts that sk AAk 0 for all k and sk AAk gt squot AAquot Therefore we have identified a vector Aquot such that squot AAquot and CAquot 2 0 so squot E N claimed There is a more general proof using an argument due to Mangasarian and Schu maker and appealing to Theorem 7 iii that does not require LICQ We omit the details here page 370 lines 24 In this displayed equation the expression bTAr which appears in four places should be replaced by AbTw in all Corrected display equation 1 As b Ab M Ab39l39w a Amyl39A l39m AIM a AmTAs A5157 AbTW 25 page 371 Theorem 132 Add a third item to this theorem iii If 131 is feasible 2 2 2 3 63 l 00 19 C and bounded then it has an optimal solution Corrected Theorem statement i If there is a feasible point for 7 then there is a basic feasible point ii If 7 has solutions then at least one such solution is a basic optimal point iii If 7 is feasible and bounded then it has an optimal solution page 372 Insert a new paragraph just before the end of the proof to read follows Inserted paragraph The final statement iii is a consequence of finite termination of the simplex method We comment on the latter property in the next section page 372 first paragraph in the section headed Vertices of the Feasible Polytope Replace the passage an intersection of halfspaces The by and the Corrected paragraph The feasible set defined by the linear constraints is a polytoI e and the vertices of this polytope are the points that do not lie on a straight line between two other points in the set Geometrically they are easily recognizable see Figure 7 page 37 39 and insert the phrase for i m 1 m 2n line 3 Replace We by w at the end of this sentence Corrected sentence Because of 7 and the fact that 139 and 1 139 are both positive we must have if z 0 for i m 1m 2 n page 374 first two paragraphs of Section 133 A new sentence needs to be added and two phrases inserted It is easiest to state the new version of these paragraphs which is follows As we just described all iterates of the simplex method are basic feasible points for 7 and therefore vertices of the feasible polytope Most steps consist of a move from one vertex to an adjacent one for which the set of basic indices 8a differs in exactly one component On most steps but not all the value of the primal objective function Va is decreased Another type of step occurs when the problem is unbounded The step is an edge along which the objective function is reduced and along which we can move infinitely far without ever reaching a vertex The major issue at each simplex iteration is to decide which index to change in the basis set 8 Unless the step is a direction of unboundedness one index must be removed from B and replaced by another from outside 8 We can get some insight into how this decision is made by looking again at the KKT conditions 7 to see how they relate to the algorithm page 375 lines 5 and 6 In this bulleted item replace the component with index p by corresponding to pp and add the following phrase before the semicolon or determining that no such component exists the unbounded case If 3 H b0 93 5 CF page 378 add a sentence after the end of the proof which happens at line 4 Corrected item keep increasing mq until one of the components of an corresponding to sip say is driven to zero or determining that no such component exists the unbounded case page 377 statement of Theorem 134 After nondegenerate add the words and bounded Corrected statement Provided that the linear program 7 is nondegenerate and bounded the simplex method terminates at a basic optimal point Note that this result gives us a proof of Theorem 132 iii for the nondegenerate c Inserted sentence Note that this result gives us a proof of Theorem iii for the nondegenerate case page 378 between lines 16 and 17 Add two extra lines details below On line 17 after denote the index add the phrase of the basic variable rearrange lines 17 and 18 and delete the parenthesized phrase p is the leaving index Corrected passage from the Procedure 13 verbatim should be ready to be copied into the latex source le gtgt Solve Bt Aq for t gtgt bf if t le 0 gtgtgt bf STOP problem is unbounded gtgt Calculate xq mini l tigt0 xBi ti and use p to denote the index of the basic gtgtgt variable for which this minimum is achievedindexSimplex methodleaving index page 393 exercise 134 Reword this exercise by inserting the following phrase after the word then the matrix B in 1313 is singular and therefore Corrected exercise Show that when m g n and the rows of A are linearly dependent in 7 then the matrix B in Y is singular and therefore there are no basic feasible points page 408 equation 428a Add a minus sign before S IX 91 Corrected equation ADZATAA 39 b A S 1Xm a ouS le 7a As a ATAA 7b A1 a ouS le S IXAs 7c Additional corrections noted on 2 25 2000 H P0 9393 A page 588 equation A24 Insert 0 before the quotquot Corrected equation TQM 1M wTVlta 0 Null VimOT 8 page 612 reference 26 The second author should be L El Ghaoui page 614 reference 71 The first author should be P Deuflhard I think The sequence of letters fl is typeset in a curious way page 622 reference 234 QuasiNewton should be QuasiNewton Also Non linear Programming J should be Nonlinear Programming 3 Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics Dynamicai Systems Group Computationai Sciences Research Center San Diego State University San Diego CA 9218277720 nttpterminussisuedu Fall 2009 Peter Blom mes w r 7 rr m Outline 9 Convergence Analysis 7 BFGS 8 SR1 0 Global Convergence of BFGS 0 Superlinear Convergence of BFGS Conver nce Anal 39 Convergence Analysis 7 BFGS e SR1 Convergence Analysis of the BFGS and SR1 Methods We look at some local and global convergence results for the BFGS method The BFGS results are more general than the results for the SR1 method hence we omit the SR1 discussion Since the Hessian approximations evolve by updating form ulas the analysis of quasi Newton methods is much more complex than the corresponding analysis for steepest descent and Newton methods Peter Blomgren Newmn 7 Convergence Anal Convergence Analysis 7 arcs e SR1 Quasi Newton Methods What We Cannot Prove We cannot prove the following desirable result 0 The iterates of a quasi Newton method approach a stationary point from any starting point and any suitable initial Hessian approximation In the analysis we must either assume that the objective function is convex or we must impose restrictions on the iterates Under reasonable assumptions we will be able to show local superlinear convergence results Conver nce Anal 39 Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Assumptions In our analysis of global convergence for BFGS in the context of an inexact line search we need the following assumptions Assumptions 1 The objective function f is twice continuously differentiable 2 The level set Q x E R f39 I D 0 is convex and there exist positive constants m and M such that mllill2 S iTsz S Mllillz for all i E R and x E Q The second assumption implies that the Hessian is positive definite on Q and that f has a unique minimizer 39 E Q 5351 Peter Blomgren wlon 7 Convergence An Global Convergence of BFCS a nmw 4 Convergence Analysis 7 Bras 8 SR1 Global Convergence of BFGS Building Blocks Using Taylor39s theorem we can express a relation between the E Ozk k and yk Vfk1 7 Vfk quantities 5k k 7 39lt using the average Hessian Gk over the step 1 Gk V2fk TQM k dT to get yk Gkak k Gkgk 0 In combination with the second assumption we get ylsk EIGksk gt T T 7 m Sk Sk Sk Sk Further since Gk is SPD its square root is well defined and we can 12 SR and we have T 2 T sGs 2G2 k 7quot kkltM sgcksk set 2k Gk 9m 7 ykTgk igik ewlon 7 Convergence Ana Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Tools In order to estimate the size of the largest and smallest eigenvalues of the generated Hessian approximations Bk we need two linear algebra tools the determinant and the trace Definition Trace The trace of an n x n matrix A is traceA Z a i1 Both the determinant and the trace are invariant under the operations performed by Gaussian elimination LU factorization hence o The trace is the sum of the eigenvalues of A o The determinant is the product of the eigenvalues of A m Peter Blomgren wlon 7 Convergence An Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Theorem Let Bo be any symmetric positive definite initial matrix and let 0 be a starting point for Which the stated assumptions are satisfied Then the sequence 7 generated by the BFGS method converges to the minimizer 70 of f This theorem can be generalized to the entire restricted Broyden class except for the DFP method The analysis can also be extended to show that convergence is rapid enough that 00 Z Hikii i lt oo k1 this implies super linear convergence 5351 Peter Blomgren wlon 7 Convergence An Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Proof First we define T yk 5k mk HSkH2 kii2 5k H37 2m 7Mk M ltl where m and M are the constants in the assumption We compute the trace and determinant of the BFGS update HBkEkH2 Hikiiz trace Bk trace Bk 7 f 1 Bksk ykTSk T detBk1 detBk M T Sk Bksk Peter Blomgren wlon 7 Convergence An Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS m ncc c Global Convergence of BFGS Proof Further we define the angle between 5k and Bk k 0k and the Rayleigh quotient qk T T S Bksk Sk Bksk C0509 k qk 7 llSlel llBkSkll7 llSkll2 From this we can obtain llBkEkll2 llBkEklllegkllZEkTBkgk qk EZBkEk ngkEkV llgkllz 305209 We can also write detBk1 detBk qk ewlon 7 Convergence Ana Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Proof We introduce the following positive function valid for positive definite matrices B lJB traceB 7 lndetB Combining our previous results we can get an update formula for W IJBk1 IJBk Mlt 7lnmk71 qk qk 2 l7 7 l 7 l 0 cos2i9lt n ltcos20kgtl ncos k Since the function ht 17 t l In t g 0 is non positive for all t gt O the term inside the square bracket is non positive Peter Blomgren QuasirNewlon 7 Convergence Analysis Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Proof Using the non positiveness of ht we get k 0 lt was IJBl ck Z lncosz0j j1 where we can assume that the constant c M 7 In m 7 1 is positive For quasi Newton methods 5k iakBkAV ik so that 0k is the angle between the steepest descent direction and the search direction From earlier discussion on the convergence of linesearch methods we know that llVf39kll is bounded away from zero non convergence only if cos0j a 0 Peter Blomgren wlon 7 Convergence An Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Proof We prove convergence by contradiction let us assume that cos 1 A 0 Then there exists k1 gt 0 such that Vj gt k1 we have In cos2 OJ lt 72c where c M7 In mil as above Plugging this into the updateinequality for IJO gives for k gt k1 k1 k 0 lt llBl ck Z lncos2 OJ 2 72C 11 jk11 k1 llBl Zln cos2 OJ 2ck1 7 ck 11 However for large enough k the righthandside will be negative resulting in a contradiction Therefore cos OJ 74gt O and llVf39kll w 0 Zoutendijk and xk A X convexity wlon 7 Convergence An Peter Blomgren Convergence Analysis 7 BFGS e SR1 Global Convergence of BFGS Proof We can conclude that there exists a subsequence of indices jk such that cos0Jk 2 6 gt 0 By Zoutendijk s result Zcos20kllVfikll2 lt oo k0 the limit implies that lim inf llVf39kll a 0 Since the problem is strongly convex this shows that k a if D Peter Blomgren wlon 7 Convergence An Convergence Analysis 7 BFGS e SR1 Superlinear Convergence of BFGS Superlinear Convergence of BFGS We need one further assumption in order to show that if holds then the rate of convergence is super linear Assumption The Hessian matrix V2f39 is Lipschitz continuous at if that is iiV2fgt39lt V2fgt39ltii S Liii Vii for all 39 near if where L is a positive constant Conver nce Anal 39 Convergence Analysis 7 BFGS e SR1 sunperlinear Convergence of BFGS Superlinear Convergence of BFGS Pre Proof We introduce 3k V2fgt39lt12 r yr Vfir1yr Br V2fgt39lt 1QBer2f7lt 12 Further we define the angle between k and Bk k 5k and the Rayleigh quotient qk T T 5 Bksk 5k Bksk COSWR k in 7 HSZH HBkSkW HM2 and we let 2 T quot y 5k Mk Lkal 7 mk k 2 Yr 5k HSkH ewlon 7 Convergence Ana Convergence Analysis 7 BFGS e SR1 Superlinear Converlgellce or BFGS Superlinear Convergence of BFGS Pre Proof Now if we pre and postmultiply the BFGS update formula BkngZBk WY T T 5k Bksk yk 5k Bk1 Bk by V2f39 12 we obtain Bkgkgl k 9151 EM 3k 7 EkTBkEk ykTSk Since this has exactly the same form as the BFGS update it follows from the argument in the previous proof that W51 W3kquot7 k39n 7k1 ak ak 2 177Nln felJrlncosO cos2 6k cos2 9 k Peter Blomgren Newmn 7 Convergence Anal Convergence Analysis 7 BFGS e SR1 uper Inear Con gence of BFGS Superlinr ConvergenceGS Pre Proof We recall 37k Gkak k Gkgk so that we can write yk 7 65k Glt 7 GQER where Ge V2f39 and pre multiplying by 612 gives 9r 7 5r 61 Gr 7 CasiZak By the assumed Lipschitz continuity of the Hessian we have Myra arii Mai2H2 HErH in 7 62H Mai2H2 HErH Ler where er max iim 4w Hiram sosa Peter Blomgren QuasieNewron 7 Convergence Analysis Convergence Analysis 7 BFGS e SR1 Superlinear Convergence of BFGS Superlinear Convergence of BFGS Pre Proof It now follows that llik igkll S Eek llSkll for some positive constant E This results together with Conver nce Anal 39 Convergence Analysis 7 BFGS e SR1 Superlinear Convergence of BFGS Superlinear Convergence of BFGS Theorem We now show the following theorem Suppose that f is twice continuously differentiable and that the iterates generated by the BFGS algorithm converge to a minimizer 70 at Which point the Hessian is Lipschitz continuous Suppose also that X Zmem lt oo k1 holds Then 39 converges to 39 at a superlinear rate Peter Blorngren wlon 7 Convergence An Convergence Analysis 7 BFGS e SR1 Superllnear Convergence of BFGS Superlinear Convergence of BFGS Proof We use the result from the pre proof igk 7 week a llykiskllSCEkllSkllr ll k l and the triangle inequality to obtain llykll C llgkll S Eek llgkllv llgkll C llykll S Eek llgkll Now 17leng llrll 1erll rll If we square the result at the top right we get lll7kll2 llgkll2 C 291 S 326 llgkllz We use the left half of the inequality above to obtain Peter Bionrgrer QuasieNevvron 7 Convergence Analysis C quotquot 3 quot MW 7 BFGS lx 5 superiinear Convergence of BFGS Superlinear Convergence of BFGS Proof lgkllzi 17Eekiziiakii2niakl2729 5k llykll2ll kll2 29 k e and therefore 295k 2 17 266k Ezei 17 Ezei lgkllz 217Eekll kll2 Finally we can say something useful Above Triangle lneq 7 2 1 mknylZ E k7 MRM llSkll yk 5k 1 66k Since 39lt 7 if we have that 6k 7 0 There must exists a K and c gt E so that for all k gt K we have Mk 1cek m ewlon 7 Convergence Ana Superlinear Convergence of BFGS Convergence Analysis 7 BFGS e SR1 Superlinear Convergence of BFGS Proof We use the non positiveness of the function ht17 t In t 1 h 1 lt 0 7 n 7x 17 X 7 for large enough k we can assume Eek lt12 so that fe n17fek 2 77k 2 72E6k 17 CEk Together with we have wlon 7 Convergence An Peter Blorngren Convergence Analysis 7 BFGS e SR1 uper Inear Con gence of BFGS Superlinear Convergence of BFGS Proof Putting together In vk 2 72Eek Mk 1 66k and WEN WBkIIk7n vkil gives 0 lt WBk1 WBk3cekncos2 k PLAN qh cos2i9lt cos2i9lt We sum this expression over k Peter Blomgren QuasieNewLon 7 Convergence Analysis Superlinear Convergence of BFGS Convergence Analysis 7 BFGS e SR1 Superlinear Convergence of BFGS Proof and make use of 71 7k InltiH C0520lt cos2 0k lt0 00 S B0 362 lt 00 k0 We must have 17 k k cos2 0k cos2 0k Inlt ewlon 7 Convergence Ana 07 Iim lt 1 gt n 7 C0520lt Hm Convergence Analysis 7 BFGS e SR1 Superlinear Convergence of BFGS Superlinear Convergence of BFGS Proof In order for this to be true we must have Iim cos lt 17 Iim quot7k 1 kaoo kaoo We are now not so obviously done In the previous proof we had the relation HGJlZWVGQEMV MgrWV Mai25W HEMP HBkngzi2 ZBk k Z k 5 a k k 7k22 7k1 cos0k The right hand ride converges to zero m ewlon 7 Convergence Ana Convergence Analysis 7 BFGS e SR1 Superlinear Co Superlinear Convergence of BFGS Proof and we can conclude that m Husk Gaskll kaoo llskll 0 close to the solution the step 04k 1will be accepted so that 5k E ink therefore m war 69M 0 race llPkll which establishes superlinear convergence D Phew Peter Blon ren Qua Newton 7 Conver Numerical Optimization Lecture Notes 16 Calculating Derivatives Finite Differencing Peter Blomgren ltblomgrenpeter gmail com Department of Mathematics and Statistics Dynamical Systems roup Computational Science Rsearch Center San Diego State University San Diego CA 921827720 httpterminussdsuedu Outline o NonAnalytic Derivatives Finite Differencing 0 Taylor s Theorem gt Finite Differencing 0 Finite Difference Gradient 0 Finite Difference Hessian a Finite Differencing Sparsity and Symmetry a Project Fall 2009 539 m Calculating Derivatives 7 Finite Differencing 7 113 Calculating Derivatives 7 Finite Differencing 7 213 Derivatives Needed Finite Differences The Return of Taylor39s Theorem We can get an approximation of the gradient Vf by evaluating As we haVe seequot and W39 see algor39thms for quot0nnear the objective f at n 1 points using the forward difference optimization and nonlinear equations require knowledge of formua derivatives 61 N fxe i7fx 71 2 Nonlinear Optimization Nonlinear Equations 3X N E 7 7 7 7quot397n7 Gradient t 1 t d Jacobian t 39 1 t d s vecor s or er ma rlx s or er where e ls the lth unit vector and e gt O is small HeSSIan matrix 2nd order If f is twice continuously differentiable then by Taylor s Theorem Often it is quite trivial to provide the code which computes those T 1 T 2 derivatives but in some cases the analytic expression for the X l P X l W00 P l EP V X l tPP7 139 E 0717 derivatives are not available andor not practical to evaluate with 3 e i l39e In those cases we need some other way to compute or 1 approximate the derivatives fxe fieriT e2 7v2fite i 1 2 n m 5353 Calculating Derivatives 7 Finite Differencing 7 313 Calculating Derivatives 7 Finite Differencing 7 413 Forward Differences Building the Gradient Selecting a Machine Epsilon Unit Roundoff 1 of 3 With a bit of re arrangement we see f 7 f 1 VfxT w 7 7e TV2fx te W e 2 857 Finite Difference A ppmimation Approximation Error If the Hessian V2f is bounded ie llV2fll S L then we have arm N x 513 7 to 3Xquot N E where the approximation error is bounded by EL 2 Since the error is proportional to 5 this is a first order Clearly the smaller the e the smaller the error How small can we set 5 in finite precision Let 5mm denote value for machine epsilon aka unit roundoff it is essentially the largest value for which 10 emach 7 10 O in finite precision Currently emach m 10 16 in double precision arithmetic double and Matlab internals on typical intel x86 based systems 5mm is a measure of how well or badly we can represent any number in finite precision and in extension a measure of the quality of every computation approximation 99 m Calculating Derivatives 7 Finite Differencing 7 513 Calculating Derivatives 7 Finite Differencing 7 618 Selecting a 2 of 3 Selecting a 3 of 3 If Lf is a bound on the value of then in finite precision we Now have something like derror 2emaCth L 2 4emaCth llcomputedfx 7 fxll S 6 th T N 7 E 0 gt e f llcomputedf 51 7 f e ll S emaCth Now if we recall our finite difference approximation with a slight abuse of notation 3f f e i 7 f 5L 7 m error 7 3Xquot 6 2 We find that the total error is 2 L L error N r L E 2 5331 713 Calculating Derivatives 7 Finite Differencng gives us the optimal value for epsilon Since Lf and L are unknown in general most software packages tend to select 5 V Emaehi which is close to optimal in most cases Hence the error in the approximated gradient is L error N 2km EV Eman Calculating Derivatives 7 Finite Differencng 535 313 Central Differences 0012 Accuracy Approximating the Hessian lof5 At twice the cost we can get about 267 extra digits of precision in the finite difference approximation by using central differences More Taylor expansions fgt39lt ea fgt39lt 63 3 08 fgt39lt 7 ea fgt39lt 7 63 ggi 063 fgt39lt e i 7 fgt39lt 7 ea 26ng 08 We get 6fx 7 fxe 7fx7e 2 6X 7 26 l 06 7 by arguments similar to the ones for the forward difference formula we can show that the optimal e and overall error is The easy case Analytic Gradient given If the analytic gradient is known then we can get an approximation of the Hessian by applying forward or central differencing to each element of the gradient vector in turn When the second derivatives exist and are Lipschitz continuous Taylor s theorem says Wat l5 WC V2fil3 0lll3ll2 Again we let p 513 l39 127 n and get V2 yogi m Vfx 613 7 Wm 05 Wat 42 7 Vf 7 4a 3mu 7 02 szX m E E em EM 539 l l 2e m Calculating Derivatives 7 Finite Differencing 7 913 Calculating Derivatives 7 Finite Differencing 7 1013 Approximating the Hessian 2 of 5 Approximating the Hessian 3 of 5 Special Case In Newton CG methods we do not require full It is worth noting that this is a column at a time process which knowledge of the Hessiah Each iteration requires the does not due to numerical roundoff and approximation errors essiahvector Product V2 gm Where 393 is the given search necessarily give a symmetric HeSSlan direction this expression can be approximated It is often necessary to symmetrize the result V2 fame N Vfx 6p 7 Vfx7ep 1 T N E Hnum l Hnum 39 216 This approximation is very cheap only one two extra gradient evaluations is are needed m 5393 Calculating Derivatives 7 Finite Differencng 7 1118 Calculating Derivatives 7 Finite Differencng 7 1213 Approximating the Hessian 4 of 5 Approximating the Hessian 5 of 5 At a price of N n2 additional function evaluations an increase of The harder case Analytic Gradient not given 33 we can use the second order central difference When the analytic gradient is not given we must use a finite aPPI Oleatlon difference formula using only function values to approximate the Hessian 82fi N fx e eel 7 fx 5e 7 eel 7 fx 7 e e j fx 7 e 7 eel 0x0gt9 N 462 The first order forward difference apprOXImatlon is given by 3211 01 e i6 j7fie 7fxe jff xi lxj 52 Figure The second order 4 point central difference approximation stencil for 31 at m the central point in the stencil note that the value in that point is not part of thm evaluation Calculating Derivatives 7 Finite Differencing 7 1313 Calculating Derivatives7 Finite Differencing 7 1413 Sparsity and Symmetry 1 of 3 Sparsity and Symmetry 2 of 3 Now that we are paying 4 function evaluations per entry in the The fill pattern of the Hessian of the extended Rosenbrock Hessian matrix it is worth taking sparsity and symmetry into funCtion COHSiStS 0f QXQdlagonal blOCk51 account 1 o a Ponder the extended Rosenbrock function 39 double functionerosenbrock int 11 double 1 3 o o a double 00 5 1m 1 for i0 iltn2 i i O O 10 x211 x2ix2i 7 x2i1 x2ix2i 1 X2i1 x t x 00 returnf l 2 3 t 5 7 x There are a lot of zero entries in this Hessian lf somehow we have Clearly there is no interaction between coordinate directions e knowledge of the sparsity Pattern then we can exploit this by not and ejl Where l Jl gt 1 computing the zeros 5331 5353 Calculating Derivatives 7 Finite Differencng 7 1513 Calculating Derivatives 7 Finite Differencng 7 1613 Sparsity and Symmetry 3 of 3 Project Extensions Milestone 2 with linesearch1 By using the fact that the Hessian is symmetric we can save about half Of the work Add the following to your codebase fdhessg Finite difference approximation to the Hessian using analytic gradient Executed when analgradTRUE analhessFALSE cheapfTRUE 1 l C E E E E E E E fdjac The core call from fdhessg note that fvec in the pseudocode 2 39 39 2 39 39 E E E E E E corresponds to your analytic gradient 3 g u a a o a I E E E E E fdgrad Finite difference forward approximation to the gradient Executed quot 39 39 39 quot 39 39 E E E E when analgradFALSE a o I I o o a o o I O I E E E fdhessf Finite difference approximation to the Hessian using only func E a E E tion values Executed when analgradFALSE analhessFALSE cheapfTRUE 7 o o o o o o o 7 o o o o o o o E E E Compare Performance of analytic everything from before analytic gradient fdhessgfdjac finite difference everything fdhessffdgrad Try optimal and i 2 a was a 7 3 i 2 a any a 7 3 nonoptimal 6 Use 2 test problems from DennisSchnabel Appendix B Figure The entries to the left Huj g i must be computed but using symmetry we can fill in the missing ones Hy Hllj gt i 539 m Calculating Derivatives 7 Finite Differencng 7 1713 Calculating Derivatives 7 Finite Differencng 7 1313 Numerical Optimization Lecture Notes 4 Unconstrained Optimization Line Search Methods Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics Dynamical Systems Group Computational Sciences Research Center San Diego State University San Diego CA 9218277720 httpterminussdsuedu Fall 2009 Line Search Method 7 113 Outline 0 Line Search Methods 0 Recap 0 Example 0 Example 2 a Convergence 0 The Zoutendijk Condition 0 Notes 84 Future Directions 0 Homework 1 Line Search Melhods 7 213 Quick Recap Last Time Quick overview of optimization algorithms two categories 1 Line search Select a search direction then optimize in that direction 2 Trust region Build a local simple model of the objective optimize the model in the local region where we trust it Convergence rates linear superlinear quadratic More details on line search Search directions Steepest descent linear convergence Newton direction quadratic convergence Enforcing sufficient decrease in the objective Wolfe conditions Line Search Melhods 7 313 Recall Algorithm Backtracking Linesearch Algorithm Backtracking Linesearch 0 Find a descent direction 5k 1 Set EgtO p 0l 66071 set 046 2 While fik 043k gt Rik capZV ik 3 Oz pa 4 EndWhile 5 Set 04k Oz If an algorithm selects the step lengths appropriately 5g backtracking we do not have to check the second inequality of the Wolfe conditions The algorithm above is especially well suited for use with Newton method pk where E 1 The value of the contraction factor p can be allowed to vary at each iteration of the line search To be revisited m Line Search Melhods 7 413 Example Minimizing f6 X1 X222 In the next few slides we illustrate our findings so far by minimizing f0 X1 l X222 with 20 11 using two algorithms 1 Steepest Descent direction with Backtracking Linesearch 2 Newton direction with Backtracking Linesearch One thing to notice is that the entire curve where X1 7X22 gives 1 62 0 the minimum is not isolated nor unique Recaquot 50 Vf lt 1 Pk Wv P2 lvzflxkll Vflxk Line Search Melhods 7 513 Steepest Descent on f6 X1 4 X222 with 20 11 T 2 U a 1 4 0000006 00 74 4721366 7 01 78 9442726 7 01 1 0 5527860 105573 3 1801936 7 01 01 72 0659076 7 01 1 70 564170 101018 17258746 01 9 8019516 7 719803446 7 01 12 0 064456 70 20003 1 0914096 02 79 2845426 7 01 3 7144686 7 01 18 70 051600 70 153604 7 843386 04 9 559089 1 72 936633 01 132 70 021728 70 162781 2 2748806 05 79 5087686 7 01 3 0956976 7 01 1128 70 029157 70 160 63 1 1838296 05 9 5222346 7 01 73 0540226 7 01 1256 70 025437 70 161556 4 3954566 07 79 5156106 7 01 3 0746016 7 01 11024 70 026367 70 16125 1 3191556 07 9 5172806 7 01 73 0694266 7 01 12048 70 025902 70 16140 2 2460366 7 8 79 5164476 7 01 3 0720106 7 01 14096 70 026134 70 161330 1 1379046 8 9 5168646 7 01 73 0707176 7 01 19192 Table Steepest Descent with Backtracking line search applied to the problem fix1X222 starting i0 11 Line Search Me39hods 7 513 Steepest Descent with Backtracking Linesearch Figure The iterations 1 2 3 and 9 convergence to 1078 Line Search Me39hods Steepest Descent with Backtracking Linesearch m Figure The size ofthe objective as a function ofthe iteration number m Line Search Melhods 7 313 Newton on f6 X1 X222 with 20 11 2 a p a i 4 0000006 00 72 0000006 00 0 0000006 00 1 O a le Newton direction with Backtracking line search applied to 1 1 T the problem fx1X222 starting i0 we achieve convergence right away m Figure The only iteration 7 913 Line Search Method Example 2 Minimizing f6 X1 X222 l 05X12 l X22 Figure Comparison of Newton direction left and Steepest Descent direction right with backtracking line search applied to the objective f0 X1 443 0V5X12 X22 with Sig 11 In this case the minimum is unique 52 00 The Newton algorithm requires 6 iterations to reach the minimum up to desired accuracy and the Steepest Descent algorithm requires 16 iterations 5351 Line Search Method 7 1013 Example Minimizing f6 X1 X222 l 05X12 X22 NeMunDiveniun SteEpea Descent 5 in 15 reddashidotted with backtracking line search applied to the objective f 0V5X12 X22 with Sig 11 We see that the Steepest Descent algorithm is linearly convergent and that the Newton algorithm is significantly faster quadratically convergent Figure Convergence comparison of Newton direction bluesolid and Steepest Descent direction 2 x1 3 Line Search Method 7 1113 Convergence of Line Search Methods cost9lt W V Figure VTW c059 All our algorithms rely on the directional derivative ZV ik One of the key properties in expressing requirements on the search direction is the angle 0k The expression for the cos of this angle is given by JTW 2 cos 0k llpk ll llVfxkll Line Search Melhods 7 1213 Convergence of Line Search Methods LheZouLendukcondmon Theorem Zoutendijk39s Theorem Consider any iteration of the form it 32k ak k where k is a descent direction and 04k satisfies the Wolfe conditions we 043k lt my getaway c1 6 01 TVfW lvf k 0450 2 CQPlt C C11v Suppose that f is bounded from below in R and that f is continuously differentiable in an open set N containing the level set L X 6 R f S fo where 520 is the starting point of the iteration Assume also that the gradient Vf is Lipschitz continuous on N ie there exists a constant L such that HVfO VfS H S LU 9H We 6 Then ECOSZQkHV ikNZ lt co k0 Line Search Melhods 7 1313 Convergence of Line Search Methods the Zoutenduk Condition This result seems at first and probably second too glance to be quite technical and obscure However the Zoutendijk condition 00 zcos2 0kllVfikll2 lt oo k0 implies that lim cos2 0kllVfikll2 a O kaoo Now if we have a method for choosing the search directions k so that cost9lt 2 6 gt O for all k then this shows that llVfikll a 0 Line Search Melhods 7 1413 The Zoutendijk Condition Steepest Descent For the steepest descent direction we have cos0k 71 Hence if we use a line search algorithm which satisfies the Wolfe conditions it will always converge to a stationary point under the conditions of the theorem f bounded below and Vf Lipschitz continuous This means that steepest descent is globally convergent in the sense klim mem 0 We cannot guarantee convergence to a minimum only to a stationary point In order to guarantee convergence to a minimum more conditions for example on the Hessian are required 535 Line Search Melhods 7 1513 The Zoutendijk Condition Newton Direction It can be shown that Newton methods are also globally convergent in this sense under these additional conditions The Hessian must be positive definite and the condition numbers must be uniformly bounded e H lvzflikll i V2fiklT1 S M7 for some positive M That is the ratio of the largest and smallest eigenvalues XgaxXg quot must remain bounded Think of Jk as the curvature in the eigen directions The proof for the steepest descent direction is given in NW pp43 44 and the key part to the proof for the Newton direction IS exercise 35 NW p62 m Line Search Melhods 7 1513 Notes amp Looking Forward We note that the Zoutendijk condition gives us global convergence ie a guarantee that we will arrive at a stationary point It does however not say anything about the rate of convergence Next time we will look at the local convergence rates Le the convergence speed of Steepest descent Newton and Quasi Newton methods The analysis takes a full lecture and it does not make sense to split it up Hence the relative shortness of this lecture Line Search Melhods 7 1713 Homework Due at 12 01pm Friday October 2 2009 mist 31 Program the steepest descent and Newton algorithms using the backtracking line search Use them to minimize the Rosenbrock function fi100xz 7 x12 17 x12 Set the initial step length 040 1 and print the step length used by each method at each iteration First try the initial point SEOT 127 12 and then the more difficult point SEOT 7127 1 Note The homework is due in Peter39s mailbox in GMCS 411 or in Peter39s office GMCS 587 slide under the door if I39m not there Line Search Melhods 7 1313 Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics Dynamicai Systems Group Computationai Sciences Research Center San Diego State University San Diego CA 9218277720 himterminussdsueciu Fall 2009 Peter Blom tJnK v li fri mi Outline 0 Recap a Conjugate Gradient Methods 0 Introduction Peter Blom Recap Q lle Recap Global Convergence and Enhancements We looked at some theorems describing the convergence of our algorithms We noted that there was a bit of a gap between what is generally true and what can be proved Theoretical limit points vs numerical stopping criteria Further we looked at some enhancements including scaling D diagdl7 d2 d 7 d gt 07 TA l3 6 R llD ll g A and the use of non Euclidean norms the latter primarily comes in handy in the context of constrained optimization We now explore an important computational tool which will help us solve problems of realistic size Conjugate Gradient Methods Peter Blomgren Linear cc Part 1 Conjugate Gradient Methods Introduction Conjugate Gradient Methods Introduction For short CG Methods One of the most useful techniques for solving large linear systems of equations A b Linear CGquot Can be adopted to solve nonlinear optimization problems Nonlinear CGquot Our type of problems Linear CG is an alternative to Gaussian elimination well suited for large problems Performance of linear CG is strongly tied to the distribution of the eigenvalues of A First we explore the Linear CG method Peter Blomgren Linear cc Part 1 Conjugate Gradient Methods Introduction The Linear CG Method Language and Notation The linear CG method is an iterative method for solving linear systems of equations AiB AeRW ieRquot BERquot where the matrix A is symmetric positive definite Notice that this problem is equivalent to minimizing where 1T T x Ex Axib xc since if m V i A 7 b We refer to F0 as the residual of the linear system Note that if quot A lb then F 0 ie the residual is a measure of how close or far we are from solving the linear system Linear CG Part 1 Conjugate Gradient Methods Dennis Conjugate Vector A set of non zero vectors 39307 3931 3n is said to be conjugate with respect to the symmetric positive definite matrix A if ramp0 Viy j 1 p 1 of comm Wm A set of conjugate vectors 50 3931 jn is linearly independent Why should we care We can minimize 62 in n steps by successively minimizing along the directions in a conjugate set Peter Blomgren NIKle Conjugate Gradient Methods Introduction Conjugate Direction Method CG Method Given a starting point 520 E R and a set of conjugate directions 39307 3931 n we generate a sequence of points ilk E R by setting ik1 ik ak k where 04k is the minimizer of the quadratic function 404 62k l 043k ie the minimizer of along the line 04 ik 043k We have already solved this problem in the context of step length selection for line search methods see lecture 6 so we know that the optimizer is given by T r 7 where rk rxk Pk APk Peter Blomgren Linear cc Part 1 ate Gradient Methods Introduction Conjugate Direction Method 7 CG Method For any quot0 E R the sequence ik generated by the conjugate direction algorithm converges to the solution quot of the linear system in at most n steps The proof indicates how properties of CG are found Proof Part 1 Linear c5 Part 1 T For any quot0 E R the sequence ik generated by the conjugate direction algorithm converges to the solution quot of the linear system in at most n steps The proof indicates how properties of CG are found Prrea Pg ii Since the directions 13 are linearly independent they must span the whole space R Hence we can write i Co igale Gradient Melho Introdi Conjugate Direction Method CG Method Proof Par If we are generating quotilt by the conjugate direction method then we have ik ioaol30a1i31 ak71i3k71 Peter Blon ren Conjugate Gradient Methods Introduction Conjugate Direction Method CG Method Proof Part 2 If we are generating quot1lt by the conjugate direction method then we have ik in ao o a131 akq kq we multiply this by ZA ZAik 51A i0 1050 OM31 041915191 Peter Blol ren Linear CG Part 1 Conjugate Gradient Methods Introduction Conjugate Direction Method CG Method Proof Part 2 If we are generating quot1lt by the conjugate direction method then we have ik in ao o a131 akq kq we multiply this by ZA ZAik 51A in 1050 OM31 041915191 using the conjugacy property we see that all but the first term on the righthandside are zero raZAik 5WD e 51 7 in o Peter Blol ren Linear CG Part 1 ate Gradient Methods Introduction Conjugate Direction Method 7 CG Method Proof Pa If we are generating quotilt by the conjugate direction method then we have ik in ao o a131 ak71i3k71 we multiply this by ZA ZAik 51A in 1050 OM31 041915191 using the conjugacy property we see that all but the first term on the righthandside are zero Wk 7 rim 7 WW 7 in 7 0 Now we have any 7 20 7 mm 7 20 7 2k 7 20117 lmquot 7 m 7 lm 7 Am 7 73 HF adds 0 Conjugate Gradient Methods Introduction Conjugate Direction Method CG Method We have shown ram 7 i0 from Now we notice that the right hand side is the numerator in 04k ZFk ZA r Peter Blomgren Linear as Part 1 7 1u2u Conjugate Gradient Methods Introduction Conjugate Direction Method CG Method We have shown ram 7 i0 from Now we notice that the right hand side is the numerator in 04k ZFk ZA r We conclude to proof by showing that the left hand side is the numerator in and expression for 0k with the same denominator Peter Blomgren Linear as Part 1 7 1u2u Conjugate Gradient Methods Introduction Conjugate Direction Method CG Method We have shown MT402 i0 T Pk rk Now we notice that the right hand side is the numerator in 04k ZFk ZA r We conclude to proof by showing that the left hand side is the numerator in and expression for 0k with the same denominator we premultiply the expression for 52 7 20 by SEA and obtain 04k n71 n71 kTAi7ioi3kTA arr ar lA rm3Ar3r i0 i0 Hence T t AK a w D PkAPk 550 Peter Blomgren Linear cc Part 1 7 1u2u Conjugate Gradient Methods IntroducLio Conjugate Direction Method Comments and Interpretation 1 of 2 Most of the proofs regarding CD and CG methods are argued in a similar way by looking at optimizers and residuals over sub spaces of R spanned by some subset of a set of conjugate vectors Interpretation If the matrix A is diagonal then the contours of are ellipses whose axes are aligned with the coordinate directions In this case we can find the minimizer by performing 1Dminimizations along the coordinate directions l quot22 i i i n in turn ar cc Part 1 Peter Blon ren Introduction Conjugate Gradient Methods Conjugate Direction Method Comments and Interpretation 5 2 E 1 2 3 3 Interpretation ctd When A is not diagonal the contours are still elliptical but are no longer aligned with the coordinate axes Successive minimization along the coordinate directions h z W n can not guarantee convergence in n or even a finite number of iterations m 73 rs 7 ar cc Part 1 Peter Blon ren Conjugate Gradient Methods Introduction Recovering n step Convergence for Non Diagonal A For non diagonal matrices A the n step convergence can be recovered by transforming the problem Let 5 6 RM be a matrix with conjugate columns e if 30431 3n1 is a set of conjugate directions with respect to A then 5 I30 P1 5n71 We introduce a new variable i 5 11 and thus get the new quadratic objective which can be minimized in n steps 3 s STAi7 SUN Diagonal Peter Blomgren Linear cc Part 1 ate Gradient Methods Introduction alA We note that the matrix STAS is diagonal by the conjugacy property and that each coordinate direction in i space corresponds to the direction 14 in i space When the matrix is diagonal each coordinate minimization determines one of the components of the solution 52 Hence after k iterations the quadratic has been minimized on the subspace spanned by l7 27 k If we instead minimize along the conjugate directions then after k iterations the quadratic has been minimized on the subspace spanned by 3930 3931 kzl Linear c5 Part 1 Before we state a fundamental theorem regarding the conjugate direction method we show the following lemma l Given a starting point quot0 E R and a set of conjugate directions 130131 i i i 434 if we generate the sequence quotlt E R by setting Fl k ik 1 ik akl3k Where ak 7 7 kaApk with Fk Aik 7 b Then the k 1st residual is given by the following expression Fk1 Fk akA k A l Fk1 Aik1 5 AG k M i 5 MAW Aik l3 MAW Fkv Peter Blon ren NIKsz Conjugate Gradiem Methods Expanding Subspace Minimization Introduction Theorem Expanding Subspace Minimization Let i0 6 R be any starting point and suppose that the sequence ilk is generated by Wquot k Pk Xk1 Xk Olkpim Where 04k TA k k Then q bH701m and ilk is the minimizer of i 6207 I 2 i0 span io 317 7 k71 Peter Blomgren Linear CG Part 1 Conjugate Gradient Methods Introduction Expanding Subspace Minimization Proof Proof Part 1 First we show that a point i minimizes dgt over the set 5i07k if and only if riT i 0 0717m7k71 Let 75 Min 0050 0151 39 39 39 01915161 Slnce 75 l5 a strictly convex quadratic it has a unique minimizer 5 that satisfies 6h quot 5 0 0717m7k71 I By the chain rule this is equivalent to V ltioa5 oait31a1t3k1Tt3i 07 i 0717m7k71 V 2 We recall that V i A 7 B thus we have established F T3 O ltgt i minimizes dgt over the set 5i0k Peter Blomgren Linear CG Part 1 Proof Part 2 We now show that the residuals Fk satisfy Flpi0 i 017 l i i 7k 71 We use mathematical induction Since ak is always the 1Dminimizer we have F1 30 O establishing the base case From the inductive hypothesis that F171 3 O i 01i i i k 7 2 we must show that Elf O i 01i i i k 71 in order to complete the proo From the lemma we have an expression for Pk Fk1 ak1Apk1 First ofF we have lilfk 71ka ak1 1A k1 0 since by definition optimality Peter Blon ren ar CG Parr 1 Conjugate Gradient Methods Introduction Expanding Subspace Minimization Proof Proof Part 3 Finally offp ffkewakel fA el0 01k72 since frk10 i01k72 by the induction hypothesis and EMMA 0 i01k72 by conjugacy This establishes l3sz 07 i 07 17 7 k 7 1 which completes the proof D Peter Blomgren Linear CG Part 1 Conjugate Gradient Methods Introduction Cliff Hangers Cliff Hanger Questions 0 How can we make this useful 0 Given A how do we get a set of conjugate vectors They are not for sale at Costco 9 Even if we have them why is this scheme any better than Gaussian elimination 0 Where is the gradient Peter Blomgren Linear cc Part 1 7 zuzu Peter Blom vlon Metho M 239 h H L Hi Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics San Diego CA 9218277720 http erminussdsuedu Fall 2009 sewnn Moiho dd Outline 0 QuasieNewton Methods 0 Introduction 0 The BFGS Method Peter Blom Newton Melho The BFGS Memod 221 Quasir N ewlon Melho ls Quasi Newton Methods Quasi Newton methods require only the gradient of the objective to be computed at each iterate By successive measurements of the gradient Quasi Newton methods build a model of the objective function which is sufficiently good that superlinear convergence is achieved Quasi Newton methods are much faster than steepest descent and coordinate descent methods Since second derivatives the Hessian are not required quasi Newton methods are sometimes more efficient that Newton methods especially when Hessian evaluation is slowexpensive Peter Blomgren wlon Method The arcs Memod 7 321 Introduction Quasir N ewton Metho is The BFGS Method Introduction The BFGS method is named for its discoverers Broyden Fletcher Goldfarb Shanno and is the most popular quasi Newton method We derive the BFGS and the DFP a close relative and look at some properties and practical implementation details The derivation starts with the quadratic model T 17 mkP fXk Vfxk P 5P BkP at the current iterate ilk Bk is a symmetric positive definite matrix model Hessian that will be updated in every iteration Peter Blomgren Newton Methods 7 The BFGS Method 7 421 Introduction Quasir N ewton Metho is The BFGS Method Introduction Given this convex quadratic model we can write down the minimizer 5k explicitly as m iBglvaik We can compute the search direction pk using eg the Cholesky factorization or a CG iteration once we have 5k we find the new iterate ik1 ik Wine where we require that the step length 04k satisfies eg the Wolfe conditions flik cla ZV i 1 6 01 C2 vfik7 CQ E 617 fgtltk 045k S lv ik 043k 2 5351 Peter Blomgren Newton Methods 7 The BFGS Method 7 521 Introduction Quasir N ewton Metho is The BFGS Method Introduction So far we have not really done anything new the key difference compared with the linesearch Newton method is that we are using an approximate Hessian Bk 7 V2fik Instead to computing a completely new Bk in each iteration we will update Bk Bk l something using information about the curvature at step k Thus we get a new model 1 mk1P fXk1 Vfxk1TP EPTBkHP Clearly for this to make sense we must impose some conditions on the update 5351 Peter Blomgren Newton Methods 7 The BFGS Method 7 521 u u The BFGS Method Quasir N ewton Metho ls The BFGS Method Conditions on Bk We impose two conditions on the new model mk1 12 mk1 must match the gradient of the objective function in ik and ilk The second condition is satisfied by construction since vmk m Wm The first condition gives us mG104kl3k Wgtltk1 akBk1i3k WW With a little bit of re arrangement we get Oszk13k Vfik1 7 VfGZk Peter Blomgren Newton Methods 7 The BFGS Method 7 721 Quasir N ewlon Melho ls l a The BFGS Medlod The BFGS Method Conditions on Bk We clean up the notation by introducing 5k ik1iik 9k Vfik17Vfik ak k We can now express the condition on Bk in terms of ER and Bk1 k 9k We will refer to this equation as the Secant Equation By pre multiplying the secant equation by 5 we see the curvature condition T T T sk Bk1sk sk yk sk yk gt 0 W gt0 Peter Blomgren wlon Method The BFGS Medlod Quasir N ewlon Melho ls l a The BFGS Medlod The BFGS Method Conditions on Bk If we impose the Wolfe or strong Wolfe condition on the line search procedure the curvature condition will always hold since Vfik17 k 2 szaik gk by the curvature Wolfe condition and therefore 925k 2 62 104kaikTl3k7 where the right hand side is positive since 2 lt 1 and pk is a descent direction When the curvature condition is satisfied the secant equation always has at least one solution Bk Peter Blomgren wlon Method The arcs Mama 7 921 hu u The BFGS Medlod The BFGS Method More Conditions on Bk Quasir N ewlon Melho ls It turns out that there are infinitely many symmetric positive definite matrices Bk which satisfy the secant equation l Degrees of Freedom l Conditions Imposed l nn 12 Symmetric n The Secant Equation n Principal minors positive PD To determine Bk uniquely we must impose additional conditions we will select the Bk that is closest to Bk in some sense Bk arg mBin MB 7 Bkllsomenol m subject to B BT7 35k 9k Peter Blomgren wlon Method The arcs Mama 7 nu21 u u The BFGS Medlod The BFGS Method More Conditions on Bk Quasir N ewlon Melho ls Each choice of matrix norm in this matriX minimization problem MMP gives rise to a different quasi Newton method The weighted Frobenius norm llAllw llWlZAWlleF llCllF allows easy solution of the MMP and gives rise to a scale invariant optimization method The matrix W is chosen to be the inverse G1 of the average Hessian 1 Gk V2fikrak3kdr 0 Peter Blomgren wlon Method The BFGS Medlod 7 1121 u u The BFGS Method Quasir N ewton Metho ls The DFP Method With this weighting matrix and norm the unique solution of the MMP is 1 Bk1 I YkYk Z Bk I 39YkSkYIZ Viawill Yk T yk 5k Note that w is a scalar and 9kg gal and y are rank one matrices This is the original DavidsonFletcher Powell DFP method suggested by WC Davidson in 1959 The original paper describing this revolutionary idea the rst quasi Newton method was not accepted for publication It later appeared in 1991 in the first issue the the SIAM Journal on Optimization Fletcher and Powell demonstrated that this algorithm was much faster and more reliable than existing methods at the time This revolutionized the field of nonlinear optimization Peter Blomgren Newton Methods 7 The BFGS Method 7 1221 The BFGS Medlod Quasir N ewlon Melho ls The DFP Method Cleaning Up The inverse of Bk is useful for the implementation of the method since it allows the search direction pk to be computed using a simple matrix vector product We let HkB1 and use Sherma n Morrison Woodbu ry form u la If A 6 RM is non singular and 55 6 R and if BAT then Bil A71 7 Ail bTAil 1 bTA15 m Peter Blomgren wlon Method The BFGS Medlod 7 1321 u u QuasIVNewlon Methods The BFGS Med od The DFP Method Cleaning Up With a little bit of linear algebra we end up with HkykyZHk 5kg 9THk k yky V Update 1 Update2 Hk1 Hk Both the update terms are rank one matrices so that Hk undergoes a rank 2 modification in each iteration This is the fundamental idea of quasi Newton updating instead of recomputing the matrix inverse from scratch each time around we apply a simple modification which combines the more recently observed information about the objective with existing knowledge embedded in the current Hessian apprOXImatIon m Newton Methods 7 The arcs Memod 7 1421 Peter Blomgren u u The BFGS Method Quasir N ewton Metho ls Improving on DFP The BFGS Method The DFP method is quite effective but once the quasi Newton idea was accepted by the optimization community is was quickly superseded by the BFGS method BFGS updating is derived by instead of imposing conditions on the Hessian approximations Bk we impose conditions directly on the inverses Hk The updated approximation must be symmetric positive definite and must satisfy the secant equation in the form Hk1 k 5k compare Bk gk 9k We get a slightly different matrix minimization problem Peter Blomgren Newton Methods 7 The BFGS Method 7 1521 hu u The BFGS Method Quasir N ewton Metho ls The BFGS Matrix Minimization Problem Hk1 arg mHin ll H Hkllsomenorm subject to H HT7 Hyk k If we again choose the weighted Frobenius norm with the same weight then we get the unique update Hk1 I P1516713 Hk I Pkykglgt Pkgkglv Pk W7 k which translated back to the Hessian approximation yields Bkgkgl Bk 9d Bk1 Bki sEBksk ygsk Peter Blomgren Newton Methods 7 The BFGS Method 7 1521 Quasir N ewlon Melho ls l a The BFGS Medlod The BFGS Method Starting H0 17 The initial value for the iteration can be selected in different ways A finite difference approximation at 520 HO I the identity matrix H0 diag517 52 5 where s captures the scaling of the variables if known Peter Blomgren wlon Method The arcs Memod 7 1721 The BFGS Method Algorithm Algorithm The BFGS Method Given starting point 520 convergence tolerance e gt 0 and initial Quasir N ewlon Melho ls llul m The BFGS Medlod inverse Hessian approximation H0 while llV ile gt e en pk inVfXk ilk linesearch ilt7 5k ik1 ik 9k Vlfik1 7 Vfik pk S ng Hk1 I Pkgkyl Hk I Pkykgl Pkgkgl 1 dwhile Peter Blomgren Newton Methods 7 The arcs Memod 7 1321 u u QuasIVNewlon Methods The BFGS Med od The BFGS Method Summary The cost per iteration is o 9n2 arithmetic operations 0 function evaluation 0 gradient evaluation The convergence rate is o Super linear Newton39s method converges quadratically but the cost per iteration is higher it requires the solution of a linear system In addition Newton39s method requires the calculation of second derivatives whereas the BFGS method does not Peter Blomgren wlon Method The BFGS Method 7 1921 u a QuasIVNewlon Methods The BFGS Med od The BFGS Method Stability and SelfrCnrretion If at some point pk 1 k lt becomes large ie 915k N 0 then from the update formula Hk1 I a pkakyl Hi I 7 MM ma we see that HRH becomes large If for this or some other reason Hk becomes a poor approximation of Vz ik l for some k is there any hope of correcting it It has been shown that the BFGS method has self correcting properties lf Hk incorrectly estimates the curvature of the objective function and if this estimate slows down the iteration then the Hessian approximation will tend to correct itself within a few steps 5351 wlon Method The arcs Memod 7 2u21 Peter Blomgren u The BFGS Mediod Quasir N ewlon Mellio ls The BFGS Method Stability and SelfrCnrretion The selfcorrecting properties stand and fall with the quality of the line search The Wolfe conditions ensure that the model captures appropriate curvature information The DFP method is less effective at selfcorrecting bad Hessian approximations Practical Implementation Details 0 The linesearch should always test a 1 first because this step length will eventually be accepted thus creating superlinear convergence 0 The linesearch can be somewhat sloppy c1 10 4 and C2 09 are commonly used values in the Wolfe conditions 0 The initial matrix H0 should not be too large if H0 5 then the first step is 30 i V ig which may be too long if B is large often H0 is rescaled before the update H1 is computed Hoe Peter Blon ren The Bass Mediod 7 2121 Numerical Optimization Lecture Notes 13 Practical Newton Methods Inexact Newton Steps Linesearch Newton Methods Peter Blomgren ltblomgrenpeter gmail comgt Department of Mathematics and Statistics D namlcal Systems roup Computational Science Raearch Center San Diego State University San Diego CA 921827720 httpterminussdsuedu Fall 2009 Practical Newton Methods Quick Recap Nonlinear Conjugate Gradient Methods 539 113 Outline 0R ecap 0 Nonlinear Conjugate Gradient Methods 9 Practical Newton Methods 0 Introduction 0 NewtonCG Inexact Newton MethodsH a Line Search NewtonCG Method 0 Dealing with Negative Curvature a Modified Newton s Method 535 113 Practical Newton Methods Practical Newton Methods Introduction Extension of the linear CG to work for non linear optimization problems In the first pass Fletcher Reeves39 Algorithm we simply replaced all instances of the residual Fk by the gradient of the objective Vf ik and the step length 04k is calculated by a linesearch We looked at some modifications and arrived at the Polak Ribiere PRCG algorithm where the B of Fletcher Reeves is modified FR katerka i 13R 7 kaTHka1 7 ka k1 7 1 7 kaTka a kt kaTka and the final is B maxwffho Finally periodic restarting when WJWH gt 01 kaTka 1 was introduced in order to get good convergence Practical Newton Methods m 313 We know eg from Lecture 5 that Newton s method has great local convergence properties Once we get close to the minimizer 1quot convergence is quadratic This convergence requires that we start close enoughquot to i in regions far away where the objective is not convex all bets are off and the behavior can be quite erratic we cannot guarantee convergence at all Our present goal To design a Newton based method which is robust and efficient in all cases 535 413 Practical Newton Methods The Newton Step Computational Cost We get the Newton step from the symmetric n gtlt n linear system The Newton Direction szmm 39 7Wm For global convergence the Newton direction must be a descent direction this is true if the Hessian Wrom is positive definite If the Hessian is not positive definite the Newton direction may be an ascent direction andor extremely long division by almost zero We look at two approaches The first uses the conjugate gradient method and gives us the Newton CG methods for both line search and trust region methods the second strategy involves modifying the Hessian so that it becomes quotsufficienty positive As always we want to keep the computational cost down In Newton CG this is accomplished by terminating the computation before an exact solution to 2 N V fXltpk 7Vfxk has been found Thus we get and approximation pk m ply hence the name inexact Newton methods We would like to exploit any special sparsity structure in the Hessian in order to solve the linear problem as efficiently as possible For now we assume we have access to the Hessian in analytical form We will cover this final weakness soon definite quot yielding the modified Newton method 5357 Practical Newton Methods 7 513 Practical Newton Methods 7 618 lnexact Newton Steps The Forcing Sequence Convergence 1 of 3 I I I Selection of the forcing sequence greatly impacts how fast the overall lf We settle for an 39neXaCt appmx39mate SOlUt39On Of algorithm converges as illustrated in the following results 2 N 7 7 V Xklpk 7 VHXquot Theorem then we need a measure of how close our approximate solution ink is to SUPPOSE that the gradient Vail l5 contquotWWW differentable h a the exact Newton direction Another use of the residual quotEighborhOOdV Uta minMiler 52 ahd assume that the HESS3h V2 fx is positive de nite Consider the iteration it xk l m Where Fk szmm Wait Fk ik satisfies lltkll 3 kllv ikllli 77k 6 071 Usually we do not want the termination condition for the lnexactIsolutlon and assume that m S n for some n E OI 1 Them if the starting point to depend on the SlZe of f hence are interested in the relatlve SlZe of h d I d h I d h h x0 ls sufflCIently nearx the sequence xk converges to x linearly t e re5l ua an say t at an apprOXImate so utlon ls goo enoug w en That SI for allsuf ciently large k we have Termination criterion F lt Vf x E O 1 ll kll Jikll kllli nr 7 llxm ll gellxk7xll The sequence m is known as a forcmg sequence for some constant C E OI1I 5331 3935 Practical Newton Methods 7 713 Practical Newton Methods 7 313 The Forcing Sequence Convergence 2 of 3 The Forcing Sequence Convergence 3 of 3 Theorem The preceding theorem is neither very eXIting nor very useful on its own Suppose that the conditions of the preVIous theorem hold and I assume that the iterates it generated by the inexact Newton The restriction on the forcing sequence is very mild we are method converge to it Then the rate of convergence is basically just requiring that we make some progress in solving the Super near fnk a 0 and quadratic ifnk linear system 2 N Now we know exactly how hard we have to work at solving the V kak Tv xk linear systems in order to achieve certain convergence rates eg leerse the result linear convergence is good news but m min G7 Superlinear Convergence hardly anything that causes us to throw a party nk min Quadratic Convergence However by carefully selecting the forcmg sequence we get a sl39ghtly more exc39tmg result Note These results are still local we still have to figure out how 539 to make our algorithms work if not started close to 1quot 5351 Practical Newton Methods 7 913 Practical Newton Methods 7 1013 Line Search Newton CG Method 1 of 5 Line Search Newton CG Method 2 of 5 We now have the pieces necessary to build robust Newton methods with good performance characteristics first on the menu the line search Newton CG method Getting the search direction 3 We apply the linear Conjugate Gradient CG method to the Newton equations wax3 7Wm and require that the solution satisfy a termination test of the type lfkll S nkllVfgtltkllr 71k 6 071 However if the Hessian is not positive definite this may break Practical Newton Methods 7 1113 IfWhen the Hessian is not positive definite we may enter a region of negative curvature when we do the CG iteration is terminated in order to guarantee that the generated 5k is a descent direction In search direction search we set A V2fik b 7Vf39 k and then start the CG iteration 1 The starting point is set to 520 O 2 If a CG internal search direction 30 generated by the CG iteration satisfies T 30 A 50 g 07 Negative curvature test then if i 0 set 520 b 7Vfik and return othenNise stop immediately and return idquot 3 Thea roximate Newton ste quot 520 PP P Pk Practical Newton Methods 7 1213 Line Search Newton CG Method 3 of 5 Line Search Newton CG Method 4 of 5 Algorithm CGcore Note If the CG core algorithm encounters a direction of nega Given A B W gs F0 Aigc 7 5 I30 40 k 0 tive curvature in the first iteration the steepest descent while HM gt quotWSW direction is used ak rkrk Store the vectof fl Line Search Newton CG Method TA k and the scalar rk rk I if fA k g 0 k gt 0 returno39ik Given xo k 0 Sik ikak k while Sik is not a minimum if kTA k g o k o returnGkH ink CGcoreA V2fik l3 evrak quotM m ggc 6 linesearch lt r r a A ak 7 k I k Pk Xk1 Xk akPk r rk 1 4 pk ktr l 7 Save numerator for next iteration end While k k 1 rk rk k1 FkJrl k1l3k end while k k 1 Where we speCIfy nk as discussed earlier and the llnesearch is such that 04k satisfies the Wolfe Strong Wolfe Goldstein or Armijo The k s are CGinternal search directions not to be confused with the search b kt dt direction for the optimization algorithm m ac rac mg con l lons39 m Practical Newton Methods 7 1313 Practical Newton Methods 7 1413 Line Search Newton CG Method 5 of 5 Modified Newton39s Method 1 of 2 Comments 9 Nothing is stopping us from basing this on the preconditioned version of CG in fact that is probably the right thing to do see other comments problems Line Search Newton CG LS N CG is well suited for large LS N CG has one minor weakness lfWhen the Hessian is nearly singular the Newton CG direction can be excessively long resulting in many function evaluations in the linesearch This weakness is greatly alleviated by preconditioning implementing LS N PCGM Practical Newton Methods ie 5351 7 1513 Sometimes it is desirable to use a direct linear algebra technique ie an efficient cousin of Gaussian Elimination to solve the Newton equations waxmy 7Wm lfWhen the Hessian is not positive definite or close to singular it can be modified either before or during the solution process so that in effect we solve lvz ik Ek ll V ik sufficiently POSItIVe Definite Where the Hessian modification Ek is chosen so that the resulting matrix is sufficiently positive definite 5393 Practical Newton Methods 7 1618 Modified Newton s Method 2 of 2 Projects Algorithm Line Search Newton with Modification Given i0 k 0 while Sik is not a minimum Bk factorize V2fik Ek c 1 c pk 78k Vfxk u n ak linesearch haaa 0 Standard Proiect separate handout ik1 it ak k 0 other end while k k 1 Where the linesearch is such that at satisfies the Wolfe Strong Wolfe Goldstein or Armijo backtracking conditions The factorization algorithm is such that Ek 0 if V2fik is sufficiently positive definite otherwise chosen so that Bk is sufficiently positive definite We save the details of Hessian modification for next lecture 539 5351 Practical Newton Methods 7 1713 Practical Newton Methods 7 1313 H Lei em my iiwcmii r Jrriiirihrmm Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics Dynamicai Systems Group Computationai Sciences Research Center San Diego State University San Diego CA 9218277720 nttpterminussisuedu Fall 2009 Peter Blom 39 inezi39 La Nonl ltlunlinom Outline 0 Nonlinear Least Squares 0 The Problem Formulation and Basics 0 The GaussNewton Method 9 Nonlinear Least Squares 0 The LevenbergiMarquardt Method Peter Blom Nonlinear LS land Bais y Nonlinear Nonlinear Lea The Nonlinear Least Squares Problem Problem Nonlinear Least Squares 1 m 2 e r gt x argng 7 00 argng 2 2100 7 m 7 n7 1 where the residuals are of the form yj 7 39 tj Here yj are the measurements taken at the locationstimes 139 and 39 tj is our model 7 9007 The Jacoblan JX 7 lit 1 if The Gradient Vf39 00er 11 The Hessian VHO J7ltTJ7ltZU7 VZU7 11 Nonlinear LSQ 7 Peter Blomgren Nonlinear Nonlinear Lea The Nonlinear Least Squares Problem Algorithms We now turn our attention to the solution of the nonlinear least squares problem Of course we could just use our unconstrained nonlinear minimization methods as is However as always we would like to implement the most efficient algorithms possible In order to achieve this we should take the special structure of the gradient and Hessian into consideration The first algorithm carries the name of two of the giants of mathematics Gauss1 and Newton 1 Johann Carl Friedrich Gauss 30 Apr 1777 23 Feb 1855 2 Sir Isaac Newton 4 Jan 1643 31 Mar 1727 Peter Blomgren Nonlinear LSQ 7 Nonlinear Least Squares i i i l v Nailinenr Least The GaussNewlon Method The Gauss Newton Method The GaussNewton method can be viewed as a modification of Newton39s method with line search Instead of generating the Newton search directions 5 as solutions of the linear systems Vz m 32 7Vf39k we use the particular form of the Hessian and exclude the second order term from it Thus we get the GaussNewton search directions N as the solutions of the linear systems T T lJXk JXkl PiN JXk rXk This simple modification has many advantages over the standard Newton39s method Peter Blon ren Nonlinear LSQ l m n The GaussNewlon Method Nonlinear Least Squares 2r Nonlinear Last 2 The Gauss Newton Method Advantages 1 N The approximation sz k m J39kTJ39k saves the work of computing the m individual Hessians V2007 If we compute the Jacobian J39k in the process if evaluating the gradi ent Vf39 J39k TF39lt then this approximation is essentially free In many situations the term J39k TJ39k dominates rj39V2 so that the GaussNewton method gives performance similar to that of Newton39s method even when the latter term is neglected This happens when are small the small residual case or when are nearly linear so that is small In practice many leastsquares problems fall into the first category and rapid local convergence of GaussNewton is observe Nonlinear LSQ Nonlinear Least Squares i i l v Nonlinear Leasl The GaussNewlon Method The Gauss Newton Method Advantages 3 When J39k has full rank and Vf39k y 1 the GaussNewton direction N is a descent direction and works for linesearch We l iNlTW l iNlTJOTUTFkil iNlTJ UTJ U iN llJik iNll 0 where the final inequality is strict unless we are in a stationary point P The GaussNewton equations are very similar to the normal equations for the linear least squares problem N is the solution of the linear least squares problem iN argmin llJkl pERquot Therefore we can use our knowledge of solutions to the linear least squares problem to find the GaussNewton search direction Peter Blon ren Nonlinear LSQ Nonlinear Lr l my ii in 4 Nonlinear Lea The GaussNewlon Method The Gauss Newton Linear Least Squares Connection The connection to the linear least squares problem shows another motivation for the Gauss Newton step Instead of building a quadratic model of the objective we form a linear model of the vector function Fx 3 x F39 102 The step 5 is obtained by using this model in the expression f39 and minimizing over 393 Peter Blomgren Nonlinear LSQ 7 Nonlinear Least Squares i l Nonlinear Least 39 The GaussNewlon Method Searching in the Gauss Newton Direction Convergence Once we have the GaussNewton direction we perform a linesearch thus identifying a step ak which satisfies the Wolfe Strong Wolfe Goldstein conditions The theory from our discussion on linesearch can be applied to guarantee global convergence of the GaussNewton method For the proof we need the following Assumption The singular values of the Jacobians J39 are bounded away from zero Le llJO39Oillz 2 Vllillzi 7 gt 0 for all 7 in a neighborhood of the level set 070 i E R i g 070 where 0 is the starting point Peter Blon ren Nonlinear LSQ Nonlinear Least Squares v Rum l r wnv The GaussNewlon Method Nonlinear Leas m39x Gauss Newton Convergence Theorem T h eore m Suppose that each residual function is Lipschitz continuously differentiable in a neighborhood ofN of the level set io i e R f39lt f07 and that the Jacobians satisfy the uniform full rank condition H107in Z Vllillzy v gt 0 Then if the iterates ik are generated by the Gauss Newton method With step lengths 04k that satisfy the Wolfe conditions we have Iim JkTFk 0 kaoo Peter Blomgren Nonlinear LSQ 7 Nonlinear Least Squares i i i v Nonlinear Last The GaussNewlon Method Gauss Newton Breaking the Convergence Theorem If the Jacobian is rank deficient ie llJgt39ltill2 Z Vllillzi 7 gt 07 then the coefficient matrix J39kTJ39k is singular However the system J39kTJ39k N 7J39kTVf39k still has a leastsquares solution since it is equivalent to the minimization problem 59 arg min llJgt39ltkI3 Wk ll pERquot In this case there are infinitely many solutions of the form ETFO 0 Z 739 2 717 a i mg i 00 The convergence result falls since the search direction may become perpendicular to Vf39k thus the Zoutendijk condition 20 cos2 OkllVf39kll2 lt 00 does not show convergence m Peter Blon ren Nonlinear LSQ Nonlinear Least Squares quot r ii u Nonlinear Leas The GaussNewlon Method Gauss Newton Local Convergence Rate The convergence rate of GaussNewton depends on how much the term J39kTJ39k dominates the neglected term in the Hessian Near the step ak 1 will be accepted and we have n 39 7 7 ik 7 re 7 1ikTJik 1 Wm TP Xk1 Jamaal 1 lJgt39ltkTJgt39ltkik 7 m mm 7 mm If we let H39 rj39V2rj39 represent the neglected term in the Hessian we can show using Taylor39s Theorem that llik1 quotH ll MmW H l Hm lummno new 5351 Peter Blomgren Nonlinear LSQ 7 Nonlinear Least Squares quotiiimui i Nonlinear Leasl The GaussNewlon Method Gauss Newton Local Convergence Rate We have the result iii41 Fll JimUiml l Hm 2 lllxrxll0 llxrfll s 0 Unless s39 lt 1 we cannot expect Gauss Newton to converge at all 0 In the small residual case it is usually true that s39ltlt1 and we have very rapid convergence1 of the Gauss Newton method 0 When H39 0 the rate of convergence is quadratic 1 Note that the convergence rate is actually linear but we do not feel the linear term until llik 7 lt If 507 is small enough smaller tha the required tolera Ce 0 X W X Y the 0t SlOW dOW N m the ethod Peter Blon ren Nonlinear LSQ Nonlinezr ms Squares Nonlinear Least Squares The Levenherngarquardl Method The Levenberg Marquardt Method Introduction Levenberg and Marquardt are slightly less famous than Gauss and Newton Q The LevenbergMarquardt method is the trust region equivalent of the GaussNewton metho The LevenbergMarquardt method avoids one of the weaknesses of the GaussNewton method the global convergence behavior when the Jacobian J39 is rankdeficient or nearly rankdeficient see slide 11 Since the secondorder Hessian component is still ignored the local convergence properties of the LM and GNmethods are similar The original description of the LMmethod first published in 1944 did not make the connection with the trustregion concept The connection was made by More in 1978 Peter Blomgren Nonlinear LSQ 7 Nonlinear 5 Nonlinear Least Squares The Levenherngarquardl Method The Levenberg Marquardt Method For a spherical trust region the subproblem to be solved at iteration k is 1 piM arg mm llJlxklp rkll subiect t0 llPll S Ak pElR 2 This corresponds to the model function 1 1 mklP Ellrkll2 PTJXkTrk EPTJlxkTJXkP7 where we can identify the Hessian approximation as Bk J39ltkTJ39ltk Peter Blomgren Nonlinear LSQ 7 Nonlinezr Laasl SK Nonlinear Lem gunman The Levenherngarquardl Method Levenberg Marquardt vs Gauss Newton If the GaussNewton step pill lies inside the trust region ie ll iNll S Aki the WM I39 Otherwise there is a A gt 0 such that Ak and J ltkTJ39k All raw acorn This can be interpreted as the normal equations for the linear least squares problem 7 1 107k rk pkiarg ngilgnglll p 0 As in the GaussNewton case this equivalence yields a way of solving the subproblem without computing the matrixmatrix product J39k TJ39lt and its Cholesky factorization 2 5351 Nonlinear LSQ Nonline 5L Sq Nonlinear Least Square The Levenherngarquardl Method The Levenlmrngarquarzlt Method Implementation In order to find the value of A which gives ll iM Ak we can apply the rootfinding algorithm from Lecture9 Nearly Exact Solutions to the Subproblem However due to the special structure Bk J39kTJ39k the approximate Hessian is always positive semidefinite hence the Cholesky factorization cannot break down for any AU gt O The structure of Bk can further be exploited to get the Cholesky factorization RATRA JTJJFAI7 by means of the QR factorization RA 7 QT J Q orthogonal O 7 A A 7 RA upper triangular Peter Blon ren Nonlinear LSQ Nonlinezl39 vql Nonlinear Least Squar The Levellherngarquardl Method The LevenbergiMarquardt Method Implementation We defer the discussion of the most efficient way of implementing the QR factorization to another forum like Math 543 Q Least squares problems are ofter poorly scaled The separation of scaled between different parameters can easily be several orders of magnitude eg X1 N 1014Xn This is a scenario where the use of scale factors and ellipsoidal trust regions llD ll Ak where the diagonal matrix D diagd1d2dn captures the scales of the various parameters see Lecture10 Truseregion Memods Global Convergence and Enhancements Nonlinear LS Nonline 5L Sq Nonlinear Least Square The Levenherngarquardl Method The LevenlmrgeMarquardt Method Implementation In the scaled case the trustregion subproblem and the two characterizations of the solution become 39 1 I PW arg ngikrin EllJXkP l rklli subJect t0 llePll S Akv J39ltkTJ39k ADE raw 407k 39 far min1 J k 39 Fk pk g eRquot2 FADk quot o The convergence properties are preserved even though Dk is allowed to change from iterationto iteration within certain bounds If D2 diagegtlttractJkTJk then the algorithm is invariant under diagonal scaling of 7 Peter Blon ren Nonlinear LSQ Nonline Sq The Levenherngarquardl Method Nonlinear Least Squares Large Residual Problems In all the previous discussion we have assumed explicitly or implicitly that the approximation W02 z J39ltTJ39lt works well However when the neglected second order part of the Hessian ngt ltV20gt39lt7 Ma 1 e ll is large the approximation does not produce good results The question looms large What should be done in this case Peter Blon ren Nonlinear LSQ Nonlinezr LaaSL Squarus Nonlinear Least Squares The Levellherngarquardl Method Large Residual Problems Solution 1 Head in the sand In statistical applications the importance of such problems are downplayed The argument is that if the residuals are large at the solution then the model is not good enough However often large residuals are caused by outliers caused by anomalous readings Solving the least squares problem may help us identify the outliers These can then be treated in an appropriate way Both the Gauss Newton and Levenberg Marquardt methods perform poorly in the large residual case The local convergence is linear and thus slower than general purpose algorithms for unconstrained optimization Peter Blomgren Nonlinear LSQ 7 Nonline 5L Sq Nonlinear Least Square The Levenherngarquardl Method Large Residual Problems Solution 2 Hybrid Methods Type 1 In a hybrid method we start with GaussNewton or Levenberg Marquardt which have must lower costiteration compared to generalpurpose methods and if it turns out that the residuals at the solution are large we switch to a quasiNewton or Newton method Solution 3 Hybrid Methods Type 2 It is also possible to combine GaussNewton and quasiNewton ideas to maintain approximations to the second order part of the Hessian Ie we maintain a sequence Sk m rj39kV2rj39k think BFGSstyle and then use the overall Hessian approximation Bk JkTJk 5k in a trustregion or linesearch calculation This can get arbitrarily complex Peter Blon ren Nonlinear LSQ Numerical Optimization Lecture Notes 17 Calculating Derivatives Automatic Differentiation Peter Blomgren ltblomgrenpeter gmail comgt Department of Mathematics and Statistics Dynamical Systems Group Computational Science Rsearch Center San Diego State University San Diego CA 921827720 httpterminussdsuedu Outline e Derivatives 0 Recap Finite Differences 0 Automatic Differentiation 6 Automatic Differentiation 0 Extended Exam le 0 Comments Extensions and Limitations Fall 2009 m 535 Automatic Differentiation 7 124 Automatic Differentiation 7 224 Derivatives Needed Continued Automatic Differentiation MathCS White Magic So far we looked at using finite difference methods for computing good numerical estimates for missing derivatives The finite difference approach comes at the price of i introducing some numerical error in the derivative expressions ii a need for extra evaluations of the function objective andor the gradient Finite difference approximations work very well but if the analytical expressions for the gradient and Hessian can be provided it is the way to go Next we look at Automatic Differentiation Techniques Automatic Differentiation m 324 httpwwwunixmcsanlgovautodiffADJools says the following about automatic differentiation quotAutomatic differentiation AD is a technique for augmenting computer programs With derivative computations It exploits the fact that every computer program no matter how complicated executes a sequence of elementary arithmetic operations such as additions or elementary functions such as exp By applying the chain rule of derivative calculus repeatedly to these operations derivatives of arbitrary order can be computed automatically and accurate to working precision quot The same site provides links and references to usable tools for AD for Fortran 77 Fortran 90 and ANSI C and C We will here take a very brief look at AD Even more references and ointers can be found at htt A t D ff P p www 11 o 1 org 7 424 Automatic Differentiation The Chain Rule Forgotten Calculus Automatic Differentiation Example 1 of 12 The Chain Rule in its full vector glory gory takes the form The Chain Rule If h Rm a R and 371R a Rm then for x E R we can write mom Z glVyIi 1 Automatic differentiation is essentially applying the chain rule at the code level There are two modes of AD the forward and the reverse mode We follow the example from Nocedal Wright We consider a function f R3 a R f X1X2 sinX3 eX1X2X3 This function can be evaluated in several different ways Certain rules restrictions apply eg the multiplication of X1X2 must take place before the exponentiation eXIXZ We illustrate one possibility with a partially ordered computational graph and several intermediate variables Some graph theoretical language nodei is the parent of nodej and hence nodej the child of nodei if there is a directed arc from to j A node can be evaluated when all its parents are known so the computation flows left to right see next slide This is known as a forward sweep m 535 Automatic Differentiation 7 524 Automatic Differentiation 7 624 Automatic Differentiation Example 2 of 12 AD Example Forward Mode 3 of 12 4 8 0 9 v Figure The computational graph associates with the function fquotltx1xz SinX3 6X1X2X3 We have introduced the intermediate variables X4 X1 X2 X5 sinxg X5 6X4 X7X4X5y X8X6X7v an X9X8 X3 Note In AD the software not the user identifies either explicitly or implicitly the intermediate steps Automatic Differentiation 7 724 In forward mode a directional derivative in the direction 3 E R of each intermediate value X is evaluated and carried forward simultaneously with the evaluation of X itself Notation The directional derivative for 3 E R associated with X n 9 3 D X df VXTp E TXij VI 11 J In our example n 3 139 12 9 and our goal is to evaluate Dx9 E VfxT Note D X1 P1 D X2 m and D X3 pg 5 is known as a seed vector m 7 324 Automatic Differentiation AD Example Forward Mode 4of 12 AD Example Forward Mode 5 of 12 As soon as the value of X is known at a node we can find D X using the chain rule Eg when X7 X4 X5 is known we have 3 3 VX7 VX4 VXE X5VX4 X4VX5 3X4 3X5 Hence the directional derivative DixI X5D3X4 X4D3X5 can be evaluated The accumulation of the directional derivative follows the forward sweep and at the end we have D3Xg D f VfT3 The key to practical implementation of forward mode automatic differentiation is the concurrent evaluation of X and Dixi To obtain the complete gradient vector the procedure is carried out simultaneously for n seed vectors 3 l z en The computational cost can be significant for instance a single division operation wy induces approximately 2n multiplications and n additions in the gradient calculation An exact bound on the increase in computation especially for a complicated expression is hard to obtain since we have to take into account the cost of storing and retrieving the extra quantities DgXi ADFM can be implemented in terms of a pre compiler which transforms function evaluation code into augmented code that evaluates derivatives as well Alternatively operator overloading in eg C can be use to transparently extend data structures and operations to m perform the necessary bookkeeping and computations m Automatic Differentiation 7 924 Automatic Differentiation 7 1024 AD Example Reverse Mode 6 of 12 AD Example Reverse Mode 7 of 12 ln reverse mode AD the function value f is first computed in a forward sweep then in a second reverse sweep the derivatives of f with respect to each variable X independent and intermediate are recovered We associate an adjoint variable x with each node in the computational graph In these adjoint variables we accumulate the partial derivative information 3f3X during the reverse sweep They are initialized x1 x2 211 O and x 1 The reverse sweep is built on the chain rule 2 IT xiT 3f 3X 3X 1 child of Automatic Differentiation 5351 7 1124 When a term in the sum becomes known we accumulate it in x Once x has received contributions from all its children it is finalized and can communicate its value to its parents During the reverse sweep we work directly with numerical values not with formulas or computer code describing the variables X andor the partial derivatives 3f3xi Let s work through the example 5353 Automatic Differentiation 7 1224 AD Example Reverse Mode 8 of 12 exp Y quot4quot MZEXPBWPi The reverse sweep Node9 is the child of nodes 3 and 8 we update the associated adjoint vari ables finalizing node8 X9 XsXg 90f12 AD Example Reverse Mode Sin H 7 8f 8X9 2 e2 78 7 4e2 7 8f 8X9 1 x3777 77 x87777 8X9 8X3 7r22 7r2 8X9 8X8 7r2 Figure Forward sweep the computational graph at level 4 Sweep complete x1 0 2 0 3 4 0 5 0 5 0 x7 0 1 l 1 1 After the forward sweep is complete we initialize the adjoint variables in particular X3 quot X9 X9 1 and node9 is finalized it has no children m m Automatic Dirrerentiation 7 1324 Automatic Differentiation 7 1424 AD Example Reverse Mode 10 of 12 AD Example Reverse Mode 11 of 12 Next we use 6 and 7 to fi nalize 4 and 5 X6 eX X7 X4 X5 gt lt 67i1j We now update the parents of 8 6 and 7 finalizing both X8 X5 7 3f 3X6 3f 3X7 7 2 7 2e2 2 X477 77X6e X717 3X6 3X4 3X7 3X4 7T 7 afaxs 7 7 313X777 2i4 x5 7x817 X5 ii7 739 7 8X8 0X5 3X7 3X5 7139 2 Y10 20Y3 i10 20i3 787246 2 2 7 2 7r 6 mge g g 33471 1r 1r 1r 1r 1r 5331 5393 Automatic Differentiation 7 1524 Automatic Differentiation 7 1624 AD Example Reverse Mode We now have the final pieces to complete the reverse sweep X4 X1 X2 X5 sinX3 12 of 12 Q o7o 7 6f 6X5 is 7 cos7T2 0 X3 if Reverse Mode Automatic Differentiation Comments Main appeal of ADRM low computational complexity for scalar functions Extra arithmetic is at most 4 5 times the pure function evaluation Main drawback of ADRM the entire computational graph must be stored for the reverse sweep Implementation and access patterns are quite straight forward but storage can be a problem If each node can be stored in 20 bytes then a function that requires one second of evaluation on a 100 megaflop computer 3X5 3X3 2 2 gz 814 74 1 y 814 24 2 46 4 may generate a graph which requires 2 Gb of store 8X4 8X2 7r 8X4 8x1 7r 482 282 737482 The storage requirement can be reduced at the cost of extra 1 7r 39 Y am 73 72 arithmetics by partial forward and reverse sweeps re evaluating 2e 2 4 2 2 2 Y4 T is is Y7 is in 1 portions Instead of storing the whole graph This is known as check ointin We have V Xliu2 2 thztxsl 99 p g 5331 Automatic Differentiation 7 1724 Automatic Differentiation 7 1324 Fwd Mode Extending AD to Vector Functions Extending AD to the Hessian 2quotd Derivatives The idea of automatic differentiation is quite easily extended to vector valued functions FR 7Rm7 so that the Jacobian matrix of first derivatives J02 3Xquot y12m 12n can be formed using automatic differentiation All we need is additional bookkeeping for the m components of F Automatic Differentiation 5351 7 1924 The technique can also be adapted to evaluate the Hessian ln forward mode we need 2 seed vectors 5 and f and we define D qx raT V2x 61 these 2nd derivative quantities are propagated using the chain rule just like the lst derivatives were propagated in our previous example At the end of the sweep the terminal node contains the value of Dinan I3T VZXIl 51 E W V2713 61 Each pair 57 q 7 j gives the Hi entry of the Hessian matrix 535 Automatic Differentiation 7 2024 Forward Mode Hessian Reverse Mode Hessian Due to symmetry we only need to compute thej S 139 entries Hg and it is quite clear how to exploit symmetry The increase in the number of arithmetic computations compared with evaluation of f alone is a multiplicative factor 1 n NZV2f where NZV2 f is the number of entries of the Hessian we decide to compute ln Newton CG algorithms we only need the effect of the Hessian vector product V2fF in this case we simply fix the second seed vector 6 E F and compute the remaining n entries T 2 2 ej V fx r V fxrj 12n 539 7 2124 using the forward sweep Automatic Differentiation It is also possible to implement evaluation of the Hessian W01 or the Hessian vector product V2fxr in reverse mode In the latter case first both f0 and VfxTF are propagated during the forward sweep and the values accumulated in X7 Din Then we apply the reverse sweep to the computed VfxTF At the completion of the reverse sweep we have 3 iwwii imm 3X i12n in the nodes corresponding to the independent variables The increase in work over evaluation off alone is a multiplicative factor not greater than N 12NCV2fgt39 where NCV2fx of right seed vectors used in computation Automatic Differentiation 7 2224 Limitation of Automatic Differentiation Bottom Line 1 fX depends on the numerical solution of a PDE In this case f0 may contain truncation errors due to the AUtomath differentiation is 3 set 0f POWerfUl tOOlS h dt l th PDEE th htht t39 SC eme use OSO ve e ven OUg f9 rlmca Ion rror 9 AD can enhance optimization algorithms and can be applied are small lTXl lt e we cannot control the derivative gradient s s successfully In many applications Involvmg highly complex 7x hence errors In AD computed f x can potentially be functions large 9 AD facilitates interpretations of the computed optimal solutions N Perverse code Due to branching if statements etc a function evaluation may be equivalent to the following valid but nasty piece of code if x 710 theuf 00 e15e r x 710 Automatic differentiation would most likely give us fl O which does not seem like a Good ldeaTM m Automatic Differentiation 7 2324 allowing the user to extract more information from the result 9 AD does not absolve the user altogether from the respon sibility of thinking about derivative calculations and cor rectly interpreting and validating the results 5353 Automatic Differentiation 7 2424 i we r tr iiwii w er N v mi 6 s 1 M 1 Hirv39 Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics Dynamicai Systems Group Computationai Sciences Research Center San Diego State University San Diego CA 9218277720 e r cerminussi uedu M Fall 2009 Peter Blom r fri quotw In irmlu Linn Snarch Math 0 Introduction 0 Recap 0 Fundamentals Rate of Convergence a Line Search Methods 0 Search Direction Steepest Descent Newton or Other7397 0 Step Length Selection 7 1D Minimization 0 Step Length Selection 7 The Wolfe Conditions Peter Blom Line Search Metho Introduction Linn Snarch Methods R tarp Legit mm Some fundamental building blocks of unconstrained optimization l Tm WW For some t 6 01 we have 1 i 3 i J Vfi 7 lvzai rm 3 V 2 gradient Hessa n 4 theorems relating f0 and its derivatives to optimal solutions 1 32 optimal gt Vf 0 2 32 optimal gt Vf 0 and V2fi positive semiideflnite 3 Vfi 0 and V2f positive definite gt 52 optimal 4a f convex and 32 local optimum gt 32 global optimum 4b f convex and Vf 0 gt 52 global optimum Note The complete statement of the theorems require sufficient smoothness eXi tence of derivatives of f Introduction Liiw Search Qtlimls F ll l l39dMK iUlh 35 V Kev Cmmcgp Rate of Convergence Rate at Wow 1 Suppose the sequence g Bn il converges to zero and 3 in il converges to a point 52 If 3K gt 0 llini TH lt KB for n gt N Le for n large enough then we say that in il converges to 52 with a Rate of Convergence Own Big Oh of n We write 2n ilk l Note The sequence g n 21 is usually chosen to be 1 Br 7 for some value of p n Peter Blomgren t ma manna Introduction Linn Snarch Methods Fu alnenlals Rate of Convergence Rates of Convergence Let 3 in il be a sequence converging to 52 the convergence rate is said to be Q Iinear quotient linear if 3r 6 01 and K E Z such that Hi H f1 r W 2 K Hin XH Q superlinear if m me ixw kaoo Milk 7 SEW O Q quadratic if 3r 6 RJV and K E Z such that i My VkZK HXR XH m n e Search Melh ods In reduction Line Search Methods Line Search Methods We now focus on line search methods where we pick a search direction pk and solve the one dimensional problem min f 52 Dtgt0 k l apk The solution gives us an optimal value for ak so the next point is given by ik1 ik ak k where 04k is known as the step length In order for a line search method to be work well we need good choices of the direction pk and the step length ak Peter Blon ren e Search Methods m Introduction u or Other Line Search Methods Steepest Descent Direction The intuitive choice for pk is to move in the direction of steepest descent e in the negative gradient direction Going back to the Taylor expansion a are fgtquotlt a TVfOI we immediately see that the direction of most rapid decrease gives 7 I min Vf x min cos0 Vf x 7 Vf x llpll1p 96mm ll l ll ll which is achieved when 9 7r ltgt l3 7VfillVfill GTVTI cos0 where 9 is the angle between the vectors a and vii Recall Peter Blomgren Convergence Line Search Methods Intrmiuctiun cent Newton or Other Line Search Methods em Steepest Descent Direction Figure The steepest descent direction 3k is perpendicular to the contour lines of the objective Figure 7T wcos 9ii 7ii iiwii m Peter Blomgren blow Search Direclio Sleepesl Descent lnuedl ion Line Search Me Is Newton or Other mm H v Newton Direction If f is smooth enough and the Hessian is positive definite we can select pk to be the Newton direction We write down the second order Taylor expansion fgt39lt 3 z f0 TVfgt39lt if Warm 3 We seek the minimum of the right hand side by computing the derivative width respect to 3 and set the result to zero Vf lt vzf qi 3 0 which gives the Newton direction 3 7 Warm 71 Vf lt 5351 Peter Blomgren Convergence Line Search Methods m Introduction u or Other Line Search Methods Newton Direction As long as the Hessian is positive definite SN is a descent direction 7T NVfb 7VfquotltT rm Vf lt lt 0 hf a Pos Def Note Clearly the Newton direction is more expensive than the steepest descent direction we must compute the Hessian matrix V2fi and invert it Le solve an n x n linear system Note The convergence rate for steepest descent methods is linear and for Newton methods it is quadratic hence there is a lot to gain by finding the Newton direction m Peter Blomgren Convergence Line Search Methods In 39rediiction Line Search Methods Search Direclio eeriest Descent Newton or Other iii 7 i whim w Example NW 22 p 30 Problem Showthat the function fx 8x12yx272y2 has only one stationary point and that it is neither a max imum nor a minimum but a saddle point Sketch the contours for f Solution The gradient of f is 7 82x Vfii1274y which has the stationary point Xy 743 Since the Hessian 2 7 2 0 v r 7 i 0 74 Figure The function fx around has both positive and negative eigenvalues the stationary Mestat onaw pom point must be a saddle point 5351 Search Melho ls Introduction Line Search Methods on or Other Example NW1E39 22 p 30 If we start an iteration in X07y0 07 0 The steepest descent direction is Sbii if 82x if 8 PO W i1274y and the Newton direction is ii liEH Peter Blomgren Convergence Line Search Methods Search Directio Steepest Descent Newton or Other I dun n v lnuedl ion Line Search Me Is Example NW1E39 22 p 30 V Figure The Newton and Steepest Descent dwecmons stamng m 00 Note that the Newton method 5 headmg to the sadd e pomL but the Steepest descent method wnh m generah not converge to a nonemhnnnmn m stamonary p Peter Blomgren Convergence Line Search Methods eepesl Descent Newton or outer Search Directiol I z 1 mm ll ad I Line Search Methods Line Search Methods Directions Method Search Direction Convergence Steepest Descent pk 7VfikllVfikH Linear Quasi Newton pk 7H Vf quotilt SuperLinear Newton Pk 7W2 ik l Vfik Quadratic Table39 Summary of search directions for different schemes In Quasi Newton schemes we do not explicitly compute the Hessian ng k in each iteration instead we use an approximation Hk z ng k which is updated in some clever way To BE EXPLORED 1N GREAT DETAIL LATER We will return to the selection of pk but let39s consider the computation of the step length ak Conve en Line Search Melh In reduction Line Search Methods Line Search Methods Step Length Selection Given a descent direction pk we would like to find the global minimizer a of min f i of 000 k Pk As this isjust one of possible many steps in the iteration it is not wise to expend too much time in finding ak We are faced with a trade off We want an 04k so that we get a substantial reduction in the objective 1 We want to find 04k fast In practice we perform an inexact line search settling for an 04k which gives adequate reduction in the objective Peter Blomgren Convergence Line Search Methods Introduction gt Line Search Methods 5 9 Le39mquot What is adequate reduction Consider the objective fX VX210 8 if we let k ctions are se Figure 1 708 07 7065 0625 706125 060625 V V V then the descent dire given by pk 71 71171171H so this generates a decreasing quence fXk akpk lt fXk However with the current choice of 04k 718157135139275712375121875 H the convergence rate is less than spectacular Clearly we need a stronger condition than xlt i akpk lt fxk Search Melho ls In reduction Line Search Methods The Wolfe Conditions There are many ways to enforce reduction in the objective The Armijo Condition rm am rm cla lwoz c1 6 01 requires the reduction to be proportional to the step length a as well as the directional derivative In practice 1 is usually set to be quite small eg 10 4 To rule out unacceptable short steps the curvature condition ZV ik am 2 CQ ZV ik oz 6 c171 is enforced it prevents us from stopping when more progress can be made by moving further increasing 04 Together these two conditions are known as the Wolfe conditionsm Peter Blomgren Convergence Line Search Methods h ul h VI u Lme Search Methods Wolfe Conditions Step Length Selection 7 The The Wolfe Conditions Part I The Armijo Condition The Armijo Condition it am ik cla lvnih c1 6 071 requires the reduction to be proportional to the step length a as well as the directional derivative In practice c1 is usually set to be quite small eg N104 The quotArmijoquot Line lt gt lt gt Armijo 0K Armijo 0K Conve en Line Search Melh w v 1 Step Length Selection 7 The Wolfe Conditions ln39tretlu ion Line Search Methods The Wolfe Conditions Part 7 The Curvature Condition To rule out unacceptable short steps the curvature condition pZV ik 043k 2 cszV ik cz 6 C171 it prevents us from stopping when more progress can be made by moving further increasing a The Curvature Condition Curvature 0K Curvature 0K Peter Bloni r w r v 1 r Step Length Selection 7 The Wolfe Conditions lnlred hon Line Search Methods The Wolfe Conditions Part I i Alt eptable Step Together the Armijo and Curvature conditions constitute the Wolfe Conditions The Curvature Condition The quotArmijoquot Line 4 lt Armijo 0K Armijo 0K Curvature 0K Curvature 0K Peter Bloni me Search Metholt w v i Step Length Selection 7 The Wolfe Conditions lnlred hon Line Search Methods The Wolfe Conditions Part I i Alt eptable Step Together the Armijo and Curvature conditions constitute the Wolfe Conditions The Curvature Condition The quotArmijoquot Line Ar ijo 0K urvature 0K lt Armijo 0K Curvatu e 0K Step 0K Step 0K Peter Bloni me Search Metholt In reduction Line Search Methods The Strong Wolfe Conditions A step length 04 may satisfy the Wolfe Conditions fik 043k g Rik cla ZV i 1 6 01 ZVfik 043k 2 CQ ZVfik7 CQ 6 317 1 even though it is far from a minimizer of Rik l 043k the Strong Wolfe Conditions fik 043k g Rik cla ZV i 1 6 01 l IZVfGZk a k S Czl kTVfgtltkl7 62 66171 disallows values of raan an which are too positive thus excluding point that are far from the stationary points of ZV ik l 043k Peter Blomgren Convergence Line Search Methods In reduction Line Search Methods were Emil l smallmy 7 u I mgr It can be shown see NW15L pp40 41 that there exist step lengths a which satisfy the Wolfe Conditions and the Strong Wolfe Conditions for every function f which is smooth and bounded below Formally it we 52 1 Suppose f R A R is continuously differentiable Let k be a descent direction at ik and assume that f is bounded below along the line quot1lt 043k a gt 0 Then ifO lt c1 lt cz lt 1 there exist intervals of step lengths satisfying the Wolfe conditions and the strong Wolfe conditions See also Goldstein Conditions NW15L p41 Peter Blomgren In radiiJim Line Search Methods whim ad ma ng 39i mum eatenmi 7 ii In ll hllik Annihile liming Mamrrch i 0 Find a descent direction 5k 1 Set EgtO p 0l 66071 set 046 2 While fik 043k gt Rik ca lv ik 3 Oz pa 4 EndWhile 5 Set 04k 04 If an algorithm selects the step lengths appropriately 5g backtracking we do not have to check the second inequality of the Wolfe conditions The algorithm above is especially well suited for use with Newton method pk where a 1 The value of the contraction factor p can be allowed to vary at each iteration of the line search To be revisited Peter Blomgren 4am masm 7 m Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics Dynamicai Systems Group Computationai Sciences Research Center San Diego State University San Diego CA 9218277720 K r cerminussr15uedu M Fall 2009 Peter Blom r dkwimln s Outline 0 Line Search Methods 0 Recap 84 Preview 0 Convergence Analysis Steepest Descent a Convergence Beyond Stepest Descent 0 Convergence Newton 0 Convergence QuasiiNewton 0 Coordinate Descent Methods Peter Blom Line Search Memo Rate ofConver Line Search Methods Recap 8 Preview n B st Deicenl V 7 Discussion on Sufficient Decrease for line search i Wolfe Conditions 7 Strong Wolfe Conditions 7 Algorithm Backtracking Line Search 7 For the Steepest Descent Direction 7 For the Newton Direction 7 Example 1 fquotx1x 2 7 Example 2 fquotx1x 2 05x12 x22 i The Zoutendijk Condition 7 Smooth and bounded below f gt Steepest Descent direction and Wolfe conditions will give globally convergent met 0 lim mem 0 k nx Also true for Newton direction if Hessian V2fi is positive de nite and the condition number is uniformly bounded 5351 Line Search Method Rate ofConve ence Line Search Methods Recap amp Preview c 2910 st 3 r secsit Con Preview Today We take a more careful look at the rates of convergence for Steepest Descent Newton and Quasi Newton methods We show using a simple model case that the resulting rates of convergence are Linear Steepest Descent Quadratic Newton Super Linear Quasi Newton Finally we discuss coordinate descent methods Note The convergence analysis builds on Taylor39s theorem and linear algebra results applied to quadratic forms f0 iTQii ETi where Q is sym pos def m Peter Blomgren Line Search Memods Rate ofConvergence 7 425 Line Search Methods Recap amp Preview ca veg0 st Deicenl v Con Are We Done We know the following for nice enough f If we ensure that pk 7K Vfik ltgt cost9lt 2 6 gt O Compute COSt9k in each iteration and turn pk in the steepest descent direction if needed cos 0k lt 6 and 04k satisfies the Wolfe conditions Eg backtracking line search then we have a globally convergent algorithm Therefore optimization is easy Line Search Memods Rate ofConvergence 7 525 Peter Blomgren Line Search Methods Recap 44 Preview nBey quot pest e 9m No We Are Not Done Angle tests cosOlt lt 6 and turning of pk ensure global convergence but They may slow down the convergence rate When the Hessian is illconditioned close to singular the appropriate search direction may be almost orthogonal to the gradient and an unlucky choice of6 may prevent this They break QuasiNewton methods Algorithmic strategies for rapid convergence is often in direct conflict with the theoretical requirements for global convergence Eg steepest descent is the model citizen for global convergence but it is quite slow we39ll show this today Eg Newton converges fast for good initial guesses but the Newton direction may not even be a descent direction far away from the solution The Goal the best of both worlds rapid global convergence m Peter Blon ren Search Mediod Rate ofConv steepest Descent Convergence Analysis Steepest Descent We apply the steepest descent method to the simple quadratic model objective 1 fx E TQxi bTx where Q is an n x n symmetric positive matrix Further we idealize the method by using exact line searches The gradient is VHS Qii l3 and hence 52 Q45 is unique The step length 0 Let k Vfik then 04k is the 04 which minimizes 1 fXk agk Xk agkTQXk agk bTXk agk m Peter Blomgren Line Search Memods Rate ofConvergence 7 725 Expanding the expression we have 1 1 1 1 it 7 an 7 gilQik 521ng 7 film 7 film 7 Na min Then we differentiate with respect to Oz and set equal to zero 07a ZQ k kTlt7Qikm hf a V ik Hence kT k 7 VfgtltkTVfgtltk 04k i 77 7 f gk ng WW QWW W W iiih man i 7 7 VfikTVfik 1 xk VfikTQVfik WW Peter Blomgren Line Search Methods ce aeye st 3 r emsm Convergence Ana is steepest Descent Convergence Analysis Steepest Descent Since for this model Vf k Qik 7 B we now have a complete closed form expression for the iterations The figure on the next slide shows a typical convergence pattern for steepest descent methods a zig zagged approach to the optimum In this case the model is fii ix1xzili llilll liliil Peter Blomgren Line Search Memods Rate ofConvergence 7 925 Line Search Methods a e r r r c Bey3nd Shapes Dm csnl Convergence Analysis steepest Descent Illustration Steepest Descent Convergence Pattern Peter Blom Line Search Medlod Rate ofConver Convergence Analysis steepest Descent Line Search Methods Bey pest s em Convergence Analysis Steepest Descent In order to measure the rate of convergence we introduce the weighed Qnorm illze for l Since Qi we have 1 u 2 w Ell X llo X i X Since Vf O we note that Vfquot Qk 7 We can now express the iteration in terms of the Qnorm 7 2 7 7 VfgtltkTVfltk 7 2 x x Q 1 Wemm Wm mlvrm H x Q The details are outlined in exercise 3 7 NW15L p 62 NWquotd p 64 This is an exact expression of the decrease in each iteration It is however uite cumbersome to work with a more useful bound can be found q 535 Search Medlod Rate ofConv Peter Blon ren Line Search Methods Convergeinc Bey3nd Moms 13 r t secsAl at mph era IAde Sit Theorem 1 When the steepest descent method With exact line searches is applied to the convex quadratic function f lt iTQii ET the error norm 1 giiiii i fgtlt f0 sa tisfies 392 Q k n 9 a k 7 1 A l W a 3 b lt 7 7 c emsm ysis steepest Descent L39ne Search Methods repes D Convergence Analysis Steepest Descent The theorem shows a linear rate of convergence 7 1 7 lt n 7 iixk x HQ lkn A1 in x HQ If n 1 then only one iteration is needed in this case Q is a multiple of the identity matrix and the contours are concentric circles which means that the steepest descent direction points straight at the solution As the condition number Q n1 increases the contours become more elongated which increases the amount of zig zagging Further the ratio approaches one which shows a significant slow down in the convergence Line Search Memo ale ofConv Line Search Methods e Bey quot past 3939 em Convergence Analysis steepest Descent scent method will increase the amount of zigzagging Peter Blon ren Search Medlod Rate ofConv Ln Search Methods tepest ngecem mm mar34 pest Descent Tram tie ref minim AEEEQEEW Emmi Suppose that f R a R is twice continuously differentiable and that the iterates generated by the steepest descent method With exact line searches converge to a point 52 Where the Hessian matrix V2fi is positive definite Let r be any scalar satisfying n71 re 717 An1 gt Where 1 g n are the eigenvalues ofV2fi Then for all k suf ciently large we have Peter Blomgren Line Search Methods 39 ce aeye st Demm Convergence Ana is steepest Descent Convergence Analysis Steepest Descent The statement of the theorem is different from the statement in Nocedal Wright 1st edition 3rd printing it has been updated according to Nocedal39s posted errata http www ece northwestern eduNnocedalbookerrata html Inexact line searches will not improve the convergence rate hence the theorem shows that convergence can be quite slow even when ampV2fi is quite small Eg when ampV2fi100 100 iterations will reduce the error by a multiplicative factor of 00183 Peter Blomgren Line Search Memods Rate ofConvergence 7 1525 Line Search Methods Convergence Beyond Slepesl Descem Convergence Analysis Newton We now look at the Newton search direction N 2 71 Pk lV fXkl me This may not always be a descent direction since the Hessian matrix V2fik may not always be positive definite We delay the discussion of how to deal with non descent Newton directions until a later time For now we focus on the local result e we start the iteration close enough to the optimum 52 that all Hessians along the solution trajectory are positive definite Line Search Memo ale ofConv Lim Search Methods Convergence Beyond Slepesl Descem Theom Suppose that f is twice differentiable and that the Hessian V2fi is Lipschitz continuous in a neighborhood of the solution 52 at Which the sufficient conditions Vfi O and V2 fi is positive definite are satisfied Consider the Newton method ik1 ik Then 1 if the starting point 520 is sufficiently close to 32 the sequence of iterates converges to 52 2 the rate of convergence of ilk is quadratic 3 the sequence ofgradient norms HVfik converges quadrat ically to zero The proof is in NW15L pp52 53 NWquotd pp45 Peter Blomgren Line Search Methods Convergence Beyond Slepesl Descem r l Conver nc rNewlon r l i l Convergence Analysis Quasi Newton We now turn our attention to the middle case where we have a search direction of the form 5k eBglvaik where Bk is symmetric positive definite and a clever approximation to the Hessian V2fik the discussion on how to build and update Bk is postponed until later Further we assume that the iteration is based on an inexact line search algorithm where 04k satisfies the Wolfe conditions We also assume that the line search algorithm always tries the step 04 1 first Peter Blon ren Search Medlod Rate ofConv Line Search Methods Convergence Beyond Slepesl Descent mw Mammal Mammal Suppose that f R a R is twice continuously differentiable Consider the iteration ilk ilk ak k where M is a descent direction and 04k satisfies the Wolfe conditions With c1 12 If the sequence ilk converges to a point 52 such that Vfi O and V2fi is positive define and if the search direction satisfies m Hv ik V2fgtltkl3kH 0 00 HPkH then 1 the step length 04k 1 is admissible for all k gt 0 2 ifozlt 1 for all k gt 0 ilk converges to 52 super linearly Note Statement of theorem updated according to errata Peter Blomgren Line Search Methods Convergence Beyond Slepesl Descem Qua Newton v I Convergence Analysis Quasi Newton If 1 gt 12 the line search would exclude the minimizer of a quadratic enforcing 04k lt1 If M is a quasi Newton search direction then the limit in the theorem is equivalent to 7 2 m Bk vfixiipkii mo iipui 0 This shows that we can achieve super linear convergence even if Iim Bk 7e W66 kaoo it is sufficient that Bk converges to the Hessian in the search directions M It turns out that this condition is both necessary and sufficient m Line Search Memo ale ofConv Line Search Methods Convergence Beyond Slepesl Descem mm mm m l Suppose that f R A R is twice continuously differentiable Consider the iteration ik1 7lt k ie ak E 1 and that k is given by k 781Vfik If ik converges to a point quot such that Vf O and VHO is positive de ne then ik converges super linearly ifand only if m llBk7V2fik kll 0 H00 llpkll hods When we return to the construction of quasi Newton methods we will see that satisfying the condition of this theorem is normally not a problem hence superlinearly convergent quasiNewton methods are readily available for most objective functions Note Statement of theorem updated according to errata PelerBlon ren in I z Line Search Methods Convergence Beyond Slepesl Descem Coordinate Descent Methods Instead of computing the search direction why not just cycle through the coordinates ie Ob OO Once we reach the final direction 393 we start over from 3931 Unfortunately this scheme is quite inefficient in practice and can in fact iterate infinitely without reaching a stationary point Peter Blomgren Line Search Memods Rate ofConvergence 7 2325 Lim Search Methods Convergence Beyond Slepesl Descem Co rdinme Descent Memo s Coordinate Descent Methods A cyclic search along any set of linearly independent directions can run into this problem of non convergence The gradient Vfik may become more and more perpendicular to the coordinate search direction so that cost9lt approaches zeros rapidly enough that the Zoutendijk condition Zcos20kllVfikll2 lt oo k0 is satisfied even though llVfikll 74gt 0 Line Search Memo ale ofConv Lim Search Methods V V e i Convergence Beyond Step19M Descent Coordinate Descent Med ods Coordinate Descent Methods Even when coordinate descent methods converge the rate of convergence is slower than that of the steepest descent method The slowness increases as the number of variables increases There are however situation in which coordinate descent may be useful since 1 no calculation of Vf k is required 2 convergence can be acceptably fast if the variables are loosely coupled the stronger the coupling the worse the convergence Rate ofConve ence Line Search Medlod Numerical Optimization Lecture Notes 2 Unconstrained Optimization Fundamentals Peter Blomgren blomgren peter gmail com Department of Mathematics and Statistics D namicai Systems Group Computationai Sciences Research Center San Diego State University San Diego CA 9218277720 httpterminussdsuedu Fall 2009 m Unconslrained Optimization Fundamentals 7 125 Outline 0 Fundamentals of Unconstrained Optimization 39 Review 0 Characterizing the Solution 0 Some Fundamental Theorems and Definitions a Optimailty 0 Necessary vs Sufficient Conditions Convexity 0 From Theorems to Algorithms 5351 Unconslrained Optimization Fundamentals 7 225 Last Time We established that our favorite problem for the semester will be of the form f X where f6 the objective function i the vector of variables aka unknowns or parameters The problem is unconstrained since all values of i E R are allowed Further we established that our initial approach will focus on problems where we do not have any extra factors working against us ie we are considering local optimization continuous variables deterministic techniques on conveX domains m 7 325 Unconslrained Optimixalion Fundamentals What are we looking for Global Optimizer A solution to the unconstrained optimization problem is a point 52 E R such that a g i v e R Such a point is called a global minimizer In order to find a global optimizer we need information about the objective on a global scale Unless we have special information such as convexity of f this information is expensive since we would have to evaluate f in infinitely many points 5351 Unconslrained Optimization Fundamentals 7 425 What are we looking for Local Optimizers 1 of 3 Most algorithms will take a starting point 520 and use information about 1 and possibly its derivatives in order to compute a point 521 which is closer to optimal than 20 in the sense that fquotlt1 S flio Then the algorithm will use information about f derivatives in i1 and possibly in i0 this increases the storage requirement to find 522 such that flizl S Hill S fliol An algorithm of this type will only be able to find a local minimizer 5351 Unconslrained Optimization Fundamentals 7 525 What are we looking for Local Optimizers 2 of 3 Definition Local Minimizer A point 52 E R is a local minimizer if there is a neighborhood N of 52 E R such that f52 i 527 V52 6 N Note A neighborhood of 52 is an open set which contains 52 Note A local minimizer of this type is sometimes referred to as a weak local minimizer A strict or strong local minimizer is defined as Definition Strict Local Minimizer A point 52 E R is a strict local minimizer if there is a neighborhood N of 52 E R such that f52 lt i 527 Vie N756 J50 Unconslrained Optimization Fundamentals 7 525 What are we looking for Local Optimizers 3 of 3 Definition Isolated Local Minimizer A point 52 E R is an isolated local minimizer if there is a neighborhood N of 52 E R such that 52 is the only local minimizer in N i rum 1m rum rum i um um nns um iii rum rnmx 1mm mi 11 ii 2 mm m in nm Figure The objective fX X22 cos1X has a strict local minimizer at X 0 however there are strict local minimizers at infinitely many neighboring m points Xquot 0 is not an isolated minimizer Unconslrained Optimization Fundamentals 7 725 Recognizing A Local Minimum If we are given a point i E R how do we know if it is a local minimizer Do we have to look at all the points in the neighborhood lfwhen the objective function f is differentiable we can recognize a minimum by looking at the first and second derivatives the gradient VHS and the Hessian V2115 The key tool is the multi dimensional version of Taylor s Theorem Taylor expansions 5351 Unconslrained Optimization Fundamentals 7 325 Illustration The Gradient Vf and the Hessian sz Example Let i E R3 ie X1 X2 X3 gtltx ll then 32f S 32f S 32f S 7 3X1 axlaxQ 3X13X3 7 MM 2 7 W0 W07 VfX i TXE 7 V fx 7 axlaxQ 3X22 3X23X3 l L00 l 32m 32m 3 3X3 3x1 3x3 3X2 3X3 8x32 Unconslrained Optimization Fundamentals 7 925 Taylor39s Theorem Theorem Taylor39s Theorem Suppose that f R a R is continuously differentiable and that 3 E R Then a 3 f lt Vf lt Nil for some t E 07 1 Moreover iff is twice continuously differentiable then 1 Vf39lt i3 Vf lt V2fi rm dt 0 and T 1T 2 fx p fx Vfx p 5p V fx rpp for some t 6 01 m Unconslrained Optimization Fundamentals 7 1025 Optimality First Order Necessary Conditions Theorem Theorem First order Necessary Conditions If is a local minimizer and f is continuously differentiable in an open neighborhood off then VHF 0 5351 Unconslrained Optimization Fundamentals 7 1125 Optimality First Order Necessary Conditions Proof Proof By contradiction Suppose VHF 7 0 Let 3 7Vfi and realize that ISTVH iiiniH2 lt 0 By continuity of Vf there is a scalar T gt 0 such that vin 113 lt 0 v e 0 T Further for any 5 E 07 T by Taylor39s theorem me 53 we 537 VHF 5 for some t e 0 5 lt0 Therefore fi s lt 1 63 which contradicts the fact that 52 is a local minimizer Hence we must have VHF 0 D 053 Unconslrained Optimization Fundamentals 7 1225 Optimality Language and Notation If VHF 0 then we call 52 a stationary point Recall from linear algebra Definition Positive Definite Matrix An n x n matrix A is Positive Definite if and only if n n V 7A 0 itAi ZZaUxixj gt 0 i1 11 Definition Positive Semi Definite Matrix An n x n matrix A is Positive Semi Definite if and only if n n V 7e 0 VA Z Zaijxixj 2 0 i1 11 as Unconslrained Optimization Fundamentals 7 1325 Optimality Second Order Necessary Conditions Theorem Second Order Necessary Conditions If is a local minimizer off and sz is continuous in an open neighborhood off then VHF O and V2fi is positive semi definite Proof Vf 0 follows from the previous proof We show that V2f is positive semiedeflnite by contradiction Assume that V2f is not positive semiedeflnite Then there must exist a vector 393 such that tV2fi lt 0 By continuity of ng there is a T gt 0 such that tVQfOquot t lt 0 Vt 6 0 T Now the Taylor expansion around 32 shows that Vs 6 0 T there exists t 6 0 T such that W t v 01 53 f 53T wow 35 TVQf V 2 0 lt0 Hence f6quot 55 lt which is a contradiction El as Unconslrained Optimixalion Fundamentals 7 1425 Optimality Necessary vs Sufficient Conditions The conditions we have outlined so far are necessary hence if 52 is a minimum then the conditions must hold It is more useful to have a set of sufficient conditions so that if the conditions are satisfied at 52 then 52 is a minimum The second order sufficient conditions guarantee that 52 is a strict local minimizer of f and the convexity of f guarantees that any local minimizer is a global minimizer 5351 Unconslrained Optimization Fundamentals 7 1525 Optimality Second order Sufficient Conditions Theorem Theorem Second order Sufficient Conditions Suppose that sz is continuous in an open neighborhood of and that VHS O and V2fi is positive definite Then 52 is a strict local minimizer off 351 Unconslrained Optimization Fundamentals 7 1525 Optimality Second order Sufficient Conditions Proof Proof Since the Hessian V2fi is positive definite we can find a open ball of positive radius r Dr 52 y E R H 7 lt r so that V2f is positive definite V9 6 D Now for any vector 3 such that lt r we have 52 3 E D and therefore by Taylor 1 fX P fgtlt PT WW 7 PTV2fx tPp W 2 0 gt0 for some t 6 01 Hence it follows that f0 lt fi i and so 52 must be a strict local minimizer D 5351 Unconslrained Optimization Fundamentals 7 1725 Optimality Convexity 1 of 3 Theorem When the objective function f is convex any local minimizer 52 is also a global minimizer off If in addition f is differentiable then any stationary point 52 is a global minimizer off 5351 Unconslrained Optimization Fundamentals 7 1325 Optimality Convexity 2 of 3 Proof part 1 Suppose that 52 is a local but not a global minimizer Then there must exist a point i E R such that fi lt f52 Consider the linesegment that joins 52 and 2 A2 17 527 6 01 Since f is convex we must have by definition f7 S AH 1 fgtlt lt 56 E 071 Every neighborhood of 52 will contain a piece of the linesegment hence 52 cannot be a local minimizer 5351 Unconslrained Optimization Fundamentals 7 1925 Optimality Convexity 3 of 3 Proof part 2 Suppose that 52 is a local but not a global minimizer and let i be such that fi lt 1 63 Using convexity and the definition of a directional derivative NiJ2quotd p 628 we have in i 7 m WiiT2 7 ii dA f lt W 7 anion lim A0 g H 2 7 H lt 0 fi 17 fquot 7 fquot Therefore VHS 7 0 so 52 cannot be a stationary point This contradicts the supposition that f is a local minimum B 3g Unconslrained Optimization Fundamentals 7 2025 Optimality Theorems and Algorithms The theorems we have shown all of which are based on elementary vector calculus are the backbone of unconstrained optimization algorithms Since we usually do not have a global understanding of f the algorithms will seek stationary points ie solve the problem VHS 0 When 52 ER this is a system of n generally non linear equations Hence there is a strong connection between the solution of non linear equations and unconstrained optimization We will focus on developing an optimization framework and in the last few weeks of the semester we will use it to solve non linear equations Unconslrained Optimization Fundamentals 7 2125 Algorithms An Overview The algorithms we study start with an initial sub optimal guess x0 and generate a sequence of iterates xkk1wN The sequence is terminated when either success We have approximated a solution up to desired accuracy failure No more progress can be made Different algorithms make different decisions in how to move from xk to the next iterate xk Many algorithms are monotone ie fxk1 lt Rik Vk 2 0 but there exist non monotone algorithms Even a non monotone algorithm is required to eventually decrease how else can we reach a minimum Typically aim lt fxk is required for some fixed value In gt O and Vk Z 0 5351 Unconstrained Optimization Fundamentals 7 2225 Moving from ik to ik1 Line Search Most optimization algorithms use one of two fundamental strategies for finding the next iterate 1 Line search based algorithms reduce the n dimensional optimization problem min 52 SEER with a onedimensional problem min f i of we k Pk where 3k is a chosen search direction Clearly how cleverly we select pk will affect how much progress we can make in each iteration The intuitive choice gives a slow scheme 5351 Unconslrained Optimization Fundamentals 7 2325 Moving from xk to n1 Trust Region 1 of 2 2 Trust region based methods take a completely different approach Using information gathered about the objective 1 ie function values gradients Hessians etc during the iteration a simpler model function is generated A good model function mm approximates the behavior of f6 in a neighborhood of xk eg Taylor expansion T 17 mkxk P fXk P VfXk 5P HkP where Hk is the full Hessian szm expensive or a clever approximation thereof 5351 Unconslrained Optimization Fundamentals 7 2425 Moving from xk to n1 Trust Region 2 of 2 The model is chosen simple enough that the optimization problem min mk xk 3 p6Ngtquotltk can be solved quickly The neighborhood Mm of xk specifies the region in which we trust the model A simple model can only capture the local behavior of f think about how the Taylor expansion approximates a function well close to the expansion point but not very well further away Usually the trust region is a ball in R ie NW 51 ll iikll S r but elliptical or box shaped trust regions are sometimes used 5351 Unconslrained Optimization Fundamentals 7 2525 Line Search vs Trust Region Step Line Search Trust Region Choose a search direction EStabHSh the maXimum 1 distance the size of the Pk39 trust region Identify the dismnc e39g39 Find the direction in the 2 the step length In the trust regon search directon Table Line search and trust region methods handle the selection of direction and distance in opposite order Next time Rate of Convergence Line search methods detailed discussion Unconslrained Optimixalion Fundamentals 5351 7 2525