### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Special Topics CS 8803

GPA 3.81

### View Full Document

## 14

## 0

## Popular in Course

## Popular in ComputerScienence

This 0 page Class Notes was uploaded by Alayna Veum on Monday November 2, 2015. The Class Notes belongs to CS 8803 at Georgia Institute of Technology - Main Campus taught by Alexander Gray in Fall. Since its upload, it has received 14 views. For similar materials see /class/234088/cs-8803-georgia-institute-of-technology-main-campus in ComputerScienence at Georgia Institute of Technology - Main Campus.

## Reviews for Special Topics

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/02/15

CS 8803MDM Lecture 26 Integration and Sampling Alexander Gray agraycc gatech edu Georgia Institute of Technology A CS 8803 MDM Lecture 26 p12 Today 7 1 2 Monte Carlo Variance Reduction 3 Markov Chain Monte Carlo Integration and Sampling L J CS 8803 MDM Lecture 26 p22 rIntegration and Sampling Why integration and why sampling Integration 7 I bada 1 r8 uppose we want to find If a is lowdimensional we can use standard quadrature techniques However quadrature techniques effectively grid up the space so that their cost is exponential in the dimen sionality D of x L J CS 8803 MDM Lecture 26 p42 Integration rN ow suppose we have the form 7 was altxfa 2 where f is a probability density function We get this form whenever we want to compute the expected value of a function aa where a N f I Ea aafada 3 L J CS 8803 MDM Lecture 26 p52 Integration and Sampling 7 rThe law of large numbers ensures that the sample mean over iid samples from f converges to the integral 539 2 am gt 3a 4 8 quot4 as S gt oo Ais an unbiased estimator of I This is called Monte Carlo integration L J CS 8803 MDM Lecture 26 p62 Integration and Sampling r ts error is effectively its variance which is g aa Ea2 d9 033 5 An estimate of this is s 02EZCLS I L J CS 8803 MDM Lecture 26 p72 Integration and Sampling 7 rExpectations are ubiquitous in statistics but this idea happens to be critical for making Bayesian statistics practicable Recall that for a dataset x E 1 xN the likelihood is N fltx6gt m le9 H mm Lu 7 i1 and the posterior is fltx6gtflt6gt L9f9 fltelxgt HWWWW C oltL9f6 lt8 Lwhere c f fx6f6d6 J ure 26 p82 Integration and Sampling 7 5 Me 6fltelxgtd6 9 r3 ayesians want to compute the posterior mean Note that this has the form we specified where the integrand 6 f6a and a6 6 So if we can draw samples 61 65 from the posterior f6a S 1 E Z 68 gt E6 10 aSS gtoo L J CS 8803 MDM Lecture 26 p92 Integration and Sampling l rBayesians also want to compute the 1 a posterior interval a b such that 300 f6ad6 boo f6ad6 042 and b Hwemwmnjfwwwwn w an This can also be done by drawing samples 68 from the posterior f6a We can approximate the posterior 1 a interval by ea261a2 where 6042 is the 042 sample quantile of 61 65 One common problem we often cannot directly generate Lsamples from f6a J CS 8803 MDM Lecture 26 p102 rMonte Carlo Variance Reduction General techniques for faster Monte Carlo Integration and Sampling 7 rNote that we can treat any bx as if it has the form aafa since bx bx 1 so we can use the uniform distribution over the domain of a as fa This is called crude Monte Carlo Note that the error of Monte Carlo integration doesn t de pend explicitly on the dimension D However if we say sam ple uniformly and the function has most of its mass concen trated in a small part of a highdimensional space this will be inefficient L J CS 8803 MDM Lecture 26 p122 Strati ed Sampling 7 lilfwe believe we know that the function looks different in disjoint regions R1 C 1 RK C R of the domain R we may profitably use a strati ed sampling approach drawing samples x8 unifRk from each region separately Sic A KRk1 I Eb 8 gtE 12 For this idea to yield an advantage the differences between the mean values of b in each region should be greater than the variation within each region Though it can be very use A CS 8803 MDM Lecture 26 p132 ful it relies on certain knowledge of the function Importance Sampling rWe may have some other function qa which we believe is similar to fa and we know how to sample from qx We can always write T M j wvwwx j wbw wwx mm The estimator draws samples x8 q instead of ms N f and A15 we I E 2 Mass MS gt Em 14 L J CS 8803 MDM Lecture 26 p142 Importance Sampling 7 rFor this idea called importance sampling to work we have to have qa gt O everywhere in the domain Ideally q overes timates f where f is small Though often powerful it relies on having knowledge ofa good q for f L J CS 8803 MDM Lecture 26 p152 Control Variates 7 rAnother idea is that of control varates If we believe we know another function qx which is similar to 65 and whose integral we know analytically we can write bada qadaba qada 15 where we compute the first part analytically and estimate the second part using samples L J CS 8803 MDM Lecture 26 p162 rMarkov Chain Monte Carlo Another Monte Carlo method based on the idea of Markov Chains Markov Chains 7 rConsider a sequence of discrete random variables Zt such that PZt1 ZjIZO an 7 Zt Zz39 PZt1 ZjIZt 2239 16 ie the transition probabilities between different values or states depend only on the variable s current state We say the random variable Z is a Markov process and the sequence Zt is a Markov chain generated by the process The transition probability between states KltZ 7 Zj E I gt E PltZt1 ZjIZt 17 Us called the jump kernel or transition kernel Let A be theJ matrix whose 239 j element is Mi gt j Markov Chains 7 Let mg E 1PZt zj denote the probability that Z is in statej at time t We start the chain with some setting 7m The probability that Z has value 57 at time t 1 is 7Tt1i PZt1 Zz 18 ZIPth Z Zt 2kIPZt 2k 19 k ZPUC gt i7Tt ZAci7rtk 20 k k or in matrix form 7Tt1 7TtA 21 We see that we can iterate by successively multiplying A 7T7 7Tt1A 7Tt2AA 7Tt2A2 7TOAt CS 8803 MDM Lecture 22 p192 Markov Chains VD 7 efining the 7step probability that the process is in statej given that it started in state 239 739 steps ago PTO gt 7 E PZt739 ZjIZt 2239 23 we say that the chain is irreducible if there exists a positive 739 such that PM gt j gt O for each z j In other words one can always go from any state to any other in some finite number of steps We say the chain is aperiodic if the number of steps required to move between any two states is not required to be a mul tiple of some integer In other words the chain is not forced into some cycle of fixed length between certain states CS 8803 MDM Lecture 26 p202 Markov Chains 7 lilfa Markov chain is irreducible and periodic it will reach a stationary or equilibrium distribution 7r 7T07TT gt7T 24 as T gt 00 The probability vectors 7r change less and less getting closer to 7r The stationary distribution stops changing ie 7T 7TA 25 lfa Markov chain is irreducible and aperiodic and has stationary distribution 7r then it is ergodic Pti gtj EPltZ75 ZjIZOZZi Vij 26 Land we can consider Z to be iid draws from 7r J CS 8803 MDM Lecture 26 p212 Markov Chains rA sufficient condition for having a unique stationary 7 distribution is that 7rz39IPi a j 7rjPj gt 239 VM 27 which is called the detailed balance or reversibility condition Now to extend all of this to continuous state spaces we just need a transition kernel which satisfies fKz zdz 1 and now mm mltz gtKltz zgtdz 28 L 7Tz 7TzKzzdz 29J CS 8803 MDM Lecture 26 p222 MetropolisHastings Algorithm 7 rOur goal is to create a Markov chain which converges to a density fa gaC from which we d like to be able to draw iid samples where the normalizing constant C may not be known We can only evaluate g at any point x but we cannot directly sample from 9 We will create a sequence of points z0zT and some subset of this sequence will be used as if they were iid sam ples from f Then we can use these to do Monte Carlo in tegration This methodology is called Markov chain Monte Carlo MCMC L J CS 8803 MDM Lecture 26 p232 rHere s the Metropolis algorithm We assume the transition MetropolisHastings Algorithm 7 kernel is symmetric ie Kz z Kz 2 Most often the Gaussian kernel Kz z Nz o is used Starting with 2t set to some point 20 1 2 Obtain a new candidate point zc from the distribution KltZt7 Compute the ratio a fZc C 920 30 fZt C 9 Ifd gt 1 ie the jump increases the density accept the candidate point setting 2H1 zc OthenNise accept it with probability or Repeat J CS 8803 MDM Lecture 26 p242 MetropolisHastings Algorithm 7W 7 e generalize this to nonsymmetric kernels to obtain the MetropolisHastings algorithm 1 Obtain a new candidate point zc from the distribution KltZt7 2 Accept zc with probability 9ZCKZC7 Zt Oz mm gZtKZt7 20 1 31 Repeat After some burnin period of 23 steps we assume the chain has converged to its stationary distribution We keep only he samples after that point and use them as if they were iid CS 8803MDM Lecture 26 p252 camnlnc 39Frnm 1quot MetropolisHastings Convergence VT 7 0 show that the MetropolisHastings algorithm generates a Markov chain whose stationary distribution is f it is sufficient to show that its transition kernel satisfies the detailed balance condition with f The actual transition probability is a combination of the transition kernel and the acceptance step CS 8803 MDM Lecture 26 p262 MetropolisHastings Convergence r We can verify that the detailed balance condition PM gt zgz IPz gt z gz holds by checking each of the three cases for the relative sizes of gzKzz and gz Kz z gt lt to see that the stationary distribution from this ker nel corresponds to draws from f L J CS 8803 MDM Lecture 26 p272 MetropolisHastings Practical Issues r8 0 far we just know that if enough samples are taken we 7 can use it to simulate samples from f To use it in practice we have to address some issues 9 J J 1 Where should we start the chain What should the width of the Gaussian be When has it converged to the stationary distribution What is the error of the final integral For all ofthese questions there is essentially no solid theory L J CS 8803 MDM Lecture 26 p282 MetropolisHastings Practical Issues 7W 7 e try to start the chain at a mode if we can Ifthere are multiple modes we may try multiple chains People argue about whether it is better in general to use a single chain or multiple chains The width of the Gaussian is critical for performance if too small we can stay trapped in a local mode if too large we devolve to uniform sampling from the highdimensional space We use trial and error examining many time series plots We say the chain is poorly mixing if it does not seem to be making much progress which is often reflected in low Lacceptance rates A CS 8803 MDM Lecture 26 p292 MetropolisHastings Practical Issues VB 7 efore convergence there is error due to the fact that the sequence is correlated an estimate of the Monte Carlo integration error using MCMC is 2 a N l A 0 02 g 1 2 Z plagt 33 where pla is the lagl autocorrelation in the sequence azt Plotting this autocorrelation can be used to exam ine the progress of MCMC L J CS 8803 MDM Lecture 26 p302 CS 8803MDM Lecture 24 Parameter Optimization III Convex Alexander Gray agraycc gatech edu Georgia Institute of Technology A CS 8803 MDM Lecture 24 p 11 Today r 1 Convex Optimization Problems 2 Convex Optimization Methods L J CS 8803 MDM Lecture 24 p 21 rConvex Optimization Problems Problems with a convex objective function and ifthere are constraints convex constraints Convex Functions 7 lilt used to be that optimization was divided into linear and nonlinear objective functions Now the distinction is between convex and nonconvex functions A convex set C has the property that for any two points x1 and 2 in C 0 g a g 1 a line segment 045131 1 042 E C 1 A convex function f has the property that domf the domain of f is a convex set and for any two points in domf foza1 1 042 g ozfa1 1 ozfa2 2 This inequality is often called Jensen s inequality f is Lconcave if f is convex CS 8803 MDM Lecture 24 p 41 rE I bbbbbbb V xamples of convex functions Convex Functions affine functions fa Ax b quadratic functions exponential fa ea powers fa x a 2 1 ora g 0 logarithm fa loga a norms maximum maxa1 xN nonnegative weighted sum of convex functions fa1f1aKfm oak 20 maximum of convex functions fa maXf1a f2a J CS 8803 MDM Lecture 24 p 51 Convex Functions 7 rSuppose f is differentiable its gradient vf exists at each point in domf Then f is convex iff domf is convex and for any two points in domf f932 Z f931 Vf1T2 1 3 Thus if Vfa 0 then for all i E domf f 2 fx ie a is a global minimizer of f This is what makes convex functions special If f is twice differentiable f is convex iff domf is convex and its Hessian is positive definite for all a e domf V2fa 2 0 or in 1 dimension f a 2 0 L J CS 8803 MDM Lecture 24 p 61 LeastSquares WC 7 Find 513 arg min Aas xTATAa ZbTAa bTb 4 IJERD onsider the leastsquares problem This is quadratic and unconstrained It can be solved analytically as 513 A lb So this is a special easy case L J CS 8803 MDM Lecture 24 p 71 Unconstrained Convex Optimization rThe conditions for optimality of a convex function in the unconstrained case boil down to the necessary and sufficient condition 7 Vfltxgt 0 5 So for convex functions without constraints unconstrained optimization methods like Newton s method find the global minimum If we use L1 or LOO minimization we end up with a constrained optimization problem L J CS 8803 MDM Lecture 24 p 81 General Form 7 rme now on convex optimization refers to problems having convex objective functions and convex constraints For such problems we can find the global minimum in polynomial time Recall the general form for an optimization problem Find 513 arg minmeRD 6 subject to 315 2 0 z39 1 M 7 d 07 i 7N 8 The domain is the intersection of the domains of the constraint functions A point is feasible if it satisfies the constraints L J CS 8803 MDM Lecture 24 p 91 Linear Programming 7 rWhen the objective and constraint functions are all affine the problem is called a linearprogram LP which has the form Find 513 arg mianRD 0T3 d 9 subject to Ga g h 10 A51 b 11 L J CS 8803 MDM Lecture 24 p 10 Quadratic Programming 7 rWhen the objective function is quadratic and the constraint functions are affine the problem is called a quadratic program QP which has the form Find 93 arg minmERD xTP93 1 r 12 subject to Ga g h 13 Am b 14 If the constraints are also quadratic the problem is called a quadratically constrained quadratic program QCQP Find 93 arg minmERD xTP93 1 r 15 subjectto xTBazqrxr O z391M 16 L Ax b WA CS 8803 MDM Lecture 24 p 11 Secondorder Cone Programming 7 rA closely related problem is called a secondorder cone program SOCP which has the form Find 513 arg minmeRD fTa 18 subject to A xb 2 cxd 11M 19 Fa g 20 A constraint of this form is called a secondorder cone constraint L J CS 8803 MDM Lecture 24 p 12 Geometric Programming U geometric program GP is a problem of the form Find 513 erg mianRD fa subject to 315 1 2 1 M d7a1 2 1 N where f and the 7 have the form 10g ealTkHbi c k and the di have the form egz39Tmthz L 21 22 23 24 25 J CS 8803 MDM Lecture 24 p 13 Semide nite Programming l A semidefinite program SDP has the form Find subject to 1171 ann G g 0 A3 b 513 arg minmeRD CTQ 26 27 28 29 J CS 8803 MDM Lecture 24 p 14 rConvex Optimization Methods The interiorpoint method Lagrangian Duality rConsider again the general form for an optimization problem Find 513 arg mianRD 30 subject to 315 2 0 11M 31 d7a0 z1N 32 L J CS 8803 MDM Lecture 24 p 16 Lagrangian Duality 7W 7 e can rewrite the original problem as the unconstrained optimization problem M N Find 93 arg 3311 fx 1040493 00d 93 33 where 00 is the indicatorlike function which takes the value 0 if the constraint function is satisfied by a and 00 otherwise L J CS 8803 MDM Lecture 24 p 17 Lagrangian Duality W rThe basic idea of Lagrangian duality is to soften the problem by replacing 00C7a by Maia and 00d7a by 77dx where the A and 77 are positive weights obtain the Lagrangian of the problem M N Mac A 77 f as Alciw meltx 34 The A and 77 are called the Lagrange multipliers and the A and 77 vectors are called the dual variables or Lagrange multiplier vectors associated with the problem L J CS 8803 MDM Lecture 24 p 18 Lagrangian Duality rw e define the dual function g as the minimum value of the Lagrangian over x 7 90M 77 13f Mac A 77 35 Since the dual function is the pointwise infimum of a family of affine functions of A 77 it is concave even if the original problem is not convex The dual function yields a lower bound on the optimal value g 77 g 513 L J CS 8803 MDM Lecture 24 p 19 Lagrangian Duality 7 rThe dual function value g 77 is its optimal value over m Now we d like to find the best lower bound that can be obtained from the dual function Find X 77 arg max 7 g 77 36 subject to A 2 0 37 This is called the dual problem while the original problem is called the primal problem This is always a convex optimization problem because the objective function is concave and the constraint is convex even if the primal problem is not convex L J CS 8803 MDM Lecture 24 p 20 Lagrangian Duality lilf the primal problem is convex usually maximizing the dual7 problem is the same as minimizing the primal problem We call this strong duality Suppose we have strong duality and all the functions are differentiable Since 513 minimizes La X 77 over x it follows that its gradient is zero at 513 M N Vfaf VcixZnVd x 0 38 L J CS 8803 MDM Lecture 24 p 21 KKT Optimality Conditions 7 rThus we have these conditions which must hold at the optimum 2492 0 39 d x 0 40 A 2 0 41 Afcxxt 0 42 M N vfxZA2 vc xZn deaf 0 43 These are called the KarushKuhnTucker KKT conditions They are necessary and sufficient at the optimum We can Lthus formulate optimization as solving these equations A CS 8803 MDM Lecture 24 p 22 Hierarchy of Problems VT 7 1 Simple quadratic problems can be solved analytically here is a kind of hierarchy of problems 2 Newton s method solves a sequence of quadratic problems 3 Newton s method can be applied to equalityconstrained optimization problems by removing the equality constraints to create unconstrained problems 4 Interiorpoint methods solve a sequence of equalityconstrained optimization problems using Newton s method L J CS 8803 MDM Lecture 24 p 23 Logarithmic Barrier Idea 7W 7 e ll now do something related but slightly different with the constraints We rewrite the general problem as Find 513 arg mianRD fa 00Cq 3 44 subject to Ax b 45 We approximate the indicator function by A 1 1a 7 log u 46 where t is a parameter which increases the accuracy of the A approximation as it increases u goes to 00 as u increases to zero but is differentiable and convex L J CS 8803 MDM Lecture 24 p 24 Logarithmic Barrier Idea rW e call the function 7 M max Z led cm 47 the logarithmic barrier for the problem We ll optimize ax Mfr The barrier method solves a sequence of such problems each of which is convex increasing t on each iteration using Newton s method The solution xt is the starting point for the next value of t It can be shown that the error for each iteration is bounded Lby Mt and thus the error goes to zero CS 8803 MDM Lecture 24 p 25 InteriorPoint Method 7 rThe barrier method is an example of an interiorpoint method Given feasible x t gt O u gt O tolerance e gt 0 repeat 1 Find xt by using Newton s method to minimize tf gb subject to Ax b starting at x 2 a 3 Quit if Mt lt e 4 25 mi Methods called phase methods are used to choose the starting x L J CS 8803 MDM Lecture 24 p 26 InteriorPoint Method 7 rThe interiorpoint method can be used for all of the constrained convex optimization problems In practice despite convergence analysis which relates the number of iterations to M it always takes about 1020 iterations There is a modification of the barrier method called the primaldual interiorpoint method which is often a bit faster and is what is used in practice L J CS 8803 MDM Lecture 24 p 27 SVM Quadratic Program 7 rRecall that the formulation of the support vector machine results in a quadratic program Though generally effective interiorpoint is not sufficient to handle the large number of variables and constraints in an SVM problem An idea called chunking is often used for largescale convex optimization problems which solves smaller problems using subsets of the constraints while adding more constraints until all of them are satisfied L J CS 8803 MDM Lecture 24 p 28 7 CS 8803MDM Lecture 25 Parameter Optimization IV Linear and Online Alexander Gray agraycc gatech edu Georgia Institute of Technology A CS 8803 MDM Lecture 25 p121 Today r 1 Linear Algebraic Optimization 2 Online Convex Optimization L J CS 8803 MDM Lecture 25 p221 Computational Linear Algebra j rThere is one main computation studied in numerical linear algebra Solving a system of linear equations Ax b where A is N x N or perhaps ij bj for many 339 It has some common special cases 0 Inverting a matrix A to obtain A 1 AA 1 I think a A lb o Solving a linear leastsquares problem minm Aa3 b2 o Solving an eigenvalue problem Ax Ax All three of these come up commonly in statistics eg re spectively in evaluating a Gaussian density linear regres Lsion PCA A CS 8803 MDM Lecture 25 p321 Types of Matrices j rThere are several special cases of matrices for which there are often faster custom methods 0 banded diagonal tridiagonal triangular Hessenberg 9 block 3 symmetric AT A 0 definite positive semi o Vandermonde 9 Toeplitz Mostly these allow different matrix decompositions which lead to different types of solutions L J CS 8803 MDM Lecture 25 p421 Types of Matrices rThese lead to implicit storage and matrix multiplication o sparse 0 kernel Examples 9 Covariance matrices symmetric positive definite sometimes diagonal AR models Toeplitz o Kernel PCA kernel matrix or sparse L J CS 8803 MDM Lecture 25 p521 Linear Regression rR ecall linear regression yz X z Z jX Zj l Ei 1 j We want the parameters which minimize the squared error or residual sum of squares N A 2 RSSw Z 247 Has 2 i1 y X Ty X 3 lly X ll2 4 L J CS 8803 MDM Lecture 25 p621 Least Squares Problem VT 7 his has the form min Aas b2 5 In linear algebra this is the standard linear leastsquares problem This is basically solving Ax b when the system is overdetermined more equations than unknowns It can be put in the form of A a b with square A as ATAE ATb 6 called the normal equations of the leastsquares problem L J CS 8803 MDM Lecture 25 p721 Optimization Form rConsider the linear system Ax b Assume for the momentj that A is symmetric and positive definite thus square Let s consider the best solution to the quadratic form gba xTAa asTb 7 Then the minimizer 513 is exactly the solution to Ax b L J CS 8803 MDM Lecture 25 p821 Optimization Form VT 7 0 see this suppose 513 is exactly the solution to Ax b Then by completing the square 1 gba ExTAa asTAa 8 1 T T gtxlt 1 gtxltT gtxlt 1 gtxltT gtxlt a Ax a A51 a A51 a A3 9 2 2 2 1 1 xTAa 513 xTAa 10 The last term does not depend on x so gba is minimized when a xTAx 513 is minimized Since A is positive definite we know that this term is positive unless a 513 is 0 L80 gba3 takes its minimum when and only when a 513 A CS 8803 MDM Lecture 25 p921 Optimization Form j V Aa b 11 rM ore simply looking at the derivative we find that Clearly the only point at which the gradient is zero is the solution of Am b 80 the problem Ax b can be written as the optimization problem gba3 xTAx be L J CS 8803 MDM Lecture 25 p1021 Steepest Descent j rAt the current guess ask the function gb decreases most rapidly in the direction of the negative gradient ngak b Axk If the residual T k b Axk 12 of ask is nonzero there exists a positive a such that CHM 06km lt CHM The residual gives us our search direction pk m Now we need the step size Oak determining it is called line search L J CS 8803 MDM Lecture 25 p1121 Steepest Descent 7L 7 ine search is difficult in general but for our quadratic function we can actually do exact line search in which we choose Oak so that k1 main WM 061 13 in which case we set pgrk pprk 0 I 14 to obtain steepest descent with exact line search L J CS 8803 MDM Lecture 25 p1221 Conjugate Gradient j rlt turns out that this isn t that efficient Taking the steepest direction from ask is not the same as taking the direction which goes to the bottom of a long trough with steep sides say We will do a smoothing of the directions over the iterations by using 1 Tic 1 kPk l 15 where k 16 W7 7 19 This is called conjugate gradient We won t go into its theory L J CS 8803 MDM Lecture 25 p1321 Conjugate Gradient rWhat s good about this j 0 Inside each iteration we need only perform a multiplication of the matrix A with a different vector each time We do not need to make a new matrix the size of A This is good ifA is large o It turns out this hones in quickly on the top eigenvalues It is the method of choice for large sparse matrices L J CS 8803 MDM Lecture 25 p1421 Conjugate Gradient rThe same basic approach can be derived this way from the viewpoint of optimization or from a different viewpoint involving methods called Lanczos methods These viewpoints can be unified using the framework of Kryov subspaces There is a version of this approach for the leastsquares problem called LSQR and also called CGNE There is a version of this approach for the eigenvalue problem called Arnodi iteration There are versions for cases which are not symmetric posi tive definite L J CS 8803 MDM Lecture 25 p1521 Decompositionbased Methods j rThe downside of the Krylovtype approaches also called iterative methods is that their numerical stability is not the best possible For fullrank matrices the methods of choice are Gaussian elimination or orthogonalization approaches such as QR factorization For rankdeficient or unknownrank matrices the method of choice is singular value decomposition L J CS 8803 MDM Lecture 25 p1621 rOnline Optimization Problems One data point at a time ShermanMorrison Formula j rSuppose you have already obtained after a lot of work a matrix inverse A l Now you want to make a small change to A for example change one element or one row or one column A gt A U X U 17 u lt29 2 is a matrix whose ijth element is the product of the 239 component ofu and the jth component of v If u is a unit vector 87 this adds 21 to the 1th row if v is a unit vector ej this adds u to the jth column if both are proportional to unit vectors then a term is only added to one element L J CS 8803 MDM Lecture 25 p1821 ShermanMorrison Formula j rThe ShermanMorrison formula gives the inverse after the change Wu lt29 ml 1 A 1A 1 u vgt vA 1u1 18 The advantage of this is that it only requires two matrix multiplications and a dot product which is 0N2 rather than 0N3 to redo the entire inverse This can be used for obtaining Au vab 19 En an online fashion A CS 8803 MDM Lecture 25 31921 Stochastic Gradient Descent WC 7 onsider the batch optimization problem N arg mutn 1 5137 20 where gb is convex Stochastic gradient descent updates the parameters one datum at a time via 1 ts 1 O kgtk 1 lg1k7 240 21 where gb gb L J CS 8803 MDM Lecture 25 p2021 V 7 CS 8803MDM Lecture 27 Generalized N Body Problems Alexander Gray agraycc gatech edu Georgia Institute of Technology L J CS 8803 MDM Lecture 27 p14t Today r 1 Basic NBody Problems 2 Kernel Summation Problems 3 Other Nbody Problems L J CS 8803 MDM Lecture 27 p24l rBasic N body Problems Basic geometric problems Basic N body Problems i W iven a query point mg and a set of reference points R 337a knearestneighbor k mygmmi mn n Rangecount ZIleq xrll lth 2 Where they come up nearestneighbor methods mani fold learning fractal dimension spatial statistics spherical Lkernel kernel methods J CS 8803 MDM Lecture 27 p44l kdtrees kdtrees kdtrees kdtrees kdtrees kdtrees Distance Bounds rWe can compute a lower bound 65R on the distance between L and any x e R as well as an upper bound SEE using the bounding box of the node B in OD time Leduve27 rpm4 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees r 7 D 5 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees V 7 Rangecount with kdtrees Recursive Algorithm rRCCEq R T if canprunexq R return if leafR RCbaseaq R else RCxq Reft RCxq Bright The base case examines all points in R exhaustively For nearestneighbor we maintain the distance of the best candidate nearestneighbor so far starting with 00 and prune based on it L J CS 8803 MDM Lecture 27 p284l Runtime m 0 build a kdtree 0DNlog N time 0DN space 7 Querytime Roughly 0cD log N though not rigorous with a kdtree Motivating recurrence TN TN2 01 3 whose solution is 0log N Note the curse of dimensional ity In reality it should be 0ch where D is some notion of intrinsic dimensionality though this has not been shown yet L J CS 8803 MDM Lecture 27 p294l Problem Variations 7 rknearestneighbor variations Nearestneighbor Classification 1 e approximate nearestneighbor Rangecount variations Rangesearch return the indices of matching points rectangular or any polytope instead of radial region L J CS 8803 MDM Lecture 27 p304t Alltype Problems 7 rGiven a set ofquery points Q mg and a set of reference points R 33 Allknearest neighbor k m arg min llxq will 4 7 Allrange count Vxq ZNllxq xrll lth 5 If Q R we call this a monochromatic problem otherwise t is bichromatic The basic knearestneighbor and rangeJ count problems are special cases of these problemsmoiiimmMW Dualtree Algorithms VT 7 here are now two sets over which to perform divideandconquer We now build a tree for each set Our algorithm will traverse both trees simultaneously note that we can analogously compute 0D distance bounds between two nodes AIIRCQ R if canpruneQ R return if eafQ and eafR AllRCbaseQ R else AIIRC AIIRC Qeft RIeft Qeft Bright AIIRC Qright Reft AIIRC Qright Bright L J CS 8803 MDM Lecture 27 p324l AAAA Runtime and Variations Querytime Roughly 0cDN though not rigorous with a kdtree Motivating recurrence TN 2TN2 01 6 whose solution is 0N Problem variations closestpair farthestpair Hausdorffdis tance npoint correlations Euclidean minimum spanning tree L J CS 8803 MDM Lecture 27 p334l rKernel Summation Problems Now throw in a continuous function Kernel Summation Problems 7 rGiven a query point mg and a set of reference points R 337a vxq qIZKUqu xrll 7 where K is monotonically decreasing with distance and radially symmetric Examples Gaussian Ku oc 6 22h2 Epanechnikov Ku cc 1 u2h2lu lt h Where they come up kernel estimators Gaussian process regression kernelized methods L J CS 8803 MDM Lecture 27 p354l Approximation Error 7 rFor infinitetailed kernels it will be impossible to compute this nonexhaustively without performing some approximation Ideally we d like to provide a guarantee that our approximate answer 313 is close to the true I in terms of a relative error tolerance e specified by the user VIM quotEll 6 qu q 8 We ll maintain various bounds in order to achieve this L J CS 8803 MDM Lecture 27 p364l Centroid Approximation and Bounds rA simple approximation of the contribution of a node B to T the summation for mg is NR quEQ qRIZKleq xrll NRKUIMQ MRH 9 r21 where HQ is the centroid of the query node and HR is the centroid of the reference node and simple bounds such that ltIgtqLR g ltIgtqR g ltIgtqUR are ogR NRK6gR 10 c1gtqUR NRngR 11 L J CS 8803 MDM Lecture 27 p374t Approximation Rule 7 rAside It is practically equivalent to only compute the lower and upper bounds then take their midpoint at the end finitedifference method Stop recursing and use this approximation when DUR LR N S j 6 12 q N This local rule ensures that the relative error tolerance on the global kernel sum will be met L J CS 8803 MDM Lecture 27 p384l Multipole Expansion rHow can we go beyond the centroid approximation One7 way is the idea of multipoe expansions Technically this refers to a different kernel used in physics Ku oc 1u Let s consider that example L J CS 8803 MDM Lecture 27 p394l Multipole Expansion 7 rThe multipole expansion is simply a Taylor expansion of the potential at a point xq due to a point xr about a third point LLR where 6W Hasq 337a 1 1 D I 5er sqr 5qR Ed D D Z 5er Z 6er d d axrd 6 6qR 9 21 acrd xrd 6 14 6qR L J CS 8803 MDM Lecture 27 p404t he total contribution from all the points in node R each Multipole Expansion Sum having weight wr is then L bq NR w NR 1 D rw 6 7 er r 6g r 6qR d D D Z 5er Z 6er d d 1 NR D 6 2107 Z6er IR 7 d 1D D Ll axrd 6 6qR 0 21 Basrd xrd 6 5qR O NR 8 1 axrd 5qR 7 1706er NR 2 wr er er 15 I 9 21 acrd xrd 6 5qR CS 8803 MDM Lecture 27 p414i Multipole Expansion Moments 7W 7 e can rewrite the last equation by replacing the summations by constants depending only on the points in the node R ie not on the query point 8 1 3de 3 qR Rd 1 D CIDq SCI ROKRd67Rd 1 D D dR a scalar R a vector and VB a matrix are called the axrdaxrd 6 6qR monopole dipole and quadrupole moments of the node R Lrespectively or multipoe moments in general A CS 8803 MDM Lecture 27 p424l Multipole Expansion Gaussian 7 rFor the case of the Gaussian kernel we ll use a slightly different expansion but the basic idea is the same We ll use the Hermite functions hnu eu2Hnu using the Hermite polynomials Hnu 1quoteu2Dquoteu2 u e R1 After scaling and shifting the argument u appropriately then taking the product of univariate functions for each dimension we obtain the multivariate Hermite expansion 1 R llfcq2fcrll2 xrHRah qR r21 oz x2h2 x2h2 r21 ozZO 17 using the usual multiindex notation for series expansions L J CS 8803 MDM Lecture 27 p434l Multipole Expansion Queryside 7 rThat was expanding about MR referenceside expansion We could also perform an expansion Taylor about HQ queryside expansion 1 if rcq rcrll2 if 2 1h 93 uQ mg uQ 8 2h2 qR 701 oz 2h2 2h2 r21 ozZO 18 We will actually do both types of expansions We will 1 Truncate the referenceside expansion to some number of terms p We can bound the error of this approximation 2 Truncate the queryside expansion of this truncated L expression to some number of terms p We can bound J the error of this second approximation CS 8803 MDM Lecture 27 p444l Multipole Expansion Algorithm 7 rThe overall algorithm requires three kinds of operations which are described in the original physics literature as translation operators on various coefficients where farfield effectively refers to the reference points and local effectively refers to the query points Farfield to farfield A bottomup precomputation of the coefficients of the referenceside expansion of parents from children Farfield to local During the run the queryside and referenceside approximation of the contribution of a reference node to each point in a query node Local to local A topdown postcomputation which propagates the approximations made at internal nodes L down to combine them all at the leaf nodes A CS 8803 MDM Lecture 27 p454l rOther N body Problems Now throw in a decision etc N body Decision Problems rGiven a query point mg and a set of reference points R xr find the Class abexq knearestneighbor decision k label min a q azr 19 Kernel summation decision max K93q xrIlabelxr c 20 L J CS 8803 MDM Lecture 27 p474i N body Decision Algorithms 7 rWhere they come up knearest neighbor classification kernel discriminant aka nonparametric Bayes classification support vector machine Basic idea for kernel decision algorithm Maintain bounds on each sum Traverse the query and reference trees until we can show that the lower bound for one is definitely larger than the upper bound for the other L J CS 8803 MDM Lecture 27 p484l

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.