New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Orval Funk


Orval Funk
GPA 3.53


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Statistics

This 12 page Class Notes was uploaded by Orval Funk on Monday September 28, 2015. The Class Notes belongs to STAT991 at University of Pennsylvania taught by S.Kakade in Fall. Since its upload, it has received 22 views. For similar materials see /class/215430/stat991-university-of-pennsylvania in Statistics at University of Pennsylvania.

Similar to STAT991 at Penn

Popular in Statistics




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/28/15
Stat 991 Multivariate Analysis Dimensionality Reduction and Spectral Methods Lecture 8 Dimensionality Reduction and Learning Random Features and Bochner s Thm Instructor Sham Kakade 1 Random Projections of Common Kernels Let us say we have a Kernel Kr y which represent an inner product between I and y When can we nd random qbw and qbw such that Ew wI wyi May and such that these have the concentration properties such that if we take many projections then the average inner product will converge rapidly to Kr y so that we have an analogue of the inner product preserving lemma For kernels of the form Kz 7 y Bochner s theorem provides a form of qbw and 11 Bochner s theorem and shift invariant kernels First let us provide some background on Fourier Series Given a positive nite measure p on the real line R ie if p integrates to 1 then it would be a density the Fourier transform Qt of p is the continuous function Qt 7 R ammo Clearly the function e m is continuous and periodic Also Q is continuous since for a xed I and the function Q is a positive de nite function In particular the kernel Kr y Qy 7 I is positive de nite which can be checked via a direct calculation Bochner s theorem says the converse is true Theorem 11 Bochner Every positive de nite function Q is the Fourier transform of a positive nite Bore measure This implies thatfor any shift invariant kernel Kz 7 y we have that there exists a positive measure p st m 7 y 7 A ermewdmw and the measure p p f dpw is aprobability measure To see the implications of this take a kernel Kz 7 y and its corresponding measure pw under Bochner s theorem Let a f dpw and let p pa Now consider independently sampling wl w2i i i wk N p Consider the random projection vector of z to be ltzgt 7 germ i aw Embe me my 7 m 7 y Note that since we have that ltzgt my 7 26 sz 7 m 7 y for large k We can clearly ignore the imaginary components as these have expectation 0 so it suf ces to consider the projection vector icosw1zsinw1ziucoswk zsinwk R which also is correct in expectation 111 Example Radial Basis Kernel Consider the radial basis keniel rimH KIye 2 It is easy to see that if choose p to be the gaussian measure with variance and p then 17y 2 ewltw7ygt 2w d2e llwll22dw a 2 Kn 7 y Hence the sampling distribution for w is Gaussian If the bandwidth of the REF keniel is not 1 then we scale the variance of the Gaussians For other scale invariant Kernels there are different corresponding sampling measures p However we always use the fourier features 112 High Probability Inner Product Preservation Recall that to prove a risk bound using random features the key was the inner product preserving lemma which characterizes how fast the inner products in the projected space converge to the truth as a function of h We can do this here as well Lemma 12 Let I y 6 Rd and let Kz 7 y be a shift invariant kernel Iflc log and ifqb is created using independently sampled w1 wg i wk N p as discussed above then with probability greater than 1 7 6 WI 4159 KW W S e where uses the cos and sinfeatures Proof Note that the cos and sin functions are bounded by 1 Now we can apply Hoeffdings directly to the random varia es a a E Zoos ui z coswi y7 E Zsin ui I sinwi y i i to show that these are 6 close to their mean with the h chosen above This proves the result D 113 Polynomial Kernels Now say our we have a polynomial Kemel degree l of the form l New Zeal yl i1 Consider randomly sampling a set of projections W l g 239 g ll g j g of size ll 7 l2 where each wig is such that 39 z y our Gaussian projections suf ce here Now consider the random projetion down to one dimension of Kr7 l WI Z 1wij39z i1 One can verify that Ewl wI Wyl New to see this note EWlHl1wij 39 IHl1wij 39yl EWlHl1wij 39 Iwij 39yl I W Where second step is due to independence Similarly for 239 a zquot Ewlm dwm 39IH 1wij 39yl 0 again by independence Again consider independently sampling W17 W2 i i Wk note each Wi is 012 random vectors Consider the random projection vector of z to be 1 WWl 17 W2 17 A A Wk Concentration properties for the an innerproduct preserving lemma should be possible to prove as well again using tail properties of Gaussians This random projection is not as convenient to use unless l is small References The observations on using Bochners theorem to get random features was pointed out by Ali Rahimi and Ben Recht This analysis argues their success based on margins rather than on a covering numuber argument 7 the argument is rather loose as this does not accurately re ect the sample sizes used in practice Stat 991 Multivariate Analysis Dimensionality Reduction and Spectral Methods Lecture 6 Dimensionality Reduction and Learning Ridge Regression vs PCA Instructor Sham Kakade 1 Intro The theme of these two lectures is that for L2 methods we need not work in in nite dimensional spaces In particu lar we can unadaptively nd and work in a low dimensional space and achieve about as good results These results question the need for explicitly working in in nite or high dimensional spaces for L2 methods In contrast for spar sity based methods including L1 regularization such nonadaptive projection methods signi cantly loose predictive power 2 Ridge Regression and Dimensionality Reduction This lecture will characterize the risk of ridge regression in in nite dimensions in terms of a biasvariance tradeoff Furthermore we will show that a simple dimensionality reduction scheme simply based on PCA along with just MLE estimates in this projected space performs nearly as well as ridge regression 3 Risk and Fixed Design Regression Let us now consider the normal means problem sometimes referred to as the xed design setting Here we have a set of n points X C Rd and let X denote the Rn matrix where the 239 row of X is Xi We also observe a output vector Y E R We desire to learn In particular we seek to predict lEY as X The square loss of an estimator w is W 7 gram 7 W 7 21302 7 Xiwr where the expectation is with respect to Y Let B be the optimal predictor 7 argngnuw The risk of an estimator 8 is de ned as 1w 7 MB 7 MB 7 gm 7 W which is the xed design risk Denoting E lXTX TL we can write the risk as RBA 3 7 BTE3 7 B 1 H3 7 6H2 Another interpretation of the risk is how well we accurately learn the parameters of the model Assume that is an estimator constructed with the outcome Y 7 we drop the explicit Y dependence as this is clear from context Let B lEy be expected weight We can decompose the expected risk as Ey 133 1 A 7 1 7 7 gEYHX 7 X llZ gllX 7 XBH2 M8 7 mg H37 WE where we have that 1 average variance 7EYHX8 7 XBH2 TL and 7 prediction bias vector X B 7 X B which shows a certain biasvariance decomposition of the error 31 Risk Bounds for Ridge Regression The ridge regression estimator using an outcome Y is just A 1 BA argrnin 7HY 7 XwH2 AHwHQ w n The estimator is then 1 1 Bi 2 AI 1XTY E AI 1 Z YiXiT For simplicity let us rotate X such that l E I 7XTX diag12i i i Ad TL note this rotation does not alter the predictions of rotationally invariant algorithms With this choice we have that A i213iiXiij l xlj M A It is straightforward to see that A 5 E Bel and it follows that A 5le Ei kl Aj iA j by just taking expectations Lemma 31 Risk Bound IfVarYi g l we have that A 1 A v Av lt 7 J 2 2 1 EleWAH 712 123 HMM2 This holds with equality ifVarYZ l Proof For the variance term we have Eyll x 7 Bali ENEwok le j 7 A1 1 n n 7 W mgm EllM99 Z3109 EleDleljl j 1 n 2 i m gvarOzllelj A 1 7 lt 7 2 A1 A n gm 1 Z A n j M A This holds with equality if VarYZ 1 For the bias term MBA BHQE ZAjiBAlj 5le j A 7 239 J 7 2 ig mr 1 612A1 2 and the result follows from algebraic manipulations There following bound characterizes the risk for two natural settings for A Corollary 32 Assume VarYZ g l 0 Finite Dims For A 0 13mm 3 And ifsmm 1 then Eymwm 0 In nite Dims For A then WWW S H HxHEHTm M iii HXZ39H lt HBH MM 2 2 2 where the trace norm is the sum of the singular values and maxi 2 Furthermore for all n there exists a distribution PrY and an X such that the inf EyR8A is 9 so the above bound is tight up to log factors Conceptually the second bound is dimension free ie it does not depend explicitly on d which could be in nite And we are effectively doing regression in a large potentially in nite dimensional space Proof The A 0 case follows directly from the previous lemma Using that a b2 2 Zab we can bound the variance term for general A as follows 38 A1 gt2lt3Z A 2 n j AjA in j 2AjA 7 2nA Again using that a b2 2 2ab the bias term is bounded as A A A 62 s 327 71W J 1 W J J 2 So we have that E R A lt trace Yl BM 7 Tm and using the choice of A completes the proof To see the above bound is tight consider the following problem Let X and and let Y X B 77 A 2 5151 where 77 is unit variance Here we have that Ai so 2 Aj g logn and HBH2 g log n so the upper is 1 Now one can write the risk as A 1 l E R 7 7i 2 7i 1 Y1 m 712 21 M 1 2 7 A 2 j 1 2A n 1 2 A gt n7 1 1 WW 3 1 2 1 1 EHA1A A1M 4 l l l JrAXilJrAifrnA 5 6 and this is for all A 1 However now we show that with L2 complexity we can effectively working in nite dimensions where the dimension is chosen as a function of n 4 PCA Projections and MLEs Fix some A Consider the following keep or kill estimator which uses the MLE estimate if Ai 2 A and 0 otherwise 18011 1M 2 A 0 else 8PCAAJ39 where 30 is the MLE estimator This estimator is 0 for the small values of Ai those in which we are effectively regularizing more anyways Theorem 41 Risk In ation of PCAA Assume VarYZ 1 then EYRBPCAA S lm13450 Note that the the actual risk not just an upper bound of the simple PCA estimate is within a factor of 4 of the ridge regression risk on a wide class of problems Proof Recall that EyR3Al 2 in A ZBJZW j Since we can write the risk as A A 7 7 EYWBH EYHB 7 Bll H3 7 Bll we have that 1 7 2 Ele PCAAl 7 ZEN gt A Z j j J N ltA where ll is the indicator function A A We now show that each term in the risk of BPCA is within a factor of 4 for each term in BA If M gt A then the ratio of thej 7 th terms is 1 SlH lt 7 lltAL 2524AJ llt gt2 n AA J12 n AA A1 02 A A lt 1 7 2 Aj g 4 Similarly if M g A then the ratio of the jth terms is A1512 lt A1512 l A 2 inf 7 inf n NA HAMMZ 17 M2 7GMMV g 4 Since each term is within a factor of 4 the proof is completed References The observation about the risk in ation of ridge regression vs PCA was rst pointed out to my by Dean Foster Stat 991 Multivariate Analysis Dimensionality Reduction and Spectral Methods Lecture 1 The Singular Value Decomposition Instructor Sham Kakade 1 Intro The SVD is the single most important concept to understand in linear algebra Intuitively it precisely characterizes a way to View how any linear map behaves Roughly speaking the SVD corresponds to a certain natural notion of geometric regression In fact with this interpretation all of classical estimation issues with noisy data are relevant here 11 Vanilla Regression and a Best Fit Line Consider an input data matrix Xn E Rn and our target prediction vector Xom In regression we desire to predict the target with the inputs In a least squares sense the goal is to nd w which minimizes En HXout Xinwii2 Here the solution is given by w XJXmrlXou This is the least squares estimator Question 1 With noisy data how accurate is our regression 1 xed design when X is random and Xm is xed 2 random design when X0 and X m are random 2 The Best Fit Line Rotationally Invariant Regression and Matrix Norms 21 The Best Fit Line In vanilla regression note that there was one preferred coordinate which we desired to predict and we t our data with a line Instead let us say we have no preferred direction with which to measure our error and yet we still desire to t our data with a line In particular this can be viewed as a rotationally invariant geometric generalization of regression i precisely what is the best t line to our data measured with respect to the rationally invariant Euclidean norm Note that for any vector x the best t point on our line w is 397 lez w Without loss of generality let us constrain w to be unit norm ie 1 X Rn Let us consider tting the best a line to the rows xi 6 Rd of this matrix Hence the best t line w is the solution to the problem min 7 w w z where w 6 Rd Equivalently we can nd w as a solution to the maximization problem in the following lemma Lemma 21 We have that 2 HM w WU 2 HIHZ 00 2 HXHF HXwH2 where HX is the Frobenius norm eg the sum ofthe squares ofthe entries Hence the best t line is given by arg max w lel1 Now one key step in understanding the SVD presented later is understanding the answer to the following ques tion Question 2 Let 1 be the best t line to the rows ofX What is the best t line to the columns ofX as a tnction ofv and X To answer this let us rst examine some norms 22 The Spectral Norm and a little duality The spectral norm of a matrix X is de ned as HXH max HXaH 0 Hall1 Note it is rotationally invariant Perhaps some intuition for this norm can be obtained by viewing the Euclidean norm as a certain maximization problem we can write the Euclidean norm of a vector a as llaHxaa max I a 5 HbH1 which follows from CauchySchwartz To understand the previous question of the best t line to the columns of X observe that 1 HXH max ba max bT billbll1 agtbillallllbll1 where b E R and a 6 Rd 23 The Best Line of the Columns Lemma 22 va is the best t line to the columns ofX then Xv is the best t line to the rows ofX Proof Observe that if v is the argmax v arg max HXaH arg max lt max bTXagt a Hall1 a Hall1 bHbH1 Note that the b which achieves the max must be gm This is because m is the unit length b which maximizes b a Hencethe argmax over u7 v in Equation 1 is achieved by Xv T 71 arg max I Xa HXvH abHaHHbH1 Note this implies that X 7v arg max lt max uTXvgt HXvH bHbH1 aHaH1 Equivalently X 7v arg max lt max bTXTagt arg max HXTbH HXvH erbH1 aHaH1 erbH1 Hence gm speci es the best t line to the columns of X So X 1 is also a best t line though not necessarily of unit length 3 The Best Fitting Subspace and the SVD Now we let us X be a general matrix The maximal singular value is mawaH1 HAng and the argmax is the corresponding singular vector We let Ai be a row of Lemma 31 For an arbitrary matrix A E Rn argrnax Aw2argmin A7 Awa2argmin Ai Aww2 MHH H MHH lt gt HF HwH1ll lt gt H where is the Frobenius norm the Frobenius norm ofa matrix M is Mfg Proof The proof essentially follows from the Pythagoras theorem 1 Theorem 32 SVD De ne the h dimensional subspace Vk as the span of the following h vectors v1 arg fxllAvll2 2 v1 v2 arg max HAUH2 3 HvHlvv10 4 w HAvlF lt5 arg max Hull1Vi kvvl0yi k Then Vk is optimal in the sense that Vk arg diwrlr11Ik disianceAi7 V02 Furthermore 0391 llAvlll 2 0392 HAUle 2 0minnd llAvminndll Let aiui Avi so ui is unit length Then the set is orthonormal so is by construction and the SVD decomposition of A is A Zaiuiv Udiagal aminndVT i where U and V are orthogonal matrices with rows and vi respectively Proof The interesting part of the proof is that is orthonormal i the rest of the proof essentially follows by construction El As a corollary we have that Corollary 33 Among all rank h matrices D Ak ELlaZuiv is the one which minimizes HAiDHF Furthermore k minnd llX Dllllel T720i2 Z 0 i1 ik1 31 Proofs The argument is essentially an inductive one based on the previous argument 32 Three Interpretations The relevance of the SVD is that it holds for all matrices eg it s a characterization of all linear maps 1 The SVD shows that any linear map consists of a rotation followed by an axis aligned scaling followed by another rotation 2 The best t kdimensional subspace to the rows is Vk Furthermore the best t k ldimensional subspace contains the best t k dimensional subspace even if there are equal singular values we can always choose subspaces such that this holds 3 The best t kdimensional subspace is speci ed by the span of le i i i Xvk 4 References Material used was Wikipedia and S antosh Vempala s lecture notes


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.