New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Cooperative Education Program

by: Florence Larkin

Cooperative Education Program N E 1

Florence Larkin
GPA 3.59


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Engineering Chemical

This 131 page Class Notes was uploaded by Florence Larkin on Thursday September 17, 2015. The Class Notes belongs to N E 1 at University of Wisconsin - Madison taught by Staff in Fall. Since its upload, it has received 41 views. For similar materials see /class/205188/n-e-1-university-of-wisconsin-madison in Engineering Chemical at University of Wisconsin - Madison.


Reviews for Cooperative Education Program


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15
ECE901 Spring 2007 Statistical Learning Theory Instructors R Castro and A Singh Lecture 17 Minimax Lower Bounds Lower Performance Bounds Up to now in class welve been analyzing estimatorspredictors obtaining upper bounds on their performance These bounds are of the form EggEan fl S Cn where 7 gt 0 We would like to know if these bounds are tight in the sense that there is no other estimator that is signi cantly better To answer this we need lower bounds like 19f sup mm m 2 cw f fef We assume we have the following ingredients Class of models 7 Q S f is a class of models containing the true model and is a subset of some bigger class 8 Eg 7 could be the class of Lipschitz density functions or distributions PXY satisfying the boxcounting condition An observation model 73f indexed by f E 7 73f denotes the distribution of the data under model Eg ln regression and classi cation this is the distribution of Z X1 Y1 XmYn E Z We will assume that 73f is a probability measure on the measurable space Z 3 A performance metric d 2 0 If you have a model estimate fAn then the performance of that model estimate relative to the true model f is dfn Eg A A A 12 Regression dfnf 7 7 Classi cation d067 7 R A l2nz 7 lldPX CWAGquot As before we are interested in the risk of a learning rule in particular the maximal risk given as sup Efd f sup dltfnltzwgtd73fltzgt fef fef where 3 is a function of the observations Z and lEf denotes the expectation with respect to 73f The main goal is to get results of the form R 3 19f sup Elm f 2 cs f fef where C gt 0 and 3 7gt 0 as n 7gt 00 The inf is taken over all estimators ie all measurable functions fnz ZHS Lecture 17 MinimaX Lower Bounds 2 Suppose we have shown that lim inf 77172 2 C gt 0 A lower bound ngtoo and also that for a particular estimator 1 lim sup 371 sup lEfdfnf S C ngtoo fef gt lim sup s7le S C ngtoo We say that 3 is the optimal rate of convergence for this problem and that 1 attains that rate Note A rate In E 11 iff 0ltlim inf ighm sup iltoo new 11 new 11 General Reduction Scheme Instead of directly bounding the expected performance we are going to prove stronger probability bounds of the form A inf sup 73fdfnf 2 Sn 2 C gt 0 f n f 6f These bounds can be readily converted to expected performance bounds using Markov7s inequality 73fltdltfmfgt 2 5 S Therefore it follows A A inf sup lEfdfnf 2 inf sup snpfdfnf 2 Sn 2 C3 fW f6 fW f6 First Reduction Step Reduce the original problem to an easier one by replacing the larger class 7 with a smaller nite class foyuwa Q 7 Observe that 19f sup 73fltdltfmfgt 2 Sn 2111f sup 73fltdltfmfgt 2 Sn ft f6 ft f6fofM The key idea is to choose a nite collection of models such that the resulting problem is as hard as the original otherwise the lower bound will not be tight Second Reduction Step Next we reduce the problem to a hypotheses testi Ideally we would like to have something like 19f sup 73fdf17f 2 8n 2 19f sup 73f Z 139 f f6 f j60M The inf is over all measurable test functions Buzzampr and 73f f j denotes the probability that after observing the data the test infers the wrong hypoth esis This might not always be true or easy to show but in certain scenarios it can be done Suppose di is a semidistance liei it satis es Lecture 17 Minimax Lower Bounds 3 i 10079 d97 f 2 0 Symmetric ii df7f 0 iii dfg S dh dhg Triangle inequality Ergo with ngd waxy 3 was Lemma 1 Suppose di is a semidistance Also suppose that we have constructed f0 i fM st dfj 2 237 Vj 16 Take any estimator fn and de ne the test 11 0 f7 Z A 0i i M as LINE argmjin f1 Then f j implies do 2 sn Suppose 11 j ltgt 3k ye 1 dog S dltfm NOW 23 dam doing in ft wolf m 12 2 St The previous lemma implies that 73f7dfnyfj 2 8n 2 73f 1 fn 1 Therefore 19f sup 73f 403 f 2 3 2 inf m 73f 403 f 2 3 f fef ax f f fofM 2 1fj61 fMpf7 11 fn J39 gt 39 f v E 39 7 likggffmm 1 3 PeM The third step follows since we are replacing the class of tests de ned by 11 by a larger class of ALL possible tests n and hence the inf taken over the larger class is smaller Now our goal throughout is going to be to nd lower bounds for PgMi So we need to construct f0 i M siti dfj 2 237 j k and PEM 2 C gt 0 Observe that this requires careful construction since the rst condition necessitates that fj and g are far from each other while the second condition requires that fj and fk are close enough so that it is harder to distinguish them based on a given sample of data and hence the prob of error PEM is bounded away from 0 We now try to lower bound the prob of error PgMi We rst consider the case M 1 corresponding to binary hypothesis testing M 1 Let P0 and P1 denote the two probability measures iiei distributions of the data under mod els 0 and 1 Clearly if P0 and P1 are very close then it is hard to distinguish the two hypotheses and so Peg is larger natural measure between probability measures is the total variation de ned as VP07 P1 Slip lPoA P1Al Slip l Area P1ZdVZl Lecture 17 Minimax Lower Bounds 4 where p0 and p1 are the densities of P0 and P1 with respect to a common dominating measure 1 and A is any subset of the domain We will lower bound the prob of error Peg using the total variation distancei But rst we establish the following lemma Scheffe s lemma WEB mmimWWmmim 1 minpoyp1 Recall the de nition of the total variation distance VP07 P1 Slip l Ape 01l Observe that the set A maximizing the right hand side is given by either Z 6 Z p0Z 2 p1Z or Z 6 Z p1Z 2 p0Zi Let us pick A0 Z 6 Z p0Z 2 p1Zi Then 1 VP07P1 P0 P1 P0 P1 l100 101l A0 A3 For the second part notice that 0 if 0Z S 1Z mm memm lmmemm i mz m Now consider 17 minp0p1 p0Z7minp0Zp1Z A 0 1002 7p1ZdzZ VP0P1 I We are now ready to tackle the lower bound on Pall In this case we consider all tests Z A 01 Equivalently we can de ne lAZ where A is any subset of the domain A l A A PE1 afjgrgian an y J 2 1g Epoch y 0 P101 1 gaaam maaam 0 139M P1AC wemwemm l l 7 VP0 131 So if P0 is close to P1 then VP0 P1 is small and the probability of error P84 is large This is interesting but unfortunately it is hard to work with total variation especially for multivariate distributionsi Bounds involving the KullbackLeibler divergence are much more convenient 101Z 100Z The following Lemma relates total variation af nity and KL divergencei PIMP 1 g mltzgtdvltzgt Lecture 17 Minimax Lower Bounds 5 Lemma 2 17 VP0 P1 2 A2P0 P1 2 exp7KP1HP0 For the rst inequality 2 A2030 P1 lt Wm 2 minP07P1maXP07P1gt 2 minltP07P1 1118941007100 S minp0p1 maxp0p1 by CauchySchwarz inequality minp0p1 lt2 7 minp0p1gt x39f minpop1f maxpop1fpofp12 S 21111110007171 217VP0P1 For the second inequality remlt ef log lt m gt wemomagt wemole 2 2 ag g m by inequality addeQo eXPKP1HPo Putting everything together we now have the following Theorem Theorem 1 Let f be a class of models and suppose we have observations Z distributed according to 73f f E 7 Let d6 be the performance measure of the estimator relative to the true model Assume also di is a semidistance Let f0f1 E f be st df0f1 2 237 Then iyfsup 73fdfmf 2 8n 2 irgf max 73f7dfnvfj 2 8n f fef f 01 l gt 1 exp KPf1lleo Lecture 17 Minimax Lower Bounds 6 How do we use this theorem Choose f0 f1 such that KP1HP0 S a then P84 is bounded away from 0 and we get a bound inf sup 73fdfu 2 5n 2 C gt 0 fW f6 or after Markov7s A inf sup lEfdfnf 2 Gen f f6 To apply the theorem we need to design f0 f1 siti df0 f1 2 Zen and exp7KPf1 HPfO gt 0 To reiterate the design of f0 f1 requires careful construction so as to balance the tradeoff between the rst condition which requires f0 f1 to be far apart and the second condition which requires f0 f1 to be close to each other Lets use this theorem in a problem we are familiar with Recall HWT Let X 6 01 and Y X z N Bernoullinz where 771 PY l X Suppose G t 1 We proved that under these assumptions and an upper bound on the density of X the Chernoff bounding technique yielded an expected error rate for ERM 1ERCn 7 m 7 0 4103 n Is this the best possible rate Construct two models in the above class denote it by 73 Pg and PE For both take PX N Uniform0 1 and 770 12 7 a n112 a a gt 0 so G 3 Ci 01 We are interested in controlling the excess risk man 7 Rm l2nr1ldPxz CWACquot Note that if the true underlying model is either Pg or PE we have 1346 7 RjC 7 2njx71dz 7 2a dz 7 MMC 6 Ema of Proposition 1 dA i is a semidistance lt suf ces to show that dG1 G2 dG2 G1 2 0 dG G 0 VG and dG1 G2 S dG1 G3 dG3 G2 The rst two statements are obvious The last one triangle inequality follows from the fact that GlAGg Q G1AG3 U GSAGQ Suppose this was not the case then 31 z z E GlAGg st 1 g G1AG3 and I g GgAGgi In other words I E GlAG2 G1AG3C G2AG3C Since SAT 5 N T U 50 N T we have 1 C1 n Ci u Ci n C2 n Ci U C3 n C1 U Ci n Ci u C3 n C2 u Ci C1 n Ci u C3 n C n C2 u Ci u Ci n C1 u Ci n C2 n C u C3 C1ngngmCiuCimCi C2 C3 E E l E l E 0 a contradiction Lecture 17 MinimaX Lower Bounds 7 Lets look at the rst reduction step inf sup 130346 2 RG 2 3 2 inf max PRjn 2 RG 2 3 an p673 J3960gt1 icnnfjggi PjdAGnG 2 Len2a So we can work out a bound on dA and then translate it to excess risk Lets apply Theorem 1 Note that dA G3 Ci 1 and let P0 3 Pgiylyuxmyn and P1 3 Pgiylywxmyn 1 0 leylvuanyYn KP1llP0 E1 log ig ylw XmYquot PX1Y1XYX17Y1meYn pay X1 Y1 A A A pay Xn m 1E1log P 11Y1 X1 Y1 109313 Xn Y n 1 7 2E 1 P3YIX2397Yi 7 1 0g 0 X Y i1 iv p yleWlle nlE1 log l pfXltYmX1gt Now pg x Y1 1lX1 12a and pg x Y1 1lX1 127a Also under model 1 Y1 N Bernoulli12a So we get KEEN W 1131 n 2alog12 a 7 2alog12 7 a 12a 127a 12a 2nalt1 ia71gt 1 KltP1HP0 n 12 W 10g 27m log l 27 47m 12 7 a Leta 1 and n 2 4 then KP1HP0 S 4nim S16 Using Theorem 1 since dA G3 Ci 1 2 Zen for 3 1 we get infmax Wham 2 12 2 1am a J J 4 This implies A 1 inf sup PRGn 7 RG 2 2 6716 a p673 4 or after Markov7s inequality A 1 1 ipfsup Ewan 7 RG 2 76157 a p673 4 Therefore apart from the logn factor ERM is getting the best possible performance Reducing the initial problem to a binary hypothesis testing does not always work Sometimes we need M hypotheses with M 7gt 00 as n 7gt 00 If this is the case we have the following theorem Lecture 17 MinimaX Lower Bounds 8 Theorem 2 Let M 2 2 ft i i i fM E f be such that dfj7 2 23 where d is a semidistance E ilmpjupo g alogM with 0 lt a lt 18 Then 19f sup Pfdfmf 2 8n 2 19f max Pjdfn7fj 2 8n fn ref fn J 172172 a gtgt0 117 logM We will use this theorem to show that the estimator of Lecture 4 is optimali Recall the setup of Lecture 4 Let f f I Mi 7 f3l S W 7 WM iiei class of Lipschitz functions with constant Li Let ziin iluin Yz Wi 0EWi2 a2 lt 007 Wi Wj are indepedent ifi ji In that lecture7 we constructed an estimator 3 such that sup EM 7 m OW f6 Is this the best we can do We are going to conAstruct a collection fo7 i i i fM E f and apply Theorem 2 Notice that the metric of interest is dfn7 7 a semidistance Let m E N h 1m and de ne Lh L KW lt7 Ll1lgt HlxlghQ 5V1 lling2 Note that lKa 7 S Lla 7 bl VaJL The subclass we are going to consider are functions of the form ie bump functionsi Let Q 0lm be the collection of binary vectors of length m7 eigi w 1017Hi70 E 9 De ne m h 16101 121 wiK lt1 7 7 1 Note that for w7w E Q 1 m h 12 damn wa7fwH Zmrwgym lt17 2i71gtgt 0 i1 Lecture 17 Minimax Lower Bounds 9 where pww is the Hamming distance7 pww 7 7 Now h2 ha L2 K2z 2 L212dz QLQQ h3 0 i 7 5 so L d 7 w7w ihg2 fw fw Since 2 the number of functions in our class is 27h Turns out7 we do not need to consider all functions fmw E 9 but only a select fewi To that extent the following result is of use Lemma 3 VarshamovGz39lbert 62 Let m 2 8 There exists a subset 100 i i i 710mm of 9 such that 100 0707 i i 7 0 Mwmywm 2 v0 g j lt k g M and M 2 2quot What this lemma says is that there are many N 27 sequences in 9 that are very different iiei M100 1006 N We are going to use the lemma to construct a useful set of hypotheses Let 100 i i i 710mm be the class of sequences in the lemma and de ne fj fwm je0mM We now need to look at the conditions of Theorem 2 and choose m appropriately First note that for j h V L m L L d v 4 0 k 7W2 gt i 732 71 9 f W w m i V 8 mm 4mm Now let Pj 1amp1 me e 0 i i i M Then Pig Y KPjHPo E1 Emmy 112 1 n P 2 ZEj log i1 pl 202 i1 J 1 Lh 2 L2 L2 lt 7 7 7 h 7 72 i 202 2 gt 8a 802nm Now notice that logM 2 log 2 We want to choose m such that M 1 L2 7 m MgKUDjHPO S Qnm 2 lt aglogZ S alogM This gives L2 13 13 13 m gt m n 2 Con so take m Cord3 1 Now L dltfj7 2 777171 2 ZConst nil3 for n 2 no const 4 Lecture 17 Minimax Lower Bounds 10 Therefore7 A 19f sup Pfltan 7 fH 2 const H3 2 c gt o fW f6 infsup 124an 7 2 const n72 2 C gt 0 f fef or after Markov s inequality7 A 19f 51119 i 2 O Const n 23 ft fer Therefore7 the estimator constructed in class attains the optimal rate of convergence What is a Support Vector Machine An optimally de ned surface Typically nonlinear in the input space Linear in a higher dimensional space Implicitly de ned by a kernel lnction Acknowledgmenls These slides combine and modify ones provided by Andrew Moore CMU Glenn Fung Wisconsin and Olvi Mangasarian Wisconsin cs 5m Unwale ofwscmslnrMadlson c R Dyer What are Support Vector Machines Used For Classi cation Regression and data tting Supervised and unsupervised learning cs 54o Unlversltv Er WsmnslnrMadlson c R Dyer Linear Classifiers X Y fxwb signw 39 x b 39 denotes 1 denotes 1 How would you classify this data cs 5m Unwale ofwscmslnrMadlson c R Dyer Linear Classifiers aka Linear Discriminant Functions 0 Definition It is a function that is a linear combination of the components of the input x 1quot in 17wa 1 11 where w is the weight vector and bthe bias 0 A twocategory classifier then uses the rule Decide class c 39ffx gt 0 and class 2 if fxlt 0 ltgt Decide c ifwlx gt b and 2 otherwise cs 54o unwaver Er WsmnslnrMadlson c R Dyer Linear Classifiers Xy fxwb 539gnw 39 x b 39 denotes 1 denotes 1 D a a How would you 39 39 classify this data cs 54m Unwasitv ofwstmsinrMadison c R Dyer Linear Classifiers X y fan1 SlyMW 39X 17 39 denotes 1 denotes 1 How would you classify this data cs 54a UNVEIle Er WsmnsinrMadison c R Dyer Linear Classifiers X y fxwb 539gnw 39 x b 39 denotes 1 denotes 1 n a a 5 How would you 39 classify this data cs 54m unwaer ofwstmsinrMadison c R Dyer Linear Classifiers X y b 5gl1W39Xb 39 denotes 1 denotes 1 Any of these would be ne but which is best cs 54a UanEYSlW Er WsmnsinrMadison c R Dyer Classifier Margin X 39 denotes 1 denotes 1 cs 54m Unwasitv ofwstmsinrMadison R Dyer Y fxwb signw 39x 17 Define the margin of a linear classifier as the width that the boundary could be increase b before hitting a data point Maximum Margin X Y 39 denotes 1 denotes 1 cs 54a UniyEVSiW Er WsmnsinrMadison c R Dyer fan1 SlyMW 39X 17 The maximum margin linear classifier is the linear classifier with the um maximum margin This is the simplest kind of SVM Called an WLSVW Maximum Margin X 39 denotes 1 denotes 1 Support Vectors up against cs 54m unwaer ofwstmsinrMadison c R Dyer Y fxwb signw 39x 17 The maximum margin linear classifier is the linear classifier with the um maximum margin This is the simplest kind of SVM Called an LSVM Why Maximum Margin 39 denotes 1 denotes 1 up against cs 54a UniyEiSiW Er WsmnsinrMadison c R Dyer NEquot 9quot 5 5 Intuitively this feels safst we39ve made a small enor in the location of the boundary it s been jolted in is perpendicular direction this gives us last chanoe ofcausing a misclassi cation Robust to outliers since the model is immune to c ange removal ofany nonsupportvector data points ere39s some theory using VC dimension that is related to but not t e same as e proposition that this is a good thing Empirically it works very well Specifying a Line and Margin Specifying a Line and Margin 1Aquot Plus Plane quot A 3335 Classi er Boundary cc e W5 1 9 quot Miriu i e A A was ail Md 16 Weightyecturww Wdy w a Bias ur threswuld b Plusplane wi 39X b 1 Minusplane w 39xb1 How do we represent this mathematically Classifyas 1 if WT 39XibZl in dinput dimensions 1 r W 39xibs 1 a An example x X1 Xay Universe if explodes Esmunmslyalwnlsl mm x Dru 1ltW Xblt1 Esmunmslyalwnlsl mm n ma Computing the Margin A w is the plare s normal vector Margin Width 1 b is the distance from the origin Aquot How do we oompute Min terms of w and b Plusplane wxi b 1 Minusplane wxi b 1 Claim The vector w is perpendicular to the PlusPlane iimiii ll lllihl k iwiiiluriliiilnryllvii ikvylxi mm iiiemiiuilw mm ixiiwnimism w iiiirihmaixi gtlii lW r 4 xi M I m mimic ruin iiiwi iiri iiininiii my izim uhil imuirlr upmglil on i iiyiuim i win in MWMWMM him x mm warmquotmg mm x m Computing the Margin garXX How do we compute Min terms of w and b o Plusplane wx b 1 0 Minusplane wx b 1 o The vector w is perpendicular to the Plus Plane W beam in 0 Let x be any point on the minus plane A R39quot not 0 Let X be the closest plusplanepoint to X 5 54m unwaer wascmslnnMadlsun c R Dyer Computing the Margin xxquot 99 How do we compute Min terms of w and b o Plusplane wx b 1 0 Minusplane wx b 1 o The vector w is perpendicular to the Plus Plane 0 Let x be any point on the minus plane 0 Let X be the closest plusplanepoint to x 0 Claim X x l w for some value of 2 Why 5 540 University an WsmnslnnMadlsDn c R Dyer Computing the Margin 3 The line from x to X is perpendicular to the planes Soto get from x to X travel some distance in o Plusplane wx b 1 direction w o Minusplane wx b 1 o The vector w is perpendicular to the Plus Plane 0 Let x be any point on the minus plane 0 Let X be the closest plusplanepoint to x 0 Claim X x l w for some value of 2 Why 5 54m unwaer wascmslnnMadlsun c R Dyer Computing the Margin Xx Whatweknow wxb1 wxb1 o xxlw XXM It s now easy to get Min terms of wand b 5 540 University an WsmnslnnMadlsDn c R Dyer Computing the Margin geeA we My 39 a ch wx lwb 1 N Whatwe know wxb1 WX blww1 o wx b 1 o Xxlw XXM It s now easy to getM 2 in terms of wand b 27 11ww1 5 54m Unwasw wastmslnrMadlsun c R Dvev Computing the Margin xx 39 72 99 7 Margin W A l w l Ajww What we know owxb1 2ww 2 wxb1 ww oIXXM 2 M 5 540 UniVEVSiW Er WsmnslnrMadlsDn c R Dvev Given a guess of wand bwe can 0 Compute whether all data points in the correct halfplanes 0 Compute the width of the margin So now we just need to write a program to search the space of w s and is to find the widest margin that matches all the data points How 5 54m unwaer wastmslnrMadlsun c R Dvev Learning via Quadratic Programming 0 QP is a wellstudied class of optimization algorithms to maximize a quadratic function of some realvalued variables subject to linear constraints 0 Minimize llvTllz subjectto wx 2 1 ifx in class1 wx bSl ifx in class2 5 540 unwaver Er WsmnslnrMadlsDn c R Dvev uhoh This is going to be a problem What should we do 5 54m Unwasitv of WscmsinrMadisDH c R Dver uhoh This is going to be a problem What should we do Idea 1 Find minimum w2 while minimizing number of training set errors 0 5 Problem Two things to minimize makes for an a illdefined optimization 5 54a UniVEVSiW Er WsmnsinrMadison c R Dver Uhoh This is going to be a problem What should we do Idea 11 Minimize 39 w 2 C train errors a 39 D v Tradeoffparameter a a D There s a serious practical D problem that s about to make a 9 us reject this approach Can you guess what it is 5 54m UnwasW DstcmsinrMadison c R Dver Uhoh This is going to be a problem What should we do Idea 11 Minimize 39 D WZ Ctrain errors 39 ad aoff parameter Can39t be expressed as a Quadratic Programming problem Solving it may be too slow Also doesn39t distinguish between disastrous errors and near miss u 5 54a Universi Er WsmnsinrMadison c R Dver uhoh This is going to be a problem What should we do Idem quot denotes 1 Minimize W2 Cdistance of error points to a a 0 their correct a c place 5 54m Unwasw wascmsinrMadisun c R DVEY Learning Maximum Margin with Noise 39 M2 Given guess of w b we can 39 7W Compute sum of distances of points to their correct 39 39 zones Wye Compute the margin Width x Assume Nexamples each m M where yk 1 What should our quadratic How many constraints will we optimization criterion be have What should they be 5 54a uniyeisitv Er WsmnsinrMadisnn c R DVEY Learning Maximum Margin with Noise M2 Given guess of w bwe can m Compute sum of distances of points to their correct zones 0 Compute the margin width Assume Nexamples each Xk ykwhere yk 1 What should our quadratic How many constraints will we optimization criterion be ave N Minimize 1 N What should they be EWWC28k wxkb21skifyk1 quot wxkb1ekifyk1 5 54m unwasw wascmsinrMadisun c R DVEY Suppose we re in 1 Dimension What would SVMs do with this data 5 54a uniyeisio Er WsmnsinrMadisnn c R DVEY Suppose we re in 1 Dimension Not a big surprise l on n a a C PDSitiVe Pan quot Negative plane 5 54m UnwasW wascmsinrMadisun c R Dver Harder 1Dimensional Dataset That s wiped the smirk off SVM s face What can be done about this 5 540 unwevsitv m WsmnsinrMadisDn c R Dver Harder 1Dimensional Dataset The Kernel Trick Preprocess the data mapping X into higher dimensional D space F 2 Zk xk xk 5 54m UnwasW wascmsinrMadisun c R Dver Harder 1Dimensional Dataset The Kernel Trick Preprocess the data mapping X into higher dimensional Space RX 2 Zk xk xk 5 540 unwevsio m WsmnsinrMadisDn c R Dver Example All Degree 2 Monomials lt1 1R2 gt 1R3 1962 H 212223 iaciiiw1w2w3 a smumpn ms 3 13mmquot mu 3 540 University ostoorsinMadison c R Dyer Project examples into some higher dimensional space where the data islinearly separable defined by z Rx Training depends only on dot products of the form Fltxgt ltxf x1x2x gt K09 Xj in 39 ij Xi 39 X92 Dimensionality of 2 space is generally much larger than the dimension of input space x S 540 Ui iiversityofWSCO Sii FMadiSOi i c R Dyer Common SVM Basis Functions zk polynomial terms of Xk of degree 1 to q For example when q2 and m2 Kxv le1 X2Y2 12 1 2x1y1 szy2 lexzyly2 x12 y12 xzzyz2 zk radial basis functions of xk zkj 1093 Kemean zk sigmoid functions of xk L3 540 University ostoorsinMadismC R Dyer SVM Kernel Functions 0 ltab a b 1dis an example of an SVM kernel function 0 Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right kernel function 0 RadialBasisstyle Kernel Function a b2 s K and 5 are magic Km b 7 am 252 J parameters that must be chosen by a model 0 NeuralNet style Kernel Function selection method such as CV or VCSRM Kab tanhKab 7 5 L3 540 Universityostcmsii irMadisoi i c R Dyer 10 The Federalist Papers 39 Written in 17871788 by Alexander Hamilton John Jay and James Madison to persuade the citizens of New York to ratify the constitution Papers consisted of short essays 900 to 3500 words in length Authorship of 12 of those papers have been in dispute Madison or Hamilton these papers are referred to as the disputed Federalist papers LS 540 University ofWiscmsinrMadison c R Dver Description of the Data For every paper 0 Machine readable text was created using a scanner 0 Computed relative frequencies of 70 words that MostellerWallace identi ed as good candidates for authorattribution studies 0 Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies 0 The dataset consists of 118 papers 0 50 Madison papers 0 56 Hamilton papers 12 disputed papers C R Dyer 0 cs 540 University of WiscmsirrMadism Function Words Based on Relative Freq ue nc Ies 1 a 15 do 29 is 43 or 57 this 2 all 16 down 30 it 44 our 58 to 3 also 17 even 31 its 45 shall 59 up 4 an 18 every 32 may 46 should 60 upon 5 and 19 for 33 more 47 so 6 was 6 any 20 from 34 must 48 some 62 were 7 are 21 had 35 my 49 such 63 what 8 as 22 has 36 no 50 than 64 when 9 at 23 haLe 37 not 51 that 65 which 10 be 24 her 38 now 52 the 66 who 11 been 25 his 39 of 53 their 67 will 12 but 26 if 40 on 54 then 68 with 13 by 27 in 41 one 55 there 69 would 14 can 28 into 42 only 56 things 70 your 35401 l ll l 4 CR Der SLA Feature Selection for Classifying the Disputed Federalist Papers 0 Apply the SVM Successive Linearization Algorithm for feature selection to 0 Train on the 106 Federalist papers with known authors 0 Find a classification hyperplane that uses as few words as possible 0 Use the hyperplane to classify the 12 disputed papers cs 540 University ofWiscmsirrMadisoo c R Dyer 11 Hyperplane Classifier Using 3 Words A hyperplane depending on three words was found 0537t0 24663up0n 2953w0uld 66616 All disputed papers ended up on the Madison side of the plane cs 540 University of Wiscons39nsMadison c R Dyer Results 3D Plot of Hyperplane Separating Plane for the Federalists Papers Fung ii Dr s Uted 12 Papa perg 15 cs 540U MultiClass Classification 0 SVMs can only handle twoclass outputs o What can be done 0 Answer for Nclass problems learn N SVM s o SVM 1 f1 learns Output1 vs Output 1 o SVM 2 f2 learns Output2 vs Output 2 o SVM N 7 learns OutputN vs Output at N cs 540 University of Wiscons39nsMadison c R Dyer MultiClass Classification 0 Ideally only one jx gt 0 and all others lt0 but this is not often the case in practice 0 Instead to predict the output for a new input just predict with each SVM and find out which one puts the prediction the furthest into the positive region 0 Classify as class C if jx maX IjX for all j cs 540 University of WisconsinsMadison c R Dyer 12 Summary Learning linear functions Pick separating plane that maximizes margin Separating plane defined in terms of support vectors only Learning nonlinear functions Project examples into higher dimensional space Use kernel functions for efficiency Generally avoids overfitting problem Global optimization method no local optima Can be expensive to apply especially for multi class problems 5 54m Unwasw wascmsinrMadisun c R Dyer 13 UNIVERSITY OF WISCONSINMADISON COMPUTER SCIENCES DEPARTMENT Class notes for MathCS 885 Matrix theory for Numerical Analysis copyright 1989 Carl cle Boor Spring 1989 Author s address Center for the Mathematical Sciences University of WisconsinMadison 610 Walnut St Madison WI 53705 notes for JanQSff latest change MarA linear spaces 57 T A E LS7 T linear map A S A T S domA domain of A7 T tarA target of A task given I E T 7 nd I E S such that Az b at least one solution iff b E ranA range of A AS at least one solution for every I E T iff ranA T existence iff A onto at most one solution iff 0 kerA null space of A z E S Az 0 uniqueness iff A 11 exactly one solution for every I E T iff A is invertible two ways to connect linear space S linearly to coordinate space R n VR gtSandASgtR v17 7117 V R A S c gt gt 20 vjcj with an n sequence in S is most general lm from R to S If V E LRn7 S is given7 then vj Vej7 all 1397 with ej the jth unit vector In this course7 any ls S will be a linear subspace of some coordinate space7 hence the vi will be column vectors and the map V is just the matrix v17 71177 with columns v17 7117 special case LRn7lRm Rmxn special fact m lt n implies that every X E Rmxn has nontrivial nullspace special matrix In 617 7 en is the nth order identity or unit matrix Note if v177vn V E LRn7S and A E LS7T7 then AV Av177vn Av177Avn E LRn7 T elements of ranV are called linear combinations of the 1117 note that I will always write the weights to the right of the vj WHY7 and ranV is called the span of the 1117 written spanvi ranV won7t distinguish between the sequence v17 71 and the map v17 71 The sequence V is spanning for 5 iff the map V is onto in that case7 call 5 nitedimensional We only deal with nitedimensional spaces The sequence V is linearly independent iff the map V is 11 The sequence V is basis for 5 iff the map V is invertible Fact 1 V not onto implies existence of w E S ranV and7 for any such w7 V710 is 11 in case V is 11 proof V7wc d 0 implies fwd E ranV7 hence7 as w g ranV7 d 07 so Vc 07 so 5 0 in case V is 1 1 Fact 2 V not ll implies that V WX for some X E LRn7an 17 hence if V is onto7 then W Rnil A S is also onto proof 5 E kerV0 implies that7 for some j7 vj WC with W Vvj v17 7vj717 vj17 7vnl7 hence V Wel7 761397170761397 7en1 WX Theorem V E LRn7 S 11 and W E LRm7 5 onto implies n S m with equality iff V is also onto and W is also ll proof of implication W onto implies that7 for each vj7 can nd zj E Rm so that vj sz hence V WX for some X E LRm7 Rn With this7 n gt m would imply that kerX not trivial7 hence kerV not trivial7 a contradiction proof of equality characterization This implies that if also V is onto and W ll7 then also m S n7 ie7 equality It also implies with Fact 1 that7 if V is not onto7 then n1 S m It also implies with Fact 2 that7 if W is not ll7 then n S m7 1 Cor Any two bases for 5 have the same cardinality7 called the dimension of S and written dim S V number of terms in the sequence V Cor If T and U are lsubspaces of S and dimT dim U gt dim 57 then T U f 0 proof Let V7 W be bases for T7 U resp and assume that T N U 0 Then V7 Wc d 0 implies Vc 7Wd7 hence V07 Wd 6 ranV ranW 07 hence Vc 0 Wd7 hence c 07 s 0 This proves that W is 117 therefore V W S dim 5 Cor with Fact 1 Any linind sequence can be extended to a basis Cor Any linear subspace of a nitedimensional ls has a basis 1 Cor with Fact 2 Any spanning sequence can be reduced to a basis Theoremi dim ranA dim kerA dim domAi proof extend basis K for kerA to a basis V KL for domAi Sufficient to prove dim ranA L For this consider AL IR A ranAi ranA ranAV ranAL since AK 0 ie AL is ontoi lf ALc 0 then LC 6 kerA hence Lc Kd for some d therefore V7d c in Lc 0 hence in particular 5 0 in short AL is also 11 therefore a basis for ranAi Cori dim ranA S mindim tarA dim domAi Cori If A E Rnxn then A is 11 iff A is onto since now dim tarA dim domAi V invertible then V 1 carries 1 E S to its coordinate vector V lz wrto the basis Vi 1 i i i Am I A S A Rm s gt gt Als io with a sequence of linear functionals on S is the most general element of LSRmi For given A have Ai S A lR z gt gt LI all ii on t distinguish between the sequence and the linear map i If S R then each Ai E LSIR Run is a row matrix or row vector ie m wiTz Emma all a J39 for some wi E Rni Correspondingly A E LRnRm Rmxn is a matrix and is the 2th row of that matrix Hence this point of view stresses the rows of matrices while the earlier one stresses the columns In this course S will always be a linear subspace of some Rn hence its dual S I LSIR can be described by restricting Rwy to S This means that we can think of each A E S as given by some wT as is argued in the following Facti If S Q R then for every A E S can nd w E R such that AI sz for all z 6 Si proof Take V a basis for S extend it to a basis B I L for Rmi Then P I diagV 0B 1 projects Rm onto S in the sense that ranP S and P s 1 Consequently AP 6 LRmIR iiei AP wT for some w E Rm and 107 AP S A Note that there is nothing unique about the w different choices of L result in different w but all these w s agree on Si Thus the most general linear map from S to Rm is of the form A WTNS for some W R A Rm but if S is a proper subspace of Rm then different W may give rise to the same map i Example A V l with V a basis for S The corresponding Ai are the coordinate functionals for the basis Vi In order to compute with a linear map A E LS T we factor it through some coordinate space ie write it as A VWT with some WT E LS R and V E LRnTi example take for V RT A ranA a basis for ranA then WT I V lA S A RT does the job More explicitly A VWT Z 7ij j1 hence Az Z vj J How large can n be Arbitrarily larger From computational point of view would like n as small as possible How small can n be Therels a lower limit Smallest possible n is called the rank of A written rankAi By example rankA S 7 I dim ranA while necessarily ranV D ranA hence rankA 2 dim ranAi Therefore rankA dim ranAi 2 Call the factorizationrepresentation VWT for A minimal if n I V rankAl If n rankA then V R A ranA necessarily invertible ie V is a basis for ranAl Also kerWT kerA since V is 11 hence Az 0 iff WTI 0 Since A VWT implies AT WVT also rankA rankAT dim ranATl Therefore if A VWT 211113910 is minimal ilel n rankA then V is a basis for ranA and W is a basis for ranAT hence kerA ranATJ Even if we insist on a minimal factorization there are still many choices suitable for speci c jobs The best known ones are LU or PLU from Gauss elimination useful for solving Az b QR from GramSchmidt useful for solving ATAz ATb least squares and for eigenvalue computa tions SVD from eigenvalue calculations useful for understanding robustness or condition of the problem Az b and others Their numerical construction uses induction recursion and elementary matricesl Their justi cation and analysis requires norms particularly the 2norml notes for Jani27ff latest change 19feb09 A more general factorization for A E LST takes the form VAWT which allows for some conditions to be put on W and Vi For example the factorization A VAW 1 with both V and W bases provides a matrix represen tation for Al Note that W 1 carries 1 E S to its coordinates wrto W A operates on this coordinate vector and V maps the resulting vector to Ti It follows that the matrix A is given by A V IAW showing that the jth column of A contains the coordinates of ij wrto Vi One says that A and A are equivalent If one further insists on having A very simple then one can make it simple indeed as we saw earlier vizi of the form diagIrankA O with the matrix 0 the zero matrix of the appropriate size When 5 T and one insists on having V W it is much harder to keep A simpler Now A is called similar to Al If A happens to be diagonal then all its diagonal entries are eigenvalues of A and the columns of V are the corresponding eigenvectors in that then Avj Ajjvj all j with each vj not trivial since it is part of a basis If we can nd such V and A then A is called diagonable While the diagonable matrices are dense not all matrices are diagonable In general the best in the way of simplicity one can achieve is an upper triangular A the Schur form if one actually wants to construct it stablyi Fortunately this is suf cient for all practical purposes If one is just proving theorems then the Jordan canonical form is simpler yet We return to these factorizations soon when we discuss the computation of eigenvalues and vectors Finally the factorization A VAVT is important in the study of the quadratic form I gt gt ITAI associated with A or more precisely with the symmetric matrix A AT2i Such A is said to be congruent to Al For a symmetric A A can be chosen to be diagonali quick review of norms sect21 and 22 I assume that you have dealt with normed linear spaces and the associated map norm MAN I M SI 1 sup HAIHTHIHS aces for A E LST and are familiar with the fact that HATHT equipped with the dual norm iiei HAHS 2 HA 5Ri In particular for z E S supAEs MHsi For example for the p norm Hsz Brow1 SET where eg 5quot is the dual for S including the maxnorm llrlloo 1 maXlIUN 11m llrllp J 17400 for z E R the dual norm is the pnorm with 1P1p I 1 Cor The rank one map ylA S A T z gt gt yz with y a xed element ofT and A a xed element of Squot has map norm lllylAll llyllTllAllSM Only the three norms H1 H2 H00 are of practical interest The rst and last because the corresponding map norms max HA39 jll1 0 139 A sup A1 I J 7 7 7 H Hp H HpH Hp m HA17H17 p 00 can actually be computed in nitely many arithmetic operations the middle one because it is the Euclidean norm the norm associated with the scalar product 2 T llIll2 I IA For numerical analysts this norm is so important because there are many linear isometries for it ie matrices which preserve this norm viz the orthogonal matrices ire i any matrix U with TU I For such a matrix and any I UzTUz ITUTUI 1T1 4 This implies that any orthogonal matrix is 11 hence a basis for its range hence invertible With inverse U UT in case U is square It also implies that MVAWTMz HAH2 in case both V and W are orthogonal matricesl One example of the plenitude of orthogonal matrices is the Fact Every orthogonal matrix V can be extended to an orthogonal basis for any linear subspace of R containing ranV In particular every linear subspace S 01 an has an orthogonal basis For the proof observe that P I VVT is a projector since PP VVTVVT VVT Pi It is the orthogonal projector for ranV socalled since for any projector Pl 7 P P7 P2 P7 P 0 hence for this projector and for any I 0 Pl 7 Pz VVT z 7 PI therefore VTz 7 PI 0 since V is 11 hence ranV L z 7 PL Hence if z E S ranV then ranV L z 7 PI 0 so y I z 7 7 PIM is wellde ned and y L ranV therefore Vy is again orthogonal and in particular lll Thus repetition of this process must end exactly When a basis for S is reached The numerical process based on this construction is called GramSchmidt We Will meet numerically preferable algorithms later We use the following lemma in the proof of the existence of SVD laterl 1Lemma Ifa I Aij MAM then AEj 016139 and EZTA 0163quot Proof With 1 Az39 T compute T allrlb HAH2HIH2 2 HAIH2 2 I I Milli to nd that a 2 4 a2 Er lAiTl2 hence Ai39r 0 for 7 ii Use MATH MAM to get that also Asj 0 for s f j E Numerical analysts also use norms on matrices Which are offhand not map norms The best known are again the p norms applied to A E Rmxn as a function on 1 i i i m X 1 i i i nll Of these only the case p 2 is of interest here the socalled Frobenius norm HAHF I ZlAi7jl2 This norm is consistent With the 2norm for vectors in the sense that still HAIH2 S HAHFHIH2 for the simple reason that MAM S hence it provides a computable upper bound for Also MUAWTMF for all orthogonal U and Wl existence of the SVD Theorem For every A E Rm can nd square orthogonal V and W so that A VEWT With E diagall 017 E Rmxn and 01 2 2 0391 2 0 an p I minmn Remark Note that 2 though a diagonal matrix by de nition is diagonal in the general sense that its only nonzero entries lie on its diagonalll In particular the use of the construct diagm here and in the book does not coincide With its use in MATLAB Proof Since the closed unit ball in Whatever norm of R is compact can nd unit vector 1 at which A takes on its norm ie for which Az ay with a MAMAand ll Extend z to an orthogonal basis W for R and y to an orthogonal basis V for Rm and let A I VTAWl Then in particular MAM MAM All since Awl Az MAMy MAMv1 hence by lLemma A diaga B and induction nishes the proof since necessarily MBM S MAM in fact MBM supx10 S 511 HAIH2HIH2 HAHN Cor Let or gt 0 UT1 Then rankA 7 Proof Then A AT 2 E vjaijT 157 with both vlalp i war and whi i wT 11 hence AT provides a minimal factorization for Al Cor ii IfA VEWT is a SVD decomposition for A and rankA 2 k then minHA 7 BHQ rankB S k 0k1 HA Akll239 Proof By assumption rankAk k and HA 7 Akll2 HVTA 7 AkWll2 H diag07 7070k17m70pll2 UHL Also with rankB S k dim kerB 2 n 7 6 hence kerB ranw1 i wk1 contains a unit vector 2 Then B2 0 while A2 EJSIHI Wa wfz so H147 Elli 2 H1427 32H HAZHE Z 012sz 2 0kr 151m B This implies that the singular values aj aj A are independent of the particular choice of V and Since ATA VEWTTVEWT WETEWT we see that 2T2 is similar to ATA hence aj2 are the eigenvalues of ATA another way of seeing that the aj do not depend on the particular V and Wi On the other hand this shows that W must consist of eigenvectors for ATA speci cally that wj must be an eigenvector for ATA belonging to the eigenvalue aj A The analogous argument shows that vj must be an eigenvector for AAT belonging to the eigenvalue 0101 Thus while the SVD is strictly speaking not unique it is unique up to signs in case all the singular values are distinct With 7 I rankA A AT 215 vjaijTi Hence v jgrl is an orthogonal basis for ranA w jgrl is an orthogonal basis for ranAT vjjgtT is an orthogonal basis for kerAT and wjjgtT is an orthogonal basis for kerAi Also HAM 01002 MATA MAAT HATAH2 HAATH2A notes for Feb latest change 19feb09 Approximate inverse After computing an approximate solution i for the equation Az b with b E T and A E LST given one judges its quality by looking at the residual 7 I b 7 Ai for that is offhand the only thing one can compute Since the error e I z 7 i is related to the residual by the equation Ae 7 we need a lower bound for A in order to obtain a bound on e from 7 let some lower bound for the number inng Note that for invertible A igfllAIHHIH 12f HAA lyHHA lyH 1sgpllA 1yHHyll 1HA 1H7 iiel 2 is the sharpest bound available Unfortunately it seems to require knowledge of A71 There is only one practical way to obtain such a lower bound vizi via an approximate inverse for A Call C E LT 539 an approximate inverse for A E LST in case H17 CAH lt 1 With E z 1 7 CA an approximate inverse supplies a lower bound for A Since CAI z 7 Ex one gets HCHHAIH 2 HCAIH 2 HIH HEIH 2 1 HEHWH 1 HEM Az gt H HCH HIH In particular 7 S C 1HHHH HTH Lemma Ifthe matrixA has C as an approximate inverse then A is bounded below ie inng gt 0 therefore also A is 1 1 Hence ifA is square then A is invertible and HA 1H S 7 H1 7 CAH Proposition dist A C E LS C not invertible lHAilHi Proof Consider the rank one modi cation C I A 7 1M ofA With A E Squot If AA lz 1 then CA lz A 7 IAA 1z z 7 z 0 ie C is not invertible Further HA 7 CH Thus dist AC E LS C not invertible S AA 11 1 AAz 1iupW1sgpllA IIHHIH 1HA 1H HIH For the converse inequality observe that for a noninvertible C can nd I E kerC0 hence get HA 7 CH 2 HAI CIHHIH HAIHHIH 2 1HA 1H D Cor If A 1 takes on its norm at the unit vector 1 and A is a linear functional of norm 1 Which takes on its norm at A lz then A 7 zlAHA lH is a noninvertible matrix closest to A If 07 is a good estimate for the error I 7 i in i then adding it to i should give a better approximation for It This is at the basis of successful xed point iteration The equation Az b is equivalent to the equation zzCb7AzEzCb and the fact that E has norm lt l ensures convergence of the iteration 2 zn1Eznz n0l2iu regardles of the choice of 2 E 5 since it ensures that the iteration function g z z gt gt Ex 2 is Lipschitz continuous With Lipschitz constant H lt 1 In fact Hznarliznll HEznizn1H S HEHHznizn71H therefore ll1nm 7 an Em llInj1 Wu EM HEWHM 7 on 7 E llz1 7 on goes to zero as n A 00 Hence the sequence is Cauchy therefore converges to some yr On the other hand any such limit y is necessarily a xed point of the iteration iiei satis es y Ey 2 ie CAy 2 In fact it is the unique solution since if also 1 Ev2 then y7 1 y 7 1 hence llyi UH S Hyi vH Which in View of lt1 can only hold if llyi UH 0 Conclusion RC is an approximate inverse for A then CA is 1 1 and onto hence invertible In particular A is 1 1 and C is onto Therefore if also either A is onto or C is 1 1 then both are invertible In particular in that case the limit of the iteration zn1 Ezn Cl is the unique solution of the linear system A 12 One reason Why Numerical Analysts are pleased to have so many different norms available is precisely because the de nition of approximate inverse involves a speci c norm While conclusions like the invertibility of A or the convergence of xed point iteration are nor 39 l in t e 39t 39 39 setting of this course This means that we can make use of suitably chosen norms to establish these results We pick up on this When discussing iterative methods notes for Feb6 latest change 19feb09 Elementary matrices are the main tool in the construction of suitable factorizations We call E E anm an elementary matrix if it is a rankone perturbation of the identity ie of the orm E In 7 va for some n vectors v and w Such matrices are easy to apply Ex z 7 vsz ie it involves no more than a scalar product and the subtraction of a scalar multiple of one vector from another Both of these operations are part of the BLAS One would never form the matrix E explicitly but merely store the two vectors 1 and w ln speci c instances only one of these has to be stored since they are closely related The action of such an elementary matrix is easy to visualize E keeps pointWise xed the hyperplane wL ie Ex 1 VI L w Thus if E is invertible then E 1 must keep the same hyperplane pointWise xed ie necessarily E 1 In 7 mo or some suitable u One computes I 7 uwTE I 7 uwTI 7 va I 7 u v 7 uwTvwT and concludes that Prop E I 7 uwT is invertible iffw 0 or else u v uwTv ie u vwTv 71 in Which case I 7 vafl I7 vawTv 71 Thus the inverse of an elementary matrix is an elementary matrix of the same kind hence easily applicable Aside More generally have the ShermanMorrisonWoodbury formula A 7 VWT 1 A 1 7 A 1VWTA 1V 71 1WTA 1 Which holds if both A and WTA lV 7 l are invertible hence A 7 VWT is also invertible The special case in which V 1 is known as the ShermanMorrison formula Cor E I 7 cva is orthogonal iffv w and c QUTU proof E orthogonal iff I 7 wcvT ET E 1 I 7 cl 7 chvva iii 1 w and c cl 7 chv ie 2 chv Cor An orthogonal elementary matrix is also symmetric and sel nverse E E l proofE I 7 cva implies ET E Hence if also E 1 ET then E is sel nverse In fact for a general square matrix A any two of the three properties orthogonal ie ATA I hence A 1 AT symmetric ie AT sel nverse ie A 1 A imply the third Because Householder has popularized use of orthogonal elementary matrices and because they are sel nverse they are called I39 1 a transformations EXAMPLES ln applications one uses elementary matrices to map a speci c vector 1 to some vector y while leaving a whole hyperplane untouchedi For example the rst and typical step in Gauss elimination consists in subtracting from the jth row of the matrix A in question the multiple of the rst row for all j gt 1 For the typical column this means that we are subtracting times its rst entry from its jth all j In other words we change the typical column a to a 7 mal with m I 0m2m3i i ilTi This is exactly the same as applying the elementary matrix E1 2 I 7 mel i We can check now the proper choice of m proper in the sense that it maps a1 ie the rst column of A to a1leli Since Elal a1 7 ma1l we need m a1 7 a1lela1li Note that Efl I 7 meTO 7 l 1 7216 ie exactly like E1 itself except for that change in sign Thus T A7ltImefgtlt3 BB and induction gives that ALM with L lower triangular and U upper triangular provided the upper left corner element at each step is nonzero For a second example we may wish to map an arbitrary vector 1 to a multiple 161 of the rst unit vector If we are to do that by a Householder re ection H I7 cva for a certain vector 1 then necessarily lal and 1 must be parallel to z7ael hence we choose 1 z7ael whence c 2vTv 2Hz7aelllg One would choose the sign of a so as to make vTv as large as possible to avoid cancellation in the calculation of 1 ie 1 signumzllzll HzHgz2z3HiTi Note that vTv 1T1 7 2azl a2 hence 17 HzHltHzHlz1l is constructed in this way from I I A1 for some given matrix A then D AHlt3 g remember that such H is selfinverse and induction gives that A QR with Q orthogonal and R upper or right triangulari Note that 1 even holds when A1 0 regardless how we might then de ne the formally unde ned v in that case Note further that A need not be square here Note also that necessarily rana1 i i i ajl Q ranqli i i qjl all j which implies that the contruction of this QR factorization also does the job for whic 39 39 quot rely on r q h 39d It turns out that Householder is better at it than Gram Schmidti We return to this shortly in the discussion of Leastsquares aka solution of overdetermined systemsi Givens rotations do not quite t this pattern of an elementary matrix since they are obtained as a ranktwo modi cation of the identity The typical Givens rotation G 9 is an arbitrary rotation in a particular twodimensional coordinate planer If this is the plane spanned by the unit vectors ei and 6 then Gek ek for all k ij while Gei cei 36 GEJ39 7861 Ge with c COSlt6gtS sint9i Note that such G is orthogonal and that Gij9 1 Gij9 Gm i Givens rotations are of use as similarity transformations ie to change A into A I GAGquot1 in such a way that Aij 0 Well come back to this in the discussion of eigenvalue calculationsi notes for Febl20 latest change May 13 rounding error analysis for Gauss elimination Gauss elimination provides the factorization PA LU or A PLU if that is more convenient with L unit lower triangular and U upper triangular The solution I of the system A b is obtained by getting rst y as the solution of L Pb and then I as the solution of U yl Since pivoting as recorded in the permutation matrix P does not introduce any rounding errors we may assume in the following error analysis that P I lie that we are applying Gauss elimination without pivoting to the properly rearranged matrix PA which we7ll call A againi Calculation of L and U proceeds exactly except possibly for the temporal order of the intermediate steps as if we were computing their entries from the requirement that A LU with L unit lower U upper triangular lief Aij 2k Li kUkj Ekimin j Li kUkj therefore L 2239 iS 39 Am 2 Llt27kgtUltk7Jgt jj D kltminij 7 7 7 thus providing the formulae WM 7 AW 7 2L0 WM 2 1 klti Mm 7 AW 7 ZLlti7kgtUltk7jgtgtUj7jgt7 igtj kltj This is not only convenient for handling storage since it indicates that the interesting entries of L and U can be stored over the corresponding entries of A even if they are computed by working out those sums gradually as one would in classical Gauss Elimination but it also allows one to account for the rounding errors incurred during the factorization by perturbing the entries of Al Thus with L and U now the computed factors we have WM Ai7j5i 1 2110 kUk7 5i k17 iSj kltz Lz 7jgtUjjgte 7 Anne 7 XML kgtUltk7jgteHH7 igtj kltj with 5 1 6 and 6 S 11 Recall that under the assumption nu S 001 which we now make 5T l T6 for any 7 S n and with 6 S llOlul Therefore Ai7 39i716 7 2 my MUUW 39i 7 M67 2 39 LzkUkJ 7 AW Ai j 7 16 7 2 LikUkji 1 7 k6 igt LU A S A LHUDun with unnliOlu and the inequalities pointWise ie E S C I Vij Bij S Cij and for any matrix B I Bi7j The computational steps in backsolving the triangular systems L b and U y are exactly of the same kind as used in the construction of the factorization The same analysis therefore gives that the computed solution y of L 1 satis es the equation L 7 Gy 7 bwith 0 g un L ll and subsequent solving of U y provides the computed solution U Hz ywith S unlUl Altogether the computed solution 1 satis es L GU Hz b A1807 LGUH LULHGUGH A LUiALHGUGH hence 2 A E b with W S W lLHUlun lLHUl11 UnUn N W 3lLHUlUn Conclusion To the extent that this estimate is accurate one could monitor the factorization process for trouble simply by tracking or at least estimating Conclusion To the extent that this estimate is accurate it should be the goal of any pivoting strategy to keep lLHUl small Note that lLHUl 2 lAl hence at best lLHUl In particular it is not so much nearzero pivots by themselves that produce trouble with Gauss elimi nation but the large entries of lLHUl that are usually their consequence We showed that elimination solves exactly the near by problem How much we can conclude from this about the error in the computed solution 1 depends on the condition of A condition Let I be the solution of A b i the computed solution and set r I b 7 A2 Ae Then llellllA lH S W S HAN Hell and llbllllAH S HIH S HA IH llbll hence M H M M 3 W A W S M A with MA I HAN HA IH the condition number of A Both inequalities in 3 are sharp ie can be made equality by appropriate choice of z and i Conclusion The relative residual allows conclusions about the size of the relative error to the extent that HA is close to l IfHA is far from 1 with respect to the precision used then a small relative residual gives no information about the relative error the relative error could be even much smaller or it could be much larger For this reason good packages like MATLAB provide automatically a condition number estimator which warns the user when HAun N l or worse For assuming that pivoting has succeeded in keeping lLHUl 0lAl we conclude that the computed solution i has a relative residual of the order of un ie as good as can be expected therefore the relative error is no worse than HAun Having this 01 or worse means that the computed solution may have no correct digits One estimates HA with the aid of the computed factorization by trying to generate some right side d of norm 1 for which A ld is as small as possible The heuristics are that with proper pivoting L 1 is not large hence whatever trouble A 1 might have this already comes out in U l Computing U 1 and taking its norm would be the obvious thing to do but it is too expensive it costs 0n3 ops lnstead working in the max norm one chooses dk 6 711 so as to maximize lykl with yk I dk 7 Ejgtk UkjyjUk k the kth entry of y I U ld as computed during backsubstitution 12 In the scheme actually used in MATLAB algorithm 451 the maximization is somewhat more global since it takes into account the effect of choosing dk 1 vs dk 71 not only on yk itself but also on all the sums 21216 Uijyj for i lt k which enter the subsequent calculation of Some efforts have been made to nd methods for reducing MAi They all come down to looking for a norm in which MA is smalli If as we mostly do we only deal with map norms then MA 2 1 The standard way for trying to reduce MA is to seek diagonal matrices D and Dr for which in the given norms MDlADT is as small as possible One would restrict the entries of D and Dr to have mantissa il in whatever oating point arithmetic is being used in order to avoid roundoff because of such scalingi This leads to equilibration of matrices something 1711 skip because I don7t fully appreciate its goal in the present context and this early in the morning Factorization as approximate inverse We can think of 2 as telling us that the solution process provides us with the means of computing C2 for given 2 with C I A E 1 and boundable as shown ls C even de ned Assuming A to be invertible we know A E to be invertible in case lt lHA lH ie in case the relative perturbation is less than lMAi In that case A E 1H S llAillll HA IHHEH Then 1 CA 1 CC 1 E CE S HA IH NEH1 HA IH HEHL andthis is less than 1 in case lt4 HA lHHEH lt12 In such a case one could use the backsubstitution process to provide xed point iteration ire the iteration zn1 znCrn n0l2ix with Tn I b 7 A1 and 10 I i and so hope to obtain an improved solution This is called iterative improvementi For this note Strictly speaking the error matrix E depends on the particular I to which the calculation was applied hence we are dealing here with a more complicated object than mere xed point iterationi But if we can be certain that 4 holds uniformly for all right sides involved then convergence to the solution can be shown if the calculation is carried out exactly Since it usually is not the additional error in the calculated residual Tn has to be taken into account If the residual is computed with the same precision as the approximate solutions then we cannot expect any improvement For this reason it is standard to compute Tn b 7 A1 with double precision accumulation of the scalar products 7 Ai zn involved as can be done easily in many implementations of oating point arithmetic i notes for Febi24 latest change MarA overdetermined system A b with A E Rm and m gt n dim ranA S dim domA n lt m dim tarA implies that there may be no solution Actually this may even happen when m S n vizi when A fails to be onto because of rank de ciency In either case the next best thing is to solve instead H12 7 AH min This is a much more tricky computational problem than solving a square linear system unless the norm is chosen so as to make the problem linear again There is only one such choice of norm vizi a weighted 2normi For this reason one usually works with that norm even though another norm may be indicated by the problem backgroundi Consider now the LS problem H127 AH2 mini We are either looking for the element y AI 6 ranA closest to 12 or we are looking explicitly for all z for which AI is closest to 12 The book takes the latter point of view I prefer the former A I orthogonal projector onto ranAi Then PA12 is the best approximation in the 2norm to 12 from ranAi Hence in the terms used in the book X I 1 H12 7 AzHg min zzAz PA12i Also X is a singleton iff A is lli In that case the book writes 1L5 I PA12i More generally the book denotes by 1L5 the unique point of minimal 2norm in Xi As near as I can tell this socalled best least squares solution is the product of mathematical neatness and good for nothing else The typical situation calls for PA12 and the elements of X become important only if for some reason it is important to represent PA12 in the form Azi In that case there are usually problemdependent considerations that give preference to one element of X over the others Such preferences amount to the addition of homogeneous equations so that the coef cient matrix A B of the enlarged system lt5gt l3l7 l l is lll For example X PA12 kerA hence 1L5 is the error in the best approximation from kerA to PA12i In particular 1L5 is the unique element in X perpendicular to kerAi Thus the augmented system 5 with B W and W any spanning sequence for kerA has as its unique LS solution the vector 1L5 of the original systemi pseudoinverse Recall that with A UEVT 2169121601612 the SVD for A UT uhui uT provides an orthogonal basis for ranA hence PA UTUTT therefore H12 7 PA12H Ekgtru 122i Further Hb7AzH UTb7 SWIM 7 amng kSr kgtr orthogonal the z of smallest norm satisfying these conditions will have Uk 1 0 for k gt 7 since HVTzHg Ekvgz2 as already observed earlieri Conclusion Hence minimization requires 1211 ugh0k for k S 7 thus pinning down 1211 for k lt Ti Since V is 6 1L5 2uf12ak k A12i kSr There is a vast literature on the resulting map AJr I VEJFUT with 2 2 diagaf1uia10i i i 0 l4 called the pseudoinverse for A Note that for m 2 n rankA ATA VEiVT hence 7 ATA 1AT VEglUTi If A is invertible then A4r A li Hence the map A 77 A4r is an extension of the map A 77 A li But note that HAH2 a1 hence HAH 7gt 00 as A approaches a matrix of lower ranki Thus the ability to de ne a pseudoinverse for all A is bought at the price of having the map A 77 A4r discontinuous Prove that that is necessarily soli This discontinuity is implicit in the following bound which need not go to zero as A 7gt B since HAH2 might blow up Theorem 612 HB AllF S 2H3 AllFmaXllAll 7 llBlllglA Compare with HBT1 Ailll S HB All llAilll HB llh valid for invertible maps and arbitrary normsi sensitivity analysis Standard sensitivity analysis for the LS problem assumes that both A and its perturbed version A 7 6A have full column ranki Here 6A 612 are names for the perturbations in A and Theorem 613 A E RmxnA of full rank all norms are 2 norms H12 7 AIM min Hb76b 7 A76AiH min 7 I b 7 AI 7Aquot 2 12 76b 7 A76Ai 51 maxfllMllllAlh ll5bllllbll lt 1 14 51119 1 llTllllbll lt17 Then coslt6gt 7 MAINW 7ltAgt 7 mamanal and Hi 7 77qu elt2nltAgt 7 sinlt6gtnltAgt2gt coslt6gt 7 03 H7 7 777127 elt1 727ltAgtgt min1m7 n 7 07 Proof Set A I A 7 MA I I b 7 MIL Then lle 7 Atth min de nes the smooth function 01 7gt R z t gt7 I AfAt 1Atbt since by assumption Ht AH S H6AH lt lanA hence rankAt n Also 55 11 10 550 0llill7 with I max7g01 therefore H55 IllNIH S lliollllrll 0llilll Differentiate the identity AfAtzt A372 to get 8 6ATAL 7 AtT A t 7 AtTAt ct 7 MW 7 A367 At t 0 this gives MTA 7 AT6Az 7 ATAa39c0 7 6ATb 7 AT b hence 70 7 ATA 1AT612 7 MI 7 MW HltATA 1ATH8HbH 757717 HIM 7 HltATA 1H HAH 777 W T 71 WM 771 7 A A A 7 HAHHIH l l l l HIM N0W7fr0m77llATA 1ATll10nAWhilellATA 1ll10nA27 A1507 llbllllAll Hill S llbllllAIll l cost9 and therefore S sint9HAHcost9l This implies that 7707qu 67ltAgt1coslt0gt 71 7 7ltAgt2sinlt6gtcoslt0gtv A differentiation of 8 gives 26AT6Azt 7 26ATAL 7 A6Az39t 7 AfAtit 6AT6b 7 6AT6bi Here all terms except for the term AfAtit are 052 hence so is ill The proof for the bound on H 7 proceeds along similar lines 1 EHIHHATA 1ATH HAM 15 Important point If sint9 07 iiei7 if the residual 7 is not zero7 then the bound on the relative error in terms of the relative noise in A and 1 involves the square of the condition number The corresponding bound on the relative change in the residual only involves the condition number itself For relatively small relative residual iiei7 for sint9 N 07 the noise effect is only proportional to the condition number In such a situation7 it is important to use a method Whose sensitivity to noise is only proportional to HAi The standard method of solving the normal equations ATAz AT eigi7 by Gauss elimination7 is not such a method since the condition of its coef cient matrix ATA is MA notes for Feb27 latest change l9feb09 QR factorization via Householder and modi ed GramSchmidt basic idea factor A QR with Q orthogonal and R right or upper triangular in the sense that ranij 0 for i gt j Notational difficulty If A is not square ie A E Rmxn with m gt n then straighforwardly also Q E Rmxn but R is square The book prefers for Q to be square as that is what one gets when using Householder for the construction of the factorization hence has R E Rmxn with the part below row n all zero and useless This does have the advantage that Q is invertible Purpose If we are to solve the linear system A b then we can now look at the equivalent system R QTb whose solution by backsubstitution is easy This even works when m gt n in which case we obtain thereby the orthogonal projection of 12 onto ranA ie the LS solution to the problem Let A QR be a QR factorization for A Then 9 ranAlj CranQlj Vj since then Aj AEj QREJ39 QEkltj Rkjek 21K RkjQek Hence if A is 11 then necessarily 7 7 ranAlj ranQlj Vj thus Q lj provides an orthogonal basis for ran A lj Conversely if Q is orthogonal and satis es 9 then necessarily A QR with R right triangular since Rj Raj contains the coordinates of A j with respect to Q and by 9 Aj 6 ran Qlj We can therefore think of obtaining such a QR factorization for A by orthogonalizing the columns of A a la GramSchmidt see Jan27 notes This would only provide us with an orthonormal basis for ranA But for solving A b that is all we need We can also think of it straightforwardly as the task of eliminating see notes Feb6 the strictly lower triangular part of A but using orthogonal elementary matrices whose product then provides Q or Q l The second point of view has the advantage that it provides Q as a square matrix thus giving not only an orthonormal basis for ranA but also for its orthogonal complement ln GramSchmidt the typical or j lst step starts with Qj Q lj already in hand This means that we can compute the orthogonal projection of any vector 1 onto ran A lj as Qij 1 Its error I 7 P11 is necessarily orthogonal to ran Aj ran Qj hence QZJ l I I 7 7 provides a suitable next column for Q provided we take I AZJ l and pmvided the resulting vector 1 7 is not zero ie provided A lj 1 is 11 The modi cation which gives the name to modi ed GramSchmidt concerns the actual calculation of z 7 sz z 7 In the standard process one computes the vector 5 and then subtracts Qjc from I Now notice that qJTz j z 7 Pglz since qj is perpendicular to raan1 ran Pj71 On the other hand 7 Pj71zH S and usually the inequality is quite strict This means that the scalar product qj z 7 j1z is apt to involve absolutely smaller terms than the scalar product qu Since they both give the same result it must be the case that the calculation QJTI involves more cancellation than the calculation 7 Pj71z hence the latter is preferable This means that at the beginning of the j lst step all columns of A will already have been modi ed j times by subtracting out their orthogonal projection onto ran 13971 ran j1 If we use Q as our working array which we would initialize to A then we have an orthonormal basis for Aj in Q lj while Q j ln l 7 Pj71A j ln The task of this step is to take out from columns j l n also the projection onto the span of qj by replacing each such column I by z 7 ijqj z This leaves the j lst column containing AZJ 1 7 PjAZ7j 1 and its normalization provides the next column for Q ie the vector qj1 In practice one would not carry out this normalization as it requires taking a squareroot Rather one would work with the unnormalized vectors call them 101102 and merely record their normsquares ie the numbers m Hpk H2 all Is One obtains 116qu from this in the form pkpgzTk When using quot 39 With I39 1 a the typical or j lst step starts with the matrix R which was initialized to A already upper triangular in its first j columns ie R1 X R l o si 17 with R1 square upper triangular Also A QR with Q initialized to I and presently containing the product of the previous j Householder re ectionsi A Householder re ection Hj l 7 ejvjvv is then constructed so that HjRj l Rljj l a 0i i i 0 with a necessarily equal to illRj 1Zm7j Q and R are then updated Q a QHj Q 7 cjijva R lt7 HjR R 7 cjvjijRy In fact Q is usually not constructed explicitlyi Rather the numbers Cj and vectors vj are stored individually mostly in R itself and the sequence H0H1 i i i is applied whenever application of Q is called for Since the construction of Hj requires division by a there is a dif culty here when Rj12m7391 0 In that case A j1 QR j1 Ekltj qukj 1 ie A6141 6 ran Qj ran Aj assuming that R1 is invertible ice A lzj 1 fails to be 1ili There is a straightforward way of dealing with this dif culty ofa rankde cient A From the elimination point of view there is no need for this elimination step since the strictly lower triangular part of the current column is already zeroi Hence the choice H39 I will do The corresponding action in GramSchmidt is simply to ignore the j 1st column and go on to the next Either way the nal factorization obtained this way is de cient since now R fails to have an invertible upper triangular parti One deals with this by formally swapping the useless j 1st column of A for a more useful iiei independent later column of Al This solution of the problem of rankde ciency takes it for granted that we can tell a zero when we see one In niteprecision arithmetic that is a hard problemi The best we can do here is the following if z 7 Ouz then within the precision used I 6 ran Pj ran Qj ranA1j i In the elimination approach one similarly checks whether lmj l OuHA j l and deems Aj l to be in ranA lzj in that case In either method one uses in practice column pivoting by choosing at the j lst step from among the columns not yet used the one with the largest normi This requires some normalization at the beginning of the process It is not necessary to recompute the norms of the remaining columns since we can keep a running total ln elimination each column is modi ed by a unitary matrix hence does not change its normi On the other hand if we have in hand m I HRjm MHQ from the previous step then we can easily update it for the current step by Tk lt7 m 7 Rj k In modi ed GramSchmidt one similarly keeps track of the current norm square of each column in the working matrix Q m I kH2 say i During the j lst step one subtracts from each such column I I Q 16 for k gt j the vector qjqTz thus can update Tk lt7 m 7 qJTz In either method one produces in the end a factorization AH QR for some permutation matrix H and with R of the form R R11 R12 0 0 where R11 is an invertible upper triangular matrix of order 7 I rankAi Then QTb QTA RHT has the socalled basic solution HTIB diagR1 110QTbi If you are after 1L3 though you would have to work still harder here You now know that X 13 H RflllfRull 1R7 hence must compute a best approximation to 13 from that space at the right On the other hand remember the earlier comments about a problemmotivated approach to removing rankde ciency Note that Householder holds a slight edge here over GramSchmidt since in the former method the vectors treated in step j l have only m 7j entries while they always have m entries in the latter method It can be shown that Householder elimination and modi ed GramSchmidt have similar stability properties which are better than those of Gauss elimination in case the latter is applicable If A is very sparse in a regular way then it might pay to use Givens rotations rather than Householder re ections during the elimination since the former can be used effectively to eliminate the occasional nonzero strictly lower triangular entryi But the implementation can be hairyi notes for marl latest change 19feb09 Householder bi diagonalization is of use in the LS problem It is also handy ultimately in the construction of the SVD to be discussed after the QR method for computing eigenvalues The goal factor A as UBVT with U and V orthogonal and B upper bidiagonal ie Bij y 0 gtj i i 1 This is as close as one can get to a SVD without iteration The purpose to provide better information about rankA though nothing beats the SVD for that for use in the LS problem More importantly to provide a good starting point for the iterative construction of the SVD The means use Householder re ections on both sides to alternately zero out a column below the diagonal and the corresponding row to the right of the first superdiagona Thus in the first step after initializing B to A and U and V to 1 construct H0 to carry Bl to 1161 and update B lt7 HOB U lt7 UHO Then construct K0 to carry the current Bl T to Bl lel leg In particular K0 leaves 61 fixed hence the resulting update B lt7 BKg BKO will not destroy the zeros already obtained in B 1 Also update V lt7 VKO Now B is of the form 7 01 516 le B22lv and the process is applied to the submatrix B22 You get the picture If m gtgt n it is faster to carry out all the left steps first since this avoids having to apply the Kj to the many rows Bi with i gt n It does require a final redoing of the factorization including further steps on the left but this will then only involve the matrix Bln 1m The detection of rankde ciency may be helped by going to this bidiagonal factorization Still it is easy to come up with bidiagonal matrices whose elements are all of size 1 yet whose condition number is large For example the nth order bidiagonal matrix On with typical row 0 7 is like a rstorder difference equation The nullspace of the bi in nite version of On is spanned by the vector 2700 ence On maps 1 2n 12 2 20 to en therefore Hofum 2 2n 1 In fact C7716 21 121 2 20 0 0 hence Hofum 2n 1 The singular values of such On behave similarly to those of Bn computed earlier They lie between 1 and 3 clustering against 1 except for anCn N 2 Only the actual calculation of the SVD or of C771 would reveal this near rankde ciency In effect the columns of On would be strongly linearly independent if it weren7t for the first column In particular the singular values of On 2m are perfectly ne Iterative improvement is possible in the fullrank case with the observation that the problem H12 7 AI l min is equivalent to the linear system L9 Mi whose unique solution is 7 z with 7 b 7 AI as usual In fact this equivalent formulation is useful in the construction of iterative methods for the solution of the LS problem in case A is suitably sparse It is also to be preferred to the normal equations formulation ATA ATb since the latter s coef cient matrix has condition H2 A2 while the former s has only condition H2 Generalized Least Squares concerns the problem vTv min subject to b A1 B1 19 Which for invertible B is equivalent to HB 1b 7 mini lndeed set 1 I B710 7 Paige solves this minimization problem this way He factors A Q1Q2 130 With R square right triangular and then obtains an orthogonal matrix P so that QgBP 0 S With 5 square right triangular and correspondingly P 131132 If A E Rmxn and B is square then B has order m R has order n QgB E Rminxm hence S has order m 7 Then I A1 B1 is equivalent to T T T T lgill wals 1132 Consequently from the bottom half the triangular system S Q51 has the solution P2Tv from which 1 is determined With 1 in hand the top half provides the system R be known stuff Whose solution is z notes for mariSff latest change MarilS Similarity A and B are similar I for some invertible V A VBV li lfA S 7gt S and V R 7gt 5 ie V is a basis for S then B V lAV is a matrix called the matrix representation of A wrto the basis Vi In a picture A S 7 S VT TV R 7 R B In the best of circumstances A is diagonable ie can nd a diagonal matrix A diagA1A2Hi similar to A For then we can study the powers of A and more generally for functions f such as polynomials and limits of polynomials in terms of the much simpler matrices diagf A1 fA2i i i using the fact that then fA WWW 17 at least for polynomials f for other functions f it would be a de nition made plausible by a limit process It so happens that not all matrices are diagonable although the generic square matrix is ie the diagonable matrices are dense But every square matrix is similar to an upper triangular matrix and this often helps already in the study of In fact such a similarity is available with V unitary ie V V I which is the complex version of an orthogonal matrix The resulting upper triangular matrix unitarily similar to A is called the Schur form for A see below Since such an upper triangular matrix T is invertible iff all its diagonal entries are nonzero it follows that A 7 A is not invertible iff A 7 T is not invertible iff A Tjj for some ji Thus construction of an upper triangular matrix similar to A requires implicitly or explicitly the construction of the spectrum of A ie the set AA I A 6 IF A 7 A is not invertible The points in the spectrum of A are called eigenvalues of A and any vector in kerA 7 A is called an eigenvector of A belonging to A In particular v1 must be an eigenvector of A belonging to the eigenvalue T11i This anticipates the result proved later that every A is similar to an upper triangular matrix and this proof requires us to know that every A has eigenvaluesi This latter fact is most quickly seen by considering the minimal polynomial for A ie the polynomial p of smallest degree for which pA 0i 10Lemma For every A there exists a unique monic polynomial p of minimal degree for which pA 0 Any root of the minimal polynomial is an eigenvalue ofA and conversely lndeed with A of order n the n2l powers A0 A1i i i A 2 must be linearly dependent since they lie in the n2dimensional space anxn This means that for some nontrivial coef cient vector a0 al i i i an2 I pA I Ej Ajaj 0 This shows that there are such nontrivial annihilating polynomials p hence there must be one of smallest degree and with leading coef cient i If now A is any root of this p then p 7Aq hence 0 pA A7AqAi lf now A7A were invertible then it would follow that already qA 0 even though 4 is of smaller degree than p a contradiction to the assumed minimality of pi Conversely for any A p7pA 7Aq hence 7pAI pA7pA A7AqA and the left side is invertible when pA 0 therefore so must the right side bet Hence pA 0 implies A g AAi D In principle it is possible to construct the minimal polynomial in nitely many flops though there is that practical dif culty of determining the smallest j for which Aj E spanAkkltj since this requires the decision of whether something is zero With the minimal polynomial p in hand the problem of calculating the spectrum AA of A is reduced7 to the problem of polynomial root nding In fact the two problems 21 are equivalent Eigi MATLAB nds the roots of an arbitrary monic polynomial p 1 7 21j 1aj by nding the eigenvalues of its companion matrix 0 l 7 7 7 639 1 j lt n39 A I a LeiAev 17 7 P I l 7 m newt J n l I claim that p is the minimal polynomial for Apr First A lel Ej for j S n hence AgA1 Ag l are linearly independent since the vectors A361 1411761 l l l Ag lel are therefore Apls minimal polynomial must have degree at least nl Further A261 Apen 2k ekaUc 2k A lelaUc hence pApel 0 But then for any j pltAp j pApA 1el A 1pApel 0 therefore pAp 0 We conclude from 10Lemma that every matrix has eigenvalues because the fundamental theorem of algebra assures us that every nontrivial polynomial has roots provided that we admit complex numbers as roots and eigenvalues For this reason we now switch from the reals to the complex numbers This means that we use the conjugate transpose or Hermitian AH instead of AT and correspondingly use unitary matrices U ilei square matrices with UHU 1 instead of orthogonal ones We are also more interested in a hermitian matrix ie A with AH A than in a symmetric one We will also be concerned with normal matrices iiei matrices with AHA AAHi One calls the linear subspace ranX invariant for A or an invariant subspace of A if AX C ranX 1 e if AX XB for some Bl Thus T I V lAV is upper triangular iff for each j V3 2 111 l l vj spans an invariant subspace for Al 11Lemma AX XB for some not necessarily square X f 0 implies that MA MB 3e 0 Note that if here X is invertible then A and B are similar hence MA MB in that case For the proof of the Lemma write X U E g VH with U and V unitary and E diagonal of order 7 I rankX hence invertiblei Then with C I UHAUD I VHBV 0112 0 7 011 012 E 0 7 H 7 H 7 E 0 D11 D12 7 Z3D11 Z3D12 0212 ol lcgl CQQHO ol U AXV U XBV 0 0 D21 D22 0 0 Therefore C112 EDu iiei C11 and D11 are similar hence AC11 MDH 3e 0 Also C21 0D21 0 consequently MA MC 2 M011 Wu C MD AltBgtA B One calls the linear subspace ranX reducing for A if it is invariant for A and some complementary subspace iiei some subspace ranY with ranX ranY 0 and ranXranY domA is also invariant for Al Assuming that X is 11 this means that there is an extension of X to a basis V XY for domA so that both ranX and ranY are Ainvarianti In that case T 0 1 7 7 11 V AViTli 0 T22 with A ranX XTHX 1 and A rany YTQQYili aving obtained such a pair of reducing spaces for A we can look at each of these and try to break them up into smaller reducing spaces and if we succeed we can try to break the smaller spaces into yet smaller reducing spacesi Since we are starting off with a nitedimensional space we will after nitely many such attempts reach a basis V V1 l l l V5 for which each ranVj is a minimal Ainvariant space We call such a basis V maximally reducing for Al The corresponding matrix J V lAV is blockdiagonal and with a natural choice of the basis for the summand ranVj all j it is the Jordan canonical form for 22 Al From a computational point of view this canonical form is of no interest since it does not depend stably on Al lnstead Numerical Analysts work with the Schur form which is only upper triangular but does depend stably on Al In fact the basis for it is even unitaryi Unitary similarity is particularly important since it preserves properties concerning the scalar product such as normality ie the property that AHA AAH or the conjugate symmetry A of a hermitian Al Thus if A VBV 1 for a unitary V then V 1 VH hence AH VBHV l ie the hermitian of A is represented by the hermitian of the representer of A In particular A is normal hermitian iff B is normal hermitian The essential point for the understanding of the Schur form is the following 12Lemma IfX is 1 1 and ranX is A invariant ie AX XB for some B then With U an orthonormal basis for ran X and V U W an orthonormal basis for domA T I VHAV is block upper triangular ie H 7 T11 T12 V AV70 1122 Moreover ifA is normal then VHAV is block diagonal ie T12 0 For the proof note that T21 WHAU 0 since AranU C ranU L Wi As to the rest if A is normal then also T must be normal In particular 13 TuT T121111 Tngi But for any matrix C traceCCH ZltCCHUJ lCU W2 HONT traceCHCA j j k Thus 13 reads HTulliv llT12ll2iv HTulliw which implies T12 0 D Corollary Schur Form For any square matrix A can nd a unitary matrix V so that VHAV is upper triangular ie VHAV A N With A diagonal and N strictly upper triangular The inductive proof relies on the fact that every square matrix has an eigenvalue A say If U1 is a corresponding eigenvector of unit length then any extension to an on basis V will give H 7 A 12H V AV 7 0 B i lnduction provides a unitary U so that T I UHBU is upper triangular hence H 7 A 12H W AW 7 0 T with W I Vdiagl U again unitary is upper triangular D Note that we had the choice here of any particular A 6 MA hence we can prescribe the order in which we would like to have the eigenvalues of A appear in the Schur formi Corollary A is normal if A is unitarily similar to a diagonal matrix A is hermitian if A is unitarily similar to a real diagonal matrix For the proof recall that with V unitary VBVH is normal hermitian iff B is normal hermitian Hence need only prove that a normal matrix is unitarily similar to a diagonal matrix But with A VTVH and T upper triangular 12Lemma implies that T must be diagonali D 23 Note that for T I V IAV upper triangular dett 7 A 7 detV dett 7 T detV 1 7 dett 7 T 7 HQ 7 Tjji This shows that A 6 MA iff A is a root of the Characteristic polynomial of A ie the polynomial t gt gt dett 7 A Moreover each eigenvalue of A appears as a diagonal entry of such an upper triangular matrix similar to A exactly as many times as it appears as a root of the characteristic polynomial This is the algebraic multiplicity of the eigenvalue Note that with A VA NVH a Schur form for A Since Ej Mjl2 in which each eigenvalue of A appears according to its algebraic multiplicity depends only on A as does HAHF this shows that is independent of how we choose the unitary matrix Vi The number is called A7s departure from normality This shows that for a nonnormal matrix we have to go to nonunitary bases V if we want to make V lAV simpler than just upper triangular For this assume that T is block upper triangular T11 T12 14 T lt gt 0 with T11 T22 square but not necessarily of the same order We look for a similarity transformation Y lTY which leaves the two diagonal blocks as well as 0 T21 unchanged but turns that block T12 into zero In the general case this forces Y to be of the block form IZ lo Ilv and so gives Y71TY 1 IZ T11 T12 1 Z T11 T12T11ZZT22 0 0 0 T22 0 I T22 Hence this similarity is successful provided we can choose Z so that Z THZ 7 ZT22 7T12i Since the map so so de ned is linear and Dg0 Tg0 we will succeed in general iff go is invertible iiei iff go is lli For this reason the following lemma is important Lemma The linear map g0 X gt7 AX 7 XB is invertible iff MA MB 0 For the proof note that if A 6 MA MB then Av A1 and wHB AwH for some nontrivial v and w consequently vw 0 yet so vw Avw 7 va Ava 7 Ava 0 The converse was proved in llLemmai D Corollary Every A is similar to a block diagonal matrix J diagJ1 J2i i i J1 With MJS As all s and M f M fori fj hence MA Aj j liut Indeed if we take J to be the blockdiagonal matrix similar to A obtained from a maximal reducing basis X for A then necessarily each block J5 has only one eigenvaluei For if such J5 had more than one then we could get its Schur form T in the form 14 with both T11 and T22 of positive order and with disjoint spectral But this would imply that ranT11 and ranT22 are nontrivial reducing spaces for J5 hence for A thus contradicting the maximality of the reducing basis we started out with D 24 From this Corollary it is only a small step along the same line of argument to the Jordan canonical formi Here is the step described for the typical minimal reducing subspace Z I ranVs Which provides the diagonal block JS in a maximally re ned blockdiagonal representation J diagJ1 i i i J1 V lAV for A With V V1i i i Vt the corresponding maximal reducing basisi On Z ran V5 A Z takes the simple form A5 N With AN By 10Lemma this implies that N is nilpotent ie N4 0 for some q The smallest q for Which this holds is called the degree of nilpotency of Ni We claim that necessarily this smallest 4 equals dim Zr lndeed With q as de ned there must be I E Z for Which Nq lz 0 and that being so there must be yH E Z eigi y Nq lz for Which yHNq lz 0 De ne zj I Nq jz I yHNi l all ij l i i i q and consider their Gramian ie the matrix YHXZy1uiyqHIliH1qi We compute yHXWJ grimy 0 g 9 This means that YHX is triangular With nonzero diagonals hence invertible This implies that both X and Y are 11 and that ranX ker YH 0 therefore ranX and ker YH are complementary subspaces for Z Moreover ranX is Ninvariant since Nz39 13971 or 0 all j While also ker YH is Ninvariant since YHI 0 is equivalent to yHNiz 0 for all i and this implies that also YHNiNz 0 for all ii We conclude that ranX and ker YH are complementary subspaces Which are both Ninvariant hence they are reducing for N hence also for A Zi Since Z cannot be further reduced one of these subspaces must be trivial Since ranX is not it must be ker YHi Conclusion 4 dim ranX dim Z iiei X is a basis for Z Further Azj A5 Nzj Aszj zj1 all j With 102 qu 0 Therefore A5 1 AMX4 X 1 s and this shows the promised typical Jordan block in the Jordan canonical formi Note that kerN span 11 hence up to scalar multiples 11 is the unique eigenvector for A in Z ranX ran Vsi We conclude that for each A E AA there are exactly dim kerA 7 A Jordan blocks associated With A This is the geometric multiplicity of A In contrast the algebraic multiplicity of A is the sum of the orders of the Jordan blocks associated With A ie the number of times A appears as a diagonal element of J This implies that the algebraic multiplicity is at least as large as the geometric multiplicityi One calls A a defective eigenvalue if the two multiplicities do not agree Thus A is diagonable iff none of its eigenvalues is defective iiei iff all its Jordan blocks are l X 1 ie iff there are enough eigenvectors to staff a basis The order a aA of the largest Jordan block for A is called the ascent of A It is the smallest 7 for Which kerA 7 AT kerA 7 AT1i It is also the order to Which A is a root of the minimal polynomial for Al The space kerA 7 A UT kerA 7 AT is the eigenspace associated With A It is a reducing space the complementary Ainvariant space is ranA 7 A notes for marilO latest change Marti perturbations Since eigenvalue calculations in niteprecision arithmetic only produce approxima tions it is important to understand the continuity of the map AH MA Rightoff there is some question as to whether to take the spectrum MA as a subset or a sequence or a suite In the rst case we would only pay attention to the eigenvalues and not their multiplicities in the second case we would write them down including algebraic multiplicities eigl we would think of the spectrum as the sequence of diagonal entries in a Schur form for At This is a bit unsatisfactory since there is arbitrariness in the order in which the eigenvalues appear in such a sequence For this reason a suite or multiset or bag like a bag of marbles is more satisfactory for here we would merely record the set of eigenvalues together with their algebraic multiplicities but not enforce any particular order Theorem The map A gt gt MA from the matrix A to the suite MA is continuous The proof is all yours see Problem Fifi6 All standard perturbation arguments rely on the following observationl If M 6 MB EMB then B 7 M is invertible while B E 7 M B 7 MI B 7 M 1E 1 EB 7 M 1B 7 M is not hence 7 E B ilEll B 7 1 gt 7 gt 1 M ll7HEB7M 1H 7 in my norml Here are some examples Gershgorin MA C UiDi with D1 7 2 e c z 27 AM ZlAWM z the typical Gershgorin disk Moreover any isolated disk Will contain exactly one eigenvalue of A For the proof let D I diagAl lA22 i i and take A E MAMD Then with N I A 7 D A7 D7AN D7ID7 1N hence 7 A 1NH 2 1 By taking the maxnorm we get 1 mix WM 7 Al llNGJN my lAi7J39llAi7i 7 M j 1 The continuity of the function t gt gt MD tN provides the conclusion about isolated disksl D Note that by going to AH or by using the lnorm instead but with the factor D 7 A 1 on the other side we also know that MA C Di with D z e C 2 7 Aiil g E lAjill z BauerPike dist MA E MA S HXHEH With X any basis Which diagonalizes A and in any norm for Which diagd1d2 i i For the proof take it E MAEMA and set A I X lAXi Then AE7M XA7MX 1EXX 1 is not invertible there ore WA 7 M 1X 1EXH 21 But WA 7 MW 7 maxi Mi 7 W1 1diStM7MA D 26 If A is not diagonable one can at times get good results from the Schur formi SChur dist AA EAA S maxt9t91q With 9 I Ekltq and A N a Schur form for A and With q the degree of nilpotency of the strictly upper triangular part N of the Schur form For the proof take M E AA Then I 7 M 7 A 1E is de ned and not invertible Therefore 1S HOW AVIEH2 S HOW A 1H2HEH2 But 7 A 1H2 7 A 7 N IHQ while N is nilpotent of degree 4 hence 7 A 7 Z w 7 Arle 7 AH kltq and so with 6 I dist MAA lHM 7 AHQ get k 7 7 7 1S 2HNH25 HEH25S HEH2ZHNH max5 075 17m75 1M1 kltq kltq Consequently 6maxl 64 1 S 9 etc D Whether by BauerFike or by Schur we see potential sensitivity of eigenvalues in case A is strongly nonnormal for then either HltXgt or is large Put positively for a normal A dist MA E7 MAD 0HEH2 Such strong continuous dependence is available for individual eigenvalues to the extent that they are normal Speci cally assume that A E AA is a simple eigenvalue iiei there is just one Jordan block for A and this block is of order 1 Then this situation will persist in a suf ciently small neighborhood of A Speci cally for an arbitrary F with HFH 1 say there exists an 5 5AA F gt 0 so that for all t 6 755 there will be At E AA tF with corresponding normalized eigenvector zt so that the map t gt gt At is smooth Let yH be a normalized left eigenvector of A belonging to A Claim sz 0 Indeed if V is a maximal reducing basis for A and without loss of generality J I V lAV has A as its rst diagonal element then U1 is necessarily an eigenvector for A belonging to A hence as kerA 7 A is onedimensional by assumption equal to z 10 except for a scaling factor Assume without loss that in fact V 1 i i i and consider 10 I V 1Hi Then wHA wHVJV 1 efJV l Aer l AwH hence wH is a nontrivial left eigenvector for A belonging to A and sz 1 Since also kerA 7 AH is onedimensional it follows that any left eigenvector yH belonging to A is a scalar multiple of wH hence sz 0 With this we can differentiate the equation A mm Mme and evaluate at t 0 to get Am F1 A0z we apply yH to both sides to get yHFz since yHAz390 Asz390 hence WON lyHFIllyHIl S 1lyHIl This shows that a simple eigenvalue is sensitive to perturbations to the extent that its normalized left and right eigenvector fail to be parallel notes for mari15 latest change 19feb09 The following analysis is valid for any compact linear map on a normed linear space hence useful in the study of the eigenstructure of differential and integral operators and their approximation But I nd it helpful even in the simple case of a matrix discussed here resolvent The map R CUA 7gt Rnxn 2 gt gt A 7 2 1 is called the resolvent of A It is continuous on its domain since C 7gt Rnxn 2 gt gt A 7 2 is This implies that R is bounded on any compact subset of aAi R is also differentiable hence analytic since 132 RV RZA 2 i A WM Z 2RZRZ7 therefore dRzdz This makes complex variable theory available as a convenient and powerful tool for the analysis of the resolvent and ultimately the spectral properties of Al Remark If you are uncomfortable with operatorvalued functions consider instead the complexvalued function 2 gt7 ARzz for arbitrary A E X and arbitrary z E Xi The spectrum 0A comprises the singularities of R 15 Theorem Suppose that 0A 01002 and that it is possible to nd a simple closed curve 1quot in CUA Which encloses 01 and excludes 02 Then i 16 P771Rd 7 27d P 2 2 de nes a bounded lprojector Which commutes With A hence ii both M1 2 ranP and M2 2 ranl 7 P ker P are A invariant closed lss s and X M1 EB M2 and iii Aj I AM 6 bLMj and iv 0Aj aj Proof P is bounded eg by lT lZn max To see that P is a projector note that P is unchanged if we deform lquot as long as we don7t cross 0A in the process Therefore 71 71 71 71 Rz 7Rz P27 7 R Rdd7 7 7dd 27riF27ri F 2 Z Z Z 27riF27riF 272 Z Z 71 7 R d P 27riF Z Z 7 the crucial equality by Cauchyls formula if we as we may choose Tquot close to lquot but enclosing it hence F Jazz61 132 A 012 727riRz z7 272 Rz 7 d2 7 A272dziRzF27z 70 ii Since Rz A7z 1 commutes with A so does P hence Mj is invariant under A ie AMj C Mji iii Therefore Aj I AM 6 LMj and Rjz I Rz M7 Aj 7 will iv In particular 0Aj C 0A while if both A1 7 2 and A2 7 2 are boundedly invertible then so is A 7 Therefore while 0A1 U 0A2 aAi On the other hand for any 2 outside 1quot R2P ARZRzdz Awdzi Rz dz 2 72 27ri Fz 7z which shows that R1 RP M1 has no singularities outside 1quot Correspondingly R2 has no singularities inside 1quot Hence 0Aj 0quot Consequently 20 J1 7 J WTEVL From this identity we can read off all kinds of qualitative information particularly when A is a differential or integral operator and A E its discrete approximation In the present context we only make use of the immediate consequence 21 M 7 JH S HWTH HEH HVlll If we start off with an on basis V for 5 hence Hng 1 then the dual basis W from ran PT is uniquely determined The size of W is a measure of the inclination between S ran P and ran PTi In fact HPH2 supgc supx For a normal A P is an orthogonal projector iiei W is also on and that is the best situation But to the extent that A fails to be normal the effect of the perturbation E on A is magni ed The other factor in 21 is Vli To rst order its norm will differ from by something of order hence provide only secondorder terms for the difference J 7 Jli Generically the single eigenvalue A of J breaks up into a aA simple eigenvalues of J1 as the result of the perturbation How close these will be to A will depend on the precise eigenstructure of J Claim 22 lA All OllJ J1lla Wu E J1 Indeed since A 7 J 0 l 7 Ailu S HQ 7 J1D ll HQ 7 J1 7 A Dull S CODStHJHHJ1H llJ Jlll7 using the fact that the map E gt gt B is locally Lipschitz continuousi Note that BD 7C0 Ba j 1B7 CCj hence 071 HBD Call S llBlla jilllCllj llB Cll j0 On the other hand the average of all a eigenvalues of J1 counting multiplicities is within HJ 7 J1 of A since this average equals trace Jla A 7 trace J 7 1 a In a practical situation one would not be able to identify the reducing subspace S belonging to A But in observing the spectrum of A E as E becomes smaller one would be able to recognize that some A E AA is defective by seeing certain of the approximate eigenvalues relatively slowly clusteri If their average then turns out to approach a limit at a much faster rate the hypothesis of a defective eigenvalue would be con rmed notes for maril7 and earlier latest change May 13 powerboundedness The following important theorem is a simple consequence of knowing the Schur formi It relates the spectral radius pA I maxlAl A E AA of A to the possible powerboundedness of A ie to the condition that sup lt 00 k Since all norms in a nitedimensional linear space are equivalent a subset is bounded in one norm iff it is bounded in any other norm hence powerboundedness holds in all norms if it holds in one Particular norms for A can be found in the form HVAV 1H with V any basis this would correspond to taking the mapnorm for A corresponding to the vector norm 1 gt gt In particular a matrix A is powerbounded supid k Bkijl lt 00 for every some matrix representation B VAV 1 for Al But it may be easier to prove it in one norm than in others For example the matrix 9 1010 A 0 9 l is powerbounded in fact A is convergent iiei limk Ak 0 But if you look at the rst few powers of A in the maxnorm or the lnorm you d never guess that A is powerbounded let alone convergenti We start with the following 23Observation Any Jordan matrix J of order gt 1 belonging to an eigenvalue A of absolute value 1 fails to be power bounded For the proof observe that A a M b 7 AM Ab an 0 A 0 M 7 0 AM 7 Ak kAk l 0 k A 1 hence by induction 0 A J and for lAl 1 the l2entry goes to 00 with kl D Remark You have here a very special case of the following remarkable result concerning Jordanlike matrices For any reasonable function f and Jordanlike matrix the matrix has the divided difference Ah i i i Ajlf as its ijentry with fJij 0 for i gt ji 24Lemma For every square matrix A and every norm on domA MA S HAW and this inequality is sharp Equality is possible in some norm if all absolutely largest eigenvalues ofA are non defective This implies that for every 739 gt MA there is some mapnorm in which lt 739 hence for which lt 1 hence for which limkn00 0 Consequently 31 25Corollary For every square matriX A and every norm on domA 0739k for all 739 gt MA For the proof of the lemma observe rst that for all A E AA there is some nontrivial eigenvector 1 hence 2 lAl item AA lies inside the disk of radius The proof that by proper choice of the vector norm on domA we can make the corresponding map norm as close as we like to MA is a bit harder For this we begin with an on basis V for which V lAV A N with A diagonal and N strictly upper triangular Then we scale this on basis to the basis W I VD with D I diagd1 d2d3 i i for some positive d Then we nd that W lAW D 1A ND A D lND with D lNDij d iNijdj dj iNiji Since N is strictly upper triangular we have Nij 0 for i 2 j hence for any d lt1 lD lNDl lt lel hence HW lAWHoo S HAHoo HD 1NDHoo S MA dllNlloltgtv and this can be made arbitarily close to MA by choosing d small enough By keeping d positive we insure that W is a basis for domA hence I gt gt HW IIHDo is a norm on domA whose corresponding mapnorm for A is indeed the number HW lAWHi Finally we can achieve equality in case all absolutely largest that case the space S spanned by a maximally linear y 39 A the sum of the corresponding nullspaces kerA7 A is reducing and with T a corresponding complementary invariant subspace we know that B I A T has spectral radius lt MA HA SHi Therefore by what we already know we can nd some norm on T for which HA TH is close enough to MA T to guarantee that it is less than MA HA SHi But then in the resulting norm on domA SEBT HA SH MA The D converse is proved as part of the following Theoremi eigenvalues of A are nondefective For in set 0 col Lei 26Theorem The square matriX A is power bounded iflr either MA lt l or else MA l With every absolutely largest eigenvalue non defective By 24Lemma A is even convergent in case MA lt 1 while A cannot possibly be powerbounded in case MA gt 1 since then for some A an some unit vector x 2 lAlk A Hence for the proof it is suf cient to consider A with MA 1 If every absolutely largest eigenvalue of A is nondefective then by 24Lemma l for some mapnorm hence A is powerbounded If on the other hand some absolutely largest eigenvalue A of A is defective then there is a reducing subspace on which A looks like a Jordan matrix of order gt 1 with an eigenvalue of absolute value 1 hence by 23Observation A fails to be powerbounded already on this reducing subspacei D ln a practical item niteprecision arithmetic setting this theorem has to be augmented to take account of the fact that we usually cannot know the matrix A better than within a relative error of 5 I uHAH and the same holds for the computed powersi Thus it is not so much As spectrum AA or its spectral radius MA that predicts the behavior of the computed sequence Ak of powers as it is the approximate spectrum AEA and the approximate spectral radius pEA sup lAEAl with MA U AltAEgt A We AW 216 HEH a This de nition depends on the particular vector norm used but whatever that norm its dependence on 5 is the more interesting feature The earlier perturbation theory makes clear that for a normal A and in the 2norm pEA MA 05 hence some attention has to be paid in case MA is close to 1 Much more attention has to be paid for a general A since according to that earlier perturbation theory pEA MA 05 with a the maximum ascent of all maximal eigenvalues of A Thus for nonnormal and particular for defective matrices even the noise associated with niteprecision arithmetic never mind the errors possibly incurred when computing A could drastically raise the spectral radius of A as used in computations hence lead to drastically different conclusions concerning the boundedness of its powers This 32 is a very important issue in the study of iterative methods be it for the solution of linear systems per se or for the solutions of discretized ODEs or PDEsi In the power method one constructs the sequence e Ak k e 1 2 2k if 20 7 Hi in hopes that for large k the ratios zk1izk provide good estimates for the absolutely largest eigenvalue of Al This hope is justi ed in case the absolutely largest eigenvalue of A is dominant ie is the only absolutely largest eigenvalue even counting algebraic multiplicityi For with A the dominant eigenvalue for A X I kerA 7 A is reducing for Al Further with Y the corresponding complementary Ainvariant subspace B I A y has spectral radius lt pA Hence for any 5 gt 0 we can choose a norm on domA so that lt MB 5 hence can make as close to 7 I pBpA as we like Further with 20 I y for I E X and y E Y 2k Akzz Bkyz Ak11 This shows that zkAk I 00k Consequently 2k1 A 00k provided I f 0 Thus to the extent that the ratio 7 pBpA of the nextlargest eigenvalue to the dominant eigenvalue is less than 1 the power method provides an effective means for obtaining good estimates for the dominant eigenvalue and for a corresponding eigenvectmt The method of orthogonal iterations is the following generalization of the power method With Q0 an orthogonal matrix having p columns compute QkRk Zk AQkA k 12HH Here QkRk is the QR factorization of Zk and the hope is that ran Qk might converge to an Ainvariant subspacei This hope is justi ed in case A has a dominant p dimensional Ainvariant subspace ire a subspace S of dimension p so that max Al A E AAAA S lt minlAl A E AA Si multiplicities lApl gt lAp1l and S is the unique invariant subspace associated with the eigenvalues A1i i i p The book uses the notation This means that with lAll 2 lAgl gt the spectrum of A ordered by magnitude and counted including DpA for this invariant subspacei In this case we can provide Theorem 731 If DpA is well de ned then starting With almost any orthogonal Q0 6 Cnxp we have diStD17A7ran S ClAp1 plk7 With c a constant that depends on As departure from normality and on the separation sep T11 T22 between the complementary diagonal blocks in an block upper triangular form for A in Which T11 represents A DPM Proof A convenient way to measure the distance between two subspaces Sj is provided by the number dist 5391 5392 2 H131 7 PgHg with the orthogonal projector onto 5quot Let QjUj be on bases for C with Sj raan hence Pj Then H131 P2ll2 HQle Q2Q H2 lllQ17U1lHQ1Qi Q2Q lQ27U2lH2 H Q ltQ1Q e Q2Q gtQ2 Q Q1Q U2 H2 M 0 UfIQ2Q5Q2 0 UiJQ2 maxflleU2ll27 HUiHQ2H2 llQ U2ll2 HUiJQ2H2A 33 HU Q10 2 H2 Last equality is equivalent to HU21H2 HU12H2 in case U11 U12 U21U22 unitary with U11U22 square This means that dist DpAran if Q Q00 Q l is an on basis for which ran Qa DpAi For such a basis T I QHAQ is necessarily blockupper triangular iiei am Now consider the iteration Observe by induction that Aon Qk Rk R1 QkR hence TkQHQo QHQkR Since AT11 AT22 I we know that I 7X T11 T12 I XT11 0 0 I 0 T22 0 I 0 T22 with X I T11X 7 XT22 7T12i This gives the equation T 0 V07XW0 7 Vk7XWk 27 lo T2162 W0 Wk R with Wj I QHQj all j hence lkall2 dist DpA7ran Q07 the quantity we want to show goes to zero as k 7gt 00 Now for almost all orthogonal Q0 V0 7 XWO 7 XQgQ0 is invertible since 7 I 7XQH is onto regardless of X since I 7X is ontoi This implies from the rst part of 27 that R71 T1k1V0 XW0 1VI XWkL therefore from the second part of 27 Wk TQICQWOR 1 T 2W0T1k1V0 7 XW0 1V1c 7 XWk This gives the estimate dist DpA7ran Qk S HT2k2ll2 llT klb llV0 XWoilll2 llWoll2 lle XWkllz Here we estimate HVk 7 XWkH2 S l a constant that depends on the separation of the spectrum of T11 and T22 but in any case does not depend on k and neither does 7 XWO 1H2 HWOH2i But by k the result on powerboundedness and since pT22 lAp1l pT 1 lMpl for any positive 5 HT22H omApm 7 er 7 and HT kH 7 0mm 7 ark therefore nally dist DAAank 7 Willa The QR method can be understood as complete orthogonal iterationi Starting with a unitary Q0 ie with p n we can for each p interpret Qk I 12p as the result of having orthogonal iteration starting with Q0 12p Consequently if Mll gt Mgl gt H i gt Mnl then DpA is de ned for every p and the theorem tells us that Qk I 12p converges to a basis for it This implies that Tic I QkHAQk converges to an upper triangular matrix thus providing a Schur form for A in the limit The actual QR method organizes the calculation as follows We have TH Q ilAQH Q5401ka lt25kaka while Tic QkHAQk QkHAQkileHile RMQkHilel Thus Tk is obtained from T1671 by computing the QR factorization Tkil Q571QkRk and then multiplying its factors in the reverse order This gives the iteration For k 12Hi step 1 A QR with Q unitary and R righttriangulari step 2 A I RQi In this raw form the iteration is not very effective and also quite expensive lts convergence proof above relies on special assumptions and as the proof indicated the convergence rate can be quite slow in case eigenvalues clusteri Also each step takes a complete QR factorization iiei takes 0n3 opsi The latter complaint is dealt with by the assumption that A is already in upper Hessenberg form which can always be achieved by a unitary similarity transformationi Now the computation of A QR ta es only 0n2 ops and fortunately RQ is again upper Hessenberg The slow convergence is dealt with by implicit shifts and de ationi Thus streamlined the QR method generates the spectrum of a matrix in remarkably fast time MATLAB provides the command eigmovieA which illustrates the iteration for a symmetric Al notes for mari29ff latest change 19feb09 Hessenberg matrices An upper Hessenberg matrix H has zeros below the rst subdiagonal irei Hij 0 for i gt j 1 Such matrices are of importance for eigenvalue calculations because it is possible to bring any matrix of order n into Hessenberg form by n 7 2 Householder re ections and ii one step of the QR iteration applied to a Hessenberg matrix costs only 0n2 rather than 0n3 ops and produces again a Hessenberg matrixi Hessenberg QR step Assuming H is in upper Hessenberg form its QR factorization is most ef ciently computed by Givens rotationsi Walking down the diagonal one rotates rows k and k l to bring a zero into position k l k k l r r r n7 1 To compute from this H I RQ QHHQ it is only necessary to rotate columns k and k l in the same way and in the same order ie for k l r r r n 7 ll ince the workarray H was upper triangular after that set of row rotations columns k and kl have no entries below the diagonal when it comes time to rotate them hence the same will be true after their rotation except that there now might be a nonzero k l 16 entry Thus H is again upper Hessenberg In the procedure we obtain Q as the product Q H diag1fk SijJ 8 C kltn k k of Givens rotations with Ck the rotation applied to rows k and kl at the kth step thus producing 7 k R I For this reason it was just the right thing to follow this up by rotating columns k and k l with the very same rotation 61c and in the order k 12 l r i since that constructs RQ QHHQ the new upper Hessenberg matrix similar to the old Application of this fast QR iteration requires an initial similarity transformation to bring the original matrix into upper Hessenberg formi This can be done in various ways One could use Givens rotations as above to zero out all entries below the Hessenberg matrix following each zeroing out by the corresponding rotation of columns to keep the similarity Assuming that one starts at the upper left and works column by column the zero entries gained during row rotation will not be destroyed by the corresponding column rotation as the latter will always involve columns to the right of the one currently being worked on or the same reason use of Householder re ection worksi Here one would re ect the rst column to a vector x x 0 r r r g 0 with just the rst two entries nonzeroi The corresponding Householder re ection will keep 61 xed hence its subsequent application from the right to keep the similarity will not change the rst column iiei will retain the zeros just gaine r This settles the ef ciency of a QRmethod stepi Next we take on the other objection to the QRmethod its slow or nonexistent convergencei For this we need some facts about Hessenberg matrices unreduced Hessenberg A Hessenberg matrix H is called unreduced by the book if its rst sub diagonal contains no zero ie if li 0 A general Hessenberg matrix breaks apart into its unreduced blocks which may be treated separately if we are after the spectrum of the matrix Thus the QR iteration and its variants may be viewed as converging toward a Hessenberg matrix with small enough unreduced blocks to make the calculation of the spectrum a trivialityi An unreduced Hessenberg matrix H of order n has its rst n 7 1 columns linearly independent This implies that its kernel has dimension no bigger than 1 In fact if it has a nontrivial kernel then necessarily its last column is linearly dependent on the preceding columnsi Since the jth column of the righttriangular matrix R in a QR factorization for H contains the coordinates of the jth column Haj with respect to the basis Qlj for a space containing ran H lzj and equal to it in case H lzj is of full rank the following is true 28Lemma IfH QR is a QR factorization of the unreduced and non invertible Hessenberg matrix H then Rnn 0 29Corollary IfH is unreduced Hessenberg then for any o E AH the QR factorization QR H7o1 must have Rnn 0 therefore H I RQ of must be reduced With Hnn 7 l 0 This observation is at the heart of the QR iteration with shifts Note that the matrix H RQ of QHH 7 MDQ of obtained from H in such a shifted QR step is again unitarily similar to H only the way in which the unitary matrix Q was obtained has become a little bit more complicated It will be important later to know that the Q can be constructed implicitlyl iiei without actually carrying out the shifts This is the content of the next result which concerns the Uniqueness of the Q in a unitary similarity to an unreduced Hessenberg matrix 30Implicit Q Theorem If QV are both real orthogonal and H I QTAQG I VTAV are both upperHessenberg then ql v1 and H2 l Hk 1671 f 0 implies qj ivj and lHjj7ll lGjj7ll forj S k 1 Proof Have WH CW with W I VTQ and by assumption wl ell Therefore Whi Gwi Hi 1 owl1 Gwi 7 EH0 0101 131 Assuming that we already know that wj iEj for j S i as we do when i 1 this implies that liwi1 E raneluiei1 since then Gwi Gei is in that space hence if also 12 0 then raneli 6139 L wi1 E raneli i 6141 which implies that also wi1 ielurli In particular Hi17ii i1 CiEi 7 2H0 ii j7 Jltz which implies that l iGi l But this last conclusion also holds if l 0 since then 0 i Z Gji j 7 ZHjiieJ E iGi l Del1 raneli ei KM 19 therefore also Ci l 0 D Remark In a complex context we would merely have to replace here iEj by eXpi6 j iiei by the jth unit vector multiplied by an arbitrary complex sign Schur is to Jordan as Hessenberg is to Krylov Breakup into unreduced blocks A general Hessenberg matrix breaks apart into its unreduced blocks and these may be treated separately if we are after the spectrum of the matrix Thus the QR iteration and its variants may be viewed as converging toward a Hessenberg matrix with small enough unreduced blocks to make the calculation of the spectrum a trivialityi The calculation is indeed trivial in case such an unreduced block is of order 1 or 2 In the latter case with Z b the unreduced block one simply solves the corresponding characteristic equation d a7d7 eb for its two roots This possibility is particularly important since one would like if possible to avoid complex arithmetic in case the original matrix A is real In that case A may well have complex eigenvalues but these would come in complex conjugate pairs hence could be found as the two eigenvalues of an unreduced block of order 2 This fact can be used in the proof of the existence of the Schur form to show by induction that for an arbitrary Teal matrix A there exists an orthogonal matrix V so that VTAV is block upper triangular with diagonal blocks of order 1 or 2 and with each ordertwo diagonal block having two complex conjugate eigenvalues To make the proof work one would observe that for a nonreal eigenvalue A 7 to 6 MA there must be real vectors y and 2 f 0 so that AW 7 to y 7y 7 2 272 My ire 37 since A is real Ay2 y 2 7 g i This shows that rany2 is a twodimensional invariant subspace for A hence an on basis 111112 for it extended to an on basis V for R provides a blockupper triangular matrix representation for A with an upper diagonal block of order 2 Its lower diagonal block is taken care of by induction hypothesis This shows that a real QRiteration should generically converge to a block upper triangular matrix whose diagonal blocks are of order 1 or i shifts The second objection to the raw QR method concerns its slow rate of convergence This is remedied by careful use of shifts Recall that the convergence of the upper left principal submatrix Hljlj to an upper triangular matrix depended on having a dominant eigenspace of order j ie on having MA gt lAj1l when we write out the spectrum of A in order counting multiplicitiesi The greater the gap the faster the convergence Also once HO lj is small enough to be zero within the precision used we declare it to be zero and are then free to concentrate on the smaller matrices Hljlj and lnj 1m In principle we can force a big gap by shifting the spectrum ie by considering H 7 MI rather than H since now the important sequence is the newly ordered sequence M1 7 M 2 M2 7 M 2 which may show a large gap Mn1 7 M gt Mn 7 M in case M is close to MA In fact recall from 29Corollary that the shifted QRstep H7MIQR HRQMI with any M 6 MH produces from an unreduced H a similar Hessenberg matrix H with Hnn 7 l 0 and Hnn M This gives the hope often not disappointed that with M a point close to MH the very same shifted QR step will produce a more nearly reduced Hi In particular if 31gt Hltnn71gt ltltminlHn717n71l7lHn7nl7 then the shifted QR step with M I H nn is recommended as it can be shown to result in a quadratic convergence to a reduced Hessenberg matrix with the n n 7 l7entry equal to zero hence its n n7entry an eigenvalue This observation is based on the following calculationi If H 1nn71m X x 6 M is the lower right principal submatrix of H of order 2 then before the last row rotation it will look like l 3l causing the last rotation to be with 32 c2 1 and 78a cg 0 and giving the new lower right principal submatrix ca 88 dz 5 7 where if 3395 01 hence s 5ca 88 therefore since f3 m5 ca 850 32 52H ca 85 0 a 5 H2 The subsequent column rotations will affect the n n 7 l7entry only at the last rotation and this gives Hltltn71gtzn7ltn71gtzngt7mfg jib Sl7lg 2b lHn71rn7n71rn7 S C 38 therefore Hnn 71 521252 a2 Consequently for 5 ltlt lal we get Hn n 71 OlHn n 7 1 double shift The nal detail to be taken care of is the fact that in case of complex eigenvalues we cannot expect condition 31 always to hold ie we cannot expect Hn n always to be a good approximation to an eigenvalue of H for the simple reason that it is rea 1 Francis managed to come up with a variant in which one carries out two shifts in succession by M1 and 2 with aha the eigenvalues of G Hn 71nn71n Z For this can be done in real arithmetic and in 0n2 ops as follows Set H i 11 1 U1R17 H11 RlUl M117 H1 M21 1 U2R27 H21 RQUQ 211 Then Hj RjUj Mj1 UfHj1 7 MDUj Mj1 UJHHJV71UJ39 with H0 2 H hence H2 U1U2HHU1U2i But H M11H M21 U1R1U1R1 2 i 01 U1H1 10131 U1R1M2 1 U1H1 7 Man UlUgRgRli This implies that the needed matrix Z 2 U1 U2 is obtainable as the orthogonal matrix in the QR factorization ZR for the real matrix M H 7 M11H 7 M21 H2 7 M1 M2H 111 H2 7 traceGH detGi Here we are using the fact that 1 2 are the eigenvalues for that ordertwo lower right princical submatrix G of the real matrix H hence 1 2 a d trace G and out ad 7 be detG and both these numbers are real Of course we make the assumption here that the QR factorization for M is essentially unique and that is false But we do know by 30Theorem the following Observation If both ZTHZ and ZfHZl are Hessenberg and one is unreduced With Z Z1 orthogonal and Z 1 Z11 then Z Z1D for some diagonal matrix D I diag1i1i1 i i 1 hence ZTHZ and ZfHZl are essentially the same This means that even when 1 2 are complex we can obtain H2 entirely by real arithmetic and in 0n2 ops by the following steps Step 1 Compute the rst column of M H2 7 trace GH detGi Step 2 Determine the Householder re ection P0 which carries Mel to 161 Step 3 Compute in order the Householder re ections Pj so that PjPj1H i POH is upper Hessenberg in its rst j columns j 11 l i n 7 2 while each such Pj leaves 61 xed Step 4 Finish up by forming H H2 ZTHZ with Z I P0131 i i r 137721 It follows that Z must be essentially UlUg as desired provided the new H is still unreducedi But if it isn7t so much the better since we now have an H similar to the original matrix A which breaks into smaller blocksi It is important to note that Mel x x x 0 0 i i hence the similarity transformation H gt gt POHPO only affects the rst three rows and also the rst three columns This implies by induction that the subsequent transformations X gt7 PJ39ij each only affect rows j 17 27 j 3 and also columns j 1j 2j 3 ire each such transformation takes 0n opsi The entire transformation H gt gt H H2 ZTHZ therefore only takes 0n2 ops as promised 39 drift We have not discussed the possibility of drift since the book doesnlt mention it This possibility exists since the calculations do not explicitly refer back to the original A nor even to the rst Hessenberg matrix constructed The only way to combat this drift is to accumulate the various orthogonal matrices into one matrix Q7 and to restart the entire process including initial reduction to Hessenberg form from the matrix The book does mention the importance of an initial scaling A gt gt D IAD7 With the diagonal matrix D so chosen that all rows and columns of D lAD have approximatedly the same size notes for Aer latest change Aer eigenvector calculations use inverse iteration ie the power method applied to the matrix A 7 MI 1 with M a very good approximation to a particular eigenvalue A of A hence lA 7 it likely to be the dominant eigenvalue of A 7 MI 1 consequently the iteration fast ie in one or two steps converging to an eigenvector of invA 7 M belonging to lA 7 M ie an eigenvector of A belonging to A This is strange at rst sight since after all A 7 M is nearly singulari Hence it should be dif cult to compute A 7 M 1z and it is But the point of the calculation is not so much to solve the equation 32gt A 7 M 7 x as it is to produce an eigenvector of A belonging to that nearby eigenvalue A and that is exactly what the numerical solution of 32 does well the more nearly singular the better If this is done at the end of QR iteration then we can expect the approximation M to be within uHAH of an eigenvalue A of Al Also we have in hand from the beginning step the upper Hessenberg matrix I To 2 U0 AUO unitarily similar to Al Thus the solution of the linear system H 021 20 can be carried out eg by Gauss elimination with partial pivoting in 0n2 flopsi Let H 7 M UEVT be an SVDi Then an N uHAH is about as small as we can make it Therefore 7 MvnH an N uHHH ie 1 is about as good an approximate eigenvector corresponding to A N M as we can hope to get Further 21 ZWWJTZOVUJ39 N vnu ZOUn7 if A is a simple eigenvalue hence 0771 gt Uni We cannot hope to improve on this by subsequent iterationi For a multiple eigenvalue A we cannot expect this process to work This is due to the fact that it is now not possible to single out a particular eigenvectori This a nice example of the paradox that the more there are the harder they are to nd Orthogonal iteration was invented for this since in such a case there is in the limit an on basis for the invariant subspace associated with the dominant eigenvalue More generally one takes the block upper triangular Schur form T QTAQ for A obtained in the limit makes certain that all repeated eigenvalues occur consecutively and then uses further similarity transformations of the kind already discussed to transform T to blockdiagonal form with different diagonal blocks having disjoint spectra and each diagonal block havingjust one eigenvalue or else a complex conjugate pair of eigenva uesi This requires the ability to prescribe the order in which A s eigenvalues appear along the diagonal Since any permutation can be achieved as the product of t A quot39 ie 0 39 t o uei il uulill terms it is sufficient to know how to reorder two neighboring diagonal entries in an upper triangular matrix using similarity For this it is sufficient to know how to transform the matrix C 3 with a f b by similarity to the matrix C Z i This means that we are looking for a basis V for which the matrix 0 representation G V lGV for G has the rst column d 0 ie for which U1 is an eigenvector of G belonging to the eigenvalue d If we also insist that V be orthogonal this determines V up to sign In particular then necessarily C22 a since G is upper triangular hence C22 E AG AG ad and d already appears as Cl 1 To get an eigenvector v of G belonging to d use inverse iteration ilel solve a7d x 0 0i 0 say to get a 7 dvl xv2 0 eg 1 xgd 7 a Thus the Givens rotation Q which carries 61 to a multiple of 1 ie for which QTv is a multiple of 61 would give QTGQ G 41 Once the Schur form T for A has been reordered to have all repeated eigenvalues appear consecutively there has to be version that will bring 2by 2 diagonal blocks with the same complex conjugate pair next to each other one would partition T accordingly see above and then make use of the following material from marl recalled here through the miracle of modern information processing Assume that T is block upper triangular 7 T11 T12 33 T 0 T22 with T11 T22 square but not necessarily of the same order We look for a similarity transformation Y lTY which leaves the two diagonal blocks as well as 0 T21 unchanged but turns that block T12 into zero In the general case this forces Y to be of the block form I Z Y lo Il v and so gives Y71TY T11 T12 T11 T12T11Z7ZT22 0 0 0 T22 0 I T22 Hence this similarity is successful provided we can choose Z so that g0Z T11Z 7 ZT22 7T12i Since the map so so de ned is linear and Dg0 Tg0 we will succeed in general iff go is invertible iiei iff go is 11 iiei iff AA MB 0 as we proved earlieri Implementation requires the solution of the Sylvester equation 34 F Z 7 Z G C for Z where in our case both F I T11 and G I T22 are block upper triangular with diagonal blocks of order 1 or 2 The book calls such matrices quasitriangular This quasitriangularity makes for an easy inductive solution the Bartels Stewart algorithm The kth column of the equation 34 reads sz 7 Z ziGik ckl igk1 lf Ck l k 0 then in fact m 7 216mm ck 2282 k igk which can be solved quickly since F is quasitriangular for 2k if we already know 2139 for i lt kl lf Gkl k f 0 then we combine this with the next equation and get since then by quasitriangularity Ck 2 k 1 0 the system F7Gkk 7Gklk 7Gkk1 F7 Gk1k1 2k 2k1 2100 7 Ck1 TziGQv 76 1l lltk By interlacing the equations from the rst block with those of the second a banded system is obtained which can be solved in 0n2 opsl Since this process breaks down when F and G share eigenvalues it is not surprising that it experiences numerical dif culties in case AF and MG are closek Explicitly one de nes the separation between the spectra of two square matrices B and C not necessarily of the same order as the largest lower bound for the linear map g0 X gt gt EX 7 XC with respect to the Frobenius norm iiei sepltB 0 7 HBX 7 XCHFHXHF This number quanti es just how invertible go is Thus sepF G measures how well we can solve the Sylvester equation 34 in the presence of noise 42 notes for aprilZ latest change apri28 The book considers only real symmetric matricesi But the theory for their eigenstructure is identical with that of hermitian matrices hence this is what will be discussed Throughout let A be a hermitian matrix ie A A By Schur A UHTU for some unitary U and some upper triangular T for any matrix Al But A being hermitian implies that T is hermitian which implies that T is diagonal with real diagonal entriesi Conclusion A is unitarily similar to a real diagonal matrix In particular if A is real then U can be chosen orthogonali For various reasons of convenience we order the real eigenvalues of A from left to right as we write rather than from right to left as the book does We write MA 3 MA 3 A A MA and denote the corresponding sequence of on eigenvectors by u1i i i un The perturbation theory for the spectrum of a hermitian matrix A is remarkably simple and elegant lt centers on the Rayleigh quotient RAltIgt I IHAIIHI whose maximumminimum equals the largestsmallest eigenvalue of A as follows easily from the observation that yHUHAUy Ej lyjl2j RAIRAU yHUHUy 21mm 6 Iginhjymfmjl A1A7AnAl This is the Rayleigh principle It is a very special case of the following MMM Theorem or Principle if you go in for such things which is the convenient amalgam of the NiNi maximin7 theorem with the Courant Fischer minimax7 theorem MaxiMiniMax Theorem For a hermitian matrix A With eigenvalues A1 A S A2A S S AAA and corresp 011 eigenvectors u1i i i un and anyj max minR z X A min maxR z dimGltjle A Jlt deimHmeH A 7 With G and H arbitrary linear subspaces Proof If dimG lt j S dim H then one can nd y E G l H0 therefore EERMI S RAy S agRMI Hence max minRAz S min maxRAzi dimGltj le deiereH On the other hand for G ranu1i i i uj1 and H ranu1 i i i uj iriigRAQ AAA rmneangAQ The MMM theorem has various useful and immediate corollaries Interlacing Theorem If the matrix B is obtained from the hermitian matrix A by crossing out the last row and column ie B Al n7 ll n 71 then AAA S A109 S Aj1A7 J39lt TL Proof With J Emil A Fm z gt gt 1 0 we have RB RAltJIgt and ran en l Hence M W RAW 39 R Magma M lt 39 R X Al wimrgagm 31113 My 11 Also Wgt ganglia jaglyggw y maxRAy AAA 2 min deimH yEH A different simpler application of the MMM theorem is based on the following observation lf ft S 9t W7 then this inequality persists if we take on both sides the maximum or minimum over the same set T iiel then t lt t i t lt i t l 313 7 133 7 Igg 7 3115190 It even persists if we further take the miminum or maximum over the same family T of subsets T elgl then a so maxmin t lt maxmin t TeT teT 7 TET teT gltgt Consequently Corollary If RAltIgt S RB c for some constant c and all I then MA A1 3 c v This gives Weyl s inequalities HA B C with A B C hermitian then A109 A1CS AAA S A109 An07 V13 Proof Since A1C S 1301 S AnC by Rayleighls principle while 1331 1301 RAI the preceding corollary provides the proof 1 These inequalities are particularly useful when C is small7 in some sense Eigi the book considers in CoriSilS the special case C chH with c of unit length Then C is a rankone matrix with 0 as n 7 1fold eigenvalue while also Cc TC hence also 739 an eigenvalue Consequently A1C min0 739 and AnC max0739 and Weylls inequalities do the rest The book also brings without proof the WielandtHoffman Theorem HA and E are both hermitian then 2 2 MA E MA 3 ZME j J But I nd the following maxnorm version more directly usefuli maxnorm WielandtHoffman HA and E are both hermitian then melAjAE 4mm mfxwm Proof By Weyl s inequalities WM MEN S maxl1A Bl7 WM BM MA 3 Finally a totally different application of the MMM Theorem is Sylvester s Law of Inertia Any two congruent Hermitian matrices have the same number of positive zero and negative eigenvalues Proof It is suf cient to prove that if B XHAX for some hermitian A and some invertible X then AAA gt 0 implies M B gt 0 For this we observe that by the MMM Theorem AAA gt 0 implies that RA is positive somewhere on any jdimensional subspace while also by the MMM Theorem for some jdimensional subspace H MB IggRBW IggagRMX RxHxWL and this is necessarily positive since dim XH j and RXHX is positive for any I f 0 D notes for apri24 latest change apri27 Sturm sequences provide a very elegant way to separate the eigenvalues of a Jacobi matrix ie a real symmetric tridiagonal matrix a1 22 52 a2 53 Tn I I I In an The key for this is the threeterm recurrence relation Pr1ar IPr711 53Pr7217 T 1727m7 satis ed by the characteristic polynomials p z I detTT 7 z 7 12i i i a de nition which in light of the recurrence relation is suitably augmented by the agreements 1071 07 1001 Remark Assuming that Tn is unreduced the sequence pr is orthogonal with respect to some discrete measure z 0 a fact that summarizes the various statements about the zeros of the pr about to be detailed More on this later We now need a way to count the sign changes in a sequence or vector 5a I J I signaj signaj17 where ran sign 711 This de nition is a bit hazy on just what sign0 might be and offhand that s really up to you To make this precise one de nes 5a I maxSa 57a I minSa with the extremum taken over all possible assignments of l or 1 to signaj in case aj 0 We are applying this notion to the particular sequence PW I 100I7Pnrl7 ie we de ne 81 2 This function is piecewise constant with jumps only at points I at which 5Pz Squot This can happen only at the nitely many points I at which pr 0 for some Ti lf 7 lt n then from the recurrence Pr11 ib 1pril1 at such a point If now also pT1z 0 then assuming that Tn is unreduced it would follow that also 0 pT2 z pT3 z Hipoz 1 an impossibility Hence assuming Tn to be unreduced it follows that 317 sz at such a point since necessarily 35 pr1zpr1z lt 0 whenever prz 0 46 Thus since p0 never vanishes s can have a jump at I only when pn 0 in which case it is a jump of i1 only since then necessarily pn1z 0 On the other hand since pr 71V lot it follows that 31 07 I 00 n 1700 This implies that 8 must jump at least n times and since it can only jump at the zeros of pn and pn can have at most n zeros it follows that pn has exactly n real zeros hence these must all be simple in other words sz 7 317 jumpxs 1 whenever Pnlt1gt 0 and 595 5 lt I I PAS 0 Therefore also p pn1 lt 0 whenever 0 This implies the same for the zeros of pn1i Also since the earlier perturbation analysis gave us that the eigenvalues of 11771 separate those of Tn we now conclude from 35 that they stn39ctly separate ie that A1Tn lt A1Tn1 lt A2Tn lt A2Tn1 lt lt An1Tn1 lt AnTni This proves Sturm s Theorem If Tn is unreduced then the zeros ofpn I detTn 7 are all real and simple and are strictly separated by the zeros of 10771 Moreover 5l10017 v 7pn17l E lt I1n5 0 Since the recurrence provides the entire vector 131 1001 i i i pnz it is easy to compute 81 SPz as one evaluates pn 1 hence easy to localize the zeros of pn and obtain them quickly by bisection followed by modi ed regula falsi and the like The corresponding eigenvectors of Tn are readily found by inverse iteration particularly fast because Tn is tridiagonali Note that in such inverse iteration we would be computing in effect the L D LT factorization LDLT I Tn 7 M with L unit lower bidiagonal and D diagonal and M our approximation to a particular eigenvalue of Tn Since the resulting D diagd1 i i i dn is congruent to Tn 7 M it follows from Sylvester s lnertia Law that j d1 lt 0 7 5lt Mipn5 70 suggesting another method for localizing and approximately determining the zeros of pni Rayleigh quotient iteration is based on the following facts With M 2 RAW ITAIITI the Rayleigh quotient for A at z and 2 computed a la inverse iteration as the solution of the linear system A 7 M 7 I we compute A 7 zzT2T2z A2 7 z 2 showing that M is an eigenvalue of the perturbed matrix A7 zzTszi Hence by BauerFike dist M AA S szTszHg if as we may we take I to be normalized For the resulting iteration zk1 I 2k1sz1H2 with A 7 Mk2k1 1k and Mk I RAWk the pair 1k Mk converges cubically to an eigenpair as the books 2by2 example demonstrates In order to shoot for a particular eigenvalue of A one would start the iteration with a M close to that eigenvalue and use inverse iteration to obtain from it a rst I notes for apri26 latest change apri26 Lanczos Method can be viewed as a re nement of the power method If as we assume A is real symmetric with RAI I ITAIITI its Rayleigh quotient then for almost any y we would expect RA A y to converge to its absolutely largest eigenvalue as 7gt 00 On the other hand we know that this absolutely largest eigenvalue is either mingc RAI or maxgc RAI hence would expect the appropriate extremum taken over I E spanyAy AQy i i i Aky to be a much better approximation to that absolutely largest eigenvalue than that simple ratio RA Here is a more explicit analysis The typical element of the Krylov space 11Aw I spanAkykltj is of the form pAy for some p E 7rltj I polynomials of degree lt j Thus with y 1 216216006 the expansion of y wrto an orthonormal basis consisting of eigenvectors of A with corresponding eigenvalues A1 S S A717 we compute pAyTApAy 2k Ck2pk2k therefore with 2k Ck2 l wlog Dc Ck2Pk2k A1 A1 3 EMA A1 2k cltkgt2pltAkgt2 2k 1 lt A An 7A gt i 1 Deanna EM 17 012 max101 pAz 2 lt A An 7 A 7 i i 1 l 1 ca pol Assume without loss that A1 is indeed the absolutely largest eigenvalue of Al Our use of the Rayleigh quotient RAAj 1y to estimate A1 corresponds to the choice p I 1 1 hence provides the estimate A1 g RAAj 1y A1 A 7 A1t 2j 1 tlA1Ml with M the absolutely second largest eigenvalue of A By choosing different p E 7rltj we may hope to get closer to All Speci cally we may seek to minimize mam MAM W1 wow over all p E 7rltji It is wellknown that the minimum of the related function lP1l which dominates b is taken on by the Chebyshev polynomial of order j for the interval A2 An ie the polynomial Cj12 If 7 l with Ck the Chebyshev polynomial of degree kl Recall that lel is bounded by l on 711 but grows the fastest outside that interval when compared with any other polynomial of degree S k which is absolutely bounded by l on 711 Also Ck t N t xt2 7 Dig2 for t outside 711 Thus replacing the interval A2 An by the larger hence less advantageous interval 7M4 lull we get here 310 I 2 510 A1 RApAy A1 0t w 71V instead of A1 S RAAj 1yS A1 06 48 for the power method with Rayleigh quotient or A1 AjykAj 1yk 097139 for the straight power method The book provides more explicit numerical comparisons ote also that the power method only goes after the absolutely largest eigenvalue while the above analysis could equally well have been applied to the largest or smallest eigenvalue In other words having taken the trouble to compute the Krylov sequence yAyA2yiHAj 1y we could have maximimized as well as minimized the Rayleigh quotient over that subspacei This leads to Lanczos7 Method There are various ways of viewing that method Most directly Lanczos method consists in deriv ing from yAy A2y i i i Aj ly an equivalent oini sequence 4142 i i i qj AHA that s the reason the book keeps talking about Aj ly all the time as one would with GramSchmid iiei so that ranql i i i qk rany Ayi HAk 1y lKA y k If Q is an extension of the on sequence to an on basis for R then the matrix representation T I QTAQ for A wrto this basis has a tridiagonal upper left principal submatrix of order ji For by construction Aqk E Aranq1i i i qkl AlKAy k C lKAy k 1 ranq1iuqk1 hence T k QTAQek QTAqk has nonzero entries only in positions 1Hik 1 On the other hand T is real symmetric since A is hence also Tk I has nonzero entries only in its rst k 1 positions This implies that 36 AQj QjTj Tjef with a1 21 21 a2 22 Tj I I brl bj1 aj and T11A Maj bjelqjeb Note the different way of subscripting the offdiagonal entries here In particular since TV is a principal 7 J submatrix of a matrix representation for A with respect to an on basis A1A 3 ML min RAKA y j max RAKA7 m A1 T1 S MA This shows that we can obtain best possible estimates from lKA yj for the extreme eigenvalues of A by computing the extreme eigenvalues of Tji t was Lanczos7 genius to turn this construction around We are usually not really interested in the Lanczos vectors qk per se Rather we are only interested in the tridiagonal matrix Tj since through the symmetric QR or other methods we can ef ciently extract the Rayleigh quotient extrema sought as the extremal eigenvalues of Tji But the entries of Tj we can hope to compute directly from the requirement that equation 36 should hold as follows Assume that we already know aibi for i lt k and also know qi for i S kl We read off that 14 bkilqkil aka bk4k17 hence we can compute ak as T T 4k AQkQk 4k 0467 and therefore know the vector Tk I A ak4k bkilqkil bk4kL We can therefore compute bk 1 Hmlla 49 and assuming that m f 0 also get 4k1 I TicI719 Note that we only needed 4161 and qk here hence need to retain now only qk and qk1 for the next step This very convenience is also the downfall of the method Since qk1 is derived from qk and 4161 drift sets in eventually ie with increasing 6 the computed qk increasingly fail to be orthogonal to earlier 41 This not only makes the computed eigenvalues of the computed Tj drift away from those of A but also introduces ghost eigenvalues iiei repetitions of eigenvalues already approximated well earlier on Reorthogonalization seems to be the only remedyi This is unfortunate since Lanczos method is offhand ideal for computing the extremal eigenvalues of a large sparse A and the sparsity advantage is lost if reorthogonalization is needed notes for may 1 latest change may 13 Error analysis for Lanczos method Paige shows that the equation AQj QjTj We is satis ed by the computed Qj Tj and Tj to within OuA ie as well as expected The trouble with Lanczos method has its source in small7 Tj which leads to ampli cation of noise when qj1 TjH39erg is computed thus spoiling the expected orthogonality though not the normality of the computed le This may have serious consequences for the quality of the eigenvalues of the computed Tj as approximations to the eigenvalues of A as the following analysis shows From p272 of the book we recall without sqrt the Proposition Theorem 818 IfAB are symmetric and E AQ1 7 Q1B with Q1 6 R71 orthogonal then 39A7 zd39tABAAltA 7 Bl gaggmg j Ml Is 7 7 H Q1 Q1 H2 Proof Extend Q1 in any way to an on basis Q 2 Q1 Then T 7 B 0 QTE ETQQ it Q AQ lo Q2TAQ2 2513 0 quot0 Hence MAC 7 AjAl S But MB C MC while Fzy ETng hence HFlrleE 7 HQTEIETQ2yH mm HQTErH HQ2TEzH Wazng 7 HErH HETQ2yH HEHE WWW and SO HFH2 S HEH2A D We conclude that actually small7 Tj is good if Qj is on If it isn7t then the best we can say is the following Corollary If A B are symmetric and F I AX 7 XB With X E anXk satisfying 0k X gt 0 then dist ltAltBgt7AltAgtgt MAX 7 XBH2akltXgt Proof Let X QR with Q on and R square upper triangular therefore HR IHQ lakXl Then AQ 7 QRBR 1 E I AX 7 XBR 1 hence the proposition does it D We conclude from this that dist ATj7A S HTJ39H2 llEjHQajQj7 with E11 AQj QjTj Me Again we know that 0uHAH2l What is troubling is the failure of the computed Qj to stay orthogonal as this leads to small7 aj i The straightforward remedy is reorthogonalization in which each computed qj is further modi ed by orthogonalizing it wrto all the earlier qr a very expensive procedurel An interesting modi cation merely orthogonalizes qj wrto a set of converged Ritz vectors thus making certain that the corresponding extremal eigenvalues are not introduced anew as ghost7 eigenvalues In this view the extremal eigenvalues of Tj are goo A A 39 quot to col 1 quot 39 o to the extent that the corresponding Ritz vectors have convergedl Here are the details from Theorem Silll If A is hermitian and Q E F an is on and D I diagt91l i i 9 I ZHQHAQZ is a Schur decomposition for QHAQ then AQZ 7 QZD I 7 QQHAQZl 51 The diagonal entries of D are called Ritz values and the columns of y1iuyj I QZ are the corresponding Ritz vectorsi It follows that Ayk 9kyk AQZ QZD k7 hence llAyk 7 Okyklb S HI QQHMQHl This is relevant here since A QjTj We f while ZJHTj Zj diagt91i i i 9 for some suitable oini matrix Zji In particular yk I QjZ39ek is the approximate eigenvector or Ritz vector corresponding to the approximate eigen or Ritz value 916 If HAyk 7 Okkag 0uHAH2 then the pair yk k is as close to an eigenpair for A as we are entitled to get We want subsequent qi to stay away from this direction and in exact arithmetic they would since they would be orthogonal to ran Qj which contains these yki ln nite precision arithmetic this orthogonality fails and this leads to a reintroduction of these eigendirectionsi in selective reorthogonalization one keeps a record of such converged Ritz pairs and only insists that subsequent qi be orthogonal to themi There are tricks concerning just how one determines which Ritz pairs have converged without keeping track of all the qi notes for may la latest change may 2 Jacobi matrices threeterm recurrence relations and orthogonal polynomials Throughout Tn is an unreduced Jacobi matrix ie a real symmetric tridiagonal matrix a1 21 21 a2 22 Tn I with bi gt 0 for all j Associated with Tn are the monic polynomials 0 7 lt 0 T1 1 7 0 det7TT Tgt0i These polynomials satisfy the threeterm recurrence relation 1M1 I arpr711 b371p7 72z7 T 1727m Conversely if we know the sequence pr gn then we can construct the corresponding Jacobi matrix by running the recurrence backwards In fact we can do this if we know just pn and pn71i For if we already know pr and pT71 then by the recurrrence necessarily pr T 7 or T 1 lot which provides us with art After that we know pr 7 7 arpr1 7 Tilpr2 hence can determine br1 and pr2 from the requirement that pr2 be monicl If we try this for an arbitrary pair pn 10771 of monic polynomials we should not succeed since we proved earlier results about interlacing of zeros of the polynomials associated with a Jacobi matrixl We will run into trouble when we have to take squareroots of negative numbers What is the most general sequence of polynomials obtainable this way From the Sturm sequence results we know that at a minimum n 7171 37 Ian 7 H 7 517 pm H 7 M with 5 lt t lt51a1121 i1 j1 As we will see these necessary conditions are also suf cienti Associated with any inner product lt gt there is its sequence of orthogonal polynomials ie the unique sequence pr of monic polynomials pr T lot all 7 for which pmp 0 whenever 7 f 8 There is in fact at most one such sequence since by its de nition each pr is in T 7rltT yet perpendicular to 7rltr spanp55ltr hence must be the error in the bat to T from 7rltr wrto the norm I lt gt12l This argument also proves existence of the sequence The actual numerical construction is best done by making use of the fact that such a sequence of orthogonal polynomials must satisfy a threeterm recurrence relation AHAl provided multiplication by 1 is symmetric with respect to the inner product ilel provided lt01f79gt ltf7 019 For in that case lt01pr717p5gtltpr7171p5gt 0 Vs lt T 7 2 since 0105 6 NHL This means that pr is obtainable from lpr1 by merely subtracting the component of pr1 and of pr2 to get something perpendicular to pr1 and pT72l Such a modi cation does not destroy the orthogonality to 7n3 since pr1 and pr2 are both orthogonal to it This gives pr 1pr71 7 p7 71 lt1p7 717p7 71gtHp7 71H2 7 pr72 lt01p7 717pr72gtHp7 72H27 53 p7 arPr71 113110r727 with ar 1 lt01PT717PT71gtCr71 1131 1 lt01p7 717p7 72gtC7 72 10717 1p7 72gtC7 72 57 7157 72 Cr 1 llPrll2 Note that the pr so computed does have the correct normalization since its leading coefficient is the same as that of pT71 ie 1 by induction Also note that the recurrence can start with 7 1 taking 120 I 0 All this goes through even if is only semidefinite except that we lose uniqueness One could regain it by insisting on the best7 biai ie the one one would obtain by using GramSchmid while ignoring all pr whose norm7 is zero For example consider the discrete inner product ltf79gt 7 Zw f y5 SEE with E a finite point set E n say and the weight function w positive on E This is definite on 7rltn but gives the nontrivial polynomial H E3 7 zero normh In fact this is necessarily pn for this inner product since it lies in 7rltn and has smallest possible norm while in 7rltn this discrete inner product is definite hence there is a unique bar from that space to any function If we continue now ignoring pn in the calculation of pn1 since pn is zero as far as the inner product is concerned then also pn1 is unique It is the unique element of 1 7rltn of zero normh Thus by induction we obtain pj for any j 2 n as the error in the polynomial interpolant to from 7rltn at 3 We noticed earlier that we can compute p1 i pn if we know pn and pn1i One way would be to run the recurrence backwards a very unstable process The other is to find a suitable inner product and generate the sequence and the corresponding Jacobi matrix by forward recurrencei We try this for the discrete inner product choosing E Zpn I z 0 51 i i i in ie with pn and 10771 satisfying 37 With this choice pn is indeed orthogonal to 7rltn regardless of the choice of the weights wi If we succeed in choosing the weights positive and so as to make pn1 perpendicular to 7rltn1 we will be done by the uniqueness of the orthogonal polynomials his requires that the linear functional L 7971 7 lRZ f gt gt ltf7Pn71gt have 7 in its kernel Since 7 is the kernel of the divided difference 2 21 7 Ej 1 as a linear functional on mkl this implies that for some 7 L Consequently w iPnaPLL on 3 using the fact that 7 In particular llpn71ll2 L0n71 7 an715iPLEi On the other hand since by 37 pn1 has exactly one zero counting multiplicity in each interval EH1 and each is a simple zero of pn it follows that for all i signumpn1p signumpn1pn 1 hence choosing the arbitrary scalar 7 l the weights we get are indeed positivei D We will meet threeterm recurrence relations again in the conjugate gradient method and for the same reason that we saw them in the consideration of eigenvalues of symmetric tridiagonal matrices and in the Lanczos method vizi because there are orthogonal polynomials in the background Conversely facts about orthogonal polynomials eigi nodes and weights for Gaussian quadrature formulae are often most easily derived by looking at eigenvalues and vectors of the associated Jacobi matrix For example if we define PT pTbl 12 all 7 then bTPT 7 aTPT1 7 bT71PT72 all Ti This shows that for each 5 E Zpn ZPn the vector 1305 i i Pn1 is a nontrivial eigenvector of the Jacobi matrix Tn belonging to the eigenvalue 5 hence can be obtained from the normalized eigenvectors for Tn provided by the symmetric QR methodi notes for may Sff latest change 17 dec 92 The Conjugate Gradient Method often called the conjugantl gradient method an illustration of the power of alliteration in language formation was derived by Hestenes and Stiefel as a direct method for solving a linear systemi It is possible as the book shows to spend quite some time deriving it by looking at steepest descent then conjugate descent directions and the like It seems more direct and less surprising to think of it instead as a particular instance of least squares approximation by polynomialsi Speci cally the cg method consists in computing min 7 pAbHA PEWk for given I and given positive de nite hermitian A and for k 0 12 l i l thus obtaining a best approximation zbiai 1k from H161 ranb Ab i i Akin note once again the Krylov sequence to the solution I of the equation A 12 The norm here is the one associated with the inner product lt972gtA1 yHA27 Mi 1 yHAy This makes it possible to compute ltzygtA for any y as lt17 ygtA bHy7 and that is all we need to be able to do in order to construct the bat to z from any linear subspace in the standard way ie by constructing an orthogonal basis p0 i pk for that subspace and then nding the bar as Zp rm AHmlli 1 Our space Hg is derived from the function space 7 of all polynomials of degree S k and there are standard methods available to compute the bat to some function f from 7 wrto the norm derived from an inner product ltgt we would generate the corresponding sequence 100101 ka of monic orthogonal polynomials with the aid of the threeterm recurrence and calculate the bat in the form fk I ijltf7pjgtltP17P1gtA 15k We would do this inductively generating rst pk from pk1 and pk2 via the recurrence Pk 01 i am biilpki27 with ak1lt1Pk717Pk71gtCk717 5711 Excel61672 Ck 10161016 and then fk from 161671 and pk via g 2 161671 pkltfpkgtcki In fact we would recall the earlier discussion of Modi ed GramSchmidt hence would compute ltfpkgt more accurately as 51611716 with 5161 2 f 7 1961 iiel would update dk I Pklt5k717pkgtCk7 fk I fkil dim 511 Ekil div We would initialize the whole process with P711 0 1f717P0107 75711 f 55 By de ning the inner product by ltf79gt1fAbHA9Ab7 we have HI PVW HA Hf Pll provided we take the function f 1 as then fAb A lb z This makes the standard machinery for LS approximation by polynomials recalled in the preceding paragraph available for the efficient calculation of the ba to z from Hk The essence of the cg method lies in the hope that already for small7 16 this approximate solution of A b is good enough In any case one is assured that for some nite k the approximate solution is exact at least if the calculation was carried out in exact arithmetic You may have a moments hesitation here since we are using here functions of matrices There is no dif culty as long as f and g are polynomials but we are proposing here to apply it also to the function 1 Of course the notation is suggestive A 1 in this case But that seems merely slick and by de nition Just to avoid any questions here we have at least two choices i Since A is hermitian positive de nite it is unitarily similar to a diagonal matrix with positive entries A UHAU in which case UHfAU hence 38 ltf7ggt UHfAUbHUHAUUHyAUb UbHfAHAyAUb AilUbil2WyJ This looks once again just like some ordinary discrete inner product ii Observe that we can think of A 1 also as a polynomial in A as follows Let p be the minimal polynomial for A and write it as p 1q 7 a for some polynomial q and with a 7100 Since p is As minimal polynomial all of its roots must be eigenvalues of A hence A s invertibility implies that a f 0 But this implies that AqA a hence A 1 qAa Hence we can take f qa in the above This second point of view as did the rst implicitly gives us a strong hint as to the success of conjugate gradient The degree of the minimal polynomial for A may well be much smaller than the order n of the matrix A In that case we will obtain the solution to the linear system A I much earlier than after n steps If we now carry out the calculation we do not want to deal explicitly with the functions pk g and 5k Rather we are only interested in the vectors 39 Pic 1 PkAb 0 I fkAb 5k 5kAb I trust that the use of pk for both the polynomial pk and the vector pkAb stresses the close connection and doesn7t confuse We observe that for any functions f and g ltf79gt ltfAb79AbgtA I fAbHA9AbA Hence in terms of the vectors 39 the earlier calculation reads Pk A awpka biilpk727 with ak I 410191milky61917 1171 Ck715k727 Ck1ltPk7PkgtA7 dk I Pklt5k717pkgtACk7 11c 1 M71 dim 511 Ekil div We would initialize the whole process with P711 0 1 fin P01 117 5711 I I Ailb In the calculations we would make use of the fact that lt5jyA AejHy TJHy with Tj Aej biAzj 56 the residual of the approximate solution zji Once we agree to keep track of the residual then there are certain simpli cations possible because of the fact that two vectors y and 2 are Aorthogonal or conjugate AHA1 iff Ay and 2 are orthogonal ire AyHz 01 Here are the details The vector pj has the form pj 14 for some polynomial pj of degree ji Hence 1001 1 11016 provides an Aorthogonal basis for 11161 Also 1161 fk1Ab E 1146 hence T1671 b 7 1411671 6 b ll lt11c C 11161 On the other hand T1671 A5k71Ab with the function 5161 perpendicular to the function space Wk71 hence 40 r ilpj 0 W lt b This implies that 41 glaze Vjltk711 It also implies that T1671 has some nontrivial component in the pkdirection ie a nontrivial scalar multiple of pk can be obtained by modifying T1671 to be A orthogonal to Hltki Yet since 141 11672 C Hltk 1 T1671 by 40 T1671 is already Aorthogonal to H1672 hence we can obtain a nontrivial scalar multiple of pk which we call again just pk even though it might now be differently normalized as 42 Pk m4 516717 with Big I Tk717Pk71gtAltPk717pk71gt4 With pk available we can proceed to get 1k fkAb by adding to 1161 the pkcomponent of 1 ie the pkcomponent of T1671 ie by computing 1k 1161 akpk with 04k 2 TkHilpkpkHApzci This also gives m I T1671 7 akApki From this and 41 43 HTng ak kmwm while from 42 and 40 H 2 Tk7lpk 11Tk71112 Hence 2 2 ak 11Tk71112Hpk11 From this and 43 with k 7 1 rather than k 2 2 2 2 11Tk7111211Pk7111A 2 2 5k 1177671112iak7111pk7111A H2 2 11Tk7111211Tk72112 Tk7211211Pk7111A Altogether this gives the conjugate gradient iteration 1016 T1671 Af kpkil 1k 2 1161 akpk m I T1671 akAPIm with 7 2 2 7 2 2 5k 7 11Tk7111211Tk721127 ak 7 117 1671112111716114 We would start with k 1 having initialized the whole process with 101 2 0 1 11 pg 2 b T711 The process becomes unde ned when Hrk2Hg 0 andor HpkHA 0 1n the rst case we would have stopped at k 7 2 since we would have found the exact solution to A 121 In the second case we conclude A being positive de nite that the vector pk 0 hence with 42 and 40 already T1671 0 and this implies that we would have stopped at k 7 1 1n exact arithmetic the cg iteration converges in nitely many steps vizi with step k 7 1 in case k A 6 MA b 1 kerA 7 1n noisy arithmetic it is more appropriate to treat it as an iteration hence to apply standard analysis of iterative methods to it 57 notes for may 8 latest change may 8 Convergence of Cg can be seen as follows Since xk fkAb is the ba from m to 0 1 wrto the norm Hfll2 HfAbHi ZAilUbil2lf il27 we readily estimate lI Iklli S llbll dist 177rkooab27 with a b any interval containing MA The smallest such interval is the interval my Ml I A1A7nAlA This indicates that cg might not be very effective when A1A is close7 to 0 It also indicates that there may be real troubles and there are when A is real symmetric and invertible but not positive definite as then we must approximate 0 1 on both sides of 0 and that is much harder In our situation of a positive definite A we need to compute or at least bound the number Ek I dist 717rkoomMi A linear change of variables which carries the interval 721 M to the interval 71 l carries 0 to the point 7Mmi l 7 I M 7 m H 7 l with H I H2ltAgt AmaxAminA This implies that we can compute Ek as 2 Ek M 7 m Cllst T 7 17rkoo11i It turns out that g I T i 39 l is one of the very few nontrivial functions for which it is possible to write down a formula for its best uniform approximation from 7 on 711 Bernstein did that some time ago This means that Bernstein also has a formula for Ek source Timan page 76 E 4 k ewmmr But up to small terms we can obtain such a result immediately as follows For any function p H9 Plloo E 10 171T 1lll1 T 0llolt On the other hand min H1T 0Hoo minll4lloo 14 E Wklvq39f 1 1 max 4TH4Hoo 1Ck1T7 179 119 with Cj the Chebyshev polynomial for the interval 711 of degree ji From the recurrence relation for Chebyshev polynomials cfi apri26 discussion of Lanczos metho 01tlt W2 701467 W2 7 WW7 58 hence EMUlc 71T 7 2 71 monotonely from below as k 7gt 00 In terms of the condition H H2ltAgt Mm of A we have 7 2 7 1 2 2 W 4HH 7 l2 hence 1 7 H71 7 2717 71 Tr271 1n2E E12 This shows that for every 16 2k 2 2 1 z 7 z lt b H m H HA VEH This error analysis has replaced dist 000A by dist 0004th and that is often a very pessimistic change In practice convergence is much faster Heuristically the method reduces the contributions of the eigenvec tors belonging to extreme eigenvalues fastesti This means that in some sense the interval on which we have to approximate 0 1 from 7 shrinks as k grows giving us a better condition number to work with notes for may 10 latest change june 15 preconditioned Cg is a natural response to the error analysis for cg just given Since the convergence rate is slow7 to the extent that H2A is largel one makes an easily invertible7 change of coordinates z A i I I In terms of these coordinates i satis es the linear system AC W b or with A I C IAC Il The hope is that H2 is smaller than H2Al For example writing A OTC as we may by eg using the LDLT factorization of A we get A 1 hence the perfect condition 1 Of course if we are willing to compute such a C we might as well solve A 12 directly Thus the point is to pick up an easily invertible7 C for which nevertheless H2 is smallh Popular candidates for C include the squareroot of the diagonal of A or the C obtained by incomplete factorization7 in which A N OTC with C as sparse as A Since we are interested ultimately in solving A b one modi es the earlier description of the cg iteration to maintain Tk and ac rather than 16 and ikl Note that with ik the approximate solution obtained and 16 I 27 Aik its residual C l c is not the residual in the approximate solution 1k I C 1 quotk Rather m biAzk b7 AC lik b7 0215 cm This introduces a notational dif culty which the book resolves by de ning 2k 2 Cil c M71776 M1 C2 m z b 7 Azkl These 2k come in handy when we need to calculate C lrkHC 1Tk 215ml We do not bother to construct the 13k explicitly but merely carry the pk C l k C 17quotk1 616151671 2161 kpk4l This gives the following proconditioned Cg iteration obtained here by copying from the may note and modifying Pic 1 Zkil Jr kpkily 9 1 M71 akplm Tic I Tkil 1114ka with M21671 2 T1671 and H H H 2 5k 1 ZkilTkilZk7239rki27 0k I zkilTkilllpkllA We would start with k 1 having initialized the whole process with 101 2 0 1 11 pg 2 21T1 I 12 The PerronFrobenius theory asserts that any irreducible nonnegative matrix has an eigenvalue equal to its spectral radius and that this eigenvalue is algebraically hence also geometrically simple with the corresponding eigenvector positive in all its entries The standard arguments eg in Varga or Franklin are somewhat long But the gist of the argument is easy to give Call a matrix A nonnegative and write this A 2 0 if it is nonnegative as a function ie if Ai j 2 0 for all i j Such a matrix maps the nonnegative orthant lRizzEan2120 into itself Call A positive and write A gt 0 in case all its entries are positive A positive matrix maps lRZ into its interior ie the map 5 A 541 gt gt 5 2 z 6 IRS l is continuous and a strict contraction hence has a unique xed point You might prefer here to use the lnorm for then 5 is a cutout from the hyperplane Ej 1 This means that we can nd I in the 60 relative interior of 5 for which Az In particular I is an eigenvector for A with corresponding positive eigenvalue A I t at we found this eigenpair by the power method and by the argument we would have reached it starting from any nontrivial nonnegative vectori Since for any nontrivial vector y in particular for any eigenvector of A we can nd some positive vector 2 with zTy 0 it follows that A must be the largest eigenvalue of A and that its ascent must be one Assume now that there is also an eigenvector y for A independent of I Then we may assume that y has some negative component hence there is a point 2 in the segment yz on the boundary of IRS and this 2 cannot be zero since I and y are linearly independent and is also an eigenvector for A But that is nonsense since A carries any nontrivial vector in IRS into the interior This is the original theory due to Oskar Perroni Frobenius improved it by weakening the conditions on Al Speci cally Frobenius showed the same conclusion under the weaker assumption that A is nonnegative but irreducible meaning that for every pair of distinct indices 239 j there is a sequence 239 i1 2i i iT j so that Hk Aikik1 0 The nonnegativity implies that z gt gt maps 5 into itself and the irreducibility guarantees that the map has at most one xed point and that this xed point necessarily lies in the relative interiori With this the geometric simplicity is demonstrated as before notes for may 12 latest change may 11 determinants are often brought into courses such as this quite unnecessarily But when they are useful they are remarkably so The use of determinants is a bit bewildering to the beginner l have found that of the many many determinant identities available in the end I have always managed with just one vizl Sylvester s determinant identity and this is nothing but Gauss elimination For A E Rmxn and ij sequence with ranges ll l m and ll l n respl Aij is the matrix fashioned by the rows and columns of A so indicated In MATLAB we would have Ai Aijl Precisely AUJXZJ 1 Aii7jj7 W713 For a1luan A E ann det A is by de nition a multilinear alternating form in its n argumentsl At a minimum this means that deta1 l i an IF IHIF zaHdetaa2Hlan is linear and that detwam 7detmbw with the various ellipses indicating the other arguments held xed Note that this alternation property implies that the determinant is zero if two of its arguments coincide which together with the linearity implies that detw a ab M b detw a mb It follows that deta1ulan detl eiajiul Z detei1uleinHajij 39 ie1nquot 139 just using linearityl By the alternation property most of these summands are zero Only those determinants detei1 l ein for which i E n I symmetric group of order n ie for which i is a permutation of the sequence n I ll l n are not automatically zerol Further for such i detei1 l l l ein 7idet1 by the alternation property with if i1 depending on whether it takes an even or an odd number of interchanges to change i into a strictly increasing sequence This requires the proof that it is always possible to convert any such i into an increasing sequence by interchanges easy and that while this might be done in many ways the parity of such number of interchanges is independent of how we got from i to n Thus det is entirely pinned down once we know the number det It is customary to choose detI I 1 On the other hand starting with the resulting formula 44 deta1l l an Z i H WOO ien j as a de nition one readily veri es that det so de ned satis es the three properties claimed for it The same argument shows that detAB dew t Z ajb1im Z detlaiuy y 7 Wm 17101 ien Z detA7 H b 10 detA detBl ien j 62 In particular with B1leiw7 jii717 j17m7 nl we have detB zj hence 45 detAzj detAB deta1 i i i aj1Az aj1i i i an therefore with b 2 AI deta1 i i i aj1baj1 i i i an detA Vj in case A is nonsingular iiei detA y 0 This is Cramer s rule for the entries of the solution of A 12 It shows that a nonsingular A is invertible since for a noninvertible A we can nd I and j with f 0 for which Az 0 hence the right side of 45 is zero and so must detA be since f 0 Equivalently it shows that the columns of A are linearly independent On the other hand if A is invertible then 1 detI detAA 1 detA detA 1 hence A is nonsingular and detA 1 detA 1i Thus invertibility and nonsingularity coincide accounting for the great popularity of determinants as a means of telling whether or not a given matrix is invertible But if such telling is done numerically then it is usually faster just to solve A 12 directly than to compute detAi One veri es directly from the formula 44 that det AT det Al Also from that formula one obtains the very useful Laplace s expansion by the last column det A Z anini ajij ien n t Zan Z aninquotZ H WOW i1 ien jltn n Zani7 i det Ani l i i i n 7 l i1 using the fact that 7 7l7 ii Finally also from 44 detT H Tm j for any triangular matrix T since for any i other than n there is aj with ij lt j and a j with ij gt j hence triangularity implies that HTjij 0 for all i f n Thus for general A the fastest way to compute det A is via an LU or PLU factorizationi Now comes the promised Sylvester s ldentity which is a direct consequence of the following neat de scription of Gauss eliminationi With k I 1 i i i k consider the matrix Bij I det Aki kji Then on expanding by entries of the last row 307 142397139 det Ak k ZA Xiwa det AUG 14TH rgk This shows that Bj E Aj detAk k spanA k while directly Bij 0 for i E k since then detAk i kj has two rows the same 63 In the same way Bi I E Ai I det Ak k span Ak I Whiledirectly Bij 0 for j E k Thus if det Ak k f 0 then for i gt k Bi det Ak k provides the 2th row of the matrix obtained from A after k steps of Gauss eliminationl In other words the matrix S BdetAk k provides the Schur complement Sk1l l l n k1 l l l n in A of the pivot block Ak kl Since such row elimination is done by elementary matrices With determinant equal to 1 it follows that detA detAkkdetSkllunkllunl Since for any i j Bij depends only on the square matrix detAk i kj this implies Sylvester s determinant identity If Sm detAltk7ik7jgtdetAltk k m then det Sij det Ak i kj det Ak kl Note All DIAGRAMS will be provided in the lecture PHYSICS 244 NOTES Lecture 27 ELECTRICAL CONDUCTION Introduction We now turn to the conduction of electrical current which will underlie everything we are going to say about devices This physical process is of course at the heart of all electrical engineering For many purposes for example almost all of power engineering a simpli ed picture of conduction that only uses classical concepts is pretty much OK This was invented mainly by German physicists at the end of the 19111 century and it worked well enough for the electrical industry to get started at that time People were only dealing with DC or low frequency AC currents at the very beginning and that is what this lecture deals with Communication using radio frequencies started soon after at the end of the l9Lh and the beginning of the 20Lh centuries using the theory of Maxwell which we don t deal with in this course the experiments of Hertz and Helmholtz and nally the practical devices of Marconi For all of these technologies quantum mechanics is not really needed Only in 1948 with the invention of the transistor did all that change and change completely Now quantum mechanics is the starting point for the understanding of nearly all electronics Nevertheless the oldfashioned classical picture can form a basis for the full picture and that is why we start with it Classical picture Consider a cylindrical piece of metal of length L and crosssectional area A We want to understand the resistance of the material de ned by a measurement of V and I at the ends Then R VI Experimentally it is found by looking at wires with different shapes made from the same material that R p L A where p is the resistivity of the material p is a property of the material whereas R is a property of the material AND its dimensions Our interest here is to connect p with the atomiclevel constituents of the material We apply a voltage to the conductor and this starts up a current This seems straightforward but there is something funny right away The electric eld caused by the voltage creates a force on each electron given by F qE eE The electrons should respond by an acceleration a Fm eEm But the current of the electron is proportional to its velocity v not its acceleration This seems to imply that the current for a DC voltage would increase linearly in time since at constant acceleration the velocity increases linearly in time What is going on Actually the situation is a familiar one If you drag a heavy object across the oor you nd that the velocity is proportional to the applied force not the acceleration What happens is that you initially accelerate the object As it speeds up the frictional force which is proportional to the speed increases until it balances the applied force The result is motion at constant speed If you think of the electrons as a uid a more helpful analogy might be the relation between a force and the water owing in a pipe where the force that pushes the water is balanced by the frictional force resisting the motion Again the point is that the speed not the acceleration is proportional to the force The same thing happens to the electrons They initially speed up but then they run into obstacles such as lattice imperfections impurities and sound waves The time between such collisions is called the mean free time or the relaxation time and it is denoted by 1 Each time a collision happens the motion of the electrons is randomized DIAGRAM Now we know that the instantaneous motion of the electrons is quite fast 7 it is the Fermi velocity vF which in metals is about 10396 ms as we have seen But this motion when averaged over times longer than 1 is almost completely random in direction There is only a small leftover part called the drift velocity vd that is not random 7 it is along the direction of the field actually opposite to the field since the charge on the electron is negative This leftover part vd comes from the small increment of velocity between collisions Resistivity formula Now we want the connection between p and the microscopic structure of the material First we need the relation between the current and vd IVRAVpLbutVELso IAEpandjIAnevd That is the first important equation j ne vd n is the volume density of electrons while j is the current per unit area Relating this to E yieldsj E p and n e vd E p So what we need to do is to find the connection between E and vd The electron moves with the very fast vF between collisions The acceleration in the time between collisions is what accounts for the drift velocity so vd a1 eEm1 We can now get p pEnevdEnemeEImne2I This is the basic result of this section Taking each of the factors in turn we have that the resistivity is higher when the mass is big 7 the particles are hared to move the resistance is bigger if the charge is small 7 one factor of the charge comes from the force and one because the contribution of each electron to the current is proportional to its charge the resistivity is bigger if I is small because it is only during the time I that the electron picks up its drift velocity The distance between collisions is called the mean free path 7 Thus the relaxation time I 9L VF The main thing we have control over in trying to get highconductivity wire is I the relaxation time We want I to be as big as possible I comes from bumping into imperfections in the lattice or from bumping into moving atoms sound waves The former effect explains why pure metals are better than disordered alloys for conduction why we prefer pure Cu for wires why working a metal which introduced disruptions in the crystal lattice raises its resistance and the last effect explains why the resistance of a metal goes up with temperature positive temperature coefficient I and therefore 7 depends on all these things but in typical cases such as Cu at room temperature 7 might be about 500 angstroms which is 5 X 1039 m or so Putting in the numbers in the formula p mnezI we find p 10397 Q m which is about right for Cu at 300 K The drift velocity depends on the electric field For Cu at 300K in a field of l Vm we have vd we use n e vd E p or vd E p n e 1 17 x108 x 85 x1028 x 16 x10 43 x 10393 ms whichis way smaller than vF the instantaneous velocity which is normally about 106ms Theoretical Probability Distributions October 26 2006 Theoretical Probability Distributions Bernoulli Process Two outcomes Probability is fixed across outcomes Trials are independent Probability Tree for a Bernoulli Process H HHH 1 H 1of3H164 1 34 HEHHHE E 1 H HHEH 3of2H364 3of1H964 1of0H2764 PHHCHC PHPH PHC iPltHgti PHci Theoretical Probability Distributions October 26 2006 Probability of Any Number k of Successes PE pk1 p o E is the event of getting k successes in any order k is numberof successes p is probability of success on one trial 1p s probability of failure on one trial n is number oftrials Permutation Rule Number of ways of getting k outcomes on n trials 10 10987654321 7737654321321 Binomial Distribution Formula PXkgtpka prk where m n K J k k n k Theoretical Probability Distributions Binomial Table Emu1sz kl 000171227 w UvbyN o LwN O UN O 10 10 15 20 25 0 35 40 45 50 0100 7225 e400 5025 4000 4225 3000 3025 2500 1800 2550 3200 3750 4200 4550 4100 4950 5000 0100 0225 0400 0025 0900 1225 1600 2025 1500 729 21 0 1154 250 2430 3251 1240 4219 4410 4436 4320 4004 3750 0270 0574 0900 1400 1000 2300 2220 3341 3750 0010 0034 0010 0150 0270 0429 0640 0911 1250 05151 5220 4005 3154 2401 17235 12 0915 0 201s 3683 409 4219 41 m 3045 3455 2995 2500 81 0975 1530 2100 2646 3105 3476 3025 375 0030 0111 0255 0460 0755 111 1520 2005 2500 0001 000 0010 0030 0001 0150 02 0410 0025 5901 44 3277 2373 1021 1150 0773 050 0 11 3280 391 409 3 5 502 3124 2502 2059 1503 0720 1302 2040 2 339 3007 3354 3450 33 9 312 0001 0244 a 1123 1011 2204 2757 3 w 0004 002 0104 0140 0204 0488 07611 112 1502 0001 0 0024 003 0102 01 5 0312 October 26 2006 105 104 cdf i 15 3 51015 5 7 5 0003 0003 1111131 4111 5 5 0064 0067 Binomial Example pdf 51 i 3 4 Z WEB E 0512 0579 0201913 2048 2627 101 1141 2131 5 5 51 16 4096 6723 5 5 45 3277 10000 Poisson Distribution xike l PX k T where A mean number of occurrences e natural constant 271828 e1im1 1iml l W W n Theoretical Probability Distributions October 26 2006 Poisson Distribution Example ukei 39 k PX k 1 P02 2 181872 0 1 81873 2 365 7 2 P1392 ew16375 12 31pg 2 22 e 2 7 0481872 2 A 2 P2 01637 P0 P1 P2 99886 P2 31799886 0014 Binomial vs Poisson p05 42505125 25 5 Zn p5m05 95 p6 72524232221 7 54321 00595006 1255939125 5 112x10 7 30522865 00728 m 007 Multinomial Distribution 130119712 quot39nka n n quotI quotk 39 39 39P1 P2 Pk r5112 nk 39 3 3 3 1 Fgt3331i l l l 1 01602 3331 4 4 4 4 Pr11n2 nk Theoretical Probability Distributions October 26 2006 Binomial p4 n10 030 0 20 015 010 0 05 0 00 012345678910 Poisson A14 Expected Value and Variance of a Discrete Distribution k X ZXiPX i1 0 MAX Theoretical Probability Distributions October 26 2006 Mean and Variance of Binomial lubinoimial n 39 p Ulinomial Expected Value and Variance of a Continuous Distribution b X IXPXi6LX a iX MAX Normal Distribution Theoretical Probability Distributions October 26 2006 Features of the Normal Distribution 045 040 0 35 0 30 0 25 0 20 015 010 0 05 0 00 4 3 2 1 0 1 2 3 4 standard deviations Standard Normal Distribution x u 1 122 Z z e 2 a p 2 Standard Normal Table Probhlllby Table entry for z is the 39z probability lying below 1 Theoretical Probability Distributions October 26 2006 689599 Rule Approximately 68 of observations fall within one standard deviation ofthe mean Approximately 95 of observations fall within two standard deviation ofthe mean Approximately 997 of observations fall within three standard deviation of the mean Normal Approximation for Binomial np gt 5 and nl p gt5 Iubinoimia quot3917 552mm n39 p 391 17 Example 10 heads on 15 flips p1509 y15575 a 155175 19365 M 129 PZ 2 129 0985 19365 Correction for Continuity PH210 PH lt10 PH g9 D1234567891U1112131415 Z 95775 l033 PZZl033151 19365 Theoretical Probability Distributions Propo p1k n leE 1 5r 7ammn1 n Wabc 1 F b IIIII 1a1 39quot39PP n n rtions vs Binomial for k successes on n trials SW Ibls n October 26 2006 Chi Square Distribution 132212 222 ZV2 v is called degrees of freedom Ez V zfZZ t distribution lim IV normal0l Theoretical Probability Distributions October 26 2006 F Distribution JainV1 JainV2 Wv The Population FTBush Three Distributions One Sample Sample Means Mean of Normal Distribution le and Y are normally distributed then any linear combination ofX and Y is also normally distributed N20 Theoretical Probability Distributions October 26 2006 Sampling Distribution of Sample Mean of a Uniform Distribution 55557 mem Central Limit Theorem The sampling distribution ofa sample mean of X approaches normality as the sample size gets large regardless of the distribution of X The mean of this sampling distribution is px and the standard deviation standard error is oxl n The sampling distribution of any linear combination of N random variables approaches normality as N gets large Sampling Distribution of Sample Mean of an Arbitrary Distribution aaaaoz uu3357 Theoretical Probability Distributions October 26 2006 Small Samples SE 7 Nquot N population size V 7 N71 n sample size 17X Nin g X J N71 plipquot Nequot FTJE Evolution of Quantitative Traits Today Mutational variance for quantitative traits Drift and the evolution of quantitative traits Maintenance of variation for quantitative traits Genotype by environment interaction Evolution of correlated traits ational Variance for QTL At what rate does new variation for quantitative traits arise The experiment with Mutation Accumulation MA lines 0 use fully a inbred line 0 self or sib mate replicate lines for many generations 0 limit natural selection 0 freeze or seed bank control lines 0 grow terminal lines and controls in a common garden 0 measure variance Vm ational Variance for QTL Founder at mutation drift equilibrium grow It lines for g generations then measure the variance utational Variance for QTL Vm zgiu af for diploid M the mutation a effect of substituting one rate at the ith locux gene becoming heterozygous V is the mutation variance introduced each generation considers only nonsegregation variance considers only additive effects V 2 m can be scaled by VE and called the mutational heritability hm since Vm ltlt V5 the latter is used rather than Vp V8 Mutational Variance for QTL How can V be measured Expected number of fixed mutations between a pair of lines at equil brium 2 x 2NM12N2x Contr bution of each mutation to variance among the two lines E2a22 2Ea2 Summing overtime in generations t and all loci of z 225241 Eaf 2Vmt Answer As the variance among MA lines divided by 2t ational Variance for QTL Body size in C elegans selfing of 152 replicate lines I2 m 0004 Control Mutation Iaccumulallon Ln iimqmm y 2 Body volume xllJ 3 mm From Barton et al 2007 ational Variance for QTL Results Trait bristle number 00035 dimensions 0 0020 Flour Beetle 00091 Mouse limb bone 00234 0 0052 00039 00030 00112 From Lynch and Walsh 1998 ational Variance for QTL Fun with numbers VmNe is roughly 00025 for many organisms and traits It would require roughly 400 generations to attain h2 05 ational Variance for QTL Multiple mutations in a single MAIine Preexisting heterozygosity in the founder mutationdri equil brium Directionality of the mutations Relevance to the variance available to natural selection Genetic Drift for Q L How important is genetic drift in the phenotypic divergence of populations Eaj39 1 12N Ea a At equil brium within a population 2 2 E 0a 2Neam At time 1 between populations EVarZ 0 203 at with molecular variation phenotypie divergence ix dependent on time and mutation rate but notpopalation Xizet Genetic Drift for QTL H 0001locus x gen MW Ne2loci50 iii mix I I 0 l39l C 5 H 2 a 3quotquotNe1oioci1o 0 CL m Q E 8 v E O 39 E E E E g quotquot Ne10loci50 g l ll 2H 4H w w inn ii Generations From Haiti and Clark 1997 Genetic Drift for Q L Testing whether there has been more divergence between species or populations than can be explained by drift alone 02 Azap2 7 lt 2 a 21096 lfthis expression is true then evolution has been to rapid to be explained by neutral processes with 95 confidence The evolutionary biologist s Just so story What maintains heritable variation in phenotypes ls most phenotypic variation neutral balance between mutation and drift VG VP VE W 2N6VmVE Vm VE AM Ne 100 100010000 10002Ne H2 017 067 095 neutral mutation can maintain very substantial amounts of genetic variation in very small populations What maintains heritable variation in phenotypes mutationstabilizing selection balance subject of theoretical study large Vm and weak selection prediction alleles of large effect should be kept at low frequency prediction h2 for fitnessrelated traits gtgt h2 for morphological traits What maintains heritable variation in phenotypes balancing and variable selection spatial adaptation with gene flow temporal fluctuations in the optimal phenotype resource specialization small seeds large seeds frequency dependent selection heterozygote advantage GenotypeEnVIronment Interaction Phenotyp c p ast c ty the same genotype exh b ts d fferent phenotypes n d fferent env ronments React on Norm the pattern of phenotyp c express on of genotypes across a range of env ronments m 2 2 2 g g g E E E l l l Environment 2 1 Environment 2 1 Environment 2 no GxE no plasticity no GxE plasticity GXE n overall plastICIty GenotypeEnvironment Interaction 0 Recnprocal transplants experiment Clausen Keck and Heisey 1940 Potentila glandulosa O quotx e 9 0 00 00 1 0 50 40 Height of plants cm 30 20 0 Stanford Mather Timberline El 30 m El 1400 m E1 3050 m Fitness 0394 Acyrthosiphan pisum 0 Test crop From Conner and Hartl 2004 Traits are often correlated strongly so in cases Are the correlations genetic or environmental Correlated Traits Can selection disrupt the correlations or is evolution constrained Positive correlations are common for morphological traits Species Trait l Trait 2 rp ra or r9 re Finch wing length bill length 060 095 068 Phlox plant height petal width 017 042 013 Daphnia body size clutch size 035 011 049 Mouse brain size body size 022 023 034 From Conner and Haiti 2004 Correlated Traits Covariance measure of association that can be partitioned COVp COVa COVd COVe These components can be estimated using parentoffspring regression COVadd 2 covXo Ymp COVadd 2 covYo me Either pleiotropy or linkage can cause a genetic correlation Epistatic selection due to the interaction of alleles at different loci Correlated Traits Thrum Cr Ar Homostylc Homoslylc gg A c an From Conner and Haiti 2004 Selectio Correlated Traits A21 012 012 013 l AZ oiz z olz z A22 012 a 023 z AZ2 012 1 A23 013 023 of z A23 012A Uzz z multivariate breeder s equation AZ a vector of responses a er one generation of selection G is the additive genetic variancecovariance matrix B is a vector of selection differentials the G matrix can be estimated using unilineal relatives B is determined by multiple regression Selectio Correlated Traits An Example Scarlet Gilia AZ G l3 corolla corolla time stigma length width receptive corolla length 0043 1092 0021 0039 005 corolla Width 0035 0021 0025 0004 122 time sigma receptive 0005 0039 0004 0002 0095 direct selection on all traits negative covariance cause a constraint on evolution very little additive variance of the amount of time the stigma is receptive From Conner and Hartl 2004 RESEARCH STATEMENT Mathew Joseph My research has been in the area of probability known as Random Walks in Random Environment RWRE An overview of this topic can be found in the lecture notes by Sznitman 5 and Zeitouni 14 The theory of random walk has been central to probability theory RWRE was introduced by Solomon 13 in 1975 as a natural generalisation to the classical random walk and has been a very active research area since As compared to random walk which aims to explain the motion of a particle in a constant medium RWRE looks at the motion of a particle in an inhomogeneous medium In the classical random walk on the integer lattice the transition jump probabilities are independent of the position of the particle performing the walk ln RWRE the transition probabilities may vary from site to site They are initially chosen randomly and the motion of the particle depends on the chosen collection of transition probabilities More formally let 9 w mmmezd Adm LamLEW E 01Z Zy Lam 1 be the set of transition probabilities for different sites z E Z A probability measure llD over 9 is chosen which is stationary and ergodic with respect to the spatial shift maps The set 9 is the environment space and llD gives a probability measure on the set of environments Hence the name random environment For each an E Q and z E Z de ne a Markov Chain Xnn20 on Z and a probability measure P on the sequence space ZZ such that PfX0 m1 PfXn1 4X y ay for w 6 2d There are thus two steps involved First the environment an is chosen at random according to the probability measure llD and then we have the random walk with path probabilities P assume z is xed beforehand gives a probability measure on the space ZdZ and is called the quenched measure The aueraged measure is obtained by averaging the quenched measure with respect to llD PmltltXngtn20 e A Puma 6 mm While one dimensional RWRE is fairly well understood there are still many fundamental questions about multidimensional RWRE which have not been resolved For example questions like whether the RWRE will return to the origin with probability one recurrence cannot be answered in general Law of large numbers results and central limit theorems are known only in some special cases Fluctuations 0f the quenched mean In my rst project see 7 l analysed the uctuations of the quenched mean EXn 2 szXn z ie the average position of the walk in n steps given the environment w Notice that as a function of w this is a random variable A handful of papers have dealt with this subject Bernabei 2 and Boldrighini Pellegrinotti 3 deal with this question in the case where llD is assumed to be independent and identically distributed iid and there is a direction in which the walk moves deterministically one step upwards time direction Bernabei 2 showed that the centered quenched mean normalised by its standard deviation converges to a normal random variable and he also showed that the standard deviation is of order ln 3 the 1 2 authors prove a central limit theorem for the correction caused by the random environment on the mean of a test function Both these papers however assume that there are only nitely many transition probabilities Balazs Rassoul Agha and Seppalainen 1 replace the assumption of nitely many transition probabilites with a 7 nite range77 assumption for d 2 to prove an invariance principle for the quenched mean In my paper 7 1 extended the results of 1 to the case where the walk is allowed to make larger steps upward nother reason for looking at the quenched mean is that recently a number of papers by Rassoul Agha and Seppalainen see 10 11 prove quenched CLT s for Xn using subdif fusivity of the quenched mean In what follows e1 10 and eg 01 Let An infk 2 1 Xk e2 2 n denote the rst time when the walk reaches level n or above the second coordinate is at least Denote the drift Dmw at the point x by BMW E Xl 7 90 and let 3 lt1 wgt T 7 62 Let B be a Brownian motion in R2 with diffusion matrix P EDDT 7 EDEDT Using a modi cation of the well known Martingale functional central limit theorem 1 was able to prove a functional central limit theorem for the expected position of the walk on crossing level n l was able to show that after centering and normalising it converges weakly to a centered normal distribution Precisely THEOREM 01 E X e 7 E X 5 0 AM i A 0 AM i 5 w 39 714 where the above convergence is the weak convergence of processes in D0 00 with the Sko rohod topology Here 2V imam for appropriate positive constants 3301 see 7 for de nitions of 33431 98 The main result in the paper is a bit more general than the above A key ingredient in the proof of the theorem is limits of certain Green functions This gives us an indication of what the proper scaling should be for a similar theorem in higher dimensional spacetime random environments It should be xlogn for d 3 and an appropriate constant for d 2 4 Independent particles in a dynamical random environment This is an ongo ing project with my advisor Prof Timo Sepp39al39ainen see Suppose we have particles distributed on the d dimensional integer lattice and let them all evolve independently ac cording to the transition probabilities of a prespeci ed discrete random walk It is known that independent Poisson form an invariant measure for this system in the sense that starting with this distribution the distribution of particles at any later time will also be independent Poisson A natural question is are there any invariant measures on parti cle con gurations for nite range space time environment changes with each time point 3 iid random environments In other words are there measures on particle con gurations such that if we let the particles evolve independently in the same random environment and observe the distributions at future times after averaging the environment it is the same as the initial distribution We have been able to show that the class of invariant measures for this system are mixtures of inhomogeneous Poisson product measures To obtain this result we began by deriving the well known invariant density for the environment process seen by a single tagged particle For what follows consider a space time environment on Z Hl Ztime x Zdspace The particles are distributed on 0 x 2 1 The evolution of the particles depends only on 420 the part of the environment from time 0 to 00 Let nil0m0 denote the transition probability in w from 71 to 00 PROPOSITION 02 There exists a function 0 g f lt 00 on 9 such that Ef 1 lt oo fw is a function of 42001 part of enuironment below leuel 0 and fw Z fT1mw7ril0x 0 llD almost surely zEZd For 0 g p lt 00 and w E 9 let IiiW be a probability distribution on particle con gurations such that the occupation variables nzmezd are independent Poisson variables and nz has mean pfT0mw Note that IiiW depends on an only through the levels Jamil De ne the averaged measure M M Wald THEOREM 03 For each 0 g p lt 00 pp is the unique inuariant distribution for the process 7 that is also inuariant and ergodic under spatial translations and has mean occupation p We have further been able to show the following THEOREM 04 Any initial distribution a which is inuariant and ergodic under spatial trans lations and has mean occupation Ulla p conuerges to the equilibrium measure lap The second part of the project is to study the current uctuations across the characteristic for a system of particles in a space time random environment This is an extension of earlier work done for the case of independent random walks see 12 Suppose initially we have particles distributed on each of the 1 d lattice points according to an iid distribution and they all evolve independently in the same random environment Imagine an observer moving at the averaged velocity characteristic speed of the RWRE We are interested in the particle current observed by the observer More generally the characteristic speed is simply the derivative f p of the ux 1 as a function of particle density p The ux fp is the mean rate of ow across a xed bond of the lattice when the system is stationary with density p For general asymmetric particle systems in one dimension it is expected and supported by known rigorous results that the current uctuations are of order n1 with a Gaussian limit if the macroscopic ux is linear and n with the Tracy Widom limit if the ux is strictly convex or concave The latter is called the KPZ universality class For the case of independent random walks see 12 9 the current uctuations are of order The motivation for our project is 4 to investigate how the random environment in uences uctuations in the n1 universality class So far we seem to be discovering that the dynamical environment does not change the uctuation picture Let me now give the precise result Let 1 denote the averaged speed of the walk and let 02 denote the averaged variance Let Ynt7 r Yn1t7 r 7 Yn t7 r where ma r Z 1Xl0 2 m Xint nut N5 and may Z 1Xl0 lt Taxi7125 gt nut r 1 Above Yn1t7 r resp Yngt7 r are the walks which are initially to the right resp left of the characteristic line rn nut and are to the left resp right of the characteristic line at time nt Let p0 and yo denote the mean and variance of the initial particle occupation variables Let W be a two parameter Brownian motion de ned on R1 x R and B be a one parameter Brownian motion de ned on R and further assume that W and B are independent of each other Let 152 112 denote the centered normal density and centered normal distribution with variance 12 We have the following result THEOREM 05 Fix t1t2 tN R and r17r27 rN ER Let 1 X 7771ltYnt17T17Ynt27TZ7 7YntN7TNgt and Z ltZt17717zt27727 7ZtN7TNgt where Zt7 The b rem is a Gaussian process with stochastic integral representation Zt7 r 039T gt02t75r 7 xo lWs7 m M signm 7 rltIgt02t 7 lz 7 TDdBm 01 gtlt1R R We have X i Z under the aueraged measure The rst integral above in the de nition of Zt7 r represents the space time noise created by the dynamics and the second integral represents the initial noise propogated by the evolution Future work One of the major open questions in RWRE is the so called 0 7 1 law is it true that P0Xn l 7 oo 0 or 1 for all directions l While this has been proved for two dimensional iid nearest neighbor uniformly elliptic there is a 6 gt 0 such that the probability to jump to a neighbor is at least 6 for almost all environmets RWRE7 it is open for higher dimensional iid uniformly elliptic RWRE ln fact7 a counterexample is known for uniformly elliptic mixing environments in dimensions 1 2 3 Jonathan Peterson NSF postdoc at UW Madison math dept and l are investigating the 0 7 1 law for uniformly elliptic mixing environments in dimension d 2 The environment viewed from the particle77 is the environment with the origin moved to the present position of the particle It is easily shown that this forms a Markov Chain 5 on the environment space under both the quenched and averaged measures An important question is whether there exists a measure POO on the environment space which is invariant for the environment Markov Chain and is equivalent to the original measure P see 4 and the references therein For one dimensional nearest neighbor RWRE which are ballistic have a non zero limiting velocity the existence of P00 can be shown I am investigating the question for the case of ballistic multidimensional nearest neighbor RWRE There are other areas in probability which I nd interesting Other than the introductory courses in probability 1 have taken courses in Stochastic Calculus Large Deviations and Statistical Mechanics This year I am taking a course in lnteracting Particle Systems speci cally the analysis of the Corner Growth Model Last summer I attended the Park City Summer School in Utah and the Cornell Summer School where l was introduced to SLE Random Matrices Dimer Model Determinantal Processes and Queueing Theory 1 hope to work in some of these areas in the future REFERENCES 1 Marton Balazs Firas Rassoul Agha and Timo Seppa39la39inen The random average process and random walk in a spacetime random environment in one dimension Comm Math Phys 26624997545 2006 M S Bernabei Anomalous behaviour for the random corrections to the cumulants of random walks in uctuating random media Probab Theory Related Fields 119341P432 2001 C Boldrighini and A Pellegrinotti TAMnoise for random walks in dynamic environment on Z Mosc Math J 13657380 479471 2001 Erwin Bolthausen and AlainSol Sznitman On the static and dynamic points of view for certain random walks in random environment Methods Appl Anal 93457375 2002 Special issue dedicated to Daniel W Stroock and Srinivasa S R Varadhan on the occasion of their 60th birthday Erwin Bolthausen and AlainSol Sznitman Ten lectures on random media volume 32 of DMV Seminar Birkhauser Verlag Basel 2002 Maury Bramson Ofer Zeitouni and Martin P W Zerner Shortest spanning trees and a counterexample for random walks in random environments Ann Probab 3438217856 2006 Mathew Joseph Fluctuations of the quenched mean of a planar random walk in an iid random environment with forbidden direction arXiv08090320v1 Mathew Joseph and Timo Seppalainen Independent particles in a dynamical random environment In preparation 2008 Rohini Kumar Spacetime current process for independent random walks in one dimension To appear in ALEA Lat Am J Probab Math Stat Firas Rassoul Agha and Timo Seppa39la39inen An almost sure invariance principle for random walks in a spacetime random environment Probab Theory Related Fields 13332997314 2005 3 E EEEEE 2 11 Firas Rassoul Agha and Timo Seppalainen Ballistic random walk in a random environment with a forbidden direction Alea 121117147 2006 12 Timo Seppalainen Secondorder uctuations and current across characteristic for a onedimensional growth model of independent random walks Ann Probab 3327597797 2005 3 Fred Solomon Random walks in a random environment Ann Probability 321731 1975 Ofer Zeitouni Random walks in random environments volume 1837 of Lecture Notes in Mathematics SpringerVerlag Berlin 2004 Lectures from the 31st Summer School on Probability Theory held in SaintFlour July 8725 2001 Edited by Jean Picard H Large Sample and Small Sample 02162007 Just as their names imply large sample property is about the property of a statistic when n A 00 here n is the sample size While small sample property is about the property when n is xed and nite Basically these properties could be summarized in the following table Large Sample Small Sample Consistency HP LLN Unbiasedness E Asymptotic Distribution Ad CLT Exact Distribution N Asymptotic E eiciency AVar Minimal Variance Var The rst question is why we need large sample theory Basically if we don t derive the large sample properties then we almost know nothing about the statistic we are using For most of the cases we don t know much about the exact small sample properties of our statistic except some Monte Carlo simulation resultsactually Monte Carlo is using much more than we are assuming The solution to this problem is that we use some approximation to the exact distribution There are two most popular methods to approximate the exact distribution large sample theory and bootstrap Here I will concentrate on the large sample theory Another justi cation of large sample theory is that it lies behind the philosophy of frenquencist that is we could quotpotentiallyquot get in nite observations and we could and should check what will happen as n gt 00 In summary large sample theory is the benchmark for any further development Any statistic E is a function of the samplethat is iid observations that is E f271 2n gnF where F is the distribution of 27 Large sample theory is letting n go to 00 and check what will happen while bootstrap is xing n but substituting F by Fn the empirical distribution Email pingyu wiscedu Example 1 Midterm 2000 1 The model is 21239 M5 5i EkiW 0 where sci and e are scalar We consider the estimator We assume that at and e have nite 4th moments and that 312 xiare a random sampleiid wand E X bFind Var EX cShow that 3 HP as n gt 00 Does this require any additional assumptions dFind the asymptotic distribution of 7 as n gt 00 e Without imposing any additional assumption is E necessarily less e cient than OLS ixx zi i 9 Solution 5177 g 1 Z a Z I 1 a This is conditional unbiasedness a small sample property N is 2mm N E373XE X 5170soE3X5 25 211 1 1 1 This is about conditional variance a small sample property From GaussiMarkov Theorem OLS is E somewhat best in small sample homoskedasticenvironment this estimator is not OLS estimator so needn t be best in this case VarltEXEltEi 2XE X This is consistency a large sample property 2 n n By LLN we have 2e HP Eei 0 and i HP u z 2 M i 21 1 Hp9 H 0 why could we have this convergence7 Ifu7 0then i M sh So we requires the assumption that u 7E 0 d This is about asymptotic normality a large sample property n By CLT 216 7gtd N0 0392 here 0392 Ee Therefore if we assume p 7E 0 then 2 n 7 7 i e N 0 1 1 WW 5 i 1 1 Ad N0 Why could we have this convergence 7 x e If we only know that Ebola 0 then we know OLS estimator is asymptotic ef cient that is no other consitent estimator could have a smaller asymptotic variance than OLS estimator But in linear regression model this needn t be true since the semiparametric ef cient estimator is GLS estimator For this concrete example let s check Avar could be smaller than Avar the asymptotic variance of OLS estimator A 7 Elrfefl 7 ElrflElefHCONIfyef 7 Elsi 60741193 n Avar 7 army 7 mam 7 EM mam N 7 Elefl 39 AMT Ezf7Varz Although lt 330 but if covac2e2 is large enough relative to we could still get Avar gt Avar With this intuition in mind let s calculate the difference of n Avar and n Avar concretely n Avar 7 n Avar 2 0 Elefl 7 Elsi Elrfl Eng 1 ElefloiLElrfl Nvmzwvawtez Elziw Eacf1 30 a 20 EleflVWUM Elel 9 2 M vqrzgVareg E 1 V 1 E f eg7 suppose 6239 N N017 an N N117 then 21HTI 7Vmxf7Vmef lt 17 so if we let p then Avar gt Avar Discussion of e We know GLS estimator is semiparametric ef cient in linear regression model if xi obviously we are assuming at gt 0 as then 3 is actually GLS estimator so it isn t surprising that 3 could be asymptotically more ef cient than I Note In this course whenever we mention quotef ciencyquot we always mean quotasymptotic ef ciency Do you ever think why we call the ef ciency as quotsemiparametricquot ef ciency De nition 1 If anTn 7 1 Ad Y 0 lt EY2 lt 00 an A 00 or an A a gt 0 and bnT 7 U ad Z 0 lt EZ2 lt 00 b7 gt 00 or b gt b gt 0 then the asymptotic relative e ciency of T7 wrt Tn is Zz eg an b7 E Y N N0o39 Z N N0o39g then the asymptotic relative b2 de ned to be EgllE 2 2 e ciency of TJ wrt Tn is that is if a gt 0393 then gt 1 and T7 is more e cient than T7 2 2 Example 2 Midterm 2005 2 Take the model yi xi ei with Eaciei 0 Suppose you have two independent random samples of observations of size n1 and n2 Let E1 and 32 denote the leastisquares estimate of on each sample Let E E1 E2 2 denote the average of the two estimates Let 3 denote the leastisquares estimate on the combined sample Which is more e cient E or E When are they asymptotically equivalent QUADRATIC TWISTS OF MODULAR FORMS AND ELLIPTIC CURVES KEN ONO AND MATTHEW A PAPANIKOLAS 1 INTRODUCTION Here we summarize the results presented in the first author7s lecture at the Millenial Conference on Number Theory Some of these results appear in O in full detail In addition we present a new result regarding the growth of Tate Shafarevich groups of certain elliptic curves over elementary abelian simple 2 extensions We begin by fixing notation Suppose that 221 anq E S2kF0M q z 62quot throughout is an even weight newform on TO M with trivial Nebentypus character As usual let LF 3 denote its L function which is defined by analytically continuing LFs Z W If D is a fundamental discriminant of the quadratic field Q then let XD denote its usual Kronecker character Let F XD denote the newform that is the D quadratic twist of F and let LF XD 5 denote the associated L function In particular if gcdD M 1 then F XD 2 Z XDnanq 711 Given a xed F we consider the behavior of the central critical values LF XD k as D varies In an important paper Goldfeld G conjectured that l 1 Z ord5k LF XD8 N i l lDlSX lDlSX gcdDM1 gcdDM1 note This conjecture was originally formulated for weight 2 newforms associated to mod ular elliptic curves Obviously this conjecture implies the weaker statement 2 lDl S X I LF XDzk 7 0 gtgtF X The rst author thanks the National Science Foundation the Alfred P Sloan Foundation and the David and Lucile Packard Foundation for their generous research support Typeset by AMS TEX 2 KEN ONO AND MATTHEW A PAPANIKOLAS Using a variety of methods early works by Bump Friedberg Hoffstein lwaniec Murty Murty and Waldspurger see B F H l M M produced a number of important nonvanishing theorems in the direction of More recently Katz and Sarnak Ka Sa provided among many other results conditional proofs of However this claim has only been proven for certain special newforms by the works of James Kohnen and Vatsal J Ko V These cases require that the modular forms possess exceptional mod 3 Galois representations The best unconditional general result in the direction of 2 is due to the first author and Skinner They proved that 3 lDl S X I LF XDzk 7 0 gtgtF XlogX We obtain for almost every F the following minor improvement of Theorem 1 Let 221 anq E SgkP0M be an even weight newform and let K be a number eld containing the coe cients an va is a place ofK over 2 and there is a prime p l 2M for which 4 orduap 0 then there is a rational number 0 lt 04 lt l for which X lDl X 1F XDzk 0 gtgtF log X This result has immediate implications for elliptic curves We begin by xing notation Suppose that E Q is an elliptic curve E y23axb and let LEs 221 aEnn 5 be its Hasse Weil L function For integers d which are not perfect squares let Ed denote the d quadratic twist of E Ed dy2 3ab Moreover if E is an elliptic curve defined over a number field K then let rhE K denote the rank of the Mordell Weil group EK Similarly let ME K denote the Tate Shafarevich group of E K By a celebrated theorem of Kolyvagin Kol and the modularity of E 2 implies the widely held speculation that lt5 lDl X rkltEltDgtQgt 0 gtgtE X Heath Brown confirmed 5 for the congruent number elliptic curves in HE and subsequent works by James and Vatsal K0 V confirm this assertion for a variety of families of quadratic twists which contain an elliptic curve with a rational torsion point of order 3 However 5 remains open for most elliptic curves In this direction Theorem 1 implies the following result QUADRATIC TWISTS OF MODULAR FORMS AND ELLIPTIC CURVES 3 Corollary 2 If EQ is an elliptic curve without a Q rational torsion point of order 2 then there is a number 0 lt ME lt 1 for which X iDi S X 1 TkED7Q 0 gtgtE m Theorem 1 and Corollary 2 depend on a nonvanishing theorem see Theorem 21 which guarantees the existence of a fat set of discriminants D which is closed under multiplication for which LF XD k 31 0 The most interesting consequence of Theorem 21 may be the following result concerning the triViality of the rank of the Mordell Weil group of most elliptic curves E over prescribed elementary abelian 2 extensions of Q of arbitrarily large degree Theorem 3 Let EQ be an elliptic curve without a Q rational torsion point of order 2 Then there is a fundamental discriminant DE and a set of primes SE with positive density with the property that for every positive integerj we have rkEDE7Qx7717 MwwxW 7 kEDE7Q 0 whenever the integers m1 m2 mj gt 1 satisfy the following conditions 1 Each mi is square free with an even number of prime factors 2 All of the prime factors of each mi are in SE Theorem 21 may also be used to prove the existence of non trivial elements of Tate ShafareVich groups of elliptic curves Regarding Tate ShafareVich groups works by Bolling Cassels Kramer and Rohrlich B6 Ca Kr R yield a variety of results concerning the non triViality of the 2 and 3 parts of Tate ShafareVich groups for families of elliptic curves Less is known about the non triViality of p parts of for primes p 2 5 Under a natural hypothesis Theorem 21 and a theorem of Frey yield a general result which holds for many if not all curves E whose Mordell Weil group over Q has torsion subgroup Z3ZZ5Z or Z7Z For simplicity we present the following special case of this result Theorem 4 Suppose that EQ is an elliptic curve whose torsion subgroup over Q is ZXZ with b 6 357 IfE is good at d see 3 has good reduction at b and has the property that there is an odd prime p0 E 71 mod b of bad reduction with ordp0 i 0 mod K where AE is the discriminant of E then there are in nitely many negative square free integers d for which rhEdQ 0 and mltEdQ E 0 mod K 4 KEN ONO AND MATTHEW A PAPANIKOLAS Remark Earlier work of Wong W0 implied that every elliptic curve without a Q rational point of order 2 is good at b We employed his result in a crucial way to obtain Th 5 O and Cor 6 0 Unfortunately we have been informed that there is a mistake in Wong7s argument Therefore readers should be aware that Th 5 O and Cor 6 O are true for elliptic curves that are good at i see 3 Using Theorems 3 and 4 we obtain the next theorem which shows for certain elliptic curves EQ that there are infinitely many number fields K for which both rhEK gtgtE logK Ql rkpltmltEKgtgt gtgtE 10qu z oi Theorem 5 Let EQ be an elliptic curve whose torsion subgroup ouer Q is ZpZ with p 6 357 IfE is good atp see 3 has good reduction at p and has the property that there is an odd prime p0 E 71 mod p of bad reduction with ordpo i 0 mOd my where AE is the discriminant of E then for every pair of non negative integers rm and r5 there are rm r5 square free integers d1 d2 drmrs with maoM Va Malays 2 2mm mamaom Va maxdang 2 2a In 2 we describe Theorem 21 and give a brief sketch of its proof and in 3 we sketch the proofs of Theorem 1 3 4 and 5 2 THE CRUCIAL NONVANISHING THEOREM The next theorem is the main result which is Vital for all of the results described in 1 Theorem 21 Let 221 anq E SgkP0M be an even weight newform and let K be a number eld containing the coe icients an Ifu is a place ofK over 2 and there is a prime p l 2M for which 21 ordultaltpgtgt 0 then there is a fundamental discriminant DF and a set of primes SF with positive density such that for every positiue integerj we have LF XP1P2quot39P2jDF7k i 0 whenever p1p2 pgj 6 SF are distinct primes not diuiding DF QUADRATIC TWISTS OF MODULAR FORMS AND ELLIPTIC CURVES 5 Sketch of the Proof We begin by recalling a theorem due to Waldspurger Th 1 W on the Shimura correspondence Sh This result expresses many of the central critical values LF XD k in terms of the Fourier coe icients of certain half integral weight cusp forms For every fundamental discriminant D define Do by lDl if D is odd 22 0 lDl4 ifD if even lf 6 6 i1 is the sign of the functional equation of LFs then there is a positive integer N with M l N a Dirichlet character X modulo 4N a non zero complex number 9p and a non zero half integral weight eigenform gm Z bpltngtq e SkltPolt4Ngtxgt 711 with the property that if SD gt 0 then k 23 bFltDogt2 ED 39 if gcdltDo4Ngt1 otherwise where 6D is algebraic Moreover the coe icients an bpn and the values of X are in OK the ring of integers of some xed number field K In addition if p l 4N is prime then 24 Mp X2pap where xp is the eigenvalue of gpz for the half integer weight Hecke operator T 192 on skrgrowzvm Define the integer so by 25 so 2 minordubpn In addition let Gpz be the integer weight cusp form defined by 26 GF2 f bgnq 2 9172 1 2 f 1712 711 71 It is easy to see that if n is a positive integer then 27 bgn bpn2 bpnit2 E bpn mod 2 151 It is our goal to determine conditions under which bg is non zero modulo 2 To achieve this we employ classical results regarding modular Galois representations due to Deligne and Serre D D S 6 KEN ONO AND MATTHEW A PAPANIKOLAS Suppose that fz 221 afnq E SkP0M11 is an integer weight newforrn and suppose that K is a number field whose ring of integers OK contains the Fourier coe icients an and the values of 11 lf OH is the completion of OK at any finite place 1 of K say with residue characteristic E then there is a continuous representation Pm I GaKQQ GL2OU with the property that if pJMM is prime then TYltPfuF7 ObP afp Using these representations we are able to study the arithmetic of the Fourier expansions of a collection of integer weight cusp forms with coe icients that are algebraic integers Suppose that f1z f2z z are integer weight cusp forms with M8 122 axing E SkP0MiaXi 1 71 Suppose that the coe icients of all the and the values of all the Xi are in OK for some su iciently large number field K and let 1 be a finite place of K with residue characteristic X If p0 JMMlMg My is prime and j is a positive integer then the Chebotarev Density theorern implies that there is a set of prirnes p of positive density such that for every 1 S i S y we have 28 ordu mg T1537 7 mg Tghxi gt j In view of 24 there is a prime 190 l 4N for which ordup0 0 Applying 28 to Gpz and Fz there is a set of prirnes SpO with positive density with the property that every prime p E SpJ satisfies ordu p Ordu p0 07 and k1 k1 210 ordu am TEH WI 7 0192 T jl XM gt 80 Suppose that m is a positive integer for which ordubpm so and suppose that q1q2 E SpJ are distinct odd prirnes which are coprirne to m Using the definition of the integer and half integral weight Hecke operators one easily checks that the coe icient of k 1 kl qmql in Gpz l qu 7 is A11bgm bgmxqlQl 1 967101091 7 QUADRATIC TWISTS OF MODULAR FORMS AND ELLIPTIC CURVES 7 where xp Mp Since X1q1q1 7 E 0 mod 2 by 29 we find that the k1 coef cient of qmq1 in Gpz l ijl xx l has ordu equal to so By 210 the coef cient of k k1 qmq1 in Gpz l TqQ XX 1 also has ordu so and this equals bgmq1q2 Xq2x39i11q2q bgmq1q2 bgmq1q2 This shows that if ordubpm so and q1 12 6 Spa are distinct odd primes which do not divide m then ordubpmq1q2 so In view of 23 the theorem follows by iterating this observation D 3 THE LOOSE ENDS In this section we sketch the proofs of Theorem 1 3 and 4 Sketch of the proof of Theorem 1 Let T be a set of primes with density 0 lt 04 lt l and let NT denote the set 31 1 neN nHp pieT andun1 Here M denotes the usual Mobius function Generalizing an argument of Landau Serre proved Th 28 S that X X nltXn iwith iET 0 0 7 Up p T loglio X logZ o X for some positive constant CT A simple sieve argument yields HEN 2nltX gtgt T 7 loglio X Theorem 1 follows immediately from this estimate and Theorem 21 D Sketch of the proof of Theorem 5 Since aEp is even for all but finitely many primes 1 if and only if E has a rational point of order 2 we may freely apply Theorem 21 By the modularity of E and Kolyvagin7s theorem if LED l 31 0 then rhEDQ 0 Suppose that S m1m2 mt is a set of square free pairwise coprime integers gt 1 where all the prime factors of each mi are in the prescribed set of primes If each mi has an even number of prime factors then for each 1 S s S t and any distinct d1 d2 d5 6 S we have rhEd1d2 d5Q 0 8 KEN ONO AND MATTHEW A PAPANIKOLAS Theorem 3 now follows from the fact that rkE Kd rkE K rkED K whenever K 2 B Let EQ be an elliptic curve whose torsion subgroup over Q is ZXZ where X E 3 5 7 By Theorem 21 there is a discriminant dE and a subset of primes of primes SE such that for every set of distinct odd primes p1 pgj E SE coprime to dE we have LEdEP1P2j1 7 0 We say that E is good at X if there are infinitely many such negative fundamental discrimi nants D dEp1 p2j for which the following hold We have X l OKD the ideal class group of the quadratic field We have gcdD NE l where NE is the conductor of E We have D E 0 mod 4 and D4 E 3 mod 4 iv lf ordljE lt 0 then 71 1 Every prime p l NE with p Z 2 U SE and with the additional property that p i 71 mod X or ordpAE i 0 mod satisfies 71 if ordpjE 2 0 D 7 71 if ordpjE lt 0 and EQp is a Tate curve P 1 otherwise Sketch of the proof of Theorem 4 Let X be an odd prime such that EQ is an elliptic curve with good reduction at X with a Q rational point of order X If D is a negative fundamental discriminant satisfying the conditions appearing in the definition above then Frey proved that OZD2 l SZED7Q Here OlDl denotes the part of the ideal class group of QD and SEDQ is the Selmer group of ED over Q Let SE be the set of primes with positive density given in Theorem 21 Then there is a discriminant dg and a subset SE of SE such that for every set of distinct odd primes 131132 pgj E SE which are coprime to dE we have 32 LEdEP1P2P2j71 7 07 33 OldEP1P2 P2jz l SZEdEP1P2 39 39 492339 Q By Kolyvagin7s theorem 32 implies that EdEp1p2 p2j has rank zero Therefore we find that if 131132 pgj E SE are distinct primes which do not divide dE then OldEP1P2 P2jz E 0 mOd 5 ZMZ X ZgZ Q LUEdEP1P2 P2j QUADRATIC TWISTS OF MODULAR FORMS AND ELLIPTIC CURVES 9 Therefore it suf ces to prove that there are in nitely many suitable discriminants satisfying 32 and 33 with the additional property that b l OldEp1p2 p2jl This is guaranteed since E is good at b D Now we prove Theorem 5 We begin by fixing notation For an abelian group A and a positive integer m we let Am denote the m torsion of A and for a prime p we let rkpA denote the p rank of AM If A is finitely generated we let rkA denote its rank For a Galois extension of fields LK we let GLK denote the Galois group We let GK FK For d e K we let Kd Kd and Gd GKdK Our results depend on the following relations between p ranks of Tate Shafarevich and Mordell Weil groups of elliptic curves upon a quadratic extension Here SpEK denotes the p Selmer group of an elliptic curve Lemma 31 Let E be an elliptic curve de ned over a number eld K Let p be an odd prime and let d be a non square in K Let rEK denote either rkEK rkpSpEK or rkpMEK Then rEKd rEK rEdK Proof The result for the rank of EKd is well known Let Gd be represented by 1p C GK Fix an isomorphism b E a Ed defined over Kd so that p Q 7 pQ for all Q E As 1 is odd we note that EKdlpl EKlpl EdKlpl via the map Q gt gt pQ Q pQ 7 Therefore the result also holds for rEKd TkpEKdpEKd By the exact sequence 0 a EKdpEKd a SpEKd a MEKdp a 0 it now suf ces to prove the result for Selmer groups The Selmer group decomposes as SpltE Kdgt SpltE KM e SpltE KM where the first group is the Gd invariants of SpEKd and the second comprises those ele ments on which p acts by 71 Sincep is odd the Galois cohomology groups Hi Gd EKdp are trivial for i 2 1 therefore by the Hochschild Serre spectral sequence the restriction map H1GK a H1GKdEpGd is an isomorphism Thus it follows that res SpEK a spEKdGd 10 KEN ONO AND MATTHEW A PAPANIKOLAS is an isomorphism Likewise SpEdK E SpEdKdGd lf 5 is the map induced by b on cohomology then one can check that airHlltGKdaElplGdquot H1GKwEdiPlGd is well defined and an isomorphism For each place i of Kd the map lt51H1GKdwEFdu w H1GKdU7EdFdu is necessarily an isomorphism so it follows from the de nition of the Selmer group that rspltEKdgtGdc a SpltEltdgtKdgtGd is also an isomorphism Therefore SpE KdGd g SpEdK and we are done D The following theorem is a weak version of one of the main results in St T Theorem 32 If EQ is an elliptic curve then for every positive integer rm there are distinct square free integers D1 D2 D for which rkEDiQ 2 2 Proof of Theorem 5 Using Theorem 32 let D1 D2 D be distinct square free integers for which rhEDZQ 2 2 Hence Lemma 31 implies that 34 moon71 M32 Mom 2 rkEDjQ 2 31 Similarly by Theorem 4 there are r5 many distinct square free integers d1d2 drs for which rhEdZQ 0 and mltEdZQ E 0 mod 1 By the proof of Theorem 4 and Kolyvagin7s theorem on the Birch and Swinnerton Dyer Conjecture ie finiteness of mltEQ and the existence of the Cassels Tate pairing implies that rhpME Q is even it follows that for each such di that rkpLUEdiQ 2 2 Therefore by Lemma 31 we get 35 rkpltmltEQltm Vi mym 2 rkpltmltEltd2gtogtgt 2 2a 31 Consequently 34 35 and Lemma 31 imply that moonD1xD2xDim xdsgt2 2m 7 kpE 7ltV D17 V D27quot7 Drmzmz auw drs 2 275quot This completes the proof D DS1 K01 W01 QUADRATIC TWISTS OF MODULAR FORMS AND ELLIPTIC CURVES 11 REFERENCES R B lling Die Ordung der SchafarewitschTate Gruppe kann beliebeg gross werden Math Nachr 67 1975 157 179 D Bump S Friedberg and J HoHstein Nonvanishing theorems for Lfunctions of modular forms and their derivatives Invent Math 102 1990 543618 J W S Cassels Arithmetic on curves of genus 1 VI The TateShafarevich group can be arbitrarily large J reine angew math 214215 1964 6570 P Deligne Formes modulaires et representations adiques Sem Bourbaki 19681969 Expose 355 Springer Lect Notes 179 1971 139172 P Deligne and JP Serre Formes modulaires de poids 1 Ann Scient EC Norm Sup S r 4 7 1974 507 530 C Frey On the Selmer group of twists of elliptic curves with Qrational torsion points Canad J Math XL 1988 649665 D Goldfeld Conjectures on elliptic curves over quadratic elds Number Theory Carbondale Springer Lect Notes 751 1979 108118 D R HeathBrown The size of the Selmer groups for the congruent number problem H Invent Math 118 1994 331370 H 1waniec On the order of vanishing of modular Lfunctions at the critical point S m Th r Nombres Bordeaux 2 1990 365376 K James Lseries with nonzero central critical value J Amer Math Soc 11 1998 635641 N Katz and P Sarnak Random matrices Frobenius eigenvalues and monodromy Amer Math Soc Colloq Publ V01 45 Amer Math Soc Providence 1999 W Kohnen On the proportion of quadratic twists of Lfunctions attached to cusp forms not vanishing at the central point J reine angew math 508 1999 179187 V Kolyvagin Finiteness of and HIEQ for a subclass of Weil curves Russian IZV Akad Nauk USSR ser Matem 52 1988 522540 K Kramer A family of semistable elliptic curves with large TateShafarevich groups Proc Amer Math Soc 89 1983 473499 M R Murty and V K Murty Mean values of derivatives of modular Lseries Annals of Math 133 1991 447475 K Ono Nonvanishing of quadratic twists of modular Lfunctions and applications to elliptic curves J reine angew math 533 2001 8197 K Ono and C Skinner Nonvanishing of quadratic twists of modular Lfunctions Invent Math 134 1998 651660 D Rohrlich Unboundedness of the TateShafarevich group in families of quadratic twists Ap pendix to J Ho stein and W Luo Nonvanishing of Lseries and the combinatorial sieve Math Res Lett 4 1997 435 444 JP Serre Divisibilit de certaines fonctions arithm tiques L Enseign Math 22 1976 227 260 G Shimura On modular forms of halfintegral weight Annals of Math 97 1973 440481 J H Silverman The Arithmetic of Elliptic Curves SpringerVerlag New York 1986 C Stewart and J Top On ranks of twists of elliptic curves and powerfree values of binary forms J Amer Math Soc 8 no 4 1995 947974 V Vatsal Canonical periods and congruence formulae Duke Math J 98 1999 397419 JLWa1dspurger Sur les coe icients de Fourier des formes modulaires de poids demientier J Math Pures et Appl 60 1981 375484 S Wong Elliptic curves and class number divisibility Int Math Res Not 12 1999 661672 DEPARTMENT OF MATHEMATICS UNIVERSITY OF WISCONSIN MADISON WISCONSIN 53706 E mail address ononathwiscedu DEPARTMENT OF MATHEMATICS BROWN UNIVERSITY PROVIDENCE RHODE ISLAND 02912


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.