Class Note for GEOS 567 with Professor Richardson at UA
Class Note for GEOS 567 with Professor Richardson at UA
Popular in Course
Popular in Department
This 247 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Arizona taught by a professor in Fall. Since its upload, it has received 14 views.
Reviews for Class Note for GEOS 567 with Professor Richardson at UA
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
HNVERSE PERCQJBLJEMS EN GE IP i39YSHCS GECGS 567 A Set of Lecture Notes by Professors Randall M Richardson and George Zandt Department of Geosciences University of Arizona Tucson Arizona 85721 Revised and Updated Summer 2003 PREFACE CHAPTER 1 h Ih Ih Ih Ih I UIbUJNt I 16 CHAPTER 2 21 22 22 CHAPTER 3 WWW ALAN 35 36 Geosciences 567 PREFACE RMRGZ TABLE OF CONTENTS v INTRODUCTION 1 Inverse Theory What It Is and What It Does 1 Useful De nitions 2 Possible Goals of an Inverse Analysis 3 Nomenclature 4 Examples of Forward Problems 7 151 Example 1 Fitting a Straight Line 7 152 Example 2 Fitting a Parabola 8 153 Example 3 Acoustic Tomography 9 154 Example 4 Seismic Tomography 10 155 Example 5 Convolution 10 Final Comments 11 REVIEW OF LINEAR ALGEBRA AND STATISTICS 12 Introduction Matrices and Linear Transformat1 221 Review of Matrix Manipulations 222 Matrix Transformations 15 223 Matrices and Vector Spaces 19 Probability and Statistics 20 231 Introduction 20 232 De nitions Part 1 20 233 Some Comments on Applications to Inverse Theory 23 234 Definitions Part 2 24 INVERSE METHODS BASED ON LENGTH 28 Introduction 28 Data Error and Model Parameter Vectors 28 Measures of Length 28 Minimizing the Misfit Least Squares 30 341 Least Squares Problem for a Straigh 30 342 Derivation of the General Least Squares Solution 33 343 Two Examples of Least Squares Problems 35 344 FourParameter Tomography Problem 37 Determinancy of Least Squares Problems 38 351 Introduction 38 352 EvenDetermined Problems M 7 N 39 353 Overdetermined Problems Typically N gt M 39 354 Underdetermined Problems TypicallyM gtN 39 Minimum Length Solution 40 361 Background Information 362 Lagrange Multipliers 41 363 Application to the Purely Underdetermined Problem 44 37 38 39 CHAPTER 4 41 42 43 44 45 CHAPTER 5 51 52 53 54 55 Geosciences 567 PREFACE RMRGZ 364 Comparison of Least Squares and Minimum Length Solutions 46 365 Example of Minimum Length Problem 46 Weighted Measures of Length 47 371 Introduction 47 372 Weighted Least quares 47 373 Weighted Minimum Length 50 374 Weighted Damped Least Squares 52 A Priori Information and Constraints 53 381 Introduction 53 382 A First Approach to Including Constraints 54 383 A Second Approach to Including Constraints 56 384 Example From Seismic Receiver Functions 59 Variance of the Model Parameters 60 391 Introduction 60 392 Application to Least Squares 60 393 Application to the Minimum Length Problem 61 394 Geometrical Interpretation of Variance 61 LINEARIZATION OF NONLINEAR PROBLEMS 65 Introduction 65 Linearization of Nonlinear Problems 65 General Procedure for Nonlinear Problems 68 Three Examples 68 441 A Linear Example 68 442 A Nonlinear Example 70 443 Nonlinear StraightLine Example 75 Creeping vs Jumping Shaw and Orcutt 1985 79 THE EIGENVALUE PROBLEM 82 Introduction 82 The Eigenvalue Problem for Square M XM Matrix A 82 521 Background 82 522 How Many Eigenvalues Eigenvectors 83 523 The Eigenvalue Problem in Matrix Notation 84 524 Summarizing the Eigenvalue Problem for A 87 Geometrical Interpretation of the Eigenvalue Problem for Symmetric A 87 531 Introduction 87 532 Geometrical Interpretation 88 533 Coordinate System Rotation 92 534 Summarizing Points 93 Decomposition Theorem for Square A 94 541 The Eigenvalue Problem for AT 94 542 Eigenvectors for AT 94 543 Decomposition Theorem for Square Matrices 95 544 Finding the Inverse A 1 for the M XM Matrix A 101 545 What Happens When There Are Zero Eigenvalues 102 546 Some Notes on the Properties of Sp and RP 105 Eigenvector Structure of mm 106 551 Square Symmetric A Matrix With Nonzero Eigenvalues 106 552 The Case of Zero Eigenvalues 108 553 Simple Tomography Problem Revisited 109 CHAPTER 6 61 62 63 64 65 66 67 68 69 610 611 CHAPTER 7 71 72 Geosciences 567 PREFACE RMRGZ SINGULARVALUE DECOMPOSITION SVD 113 Introduction Formation of a New Matrix B 621 Formulating the Eigenvalue Problem With G 111 622 The Role of GT as an Operator 114 The Eigenvalue Problem for B 115 631 Properties ofB 115 632 Partitioning W 115 Solving the Shi ed Eigenvalue Problem 116 641 The Eigenvalue Problem for GTG 116 642 The Eigenvalue Problem for GGT 118 How Many 711 Are There Anyway 119 651 Introducing P the Number of Nonzero Pairs 711 771139 119 652 Finding the Eigenvector Associated with 7711 120 653 No New Information From the 7711 System 654 What About the Zero Eigenvalues 711 s 139 2P 1 NM 121 655 How Big isP 122 Introducing Singular Values 123 661 Introduction 123 662 De nition of the Singular Value 124 663 De nition of A the SingularValue Matrix Derivation of the Fundamental Decomposition Theorem for General G NXMN M 126 SingularValue Decomposition SVD 127 681 Derivation of SingularValue Decompos1tion 127 682 Rewriting the Shifted Eigenvalue Problem 129 683 Summarizing SVD 129 Mechanics of SingularValue Decomposition Implications of SingularValue Decomposition 132 6101 Relationships Between U Up and U0 132 6102 Relationships Between V VP and V0 132 6103 Graphic Representation of U Up U0 V VP and V0 Spaces 133 Classi cation ofd Gm Based on P M and N 134 6111 Introduction 132 6112 Class I PMN 6113 Class II PMltN 135 6114 Class III PNltM 136 6115 Class IV PltminNM 137 THE GENERALIZED INVERSE AND MEASURES OF QUALITY 138 Introduction 138 The Generalized Inverse Operator G g1 140 721 Background Information 140 722 Class I P NM 140 723 Class II P MltN 141 724 Class III P NltM 148 725 Class IV PltminNM 151 iii 73 74 75 CHAPTER 8 81 82 83 84 85 CHAPTER 9 50505050 ALAN 95 96 Geosciences 567 PREFACE RMRGZ Measures of Quality for the Generalized Inverse 153 731 Introduction 153 732 The Model Resolution Matrix R 733 The Data Resolution Matrix N 734 The Unit Model Covariance Matrix covu m 159 735 Combining R N covu m 161 736 An Illustrative Example 164 Quantifying the Quality of R N and covu m 166 741 Introduction 166 742 Classes of Problems 166 743 Effect of the Generalized Inverse Operator 1 167 Resolution Versus Stability 169 751 Introduction 169 752 R N and covu m for Nonlinear Problems 171 VARIATIONS OF THE GENERALIZED INVERSE 176 Linear Transformations 176 811 Analysis of the Generalized Inverse Operator G 176 812 G g1 Operating on a Data Vector d 178 813 Mapping Between Model and Data Space An Example 179 Including Prior Information or the Weighted Generalized Inverse 181 821 Mathematical Background 181 822 Coordinate System Transformation of Data and Model Parameter Vectors 823 The Maximum Li od Inverse Operator Resolutio and Model Covariance 185 824 Effect on Model and DataSpace Eigenvectors 187 825 AnExample 189 Damped Least Squares and the Stochastic Inverse 195 831 Introduction 195 832 The Stochastic Inverse 195 833 Damped Least Squares 199 Ridge Regression 203 841 Mathematical Background 203 842 The Ridge Regression Operator 204 843 An Example of Ridge Regression Analys1s 205 Maximum Likelihood 210 851 Background 210 852 The General Case 212 CONTINUOUS INVERSE THEORY AND OTHER APPROACHES 216 Introduction 216 The BackusiGuilbeIt Approach Neural Networks The Radon Transform and Tomography Approach 94 1 Introduction 942 Interpretation of Tomography Using the Radon Transform 943 SlantStacking as a Radon Transform following Claerbout 1985 A Review of the Radon Transform Approach 2 235 Alternative Approach to Tomography 238 Geosciences 567 PREFACE RMRGZ PREFACE This set of lecture notes has its origin in a nearly incomprehensible course in inverse theory that I took as a rstsemester graduate student at MIT My goal as a teacher and in these notes is to present inverse theory in such a way that it is not only comprehensible but useful Inverse theory loosely de ned is the ne art of inferring as much as possible about a problem from all available information Information takes both the traditional form of data as well as the relationship between actual and predicted data In a nutsandbolt definition it is one some would argue the best way to find and assess the quality of a solution to some mathematical problem of interest Inverse theory has two main branches dealing with discrete and continuous problems respectively This text concentrates on the discrete case covering enough material for a single semester course A background in linear algebra probability and statistics and computer programming will make the material much more accessible Review material is provided on the first two topics in Chapter 2 This text could stand alone However it was written to complement and extend the material covered in the required text for the course which deals more completely with some areas Furthermore these notes make numerous references to sections in the required text Besides the required text is by far the best textbook on the subject and should be a part of the library of anyone interested in inverse theory The required text is Geophysical Data Analysis Discrete Inverse Theory Revised Edition by William Menke Academic Press 1989 The course format is largely lecture We may from time to time read articles from the literature and work in a seminar format I will try and schedule a couple of guest lectures in applications Be forewarned There is a lot of homework for this course They are occasionally very time consuming I make every effort to avoid pure algebraic nightmares but my general philosophy is summarized below I hear and I forget I see and I remember I do and I understand 7 Chinese Proverb Itry to have you do a simple problem by hand before turning you loose on the computer where all realistic problems must be solved You will also have access to existing code and a computer account on my SPARClO workstation You may use and modify the code for some of the homework and for the term project The term project is an essential part of the leaming process and I hope will help you tie the course work together Grading for this course will be as follows 60 Homework 30 Term Project 10 Class Participation Good luck and may you find the tradeoiT between stability and resolution less traumatic than most on average Randy Richardson and George Zandt Geosciences 567 CHAPTER 1 RMRGZ CHAPTER 1 INTRODUCTION 11 Inverse Theory What It Is and What It Does Inverse theory at least as I choose to de ne it is the ne art of estimating model parameters from data It requires a knowledge of the forward model capable of predicting data if the model parameters were in fact already known Anyone who attempts to solve a problem in the sciences is probably using inverse theory whether or not he or she is aware of it Inverse theory however is capable at least when properly applied of doing much more than just estimating model parameters It can be used to estimate the quality of the predicted model parameters It can be used to determine which model parameters or which combinations of model parameters are best determined It can be used to determine which data are most important in constraining the estimated model parameters It can determine the effects of noisy data on the stability of the solution Furthermore it can help in experimental design by determining where what kind and how precise data must be to determine model parameters Inverse theory is however inherently mathematical and as such does have its limitations It is best suited to estimating the numerical values of and perhaps some statistics about model parameters for some known or assumed mathematical model It is less well suited to provide the fundamental mathematics or physics of the model itself I like the example Albert Tarantola gives in the introduction of his classic book1 on inverse theory He says you can always measure the captain s age for instance by picking his passport but there are few chances for this measurement to carry much information on the number of masts of the boa You must have a good idea of the applicable forward model in order to take advantage of inverse theory Sooner or later however most practitioners become rather fanatical about the bene ts of a particular approach to inverse theory Consider the following as an example of how or how not to apply inverse theory The existence or nonexistence of a God is an interesting question Inverse theory however is poorly suited to address this question However if one assumes that there is a God and that She makes angels of a certain size then inverse theory might well be appropriate to determine the number of angels that could t on the head of a pin Now who said practitioners of inverse theory tend toward the fanatical In the rest of this chapter I will give some use ll de nitions of terms that will come up time and again in inverse theory and give some examples mostly from Menke s book of how to set up forward problems in an attempt to clearly identify model parameters from data 1Inverse Problem Theory by Albert Tarantola Elsevier Scienti c Publishing Company 1987 Geosciences 567 CHAPTER 1 RMRGZ 12 Useful De nitions Let us begin with some de nitions of things like forward and inverse theory models and model parameters data etc Forward Theory The mathematical process of predicting data based on some physical or mathematical model with a given set of model parameters and perhaps some other appropriate information such as geometry etc Schematically one might represent this as follows model parameters gt model gt predicted data As an example consider the twoway vertical travel time t of a seismic wave through M layers of thickness all and velocity vi Then I is given by d 139 221 11 The forward problem consists of predicting data travel time based on a mathematical model of how seismic waves travel Suppose that for some reason thickness was known for each layer perhaps from drilling Then only the M velocities would be considered model parameters One would obtain a particular travel time tfor each set of model parameters one chooses Inverse Theory The mathematical process of predicting or estimating the numerical values and associated statistics of a set of model parameters of an assumed model based on a set of data or observations Schematically one might represent this as follows data gt model predicted or estimated model parameters As an example one might invert the travel time t above to determine the layer velocities Note that one needs to know the mathematical model relating travel time to layer thickness and velocity information Inverse theory should not be expected to provide the model itself Model The model is the mathematical relationship between model parameters and other auxiliary information such as the layer thickness information in the previous example and the data It may be linear or nonlinear etc Model Parameters The model parameters are the numerical quantities or unknowns that one is attempting to estimate The choice of model parameters is usually problem dependent and quite often arbitrary For example in the case of travel times cited earlier layer thickness is not Geosciences 567 CHAPTER 1 RMRGZ considered amodel parameter while layer velocity is There is nothing sacred about these choices As a further example one might choose to cast the previous example in terms of slowness 3 where Si 1 Vi 12 Travel time tis a nonlinear function of layer velocities but a linear function of layer slowness As you might expect it is much easier to solve linear than nonlinear inverse problems A more serious problem however is that linear and nonlinear formulations may result in different estimates of velocity if the data contain any noise The point I am trying to impress on you now is that there is quite a bit of freedom in the way model parameters are chosen and it can affect the answers you get Data Data are simply the observations or measurements one makes in an attempt to constrain the solution of some problem of interest Travel time in the example above is an example of data There are of course many other examples of data Some examples of inverse problems mostly from Menke follow Medical tomography Earthquake location Earthquake moment tensor inversion Earth structure from surface or body wave inversion Plate velocities kinematics Image enhancement Curve fitting Satellite navigation Factor analysis 13 Possible Goals of an Inverse Analysis Now let us turn our attention to some of the possible goals of an inverse analysis These might include Estimates of a set of model parameters obvious Bounds on the range of acceptable model parameters Estimates of the formal uncertainties in the model parameters How sensitive is the solution to noise or small changes in the data Where and what kind of data are best suited to determine a set of model parameters Is the fit between predicted and observed data adequate Is amore complicated ie more model parameters model significantly better than a more simple model QMer N Not all of these are completely independent goals It is important to realize as early as possible that there is much more to inverse theory than simply a set of estimated model parameters Also it is important to realize that there is very often not a single correct answer Unlike a mathematical inverse which either exists or does not exist there are many possible approximate inverses These may give different answers Part of the goal of an inverse analysis is to determine if the answer you have obtained is reasonable valid acceptable etc This takes experience of course but you have begun the process Geosciences 567 CHAPTER 1 RMRGZ Before going on with how to formulate the mathematical methods of inverse theory I should mention that there are two basic branches of inverse theory In the rst the model parameters and data are discrete quantities In the second they are continuous functions An example of the first might occur with the model parameters we seek being given by the moments of inertia of the planets model parameters 111213 110 13 and the data being given by the perturbations in the orbital periods of satellites data T1 T2 T3 TN An example of a continuous function type of problem might be given by velocity as a function of depth model parameters vz 15 and the data given by a seismogram of ground motion data dt 16 Separate strategies have been developed for discrete and continuous inverse theory There is of course a fair bit of overlap between the two In addition it is often possible to approximate continuous functions with a discrete set of values There are potential problems aliasing for example with this approach but it often makes otherwise intractable problems tractable Menke s book deals exclusively with the discrete case This course will certainly emphasize discrete inverse theory but I will also give you a little of the continuous inverse theory at the end of the semester 14 Nomenclature Now let us introduce some nomenclature In these notes vectors will be denoted by boldface lowercase letters and matrices will be denoted by boldface uppercase letters Suppose one makes N measurements in a particular experiment We are trying to determine the values of M model parameters Our nomenclature for data and model parameters will be data 1 611 612 613 dNT model parameters m m1 M2 M3 mMT 18 where d and m are N andM dimensional column vectors respectively and T denotes transpose The model or relationship between 1 and m can have many forms These can generally be classified as either explicit or implicit and either linear or nonlinear Explicit means that the data and model parameters can be separated onto different sides of the equal sign For example all 2m14m2 19 Geosciences 567 CHAPTER 1 RMRGZ and d1 2m14mm2 110 are two explicit equations Implicit means that the data cannot be separated on one side of an equal sign with model parameters on the other side For example d1m1m2 0 111 and d1m1mm2 0 112 are two implicit equations In each example above the rst represents a linear relationship between the data and model parameters and the second represents a nonlinear relationship In this course we will deal exclusively with explicit type equations and predominantly with linear relationships Then the explicit linear case takes the form dGm 113 where d is an N dimensional data vector m is an M dimensional model parameter vector and G is an N XM matrix containing only constant coefficients The matrix G is sometimes called the kernel or data kernel or even the Green s function because of the analogy with the continuous function case doc Got me dr 114 Consider the following discrete case example with two observations N 2 and three model parameters M 3 d1 2m10m274rn3 d2 m12m23M3 13915 which may be written as all 2 0 4 m1 116 d2 1 2 3 m2 39 m3 or simply dGm 113 where d 611 61le Geosciences 567 CHAPTER 1 RMRGZ m M1 M2 n13T and G20 4 117 12 3 39 Then 1 and m are 2 X l and 3 X 1 column vectors respectively and G is a 2 X 3 matrix with constant coefficients On the following pages I will give some examples of how forward problems are set up using matrix notation See pages 10716 of Menke for these and other examples Geosciences 567 CHAPTER 1 RMRGZ 15 Examples of Forward Problems 151 Example 1 Fitting a Straight Line See Page 10 of Menke T temperature z depth gt Suppose that N temperature measurements T1 are made at depths 21 in the earth The data are then a vector 1 of N measurements of temperature where d T1 T2 T3 TNT The depths 2139 are not data Instead they provide some auxiliary information that describes the geometry of the experiment This distinction will be further clari ed below Suppose that we assume a model in which temperature is a linear function of depth T a bz The intercept a and slope b then form the two model parameters of the problem m a bT According to the model each temperature observation must satisfy T a zb T1abzl T2ab22 TNabZN These equations can be arranged as the matrix equation Gm d T1 1 Z1 T2 1 22 a 2 2 2 1 TN 1 ZN Geosciences 567 CHAPTER 1 RMRGZ 152 Example 2 Fitting a Parabola See Page 11 of Menke T temperature z depth gt If the model in example 1 is changed to assume a quadratic variation of temperature with depth of the form T a bz 022 then a new model parameter is added to the problem m a b cT The number of model parameters is nowM 3 The data are supposed to satisfy 2 T1 at bzl 021 2 T2 at bzz 022 2 TN a bZN czN These equations can be arranged into the matrix equation This matrix equation has the explicit linear form Gm d Note that although the equation is linear in the data and model parameters it is not linear in the auxiliary variable 2 The equation has a very similar form to the equation of the previous example which brings out one of the underlying reasons for employing matrix notation it can o en emphasize similarities between super cially diiTerent problems Geosciences 567 CHAPTER 1 RMRGZ 153 Example 3 Acoustic Tomography See Pages 12713 of Menke Suppose that a wall is assembled from a rectangular array of bricks Figure 11 from Menke below and that each brick is composed of a different type of clay If the acoustic velocities of the different clays differ one might attempt to distinguish the different kinds of bricks by measuring the travel time of sound across the various rows and columns of bricks in the wall The data in this problem areN 8 measurements of travel times 1 T1 T2 T3 T8T The model assumes that each brick is composed of a uniform material and that the travel time of sound across each brick is proportional to the width and height of the brick The proportionality factor is the brick s slowness s1 thus givingM 16 model parameters m s1 s2 s3 s16T where the ordering is according to the numbering scheme of the figure as yE EIlt 13 14 15 16 The travel time of acoustic waves dashed lines through the rows and columns of a square array of bricks is measured with the acoustic source S and receiver R placed on the edges of the square The inverse problem is to infer the acoustic properties of the bricks which are assumed to be homogeneous row 1 T1 hsl hs2 hs3 hs4 row 2 T2 hs5 hs hs7 hsg column 4 T8 hs4 hsg hs12 hsl and the matrix equation is T1 1111000000000000sl 000111100000000s2 T8 0001000100010001s16 Here the bricks are assumed to be of width and height h Geosciences 567 CHAPTER 1 RMRGZ 154 Example 4 Seismic Tomography An example of the impact of inverse methods in the geosciences Northern California A large amount of data is available much of it redundant Patterns in the data can be interpreted qualitatively Inversion results quantify the patterns Perhaps more importantly inverse methods provide quantitative information on the resolution standard error and quotgoodness of f1 quot We cannot overemphasize the quotimpactquot of colorful graphics for both good and bad Inverse theory is not a magic bullet Bad data will still give bad results and interpretation of even good results requires breadth of understanding in the eld Inverse theory does provide quantitative information on how well the model is quotdeterminedquot importance of data and model errors Another example improvements in quotimagingquot subduction zones 155 Example 5 Convolution Convolution is widely signi cant as a physical concept and offers an advantageous starting point for many theoretical developments One way to think about convolution is that it describes the action of an observing instrument when it takes a weighted mean of some physical quantity over a narrow range of some variable All physical observations are limited in this way and for this reason alone convolution is ubiquitous paraphrsed from Bracewell The Fourier Transform and Its Applications 1964 It is widely used in time series analysis as well to represent physical processes The convolution of two functions fx and gx represented as fxgx is EU gx u du 118 For discrete f1nite functions with common sampling intervals the convolution is m hk2figk1i 0ltkltmn 119 13970 A FORTRAN computer program for convolution would look something like LMN l DO 10 IlL lO HIO lM lN 20 HIJ lHIJ lGIFJ Convolution may also be written using matrix notation as 10 Geosciences 567 CHAPTER 1 RMRGZ f1 0 r 0 f2 f1 39 0 39 f2 39 g1 h1 39 39 g2 h2 j r r 120 n 0 fn 39 gm hnmil In the matrix form we recognize our familiar equation Gm d ignoring the confusing notation differences between elds when for example g1 above would be m1 and we can define deconvolution as the inverse problem of finding m G ld Altematively we can also reformulate the problem as GTGm GTD and find the solution as m GT 1 GTd 16 Final Comments The purpose of the previous examples has been to help you formulate forward problems in matrix notation It helps you to clearly differentiate model parameters from other information needed to calculate predicted data It also helps you separate data from everything else Getting the forward problem set up in matrix notation is essential before you can invert the system The logical next step is to take the forward problem given by d Gm 113 and invert it for an estimate of the model parameters mest as mest G inverse d 117 We will spend a lot of effort determining just what G lnverse means when the inverse does not exist in the mathematical sense of GG inverse G inverse G I 118 where I is the identity matrix The next order of business however is to shift our attention to a review of the basics of matrices anal linear algebra as well as probability and statistics in order to take full advantage of the power of inverse theory 11 Geosciences 567 CHAPTERZ RMRGZ CHAPTER 2 REVIEW OF LINEAR ALGEBRA AND STATISTICS 21 Introduction In discrete inverse methods matrices and linear transformations play Jndamental roles So do probability and statistics This review chapter then is divided into two parts In the rst we will begin by reviewing the basics of matrix manipulations Then we will introduce some special types of matrices Hermitian orthogonal and semiorthogonal Finally we will look at matrices as linear transformations that can operate on vectors of one dimension and return a vector of another dimension In the second section we will review some elementary probability and statistics with emphasis on Gaussian statistics The material in the first section will be particularly useful in later chapters when we cover eigenvalue problems and methods based on the length of vectors The material in the second section will be very useful when we consider the nature of noise in the data and when we consider the maximum likelihood inverse 22 Matrices and Linear Transformations Recall from the rst chapter that by convention vectors will be denoted by lower case letters in boldface ie the data vector 1 while matrices will be denoted by upper case letters in boldface ie the matrix G in these notes 221 Review of Matrix Manipulations Matrix Multiplication IfA is anN XM matrix as inNrows by M columns and B is an M X L matrix we write the N X L product C ofA and B as C AB 21 We note that matrix multiplication is associative that is ABC ABC 22 but in general is not commutative That is in general AB 7 BA 23 In fact if AB exists then the product BA only exists if A and B are square In Equation 21 above the ijth entry in C is the product of the ith row of A and the jth column of B Computationally it is given by 12 Geosciences 567 CHAPTERZ RMRGZ M cl 2 aikbkj 24 kl One way to form C using standard FORTRAN code would be DO 300 1 DO 300 J CIJ 00 DO 300 K 1 M 300 CIJ CIJ AIKBKJ 25 lN lL A special case of the general rule above is the multiplication of a matrix G N XM and a vectorm MX l d G m 113 NX1NgtltMMgtlt1 In terms of computation the vector 1 is given by M d 26W 26 jl T he Inverse of a Matrix The mathematical inverse of the M XM matrix A denoted A l is de ned such that AA 1 A lA 1M 27 where 1M is the M XM identity matrix given by l 0 0 0 l i 28 0 0 1 MM A 1 is the matrix which when either pre or postmultiplied by A returns the identity matrix Clearly since only square matrices can both pre and postmultiply each other the mathematical inverse of a matrix only exists for square matrices A useful theorem follows concerning the inverse of a product of matrices l3 Geosciences 567 CHAPTERZ RMRGZ Theorem If A B C D 28a NXN NXN NXN NXN Then A l if it exists is given by A 1 D 1 C 1 B 1 28b Proof AA 1 BCDD 1C 1B 1 BC DD 1C 1B 1 BC I C lB 1 B CC l B 1 BBA1 I 28c Similarly A 1A D lC lB lBCD I QED T he T ranspose and Trace of a Matrix The transpose of a matrix A is written as AT and is given by AT A That is you interchange rows and columns 29 The transpose of a product of matrices is the product of the transposes in reverse order That is ABT BTAT 210 Just about everything we do with real matrices A has an analog for complex matrices In the complex case wherever the transpose of a matrix occurs it is replaced by the complex conjugate transpose of the matrix denoted A That is if then where Aijaijbiji 2lla Aycijdiji 211b cg a 2llc 14 Geosciences 567 CHAPTERZ RMRGZ and altj 7in 211d thatis AU a ib i 2lle Finally the trace of A is given by M trace A 2 a 212 il Hermitian Matrices A matrix A is said to be Hermitian if it is equal to its complex conjugate transpose That is A A 213a If A is a real matrix this is equivalent to A AT 213b This implies that A must be square The reason that Hermitian matrices will be important is that they have only real eigenvalues We will take advantage of this many times when we consider eigenvalue and shifted eigenvalue problems later 222 Matrix Transformations Linear Transformations A matrix equation can be thought of as a linear transformation Consider for example the original matrix equation dGm 113 where d is anN X 1 vector m is anM X 1 vector and G is anN XM matrix The matrix G can be thought of as an operator that operates on an M dimensional vector m and returns an N dimensional vector 1 Equation 113 represents an explicit linear relationship between the data and model parameters The operator G in this case is said to be linear because if m is doubled for example so is d Mathematically one says that G is a linear operator if the following is true If d Gm and f Gr then d f Gm r 214 15 Geosciences 567 CHAPTERZ RMRGZ In another way to look at matrix multiplications in the bynowfamiliar Equation 113 d Gm 113 the column vector 1 can be thought of as a weighted sum of the columns of G with the weighting factors being the elements in m That is dquot11g17 12g2 39mMgM 215 where m m1 m2 mMT 216a and g gm g2 ngT 216b is the ith column of G Also if GA B then the above can be used to infer that the first column of B is aweighted sum of the columns of G with the elements of the rst column of A as weighting factors etc for the other columns of B Each column of B is a weighted sum of the columns of G Next consider dT GmT 217 01 F mT GT 218 1gtltN 1gtltM MXN The row vector IT is the weighted sum of the rows of GT with the weighting factors again being the elements in m That is dT mg ngg r r r mMgAE 219 Extending this to ATGT BT 220 we have that each row of BT is a weighted sum of the rows of GT with the weighting factors being the elements of the appropriate row of AT In a long string of matrix multiplications such as ABC D 221 each column of D is a weighted sum of the columns of A and each row of D is a weighted sum of the rows of C 16 Geosciences 567 CHAPTERZ RMRGZ Orthogonal Transformations An orthogonal transformation is one that leaves the length of a vector unchanged We can only talk about the length of a vector being unchanged if the dimension of the vector is unchanged Thus only square matrices may represent an orthogonal transformation Suppose L is an orthogonal transformation Then if Lx y 222 where L is N gtltN and x y are both Ndimensional vectors Then xTx yTy 223 where Equation 223 represents the dot product of the vectors with themselves which is equal to the length squared of the vector Ifyou have ever done coordinate transformations in the past you have dealt with an orthogonal transformation Orthogonal transformations rotate vectors but do not change their lengths Properties of orthogonal transformations There are several properties of orthogonal transformations that we will wish to use First if L is anN XN orthogonal transformation then LTL 1N 224 This follows from yTy LxlTlLX xTLTLx 225 but yTy xTx by Equation 223 Thus LTL IN QED 226 Second the relationship between L and its inverse is given by L 1 LT 227a and L LT 1 227b These two follow directly from Equation 226 above Third the determinant of a matrix is unchanged if it is operated upon by orthogonal transformations Recall that the determinant of a 3 X 3 matrix A for example where A is given by 17 Geosciences 567 CHAPTERZ RMRGZ 011 012 013 A 021 022 023 228 031 032 033 is given by det A 0110122033 023032 0120 216 33 023031 611361216132 61226131 229 Thus if A is anM XM matrix and L is an orthogonal transformations and if A LALT 230a it follows that det A det A 23 Fourth the trace of a matrix is unchanged if it is operated upon by an orthogonal transformation where trace A is de ned as M trace A 2a 230c il That is the sum of the diagonal elements of a matrix is unchanged by an orthogonal transformation Thus trace A trace A 230d Semiorthogonal Transformations Suppose that the linear operator L is not square butN X M N i M Then L is said to be semiorthogonal if and only if LTL 1M but LLT 2 IN NgtM 231 or LLT IN but LTL 7 1M Mgt N 232 where IN and 1M are the N gtltN and M XM identity matrices respectively A matrix cannot be both orthogonal and semiorthogonal Orthogonal matrices must be square and semiorthogonal matrices cannot be square Furthermore if L is a square N X N matrix and 18 Geosciences 567 CHAPTERZ RMRGZ LTL IN 226 then it is not possible to have LLT 7 IN 233 223 Matrices and Vector Spaces The columns or rows of a matrix can be thought of as vectors For example if A is an N XM matrix each column can be thought of as a vector in Nspace because it has N entries Conversely each row of A can be thought of as being a vector in M space because it hasM entries We note that for the linear system of equations given by Gm d 113 where G is NXM m isMX l and d is N X 1 that the model parameter vector m lies in M space along with all the rows of G while the data vector lies in Nspace along with all the columns of G In general we will think of theM X l vectors as lying in model space while the N X l vectors lie in data space Spanning a Space The notion of spanning a space is important for any discussion of the uniqueness of solutions or of the ability to t the data We rst need to introduce de nitions of linear independence and vector orthogonality A set onM vectors V1 1 l AI in M space the set of all M dimensional vectors is said to be linearly independent if and only if a1V1 a2V2 anM 0 234 where al are constants has only the trivial solution al 0 1 l M This is equivalent to saying that an arbitrary vector s in M space can be written as a linear combination of the V1 1 l M That is one can nd al such that for an arbitrary vector s s a1V1 612V2 aMVM 235 Two vectors 1 and s in M space are said to be orthogonal to each other if their dot or inner product with each other is zero That is if 1quotS scos6 0 236 where 9 is the angle between the vectors and quot1 quot are the lengths of 1 and s respectively The dot product of two vectors is also given by 19 Geosciences 567 CHAPTERZ RMRGZ M rTs sTr Erisi 237 M space is spanned by any set of M linearly independent Mdimensional vectors Rank of a Matrix The number of linearly independent rows in a matrix which is also equal to the number of linearly independent columns is called the rank of the matrix The rank of matrices is de ned for both square and nonsquare matrices The rank of a matrix cannot exceed the minimum of the number of rows or columns in the matrix ie the rank is less than or equal to the minimum of N M If anM gtltM matrix is an orthogonal matrix then it has rankM The M rows are all linearly independent as are the M columns In fact not only are the rows independent for an orthogonal matrix they are orthogonal to each other The same is true for the columns If a matrix is semiorthogonal then theM columns orN rows if N ltAI are orthogonal to each other We will make extensive use of matrices and linear algebra in this course especially when we work with the generalized inverse Next we need to turn our attention to probability and statistics 23 Probability and Statistics 231 Introduction We need some background in probability and statistics before proceeding very far In this review section I will cover the material from Menke39s book using some material from other math texts to help clarify things Basically what we need is a way of describing the noise in data and estimated model parameters We will need the following terms random variable probability distribution mean or expected value maximum likelihood variance standard deviation standardized normal variables covariance correlation coe icients Gaussian distributions and confidence intervals 232 De nitions Part 1 Random Variable A function that assigns a value to the outcome of an experiment A random variable has wellde ned properties based on some distribution It is called random because you cannot know beforehand the exact value for the outcome of the experiment One cannot measure directly the true properties of a random variable One can only make measurements also called realizations of a random variable and estimate its properties The birth weight of baby goslings is a random variable for example Probability Density Function The true properties of a random variable b are specified by the probability density function Pb The probability that a particular realization of b will fall between b and b db is given by Pbdb Note that Menke uses d where I use b His notation is bad when one needs to use integrals Pb satisfies 20 Geosciences 567 CHAPTERZ RMRGZ 1 21 db 238 which says that the probability of b taking on some value is l Pb completely describes the random variable b It is o en useful to try and nd a way to summarize the properties of Pb with a few numbers however Mean or Expected Value The mean value Eb also denoted ltbgt is much like the mean of a set of numbers that is it is the balancing point of the distribution Pb and is given by Eb wab db 239 Maximum Likelihood This is the point in the probability distribution Pb that has the highest likelihood or probability It may or may not be close to the mean Eb ltbgt An important point is that for Gaussian distributions the maximum likelihood point and the mean Eb ltbgt are the same The graph below a er Figure 23 p 23 Menke illustrates a case where the two are different A W b b ML ltbgt The maximum likelihood point bML of the probability distribution Pb for data b gives the most probable value of the data In general this value can be different from the mean datum ltbgt which is at the balancing poin of the distribution Variance Variance is one measure of the spread or width of Pb about the mean Eb It is given by oo 2 b lt b gt2Pb db 240a 21 Geosciences 567 CHAPTERZ RMRGZ Computationally forL experiments in which the kth experiment gives bk the variance is given by 02 20 k lt b gt2 240b Standard Deviation Standard deviation is the positive square root of the variance given by a 72 240c Covariance Covariance is a measure of the correlation between errors If the errors in two observations are uncorrelated then the covariance 1s zero We need another de nrtron before proceeding Joint Density Function Pb The probability that b1 is between b1 and b1 dbl that b2 is between b2 and b2 dbz etc If the data are independent then Pb 1303013032 39 39 39Pbn 241 If the data are correlated then Pb will have some more complicated form Then the covariance between b1 and b2 is de ned as oo covb1 b2 bl lt b1 gtb2 lt b2 gtPb dbl dbz dbquot 242a In the event that the data are independent this reduces to oo oo covb1 b2 Lo 73 lt b1 gtb2 lt b2 gtPb1Pb1db1db2 24 0 The reason is that for any value of b1 7 ltb1gt b2 7 ltb2gt is as likely to be positive as negative ie the sum will average to zero The matrix cov b contains all of the covariances de ned using Equation 242 in anN X N matrix Note also that the covariance of b1 with itself is just the variance of bi In practical terms if one has an N dimensional data vector b that has been measured L times then the i h term in cov b denoted cov bZ is de ned as L 1 k k covbij EEO bibj bj 242c kl where bl is the value of the 1th datum in b on the kth measurement of the data vector ltbigt is the mean or average value of b1 for all L measurements and the L 7 1 term results from sampling theory 22 Geosciences 567 CHAPTERZ RMRGZ Correlation Coef cients This is anormalized measure of the degree of correlation of errors It takes on values between 71 and l with a value of 0 implying no correlation The correlation coef cient matrix cor b is de ned as covbij cor b1 243 001 where cov bZJ is the covariance matrix de ned term by term as above for cov b1 b2 and 71 are the standard deviations for the 1th andjth observations respectively The diagonal terms of cor bZJ are equal to 1 since each observation is perfectly correlated with itself The gure below after Figure 28 page 26 Menke shows three different cases of degree of correlation for two observations b1 and b2 a b C I I I b 2 b2 b2 I b b b 1 1 1 Contour plots of Pb1 b2 when the data are a uncorrelated b positively correlated c negatively correlated The dashed lines indicate the four quadrants of alternating sign used to determine correlation 233 Some Comments on Applications to Inverse Theory Some comments are now in order about the nature of the estimated model parameters We will always assume that the noise in the observations can be described as random variables Whatever inverse we create will map errors in the data into errors in the estimated model parameters Thus the estimated model parameters are themselves random variables This is true even though the true model parameters may not be random variables Ifthe distribution of noise for the data is known then in principle the distribution for the estimated model parameters can be found by mapping through the inverse operator This is often very dif cult but one particular case turns out to have a rather simple form We will see where this form comes from when we get to the subject of generalized inverses For now consider the following as magic If the transformation between data b and model parameters m is of the form m Mb V 244a 23 Geosciences 567 CHAPTERZ RMRGZ where M is any arbitrary matrix and V is any arbitrary vector then ltmgt Mltbgt v 244b and cov m M cov b Mr 244c 234 De nitions Part 2 Gaussian Distribution This is a particular probability distribution given by 1 b lt b gt2 Mo eXp 202 245a The gure below a er Figure 210 page 29 Menke shows the familiar bellshaped curve It has the following properties PU Mean Eb ltbgt and Variance 72 050 PU o N 01 I 54321012345 b Gaussian distribution with zero mean and 7 l for curve A and 7 2 for curve B Many distributions can be approximated fairly accurately especially away from the tails by the Gaussian distribution It is also very important because it is the limiting distribution for the sum of random variables This is often just what one assumes for noise in the data One also needs a way to represent the joint probability introduced earlier for a set of random variables each of which has a Gaussian distribution The joint probability density lnction 24 Geosciences 567 CHAPTERZ RMRGZ for a vector b of observations that all have Gaussian distributions is chosen to be see Equation 210 of Menke page 30 detcovb712 2EN2 exp b lt b gtTcovb 1b lt b gt 245b which reduces to the previous case in Equation 245a for N l and var b1 0392 In statistics books Equation 245b is often given as 131 270W2 lzbl eXP72b 7 HblT gt3 1b 7 Hbl With this background it makes sense statistically at least to replace the original relationship b Gm 113 with ltbgt Gm 246 The reason is that one cannot expect that there is an m that should exactly predict any particular realization of b when b is in fact a random variable Then the joint probability is given by detcovb71 2 131 2nN2 expb GmTcovbTlb Gm 247 What one then does is seek an m that maximizes the probability that the predicted data are in fact close to the observed data This is the basis of the maximum likelihood or probabilistic approach to inverse theory Standardized Normal Variables It is possible to standardize random variables by subtracting their mean and dividing by the standard deviation If the random variable had a Gaussian ie normal distribution then so does the standardized random variable Now however the standardized normal variables have zero mean and standard deviation equal to one Random variables can be standardized by the following transformation S mltmgt 248 7 where you will often see z replacing s in statistics books We will see when all is said and done that most inverses represent a transformation to standardized variables followed by a simple inverse analysis and then a transformation back for the nal solution 25 Geosciences 567 CHAPTERZ RMRGZ Chi Squared Goodness of Fit Test A statistical test to see whether a particular observed distribution is likely to have been drawn from a population having some known form The application we will make of the chisquared test is to test whether the noise in a particular problem is likely to have a Gaussian distribution This is not the kind of question one can answer with certainty so one must talk in terms of probability or likelihood For example in the chisquared test one typically says things like there is only a 5 chance that this sample distribution does not follow a Gaussian distribution As applied to testing whether a given distribution is likely to have come from a Gaussian population the procedure is as follows One sets up an arbitrary number of bins and compares the number of observations that fall into each bin with the number expected from a Gaussian distribution having the same mean and variance as the observed data One quanti es the departure between the two distributions called the chisquared value and denoted 12 as k 2 obs 1n b1n z expected 1n b1n z 12 Z49 expected in bin 139 where the sum is over the number of bins k Next the number of degrees of freedom for the problem must be considered For this problem the number of degrees is equal to the number of bins minus three The reason you subtract three is as follows You subtract 1 because if an observation does not fall into any subset of k 7 l bins you know it falls in the one bin le over You are not free to put it anywhere else The other two come from the fact that you have assumed that the mean and standard deviation of the observed data set are the mean and standard deviations for the theoretical Gaussian distribution With this information in hand one uses standard chisquared test tables from statistics books and determines whether such a departure would occur randomly more o en than say 5 of the time Of cially the null hypothesis is that the sample was drawn from a Gaussian distribution If the observed value for 12 is greater than xi then one rejects the null hypothesis at the 06 signi cance level Typically 005 is used for the test The 06 signi cance level is equivalent to the l 7 00 con dence level Con dence Intervals One says for example with 98 con dence that the true mean of a random variable lies between two values This is based on knowing the probability distribution for the random variable of course and can be very difficult especially for complicated distributions that include nonzero correlation coefficients However for Gaussian distributions these are well known and can be found in any standard statistics book For example Gaussian distributions have 68 and 95 con dence intervals of approximately i l 7 and i20 respectively T and F Tests These two statistical tests are commonly used to determine whether the properties of two samples are consistent with the samples coming from the same population The F test in particular can be used to test the improvement in the t between predicted and observed data when one adds a degree of freedom in the inversion One expects to t the data better by adding more model parameters so the relevant question is whether the improvement is signi cant 26 Geosciences 567 CHAPTERZ RMRGZ As applied to the test of improvement in t between case 1 and case 2 where case 2 uses more model parameters degrees of freedom to describe the same data set the F ratio is given by F E E2D0F1 D0F2 250 E2 D0F2 where E is the residual sum of squares and DOF is the number of degrees of freedom for each case If F is large one accepts that the second case with more model parameters provides a signi cantly better t to the data The calculated F is compared to published tables with DOFl 7 DOF2 and DOF2 degrees of freedom at a speci ed con dence level Reference T M Heams Pn travel times in Southern California J Geophys Res 89 184371855 1984 The next section will deal with solving inverse problems based on length measures This will include the classic least squares approach 27 Geosciences 567 CHAPTER3 RMRGZ CHAPTER 3 INVERSE METHODS BASED ON LENGTH 31 Introduction This chapter is concerned with inverse methods based on the length of various vectors that arise in a typical problem The two most common vectors concerned are the data error or mis t vector and the model parameter vector Methods based on the first vector give rise to classic least squares solutions Methods based on the second vector give rise to what are known as minimum length solutions Improvements over simple least squares and minimum length solutions include the use of information about noise in the data and a priori information about the model parameters and are known as weighted least squares or weighted minimum length solutions respectively This chapter will end with material on how to handle constraints and on variances of the estimated model parameters 32 Data Error and Model Parameter Vectors The data error and model parameter vectors will play an essential role in the development of inverse methods They are given by data error vector e dObS 7 dPre 31 and model parameter vector m 32 The dimension of the error vector e is N X 1 while the dimension of the model parameter vector is M X 1 respectively In order to utilize these vectors we next consider the notion of the size or length of vectors 33 Measures of Length The norm of a vector is a measure of its size or length There are many possible de nitions for norms We are most familiar with the Cartesian L2 norm Some examples of norms follow N L1 2ei 33a il 2 N 1 L2 Elei2 33b il 28 Geosciences 567 CHAPTER3 RMRGZ N lM M LM zleil 33c il and nally L maX e 33d 139 1m mam N mi 0 e Inverse methods based on different norms can and often do give different P 39 answers The reason is that different norms give different weight to outliers For example the Leo norm gives all the weight to the largest mis t Loworder norms give more equal weight to errors of different sizes The L2 norm gives the familiar Cartesian length of a vector Consider the total mis t E between observed and predicted data It has units of length squared and can be found either as the square of the L2 norm of e the error vector Equation 31 or by noting that it is also equivalent to the dot or inner product of e with itself given by 91 N e EeTeel e2 eN 2e 34 39 il 9N Inverse methods based on the L2 norm are also closely tied to the notion that errors in the data have Gaussian statistics They give considerable weight to large errors which would be considered unlikely if in fact the errors were distributed in a Gaussian fashion Now that we have a way to quantify the mis t between predicted and observed data we are ready to de ne a procedure for estimating the value of the elements in m The procedure is to take the partial derivative of E with respect to each element in m and set the resulting equations to zero This will produce a system of M equations that can be manipulated in such a way that in general leads to a solution for the M elements of m The next section will show how this is done for the least squares problem of nding a best t straight line to a set of data points 29 Geosciences 567 CHAPTER3 RMRGZ 34 Minimizing the Mis t Least Squares 341 Least Squares Problem for a Straight Line Consider the gure below after Figure 31 from Menke page 36 a b i a Least squares tting of a straight line to z 6139 pairs b The error 61 for each observation is the difference between the observed and predicted datum ei 61ng 7 dpre Z The 1th predicted datum allPre for the straight line problem is given by allPre m1 mzzl 35 where the two unknowns m1 and m2 are the intercept and slope of the line respectively and 21 is the value along the z axis where the ith observation is made ForN points we have a system of N such equations that can be written in matrix form as all 1 21 m1 di 1 2 36 m2 dN 1 ZN Or in the by now familiar matrix notation as d G In 113 Nx 1 NXZ 2x 1 30 Geosciences 567 CHAPTER3 RMRGZ The total mis tE is given by E eTe SEMIms dimer 139 vtObs m1 mzzi2 Dropping the obs in the notation for the observed data we have N E 2dim1 2dim22i 2m1m22i m12 i 37 38 39 Then taking the partials of E with respect to m1 and m2 respectively and setting them to zero yields the following equations 9E N N g 2Nm1 22di2m222i 0 1 11 11 and N N N in E 22dtzl 2m1 21 2m22212 0 2 Rewriting Equations 310 and 311 above yields Nm1 mzzzi 2d 139 i and mlzzi m22212 Edizl i i i Combining the two equations in matrix notation in the form Am b yields N 22 m1 261 Zzi 222 m2 Edizl A m b 2X2 2X1 2X1 or simply 31 310 311 312 313 314 315 Geosciences 567 CHAPTER3 RMRGZ Note that by the above procedure we have reduced the problem from one with N equations in two unknowns m1 and m2 in Gm d to one with two equations in the same two unknowns in Am b The matrix equation Am b can also be rewritten in terms of the original G and 1 when one notices that the matrix A can be factored as 1 21 N 22 1 1 1 1 22 GTG 316 22 223 21 22 ZN i I 1 ZN 2X2 2XN NX2 2X2 Also b above can be rewritten similarly as all 251 1 1 1 d 2 GTd 317 2dizi 21 22 ZN dN Thus substituting Equations 316 and 317 into Equation 314 one arrives at the socalled normal equations for the least squares problem GTGm GTd 318 The least squares solution mm is then found as mLS GTG 1 GTd 319 assuming that GTG 1 exists In summary we used the forward problem Equation 36 to give us an explicit relationship between the model parameters m1 and m2 and a measure of the mis t to the observed data E Then we minimized E by taking the partial derivatives of the misfit function with respect to the unknown model parameters setting the partials to zero and solving for the model parameters 342 Derivation of the General Least Squares Solution We start with any system of linear equations which can be expressed in the form 1 G m 113 Ngtlt1 NXMMgtlt1 Again letE eTe d 7 dPreTd 7 dee E d 7 GmTd 7 Gm 320 32 Geosciences 567 CHAPTER3 RMRGZ N M M E2 cit 26W di zGikmk 321 39 kl As before the procedure is to write out the above equation with all its cross terms take pa1tials of E with respect to each of the elements in m and set the corresponding equations to zero For example following Menke page 40 Equations 3639 we obtain an expression for the pa1tial of E with respect to mq 9E M N N Wq2 mk2quGik 4216 0 322a 1 I We can simplify this expression by recalling Equation 24 from the introductory remarks on matrix manipulations in Chapter 2 M Cy zaikblg 24 k l Note that the first summation on 1 in Equation 322a looks similar in form to Equation 24 but the subscripts on the first G term are backwards If we further note that interchanging the subscripts is equivalent to taking the transpose of G we see that the summation on 1 gives the qk th entry in GTG N N zquGic zlGquiGic GTquk 322b H 11 Thus Equation 322a reduces to M N 9E W22mkGTGqk 426451 0 323 1 k1 11 Now we can further simplify the rst summation by recalling Equation 26 from the same section M di 26me 26 13971 To see this clearly we rearrange the order of terms in the rst sum as follows M M 2mkGTGqk 2GTGqkmk GTGmq 324 kl kl which is the qth entry in GTGm Note that GTGm has dimension M XNN XAIM X l M X 1 That is it is an Mdimensional vector In a similar fashion the second summation on 1 can be reduced to aterm in GTdq the qth entry in an M XNN X l M X 1 dimensional vector Thus for the qth equation we have 33 Geosciences 567 CHAPTER3 RMRGZ 0 a jn E2GTGmq 2GTdq 325 q Dropping the common factor of 2 and combining the q equations into matrix notation we arrive at GTGm GTd 326 The least squares solution for m is thus given by mLs GTG 1GTd 327 The least squares operator Gils is thus given by Gils 7 GTGHGT 328 Recalling basic calculus we note that mm above is the solution that minimizes E the total mis t Summarizing setting the q partial derivatives of E with respect to the elements in m to zero leads to the least squares solution We have just derived the least squares solution by taking the partial derivatives of E with respect to mq and then combining the terms for q l 2 M An altemative but equivalent formulation begins with Equation 32 but is written out as E d 7 GmTd 7 Gm 320 1T 7 mTGTd 7 Gm de 7 dTGm 7 mTGTd mTGTGm 329 Then taking the partial derivative of E with respect to mT turns out to be equivalent to what was done in Equations 322326 for mq namely BEBmT 7 GTd GTGm 0 330 which leads to GTGm GTd 326 and mm GTG 1GTd 327 It is also perhaps interesting to note that we could have obtained the same solution without taking partials To see this consider the following four steps Step 1 We begin with Gm d 113 34 Step 2 Step 3 Step 4 This reduces to Geosciences 567 CHAPTER3 RMRGZ We then premultiply both sides by GT GTGm cm 326 Premultiply both sides by GTG 1 GTGrlGTGm GTGrlGTd 331 mLS GTGrlGTd 327 as before The point is however that this way does not show why mm is the solution which minimizes E the mis t between the observed and predicted data All of this assumes that GTG 1 exists of course We will retum to the existence and properties of GTG 1 later Next we will look at two examples of least squares problems to show a striking similarity that is not obvious at rst glance 343 Two Examples of Least Squares Problems Example I BestFit StraightLine Problem We have of course already derived the solution for this problem in the last section Brie y then for the system of equations given by we have and dGm 113 611 l 21 d 1 z m 2 2 1 36 m2 dN l ZN l 21 T 1 1 1 1 22 N 22 332a G G Zl 22 ZN Zzi 2212 l ZN 35 Geosciences 567 CHAPTER3 RMRGZ 11 T l l 1 d2 261 332b G dZ1 22 ZN 3 2di i dN Thus the least squares solution is given by m1 N 22171 2d 332 m2 LS 22 Ballz I 0 Example 2 BestF it Parabola Problem The 1th predicted datum for a parabola is given by 611 m1 mzzl quot13212 333 where m1 and m2 have the same meanings as in the straight line problem and m3 is the coefficient of the quadratic term Again the problem can be written in the form dGm 113 where now we have 611 l 21 21 E m1 d 1 Zl Z m2 334 E 1 m3 dN l ZN ZN and N 22 222 2d GTG 22 222 223 GTd 2512 335 222 223 224 2512 I As before we form the least squares solution as mLs GTG 1GTd 327 Although the forward problems of predicting data for the straight line and parabolic cases look very di erent the least squares solution is formed in a way that emphasizes the fundamental similarity between the two problems For example notice how the straight line problem is buried within the parabola problem The upper left hand 2 X 2 part of GTG in Equation 335 is the same as Equation 332a Also the first two entries in GTd in Equation 335 are the same as Equation 332b 36 Geosciences 567 CHAPTER3 RMRGZ Next we consider a fourparameter example 344 FourParameter Tomography Problem Finally let s consider a fourparameter problem but this one based on the concept of tomography t1hi hi hsl32 1 2 V1 V2 fls t IEK 1 t2hi hi hs3s4 I I V3 V4 3 I 4 I 1 1 336 p t2 t3h h E hsls3 V V 1 1 t3 t4 t4h h hs2s4 v2 v4 1 1 1 0 0 s1 t 0 0 l l s 2 h 2 337 t3 1 0 1 0 33 t4 0 1 0 1 s4 01 dGm 113 l 0 l 0 l l 0 0 2 l l 0 T 21 0 0 1 0 0 1 1 21 2 0 1 GGh h 338 0 l l 0 l 0 l 0 l 0 2 l 0 l 0 l 0 l 0 l 0 l l 2 tlt3 11 GTdh1 4 339 12t3 12t3 So the normal equations are GTGm GTd 318 37 Geosciences 567 CHAPTER3 RMRGZ 2 1 1 0 s1 rlr3 1 2 0 1 s 1 h 2 1 4 340 1 0 2 1 s3 12r3 0 1 1 2 s4 12r4 or 2 1 1 0 t1t3 h 1 2 0 1 t1t4 341 131052253154 t2t3 0 1 1 2 12r4 Example S1S2S3S41 h 1thent1t2t3t4l By inspection S1 S2 S3 S4 1 is a solution but so is S1 S4 2 S2 S3 0 or S1 34032S3 2 Solutions are nonunique Look back at G Are all of the columns or rows independent No What does that imply about G and GTG Rank lt 4 What does that imply about GTG 1 It does not exist So does mLs exist No Other ways of saying this The vectors gl do not span the space of m Or the experimental setup is not suf cient to uniquely determine the solution Note that this analysis can be done without any data based strictly on the experimental design Another way to look at it Are the columns of G independent No For example coefficients 71 1 1 and 71 will make the equations add to zero What pattern does that suggest is not resolvable Now that we have derived the least squares solution and considered some examples we next turn our attention to something called the determinancy of the system of equations given by Equation 113 d Gm 113 This will begin to permit us to classify systems of equations based on the nature of G 35 Determinancy of Least Squares Problems See Pages 46 52 Menke 351 Introduction We have seen that the least squares solution to d Gm is given by mLS GTGrlGTd 327 38 Geosciences 567 CHAPTER3 RMRGZ There is no guarantee as we saw in Section 344 that the solution even exists It fails to exist when the matrix GTG has no mathematical inverse We note that GTG is square M XAI and it is at least mathematically possible to consider inverting GTG NB The dimension of GTG is M X AI independent of the number of observations made Mathematically we can say the GTG has an inverse and it is unique when GTG has rank M The rank of a matrix was considered in Section 223 Essentially if GTG has rankAI then it has enough information in it to resolve M things in this case model parameters This happens when all M rows or equivalently since GTG is square allM columns are independent Recall also that independent means you cannot write any row or column as a linear combination of the other rows columns GTG will have rank ltM if the number of observations N is less than M Menke gives the example pp 45416 of the straightline t to a single data point as an illustration If GTG 1 does not exist an in nite number of estimates will all t the data equally well Mathematically GTG has rank ltM if lGTGl 0 where lGTGl is the determinant of GTG Now let us introduce Menke s nomenclature based on the nature of GTG and on the prediction error In all cases the number of model parameters is M and the number of observations is N 352 EvenDetermined Problems M N If a solution exists it is unique The prediction error dObs 7 dPTe is identically zero For example 1 1 0 m1 3 42 2 5 71 m2 39 for which the solution is m 1 3T 353 Overdetermined Problems Typically N gtM With more observations than unknowns typically one cannot t all the data exactly The least squares problem falls in this category Consider the following example 1 1 0 m1 2 5 1 343 m2 1 3 1 This overdetermined case consists of adding one equation to Equation 342 in the previous example The least squares solution is 1333 4833T The data can no longer be t exactly 354 Underdetermined Problems Typically M gtN With more unknowns than observations m has no unique solution A special case of the underdetermined problem occurs when you can t the data exactly which is called the purely 39 Geosciences 567 CHAPTER3 RMRGZ underdetermined case The prediction error for the purely underdetermined case is exactly zero ie the data can be t exactly An example of such a problem is 1 2 1 m1 3 44 m2 Possible solutions include 0 lT 05 0T 5 79T 13 13T and 04 02T The solution with the minimum length in the L2 norm sense is 04 02T The following example however is also underdetermined but no choice of m1 m2 m3 will produce zero prediction error Thus it is not purely underdetermined m1 1121 345 1242m2 39 m3 You might want to verify the above examples Can you think of others Although I have stated that overdetermined underdetermined problems typically have N gt M N ltAI it is important to realize that this is not always the case Consider the following 1100 m1 2100 346 m 30112 m3 4022 For this problem m1 is overdetermmed that is no choice of MI can exactly t both 611 and 612 unless all happens to equal 612 while at the same time m2 and m3 are underdetermined This is the case even though there are two equations ie the last two in only two unknowns m2 m3 The two equations however are not independent since two times the next to last row in G equals the last row Thus this problem is both overdetermined and underdetermined at the same time For this reason I am not very satis ed with Menke s nomenclature As we will see later when we deal with vector spaces the key will be the single values much like eigenvalues and associated eigenvectors for the matrix G 36 Minimum Length Solution The minimum length solution arises from the purely underdetermined case In this section we will develop the minimum length operator using Lagrange multipliers and borrowing on the basic ideas of minimizing the length of a vector introduced in Section 34 on least squares 361 Background Information We begin with two pieces of information 40 Geosciences 567 CHAPTER3 RMRGZ 1 First GTG 1 does not exist Therefore we cannot calculate the least squares solution mLs GTG 1GTd 2 Second the prediction error e dObS 7 1Pre is exactly equal to zero To solve underdetermined problems we must add information that is not already in G This is called a priori information Examples might include the constraint that density be greater than zero for rocks or that V the seismic Pwave velocity at the Moho falls within the range 5 lt Vn lt 10 kms etc Another a priori assumption is called solution simplicity One seeks solutions that are as simple as possible By analogy to seeking a solution with the simplest misfit to the data ie the smallest in the least squares problem one can seek a solution which minimizes the total length of the model parameter vector m At rst glance there may not seem to be any reason to do this It does make sense for some cases however Suppose for example that the unknown model parameters are the velocities of points in a uid A solution that minimized the length of m would also minimize the kinetic energy of the system Thus it would be appropriate in this case to minimize m It also turns out to be a nice property when one is doing nonlinear problems and the m that one is using is actually a vector of changes to the solution at the previous step Then it is nice to have small step sizes The requirement of solution simplicity will lead us as shown later to the socalled minimum length solution 362 Lagrange Multipliers See Page 50 and Appendix Al Menke Lagrange multipliers come to mind whenever one wishes to solve a problem subject to some constraints In the purely underdetermined case these constraints are that the data mis t be zero Before considering the full purely underdetermined case consider the following discussion of Lagrange Multipliers mostly after Menke Lagrange Multipliers With 2 Unknowns and I Constraint Consider Ex y a function of two variables Suppose that we want to minimize Ex y subject to some constraint of the form x y 0 The steps using Lagrange multipliers are as follows next page 41 Geosciences 567 CHAPTER3 RMRGZ Step 1 At the minimum in E small changes in x and y lead to no change in E Ex y constant m minimum x gt 39dEa dea Edy0 347 3x t9y Step 2 The constraint equation however says that dx and dy cannot be varied independently since the constraint equation is independent or different from E Since x y 0 for all x y then so must d x y 0 But d gdxgdy0 348 Step 3 Form the weighted sum of 347 and 348 as dEM ldxla de0 349 where l is a constant Note that 349 holds for arbitrary A Step 4 If his chosen however in such a way that 9E 19 0 350 ax ax then it follows that 9E 1 0 351 1 1 since at least one of dx dy in this case dy is arbitrary ie dy may be chosen nonzero 42 Step 5 Geosciences 567 CHAPTER3 RMRGZ When A has been chosen as indicated above it is called the Lagrange multiplier Therefore 349 above is equivalent to minimizingE M without any constraints ie 9 9E 19 E l 1 0 352 a gtgt ax 5k and 9 9E 19 EM 0 353 34 1 1 Finally one must still solve the constraint equation x y 0 354 Thus the solution for x y that minimizes E subject to the constraint that x y 0 is given by 352 353 and 354 That is the problem has reduced to the following three equations aE 20 350 3 3i 0 351 a and x y 0 354 in the three unknowns x y l Extending the Problem to M Unknowns and N Constraints The above procedure used for a problem with two variables and one constraint can be generalized toM unknowns in a vector m subject to N constraints im 0 j l N This leads to the following system of M equations 139 l M N 8E 31 2 1 J 0 3 55 ami J amt 11 with N constraints of the form qm 0 356 43 Geosciences 567 CHAPTER3 RMRGZ 363 Application to the Purely Underdetermined Problem With the background we now have in Lagrange multipliers we are ready to reconsider the purely underdetermined problem First we pose the following problem nd m such that me is minimized subject to the N constraints that the data mis t be zero M e 6119175 at 61be 26me 0 357 j1 That is minimize N mm me 21161 358 il with respect to the elements m in m We can expand the terms in Equation 358 and obtain M N M upm2m3 21 di szmj 359 k1 H j1 Then we have M N M 3W amk amj 722kaziizayW 360 k1 1 11 11 1 but 9mk 197 6 d J 6 361 9mg kq an 9mg 14 Thus any N W2mlei6iqo q12M 362 1 13971 In matrix notation over all q Equation 362 can be written as 2m7GT7t 0 363 44 Geosciences 567 CHAPTER3 RMRGZ where 7 is anN X 1 vector containing theN Lagrange Multipliers it i l N Note that GT7 has dimension M X N x N X l M X l as required to be able to subtract it from m Now solving explicitly for m yields m GTX 364 The constraints in this case are that the data be t exactly That is d Gm 113 Substituting 364 into 113 gives dGmG GTX 365 which implies d GGT x where GGT has dimension N X M X M X N or simply N X N Solving for 7 when GGT 1 exists yields 11 2GGT 1d 366 The Lagrange Multipliers are not ends in and of themselves But upon substitution of Equation 366 into 364 we obtain GT1 GT2GGT 1d 36 l l 2 2 Rearranging we arrive at the minimum length solution mML mML GTGGT 1d 368 where GGT is an N X N matrix and the minimum length operator GML is given by GMlL GTGGT1 369 The above procedure then is one that determines the solution which has the minimum length L2 norm me12 amongst the in nite number of solutions that t the data exactly In practice one does not actually calculate the values of the Lagrange multipliers but goes directly to 368 above The above derivation shows that the length of m is minimized by the minimum length operator It may make more sense to seek a solution that deviates as little as possible from some prior estimate of the solution ltmgt rather than from zero The zero vector is in fact the prior estimate ltmgt for the minimum length solution given in Equation 368 If we wish to explicitly include ltmgt then Equation 368 becomes 45 Geosciences 567 CHAPTER3 RMRGZ mML ltmgt GTGGT 1 d 7 Gltmgt ltmgt GMlL d 7 Gltmgt GMlL d I 7 GMlL Gltmgt 370 We note immediately that Equation 370 reduces to Equation 368 when ltmgt 0 364 Comparison of Least Squares and Minimum Length Solutions In closing this section it is instructive to note the similarity in form between the minimum length and least squares solutions Least Squares mLS GTG 1GTd 327 with Gils GTGHGT 328 Minimum Length mML ltmgt GTGGT 1d 7 Gltmgt 370 with GJL GTGGT1 369 The minimum length solution exists when GGT 1 exists Since GGT is N X N this is the same as saying when GGT has rankN That is when the N rows orN columns are independent In this case your ability to predict or calculate each of the N observations is independent 365 Example of Minimum Length Problem Let39s reconsider the fourparameter tomography problem we introduced in Section 344 With our new understanding of the determinancy of least squares problems we can now recognize the problem in Section 344 as an underdetennined problem with four unknowns and three independent data The least squares solution mLs does not exist but the minimum length solution mML does exist In this case G is a 3 X 4 matrix Why 1 0 0 0 G 0 0 1 1 371 l 0 l 0 l 0 1 OCT 0 2 1 372 l l 2 The inverse for GGT exists so we can compute l 0 0 7 0 0 0 GTGGT 1 1 0 1 373 l l l 46 Geosciences 567 CHAPTER3 RMRGZ For an actual model m l 0 0 0 the data are d l 0 l The minimum length solution is given by mML GTGGT 1d l 0 0 0 374 In this particular case mML is the correct solution This is not always the case In fact in most realistic cases it is not the correct solution if one is known Remember this is the minimum length solution that solves the problem but it is not unique or for that matter correct 37 Weighted Measures of Length 371 Introduction One way to improve our estimates using either the least squares solution mLs GTG 1GTd 327 or the minimum length solution mML ltmgt GTGGT 1d 7 Gltmgt 370 is to use weighted measures of the mis t vector e 101757 am 375 or the model parameter vector m respectively The next two subsections will deal with these two approaches 372 Weighted Least Squares Weighted Measures of the Misfit Vector e We saw in Section 34 that the least squares solution mm was the one that minimized the total mis t between predicted and observed data in the L2 norm sense That is E in 81 e N EeTeel e2 eN 2ei2 34 39 il 8N is minimized Consider a new E de ned as follows E eTWee 376 47 Geosciences 567 CHAPTER3 RMRGZ and where We is an as yet unspeci ed N X N weighting matrix We can take any form but one convenient choice is we cov 1171 377 where cov d 1 is the inverse of the covariance matrix for the data With this choice for the weighting matrix data with large variances are weighted less than ones with small variances While this is true in general it is easier to show in the case where We is diagonal This happens when cov d is diagonal which implies that the errors in the data are uncorrelated The diagonal entries in cov d 1 are then given by the reciprocal of the diagonal entries in cov d That is if 012 0 0 02 39 covd 2 0 378 0 0 a then of2 0 0 0 72 covdr1 72 0 379 0 0 0192 With this choice for We the weighted mis t becomes N N E eTWee 2 eiz Wye 380 il jl But 1 W 5n 381 y y at where 61 is the Kronecker delta Thus we have N 1 E 2 2 e 382 7 11 1 If the 1th variance is large then the component of the error vector in the 1th direction 8 has little in uence on the size of E This is not the case in the unweighted least squares problem where an examination of Equation 34 clearly shows that each component of the error vector contributes equally to the total mis t 48 Geosciences 567 CHAPTER3 RMRGZ Obtaining the Weighted Least Squares Solution mWLs If one uses E eTWee as the weighted measure of error we will see below that this leads to the weighted least squares solution mWLs GTWeGTlGTWed 383 with a weighted least squares operator G s given by G s GTWeG 1GTWe 384 While this is true in general it is easier to anive at Equation 383 in the case where We is a diagonal matrix and the forward problem 1 Gm is given by the least squares problem for a best tting straight line see Equation 36 Step 1 N N N EeTWee2 eithjej 2W e 385 il jl il N 2 N M 2 b Wud 5 689 cit26m il il jl N 2m1di 2m2dizi m12 2m1m22i 386 il Step 2 Then 9E N N N 22diVV 2mlzw 2mzzzim 0 387 1 il il il and N N N BE 2 9 22diziVKi2mlzzimi2m222i W 0 388 m2 il i1 11 This can be written in matrix form as N N ZZiWii 2diWii 11 11 m1 1 389 N N m2 N 2 EZiWii 22139 Wii zzidiwii il il 11 49 Geosciences 567 CHAPTER3 RMRGZ Step 3 The lefthand side can be factored as W11 0 0 1 21 2W 22W 1 1 1 0 W 2 1 z g 22 2 390 zziWii 22139 Wii 21 22 ZN 39 i 39 39 or simply 0 OWNNlZN XVIii zziWii T 2 G wee 391 EZiWii 22139 Wii Similarly the righthand side can be factored as W11 0 0 d1 Edm 1 1 1 0 W22 2 d2 392 26112th Zl 22 ZN E quot 0 E I 0 0 WNN dN orsimply EatWt T G Wd 393 Step 4 Therefore using Equations 391 and 393 Equation 389 can be written as GTWeGm GTWed 394 The weighted least squares solution mWLs from Equation 394 is thus mWLs GTWeG 1GTWed 395 assuming that GTWeGT1 exists of course 373 Weighted Minimum Length The development of a weighted minimum length solution is similar to that of the weighted least squares problem The steps are as follows First recall that the minimum length solution minimizes me By analogy with weighted least squares we can choose to minimize mTWmm 396 50 Geosciences 567 CHAPTER3 RMRGZ instead of me For example if one wishes to use Wm cov m 1 397 then one must replace m above with m 7 ltmgt 398 where ltmgt is the expected or a prion estimate for the parameter values The reason for this is that the variances must represent uctuations about zero In the weighted least squares problem it is assumed that the error vector e which is being minimized has a mean of zero Thus for the weighted minimum length problem we replace m by its departure from the expected value ltmgt Therefore we introduce a new functionL to be minimized L 7 m 7 ltmgtTWmm 7 ltmgt 399 If one then follows the procedure in Section 36 with this new function one eventually as in It is left to the student as an exercise is led to the weighted minimum length solution mWML given by mWML 7 ltmgt w GTGW GTrl d 7 Gltmgt 3100 and the weighted minimum length operator OWL is given by OWL 7 w GTGW GT1 3101 This expression differs from Equation 338 page 54 of Menke which uses Wm rather than Wn1 Ibelieve Menke s equation is wrong Note that the solution depends explicitly on the expected or a prion estimate of the model parameters ltmgt The second term represents a departure from the a priori estimate ltmgt based on the inadequacy of the forward problem Gltmgt to t the data 1 exactly Other choices for Wm include 1 DTD where D is a derivative matrix a measure of the atness of m of dimension M 7 l X M l l 0 0 3 3102 D 1 391 39 39 a a 0 0 0 1 l 2 DTD where D is an M 7 2 XM roughness second derivative matrix given by l 72 l 0 D l 72 391 39 3103 a a a 0 0 0 l 72 l 51 Geosciences 567 CHAPTER3 RMRGZ Note that for both choices of D presented DTD is an M X M matrix of rank less than M for the rstderivative case it is of rankM 7 1 while for the second it is of rank M 7 2 This means that Wm does not have a mathematical inverse This can introduce some nonuniqueness into the solution but does not preclude nding a solution Finally note that many choices for Wm are possible 374 Weighted Damped Least Squares In Sections 372 and 373 we considered weighted versions of the least squares and minimum length solutions Both unweighted and weighted problems can be very unstable if the matrices that have to be inverted are nearly singular In the weighted problems these are GTWeG 3104 and GWn1 GT 3105 respectively for least squares and minimum length problems In this case one can form a weighted penalty or cost function given by E 22L 3106 where E is from Equation 385 for weighted least squares and L is from Equation 399 for the weighted minimum length problem One then goes through the exercise of minimizing Equation 3106 with respect to the model parameters In and obtains what is known as the weighted damped least squares solution mm It is in fact a weighted mix of the weighted least squares and weighted minimum length solutions One nds that mm is given by either mWD ltmgt GTWeG 12me GTWe d 7 Gltmgtl 3107 or mWD ltmgt wn1 GTGWn1 GT 22w1 171m 7 Gltmgt 3108 where the weighted damped least squares operator G 1 is given by Gang GTWeG 2Wm 1GTWe 3109 or 3 w GTGWn1 GT 22w1 1 3110 52 Geosciences 567 CHAPTER3 RMRGZ The two forms for Gang can be shown to be equivalent The 82 term has the effect of damping the instability As we will see later in Chapter 6 using singular value decomposition the above procedure minimizes the effects of small singular values in GTWeG or Gerll GT In the next section we will learn two methods of including a priori information and constraints in inverse problems 38 A Priori Information and Constraints See Menke Pages 55 57 381 Introduction Another common type of a priori information takes the form of linear equality constraints Fm h 3111 where F is a P X M matrix and P is the number of linear constraints considered As an example consider the case for which the mean of the model parameters is known In this case with only one constraint we have M 1 2 m hl 3112 M il Then Equation 3111 can be written as m1 1 m2 Fm 1 1 1 h 3113 Mi 1 Z 1 mM As another example suppose that the jth model parameter mj is actually known in advance That is suppose mjh1 3114 Then Equation 3 1 1 1 takes the form m1 Fm0 0 1 0 0 m h1 3115 jth column mM 53 Geosciences 567 CHAPTER3 RMRGZ Note that for this example it would be possible to remove mj as an unknown thereby reducing the system of equations by one It is o en preferable to use Equation 394 even in this case rather than rewriting the forward problem in a computer code 382 A First Approach to Including Constraints We will consider two basic approaches to including constraints in inverse problems Each has its strengths and weaknesses The first includes the constraint matrix F in the forward problem and the second uses Lagrange multipliers The steps for the first approach are as follows Step 1 Include Fm h as rows in a new G that operates on the original m m1 G m2 d 3116 H bl ltgt mM NPgtltM Mgtlt1 NPgtlt1 Step 2 The new N P X l mis t vector e becomes obs pre e 1 Em 3117 NPgtlt1 NPgtlt1 Performing a least squares inversion would minimize the new eTe based on Equation 3116 The difference hihpre 3118 which represents the misfit to the constraints may be small but it is unlikely that it would vanish which it must if the constraints are to be satisfied Step 3 Introduce a weighted mis t eTWee 31 19 where We is a diagonal matrix of the form 54 Geosciences 567 CHAPTER3 RMRGZ 1 0 0 T 0 39 N We 39 1 3120 b1g T 0 P 0 0 big i That is it has relatively large values for the last P entries associated with the constraint equations Recalling the form of the weighting matrix used in Equation 377 one sees that Equation 3120 is equivalent to assigning the constraints very small variances Hence aweighted least squares approach in this case will give large weight to tting the constraints The size of the big numbers in We must be determined empirically One seeks a number that leads to a solution that satis es the constraints acceptably but does not make the matrix in Equation 3104 that must be inverted to obtain the solution too poorly conditioned Matrices with a large range of values in them tend to be poorly conditioned Consider the example of the smoothing constraint here P M 7 2 Dm 0 3 121 where the dimensions of D M7 2 X m m M X l and 0 M7 2 X l The augmented equations are m D 0 3122 Let39s use the following weighting matrix 1 0 0 0 1 0 I 0 NgtltN We a 0 621 I 92 0 PXP 3123 0 0 92 where 92 is a constant This results in the following with the dimensions of the three matrices in the rst set of brackets beingM X N P N P X N P and N P XM Ell ll ll l a lt 7gt lt 7gt GT 62DT GT 62DT 55 Geosciences 567 CHAPTER3 RMRGZ The lower matrices having dimensions of M X NP N P X 1 GTG 92DTD1 GTD 3124 M x M M x 1 11611131 The three matrices within 3125 have dimensions M X N P N P X M andM X 1 which produce an M X 1 matrix when evaluated In this form we can see this is simply the mm for the problem G d 3126 enlmol lt gt By varying 9 we can trade off the mis t and the smoothness for the model 383 A Second Approach to Including Constraints Whenever the subject of constraints is raised Lagrange multipliers come to mind The steps for this approach are as follows Step 1 Form a weighted sum of the mis t and the constraints m eTe Fm 7 hT 2 3127 which can be expanded as N M 2 P M 221 dizGtJ39mf 22 2EJmfh 3128 11 T jl 11 jl T where T indicates a difference from Equation 343 on page 56 in Menke and where there are P linear equality constraints and where the factor of 2 as been added as a matter of convenience to make the form of the nal answer simpler Step 2 One then takes the partials of Equation 3128 with respect to all the entries in m and sets them to zero That is 93m 0 q q12M 3129 which leads to 56 Geosciences 567 CHAPTER3 RMRGZ M N N P 22mi2quGji 22qu1 221i1 q0 ql 2 M 3130 il il 11 jl where the rst two terms are the same as the least squares case in Equation 322a since they come directly from eTe and the last term shows why the factor of 2 was added in Equation 3128 Step 3 Equation 3130 is not the complete description of the problem To the M equations in Equation 3130 P constraint equations must also be added In matrix form this yields GTG FT m GTd 3131 F 0 i 1 hi MP gtlt MP MPgtlt1 MPgtlt1 Step 4 The above system of equations can be solved as m T T 1 T GG F Gd 3132 A F 0 h As an example consider constraining a straight line to pass through some point 239 6139 That is forN observations we have all m1mgzi 11N 3133 subject to the single constraint d m1m22 3134 Then Equation 3111 has the form m1 Fm 1 z d 3135 m2 We can then write out Equation 3132 explicitly and obtain the following 1 2d ml N 22 1 1 m2 22 gt32 2 2251 3136 1 1 z 0 61 Note the similarity between Equations 3136 and 332 the least squares solution to tting a straight line to a set of points without any constraints 57 Geosciences 567 CHAPTER3 RMRGZ m1 N 22171 Ed 332 m2 LS 22 Zziali I If you wanted to get the same result for the straight line passing through a point using the rst approach with We you would assign W 1 i1N 3137 and WNLN1 138 which is equivalent to assigning a small variance relative to the unconstrained part of the problem to the constraint equation The solution obtained with Equation 396 should approach the solution obtained using Equation 3136 Note that it is easy to constrain lines to pass through the origin using Equation 3136 In this case we have al z 0 3139 and Equation 3 136 becomes 71 m1 N 22 1 261 m2 22 gt32 0 Zziali 3140 t 1 0 0 0 The advantage of using the Lagrange multiplier approach to constraints is that the constraints will be satis ed exactly It often happens however that the constraints are only approximately known and using Lagrange multipliers to t the constraints exactly may not be appropriate An example might be a gravity inversion where depth to bedrock at one point is known from drilling Constraining the depth to be exactly the drill depth may be misleading if the depth in the model is an average over some area Then the exact depth at one point may not be the best estimate of the depth over the area in question A second disadvantage of the Lagrange multiplier approach is that it adds one equation to the system of equations in Equation 3136 for each constraint This can add up quickly making the inversion considerably more dif cult computationally An entirely dilTerent class of constraints are called linear inequality constraints and take the form Fm 211 3141 These can be solved using linear programming techniques but we will not consider them further in this class 58 Geosciences 567 CHAPTER3 RMRGZ 384 Seismic Receiver Function Example The following is an example of using smoothing constraints in an inverse problem Consider a general problem in time series analysis with a delta function input Then the output from the quotmodelquot is the Greens function of the system The inverse problem is this Given the Greens function find the parameters of the model input Greens Function A gt model gt AvAvA 3939 impulse In a little more concrete form model space data space m gt 0 1 Cl c a 2 c 1 2 2 3 C3 3 39 3 Fm gt d M I CN N Z Fm d I If 1 is very noisy then mLs will have a highfrequency component to try to quotfit the noisequot but this will not be real How do we prevent this So far we have leamed two ways use mWLs if we know cov d or if not we can place a smoothing constraint on m An example of this approach using receiver function inversions can be found inAmmon et al 1990 The important points are as follows This approach is used in the real world The forward problem is written abiFJm j123N This is nonlinear but a er linearization discussed in Chapter 4 the equations are the same as discussed previously with minor differences Note the correlation between the roughness in the model and the roughness in the data The way to choose the weighting parameter 7 is to plot the tradeoff between smoothness and waveform fit 59 Geosciences 567 CHAPTER3 RMRGZ 39 Variances of Model Parameters See Pages 58 60 Menke 391 Introduction Data errors are mapped into model parameter errors through any type of inverse We noted in Chapter 2 Equations 24l243 that if mest Md v 241 and if cov d is the data covariance matrix which describes the data errors then the a posteriori model covariance matrix is given by cov m Mcov dMT 244c The covariance matrix in Equation 244c is called the a posteriori model covariance matrix because it is calculated a er the fact It gives what are sometimes called the formal uncertainties in the model parameters It is different from the a priori model covariance matrix of Equation 379 which is used to constrain the underdetermined problem The a posteriori covariance matrix in Equation 244c shows explicitly the mapping of data errors into uncertainties in the model parameters Although the mapping will be clearer once we consider the generalized inverse in Chapter 7 it is instructive at this point to consider applying Equation 244c to the least squares and minimum length problems 392 Application to Least Squares We can apply Equation 244c to the least squares problem and obtain cov m GTG 1GTcov dGTG 1GTT 3142 Further if cov d is given by cov d 721A 3143 then cov m GTG 1GT72 IltGTG 1GTgtT 02 GTG71GTGGTG71T 02 GTG71T 2 GTGr1 3 144 since the transpose of a symmetric matrix returns the original matrix 60 Geosciences 567 CHAPTER3 RMRGZ 393 Application to the Minimum Length Problem Application of Equation 244c t0 the minimum length problem leads to the following for the a posteriori model covariance matrix cov m 7 GTGGT 1c0v 1GTGGT 1T 3145 If the data covariance matrix is again given by cov d 02 IN 3146 we obtain cov m 7 02 GTGGT 2G 3147 where GGTl 2 GGT1 1GGT1 3148 394 Geometrical Interpretation of Variance There is another way to look at the variance of model parameter estimates for the least squares problem that considers the prediction error or mis t t0 the data Recall that we de ned the mis t E as E eTe d 7 dPreTd 7 dPre d 7 GmTd 7 Gm 320 which explicitly shows the dependence of E on the model parameters m That is we have EEm 3149 If Em has a sharp wellde ned minimum then we can conclude that our solution mm is well constrained Conversely if Em has a broad poorly de ned minimum then we conclude that our solution mLs is poorly constrained A er Figure 310 page 59 of Menke we have next page 61 Geosciences 567 CHAPTER3 RMRGZ b model parameter In model parameter In a The best estimate mest of model parameter m occurs at the minimum of Em If the minimum is relatively narrow then random uctuations in Em lead to only small errors Am in meSt b If the minimum is wide then large errors in m can occur One way to quantify this qualitative observation is to realize that the width of the minimum for Em is related to the curvature or second derivative of Em at the minimum For the least squares problem we have 82E 32 2 2 d Gm 3150 am mmLS am mmLS Evaluating the righthand side we have for the qth term N M 2 82E 32 2 32 2 2d Gm 2 di Gymj 3151 qu qu qu N M 92 W2 d 2Gijmj qu 3152 1 il jl N M 92 2 am2 qudi szqumj 339153 1 il jrl Using the same steps as we did in the derivation of the least squares solution in Equations 321325 it is possible to see that Equation 3153 represents the qth term in GTd 7 Gm Combining the q equations into matrix notation yields 62 Geosciences 567 CHAPTER3 RMRGZ 92 Wm Gm2 2GTd Gm 3154 Evaluating the rst derivative on the righthand side of Equation 3154 we have for the qth term a a N M GTd Gm qudi Gyqumj 3155 N M 22Gyaiqmj 3156 11 11 4 N 2 Gitqu 3157 which we recognize as the q q entry in GTG Therefore we can write the M X M matrix equation as GTd Gm GTG 3158 From Equations 314373151 we can conclude that the second derivative of E in the least squares problem is proportional to GTG That is 82E 9 112 mmLs constantGTG 3159 Furthermore from Equation 3143 we have that cov m is proportional to GTG 1 Therefore we can associate large values of the second derivative of E given by 3159 with 1 sharp curvature for E 2 narrow well for E and 3 good ie small model variance As Menke points out cov m can be interpreted as being controlled either by 1 the variance of the data times a measure of how error in the data is mapped into model parameters or 2 a constant times the curvature of the prediction error at its minimum I like Menke s summary for his chapter page 60 on this material very much Hence I39ve reproduced his closing paragraph for you as follows The methods of solving inverse problems that have been discussed in this chapter emphasize the data and model parameters themselves The method of least squares estimates the model parameters with smallest prediction length The method of minimum length estimates the simplest model parameters The ideas of data and model parameters are very concrete and straightforward and the methods based on them are simple and easily understood Nevertheless this viewpoint tends to obscure an important aspect of inverse problems Namely that the nature of the problem depends more on the relationship between the data and model parameters than on the data or model parameters themselves It should for instance be possible 63 Geosciences 567 CHAPTER3 RMRGZ to tell a welldesigned experiment from a poorly designed one without knowing what the numerical values of the data or model parameters are or even the range in which they fall Before considering the relationships implied in the mapping between model parameters and data in Chapter 5 we extend what we now know about linear inverse problems to nonlinear problems in the next chapter 64 Geosciences 567 CHAPTER 4 RMRGZ CHAPTER 4 LINEARIZATION OF NONLINEAR PROBLEMS 41 Introduction Thus far we have dealt with the linear explicit forward problem given by Gm d 113 where G is a matrix of coefficients constants that multiply the model parameter vector m and return a data vector 1 If m is doubled then 1 is also doubled We can also write Equation 113 out explicitly as M 6426me i1 2N 41 j l This form emphasizes the linear nature of the problem Next we consider a more general relationship between data and model parameters 42 Linearization of Nonlinear Problems Consider a general explicit relationship between the ith datum and the model parameters given by 61139 gim 42 An example might be all M 43 The steps required to linearize a problem of the form of Equation 42 are as follows Step 1 Expand gm about some point mo in model space using a Taylor series expansion M M E 38011 g111 2 3 d mz m rAm rAm 0Am t g1 gt 0 j1 0 J 2j1 0 J J 44 65 Step 2 Step 3 Step 4 Geosciences 567 CHAPTER 4 RMRGZ where Am is the difference between m and mg or Am m 7 mg 45 If we assume that terms in Am n 2 2 are small with respect to Amj39 terms then M 9 m d gm gmo2 39Amj 46 m 39 mm J1 J 0 The predicted data 6Z at m m0 are given by 6 gmo 47 Therefore M cit 221 m mm 48 j1 amj mm0 We can de ne the mis t Aci as Aci 511 7 a 49 observed data 7 predicted data AC is not necessarily noise It is just the mis t between observed and predicted data for some choice of the model parameter vector mg The partial derivative of the 1th data equation with respect to the jth model parameter is given by agi 1 9m J39 These partial derivatives are functions of the model parameters and may be nonlinear gasp or occasionally even nonexistent shudder Fortunately the values of these partial derivatives evaluated at some point in model space m0 and given by 3amp0 410b amj m m0 are just numbers constants if they exist and not functions We then define Gij as follows 66 Geosciences 567 CHAPTER 4 RMRGZ 3amp0 1 am mm0 Step 5 Finally combining the above we have M mfgGym i1N 13971 m m0 or in matrix notation the linearized problem becomes Ac GAm where Aci d 7 cit observed data 7 predicted data 61139 gim0 Gij aim j mm 0 and Amj change from m0j 411 412 413 414 415 416 Thus by linearizing Equation 42 we have anived at a set of linear equations where now Acl the difference between observed and predicted data is a linear function of changes in the model parameters from some starting model Some general comments on Equation 413 1 In general Equation 413 only holds in the neighborhood of mo and for small changes Am The region where the linearization is valid depends on the smoothness of gim 2 Note that G now changes with each iteration That is one may obtain a different G for each spot in solution space Having to reform G at each step can be very time computer intensive and often one uses the same G for more than one iteration 67 Geosciences 567 CHAPTER 4 RMRGZ 43 General Procedure for Nonlinear Problems Step 1 Pick some starting model vector m0 Step 2 Calculate the predicted data vector d and form the mis t vector Ac d 7 1 Step 3 Form 3gm Gv m J mm0 Step 4 Solve for Am using any appropriate inverse operator ie least squares minimum length weighted least squares etc Step 5 Form a new model parameter vector m1 m0 Am 417 One repeats Steps 175 until Am becomes suf ciently small convergence is obtained or until Ac becomes suf ciently small acceptable mis t Note that m note the boldfaced m is the estimate of the model parameters at the ith iteration and not the ith component on the model parameter vector 44 Three Examples 441 A Linear Example Suppose gim di is linear and of the form 2m 1 4 418 With only one equation we have G 2 m m1 and d 4 I know I know It s easy Then 8511le G112 for all m1 Suppose that the initial estimate of the model vector m0 0 Then d Gmo 20 0 and we have Ac d7amp470 4 68 Geosciences 567 CHAPTER 4 RMRGZ or the change in the rst and only element of our mis t vector A01 4 Looking at our lone equation then G11Am1Acl or 2Am1 4 or Aml 2 Since this is the only element in our modelchange vector in this case Aml1 Aml we have Aml 2 and our next approximation of the model vector m1 then becomes 11111110 Am1 0 t 2 2 We have just completed Steps 175 for the rst iteration Now it is time to update the mis t vector and see if we have reached a solution Thus for the predicted data we obtain d Gm12l2l 4 and for the mis t we have Acd7d 4740 which indicates that the solution has converged in one iteration To see that the solution does not depend on the starting point if Equation 42 is linear let s start with m01 1000 m1 Considering the one and only element of our predicteddata and mis t vectors we have d12gtlt 10002000 and AC1 4 7 2000 71996 then 2Am1 71996 or Aml 7998 Since Aml is the only element of our rst modelchange vector Aml we have Aml 7998 and therefore m1 m0 Aml 1000 7998 2 As before the solution has converged in one iteration This is a general conclusion if the relationship between the data and the model parameters is linear This problem also illustrates that the nonlinear approach outlined above works when gim is linear 69 Geosciences 567 CHAPTER 4 RMRGZ Consider the following graphs for the system 2m1 4 We note the following 1 For 6 versus m1 we see that the slope 8gm1 9m1 2 for all m1 2 For our initial guess of m01 0 Ac 4 7 0 4 denoted by a square symbol on the plot of Ac versus m1 We extrapolate back down the slope to the point m1 where Ac 0 to obtain our answer 3 Because the slope does not change the problem is linear the starting guess m0 has no effect on the final solution We can always get to Ac 0 in one iteration 442 A Nonlinear Example Now consider as a second example of the form g1m all the following 2m3 16 419 Since we have only one unknown I chose to drop the subscript Instead I will use the subscript to denote the iteration number For example W13 will be the estimate of the model parameter m at the third iteration Note also that by inspection m 2 is the solution Working through this example as we did the last one we first note that G11 at the ith iteration will be given by GU agm 3gtlt2mi2 6m2 Note also that G11 is now a function of m Iteration 1 Let us pick as our starting model 70 Geosciences 567 CHAPTER 4 RMRGZ moil then G11 9gm 6m 6 am mm also d2x132 and Acdic 167214 Because we have only one element in our model change vector we have Ac l4 and the length squared of Ac HAcHZ is given simply by Ac2 Ac2 14 X14 196 Now we nd mm the change to m0 as 6Am1 l4 and Aml 146 23333 Thus our estimate of the model parameter at the rst iteration m1 is given by m1 m0 Am11 23333 33333 Iteration 2 Continuing 611 6m1 6666 and a 233333 7407 thus Acd7 a 167740775807 now Ac2 3372 and 6666Am2 75807 gives Amz 70871 thus m2 m1 Am2 33333 7 0871 2462 Iteration 3 Continuing 611 6m2 3637 and d 29847 thus Ac 713847 71 Geosciences 567 CHAPTER 4 RMRGZ now Ac2 192 and 3637AM3 713847 gives AM3 70381 thus W13 mg AM3 2462 7 0381 2081 Iteration 4 Will this thing ever end G11 6mg 25983 and d 18024 thus Ac 72024 now Ac2 41 and 25983AM4 72024 gives AM4 70078 thus W14 m3 AM4 20817 0078 2003 Iteration 5 When were computers invented G11 6m1 24072 and d 16072 thus Ac 70072 now Ac2 0005 and 24072Am5 70072 gives AM5 70003 thus m5 m4 Am5 2003 7 0003 2000 Iteration 6 Beginning we have G11 6mg 24 and a 16 thus Ac0 72 Geosciences 567 CHAPTER 4 RMRGZ and we quit We should note that we have quit because the mis t has been reduced to some acceptable level three signi cant gures in this case The solution happens to be an integer and we have found it to three places Most solutions are not integers and we must decide how many signi cant gures are justi ed The answer depends on many things but one of the most important is the level of noise in the data If the data are noisy it does not make sense to claim that a solution to seven places is meaningful Consider the following graph for this problem 16 mm 2m3 gt m 3333 12 lt O 8 3g 2 6 6 3 ml Mt 4 0 I I I I o 1 2 3 4 m Note that the slope at m 1 when extrapolated to give a 16 yields m 3333 The solution then iterates back down the curve to the correct solution m 2 Consider plotting Ac rather than a versus m diagram on next page This is perhaps more useful because when we solve for Ami we are always extrapolating to Ac 0 73 Geosciences 567 CHAPTER 4 RMRGZ 16 lt Ac14atm01 12 2 slope at m0 1 is extrapolated to Ac O 8 4 For this example we see that at m0 l the slope 8gm 9m 6 We used this slope to extrapolate to the point where Ac 0 At the second iteration m1 3333 and is farther from the true solution m1 2 than was our starting model Also the length squared of the mis t is 3372 much worse than the mis t 196 at our initial guess This makes an important point you can still get to the right answer even if some iteration takes you farther from the solution than where you have been This is especially true for early steps in the iteration when you may not be close to the solution Note also that if we had started with Mo closer to zero the shallow slope would have sent us to an even higher value for m1 We would still have recovered though do you see why The shallow slope corresponds to a small singular value illustrating the problems associated with small singular values We will consider singularvalue analysis in the next chapter What do you think would happen if you take m0 lt 0 Would it still converge to the correct solution Try m0 71 if your curiosity has been piqued 74 Geosciences 567 CHAPTER 4 RMRGZ 443 Nonlinear StraightLine Example An interesting nonlinear problem is tting a straight line to a set of data points y 2 which may contain errors or noise along both the y and z axes One could cast the problem as yiabzi ilN 420 Assuming 2 were perfectly known one obtains a solution for a b by a linear least squares approach see Equations 327 and 332 Similarly if y were perfectly known one obtains a solution for zicdy ilN 42l again using 327 and 332 These two lines can be compared by rewriting 421 as a function of y giving yi cd la z 139 l N 422 In general a i 7061 and b i ld because in 420 we assumed all of the error or mis t was in y while in 421 we assumed that all of the error was in 2 Recall that the quantity being minimized in 420 is El yobs 7 ypreTyobs 7 ypre N 201 yi2 423 il where 51 is the predicted y value The comparable quantity for 421 is E2 zobs 7 zpreTzobs 7 zpre N 24 22 424 391 l where 2 is the predicted 2 value For the best fit line in which both y and 2 have errors the function to be minimized is A T A z z z z N E2y amp2 z 22 425 il where y 7 y and z 7 2 together compose a vector of dimension 2N if N is the number of pairs y 2139 75 Geosciences 567 CHAPTER 4 RMRGZ Consider the following diagram gt Line PA above represents the mis t in y 423 line PC represents the mis t in z 424 while line PB represents 425 In order to minimize 425 we must be able to write the forward problem for the predicted data jg 21 Let the solution we seek be given by y m1 my 426 Line PB is perpendicular to 426 and thus has a slope of ilm2 The equation of a line through P031 21 with slope ilm2 is given by y yz39 lmzxz 21 01 y 1m2Zi yr1mzz 427 The point 303 2 is thus the intersection of the lines given by 426 and 427 Equating the righthand sides of 426 and 427 fory 3 and z 2 gives m1 m2 2 lm2z yi 7 lm2 2 428 which can be solved for 2 as 2 m1m2 zi my I 429 l mg Realranging 426 and 427 to give 2 as a function of y and again equating for y 31 and z 2 gives 1mzJ9i quot11 em fitt 2139 szi 430 which can be solved for 31 as 2 7W l 431 lm 76 Geosciences 567 CHAPTER 4 RMRGZ substituting 431 for fit and 429 for 2 into 425 now gives E as a function of the unknowns m1 and m2 The approach used for the linear problem was to take partials of E with respect to m1 and m2 set them equal to zero and solve for m1 and m2 as was done in 310 and 311 Unfortunately the resulting equations for partials of E with respect to m1 and m2 in 425 are not linear in m1 and Mg and cannot be cast in the linear form GTGm GTd 318 Instead we must consider 429 and 431 to be ofthe form of 42 all grm 42 and linearize the problem by expanding 429 and 431 in a Taylor series about some starting guesses 20 and J30 respectively This requires taking partials of 3 and 2 with respect to m1 and m2 which can be obtained from 424 and 431 Consider the following data set shown also on the following diagram yi 1 4 5 2139 1 2 5 y 1077 08462 y 0667 10002 2 0154 0846y 1 2 3 4 5 6 7 Z The linear least square solution to 420 yiabzl 139 1 2 3 420 is yi 1077 08462 z39 1 2 3 432 The linear least squares solution to 421 77 Geosciences 567 CHAPTER 4 RMRGZ zicdy 139 123 421 is zi701540846yi 239 1 2 3 433 For comparison with a and b above we can rewrite 433 with y as a lnction of 2 as yi0182 118221 z39 123 434 The nonlinear least squares solution which minimizes N E2y 2z 22 425 il is given by yl 0667 100021 139 l 2 3 435 From the gure you can see that the nonlinear solution lies between the other two solutions It is also possible to consider a weighted nonlinear least squares best t to a data set In this case we form a new E after 359 as A T A Ey 7 wry Y 436 Z Z Z Z and where We is a 2N gtlt 2N weighting matrix The natural choice for We is the inverse data covariance matrix If the errors in y 21 are uncorrelated then the data covariance matrix will be a diagonal matrix with the variances for yi as the rst N entries and the variances for z as the lastN entries If we further let Vy and V2 be the variances for yi and 2 respectively then Equations 429 and 43 1 become mrszz VyZi szz i 21 2 m21Vy 437 and A ley m2VyZi mszyi yl 2 438 szz Vy 78 Geosciences 567 CHAPTER 4 RMRGZ If Vz V3 then dividing through either 437 or 438 by the variance returns 429 or 431 Thus we see that weighted least squares techniques are equivalent to general least squares techniques when all the data variances are equal and the errors are uncorrelated Furthermore dividing both the numerator and denominator of 438 by Vy yields 2 A quot11 quot122 m2sziVy 439 mVz Vy 1 Then in the limit that Vz goes to zero 439 becomes 31 m1 m22139 440 which is just the linear least squares solution of 420 That is if 21 is assumed to be perfectly known Vz 0 the nonlinear problem reduces to a linear problem Similar arguments can be made for 437 to show that the linear least squares solution of 421 results when V3 goes to zero 45 Creeping vs Jumping Shaw and Orcutt 1985 The general procedure described in Section 43 is termed quotcreepingquot by Parker 1985 It nds from the set of acceptable mis t models the one closest to the starting model under the Euclidian norm acceptable mi t models m m0 0 0 gt m 1m2m 3 Because of the nonlinearity often several iterations are required to reach the desired mis t The acceptable level of mis t is free to vary and can be viewed as a parameter that controls the tradeoff between satisfying the data and keeping the model perturbation small Because the starting model itself is physically reasonable the unphysical model estimates tend to be avoided There are several potential disadvantages to the creeping strategy Creeping analysis depends signi cantly on the choice of the initial model Ifthe starting model is changed slightly a new nal model may well be found In addition constraints applied to model perturbations may not be as meaningful as those applied directly to the model parameters Parker 1985 introduced an alternative approach with a simple algebraic substitution The new method called quotjumpingquot directly calculates the new model in a single step rather than calculating aperturbation to the initial model Now any suitable norm can be applied to the model rather than to the perturbations 79 Geosciences 567 CHAPTER 4 RMRGZ This new strategy is motivated in part by the desire to map the neighborhood of starting models near mg to a single nal model thus making the solution less sensitive to small change in m0 m0 Let39s write the original nonlinear equations as Fm d 441 A er linearization about an initial model m0 we have GAm Ac 442 when G Acd Fm0 443 aln m 0 The algebraic substitution suggested by Parker is to simply add Gmo to both sides yielding GAm Gmo Ac Gmo then GAm m0 Ac Gmo 444 and Gml Ac Gmo 445 or Gml d 7 Fmo Gmo 446 At this point this equation is algebraically equivalent to our starting linearized equation But the crucial difference is that now we are solving directly for the model m rather than a perturbation Am This slight algebraic difference means we can now apply any suitable constraint to the model A good example is a smoothing constraint Ifthe initial model is not smooth applying a smoothing constraint to the model perturbations may not make sense In the new formulation we can apply the constraint directly to the model In the jumping scheme the new model is computed directly and the norm of this model is minimized relative to an absolute origin 0 corresponding to this norm 80 Geosciences 567 CHAPTER 4 RMRGZ The explicit dependence on the starting model is greatly reduced In our example of the second derivative smoothing matrix D we can now apply this directly to the jumping equations G Ac Gmo m 447 6D 0 We should keep in mind that since the problem is nonlinear there is still no guarantee that the nal model will be unique and even this quotjumpingquot scheme is iterative To summarize the main advantage of the jumping scheme is that the new model is calculated directly Thus constraints can be imposed directly on the new model The minimization of the squared mis t can be traded off with the constraint measures allowing optimization of some physically signi cant quantity This tends to reduce the strong dependence on the initial model that is associated with the creeping scheme We now turn our attention to the generalized inverse in the next three chapters We will begin with eigenvalue problems and then continue with singularvalue problems the generalized inverse and ways of quantifying the quality of the solution 81 Geosciences 567 CHAPTER 5 RMRGZ CHAPTER 5 THE EIGENVALUE PROBLEM 51 Introduction In Chapter 3 the emphasis was on developing inverse operators and solutions based on minimizing some combination of the perhaps weighted error vector e and the model parameter vector m In this chapter we will change the emphasis to the operator G itself Using the concepts of vector spaces and linear transformations we will consider how G maps model parameter vectors into predicted data vectors This approach will lead to the generalized inverse operator We will see that the operators introduced in Chapter 3 can be thought of as special cases of the generalized inverse operator The power of the generalized inverse approach however lies in the ability to assess the resolution and stability of the solution based on the mapping back and forth between model and data spaces We will begin by considering the eigenvalue problem for square matrices and extend this to the shifted eigenvalue problem for nonsquare matrices Once we have learned how to decompose a general matrix using the shi ed eigenvalue problem we will introduce the generalized inverse operator show how it reduces to the operators and solutions from Chapter 3 in special cases and consider measures of quality and stability for the solution The eigenvalue problem plays a central role in the vector space representation of inverse problems Although some of this is likely to be review it is important to cover it in some detail so that the underlying linear algebra aspects are made clear This will lay the foundation for an understanding of the mapping of vectors back and forth between model and data spaces 52 Eigenvalue Problem for the Square M X 114 Matrix A 52 1 Background Given the following system of linear equations Ax b 51 where A is a generalM XM matrix and x and b areM X 1 column vectors respectively it is natural to immediately plunge in calculate A l the inverse of A and solve for x Presumably there are lots of computer programs available to invert square matrices so why not just hand the problem over to the local computer hack and be done with it The reasons are many but let it suffice to say that in almost all problems in geophysics A 1 will not exist in the mathematical sense and our task will be to find an approximate solution we hope along with some measure of the quality of that solution One approach to solving Equation 51 above is through the eigenvalue problem for A It will have the bene t that in addition to finding A 1 when it exists it will lay the groundwork for finding approximate solutions for the vast majority of situations where the exact mathematical inverse does not exist 82 Geosciences 567 CHAPTER 5 RMRGZ In the eigenvalue problem the matrix A is thought of as a linear transformation that transforms one Mdimensional vector into another M dimensional vector Eventually we will seek the particular x that solves Ax b but the starting point is the more general problem of transforming any particular M dimensional vector into another M dimensional vector We begin by defining M space as the space of all Mdimensional vectors For example 2 space is the space of all vectors in a plane The eigenvalue problem can then be de ned as follows For A anM gtltM matrix nd all vectors s inM space such that As l s 52 where his a constant called the eigenvalue and s is the associated eigenvector In words this means find all vectors s that when operated on by A retum a vector that points in the same direction up to the sign of the vector with a length scaled by the constant 2 522 How Many Eigenvalues Eigenvectors If As is then A 7 MMs 0M 53 where OM is an M X 1 vector of zeros Equation 53 is called a homogeneous equation because the righthand side consists of zeros It has a nontrivial solution eg s i 0 0 0T if and only if IA 7 MM 0 54 where 011 l 012 01M A MM a a 1 W 55 aMl aMM 1 Equation 54 also called the characteristic equation is a polynomial of orderM in l of the form 1M CMillM l CMiglM z C010 0 56 Equation 56 hasM and onlyM roots for A In general these roots may be complex shudder However if A is Hermitian then all of the M roots are real whew They may however have repeated values ugh be negative ouch or zero yuk We may write the M roots for l as 83 Geosciences 567 CHAPTER 5 RMRGZ inlaywild For each 1739 an associated eigenvector si which solves Equation 52 can be easily found It is important to realize that the ordering of the is completely arbitrary Equation 52 then holds forM values of 2139 That is we have 111 sss sA1I Ts1 112ssssA31Ts2 57 llM ssA1IZsl 2I sTsM About the above notation 3 is the ith component of the jth eigenvector It turns out to be very inconvenient to use superscripts to denote which eigenvector it is so sj will denote the jth eigenvector Thus if I use only a subscript it refers to the number of the eigenvector Of course if it is a vector instead of a component of a vector it should always appear as a boldface letter The length of eigenvect01s is arbitrary To see this note that if Asi lisi 58a then A25i ms 5813 since A is a linear operator 523 The Eigenvalue Problem in Matrix Notation Now that we know that there are M values of 1139 andM associated eigenvectors we can write Equation 52 as Asi 21151 i1 2 M 59 Consider then the following matrices 11 0 0 0 E A 12 510 a 0 0 0 1M and 84 Geosciences 567 CHAPTER 5 RMRGZ MXM 511 where the 1th column of S is the eigenvector si associated with the 1th eigenvalue A Then Equation 59 can be written in compact matrix notation as ASSA 512 where the order of the matrices on the righthand side is important To see this let us consider the 1th columns of AS and SA respectively The rst component of the 1th column of AS is given by M M A3 2 alkSa 2 alksltc kl kl where ski squotC is the kth component of the ith eigenvector Similarly components 2 throughM of the 1th column are given by M M i AS2 2 a2kski 2 a2ksk k1 k1 M M i ASMi 2 aMkSki 2 aMkSk k1 k1 Therefore the 1th column of AS is given by i 2 aMkSk kl That is the ith column of AS is given by the product of A and the 1th eigenvector si Now the rst element in the 1th column of SA is found as follows 85 513 514 515 Geosciences 567 CHAPTER 5 RMRGZ M M SAlli 2 SlkAki 2 SlkAki k1 k1 But A 6 kt kt ki where 6k is the Kronecker delta Therefore SAlli Si 1139 Entries 2 throughM are then given by M SAM Esm s11 kl M SAM 2 sm sm kl Thus the ith column of SA is given by sf Si Si Si 2 2 ii lisi SM SM Thus we have that the original equation de ning the eigenvalue problem Equation 59 ASz39 115139 is given by the ith columns of the matrix equation AS SA Consider on the other hand the ith column of AS M Aslu Elmsa 113i kl M Asl2i zAszki kl 86 516 517 518 519 520 512 Geosciences 567 CHAPTER 5 RMRGZ M ASM 2AMkSki Amid 521a kl Therefore the 1th column of AS is given by AS Asi 521b That is the product of A with the 1th eigenvector si Clearly this is not equal to As above in Equation 59 524 Summarizing the Eigenvalue Problem for A In review we found that we could write the eigenvalue problem for Ax b as AS SAA 512 where S s1 s2 SM 511 is anM XM matrix each column of which is an eigenvector s such that As kisl 59 and where 11 0 0 0 i A 12 510 a 0 0 0 AM is a diagonal M XM matrix with the eigenvalue 1139 of As jsi along the diagonal and zeros everywhere else 53 Geometrical Interpretation of the Eigenvalue Problem for Symmetric A 531 Introduction It is possible of course to cover the mechanics of the eigenvalue problem without ever considering a geometrical interpretation There is however a wealth of information to be learned 87 Geosciences 567 CHAPTER 5 RMRGZ from looking at the geometry associated with the operator A This material is not covered in Menke s book although some of it is covered in Statistics anal Data Analysis in Geology 2nd Edition 1986 pages 1317139 by J C Davis published by John Wiley and Sons We take as an example the symmetric 2 X 2 matrix A given by A 6 2 522 2 3 5 Working with a symmetric matrix assures real eigenvalues We will discuss the case for the general square matrix later The eigenvalue problem is solved by IA 7 Ml 0 523 where it is found that 11 7 12 2 and the associated eigenvectors are given by s1 0894 0447T 524a and s2 70447 0894T 524b respectively Because A is symmetric s1 and s2 are perpendicular to each other The length of eigenvectors is arbitrary but for orthogonal eigenvectors it is common to normalize them to unit length as done in Equations 524a and 524b The sign of eigenvectors is arbitrary as well and I have chosen the signs for s1 and s2 in Equations 524a and 524b for convenience when I later relate them to orthogonal transformations 532 Geometrical Interpretation The system of linear equations Ax b 51 implies that the M XM matrix A transforms M X l vectors x intoM X l vectors b In our example M 2 and hence x x1 x2T represents apoint in 2space and b b1 b2T is just another point in the same plane Consider the unit circle given by xTx x12 x3 1 525 Matrix A maps every point on this circle onto another point The eigenvectors s1 and s2 having unit length are points on this circle Then since ASZ39 ljsi 58a we have 88 Geosciences 567 CHAPTER 5 RMRGZ 6 2 0894 6258 0894 As1 7 526a 2 3 0447 3129 0447 and 6 2 0447 0894 0447 As2 2 526b 2 3 0894 1788 0894 As expected when A operates on an eigenvector it returns another vector parallel or antiparallel to the eigenvector scaled by the eigenvalue When A operates on any direction different from the eigenvector directions it returns a vector that is not parallel or antiparallel to the original vector What is the shape mapped out by A operating on the unit circle We have already seen where s1 and s2 map If we map out this transformation for the unit circle we get an ellipse and this is an important element of the geometrical interpretation What is the equation for this ellipse We begin with our unit circle given by xTx l 525 and Ax b 51 then x A lb 527 assuming A 1 eXists Then substituting 527 into 525 we get for a general A A lbTA 1b bTA 1TA 1b l 528a Where A is symmetric A lT A 1 528b and in this case bTA lA lb l 528c After some manipulations Equation 528 for the present choice of A given in Equation 522 g1ves 2 2 b1 blb2 b2 529 15077 5444 4900 which we recognize as the equation of an ellipse inclined to the b1 and b2 axes We can now make the following geometrical interpretation on the basis of the eigenvalue problem 89 Geosciences 567 CHAPTER 5 RMRGZ l The major and minor axis directions are given by the directions of the eigenvectors s1 and S2 2 The lengths of the semimajor anal semiminor axes of the ellipse are given by the absolute values of eigenvalues 11 and 3 The columns of A ie 6 2T and 2 3T in our example are vectors from the origin to points on the ellipse The third observation follows from the fact that A operating on 1 0T and 0 lT retum the first and second columns of A respectively and both unit vectors clearly fall on the unit circle If one of the eigenvalues is negative the unit circle is still mapped onto an ellipse but some points around the unit circle will be mapped back through the circle to the other side The absolute value of the eigenvalue still gives the length of the semiaXis The geometrical interpretation changes somewhat if A is not symmetrical The unit circle is still mapped onto an ellipse and the columns of A still represent vectors from the origin to points on the ellipse Furthermore the absolute values of the eigenvalues still give the lengths of the semimajor and semiminor axes The only difference is that the eigenvectors s1 and s2 will no longer be orthogonal to each other and will not give the directions of the semimajor and semiminor axes The geometrical interpretation holds for largerdimensional matrices In the present 2 X 2 case for A the unit circle maps onto an ellipse For a 3 X 3 case a unit sphere maps onto an ellipsoid In general the surface defined by the eigenvalue problem is of dimension one less than the dimension of A The following page presents a diagram of this geometry 90 Geosciences 567 CHAPTER 5 RMRGZ Geometry of the Eigenvalue Problem 6 2 A 2 3 x17 xl2 089 31 04477 2 V 4477 Kw gtAV 32 31 0 0 1 6 4 2 6 2 4 6 X Blowup of 7 Central Portion from Above 1 0 1 2 91 Geosciences 567 CHAPTER 5 RMRGZ 533 Coordinate System Rotation For the case of a symmetIic A matrix it is possible to relate the eigenvectors of A with a coordinate transformation that diagonalizes A For the example given in Equation 522 we construct the eigenvector matrix S with the eigenvectors s1 and s2 as the first and second columns respectively Then 530 S 0894 0447 0447 0894 Since s1 and S are unit vectors perpendicular to one another we see that S represents an orthogonal transformation and SSTSTS12 531 Now consider the coordinate transformation shown below y 9 where x cos6 s1n6 x x T 532a y 7s1n6 cos6 y y x cos6 sin6 x T x T 532b y s1n6 cos6 y y where T transforms x yT to c y T and TT transforms c y T to x y where 9 is positive for counterclockwise rotation from x to x and where and TTT TTT 12 532c It is always possible to choose the signs of s1 and S such that they can be associated with x and y as Ihave done in Equations 524a and 524b Looking at the components of s1 then we see that 9 2660 and we further note that 92 Geosciences 567 CHAPTER 5 RMRGZ s TT 533 The matrix A can also be thought of as a symmetric secondorder tensor such as stress or strain for example in the original x y coordinate system Tensors can also be rotated into the x y coordinate system as A TATT 534a where A is the rotated tensor Using Equation 533 to replace T in Equation 534 yields A STAS 534b If you actually perform this operation on the example you will nd that A is given by A 7 0 535 0 2 Thus we nd that in the new coordinate system de ned by the s1 and s2 axes A is a diagonal matrix with the diagonals given by the eigenvalues of A If we were to write the equation for the ellipse in Equation 528 in the x y coordinates we would nd 2 2 E 1 536 49 4 which is just the equation of an ellipse with semimajor and semiminor axes aligned with the coordinate axes and of length 7 and 2 respectively This new 6 y coordinate system is often called the principal coordinate system In summary we see that the eigenvalue problem for symmetric A results in an ellipse whose semimajor and semiminor axis directions and lengths are given by the eigenvectors and eigenvalues of A respectively and that these eigenvectors can be thought of as the orthogonal transformation that rotates A into a principal coordinate system where the ellipse is aligned with the coordinate axes 534 Summarizing Points A few points can be made 1 The trace of a matrix is unchanged by orthogonal transformations where the trace is de ned as the sum of the diagonal terms that is M trace A 2a 212 il This implies that trace A trace A You can use this fact to verify that you have correctly found the eigenvalues 93 Geosciences 567 CHAPTER 5 RMRGZ 2 If the eigenvalues had been repeated it would imply that the length of the two axes of the ellipse are the same That is the ellipse would degenerate into a circle In this case the uniqueness of the directions of the eigenvectors vanishes Any two vectors preferably orthogonal would suffice 3 If one of the eigenvalues is zero it means the minor axis has zero length and the ellipse collapses into a straight line No information about the direction perpendicular to the major axis can be obtained 54 Decomposition Theorem for Square A We have considered the eigenvalue problem for A Now it is time to turn our attention to the eigenvalue problem for AT It is important to do so because we will leam that AT has the same eigenvalues as A but in general different eigenvectors We will be able to use the information about shared eigenvalues to decompose A into a product of matrices 541 The Eigenvalue Problem for AT The eigenvalue problem for AT is given by ATl i 711391 139 537 where m is the eigenvalue and ri is the associated M X 1 column eigenvector Proceeding in a manner similar to the eigenvalue problem for A we have AT 7 711Mr 0M 538 This has nontrivial solutions for r if and only if lAT 7 mm E 0 539 But mathematically lBl lBTl That is the determinant is unchanged when you interchange rows and columns The implication is that the Mthorder polynomial in 71 nM bM1nM1 r r b0n00 540 is exactly the same as the Mthorder polynomial in l for the eigenvalue problem for A That is A and AT have exactly the same eigenvalues Therefore 71139 ii 541 542 Eigenvectors for AT In general the eigenvectors s for A and the eigenvectors r for AT are not the same Nevertheless we can write the eigenvalue problem for AT in matrix notation as 94 Geosciences 567 CHAPTER 5 RMRGZ ATR RA 542 where R r1 r2 rM 543 is anM gtltM matrix of the eigenvectors ri of ATri Air and where 11 0 0 0 E A A 544 a 0 0 0 AM is the same eigenvalue matrix shared by A 543 Decomposition Theorem for Square Matrices Statement of the Theorem We are now in a position to decompose the square matrix A as the product of three matrices Theorem The square M gtltM matrix A can be written as A SART 545 where S is the M gtltM matrix whose columns are the eigenvectors of A R is theM X M matrix whose columns are the eigenvectors of AT and A is the M X M diagonal matrix whose diagonal entries are the eigenvalues shared by A and AT and whose offdiagonal entries are zero We will use this theorem to nd the inverse of A if it exists In this section then we will go through the steps to show that the decomposition theorem is true This involves combining the results for the eigenvalue problems for A and AT Proof of the Theorem Step 1 Combining the resultsfor A and AT We start with Equation 512 AS SA 512 95 Step 2 Geosciences 567 CHAPTER 5 RMRGZ Premultiply Equation 512 by RT which gives RTAS RTSA Now returning to Equation 542 ATR RA Taking the transpose of Equation 542 ATRlT RAT or RTATT ATRT or RTA ART since ATT A and AT A Next we postmultiply Equation 5470 by S to get RTAS ARTS Noting that Equations 546 and 548 have the same lefthand sides we have RTSA ARTS or RTSAiARTS 0M where OM is anM XM matrix of all zeroes 546 542 547a 547b 5470 548 549 550 Showing that RTS 1M In order to proceed beyond Equation 550 we need to show that RTS IM This means two things First it means that 96 551 Geosciences 567 CHAPTER 5 RMRGZ RTr1 s 552 and s71 RT 553 Second it means the dot product of the eigenvectors in R the columns of R with the eigenvectors in S is given by 1 r1Ts5 554 That is if 139 if 1 has no projection onto sj Ifz39 1 then the projection of 1 onto sl can be made to be 1 I say that it can be made to be 1 because the lengths of eigenvectors is arbitrary Thus if the two vectors have a nonzero projection onto each other the length of one or the other or some combination of both vectors can be changed such that the projection onto each other is 1 Let us consider for the moment the matrix product RTS and let W RTS 5 55 Then Equation 550 implies M MWVH 12 11W12 AM AIWVTM lllawn 1242sz f AM 12W2M 0M 556 11 lMWMi 12 1MWM2 1M lMWMM Thus independent of the values for Vij the diagonal entries we have that 0 12 11th AM AIWVIM 11 42 9 AM Wm 0M 557 11 lMWMi 12 1MWM2 0 If none of the eigenvalues is repeated ie 1 then Ag71 0 558 and it follows that Wz39j 0 13972 559 If 11 for some pair of eigenvalues then it can still be shown that Equation 559 holds The explanation rests on the fact that when an eigenvalue is repeated there is a plane or hyperplane if M gt 3 associated with the eigenvalues rather that a single direction as is the case for the eigenvector associated with a nonrepeated eigenvalue One 97 Geosciences 567 CHAPTER 5 RMRGZ has the freedom to choose the eigenvectors 1 1 si and I s39 in such a way that Wij 0 while still having the eigenvectors span the appropriate planes Needless to say the proof is much more complicated than for the case without repeated eigenvalues and will be le to the student as an exercise The end result however is that we are left with W11 0 0 0 W 2 RTS 22 560 a 0 0 0 WW We recognize the W entries as the dot product of 13 and s1 given by M i i T ii rkiski zrksk 1139 Si 53961 kl kl If W 0 then we can make VV l by scaling 1 1 S or some combination of both We can claim that Vij i 0 as follows l 1 is orthogonal to s 139 7 That is 1 1 avector in M space is perpendicular to M 7 1 other vectors in M space These M 7 l vectors are not all perpendicular to each other or our work would be done 2 However the vectors in R and S spanM space That is one can write an arbitrary vector in M space as a linear combination of the vectors in R or S Thus 1 which has no projection onM 7 1 independent vectors inM space must have some projection on the only vector left in S S Since the projection is nonzero one has the freedom to choose the 13 and s such that Al Al W m rm riTs 1 562 k1 k1 Thus finally we have shown that it is possible to scale the vectors in R andor S such that RTS IM 563 This means that RT S 1 564 and RTST IT I STR 565 98 Step 3 Geosciences 567 CHAPTER 5 RMRGZ and ST R 1 566 Thus the inverse of one is the transpose of the other etc Before leaving this subject I should emphasize that R and S are not orthogonal matrices This is because in general RTR 7 1M 7 RRT 567a and sTs IM SST 567b We cannot even say that S and R are orthogonal to each other It is true that 17 is perpendicular to s 139 7 because we have shown that r s 0 139 if but r s 1 568 does not imply that 13 is parallel to s Recall that T 1 llsillcose 569 where 9is the angle between ri and si The fact that you can choose such that the dot product is equal to 1 does not require that 9be zero In fact it usually will not be zero It will be zero however if A is symmetric In that case R and S become orthogonal matrices Final justi cation that A SART Finally now that we have shown that RTS 1M we go back to Equations 512 and 547 AS SA 512 and ATR RA 547 If we postmultiply Equation 512 by RT we have ASRT SART 570 But 99 Geosciences 567 CHAPTER 5 RMRGZ SRT 1M 571 because if we start with RTS 1M 563 which we so laboriously proved in the last pages and postmultiply by S l we obtain RTSS 1 S 1 572a or RT S 1 572b Premultiplying by S gives SRT IM 573 as required Therefore at long last we have A SART 574 Equation 574 shows that an arbitrary square M XM matrix A can be decomposed into the product of three matrices S s1 s2 SM 511 where S is anM XM matrix each column of which is an eigenvector sl such that As jsi 59 r r r R 1 2 M 5 43 where R is anM XM matrix each column of which is an eigenvector 13 such that ATrl 1113 538 11 0 0 0 E A A 510 a 0 0 0 AM 100 Geosciences 567 CHAPTER 5 RMRGZ where A is a diagonalM XM matrix with the eigenvalues of Asl jsi 59 along the diagonal and zeros everywhere else A couple of points are worth noting before going on to use this theorem to nd the inverse of A if it exists First not all of the eigenvalues l are necessarily real Some may be zero Some may even be repeated Second note also that taking the transpose of Equation 574 yields AT 2 RAST 575 544 Finding the Inverse A 1 for the M X M Matrix A The goal in this section is to use Equation 574 to nd A l the inverse to A in the exact mathematical sense of A lA 1M 576a and AAA 1M 576b We start with Equation 59 the original statement of the eigenvalue problem for A Asl jsi 59 Premultiply by A 1 assuming for the moment that it exists A lAsl Aims AA lsi 577 But of course A lA 1M by Equation 576a Therefore sl jA lsi 578 or rearranging terms A lsl l jsl 0 579 101 Geosciences 567 CHAPTER 5 RMRGZ Equation 579 is of the form of an eigenvalue problem In fact it is a statement of the eigenvalue problem for A4 The eigenvalues for A 1 are given by the reciprocal of the eigenvalues for A A and A 1 share the same eigenvectors Si Since we know the eigenvalues and eigenvectors for A l we can use the information we have learned on how to decompose square matrices in Equation 574 to write A 1 as A 1 SA IRT 580 where 111 0 0 0 l E A 1 A 581 a 0 gt 0 0 llM Hence if we can decompose A as A SART 574 one can nd if it exists A 1 as A 1 SA lRT 580 Of course if 0 for any 139 then l j is unde ned and hence A 1 does not eXist 545 What Happens When There are Zero Eigenvalues Suppose that some of the eigenvalues 14 of As its 59 are zero What happens then First of course A 1 does not exist in the mathematical sense of Equations 576a7b Given that however let us look in more detail First suppose there are P nonzero and M 7P zero We can order the such that W 2M 2quot W gt0 582 and APH lP 1M 0 583 Recall that one is always free to order the 1739 any way one chooses as long as the si and 17 in Equation 574 are ordered the same way Then we can rewrite A as 102 Geosciences 567 CHAPTER 5 RMRGZ 11 0 0 0 12 A 1P 0 0 0 Consider Equation 574 again We have that A SART We can write out the righthand side as 11 0 0 gt 0 12 S1 S2 SP SP1 SM 3 1P 39 39 39 39 0 0 0 where S A and R are allM gtltM matrices Multiplying out SA explicitly yields r1 r2 Asl lzsz lPsP 0 0 rP I I i 39 rP1 39 M rM llt P gtHltM7Pzgtl llt M gtl gtlt quotU gt lt r1 r2 rP rP1 rM P 584 574 585 586 where we see that the last M 7 P columns of the product SA are all zero Note also that the last M 7P rows of RT or equivalently the last M 7P columns of R will all be multiplied by zeros This means that SP1 SP2 that these eigenvectors are obliterated in or by A or that A is blind to them SMand I P FP1 I M are not needed to form A We say In order not to have to write out this partitioning in long hand each time let us make the following de nitions 103 Geosciences 567 CHAPTER 5 RMRGZ Let S Spl S0 where SP s1 s2 SF 587 is anM gtltP matrix with the P eigenvectors of As jsi associated with the P nonzero eigenvalues as columns and where so SP1 SP2 SM 588 is anM X M 7P matrix with the M 7P eigenvectors associated with the zero eigenvalues of Asl jsl as columns Let R RP l R0 where RP 1 1 1 2 I 589 is an M X P matrix with the P eigenvectors of ATI Z39 1113 associated with the P nonzero eigenvalues 1139 as columns and where R0 rP1 rP2 rM 590 is anM X M 7P matrix with the M 7P eigenvectors associated with the zero eigenvalues of ATri 171 as columns Let AP M 0 0 0 where AP is the P X P subset of A with the P nonzero eigenvalues along the diagonal and zeros elsewhere as shown below 11 0 0 0 3 AP 1 A u 0 591 0 0 AP 104 Geosciences 567 CHAPTER 5 RMRGZ and where the rest of A consists entirely of zeros We see then that Equation 586 implies that A can be reconstructed with just Sp Ap and RP as A SP AP R MXM MXP PxP PXM 592 It is important to note that A can be reconstructed using either Equation 574 or Equation 592 The bene t of using Equation 592 is that the matrices are smaller and it could save you effort not to have to calculate the 1 si associated with the zero eigenvalues An important insight into the problem however is that although A can be reconstructed without any information about directions associated with eigenvectors having zero eigenvalues no information can be retrieved or gained about these directions in an inversion 546 Some Notes on the Properties of Sp and Rp 1 At best Sp is semiorthogonal It is possible that s Sp IP 593 depending on the form or information of A Note that the product S Sp has dimension P XMM gtltP P XP independent of M The product will equal Ip if and only if the P columns of Sp are all orthogonal to one another It is never possible that sps 1M 594 This product has dimension M X PP gtlt AI M X M Sp has M rows but only P of them can be independent Each row of Sp can be thought of as a vector in Pspace since there are P columns OnlyP vectors in Pspace can be linearly independent andM gt P 2 Similar arguments can be made about Rp That is at best Rp is semiorthogonal Thus R Rp 1P 595 is possible depending on the structure of A It is never possible that RPR IM 596 3 A 1 does not exist since there are zero eigenvalues 4 If there is some solution x to Ax b which is possible even if A 1 does not exist then there are an in nite number of solutions To see this we note that Asi jsi0 139P1P2M 597 This means that if 105 Geosciences 567 CHAPTER 5 RMRGZ Ax b 51 for some x then if we add S P 1 S 139 SM to x we have Ax si b is 598a b 598b since A is a linear operator This means that one could write a general solution as M x 2061s 599 iP1 where 06139 is an arbitrary weighting factor 55 Eigenvector Structure of mm 551 Square Symmetric A Matrix With Nonzero Eigenvalues Recall that the least squares problem can always be transformed into the normal equations form that involves square symmetric matrices If we start with Gm d 5100 GTGmGTd 5101 mLs GTGrlGTd 5102 Let A GTG and b GTd Then we have Am b 5103 mLS Arlb 5104 where A is a square symmetric matrix Now recall the decomposition theorem for square symmetric matrices A SAST 5105 where S is the M X M matrix with columns of eigenvectors of A and A is the M X M diagonal matrix with elements of eigenvalues of A Then we have shown A 1 SMIST 5106 and mLS SAMSTb 5107 106 Geosciences 567 CHAPTER 5 RMRGZ Let s take a closer look at the quotstructurequot of mLs The easiest way to do this is to use a simple example withM 2 Then s s 11 0 s s A111 12 1 11 21 5108 321 S22 0 1M2 512 S22 511 Sl2 S S A4 11 12 1 11 5109 521 S22 32 1 11 11 2 2 S12 5111521S12S22 A4 1 A 12 212 5110 S21S11S22S12 S21S22 2122 This has the form of the sum of two matrices one with 111 coef cient and the other with 112 coef cient 2 2 71 1 311 511321 1 512 512322 A 2 2 11 521311 521 A 522312 522 1 S11 1 512 SH s21 le 322 5111 111 1 2 S221 1 1 T T s1s1 s2s2 11 12 In general for a square symmetric matrix A 1 zislslT is2s sMsz1 5112 M 11 12 and A qs1s1T12s2sg1MsMszI 5113 where Si is the 1th eigenvector of A and A 1 Now let s nish forming mLs for the simple 2 X 2 case mLS A41 1 T 1 T s1s1b s2s2b 5114 11 12 l l S11b1 S21b251z512b1 322132S2 11 Or in general 107 Geosciences 567 CHAPTER 5 RMRGZ M 1 M mLS s E E J1 t t J t 2sfbsi 5115 Ci s a t i where stTb E st rb the projection of data in the Si direction and 01 are the constants In this form we can see that the least squares solution of Am b is composed of a weighted sum of the eigenvectors of A The coef cients of the eigenvectors are constants composed of two parts one is the inverse of the corresponding eigenvalue and the second is the projection of the data in the direction of the corresponding eigenvector This form also clearly shows why there is no inverse if one or more of the eigenvalues of A is zero and why very small eigenvalues can make mLs unstable It also suggests how we might handle the case where A has one or more zero eigenvalues 552 The Case of Zero Eigenvalues As we saw in section 545 we can order the eigenvalues from largest to smallest in absolute value as long as the associated eigenvectors are ordered in the same way Then we saw the remarkable result in Equation 592 that the matrix A can be completely reconstructed using just the nonzero eigenvalues and eigenvectors V111 gt V121 gt W1 gt llpl gt0 and 10511 AM0 5116 Thatis A SPAgt81 5117 llslslT lPsPSIT This suggests that perhaps we can construct A 1 similarly using just the nonzero eigenvalues and their corresponding eigenvectors Let39s try A 1SPA1SIT 5118 which must at least exist But note that AA 1SPAPST5 SPAEJSITa 5119 Note that SESP IP because A is symmetric and hence the eigenvectors are orthogonal and APA1 I P so AA 1 SPSIT IM see 594 5120 108 Geosciences 567 CHAPTER 5 RMRGZ Like the case for A we cannot write a true inverseN for A using just the P nonzero e1genvectorsithe true inverse does not eX1st But we can use A as an quotapprox1matequot 1nve1se for A Thus 1n the case when A 1 does not eX1st we can use mm A lb SPN lsgb 5121 i139 i1 i 553 Simple Tomography Problem Revisited The concepts developed in the previous section are best understood in the context of an example problem Recall the simple tomography problem l 2 Fall 3 4 sz V V d3 d4 1 l 0 0 0 0 l l G 5122 1 0 l 0 0 l 0 l 2 l l 0 T 1 2 0 1 AGG 5123 1 0 2 l 0 l l 2 4 2 7 the eigenvalues of A 5124 0 109 Geosciences 567 CHAPTER 5 RMRGZ s1 s2 s3 s4 05 05 05 05 S 0395 0395 0395 0395 the eigenvectors ofA 5125 05 05 05 05 05 05 05 05 A SNST 5126 where 4 0 0 0 0 2 0 0 A 5127 0 0 2 0 0 0 0 0 Let39s look at the quotpattemsquot in the eigenvectors si 114551 M2S2 132553 140554 x X X x 1 LA 59 t x 1 x X N x R g r x x x comps quotdcquot quotL 7 Rquot quotT 7 Bquot quotcheckerboardquot These quotpatternsquot are the fundamental building blocks basis functions of all solutions in this fourispace Do you see why the eigenvalues correspond to their associated eigenvector pattems Explain this in terms of the sampling of the four paths shooting through the medium What other paths must we sample in order to make the zero eigenvalue nonzero Now let39s consider an actual model and the corresponding noisefree data The model is m The data are l 0 0 0 0 or graphically 0 0 1 gt2 0 T l and b G d l l 0 gt0 110 5128 5129 Geosciences 567 CHAPTER 5 RMRGZ First A 1 does not exist But A71 SPAIST5 does where Sp s1 s2 S3 and 0 0 A5 0 i o 0 0 1 2 then mLSA71GTd 5130 5131 We now have all the parts necessary to construct this solution Let s use the altemate form of the solution ns s1rbs1 ihz rbs2 s3 rbs3 05 05 511 05 05 05 05 52 1 05 05 05 05 531 05 05 andnote 05 b 05 S4 05 05 The data have no projection in the quotnullspacequot S4 1 0 2 l l OD ID IN 2 l r l005052 100505l 2 l0 0505 l 0 l0 05 050 5132 5133 5134 5135 5136 The geometry of the experiment provides no constraints on the quotcheckerboardquot pattem How would we change the experiment to remedy this problem Continuing the construction of the solution 111 Geosciences 567 CHAPTER 5 RMRGZ 2 05 1 05 1 05 N 05 05 05 mLS Z 05 3 05 3 05 05 05 05 025 025 025 075 025 025 025 025 5134 025 025 025 025 025 025 025 025 Graphically this is 075 025 X t 25 025 gx Does this solution t the data What do we need to add to this solution to get the true solution x x xx xx x x025 20 2539 xx 3 07 This is the eigenvector associated with the zero eigenvalues and represents the quotnullquot space Any scaled value of this pattern can be added and the data will not changeithe data are quotblindquot to this pattem 075 025 025 025 5135 mLs 025 const 025 025 025 Summary We now have all the tools to solve inverse problems even those with zero eigenvalues There are many real inverse problems that have been tackled using just what we have leamed so far In addition to truncating zero eigenvalues the truncation method can be used to remove the effects of small eigenvalues As you might expect there is a quotcostquot or tradeoff in truncating nonzero eigenvalues Before we get into techniques to evaluate such tradeoffs we rst turn to the generalization of the eigenvalueigenvector concepts to nonsquare matrices 112 Geosciences 567 Chapter 6 RMRGZ CHAPTER 6 SINGULARVALUE DECOMPOSITION SVD 61 Introduction Having nished with the eigenvalue problem for Ax b where A is square we now turn our attention to the general N X M case Gm d First the eigenvalue problem per se does not exist for Gm 1 unless N M This is because G maps transforms a vector m from M space into a vector 1 in N space The concept of parallel breaks down when the vectors lie in diiTerent dimensional spaces Since the eigenvalue problem is not de ned for G we will try to construct a square matrix that includes G and as it will turn out GT for which the eigenvalue problem is de ned This eigenvalue problem will lead us to singular value decomposition SVD a way to decompose G into the product of three matrices two eigenvector matrices V and U associated with model and data spaces respectively and a singularvalue matrix very similar to A from the eigenvalue problem for A Finally it will lead us to the generalized inverse operator de ned in a way that is analogous to the inverse matrix to A found using eigenvalueeigenvector analysis The end result of SVD is G UP AP v NXM NXP PXP PXM 61 where Up are the P N dimensional eigenvectors of GGT VP are the P M dimensional eigenvectors of GTG and AP is the P gtltP diagonal matrix with P singular values positive square roots of the nonzero eigenvalues shared by GGT and GTG on the diagonal 62 Formation of a New Matrix B 621 Formulating the Eigenvalue Problem With G The way to construct an eigenvalue problem that includes G is to form a square N AI X N AI matrix B partitioned as follows 62 lltN gtHltM gtl 113 Geosciences 567 Chapter 6 RMRGZ B is Hermitz39an because BT l l Gd Note for example B1N3 G13 and BN31 GT31 G13 etc 622 The Role of GT as an Operator Analogous to Equation 113 we can de ne an equation for GT as follows GT y c MXN Ngtltl Mgtltl 63 64a 64b 65 We do not have to have a particular y and c in mind when we do this We are simply interested in the mapping of an N dimensional vector into an M dimensional vector by GT We can combine Gm d and GTy c using B as 01 B z b NMgtltNM NMgtlt1NMgtlt1 where we have y z m and d b c Note that z and b are both N M X 1 column vectors 114 66 67 68 69 Geosciences 567 Chapter 6 RMRGZ 63 The Eigenvalue Problem for B The eigenvalue problem for the N AI X N M matrix B is given by Bwiniwi 11 2NM 610 631 Properties of B The matrix B is Hermitian Therefore allN M eigenvalues 711 are real In preparation for solving the eigenvalue problem we de ne the eigenvector matrix W for B as follows W w1 w2 WNM 611 NMgtltNM 39 39 We note that W is an orthogonal matrix and thus VVTW WWT INM 612 This is equivalent to w w 61J 613 where wi is the 1th eigenvector in W 632 Partitioning W Each eigenvector wl is N M X 1 Consider partitioning wl such that 614 Ilt gtIlt 2 gt That is we stack an Ndimensional vector ul and an M dimensional vector v into a single N AIdimensiona1 vector Then the eigenvalue problem for B from Equation 610 becomes 115 Geosciences 567 Chapter 6 RMRGZ 0 G 11 ul 615 277139 GT 0 v v This can be written as G vi int il2NM 616 NgtltM Mgtlt1 Ngtlt1 and GT 11 nivi 11 2NM 617 MXN Ngtlt1 Mgtlt1 Equations 616 and 617 together are called the shifted eigenvalue problem for G It is not an eigenvalue problem for G since G is not square and eigenvalue problems are only de ned for square matrices Still it is analogous to an eigenvalue problem Note that G operates on an M dimensional vector and returns an N dimensional vector GT operates on an N dimensional vector and returns anM dimensional vector Furthermore the vectors are shared by G and GT 64 Solving the Shifted Eigenvalue Problem Equations 616 and 617 can be solved by combining them into two related eigenvalue problems involving GTG and GGT respectively 641 The Eigenvalue Problem for GTG Eigenvalue problems are only de ned for square matrices Note then that GTG is M XM and hence has an eigenvalue problem The procedure is as follows Starting with Equation 6 17 GTul nivl 617 Multiply both sides by 711 mGTuZ 71 V1 618 01 116 Geosciences 567 Chapter 6 RMRGZ GTOlilli 77 Vi 619 BUB by Equation 616 we have GVZ39 quot1quot 616 Thus GTGVi nfvi 239 1 2 M 620 This is just the eigenvalue problem for GTG We were able to manipulate the shifted eigenvalue problem into an eigenvalue problem that presumably we can solve We make the following notes GTG is Hermitian N N GTGj 26ng Egggig 621 kil kil N N GTGLi 2GTjka Eggg1 GTGj 622 kl kl Therefore allM 71 are real Because the diagonal entries of GTG are all 2 0 then all 71 are also 2 0 This means that GTG is positive semide nite one de nition of which is simply that all the eigenvalues are real and 2 0 We can combine the M equations implied by Equation 620 into matrix notation as GTG V V M 623 MXMMXM MXMMXM where V is defined as follows V V1 V2 VM 624 MXM and 117 Geosciences 567 Chapter 6 RMRGZ 7112 0 0 0 11 E M 625 0 0 0 nil M X M 3 Because GTG is a Hermitian matrix V is itself an orthogonal matrix VVT VTV 1M 626 642 The Eigenvalue Problem for GGT The procedure for forming the eigenvalue problem for GGT is very analogous to that of GTG We note that GGT is N gtltN Starting with Equation 616 le mui 616 Again multiply by 711 Gmvi n i 627 But by Equation 617 we have GTui nivi 617 Thus GGTui nfui 139 1 2 N 628 We make the following notes for this eigenvalue problem 1 GGT is Hermitian 2 GGT is positive semide nite 3 Combining the N equations in Equation 628 we have GGTU UN 629 where U 111 Hz uN 630 N X N 118 Geosciences 567 Chapter 6 RMRGZ and n1 0 0 0 11 I N 0 0 0 111 N X N 4 U is an orthogonal matrix UTU UUT IN 65 How Many 7 Are There Anyway 631 632 A careful look at Equations 610 620 and 628 shows that the eigenvalue problems for B GTG and GGT are de ned for N AI M and N values of 139 respectively Just how many 71 are there 651 Introducing P the Number of Nonzero Pairs 1713 nl Equation 610 BWz39 Uin39 can be used to determine N AI real 71139 Equation 620 GTGVZ39 71 V1 can be used to determineM real 71 since GTG is M XM Equation 628 GGTllj 71 11139 can be used to determine N real 71 since GGT is N X N This section will convince you I hope that the following are true 1 There are P pairs of nonzero 71139 where each pair consists of 711 7711 2 If H1 is an eigenvalue of BWz39 Uin39 and the associated eigenvector W is given by 119 610 620 628 610 633 Geosciences 567 Chapter 6 RMRGZ then the eigenvector associated with 7711 is given by 3 There are N M 7 2P zero 711 634 4 You can know everything you need to know about the shifted eigenvalue problem by retaining only the information associated with the positive 71139 5 P is less than or equal to the minimum of N and M P S minN M 652 Finding the Eigenvector Associated With 171 Suppose that you have found W a solution to Equation 610 associated with 711 satis es the shifted eigenvalue problem GVi Willi and GTui Uin Let us try in as an eigenvalue and w given by and see if it satis es Equations 616 and 617 GVi TliX ui Willi and GTle39 TliVi 0r GTllz39 71in39 635 It also 616 617 634 636 637a 637b From this we conclude that the nonzero eigenvalues of B come in pairs The relationship between the solutions is given in Equations 633 and 634 Note that this property of paired eigenvalues and eigenvectors is not the case for the general eigenvalue problem It results from the symmetry of the shi ed eigenvalue problem 120 Geosciences 567 Chapter 6 RMRGZ 653 No New Information From the ni System Let us form an ordered eigenvalue matrix D for B given by gt771 0 0 0 772 UP 771 D 772 638 UP 0 0 0 NMgtltNM where 711 2 712 2 2 Up Note that the ordering of matrices in eigenvalue problems is arbitrary but must be internally consistent Then the eigenvalue problem for B from Equation 610 becomes BW WD 639 where now the N A1 X N AI dimensional matrix W is given by 111 112 P 1 2 11P ll2P1 uNM W 640 V1 V2 VP V1 V2 VP V2P1 VNM llt P gt l llt P gtl lltNM 7 2Pgtl The second P eigenvectors certainly contain independent information about the eigenvectors wl in N AIspace They contain no new information however about 11139 or Vi in N and M space respectively since 711139 contains no information not already contained in u139 654 What About the Zero Eigenvalues 173 s i 2P 1 N M For the zero eigenvalues the shifted eigenvalue problem becomes 121 Geosciences 567 Chapter 6 RMRGZ GVZ39 711111 011139 0 i2P 1 NM 641a NXl and GTIIZ39 mVZ39 OVZ39 0 i 2P 1 NM 641b MXl where 0 is a vector of zeros of the appropriate dimension If you premultiply Equation 641a by GT and Equation 641b by G you obtain GTGviGT0 0 642 MX 1 and GGTui G0 0 643 NX 1 Therefore we conclude that the 111 V1 associated with zero 71 for B are simply the eigenvectors of GGT and GTG associated with zero eigenvalues for GGT and GTG respectively 655 How Big is P Now that we have seen that the eigenvalues come in P pairs of nonzero values how can we determine the size of P We will see that you can determine P from either GTG or GGT and that P is bounded by the smaller of N and AI the number of observations and model parameters respectively The steps are as follows Step 1 Let the number of nonzero eigenvalues 71 of GTG be P Since GTG is M X AI there are onlyM 71 all together Thus P is less than or equal to M Step 2 If 71 0 is an eigenvalue of GTG then it is also an eigenvalue of GGT since GTle 713 v 620 and GGTuZ 71 ul 628 Thus the nonzero m s are shared by GTG and GGT Step 3 P is less than or equal to N since GGT is N gtltN Therefore since P SM andP SN P S minN AI 122 Geosciences 567 Chapter 6 RMRGZ Thus to determine P you can do the eigenvalue problem for either GTG M X M or GGT N XN It makes sense to choose the smaller of the two matrices That is one chooses GTG if M ltN or GGT ifNltM 66 Introducing Singular Values 661 Introduction Recalling Equation 629 de ning the eigenvalue problem for GGT GGT U U N NXN NXN NXN NXN The matrix U contains the eigenvectors Hi and can be ordered as U 111 2 P uP1 N N NXN 39 U l P gtHlt NiP gtl 01 UUp U0 Recall Equation 623 which de ned the eigenvalue problem for GTG GTG V V M MXM MXM MXM MXM The matrix V of eigenvectors is given by W V V1 V2 VP VP1 VMM MXM j U l P gtHlt MiP gtl 01 vvP v0 where the Hi V1 satisfy GVi71i11i and 123 629 644 645 623 646 647 616 Geosciences 567 Chapter 6 RMRGZ GTuz 71m 617 and where we have chosen the P positive 71139 from GTle 713v 620 GGTuZ n u 628 Note that it is customary to order the 11139 Vj such that 7112 712 2 2 UP 648 662 De nition of the Singular Value We de ne a singular value from Equation 620 or 628 as the positive square root of the eigenvalue 71 for GTG or GGT That is l an 649 Singular values are not eigenvalues is not an eigenvalue for G or GT since the eigenvalue problem is not de ned for G or GT N i M They are of course eigenvalues for B in Equation 610 but we will never explicitly deal with B The mat1ix B is a construct that allowed us to formulate the shi ed eigenvalue problem but in practice it is never formed NeveItheless you will often read or hear A referred to as an eigenvalue 663 De nition of A the SingularValue Matrix We can form anN XM matrix with the singular values on the diagonal If M gt N it has the form 11 0 0 0 A2 A 1P 0 N NXM 0 65021 0 0 Q R N gtHltM Ngt IfN gt M it has the form next page 124 Geosciences 567 Chapter 6 RMRGZ 11 o 0 1r 0 12 AP M 0 A N XM 0 0 Q E 0 N M Q l M gtl Then the shifted eigenvalue problem GVi Willi and GTllz39 71in39 can be written as GVi 1111i and GTllz39 11W 650b 616 617 651 652 where 71139 has been replaced by A since all information about U V can be obtained from the positive 71139 Equations 651 and 652 can be written in matrix notation as G V U A NgtltM MgtltM NgtltN NgtltM and GT U V AT MgtltN NgtltN MgtltM MgtltN 125 653 654 Geosciences 567 Chapter 6 RMRGZ 67 Derivation of the Fundamental Decomposition Theorem for General G NX M N1VI Recall that we used the eigenvalue problem for square A and AT to derive a decomposition theorem for square matrices A SART 574 where S R and A are eigenvector and eigenvalue matrices associated with A and AT We are now ready to derive an analogous decomposition theorem for the general N XM N M matrix G We start with Equation 653 G V U A 653 NXM MXM NXN NXM postmultiply by VT GVVT UAVT 655 But V is an orthogonal matrix That is VVT 1M 626 since GTG is Hermitian and the eigenvector matrices of Hermitian matrices are orthogonal Therefore we have the fundamental decomposition theorem for a general matrix given by G U A vT 656 NXM NXN NXM MXM By taking the transpose of Equation 656 we obtain also GT 2 V AT UT 657 MXN MXM MXN NXN where U 111 up up1 uN UPI U0 658 NXN 39 and 126 Geosciences 567 Chapter 6 RMRGZ W VPI V0 V V1 VP VP1 MXM 39 39 and 11 0 0 0 12 A 1P NXM 0 0 0 68 SingularValue Decomposition SVD 681 Derivation of SingularValue Decomposition 659 660 We will see below that G can be decomposed without any knowledge of the parts of U or V associated with zero singular values 11 139 gt F We start with the fundamental decomposition theorem G UAVT Let us introduce aP X P singularvalue matrix AP that is a subset of A 11 0 0 0 A AP 0 0 0 AP We now write out Equation 656 in terms of the partitioned matrices as E 2 AP 0 V U GN U U P 0 N P 0 0 VoT Q Q l P gtHltN7Pzgtl llt P gtHltM7PgtHlt M gt 127 gtlt quotU gt 656 661 662 Geosciences 567 Chapter 6 RMRGZ H T VP N UPAP 0 VT 663 U 0 llt P gtHlt M7Pgtl UPAPV 664 That is we can write G as G UP AP V 665 NXM NXP PXP PXM Equation 665 is known as the Singular Value Decomposition Theorem for G The matrices in Equation 665 are l G an arbitraryN XM matrix 2 The eigenvector matrix U UP 111 Hz up 666 where 11139 are the P N dimensional eigenvectors of GGTul n 111 628 associated with nonzero singular values 3 The eigenvector matrix V VP V1 V2 VP 667 where Vj are the P M dimensional eigenvectors of GTGVZ 713 V1 620 associated with nonzero singular values 11 and 128 Geosciences 567 Chapter 6 RMRGZ 4 The singularvalue matrix A 11 0 0 0 i A P A 661 a 0 0 0 AP where is the nonzero singular value associated with 111 and V1 139 l P 682 Rewriting the Shifted Eigenvalue Problem Now that we have seen that we can reconstruct G using only the subsets of U V and A de ned in Equations 661 666 and 667 we can rewrite the shifted eigenvalue problem given by Equations 653 and 654 G v U A 653 NXM MXM NXN NXM and GT U v AT 654 MXN NXN MXM MXN as 1 G vp Up AP 662 NXM MXP NXP PgtltP 2 GT Up vp Ap 663 MXN NXP MXP PgtltP 3 G v0 0 664 NXM MgtltM7P NgtltM7P 4 GT U0 0 665 MXN NgtltN7P MgtltN7P Note that the eigenvectors in V are a set of M orthogonal vectors which span model space while the eigenvectors in U are a set of N orthogonal vectors which span data space The P vectors in VP span aPdimensional subset of model space while the P vectors in Up span a Pdimensional subset of data space V0 and U0 are called null or zero spaces They are M 7 P and N 7 P dimensional subsets of model and data spaces respectively 683 Summarizing SVD In summary we started with Equations 113 and 65 Gm d 113 129 Geosciences 567 Chapter 6 RMRGZ and GTy c 65 We constructed 62 llt N gtHltM gtl We then considered the eigenvalue problem for B Bwimwi i12NM 610 This led us to the shifted eigenvalue problem le mul i 1 2 N M 616 and GTul my i 1 2 N M 617 We found that the shifted eigenvalue problem leads us to eigenvalue problems for GTG and GGT GTGVZ 713 v i 12 M 620 and GGTui i ui i 12 N 628 We then introduced the singular value A given by the positive square root of the eigenvalues from Equations 620 and 628 A Vn 649 Equations 616 617 620 and 628 give us a way to nd U V and A They also lead eventually to G U A VT 656 NXM NXN NXMMXM We then considered pa1titioning the mat1ices based on P the number of nonzero singular values This led us to singular value decomposition G Up AP VPT 665 NXM NXP PgtltP PXM Before considering an inverse operator based on singularvalue decomposition it is probably useful to cover the mechanics of singularvalue decomposition 130 Geosciences 567 Chapter 6 RMRGZ 69 Mechanics of SingularValue Decomposition The steps involved in singularvalue decomposition are as follows Step 1 Begin with Gm 1 Form GTG M X M or GGT N X N whichever is smaller NB Typically there are more observations than model parameters thus N gt 111 and GTG is the more common choice Step 2 Solve the eigenvalue problem for Hermitian GTG or GGT GTGVZ 713 V 620 1 Find the P nonzero 71 and associated Vi 2 Let L 01 12 3 Form VP V1 V2 VP 667 MXP and 11 0 0 0 3 AP 1 A 0 112122 21 661 0 0 AP PXP Step 3 Use GVZ39 jui to nd ul for each known 17 Vi Note Finding Hi this way preserves the Sign relationship implicit between ui V by taking the positive member of each pair 17 717 You will not preserve the sign relationship except by luck if you use GGTul 71 Hi to nd ui Step 4 Form 666 131 Step 5 Finally form G as G Up AP V Geosciences 567 Chapter 6 RMRGZ 665 NXM NXPPXPPXM 610 Implications of SingularValue Decomposition 6101 Relationships Between U Up and U0 1 UT U U UTIN NXN NXN NXN NXN NXN 2 U Up 1 PXN NXP 3 Up U IN NXP PXN 4 U3 U0 INiP NiPgtltN NgtltN7P 5 U0 U3 IN NgtltN7P NiPgtltN 6 U1 U0 0 PXN NgtltN7P PgtltN7P 7 U3 U1 0 NiPgtltN NXP NiPgtltP Since U is an orthogonal matrix Up is semiorthogonal because all P vectors in Up are perpendicular to each other Unless P N UPU isNXN Up cannot span N space with only P N dimensional vectors U0 is semiorthogonal since the N 7 P vectors in U0 are all perpendicular to each other U0 has N 7 P N dimensional vectors in it It cannot span N space UOUE is N X N Since all the eigenvectors in Up are perpendicular to all the eigenvectors in U0 Again since all the eigenvectors in U0 are perpendicular to all the eigenvectors in Up 6102 Relationships Between V Vp and V0 1 VT V V VT1M MXM MXM MXMMXM 2 V VP 1 PXM MXP 3 VP V 721M MXPPXM Since V is an orthogonal matrix Vp is semiorthogonal because all P vectors in Vp are perpendicular to each other Unless P M V is M X M Vp cannot span M space with only P Mdimensional vectors 132 Geosciences 567 Chapter 6 RMRGZ 4 V3 V0 1M7 V0 is semiorthogonal since the M 7 P vectors M 713 xM M X M713 in V0 are all perpendicular to each other 5 V0 V3 i 1M V0 has M 7 P M dimensional vectors in it It M X M713 M713 xM cannot span M space VOVE isM XM 6 V V0 0 Since all the eigenvectors in VP are P gtltA1 M X M713 P X M713 perpendicular to all the eigenvectors in V0 7 V3 VP 0 Again since all the eigenvectors in V0 are M 7 P X M M X P M713 X P perpendicular to all the eigenvectors in VP 6103 Graphic Representation of U Up U0 V VP V0 Spaces Recall that starting with Equation 113 G m d 113 NXMMXI Ngtlt1 we obtained the fundamental decomposition theorem G U A vT 656 NXM NXN NXMMXM and singularvalue decomposition G Up AP v 665 NXM NXPPXP PXM This gives us the following 1 Recall the de nitions of U Up and U0 U 111 u1 UP U0 658 2 Similarly recall the de nitions for V VP and V0 MXM v1 vP1 VP v0 659 3 Combining U Up U0 and V VP V0 graphically 133 Geosciences 567 Chapter 6 RMRGZ llt P gtHlt N7P gtl W N UP U0 1 668 M VP V0 1 llt P gtHltM7P 1 4 Summarizing l V is anM XM matrix with the eigenvectors of GTG as columns It is an orthogonal 2 3 4 5 6 7 8 matrix VP is an M X P matrix with the P eigenvectors of GTG associated with nonzero eigenvalues of GTG VP is a semiorthogonal matrix V0 is anM X M 7P matrix with theM 7 P eigenvectors of GTG associated with the zero eigenvalues of GTG V0 is a semiorthogonal matrix The eigenvectors in V VP or V0 are all M dimensional vectors They are all associated with model space since In the model parameter vector of Gm d is an M dirnensional vector U is an N X N matrix with the eigenvectors of GGT as columns It is an orthogonal matrix Up is an N X P matrix with the P eigenvectors of GGT associated with the nonzero eigenvalues of GGT Up is a semiorthogonal matrix U0 is an N X N 7 P matrix with the N 7 P eigenvectors of GGT associated with the zero eigenvalues of GGT U0 is a semiorthogonal matrix The eigenvectors of U Up or U0 are all Ndimensional vectors They are all associated with data space since d the data vector of Gm d is an Ndimensional vector 611 Classi cation of d Gm Based on P M and N 6111 Introduction In Section 33 we introduced a classi cation of the system of equations dGm 113 134 Geosciences 567 Chapter 6 RMRGZ based on the dimensions of d N X 1 and m M X 1 I said at the time that I found the classi cation lacking and would return to it later Now that we have considered singularvalue decomposition including nding P the number of nonzero singular values I would like to introduce a better classi cation scheme There are four basic classes of problems based on the relationship between P 111 and N We will consider each class one at a time below 6112 Class I PMN Graphically for this case we have llt N gt 11 N U P U 1 TT N VP V 1 llt N gt 1 U0 and V0 are empty 2 G has a unique mathematical inverse G l 3 There is a unique solution for m 4 The data can be t exactly 6113 Class II PMltN Graphically for this case we have llt P gtlltN7PgtI N Up U0 1 M VPV 1 lltM gt 135 Geosciences 567 Chapter 6 RMRGZ V0 is empty since P M U0 is not empty since P lt N G has no mathematical inverse There is a unique solution in the sense that only one solution has the smallest mis t to the data The data cannot be t exactly unless the compatibility equations are satis ed which are de ned as follows U3 1 0 669 NiPgtltN Ngtlt1 NiPgtlt1 If the compatibility equations are satis ed one can t the data exactly The compatibility equations are equivalent to saying that d has no projection onto U0 Equation 669 can be thought of as the N 7 P dot products of the eigenvectors in U0 with the data vector d If all of the dot products are zero then 1 has no component in the N 7Pdimensional subset of Nspace spanned by the vectors in U0 G operating on any vector m can only predict a vector that lies in the Pdimensional subset of Nspace spanned by the P eigenvectors in Up We will return to this later P M ltN is the classic least squares environment We will consider least squares again when we introduce the generalized inverse 6114 Class III PNltM Graphically for this case we have 3 llt N gtl W N UPU Q i M VP V0 llt P 2 MP gtl U0 is empty since P N V0 is not empty since P ltM G has no mathematical inverse 136 Geosciences 567 Chapter 6 RMRGZ 4 You can t the data exactly because U0 is empty 5 Solution is not unique If mest is a solution which ts the data exactly Gmest 1Pre d 670 then M mest 2am 671 iPl is also a solution where 06139 is any arbitrary constant 6 P M ltN is the minimum length environment The minimum length solution sets aiiP 1 Mtozero 6115 Class IV Plt minN 1W Graphically for this case we have llt P gtKN7Pgt W N U P U0 E W M VP V0 Q llt P 2 MP gtl 1 Neither U0 nor V0 is empty 2 G has no mathematical inverse 3 You cannot t the data exactly unless the compatibility equations Equation 669 are satis ed 4 The solution is nonunique This sounds like a pretty bleak environment No mathematical inverse Cannot t the data The solution is nonunique It probably comes as no surprise that most realistic problems are of this type P lt minN AI In the next chapter we will introduce the generalized inverse operator It will reduce to the unique mathematical inverse whenP M N It will reduce to the least squares operator when we have P M ltN and to the minimum length operator when P N lt M It will also give us a solution in the general case where we have P lt minN M that has many of the properties of the least squares and minimum length solutions 137 Geosciences 567 Chapter 7 RMRGZ CHAPTER 7 THE GENERALIZED INVERSE AND MEASURES OF QUALITY 71 Introduction Thus far we have used the shifted eigenvalue problem to do singularvalue decomposition for the system of equations Gm 1 That is we have G U A vT 656 NXM NXNNXMMXM and also G Up AP v 665 NXM NXP PgtltP PXM where U is an N gtltN orthogonal matrix and where the 1th column is given by the 1th eigenvector ul which satis es GGTuZ n u 628 V is anM XM orthogonal matrix where the ith column is given by the ith eigenvector V which satis es GTGVZ 713v 620 A is an N XM diagonal matrix with the singular values it Ml2 along the diagonal Up AP and VP are the subsets of U A and V respectively associated with the P nonzero singular values P S minN AI We found four classes of problems for Gm 1 based on P N M Class I P N M G 1 mathematical exists Class II P M lt N least squares Recall mLs GTG 1GTd Class III P N ltM Minimum Length Recall mML ltmgt GTGGT 1 x d 7 Gltmgt Class IV P lt minN AI at present we have no way of obtaining an mast 138 Geosciences 567 Chapter 7 RMRGZ Thus in this chapter we seek an inverse operator that has the following properties 1 Reduces to G 1 when P N M 2 Reduces to GTG 1GT when P M lt N least squares 3 Reduces to GTGGT 1 whenP N ltM minimum length 4 Exists when P lt minN AI In the following pages we will consider each of these classes of problems beginning with P NM Class I In this case G U A VT 71 NgtltN NgtltN NgtltN NgtltN with 11 0 0 0 E A 39 A 72 a 0 0 0 AN Since P N M there are no zero singular values and we have 112122w21Ngt0 73 In order to nd an inverse operator based on Equation 71 we need to nd the inverse of a product of matrices Applying the results from Equation 28 to Equation 7 1 above gives G71 VT 1A 1 U71 74 We know A 1 exists and is given by 111 0 0 0 1 E A 1 A 75 a 0 0 0 llN N X N We now make use of the fact that both U and V are orthogonal matrices The properties of U and V that we wish to use are vT1 v 76 and U71 UT 77 139 Geosciences 567 Chapter 7 RMRGZ Therefore G 1V N1 UT PNM NXN NXN NXN NXN 78 Equation 78 implies that G l the mathematical inverse of G can be found using singularvalue decomposition when P N M What we need now is to nd an operator for the other three classes of problems that will reduce to the mathematical inverse G 1 when it exists 72 The Generalized Inverse Operator G g1 721 Background Information We start out with three pieces of information 1 G UAVT 670 2 G 1 VA IUT when G 1 exists 78 3 G UPAPV singularvalue decomposition 665 Then by analogy with de ning the inverse in Equation 78 above on the form of Equation 71 we introduce the generalized inverse operator 71 71 T Gg VP AP UP MXN MgtltP PgtltP PgtltN 79 and find out the consequences for our four cases Menke points out that there may be many generalized inverses but Equation 79 is by far the most common generalized inverse 722 Class I PNM In this case we have 1 VPVandUpU 2 V0 and U0 are empty We start with the definition of the generalized inverse operator in Equation 79 G VIDA U 79 140 Geosciences 567 Chapter 7 RMRGZ But since P M we have VP V 710 Similarly since P N we have Up U 711 Finally since P N M we have A3 A 1 712 Therefore combining Equations 7947 12 we recover Equation 78 G 1 VA lUT P N M 78 the exact mathematical inverse Thus we have shown that the generalized inverse operator reduces to the exact mathematical inverse in the case of P N M Next we consider the case of P M lt N 723 Class II P Mlt N This is the least squares environment where we have more observations than unknowns but where a unrque solutron ex1sts In this case we have 1 VP V 2 V0 is empty 3 U0 exists Ultimately we wish to show that the generalized inverse operator reduces to the least squares operator whenP M lt N The Role ofGTG Recall that the least squares operator as de ned in Equation 327 for example is given by mLs GTG 1GTd 327 We rst consider GTG using singularvalue decomposition for G obtaining GTG UpApVII TUPAPvT 713 Recall that the transpose of the product of matrices is given by the product of the transposes in the reverse order ABT BTAT l4l Geosciences 567 Chapter 7 RMRGZ Therefore GTG VII FA U UPAPV 01 GTG VPAPU UPAPV since A AP and V T VP We know however that Up is a semiorthogonal matrix Thus U Up 1P Therefore Equation 715 reduces to GTG VPAPAPV Now consider the product of AP with itself 21 0 0 21 0 0 0 2 2 0 2 39 AA P P 0 0 0 0 AP 0 0 AP 01 A 0 0 2 APAP A ZAla a 0 0 0 A where we have introduced the notation A12 for the product of AP with itself 714 715 716 717 718 719 720 721 Please be aware as noted before that the notation A2 when A is a matrix has no universally accepted definition We will use the definition implied by Equation 721 Finally then we write Equation 719 as GTG VP A VT MXM MXPPXP PX 142 722 Geosciences 567 Chapter 7 RMRGZ Finding the Inverse 0f GTG I claim that GTG has a mathematical inverse GTG 1 when P M The reason that GTG 1 exists in this case is that GTG has the following eigenvalue problem GTGVZ39 n v 139 1 M 620 where because P AI we know that allM 71 are nonzero That is GTG has no zero eigenvalues Thus it has a mathematical inverse Using the theorem presented earlier about the inverse of a product of matrices in Equations 28 we have GTGl 1 V13 PM WV 723 The inverse of VIT is found as follows First v VP 1P 724 is always true because VP is semiorthogonal But because we have thatP M in this case we also have vpvT IM 725 P Thus VP is itself an orthogonal matrix and we have VII 1 1 VP 726 and V1 v 727 Thus we can write Equation 723 as GTGr1 VFW 171v 728 Finally we note that A123 1 is given by 1212 0 0 AH1 A 3 Mg 729 i 0 123 Therefore GTGrl VIDA v 730 where A is as defined in Equation 729 143 Geosciences 567 Chapter 7 RMRGZ Equivalence of G g1 ana39 Least Squares When P M lt N We start with the least squares operator from Equation 328 for example GL3 GTG 1GT 328 We can use Equation 730 for GTG 1 in Equation 328 and singularvalue decomposition for GT to obtain GTGHGT VIDA v UpApVII T 731 VIDA v VPAPU 732 But because VP is semiorthogonal we have from Equation 724 v VP 1P 724 Thus Equation 732 becomes GTGHGT VIDA APU 733 Now considering Equations 720 and 721 we see that A AP A3 734 Finally then Equation 733 becomes GTGHGT VIDA U G gl 735 as required That is we have shown that when P M lt N the generalized inverse operator is equivalent to the least squares operator Geometrical Interpretation of C1 When P M lt N It is possible to gain some insight into the generalized inverse operator by considering a geometrical argument An arbitrary data vector 1 may have components in both Up and U0 spaces The generalized inverse operator retums a solution mg for which the predicted data lies completely in Up space and minimizes the mis t to the observed data The steps necessary to see this follow Step 1 Let mg be the generalized inverse solution given by mg G d 736 Step 2 Let d be the predicted data given by 144 Step 3 Geosciences 567 Chapter 7 RMRGZ d Gmg 737 We now introduce the following theorem about the relationship of the predicted data to Up space Theorem d lies completely in Upspace the subset of N space spanned by the P eigenvectors in Up Proof If d lies in Upspace it is orthogonal to UOspace That is U3 1 U3 Gmg U3 UPAPV3 mg 738 0 N 7P X 1 which follows from U3 Up 0 739 That is every eigenvector in U0 is perpendicular to every eigenvector in Up Another way to see this is that all of the eigenvectors in U are perpendicular to each other Thus any subset of U is perpendicular to the rest of U QED Let 1 7 d be the residual data vector ie observed 7 predicted data also known as the mis t vector given by d7dd7Gmg d 7 GG gl d d 7 CG 1 d 7 Upw MVPAM d d 7 UPU3 1 740 We cannot further reduce Equation 740 whenever P lt N because in this case UPU3 1N Next we introduce a theorem about the relationship between the mis t vector and U0 space 145 Geosciences 567 Chapter 7 RMRGZ Theorem Proof The mis t vector 1 7 d is orthogonal to Up U d 7 d 7 U d 7UPUIT d 7 T T T 7UPd7UPUPUPd 7 U 1 7 U 1 7 0 QED 741 Pgtlt1 The crucial step in going from the second to third lines being that U Up Ip since Up is semiorthogonal This implies that the mis t vector 1 7 1 lies completely in the space spanned by U0 Combining the results from the above two theorems we introduce the nal theorem of this section concerning the relationship between the predicted data and the mis t vector Theorem Proof The predicted data vector d is perpendicular to the mis t vector d 7 d l The predicted data vector d lies in Up space 2 The mis t vector 1 7 d lies in U0 space 3 Since the vectors in Up are perpendicular toAthe vectors in U0 1 is perpendicular to the mis t vector 1 7 d QED Step 4 Consider the following schematic graph showing the relationship between the various vectors and spaces U0 space 146 Geosciences 567 Chapter 7 RMRGZ Theorem Proof The mis t vector 1 7 d is orthogonal to Up U d 7 d 7 U d 7UPUIT d 7 T T T 7UPd7UPUPUPd 7 U 1 7 U 1 7 0 QED 741 Pgtlt1 The crucial step in going from the second to third lines being that U Up Ip since Up is semiorthogonal This implies that the mis t vector 1 7 1 lies completely in the space spanned by U0 Combining the results from the above two theorems we introduce the nal theorem of this section concerning the relationship between the predicted data and the mis t vector Theorem Proof The predicted data vector d is perpendicular to the mis t vector d 7 d l The predicted data vector d lies in Up space 2 The mis t vector 1 7 d lies in U0 space 3 Since the vectors in Up are perpendicular toAthe vectors in U0 1 is perpendicular to the mis t vector 1 7 d QED Step 4 Consider the following schematic graph showing the relationship between the various vectors and spaces U0 space 146 Step 5 Step 6 Geosciences 567 Chapter 7 RMRGZ The data vector 1 has components in both Up and U0 spaces Note the following points 1 The predicted data vector d Gmg lies completely in Up space 2 The residual vector 1 7 d lies completely in U0 space 3 The shortest distancefrom the observed data 1 to the Up aXis is given by the mis t vector 1 7 d Thus the generalized inverse G 1 minimizes the distance between the observed data vector 1 and Up the subset of data space in which all possible predicted data 1 must lie Recall that the least squares operator given in Equation 328 Gils GTGHGT 328 minimizes the length of the mis t vector Thus the generalized inverse operator is equivalent to the least squares operator when P M lt N ForP M lt N it is possible to write the generalized inverse without forming Up To see this note that the generalized inverse is equivalent to least squares forP M ltN That is G gl GTGHGT 742 But by Equation 728 GTG 1 is given by GTGr1 VIDA v 728 Thus the generalized inverse in this case is given by G GTGHGT VpA v GT 743 Equation 743 shows that the generalized inverse can be found without ever forming Up when P M ltN In general this shortcut is not used even though you can form the inverse because there is useful information lost about data space Finally recall the compatibility equations given by Ugd 0 NiPgtlt1 669 Note that if the observed data 1 has any projection in U0 space is not possible to nd a solution mthat can t the data exactly All estimates m lead to predicted data Gm that lie in Up space Thus from the graph above one sees that if the observed data 1 lies completely in Up space the compatibility equations are automatically satis ed 147 Geosciences 567 Chapter 7 RMRGZ 724 Class III P Nlt M This is the minimum length environment where we have more model parameters than observations There are an in nite number of possible solutions that can t the data exactly Recall that the minimum length solution is the one which has the shortest length Ultimately we wish to show that the generalized inverse operator reduces to the minimum length operator when P N lt M ForP NltMwe have 1 Up U 2 U0 is empty 3 V0 is not empty The Role ofGGT Recall that the minimum length operator as de ned in Equation 359 is given by Gm GTGGT1 359 We seek thus to show that G g1 GTGGT 1in this case First consider writing GGT using singularvalue decomposition GGT UPAPV UPAPVIE 1T UpAan vPAPU UPyg U 744 Finding the Inverse 0f GGT Note that GGT is N XN andP N This implies that GGT 1 the mathematical inverse of GGT exists Again using the theorem stated in Equation 28 about the inverse of a product of matrices we have GGTH U13 1N PU UPA U P N 745 Then GTGGT1 UPAPVE iTUPA U 148 Geosciences 567 Chapter 7 RMRGZ VpAl U UpAgD2 U vPAPA U VijDl U G as required Fitting the Data Exactly When P N lt M As before let the generalized inverse solution mg be given by mg G d Then the predicted data d is given by d Gmg GG g1 d CG 1 UPAPV VPA U 1 UPU d 1 since UPU IN wheneverP N 746 736 737 7 47 Thus one can t the data exactly whenever P N The reason is that U0 is empty when P N That is Up is equal to U space The Generalized Inverse Solution Ing Lies in Vp Space The generalized inverse solution mg is given by mgG g1 d 149 736 Geosciences 567 Chapter 7 RMRGZ and is a vector in model space It lies completely in VP space The way to see this is to take the dot product of mg with the eigenvectors in V0 If mg has no projection in V0 space then it lies completely in VP space Thus T 7 T 71 VomgivoGg d 7 T 71 T 7V0VPAP UPd 0 748 M713 gtlt1 since V3 VP 0 Nonum39queness 0f the Solution When P N lt M The solution to Gm d is nonunique because V0 exists when P lt M Let the general solution m to Gm d be given by M 1hmg 2am 749 iP1 That is the general solution is given by the generalized inverse solution mg plus a linear combination of the eigenvectors in V0 space where 06139 are constants The predicted data for the general case is given by M GmG mg EatV iP1 750 M Gmg 2 06 CV iP1 When G operates on a vector in V0 space however it retums a zero vector That is GVO UPAPV v0 0 751 NgtltM7P which follows from the fact that the eigenvectors in VP are perpendicular to the eigenvectors in V0 Thus GmGmg0 d 752 150 Geosciences 567 Chapter 7 RMRGZ Now consider the length squared of m M A 2 2 m mg 2a 753 iP1 which follows from the fact that viTVJ H lllz lemgll2 754 That is mg the generalized inverse solution is the smallest of all possible solutions to Gm d This is precisely what was stated at the beginning of this section the generalized inverse solution is equivalent to the minimum length solution when P N lt M It is Possible to Write G751 Without VP When P N ltM To see this we write the generalized inverse operator as the minimum length operator and use singularvalue decomposition That is G GTGGT 1 GTUPA U11 755 Typically this shortcut is not used because knowledge of VP space is useful in the interpretation of the results 725 Class IV Plt minN A This is the class of problems for which neither least squares nor minimum length operators exist That is the least squares operator GLS GTG 1GT 328 does not exist because GTG 1 exists only whenP M Similarly the minimum length operator GML GTGGT 1 369 does not exist because GGT 1 exists only when P N For P lt minN AI we have 1 V0 is not empty 2 U0 is not empty 151 Geosciences 567 Chapter 7 RMRGZ In this environment the solution is both nonunique because V0 exists and it is impossible to t the data exactly unless the compatibility equations Equations 669 are satis ed That is it is impossible to t the data exactly unless the data have no projection onto U0 space The generalized inverse operator cannot be further reduced and is given by Equation 79 G VijDl U 79 The generalized inverse operator G 1 simultaneously minimizes the mis t vector 1 7 d in data space and the solution length ngH2 in model space In summary in this section we have shown that the generalized inverse operator G g1 reduces to l The exact inverse whenP N M 2 The least squares inverse GTG 1GT whenP M lt N 3 The minimum length inverse GTGGT 1 when P N ltM Since we have shown that the generalized inverse is equivalent to the exact least squares and minimum length operators when they exist it is worth comparing the way the solution mg is written In the least squares or unique inverse environment for example we would then write mg G gl d 756 but in the minimum length environment we would write mg ltmgt G gl d 7 Gltmgt 757 which explicitly includes a dependence on the prior estimate ltmgt It is somewhat disconcerting to have to carry around two forms of the solution for the generalized inverse Consider what happens however if we use Equation 757 for the unique or least squares environment Then mg ltmgt G d 7 Gltmgt ltmgt G d 7 G Gltmgt G d 1M7 G Gltmgt 758 For the unique inverse environment G1 G 1M and hence Equation 758 reduces to Equation 756 For the least squares environment we have G g1 G GTG 1GTG 1M 759 and hence Equation 758 again reduces to Equation 756 The unique inverse and least squares environments thus have no dependence on ltmgt Equation 757 however is true for the generalized inverse in all environments and is thus adopted as the general form of the generalized inverse solution mg 152 Geosciences 567 Chapter 7 RMRGZ In the next section we will introduce measures of the quality of the generalized inverse operator These will include the model resolution matrix R the data resolution matrix N also called the data information density matrix and the unit model covariance matrix covu m 73 Measures of Quality for the Generalized Inverse 731 Introduction In this section three measures of quality for the generalized inverse will be considered They are l The M XM model resolution matrix R 2 The N X N data resolution matrix N 3 The M XM unit covariance matrix covu m The model resolution matrix R measures the ability of the inverse operator to uniquely determine the estimated model parameters The data resolution matrix N measures the ability of the inverse operator to uniquely determine the data This is equivalent to describing the importance or independent information provided by the data The two resolution matrices depend upon the partitioning of model and data spaces into VP V0 and Up U0 spaces respectively Finally the unit covariance matrix covu m is a measure of how uncorrelated noise with unit variance in the data is mapped into uncertainties in the estimated model parameters 732 The Model Resolution Matrix R Imagine for the moment that there is some true solution mtrue that exactly satis es Gmtme d 760 In any inversion we estimate this true solution with meSt mest Ggslt d 761 where Ggslt is some inverse operator It is then possible to ask how mest compares to mme Speci cally considering the generalized inverse we start with Equation 761 and replace 1 with Gmtme obtaining mg G gl Gmtrue 762 The model resolution matrix R is then de ned as R G G 763 where R is anM XM symmetric matrix 153 Geosciences 567 Chapter 7 RMRGZ If R 1 then mest mm and we say that all of the model parameteis are perfectly resolved or equivalently that all of the model parameters are uniquely determined If R i 1M then mest is some weighted average of mme est Consider the kth element of mast denoted m k given by the product of the kth row of R and mme tme m1 m mist kth row ofR 2 764 mM The rows of R can thus be seen as windows or lteIs through which the true solution is viewed For example suppose that the kth row of R is given by Rk000f000 kth column We see that mist Omtrlm3 0mtlfi lmtkrue 0mg 0 or simply mist mtlgue In this case we say that the kth model parameter is perfectly resolved or uniquely determined Suppose however that the kth row of R were given by Rk 0 0 01 03 OTIS 04 02 0 kth column Then the kth estimated model parameter n5 is given by m ekst 0mg 03mg33 08m 5218 04mg 02mg53 Or m 1 is a weighted average of several terms in mme In the case just considered it depends most heavily 08 on m lime but it also depends on other components of the true solution We say then that quotF5 is not perfectly resolved in this case The closer the row of R is to the row of an identity matrix the better the resolution From the above discussion it is clear that model resolution may be considered element by element If R 1 then all elements are perfectly resolved If a single row of R is equal to the corresponding row of the identity matrix then the associated model parameter estimate is perfectly resolved Finally we can rewrite Equation 758 as mg G gl d I 7 Rltmgt 765 154 Geosciences 567 Chapter 7 RMRGZ Some Properties of R 1 R vPv Using singularvalue decomposition on Equation 763 R can be written as R VPx U UPAPV VIDA Apv vPv 766 In general VPVIT i 1 However if P M then VP V and V0 is empty In this case Rvpvva1M since V is an orthogonal matrix Thus the condition for perfect model resolution is that V0 be empty or equivalently that P M M 2 Trace R 2 rit P the number of nonzero singular values il Proof IfR IM thenP M and trace R M For the general case it is possible to write R as the product of the following three partitioned matrices T R vPv0 0 V1 0 0 v0T 767 V 1P 0VT 0 0 VAvT where no part of V0 actually contributes to R because of the extra zeros in A The trace of A is equal to P Note however that matrix A has been obtained from R by an orthogonal transformation because V is an orthogonal matrix Thus by Equation 230d which states that the trace of a matrix is unchanged by an orthogonal transformation we conclude that trace R P as required QED Trace R P implies that G has enough information to uniquely resolve P aspects of the solution These aspects are in fact the P directions in model space given by the eigenvectors Vj in VP Whenever a row of R is equal to a row of the identity matrix I then 155 Geosciences 567 Chapter 7 RMRGZ no part of the associated model parameter m1 falls in V0 space ie it all falls in VP space and that model parameter is perfectly resolved When R is not equal to the identity matrix I some part of the problem is not perfectly resolved Sometimes this is acceptable and other times it is not depending on the problem Forming new model parameters as linear combinations of the old model parameters is one way to reduce the nonuniqueness of the problem One way to do this is to form new model parameters by using the eigenvectors V1 to de ne the linear combinations Suppose that Vi 139 gt p is given by vil W111 1T 768 This tells us that the average of all the model parameters is resolved even if the individual model parameters may not be If we de ned a new model parameter as the average of all the old model parameters it would be perfectly resolved If as is often the case G represents some kind of an averaging lnction you can attempt to reduce the nonuniqueness of the problem by forming new model parameters that are the sum or average of a subset of the old ones even without using the full information in VP If the model parameters are discretized versions of a continuous lnction such as velocity or density versus depth you may be able to improve the resolution by combining layers A rule of thumb in this case is to sum the entries along the diagonal of the resolution matrix R until you get close to one At this point your system is able to resolve one aspect of the solution uniquely You can try forming a new model parameter as the average of the layer velocities or densities up to this point Depending on the details of G you may have perfect resolution of this average of the old model parameters Depending on the problem it may be more useful to uniquely know the average of the model parameters over some depth range than it is to have nonunique estimates of the values over the same range M M E r E r12 rit importance of 1th model parameter If r 1 then the ith model parameter is uniquely resolved and it is thus said to be very important If on the other hand r is very small then the parameter is poorly resolved and is said to be not very important If we lrther note that R can be written as R 1 1 1 2 I M where 13 is the 1th column of R then the estimated solution mg from 762 can be written as mg Rmtrue r1m1 r2m2 rMmM 156 Geosciences 567 Chapter 7 RMRGZ where m is the ith component of mme which follows from 215221 That is the estimated solution vector can also be thought of as the weighted sum of the columns of R with the weighting factors being given by the true solution The length squared of each column of R can then be thought of as the importance of m in the solution The length squared of 13 is given by M llrill2 203 4 j1 Thus the diagonal entries in R give the importance of each model parameter for the problem We will return to the model resolution matrix R later to show how the generalized inverse is the inverse operator that minimizes the difference between R G1 G and IM in the least squares sense and when we discuss the tradeoff between resolution and variance 733 The Data Resolution Matrix N Consider the development of the data resolution matrix N which follows closely that of the model resolution matrix R The estimated solution for any inverse operator Ggslt is given by mest Ggslt d 7 69 The predicted data d for this estimated solution are given by 1 Gmest 770 Replacing mest in 7 70 with 769 gives 21 Gagt d Nd 771 where N is an N XN matrix called the data resolution matrix A Speci c Example As a speci c example consider the generalized inverse operator G g1 Then N is given by N GG gl 772 If N IN then the predicted data d equal the observed data d and the observed data can be t exactly If N i IN then the predicted data are some weighted average of the observed data 1 Consider the kth element of the predicted data 67 k 157 Geosciences 567 Chapter 7 RMRGZ all A d2 dk kth row ofN 773 dN The rows of N are windows through which the observed data are viewed If the kth row of N has a l in the kth column and zeroes elsewhere the kth observation is perfectly resolved We also say that the kth observation in this case provides completely independent information For this reason N is sometimes also referred to as the data information density matrix Equation 773 shows that the kth predicted datum is a weighted average of all of the observations with the weighting given by the entries in the kth row of N If the kth row of N has many nonzero elements then the kth predicted observation depends on the true value of many of the observations and not just on the kth The data resolution of the kth observation then is said to be poor Some Properties of N for the Generalized Inverse 1 N GG gl UPU Using singularvalue decomposition we have that N GG gl UPAPVII VIDA U UPU 774 since VVP I VP is semiorthogonal and AIDA I In general UPU IN However if P N then Up U and U0 is empty Then N UPU UUT IN since U is itself an orthogonal matrix Thus the condition for perfect data resolution is that U0 be empty or that P N 2 trace N P The proof follows that of trace R P in 763 I 0 UT NUPUTUPU0I 0U T 0 and IP 0 trace P QED 0 0 If for example n11 n22 3 1 775a 158 Geosciences 567 Chapter 7 RMRGZ one might choose to form a new observation di as a linear combination of all and d2 given in the simplest case by d1 all d2 775b The actual linear combination of the two observations that is resolved depends on the eigenvectors in Up or equivalently upon the structure of the data resolution matrix N In any case the new observation 61 could provide essentially independent information and could be a way to reduce the computational effort of the inverse problem by reducing N In many cases however the bene t of being able to average out data errors over the observations is more important than any computational savings that might come from combining observations N N 2n nil importance of the 1th observation 776 jl jl That is the sum of squares of the entries in a row or column since N is symmetric of N is equal to the diagonal entry in that row Thus as the diagonal entry gets large close to one the other entries in that row must become small As the importance of a particular datum becomes large the dependence of the predicted datum on other observations must become small If we further note that we can write N as N n1 n2 nN 777 where ni is the 1th column of N then the predicted data d from 771 can be written as 1 Nd n1d1 n2d2 nNdN where di is the ith component of d Equation 778 follows from 215221 That is the predicted data vector can also be thought of as the weighted sum of the columns of N with the weighting factors being given by the actual observations The length squared of each column of N can then be thought of as the importance of di in the solution The length squared of ni is given by N llnillz2n n 079 Thus the diagonal entries in N give the importance of each observation in the solution 159 Geosciences 567 Chapter 7 RMRGZ It can also be shown that the generalized inverse operator G g1 VijD1 U minimizes the difference between N and IN Let us now turn our attention to another measure of quality for the generalized inverse the unit covariance matrix covu m 734 The Unit Model Covariance Matrix covu In Any errors noise in the data will be mapped into errors in the estimates of the model parameters The mapping was rst considered in Section 37 of Chapter 3 We will now reconsider it from the generalized inverse viewpoint Let the error noise in the data 1 be Ad Then the error in the model parameters due to Ad is given by Am G gl Ad 780 Step 1 Recall from Equation 242c that the N XN data covariance matrix cov d is given by l k T i i covd k12Ad Ad 781 il where k is the number of experiments and 139 is the experiment number The diagonal terms are the data variances and the offdiagonal terms are the covariances The data covariance is also written as ltAdAdTgt where lt gt denotes averaging Step 2 We seek then a model covariance matrix ltAmAmTgt cov m ltAm AmTgt ltG g1 Ad org1 AdTgt ltG g1 Ad AdT G gl Tgt 782 G1 is not changing with each experiment so we can take it outside the averaging implying ltAm AmgtT cg ltAd AdTgt G gl T 01 cov m G gl cov d G gl T 783 The above derivation provides some of the logic behind Equation 244c which was introduced in Chapter 2 as magic 160 Geosciences 567 Chapter 7 RMRGZ Step 3 Finally de ne a unit model covariance matrix covu m by assuming that cov d 1M that is by assuming that all the data variances are equal to l and the covariances are all 0 uncorrelated data errors Then covu m G l cov 1G g1 T g G gl 03g1 T 784 Some Properties of covu In 1 Using singularvalue decomposition we can write the unit model covariance matrix as covu m G3 03g1 1T VP A3 U3 VP A3 U T VP A3 U32 Up A3 v32 VP A3 v 785 This emphasizes the importance of the size of the singular values in determining the model parameter variances As 11 gets small the entries in covu m tend to get big implying large model parameter estimate variances due to the terms in 112 Consider the kth diagonal entry in covu m covu mkk where 1 0 0 vlT 0 1 2 E T covum V1 V2 VP A 0 v2 786 2 T 0 0 llp VP If we multiply out the rst two matrices we can then identify the kk entry in covu m as the product of the kth row times the kth column of the resulting matrices 2 2 2 VIIll VizAg VIP1P V11 Vkl le COVumlkk 39 39 V12 sz VMZ 2 2 2 kthrowgt Vklll Via12 vkpAP V1P VkP VMP VMlll Vim122 vMp TT kthcolumn 161 Geosciences 567 Chapter 7 RMRGZ P V2 7 87 il Thus as 2139 gets small covu mkk will get large if ij is not zero Recall that vki is the kth component in the ith eigenvector Vj in VP Thus it is the combination of g getting small and Vj having a nonzero component in the kth row that makes the variance for the kth model parameter potentially very large 2 Even if the data covariance is diagonal ie all the observations have errors that are uncorrelated covu m need not be diagonal That is the model parameter estimates may well have nonzero covariances even though the data have zero covariances For example from Equation 787 above we can see that Vkl P 2 2 2 Vk2 M COVumlrk Vrrlr V12A2 VIP1P E Z it 788 VkP Note that the term in the above equation is not the dot product of two columns of VP In fact even if the numerator were the dot product between columns ie v1 Mk the fact that every term is divided by 1 would likely yield something other than 1 The numerator is the dot product of two rows of VP and is likely nonzero anyway 3 Notice that covu m is a function of the forward problem as expressed in G and not a function of the actual data Thus it can be useful for experimental design 735 Combining R N covu In Note that in general G G1 R N and covu m can be written in terms of singularvalue decomposition as 1 G UPAPV 717 71 T 2 Gg ivaP UP 7 71 7 T 3 RiGg GVPVP 7 71 7 T 4 NiGGg iUpUP 162 Geosciences 567 Chapter 7 RMRGZ 5 covu In G GgilT VIDA V Case P MN R G G 1M since G G4 N GG gl GG l IN since G gl G l covu In G gl GgilT VAilUTVA71UTT VA lUTUA lVT VA ZVT Case I P M lt N Least Squares G gl VIDA U GTG1GT R G G VPV VVT 1M sinceP M N GG gl GGTG 1GT using SVD UPU covu In G gl G gl 1T GTG71GTGTG71GTT using SVD VIDA V Case I P N lt M Minimum Length G gl VIDA U GTGGT 1 R G G GTGGT1G using SVD VPV N GG gl GGTGGT1 using SVD UPU UUT 1N covu In G G gl 1T GTGGT71GTGGT71T using SVD VPA V 163 Geosciences 567 Chapter 7 RMRGZ Case IV P lt minM N General Case This is just the general case 736 An Illustrative Example Consider a system of equations Gm 1 given by 100 100 ml 200 789 200 201 m2 410 Doing singularvalue decomposition one nds 11 3169 12 000316 0446 0895 UP U 0895 0446 0706 0709 VP V 0709 0706 T 10 00 RZVPVP 00 10 12 NU UT 10 00 I PP 00102 71 201 100 Gg 200 100 1 201 100 20 80 mg Gg d 790 200 100 41 100 Note that the solution has perfect model resolution R I and hence the solution is unique and perfect data resolution N I and hence the data can be t exactly Note also thatP N 111 and the generalized inverse is in fact the unique mathematical inverse This solution is however essentially meaningless if the data contain even a small amount of noise To see this consider the unit covariance matrix covu m for this case 00996 0 covu m Gg1Gg1T VPAPZV VP T VP 0 1003009 164 Geosciences 567 Chapter 7 RMRGZ 50401 50200 791 50 200 50 000 These are very large covariances for m1 and m2 which indicate that the solution while unique and tting the data perfectly is very unstable or sensitive to noise in the data For example suppose that 612 is 40 instead of 41 25 error Then the generalized inverse solution mg is given by 71 201 100 20 20 m G d 792 g g 200 100 40 00 That is errors of less than a few percent in 1 result in errors on the order of several hundred percent in mg Whenever small changes in 1 result in large changes in mg the problem is considered unstable In this particular problem if a solution is desired with a standard deviation of order 01 then the data standard deviations must be less than about 5 X 104 Another way of quantifying the instability of the inversion is with the condition number de ned as condition number max 1min 793 For this particular problem the condition number is approximately 1000 which indicates considerable instability The condition number by itself can be misleading If a problem has two singular values 11 and M with 11 1000000 and M 1000 then ll g 1000 This problem however is very stable with changes of length order one in the data do you see why If however 11 0001 and 12 0000001 then 1112 1000 and unit length changes in the data will cause large changes in the solution In addition to just the condition number the absolute size of the singular values is important especially compared to the size of the possible noise in the data In order to gain a better understanding of the origin of the instability one must consider the structure of the G matrix itself For the present example inspection shows that the columns or rows of G are very nearly parallel to one another For example the angle between the vectors given by the columns is 0114quot obtained by taking the dot product of the two columns The two columns of G span the two dimensional data space and hence the data resolution is perfect but the fact that they are nearly parallel leads to a significant instability It is not a coincidence therefore that the data eigenvector associated with the larger singular value 111 0446 0895T is essentially parallel to the common direction given by the columns of G Nor is it a coincidence that u associated with the smaller singular value is perpendicular to the almost uniform direction given by the columns of G The eigenvector 111 represents a stable direction in data space as far as noise is concerned while u2 represents an unstable direction in data space as far as noise is concerned Noise in data space parallel to 111 will be damped by 111 while noise parallel to H2 will be amplified by 112 Similar arguments can be made about the rows of G which lie in model space That is V1 is essentially parallel to the almost uniform direction given by the rows of G while V2 is essentially perpendicular to the direction given by the rows of G Noise parallel to 111 when operated on by the generalized inverse creates noise in the solution parallel to v1 while noise parallel to 112 creates noise parallel to V2 Thus V2 is the unstable direction in model space 165 Geosciences 567 Chapter 7 RMRGZ Methods to stabilize the model parameter variances will be considered in a later section but it will also be shown that any gain in stability is obtained at a cost in resolution First however we will introduce ways to quantify R N and covu m We will return to the above example and show speci cally how stability can be enhanced while resolution is lost 74 Quantifying the Quality of R N and covll m 741 Introduction In the proceeding sections we have shown that the model resolution matrix R the data resolution matrix N and the unit model covariance matrix covu m can be very use ll at least in a qualitative way in assessing the quality of a particular inversion In this section we will quantify these measures of quality and show that the generalized inverse is the inverse that gives the best possible model and data resolution First consider the following de nitions see Menke page 68 M M 2 spread R R 7 1 22h 611 794 il jl N N 2 spread N N 7 I 22M 6 795 il jl and M size covu m z covu mit 796 il The spread lnction measures how different R or N is from an identity matrix If R or N I then spread R or N 0 The size function is the trace of the unit model covariance matrix which gives the sum of the model parameter variances We can now look at the spread and size functions for various classes of problems 742 Classes of Problems Class P NM spread R spread N 0 perfect model and data resolution size covu m depends on the size of the singular values 166 Geosciences 567 Chapter 7 RMRGZ Class II P M lt N Least Squares spread R 0 perfect model resolution spread N i 0 data not all independent size covu m depends on the size of the singular values Class III P N lt M Minimum Length spread R i 0 nonunique solution spread N 0 perfect data resolution size covu m depends on the size of the singular values Class IV P lt minN A4 General Case spread R i 0 nonunique solution spread N i 0 data not all independent size covu m depends on the size of the singular values We also note that the position of an offdiagonal nonzero entry in R or N does not affect the spread This is as it should be if the model parameters and data have no physical ordering 743 Effect of the Generalized Inverse Operator G1 We are now in a position to show that the generalized inverse operator G 1 gives the best possible R N matrices in terms of minimizing the spread functions as defined in 7947795 Menke pp 68770 does this for the P M lt N case and less fully for the P N lt M case Consider instead the more general derivation after Jackson 1972 For any estimate of the inverse operator Ggslt the model resolution matrix R is given by R Ggslt G G slt UPAPV Bv 797 where B Ggslt UpAp From 215221 each row of R will be a linear combination of the rows of V or equivalently a linear combination of the columns of VP The weighting factors are determined by B which depends on the choice of the inverse operator The goal then is to choose an inverse operator that will make R most like the identity matrix I in the sense of minimizing spread R Define b as the kth row of B and d as the kth row of IM We seek b as the least squares solution to T T 7 T bk VP 7 dk 798 1 XP P gtltM 1 XM 167 Geosciences 567 Chapter 7 RMRGZ Taking the transposes implies VP bk dk 799 MXP Pgtlt1 Mgtlt1 Equation 799 can be solved with the least squares operator see 319 as bk V13 VPl lV dk 171v dk v dk 7100 Taking the transpose of 7 100 gives bk dgvp 7101 Writing this out speci cally we have V11 V12 V1P ka bk2 bkp0 0 1 0 0 V V V21 7102 I VM1 VM2 VMP Looking at 7102 we see that the 1th entry in b is given by b k Vla 7103 That is each element in the kth row of B is the corresponding element in the kth row of VP Or simply put the kth row of B is given by the kth row of VP Making similar arguments for each row of B gives us B VP 7104 Substituting B back into 797 gives R vPv 7105 This is however exactly the model resolution matrix for the generalized inverse given in 765 Thus we have shown that the generalized inverse is the operator with the best model resolution in the sense that the least squares difference between R and IM is minimized Very similar arguments can be made that show that the generalized inverse is the operator with the best data resolution in the sense that the least squares difference between N and IN is minimized In cases where the model parameters or data have a natural ordering such as a discretization of density versus depth for model parameters or gravity measurements along a pro le for data we might want to modify the definition of the spread functions in 794795 One such 168 Geosciences 567 Chapter 7 RMRGZ modi cation leads to the BackusiGilbert inverse A modi ed spread function is de ned by M spread R XVI239 jrij 6A2 7106 il j where Wi j 139 if This gives more weight penalty to entries far from the diagonal It has the effect however of canceling out any 139 j contribution to the spread To handle this a constraint equation is added and satis ed by the use of Lagrange multipliers The constraint equation is given by M ry 1 7107 1 K This ensures that not all entries in the row of R are allowed to go to zero which would minimize the spread in 7106 The inverse operator based on 7106 is called the Backusi rilbert inverse rst developed for continuous rather than discrete problems 75 Resolution Versus Stability 751 Introduction We will see in this section that stability can be improved by removing small singular values from an inversion We will also see however that this reduces the resolution There is an unavoidable tradeoff between solution stability and resolution Recall the example from Equation 789 100 100 ml 200 7 89 200 201 m2 410 The singular values eigenvector matrices generalized inverse and other relevant matrices are given in Equation 790 One option is to arbitrarily set 12 0 Then P is reduced from 2 to 1 and U 0446 P 0895 0706 VP 0709 05 05 T RZVPVP 05 05 169 Geosciences 567 Chapter 7 RMRGZ T 02 04 N UPUP 0 4 0 8 0100 0200 1 1 T 0099 0199 Gg VPAPUP 72 00496 00498 covu m VPAP VP 00498 00500 Gila 1016 mg g 1020 1 G 23904 7108 mg 408 39 and First note that the size of the unit model covariance matrix has been significantly reduced indicating a dramatic improvement in stability in the solution The model parameter variances are order 005 for data with unit variance Second note that the fit to the data while not perfect I fairly close The mis ts for all and 612 are at most 2 Third however note that both model and data resolution have been degraded from perfect resolution when both singular values were retained In fact R now indicates that the estimates for both m1 and Mg are given by the average of the true values of m1 and Mg That is 0706m1 plus 0704m2 is perfectly resolved but there is no information about the difference This can also be seen by examining VP which points in the 0706 0709T direction in model space This is the only direction in model space that can be resolved Recall on page 165 that when 612 was changed from 41 to 40 the solution changed from 78 10T to 2 0T The sum of m1 and Mg remained constant but the difference changed significantly If 61 is changed from 41 to 40 now the solution is 0998 l000T very close to the solution with 612 4 1 In some cases knowing the sum of MI and Mg may be use ll such as when m gives the velocity of some layered structure Then knowing the average velocity even if the individual layer velocities cannot be resolved may be useful In any case we have shown that the original decomposition with two nonzero singular values was so unstable that the solution while unique was essentially meaningless The data resolution matrix N indicates that the second observation is more important than the first 08 versus 02 along the diagonal This can be seen either from noting that the second row of either column of G is larger than the first row and Up is formed as a linear combination of the columns of G or by looking at Up which points in the 0446 0895T direction in data space 170 Geosciences 567 Chapter 7 RMRGZ Another way to look at the tradeoff is by plotting resolution versus stability as shown below E E I decreasing P C A2 g 390 2 g g best E L D 9 large size covU m small smaller better As P is decreased by setting small singular values to zero the resolution degrades while the stability increases Sometimes it is possible to pick an optimal cutoff value for small singular values based on this type of graph 752 R N and covu In for Nonlinear Problems The resolution matrices and the unit model covariance matrix are also useful in a nonlinear analysis although the interpretations are somewhat different than they are for the linear case Model Resolution Matrix R In the linear case the solution is unique whenever R I For the nonlinear problem a unique solution is not guaranteed even if R I In fact no solution may exist even when R I Consider the following simple nonlinear problem m2 d1 7109 With a single model parameter and a single observation we haveP M N Thus R I at every iteration If all 4 the process will iterate successfully to the solution m1 2 unless by chance the iterative process ever gives m1 exactly equal to zero in which case the inverse is undefined However if all is negative there is no real solution and the iterative process will never converge to an answer even though R I The uniqueness of nonlinear problems also depends on the existence of local minima It is always a good idea in nonlinear problems to explore solution space to make sure that the solution obtained corresponds to a global minima Take for example the following case with two observations and two model parameters 171 Geosciences 567 Chapter 7 RMRGZ m m 21 2 7110 This simple set of two nonlinear equations in two unknowns has R I almost everywhere in solution space By inspection however there are four solutions that t the data exactly given by m1 m2T 1 1T 1 71T 71 1T and 71 71T respectively To see the role of the model resolution matrix for a nonlinear analysis recall Equations 4 124 16 where for example Ac GAm 413 and where Ac is the mis t to the data given by the observed minus the predicted data Am are changes to the model at this iteration and G is the matrix of partial derivatives of the forward equations with respect to the model parameters If R I at the solution then the changes Am are perfectly resolved in the close vicinity of the solution If R i I then there will be directions in model space corresponding to V0 that do not change the predicted data and hence the t to the data All of this analysis of course is based on the linearization of a nonlinear problem in the vicinity of the solution The analysis of R is only as good as the linearization of the nonlinear problem If the solution is very nonlinear at the solution the validity of conclusions based on an analysis of R may be suspect Note also that R which depends on both G and the inverse operator may change during the iterative process For example in the Equation 7110 above we noted that R I almost everywhere in solution space At the point given by m1 m2T 07071 07071T you may verify for yourselves that all four entries in the G matrix of partial derivatives are equal to 1 In this case P is reduced from 2 to 1 The next iteration however will take the solution to somewhere else where the resolution matrix is again an identity matrix The analysis of R is thus generally reserved for the nal iteration at the solution At intermediate steps R determines whether there is a unique direction in which to move toward the solution Since the path to the solution is less critical than the nal solution little emphasis is generally placed on R during the iterative process The generalized inverse operator which nds the minimum length solution for Am nds the smallest possible change to the linearized problem to minimize the mis t to the data This is a bene t because large changes in Am will take the estimated parameter values farther away from the region where the linearization of the problem is valid Data Resolution Matrix N In the linear case N I implies perfectly independent and resolved data In the nonlinear case N I implies that the mis t Ac and not necessarily the data vector 1 itself is perfectly resolved for the linearized problem If N i I then any part of the mis t Ac that lies in U0 space will not contribute to changes in the model parameter estimates In the vicinity of the solution if N I then data space is completely resolved and the mis t should typically go to zero If N i I at the solution then there may be a part of the data that cannot be t But even if N I there is no guarantee that there is any solution that will t the data exactly Recall the example in Equation 7109 above where N I everywhere If all is negative no real solution can be found that ts the data exactly 172 Geosciences 567 Chapter 7 RMRGZ As with the model resolution matrix N is most useful at the solution and less useful during the iterative process Also it should always be recalled that the analysis of N is only as valid as the linearization of the problem T he Unit Model Covariance Matrix covu In For a linear analysis the unit model covariance provides variance estimates for the model parameters assuming unit uncorrelated data variances For the nonlinear case the unit model covariance matrix provides variance estimates for the model parameter changes Am At the solution these can be interpreted as variances for the model parameters themselves as long as the problem is not too nonlinear Along the iterative path and at the nal solution the unit covariance matrix provides an estimate of the stability of the process If the variances are large then there must be small singular values and the mis t may be mapped into large changes in the model parameters Analysis of particular model parameter variances is usually reserved for the nal iteration As with both the resolution matrices the model parameter variance estimates are based on the linearized problem and are only as good as the linearization itself Consider a simple N M l nonlinear forward problem given by 13 d m 7111 The inverse solution exact or equivalently the generalized inverse is of course given by m d3 7112 These relationships are shown in the gures on the next page Suppose we consider a case where dtrue l The true solution is m l A generalized inverse analysis leads to a linearized estimate of the uncertainty in the solution covu m of 9 This analysis assumes Gaussian noise with mean zero and variance 1 If we use Gaussian noise with mean zero and standard deviation 7 025 ie variance 00625 then cov m 05625 The simple nature of this problem leads to an ampli cation by a factor of 9 between the data variance and the linearized estimate of the solution variance Now consider an experiment in which 50000 noisy data values are collected The noise in these data has a mean of 00 and a standard deviation of 025 For each noisy data value a solution is found from the above equations This will create 50000 estimates of the solution Distributions of both the data noise and the solution are shown in the gures after those for the forward and inverse problems in Equations 7 lll7 1 12 Note that due to the nonlinear nature of the forward problem the distribution of solutions is not even Gaussian The mean value ltmgt is 118 greater than the true value of l The standard deviation is 084 Also shown on the gure is the maximum likelihood solution mML as determined empirically from the distribution The purpose of this example is to show that caution must be applied to the interpretation of all inverse problems but especially nonlinear ones 173 Geosciences 567 Chapter 7 RMRGZ Forward and Inverse Problems a Forward Problem dpredicted 7 b Inverse Problem mpredicted 4 00 05 110 15 dobserved 174 Geosciences 567 Chapter 7 RMRGZ Mapping Noisy Data 50000 Experiments a Gaussian Data Noise 2500 i i 2000 i i i 1500 i i 1000 i i quotquotII 0 lililllililunL 08 06 04 02 00 02 04 06 08 Data Noise 3000 b NonGaussian Solution Distribution 2500 mm 064 mtme 10 2000 7 11 2 119 1500 7 i o 084 7 1000 i i 500 i i 0 0 1 2 3 4 Solution 175 Geosciences 567 CHAPTER 8 RMRGZ CHAPTER 8 VARIATIONS OF THE GENERALIZED INVERSE 81 Linear Transformations 811 Analysis of the Generalized Inverse Operator G zg1 Recall Equation 221 ABC D 221 which states that if the matrix D is given by the product of matrices A B and C then each column of D is the weighted sum of the columns of A and each row of D is the weighted sum of the rows of C Applying this to Gm d we saw in Equation 215 that the data vector 1 is the weighted sum of the columns of G Note that both the data vector and the columns of G are N X 1 vectors in data space We can extend this analysis by using singularvalue decomposition Speci cally writing out G as G Up Ap v 665 NXM NXP PgtltP PXM Each column of G is now seen as a weighted sum of the columns of Up Each column of G is an N X 1 dimensional vector ie in data space and is the weighted sum of the P eigenvectors 111 ug up in Up Each row of G is a weighted sum of the rows of V or equivalently the columns of Vp Each row of G is a 1 XM row vector in model space It is the weighted sum of the P eigenvectors v1 v2 Vp in Vp A similar analysis may be considered for the generalized inverse operator where cg Vp A U 79 MXN MXPPXPPXN Each column of G 1 is a weighted sum of the columns of Vp Each row of G1 is the weighted sum of the rows ong or equivalently the columns of Up Let us now consider what happens in the system of equations Gm 1 when we take one of the eigenvectors in Vp as m Let m vi the ith eigenvector in Vp Then le Up Ap v v 81 Ngtlt1 NXP PgtltP PXM Mgtlt1 We can expand this as 176 Geosciences 567 CHAPTER 8 RMRGZ GviUPAP VT v 82 The product of V with V is a P X 1 vector with zeros everywhere except for the 1th row which represents the dot product of Vj with itself Continuing we have 0 A 0 0 39 0 39 0 GviUP it 1 0 0 0 0 1P 3 0 0 11 1i 1P 21 2i 2P ii 0 N1 Ni NP 0 1i u 11 21 83 Ni Orsimply Gvi jui 84 This is of course simply the statement of the shi ed eigenvalue problem from Equation 616 The point was not however to reinvent the shifted eigenvalue problem but to emphasize the linear algebra or mapping between vectors in model and data space Note that Vi a unitlength vector in model space is transformed into a vector of length 11 since 111 is also of unit length in data space If is large then a unitlength change in model 177 Geosciences 567 CHAPTER 8 RMRGZ space in the Vj direction will have a large effect on the data Conversely if 1739 is small then a unit length change in model space in the Vj direction will have little effect on the data 812 G g1 Operating on a Data Vector d Now consider a similar analysis for the generalized inverse operator G g1 which operates on a data vector 1 Suppose that d is given by one of the eigenvectors in Up say Hi Then G glui VP A3 U u 85 Mgtlt1 MXP PgtltP PXN Ngtlt1 Following the development above note that the product of U with ui is a P X 1 vector with zeros everywhere except the 1th row which represents the dot product of Hi with itself Then 0 El 0 0 E 0 39 0 1 Gg uiVP ail 1 0 0 0 0 1131 I 0 Continuing 0 V11 V1 V1P V21 V2i V2P 1 i 0 Wm VMi VMP 0 Vli v 51 12 86a VMi Orsimply G u 131 v 86b 178 Geosciences 567 CHAPTER 8 RMRGZ This is not a statement of the shifted eigenvalue problem but has an important implication for the mapping between data and model spaces Speci cally it implies that a unitlength vector ui in data space is transformed into a vector of length lli in model space If 1739 is large then small changes in d in the direction of Hi will have little effect in model space This is good if as usual these small changes in the data vector are associated with noise If is small however then small changes in d in the Hi direction will have a large effect on the model parameter estimates This re ects a basic instability in inverse problems whenever there are small nonzero singular values Noise in the data in directions parallel to eigenvectors associated with small singular values will be ampli ed into very unstable model parameter estimates Note also that there is an intrinsic relationship or coupling between the eigenvectors Vj in model space and Hi in data space When G operates on V1 it returns ui scaled by the singular value 17 Conversely when G1 operates on ul it returns Vi scaled by 131 This represents a very strong coupling between V and Hi directions even though the former are in model space and the latter are in data space Finally the linkage between these vectors depends very strongly on the size of the nonzero singular value 813 Mapping Between Data and Model Space An Example One useful way to graphically represent the mapping back and forth between model and data spaces is with the use of stick figures These are formed by plotting the components of the eigenvectors in model and data space for each model parameter and observation as a stick or line whose length is given by the size of the component These can be very helpful in illustrating directions in model space associated with stability and instability as well as directions in data space where noise will have a large effect on the estimated solution For example recall the previous example given by 100 100 ml 200 779 200 201 m2 410 The singular values and associated eigenvectors are given by 11 3169 and 12 000316 0706 0710 0709 0704 and 0446 0895 UP U 87 0895 0446 From this information we may plot the following figure l79 Geosciences 567 CHAPTER 8 RMRGZ 1 1 G v1 113169 u1 m1 m2 d1 d2 I G9 1 1 1 1 From VP we see that V1 0706 0709T Thus on the gure for V1 the component along m1 is 0706 while the component along m2 is 0709 Similarly 111 0446 0895T and thus the components along all and 612 are 0446 and 0895 respectively For V2 70710 0704T the components along m1 and m2 are 70710 and 0704 respectively Finally the components of H2 70895 0446T along all and 612 are 41895 and 0446 respectively These gures illustrate in a simple way the mapping back and forth between model and data space For example the top gure shows that a unit length change in model space in the 0706 0709T direction will be mapped by G into a change of length 3169 in data space in the 0446 0895T direction A unit length change in data space along the 0446 0895T direction will be mapped by G 1 into a change of length l3l69 0316 in model space in the 0706 0709T direction This is a stable mapping back and forth since noise in the data is damped when it is mapped back into model space The pairing between V2 and H2 directions is less stable however since a unit length change in data space parallel to 112 will be mapped back into a change of length l0003l6 317 parallel to V2 The V2 direction in model space will be associated with a very large variance Since V2 has signi cant components along both m1 and Mg they will both individually have large variances as seen in the unit model covariance matrix for this example given by 50551 50154 88 COVH 11 2 50154 49753 180 Geosciences 567 CHAPTER 8 RMRGZ For aparticular inverse problem these gures can help one understand both the directions in model space that affect the data the least or most and the directions in data space along which noise will affect the estimated solution the least or most 82 Including Prior Information or the Weighted Generalized Inverse 821 Mathematical Background As we have seen the generalized inverse operator is a very power ll operator combining the attributes of both least squares and minimum length estimators Speci cally the generalized inverse minimizes both eTe d 7 dPreTd 7 dPre d 7 GmTd 7 Gm 89 and m 7 ltmgtTm 7 ltmgt where ltmgt is the a priori estimate of the solution As discussed in Chapter 3 however it is use ll to include as much prior information into an inverse problem as possible Two forms of prior information were included in weighted least squares and weighted minimum length and resulted in new minimization criteria given by eTWee eTcov d 1e d 7 GmTcov d 1d 7 Gm 810 and mTWmm m 7 ltmgtTcov m 1m 7 ltmgt 811 where cov d and cov m are a priori data and model covariance matrices respectively It is possible to include this information in a generalized inverse analysis as well The basic procedure is as follows First transform the problem into a coordinate system where the new data and model parameters each have uncorrelated errors and unit variance The transformations are based on the information contained in the a priori data and model parameter covariance matrices Then perform a generalized inverse analysis in the transformed coordinate system This is the appropriate inverse operator because both of the covariance matrices are identity matrices Finally transform everything back to the original coordinates to obtain the nal solution One may assume that the data covariance matrix cov d is a positive de nite Hermitian matrix This is equivalent to assuming that all variances are positive and none of the correlation coef cients are exactly equal to plus or minus one Then the data covariance matrix can be decomposed as cov d B Ad BT 812 NXN NXN NXN NXN where Ad is a diagonal matrix containing the eigenvalues of cov d and B is an orthonormal matrix containing the associated eigenvectors B is orthonormal because cov d is Hermitian and all of the eigenvalues are positive because cov d is positive de nite 181 Geosciences 567 CHAPTER 8 RMRGZ The inverse data covariance matrix is easily found as covd 1 B Ag BT 813 NgtltN NgtltN NgtltN NgtltN where we have taken advantage of the fact that B is an orthonormal matrix It is convenient to write the righthand side of 813 as B Ag BT DT D 814 NgtltN NgtltN NgtltN NgtltN NgtltN where D val2 BT 815 Thus covd 1 DT D 816 N X N N X N N X N The reason for writing the data covariance matrix in terms of D will be clear when we introduce the transformed data vector The covariance matrix itself can be expressed in terms of D as cov d cov d 1 1 DTDpl D71DT71 817 Similarly the positive de nite Hermitian model covariance matrix may be decomposed as covm M Am Mr 818 MXM MXM MXM MXM where Am is a diagonal matrix containing the eigenvalues of cov m and M is an orthonormal matrix containing the associated eigenvectors The inverse model covariance matrix is thus given by covm 1 M A M 819 MXM MXM MXM MXM where as before we have taken advantage of the fact that M is an orthonormal matrix The right hand side of 8 18 can be written as M A MT ST s 820 MXMMXM MXM MXM MXM 182 Geosciences 567 CHAPTER 8 RMRGZ where 7 7 87Aan MT 821 Thus cov m 1 ST S 822 MXM M XM MXM As before it is possible to write the covariance matrix in terms of S as cov m cov m 1gt 1 STSpl S 1ST 1 823 822 Coordinate System Transformation of Data and Model Parameter Vectors The utility of D and S can now be seen as we introduce transformed data and model parameter vectors First we introduce a transformed data vector 1 as d Aj2 RM 824 01 d Dd 825 The transformed model parameter m is given by m AjnZ MTm 826 or m Sm 827 The forward operator G must also be transformed into G the new coordinates The transformation can be found by recognizing that G m d 828 G Sm Dd 829 or D lG Sm d Gm 830 That is 183 Geosciences 567 CHAPTER 8 RMRGZ D IG S G Finally by pre7 and postmultiplying by D and S l respectively we obtain G as G DGS 1 831 832 The transformations back from the primed coordinates to the original coordinates are given by 7 12 d 7 BA d d or d D ld m MAIn2 m or m S lm and 7 12 12 G 7 BA d G A m MT or G D lG S In the new coordinate system the generalized inverse will minimize e Te d 7 d PreTd 7 d Pre d 7 G m Td 7 G m and m Tm Replacing d m and G in 839 with Equations 8244832 we have d 7 G m Td 7 G m Dd 7 DGS 1SmTDd 7 DGS 1 Sm Dd 7 DGmTDd 7 DGm DH 7 GmTDd 7 Gm d 7 GmTDTDd 7 Gm d 7 GmTcov d 1d 7 Gm where we have used 816 to replace DTD with cov d 1 184 833 834 835 836 837 838 839 840 Geosciences 567 CHAPTER 8 RMRGZ Equation 840 shows that the unweighted mis t in the primed coordinate system is precisely the weighted mis t to be minimized in the original coordinates Thus the least squares solution in the primed coordinate system is equivalent to weighted least squares in the original coordinates Furthermore using 827 for m we have m rm SmTSm mTSTSm mTcov m 1m 841 where we have used 822 to replace STS with cov m 1 Equation 841 shows that the unweighted minimum length solution in the new coordinate system is equivalent to the weighted minimum length solution in the original coordinate system Thus minimum length in the new coordinate system is equivalent to weighted minimum length in the original coordinates 823 The Maximum Likelihood Inverse Operator Resolution and Model Covariance The generalized inverse operator in the primed coordinates can be transformed into an operator in the original coordinates We will show later that this is in fact the maximum likelihood operator in the case where all distributions are Gaussian Let this inverse operator be GM and be given by 1 7 7 1 GMX 7 D 1G S g S1G g1 D 842 The solution in the original coordinates mMX can be expressed either as mMX GB 1 843 or as mMX S lmg 71 71 S G g d 844 Now that the operator has been expressed in the original coordinates it is possible to calculate the resolution matrices and an a posteriori model covariance matrix The model resolution matrix R is given by 7 71 R7 GMXG 185 Geosciences 567 CHAPTER 8 RMRGZ S 1G g1DD 1G S S 1G g1 G S S IR S 845 where R is the model resolution matrix in the transformed coordinate system Similarly the data resolution matrix N is given by N cg Dilc sxsiwc rgl D D lG G g1 D D lN D 846 The a posteriori model covariance matrix cov mp is given by cov mp GM cov dGM T 847 Replacing cov d in 847 with 817 gives cov mp GM D 1DT1GM T 84mg DD 1DT 1S 1G g1 DT sum DD 1DT1DTG 1 TS1T 84mg mg TS1T S 1covu m S 1T 848 That is an a posteriori estimate of model parameter uncertainties can be obtained by transforming the unit model covariance matrix from the primed coordinates back to the original coordinates It is important to realize that the transformations introduced by D and S in 8244838 are not in general orthonormal Thus 1 Dd 825 implies that the length of the transformed data vector 1 is in general not equal to the length of the original data vector 1 The function of D is to transform the data space into one in which the data errors are uncorrelated and all observations have unit variance If the original data errors are uncorrelated the data covariance matrix will be diagonal and B from 186 Geosciences 567 CHAPTER 8 RMRGZ covd 1 B Ag BT 813 NXN NXN NXN NXN will be an identity matrix Then D given by D Ag BT 815 will be a diagonal matrix given by A 312 The transformed data 1 are then given by 7 712 1 7Ad 1 0r di dj0dj i lN 849 where ad is the data standard deviation for the ith observation If the original data errors are uncorrelated then each transformed observation is given by the original observation divided by its standard deviation The transformation in this case can be thought of as leaving the direction of each axis in data space unchanged but stretching or compressing each axis depending on the standard deviation To see this consider a vector in data space representing the all axis That is 0 d z 850 0 This data vector is transformed into 1 7 d1 0 d z 8 5 l 0 That is the direction of the axis is unchanged but the magnitude is changed by lle If the data errors are correlated then the axes in data space are rotated by BT and then stretched or compressed Very similar arguments can be made about the role of S in model space That is if the a priori model covaliance matrix is diagonal then the directions of the transformed axes in model space are the same as in the original coordinates ie m1 m2 mM but the lengths are stretched or compressed by the appropriate model parameter standard deviations If the errors are correlated then the axes in model space are rotated by MT before they are stretched or compressed 187 Geosciences 567 CHAPTER 8 RMRGZ 824 Effect on Model and DataSpace Eigenvectors This stretching and compressing of directions in data and model space affects the eigenvectors as well Let V be the set of vectors transformed back into the original coordinates from V the set of model eigenvectors in the primed coordinates Thus v selv 852 For example suppose that cov m is diagonal then A 7 12 V 7 Am V 853 For 91 the ith vector in V this implies 31 7m1V1 32 7m2V2 8 54 GM i 7va i Clearly for a general diagonal cov m 1 will no longer have unit length This is true whether or not cov m is diagonal Thus in general the vectors in V are not unit length vectors They can of course be normalized to unit length Perhaps more importantly however the directions of the 1 have been changed and the vectors in V are no longer perpendicular to each other Thus the vectors in V cannot be thought of as orthonormal eigenvectors even if they have been normalized to unit length These vectors still play an important role in the inverse analysis however Recall that the solution mMX is given by me GM d 843 01 as mMX S lmg selG g1 d 844 We can expand 844 as me S lV A15 1 1 U15 lTDd VPA15 1U15 PM 855 Recall that the solution mMX can be thought of as a linear combination of the columns of the rst matrix in a product of several matrices see Equations 2194221 This implies that the solution mMX consists of a linear combination of the columns of V p The solution is still a linear combination of the vectors in V p even if they have been normalized to unit length Thus V p still 188 Geosciences 567 CHAPTER 8 RMRGZ plays a fundamental role in the inverse analysis It is important to realize that cov m will only affect the solution if P lt M If P AI then Vp V and Vp spans all of model space V p will also span all of solution space In this case all of model space can be expressed as a linear combination of the vectors in V p even though they are not an orthonormal set of vectors Thus the same solution will be reached regardless of the values in cov m If P ltAI however the mapping of vectors from the primed coordinates back to the original space can affect the part of solution space that is spanned by V p We will return to this point later with a speci c example Very similar arguments can be made for the data eigenvectors as well Let U be the set of vectors obtained by transforming the data eigenvectors U in the primed coordinates back into the original coordinates Then U D lU 856 In general the vectors in U will not be either of unit length or perpendicular to each other The predicted data d are given by d G 111W D lG SmMX DTIU A15 V15 SmMX 1313A v SmMX857 Thus the predicted data are a linear combination of the columns of Up It is important to realize that the transformations introduced by cov 1 will only affect the solution if P lt N If P N then Up U and Up spans all of data space The matrix Up will also span all of data space In this case all of data space can be expressed as a linear combination of the vectors in Up even though they are not an orthonormal set of vectors Thus the same solution will be reached regardless of the values in cov d If P lt N however the mapping of vectors from the prin red coordinates back to the original space can affect the part of solution space that is spanned by Up We are now in a position to consider a speci c example 825 An Example Consider the following speci c example of the form Gm d where G and d are given by 100 100 G 200 200 189 Geosciences 567 CHAPTER 8 RMRGZ d 400 8 58 500 39 If we assume for the moment that the a priori data and model parameter covariance matrices are identity matrices and perform a generalized inverse analysis we obtain P1ltMN2 11 3162 V 0707 0707 0707 70707 0447 0894 0894 70447 0500 0500 0500 0500 0200 0400 0400 0800 1400 mg 1400 A 2800 d 5600 eTe eTcov d 1e 1800 859 The two rows or columns of G are linearly dependent and thus the number of nonzero singular values is one Thus the rst column of V or U gives VP or Up while the second column gives V0 or U0 The generalized inverse solution mg must lie in VP space and is thus parallel to the 0707 0707T direction in model space Similarly the predicted data 1 must lie in Up space and is thus parallel to the 0447 0894T direction in data space The model resolution matrix R indicates that only the sum equally weighted of the model parameters m1 and m2 is resolved Similarly the data resolution matrix N indicates that only the sum of all and 612 with more weight on 612 is resolved or important in constraining the solution Now let us assume that the a priori data and model parameter covariance matrices are not equal to a constant times the identity matrix Suppose 4362 72052 cov d 72052 15638 190 Geosciences 567 CHAPTER 8 RMRGZ and 23128 5142 cov m 860 5142 10872 The data covariance matrix cov d can be decomposed as cov d 31deT 7 0985 70174 4000 0000 0985 0174 861 0174 0985 0000 16000 70174 0985 I 4362 72052 72052 15638 Recall that B contains the eigenvectors of the symmetric matrix cov 1 Furthermore these eigenvectors represent the directions of the major and minor axes of an ellipse Thus for the present case the rst vector in B 0985 0174T is the direction in data space of the minor aXis of an ellipse having a halflength of 4 Similarly the second vector in B 4174 0985T is the direction in data space of the major aXis of an ellipse having length 16 The eigenvectors in BT represent a 100 counterclockwise rotation of data space as shown below 16 The negative olT diagonal entries in cov 1 indicate a negative correlation of errors betwen all and 612 Compare the gure above with gure c on page 23 of these notes The inverse data covariance cov d 1 can also be written as cov d 1 BA dl RT 191 Geosciences 567 CHAPTER 8 RMRGZ DTD 862 where D is given by D A3 BT 863 0492 0087 0043 0246 Similarly the model covariance matrix cov m can be decomposed as cov m MAmMT 0940 0342 25000 0000 0940 0342 864 0342 0940 0000 9000 41342 0940 The matrix MT represents a 200 counterclockwise rotation of the m1 and Mg axes in model space In the new coordinate system the a priori model parameter errors are uncorrelated and have variances of 25 and 9 respectively The major and minor axes of the error ellipse are along the 0940 0342T and 4342 0940T directions respectively The geometry of the problem in model space is shown below 20 20 The inverse model parameter covariance matrix cov m 1 can also be written as 71 1 T cov n1 MAm M sTs 865 where S is given by 192 Geosciences 567 CHAPTER 8 RMRGZ sA12MT 0188 0068 866 0114 0313 With the information in D and S it is now possible to transform G d and m into G d and m in the new coordinate system G DGS 1 0492 0087 1000 1000 4698 1026 0043 02462000 20001710 2819 426844 119424 287739 080505 and d Dd 0492 0087 4000 0043 02465000 240374 105736 83967 In the new coordinate system the data and model parameter covariance matrices are identity matrices Thus a generalized inverse analysis gives P1ltMN2 21 5345 V 0963 0269 0269 0963 U 0829 0559 10559 0829 1 0149 0101 G g 0042 0028 R 0927 0259 10259 0073 N gt0688 0463 gt0463 0312 193 Geosciences 567 CHAPTER 8 RMRGZ M mg 0130 A 2143 W 1145 e Te e Tcov d 1e 0218 868 The results may be transformed back to the original coordinates using Equations 834 836 842 845 846 852 and 856 as GMX 71 0305 0167 0173 0094 M53 g MM Ll63 1 D l MN ZbaJ eTe 2670 71 mMX S m T 4 MM mm m e covd e0783 1434 0032 0068 1434 0218 A 0870 0707 0493 0707 A 0447 0479 0894 0878 0639 0639 0362 0362 869 0478 0261 0956 0522 194 Geosciences 567 CHAPTER 8 RMRGZ Note that eTe 2670 for the weighted case is larger than the mis t eTe 1800 for the unweighted case This is to be expected because the unweighted case should produce the smallest mis t The weighted case provides an answer that gives more weight to better known data but it produces a larger total mis t The fl and 7 matrices were obtained by transforming each eigenvector in the primed coordinate system into a vector in the original coordinates and then scaling to unit length Note that the vectors in U and V are not perpendicular to each other Note also that the solution nlMX is parallel to the 0870 0493T direction in model space also given by the rst column of V The predicted data 1 is parallel to the 0447 0894T direction in data space also given by the rst column in U The resolution matrices were obtained from the primed coordinate resolution matrices a er Equations 845846 Note that they are no longer symmetric matrices but that the trace has remained equal to one The model resolution matrix R still indicates that only a sum of the two model parameters m1 and m2 is resolved but now we see that the estimate of m1 is better resolved than that of Mg This may not seem intuitively obvious since the a priori variance of m2 is less than that of m1 and thus m2 is better known Because m2 is better known the inverse operator will leave m2 closer to its prior estimate Thus m1 will be allowed to vary further from its prior estimate It is in this sense that the resolution of m1 is greater than that of Mg The data resolution matrix N still indicates that only the sum of all and 612 is resolved or important in constraining the solution Now however the importance of the rst observation has been increased signi cantly from the unweighted case re ecting the smaller variance for 611 compared to 612 83 Damped Least Squares and the Stochastic Inverse 831 Introduction As we have seen the presence of small singular values causes signi cant stability problems with the generalized inverse One approach is simply to set small singular values to zero and relegate the associated eigenvectors to the zero spaces This improves stability with an inevitable decrease in resolution Ideally the cutoiT value for small singular values should be based on how noisy the data are In practice however the decision is almost always arbitrary We will now introduce a damping term the function of which is to improve the stability of inverse problems with small singular values First however we will consider another inverse operator the stochastic inverse 832 The Stochastic Inverse Consider a forward problem given by Gm n d 870 where n is an N X 1 noise vector It is similar to Gmd 113 195 Geosciences 567 CHAPTER 8 RMRGZ except that we explicitly separate out the contribution of noise to the total data vector d This has some important implications however We assume that both m and n are stochastic ie random variables as described in Chapter 2 that are characterized by their statistical properties processes with mean or expected values of zero This is natural for noise but implies that the mean value must be subtracted from all model parameters Furthermore we assume that we have estimates for the model parameter and noise covariance matrices cov m and cov 11 respectively The stochastic inverse is de ned by minimizing the average or statistical discrepancy between m and G s1 d where G s1 is the stochastic inverse Let G s1 L and determine L by minimizing m 2Lijdj 871 for each 139 Consider repeated experiments in which m and n are generated Let these values on the kth experiment be mk and nk respectively If there are a total of q experiments then we seek L which minimizes 2 N 4 1 k k 2 m E Lydj 872 k1 jl The minimum of Equation 872 is found by differentiating with respect to L and setting it equal to zero a 1 4 N 2 k k a 2 m 2Lijdj 0 873 L q k1 jl or 2 4 N k k k 2 i2Lydj d0 874 q k1 jl This implies 1 q k k 1 q N k k 2 1 2 2WD 1 83975 qk1 qk1 391 The le hand side of Equation 875 when taken over 139 and l is simply the covariance matrix between the model parameters and the data or cov md ltdegt 876 196 Geosciences 567 CHAPTER 8 RMRGZ The righthand side again taken over 139 and l and recognizing that L will not vary from experiment to experiment gives see Equation 242c Lcov d LltddTgt 877 where cov d is the data covariance matrix Note that cov d is not the same matrix used elsewhere in these notes As used here cov d is a derived quantity based on cov m and cov 11 With Equations 876 and 877 we can write Equation 875 taken over 139 and l as cov md Lcov d 878 or L cov mdcov d 1 879 We now need to rewrite cov d and cov md in terms of cov m cov n and G This is done as follows cov d ltddTgt ltGm n Gm nTgt GltmnTgt GltmmTgtGT ltangtGT ltnnTgt 880 If we assume that model parameter and noise errors are uncorrelated that is that ltmnTgt 0 ltangt then Equation 880 reduces to cov d GltmmTgtGT ltnnTgt Gcov mGT cov n 881 Similarly cov md ltdegt ltmGm nTgt ltmmTgtGT ltmnTgt cov mGT 882 if ltmnTgt 0 Replacing cov md and cov d in Equation 879 with expressions from Equations 881 and 882 respectively gives the de nition of the stochastic inverse operator G s1 as G s1 cov mGTGcov mGT cov ngt 1 883 Then the stochastic inverse solution ms is given by mS G sl d 197 Geosciences 567 CHAPTER 8 RMRGZ cov mGTcov d 1d 884 It is possible to decompose the symmetric covariance matrices cov d and cov m in exactly the same manner as was done for the maximum likelihood operator Equations 814 and 820 cov d BAdBT BAld2 A1d2 BT D1D1T 885a cov d 1 BA BT DTD 885b cov m MAmMT MA2 MA2 s71s71T 885c cov m 1 MA MT sTs 885d where Ad and Am are the eigenvalues of cov d and cov m respectively The orthogonal matrices B and M are the associated eigenvectors At this point it is useful to reintroduce a set of transformations based on the decompositions in 885 that will transform d m and G back and forth between the original coordinate system and a primed coordinate system m39 Sm 886a d Dd 886b G DGS l 886c m S lm39 887a d D ld 887b G D lG S 887c Then Equation 884 using primed coordinate variables is given by S lmg cov mGTcov d 1d S lmg S lS 1TD 1G STDTDD 1d S lS 1TSTG TD 1TDTd 888a but S 1TST 1M and D 1TDT IN and hence S lmg S 1G39Td 888b Premultiplying both sides by S yields mg G Td 889 198 Geosciences 567 CHAPTER 8 RMRGZ That is the stochastic inverse in the primed coordinate system is simply the transpose of G in the primed coordinate system Once you have found mg you can transform back to the original coordinates to obtain the stochastic solution as mS S lmg 890 The stochastic inverse minimizes the sum of the weighted model parameter vector and the weighted data mis t That is the quantity mTcov mrlm d 7 dTcov d 1d 7 1 891 is minimized The generalized inverse or maximum likelihood minimizes both individually but not the sum It is important to realize that the transformations introduced in Equations 885 while of the same form and nomenclature as those introduced in the weighted generalized inverse case in Equations 814 and 8320 differ in an important aspect Namely as mentioned a er Equation 877 cov d is now a derived quantity given by Equation 88l cov d Gcov mGT cov n 881 The data covariance matrix cov d is only equal to the noise covariance matrix cov n if you assume that the noise or errors in m are exactly zero Thus before doing a stochastic inverse analysis and the transformations given in Equations 885 cov 1 must be constructed from the noise covariance matrix cov n and the mapping of model parameter uncertainties in cov m as shown in Equation 881 833 Damped Least Squares We are now ready to see how this applies to damped least squares Suppose cov m ofquot 1M 892 and cov n 031 IN 893 Define a damping term 82 as 22 03103quot 894 The stochastic inverse operator from Equation 883 becomes G sl GTGGT 221N171 895 To determine the effect of adding the 82 term consider the following GGT UPA U 744 199 Geosciences 567 CHAPTER 8 RMRGZ GGT 1 exists only when P N and is given by 7 7 72 T GGT liUpAP UP P N we can therefore write GGT 821 as A2 221 0 UT GGT821UPU0 P P 2 T 0 s I N71 U0 N gtltN N gtltN N gtltN N gtltN Thus 71 2 2 71 UT GGT 221 UPU0 AP 8 11 0 IT 72 0 s 1N4 U0 Explicitly multiplying Equation 897 out gives GGT 221171 UPA2 sZIPrIU U0 2 1NPUg Next we write out Equation 895 using singularvalue decomposition as G1 GTGGT 221N171 VPAPU UPA23 t 2IPl lU t UolsleiPlUg AP UT A 221 P VP since U U0 0 745 896 897 898 899 Note the similarity between the stochastic inverse in Equation 899 and the generalized inverse 71 71 T G g VPAP UP 79 The net effect of the stochastic inverse is to suppress the contributions of eigenvectors with singular values less than 8 To see this let us write out AP A12 2112 explicitly 211 2 0 0 lls 12 AP 0 22 AzP221P 0 AP 0 0 12s2 P 200 8100 Geosciences 567 CHAPTER 8 RMRGZ If 11 gtgt s then 1 82 gt 1 the same as the generalized inverse If 11 ltlt 2 then j l s gt 1739 s2 gt 0 The stochastic inverse then dampens the contributions of eigenvectors associated with small singular values The stochastic inverse in Equation 895 looks similar to the minimum length inverse GMlL GTGGT 1 358 To see why the stochastic inverse is also called damped least squares consider the following GTG sZIMrIGT VPAP2 sllprlv 22v0v3 VPAPU WA 221P171v VPAPU ylvovg VPAPU P 13 GTGGT 221N171 8101 Thus GTG sZIMrIGT GTGGT 221N171 8102 The choice of 72 is often arbitrary Thus 82 is often chosen arbitrarily to stabilize the problem Solutions are obtained for a variety of 2 and a final choice is made based on the a posteriori model covariance matrix The stability gained with damped least squares is not obtained without loss elsewhere Speci cally resolution degrades with increased damping To see this consider the model resolution matrix for the stochastic inverse R G31 G V 8103 It is easy to see that the stochastic inverse model resolution matrix reduces to the generalized inverse case when 82 goes to 0 as expected The reduction in model resolution can be seen by considering the trace of R P traceR i s P 8104 12 22 11 Similarly the data resolution matrix N is given by N CG1 201 Geosciences 567 CHAPTER 8 RMRGZ A UT 8 105 P A 221 P 39 P 2 traceN s P 8106 Finally consider the unit model covariance matrix covu m given by covu m G31 G31 T A T V V 8107 PA 21P2 P which reduces to the generalized inverse case when 82 0 The introduction of 82 reduces the size of the covariance terms a re ection of the stability added by including a damping term An alternative approach to damped least squares is achieved by adding equations of the form smi0 112M 8108 to the original set of equations Gm d 113 The combined set of equations can be written in partitioned form as G m 8109 SIM 0 NMgtltM NMgtltl The least squares solution to Equation 8109 is given by m1mm1811w 1 GTG sZIMHGTd 8110 The addition of 3 le insures a least squares solution because GTG 3 le will have no eigenvalues less than 2 and hence is invertible In signal processing the addition of 82 is equivalent to adding white noise to the signal Consider transforming GTG 221M m cm 8111 202 Geosciences 567 CHAPTER 8 RMRGZ into the frequency domain as 117i01Fz39004r 2MUO FiwF0w 8112 where Fia is the Fourier transform of the input waveform to some lter represents complex conjugate F 0a is the Fourier transform of the output wave form from the lter Ma is the Fourier transform of the impulse response of the lter and 82 is a constant for all frequencies a Solving for m as the inverse Fourier transform of Equation 8112 gives mFT 1FiLW2 8113 F 00F0 8 The addition of 82 in the denominator assures that the solution is not dominated by small values of F m which can arise when the signaltonoise ratio is poor Because the 82 term is added equally at all frequencies this is equivalent to adding white light to the signal Damping is particularly useful in nonlinear problems In nonlinear problems small singular values can produce very large changes or steps during the iterative process These large steps can easily violate the assumption of linearity in the region where the nonlinear problem was linearized In order to limit step sizes an 82 term can be added Typically one uses a fairly large value of 82 during the initial phase of the iterative procedure gradually letting 82 go to zero as the solution is approached Recall that the generalized inverse minimized d 7 GmTd 7 Gm and me individually Consider a new function E to minimize de ned by E d 7 GmTd 7 Gm 3 2me mTGTGm 7 mTGTd 7 dTGm de 3 2me 8114 Di erentiating E with respect to mT and setting it equal to zero yields BEBmT GTGm 7 GTd 82m 0 or GTG SZIMm GTd 8115 This shows why damped last squares minimized a weighted sum of the mis t and the length of the model parameter vector 84 Ridge Regression 841 Mathematical Background Recall the least squares operator 203 Geosciences 567 CHAPTER 8 RMRGZ GTG 1GT 8116 If the data covariance matrix cov d is given by cov d 721 8117 then the a posteriori model covariance matrix cov m also called the dispersion of m is given by cov m 72GTG 1 8118 In terms of singularvalue decomposition it is given by cov m alva VII 8119 This can also be written as 72 T cov m 02VPV0AL JR P 8120 0 The total variance is de ned as the trace of the model covariance matrix given by P trace covm 02 trace GTG 1 022 8121 11 39 t which follows from the fact that the trace of a matrix is invariant under an orthogonal coordinate transformation It is clear from Equation 8121 that the total variance will get large as 11 gets small We saw that the stochastic inverse operator cg GTG sZIMHGT GTGGT 221er 8102 resulted in areduction of the model covariance 8107 In fact the addition of 2 to each diagonal entry GTG results in a total variance de ned by P 12 trace cov m 02trace GTG 21 1 722 12 8122 il i i 22 Clearly Equation 8122 is less than 8121 for all 82 gt 0 842 The Ridge Regression Operator The stochastic inverse operator of Equation 8102 is also called ridge regression for reasons that I will explain shortly The ridge regression operator is derived as follows We seek an 204 Geosciences 567 CHAPTER 8 RMRGZ operator that nds a solution mRR that is closest to the origin as in the minimum length case subject to the constraint that the solution lie on an ellipsoid de ned by 1111z11111LsT GTG mRR mLSl 0 8123 1X M MXM MXl 1X1 where mm is the least squares solution ie obtained by setting 82 equal to 0 Equation 8123 represents a singleequation quadratic in mm The ridge regression operator G131 is obtained using Lagrange multipliers We form the function ImRR mER mRR MUHRR mleTGTGmRR mle 0 8124 and differentiate with respect to m to obtain mRR lGTGmRR 7 mLs 0 8125 Solving Equation 8125 for mRR gives RGTG IMmRR RGTGmLs or mRR XGTG IMHAGTGmLS 8126 The least squares solution mm is given by mLs GTG 1GTd 327 Substituting mLs from Equation 327 into 8 126 mm AGTG IM 1 AGTGGTG 1GTd IGTG IM 1IGTd 1 T 1 1 T G G I AG 1 0 1i 1 Mi 71 GTG IM GTd 8127 If we let ll 82 then Equation 8127 becomes mRR GTG SZIMH GTd 8128 and the ridge regression operator G is de ned as 205 Geosciences 567 CHAPTER 8 RMRGZ Gill GTG sZIMHGT 8129 In terms of singularvalue decomposition the ridge regression operator G is identical to the stochastic inverse operator and following Equation 899 AP UT 8130 A 221 P G v In practice we determine 82 and thus 1 by trial and error with the attendant tradeoff between resolution and stability As de ned however every choice of 82 is associated with a particular 0 and hence a particular ellipsoid from Equation 8123 Changing 0 does not change the orientation of the ellipsoid it simply stretches or contracts the major and minor axes We can think of the family of ellipsoids de ned by varying 82 or 0 as a ridge in solution space with each particular 82 or 0 being a contour of the ridge We then obtain the the ridge regression solution by following one of the contours around the ellipsoid until we nd the point closest to the origin hence the name ridge regression 843 An Example of Ridge Regression Analysis A simple example will help clarify the ridge regression operator Consider the following i 8131 G m d Singularvalue decomposition gives UP U 12 VP V 12 A A 2 0 8 132 P 0 1 The generalized inverse G1 is given by 71 7 l T Gg ivPAq UP 1 E 0 8133 0 1 The generalized inverse solution also the exact or least squares solution is 206 Geosciences 567 CHAPTER 8 RMRGZ 4 mLS Ggld 4 8134 The ridge regression solution is given by 71 A T 8 mRR GRRd VPQ PUP AP 221 4 2 4 2 0 8 1 S 1 12 0 4 122 2 422 0 8 1 8135 0 4 122 Note that for 2 0 the least squares solution is recovered Also as 82 gt 00 the solution goes to the origin Thus as expected the solution varies from the least squares solution to the origin as more and more weight is given to minimizing the length of the solution vector We can now determine the ellipsoid associated with a particular value of 82 For example let 82 1 Then the ridge regression solution from Equation 8135 is 16 m 2 32 I 4418 8136 m2 RR 2 2 12 Now returning to the constraint Equation 8123 we have that T m1 40 4 0 m1 40 m2 40 0 1 m2 40 0 4m17 402 m2 7 402 gt0 8137 01 To nd 0 we substitute the solution from Equation 8136 into 8123 and 432 7 402 20 7 402 0 or 0 656 8138 Substituting 0 from Equation 8138 back into 8137 and rearranging gives 207 Geosciences 567 CHAPTER 8 RMRGZ m1 402 m2 402 10 8139 164 656 Equation 8139 is of the form 2 2 k x h y 2 10 8140 which represents an ellipse centered at h k with semimajor and semiminor axes a and b parallel to the y and x axes respectively Thus for the current example the lengths of the semimajor and semiminor axes are 256 and 128 respectively The axes of the ellipse are parallel to the m2 and MI axes and the ellipse is centered at 4 4 Different choices for 82 will produce a family of ellipses centered on 4 4 with semimajor and semiminor axes parallel to the m2 and MI axes respectively and with the semimaj or axis always twice the length of the semiminor axis The shape and orientation of the family of ellipses follow completely from the structure of the original G matrix The axes of the ellipse coincide with the m1 and m2 axes because the original G matrix was diagonal If the original G matrix had not been diagonal the axes of the ellipse would have been inclined to the m1 and m2 axes The center of the ellipse given by the least squares solution is of course both a function of G and the data vector 1 The graph below illustrates this particular problem for 2 l 70 60quot Least squares solution 50 4 4 40 m 2 64 128 30 20 32 20 656 256 1390quot Ridge regression I I I solution I 1390 2390 3390 4390 5390 6390 m 1 gt It is also instructive to plot the length squared of the solution me as a function of 82 208 Geosciences 567 CHAPTER 8 RMRGZ 00 50 100 150 200 82 This gure shows that adding 82 damps the solution from least squares toward zero length as 82 increases Next consider a plot of the total variance from Equation 8122 as a function of 82 for data variance 72 l Troov m 9000fo ONACDOOON Total Variance 00 20 40 60 80 100 The total variance decreases as expected as more damping is included Finally consider the model resolution matrix R given by R G34 A T v v 8141 P A 221 P 209 Geosciences 567 CHAPTER 8 RMRGZ We can plot trace R as a function of 82 and get 20 15 trace R 05 4 o 0 o oo o A o o 00 20 For 82 0 we have perfect model resolution with trace R P 2 M N As 82 increases the model resolution decreases Comparing the plots of total variance and the trace of the model resolution matrix we see that as 82 increases stability improves total variance decreases while resolution degrades This is an inevitable tradeoff In this particular simple example it is hard to choose the most appropriate value for 82 because in fact the sizes of the two singular values differ very little In general when the singular values differ greatly the plots for total variance and trace R can help us choose 82 If the total variance initially diminishes rapidly and then very slowly for increasing 2 choosing 82 near the bend in the total variance curve is most appropriate We have shown in this section how the ridge regression operator is formed and how it is equivalent to damped least squares and the stochastic inverse operator 85 Maximum Likelihood 85 1 Background The maximum likelihood approach is fundamentally probabilistic in nature A probability density function PDF is created in data space that assigns a probability Pd to every point in data space This PDF is a function of the model parameters and hence Pd may change with each choice of m The underlying principle of the maximum likelihood approach is to find a solution mMX such that Pd is maximized at the observed data M Put another way a solution mMX is sought such that the probability of observing the observed data is maximized At first thought this may not seem very satisfying A er all in some sense there is a 100 chance that the observed data are observed simply because they are the observed data The point is however that Pd is a calculated quantity which varies over data space as a function of m Put this way does it make sense to choose m such that Pd bs is small meaning that the observed data are an unlikely 210 Geosciences 567 CHAPTER 8 RMRGZ outcome of some experiment This is clearly not ideal Rather it makes more sense to choose m such that Pd bs is as large as possible meaning that you have found an m for which the observed data which exist with 100 certainty are as likely an outcome as possible Imagine a very simple example with a single observation where Pa is Gaussian with xed mean ltdgt and a valiance 72 that is a function of some model parameter m For the moment we need not worry about how m affects 72 other than to realize that as m changes so does 72 Consider the diagram below where the vertical axis is probability and the horizontal axis is d Shown on the diagram are dObS the observed datum ltdgt the mean value for the Gaussian Pd and two dilTerent Pd curves based on two different variance estimates 712 and 722 respectively mm 2 ltdgt dobs d The area under both Pd curves is equal to one since this represents integrating Pd over all possible data values The curve for 72 where 715 is small is sha1ply peaked at ltdgt but is very small at 61quot In fact dObS appears to be several standard deviations afrom ltdgt indicating that dObS is a very unlikely outcome Pa for 722 on the other hand is not as sharply peaked at ltdgt but because the variance is larger Pa is larger at the observed datum dObS You could imagine letting 72 get very large in which case values far from ltdgt would have Pd larger than zero but no value of Pgd would be very large In fact you could imagine Pd bs becomingbsmaller than the case for 72 Thus the object would be to vary m and hence 72 such that PalO S is maximized Of course for this simple example we have not worried about the mechanics of nding m but we will later for more realistic cases A second example after one near the beginning of Menke s Chapter 5 is also very illustrative Imagine collecting a single datum N times in the presence of Gaussian noise The observed data vector dObS has N entries and hence lies in an Ndimensional data space You can think of each observation as a random variable with the same mean ltdgt and variance 72 both of which are unknown The goal is to find ltdgt and 72 We can cast this problem in our familiar Gm 1 form by associating m with ltdgt and noting that G lNl l lT Consider the simple case where N 2 shown on the next page 211 Geosciences 567 CHAPTER 8 RMRGZ 1 d1 The observed data dObS are a point in the dldg plane If we do singularvalue decomposition on G we see immediately that in general Up l Nl l lT and for our N 2 case Up IA2 lx2T and U0 flJ2 lx2T We recognize that all predicted data must lie in Up space which is a single vector Every choice of m ltdgt gives a point on the line all d2 dN If we slide ltdgt up to the point Q on the diagram we see that all the mis t lies in U0 space and we have obtained the least squares solution for ltdgt Also shown on the gure are contours of Pd based on 72 If 72 is small the contours will be close together and Pd b5 will be small The contours are circular because the variance is the same for each d OurN 2 case has thus reduced to the onedimensional case discussed on the previous page where some value of 72 will maximize Pd bs Menke Chapter 5 shows that Pd for the N dimensional case with Gaussian noise is given by N 1 1 2 PdWeXP 2 E di ltdgt 839142 a 2727 20 H where d are the observed data and ltdgt and 7 are the unknown model parameters The solution for ltdgt and Tis obtained by maximizing Pd That is the pa1tials of Pd with respect to ltdgt and 7 are formed and set to zero Menke shows that this leads to N ltdgteSt2di 8143 i1 1 N 12 asst Ezwi lt d gt2 8144 i1 We see that ltdgt is found independently of 7 and this shows why the least squares solution point Q on the diagram seems to be found independently of 7 Now however Equation 8144 indicates that aeSt will vary for different choices of ltdgt affecting Pd ES The example can be extended to the general data vector 1 case where the Gaussian noise possibly correlated in d is described by the data cova1iance mat1ix cov d Then it is possible to assume that Pd has the form Pd cc exppE d 4 GmTcov 1171 d 4 Gm 8145 212 Geosciences 567 CHAPTER 8 RMRGZ We note that the exponential in Equation 8145 reduces to the exponential in Equation 8142 when cov d 721 and Gm gives the predicted data given by ltdgt Pd in Equation 8145 is maximized when d 7 GmTcov d 1d 7 Gm is minimized This is of course exactly what is minimized in the weighted least squares Equations 359 and 360 and weighted generalized inverse Equation 810 approaches We can make the very important conclusion that maximum likelihood approaches are equivalent to weighted least squares or weighted generalized inverse approaches when the noise in the data is Gaussian 852 The General Case We found in the generalized inverse approach that whenever P lt M the solution is nonunique The equivalent viewpoint with the maximum likelihood approach is that Pd does not have a wellde ned peak In this case prior information such as minimum length for the generalized inverse must be added We can think of dObS and cov d as prior information for the data which we could summarize as PAd The prior information about the model parameters could also be summarized as PAm and could take the form of a prior estimate of the solution ltmgt and a covariance matrix cov mA Graphically a er Figure 59 in Menke you can represent the joint distribution PAm d PAm PAd detailing the prior knowledge of data and model spaces as contours of amd debs I I ltmgt where PAm d is contoured about dObS ltmgt the most likely point in the prior distribution The contours are not inclined to the model or data axes because we assume that there is no correlation between our prior knowledge of d and m As shown the figure indicates less con dence in ltmgt than in the data Of course if the maximum likelihood approach were applied to PAm d it would retum dObS ltmgt because there has not been any attempt to include the forward problem Gm 1 Each choice of m leads to a predicted data vector 11 In the schematic figure on the next page the forward problem Gm d is thus shown as a line in the model space4lata space plane 213 Geosciences 567 CHAPTER 8 RMRGZ do bs d pre t mes ltmgt The maximum likelihood solution meSt is the point where the Pd obtains its maximum value along the Gm 1 curve If you imagine that Pd is very elongated along the modelspace axis this is equivalent to saying that the data are known much better than the prior model parameter estimate ltmgt In this case all will be very close to the observed data dObS but the estimated solution meSt may be very far from ltmgt Conversely if Pd is elongated along the data axis then the data uncertainties are relatively large compared to the con dence in ltmgt and meSt will be close to ltmgt while all may be quite different from dObS Menke also points out that there may be uncertainties in the theoretical forward relationship Gm d These may be expressed in terms of an N X N inexacttheory covariance matrix cov g This covariance matrix deserves some comment As in any covariance matrix of a single term e g d m or G the diagonal entries are variances and the offdiagonal terms are covariances What does the l 1 entry of cov g refer to however It turns out to be the variance of the rst equation row in G Similarly each diagonal term in cov g refers to an uncertainty of a particular equation row in G and offdiagonal terms are covariances between rows in G Each row in G times m gives a predicted datum For example the rst row of G times m gives dpre Thus a large variance for the l 1 term in cov g would imply that we do not have much con dence in the theory s ability to predict the rst observation It is easy to see that this is equivalent to saying that not much weight should be given to the rst observation We will see then that cov g plays a role similar to cov d We are now in a position to give the maximum likelihood operator GMgg in terms of G and the data cov 1 model parameter cov m and theory cov g covariance matrices as GM cov m 1GTcov a cov g Gcov m IGTFI 8 146a GTcov d cov g 1G cov m 1 1GTcov d cov ggt 1 8146b where Equations 8146a and 8146b are equivalent There are several points to make First as mentioned previously cov d and cov g appear everywhere as a pair Thus the two covariance matrices play equivalent roles Second if we ignore all of the covariance information we see that Equation 8146a looks like GTGGT 1 which is the minimum length operator Third if we again ignore all covariance information Equation 8146b looks like GTG 1GT which is the least squares operator Thus we see that the maximum likelihood operator can be viewed as some kind of a combined weighted least squares and weighted minimum length operator 214 Geosciences 567 CHAPTER 8 RMRGZ The maximum likelihood solution mMX is given by mMX ltmgt GB d 7 Gltmgt 8147 ltmgt GM 1 7 my ltmgt GMgdl 17R ltmgt 8148 where R is the model resolution matrix Equation 8148 explicitly shows the dependence of mMX on the prior estimate of the solution ltmgt If there is perfect model resolution then R I and mMX is independent of ltmgt If the 1th row of R is equal to the 1th row of the identity matrix then there will be no dependence on the 1th entry in mMX on the 1th entry in ltmgt Menke points out that there are several interesting limiting cases for the maximum likelihood operator We begin by assuming some simple forms for the covariance matrices cov g 08 IN 8 149a cov m or IM 8149b cov d 0 IN 8149c In the rst case we assume that the data and theory are much better known than ltmgt In the limiting case we can assume 72 72 0 If we do then Equation 8146a reduces to GKAgg GTGGT 1 the minimum length operator If we assume that cov m still has some structure then Equation 8 146a reduces to GM cov m 1GTGcov m 1GT 8150 the weighted minimum length operator Ifwe assume only that 75 and 0 are much less than or and that lTrg goes to 0 then Equation 8146b reduces to GMgg GTG 1GT or the least squares operator It is important to realize that GTG 1 only exists when P 111 and GGT 1 only exists when P N Thus either form or both may fail to exist depending on P The simplifying assumptions about 75 72 and or can thus break down the equivalence between Equations 8 146a and 8146b A second limiting case involves assuming no con dence in either or both the data or theory That is we let 7d andor 72 go to in nity Then we see that GMI goes to 0 and mMX ltmgt This makes sense if we realize that we have assumed the data are use ess andor the theory and hence we do not have a useful forward problem to move us away from our prior estimate ltmgt We have assumed in deriving Equations 8146a and 8146b that all of the covariance matrices represent Gaussian processes In this case we have shown that maximum likelihood approaches will yield the same solution as weighted least squares P M weighted minimum length P N or weighted generalized inverse approaches Ifthe probability density lnctions are not Gaussian then maximum likelihood approaches can lead to dilTerent solutions If the distributions are Gaussian however then all of the modi cations introduced in Section 82 for the generalized inverse can be thought of as the maximum likelihood approach 215 Geosciences 567 CHAPTER 9 RMRGZ CHAPTER 9 CONTINUOUS INVERSE PROBLEMS AND OTHER APPROACHES 91 Introduction Until this point we have only considered discrete inverse problems either linear or nonlinear that can be expressed in the form dGm 113 We now turn our attention to another branch on inverse problems called continuous inverse problems in which at least the model parameter vector m in replaced by a continuous function and the matrix G is replaced by an integral relationship The general case of a linear continuous inverse problem involving continuous data and a continuous model function is given by a Fredholm equation of the rst kind gy I kxymxdx continuous data g y 91 where the data g are a continuous lnction of some variable y the model m is a continuous lnction of some other variable x and k called the data kernel or Green s lnction is a function of both x and y In most situations data are not continuous but rather are a nite sample For example analog continuous seismic data is digitized to become a nite discrete data set In the case where there are N data points Equation 91 becomes gj kaxmxdx discrete data j l N 92 One immediate implication of a nite data set of dimension N with a continuous in nite dimensional model is that the solution is underconstrained and if there is any solution mx that ts the data there will be an in nite number of solutions that will t the data as well This basic problem of nonuniqueness was encountered before for the discrete problem in the minimum length environment Chapter 3 Given that all continuous inverse problems with discrete data are nonunique and almost all real problems have discrete data the goals of a continuous inverse analysis can be somewhat different than a typical discrete analysis Three possible goals of a continuous analysis include 1 nd a model mx that ts the data g also known as construction 2 nd unique properties or values of all possible solutions that t the data by taking linear combinations of the data also known as appraisal and 3 nd the values of other linear combinations of the model using the data g also known as inference Oldenburg 1984 There are many parallels and some fundamental differences between discrete and continuous inverse theory For example we have encountered the construction phase before for discrete problems whenever we used some operator 216 Geosciences 567 CHAPTER 9 RMRGZ to nd a solution that best ts the data The appraisal phase is most similar to an analysis of resolution and stability analysis for discrete problems We have not encountered the inference phase before Emphasis in this chapter on continuous inverse problems will be on the construction and appraisal phases and the references especially Oldenburg 1984 can be used for further information on the inference phase The material in this chapter is based primarily on the following references Backus G E and J F Gilbert Numerical application of a formalism for geophysical inverse problems Geophys J Roy Astron Soc 13 2477276 1967 Huestis S P An introduction to linear geophysical inverse theory unpublished manuscript 1992 Jeffrey W and R Rosner On strategies for inverting remote sensing data Astrophys J 310 4637472 1986 Oldenburg D W An introduction to linear inverse theory IEEE Trans Geos Remote Sensing Vol GE22 No 6 6657674 1984 Parker R L Understanding inverse theory Ann Rev Earth Planet Sci 5 3544 1977 Parker R L Geophysical Inverse Theory Princeton University Press 1994 92 The Backus Gilbert Approach There are a number of approaches to solving Equations 91 or 92 This chapter will deal exclusively with one approach called the BackusiGilbert approach which was developed by geophysicists in the 196039s This approach is based on taking a linear combination of the data g given by N N 1 21ng zaijJxmx dx jl jl N 1 zajkjoc xdx 93 jl where the 06 are as yet undefined constants The essence of the BackusGilbert approach is deciding how the 09 s are chosen If the expression in square brackets 2091900 has certain special properties it is possible in theory to construct a solution from the linear combination of the data Specifically if N zajkjx 6x x0 94 jl where 6x 7 x0 is the Dirac delta function at x0 then 217 Geosciences 567 CHAPTER 9 RMRGZ N 1xo 209me Soc xomx dx quot1050 95 jl That is choosing the 06 such that 2099x is as much like a delta function as possible at x0 we can obtain an estimate of the solution at x0 The expression 206139kjx plays such a fundamental role in Backusi rilbert theory that we let N zaijx Ax x0 96 jl where Ax x0 has many names in the literature including averaging kernel optimal averaging kernel scanning function and resolving kernel The averaging kernel Ax x0 may take many different forms but in general is a function which at least peaks near x0 as shown below T Ax X0 x X Recall from Equation 96 that the averaging kemel Ax x0 is formed as a linear function of a finite set of data kernels An example of a set of three data kernels kl over the interval 0 lt x lt 4 is shown below One of the fundamental problems encountered in the BackusGilbert approach is that the finite set of data kernels j l N is an incomplete set of basis functions from which to construct the Dirac delta function You may recall from Fourier analysis for example that a spike Dirac delta function in the spatial domain has a white spectrum in the frequency domain which implies that it takes an infinite sum of sin and cosine terms basis functions to construct the spike function It is thus impossible for the averaging kemelAx x0 to exactly equal the Dirac delta with a finite set of data kernels 218 Geosciences 567 CHAPTER 9 RMRGZ Much of Backusi rilbert approach thus comes down to deciding how best to make Ax x0 approach a delta function Backus and Gilbert de ned three measures of the deltaness of Ax am as follows JJ Ax x0 6x x02dx 97 K 12 x7x02Ax x0 6x7x02dx 98 2 WJ Hx x0 JAx x0dx dx 99 where H x 7 x0 is the Heaviside or unit step function at x0 The smaller J K or W is the more the averaging kemel approaches the delta function in some sense Consider first the K criterion K 12 x x0 2 Ax x0 2 2x x0 2Ax x06x x0 x x026x x0 2 dx 910 The second and third terms drop out because 6x 7 x0 is nonzero only when x x0 and then the x 7 x02 term is zero Thus K 12 x x02Ax x02dx N 2 12Jx x02 zajkjx dx 911 jl We minimize K by taking the partials with respect to the 09 s and setting them to zero N 9K 2 876i 12 x x0 2 ZaJkJx kx dx 0 J N 24209 x x02kixkjx dx 0 912 jl Writing out the sum over j explicitly for the ith partial derivative gives 219 Geosciences 567 CHAPTER 9 RMRGZ x x02kixk1xdxa1 x x02kixk2xdxa2 U x x02kixkNxdxaN 0 913 Combining the N partial derivatives and using matrix notation this becomes I x x02k1k1 dx I x x02k1k2 dx I x x02k1kN dx a1 0 I x x02k2k1dx I x x02k2k2 dx I x x02k2kN dx 0 2 0 i 39 914 2 2 i 2 N 0 Jx x0 kaldx Jx x0 ka2 dx Jx x0 kaNdx NXN Ngtltl Ngtltl or Boa 0 915 B one 0 has the trivial solution one 0 Therefore we add the constraint with Lagrange multipliers that IAOc x0 dx 1 916 which says the area under the averaging kernel is one or that Ax x0 is a unimodular function Adding the constraint to the original K criterion creates a new criterion K given by K 12 x x02 Ax x0 2 dx A JAoc x0dx 1 917 which leads to 24 x x02k1k1dx 24 x x02k1kN dx Jkl dx a1 0 241 x x02ka1dx 241 x x02kaNdx JkNdx N 0 918 l l Jkldx JkNdx 0 N1gtltN1 01 220 Geosciences 567 CHAPTER 9 RMRGZ cum Then the 09 s are found by inverting the square symmetric matrix C The factor of 12 in the original de nition of K was added to facilitate the geometrical interpretation of K Speci cally with the factor of 12 included K 8 if Ax x0 is a unimodular ie lAx x0 dx 1 box car of width 8 centered on x0 In gemeral when K is small it is found Oldenburg 1984 p 669 that K is a good estimation of the width of the averaging function Ax x0 at half its maximum value Thus K gives essential information about the resolving power of the data If K is large then the estimate of the solution mx will be a smoothed average of the true solution around x0 and the solution will have poor resolution Of course you can also look directly at Ax x0 and get much the same information If Ax x0 has a broad peak around x0 then the solution will be poorly resolved in that neighborhood Features of the solution mx with a scale length less than the width of Ax x0 cannot be resolved ie are nonunique Thus on the gure below the highfrequency variations in mx are not resolved m X I p X X39O V The analysis of the averaging kemel above falls within the appraisal phase of the possible goals of a continuous inverse problem Since any solution to Equation 92 is nonunique often times the most important aspect of the inverse analysis is the appraisal phase The above discussion does not include possible data errors Without data errors inverting C to get the Olfs gives you the solution Even if C is nearly singular then except for numerical instability once you have the 00 s you have a solution If however the data contain noise then nearsingularity of C can cause arge errors in the solution for small uctuations in data values It may help to think of the analogy between and the kth row of G in the discrete case Gm d Then N Ax x0 206ij jl is equivalent to taking some linear combination of the rows of G which are M dimensional and therefore represent vectors in model space and trying to make a rowivector as close as possible to a row of the identity matrix If there is a nearlinear dependency of rows in G then the coef cients in the linear combination will get large One can also speak of the near interdependence of the data kemels kx j l N in Hilbert space If this is the case then terms like I kikj dx can be large and C will be nearly srngular This will lead to large values for 06 as it tries to approximate 6x 7 x0 with a set of basis functions kernels that have near interdependence Another example arises in the 221 Geosciences 567 CHAPTER 9 RMRGZ gure after Equation 9 with three data kemels that are all nearly zero at x 3 It is difficult to approximate a delta function near x 3 with a linear combination of the data kemels and the coefficients of that linear combination are likely to be quite large You can quantify the effect of the near singularity of C by considering the data to have some covariance matrix cov d Then the variance of your estimate of the solution mx0 at x0 is 0 1 0 2 03x00 1 0 2 aNcovd 920 N andif 012 0 0 covd 0 a I 0 0 0 0 then N 2 2 2 0mm 205101 921 11 See Oldenburg 1984 Equation 18 p 670 or Jeffrey and Rosner Equation 35 p 467 If amp0 is large the solution is unstable We now have two con icting goals 1 minimize K versus 2 minimize 0 mm This leads to trade of curves of stability small omltx0 versus resolving power small K These tradeoff curves are typically plotted as next page 222 Geosciences 567 CHAPTER 9 RMRGZ large successively throwing i away larger 05139s 3 l g m optimal 2 best stability small best resolving I small large power scanning width Kgt Typically one gains a lot of stability without too much loss in resolving power early on by throwing away the largest few 09 s Another way to accomplish the same goal is called spectral expansion or spectral synthesis With this technique you do singularvalue decomposition on C and start throwing away the smallest singular values and associated eigenvectors until the error ampli cation ie log 0mm is suf ciently small In either case you give up resolving power to gain stability Small s1ngular values or large 09 are associated with highfrequency components of the solution mx As you give up resolving power to gain stability the width of the resolving kemel increases Thus in general the narrower the resolving kernel the better the resolution but the poorer the stability Similarly the wider the resolving kernel the poorer the resolution but the better the stability The logic behind the tradeolT between resolution and stability can be looked at another way With the best resolving power you obtain the best t to the data in the sense of minimizing the difference between observed and predicted data However if the data are known to contain noise then tting the data exactly or too well would imply tting the noise It does not make sense to t the data better than the data uncertainties in this case We can quantify this relation for the case in which the data errors are Gaussian and uncorrelated with standard deviation 7 by introducing 12 N 952 2gj J2Uf 922 jl where g J is the predicted jth datum which depends on the choice of 06 12 will be in the range 0 S 12 S 00 If 12 0 then the data are t perfectly noise included if 12 rs N then the data are being t at about the one standard deviation level If 12 gtgt N then the data are t poorly By using the tradeoff curve or the spectral expansion method you affect the choice of s and hence the predicted data g j and ultimately the value of 12 Thus the best solution is obtained by adjusting the 06 until 12 2N 223 Geosciences 567 CHAPTER 9 RMRGZ 93 Neural Networks Neural networks based loosely on models of the brain and dating from research in the 1940s long before the advent of computers have become very successful in a wide range of applications from pattern recognition one of the earliest applications to aerospace autopilots aircraft control systems defense weapon steering signal processing entertainment animation and other special effects manufacturing quality control product design medical breast cancer cell analysis EEG analysis optimization of transplant times oil and gas exploration telecommunications image and data compression and speech speech recognition applications The basic idea behind neural networks is to design a system based heavily on a parallel processing architecture that can learn to solve problems of interest We begin our discussion of neural networks by introducing the neuron model which predictably has a number of names including of course neurons but also nodes units and processing elements PEs Neuron Input Output w P gt 0 where the neuron sees some input P coming which is weighted by w before entering the neuron The neuron processes this weighted input Pw and creates an output 0 The output of the neuron can take many forms In early models the output was always 1 or 71 because the neural network was being used for pattem recognition Today this neuron model is still used called the step function or Heaviside function among other things but other neuron models have the output equal to the weighted input to the neuron called the linear model and perhaps the most common of all the sigmoid model which looks like quotquotquotquot 00 00 x gt often taken to be given by Sx 11 fx where x is the weighted input to the neuron This model has elements of both the linear and step function but has the advantage over the step function of being continuously differentiable So how is the neuron model useful It is useful because like the brain the neuron can be trained and can leam from experience What it learns is w the correct weight to give the input so that the output of the neuron matches some desired value As an almost trivial example let us assume that our neuron in the gure above behaves as a linear model with the output equal to the weighted input We train the neuron to learn the correct w by giving it examples of inputs and output that are true For example consider examples to be 225 Geosciences 567 CHAPTER 9 RMRGZ input 1 output 2 10 20 720 740 42 84 By inspection we recognize that w 2 is the correct solution However in general we start out not knowing what w should be We thus begin by assuming it to be some number perhaps 0 or as is typically done some random number Let us assume that w 0 for our example Then for the rst test with input 1 our output would be 0 The network recognizes that the output does not match the desired output and changes w There are a world of ways to change w based on the mismatch between the output and the desired output more on this later but let us assume that the system will increase w but not all the way to 2 stopping at 05 Typically neural networks do not change w to perfectly t the desired output because making large changes to w very often results in instability Now our system with w 05 inputs 10 and output 5 It again increases w and moves on to the other examples When it has cycled through the known inputoutput pairs once this is called an epoch Once it has gone through an epoch it goes back to the rst example and cycles again until there is an acceptably small mis t between all of the outputs and desired outputs for all of the known examples In neural network programs the codes typically go through all of the known examples randomly rather than sequentially but the idea is the same In our example it will settle in eventually on w 2 In more realistic examples it takes hundreds to thousands of iterations one iteration equals an epoch to nd an acceptable set of weights In our nomenclature we have the system 1 Gm and in this example G 2 The beauty of the neural network is that it learned this relationship without even having to know it formally It did it simply by adjusting weights until it was able to correctly match a set of example input output pairs Suppose we change our example input output pairs to input 1 output 5 10 23 720 737 42 87 where the output has been shi ed 3 units from the previous example You may recognize that we will be unable to nd a single weight w to make the output of our neuron equal the desired output This leads to the rst additional element in the neuron model called the bias Consider the diagram below Neuron Input w Output P gt O 226 Geosciences 567 CHAPTER 9 RMRGZ where the bias is something added to the weighted input Pw before the neuron processes it to make output It is convention that the bias has an input of l and another weight bias I that is to be determined in the learning process in this example the neuron will learn that as before w 2 and now that the bias I is 3 The next improvement to our model is to consider a neuron that has multiple inputs Input We now introduce the nomenclature that the process of the neuron is going to be some function of the sum of the weighted input vector and the bias I That is we describe the neuron by the function of Pw b where N Pwle1P2w2PNWNZBwi 927 i1 This is a very incomplete introduction to neural networks Perhaps someday we will add more material 94 The Radon Transform and Tomography Approach 1 Ay 941 Introduction Consider a model space mx y through which a straightline ray is parameterized by the perpendicular distance u and angle 6 Position x y and ray coordinates u s are related by 227 Geosciences 567 CHAPTER 9 RMRGZ x cos 6 sin 6 u 9 28 y sm 6 cos 6 s u cos6 sin6 x l l 18 in Mencke s s1n6 cos6 y If mx y is slowness l v model and tu 9 is the travel time along a ray at distance u and angle 9 then and tu6 Jmxy ds 929 is known as the Radon transform RT Another way of stating this is that tu 9 is the quotprojectionquot of mx y onto the line de ned by u and 9 The inverse problem is Given tu 6 for many values of u and 9 nd the model mx y ie the inverse Radon transform IRT De ne a 27D Fourier transform of the model r71kxkyJ j mxye 2quot quot quoty dx dy 930 mxyJ I rhkxkye Z quotquotkyydcxdky 931 Now de ne the 17D Fourier Transform of the projection data fku6 rtuee 2i quotuquotdu 932 tu6 Facwe eZ Wdcu 933 Substituting Equation 1 into Equation 4 gives fku6 UM ds e Wuudu 934 Making a change of variables ds du 7gt dx dy and using the fact that the Jacobian determinant is unity we have 1916 I J mxyei2i7rku cosexsi119y dx dy r71ku cos6x ku sin6y 935 Equation 935 states that the 17D Fourier transform of the projected data is equal to the 27D Fourier transform of the model This relationship is known as the Fourier central slice 228 Geosciences 567 CHAPTER 9 RMRGZ theorem because for a fixed angle 6 the projected data provide the Fourier transform of the model slice through the origin of the wavenumber space ie 1quot k 00590 9 r d ky sin60 0 1X6 uAy lu 90 Fourier transform gt Now we can invert the 2 D Fourier transform and recover the model mxy I j 1kux cos6 ysin 9 e2mltkxxkyydkxdky 936 Change variables to polar coordinates gives X 7C mx y I j qacu9e2 kux 59y3m9gtku d6 dk 937 eo 0 where k arises because k a deg 7 w mx y I I 1ku6 k eWudkude 0 olt 7T Jmkuxcos6ysin66 d9 938 0 where m is obtained from the inverse Fourier transform of lkul 1710 0 The Radon Transform can be written more simply as FabJfx abx dr 939 shadow path model 229 Geosciences 567 CHAPTER 9 RMRGZ with the inverse fx y Qx y bx b dc 940 q model quotrho lterquot path shadows Note the similarity of the inverse and forward transforming with a change of sign in the argument of the integrand Q ky y y 1 l0 26 I k x x y gt rho lter model reconstructed image 942 Interpretation of Tomography Using the Radon Transform source l proj ecti on anomaly The resolution and accuracy of the reconstructed image are controlled by the coverage of the sources 230 Geosciences 567 CHAPTER9 RMRGZ blurring Back projection with one ray Back projection with many rays Because of this requirement for excellent coverage for a good reconstruction the Radon transform approach is generally not used much in geophysics One exception to this is in the area of seismic processing of petroleum industry data where data density and coverage is in general much better 943 SlantStacking as a Radon Transform following Claerbout 1985 Let ux t be awavefreld An example would be as follows RV w l The slant stack of the wavefreld is de ned by 17p tux tpx dx 941 The integral along x is done at constant I which de nes a slanting straight line in the x t plane with slope p Note the similarity of Equation 941 to Equation 939 Equation 941 is a Radon transform To get a better feel for this concept let39s step back a little and consider the travel time equations of rays in a layered medium 231 Geosciences 567 CHAPTER 9 RMRGZ x1 x2 x3 The travel time equation is I 4212x2v or t2v2 x2 4212 942 Because the signs of the t2 and x2 terms are opposite this is an equation of a hyperbola Slant stacking is changing variables from t7 x to L39 7 p where L39 tipx 943 and dt 944 p dx From Equation 942 2t dt v2 2x dx so dt x 945 dx v2t The equations simplify if we introduce the parametric substitution 2 vtcos6 x vtsin6 946 so x ztan6 Now take the linear moveout Equation 943 and substitute for t and x using p sin 9 v T Z Smeztan6 vcos6 v Ecose 947 v This can be written a er some substitution as 1 pzvz 948 232 Geosciences 567 CHAPTER 9 RMRGZ and nally 2 This is an equation of an ellipse in T p space All this was to show that hyperbolic curves in t x space transform to ellipses in T p space gt gt V V T With this background consider how the wavef1eld in a layeroverhalfspace slant stacks or Radon transforms lv0 source gt direct wave VI F F re ected wave V0 direct wave direct wave 0 head wave v gt v head wave 1 0 n A couple of important features of the T p plot are as follows head wave V7 re ected wave t Curves cross one another in t x space but not in T p space 2 The p aXis has dimensions of l v 3 The velocity of each layer can be read from its T p curve as the inverse of the maXimum p value on its ellipse 4 Head waves are points in T p space located where the ellipsoids touch Returning to our original slantstack equation L7pTJux Tpxdx 941 This equation can be transformed into the Fourier space using Ukw r Jx teiwtikxdx dt 950 oo oo 233 Geosciences 567 CHAPTER 9 RMRGZ Let p k w Uwp w 1 Juxteiw mdx dt 951 and change of variables from Ito T t 7px Uwpw f zx t pxdx e mdr 952 lltg9417gt Insert Equation 941 to get Uwpw 1700 r e mdr 953 Think of this as a 17D function of w that is extracted from the kiw plane along the line k wp Finally taking the inverse Fourier transform m p r JUwp wequot39wrdw 954 This result shows that a slant stack can be done by Fourier transform operations 1 Fourier transform ux t to Uk w 2 Extract Uwp w from Uk w along the line k wp 3 Inverse Fourier transform from w to t 4 Repeat for all interesting values of p Tomography is based on the same mathematics as inverse slant stacking In simple terms tomography is the reconstruction of a function given line integrals through it The inverse slant stack is based on the inverse Fourier transform of Equation 950 uxt r xc weikquotdk e m dw 955 substituting k wp and dk wpd Notice that when w is negative the integration with dp is from positive to negative To keep the integration in the conventional sense we introduce lwl uxt Jm Ifwpw w eiwpxdp e iwtdw 956 Changing the order of integration gives 234 Geosciences 567 CHAPTER 9 RMRGZ uxt w w Uwpweiwpx w eiiwtdw eiiwtdp 957 mi 1 Note that the term in contains the inverse Fourier transform of a product of three functions of frequency The three functions are l Uwp w which is the Fourier transform of the slant stack 2 eiwpx which can be thought of as a delay operator and 3 lwl which is called the rho filter A product of three terms in Fourier space is a convolution in the original space Let the delay px be atime shi in the argument so uxt rhot Imp t px dp 958 This is the inverse slantstack equation Comparing the forward and inverse Equations 941 and 958 note that the inverse is basically another slant stack with a sign change p139 Jux 1 px dx 94l uxt rhot I p t px dp 958 95 Review of the Radon Transform Approach 2 If mx y is a 2D slowness l v model and tu 9 is the travel time along a ray parameterized by distance u and angle 9 then tu 6 Jmx y ds is the Radon transform 959 The inverse problem is Given tu 9 for many values of u and 9 find the model mx y Take the 1D Fourier transform of the projection data fku6J tu ee 2quot quotu du 960 Substitute Equation 959 fku6 J I mx y 613 e Z Wdu 961 Change variables ds du 7gt dx dy fku6 J I mx y e Wu SquotSi 9ydx dy 962 Recognize that the righthand side is a 2D Fourier transform 235 Geosciences 567 CHAPTER 9 RMRGZ nzacx 19 J mx y e2mkxykyydx dy 963 So ku 9 1ku cos9x ku sin 9y 964 In words The 1D Fourier transform of the projected data is equal to the 2D Fourier transform of the model evaluated along a radial line in the kx ky space with angle 6 u Ay Akx 71 60 Fourier transform gt If the Radon transform is known for all values of u 6 then the Fourier transform image of m is known for all values of u 6 The model mx y can then be found by taking the inverse Fourier transform of mkxky Slantstacking is also a Radon transform Let ux t be a wavef1eld Then the slantstack is x slope p L7pT Jux T px dx 965 Now consider a simple example of re ections in a layer 236 Geosciences 567 CHAPTER 9 RMRGZ Travel time fix equation is Slantstack pr equation is 2v2 6222 Tx2p2lV2 hyperbolas ellipses With this background consider how the complete wavefield in a layer slantstack 397 source F direct wave VI F F re ected wave V0 lv0 direct wave direct wave 0 head wave v gt v head wave 1 0 I head wave 392 VT The 1139 equation can be transformed into kiw space by a 2D Fourier transform re ected wave t Uk w J Jux teiwtikxdx dt 966 The inverse slantstack IRT is based on the inverse Fourier transform of the above equat10n ux r J JUUc we dx ethdw 967 Substitute k wp and dk w dp Notice that when w is negative the integral with respect to dp is from to To keep the integration in the conventional sense introduce w ux t J00 Jogwp w w eiwpxdp 6 thdW 968 Changing the order of integration 237 Geosciences 567 CHAPTER 9 RMRGZ ux t 1 JUwp w eiwpx e iwtdw dp 969 The term in contains an inverse Fourier transform of a product of three functions of w l Uwp w is the Fourier transform of the slantstack evaluated at k wp So the slantstack is given by mp r Je iwrwwp w dw 970 2 The eiWPx term can be thought of a delay operator where L39 t7 px The lwl term is called the rho filter Now we can rewrite the above equation as ux t rhot If p t7 px dp Inverse Radon Transform 971 Compare with L7x T Jux L39 px dx Radon Transform 941 96 Alternative Approach to Tomography An altemative approach to tomography is to discretize the travel time equation as follows y mx y 972 We y all t Im d1 973 r r x With some assumptions we can linearize and discretize this equation to t zlrbmb 974 b where tr travel time of the rth ray mb slowness of the bth block lrb length of the rth ray segment in the bth block 238 Geosciences 567 CHAPTER 9 RMRGZ In matrix form d Gm Aha 975 We know how to solve this equation But what if we have a 37D object with 100 blocks on a side ThenM rs 1003 106 and even GTG is a matrix withM2 elements or 1012 Try throwing that into your MATLAB program So what can we do Let s start at the beginning again 1 Gm 975 GTd GTGm 976 l R l In a sense GT is an approximate inverse operator that transforms a data vector into model space Also in this respect GTG can be thought of as a resolution matrix that shows you the lter between your estimate of the model GTd and the real model m Since the ideal R I let s try inverting Equation 976 by using only the diagonal elements of GTG Then the solution can be computed simply N N 2 214 211 0 0 0 11 i1 N N 2 21b 0 21 0 0 11 i1 m 977 0 0 0 N N 211 0 0 0 213 11 11 so N 21139in mb 31 quottomographic approximationquot 978 24 il Operationally each ray is backprojected and at each block a ray hits the contributions to the two sums are accumulated in two vectors At the end the contents of each element of the numerator vector are divided by each element of the denominator vector This reduces storage requirements to 2M instead of M2 Let s see how this works for our simple tomography problem 239 l 2 gt1 3 4 gt0 1 0 3 4 i 4 mm 1 G X 1 4 Fits data 2 l l 0 l 2 0 GTG l 0 2 gt0 1 l 2 The approximate equation is 2 2 0 l 0 2 l 0 0 gt0 0 0 0 Geosciences 567 CHAPTER 9 RMRGZ for m COO GTGLI CH O lob t OHIO 979 980 981 982 Not bad but this does not predict the data Can we improve on thus Yes Following Menke develop an iterative solution GTGm GM 1 7 I GTGm GTd m 7 I 7 GTGm am S0 240 983 984 985 Geosciences 567 CHAPTER 9 RMRGZ meStO GTd I 7 GTGmeStH 986 Menke s Equations 1 l and 24 A slightly different iterative technique mk D lGTd D 7 GTGmk 1 987 where D is the diagonalized GTG matrix Then for m0 0 m1 D lGTd is the tomographic approximation solution These are still relatively rudimentary iterative techniques They do not always converge Ideally we want iterative techniques to converge to the least squares solution A number of such techniques exist and have been used in tomography They include the simultaneous iterative reconstruction techniques or SIRT and LSQR See the literature This chapter is only an introduction to continuous inverse theory neural networks and the Radon transform The references cited at the beginning of the chapter are an excellent place to begin a deeper study of continuous inverse problems The Backusi rilbert approach provides one way to handle the construction ie nding a solution and appraisal ie what do we really know about the solution phases of a continuous inverse analysis In this sense these are tasks that were covered in detail in the chapters on discrete inverse problems For all the division between continuous and discrete inverse practioners continuous inverse types have been known to look down their noses at discrete types saying that they re not doing an inverse analysis but rather parameter estimation conversely discrete types sneer and say that continuous applications are too esoteric and don39t really work very often anyway we have shown in this chapter that the goals of constructing nding a solution and appraising it are universal between the two approaches We hope to add more material on these topics Real Soon Now References Claerbout J F Imaging the Earth39s Interior 398 pp Blackwell Scienti c Publications Oxford UK 1985 241
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'