New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Santos Fadel


Santos Fadel
GPA 3.76

Ashwani Kapila

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Ashwani Kapila
Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 84 page Class Notes was uploaded by Santos Fadel on Monday October 19, 2015. The Class Notes belongs to CSCI 4800 at Rensselaer Polytechnic Institute taught by Ashwani Kapila in Fall. Since its upload, it has received 31 views. For similar materials see /class/224840/csci-4800-rensselaer-polytechnic-institute in ComputerScienence at Rensselaer Polytechnic Institute.

Similar to CSCI 4800 at RPI

Popular in ComputerScienence




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/19/15
NUMERICAL COMPUTING Spring 2012 LESSON 2 Relevant sections from the Text 21 7 247 This lesson discusses the direct numerical approach for solving systems of linear algebraic equa tions Gaussian elimination LUi decomposition and partial pivoting are considered along with assessment of the quality of the computed solution by employing the notions of conditioning sta bility and accuracy 1 Introduction Systems of linear equations especially n equations in n unknowns arise in a variety of applications in science engineering economics nance and bio ogyi The solution of such a system especially for equations numbering in the tens of thousands or even more is the most frequently encountered problem in scienti c computing and still a topic of research Such a system can be written as a11I1a1212 a1nI z b1 a2111a22732 39a2n1n b2 anlzl an2z2 annzn bn The matrix version of this set of equations is A17 177 where the square matrix A and column vectors 1 and are given by an L112 am b1 11 A L121 L122 cm 7 b b2 and 17 I2 anl am am bn In In an alternative form these equations can be expressed as an L112 am b1 L121 L122 a2n 52 am am am bn The lefthand side of the above equation now represents a linear combination of the column vectors of A with the 11 the coef cients that force the linear combination to equal the vector bl For given coef cients aij and 12 the problem is to nd the values of the unknowns Iii An immediate question is does a solution even exist We consider this question by examining two loworder systemsl Example 11l For n 2 the two equations 111 any b1 211 a22y 112 have a geometric interpretation they are equations of straight lines in the my plane and the solution is the set ofpoints common to the two lines There is no solution if the lines are parallel a unique solution if the lines are not parallel and an in nity of solutions if the lines are coincidentl These situations are illustrated by the following examples 1 The lines 21 7 3y 8 6x 7 9y 2 are parallel and there is no point of intersection 2 The lines 21 7 3y 8 3x 7 y 5 are not parallel and have a unique point of intersection l 72 3 The lines 2173y 8 6179y 24 are coincident and there is an in nity of common points 4 3t 2t7 where t is arbitrary Example 12 For n 3 the three equations 111 any 132 b1 211 a22y 232 b2 311 any 332 113 can also be interpreted geometrically they represent three planes and the solution yields the points common to the planes There is no solution if two or more of the planes are parallel a unique solution if the planes intersect in a point and an in nity of solutions if the planes pass through a common line but are not coincident if they pass through a common line and two of them are coincident or all three are coincidentl The following example illustrates this situation 1 N 93 F The rst two of the planes 2x 7 3y 2 12 6x 7 9y 32 2 z 7 y 2 0 are parallel and there is no point of intersection The planes 2x 7 3y 2 12 6x 7 9y 2 28 2x 7 y 7 2 0 intersect in the unique point 1 72 4 The planes 2x 7 3y 2 12 2x 7 y 7 2 16 z 7 2 9 intersect in the straight line 9 t 2 t t where t is arbitrary The planes 2x 7 3y 2 12 6x 7 9y 32 36 z 2 8 intersect in the straight line 8 7 t 4 7 t3t where t is arbitrary Here the rst two planes are coincidentl 5 The planes 2173y2 12 6179y32 36 4zi6y22 24 are coincident and have the double in nity of solutions uv12 7 Zu 31 where u and v are arbitrary The linear systems considered above had none one or an in nity of solutions and this result holds in general Here we were able to exploit a geometric approach but in general an algebraic approach is more appropriate and the solution behavior is dictated by the algebraic properties of the matrix A and the augmented matrix Albi You must review the basic terminology associated with matrices and the rules and manipulations of elementary linear algebra The basic terms include matrix vector row vector column vector transposed matrix identity matrix diagonal matrix symmetric matrix upper triangular matrix lower triangular matrix Other useful terms include inverse matrix determinant of a square matrix and its computation singular and nonsingular matrices linear dependence and independence of vectors elementary row operations on a matrix rank of a matrix The following theorem summarizes the conditions under which a a matrix is nonsigular and connects them with the solvability of the linear systemi Theorem 1 Given a square matrix A of order n the following statements are equivalent 1 A is nonsingulari 2i det A a 0 3 The columns of A form an independent set of vectors 4 The rows of A form an independent set of vectors 5 The linear system A b has a unique solution for all vectors 1 of order mi 6 The homogeneous linear system A 0 has only the trivial solution 0 Corollary If A is singular then A 0 has in nitely many nontrivial solutionsi A b has no solution in general unless I is such that A and Alb have the same rank hen A b has in nitely many solutions 2 Triangular Systems Now that we know the circumstances under which the solution is unique let us see how to nd it First we consider triangular systems that are easy to solve A matrix L is lower triangular if its elements satisfy lij 0 for i lt j In other words all elements above the principal diagonal are zero Similarly a matrix U is upper triangular if its elements satisfy uij 0 for i gt ji In other words all elements below the principal diagonal are zero The determinant of a triangular system is the product of the diagonal elements the matrix is nonsingular if none of the diagonal397 elements is zero For n 4 we have 11 0 0 0 21 22 0 0 131 132 133 0 41 142 143 Z44 u11 u12 U13 u14 0 W2 u23 W4 0 0 U33 U34 0 0 0 mm U 7 Triangular linear systems are easy to solve Consider the upper triangular system Um c7 ie unri u1212 uisrs u1414 Ci u2212 U23 W414 C2 US13 u34I4 63 714414 54 Starting with the last equation we solve for the unknowns as follows 54 I4 u44 CS 7 713414 13 uSS C2 7 712313 7 712414 12 u22 Cl 7 711212 7 711313 7 711414 1 ull The above procedure extends easily to arbitrary order n and leads to the following backward substitution algorithm i1n sum sum uijXj xi ci sumuii There is a corresponding forward substitution procedure for the lower triangular system Ly bi Operation count for backward substitutioni For i running from 1 to n the statement sum sum uij x j is executed for j running from i 1 to n7 and the statement xi c i sum ui i is executed oncei Each statement requires one addition or subtraction and one multiplication or division Thus the total number of operations is n n n nn 1 2 2 2 22017241 2T an 11 z1 11 the last approximation holding for large ni There are a like number of multiplicationsi Incidentally the dominant term in the count can be obtained by replacing sums by integrals and adjusting the limits of integration to make life easy Thus in the case above the result is 2djdi2niidin2i 0 i39 0 3 General Systems Turning now to the general system consider rst the single equation ax 12 its solution can be computed as I ba or as I a 1 X 12 The former requires one division but the latter requires a division in computing the inverse of b and a multiplication and therefore more expensive Similar considerations of cost hold true for systems For nonsigular A the solution of A b can be expressed as 17 A lbi7 but this method is prohibitively expensive for large systems as we shall see That leaves two choices The rst is the indirect or iterative approach in which we start with a guess for the solution and compute successive iterates which if convergent can be continued until some tolerance is met The second is the direct approach in which barring roundoff errors a nite number of steps lead to the exact solution We shall rst concentrate on the second approach which is embodied by Gaussian elimination We shall consider the relevant algorithm and its sensitivity to errors in data as well as roundoff errors of computation 31 Gaussian elimination The purpose of Gaussian elimination is to reduce a general system A 17 into an upper triangular system U c which can then be solved by backward substitutioni Before introducing the algorithm for the general case we consider a 3 X 3 example Example 31 Consider the system 1 2 3 11 9 2 2 1 I2 6 i 311 3 2 1 13 7 We shall use row operations to transform the system into an upper triangular form 239e a form in which all elements below the principal diagonal are zerosi Let the three rows be denoted by R1 R2 and R31 In the rst step we shall put zeros in the positions 12 and 13 This is done by replacing row R1 by the combination R2 7 2R1 and row R3 by the combination R2 7 3R1 yielding the system 1 2 3 11 9 0 72 75 12 712 0 74 is 13 720 In the second step we shall put a zero in the position 2 3 This is done by replacing R3 by the combination R3 7 2R2 which yields 1 2 3 11 9 0 72 75 12 712 i 312 0 0 2 13 4 The system 311 has now been reduced to U17 c 313 where 1 2 3 U 0 72 75 0 0 2 is an upper triangular matrix with the pivots on the diagonal The systems A b and U c have the same solution since equality has been maintained at every step in deriving the second system from the rst and the steps are reversible Since U is upper triangular U c can be solved by back substitution this was the Whole purpose of elimination We get 213 4 yielding zg 2 7212 7 513 712 or 7212 71252 72 or 12 1 and zl2123zg9 or 1197217321i We now turn to the general case a1111 a1212 a1313 a1414 171 2111 a2212 a2313 2414 b2 a3111 a3212 3313 a3414 113 a4111a4212 a4313a4414 1 4 STEP Ii De ne the multipliers mil i2 3 4 an Subtracting mil times the rst equation from the ith equation 23 4 to get a1111 a1212 a1313 a1414 b1 a22I2 a2313 a24I4 2 a32I2 was a34I4 3 4212 4313 4414 1 54 Here a j aij milaljy 11 bi miibr STEP Hi De ne the multipliers a mig i3 4 a22 Subtracting mi times the second equation from the ith equation 3 4 to get a1111 a1212 a1313 a1414 b1 7 a22I2 a23I3a2414 2 H H 3313 a34z4 b3 H H H 4313 4414 1 4 Here aij aij 7 migagj b 1217 minQi STEP llli De ne the multiplier Subtracting mig times the third equation from the 4th equation to get a1111 a1212 a1313 a1414 b1 7 2212 2313 2414 7 1 2 H u 1 3313 a34z4 123 W m a44z4 b4 Here H H W H H aij azj miSaSjv bi bi M21721 STEP TVi Solve the resulting upper triangular system by backward substitution Consider the following pseudocode for a naive Gaussian elimination algorithmi The corresponding function M le is available in the Appendix for k1n1 for ik1n maik akk for jk1n aijaijmakj end b i b i mb k end end Remarks 1 The outer loop the k loop ranges over the number of steps from 1 to n7 1 Step Is processes rows k 1 through mi 2 The middle loop the 239 loop ranges over the rows 16 1 through n to be processed by step kl It computes the multiplier for each rowi 3 For each row i the inner loop the j loop ranges over the columns k 1 through n updating the corresponding elements The elements of the vector 12 are updated as well Note that the elements of column 16 which will be rendered zero are not actually compute i 4 At the end the upper triangular half of the matrix A is updated The elements below the principal diagonal all zeros will not be needed by the backward substitution and are therefore not modi ed 5 The solution is completed by the backward substitution algorithm referred to above 6 An operation count for large n is obtained in the usual way by integrating the loops 3 Zdk di dj2i o k k 3 7 The elimination process as described above fails at step k if the corresponding pivot element akk 0 because then the corresponding multipliers are unde nedi We shall return to this important issue lateri 32 The LU decomposition We now connect Gaussian elimination to the factorization A LUi This decomposition avoids repetition if A b is to be solved for a number of different I vectors that may not all be available at once STEP 1 Consider the n 4 system considered above Let A1 A and de ne the matrix 1 0 0 0 7 1m1 1 0 0 Ml 772131 0 1 0 7 777141 0 0 l where the mil are the multipliers de ned above Observe that 1 0 0 0 an an dis a14 an an dis a14 A2M1A1 772121 1 0 0 an agg agg a24 0 aQQ aQg ag4 m31 0 1 0 an M2 ass M4 0 a32 ass a34 m41 0 0 1 a41 a42 a43 M4 0 542 543 544 Thus we see that multiplication with M1 is equivalent to the row operations of step 1 of Gaussian elimination STEP 11 Next set 1 0 0 0 0 1 0 0 M2 0 7mg 1 0 7 0 772142 0 1 Then 1 0 0 0 7 1 0 0 0 an a12 dis a14 an an dis a14 A3M2A2 0 1 0 0 0 a2 a23 a24 7 0 a22 a23 a24 0 77132 1 0 0 a32 ass a34 0 0 agg a34 0 77142 0 1 0 542 543 544 0 0 513 ah STEP 1111 Finally set 7 1 0 0 0 0 1 0 0 M3 0 0 1 0 7 0 0 772143 1 Then 1 0 0 0 an L112 an L114 an an an L114 7 7 0 1 0 0 0 a22 a23 a24 0 a22 a23 a24 A47M3A37 0 0 1 0 0 0 ag ag 0 0 g aS 0 0 772143 1 0 0 04213 ah 0 0 0 ail Thus we see that M 3M2M 1A U where U is the upper triangular matrix produced by Gaussian elimination Note further that the lower triangular matrix L1 de ne y 0 f Mfl 0 HOOD 0 1 mgl 0 0 where the second equality can be established by showing that Mle1 Ii Similarly Lk M1671 k 2 31 On setting 1 0 0 mgl 1 0 77131 77132 1 0 77141 77142 77143 1 0 0 L L1L2L3 7 where L being the product of lower triangular matrices is also lower triangular we nd that A LUi The corresponding Matlab code is available in the Appendix Remarks H Note that the diagonal elements of the lower triangular matrix L are all unity and that the elements of L below the diagonal are given by Lij miji N The decomposition procedure applies for all n CA3 1 The problem A 17 becomes LU bi It is solved in two steps First LC b is solved for c by forward substitution and then U c for by backward substitutioni 33 Pivoting We have already identi ed a problem with Gaussian elimination namely the appearance of a zero in the pivot positionl Let us illustrate the problem with an examp er Example 32 Let us perform Gaussian elimination on the augmented matrix 10 77 0 7 A 73 2 1 6 3 9 5 71 5 6 Step L The pivot element is All 10 We have 72121 7310 7073 mgl 510 075 The row operations R27m21R1 and R3 7m31R1 yield H 0 77 0 7 A1 0 0 6 6 0 2 5 5 275 Step Hi The pivot element is A22 0 Further progress is not possible but the situation can be salvaged by interchanging R2 and R37 This yields the equivalent system 10 77 0 7 A1 0 2 5 5 2 5 0 0 6 6 and brings 25 in the pivot position 1422 No further steps are required since upper triangularization has already been achieved Back substitution yields 76 7257513 77z270 2 5 7 71 a 10 and we are done Example 33 Now consider the following variation of Example 32 obtained by replacing A32 fl by 7375 10 77 0 7 A 73 2 1 6 3 9 5 735 5 6 Step I Proceeding as in example 1 above we get 10 A1 0 0 CON 0 7 6 6 5 25 Step ll A zero appears in the pivot position A22 but there is no nonzero entry below the pivot Further progress is not possible The system is singular and in this case does not have a solution The singular solution would have an in nity of solutions had the element in the 3 4 position above been 5 Example 34 A more insidious situation arises when the pivot element is small but nonzerol Consider the augmented matrix 20 10 l 1 Alb 7 l l l 0 The exact solution of A b is 1 11 12 W 10 On a double precision machine with emach 10 let us see how Gaussian elimination proceeds The multiplier is 1 20 71121 10720 10 1 Then R2 7 R2 7 m21R1 leads to 10 20 1 1 1 0 1 7 1020 1 71020 7 which rounds off to 100 1 1 1 0 71020 1 71020 yielding the solution 12 1 11 01 When compared to the exact solution above the error is enormous An alternate procedure is to interchange rows R1 and R2 prior to beginning the process of elimination This row interchange leads to 71 110 Alb lmw 111 Now m21 10 20 and R2 7 R2 7 m21R1 leads to 11 10 0 171040117 1110 0111 Now the solution is 12 1 11 71 which is a good approximation to the exact solution found above which rounds off to The process just outlined can be extended to systems of any size It is based on the recognition that small but nonzero pivot elements generate large multipliers which in turn can lead to addition of numbers of vastly disparate size and hence substantial loss of signi cance as the entries of the matrix are being updated during eliminationi The remedy at elimination step 16 is to exchange rows so that the largest element in absolute value in positions 16 through it in the k th column is brought to the pivot position Ak 16 Then the multipliers are always smaller than unity and loss of signi cance is minimized This procedure is known as partial pivotingi An alternative is complete pivoting where both rows k z n and columns k z n are scanned and suitably interchanged so as to bring the largest possible element in submatrix Ak n k z n to the pivot positioni Implementation of complete pivoting brings in additional complexity and as a result is not popular in practical implementations of Gaussian elimination 34 Errors conditioning and stability Now that we know how to compute the solution of A b for nonsingular A a natural question arises how accurate is the solution Before this question can be answered we need to decide how the error of such a computation can be measured When the solution to a problem is a scalar the error between the exact solution I and the computed solution i can be measured by the quantity 1i 7 When the solution is a vector or a matrix it is unwieldy to keep track of the error in each component Instead we resort to the concept of the norm of a vector or a matrix which is a single scalar characterizing the size of the vector or the matrix 34 1 Vector norms Some common vector norms are O The 2 norm or the Euclidean norm 7 12 llf ll2 i1 n llf lll eril 1 n 117 lwllp Emv7 il O The 1 norm 0 The pnorm O The in nitynorm max llwloo 19 lml For example for 4 7 3 2 5 v54 7 357 Ml1 147 llil lloo 5 These norms are roughly comparable in size The choice of a norm is generally a matter of analytical and computational convenience We shall mostly use the lnorm or the in nitynormi Every vector norm satis es the following properties 9 gt0 for 17 0 O Howll for any scalar 1 llyll S Ml llyllA A consequence of the third property above the triangle inequality is Ml llyll S llil yllA 342 Matrix norms For our purposes the norm of a matrix is de ned most usefully in terms of consistency with the underlying vector normi Recall that in two and three dimensions the multiplicative action of a matrix A on a vector amounts to a rotation of the vector accompanied by a stretch or a compression The norm of the matrix written as llAll7 measures the maximum stretch induced by A 239e max lAl of Ml Ml or equivalently max llAll llAil llA Ml 1 It can be shown that max n llAlll maximal absolute column sum Z laijl J 11 n llAlloo maximal absolute row sum In E laijli 2 F1 12 Figure 1 Rotation and stretch induced by matrix multiplication applied to a vector For example for 210 142710 0071 Mb 47 HAlloo 3 Every matrix norm satis es the following properties 0 gt0 for A7 0i O HaAH lalHAH for any scalar 1 HABH S HAH HBHA HABH S HAllllBll HALEH S HAllllil ll 343 Condition number The condition number H of the matrix A is de ned as H HAHHA ll Since is the maximum stretch induced by A7 and HA 1H is the reciprocal of the maximum shrinkage induced by A see gure 1 we can write 31 HA1 MA W umu1llA17ll A large condition number implies that the matrix A when applied to vectors causes extreme distortioni If A is singular 239e its determinant is zero then the inverse does not exist and we take the condition number to be in nite The condition number satis es the following properties 0 HltAgt Z 1 o HltIgt 1i 0 HaA MA This property shows that MA is a better measure of the sigular nature of A than the determinant Since det aA andet A7 multiplying a 50 X 50 matrix by a factor of 3 ampli es the determinant by a factor of 350 m 1023 13 O For a diagonal matrix D with diagonal entried di 239 l n maxldil H D if minldil Exact computation of the condition number of A involving as it does the inverse of A is a nontrivial exerciser However the exact value is seldom needed in practice and an approximation suf ces Such approx imations can be found by Matlab commands cond A and rcond A i The latter generates the reciprocal of MA The results depend of course on the norm employed For example the command cond A 1 will produce an approximate value of the lnorm of 344 Quality of the solution We now return to the question of how accurate the computed solution is and address the question from three different points of viewl ll Conditioning Let us begin by assuming that the computations are exact and that the only errors introduced into the problem are those in the input when the elements of an b are entered into the computer For de niteness let us further assume that only the elements of b involve such data errors and that the elements of A are entered exactlyl Data errors in A can be treated analogouslyl Let Ab denote the data error in b and let Am be the corresponding error 239e the propagated data error in the solution ml Then Am b and AmAm bAb so that AAm Ab or Am A lAbl Taking norms we obtain 71 71 HALEH HA Abll S HA HHAbHA 34 We can also write llbllllA1 ll S HAHHil H or 1 A S H M will Hbll Equations 34 and 35 can be combined to yield Am 7 Ab M HAHHA luw uwu H H 35 u u u N Am Ab uwu S Al HbH where HltAgt is the condition number of A de ned in the preceding subsectioni A similar result is obtained if the entries of A are perturbe Thus we see that relative propagated data error HltAgt gtlt relative data error 36 239 e H is the error magnifying factor rendering the problem illconditioned for large Hi Consider the idealized situation in which the only errors that occur are those due to the rounding involved when elements of A or b are entered into the machine The relative input errors are therefore of order emachl Let HltAgt 1016 Even if the subsequent computations are exact the relative output error could be as large as 10k gtlt emach 239e k digits could be lost in the output This phenomenon is illustrated by the following example Example Consider Am b where 7 62e 22 6276 22 l A e2 Asia2 sum2 l l W27 l 14 and the exact solution is The inverse matrix A 1 is easily found to be 1 7 62 672V 62 i 672V A l 76276 22 Susie2 Upon using lnorms and assuming that e is small so that 6 2 gtgt 62 we nd that HAHHA 1H 6 2 so that HltAgt 674 For 6 01 solution by Gaussian elimination turns out to be 1 I2 7 071067811861788 X 107 while the exact solution is 1 I2 7 071067811865473 X 10 We observe that the computed solution has lost 4 signi cant digits For 6 001 the computed solution is 1 I2 7 071067855474009 X 1037 while the exact solution is 1 I2 7 071067811865473 X 103 Now the computed solution has lost 8 signi cant digits These results are entirely consistent with the condition number of A being 104 in the rst case and 108 in the secon i ll Accuracyi Suppose that we have obtained a computed solution 0 to the problem A bi One way to gauge the accuracy of the solution is to compute the residual rb7Aztci We know that when the solution is exact 239e when the error A17H szci 0 then 0 as well In other words vanishing of the residual implies zero error in the solution Is it also true that a small residual implies a small output error Not necessarily since multiplying A bi by a number a does not affect the solution but does scale by on To eliminate the effect of scaling we focus on the relative residual de ned as W HAN ll cll Now llAil HH17c 17H HA VALEc NH HA 1TH HAAHHTHA Upon dividing both sides by H170 H and rearranging the RHS we get Mill 71 W HAHHA H llil cll llAllHil cH HTH HA7i llAllHil cH Thus we nd that relative output error HltAgt gtlt relative residuali 3 7 15 We have come upon the requirement for accuracy Only for wellconditioned problems does a small relative residual guarantee small output error This concept is illustrated by the following example Example Consider the 2 X 2 system A17 b7 0780 0563 11 7 0217 0913 0659 12 7 0254 Let us suppose that by using two different methods we have computed two different solutions 17 0341 27 0999 quot3 lioiom and quot3 71000 The corresponding residuals are T1 b7 14111 00008001 and T2 17714142 0000780 0000913 It is clear that in any norm Hrlu is much smaller than Hrgu suggesting that solution 0 is to be preferred However the exact solution 7 l 1 l 71 shows that we is more accurate The explanation of this seeming contradiction lies behind the fact that the condition number of A is of the order of 105 Ill Stabilityi Stability refers to the sensitivity of the numerical model to accumulation and ampli cation of roundoff errors during the computation a stable algorithm being less prone to such contaminationi Let us suppose that the computed solution 0 of A17 b is the exact solution of a nearby problem say A EC bi Here is a measure of the extent to which the original problems is perturbed ie is a measure of the backward error 1 A stable algorithm more precisely a backward stable algorithm would require that the relative backward error be small Now the residual r can be expressed as follows b7 A17C Emci Therefore HTHHE17cH llEllllil ch so that r HEH 2 i7 llil cll and hence m gt M HAN 7 HAMth Thus a large relative residual RHS implies a large relative backward error LHS and hence an unstable algorithmi If the algorithm is stable the backward error LHS will be small and so the relative residual RHS will be small as well This statement about the smallness of the relative residual holds regardless of the conditioning of the problemi However the small relative residual implies a small relative output error ie an accurate solution only when the problem is wellconditioned In other words accuracy requires both stability and a wellconditioned problemi 1See section 14 of Lesson 1 for a de nition of backward error 4 Appendix 41 Naive Gaussian algorithm function xnaiveGEAb Z Solves Axb by standard Gaussian elimination Z Input Z Nonsingular square matrix A of order n X Vector b of order n quot0 Z Output Z Solution vector x Z Check to make sure that the matrix A is square and establish order n mn sizeA if m n error The matrix is not square end Z Z Perform Gaussian elimination Z for k1n1 for ik1n mAikAkk for jk1n AijAijmAkj end bi bimbk end Z Z Apply backwards substitution to get solution x Z for in11 sum0 for ji1n sum sum AijXj en xi bisumAiiL end 7 Z Example A4 3 2 1 3 4 3 2 2 3 4 3 1 2 3 4 00 b1 l 1 1 Z XO 1 1 0 NUMERICAL COMPUTING Spring 2012 LESSON 5 This lesson is concerned with numerical algorithms for solving initialvalue problems for ordinary di erential equations It should be treated as a supplement to Chapter 9 of the Text 1 Introduction In this lesson we shall consider rstorder ODEs of the form dy i t l 11 dt f y The discussion Will focus on single ODEs but the procedures developed here can apply equally well to a system of n ODEs for Which y and f are ndimensional vectors 5 7 7 7 7 7 7 7 v j I I 1 1 0 2 4 e 8 m 7 7 7 7 7 7 7 7 r z 7 Figure 1 Direction eld and solution curves for yt ysintl Any function yt satisfying the ODE is a solution Its graph in the ty plane is called a trajectory or an integral curve The ODE itself does not de ne a solution uniquely At a point Ly it only provides the slope y t of the integral curve that passes through that point There is an in nite family of solution curves in general see gure 1 above To select a particular member of the family one needs to specify a point on the curve This is done via an initial condition ya at Example 11 y t ysint y0 1 The unique solution is y exp1 7 cos it The solution is valid for t2 0 Figure 1 above Example 12 y t 1 y2 y0 0 The solution is y tantl It is unique but valid only for 0 g t lt 7r2l Example 13 y t yQS y0 0 There are two solutions yt 0 and yt 75327 With the exception of certain special forms of fty solutions to the ODE 11 cannot be determined analytically Recourse to the computer to obtain approximate solutions is therefore necessary 2 Theoretical Considerations Before embarking on the development of numerical approximations it is useful to examine certain aspects of the theory of the IV dy E fty7 W1 av 21 We shall consider in particular the issues of existence uniqueness and stability of solutions Existence deals with the question of whether the IV 21 has a solution If it does not then there is no point in trying to nd an approximate solution numericallyi If it does then uniqueness is concerned with the question of whether there is only one solution Stability deals with the sensitivity of the solution to small perturbations in the IV Let ft y be continuous in the strip D a g t g b7 700 lt y lt Moreover assume Lipschitz condition in y there exists a constant L such that for all Ly in D lft7y1 ft7 7y2l S MM 7 wt Then the following results hold H The TV has a unique solution throughout the interval a bl N The solution depends continuously on the initial data ie if satis es the ODE and the initial data ya d then Mt 9W ewialla All 93 If satis es more generally the perturbed ODE 9 1307s then Lt7a 7 W 7 m emima 7 an 7 eL lM where M maxlfi in Dr Incidentally consider the Lipschitz condition for a single variable 7 g 11 7 12 It implies continuity iiei must approach as 12 approaches 11 Also it is implied by the existence of a bounded derivative iiei W11 f12llf5ll11 I2l S Llrl I2l where L is the bound on the derivative and the intermediatevalue theorem has been applied Therefore in the current case if BfBy exists in D then L can be taken to be mwleByl in Dr The partial derivative BfBy is often called the jacobiani Thus we have existence uniqueness and a continuous dependence of the solution on initial data ie a well posed problemi However as the above bounds show the perturbed solution may still diverge exponentially from the exact solution thereby displaying a strong sensitivity to perturbations in initial data Heretofore we have referred to such sensitivity by the term conditioning However a long tradition in ODEs uses the term stability instead We shall do that as well but with due care so as not to cause confusion with our use thus far of stability in the context of sensitivity to perturbations that arise during computationi 21 Stability of solutions of the IVP Often we are concerned with lVPs whose solutions exist for all t 2 at A simple example is the IV yt Ay y0 a with solution y ae i A small perturbation in the initial data leads to Mi 7 WM la dlek We see that the difference between the exact and the perturbed solutions remains bounded for A g 0 but grows for A gt 0 If A lt 0 the difference in fact decays with t These possibilities correspond respectively to a stable an unstable and an asymptotically stable solutioni Note the behavior of the family of trajectories in each case In general we consider the IV dy 7 7 7d 7 my yltagt 7 a and de ne a solution yt to be O stable if given any 6 gt 0 there exists a 6 gt 0 such that any other solution which satis es the ODE and the initial condition lye 7 ol 6 also satis es 7 g e for all t Z 039 O asymptotically stable if in addition to being stable 7 7 0 as t7 00 We have already examined the linear constantcoef cient ODE For the linear variablecoef cient case fty is of the form ygti The sign of gt at a particular value of t gives an indication of stability but only in the short run as a change in the sign of y will alter the behavior In the nonlinear case one examines the sign of the jacobian fy t y but again the behavior may change in t and now also from one solution y to another Example 21i yt y1 7 The steady solutions are y 0 and y 1 The jacobian is A l 7 2y which is positive for y 0 and negative for y 1 We conclude that y 0 is unstable and y 1 is stable Also two solutions starting at say y 0 05 and y 01 have positive jacobians and hence a tendency to move apart until they pass y 12 when the sign of the jacobian changes and the solutions draw closer 2 tt x 5 x N V Ammmxmm f gtSquot z7v 7 05 r c 1 2 3 4 5 t Figure 2 Trajectories of MO yl 7 3 Numerical Solution of IVPS Analytical solutions to ODEs are unavailable except for special classes of ODEsi When available the analytical solution is in the form of a function yt or its graph the solution curve A computational solution produces discrete approximations ti to points tiyti on the solution curvei 31 Euler7s Method We begin with the basic lVP yt fty a g t g b ya 1 31 The objective is to determine a numerical approximation to yt at the discrete set of points at0ltt1ltt2lttn1lttnbi We adopt the notation that wi is the numerical approximation to yi For simplicity we take the points to be equally spaced so that the step size is given by hb 7a TL Then ti aih i012 n We assume that yt has two continuous derivatives It is instructive to derive the method in three different ways i Expand the exact solution yt in a Taylor series about t ti to get 1 975 9139 t 7 750902 05 ti2 Ci7 where Ci is an unknown quantity lying between t and tin Upon replacement of MO by fty and evaluation at t ti1 the above expression reads 1 2 yi1 M hf ivyi Eh y 007 where Ci now lies between ti and 751411 Euler s method results upon dropping the last term on the right and repalcing the exact solution yi by the numerical approximation wi wi1wihftiwi i01n71 w0ai 312 ii Replace the derivative term in the ODE 311 by the forward difference approximation recall discussion on numerical differentiation section 511 of text 7 1 t M 7 7h 77 y 1 h 2 y C to get y39 7 y39 1 h9775 27 Omitting the error term again leads to 312 iii Integrate the differential equation in 311 from t ti to t ti1 to get dy 1 mm at The LHS yields yi1 7 yin On the RHS approximate the integrand by its value at the left end point til The result is hfti7 yi7 and one is again led to 312 The pattern of the Euler approximation is clear At each step we evaluate the slope f at the current approx imate solution point and use it to advance to the next time step Thus the solution curve is approximated locally by the tangent line segment passing through ti7 and having slope fti7 Example 31 y 1 7 2y y01i This is a linear problem with exact solution 1 1 yt E 6723 313 Euler s method gives wi1 wi 7517100 wi h17 210i leading to the difference equation wi1 717 2hwi h7 we 11 The difference equation can be solved by noting that the homogeneous equation will 7 l 7 2hwi 0 has the general solution Cl 7 2hi while the nonhomogeneous equation will 7 l 72hwi h has a particular solution 121 The general solution to the difference equation is the sum wi C17 2hi1 The initial condition we 1 determines C 12 leading to the approximate solution 7 1 1 i w17 172h1 Note that with ih t the above can be rewritten as 1 1 ti yiswi ltl7227gt1 In the limit of vanishingly small time step h 7gt 0 and correspondingly i 7gt 0 The RHS of the above equation then approaches the exact solution 313 In this simple example we were able to compute wi analyticallyi A numerical code is required in general Example 32 y 1yt 1 g t g 6 yl 11 The Matlab code given below involves three scripts The rst script callingOO m sets up the problem by specifying the time span and the initial condition It calls the function euler1m which in turn calls the script ydot1m to obtain y t and generates the Euler approximation The output includes the numerical solution and the error between the exact and the approximate solutions quot0 script callingOO m o quot0 Solves IVP for ODE y f ty yaalpha alttltb quot0 Input quot0 ODE particulars fty a b alpha quot0 Number of subintervals n quot0 quot0 Output quot0 Exact solution y quot0 Computed solution w quot0 Error Iyi wi quot0 quot0 quot0 Calls Euler routine eulerl clc format1 t w y Iywl nn fprintfformat1 a1 b6 quot0 initial and final times alpha1 quot0 initial value n10 quot0 number of subintervals hban quot0 step size seuler1 ydot1 abalphan for i1tn1 ttsi1 yexacttt1logtt errabsyexactsi2 format2 53f Z156e Z156e Z156e n fprintfformat2ttsi2yexact err end ZZZZZZZZZ function soleuler1fcnabalphaN Z Generates Euler solution to y fty using N steps Z Initial condition yaalpha Z Interval ab hbaN taONh y1alpha for k1 N yk1ykhfevalfcntkyk end solt y ZZZZZZZZZ function fydot1ty f1y t quot47347073473470 OUTPUT t w y Iywl 1 000 1 39 1 39 0 39 1 500 2000000e00 2108198e00 1 081977e01 2 000 3 166667e00 3386294e00 2 196277e01 2 500 4458333e00 4790727e00 3 323935e01 3 000 5850000e00 6295837e00 4 458369e01 3 500 7325000e00 7884670e00 5 596704e01 4 000 8871429e00 9545177e00 6 737489e01 4 500 1048036e01 1126835e01 7879911e01 5 000 1214484e01 1304719e01 9 023483e01 5 500 1385933e01 1487611e01 1016789e00 6 000 1561926e01 1675056e01 1131293e00 The exact and the approximate solutions are plotted in Figure 3 32 Local and global errors Suppose that the solution wi is exact 239e yi win Then a Taylor expansion about ti gives h2 yi1 91 hfti7 Ey ci7 ti S Ci S tir Also wil wi fltti7 yi flttivyi Subtraction leads to h2 i1 E yi1 wi1 iyKci 34 This is called the local truncation error LTEi The LTE of an IV algorithm is the error that results When a single step of the algorithm is performed With exact input data A method has accuracy of order k if the Solution of dydt 1yt Figure 3 Exact solution and Euler approximation of the IV y l yt yl 1 t E 1 6 h 05 LTE is proportional to hk1l Thus the Euler method is of order 1 One can Visualize the LTE by observing that at time ti the solution point ti lies on a certain solution curve of the ODE The LTE causes the next computed solution point n1 wi1 to jump from this solution curve to a neigboring solution curve With the LTE determining the magnitude of the jump The next example illustrates this phenomenon Example 33 y iy y0 1 t6 02 Figure 4 Euler solution of y y left and y 7y right y0 1 With h 04 The solutions are plotted in Figure 4 In each case a number of neighboring exact solutions are plotted along With the Euler approximationi We note that at every time step the tangentline approximation representing the Euler approximation causes the solution to jump from one integral curve to anot eri The global error at step i l is de ned as the difference between the exact and computed solution 9i1 yi1 wi17 3 5 and is the error of interest in practice It is the result of accumulation of local errors but is not necessarily their sum as the local errors can undergo magni cation or shrinkage as the calculation proceeds see the two cases in Figure 4 Remark Do not be confused by similar expressions for the local and global errors in 34 and 35 respectively In the former case the solution at time ti is deemed to be exact and error over a single time step is being considered in the latter case every step produces an approximate solution and it is the accumulated error since the initial time to that is being considered 33 Global error of the Euler method for the general ODE We begin by noting that 60 ya 7 we 0 Let us try to compute 6141 We have h2 yi1 91 hfti7 iMOi ti S Ci S tir Therefore i1 yi1 wil h2 9139 hf i iywci wi hf i i h2 5i hlf i 9139 fltti7 wil 5W0 h2 6i hfy iy i yi 101 9HCi wi S i S M h2 i eill hfy z39y EH 5W6 The meanvalue theorem has been applied in the penultimate step The above expression shows that the global error at step i 1 is the sum of the LTE at that step plus the global error at the preceding step multiplied by the growth factor 1 hfy ti i Now suppose that in the region of interest lfyl S L W l S M Then h2 lei1l 1 hLleZl In order to obtain an estimate of the global error consider the difference equation h2 2H1 1 hL2i EM 20 0 We shall show that leil g zit This is certainly true for i 0 Assuming the truth for i it then follows from above that lei1l 2i1i Now the solution for the difference equation for 2139 see Example 31 for a similar situation is 2139 c1 hLi 7 The initial condition gives 5 hMZL so that 2i hM Z E1hL 71y leil lt 2139 hM Z ith 71 hM Z S flew 1l ewrtogtiu This result provides an upper bound for the global error and shows that it is of order h7 thus becoming vanishingly small as h a 0 The primary importance of this result is not so much that it provides a bound on the global error but the fact that it proves convergence of the Euler method A convergent method is one for which the global error vanishes in the limit h a 0 Thus the method becomes increasingly accurate as the time step is reduced For it to be useful a method needs to be convergentl Example 34 y 1 yt7 l g t g 67 yl 1 We return to the IV rst considered in Example 32 but now examine its global error at the nal time t 6 as h is reduced from 12 to 1256 The Matlab code and the computed results are given belowl Only the script for the calling program is provided the Euler routine and the function M le are the same as in Examp e 32 quot0 script calling01m quot0 Solves IVP y tfty yaalpha at progressively quot0 decreasing step size quot0 Input 0 o ODE particulars fty a b alpha quot0 Number of subintervals n quot0 quot0 Output quot0 Step size quot0 Exact solution y quot0 Computed solution w quot0 Global error Iyn wn quot0 Error ratio quot0 Calls Euler routine eulerl clc format h Iywl error ratio nn fprintfformat t01 T6 quot0 initial and final times alpha1 quot0 initial value yexactT 1logT for k128 n102 k1 quot0 number of subintervals hT1n quot0 step size seuler1 ydot1 tO Talphan globalerrabs yexactsn1 2 if k1 format o75f o 156e n fprintf format habsyexactsn1 2 10 else rateg1obalerrOglobalerr format 75f Z156e Z156e n fprintfformathabsyexactsn12rate end globalerr0globalerr end ZZZZZZZZZZZZZZZ NUMERICAL OUTPUT h Iywl error ratio 050000 1131293e00 0 25000 5948077e01 1901947e00 0 12500 3049166e01 1950723e00 0 06250 1543519e01 1975464e00 0 03125 7765033e02 1987782e00 0 01562 3894382e02 1993906e00 000781 1950158e02 1996957e00 000391 9758208e03 1998480e00 The rst column of the output above lists the values of h which are being reduced by a factor of 2 from one row to the next The second column is the global error at the nal time t 6 We note that it is decreasing with decreasing hr The third column is the ratio of the global error at two successive values of h This ratio approaches 2 indicating that the global error is decreasing at the same rate as h and is therefore proportional to h for the Euler method as anticipated in the result of section 33 34 Higherorder Method Explicit Trapezoidal We have seen that the Euler method is rstorder accurate ie its LTE is of order h2i We now present an example of a secondorder accurate method Starting with the basic IVP 31 we integrate the ODE as was done in the third method of deriving the Euler method39 see section 31 y dy 1 m w dy The LHS yields yi1 7 yin On the RHS approximate the integrand by the trapezoidal rule to get h yi1 yi N 5W1 9139 fti17yi1l On ignoring truncation errors and upon replacing yi by wi one is led to the algorithm h 2 wi1 wi 160517 21 10141 In general this is a nonlinear equation for wi1 and its solution will require a root nder such as Newton s method or bisectioni Algorithms of this kind in which the approximation wi1 cannot be explicitly com puted in terms of the approximation wi are said to be implicit We shall have more to say about them later It is possible to render the above algorithm explicit by replacing wi1 on the RHS by its Eulermethod approximation The result is the Explicit Trapezoidal algorithm h wi1 W E fab 751417715 hfti7 Example 35 y1 yt 1 g t g 6 yl 1 11 We consider the above IVP again and obtain its numerical solution by the explicit trapezoidal method The results analogous to those obtained in Examples 32 and 34 are given belowi The rst set of results corresponding to h 0 5 tabulates the computed and exact solutions and the corresponding error A comparison With the error results for the Euler method obtained in Example 32 shows the greater accuracy of the trapezoidal method The trapezoidal error at the nal time for example is 018636 as compared to the Euler value of 113129 The higher accuracy of the trapezoidal method is also eVident in the plot of the exact and computed solutions shown in Figure 5 When compared to the corresponding Euler plots in Figure 3 Trapezoidal solution of dydt 1yt Figure 5 Exact solution and explicit trapezoidal approximation of the 1V y 1 yt7 y1 1 t 6 16 h 015 COMPUTATIONAL RESULTS t w y Iywl 1 000 1 39 1 39 0 39 1 500 2083333e00 2108198e00 2 486433e02 2 000 3340278e00 3386294e00 4 601658e02 2 500 4725347e00 4790727e00 6 537961e02 3 000 6212083e00 6295837e00 8 375353e02 3 500 7783145e00 7884670e00 1 015255e01 4 000 9426273e00 9545177e00 1189048e01 4 500 1113233e01 1126835e01 1 360137e01 5 000 1 289426e01 1304719e01 1 529290e01 5 500 1 470641e01 1487611e01 1 697006e01 6 000 1656419e01 1675056e01 1863628e01 h Iywl error ratio 0 50000 1863628e01 25000 5 327087e02 3498400e00 12500 1423406e02 3742494e00 OO 006250 3 677094e03 3 871007e00 003125 9 342982e04 3935675e00 001562 2354635e04 3967911e00 000781 5910261e05 3983978e00 000391 1 480528e05 3991995e00 The second set of computational results shows the effect of reducing h on the global error The rst column of the output lists the values of h7 which are being reduced by a factor of 2 from one row to the next The second column is the global error at the nal time t 6 The third column is the ratio of the global error at two successive values of h This ratio approaches 4 indicating that the global error is decreasing at the rate h27 rather than the slower rate h for the Euler method Example 34 35 Higherorder Taylor Methods After having presented one example of a method that is more accurate than the Euler method we now consider the problem of deriving higherorder methods more systematicallyi One approach is based on Taylor series which can be used to advance the solution from time ti to ti1i Assuming suf cient differentiability one can obtain the Taylor expansion h2 hk yitl 9139 hyt 39 y Rim where the remainder has the form hk1 Ric yltk1gt Ci ti S Ci S tir 36 We note that y is ft7 y and derivatives of y of order greater than one are computed by differentiating f0 9 Thus 6 f t W y t f Mt W fytyty f ffy y t f f fzyy f mm f fy fyyy etc where the prime denotes total differentiation with respect to t Upon dropping the remainder we obtain the approximation wi1 to the solution yi1 as h2 hk wi1 wi hfti7wi iftivwi 39 39 39 g k lwz vuh 37 We can write the above expression economically as wi1 wi th ti7 wi7 38 where Tk Ly is de ned as hkil k 1 k f my 39 The LTE of the method is de ned by the remainder 36 above we observe from this expression that the method is of order kl We note also that the Euler method corresponds to k 1 my fltnygt gmm Example 36 Sty Ff 27 y4 1 Using a Taylorseries method of order 2 take one step and compute y4ili For k 2 the algorithm 38 can be written as wil wi hT2 ti 13 where h T2t7 y f6 y 511 y ft7 yfyt7 yA Here the subscripts t and y denote the corresponding partial derivatives We are given that 2 7 y2 5t ft7y Then 2 7 y2 2y ft 7 572 and fy 75 The algorithm then reduces to 27w2 h2 27102 2710122115 1 139 51 1 2 5t Sti 5n wi1 wi h Upon substituting h 0 1 and the initial condition we 1 at to 4 we nd that 71 393 ml 7 80000 36 Runge Kutta Methods Taylorseries methods especially those of higher orders require substantial derivative computationi Runge Kutta methods imitate Taylorseries methods in accuracy but avoid this dif culty by using diffrerence ap proximations for the derivatives The methods can be derived in a variety of ways including use of quadrature rules and use of Taylor series 361 Runge Kutta method of order 2 Consider the derivation of the Taylorseries method of order 2 h2 yt h 90 flyt Ey W 0013 1 W hf h2ft ffy 0h3 1 1 7 m Ehf 5w hf hffy1 0lth3gti By using Taylor series for two variables we obtain ft h7y hf My hfz ffy 0h2i Then we can write h h ya hgt 7 W 5mg Em mm mm 0h3gt Upon dropping the error term we get the Runge Kutta method of order 2 which can also be written as h wil wi 5W1 k2 where 161 100517115 k2 fltti17wi hkll This method can be thought of as an Eulerlike method in which the slope used to advance the solution is computed as an average of the initial slope and the slope at the provisional next step computed by the Euler method A moment s inspection reveals that this method is the same as the explicit trapezoidal method introduced earlier It also goes under the name Heun s method 14 In general secondorder RungeKutta methods are of the form yt h yt h41ft7y q2ft ah y hf y yl 0013 where qlq2a and B are to be found subject to the requirement that the formula be equivalent to the Taylorseries formula of the same order Upon Taylor expansion we get yltt h W qlhf 42w ah mm 0013 Upon comparison with w h m hf 902 1 0013 we obtain the conditions ql 42 1 qga 12 qg 121 One solution is ql L12 12a 1 which corresponds to Heun s method There are other choices such as ql 0 L12 1 a 12 leading to the modi ed Euler method wi1 wi hk27 where k1 ftiwi k2 ftih2wi him2 362 RungeKutta method of order 4 This classical method is given by wi1 wiglk1 27 27 1641 where 161 160513 wi7 k2 fti hZywi hk127 k3 fti h27wi hk227 k4 fltti17wi hks One can show by Taylor series arguments that it reproduces the terms in the Taylor series upto and including h4 and that the error term is 0h5i Advantages of a Runge Kutta method over similarorder Taylor method include no evaluations of derivatives of f and easy implementationi Disadvantages include dif cult derivation of the truncation error termi 4 Absolute Stability The explicit singlestep methods derived thus far for the IVP 31 have the following general form wil wi hFti7 well w0 017 where the speci cs of F depend upon the method under consideration Analysis similar to that carried out for the Euler method in section 33 shows that under appropriate restrictions on fty the global error tends to zero as h 4gt 0 239e the method is convergenti Since computations always involve nite values of h it is reasonable to investigate how a method behaves in practice when h is small but nitei In particular if an exact solution of the IV is stable or asymptotically stable it would be desirable for the corresponding numerical solution to retain this behavior That this is true for suf ciently small h is ensured by the convergence property but the question remains how small To investigate this question let us consider the following standard IVP as the test case M M t 0 y0 at Plots 0 1IC expr100 l Figure 7 Analytical solution of y 7100y 100t 101 y0 1 C For C 0 the solution is the straight line 1 t For C y 0 the solution has an exponential component that decays very rapidly to zero and one is again left with the straight line 1 t as the asymptotei One can say that for C a 0 the solution has two characteristic time scales a fast time on which the initial transient decays and a slow time on which the portion 1 t of the solution evolves Problems exhibiting such disparate time scales are said to be 575239 and a straightforward numerical approach experiences dif culties To explore the issue let us try to solve this problem by the Euler method The algorithm leads to wi1 wi hfti wi h7100wi100ti101 we 1 C This linear nonhomogeneous difference equation is of the type encountered before in this lesson and the solution is found to be I wi C17100hZ 1tii Stability requires that the rst term on the RHS be a decaying term which leads to the restriction 71lt17100hlt1 01 0 lt h lt 0102 Let us solve the IV by the Euler method for C 0 0101 and 70101 for h 0105 a value outside the stability regime The corresponding solutions are plotted in Figure 8 While the solution corresponding to 0 is computed accurately those for O f 0 illustrate dramatically the instability of the algorithmi Thus we nd that even though the rapidlyvarying component of the solution decays rapidly the algorithm is forced to employ a very short time step for the entire time span of the computation This makes the method extremely expensive if computations are to be carried out over long times A remedy is provided by implicit methods 51 Implicit Approach Backward Euler Method Consider a backward Taylor expansion where the solution at time ti1 is expanded about time ti as follows 92 21 h y i1 fly21 0012 y i1 hfti17 y i1 0h2 17 nus m m5 U2 U25 m mas m m5 us nus m m5 U2 U25 m u m m as nus m m5 U2 U25 u m m m Figure 8 Euler solution of y 7 7100y 100t 101 y0 1 C for h 005 C 0 left C 0 01 middle and C 7001 right The exact solution in each case is shown in blue and the Euler solution in red o Numencal Egtltact Figure 9 Backward Euler solution of y 7100y 100t1017 y0 1 C for h 014 C 01 The exact solution corresponds to C 0 Upon dropping the LTE and substituting wi for one obtains the backward Euler formula wil wi hfti17 wi1 This method is rstorder like the forward Euler method but is implicit In other words one will need to solve a nonlinear algebraic problem in general to determine wi1i For the linear test problem y Ay y0 a the above equation reduces to wi1 wi hAwi17 we 1 In this case it is possible to solve for wi1 explicitly and the result is wi wi1 m we 0A This difference equation has the general solution a w 7 Z 1 7 hAZ7 For A lt 0 the magni cation factor 11 7 hA is smaller than unity which causes wi to decay as i is increased Thus the method is stable for all h In other words unlike what we found in the explicit Euler method no restriction on the size of h is imposed by absolute stability In practice of course accuracy will limit the step size Example 39 For the linear problem considered in Example 318 the above method leads to h101 100ti 10 1 Z 1 01 1 100hw 1 100h J0 wi1 The corresponding numerical solution and the exact solution 1 t for C 07 are shown in Figure 9 for h 0141 We note that unlike the unstable Forward Euler solutions for h 005 shown in Figure 8 the Backward Euler solution for the much larger value h 014 is stable and approaches the asymptotic solution 1t as anticipate 1 6 IVP Tools in MATLAB MATLAB has a variety of ODE solvers using diverse methods The most popular of these is the solver ODE45 which is an adaptive procedure based on 4th and 5th order RungeKutta methods The sample code below illustrates the use of ODE45 in solving the IVP y7y75equotsin5t7 y0l7 0 t 61 The ODE is supplied in a separate m le while the calling program supplies the initial condition and the span of t over which the solution is desired The exact and computed solutions are compared in Figure 101 Figure 10 ODE45 solution of y 7y 7 56 t sin 5t7 y0 l7 0 g t g 61 quot0 script calling3m quot0 Calls ode45 and solves dydt fty yaalpha quot0 Plots numerical approximation and the exact solution quot0 Supplies initial value and span of t quot0 Requires fyt supplied in an auxiliary mfile clc tspan0 6 y01 X initial value t yode45odetestslopetspany0 figure plotty 1inewidth 2 hold on tt1inspace06 yyexpttcos5tt plotttyy r linewidth 2 zoom on maxabsyexptcos5t ZZZZZZZZZZZZZZZZZZZZZZZ Z function file odetestslope m function yprimeodetestslopety yprimey5exptsin5t 20 NUMERICAL COMPUTING Spring 2012 LESSON 4 Relevant sections from the Text 81 8367 86 861 and 87 This lesson meant as a supplement to Chapter 8 in the Text discusses numerical integration 1 Introduction The integral b 1f mm is in most practical cases impossible to evaluate analytically Even for simple functions such as 7 sinIQ the integral does not exist in terms of other elementary functions Numerical approximation is therefore the only option Integrals are numerically approximated by quadrature rules which are nite sums of the form n Qnf i1 This is an npoint rule where the points 11 are sampling points or nodes obeying a 11ltzgltiult1n71ltzn b7 and the wi are weights The main issues are how to choose the nodes and how to determine the weights so that an accurate approximation is obtained at low cost ie with as few function evaluations as possible A quadrature rule can be thought of as a generalized Riemann sum inspired by the form n Sn 2 mi Ari i1 with the wi replacing the Arii However unlike the Riemann sum the weights in a quadrature rule need not be all positive The weights depend upon a b and the nodes but not on the function The rule is closed if a 11 and b In otherwise it is open When the nodes are equispaced the rule is said to belong to the NewtonCotes Class 11 Derivation A quadrature rule can be derived by approximating f by its polynomial interpolant F on the nodes Ii and then replacing f in the integrand by F Thus Hf e NF 1 Fz dz Qnf If Lagrange interpolation is used then FW 216950114907 i1 whence b n cw Z mamz dz 1 i1 n b dz so that the weights are I wi dzi If f is a polynomial of degree n 7 1 or smaller then it is its own interpolant and the quadrature rule is exact This observation when applied to the n monomials 1 z 12 i i i zn l in succession yields a system of n linear equations for the weights This system is Qnzp1rp7100717272227171 b ldzb7a b 2 2 b7a Azdz7T b 3 3 b7a 2 7 AI dz7 3 12 bn 7 an wlzilil wgzgil i i i wnznil Inil dz it n a or in expanded form w1w2uiwn wlzlwgzgiuwn1n 2 2 wlzl 10212 i i iwnzn The matrix associated with this system is a Vandermonde matrix and there is a unique solution for the weights Examples are n 2 trapezoidal rule and n 3 Simpson s rulei 12 Error estimation The error associated with a quadrature rule can be determined by a variety of methods An obvious procedure is the following 1 EM 71m few 7 W 7m dz b f c z A dz where z 7 zlz 7 12 i i i I 7 In and 01 6 a 12 Estimation of the integral on the right can provide error bounds For the trapezoidal rule for example we get b 251 E2f Wz7az7bdz b 7 142M ltz7agtltz7bgtdz f2n In evaluating this error we have made use of the meanvalue theorem for integrals which applies when does not change sign in a by In terms of the mesh spacing h b 7 a the above result for the Trapezoidal error7 ET reads ETi7 hS 102 Wax We can also make use of the Taylor theorem for error evaluation Consider Simpson s rule b b 7 H dz 6aU4 ahlfW2MlE where now h b7 a2i Upon expanding each side in a Taylor expansion about a and cancelling common terms ll in details one obtains 15 lt4 15 4 E 7 if if s f fm Note that the error for the twopoint trapezoidal rule is proportional to hgf while that for the 3point Simpson s rule is proportional to h5f4i ln can be shown that in general the error for n even is Kf hn1 while that for n odd is Kflt 1hn2i A related notion is that of the degree of precision d of a quadrature rule it is the highestdegree polynomial for which the rule is exact The above discussion suggests that the precision of Qn is n for n odd and n 7 l for n even The constant of proportionality K in the error formulas can then be evaluated easily by applying the rule to the rst monomial for which it is not exact For example consider the 3point Simpson s rule 2h AfwwMHMWMWmHWW Applying the rule to f 14 for which f4 24 we get 5 g4h4 2h4 24Kh5 whence 1 K as above Remarks 1 Since the error is proportional to a positive power of h and h b 7 an it follows that the error approaches zero as n is increased provided fol remains boundedi 2 Note that Hm7WM Ah7 m Ahf7 dz 7wv7mw so that the absolute condition number of the integration problem 239e the ratio of the output error to the input error is simply I 7 a For the quadrature rule lm7 l 7m lZwZfm 7 an9 iltwmm7fmw lme7mm i1 Thus the absolute condition number of the numerical rule is 21 Recall the rst moment equation of the derivation of an interpolatory quadrature rule wi b 7 at If all weights are positive the numerical and analytical integration problems have the same condition number so that the quadrature rule is stable If some of the weights are negative as is true for n 2 11 then the numerical condition number can be much larger In fact it can be shown that ELI 7gt X as n 7gt 0 so that the NewtonCotes rules become arbitrarily illconditioned as n growsi The presence of large positive and negative weights also leads to the integral being computed as the sum of large quantities of differing signs further leading to inaccuracies CA3 i It turns out that rules based on the Chebyshev points ClenshawCurtis rules lead to positive weights for any n thus eliminating the above illconditioningi Gaussian points discussed next lead to still more accurate and in fact optimal approximations 2 Gaussian Quadrature ln Gaussian quadrature both the nodes and the weights are considered free parameters to be derived for optimal accuracy The derivation of the 2n parameters is done by requiring that the rule be exact for monomials up to degree 2n 7 1 The resulting system of equations is nonlinear Example n 2 and n 3 The rules are typically open rulesi Points are bunched closer near the ends of the interval Derivation of the nodes on 711 can also be accomplished by employing orthogonal polynomials and it turns out that the n nodes are the real zeros of the Legendre polynomial of degree n Thus the zeros of the Legendre polynomial P2 12 7 13 ilxg are the nodes for the 2point Gaussian rule while those of the Legendre polynomial P3 13 7 315 0i 35 are those for the 3point rule The nodes can be ported to an arbitrary interval by the usual linear transformation Usually one transfers the original interval to 71 1 and transforms the integrand accordingly 3 Composite Quadrature An alternative to using Qn with a large n is to subdivide the original interval a 12 into k subintervals or panels and to use a lowerorder quadrature rule on each paneli This is equivalent to a piecewise polynomial interpolation of the integrand The composite rule is stable if the underlying loworder rule is Thus with panel width p b 7 ak and a rule of order n we have b k ajp k V I d d m J f fltzgt z My fltzgt z cw where is the order n rule applied to panel ji Example Compound Trapezoidal rule panel width p h b 7 aki b M 7 fltzgt dz 7 give 7 fltahgt mam fa2h 7 am we 711hgt fa 1mm gm 2fah 2fa k 7 1h fa W The error is the sum over all panels ie E 7 7h73vlt2gtlt51gt We f2 kl 7 7I flt2gtltsgt 7 702 7 agth72flt2gtltsgt l2 l2 12 This form of the error indicates that for a given interval of integration the composite rule is of order 2 ie halving of the panel width reduces the error by a factor of 12 For the compound Simpson s rule 17 b 7 ak while the mesh size h 102 b 7 a2k1 We have Hf gww 4fa h fa 2h u 2h 4fa 3h fa 4h fa 2k 7 2h 4fa 2k 71h fa 2kh1 The error is E 7 7551flt4gtlt51gt 14 flt4gtlt51gt1 7 7kh75flt2gtltsgt 7 7ltb 7 agth74flt4gt 51 90 90 180 Thus the method is of order 41 We emphasize that increasing the number of panels does not change the underlying loworder method and hence the order of the derivative appearing in the error estimatesi Observe again that h b 7 a2k1 This allows us to rewrite the error as 1 b7a 5 1 if i 4 E M 2 gt k4 51 Given the bound M on 4 an error tolerance 6 will require the nuimber of panels k to obey b7a b7aM 14 kg 2 lt 1806 gt Suppose we compute by using composite Simpson s rule twice once with k panels and then with 216 panels and let the corresponding answers be A1 and A21 Then we can write the corresponding errors as 1 1270 5 1 if i 4 E1 7 90lt 2 gt k4f 51 1 b7a 5 1 if 4 E2 7 90lt 2 gt 16k4f 52 Let us now make the assumption that f4 f4 This is not unreasonable if the derivative does not vary very much on a 12 The shorter the interval the more likely this is Then 1f 7 A1 c 1f 7 A2 where 7 1 b71151 077lt 2gtyf4 1gt We then have 7 A2 7 A1 0 7 17116 and also the error estimate A A I 7 A2 s 21 15 Note that while this estimate involves only the computational results it does have the analytical estimate as its basis 4 Adaptive Quadrature Uniformlyspaced composite rules are not ef cient if the magnitude of the derivative involved in the er ror bound varies widely across the integration intervali Adaptive quadrature addresses this problem by discovering where the integrand is illbehaved and re ning the mesh size there accordinglyi Such a procedure based upon the composite Simpson s rule may proceed as follows H Specify tolerance 6 10 Compute I in two ways by A1 and A2 with A2 the more accurate 93 Estimate the error in A2 based on A1 7 A2 1f the error is within tolerance then done If not then divide a 12 into two equal subintervals and repeat 0 Continue to divide and rule If the tolerance is met on a particular subinterval then no further subdivision of that subinterval is required 01 Stop when the tolerance is met on all subintervals on Certain subtle points arise should tolerance be relative or absolute should one ignore unconverged intervals from which contribution to the integral is zero etc Example Evaluation of 1 dm 0 to an accuracy of 6 00005 by adaptive quadrature The notation used below is as follows Sa b Result of Simpson s rule with h b a 3a b Result of composite Simpson s rule with h b a27 Ea b Estimate of the error associated with 3a b We begin by dividing the interval 01 into two panels of equal width On each the acceptable tolerance is 62 000025 1 Panels 012 and 121 S121 043093403 S121 043090219 1 4 1 E12 1 E09 S 00000018775 lt E00005 000025 Since tolerance is satis ed accept 312 1 10 Similarly 30 12 4 51012 E012 V 022559223 023211709 7 000043499 gt 000025 V Unacceptable Divide subinterval and repeat Error per new subinterval 0000252 0000125 S1412 015235819 S1412 015230814 E1412 0664x 1076lt0000125 Since tolerance is satis ed accept 314 12 S014 007975890 S014 008200578 E014 000015 gt 0000125 Unacceptable Divide subinterval and repeat Error per subinterval 00001252 0625 x 10 4 S1814 005380075 S1814 005387027 E1814 02340 x 104 lt 0025 x 1074 Since tolerance is satis ed accept S 1 8 14 S018 002819903 S 018 002901464 E0 18 05437 x 104 lt 0625 x 1074 Since tolerance is satis ed accept 30 18 Result I S018 S1814 S1412 312 1 066621524 1W 23 066606607 E 000045143 lt 00005 NUMERICAL COMPUTING Spring 2010 LESSON 3 Relevant sections from the Text 117 5557 562 This lesson discusses iterative numerical algorithms for nding roots of algebraic equations R0 bustness and speed of convergence are ofparticular interest Sample MA TLAB codes are provided These notes supplement material in the text 1 Introduction In this lesson we consider the problem of nding numerically the solution of the single nonlinear equation in a single variab e me 7 0 The values of z satisfying this equation are known as the roots of the equation or the zeros of the function f I First let us look at a few examples of how such problems may arise Example 11 The chemical reactor problem In a chemical reactor suppose that the concentration of a particular chemical as a function of time is given by Ct 100 7 65aquot 7 25653 The initial concentration is CO 10 and the concentration is increasing with t Suppose we wish to nd it at which C is twice the initial value Then we must solve the nonlinear equation 20 100 7 656quot 7 256 5 7 or 656quot 256 5t 80 This problem has a unique solution but it cannot be found analytically Example 12 The missile intercept problem Let the movement of an object A in the zy7 plane be described parametrically by the equations I1tt7 y1t1 6 Let the motion of a second object B be described by 12t 1 7 tcos a y2t tsina 7 0172 Here it is the time Is it possible to choose a so that both the objects will be at the same location at some time We set the z and y coordinates equal and get the pair of equations 17tcosa 1 7 e t tsina 7 0172 Elimination of 1 yields ft E 1 7 t2 1 7 equot 01 7 t2 0 The rst positive root of this equation if one exists yields the time of intercept tquot i Then the desired a is given by 1 7 t tquot 7 provided this equation has a solution If it does there is an in nity of solutions COSOt Example 13 The minimum surfaceof revolution problem Let Azly1 and Bzgy2 be two points in the my plane Consider the problem of determining the curve connecting A and B for which the surface area of revolution about the z axis is minimized It can be shown that the optimal curve y satis es the differential equation 1 y 7 W 07 with general solution 1 y a cosh 01 C1 where C and C1 are constants of integration This curve is a catenaryi Consider the problem for 11 710 12 Io yl yg 1 For 10 12 does the optimal curve exist If so how many solutions are there What about 10 1 Upon substitution into the general solution above the boundary conditions 11 710 12 IO yl yg 1 yield C1 0 by symmetry and require C to satisfy the equation C coshCzoi For 10 12 we have C 7 coshC2 0 and for 10 1 the equation becomes C 7 cosh C 0 Existence and possible multiplicity of solutions depends upon the roots of the above nonlinear equations The situation is depicted in the gure belowi There are two roots in the rst case 111788 and 42536 as we shall nd later on and none in the second Figure 1 The left gure shows graphs of C7coshC2 07 corresponding to 10 12 and C7coshC 07 corresponding to 10 1 The right gure shows the catenaries corresponding to the two roots for the case 10 1 2 Example 14 Consider the class of polynomial functions of which an example is the quadratic a0 alz agz2 which has two zeros 7a1 Na 7 4a0a2 7a1 7 Ma 7 4a0a2 zl and zg i 2 2a0 Polynomial equations of degree 3 and 4 can also be solved analytically although the formulas for their roots are more complicated No such formulas exist for the zeros of polynomials of degree 5 and higher and one must resort to numerical computation Example 15 Consider the trigonometric equation sinz 7 a 0 For a 12 we can write down all the roots 7r6 i 27m and 57r6 i 27m where n is an integer No such expressions exist for the roots corresponding to a Thus we see that there are few problems of this kind for which the exact solution can be written down in the form of a mathematical formula which can then be evaluated for speci c numerical values either by hand or by a computer lnstead as we shall see one must resort to a numerical algorithm that is iterative 239e it produces a sequence of numbers which one hopes quickly converges an approximate solution that has the desired accuracyi Implementation of an iterative method begs the following questions We note that the new bracket is onehalf the length of the original bracket 239e b2 7 a2 62 We repeat the process and observe that the mid point of the new interval 7 a2 52 2 provides an estimate for a with error bound 2 7 a2 6 52la a2l 2 QA Note that the estimate 12 is obtained after one step of bisectioni After n steps we have the estimate an1 bnl an1 f for the zero a with error bound 6 enl la an l S 2nr In order to achieve the error tolerance en1 g e we must choose n such that 6 fwd lt e 239e ln6e gt 7 i n 7 ln2 1 Remarks 1 The method always converges 239e an a a as n a 07 as long as the initial bracket contains the single zero 1 2 The function plays no role in determining n for given 6 only 6 matters 3 For 61 we get 1210 7 97656 e 7 04 1220 7 965367 e 7 07 1230 7 963132 e 710 1240 7 960949 e 713 Thus every 10 iterations gain about 3 digits of accuracy According to a de nition to be introduced shortly convergence is lineari 4 The procedure does not work if a is a multiple zero of even order Why not 5 Here is a pseudocode for implementing bisection to satisfy a given tolerancei input abtol provide f while ba gttol do ca ba 2 if f c 0 then done else if signfa signfc then ac else bc end end and then to fCn fIn If we assume that convergence is occurring e the starting guess was close enough then zn17azn7a2 i 3320 7 av and we can say that fa and f cn f a so that a 7 zn1 Ca 7 In2 where 7W 0 may In fact a In1 7 nligtoltgt a 7 In2 7 C We recall the following de nition that allows us to characterize how rapidly a sequence converges to a limit 311 Order of convergence for sequences Let the sequence 37 converge to s such that lim m C n7gtoltgt ls 7 snlp for some positive constant C and p Z 1 Then p is called the order of convergence of the sequence For the special case p 1 it is required that C lt 1 and convergence is said to be linear Such was the case for bisection where C 12 For 17 2 which applies to Newton s method as we saw above convergence is said to be quadratici Once the iteration is close enough to convergence error is roughly squared with each iteration and the number of correct digits approximately doublesi For the xg example computed above we nd that e4e 2234042553224218e o1 e5ei 2236067058510029e 01 eBeg 2235786924898081e 01 We see that n1 i is approaching a constant with increasing n as quadratic convergence requires The remaining question is how close does the initial guess need to be Consider a function f that is twice continuously differentiable on a b and satis es the conditions 1 lt0 2 f has no zeros on a z 3 f does not change sign on a z 4i lfaf al and are both less than I 7 a Then 0 has a unique solution a in a z and Newton7s iteration will converge to a from any point in ab Remarks The rst condition establishes the existence of a solution The second and third conditions force f to be monotone and inflectionfree and guarantee uniquenessi The last condition ensures that an iteration starting at either end point of the interval remains in the interval We shall return to some of these ideas in our discussion of xed points Figure 7 Graphs of 91 1 z13 and 921 IS 71 on 12 Note that the simplest choice 91 1 yields the second of the above choices but there are several alternatives to choosing gr Starting With 10 07 let us carry out 10 xedpoint iterations for each of the above four choicesi Here are the results 911 1 9013 k 1 x 1 259921049894873e00 k 2 x 1312293836683289e00 k 3 1 322353819138825e00 k 4 x 1 324268744551578e00 k 5 x 1 324632625250920e00 k 6 X 1 324701748510359e00 k 7 x 1 324714878440950e00 k 8 x 1 324717372435671e00 k 9 1 324717846162145e00 k 10 x 1 324717936144965e00 921 13 71 k 1 x O OOOOOOOOOOOOOOOeOO 1 OOOOOOOOOOOOOOOeOO 2000000000000000eOO 9000000000000000eOO 7300000000000000e02 WW5 3890170010000000e08 5 887158716227059e25 2 040409013227526e77 8 494771472237388e231 Inf 1 x 2000000000000000e00 7500000000000000e01 3111111111111111e00 4 247448979591836e01 7 897338780221665e00 1426588076214919e01 614607377836743e01 812790487205253e02 098183336118154e03 10 x 3228739752529491e04 ID 00 l 01 O1 gtJgt A to gt N N N N N gt N H WWWWWWWWW 1 X 1500000000000000e00 DOONOEO39IpbWIO NNNNNNNN 324717957244746e00 1324717957244746e00 WWWWWWWWW 10x 11 931 12 137171 94ltIgtI m We observe that the rst iteration converges to 8 digits in 10 iterations and the fourth to 16 digits in six While the second and third iterations diverge Returning now to the remaining questions posed above we proceed analyticallyi Formal De nitions and Theorems Contraction Let g I be continuous on the bounded interval a 1 Then 9 is said to be a contraction on a z if there exists a constant L 0 lt L lt 1 such that Mr 7 Remarks 9yl SLlI7yl Wye laybl 1 1 The term contraction stems from the fact that the distance 79yl between the images of z and y has shrunk by at least the factor 2 Let 9 also be differentiable in abi Then for any my in ab the MeanValue Theorem implies that Mr 7 9W 7 WWW 7 W S WWHI 7 yly Where 77 6 ab Therefore if lg 77l g L lt 1 for 77 6 ab then 9 is a contraction This is an easier condition to apply than 1 to establish contractioni This result gives the error at the nth iteration in terms of the difference of the rst two iterates I can be used to determine how many iterates are needed to attain a prescribed tolerance 6 Thus r Ln 1 7 z lt 6 1 7 Ll 1 l is satis ed for n gt lnlzl 7 10 7 ln6l 7L 7 1111 L It is still possible to get local convergence results if we only have information on 9 near the xed point Such information may be dif cult to come by since a is an unknown However the theory so developed leads to results of practical utilityi A Local Convergence Theoremi Let g be differentiable in an open interval containing the xed point a and let lg al lt 1 Then for 10 suf ciently close to a the xedpoint iteration converges and a In1 1 7 i 3 ange a W W Since 9 is continuously differentiable in an open interval of a and lg al lt l we can nd a closed interval centered at a in which lgz g L lt 1 Then to prove convergence we proceed as in proving the Contraction Mapping Theoremi We have a In1 9a 9In 9a 7 M 7 son 7 coga Mgrlt50 a 7 ruga 7 WM where 5 lies between a and zni As n7gt 0 In and En approach 1 Therefore for ga 0 3320 73521 9W Since lg al lt l convergence is linear If ga 0 but g a y 0 then the above argument leads to la In1l 1 hm 7 7 1 Moo 1 7 In 2lg l so that convergence is quadratici This result can be extended in an obvious way Mr Let us suppose that for the root nding problem 0 we wish to nd a xedpoint problem 91 I such that convergence of the xedpoint iteration is quadratici How do we nd the requisite y We begin by seeking g in the form 995 I 90101 where 101 is to be found We note immediately that at z a the above equation is satis ed for arbitrary 11 Now 91 1 fIMI NPI For quadratic convergence we need ga 0 This achieved by choosing 101 Then ga 7fap oz 0 since fa 0 Our choice leads to 7 fr 9I77 fmxy The corresponding xedpoint iteration is f In 7 gm 7 In 7 Wu and we have derived Newton s method from a different point of viewi 14 5 Nonlinear Systems of Equations ln Lesson 2 we studied methods to solve systems of linear equations and in this lesson we have concentrated on the solution of a single nonlinear equation thus far We now combine the relevant concepts and consider systems of nonlinear equations Our approach will generalize Newton s method for solving the single nonlinear equation 0 Recall that the Newtoniteration equation 31 was derived under the assumption that if In is the present iterate to the solution then an improved iterate can be obtained by replacing by its linear or tangentline approximation centered at I In and solving the resulting linear equation The tangentline approximation which is just the Taylor polynomial of degree 1 is THE f1n1 MmeIn The solution In1 of T1I 0 is I 1 7 I 7 i zn n 7 n f 7 which is the same as equation 31 derived earlieri We now consider two nonlinear equations in two unknowns written as f1 9 0 102179 0 Let In yn be the current guess to the solution We linearize the above equations at this guess noting that the linearization now consists of a degree1 Taylor polynomial in two variables These polynomials when set to zero yield E E f1ltzmyngt z 7 zngta7fltznyngt y 7 yngta7fy1ltzmyngt 7 0 f2rmyn z 7 zngtltzmyngt y 7 mg m 7 0 This is a pair of linear equations whose solution I y provides the next guess With I 7 In AIn and y 7 yn Ayn the problem can be written in a matrix form as 3 3 Inv yn TQltIn7 yn AIn 16117 yn Imyn Ayn f2lt1 7y With the solution determined the next iterates are given by n1 In Azn yn1 yn Ayn 6 Miscellaneous Topics 61 Termination of a xedpoint iteration With bisection method given the initial bracket one can compute the number of iterations required for given accuracy We cannot do this in general So the question arises what criterion does one use to stop the iteration ldeally one should stop when a 7 In is small enough but we do not know at We can however keep track of the iterates In and the residuals and may be tempted to stop when either In 7 In1 or is small enough For functions that are very at near a we use the smallness of In 7 In1 and for functions that are very steep near a we use the smallness of Both cases can be covered by requiring that In 7 In1l be small enough 62 Multiple roots Recall that Newton s method assumes that fa a 0 If a is a double root then one can write f11 7 a2FI7 where Fa 0 The xedpoint iteration function for Newton s method now becomes 91 I M 1 17a2Fz 17 I7QFz fI z 7 appI 2at 7 max I myI 21 One can show that for this choice ga 12 0 Therefore convergence is linear One nds that quadratic convergence can be recovered by selecting 1 g z z 7 2 l Hz Similarly for a root with multiplicity m the proper choice is 1 91 z7mfzl 63 Aitkin7s extrapolation In computation we can often accelerate convergence or improve accuracy with only a small amount of additional work Consider a convergent xedpoint iteration In gzn1l Then proceeding as in the proof of the convergence theorem 0 i In 90 7 917171 95n7101 7 17171 Also 1 a 7 In a 7 In711n71 7 In m l 7 In1n71 7 In whence 5 7 Q n71 7 a 7 In 1 7 71 In In71 This is a formula for the exact answer a in terms of successive iterates 1771 and In but is not directly useful as gzn71 is not known The extrapolation consists of estimating gzn71 by a computable quantityl This estimate is obtained in two steps First note that as n 7gt 0 In 7gt a and also 5 7gt at Therefore gEn7l y a Second consider the quotient An Izn7xzn a 7 1771 7 a 7 In a i 17172 a i 17171 a 7 In71 7 y 5n710t 7 In71 1 7 In7lgEn72 7 a 7 In71 1 7 Nina 7 9 5n7217 gmgn s ga Thus we have the estimate n R 95n717 16 and hence the computable estimate for a z A a 171 an 7 We use this estimate to compute from the sequence In the secondary sequence A7 17A In1n In In717 which is expected to converge fasteri 64 More than one root Once a root 1 is found one looks for the second one by considering the de ated function 7 a and applying one of the methods discussed to it It is a good idea to go back and re ne each root using the original function to avoid contamination due to rounding error 65 How does Matlab nd roots The methods discussed so far belong to one of two categories reliable but slow bisection or fast but possibly unreliableNewton secant i i i Practical software often uses a hybrid approach combining the strengths of more than one methods For example Matlab s routine fzero combines bisection secant and inverse quadratic interpolation The simplest invocation of the routine is fzero funct ion x0 where function is the name of the function being evaluated and x0 is an initial guess For example fzero sinx 3 looks for the zero of sinz near 10 3 The routine starts by nding a bracket around 10 in which the zero liesi This is done by taking increasingly bigger steps on either side of 10 until a bracket identi ed by a change in the sign of the function at the end points is found One can also provide a bracket by using the command fzero sinx 25 351 Once the bracket is found the routine uses secant and inverse quadratic interpolation and shrinks the bracket If at any stage an iterate falls outside the bracket then bisection is used for a few iterations before returning to the fast methodsi Use the command type fzero to get a listing of the routine There are commands to change the default tolerance the display mode etc use help fzero to get more information Matlab uses a different procedure related to eigenvalue search to nd all the zeros of a polynomiali 7 Appendix 71 Sample Matlab code for bisection Consider the problem of nding the unique zero of 12 7 4sinz in the interval 13 We propose to 0 so by using the following Matlab program written as function M le bisect mi The input quantities are the end points of the interval and the tolerance to which the root is to be calculated The output is the root For instructional purposes we also count and display the number of iterations neededi Additionally the pause command is invoked which allows the output to be displayed one iteration at a timer The function is de ned separately in a function M le f m and the function f is explicitly referred to in bisect m For execution purposes it is best to put both the M les in the same directory The program is called by the following Matlab command issued in the Command Windowi gtgt xbbisect 1 3 1e8 The complete output is also printed belowi function xbbisectabtol Z Matlab function bisectm Z Solves fx 0 using the bisection method Z Z Inputs Z a left endpoint of interval X b right endpoint of interval Z tol tolerance for stopping bisection method Z Required mfile Z fm is mfile for function fx Z Outputs Z Root by bisection Z Z it0 Z set iteration count to zero fafa fbfb Z Z check bracketing X if signfasignfb error Root not in bracket end Z begin while loop 7 while bagt2tol itit1 ca05ba fcfc if signfasignfc Z root lies in ad bc fbfc else Z root lies in cb ac fafc end Z print progress xb05ab err05absba fprintf n n Xi Solution Z2015e error Z2015e n itxb err pause Z The pause command causes the execution of this routine to X be paused at the end of each iteration Press any key Z on the keyboard to continue end Z print results format long e fprintf nn Computed Solution Ze nn xb fprintf Iteration count Xi nn it 18 mmmmmmmmmmmmx Z Matlab function fm Z Defines fx function y fx y x 24sinx 730ZZZZZZXZXZZXZZZZZXZXZZXZ n 1 Solution 1 500000000000000e00 n 2 Solution 1 750000000000000e00 n 3 Solution 1 875000000000000e00 n 4 Solution 1 937500000000000e00 n 5 Solution 1 906250000000000e00 n 6 Solution 1 921875000000000e00 n 7 Solution 1 929687500000000e00 n 8 Solution 1 933593750000000e00 n 9 Solution 1 935546875000000e00 error 5000000000000000e01 error 2500000000000000e01 error 1 250000000000000e01 error 6250000000000000eO2 error 3125000000000000eO2 error 1 562500000000000e02 error 7 812500000000000e03 error 3906250000000000e03 error 1 953125000000000e03 n 10 Solution 1 934570312500000e00 n 11 Solution 1934082031250000e00 n 12 Solution 1 933837890625000e00 n 13 Solution 1 933715820312500e00 n 14 Solution 1 933776855468750e00 n 15 Solution 1 933746337890625e00 n 16 Solution 1 933761596679688e00 n 17 Solution 1 933753967285156e00 n 18 Solution 1 933750152587891e00 n 19 Solution 1 933752059936523e00 n 20 Solution 1 933753013610840e00 n 2 H Solution 1 933753490447998e00 n 22 Solution 1 933753728866577e00 19 error 9 765625000000000e04 error 4882812500000000eO4 error 2 441406250000000e04 error 1 220703125000000e04 error 6 103515625000000e05 error 3 051757812500000e05 error 1 525878906250000e05 error 7 629394531250000e06 error 3 814697265625000e06 error 1 907348632812500e06 error 9 536743164062500e07 error 4 768371582031250e07 error 2 384185791015625e07 n 23 Solution 1 933753848075867e00 error 1 192092895507812e07 n 24 Solution 1 933753788471222e00 error 5 960464477539062e08 n 25 Solution 1 933753758668900e00 error 2 980232238769531e08 n 26 Solution 1 933753773570061e00 error 1 490116119384766e08 n 27 Solution l 933753766119480e00 error 7 450580596923828e09 Computed Solution 1933754e00 Iteration count 27 Kb 1933753766119480e00 72 Sample Matlab code for Newt0n7s method Consider the problem of nding the square root of 5 which is equivalent to nding the positive zero of 12 7 5 The Newton algorithm for root nding is written below as the function M le Newtonmi The implementation requires the initial guess IO and the required tolerance tol as inputs Also required as an input is the string name of the function M le here fff m where the function and its derivative f z are de ned Contrast this approach with that used in bisectm above where f was not included as an input The output is the root For instructional purposes we also count and display the number of iterations needed The program is called by the following Matlab command issued in the Command Windowi gtgt rNewton fff 71e15 function r Newtonfunx0tol quot0 Input fun string name of mfile that returns fx and f x quot0 x0 initial guess quot0 tol tolerance for stopping quot0 quot0 Output r the root x x0 quot0 initial guess k 0 quot0 initialize iteration counter maxit50 quot0 max iterations allowed while klt maxit kk1 f fp fevalfunx dx ffp x xdx err absdx fprintf n k Xi Solution Z2015e Error Z1510e n kxerr if err lt tol rx return end 20 end warningsprintf root not found within tolerance after Zd iterationsn k ZZZZZZZZZZZZZZZZZZZZZZZ Z file fffm Z Specifies f and fprime function ffp fffx fx 25 fp2x ZZZZZZZZZZZZZZZZZZZZZZZZ k 1 Solution 3857142857142857e00 Error 31428571429e00 k 2 Solution 2 576719576719577e00 Error 12804232804e00 k 3 Solution 2 258585661049727e00 Error 31813391567e01 k 4 Solution 2236180226081103e00 Error 2 2405434969e02 k 5 Solution 2 236067980317037e00 Error 11224576407e04 k 6 Solution 2 236067977499790e00 Error 2 8172470799e09 k 7 Solution 2 236067977499790e00 Error 1 9860273226e16 2236067977499790e00 73 Sample Matlab code for xedpoint iteration In this example we consider the xedpoint iteration for the function 91 2 81 7 12 The starting guess is 1 5 and the interval 02 is being used for plotting purposes Z Script fixedl m X Carries out the fixedpoint iteration Z Input g function being iterated Z In the calling program use a command such as Z ginline x 31 Z to supply the input function g X Otherwise supply the function g in this script Z r initial guess Z Output plot of iteration Z r the fixed point if it exists maxit15 Z max iterations allowed ginline 28xxx r 15 X initial guess k 1 Z initialize iteration counter 21 xlinspace02 ygxgt figure plotxyxx grid on hold on plotrr0 gr r hold on while klt maxit kk1 r1gr plotr r1r1 r1 r hold on plotr1 r1r1 gr1 r hold on rr1 fprintf n k Zi r Z2015e n kr pause end k 2 r 1949999999999999800 k 3 r 1657500000000000900 k 4 r 1893693749999999e00 k 5 r 1716266481210937e00 k 6 r 1 859975512862452900 k 7 r 1748422527566924e00 k 8 r 183860174228387Ge00 k 9 I 1767628511665548e00 k 10 r 1824849277410574e00 k 11 r 1779503091483713e00 k 12 r 1815977403554304e00 k 13 r 1 786962799732219e00 k 14 r 1810259791623403900 k 15 r 1791686903377122e00 22 O5 05 23 NUMERICAL COMPUTING SPRING 2012 LESSON 1 This lesson introduces the subject of Numerical Computing and reviews introductory concepts from calculus that will be needed as the course proceeds It also deals with errors that arise in the construction of numerical solutions to mathematical models with niteprecision representation of real numbers and with conditioning of mathematical models and stability of numerical algo rithms Supplements Chapter 1 of the text 1 Introduction Scienti c computing or numerical computing is the activity of using computers to obtain solutions to mathematical problems that are either too laborious or impossible to solve by han i Problems of this sort arise routinely in science engineering medicine nance and numerous other elds Prior to the advent of computers solutions of mathematical problems were largely produced through analytical means often involving concepts of calculus Example 11i Oscillations of a springmass systemi dz oi 7d d2 d iclkz0 10 vi m M dt The solution of this linear model obtained by integrating the ODE analytically is 1 012771 1 76 sinwt w w The solution determines I as a function of t and the parametric dependencies are explicitly displayedi Example 12 Arc length of a piece of the sine curvei Analysis provides the formula for the arc length as an integral a L lcos21 dzi 0 Further analytical progress is impossible as the integrand does not have a known antiderivativei For a numerical choice of a t e integral can be A A 39 39 ru e say or a 7r the answer to six decimal places is 3620198 However the parametric dependence of L upon a is not available unless one carries out a number of computations with different values of ai JJuJJJULJLaJJ y A Realworld problems such as 0 weather forecasting O computing the flow around an airplane in an internal combustion engine or around a stent in a blood vesse O designing a drug a heart valve a safer rocketmotor propellant or the schedule of an airline 0 oil reservoir simulation 0 image processing and reconstruction O forecasting economic trends etc are much more complex than the simple examples considered above It turns out however that these large scale problems can typically be broken down into simpler component parts and these parts are not unlike those just consideredi They may involve 0 tting functions to data 0 solving large systems of linear or nonlinear algebraic equations 0 solving eigenvalue problems for matrices O computing integrals or O obtaining solutions of systems of ordinary differential equations This course will concentrate on such basic classes of smallscale problems for which we shall study numerical algorithms We shall learn how to implement these algorithms and mathematically analyze their behavior The last item is particularly important because as we shall see computer solutions are almost always approximate We must make sure that the numerical procedure being used does not compromise the solution by magnifying small initial errors or accumulating excessive errors during the computation The algorithm should ensure that employment of additional computing resources produces a solution that is increasingly close to the correct answer Another important issue is ef ciency some algorithms may require fewer operations than others to accomplish the same task The aim is to understand the ideas and concepts underlying scienti c computing rather than specialize in a particular area MATLAB will be an important component of the course and you are expected to acquire pro ciency in its user We shall use MATLAB primarily as an exploratory environment It also has many builtin functions for solving particular problems You will learn how to use some of them gain an understanding of how they work and why they sometimes do not so that they can be employed intelligentlyl 2 Theorems From Calculus Taylor s Theoreml Let have n1 continuous derivatives on ab for some n 2 0 and let z 10 E ab Then 161 Pn1 RAIL where n k r 7 1 PM Z fltkgtltzogt and 1 1 RN 7 z 7 t f 1t at n 10 An equivalent form of the remainder is I 7 IonTl R n1 7 we n1 f c where c depends on 1 lies between 10 and I but is otherwise unknownl Despite c being unknown we shall see how the knowledge of the form of the remainder can still allow it to be estimated Taylor s theorem is important because it allows one to represent a large class of functions by polynomial approximations with known speci ed boundable error Examples of some familiar functions and their Taylor series 1 1 r i k n1 c 6 160er Tnllz e 7 Dk 2k1 illnTl 2n3 sinz 7 mz cosc n 7 71V 2k 71 1 2n2 cosz 7 E 2k 1 mz s1nc k0 Example 20 Given that M4 1 Rzjlt1cgt for ze7l2l2 where c is between I and 0 nd an upper bound on lRl valid for all z 6 712 12 that is independent of z and c 3 25 iexact 25 appro m on 2 2 15 15 7 1 1 7 05 05 7 0 1 1 0 1 1 05 o o 5 1 1 05 o o 5 1 X X Figure l Approximation and error in Example 211 We note that both I and 5 lie in the interval fl212ll To get an upper bound or lRl we maximize the numerator and minimize the denominator Thus lRl ltl1lclgt 1721 Example 211 Consider the problem of approximating e on the interval 711 We select 10 0 Since I can be any point in 71 l the same is true of cl Suppose we wish the approximation to be accurate to within 10 4 in absolute value 239e we want l6 7 PAIN anIl lt10 4 for all z 6 711 This can be accomplished by nding an upper bound on and choosing n so that this upper bound is smaller than 1041 This upper bound corresponds to evaluating Rn at z c 1 Thus we set e 7 lt 1074 whence n 1 l gt 10461 71 1 i l l By computing the rst few factorials it is easily seen that the smallest n satisfying the above inequality is n 7 The results for the approximation and the error are shown in Figure ll On a larger interval the error may be larger This is shown in Figure 2 where both the exact and approximate functions are plotted on the interval 71 5 Example 22 Let as 1 7 git 7 dt 0 t 0 Find Pnz the Taylor polynomial approximation to of degree n and centered at z 0 Also nd the derivative form of the remainder Rn 0 Find n so that P7 approximates to within an error of 10 4 on fl212ll Zn 0 20 i eXPX x polynomial approx gtlt X 15 100 XX X XXX 10 50 5 u 0 0 1 2 3 4 5 1 0 1 Figure 2 Approximation and error on a larger interval in Example 211 We begin with a Taylor approximation of the integrand which in turn requires the Taylor approximation of 7 e l Slnce dn Tyler Arm and therefore d 7x n dzne 10 71 7 we have the Taylor series 2 In In1 7117 i Hilni 71n1770 e 121 7u nnf where 5 lies in the interval from 0 to If A simple rearrangement yields 7171 n 176 z 12 7 7 if i 7 n71 7 7 fz7 71 1 n1e i z 2 3 nl To obtain the series for we integrate the above expression from 0 to If 12 13 n71 In inn 1 n 7c f I m lt mtmo dt Pnz The remainder can be manipulated as follows fly 1 7 fly 7A ac zn1 7A Rn z 7 tne 0 dt e C t dt 71 n e Cl n1 0 n1 0 n1nll Here we have highlighted the fact that in the rst integral above 5 is a function of the integration variable t and lies between 0 and t Also 6 is a constant that lies between 0 and If We have used the lntegral Mean Value Theorem see below in obtaining the above result We now bound the remainder to ensure that the error associated with the Taylor polynomial is less than 10 4 for z 6 712 12 We have anl Mil In1 A mnmnf It is easy to check that n 4 will suf ce since the above expression then has the value 8 6 X 10 5i The 1 39 1 col quot Taylor r A A 39 39 to is P4II M M Mean Value Extreme Value and Intermediate Value Theorems Mean Value Theorem Let be continuous on ab and differentiable on a7bi Then there exists a point c 6 ab such that b f i a f C WA The following form is more useful it allows difference of values of function to be expressed in terms of difference of values of argument 1612 1611 fCI2 7951 For example we infer that lCOS127 coszll g 12 7 11 Intermediate Value Theorem Let be continuous on all Let W be avalue between u and Then there exists a c 6 ab such that W This theorem will be found useful in root nding It tells us that a continuous function is one whose graph can be drawn without lifting the pen off the paper Extreme Value Theorem Let be continuous on all Then there exist m 6 ab and M 6 all such that g for all z 6 ab and Z for all z 6 all Furthermore these minimum maximum values are achieved either at the end points of the interval or at critical points Integral Mean Value Theorem Let and 91 be both continuous on ab and let 91 not change sign on all hen there exists a c 6 a7 12 such t at b b ftyt dt fc glttgt at This theorem has the following discrete versioni Discrete Average Value Theorem Let be continuous on all Consider the sum 5 Z akf k 161 where 1k 6 ab ak Z 0 and 221 ak 1 Then there exists a c 6 ab such that 5 The above results will be found useful in obtaining error estimates for numerical algorithmsi Order of convergence for sequences Let the sequence 3n converge to s such that s 7 s hm lintll C n7gtoltgt ls 7 snlp for some positive constant C7 and p Z 1 Then 0 is called the order of convergence of the sequence For the special case p 17 it is required that C lt 1 and convergence is said to be linear For p 2 convergence is said to be quadratici 3 Sources of Error in Computations Construction of a numerical solution to a physical problem involves a number of steps identi ed in the schematic diagram belowi here are several reasons why the numerical solution may provide only an ap proximation to the exact answer the mathematical model may be de cient the input data incorrect and errors introduced during the computational process The job of the computational scientist is to understand how the errors arise and to control them suf ciently so that the answer is usable We shall look at errors in detail and from several points of viewi For the purpose of this course our starting point will be the mathematical model we shall not dwell on the approximations involved in its development Mathematical Model 1 Numerical Model 1 31 Measuring Error The error in using the approximation 1 for an exact often unknown quantity 1 is absolute error 1 7 1 Often one uses the absolute value 1 7 1 Thus 321143 approximates 32114159 with an error of 0001411 A more useful measure is the relative error de ned as 7 1 1 relat1ve error 1 Often the approximation 1 rather than the exact value 1 appears in the denominator Relative error is not de ned i 1 0 1n the above example the relative error is 000141 7 000004387 04387 1074 32114159 The number 1 approximates 1 to t signi cant digits if t is the largest nonnegative integer for which 1 7 1 1 71041 1 lt 2 Here 321143 approximates 32114159 to 4 signi cant digits 32 Data Error and Computational Error Let us suppose that the computational problem consists in evaluating the function for which 1 is the exact input and the exact outputi Let the inexact input be 1 so that data error 1 7 1i f I gt fr data error L propagated data error computational error f 02 Exact computation on the inexact input produces the incorrect output leading to the propagated data error 7 In practice computation of f is approximate as well and if the computed function is represented by f then the computed value is leading to the computational error 7 Now the total error is total error 7 7 7 computational error propagated data error Example 311 Consider the evaluation of sinz at z TrSi To get a quick estimate of the answer let us approximate 7r by 3 thus replacing the exact input I by the approximate input i 35 06 Let us also approximate sinz by z the rst term in its Taylor expansion about I 0 Then we have the approximate answer sin7r5 m 06 The exact answer is sin7r5 0587785 with error 00122151 The total error can be broken down as follows data error i 7 z 06 7 7r5 700283185 propagated data error 7 sin06 7 sin7r5 700231428 computational error 7 06 7 sin06 00353575 total error the sum of the two above 0012215 We note that the two errors are of opposite signs but that may not always be the case 33 Truncation Error and Rounding Error Computational error is the sum of truncation error and rounding error 0 Truncation error is the difference between the true result of the mathematical model and the result obtained from the numerical model using exact arithmetici Recall that mathematical models are typically continuous and involve the notion of the limit as in an integral an in nite sum or a derivative the limit of a difference quotienti The computer is unable to handle the in nite processes and must replace them with nite ones as in truncating an in nite series or replacing a derivative by a difference quotienti Such a replacement of a continuous process by a discrete one is necessarily approximate and leads to discretization or truncation errorsi A common example is the replacement of a transcendental or trigonometric function by a Taylor poynomial eigi replacing e by T3z 1 1 47122 133li If eo39l is approximated by T30il using exact arithmetic then the truncation error is T30i1 7 601 We shall discuss this error in more detail in subsequent lessons 0 Roundoff error is the difference between the result of the numerical model using exact arithmetic and the result of the same model using niteprecision arithmetic As we shall see in Section 2 the machine necessarily approximates the set of all real numbers by a nite and discrete set of reals leading to roundo c errorsi Example 32 Interplay of truncation and rounding errors in approximation of dsin zdz cosz by the difference quotient sinz h 7 sinzhi As h 7gt 0 the truncation error declines and the quotient should tend to the exact resulti See the gure below which plots the difference between cosz and the difference quotient sinz h 7 sinzh as h is reduced from 1 to 10 15i The computation is carried out for z 5 oundoff error intrudes at approximately h 10 3 grows as h is reduced further and contaminates the approximation severelyi 10 1o 2 f ff 3 10 4 f P U 2 10 6 2 10 8 7l0 1010715 1 710 75 100 log h 34 Forward and Backward Errors f backward error f forward error A f A A I f I 161 Let us again consider the problem of function evaluation but now from a slightly different point of viewi As before I is the exact input and the exact output We assume that the input 1 remains exact but the function is again evaluated approximately yielding the computed value Then the error in the solution 7 is called the forward error whose relative size is one estimate of the quality of the calc ationi An alternate approach is to treat the approximate answer as the exact solution of a perturbed problem and ask how large such a perturbation needs to be to produce the computed result In this case let the perturbed problem be de ned by the modi ed input i that satis es the condition Then li 7 ml is the backward error How the perturbed problem is de ned may vary from one problem to the next Consider for example the problem of nding the zero r of Then 0 Let the computed zero be r so that the forward error is r 7r Now r is the exact zero of the perturbed function 7 0 Therefore the backward error is the perturbation Note the relationship between forward error and output error and backward error and input error 35 Error Propagation Conditioning Let us consider the situation in which computation is exact so that the only error in the output is that due to the propagation of data error in the input Let us see whether a small relative error in the input necessarily implies a small relative error in the output Example 33 Suppose that we wish to compute the distance d between two points A and B at distances a and b respectively from a reference point 0 with subtended angle 1 It can be shown that the exact expression for d is d H b 7 a2 4absin2oz2i For a 100 units 12 102 units and a 2 the answer is 73275767 Now suppose that we make a 1 error in the angle so that a 2702 Then d 773953907 The relative error in the output is 77395390 7 77327576 W 07009257 or 0792592 about the same as the input errori Now suppose that we make a 2 error in a so that a 102 Then d 77190663 and the relative error in the output is 7190663 7 7327576 W 7001877 or 71 877 again about the same as the input errori Is this always true The next example answers in the negative Example 34 Consider the ODE initialvalue problem d dt 1 2t 1 e 3 The exact solution is Now considered the perturbed initial condition 10 713ei The exact solution of the perturbed problem is A 1 72t 2 zt fie 66 i 3 The difference between the two solutions 66 grows without bound with t no matter how small 6 is39 see the accompanying gure for e 1043 In contrast to the previous example here a small change in the input results in a large change in the output t Example 35 Let y Let I be increased to 1 61 so that y changes to y 6y fz 61 N hfz leading to 6y N f z6zi The ratio of the relative output error to relative input error is 7 6yy N HEW 7 HI sacz fltzgt6z I fltzgt Notice the magnifying effect of the large slopei For sini1 I H I 7 1712s1n 11 H with H4gtOO aszelii Example 36 Finding the zero T of If is perturbed to 691 how does T change The perturbed root T h satis es fTh egTh 0 Taylor expansion leads to f0 hf T 690 07 whence h 9T 6 7 f T Let us apply this result to the famous example of Wilkinson s polynomial with M ltz71gtltz 72gtltz 7 20 Let 91 120 corresponding to a small change in the coef cient of I20 in Consider the resulting relative error in the root T 20 given by LT 920 202 7777m75 107 e 20f20 20 X which is enormous Thus we see that certain problems are more sensitive to data errors than others This sensitivity has nothing to do with the computational procedure and everything to do with the mathematical character of the problem itself it persists even if the computation with perturbed data is carried out with in nite precisioni This sensitivity is measured by the condition numbeT H de ned as the magnitude of the ratio of relative output error to relative input error A large condition number implies higher sensitivity and the problem is then said to be illconditioned A moderate condition number implies less sensitivity and the problem is called wellconditioned 4 Digital Representation of Real Numbers Roundoff Errors Everyday computations typically involves real numbers A complex number is simply an ordered pair of realsi A real number can be an integer a rational fraction terminating or recurring decimal or an irrational nonterminating decimali The real number system has a onetoone correspondence with points on a liner ln hand calculations we typically use the decimal system or base 10 to represent numbers Thus 4271508 4 x1022 x1017 x10 5 x 1010 x10 28 x 10 On the RHS there appear powers of 10 and digits from 0 to 91 Any real number can be represented in this way if we allow an in nite number of digits Thus in 73141592653589 1 H where the last digit shown on the right contributes 9 X 10 121 The representation is nonunique in that 4271508 has other representations such as 74217508 X 101 or 74275018 X 10 The representation can be rendered unique by normalizing ie by writing the number as 74127508 X 102 One nonzero digit before decimal1 More generally a number I has the normalized form in base 10 z iqx 10E where l qlt 101 Computers represent real numbers in a form called oating point that resembles scienti c notation For example a number N is stored as N i do1d1d2 dp1BE dl d2 dpel E 7 iltd0 w B signmantissa or signi cand gtlt BE where the positive integer B is the base the integer E is the exponent and the di are the digits ln scienti c notation the base is 10 while for a computer the base is usually 2 and sometimes 8 or 161 The digit do is between 1 and B 7 1 while each of the remaining digits is beween 0 and B 7 11 Base 2 is a logical choice because in the binary representation each digit is 0 or 1 corresponding to the onoff states of a transistor What makes computing with a machine different from computing by hand is that depending upon the precision of the machine only a xed number of digits p is available and the integer exponent E is also restricted to a nite range E e EmimEmml As already mentioned with do y 0 the system is called a normalized floatingpoint system with precision 11 Some texts choose to write the normalized system as i1d1d2 dpBE with d1 necessarily being nonzeroi We shall denote this oatingpoint system as FBp EmimEmm ln base 10 a number N has the form d1 d2 dpil E N7iltdo n10pil 10 whereas in base 2 it would be d d d Ni lt1 1 25gt 2E1 For the binary normalized system do is necessarily 1 so that there is no need to store its value Consequently p 71 bits suf ce to store a 10bit mantissai In any normalized system the number zero must be an exception and we shall return to this point shortly Note that numbers that can be exactly represented in this way are limited in range form a discrete set and are not evenly distributed on the real line For example the positive binary numbers in the system F23720 are 14 516 38 716 12 58 34 78 1 54 32 74 How many The total number of oatingpoint numbers available in FBpEminEmw assuming d1 0 is V 2B 7 1BP 1Emw 7 mm 1 1 the 1 having been added to accommodate zero For a base 10 system with two digits and exponents of 0 and 1 we have V 210 7 1102 1 3611 For a base 2 system with 3 digits and exponents of 2 1 and 0 V 212231 251 How large and how small The smallest normalized number available has do 1 all other di 0 and E Emin so that in magnitude 111mm Under Flow Limit UFL BEWW The largest number corresponds to do d1 171 B 7 1E Em ie 1 B 1 32771 1 mm Over Flow L1m1t OFL B 71 lt1 1 E7 gtBEW 1 7 B pBlEm 1 For the base 10 example above UFL 1 and OFL 991 For the binary example UFL 0125 and OFL 1175 The IEEE Standard Most modern machines follow the 1EEE standard wherein a singleprecision number is stored in four words or 32 bits A bit is a binary digit 0 or 11 A byte is 4 bits A word is two bytes Of the 32 bits 23 are used for digits 8 for the exponent and 1 for the sign 0 for 1 and 1 for 11 Since do is always 1 it need not be stored and the precision of the system is 241 A typical singleprecision representation is 71gt x 2E x 1102 Here 1g 11f2 g 1111 111111 111111111 111112 27 2 2 The 8 bits for the exponent E can take 2 3 256 values from 00000000 0 to 11111111 2551 The extreme values 0 and 255 are reserved for special situations to be discussed belowi The system stores a biased exponent E ranging from1 to 2541 The actual exponent E E7127 then has the range 71261271 and one need not expend one bit for the sign The number zero is handled in a special way it is stored as E 0 and d1 d2 dgg 01 Other special numbers are i Inf stored as E 255 and d1 d2 dgg 0 and NaN not a number 00 stored as E 255 and not all d1 d2d23 zero In double precision each oatingpoint number occupies eight words 64 bits39 11 for the exponent 52 for the digits and 1 for the sign 1A The 11bit exponent can take 2048 values 0 to 20471 The exponent range for E is 1 2046 and for E E 7 1023 71022 102311 Zero ioo and NaN are de ned as above Single Precision 1 8 23 8 biased exponent E f from normalized mantissa 11f Double Precision 1 11 52 1n single precision UFL 2125 s 1118 x10 38 and OFL 2 7 2 23 2127 s 2128 s 3140 x 10331 UFL and OFL are the smallest and largest numbers in a given oatingpoint system or oat 1fa computation produces a number smaller than UFL the condition is called under ow and leads to an error or warning If a number larger than OFL results an over ow has occurred and an error message is generated along with the exceptional oating point value Inf Inf satis es 1nf1nf1nf and InfInfNaN1 1n double precision 13 UFL 2 1022 s 2123 x 10 and OFL 2 7 2 52 210 R 21024 1180 x 103081 Subnormal numbers The gap between the smallest normalized number and 0 can be lled by relaxing the normalization requirement ie by letting do 0 when E Emmi This provides additional subnormal numbers ranging from 0111111112 gtlt ZEWquot to 0100111012 gtlt ZEWquot thus leading to gradual under owi These subnormal numbers are less accurate than normal numbers because they have less room in the fraction eld for nonzero bits The IEEE system allows subnormal numbers but not all machines use it Matlab does 1 It identi es smack by eps UFL by realmin and OFL by realmaxi Granularityi Consider F10 3 71 17 a 3digit decimal system with E 71 0 17 in which a oatingpoint number is represented as deadld210Ei For E 0 the numbers range from 1100 to 9199 and are evenly separated with spacing 1011 For E 71 the spacing decreases by a factor of 10 while for E 1 it increases by the same factori Similarly for the general case do dl 39dp71BE7 the numbers are arranged into blocks each block corresponds to a xed value of the exponent within each block the separation between adjacent numbers is xed at Bl p1E and increases by the factor B as the exponent E is increased by 1 Thus the larger a number the greater is the distance between it and its neighbors Machine epsiloni Machine epsilon in any oat is the difference between 1 and the next larger oati In a system with precision p and base B 7 1 7 W For IEEE single precision emach 1223 x 1119 X 10 7 while for IEEE double precision emach 1252 s 2122 gtlt 10 There are two interpretations of machine epsiloni First it represents the relative spacing between adjacent oats This is because the relative spacing between say B and the next oat will be mad 11000171 1 0001BE71BE B E 11000171 5mm Thus if z is an exact oat the two oats on either side of it form the triplet 11i 5mach71711 Smash Second as we discuss below machine epsilon is also a measure of the precision with which a number I is approximated by the oatingpoint systemi 41 FloatingPoint Approximation and Machine Precision If a number is not exactly a p oat ie it requires more than p digits it is approximated by an adjacent p oat to its left or right Example Let I 31145927104i In a 6digit decimal system the two nearby machine numbers bracketing z are 11 3114593104 12 31145931041 11 had believed otherwise My thanks to John Behmer of this class who raised a question that led to my discovering that the current version of Matlab does support subnormal numbers 14 Approximation of I by 11 is called approximation by chopping it always results in a number closer to zero than I and is obtained by dropping all digits in 1 beyond the 6th Approximation by 12 in the present case is called approximation by rounding to the nearest39 we note that 12 is closer to 1 than rah 12 is obtained by adding 5 to the 7th digit of z and then chopping to 6 digits In a binary system with 10bit precision let I lldl dp71dp 2E 4 2E The numbers bracketing z are 11 lld1dp712E and 12 m1 Haw 212 Chopping corresponds to dropping digits dz and beyond so that z is always approximated by the number 11 which is closer to zero than 1 Rounding to nearest consists of chopping after the following procedure add 1 to the bit dp71 rounding up if dz 1 ie approximate I by 12 and do nothing if dp 0 ie approximate I by 11 with one exception if the bits following bit dp71 are 10000 H exactly ie 1 lies exactly half way between 11 and 12 then round up or down so as to make dp71 0 Consider the relative error introduced by chopping when I is replaced by rah I 7 I1 12 7 11 217pE lil S l7 7E z z 42 Since 1 g L lt 2 z 7 2E 217 1 The error is only onehalf of the above value if 11 approximates I by rounding to nearest because then zizl lt 21 pE 1 I 7 I1 12 7 11 21717E 2pE 1 1 lil liliE E 7 71 z 2x 242 22 2 21 A similar result will emerge if 12 were to approximate I by rounding to nearest This quantity ie the error bound is called the unit roundo it equals machine epsilon when rounding by chopping and half that when rounding to nearest Since chopping always rounds the magnitude towards zero it is unidirectional and can accumulate over many computationsl Rounding may go in either direction and can therefore have greater overall accuracy In either case the accumulated error is called roundojf error Propagation of roundoff error in arithmetic operations Consider z and y to be machine numbers ilel numbers that can be represented exactly on the machine Let the symbol 0 represent any one of the four arithmetic operations We proceed on the premise that when two such numbers are combined arithmetically the combination is rst correctly formed then normalized rounded off and stored This is common practice where computers carry out arithmetic operations in special registers that have extra bits called guard bits which allow numbers to exist temporarily with extra precisionl Thus I o y is stored as flz o y and from our discussion of unit roundoff we know that ltzoygtltzoygtlt16gt where walsgemach provided we are using rounding to nearest When I and y are not machine numbers then additional error is introduced in their machine representation before the arithmetic operation is carried out In addition for example we have l w flyl 11 61 y1 6216s e I y 611 62y 1 053 Then relative error Iy6ir62y1y539 y 19 63 51152y 63 51m52 zy 1 m where we have used y mzl Thus the error is 63 61 if m is small 63 62 if m is large and can be arbitrarily large if m m 71 catastrophic cancellation Example 411 Subtraction of two close numbers loss of signi cance Consider z 123416726 y 112339 4427 17y 1000222991 On a 5digit machine with rounding to nearest one would have flz 123427 fly 123397 flz 7 00031 The relative error is 0003 70002 2299 7 0000 7701 m 345 0002 2299 0002 2299 Even though the machine will represent 7 as the normalized number 30000 10 47 the extra zeros in the mantissa serve only as placeholders the correct digits have been lost Thus we see that except for this special case roundoff error made in a single arithmetic step is quite small It can of course accumulate when a large number of operations are performed Although these errors are inevitable we can avoid the occurrence of a large error in a single step by careful design of algorithms As a rule the subtraction of nearly equal numbers should be avoided Consider the quadratic equation az2 121 c 07 b gt 07 with roots 71 Vb2 7 4ac 7 7 b2 7 4ac 11 7 2a 7 12 7 2a If 4ac ltlt 122 then there is a possibility of a large error in 11 if the above formula is used The mathemat ically equivalent formula zl cazg avoids the problemi Example 42 Assume a decimal system with exponent range i10 and smack 10 51 Note the oating point results of the following operations 11 1106 11 21 1103 10011103 10le103 10011 31 1107 1 000 0001107 11000 0001 x 107 107 Loss of digits 4 10101015 10400 001 5 1010 x10 1010 5 x 10 00 NaN Example 43 As another example of contamination by roundoff consider the evaluation of exp710 by means of its Taylor polynomials The N7 term Taylor series approximation to e is given by N k SNltIgt 1 ZTk where Tk The truncation error of the approximation in absolute value is ENI 6 SNWHA We note that all derivatives of 6 equal 611 Then according to Taylor s theorem n1 T I C e 7 SNltIgt We 16 Where 5 lies between 0 and IT For I lt 0 as is the case here 60 has its maximum value equal to unity at c 0 Therefore z nl ENI lei SNWN g lTnlIlv ie the truncation error at any stage is no larger than the rst term of the series droppedi For I 710 the terms of the series increase in magnitude till the 10th term and decrease thereafter approaching zero in the limit Therefore as N a X ENz a 0 The Matlab script appears belowi The accompanying gure displays the absolute error EN710 against the number of terms N note the at line beyond N 47 A table listing N the term TN the partial sum SN and the error EN is also printed to aid in understanding Taylor Series Approximations to exp 10 Absolute Error 20 40 Number of Terms As observed above the terms in the Taylor series increase in magnitude up to the 10th term and then decrease This is con rmed by the accompanying table note the magnitude of TN at N 10 Because of roundoff every term is computed to a relative error of emach a multiple of 10 Therefore TlO Which is a multiple of 103 is computed to Within an absolute error of 103 X 10 16 10 13i As a result the sum SN for N gt 10 can never approximate 6 10 to an accuracy greater than 10 13i This the reason Why the error EN even though it is supposed to converge to zero if computed precisely does not decay due to roundoff beyond the level of 10 Which it rst reaches at N 47 according to the accompanying gure as well as the last column of the table Beyond N 47 the error remains at the level 10713 hence the at line in the gure quot0 Script File ExtraCreditm quot0 quot0 Computes and plots error in the Taylor series representation of expx quot0 Requires x the argument of the exponential and N the highest degree of Taylor polynomial cl x10 N60 exexpx quot0 Exact answer t1 s1 RESULT E quot0 Set up RESULT as a blank matrix 0 quot0 Print headings for output display fprintf N TN SN EN n 70 quot0 Set loop for computing Taylor polynomials 70 for n1N ttxn sst Eabsex s RESULTRESULTntsE Z Augment RESULTS by current output Z Plot error against number of terms on a semilog plot semilogyRESULT 1 RESULT 4 LineWidth 2 x1abe1 Number of Terms ylabe1 Absolute Error title Taylor Series Approximations to exp10 Z Print Z fprintf 40f Z251Ge X2516e Z2516e n RESULT output N 1 10000000000000000eOl 9 OOOOOOOOOOOOOOOOeOO 90000453999297623e00 2 50000000000000000e01 41000000000000000e01 40999954600070240e01 3 16666666666666666e02 12566666666666666e02 1 2566671206659642e02 4 9 2 7023e02 5 83333333333333326e02 54233333333333326e02 54233337873326298e02 6 1 3888888888888887e03 84655555555555543e02 8 4655551015562571e02 7 19841269841269839e03 1 1375714285714284e03 11375714739713583e03 8 2 4801587301587297e03 13425873015873012e03 13425872561873714e03 9 2 7557319223985887e03 14131446208112875e03 1 4131446662112173e03 10 27557319223985887e03 13425873015873012e03 13425872561873714e03 11 25052108385441716e03 11626235369568703e03 11626235823563002e03 12 2 0876756987868098e03 9 2505216182993945e02 9 2505211643000973e02 13 1 6059043836821616e03 6 8085222185222210e02 680852 6725215182e02 14 11470745597729726e03 4 6622233792075053e02 4 6622229252082076e02 15 76471637318198179e02 29849403526123126e02 2 9849408066116104e02 16 47794773323873864e02 17945369797750737e02 17945365257757760e02 17 28114572543455211e02 10169202745704473e02 10169207285697449e02 18 1 5619206968586229e02 54500042228817563e01 54499996828887802e01 19 82206352466243317e01 27706310237425754e01 2 7706355637355518e01 20 4 1103176233121658e01 1 3396865995695904e01 13396820595766142e01 21 19572941063391266e01 61760750676953613e00 61761204676251236e00 22 88967913924505755e00 27207163247552142e00 2 7206709248254519e00 23 38681701706306848e00 11474538458754706e00 1 474992458052331e00 24 16117375710961186e00 46428372522064798e01 46423832529088549e01 25 6 4469502843844739901 18041130321779941e01 18045670314756190e01 26 2 4795962632247978e01 6 7548323104680369e02 67502923174917878e02 27 9 1836898637955480e02 2 4288575533275111e02 24333975463037595e02 28 3 2798892370698385e02 85103168374232735e03 84649169076607880e03 29 11309962886447719e02 2 39 20ov1o l 1 t 30 37699876288159063e03 97034157979146093e04 9 2494165002897605eO4 31 1 2161250415535180903 24578346176205705e04 2 9118339152454188e04 32 38003907548547436e04 13425561372341731e04 88855683960932462e05 33 1 1516335620771951eO4 1 9092257515697798e05 26307672246787049e05 34 33871575355211623e05 52963832870909421e05 75639031084245731e06 35 96775929586318929e06 43286239912277529e05 21136898502073181e06 36 26882202662866368e06 45974460178564163e05 57453041607931571e07 18 37 7 2654601791530731907 4 5247914160648854e 05 15201560183599329907 38 1 9119632050402824907 45439110481152882e05 39180718668034684908 39 4 9024697565135442e08 4 8 98439788971039946e09 4O 12256174391283859908 4 24121954941789809e09 41 2 9893108271424050909 4 5399352647151881e05 57711533296667897910 42 71174067312914397910 45400064387825013605 13462534016582149e10 43 l6552108677421953910 4 53998988667382406 05 30895746606972394911 44 37618428812322619e11 4 5399936485167052905 67226822050262154912 45 8 3596508471828033912 45399928125516204905 1 6369686435904231912 46 1 8173154015614790e12 45399929942831606905 1 8034675857737936e13 47 3 8666285139605936913 4 5399929556168757e05 20631609078504767913 48 80554760707512366914 4 53999296367235146 05 12576133304230755e13 49 1 6439747083165789e14 45399929620283770905 1 4220107703117810e13 50 3 2879494166331576915 4 53999296235717226 05 13891312552289856613 51 64469596404571717e16 4 5399929622927028905 13955781923971275913 52 1 2397999308571485916 45399929623051007e05 1 3943384072128903913 53 2 3392451525606574917 4 5399929623027615905 1 3945723238316041913 54 4 3319354677049209918 4 5399929623031945e05 1 3945290235073404913 55 78762463049180375919 40399929623031159905 13945368839730909e13 5G 1 4064725544496496e19 45399929623031301905 1 3945354609577396e13 7 2 4674957095607887e20 4 5399929623031274e05 13945357320082827913 58 4 2543029475186009921 45399929623031281e05 1 3945356642456469e13 59 7 2106829618959334e22 45399929623031281905 1 3945356642456469e13 60 12017804936493222922 4 5399929623031281905 13945356642456469e13 42 Numerical Instability Numerical instability is a term applied to algorithms in much the same way as conditioning is applied to mathematical models an algorithm is unstable if roundofE errors creeping in at an early stage of calculation are magni ed to unacceptable levels An unstable algirithm will work ne on a computer with in nite precision Example 44 Computation of a a 71 One possible algorithm is the direct one A second algorithm is a a 2 7 anil n 2 3 Proof a2 17a or a2ail0 or a ilix52 The second algorithm is attractive as multiplication is replaced by subtraction Matlab results for computations using both the direct and recurrence algorithms are given below We note that differences between the results of the two algorithms begin to appear at n 5 and increase thereafter At n 39 even the rst digits and at n 43 even the signs are in disagreement The second algorithm is unstable because roundoff allows the second solution 71 x52 with magnitude greater than unity to creep in Since the recurrence relation is linear error grows exponentially The general solution of the recurrence relation is 44gt we The constants A and B are determined by initial values a and a2 The correct solution corresponds to B 0 Contamination due to roundoff brings in a small B whose effect is magni ed with increasing n Z Script File unstable m a Z Illustrates instability of the recurrence relation 7 a n a n 2 a n 1 X Specify a1 and a2 alsqrt512 a2a1 2 Z Initialize column vectors a and aa Initially each has Z a1 and a2 as elements aa1 a2 aaa1 a2 nn12 Z Compute a n by using the recurrence relation as well as directly for Z n from 3 to 50 by setting up a loop for n35 ban2an1 Z result computed by recurrence bba1 n Z result computed directly aab aaaabb Z Put results as new rows in a and aa nnnnn end Z Collect results in a matrix cnn aa a Z Display results fprintf n direct recurrence n fprintf 5i Z2015e Z2015e n c direct recurrence u robotyj 1 1 19660112501052e 01 60679774997897e 01 8980337503155e 01 a lt 030 LU cc r y1co so 819660112501052e01 360679774997897901 458980337503155e01 016994374947428e02 6994374947418e02 572809000084124e02 572809000084133e02 6 38 23 1 9 5 444185374863305e02 3444185374863284e02 2 1 8 5 3 1 WJgtOJIQH Op l m GO 128623625220820e02 128623625220849e02 315561749642485602 130618755783355903 024998740641495e03 105620015141862e03 919378725493634e03 I 315561749642435e02 130618755784136e03 024998740640219e03 105620015143917e03 919378725496301e03 b vb HH HIGHO NO IOOI IQWUTWHM H 20 14 1 186241289642228e 03 1 186241289647616e03 15 7331374358574057eO4 7331374358486853904 16 4DJLUOOQJ01 re 4 45 o r o ru e04 17 2800335820725831e04 2800335820497546e04 18 1730702717122396904 1730702717491761e04 19 1059633103603436604 1069633103005785904 20 6610696135189609e 05 6610696144859762e05 21 4 085634900844748905 4085634885198086e O5 22 2525061234344861905 2525061259661676905 23 1560573666499888e05 1560573625536410e05 24 9644875678449738e06 9644876341252662e06 25 59608609865491406 06 5960859914111438e06 26 3684014691900599906 3684016427141223e06 27 2 276846294648542e 06 2 276843486970215e06 28 i 407168397252057906 1407172940171009e 06 29 8696778973964854eO7 8696705467992061907 30 5374904998555718e07 5375023933718026e07 31 3321873975409138907 3321681534274035e07 32 2UDJUk1U 14 5 y 2 1907 33 1 268842952262559907 1268339134830043eO7 34 7 841880708840215908 7 850032646139482e 08 35 4846548813785372908 4833358702160950908 36 299530 3016670 40 1050 e08 37 1 851216918730527e08 1 816684758182419e08 38 1144114976324318908 1 199989185796113e08 39 7 7LU quot 61061135 5909 40 4 370130339181083e 09 5 832936134098077e09 41 2 700889084881016909 31340195897649778e 10 42 1669211 9 549891 39 C 43 1031647830580948e09 5164896954568121e09 44 6375934237191194e10 1066381349890122e08 45 3940544058618292610 1582871045346934e08 46 24353901685729039 10 2649252395237056e08 47 1 505153900045390e 10 4232123440583990e 08 48 9 3023626852751299 11 6 881375835821046e 08 49 5 749176315178772e11 1111349927640504e07 50 3553186370096359e11 1799487511222608e07 5 Appendix Conversion Between Decimal and Binary 51 Binary to decimal We demonstrate this conversion by means of two examples Example 51 Consider the binary number I 1101011101101 21 Then the decimal equivalent can be found by rst writing I as an appropriate sum of powers of 2 1 1 1 1 7 4 2 0 I 7 252 2 2 27 1 1 1 1 321641 Ra 53 015 01125 00625 01015625 531890625 Example 52 Now consider a binary number with a repeating block I 0111001100 1100 Hi Upon using an overline for the repeating block we can write I 10111002 101210011002 1012 111002 x 2 2 i 2 3 1 1 4 where 2 11100 We can also write 2 as 7 2 11100 1100 Multiplication of both sides by 24 moves the binary point to the right by 4 digits 239e 242 11mm2 23 22 2 12 21 Therefore 162 12 2 or 2 0181 Upon substituting for 2 in 31 we nd that z 0145 52 Decimal to Binary Example 53 Consider z 2351451 We consider the integer and fraction parts separatelyi Let 235 i i 040302012 5120 5221 5322 0423 1 32 Division by 2 yields 235 T 117 rema1nder 1 Since all terms on the RHS of 32 are divisible by 2 except the rst the remainder 1 implies that we must assign cl i ow we have 117 5220 5321 5422 Another division by 2 yields 58 remainder 1 The unity remainder selects Cg 1 Continuing in the same way we divide repeatedly by 2 and record the remaindersi Starting from the beginning we get number 2 a dividend remainder 2352 a 1171 cl 1172 a 581 Cg 582 4gt 29 0 Cg 292 4gt 14 1 C4 142 4gt 7 0 C5 72 a 3 1 55 32 a 1 1 C7 12 a 0 1 53 22 Therefore the binary representation of the integer 235 is 23510 11 101 011 For the fractional part let 1 1 1 0145 1d1d2d3d41112 d1 d2E d33 1 d4 24 1 Now we multiply repeatedly by 2 and record the integer part 0145 x 2 019 0 d1 09 x 2 018 1 d2 08 x 2 016 1 d3 06 x 2 02 1 d4 02 x 2 0140 d5 04 x 2 0180 d5 018 X 2 0161d71 We see that the pattern has begun to repeat Therefore the binary representation of 04510 is 04510 01110011001112 101110021 Combining the two results above we have 2354510 11 101011101 1100 23


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Kyle Maynard Purdue

"When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the I made $280 on my first study guide!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.