Aetna Book 2015 Hyper

Petr Krysl
An EngineersToolkit of Numerical
Algorithms
With the MATLABr toolbox
https://github.com/PetrKryslUCSD/AETNA
July 2015
Pressure Cooker Press

San Diego
c 2009-2015 Petr Krysl
Contents
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Modeling with dierential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 Simple model of motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Eulers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 A simple implementation of Eulers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Solving the Stokes IVP with built-in MATLAB integrator . . . . . . . . . . . . . . . . .
2.2.3 Rening the Stokes IVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Some properties of Eulers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 A variation on Eulers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.6 Implementations of forward and backward Euler method . . . . . . . . . . . . . . . . . .
2.3 Beam bending model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Model of satellite motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 On existence and uniqueness of solutions to IVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 First look at accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Modied Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Deeper look at errors: going to the limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Runge-Kutta integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
8
10
11
11
13
15
16
17
18
19
21
23
25
26
29
31
32
Preservation of solution features: stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1 Scalar real linear ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Forward Euler method for a decaying solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Backward Euler method for a decaying solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Backward Euler method for a growing solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Forward Euler method for a growing solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Complex IVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Single scalar equation versus two coupled equations: eigenvalues . . . . . . . . . . . . . . . . . .
3.9 Case of Rek = 0 and Imk = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Case of Rek = 0 and Imk = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.11 Application of the Euler integrators to the IVP (3.10) . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.12 Euler methods for oscillating solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.13 General complex k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
33
34
34
36
37
37
37
39
40
40
43
43
43
46
47
49
50
VI
Contents
3.14 Summary of integrator stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.14.1 Visualizing the stability regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
52
53
57
Linear Single Degree of Freedom Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1 Linear single degree of freedom oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 = 0: No oscillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 = (c/2m): Oscillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Critically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Supercritically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Change of coordinates: similarity transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Subcritically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Undamped oscillator: alternative treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Subcritically damped oscillator: alternative treatment . . . . . . . . . . . . . . . . . . . . .
4.6 Matrix-exponential solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Critically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
59
60
61
61
62
62
63
64
65
66
67
67
69
69
70
71
74
75
75
Linear Multiple Degree of Freedom Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.1 Model of a vibrating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Undamped vibrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Second order form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 First order form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Direct time integration and eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Practical use of eigenvalues for integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Analyzing the frequency content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Proportionally damped system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Non-proportionally damped system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Singular stiness, damped . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
77
78
78
79
79
80
83
84
85
87
88
90
92
93
95
95
Analyzing errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Order-of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Using the big-O notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Error of the Riemann-sum approximation of integrals . . . . . . . . . . . . . . . . . . . . .
6.2.3 Error of the Midpoint approximation of integrals . . . . . . . . . . . . . . . . . . . . . . . . .
97
97
97
98
99
99
99
100
100
101
Contents
VII
6.3 Estimating error in ODE integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3.1 Local error of forward Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Global error of forward Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Approximation of derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Computer arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Integer data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.2 Floating-point data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Interplay of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
104
104
105
106
107
109
110
110
112
115
116
116
116
117
Solution of systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1 Single-variable nonlinear algebraic equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Convergence rate of Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.2 Robustness of Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.3 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 System of nonlinear algebraic equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Numerical Jacobian evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Nonlinear structural analysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Forward and backward substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.4 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.5 Large systems of coupled equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.6 Uses of the LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Errors and condition numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Perturbation of b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Perturbation of A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.4 Induced matrix norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.5 Condition number in pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.6 Condition number for symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Householder reections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119
119
120
121
122
123
124
126
127
129
129
130
134
135
136
139
141
141
142
143
143
146
146
147
148
148
149
149
150
151
151
152
152
155
156
157
159
160
VIII
Contents
7.6 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8
Solutions methods for eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.1 Repeated multiplication by matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Inverse power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Shifting used with inverse power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Simultaneous power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 QR iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Schur factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 QR iteration: Shifting and deation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Spectrum slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7 Generalized eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
161
161
162
164
167
168
168
169
171
171
171
174
174
175
176
178
179
179
180
181
181
182
183
Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 Basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Two degrees of freedom static equilibrium: unstable structure . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Two degrees of freedom static equilibrium: stable structure . . . . . . . . . . . . . . . . . . . . . .
9.4 Potential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5 Determining deniteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6 Two degrees of freedom static equilibrium: computing displacement . . . . . . . . . . . . . . .
9.7 One degree of freedom total energy minimization example . . . . . . . . . . . . . . . . . . . . . . .
9.8 Two degrees of freedom total energy minimization example . . . . . . . . . . . . . . . . . . . . . .
9.9 Application of the total energy minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9.1 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9.2 Line search for the quadratic-form objective function . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.10 Conjugate Gradients method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11 Generalization to multiple equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.12 Direct versus iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.13 Least-squares minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.13.1 Geometry of least squares tting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.13.2 Solving least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185
185
186
188
188
189
190
191
191
192
192
193
194
194
196
196
198
200
202
202
206
207
208
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
1
Motivation
The narrative in this chapter is provided in the hope that it will motivate the esteemed reader to
take the present subject seriously. Please do not be discouraged if the text in this Chapter is found
lacking in entertainment value. The rest of the book will make it up to you to excess.
Let us consider the experience made by the renowned structural engineer WTP (in Figure 1.1
accompanied by his attorney CR). It concerns a planar truss structure designed by WTP and
analyzed for static loads. The structural software used was developed by Owl & Co.
Fig. 1.1. The renowned structural engineer WTP depicted on the stairs of his house with his attorney CR.
CR was instrumental in keeping WTPs engineering career on track.
The structure was rst analyzed with default analysis settings and the shape after the deformation
is shown in Figure 1.2 ( also included is a visual representation of the applied loading and the three
pin supports). The shape before deformation is shown in broken line. The deformation is highly
magnied.
Later that day WTP was exploring the menus of the analysis software (there is always a rst time
for everything), and got intrigued by the fact that the analysis option Use automatic stabilization
was checked. The documentation was not very helpful in explaining the eects of this option (WTP
was in fact not sure in which language the documentation was written), and therefore an experiment
was in order. The analysis option Use automatic stabilization was unchecked, and the analysis
was repeated. To WTPs surprise the results were practically identical, except that a slight displacement unsymmetry developed (for instance, displacement of -0.7039 versus -0.6518 units). This was
disquieting since the structure and the boundary conditions (loads and supports) were symmetric.
WTP was however nonplussed, especially given that the analysis was to be delivered to the client
the next day.
Several weeks later the software developers alerted WTP that a bug was found in the analysis
software, the bug was xed, and an update was to be installed. WTP remembered the slight unsymmetry, and therefore checked whether the update removed it. Since the unsymmetry remained,
a brief discussion ensued in whose course WTP ascertained that the bug had to do with the color
1 Motivation
Fig. 1.2. The planar truss structure. Deformation under indicated static loads is shown in solid line (magnied). The undeformed structure is shown in dashed line.
in which the logo of the company was drawn in the splash screen, and therefore it was somewhat
unlikely to be the cause of the unsymmetry.
During a discussion with a colleague WTP was able to convince himself that no unsymmetry
was to be expected in the analysis, and that if it appeared it should be considered an error. At
that point WTP began to draw on his immense powers of reasoning. After only a few hours he
was able to recall the name of the text in which properties of coupled systems of linear algebraic
equations were discussed in his junior year in college. An intense session with the textbook followed,
and WTP was quickly able to nd the page that pertained to errors that can appear in the solution
of systems of equations. The error was found to be proportional to the error in the right-hand side
(the loads) and to the condition number of the stiness matrix. The loads were, as WTP checked,
specied correctly, and consequently the mysterious condition number was probably the source of
the confounding error.
WTP was now able to nd in the textbook that the condition number of the stiness matrix
was rather expensive to compute as one had to solve an eigenvalue problem. WTP was not to be
deterred however, and subcontracted this work out to a group of students from the local university,
cost it what it may (it wasnt much). The magnitudes of the eigenvalues of the stiness matrix found
by the students are shown in Figure 1.3.
Eigenvalue magnitude
10
10
10
10
10
15
10
10
20
30
Eigenvalue
40
50
Fig. 1.3. The magnitudes of the eigenvalues of the stiness matrix of structure from Figure 1.2
1 Motivation
The rather small rst eigenvalue did not escape WTP and a few more rewarding hours were
spent looking for information that could lead to an understanding of the relationship between the
condition number, the eigenvalue problem, and the stiness matrix. Eventually the critical piece of
information that the so-called singular matrix had at least one zero eigenvalue was located, and the
conclusion that the stiness matrix was somehow close to singular was reached.
The displacement shape corresponding to the rst eigenvector (Figure 1.4) facilitated the ultimate breakthrough. The structure contained a mechanism: a oppy piece of structure that was
insuciently connected to the rest of the structure (which was in fact suciently supported).
Fig. 1.4. The eigenvector 1 of the stiness matrix of structure from Figure 1.2
The structure was consequently subjected to a redesign to remove the mechanism, and the
redesign was eagerly adopted by the client who remarked on the propitious circumstance that a
superior design became available before the structure was realized. WTP has yet again demonstrated
that superior skill and knowledge cannot fail to win the day. Even though his friend CRs assistance
was not required in this matter, his comforting presence during these trials and tribulations was
gratefully noted by WTP.
2
Modeling with differential equations
Summary
1. In this chapter we develop an understanding of initial value problems (IVPs). We look at the
simple but illustrative model of motion in a viscous uid, and the model of satellite motion. The
main idea: these models can be treated similarly since they are are both members of the class
of IVPs. The constituents of an IVP are the governing equation and the initial conditions.
2. The IVPs that will be considered in this book will be in the form of coupled rst-order (only
one derivative with respect to the independent variable) equations.
3. We develop simple methods for integrating IVPs numerically in time. The main idea: approximate the curve by its tangent in order to make one discrete step in time. The basic visual picture
is provided by the direction eld.
4. We discuss the essential dierences between IVPs and BVPs (boundary value problems). The
main idea: BVPs are harder to solve than IVPs because the problem data is located on the entire
boundary of the domain of the independent variables.
5. We investigate the accuracy of some simple numerical solvers for IVPs. The main concepts:
Monomial relationship between the error and the time step length gives us formulas to estimate
the error, the log-log plot illuminates the convergence produced by the dependence on the time
step, and the convergence rate is revealed by the log-log plot.
6. We wrap up the exposition of the various time integrators by describing the Runge-Kutta integrators. Main idea: try to aim the time step for optimal accuracy by sampling the right-hand
side function (that is the slope) within the time step.
2.1 Simple model of motion

George Gabriel Stokes was a 19th century mathematician who has had an enormous impact on
many areas of engineering through his work on properties of uids. Perhaps his most signicant
accomplishment was the work describing the motion of a sphere in a viscous uid. This work led to
the development of Stokes Law. This is a mathematical description of the force required to move a
sphere through a viscous uid at specic velocity.
Stokes Law for a sphere descending under the inuence of gravity in a viscous uid is written
as
6rv =
4 3
r (s f )g ,
3
(2.1)
where 43 r3 is the volume of the sphere, is the dynamic uid viscosity (for instance in SI units
Pa s), 6r is the shape factor of the sphere of radius r, v is the velocity of the falling sphere relative
to the uid, m is the mass of the sphere, and g is the gravitational acceleration. On the left of
equation (2.1) is the so-called drag force Fd , on the right is the gravitational force Fg (i.e. 43 r3 s g,
2 Modeling with dierential equations
Fd
x
Fg
Fig. 2.1. Sphere falling in viscous uid.
where s is the mass density of the material of the sphere) minus the buoyancy force (i.e. 43 r3 f g,
where f is the mass density of the uid) compare with Figure 2.1.
An application of this law to structural engineering may be found for instance in composites
manufacturing: a commonly used manufacturing technique used for large parts infuses dry bers laid
up on a bagged mold with resin by creating a degree of vacuum (Vacuum Assisted Resin Transfer
Moulding (VARTM)) to suck the resin into the bers. A critical property of the polymer resins is
their dynamic viscosity: if the resin is too viscous, the bers may be incompletely impregnated and
the part must be discarded. Some of the techniques to determine the viscosity of the liquid resemble
a high school science experiment: drop a ball into a tube lled with this liquid. Measure the time it
takes the ball to travel some distance. From that calculate the balls velocity (distance/time), and
knowing the balls diameter and mass obtain from (2.1) the liquids viscosity.
Of course, if we calculate the balls velocity as (distance/time) it better be uniform in that interval. So how does the velocity of the falling ball vary with time? Let us say we observe the proceedings
with a high-speed camera. We drop the ball from rest, and then we see the ball rapidly accelerate.
Eventually it seems to settle down to a steady speed on the way downwards. The modeling keyword
is acceleration, and consequently we shall use Newtons equation: Acceleration is proportional to
force. The acceleration may be written as x
(measuring the distance traveled downwards: Figure 2.1),
and the total applied force is Fg Fd . Therefore, we write
4 3
4
r s x
= r3 (s f )g 6rv .
3
3
(2.2)
Simplifying a little bit, we obtain

x
=
s f
9
g 2 v.
s
2r s
(2.3)
We see that we have one equation, but two variables: x and v. These are not independent, since
the velocity is dened as the rate of change of the distance fallen, v = x.
We have two choices. Either
we express equation (2.3) in terms of the distance
x
=
9
s f
g 2 x
s
2r s
(2.4)
and we obtain a second order dierential equation, or we express equation (2.3) in terms of the
velocity
v =
s f
9
g 2 v
s
2r s
(2.5)
2.1 Simple model of motion
and we obtain a rst order dierential equation. Since we are at the moment primarily interested in
the velocity, we will stick to the latter.
All these equations are the so-called equations of motion. They are dierential equations, expressing rate of change of some variable (x or v) in terms of the same variable (and/or other variables,
in general). The independent variable is the time t and the dependent variable is the velocity v.
We realize that to obtain a solution we must somehow integrate both sides of the equation of
motion. From calculus we know that integration brings in constant(s) of integration. So, for instance
for equation (2.5), we may write
]
t
t[
s f
9
v(
)d =
g 2 v( ) d
s
2r s
t0
t0
and evaluating the left-hand side we arrive at
]
t[
9
s f
v(t) v(t0 ) =
g 2 v( ) d .
s
2r s
t0
Here the task is to nd a suitable form of function v( ) to satisfy this equation for all times. The
value v(t0 ) is arbitrary. Its physical meaning is that of the velocity at the beginning of the interval
t0 t. Therefore, setting the value v(t0 ) to some particular number
v(t0 ) = v0
(2.6)
is called specifying the initial condition. The initial condition makes the solution to the equation
of motion meaningful to a particular problem. Therefore, we always think of the models of this type
in terms of the pair governing equation (the equation of motion) plus the initial condition. This
type of model is called the initial value model (and the problem which is modeled this way is called
an initial value problem: IVP). The problem of the falling sphere is an initial value problem,
and the model that needs to be solved is
v =
9
s f
g 2 v,
s
2r s
v(0) = v0
(2.7)
where we have quite sensibly taken t0 = 0.

For future reference we will sketch the construction of an analytical solution. One possible approach uses the decomposition of the solution into a general solution of the homogeneous equation
vh =
9
vh
2r2 s
and one particular solution to the inhomogeneous equation

vp =
s f
9
g 2 vp .
s
2r s
The homogenous equation may be solved by assuming the solution in the form of an exponential
vh ( ) = exp(a ) .
Dierentiating vh we nd a = 2r9
2 .
s
The particular solution can be guessed as vp = constant, and dierentiating we nd
vp =
2r2 (s f )
g.
9
The solution to the initial value problem is the sum of the particular solution and some multiple of
the general solution
v( ) = vp ( ) + Cvh ( )
and it must satisfy the initial condition v(0) = v0 . Substitution of = 0 leads to
C = v0
2r2 (s f )
g
9
and the analytical solution to the initial value problem (2.7)

(
)
(
)
2r2 (s f )
2r2 (s f )
9
v( ) =
g + v0
g exp 2 .
9
9
2r s
Figure 2.2 displays the time variation of the speed of the falling sphere for some common data
(epoxy resin, and steel sphere of 5 mm radius). The analytical formula is easily recognizable in the
MATLAB1 code2 to produce the gure:
t=linspace(tspan(1),tspan(2), 100);
vt =(2*r^2*(rhos-rhof))/(9*eta)*g +...
(v0-(2*r^2*(rhos-rhof))/(9*eta)*g)*exp (-(9*eta)/(2*r^2*rhos)*t);
plot(t,vt, linewidth, 2, color, black, marker, .); hold on
xlabel(t [s]),ylabel(v(t) [m/s])
As we can see, the sphere attains an essentially unchanging velocity within a fraction of a second.
Our model can give us additional information: we can see that in theory the time dependence of
the velocity vanishes only for . In other words, it takes an innite time for the sphere to
stop accelerating. The corresponding velocity is called terminal velocity. So much for theory. In
experiments we expect that practical limits to measurement accuracy would allow us to say that
terminal velocity will be reached within a nite time (for instance, if we can measure velocities with
accuracy of about 1 mm per second, the time to reach terminal velocity in our example is less than
0.3 seconds).
0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.2. Sphere falling in viscous uid. Time variation of the descent speed.
2.2 Eulers method

For the simple problem (2.7) it was relatively straightforward to derive an analytical solution. As
we encounter more complicated problems, it will dawn on us that analytical solutions are in general
1
MATLABr is registered trademark of The MathWorks, Inc. Contact: www.mathworks.com.
See: aetna/Stokes/stokesref.m
2.2 Eulers method
not available. The tools available to engineers for these problems will most likely be numerical in
nature. (Hence the reason for this book.)
The simplest method with which to introduce numerical solutions to initial value problems (IVP)
is the Eulers method. It is based on a very simple observation: the solution graph is a curve. The
solution process itself could be understood as constructing a curve. The curve passes through a point
that is known to us from the specication of the IVP: the point (t0 , v0 ). A curve consists from an
innite number of points, and we do not want to have to compute the coordinates of an innite
number of points. The next best thing would be to compute the solution at only a few points along
the curve, and somehow approximate the curve in-between. It is logical to try to approach this task
by starting from the point we know from the beginning, (t0 , v0 ), and to compute next another point
on the curve, let us say (t1 , v1 ). Then restart the process by moving one point forward in time,
compute (t2 , v2 ), and so on. This is an important aspect of numerical methods: the algorithms make
discrete steps and they produce discrete solutions (as opposed to a continuous analytical solution).
In general we will not be able to compute this sequence of points so that they all lie on the
exact solution curve. The points will only be close to the curve we wish to nd (they will be
only approximately on the curve). In fact, there is in general an innite number of solution curves,
those passing through all possible initial conditions. Refer to Figure 2.3): Shown are ve solution
curves for ve dierent initial velocities. So if our numerical solution process drifts o the desired
solution curve, it will most probably lie on an adjacent solution curve.
Since the process is repetitive (start from a known solution point and then compute the next
solution point), we may just as well think in terms the pair (tj , vj ) (known) and (tj+1 , vj+1 ) (unknown, to be computed). How do we approximately locate the point (tj+1 , vj+1 ) from what we know
of the solution curve passing through (tj , vj )? We know the location (tj , vj ), but is there anything
else? The answer is yes: having (tj , vj ) allows us to substitute these values on the right-hand side of
the governing equation (2.7) and compute
v(t
j , vj ) =
s f
9
g 2 vj
s
2r s
(2.8)
(there is no mention of tj , so it does not appear). The meaning of v(t

j , vj ) ( dv
dt (tj , vj )) is the slope
of the solution curve that passes through (tj , vj )! And here is Eulers critical insight: if we cant
move along the actual curve (since we dont know it), we will move instead along the straight line
tangent to the solution curve. How far? Just a little bit, since if the curve really curves, the straight
line motion will quickly become a very poor approximation of the curve. Therefore we compute the
next solution point as (compare with Figure 2.3)3
(tj+1 , vj+1 ) (tj + t, vj + tv(t
j , vj )) ,
t small .
(2.9)
Here v(t
j , vj )) may become confusing, since by the superimposed dot we dont mean that a time
derivative of some quantity was taken. We simply mean the value of the given function on the right
of (2.8). Therefore, we give the right-hand side function a name and we use the notation
v = f (t, v) ,
v(t0 ) = v0
(2.10)
for the IVP. Here by f (t, v) we mean that the right-hand side of the governing equation is known as
a function of t and v. Then the Euler algorithm may be written as
(tj+1 , vj+1 ) (tj + t, vj + tf (tj , vj )) ,
t small .
(2.11)
One more remark is in order in reference to Figure 2.3. The short red lines indicate the slope
of the solution curves passing through the points from which the straight red lines emanate (the
left-hand side ends). The straight lines represent the tangents to the solution curves (the slopes).
They are also known as the direction eld . Plotting the direction eld is a good way in which
the behavior of solutions to ordinary dierential equations can be understood. It works best for a
single scalar equation since it is hard to visualize the direction elds when there are more than one
dependent variables.
3
See: aetna/Stokes/stokesdirf.m
10

0.4
v(t) [m/s]
0.3
0.2
0.1
0.1
0
0.05
0.1
t [s]
0.15
0.2
Fig. 2.3. Stokes problem solutions corresponding to dierent initial conditions, with the direction eld
shown at a few selected points.
2.2.1 A simple implementation of Eulers method

First we dene the variables that appear in the denition of the IVP (2.7)4 :
g=9.81;%m.s^-2
r=0.005;%m
eta=1100*1e-3;%1 Centipoise =1 mPa s
rhos=7.85e3;%kg/m^3
rhof=1.10e3;%kg/m^3
v0 = 0;% meters per second
This is the solution time span.
tspan =[0
0.5];% seconds
Dene an anonymous function (assigned to the variable f) to return the value of the right-hand side
of (2.7) for a given time t and velocity v.
f=@(t,v)((rhos-rhof)/rhos*g - (9*eta)/(2*r^2*rhos)*v);
Decide how many steps the algorithm should take, and compute the time step to cover the time
span in the selected number of time steps.
nsteps =20;
dt= diff(tspan)/nsteps;
Initialize two arrays to hold the solution pairs. Note that the two lines in the loop correspond exactly
to the algorithm formula (2.11). We call the function f dened above to evaluate the right-hand
side.
t(1)=tspan(1);
v(1)=v0;
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =v(j)+dt*f(t(j),v(j));
end
Finally, we graphically represent the solution as a series of markers that correspond to the computed
solution pairs (tj , vj ).
plot(t,v,o)
4
See: aetna/Stokes/stokes1.m
2.2 Eulers method
11
0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.4. Stokes problem solution computed with a simple implementation of the Eulers method.
2.2.2 Solving the Stokes IVP with built-in MATLAB integrator

Numerically solving IVPs is a common and important task. Not surprisingly, MATLAB has a
menagerie of functions that can do this job very well indeed. Here we illustrate how to use the
MATLAB integrator (thats what these types of functions are usually called) ode235 . For brevity
we omit the denitions of the constants (same as above). Then as above we dene an anonymous
function for the right-hand side of the governing equation, and we pass it as the rst argument to
the integrator. The integrator returns two arrays, whose meaning is the same as in our simple code
above.
f=@(t,v)((rhos-rhof)/rhos*g - (9*eta)/(2*r^2*rhos)*v);
[t,v] = ode23 (f, tspan, [v0]);
The solution pairs are now plotted. However, this time we let the plotter connect the computed
points (as indicated by markers) with straight lines. Note well: we are not computing the points
in between, those are interpolated from the computed data points only for show. We may take
note of the spacing of the computed data points: the spacing is not uniform. The integrator is clever
enough to gure out how long a step may be taken from one time instant to the next without losing
too much accuracy. We will do our own so-called adaptive time stepping later on in the book.
plot(t,v,o-)
2.2.3 Refining the Stokes IVP

Now consider the equation of motion written in terms of the second derivative of the distance
traveled (2.3). Since two time derivatives are present, we should expect to have to integrate twice
to obtain a solution. This will result in two constants of integration. The rst integration yields
)
t
t(
s f
9
x
d =
g 2 x d ,
(2.12)
s
2r s
t0
t0
which results in
x(t)
x(t
0 ) = (t t0 )
9
s f
g 2 (x(t) x(t0 )) .
s
2r s
Similarly the second integration gives

5
12

0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.5. Stokes problem solution computed with a MATLAB integrator.
(x(
) x(t
0 )) d =
t0
)
)
t(
t(
s f
9
( t0 )
g d
(x(
)
x(t
))
d .
0
s
2r2 s
t0
t0
This expression could be further simplied, but my point can be made here: the two constants of
integration are already present, x(t0 ) and x(t
0 ). Therefore the IVP (the governing equation plus the
initial conditions) may be written
x
=
s f
9
g 2 x ,
s
2r s
x(t0 ) = x0 , x(t
0 ) = v0 .
(2.13)
The meaning of the IVP is: Find a function (distance traveled) x(t) such that it satises the equation
of motion, and such that the initial distance and the initial velocity at the time t0 are x0 and v0
respectively.
The integration of IVPs in MATLAB is made general by requiring that all IVPs be rst order
(only rst derivatives of the variables may be present). Our IVP (2.13) is second order, but we can
see that it may be converted to a rst order form. Just introduce the velocity to write
v =
s f
9
g 2 v,
s
2r s
x = v ,
x(t0 ) = x0 , v(t0 ) = v0 .
(2.14)
The price to pay for having to deal with only the rst order derivatives is an increased number of
variables: now we have two. Since we have two variables, we better also have two equations. Note
that the initial conditions are now written in terms of the two variables, but we still have two of
them. That is not entirely surprising since we still need two integration constants: we have two rstorder equations, each of them needs to be integrated once, which will again result in two constants
of integration.
The IVP now deals with a system of coupled ordinary dierential equations. Such systems are
usually written in the so-called vector form. We introduce a vector to collect our variables
[ ]
x
z=
v
and then the IVP (2.14) is put as
v
9 = f (t, z) ,
z = s f
g 2 v
s
2r s
[
z(t0 ) =
]
x0
.
v0
(2.15)
Formally, this is the same as the IVP (2.7), except that our variable is a vector, and the function
on the right-hand side returns a vector and takes the time and a vector as arguments. This parallel
2.2 Eulers method
13
makes it possible to treat a variety of IVPs with the same code in MATLAB. Here we show an
implementation6 that computes the solution to (2.15).
The denitions of the constants are the same as above, except for the initial conditions. The
initial condition now is a column vector.
z0 = [0;0];% Initial distance and velocity, meters per second
The right-hand side function looks very similar to the one introduced above, except that it needs to
return a vector, and whenever it refers to velocity it needs to take it out of the input vector as z(2)
f=@(t,z)([z(2); (rhos-rhof)/rhos*g-(9*eta)/(2*r^2*rhos)*z(2)]);
The MATLAB integrator is called exactly as before.
[t,z] = ode23 (f, tspan, z0);
The arrays returned by the integrator collect results in the form of a table:
t(:)
t1
t2
...
z(:,1)
x1
x2
...
z(:,2)
v1
v2
...
Plotting the two arrays then the yields two curves: the distance traveled and the velocity (Figure 2.6).
plot(t,z,o-)
xlabel(t [s]),ylabel(x(t) [m], v(t) [m/s])
0.35
x(t) [m], v(t) [m/s]
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.6. Stokes problem. Solution of (2.15) computed with a MATLAB integrator.
2.2.4 Some properties of Eulers method

The solutions are computed in the form of a table. An important parameter in that table is the
spacing along the time direction, the time step. The time step is a form of a control knob: it
appears that turning the knob so that the time step decreases would compute more points along
the solution curve. This should be helpful, if for nothing else than to render better representations
of the solutions. Since the major approximation in Eulers method is the replacement of the actual
curve with a straight line, we can also see that making the time step smaller will somehow decrease
the error that we will make in each step.
6
14
Making the step smaller is however expensive. The more steps we make the algorithm take, the
longer we have to wait for the computer to give us the solution. Hence we may wish to use a step that
is suciently large for the results to arrive quickly, but large steps also have consequences. What if
I wanted to take only 10 steps instead of 20 in the rst MATLAB script (Section 2.2.1, set nsteps
=10;). The result7 is shown in Figure 2.7 and it is clearly unphysical: in the actual experiment (and
in the analytical solution and in our prior calculations) the dropped sphere certainly seems to be
monotonically speeding up, whereas here the result tells us that the velocity oscillates, and moreover
at times seems to be higher than the terminal velocity.
0.45
0.4
0.35
v(t) [m/s]
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
t [s]
0.3
0.35
0.4
0.45
0.5
Fig. 2.7. Stokes problem. Solution with a larger time step than in Figure 2.4.
An explanation of this phenomenon8 may be found in Figure 2.8. Note the direction eld which
will help us understand the numerical solution. Starting from (t1 = 0, v1 = 0) we proceed along the
steep straight-line so far that the next solution point (t2 , v2 ) overshoots the terminal velocity. The
next step is along a straight line with a negative slope, and again we go so far that we undershoot
the terminal velocity. The third step takes us along a straight line with a positive slope, and we
overshoot again. This kind of computed response is not useful to us since the qualitative feature of
the solution, namely the monotonic increase of the speed, is lost in the numerical solution.
0.45
0.4
v(t) [m/s]
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.8. Stokes problem. Solution with a larger time step than in Figure 2.4. The direction eld and the
analytical solution are shown.
7
8
See: aetna/Stokes/stokes4ill.m
2.2 Eulers method
15
In summary, we see that the selection of the time step length has two kinds of implications.
Firstly, the time step aects the accuracy (how far are the computed solutions from the curve that
we would like to track?). Secondly, the time step eects the quality of the solution (is the shape of the
computed solution series a reasonable approximation of the shape of the exact solution curve?).
The rst aspect is generally referred to as accuracy . The second aspect is generally considered a
manifestation of stability (or instability, depending on how we look at it).
2.2.5 A variation on Eulers method
The Eulers method proposes to follow a straight line set up at (tj , vj ) to arrive at the point
(tj+1 , vj+1 ). As an alternative, let us consider the possibility of following a straight line set up
at the (initially unknown) point (tj+1 , vj+1 ). For simplicity let us work with the IVP (2.7). The
right-hand side of the equation of motion is the function
f (t, v) =
9
s f
g 2 v.
s
2r s
The slope of the solution curve passing through (tj+1 , vj+1 ) is

f (tj+1 , vj+1 ) =
s f
9
g 2 vj+1 ,
s
2r s
which we substitute into the formula that expresses the movement from the point (tj , vj ) to
(tj+1 , vj+1 ) along a straight line
(tj+1 , vj+1 ) = (tj + t, vj + tf (tj+1 , vj+1 )) .
(2.16)
The resulting expression for the velocity at time tj+1 reads

(
)
s f
9
vj+1 = vj + tf (tj+1 , vj+1 ) = vj + t
g 2 vj+1 ,
s
2r s
which has the unknown velocity vj+1 on both sides of the equation. Equations of this type are called
implicit, as opposed to the original Eulers algorithmic equation (2.9). The latter was explicit: the
unknown was explicitly dened by the right-hand side. In implicit equations, the unknown may be
hidden in some nasty expressions on both sides of the equation, and typically a numerical method
must be used to extract the value of the unknown.
In the present case, solving the implicit equation is not that hard
s f
g
s
.
9
1 + t 2
2r s
vj + t
vj+1 =
The MATLAB script of Section 2.2.1 may be easily modied to incorporate our new algorithm. The
only change occurs inside the time-stepping loop
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =(v(j)+dt*(rhos-rhof)/rhos*g)/(1+dt*(9*eta)/(2*r^2*rhos));
end
With this modication9 we can now compute the numerical solution without overshoot not only
with just 10 steps, but with just ve or even two see Figure 2.9. The computed points are not
particularly accurate, but the qualitative character of the solution curves is preserved. In this sense,
the present modication of the Eulers algorithm has rather dierent properties than the original.
9
16

0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.9. Stokes problem. Solution with the algorithm (2.16).
In order to be able to distinguish between these algorithms we will call the original algorithm
of Section 2.2.1 the forward Euler , and the algorithm introduced in this section will be called
backward Euler . The justication for this nomenclature may be sought in the visual analogy of
approximating a curve with a tangent: in the forward Euler method this tangent points forward from
the point (tj , vj ), in the backward Euler method the tangent points backward from (tj+1 , vj+1 ).
2.2.6 Implementations of forward and backward Euler method
In this book we shall spend some time experimenting with the forward and backward Euler method.
However, MATLAB does not come with integrators implementing these methods. They are too
simplistic to serve the general-purpose aspirations of MATLAB. Since it will make our life easier if
we dont have to code the forward and backward Euler method every time we want to apply it to a
dierent problem, the toolbox aetna that comes with the book provides integrators for this pair of
methods.
The aetna forward and backward Euler integrators are called in the same way as the built-in
MATLAB integrators. We have seen in Section 2.2.2 an example of the built-in MATLAB integrator,
ode23. There is one dierence, however, which is unavoidable. The built-in MATLAB integrators are
able to determine the time step automatically, and in general the time step is changed from step to
step. The aetna forward and backward Euler integrators are xed-time-step implementations: the
user controls the time step, and it will not change. Therefore, we have to supply the initial time step
as an option to the integrator. (In fact, even the MATLAB built-in integrators take that options
argument. It is used to control various aspects of the solution process.) The MATLAB odeset
function is used to create the options argument. To compute the solution with the forward Euler
integrator odefeul10 , replace the ode23 line in the script in Section 2.2.2 with these two lines11 :
options =odeset(initialstep, 0.01);
[t,v] = odefeul(f, tspan, [v0], options);
To compute the solution with a backward Euler integrator, use odebeul12 instead. The inquisitive
reader now probably wonders: how does odebeul solve for vj+1 from the implicit equation
vj+1 = vj + tf (tj+1 , vj+1 )
when it cannot even know how the function f was dened (all it is given is the function handle f)? The
answer is: the equation is solve numerically. Solving (systems of) non-linear algebraic equations is
10
See: aetna/utilities/ODE/integrators/odefeul.m
12
See: aetna/utilities/ODE/integrators/odebeul.m
11
2.3 Beam bending model
17
so important that MATLAB cannot fail to deliver some methods for dealing with them. MATLABs
fzero implements a few methods by which the root of a single nonlinear equation may be located.
It takes two arguments, function handle: this would be the function whose zero we wish to nd; and
the initial guess of the root location. First we dene the function
F (vj+1 ) = vj+1 vj tf (tj+1 , vj+1 )
by moving all the terms to the left-hand side, and our goal will be to nd vj+1 such that F (vj+1 ) = 0.
For that purpose we will create a handle to an anonymous function @(v1)(v1-v(j)-dt*f(t(j+1),v1))
in which we readily recognize the function F (vj+1 ) (the argument vj+1 is called v1). Finally, the
time stepping loop for the backward Euler method is written as13
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =fzero(@(v1)(v1-v(j)-dt*f(t(j+1),v1)),v(j));
end
where the second line inside the loop solves the implicit equation using fzero.
2.3 Beam bending model

The reader will nd more than initial value problems (IVP) in this book. Here is a boundary value
problem (BVP). It is of particular interest to structural engineers as it describes the planar bending
of a prismatic thin isotropic elastic beam. Using the orientations of the transverse displacement,
distributed transverse load, and the resultant shear forces and moments in Figure 2.10, we can
introduce the denitions
V =q ,
M = V ,
EIv = M ,
(2.17)
where V is the shear force resultant, q is the applied transverse load, M is the moment resultant, E
is the Youngs modulus, I is the moment of inertia of the cross-section, and (.) = d(.)/dx. Therefore,
the governing equation (static equilibrium of the dierential element of the beam) is fourth order
EIv = q .
(2.18)
Our knowledge of the particular conguration of the beam would be expressed in terms of the
conditions at either end: Is the cross-section at the end of the beam free of loading? Is it supported?
Is the support a roller or is the rotation at the supported cross-section disallowed? At the crosssection x = 0 we could write the following four conditions
1
1
v(0) = v0 , v (0) = s0 , v (0) =
M0 , v (0) =
V0 ,
EI
EI
depending of course on what was known: deection v0 , slope s0 , moment M0 , or shear force V0 .
Similar four conditions could be written for the cross-section at x = L (L = the length of the beam).
Normally we would know two quantities out of the four at each end of the beam. For instance, for a
beam supported on rollers on each end (the so-called simply supported beam) the known quantities
would be v0 = M0 = vL = ML = 0. Since the quantities are specied at the boundary x = 0 and
x = L of the domain 0 x L on which the governing equation is written, we call these the
boundary conditions. The entire setup leads consequently to a boundary value problem (BVP),
which for the instance of the simply supported beam would be dened as
EIv = q ,
v(0) = 0 ,
v (0) = 0 ,
v(L) = 0 ,
v (L) = 0 .
(2.19)
The dierence between an IVP and a BVP is innocuous but rather consequential. All the conditions
from which the integration constants needs to be obtained are given at the same point for the IVP.
On the other hand, they are not given at the same point for the BVP, and therefore the boundary
value problem is considerably more dicult to solve. One of the diculties is that solutions to BVPs
do not necessarily exist for some combinations of boundary conditions.
13
18
Illustration 1
Consider the beam-bending BVP with
EIv = 0 ,
v(0) = 0 , v (0) = 0 , v (L) = 0 , v (L) = 1 .
Note that the beam is not loaded (q = 0). The boundary conditions correspond to a beam simply
supported at one end and on the other side with a free end with a nonzero shear force. In the absence
of other forces and moments, the shear force at x = L cannot be balanced. Such a beam is not stably
supported, and therefore no solution exists for these boundary conditions.
We will handle in this book some simple boundary value problems, but most of their intricacies
are outside of the scope of this book.
v(x)
q(x)
M (x + dx)
x
M (x)
V (x)
V (x + dx)
Fig. 2.10. Beam bending schematic
It is relatively straightforward to add the aspect of dynamics to the equation of motion (2.18).
All terms are moved to one side of the equation, and they represent the total applied force on the
dierential element of the beam. Then invoking Newtons law of motion, we obtain
v = q EIv .
(2.20)
Here is the mass per unit length, and v is the acceleration. The equation of motion now became
a partial dierential equation, since there are now derivatives with respect to space and time. With
the time derivatives there comes the need for more constants of integration. It is consistent with
our physical reality that the integration constants will come from the beginning of the time interval
on which the equation of motion (2.20) holds. Therefore, they will be obtained from the so-called
initial conditions. The solution will still be subject to the boundary conditions as before, and thus we
obtain an initial boundary value problem (IBVP) for the function v(t, x) of the midline deection.
For instance
v = q EIv ,
v(t, 0) = 0 , v (t, 0) = 0 , v(t, L) = 0 ,

v(0, x) = vt0 (x) , v(0,
x) = v t0 (x) .
v (t, L) = 0 ,
(2.21)
This IBVP model describes the vibration of a simply-supported beam, whose deection at time t = 0
(initial deection) is described by the shape vt0 (x) and whose (initial) velocity at time t = 0 is given
as v t0 (x).
2.4 Model of satellite motion
19
2.4 Model of satellite motion

For the moment we shall continue investigating initial value problems. Now we look at another
mechanical IVP. Consider the unpowered motion of an Earth-orbiting satellite. The only force in
the problem is the force of gravity. The force acting on the satellite is shown in Figure 2.11, and a
corresponding force of equal magnitude but opposite direction is also acting on the Earth.
F =
GmM
r.
r3
Here G is the gravitational constant, m and M are the masses of the satellite and the planet
respectively, and r is the vector from the center of the earth to the location of the satellite. The IVP
Fig. 2.11. Satellite motion. Satellite path and the gravitational force.
formulation is straightforward. The equation of motion is a classical Newtons law: the acceleration
of the mass of the satellite is proportional to the acting force
F = m
r.
Substituting for the force, we obtain
m
r=
GmM
r,
r3
which is entirely expressed in terms of the components of the location of the satellite with respect to
the Earth. The initial conditions are the location and velocity of the satellite at some time instant,
let us say at t = 0
r(0) = r 0 ,
r(0)
= v0 .
Thus, the IVP reads

m
r=
GmM
r,
r3
r(0) = r 0 ,
r(0)
= v0 .
(2.22)
As for the problem discussed in Section 2.2.3, the dynamics of this IVP is driven by a second order
equation. In order to convert to the rst order form, we shall use the obvious denition of a new
With this denition, the IVP may be written in rst order form as
variable, the velocity v = r.
v =
GM
r,
r3
v = r
r(0) = r 0 ,
v(0) = v 0 .
(2.23)
20
Note that the mass of the satellite canceled in the equation of motion.
As before we can introduce the same formal way of writing the IVP using a single dependent
variable. Introduce the vector
[ ]
r
z=
v
and the denition of the right-hand side function
[
]
v
f (t, z) =
.
GM
r 3 r
(2.24)
Then the IVP is simply

z = f (t, z) ,
z(0) = z 0 .
The complete MATLAB code14 to compute the solution starts with a few denitions. Especially
note the initial conditions, velocity v0, and position r0.
G=6.67428 *10^-11;% cubic meters per kilogram second squared;
M=5.9736e24;% kilogram
R=6378e3;% meters
v0=[-2900;-3200;0]*0.9;% meters per second
r0=[R+20000e3;0;0];% meters
dt=0.125*60;% in seconds
te=50*3600; % seconds
Now the right-hand side function is dened (as an anonymous function, assigned to the variable f).
Clearly the MATLAB code corresponds very closely to equation (2.24).
f=@(t,z)([z(4:6);-G*M*z(1:3)/(norm(z(1:3))^3)]);
We set the initial time step (the MATLAB integrator may or may not consider it: it is always driven
by accuracy), and then we call the integrator ode45.
opts=odeset(InitialStep,dt);
[t,z]=ode45(f,[0,te],[r0;v0]);
Finally, we do some visualization in order to understand the output better than a printout of the
numbers can aord. In Figure 2.12 we compare results for this problem obtained with two MATLAB
integrators, ode45 and ode23, and with the forward and backward Euler integrators, odefeul and
odebeul. Some of the interesting features are: ode45 is nominally of higher accuracy than ode23.
However, we can see the individual curves spread out quite distinctly for ode45 while only a single
curve, at this resolution of the image, is presented for ode23. From what we know about analytical
solutions to this problem (remember Kepler?), the curve is an ellipse and the computed paths for
repeated revolutions of the satellite around the planet would ideally overlap and represent a single
curve. Therefore we have to conclude that ode23 is actually doing a better job, but not perfect (the
trajectory is not actually closed). The two Euler integrators produce altogether useless solutions.
The problem is not accuracy, it is the qualitative character of the orbits. From years and years of
observations of the motion of satellites (and from the analytical solution to this model) we know
that the energy of a satellite moving without contact with the atmosphere should be conserved to
a high degree. For the forward Euler the satellite is spiraling out (which would correspond to its
gaining energy), while for the backward Euler it is spiraling in (losing energy). A lot of energy! We
say that all these integrators fail to reproduce the qualitative character of the solution, but some
fail more spectacularly than others.
Looking at energy is a good way of judging the performance of the above integrators. The kinetic
energy of the satellite is
14
See: aetna/SatelliteMotion/satellite1.m
2.5 On existence and uniqueness of solutions to IVPs
21
Fig. 2.12. Satellite motion. Solution computed with (left to right): ode45, ode23, odefeul, odebeul.
K=
mv2
2
and the potential energy of the satellite is written as

V = G
mM
.
r
The total energy T = K + V should be conserved for all times. Let us compute this quantity
for the solutions produced by these various integrators, and graph it. Or, rather we will graph
T /m = K/m + V /m so that the expressions do not depend on the mass of the satellite, which did
not appear in the IVP in the rst place. Here is the code15 to produce Figure 2.13 which shows what
the time variation of the energies should look like (the total energy is conserved hence a horizontal
line).
Km=0*t;
Vm=0*t;
for i=1:length(t)
Km(i)=norm(z(i,4:6))^2/2;
Vm(i)=-G*M/norm(z(i,1:3));
end
plot(t,Km,k--); hold on
plot(t,Vm,k:); hold on
plot(t,Km+Vm,k-); hold on
xlabel(t [s]),ylabel(T/m,K/m,V/m [m^2/s^2])
In Figure 2.14 we compare the four integrators. Only the total energy is shown, so ideally we
should see horizontal lines corresponding to perfectly conserved energy. On the contrary, we can see
that neither of the four integrators conserves the total energy. Note that the vertical axes have rather
dierent scales. The Euler integrators perform very poorly: the change in total energy is huge. The
ode45 is signicantly outperformed by ode23, but both integrators lose kinetic energy nevertheless.
Since ode45 is signicantly more expensive than ode23, this example illustrates that choosing an
appropriate integrator can make the dierence between success and failure.
2.5 On existence and uniqueness of solutions to IVPs

With many common engineering models we are not worried about the existence and uniqueness
of the solutions to the IVPs. Existence and uniqueness are guaranteed under certain conditions
concerning the smoothness of the right-hand side function f (see equation (2.15)), and for many
models these conditions are satised.
There are nevertheless engineering models where the right-hand side has a built-in non-smoothness.
A good example of such a model deals with dry (Coulomb) friction. Consider an eccentric mass shaker
15
See: aetna/SatelliteMotion/satellite energy.m
22

4
x 10
2 2
T/m,K/m,V/m [m /s ]
6
0
0.5
1
t [s]
1.5
2
5
x 10
Fig. 2.13. Satellite motion. Total energy (solid line), potential energy (dotted line), kinetic energy (dashed
line).
7.4
x 10
7.55
x 10
7.56
2 2
T/m,K/m,V/m [m /s ]
T/m,K/m,V/m [m2/s2]
7.5
7.6
7.7
7.57
7.58
7.59
7.6
7.61
7.8
7.62
7.9
0
5.5
x 10
0.5
1
t [s]
1.5
7.63
0
2
x 10
0.7
T/m,K/m,V/m [m2/s2]
T/m,K/m,V/m [m2/s2]
x 10
1
t [s]
1.5
1
t [s]
1.5
2
5
x 10
0.8
6.5
7.5
8
0
0.5
0.9
1
1.1
1.2
0.5
1
t [s]
1.5
2
5
x 10
1.3
0
0.5
2
5
x 10
Fig. 2.14. Satellite motion. Total energy computed with (left to right, top to bottom): ode45, ode23,
odefeul, odebeul.
2.6 First look at accuracy
23
0.08
v(t),[m/s], x(t)[m]
0.06
0.04
0.02
0
0.02
0.04
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
t [s]
Fig. 2.15. Dry friction sliding of eccentric mass shaker. Sliding motion: Displacement in dotted line, velocity
in solid line.
with a polished steel base lying on a steel platform. The mass of the shaker is m. The harmonic
force due to the eccentric mass motion is added to the weight of the shaker to give the normal
contact force between those two as mg + A sin(t), and the horizontal force parallel to the platform
A cos(t). The IVP of the shaker sliding motion may be written in terms of its velocity as
mv + (v)(mg + A sin(t))signv = A cos(t) ,
v(0) = v0 .
Here (v) is the friction coecient. For steel on steel contact we could take
{
s ,
for |v| vstick
(v) =
k s , otherwise.
Here s is the so-called static friction coecient, k is the kinetic friction coecient, and vstick is the
sticking velocity. In words, for low sliding velocity the coecient of friction is high, for high sliding
velocity the coecient of friction is low.
Run the simulation script stickslip harm 2 animate and watch the animation a few times to
get a feel for the motion.16 Figure 2.15 shows the displacement and velocity of the sliding motion.
The brief stick phases should be noted. Also note the drift of the shaker due to the lift o force
which reduces the contact force and hence also the friction force when the mass is moving upwards.
Now consider Figure 2.16 which shows the velocity of the sliding motion for two slightly dierent
initial conditions.17 Note well the direction eld and consider how quickly (in fact discontinuously)
it changes for some values of the velocity of sliding.
We take the initial velocity of the shaker to be 0.99vstick and 1.01vstick . We may expect that for
such close initial conditions the velocity curves would also stay close, but they dont. The reason is
the discontinuous (and divergent) direction eld, as especially evident in the close-up on the right.
The direction eld is also discontinuous at zero velocity, but there it is convergent, and solution
curves that arrive there are forced to remain at zero (and sticking occurs).
The divergent discontinuous direction eld makes the solution non-unique. As users of numerical
algorithms for IVPs we must be aware of such potential complications, and address them by careful
consideration of the formulation of the problem and interpretation of the results.

In this section we will have a rst look at how to get good accuracy for initial value problems. Or,
more generally, how to control the error.
16
17
See: aetna/Stickslip/stickslip harm 2 animate.m

See: aetna/Stickslip/stickslip harm 1 dirf.m
24
0.05
0.03
0.025
0.04
0.02
v(t) [m/s]
v(t) [m/s]
0.03
0.02
0.015
0.01
0.01
0.005
0
0.005
0.01
0
0.01
0.02
0.03
t [s]
0.04
0.05
0.06
0.005
0.01
t [s]
0.015
0.02
Fig. 2.16. Dry friction sliding of eccentric mass shaker. Direction eld and velocity curves for two initial
conditions.
First, what do we mean by error? Consider for example that we want to obtain a numerical
solution to the IVP
y = f (t, y) ,
y(0) = y0
(2.25)
in the sense that our goal is the approximation of the value of the solution at some given point t = t.
The dierence between our computed solution yt and the true answer y(t) will be the true error
Et = y(t) yt .
We have already seen that the time step length apparently controls the error (we will see later exactly
how it achieves this feat). So let us compute the solution for a few decreasing time step lengths. The
result will be a table of time step length versus true error.
Time step length
t1
t2
...
Solution at t for tj
yt,1
yt,2
...
True error for tj

Et,1 = y(t) yt,1
Et,2 = y(t) yt,2
...
The true error is a ne concept, but not very useful as knowing it implies knowing the exact
solution. In practical applications of numerical methods we will never know the true error (otherwise
why would we be computing a numerical solution?). In practice we will have to be content with the
concept of an approximate error . A useful form of approximate error is the dierence of successive
solutions. So now we can construct the table of approximate errors
Time step length
t1
t2
t3
t4
...
Solution at t for tj
yt,1
yt,2
yt,3
yt,4
...
Approximate error for tj
Ea,1 = yt,2 yt,1

Ea,2 = yt,3 yt,2
Ea,3 = yt,4 yt,3
...
For deniteness we will be working in this section with the IVP

1
y = y ,
2
y(0) = 1.0
(2.26)
and our goal will be to compute y(t = 4). Figure 2.17 shows on the left the succession of computed
solutions with various time steps. As we can see, the two methods used, the forward and backward
25
Euler, are approaching the same value as the time step gets smaller. We call this behavior convergence. From the computed sequence of solutions we can obtain the approximate errors as discussed
above18 . The approximate errors are shown in Figure 2.17 on the right.
0.25
0.07
0.06
0.2
0.05
0.15
Ea
y(t)
0.04
0.03
0.1
0.02
0.05
0.01
0
0
0.5
1
t
1.5
0
0
0.2
0.4
0.6
0.8
Fig. 2.17. Successive approximations to y(t = 4) for various time steps (on the left), and approximate errors
(on the right). Red curve backward Euler, blue curve forward Euler
With this data at hand we can try to ask some questions. How does the error depend on the
time step length? The curves in Figure 2.17 suggest a linear relationship. Before we look at this
question in more detail, we will consider the problem of numerical integration of the IVP again, this
time with a view towards devising a better (read: more accurate) integrator than the rst two Euler
algorithms.
2.6.1 Modified Euler method
As discussed below equation (2.5), the governing equation of the IVP (2.25) may be subject to
integration from t0 to t to obtain
t
y(t) y(t0 ) =
f (, y( ))d .
t0
To use this expression to obtain an actual solution may not be easy because of the integral on the
right-hand side. This gives us an incentive to try to approximate the right-hand side integral. One
possibility is to write
t
f (, y( ))d (t t0 )f (t0 , y(t0 ))
t0
and we see that weve obtained the forward Euler algorithm

y(t) = y(t0 ) + (t t0 )f (t0 , y(t0 )) .
Or, we may write
t
f (, y( ))d (t t0 )f (t, y(t))
t0
and we get the backward Euler method. This should be familiar: we are approximating the areas
under curves (integrals of functions) by rectangles (recall the concept of the Riemann sum). A
better approximation would be achieved with trapezoids. Thus we may try
18
See: aetna/ScalarODE/scalardecayconv.m
26
f (, y( )) d
t0
(t t0 )
[f (t0 , y(t0 )) + f (t, y(t))] .
2
The resulting trapezoidal method

(tj+1 , vj+1 ) (tj + t, Solution vj+1 to vj+1 = vj +
t
[f (tj , yj ) + f (tj+1 , vj+1 )])
2
has very attractive properties, and we will devote more attention to it later. (It is implemented in
aetna as odetrap19 .) One factor that may discourage its use is cost: it is an implicit method, and
to obtain y(t) one has to solve (in general, nonlinear) equation for y(t)
y(t) = y(t0 ) +
(t t0 )
[f (t0 , y(t0 )) + f (t, y(t))] .
2
(2.27)
To obtain a method that is explicit in y(t) one may try the following trick: in the above equation
approximate y(t) in f (t, y(t)) using the forward Euler step to arrive at
ya = y(t0 ) + (t t0 )f (t0 , y(t0 )) ,
(t t0 )
y(t) = y(t0 ) +
[f (t0 , y(t0 )) + f (t, ya )] .
2
(2.28)
This formula denes one of the so-called modied Euler algorithms. It turns out to be only a little
bit more expensive than the basic forward Euler method, but its accuracy is superior as we will
immediately see on some results. (An implementation is available in 20 odemeul.)
2.6.2 Deeper look at errors: going to the limit
We will now compute21 the solution to (2.26) also with the modied Euler method (2.28). Figure 2.18
shows that the modied Euler approaches the solution somehow quicker than both backward and
forward Euler methods. This is especially clear when we look at the approximate errors (on the
right).
0.2
0.08
0.15
0.06
y(t)
0.1
y(t)
0.25
0.1
0.04
0.05
0.02
0
0
0.5
1
t
1.5
0
0
0.2
0.4
0.6
0.8
Fig. 2.18. Successive approximations to y(t = 4) for various time steps (on the left), and approximate errors
(on the right). Red curve backward Euler, blue curve forward Euler, black curve modied Euler
The errors seem to decrease roughly linearly for the backward and forward Euler methods. The
modied Euler method errors decrease along some curve. Can we nd out what kind of curve? Could
it be a polynomial? That would be the rst thing to try, because polynomials tend to be very useful
in this way (viz the Taylor series later in the book).
19
See: aetna/utilities/ODE/integrators/odetrap.m
See: aetna/utilities/ODE/integrators/odemeul.m
21
See: aetna/ScalarODE/scalardecayconv1.m
20
27
We will assume that the approximate errors depend on the time step length as
Ea (t) Ct .
(2.29)
The exponent is unknown. One clever way in which we can use data to nd out the value of
relies on taking logarithms on both sides of the equation
log(Ea (t)) log(Ct ) ,
which yields
log(Ea (t)) log(C) + log(t) .
This is an expression for a straight line on a plot with logarithmic axes. The slope of the line would
be . Figure 2.19 shows the approximate errors re-plotted on the log-log scale. Also shown are two
red triangles. The hypotenuse in those triangles has slope 1 or 2 respectively. This may be compared
with the plotted data. The forward and backward Euler approximate errors (at least for the smaller
time step lengths) appear to lie along a straight line with slope equal to one. The modied Euler
approximate errors on the other hand are close to a straight line with slope equal to two. Therefore,
we may hypothesize that the approximate errors behave as Ea (t) Ct for the forward and
backward Euler, and as Ea (t) Ct2 for the modied Euler. The exponent is called the rate
of convergence (convergence rate). The higher the rate of convergence, the faster the errors drop.
Later in the book we will use mathematical analysis tools (the Taylor series) to understand where
the convergence rate is coming from.
10
y,E
10
10
10
10
10
10
t
10
Fig. 2.19. The approximate errors from Figure 2.18 re-plotted on the log-log scale. Red curve backward
Euler, blue curve forward Euler, black curve modied Euler
What about the rst few points in the computed series: they fail to lie along a straight line on the
log-log plot? We have assumed that the errors depended only a single power of the time step length.
This is a good assumption for very small time step lengths (the so-called asymptotic range, where
t 0), but for larger time step lengths (the so-called pre-asymptotic range) the error more
likely depends on a mixture of powers of the time step length. Then the data points will not lie on
a straight line on the log-log plot.
Plotting the data as in Figure 2.19 is very useful in that it gives us the convergence rate. Could
we use this information to get a handle on the true error? As explained above, we assumed that the
28
approximate error depended on the time step length through the mononomial relation (2.29). Using
a simple trick, we can relate the approximate errors to the true errors
Ea,j = yt,j+1 yt,j = yt,j+1 y(t) + y(t) yt,j ,
where Et,j+1 = yt,j+1 y(t) and Et,j = y(t) yt,j and so we have
Ea,j = Et,j Et,j+1 .
(2.30)
Then if the approximate error on the left behaves as the mononomial (2.29), then so will the true
errors on the right. There are two parameters in (2.29), the rate and the constant C. We have
estimated the rate by plotting the approximate errors on a log-log scale. Now we can estimate the
constant C by taking
Et,j Ctj ,
Et,j+1 Ctj+1
to obtain
Ea,j = Ctj Ctj+1
(2.31)
and
C=
Ea,j
tj
tj+1
For instance, for the forward Euler we have obtained the following approximate errors
>> ea_f =
6.2500e-2 3.7613e-2 1.7954e-2
>> dts =
2, 1, 1/2, 1/4, 1/8, 1/16, 1/32
8.7217e-3
4.2952e-3
2.1311e-3
and we have estimated from Figure 2.19 that the convergence rate was = 1. Therefore, we can
estimate the constant using (for example) Ea,3 as
>> C=ea_f(3)/(dts(3)-dts(4))
C =
0.071816687928745
This is useful: we can now predict for instance how small a time step will be required to obtain the
solution within the absolute tolerance 104 :
( 4 )1/
10
Et,j Ctj 104 tj
C
>> 1e-4/(ea_f(3)/(dts(3)-dts(4)))
ans =
0.001392434027300
Indeed, we nd that for time step length 1/1024 < 0.00139 the true error is computed as 0.000066 <
104 .
If we do not have an estimate of the convergence rate, we can try solving for it. Provided we
have at least two approximate errors, let us say Ea,1 and Ea,2 , we can write (2.31) twice as
Ea,1 = Ct1 Ct2 ,
Ea,2 = Ct2 Ct3 .
This system of two nonlinear equations will allow us to solve for both unknowns C and . In general
a numerical solution of this nonlinear system of equations will be required. Only if the time steps
are always related by a constant factor so that
2.7 Runge-Kutta integrators
tj+1 = tj ,
29
(2.32)
where is a xed constant, will we be able to solve the system analytically: First we write
Ea,1
t1
t2
Ea,2
t2
t3
(2.33)
Now we realize that

(
)
t1 t2 = t1 (t1 ) = t1 1
and also
(
)
t2 t3 = t2 (t2 ) = t2 1
so that we can rewrite (2.33) as
Ea,1
t1
Ea,2
t2
(1 )
)
(
and canceling the factor 1
Ea,1
t1
(1
Ea,2
t2
This is then easily solved for the convergence rate by taking a logarithm of both sides to give
=
log Ea,1 log Ea,2

.
log t1 log t2
(2.34)
The second unknown C then follows

C=
Ea,1
t1
t2
(2.35)
The described procedure for the estimation of the parameters of the relation (2.29) is a special
case of the so-called Richardson extrapolation. When the data for the extrapolation is nice,
this procedure is very useful. The data may not be nice: for instance for some reason we havent
reached the asymptotic range. Or, perhaps there is a lot of noise in the data. Then the extrapolation
procedure cannot work. It is a good idea to always visualize the approximate error on the log-log
graph. If the approximate error data does not not appear to lie on a straight line, the extrapolation
should not be attempted.
An important note: the above Richardson extrapolation can work only for results obtained with
xed step-length integrators. The step length is the parameter in the extrapolation formula. It varies
from step to step when the MATLAB adaptive step length integrators (i.e. ode23, ...) are used,
which the formula cannot accommodate, and the extrapolation is then not applicable.

The modied Euler method (2.28) is an example of the so-called Runge-Kutta (RK) algorithms. A
large subclass of such algorithms (the explicit RK methods) advances the solution by the prescription
y(t) = y(t0 ) + t (b1 k1 + b2 k2 + + bs ks ) ,
(2.36)
which means that from y(t0 ) we follow a slope which is determined as a linear combination of slopes
kj evaluated at various points within the current time step
30
k1 = f (t0 + c1 t, y(t0 ))
k2 = f (t0 + c2 t, y(t0 ) + a21 tk1 )
k3 = f (t0 + c3 t, y(t0 ) + a31 tk1 + a32 tk2 )
ks = f (t0 + cs t, y(t0 ) + as1 tk1 + as2 tk2 + + as,s1 tks1 )
(2.37)
where t = (t t0 ), and the coecients asj , bj , cj are determined in various ingenious ways so that
the method has the best accuracy and stability properties.
Figure 2.20 shows graphically an example of such an explicit Runge-Kutta method, the modied
Euler method. It can be written in the above notation as
(
)
1
1
y(t) = y(t0 ) + t
k1 + k2
2
2
(2.38)
k1 = f (t0 + 0 t, y(t0 ))
k2 = f (t0 + 1 t, y(t0 ) + 1 tk1 )
We see that the coecients of this method are c1 = 0, c2 = 1, a21 = 1 and b1 = b2 = 12 .
Fig. 2.20. The modied Euler algorithm as a graphical schematic: The solution at the time t = t0 + t is
arrived at from y(t0 ) using the average slope 21 k1 + 12 k2
The coecients of Runge-Kutta methods asj , bj , cj are usually presented in the form of the
so-called Butcher tableau
ca
b
(2.39)
where the coecients are elements of the three matrices. For the explicit RK methods c1 = 0 always,
and the matrix a is strictly lower diagonal. The modied Euler method is an RK method with s = 2
and the tableau
0 0
1 1
0
0
1
2
1
2
The forward Euler method is an RK method with s = 1 with the tableau

00
1
We must mention the fourth-order explicit Runge-Kutta. It represents perhaps the most common
RK method. An improvement of this method in the form of the explicit Runge-Kutta (4,5) pair
of Dormand and Prince (a combination of fourth-and fth-order method) makes appearance in
31
MATLAB in the ode45 integrator. The tableau of the fourth-order explicit Runge-Kutta with a
xed time step is
0 0
1
2
1
2
1
2
0
1 0
1
6
0 0 0
0 0 0
1
2
0 0
1 0
1
3
1
3
1
6
The aetna toolbox implements the xed-time-step fourth-order RK integrator in oderk4.22

Illustration 2
Figure 2.21 shows the same results as Figure 2.19, but supplemented23 with results for fourth-order
Runge-Kutta oderk4. Clearly the fourth-order method is much more accurate, and by drawing a
triangle in the log-log scale plot we easily ascertain that RK4 converges with convergence rate of 4.
10
y,E
10
10
10
15
10
10
10
10
10
Fig. 2.21. The approximate errors plotted on the log-log scale. Red curve backward Euler, blue curve
forward Euler, black curve with o markers modied Euler, black curve with x markers fourth-order
Runge-Kutta oderk4
To round o this discussion we will consider the adaptive-step Runge-Kutta method implemented
in the Matlab ode23 integrator. The tableau reads
22
23
See: aetna/utilities/ODE/integrators/oderk4.m
See: aetna/ScalarODE/scalardecayconv2.m
32
1
2
3
4
1
2
0
2
9
2
9
7
24
0 0 0
0 0 0
3
0 0
4
1
3
1
3
1
4
4
9
4
9
1
3
0
0
1
8
The array b with two rows instead of one makes the method so useful: the solution at the time
t = t0 + t may be computed in two dierent ways from the slopes k1 , ..., k4 . One of these (the rst
row) is third-order accurate and the other (the second row) is fourth-order accurate. The dierence
between them can be used to guide the change of the time step to maintain accuracy.
Suggested experiments
1. The integrator ode87fixed24 uses a high order Runge-Kutta formula and xed time step length.
Repeat the above exercise with this integrator, and estimate its convergence rate.
2.8 Annotated bibliography

First of all, the MATLAB documentation on the MathWorks website is eminently useful. I copy and
paste code from there all the time. Checkout the Getting Started, the User Guides, and the
numerous Examples at
www.mathworks.com/access/helpdesk/help/techdoc/MATLAB product page.html.
1. V. I. Arnold, Ordinary Dierential Equations, Universitext, Springer, 2006.
This book provides great insights into dierential equation models by using a crisp language and
lots of pictures. Highly recommended especially for Chapters 1 3.
2. C. Moler, Numerical Computing with MATLAB, SIAM, 2004.
Written by one of the co-authors of MATLAB, this is a gem of a textbook. Covered are selected
numerical methods and basics of MATLAB. An electronic version of it is available for free at
http://www.mathworks.com/moler/index ncm.html, including dozens of MATLAB codes.
3. L. F. Shampine, I. Gladwell, S. Thompson, Solving ODEs with MATLAB, Cambridge University
Press, 2003.
Especially Chapters 1 and 2 are a valuable source of basic theory and examples for IVPs.
4. S. S. Rao, Mechanical Vibrations, Addison-Wesley, second edition, 1990.
Comprehensive treatment of many mechanical systems. Suitable as a reference for almost all
vibrations examples in this book.
24
See: aetna/utilities/ODE/integrators/ode87fixed.m
3
Preservation of solution features: stability
Summary
1. In this chapter we investigate the central role that the eigenvalue problem plays in the design
of ODE integrators. The goal is to preserve important solution features. This is referred to as
the stability of the integration algorithm. Main idea: stability can be investigated on the model
equation of the scalar linear ODE.
2. For the model IVP, the formula of a particular integrator can be written down so that the new
value of the solution is expressed as a multiple of the solution value in the previous step. Main
idea: the amplication factor depends on the product of the eigenvalue and the time step, and
therefore the shape of the numerical solution is determined by these quantities. The eigenvalue
is given as data, the time step can be (needs to be) chosen by the user.
3. The scalar linear ODE with a complex coecient is equivalent to two coupled real equations in
two real variables. Main idea: the ODE with a complex coecient describes harmonic oscillations.
4. For the model IVP with a complex coecient, the same procedure that leads to an amplication
factor is used. Main idea: the amplication factor and the solution now live in the complex
plane. The magnitude of the amplication factor again is seen to play a role in the stability
investigation.
5. Understanding the amplication factors is aided by appropriate diagrams. Main idea: The preservation of solution features is illustrated by a complete stability diagram for the various methods.
The magnitude of the amplication factor may also be visualized as a surface above the complex
plane.
3.1 Scalar real linear ODE

At the end of the previous chapter we had a brief look at accuracy. This is the rst aspect of the
application of numerical integration to the solution of initial value problems. The second aspect
has to do with the preservation (or lack thereof) of the important features of the solutions. As an
example of such important features, in the modeling of mechanical systems we worry a lot about
the conservation of momentum or energy. Often just the general shape of the solution curve may be
a characteristic of an analytical solution that we would really like to see preserved in the numerical
solution. We refer to the ability of the numerical algorithms to preserve these important aspects of
the analytical solution as stability.
We will begin to study the issue of stability on the simplest and nicest possible dierential
equation: a scalar linear ordinary dierential equation with a constant coecient k
y = ky ,
y(0) = y0 .
(3.1)
For the moment we shall consider k real. As an example take k = 1/2, with an arbitrary initial
condition
34
3 Preservation of solution features: stability
1
y = y ,
2
y(0) = 1.3361 .
(3.2)
3.2 Eigenvalue problem

For this type of equation (derivative of the function proportional to the function itself) we can guess
that the function is an exponential. Both real and complex exponentials have this property. Our
guess
y = B exp(t)
is dierentiated and substituted into the dierential equation
y = B exp(t) = ky = kB exp(t)
which can be reshued to give
B( k) exp(t) = 0 .
(3.3)
The constant B = 0 (otherwise we dont have a solution!), and for the above to hold for all times t
we must require
B( k) = 0 .
The above equation is called the eigenvalue problem, and this is denitely not the last time we
have encountered this type of equation in the present book. Here is the eigenvalue, and B is the
eigenvector. The solution is easy: we see that = k. Any B = 0 will satisfy the eigenvalue equation.
We could determine B so that the initial value problem (3.1) was solved by substituting into the
initial condition to obtain B = y0 .
The solution to the IVP (3.2) is drawn with a solid line in Figure 3.1. It is a decaying solution.
In the same gure theres also a growing solution (for k = 1/2), and a constant solution (for
k = 0).
10
8
y(t)
6
4
2
0
0
2
t
Fig. 3.1. Solution to (3.1) for k positive, negative, and zero
3.3 Forward Euler method for a decaying solution

Let us now look at what the forward Euler (2.11) will produce for the model equation (3.2). We
substitute f (tj , yj ) = kyj into the Euler method
yj+1 = yj + tf (tj , yj ) ,
3.3 Forward Euler method for a decaying solution
35
to obtain
yj+1 = yj + tkyj = (1 + tk)yj .
(3.4)
We would like to see a monotonically decaying numerical solution, | yj+1 |<| yj |, so the so-called
amplication factor (1 + tk) must be positive and its magnitude must be less than one
|1 + tk| < 1 .
If this condition is satised but (1 + tk) < 0 the solution decreases in magnitude, but changes
sign from step to step. Finally, (1 + tk) = 0 implies that the solution drops to zero in one step
and stays zero. Recall that for our example k = 1/2. Correspondingly, in Figure 3.2 we see1 a
monotonically decaying solution for t = 1.0 (|1 + tk| = |1 + 1.0 (1/2)| = 1/2 < 1), a solution
dropping to zero in one step for t = 2.0, a solution decaying, but non-monotonically for t = 3.0
(as 1 + tk = 1 + 3.0 (1/2) = 1/2), and nally for t = 4.0 we get a solution which oscillates
between y0 . Note that for an even bigger time step we would get an oscillating solution which
would increase in amplitude rather than decrease.
1.2
1.2
0.8
0.8
y
1.4
1.4
0.6
0.6
0.4
0.4
0.2
0.2
0
0
10
20
30
40
0
0
50
10
20
30
40
50
30
40
50
1.5
1.5
1
0.5
y
0.5
0
0
0.5
0.5
1
0
10
20
30
40
50
1.5
0
10
20
t
Fig. 3.2. Forward Euler solutions to (3.2) for time steps (left to right) t = 1.0, 2.0, 3.0, 4.0
In summary, for a negative coecient k < 0 in the model IVP (3.1) we can reproduce the correct
shape of the solution curve with the forward Euler method provided
0 < t 1/k .
(3.5)
This is visualized in Figure 3.3. On the top we show the real line, the thick part indicates where
the eigenvalues = k are located when they are negative. On the bottom we show the real line
for the quantity t. The thick segment corresponds to equation (3.5). The lled circle indicates
included, the empty circle indicates excluded. The meaning of (3.5) is expressed in words as:
1
See: aetna/ScalarODE/scalarsimple.m
36
for a negative = k the forward Euler will reproduce the correct decaying behavior provided the
quantity t lands in the segment 1 t < 0 as indicated by the arrow.
The time step lengths that satisfy equation (3.5) are called stable. If we need to be precise, we
would say that such time step lengths are stable for the forward Euler applied to IVPs with decaying
solutions.
Fig. 3.3. Forward Euler stability when applied to the model problem (3.1) for negative eigenvalues. The
given coecient is located in the negative part of the real axis on top. The time step t needs to be chosen
to place the product t in the unit interval 1 t < 0 on the axis at the bottom.
Sometimes it is useful to set the time step from the condition

0 < t 2/k
(3.6)
so that tk is allowed to be in the interval between -2 and zero. This will guarantee that the solution
decays, albeit non-monotonically. Such behavior is considered admissible when all we care about is
that the solution decays. Detailed discussion follows in Section ??.
3.4 Backward Euler method for a decaying solution

How would the backward Euler (2.16) handle the model equation (3.2)? Upon substitution of the
expression f (tj+1 , yj+1 ) = kyj+1 into
yj+1 = yj + tf (tj+1 , yj+1 ) ,
we obtain
yj+1 = yj + tkyj+1 .
(3.7)
This may be solved for yj+1

yj+1 =
yj
,
1 tk
where
1
1 tk
is the amplication factor for this Euler scheme. Now if we realize that by assumption k < 0, we see
that the solution is going to decay monotonically for all nonzero time step lengths, since 1 tk > 1
for t > 0. Hence we can state that any time step length is stable for the backward Euler method
applied to an IVP with a decaying solution.
3.7 Complex IVP
37
3.5 Backward Euler method for a growing solution

Consider now the model IVP (3.1) with k > 0. The solution should be monotonically growing in
magnitude. When the backward Euler method is applied to such an equation, the amplication
factor
1
1 tk
is now going to be greater than one in magnitude without changing sign provided
0 < t 1/k .
(3.8)
The time step lengths that satisfy equation (3.8) are called stable. If we need to be precise, we
would say that such time step lengths are stable for the backward Euler applied to IVPs with growing
solutions. We see that the situation somehow mirrors the one discussed for the forward Euler applied
to decaying solutions. The Figure 3.4 which corresponds to (3.8) illustrates this quite clearly, as it
is quite literally a mirror image of the Figure 3.3 for the forward Euler and k < 0.
Fig. 3.4. Backward Euler stability one applied to the model problem (3.1) for positive eigenvalues. The
given coecient is located in the positive part of the real axis on top. The time step t needs to be chosen
to place the product t in the unit interval 0 < t 1 on the axis at the bottom.
3.6 Forward Euler method for a growing solution

Now again we consider k > 0, this time with a forward Euler method. When it is applied to such an
equation, the amplication factor
(1 + tk)
is positive for all time step lengths, and also (1 + tk) > 1. Hence we see that any time step length
is stable for the forward Euler method applied to an IVP with a growing solution. This is again a
mirror image, this time of the backward Euler applied to IVP with a decaying solution.
3.7 Complex IVP

The model IVP (3.1) admits the possibility of the coecient k being a complex number. Section 3.2
still applies, and the solution may be therefore again sought as
38
y = y0 exp(kt) .
Note that the complex exponential may be expressed in terms of sine and cosine
exp(kt) = exp [(Rek + i Imk)t] = exp(Rek t) [cos(Imk t) + i sin(Imk t)] .
The solution is now to be sought with a time dependence in the form of a complex exponential. Let
us write the solution in terms of the real and imaginary parts
y = Rey + i Imy ,
which can be substituted into the dierential equation, together with k = Rek + i Imk, to give
Re y + i Im y = (Rek + i Imk)(Rey + i Imy) .
Expanding we obtain
Re y + i Im y = RekRey ImkImy + i RekImy + i ImkRey .
Now we group the real and imaginary terms
[Re y (RekRey ImkImy)] + i [Im y (RekImy + ImkRey)] = 0 ,
and in order to get a real zero on the right-hand side, we require that both brackets vanish identically,
and we obtain a system of coupled real dierential equations
Re y = RekRey ImkImy
Im y = RekImy + ImkRey .
(3.9)
Also, the initial condition y(0) = y0 is equivalent to

Rey(0) = Rey 0 ,
Imy(0) = Imy 0 .
So we see that to solve (3.1) with k complex is equivalent to solving the real IVP (protably written
in matrix form)
[
] [
][
]
[
] [
]
Re y
Rek, Imk
Rey
Rey(0)
Rey 0
=
,
=
.
(3.10)
Im y
Imk, Rek
Imy
Imy(0)
Imy 0
The method of Section 3.2 can be used again, but with a little modication since we now have a
matrix dierential equation instead of a scalar ODE. We will seek the solution to (3.10) as
[
]
[ ]
Rey
z
= exp(t) 1 .
Imy
z2
For brevity we will introduce the notation
[
]
Rey
w=
,
Imy
and
K=
Rek, Imk
Imk, Rek
]
(3.11)
and the IVP could then be written as

= Kw ,
w
w(0) = w0 .
Correspondingly, the solution will be sought as
(3.12)
3.7 Complex IVP
w = exp(t)z ,
39
(3.13)
where z is a time independent vector. Performing the time dierentiation, we obtain

= exp(t)z = K exp(t)z .
w
Collecting the terms, we get, entirely analogously to (3.3),
exp(t) (Kz z) = 0 .
To satisfy this equation for all times, the following conditions must be true
Kz = z .
(3.14)
This is the so-called matrix eigenvalue problem. The vector z is the eigenvector , the scalar
is the eigenvalue, and they both may be complex. The eigenvalue problem (EP) is highly nonlinear, and therefore for larger matrices impossible to solve analytically and quite dicult to solve
numerically.
Looking at (3.14) we realize that there are too many unknowns here: , z1 , and z2 (three), and
not enough equations (two). We need one more equation, and to get it we rewrite (3.14) as
(K 1)z = 0 ,
where 1 is an identity matrix. This is a system of linear equations for the vector z with a zero
right-hand side. In order for the above equation to have a nonzero solution, the square matrix
K 1
must be singular . (The linear combination of the columns of K 1 yields a zero vector, which is
just another way of saying that the columns are linearly dependent. Hence, the matrix is singular.)
We may put the fact that K 1 is singular dierently by referring to its determinant
det (K 1) = 0 .
(3.15)
This is the additional equation that makes the solution of the eigenvalue problem possible (the
characteristic equation).
Illustration 1
Expand the determinant of the 2 2 matrix
[
]
[
]
2, 1
1, 0
1, 1
0, 1
The determinant may be dened recursively in terms of cofactors (Laplace formula). For a 2 2
matrix we obtain the familiar diagonal products rule
([
]
[
])
2, 1
1, 0
det
= (2 )(1 ) (1)(1) = 2 3 + 1
1, 1
0, 1
We see that the expanded determinant is a polynomial in , the so-called characteristic polynomial .
For a 2 2 matrix the polynomial is quadratic, and with each additional row and column the
order of the polynomial goes up by one. As a consequence, to solve the eigenvalue problem means to
nd the roots of the characteristic polynomial. This is a highly nonlinear and unstable computation,
which for larger matrices must be done numerically since no analytical formulas exist.
40
Illustration 2
Display the characteristic polynomial of the matrix [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1].
The MATLAB symbolic solution
>> syms lambda real
>> det( [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1]-lambda*eye(4))
ans =
lambda+6*lambda^2-5*lambda^3-2+lambda^4
>> ezplot(ans)
>> grid on
yields a curve similar to the one shown in Figure 3.5. One has to zoom in to be able to estimate
where the roots lie. There are going to be four of them, corresponding to the highest power 4 .
p() =+6 25 32+4
p()
150
100
50
0
3
Fig. 3.5. Characteristic polynomial of [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1]
Illustration 3
We may familiarize ourselves with the concepts of the EP solutions by looking at some simple 2 2
matrices.
Zero matrix. The characteristic polynomial is
([
]
[
])
0, 0
1, 0
det
= 2 = 0
0, 0
0, 1
which has the double root 1,2 = 0. Apparently any vector v is an eigenvector since
0v = 0 v .
The MATLAB solution agrees with our analytical consideration (columns of V are the eigenvectors, the diagonal elements of D are the eigenvalues). Eigenvectors we obtained are particularly
nice because they are orthogonal.
>> [V,D]=eig([0,0;0,0])
V =
1
0
0
1
D =
0
0
0
0
3.7 Complex IVP
41
Identity matrix. The characteristic polynomial is

([
]
[
])
1, 0
1, 0
det
= (1 )2 = 0
0, 1
0, 1
which has the double root 1,2 = 1. Again any vector is an eigenvector. The MATLAB solution
agrees with our analytical consideration (note that the eigenvectors are again orthonormal).
>> [V,D]=eig([1,0;0,1])
V =
1
0
0
1
D =
1
0
0
1
Diagonal matrix. The characteristic polynomial is
([
]
[
])
a, 0
1, 0
det
= (a )(b ) = 0
0, b
0, 1
which has the roots 1 = a and 2 = b. The eigenvectors may be calculated by substituting the
eigenvalue (let us start with 1 )
[
]
a, 0
v = 1 v 1 = av 1
0, b 1
and by guessing that this can be satised with the vector
[ ]
1
v1 =
.
0
Similarly for the second eigenvalue.
The symbolic MATLAB solution agrees with our analytical consideration. (a, b are real symbolic
constants.)
>> syms a b real
>> [V,D]=eig([a,0;0,b])
V =
[ 1, 0]
[ 0, 1]
D =
[ a, 0]
[ 0, b]
General real matrix. The characteristic polynomial is
([
]
[
])
a, d
1, 0
det
= (a )(b ) cd = 0
c, b
0, 1
The roots 1 and 2 need to be solved for from this quadratic equation. The below symbolic
MATLAB expression evaluates the determinant
>> syms a b c d lambda real
>> det([a,d;c,b]-lambda*[1,0;0,1])
ans =
a*b-a*lambda-lambda*b+lambda^2-d*c
42
A helpful observation usually made in a linear algebra course is that the trace of the 22 matrix
(i.e. the sum of the diagonal elements) is equal to the sum of the eigenvalues a + b = 1 + 2 , and
the determinant of the matrix is equal to the product of the eigenvalues ab cd = 1 2 . We can
easily verify this symbolically in MATLAB by rst computing the eigenvalues and eigenvectors
(symbolically)
syms a b c d lambda real
[V,D]=eig([a,d;c,b])
and then using the symbolic expressions
D(1,1) +D(2,2)-a-b
simple(D(1,1)*D(2,2)-a*b+c*d)
we check that we get identically zero. As an example consider the matrix
[
]
2, 1
1, 2
We nd the eigenvalues from 2 + 2 = 4 = 1 + 2 , and 2 2 (1) (1) = 3 = 1 2 . We
easily guess 1 = 3 and 2 = 1. The eigenvectors are found by substituting the eigenvalue into
the eigenvalue problem, and then solving the singular system of equations. For instance,
([
]
[
]) [
] [ ]
2, 1
1, 0
z11
0
1
=
1, 2
0, 1
z21
0
So that
[
][
] [ ]
1, 1
z11
0
=
1, 1
z21
0
These two equations are linearly dependent, and we cannot determine both elements z11 , z21
from a single equation. Choosing for instance z11 = 1 gives (one possible) solution for the rst
eigenvector
[
] [
]
z11
1
=
z21
1
Real matrix of the form (3.11)
([
]
[
])
a, b
1, 0
det
= (a )2 + b2 = 0
b, a
0, 1
Taking the helpful formula for the eigenvalues of the 2 x 2 matrix
1 + 2 = 2a ,
1 2 = a2 + b2
and the identity (a + i b) (a i b) = a2 + b2 we can see that the eigenvalues are in fact
1 = a + i b ,
2 = a i b .
So the diagonal elements of the matrix are the real parts of the eigenvalues, and the o diagonal
elements are the (real values) of the imaginary parts of the eigenvalues.
3.10 Case of Rek = 0 and Imk = 0
43
1. When we compute the eigenvector by solving the system with the singular matrix we have to
choose one element of the vector, apparently arbitrarily. Discuss whether the choice is truly
arbitrary. For instance, could we choose z11 = 0?
3.8 Single scalar equation versus two coupled equations: eigenvalues

We know that the eigenvalue of the scalar IVP (3.1) may be obtained from the complex eigenvalue
problem discussed in Section 3.2. We have also seen that the scalar complex IVP is equivalent to
the real coupled IVP (3.12). What is the correspondence of the eigenvalue obtained from the scalar
equation with the eigenvalues obtained from the coupled matrix equations?
The eigenvalues of the matrix (3.11) may be obtained from the characteristic equation
det (K 1) = 0 ,
which yields
{[
]
[
]}
Rek, Imk
1, 0
det
= (Rek )2 + (Imk)2 = 0 .
Imk, Rek
0, 1
(3.16)
We know that for the scalar case the eigenvalue is = k = Rek + iImk. Would this eigenvalue satisfy
also the characteristic equation above? Substituting and simplifying we obtain:
(Rek )2 + (Imk)2 = (Rek Rek iImk)2 + (Imk)2 = i2 (Imk)2 + (Imk)2 = 0 .
It does! That is not all, however. Numbers whose imaginary parts have equal magnitude but opposite
signs are called complex conjugate (see Figure 3.6). The characteristic equation (3.16) also has
the root = k = Rek iImk, where the overbar means complex conjugate. This holds because
(iImk)2 = (iImk)2 . The eigenvalue problem in Section 3.2 is saying the same thing, since forming
a complex conjugate of the equation
B( k) = B( k) = 0
is equally valid as the original equation.

For Rek =
0 and Imk = 0 the matrix
[
]
Rek,
0
K=
0, Rek
becomes a multiple of the identity. The eigenvalues are 1,2 = Rek. Depending on the numerical
value of Rek both equations describe the same growth, decay, or stagnation.

For Rek = 0 and Imk = 0 the matrix becomes skew-symmetric
[
]
0, Imk
K=
.
Imk,
0
(3.17)
44
Im
a
a+a
Re
a
Fig. 3.6. Graphical interpretation of complex conjugate quantities
These are interesting matrices, which occur commonly in many important applications. We will
hear more about them. The eigenvalues are 1,2 = i Imk, which means purely imaginary. We write
1 = 2 (and 1 = 2 ).
We solve for the components of the rst eigenvector. The procedure is the same as in the example
above: substitute the computed eigenvalue into the eigenproblem equation, and since the resulting
equations are linearly dependent, choose one of the components of the eigenvector and solve for the
rest. Thus we get for 1 = i Imk
([
]
[
]) [
] [ ]
0, Imk
1, 0
z11
0
1
=
.
Imk,
0
0, 1
z21
0
This may be rewritten
[
][
] [ ]
i , 1
z11
0
Imk
=
1, i
z21
0
and choosing z21 = 1 we obtain the rst eigenvector
[
] [ ]
z
i
z 1 = 11 =
.
z21
1
Similarly we obtain the second eigenvector as
[
] [
]
z12
i
z2 =
=
.
z22
1
Note that z 1 and z 2 are complex conjugate, as their corresponding eigenvalues. We can easily convince ourselves that an eigenvalue problem with complex conjugate eigenvalues must have complex
conjugate eigenvectors. For an arbitrary real matrix A write the complex conjugate on either side
of the equation
A z = z A z = A z = z
Both eigenvalue/eigenvector pairs
w1 = exp(1 t)z 1
and
w2 = exp(2 t)z 2 = w1
(3.18)
45
could be solutions of the IVP (3.10). A general solution therefore is likely to be a mix of these two
w = C1 w1 + C2 w2 .
We expect w to be a real vector, whereas w1 and w2 are both complex quantities. However, they are
complex conjugate which suggests that if the constants are also complex conjugates the expression
on the right may be real (refer to Figure 3.6).
w = C1 w1 + C1 w1
In general, the complex constant may be written as
C1 = ReC1 + i ImC1
(3.19)
and the complex exponential has the equivalent expression (Eulers formula from complex analysis)
exp(i Imkt) = cos(Imkt) + i sin(Imkt) .
(3.20)
Therefore, the product of the three complex quantities may be expanded as

C1 w1 = C1 exp(1 t)z 1 =
[
]
[
]
ReC1 sin(Imkt) ImC1 cos(Imkt)
ReC1 cos(Imkt) ImC1 sin(Imkt) .
+i
ReC1 cos(Imkt) ImC1 sin(Imkt)
ReC1 sin(Imkt) + ImC1 cos(Imkt)
Next we take into account that C2 w2 = C1 w1 and we attain readily the simplication of the
expression w = C1 w1 + C2 w2 by canceling the imaginary part
[
]
w=2
.
(3.21)
Substituting of the above expression into the initial condition we arrive at
[
]
ImC1
w(0) = 2
= w0
ReC1
and that allows us to solve immediately for ImC1 , ReC1 .
So nally we can write the solution to the matrix IVP (3.10)
[
]
Imy0 sin(Imkt) + Rey0 cos(Imkt)
w=
Imy0 cos(Imkt) + Rey0 sin(Imkt)
or even more protably in the matrix form
[
][
]
cos(Imkt), sin(Imkt)
Rey0
w=
.
sin(Imkt), cos(Imkt)
Imy0
(3.22)
The matrix in the above equation is the so-called rotation matrix . The quantity Imk has the
meaning of angular velocity, and correspondingly Imkt is the rotation angle. One way of visualizing
rotations is through phasors: see Figure 3.7. A phasor is a rotating vector whose components vary
harmonically.
Figure 3.8 provides a link between dierent ways of visualizing rotations2 . The black circle
corresponds to the trace of the tip of the rotating vector of Figure 3.7. The rainbow colored helical
tube (time advances from blue to red) is the black circle stretched in the time dimension. (Think
Slinky.)
The red curve is the projection of the helix onto the plane Imy = 0, and it is the graph of t
versus Rey. The blue curve is the projection of the helix onto the plane Rey = 0, and it is the graph
of t versus Imy. When we plot the solutions computed by the MATLAB integrators they are the
superposition of these (red and blue) curves in one plane, as shown on the right in Figure 3.8.3 The
rotating-vector picture tells us was kind of curves we should expect: the vector rotates with constant
angular velocity, which when projected onto either of the two coordinates will yield a sinusoidal
phase-shifted curve in time compare with Figure 3.8.
2
3
See: aetna/ScalarODE/scalaroscillstream.m
See: aetna/ScalarODE/scalaroscillplot.m
46
Imy
w(t)
Imk t
w0
Rey
Fig. 3.7. Representation of the solution to (3.10) as a rotating vector (phasor)
Fig. 3.8. Graphical representation of the solution to (3.12) Imk = 0.3, Rey 0 = 0, Imy 0 = 8
3.11 Application of the Euler integrators to the IVP (3.10)

For our numerical experiments we shall consider (3.10) with Imk = 3, and the initial conditions
Rey 0 = 0, Imy 0 = 8. The code4 to integrate the system of ODEs starts by dening the matrix K,
the initial condition, and the time span. The right-hand side function literally copies the denition
of the IVP (3.12).
K=[0,-3;3,0];
w0=[0;8];
tspan =[0,10];
options=odeset (InitialStep, 0.099);
[t,sol] = ode45(@(t,w) (K*w), tspan, w0, options);
From our analysis we would expect the numerical solution to reproduce shifted sine waves that we
found as the analytical solution. The built-in MATLAB integrator ode45 does a good job, at least
at rst sight (Figure 3.9).
Now replace ode45 with odefeul in the above code fragment. With the step of 0.099 the forward
Euler integrator takes more than 20 steps per period of oscillation. This seems like a suciently ne
4
See: aetna/ScalarODE/scalaroscill1st.m
3.12 Euler methods for oscillating solutions
47
8
6
Re y, Im y
4
2
0
2
4
6
8
0
10
Fig. 3.9. Example of Section 3.11, ode45 integrator
time step, but the forward Euler integrator odefeul fails spectacularly: the solution blows up very
quickly (Figure 3.10 on the left). The backward Euler integrator is not much better, except that the
amplitude goes to zero (Figure 3.10 on the right). With smaller time steps we can reduce the rate of
the blowup (decay) of the amplitude, but we can never remove it (try it: decrease the time step by
couple of orders of magnitude and arm yourselves with patience, it is going to take a long time to
integrate). We consider the constant amplitude as the main feature of the solutions to this problem.
Therefore, we must conclude that for this problem the two integrators appear to be unconditionally
unstable as they are unable to maintain an unchanging amplitude of the oscillations no matter how
small the time step. For comparison we show the results for the built-in ode45 integrator, applied
600
8
6
400
4
Re y, Im y
Re y, Im y
200
0
200
2
0
2
4
400
600
0
6
2
10
8
0
10
Fig. 3.10. Example of Section 3.11, odefeul integrator (on the left), and odebeul integrator (on the right).
Time step t = 0.099.
to in a long integration time5 in Figure 3.11 (there are so many oscillations that the curves visually
melt into a solid block). We see that even for this integrator there is a systematic change (decay)
in the amplitude of the oscillation. By reducing the time step length we can reduce the drift, but
we cannot remove it entirely (as observed in numerical experiments). Again, this behavior has to do
with stability, not accuracy.
3.12 Euler methods for oscillating solutions

We shall now analyze the Euler methods in order to gain an understanding of the results reported in
the previous section. In the rst step we will apply the forward Euler method to the model IVP (3.1)
yj+1 = yj + tkyj = (1 + tk)yj ,
5
See: aetna/ScalarODE/scalaroscill1stlong.m
48

8
6
Re y, Im y
4
2
0
2
4
6
8
0
200
400
600
t
800
1000
1200
Fig. 3.11. Example of Section 3.11, ode45 integrator, long integration time
and we work in the knowledge that k, yj , yj+1 are complex. We now understand that for a purely
imaginary k = i Imk the solution may be represented as a circle in the plane Rey, Imy. Another way
of saying this is the modulus of the complex quantity y is constant. We take the modulus on both
sides
|yj+1 | = |(1 + tk)yj | = |1 + tk| |yj | ,
(3.23)
and in order to get |yj+1 | = |yj | (so that the solution points lie on a circle) we need for the complex
amplication factor to satisfy
|1 + tk| = 1 .
(3.24)
Figure 3.12 illustrates the meaning of the above equation graphically. The circle of radius equal to
1.0 centered at (0, 0) is translated to be centered at (1, 0) in order for the complex number tk to
satisfy (3.24). Now consider the purely imaginary value of the coecient k = i Imk. Such numbers
tImk
|1 + x| = 1
tk
|x| = 1
tRek
Fig. 3.12. Representation of equation (3.24)
lie along the imaginary axis, Rek = 0, and when multiplied by t > 0 the resulting product just
moves closer to or further away from the origin. One such number tk is shown in Figure 3.12.
In order for tk to satisfy (3.24) the dot representing the number must move to the thick circle
in Figure 3.12. We can see that no such non-zero time step length exists: only t = 0 will make
tk = 0 lie on the circle at (0, 0). Therefore, we must conclude that the forward Euler method
is unconditionally unstable for imaginary k as there is no time step length that would satisfy the
stability requirement (3.24).
3.13 General complex k
49
Next we shall consider the backward Euler method (3.7) for the same problem. Taking the
modulus on both sides we obtain

yj
|yj |
=
|yj+1 | =
,
(1 tk)
|1 tk|
and in order to get |yj+1 | = |yj | (so that the solution points lie on a circle) we need
|1 tk| = 1 .
(3.25)
Figure 3.13 now illustrates that the circle of radius equal to 1.0 centered at (0, 0) needs to be
translated to be centered at (+1, 0) in order for tk to satisfy (3.25). Again, we must conclude that
the backward Euler method is unconditionally unstable for imaginary k as there is no non-zero time
step length that would satisfy the stability requirement (3.25).
tImk
tk
|x| = 1
|1 x| = 1
+1
+2
tRek
3.13 General complex k

For a general complex coecient k (meaning neither the real part, nor the imaginary part are zero)
in the IVP (3.1) the general solution will still be the sum of two complex conjugate terms as in (3.19).
The eigenvalues are general complex numbers, and hence formula (3.20) will need to become the
general Euler formula
exp(1 t) = exp(Rekt) [cos(Imkt) + i sin(Imkt)] .
(3.26)
The solution will be in the form of (3.21), except that everything will be multiplied by the real
exponential exp(Rekt)
[
]
w = 2 exp(Rekt)
.
Following the same steps as in Section 3.10, we arrive at the solution to the IVP in the form
[
][
]
cos(Imkt), sin(Imkt)
Rey0
w = exp(Rekt)
,
(3.27)
sin(Imkt), cos(Imkt)
Imy0
which may be interpreted readily as the rotation of a phasor with exponentially decreasing (Rek < 0)
or increasing (Rek > 0) amplitude.
50
Let us take rst Rek < 0 and the forward Euler algorithm. Equation (3.23) is still our starting
point, but now we are asking if there is a time step length that would make the modulus of the
solution decrease in time, or in mathematical terms
|1 + tk| < 1 .
(3.28)
For the accompanying picture refer to Figure 3.14: One possible complex coecient k is shown, as
is its scaling (down) by the time step tk. Clearly, it is now possible by choosing a suciently small
time step length to bring tk inside the circle so that its distance from (1, 0) is less than one, and
so that the stability criterion (3.28) is satised. Since now there is a time step length so that the
forward Euler can reproduce the correct solution shape, we call forward Euler for general complex k
and Rek < 0 conditionally stable. The condition implied by conditionally is equation (3.28), and
for a given k we can use it to solve for an appropriate t.
tImk
k
|1 + x| = 1
tk
2
tRek
On the other hand, we can now see that for the forward Euler algorithm we achieve stability
(satisfy equation (3.28)) for Rek > 0 for any t: the coecient k is in the right-hand side half plane,
and the stability circle is in the left-hand side half plane. Therefore multiplying a complex k with
an arbitrary t > 0 will satisfy |1 + tk| > 1. Hence, for Rek > 0 the forward Euler method is
unconditionally stable.
This state of aairs is again mirrored by the behavior of the backward Euler algorithm. First
take Rek > 0. Equation (3.25) is now used to gure out if there is a time step length that would
make the modulus of the solution increase in time, or in mathematical terms
1
>1.
|1 tk|
(3.29)
For the accompanying picture refer to Figure 3.15: One possible complex coecient k is shown,
as is its scaling tk. Clearly, it is now possible by choosing a suciently small time step length
to bring tk inside the circle so that its distance from (+1, 0) is less than one which will ensure
satisfaction of (3.29). Thus, the backward Euler method is conditionally stable for general complex
k and Rek > 0. Also, we now conclude the backward Euler algorithm achieves stability (satisfy
equation (3.29)) for Rek < 0 for any t: the coecient k is in the left-hand side half plane, and the
stability circle is in the right-hand side half plane. Similar reasoning as for the forward Euler leads
us to conclude that backward Euler is unconditionally stable for complex k and Rek < 0.
Illustration 4
Apply the modied Euler (2.28) to the model equation (3.1), and derive the amplication factor.
3.14 Summary of integrator stability
51
tImk
k
|1 x| = 1
tk
+1
+2
tRek
Substituting the right-hand side of the model equation into the formula (2.28) we get
ya = y(t0 ) + (t t0 )f (t0 , y(t0 )) = y(t0 ) + (t t0 )ky(t0 )
and
(t t0 )
[ky(t0 ) + kya ]
2
(t t0 )
= y(t0 ) +
[ky(t0 ) + k(y(t0 ) + (t t0 )ky(t0 ))]
2
[
]
[k(t t0 )]2
= y(t0 ) 1 + k(t t0 ) +
.
2
y(t) = y(t0 ) +
The term in square brackets that multiplies y(t0 ) is the amplication factor for the modied Euler.
1. Derive the amplication factor for the trapezoidal rule (2.27).

Figure 3.16 shows the classication of the various behaviors for the linear dierential equation with
constant coecients
y = ky ,
y(0) = y0 ,
k, y complex .
The eigenvalue = k (a complex number) is plotted in the complex plane. Depending on where
it lands, the analytical solution will display the following behaviors: In the left half-plane we get
decaying oscillations, in the right half-plane we get growing oscillations. If the eigenvalue is purely
imaginary, we get pure oscillation. If the eigenvalue is purely real, we get either exponentially decaying or growing solutions. Finally, zero eigenvalue yields a stagnant (unchanging) solution. Figure 3.17
shows the behaviors produced by the forward Euler integrator. The same color coding as in Figure 3.16 is used. The key to understanding whether the forward Euler integrator can give us a discrete
52
Im
Oscillation
Decaying
Growing
Re
Decaying
Oscillations
Growing
Oscillations
Fig. 3.16. Behavior classication for the rst order linear dierential equation
solution that mimics the analytical one is to compare the two gures. The complex number t is
plotted in the complex plane in Figure 3.17. Forward Euler can reproduce the desired behavior if
there is such a t as to place the number t in Figure 3.17 in the region with the same color as
the one in which was located in Figure 3.16.
Illustration 5
Example 1: consider = 0.1 + i3. The analytical solution is decaying oscillation. In Figure 3.17 we
can see that a suciently small time step t will indeed place t inside the circle of unit radius
centered at 1 which has the same color as the left-hand side half-plane in Figure 3.16. Forward
Euler is conditionally stable in this case. (The condition is that t must be suciently small.)
Example 2: consider = i3. The analytical solution is pure oscillation. In Figure 3.17 we can
see that it is not possible to nd any other time step but t = 0 to place t on the circle of unit
radius centered at 1 (which has the same color as the imaginary axis in Figure 3.16). Forward
Euler is unconditionally unstable for pure oscillations.
Example 3: consider = 13.3. The analytical solution is exponential growth. In Figure 3.17 we
can see that the positive part of the real axis has the same color in both gures. Therefore, for
all t > 0 we get the correct behavior. Forward Euler is unconditionally stable for exponentially
growing solutions.
Example 4: consider = 0.61. The analytical solution is exponentially decaying. In Figure 3.17
we can see that a suciently small time step t will indeed place t within the interval 1
t < 0 which has the same color as the negative part of the real axis in Figure 3.16. Forward Euler
is conditionally stable in this case. (The condition is that t must be suciently small.)
In words, using the pair of images 3.16 and 3.17 the forward Euler integrator is found to be
unconditionally unstable for pure oscillations, unconditionally stable for growing oscillations and
exponentially growing non-oscillating solutions, and conditionally stable for exponentially decayingoscillating and non-oscillating solutions. Analogous observations can be made about the backward
Euler integrator which is found to be unconditionally unstable for pure oscillations, conditionally stable for growing oscillations and exponentially growing non-oscillating solutions, and unconditionally
stable for exponentially decaying oscillating and non-oscillating solutions.
53
Im
Oscillation
Decaying
Growing
+1
Decaying
Oscillations
Re
Growing
Oscillations
Fig. 3.17. Behavior classication for the rst order linear dierential equation, Forward Euler algorithm
Im
Oscillation
Decaying
1
Decaying
Oscillations
Growing
+1
Re
Growing
Oscillations
Fig. 3.18. Behavior classication for the rst order linear dierential equation, Backward Euler algorithm.
3.14.1 Visualizing the stability regions

The stability diagrams that we have developed for the Euler algorithms are complete and unambiguous. Nevertheless, it will be instructive to visualize the amplication factors of the algorithms
discussed so far in yet another way6 .
For instance, the amplication factor for the modied Euler may be written in terms of t as
(see the Illustration section on page 50)
1
1 + t + (t)2 .
2
(3.30)
All possible complex are allowed, which means that t may represent an arbitrary point of the
complex plane. The magnitude of the amplication factor may be therefore considered a function
of the complex number t, and it is often useful to visualize such functions as surfaces raised
above the complex plane. The MATLAB function surf is designed to do just that. It takes three
matrices which represent the coordinates of points of a logically rectangular grid. The elements
6
See: aetna/StabilitySurfaces/StabilitySurfaces.m
54
x(k,m), y(k,m), z(k,m), represent Cartesian coordinates of the k,m vertex of the grid. The grid
then may be rendered with surf(x,y,z). Here we set up a grid with 99 rectangular faces in each
direction (which is why we have 100 100 matrices for the corners of those faces). First the extent
of the grid and the number of corners.
xlow =-3.2; xhigh= 0.9;
ylow =-3.2; yhigh= 3.2;
n=100;
Then we set up the matrices for the coordinates. Note that k corresponds to moving in the x
direction, the index m corresponds to moving in the y direction. dtlambda is a complex number (1i
is the complex unit), so taking its absolute value means getting the magnitude of the amplication
factor.
x=zeros(n,n); y=zeros(n,n); z=zeros(n,n);
for k =1:n
for m =1:n
x(k,m) =xlow +(k-1)/(n-1)*(xhigh-xlow);
y(k,m) =ylow +(m-1)/(n-1)*(yhigh-ylow);
dtlambda = x(k,m) + 1i*y(k,m);
z(k,m) = abs(1 + dtlambda + 0.5*dtlambda.^2);
end
end
Of course there is more than one way of accomplishing this. Here is the whole setup accomplished
with just three lines using the handy meshgrid and linspace functions.
[x,y] = meshgrid(linspace(xlow,xhigh,n),linspace(ylow,yhigh,n));
dtlambda = x + 1i*y;
% Modified Euler
z = abs(1 + dtlambda + 0.5*dtlambda.^2);
Next we draw the color-coded surface that represents the height z above the complex plane: blue
is the lowest, red is the highest.
surf(x,y,z,edgecolor,none)
Then we draw into the same gure the level curve at height 1.0 of the same function z of x, y. We
set the linewidth of the curve using a handle returned from the function contour3.
hold on
[C,H] = contour3(x,y,z,[1, 1],k)
set(H,linewidth, 3)
Finally set up the view, and label the axes.
axis([-4 0.6 -4 4 0 8])
axis equal,
xlabel (Re (\Delta{t}\lambda))
ylabel (Im (\Delta{t}\lambda))
Voila Figure 3.19. It shows how the amplication factor falls below 1.0 in amplitude inside an oval
shape in the left-hand side half plane.
As shown in the MATLAB script StabilitySurfaces, corresponding surface representations of
the amplication factors for the methods discussed so far, forward and backward Euler (Figures 3.20
and 3.21), trapezoidal rule (Figure 3.22), and the fourth-order Runge-Kutta (Figure 3.23), are easily
obtained just by commenting out or uncommenting the appropriate denitions of the variable z.
Figure 3.24 compares the level curves at 1.0 for the amplitude of the amplication factor
for the rst order linear dierential equation for the integrators FEUL=forward Euler algorithm,
55
Fig. 3.19. Surface of the amplitude of the amplication factor for the rst order linear dierential equation,
MEUL=modied Euler algorithm. The contour of unit amplitude is shown in black.
FEUL=forward Euler algorithm. The contour of unit amplitude is shown in black.
7
6
5
4
3
2
1
2
0
2
Im (t)
Re (t)
BEUL=backward Euler algorithm. The contour of unit amplitude is shown in black.
56
TRAP =trapezoidal rule algorithm. The contour of unit amplitude is shown in black.
RK4 =fourth-order Runge-Kutta algorithm. The contour of unit amplitude is shown in black.
BEUL=backward Euler algorithm, MEUL=modied Euler algorithm, TRAP =trapezoidal rule algorithm, RK4 =fourth-order Runge-Kutta algorithm. Note that the level curve for the trapezoidal
rule coincides with the vertical axis in the gure. For a decaying solution, the integrator will produce a stable solution if it is inside the contours in the left-hand side plane, or outside the circle in
the right-hand side plane for the backward Euler. Clearly, comparing Figure 3.24 with the surface
representations of integrator stability in Figures 3.20 3.23 we can see that visualizing the stability
with contours is only part of the story: the surface gures supply the missing information about the
magnitude of the amplication factor.
1. Use the information in the Figure 3.22 to estimate the stability diagram for the integrator
odetrap similar to that shown in Figures 3.17 , 3.18.
57
RK4
BEUL
Im(t )
1
0
FEUL
1
2
MEUL
TRAP
3
3
0
Re(t )
Fig. 3.24. Level curves (contours) at value 1.0 of the amplitude of the amplication factor for the
rst order linear dierential equation, FEUL=forward Euler algorithm, BEUL=backward Euler algorithm,
MEUL=modied Euler algorithm, TRAP =trapezoidal rule algorithm, RK4 =fourth-order Runge-Kutta algorithm.

Nice introduction to the topic of classication of solutions of linear dierential equations.
2. B. Leimkuhler, S. Reich, Simulating Hamiltonian Dynamics, Cambridge Monographs on Applied
and Computational Mathematics, Cambridge University Press, 2005.
Valuable discussion of the stability of numerical integrators, especially for mechanical systems.
4
Linear Single Degree of Freedom Oscillator
Summary
1. The model of the linear oscillator with a single degree of freedom is investigated from the point
of view of the uncoupling procedure (so-called modal expansion), and the solution in the form
of a matrix exponential. Main idea: solve the eigenvalue problem for the governing ODE system,
and expand the original variables in terms of the eigenvectors. The modal expansion is a critical
piece in engineering vibration analysis.
2. For the single degree of freedom linear vibrating system we have study how to transform between
the second order and the rst order matrix form, and we discuss the relationship of the scalar
equation with the complex coecient from Chapter 3 with the linear oscillator model. Main
idea: the two IVPs are shown to be equivalent descriptions.
3. It is shown that modal analysis is possible as long as the system matrix is not defective, i.e. as
long as it has a full set of eigenvectors. The case of critical damping is discussed as a special case
which leads to a defective system matrix.
4. The modal analysis allows multiple degree of freedom systems to be understood in terms of the
properties of multiple single degree of freedom linear oscillators.
4.1 Linear single degree of freedom oscillator

The second-order equation of the free (unforced) motion of the singled degree of freedom damped
linear oscillator (see Figure 4.1) is
m
x = kx cx .
When supplemented with the initial conditions
x(0) = x0 , x(0)
= v0
together this will constitute the complete denition of the IVP of the linear oscillator. Using the
denition of the velocity
v = x
will yield the general rst-order form of the 1-dof damped oscillator IVP as
y = A y , y(0) = y 0 ,
where
A=
0,
1
k/m, c/m
(4.1)
]
(4.2)
60
4 Linear Single Degree of Freedom Oscillator
and
[ ]
x
y=
.
v
m
x
Fig. 4.1. Linear one-degree of freedom oscillator
The discussion of Section 3.7 (refer to equation (3.13)), applies here too. We assume the solution
in the form of an exponential
y = et z .
The characteristic equation for the damped oscillator is
[
]
,
1
det (A 1) = det
= 2 + (c/m) + k/m = 0 .
k/m, c/m
The quantity n
n2 = k/m
(4.3)
appears everywhere in vibration analysis as the natural frequency of undamped vibration.

Illustration 1
Show that the IVP of the undamped (c = 0) one degree of freedom oscillator is equivalent to a single
scalar equation with a complex coecient as in equation (3.12).
Solution is obtained by substitution of (4.3) into the matrix of (4.2)
[
]
0, 1
A=
.
n2 , 0
The IVP is therefore expanded as
[ ] [
][ ]
[
] [ ]
x
0, 1
x
x(0)
x0
=
,
=
.
v
n2 , 0
v
v(0)
v0
The trick (yes, there is one) is to introduce a new set of variables. The rst is the same as deection
of the mass x and the second is the velocity scaled by the negative natural frequency:
[ ] [
]
w1
x
=
.
w2
v/n
4.1 Linear single degree of freedom oscillator
61
Therefore, the dierential equation of motion may be written in terms of the new variables as
[
] [
][
]
w1
0, 1
w1
=
,
n w2
n2 , 0
n w2
and by canceling n in the second equation we obtain
[ ] [
][ ]
[
] [
]
w1
0, n
w1
w1 (0)
x0
=
,
=
,
w2
n ,
0
w2
w2 (0)
v0 /n
which is in perfect agreement with Section 3.10: we get two variables, the displacement x and the
velocity scaled by the angular velocity v/n , coupled together by a skew symmetric matrix which
is the same as in equation (3.17) (where n = Imk). The solution in the new variables w1 , w2 is
therefore expressed by the rotation matrix as in (3.22). Now we can understand that Figure 3.7
describes the motion of an oscillating mass.
Assuming to be in general complex, we can write = + i and substituting into the

characteristic equation we obtain
(
)
(c
)
c
k
2 + (c/m) + k/m = 2 2 + +
+i
+ 2 = 0 .
m
m
m
Since both the real and the imaginary part of this equation must vanish at the same time, we must
have
(c
)
c
k
2 2 + +
= 0,
+ 2 = 0 .
m
m
m
The second equation allows us to branch out into two subcases, since there are two ways in which
the second equation could be satised.
4.1.1 = 0: No oscillation
For = 0 the imaginary part of the eigenvalue vanishes, and then there is no oscillation. The real
component is obtained from
2 + (c/m) + k/m = 0
giving
1,2
c
=
2m
(
c )2
k
.
2m
m
Notice that we must require

( c )2
k
2m
m
for 1,2 to come out real.
4.1.2 = (c/2m): Oscillation
This is the second subcase: Substituting = (c/2m) into the rst equation, we obtain
( c )2
k
c
)+
=0,
2 + (c/m)(
2m
2m
m
(4.4)
62
which immediately gives for the imaginary component
( c )2
k
=
.
m
2m
For to come out real and positive (we include the latter condition since = 0 was already covered
in the preceding section) we require
( c )2
k
<
.
2m
m
4.1.3 Critically damped oscillator
The case of = 0 and at the same time
( c )2
k
=
2m
m
yields a special case: the critically damped oscillator. The damping coecient
ccr = 2mn
(4.5)
is the so-called critical damping . The critically damped oscillator needs a special handling, which
we will postpone to its own section that will follow the discussion of the generic cases of the supercritically and the subcritically damped oscillator.
4.2 Supercritically damped oscillator

The oscillator is supercritically damped when the damping is suciently strong to eliminate oscillation, = 0, and when equation (4.4) gives two real roots. For this to occur we require
( c )2
k
>
2m
m
i.e. a sharp inequality. In other words, the damping coecient is greater than the critical damping
coecient
c = ccr > ccr ,
(i.e. > 1)
from equation (4.5). Here is the so-called damping ratio.

The characteristic equation gives two real roots
(
c
c )2
k
1,2 =
.
2m
2m
m
c
+
Let us compute the rst eigenvector corresponding to 1 = 2m
for the vector z 1 that solves
)
c 2
2m
k
m.
We are looking
(A 1 1) z 1 = 0 .
Substituting we have
[
][
] [ ]
1 ,
1
z11
0
=
.
k
c
z21
0
m
, m
1
The two equations are really only one equation (the rows and columns of the matrix on the left
are linearly dependent, since that is the condition from which we solved for 1 ). Therefore, using
4.2 Supercritically damped oscillator
63
for instance the rst equation and choosing z11 = 1 we compute z21 = 1 . We repeat the same
procedure for the second root to arrive at the two eigenvectors
[
] [ ]
[
] [ ]
z11
1
z12
1
z1 =
=
, z2 =
=
.
z21
1
z22
2
The general solution of the dierential equation of motion of the oscillator is therefore
y = c1 e1 t z 1 + c2 e2 t z 2 .
(4.6)
The two constants cj can be determined from the initial condition

y(0) = c1 e1 0 z 1 + c2 e2 0 z 2 = c1 z 1 + c2 z 2 = y 0 .
This can be conveniently cast in matrix form using the matrix of eigenvectors
V = [z 1 , z 2 ]
as the matrix-vector multiplication
[ ]
c1
V
= y0 .
c2
Provided 1 = 2 , the two eigenvectors are linearly independent, which means that the matrix V is
non-singular. The constants are then
[ ]
c1
= V 1 y 0 .
c2
Illustration 2
It may be illustrative to work out in detail the inverse of the matrix of eigenvectors.
We write the matrix of the eigenvectors as above:
[
]
1, 1
V =
.
1 , 2
The cofactor equation yields immediately
[
]
1
2 , 1
V 1 =
.
2 1 1 , 1
Note that (4.6) may be written as

[ ]
[ 1 t
] c1
2 t
y = e z1 , e z2
c2
or, even slicker,
[
y = [z 1 , z 2 ]
e1 t 0
0 e2 t
][
c1
c2
[
=V
e 1 t 0
0 e2 t
][
Substituting for the integration constants we obtain

[ t
]
e 1 0
y=V
V 1 y 0 .
0 e2 t
c1
c2
]
.
(4.7)
64
If we pre-multiply this equation by V 1 , we obtain this eminently useful representation

[ t
]
e 1 0
V 1 y =
V 1 y 0 .
0 e 2 t
Namely, with the denition of a new variable w (also commonly referred to as change of coordinates)
w(t) = V 1 y(t) ,
w0 = V 1 y 0
(4.8)
we can write
[ t
]
e 1 0
w=
w0
0 e2 t
as a completely equivalent solution to the oscillator IVP, using the new variable w. Each component
of the solution is independent of the other, as we can see from the scalar equivalent to the above
matrix equation
w1 (t) = e1 t w10 , w2 (t) = e2 t w20 .
(4.9)
4.3 Change of coordinates: similarity transformation

As an alternative way of deriving the solution, the change of coordinates (4.8) may be also introduced
already into the equation of motion (4.1)
V 1 y = V 1 Ay = V 1 AV V 1 y
and with the new variable w we may rewrite the original IVP in this form
= V 1 AV w ,
w
w(0) = w0 = V 1 y 0 .
The matrix V 1 AV is a very nice one: it is diagonal . To see this, we realize that for each column
of the matrix V the eigenvalue problem
Az j = j z j , j = 1, 2
holds, and writing all such eigenvalue problems in one shot is possible as
[
]
0
A [z 1 , z 2 ] = [z 1 , z 2 ] 1
0 2
using the diagonal matrix
[
]
1 0
=
.
0 2
(4.10)
(4.11)
(4.12)
Therefore we have
A [z 1 , z 2 ] = AV = V
and pre-multiplying with V 1
V 1 AV = .
(4.13)
We say that the matrix A is similar to a diagonal matrix . (We also say that A is diagonalizable.)
So the IVP for the oscillator can be written in the new variable w as
4.3 Change of coordinates: similarity transformation
= V 1 AV w = w ,
w
65
w(0) = w0 = V 1 y 0 .
This means that we can write totally independent scalar IVPs for each component
w 1 (t) = 1 w1 , w1 (0) = w10 ,
w 2 (t) = 2 w2 , w2 (0) = w20 ,
which as we know have the solutions (4.9):

wj (t) = ej t wj0 .
(4.14)
This is the well-known decoupling procedure: the original variables y are in general coupled together
since the matrix A is in general non-diagonal. Therefore to make things easier for us we switch to
a dierent set of variables w with the transformation (4.8) in which all the variables are uncoupled.
The uncoupled variables have each its own IVP which is easily solved. Finally, if we wish to, we
switch back to the original variables y. This procedure may be summarized as
y = Ay ,
y(0) = y 0
(original IVP),
(4.15)
= V 1 AV w = w , w(0) = V 1 y 0 (uncoupled IVP),

w
]
[ t
e 1 0
w=
w(0) (solution to uncoupled IVP),
0 e2 t
[ t
]
e 1 0
V 1 y(0) (solution to original IVP).
y=Vw=V
0 e2 t
It is well worth understanding this sequence of operations. It is the essence of linear vibration
analysis. The variables w are the modal coordinates (often called normal coordinates), and the
meaning of y = V w is that of expansion of the solution y as a linear combination of modes (the
columns of V , which are the eigenvectors), where the coecients of the linear combination are the
modal coordinates w
y=Vw=
z j wj (t) .
j=1
Illustration 3
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, = 3/2, x0 = 0, and v0 = 1.
We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra
toolbox1 . First the denitions of the variables. The variable names are self-explanatory.
function froscill_super_symb
syms m k c omega_n t x0 v0 real
y0= [x0;v0];
c_cr=2*m*omega_n;
c=3/2*c_cr;
A = [0, 1; -omega_n^2, -(c/m)];
We compute symbolically the eigenvalues and eigenvectors, and we construct the diagonal matrix
(called L in the code)
[V,D] =eig(A);
L =simple(inv(V)*A*V);
(Control question: How do L and D compare?) Next we can compute the matrix with ej t on the
diagonal (called eLt). Note that calling the MATLAB function exp on a matrix would exponentiate
each element of the matrix. This is not what we intend: only the elements on the diagonal should
be aected. Therefore we have to extract the diagonal of L with diag, exponentiate, and then
reconstruct a square matrix with another call to diag
1
See: aetna/LinearOscillator/froscill super symb.m
66
eLt =diag(exp(diag(L)*t));
Now we are ready to write down the last equation (4.15) to construct the solution components
(displacement and velocity).
y=simple(V*eLt*inv(V))*y0;
It only remains to substitute numbers and plot. These are the given numbers and we also dene an
auxiliary variable.
x0= 0; v0=1;% [initial displacement; initial velocity]
m= 13;
k= 6100; omega_n= sqrt(k/m);
For the plotting we need data to plot on the horizontal and vertical axis. Here we set it up so that
the time variable array t consists of 200 points spanning two periods of vibration of the undamped
system.
T_n=(2*pi)/omega_n;
t=linspace(0, 2*T_n, 200);
Finally the plotting of the components of the solution.
plot(t,eval(vectorize(y(1))),m-); hold on
plot(t,eval(vectorize(y(2))),r--); hold on
Remember the components of y are symbolic expressions. Now that we have provided all the variables
with numerical values, we need to evaluate the numerical value of the solution components using
the MATLAB function eval. It also doesnt hurt to use the function vectorize: the variable t is an
array. In case the expression for the solution components contained arithmetic operators of two or
more terms that referred to t (such as exp(t)*sin(t)) we would want the expressions to evaluate
element-by-element. vectorize replaces all references to operators such as * or ^ with .* or
.^ so that these operators work on each scalar element of the arrays in turn.
4.4 Subcritically damped oscillator

The eigenvalues are
( c )2
c
k
1,2 =
i
.
2m
m
2m
Let us remind ourselves that an undamped oscillator is a special case of the subcritically damped
oscillator for c = 0.
The same procedure as in Section 4.2 leads to the eigenvectors
[
] [ ]
[
] [ ]
z
1
z
1
z 1 = 11 =
, z 2 = 12 =
,
z21
1
z22
2
which are complex, since j are complex numbers. The solution is again written as in (4.6) but with
the important dierence that all quantities on the right-hand side are complex while the left-hand
side is expected to be real.
The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of
the rst one, 2 = 1 . We see this easily writing the complex conjugate of the equation A z = z
(see equation (3.18)). The two constants cj can be determined from the initial condition
y(0) = c1 e1 0 z 1 + c2 e2 0 z 2 = c1 z 1 + c2 z 2 = y 0
4.5 Undamped oscillator: alternative treatment
67
and since y 0 is real, the two constants must be complex conjugates of each other, c2 = c1 . The
constants are still determined by
[ ]
c1
= V 1 y 0 .
c2
Now we can follow all the derivations from the previous section, and the solution will be still arrived
at in the form of (4.7). Since both y and y 0 are real, the product of the three complex matrices
[ t
]
e 1 0
V
V 1
0 e 2 t
must also be real, and however surprising it may seem, it is real. (We can do the algebra by hand
or with MATLAB to check this.)
Illustration 4
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, = 0.2 (< 1 so that the
damping is subcritical), x0 = 0, and v0 = 1.
We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra
toolbox2 . First the denitions of the variables. The variable names are self-explanatory. The code is
pretty much the same as for the supercritically damped oscillator example above, except
function froscill_sub_symb
...
c=0.2*c_cr;
...
We may verify that the eigenvalues (and eigenvectors) are now general complex numbers. For instance
K>> D(1,1)
ans =
(-1/5+2/5*i*6^(1/2))*omega_n
It is rather satisfying to nd that no modications to the code of froscill_super_symb that was
written for the real (supercritical) case are required to account for the complex eigenvalues and
eigenvectors: it just works as is.

The characteristic equation for the undamped oscillator gives
1,2 = i k/m = in .
Let us compute the rst eigenvector corresponding to 1 = in
[
] [
]
z11
1
z1 =
=
.
z21
in
The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of the
rst one, 2 = 1 = in ,
2
See: aetna/LinearOscillator/froscill sub symb.m
68
[
z2 =
z12
z22
[
= z1 =
1
in
]
.
The general solution of the free undamped oscillator motion is a linear combination of the eigenvectors
y = c1 e1 t z 1 + c2 e2 t z 2 .
Because of the complex conjugate status of the pairs of the eigenvalues and eigenvectors, we have
y = c1 e1 t z 1 + c2 e1 t z 1 .
Introducing the initial condition, which is real, we obtain
y(0) = c1 z 1 + c2 z 1
and we must conclude c1 = c2 , otherwise the right-hand side couldnt be real. Using
Rea = (a + a)/2
we see that the sum c1 z 1 + c2 z 1 therefore evaluates to 2Re(c1 z 1 ), and the constants can be determined from
[
]
[
] Rec 1
y(0) = 2Re(c1 z 1 ) = 2 (Rec 1 Rez 1 Imc 1 Imz 1 ) = 2 Rez 1 , Imz 1
.
Imc 1
We will introduce the matrix composed of the real and imaginary part of the eigenvector z 1
[
]
[
]
1, 0
Z = Rez 1 , Imz 1 =
.
(4.16)
0, n
Then we can write
[
]
[
]
1 1
1 1, 0
Rec 1
= Z y(0) =
y(0) .
Imc 1
2
2 0, n1
Using the same principle that we obtain a real number from the sum of the complex conjugates, we
write
(
)
y = 2Re c1 e1 t z 1 ,
(4.17)
which may be expanded into
y = 2 [Rec1 (cos n t Rez 1 sin n t Imz 1 ) Imc1 (sin n t Imz 1 + cos n t Rez 1 )] .
Then collecting the terms leads to the matrix expression
[
][
]
[
] cos n t , sin n t
Rec 1
y = 2 Rez 1 , Imz 1
,
sin n t , cos n t
Imc 1
which after substitution of Rec 1 , Imc 1 nally results in the matrix expression
[
]
cos n t , sin n t 1 1
Z y(0) = ZR(t)Z 1 y(0) .
y = 2Z
sin n t , cos n t
2
We have in this way introduced the time-dependent rotation matrix
[
]
cos n t , sin n t
R(t) =
.
sin n t , cos n t
(4.18)
(4.19)
The solution for the displacement and velocity of the linear single degree of freedom oscillator can
therefore be understood as the result of the rotation of the initial-value quantity Z 1 y(0) (phasor)
Z 1 y(t) = R(t)Z 1 y(0) .
(4.20)
69
Illustration 5
Check that the procedure (4.15) and the alternative formula (4.18) lead to the same solution.
We dont want to do this by hand. It is faster to use the MATLAB symbolic algebra. The function
froscill un symb3 computes the solution twice, and then subtracts one from the other. If we get
as a result zeroes, the solutions were the same.
The code begins by the same variable denitions and solution of the eigenvalue problem as for
froscill sub symb. We compute the rst solution using (4.15).
L =simple(inv(V)*A*V);
eLt =diag(exp(diag(L)*t));
y1=simple(V*eLt*inv(V))*y0;
Next, we compute the solution using the alternative with the rotation matrix (4.19).
Z =[real(V(:,1)),-imag(V(:,1))];
R = [cos(omega_n*t),-sin(omega_n*t);
sin(omega_n*t),cos(omega_n*t)];
y2 =simple(Z*R*inv(Z))*y0;
Finally we evaluate y1-y2.
Finally we can realize that the solution (4.20) is of the same form as that derived in Section 3.10
(as in (3.22)) and then again in the Illustration in Section 4.1. The new variables are w1 = y1 , w2 =
y2 /n as in Section 4.1.
4.5.1 Subcritically damped oscillator: alternative treatment
The eigenvalues are
( c )2
c
k
1,2 =
i
.
2m
m
2m
Equation (4.17) is still applicable. The only dierence is that 1,2 now have a real component. Using
e(+i)t = et eit
we see that (4.18) requires only a change of the matrix R, which should for the damped oscillator
read
[
]
t cos t , sin t
R(t) = e
.
sin t , cos t
Here
c
=
, =
2m
( c )2
k
.
m
2m
Let us note that is the frequency of damped oscillation.
See: aetna/LinearOscillator/froscill un symb.m
70
4.6 Matrix-exponential solution

Consider a linear dierential equation with constant coecients in a single variable
y = ay ,
y(0) = y0 .
We have derived the solution before in the form

y = eat y0 ,
that is as an exponential. Therefore it may not be a terrible stretch of imagination to anticipate the
solution to (4.1) to be formally identical so that the IVP
y = A y ,
y(0) = y 0
would have the solution

y = eAt y 0 .
Of course, we must explain the meaning of the matrix exponential eAt . Even here the analogy
with the scalar case is of help: consider dening a scalar exponential using the Taylor series starting
from t = 0
eat = ea0 + aea0 t + a2 ea0 t2 /2 + ... =
ak tk
k=0
k!
The matrix exponential could be dened (and in fact this is one of its denitions) as
eAt =
Ak tk
k=0
k!
(4.21)
For a general matrix A evaluating the innite series would be dicult. Fortunately, for some
special matrices it turns out to be easy. Especially the nice diagonal matrix makes this a breeze:
Dk tk
11
0
...
0
0
k=0
k!
k k
D22
t
0
...
0
0
k=0
k!
k
k
D t
..
..
.
.
.
D
t
.
.
.
e
=
=
.
.
.
.
.
.
k!
k
k
Dn1,n1 t
k=0
0
0
. . . k=0
k!
k
k
Dnn t
0
0
...
0
k=0
k!
This result is easily veried by just multiplying through the diagonal matrix with itself. Finally we
realize that on the right-hand side we have a matrix with exponentials eDjj t on the diagonal
D t
e 11
0 ...
0
0
0 eD22 t . . .
0
0
k k
D
t
..
.
.
.. .
D
t
.
..
..
..
= .
e
=
(4.22)
.
k!
Dn1,n1 t
k=0
0
0 ... e
0
0
0 ...
0
eDnn t
This is very helpful indeed, since we already saw that having a full set of eigenvectors as in equation (4.13) allows us to write the matrix A as similar to a diagonal matrix. Let us substitute into
the denition of a matrix exponential the similarity
V 1 AV = ,
A = V V 1
4.7 Critically damped oscillator
71
as
eAt =
Ak tk
k=0
k!
)k
(
V V 1 tk
=
.
k!
k=0
Now we work out the matrix powers: The zeroth and rst,
(
V V 1
)0
= 1 = V 1V 1 ,
V V 1
)1
= V V 1 ,
and the second,

(
V V 1
)2
(
)(
)
(
)
= V V 1 V V 1 = V V 1 V V 1 = V V 1 = V 2 V 1 .
The pattern is clear: we get

(
V V 1
)k
= V k V 1 .
The matrix exponential will become

(
)
k tk
V k V 1 tk
A
t
e =
=V
V 1 = V et V 1 .
k!
k!
k=0
k=0
To compute the matrix exponential of the diagonal t is easy, so the only thing we need in order to
compute the exponential of At is a full set of eigenvectors of A. (Warning: There are matrices that
do not have a full set of linearly independent eigenvectors. Such matrices are called defective. More
details are discussed in the next section.)
As a matter of fact we have been using the matrix exponential all along. The solution (4.7)
is of the form V et V 1 . In equation (4.18) the matrix R(t) (rotation matrix) is also a matrix
exponential of a special matrix: the skew-symmetric matrix.
[
]
[
]
0, n
0, 1
S=
= n
.
n , 0
1, 0
Note that the powers of S have this special structure
S 2 = n2 1 , S 3 = n2 S , S 4 = n4 1 , S 5 = n4 S , ... .
Therefore, for the rotation matrix we have
R(t) = eS t =
S k tk
k=0
k!
1t0
St1
n2 1t2
n2 St3
+
+
+
+ ... .
0!
1!
2!
3!
Constructing the innite matrix series, this gives the correct Taylor expansions for cosines and
sines of the rotation matrix
[
]
]
[
n2 t2
n4 t4
n5 t5
n3 t3
1
S
t
R(t) = e = 1
+
+ ... 1 + n t
+
+ ...
S.
2!
4!
3!
5!
n
|
{z
}
|
{z
}
cos n t
sin n t

c
and = 0. The characteristic
The oscillator is critically damped when at the same time = 2m
c
equation has a double real root 1,2 = = 2m .
Let us compute the rst eigenvector. Substituting we have
72
[
] [ ]
,
1
2m
z11 = 0 .
k
c
c
z21
0
,
m 2m m
( c )2
k
Further simplifying with m
= 2m
leads to
c
[
] [ ]
,
1
z11 = 0 .
(2m )2
c
c
z21
0
,
2m
2m
Arbitrarily choosing one component of the eigenvector, for instance z21 =

[
]
] [
z11
1
z1 =
.
=
c
z21
2m
c
2m ,
yields
Inconveniently, this is the only eigenvector that we are going to get for the case of the critically
damped oscillator. Since we obtained a double real root, the second eigenvector is exactly the same
as the rst. We say inconveniently, because our approach was developed for an invertible eigenvector
matrix
V = [z 1 , z 2 ]
and it will now fail since both columns of V are the same, and such a matrix is not invertible.
We call matrices that have missing eigenvectors defective. For the critically damped oscillator the
matrix A is defective.
=1.005
20
15
10
Im
5
0
5
10
15
20
25
20
15
10
0
Re
10
15
20
25
Fig. 4.2. Location of the roots for = 1.005
Let us approach the degenerate case of the critically damped oscillator as the limit of the super
critically damped oscillator whose two eigenvalues will approach each other to become one. Figure 4.2
shows a circle of the radius equal to n for the data of the IVP (4.1) set to m = 13, k = 6100,
= 1.005 (in other words close to critical damping). The two (real) eigenvalues are indicated by
small circular markers (the function animated eigenvalue diagram4 illustrates with an animation
how the eigenvalues change in dependence on the amount of damping). For critical damping ( = 1.0)
the two eigenvalues would merge on the black circle and become one real eigenvalue (also referred to
4
See: aetna/LinearOscillator/animated eigenvalue diagram.m
73
as a repeated eigenvalue). As the eigenvalues approach each other 2 1 the solution may still
be written as
y = c1 e1 t z 1 + c2 e2 t z 2 .
In order to understand the behavior of the eigenvalues as they approach each other, we can write
the exponential e2 t using the Taylor series with 1 as the starting point
e2 t = e1 t +
d ( 2 t )
e
|1 (2 1 ) + . . . = e1 t + te1 t (2 1 ) + . . . .
d2
So we see that the dierence between e1 t and e2 t is

e2 t e1 t = e1 t + te1 t (2 1 ) + . . . e1 t = te1 t (2 1 ) + . . . .
From this result we conclude that as 2 1 , a linearly independent basis will be the two functions
e 1 t ,
and
te1 t .
With essentially the same reasoning we can now look for the missing eigenvector. Write (again
assuming 2 1 )

dz 2
z2 z1 +
(2 1 ) .
d2 1
This allows us to subtract the two eigenvector equations from each other to obtain
(+) Az 2 = 2 z 2
() Az 1 = 1 z 1
,
A (z 2 z 1 ) = (2 z 2 1 z 1 )
where we can substitute the dierence of the eigenvectors to arrive at

dz 2
dz 2
A
(2 1 ) = (2 1 )z 1 + 2
(2 1 ) ,
d2 1
d2 1
and, factoring out (2 1 ), nally

dz 2
dz 2
A
= z 1 + 2
.
d2 1
d2 1
Fig. 4.3. Relationship of eigenvectors for 2 1

z 2
Note that dd
2
has the direction of the dierence between the two vectors z 2 and z 1 . Since z 2

z 2 . Therefore,
and z 1 are linearly independent vectors for 2 = 1 , so are the vectors z 1 and dd
2
1
when 2 = 1 , we can obtain a full set of linearly independent vectors that go with the double root
as the two vectors z 1 and p2 that solve
74
Az 1 = 1 z 1 ,
Ap2 = z 1 + 2 p2 .
(4.23)
Here p2 is not an eigenvector. Rather, it is called a principal vector . To continue with our critically
damped oscillator: we can compute the principal vector as
[
][
] [
]
]
[
0,
1
p12
z11
p12
=
+ 2
,
k
c
p22
z21
p22
,
m
m
or, upon substitution,
[
][
] [
]
[
]
0,
1
c p12
p12
1
( c )2
=
,
c
c
p22
2m
,
2m p22
2m
m
or, rearranging the terms,
c
]
[
] [
,
1
1
(2m )2
p12 =
.
c
c
c
p22
2m
,
2m
2m
Since the matrix on the left-hand side is singular, the principal vector is not determined uniquely.
One possible solution is
[
] [
]
p
0
p2 = 12 = c
.
p22
2m
Similarly as for the general oscillator eigenproblem (4.10) which could be written in the matrix
form (4.11), we can write here for the critically damped oscillator
[
]
1 1
A [z 1 , p2 ] = [z 1 , p2 ]
,
(4.24)
0 2
where we introduce the so-called Jordan matrix
[
] [
]
1 1
1 1
J=
=
(since 1 = 2 )
0 2
0 1
(4.25)
and of the matrix of the principal vectors

M = [z 1 , p2 ] .
We see that for critical damping the matrix A cannot be diagonalized (i.e. be made similar to a
diagonal matrix). It becomes defective (i.e. it doesnt have a full set of eigenvectors). The best we
can do is to make it similar to the Jordan matrix
M 1 AM = J .
(4.26)
Illustration 6
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, = 1.0 (critical damping),
x0 = 0, and v0 = 1.
We shall follow the procedure that leads to the Jordan matrix. The MATLAB solution is based
on the symbolic algebra toolbox5 .
The solution to the eigenvalue problem yields a rectangular one-column V. Therefore we solve for
the principal vector p2 , and we form the matrix M
5
See: aetna/LinearOscillator/froscill crit symb.m
75
[V,D] =eig(A);% this gives V with only one column

% so here we solve for the principal vector
p2 = (A-D(2,2)*eye(2))\V(:,1);
M = [V(:,1),p2];
We compute the Jordan matrix, the exponential of the Jordan matrix, and the solution follows as
before (see for instance Illustration on page 65).
J =simple(inv(M)*A*M);
eJt =expm(J*t);
y=simple(M*eJt*inv(M))*y0;
Illustration 7
Compute the matrix exponential of the Jordan matrix
[
]
1
J =t
.
0
Solution: The matrix can be decomposed as
[
]
01
J = t1 + t
= t1 + t .
00
Because we have
(t1) (t) = (t) (t1)
(i.e. the matrices commute), it holds for the matrix exponential
et1 + t = et1 et = et et1 .
The exponential of the diagonal matrix is easy: see equation (4.22). For the matrix using the
denition (4.21) we readily get
et =
k tk
k=0
k!
= 1 + t
because all its powers higher than two are zero matrices, 2 = 0, and so on. Therefore, we have
[
]
1t
et1 + t = et1 et = et 1 (1 + t) = et
.
01

This book has a great discussion of the issues of complex dierential equations, including explanations of the relationship of complex dierential equations and the linear oscillator IVP.
2. D. E. Newland, Mechanical Vibration Analysis and Computation, Dover Publications Inc., 2006.
An excellent reference for all vibrations subjects. It covers thoroughly the single-degree of freedom
oscillator, matrix analysis of natural frequencies and mode shapes, and numerical methods for
modal analysis. Did I mention that it was inexpensive?
5
Linear Multiple Degree of Freedom Oscillator
Summary
1. For the multiple degree of freedom linear vibrating system we study how to transform between
the second order and the rst order matrix form. Modal analysis is discussed in detail for both
forms.
2. Modal analysis decouples the equations of the multiple degree of freedom system. The original coupled system may be understood in terms of the individual modal components. Main
idea: whether coupled or uncoupled, the response of the system is determined by the modal
characteristics. Each uncoupled equation evolves as governed by its own eigenvalue.
3. We can analyze a scalar real or complex linear dierential equation to gain insight into the
stability behavior. When the equations are coupled, stability is usually decided by the fastest
changing component of the solution (as dictated by the largest eigenvalue). This information is
used to select the time step for direct integration of the equations of motion.
4. The frequency content (spectrum) is a critical piece of information. We use the Fourier transform
and we discuss the Nyquist frequency.
5. The rst-order form of the vibrating system equations is used to analyze damped systems.
5.1 Model of a vibrating system

The second-order equation of the free (unforced) motion of a system of interconnected damped linear
oscillators (see Figure 5.1) is
= Kx C x ,
Mx
(5.1)
where M is the mass matrix, K is the stiness matrix, C is the damping matrix, and x is the vector
of displacements. In conjunction with the initial conditions
x(0) = x0 , x(0)
= v0
this will dene the multi-degree of freedom (dof) damped oscillator IVP. Using the denition
v = x
will yield the general rst-order form of the multi-dof damped oscillator IVP as
y = A y , y(0) = y 0 ,
where
A=
0,
1
M 1 K, M 1 C
(5.2)
]
78
5 Linear Multiple Degree of Freedom Oscillator
and
[ ]
x
y=
.
v
The vector variable y collects both the vector of displacements x and the vector of velocities v.
Figure 5.1 shows an example of a multi-degree of freedom oscillator that is physically realized as
three carriages connected by springs and dampers. This will be our sample mechanical system that
will be studied in the following sections.
k1 , c 1 m 1 k2 , c 2 m 2 k3 , c 3 m 3
x1
x2
x3
Fig. 5.1. Linear 3-degree of freedom oscillator
5.2 Undamped vibrations

Let us take a system where all the springs are of equal stiness kj = k = 61, all the masses are equal
mj = m = 1.3, and the system is undamped, cj = 0.
5.2.1 Second order form
The second order equations of motion (5.1) have a solution (this is an educated guess)
x = et z ,
which upon substitution into (5.1) gives
2 M et z = Ket z .
This yields the eigenvalue problem
2 M z = Kz ,
which is a form of the so-called generalized eigenvalue problem
2 M z = Kz
(5.3)
for the eigenvalues 2 .

For the mechanical system of Figure 5.1 the mass and stiness matrices are
m 0 0
2k, k, 0
M = 0 m 0 , K = k, 2k, k .
0 0 m
0, k, k
Similarly to the characteristic equation for the standard eigenvalue problem (3.15) we can write
(
)
det K 2 M = 0 .
(5.4)
79
Illustration 1
For the stiness and mass matrices given above, the characteristic polynomial is
2k, k, 0
m 0 0
det k, 2k, k 2 0 m 0 = k 3 6k 2 m 2 + 5km2 ( 2 )2 m3 ( 2 )3
0, k, k
0 0 m
The eigenvalues 2 are the roots of this polynomial.
Illustration 2
For the stiness and mass matrices given above, the characteristic equation is
k 3 6k 2 m 2 + 5km2 ( 2 )2 m3 ( 2 )3 = 0 .
Find the roots.
Symbolic solution can be delivered by Matlab, but it is far from tidy. Numerical solution eigenvalues for the data m = 1.3, k = 61, c = 0, results from the roots of
(2197/1000)( 2 )3 + (10309/20)( 2 )2 (145119/5) 2 + 226981 = 0 ,
(2197*o23)/1000 + (10309*o22)/20 (145119*o2)/5 + 226981
which may be (crudely) solved graphically from1

4
x 10
3
2
1
0
1
2
3
0
50
100
150
o2
or numerically using solve.
The eigenvalues (and eigenvectors) of the generalized eigenvalue problem are known to be
real for M , K symmetric. Also, when the stiness matrix is nonsingular, the eigenvalues will be
positive. Hence we write
2 = 2 0 .
The generalized eigenvalue problem is solved in MATLAB2 using
[V,D]=eig(K,M);
1
2
See: aetna/ThreeCarriages/n3 undamped modes MK symbolic.m

See: aetna/ThreeCarriages/n3 undamped modes MK.m
80
For the above matrices, the eigenvalues are 12 = 9.2937 (i.e. angular frequency 1 = 3.0486),
22 = 72.9634 (i.e. angular frequency 2 = 8.5419), and 32 = 152.3583 (i.e. angular frequency
3 = 12.3433). Therefore, we see that the s are all imaginary, j = i j . Note that there
are three eigenvalues, but each eigenvalue generates two solutions because of the for the square
roots. That is necessary, because there are six constants needed to satisfy the initial conditions (two
conditions, each with three equations).
The solutions are therefore found to be both
x = e+i j t z j
and
x = ei j t z j ,
which are complex vectors. The solution however needs to be real. This is easily accomplished by
taking as the solutions a linear combination of the above, for instance
(
)
(
)
x = Re e+i j t + ei j t z and x = Im e+i j t ei j t z .
From Eulers formula we know that
(
)
Re e+i j t + ei j t = 2 cos j t
and
)
(
Im e+i j t ei j t = 2 sin j t .
Therefore, we can take as the three linearly independent solutions (j = 1, 2, 3)
x = cos j tz j
and x = sin j tz j .
In this way we will obtain enough integration constants to satisfy the initial conditions, since the
general solution may be written as
x=
(Aj cos j t + Bj sin j t)z j .

j=1
The undamped mode shapes for our example are shown in Figure 5.2, both graphically as arrows
and numerically as the values of the components.3
5.2.2 First order form
Next we will explore the free vibration of the same system in its rst-order form. The system matrix
is (note: no damping)
[
]
0,
1
A=
.
M 1 K, 0
The standard eigenvalue problem is solved in MATLAB as4
[V,D]=eig(A);
Note that the results for the eigenvalues on the diagonal of D indicate the eigenvalues are not ordered
from smallest in absolute value to the largest as we would like to see them.
12.34i 0
0
0
0
0
0 -12.34i 0
0
0
0
0
0 +8.54i 0
0
0
D=
0
0
0 -8.54i 0
0
0
0
0
0 3.05i 0
0
0
0
0
0 -3.05i
3
4
See: aetna/ThreeCarriages/n3 undamped modes MK.m

See: aetna/ThreeCarriages/n3 undamped modes A.m
81
z11 = 0.288 z21 = 0.518 z31 = 0.646
z12 = 0.646
z22 = 0.288 z32 = 0.518
z13 = 0.518 z23 = 0.646 z33 = 0.288
Fig. 5.2. Linear 3-degree of freedom oscillator: second-order model, undamped modes
We can reorder them using the sort function: the rst line sorts the diagonal elements by ascending
modulus, the second line re-orders the rows and columns of D, and constructs the new D, the third
line then reorders the columns of V .
[Ignore,ix] = sort(abs(diag(D)));
D =D(ix,ix);
V =V(:,ix);
Here is the reordered D (be sure to compare with the eigenvalues computed in the previous section
for the generalized EP)
0+3.05i 0
0
0
0
0
0
0-3.05i
0
0
0
0
0
0 0+8.54i 0
0
0
D=
0
0
0
0-8.54i
0
0
0
0
0
0 0+12.3i 0
0
0
0
0
0
0-12.3i
and the corresponding eigenvectors as columns of V
0-10.2i 0+10.2i 0-8.57i 0+8.57i 0+4.77i 0-4.77i

0-18.4i 0+18.4i 0-3.81i 0+3.81i 0-5.95i 0+5.95i
0+23i 0+6.87i 0-6.87i 0+2.65i 0-2.65i

2 0-23i
.
V = 10
31.2
73.2
73.2
-58.9
-58.9
31.2
56.2
56.2
32.6
32.6
73.5
73.5
70
70
-58.7
-58.7
-32.7
-32.7
Note that the eigenvalues come in complex conjugate pairs. The corresponding eigenvectors are
also complex conjugate. Each pair of complex conjugate eigenvalues corresponds to a one-degree of
freedom oscillator with complex-conjugate solutions.
Figure 5.3 illustrates graphically the modes of the A matrix. There are six components to each
eigenvector: the rst three elements represent the components of the displacement, and the last
three elements represent the components of the velocity. Therefore, the eigenvectors are visualized
using two arrows at each mass. We use the classical complex-vector (phasor) representation: the
real part is on the horizontal axis, and the imaginary part is on the vertical axis. Note that all
displacement components (green) are purely imaginary, while all the velocity components (red) are
real. An animation of the motion described by a single eigenvector
82
x = e j t z j
(no sum over j)
is implemented in the script n3 undamped A animation5 .
z11 , z41
z21 , z51
z31 , z61
z13 , z43
z23 , z53
z33 , z63
z15 , z45
z25 , z55
z35 , z65
z12 , z42
z22 , z52
z32 , z62
z14 , z44
z24 , z54
z34 , z64
z16 , z46
z26 , z56
z36 , z66
Fig. 5.3. Linear 3-degree of freedom oscillator: rst-order model, undamped modes
Figure 5.4 shows the free-vibration response to excitation in the form of the initial condition set
to (the real part of) mode 2.6 Note that the displacements go through zero at the same time, and
that the amplitude does not change.
0.25
0.2
0.15
0.1
y(1:3)
0.05
0
0.05
0.1
0.15
0.2
0.25
0
3
t
Fig. 5.4. Linear 3-degree of freedom oscillator: rst-order model, undamped. Free-vibration response to
initial condition in the form of mode 2.
We have made the observation that the eigenvalues and eigenvectors come in complex conjugate
pairs. Each pair of complex conjugate eigenvalues corresponds to a one-degree of freedom oscillator
with complex-conjugate solutions. We have shown in Section 4.3 that all the individual eigenvalue
problems for the 2 2 matrix A may be written as one matrix expression
AV = V ,
where each column of V corresponds to one eigenvector, and the eigenvalues are the diagonal elements of the diagonal matrix . So that provided V was invertible, the matrix A was similar to a
5
6
See: aetna/ThreeCarriages/n3 undamped A animation.m

See: aetna/ThreeCarriages/n3 undamped IC.m
5.3 Direct time integration and eigenvalues
83
diagonal matrix (4.13). Exactly the same transformation may be used no matter what the size of
the matrix A. The 6 6 A is also similar to a diagonal matrix
V 1 AV = D
using the matrix of eigenvectors V . Therefore, the original IVP (5.2) may be written in the completely equivalent form
= D w , w(0) = V 1 y 0
w
(5.5)
for the new variables, the modal coordinates, w. Each modal coordinate wj is independent of the
others since the matrix D is diagonal.
5.3 Direct time integration and eigenvalues

Let us consider the task of nding a numerical solution to the IVP (5.2) by the so-called direct time
integration using a Matlab integrator. Let us assume such an integrator is conditionally stable for
our current vibration system (pure oscillation, no decay, no growth). As an example, let us take
the fourth-order Runge-Kutta integrator (oderk47 ). The amplication factor for this method when
applied to the scalar IVP for the modal coordinate
w = w,
w(0) = w0
reads
= 1 + t +
(t)2
(t)3
(t)4
+
+
.
2
6
24
The stability diagram is shown in Figure 3.24. The intersection of the imaginary axis with the level
curve = 1 of the amplication factor gives one and only one stable time step for purely oscillatory
solutions. Numerically we can solve for the corresponding stable time step with fzero as
F=@(dt)(abs(1+(dt*lambda)+(dt*lambda)^2/2+(dt*lambda)^3/6+(dt*lambda)^4/24)-1);
dt =fzero(F, 1.0)
Integrating with the stable time step leads to an oscillating solution with unchanging amplitude,
using a longer time step yields oscillating solutions with increasing amplitude, and decreasing the
time step leads to oscillations with decaying amplitude. Figure 5.5 was produced by the script
n3 undamped direct modal8 . The modal coordinate w2 (2 = 3.0486i) was integrated by oderk4
with a stable time step t (horizontal curve), slightly longer time step 1.00001t (rising curve), and
shorter time step t/10 (dropping curve), and it is a good illustration of the above derivation.
If we were to numerically integrate the IVP (5.5), i.e. the uncoupled form of the original (5.2),
we could integrate each equation separately from all the others since in the uncoupled form they
are totally independent. Hence we could also use dierent time steps for dierent equations. Let us
say we were to use a conditionally stable integrator such as the oderk4. Then for each equation j
we could nd a stable time step and integrate wj with that time step. Of course to construct the
original solution as y = V w would take additional work: All the wj would be computed at dierent
time instants, whereas all the components of y should be known at the same time instants.
Alternately, if we were to integrate the original IVP (5.2) in the coupled form, the uncoupled
modal coordinates wj would still be present in the solution y, only now they would be mixed
together (coupled) in the variables yk . Again, let us assume that we need to use a conditionally
stable integrator such as the oderk4. However, now we have to use only one time step for all the
components of the solution. It would be in general impossible for purely oscillatory solutions to
7
8
See: aetna/utilities/ODE/integrators/oderk4.m
See: aetna/ThreeCarriages/n3 undamped direct modal.m
84

0.508
0.5
0.506
|w|
Im(w )
0.504
0
0.502
0.5
0.5
0.5
0
Re(w )
0.5
0.498
0
20
40
60
80
100
Fig. 5.5. Integration of modal coordinate w2 (2 = 3.0486i). The real and imaginary part of the solution
(phase-space diagram) on the left, absolute value of the complex solution on the right. Integrated with stable
time step t (exactly one circle on the left, on the horizontal curve on the right), slightly longer time step
1.00001t (increasing radius on the left, rising curve on the right), and shorter time step t/10 (decreasing
radius on the left, dropping curve on the right)
integrate at a time step that was stable for all wj at the same time. If we cannot integrate all
solution components so that their amplitude of oscillation is conserved, then we would probably
elect to have the amplitudes decay rather than grow. Therefore, we would integrate the coupled IVP
with the time step equal to or shorter than the shortest stable time step. For our example the stable
time step lengths are9
dts =
0.9278
0.9278
0.3311
0.3311
0.2291
0.2291
The shortest stable time step (for solution components ve and six) is tmin 0.2291. Figure 5.6
shows that running the integrator at the shortest stable time step yields a solution of the original,
uncoupled, vibrating system which is non-growing (decaying oscillations), because two components
are integrated at the stable time step (and therefore their amplitude is maintained), and the rst
four components are integrated below their stable time step and hence their amplitude decays.
Running the integration at just a slightly longer time step than tmin means that the rst four
components are still integrated below their stable time steps. Their amplitude will still decay. The
last two components are integrated very slightly above their stable time step, which means that the
amplication factor for them is just a tad greater than one. We can clearly see how that can easily
destroy the solution as we get a sharply growing oscillation amplitude of the coupled solution (on
the right).
5.3.1 Practical use of eigenvalues for integration
The eigenvalues of the matrix A of the IVP (5.2) (sometimes referred to as the spectrum of A) need
to be taken into account when the IVP is to be integrated numerically. We have shown the reasons
for this above, and now we are going to try to summarize a few practical rules.
If the decoupling of the original system is feasible and cost-eective, each of the resulting independent modal equations can be integrated separately with its own time step. In particular,
exponentially decaying (or growing) solutions may require the time step to be smaller than some
appropriate length for stability. Purely oscillating solutions may also pose a limit on the time
step, depending on the integrator. To achieve stability we need to solve for an appropriate time
step from the amplication factor as shown for instance above for the fourth-order Runge-Kutta
integrator, or for the Euler integrators in Chapter 3.
9
See: aetna/ThreeCarriages/n3 undamped stable rk4.m
5.4 Analyzing the frequency content

0.5
y(1:3)
y(1:3)
0.5
0.5
0
85
10
20
30
40
50
0.5
0
10
20
30
40
50
Fig. 5.6. Integration of the undamped IVP with the shortest stable time step tmin (non-growing solution
on the left), and slightly longer time step than the shortest stable time step 1.002tmin (growing solution
on the right)
All types of solutions may also require time step that provides sucient accuracy. In this respect
we should remember that equations should not be integrated at a time step that is longer than the
stable time step. Therefore we rst consider stability, and then, if necessary, we further shorten
the time step length for accuracy. For oscillating solutions, good accuracy is typically achieved if
the time step is less than 1/10 of the period of oscillation. In particular, let us say we got a purely
imaginary eigenvalue for the jth mode, j = ij . Then the time step for acceptable accuracy
should be
t
Tj
,
10
where Tj is the period of vibration for the jth mode

Tj =
2
.
j
If the equations cannot be decoupled (such as when the cost of solving the complete eigenvalue
problem is too high), the system has to be integrated in its coupled form. Firstly, we shall
think about stability. A time step must be chosen that works well for all the eigenvalues and
eigenvectors in the system. That shouldnt be a problem for unconditionally stable integrators
they would give reasonable answers for any time step length. Unfortunately, there is really only
one such integrator on our list, the trapezoidal rule. For conditionally stable integrators we have
to choose a suitable time step length. In particular, we would most likely try to avoid integrating
at a time step length that would make some of the solution components grow when they should
not grow (oscillating or decaying components). Then we should choose a time step that is the
smallest of all the time step limits computed for the individual eigenvector/eigenvalue pairs.
Secondly, the time step is typically assessed with respect to accuracy requirements this was
discussed above.
More on the topic of the time step selection in the next two sections that deal with solutions to
initial boundary value problems.

Next we look at a couple of experiments that will provide insight into the frequency content of the
response. First we simulate the free vibration of the undamped system, with the initial condition
being a mixture of the modes 1,2,5,6.10 The measurement of the response will be displacement
10
See: aetna/ThreeCarriages/n3 undamped fft.m
86
of the mass 3, which the simulation will give us as a discrete signal. The signal is a sequence of
numbers xj measured at equally spaced time intervals, tj such that tj tj1 = t.
The sampling interval is a critical quantity. With a given sampling interval length it is only possible to sample signals faithfully up to a certain frequency. Figure 5.7 shows two signals of dierent
frequencies sampled with the same sampling interval. Even though the signals have dierent frequencies, their sampling produces exactly the same numbers and therefore we would be interpreting
them as one and the same. This is called aliasing. The so-called Nyquist rate 1/t is the minimum
sampling rate required to avoid aliasing, i.e. viewing two very dierent frequencies as being the same
due to inadequate sampling.
s(t)
Fig. 5.7. Illustration of the Nyquist rate. Sampling at a rate that is lower than the Nyquist rate for the
signal represented with the dashed line. Clearly as far as the information obtained from the sampling the
two signals shown in the gure are completely equivalent, even though they have dierent frequencies.
We can see from Figure 5.8 that the Nyquist rate is twice the rate (frequency) of the frequency
we wish to reproduce faithfully. The highest frequency that is reproduced faithfully by the Nyquist
rate is the Nyquist frequency
fN y =
1 1
,
2 t
(5.6)
where t is the sampling interval. If we sample with an even higher rate (with smaller sampling
interval), the signal is going to be reproduced much better; on the other hand sampling slower, below
the Nyquist rate, i.e. with a longer sampling interval, the signal is going to be aliased: we will get
the wrong idea of its frequency.
In order to extract the frequencies that contribute to the response from the measured signal we
perform an FFT analysis. A quick refresher: The discrete Fourier transform (DFT ) is expressed
by the formula
Am =
N
1 i2(n1)(m1)/N
e
an ,
N n=1
m = 1, ..., N
(5.7)
that links two sets of numbers, the input signal an and its Fourier transform coecients Am . The
Fast Fourier transform (FFT) is simply a fast way of multiplying with the complex transform matrix,
N
i.e. evaluating the sum n=1 ei2(n1)(m1)/N an .
The Fourier transform (Fourier series) of a periodic function x(t) with period T is dened as
x(t) =
m=
where
Xm eim(2/T )t ,
(5.8)

s(t)
s(t)
f = 1 fN y
f = 1.1 fN y
s(t)
s(t)
f = 2 fN y
87
f = 10 fN y
Fig. 5.8. Illustration of the Nyquist frequency. Frequencies which are lower than the Nyquist frequency are
sampled at a higher rate.
Xm =
1
T
x(t)eim(2/T )t dt .
(5.9)
Here 2
T = 0 is the fundamental frequency. The following illustration shows how equation (5.7)
that denes the transformation between the Fourier coecients and the input discrete signal can be
obtained from the above expressions for the continuous transform by a numerical approximation of
integral.
Illustration 3
Consider the possibility that the function x(t) is known only by its values xj = x(tj ) at equally spaced
time intervals, tj such that tj tj1 = t. Assume the period of the function is an integer number
of the time intervals, T = N t, and the function is periodic between 0 and T . The integral (5.9)
may then be approximated by a Riemann-sum
N
1 T
1
x(t)ei2mt/T dt
x(tn )ei2mtn /T t ,
T 0
T n=1
where m = 0, 1, .... After we substitute T = N t, tn = (n 1)t, and x(tn ) = xn we obtain
N
1 T
1
x(t)ei2mt/T dt
xn ei2m(n1)t/(N t) t
T 0
N t n=1
and nally
N
1 T
1
i2mt/T
x(t)e
dt
xn ei2m(n1)/N .
T 0
N n=1
This is already close to formula (5.7). The remaining dierence may be removed by a shift of the
index m. Therefore if we set m = 1, 2, ..., then the above will change to
N
1 T
1
x(t)ei2mt/T dt
xn ei2(m1)(n1)/N .
T 0
N n=1
As an example of the use of the DFT we will analyze the spectrum of an earthquake acceleration
record to nd out which frequencies were represented strongly in the ground motion.
88
Fig. 5.9. Workspace variables stored in elcentro.mat. The variable desc is the description of the data
stored in the le.
Illustration 4
The earthquake record is from the notorious 1940 El Centro earthquake. The acceleration data is
stored in elcentro.mat (Figure 5.9), and processed by the script dft example 111 . Note that when
the le is loaded as Data=load(elcentro.mat);, the variables stored in the le become elds of
a structure (in this case called Data).
Data=load(elcentro.mat);
dt=Data.delt;% The sampling interval
x=Data.han;% This is the signal: Let us process the North-South acceleration
t=(0:1:length(x)-1)*dt;% The times at which samples were taken
Next the signal is going to be padded to length which is an integral power of 2 for eciency. The
product of the complex transform matrix with the signal is carried out by fft.
N = 2^nextpow2(length(x)); % Next power of 2 from length of x
X = (1/N)*fft(x,N);% Now we compute the coefficients X_k
The Nyquist frequency is calculated and used to determine the N/2 frequencies of interest, which
are all frequencies lower than one half of the Nyquist rate.
f_Ny=(1/dt)/2; % This is the Nyquist frequency
f = f_Ny*linspace(0,1,N/2);% These are the frequencies
Because of the aliasing there is a symmetry of the computed coecients, and hence we also take
only one half of the coecients, X(1:N/2). In order to preserve the energy of the signal we multiply
by two.
absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients
Finally, the coecients are plotted.
plot(f,absX,Color, r,LineWidth, 3,LineStyle, -,Marker, .); hold on
xlabel ( Frequency f [Hz]); ylabel ( |X(f)|);
11
See: aetna/FourierTransform/dft example 1.m

6
x 10
89
|X(f)|
4
3
2
1
0
0
10
15
Frequency f [Hz]
20
25
We can see that the highest-magnitude accelerations in the north-south direction occur with
frequencies below 5 Hz.
Finally, we are ready to come back to our vibration example. The displacement at the third mass
is the signal to transform.
x=y(:,3);% this is the signal to transform
The computation of the Fourier transform coecients proceeds as
N = 2^nextpow2(length(x)); % Next power of 2 from length of x
X = (1/N)*fft(x,N);% Now we compute the coefficients X_k
f_Ny=(1/dt)/2; % This is the Nyquist frequency
f = f_Ny*linspace(0,1,N/2);% These are the frequencies
absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients
Note that the absolute value of one half of the coecients (shown in Figure 5.10) is often called the
one-sided amplitude spectrum.
The three frequencies that we may expect to show up correspond to the angular frequencies
above and are 0.485 Hz, 1.359 Hz and 1.965 Hz. As evident from Figure 5.10 we can see that the
intermediate frequency, 1.359 Hz, is missing in the FFT. By including only the modes 1,2 and 5,6
with frequencies 0.485 Hz and 1.965Hz in the initial condition, we have excluded the intermediate
two modes from the response. Not to have been excited by the initial condition, the two modes will
not appear in the FFT: they will not contribute to the response of the system at any time.
Next we simulate the forced vibration of the system, with zero initial condition and sinusoidal
force at the frequency of 3 Hz applied at the mass 3.12 With the inclusion of forcing the second order
equations of motion are rewritten as
= Kx + L ,
Mx
where L is the vector of forces applied to the individual masses. Converting this to rst order form
results in
[ ] [
][ ] [ ]
0,
1
x
x
0
=
+
.
v
v
L
M 1 K, 0
Therefore, we add in the forcing to the right-hand side function supplied to the integrator: now it
includes a harmonic force applied to mass 3. 13
12
13
See: aetna/ThreeCarriages/n3 undamped fft f.m

See: aetna/utilities/ODE/integrators/odetrap.m
90
Onesided amplitude spectrum |X(f)|
0.25
0.2
0.15
0.1
0.05
0
0
2
3
Frequency f [Hz]
Fig. 5.10. Linear 3-degree of freedom oscillator: rst-order model, undamped. Free-vibration response to
initial condition in the form of mode 1,2,5,6 mixture.
[t,y]=odetrap(@(t,y)A*y+sin(2*pi*3*t)*[0;0;0;0;0;1],...
tspan,y0,odeset(InitialStep,dt));
Again, the measurement of the response (the signal) will be the displacement of the mass 3. The
simulation will give us the displacement x3 as a discrete signal. The FFT analysis on this signal is
shown in Figure 5.11. We can see that now all free-vibration frequencies are present, and of course
the forcing frequency shows up strongly.
x 10
0
0
2
3
Frequency f [Hz]
Fig. 5.11. Linear 3-degree of freedom oscillator: rst-order model, undamped. Forced-vibration response.
5.5 Proportionally damped system

In this section we are again considering the system of Section 5.2, but this time with nonzero damping
c.14
Here we consider the damping matrix to be a multiple of the stiness matrix (so-called stinessproportional damping). This manifests itself by the damping matrix having the same structure of
the nonzero elements as the stiness matrix
14
See: aetna/ThreeCarriages/n3 damped modes A.m
5.5 Proportionally damped system
91
2c c 0
C = c 2c c ,
0 c c
where for our particular data c = 3.13. This is an example of the so-called Rayleigh damping.
(In addition to stiness-proportional there is also a mass-proportional Rayleigh damping.) The
eigenvalues are now complex with negative real parts
-0.238+3.04i
0
0
0
0
0
0
-0.238-3.04i
0
0
0
0
0
0
-1.87+8.33i
0
0
0
.
D=
0
0
0
-1.87-8.33i
0
0
0
0
0
0
-3.91+11.7i
0
0
0
0
0
0
-3.91-11.7i
Clearly the system is strongly damped (the real parts of the eigenvalues are quite large in magnitude).
The eigenvectors shows that the velocities (the last three components) are no longer phase-shifted
by 90o with respect to the displacements.
-0.8-10.2i -0.8+10.2i -1.88-8.36i -1.88+8.36i 1.51+4.53i 1.51-4.53i

-1.44-18.4i -1.44+18.4i -0.836-3.72i -0.836+3.72i -1.88-5.64i -1.88+5.64i
1.51-6.7i 0.839+2.51i 0.839-2.51i

2 -1.8-22.9i -1.8+22.9i 1.51+6.7i
.
V = 10
31.2
31.2
73.2
73.2
-58.9
-58.9
56.2
56.2
32.6
32.6
73.5
73.5
70
70
-58.7
-58.7
-32.7
-32.7
z11, z41
z21, z51
z31, z61
z13, z43
z23, z53
z15, z45
z25, z55
z12, z42
z22, z52
z32, z62
z33, z63
z14, z44
z24, z54
z34, z64
z35, z65
z16, z46
z26, z56
z36, z66
Fig. 5.12. Linear 3-degree of freedom oscillator: rst-order model, modes for stiness-proportional damping
Figure 5.13 shows the free-vibration response to excitation in the form of the initial condition set
to (the real part of) mode 2.15 Note that the displacements go through zero at the same time. This
may be also deduced in Figure 5.12 from the fact that all the displacement arrows for any particular
mode are parallel, which means they all have the same phase shift. Next we repeat the frequency
analysis weve performed for the undamped system previously: we simulate the forced vibration of
the damped system, with zero initial condition and sinusoidal force at the frequency of 3 Hz applied
at the mass 3.16 Again, the measurement of the response will be displacement of the mass 3. The
one-sided amplitude FFT analysis on this signal is shown in Figure 5.14. We can see that not all
free-vibration frequencies are clearly distinguishable, while the forcing frequency shows up strongly.
15
16
See: aetna/ThreeCarriages/n3 damped IC.m

See: aetna/ThreeCarriages/n3 damped fft f.m
92

0.25
0.2
0.15
y(1:3)
0.1
0.05
0
0.05
0.1
0.15
0.2
0
3
t
Fig. 5.13. Linear 3-degree of freedom oscillator: rst-order model, stiness-proportional damping. Freevibration response to initial condition in the form of mode 2.
1.6
x 10
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
3
Frequency f [Hz]
Fig. 5.14. Linear 3-degree of freedom oscillator: rst-order model, stiness-proportional damping. Forcedvibration response.
5.6 Non-proportionally damped system

In this section we consider general damping, by which we mean that it is represented by a damping
matrix that does not have the structure of either the stiness or the mass matrix. We assume only
damper 1 is active (c1 = 33.3), and dampers 2 and 3 are absent:
c1 0 0
C = 0 0 0 .
0 00
Otherwise the mass and stiness properties are unchanged.17
The eigenvalues are quite interesting. There are two negative and real eigenvalues (each corresponding to an exponentially decaying mode), and two pairs of complex conjugate eigenvalues for
one-degree of freedom oscillators.
17
See: aetna/ThreeCarriages/n3 damped non modes A.m
5.6 Non-proportionally damped system
D=
93
-2.4
0
0
0
0
0
0 -0.641+3.98i
0
0
0
0
0
0
-0.641-3.98i
0
0
0
.
0
0
0
-0.254+11.1i
0
0
0
0
0
0
-0.254-11.1i 0
0
0
0
0
0
-21.4
Correspondingly, the rst and last eigenvector is real, and the rest are complex conjugate pairs.
26 5.22+1.33i 5.22-1.33i -1.25+0.191i -1.25-0.191i 4.65

21.1 4.16+12.5i 4.16-12.5i -0.173-7.57i -0.173+7.57i 0.398
2 18.8 3.09+19.2i 3.09-19.2i 0.446+4.61i 0.446-4.61i 0.0369

V = 10
.
-62.5
-8.64+19.9i
-8.64-19.9i
-1.8-13.9i
-1.8+13.9i
-99.5
-50.7 -52.5+8.5i -52.5-8.5i

84.1
84.1
-8.52
-45.2
-78.2
-78.2
-51.3+3.79i -51.3-3.79i -0.79
We can illustrate that the motion for instance for mode 6 is non-oscillatory in Figure 5.15 where we
show the response for the initial conditions in the form of mode 6.18
0.05
0.04
y(1:3)
0.03
0.02
0.01
0.01
0
3
t
Fig. 5.15. Linear 3-degree of freedom oscillator: rst-order model, modes for non-proportional damping.
Response for initial conditions in the form of mode 6.
Figure 5.16 illustrates graphically the modes of the A matrix. It is noteworthy that the displacements and velocities for the purely decaying modes are phase shifted by 180o (they are out of phase).
Figure 5.17 shows the free-vibration response to excitation in the form of the initial condition
set to (the real part of) mode 2. Note that the displacements no longer go through zero at the same
time: they are phase shifted. This may be also deduced in Figure 5.16 because the displacement
arrows for any particular mode are not parallel any more.
Illustration 5
The dynamics of the system discussed above is to be integrated with the time step t = 0.06 s with
the modied Euler integrator. Determine if this integrator will be stable.
The natural angular frequencies are diag(D)
lambda=[ -2.4030
-0.6411+3.9785i
18
See: aetna/ThreeCarriages/n3 damped non IC.m
94
z11 , z41
z21 , z51
z31 , z61
z12 , z42
z22 , z52
z32 , z62
z13 , z43
z23 , z53
z33 , z63
z14 , z44
z24 , z54
z34 , z64
z15 , z45
z25 , z55
z35 , z65
z16 , z46
z26 , z56
z36 , z66
Fig. 5.16. Linear 3-degree of freedom oscillator: rst-order model, modes for non-proportional damping
0.1
0.05
y(1:3)
0.05
0.1
0.15
0
3
t
Fig. 5.17. Linear 3-degree of freedom oscillator: rst-order model, nonproportional damping. Free-vibration
response to initial condition in the form of mode 2.
-0.6411-3.9785i
-0.2541+11.1142i
-0.2541-11.1142i
-21.4220]
Each angular frequency needs to be substituted into the amplication factor for the modied Euler (3.30), and its modulus (absolute value) needs to be evaluated. The result is
>> abs(1+dt*lambda+1/2*(dt*lambda).^2)
ans =
0.8662
0.9616
0.9616
1.0063
1.0063
0.5407
Since two of the amplication factors (for the complex-conjugate natural frequencies 4 and 5) are
greater than one in modulus, the integrator is not going to be stable with the given time step as the
contribution of the modes 4 and 5 would grow in time.
95
5.7 Singular stiffness, damped

Now we consider a system with a singular stiness matrix the rst spring is absent. We also include
damping in the form considered in the previous section.19
z11 , z41
z21 , z51
z31 , z61
z13 , z43
z23 , z53
z33 , z63
z15 , z45
z25 , z55
z35 , z65
z12 , z42
z22 , z52
z32 , z62
z14 , z44
z24 , z54
z34 , z64
z16 , z46
z26 , z56
z36 , z66
Fig. 5.18. Linear 3-degree of freedom oscillator: rst-order model, modes for singular-stiness nonproportional damping
Note the zero eigenvalue: for a singular stiness the entire matrix A must be singular (consider
whether the rst three columns of A can be linearly independent when K has linearly dependent
columns).
0
0
0
0
0
0
0 -0.679+4.31i
0
0
0
0
0
0
-0.679-4.31i
0
0
0
.
D=
0
0
0
-0.237+11.2i
0
0
0
0
0
0
-0.237-11.2i 0
0
0
0
0
0
-23.8
Correspondingly, the rst and last eigenvector is real, and the rest are complex conjugate pairs. The
rst eigenvector is as expected: all displacements the same, no velocities:
-57.7 -4.98+1.26i -4.98-1.26i -1.16+0.371i -1.16-0.371i -4.19

-57.7 -4.03-10.8i -4.03+10.8i -0.161-7.57i -0.161+7.57i -0.3
-57.7 -2.86-18.2i -2.86+18.2i 0.409+4.56i 0.409-4.56i -0.023

.
V = 102
0 -2.07-22.4i -2.07+22.4i -3.86-13i
-3.86+13i 99.7
0
49.3-10i
49.3+10i
84.4
84.4
7.13
0
80.4
80.4
-51+3.48i
-51-3.48i 0.546
Under these conditions no forces are generated in any of the springs or the damper.

An excellent reference for all vibrations subjects. It covers thoroughly the single- and multipledegree of freedom oscillators, matrix analysis of natural frequencies and mode shapes, and numerical methods for modal analysis. Also a nice exposition of the discrete Fourier transform
(DFT).
19
See: aetna/ThreeCarriages/n3 damped sing modes A.m
6
Analyzing errors
Summary
1. The basic tool is here the Taylor series. Especially important is the Lagrange remainder term.
2. We use it to reason about order-of estimates (i.e. big-O notation). Main idea: as we control error
in numerical algorithms by decreasing the time step length, the element size, and other control
parameters, towards zero, the rst term of the Taylor series that is missing in our model will
dominate the error. We use these ideas to evaluate errors of integrals and estimate local and
global errors of ODE integrators.
3. Combining order-of error estimates with repeated solutions with dierent time step lengths allows
us to construct time-adaptive integrators. Main idea: by controlling the local error (estimated
from the Taylor series) we attempt to deliver the solution within a user-given error tolerance.
4. We discuss the approximation of derivatives by the so-called nite dierence stencils. Main idea:
the total error has components of a distinct nature, the truncation error and the machinerepresentation error.
5. The computer represents numbers as collections of bits. Main idea: The machine-representation
error (round-o) is due to the inability of the computer to store only some values, to which
results of arithmetic operations must be converted (with the attendant loss of precision).
6.1 Taylor series

For a reasonably smooth function (for instance it helps if all the functions derivatives exist), we can
write the innite series
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
(x x0 ) +
+ ... .
dx
dx2
2
Its purpose is to approximate the function value at x from the function derivatives at x0 (the function
value may be considered the zeroth derivative). When the above series converges, the Taylor series
will become better and better approximation with any additional term. (When the Taylor series for
a given function converges, we call such function analytic.)
Illustration 1
Warning: The Taylor series need not be convergent. For instance, the function log(1 + x) has a
convergent Taylor series in the interval 1 < x < 1. Outside this interval the Taylor series does not
converge (the more terms are added, the worse the approximation becomes). Try the following code
that uses the taylor MATLAB function.
98
6 Analyzing errors
syms x real
t=taylor(log(1+x),6);
x=linspace(-1,+2,100);
plot (x,log(1+x))
hold on
plot (x,eval(vectorize(t)),--)
Note the use of vectorize: MATLAB will choke on all those powers of x from the Taylor series
function when x is an array of numbers.
Often it is useful to truncate the Taylor series exactly (that is to write down a nite number of
the terms, but still preserve the exact meaning). The Lagrange remainder can be used for this
purpose. For instance we can write
y(x) = y(x0 ) +
dy(b
x)
(x x0 )
dx
to truncate after the rst term, or

y(x) = y(x0 ) +
d2 y(b
x) (x x0 )2
dy(x0 )
(x x0 ) +
2
dx
dx
2
to truncate after the second term. Both truncations are exact (when the Taylor series converges, of
course). The trick is to write the last term (which is the Lagrange remainder) with a derivative taken
at x
b somewhere between x0 and x. The location x
b is not the same in the two truncations above.
In general, we would write
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
dn y(x0 ) (x x0 )n
(x x0 ) +
+
.
.
.
+
+
+ Rn ,
dx
dx2
2
dxn
n!
where the Lagrange remainder Rn is

Rn =
dn+1 y(b
x) (x x0 )n+1
.
dxn+1
(n + 1)!
(6.1)
Having reminded ourselves of the basics of Taylor series approximation, we can look at a very
useful tool (terminology really) to help us with engineering analyses of all kinds.
6.2 Order-of analysis

The order-of analysis helps us to make sweeping statements about things such as errors by highlighting the most important contributions and obscuring the rest. To begin, consider how to say as
simply as possible that for x 0 the value of a given function f (x) decreases toward zero. The
function f could vary in some complicated way. Perhaps we could compare it to something really
simple such as the function g(x) = x which also decreases to zero for x 0?
That is the idea behind this denition: The function f (x) is of the order of g(x) if
lim
x0
|f (x)|
<M <,
|g(x)|
where we require g(x) = 0 for x = 0. In words, the absolute values of the two functions are in some
proportion that is of nite magnitude. We write f (x) O(g(x)) and say f of x is big o g of x as x
goes to zero. The meaning of this denition is that |f (x)| decreases towards zero at least as fast
as |g(x)|.
99
Illustration 2
Example 1: Consider f (x) = 0.1x + 30x2 , for x > 0. Show that it is of order g(x) = x as x 0.
We form the fraction and simplify
|f (x)|
|0.1x + 30x2 |
0.1x + 30x2
= lim
= lim
= lim 0.1 + 30x = 0.1 <
x0 |g(x)|
x0
x0
x0
|x|
x
lim
Conclusion: f (x) = 0.1x + 30x2 is of order g(x) = x as x 0. We say f of x is big o x, and write
f (x) = 0.1x + 30x2 O(x).
Example 2: Consider f (x) = 0.1x + 30, for x > 0. Show that it is of order g(x) = 1 as x 0.
lim
x0
|0.1x + 30|
0.1x + 30
|f (x)|
= lim
= lim
= lim 30 = 30 <
x0
x0
|g(x)| x0
|1|
1
Conclusion: f (x) = 0.1x + 30 is of order g(x) = 1 as x 0. We say f of x is big o one, and write
f (x) = 0.1x + 30 O(1).
Example 3: Consider f (x) = 0.1x + 30x2 , for x > 0. Show that f (x) is not of order g(x) = x2 as
x 0.
|0.1x + 30x2 |
0.1x + 30x2
|f (x)|
= lim
= lim
= lim 0.1/x + 30
2
x0
x0
x0
x0 |g(x)|
|x |
x2
lim
Conclusion: f (x) = 0.1x + 30x2 is not of order g(x) = x2 as x 0.
6.2.1 Using the big-O notation

When analyzing algorithms, our interest is typically to nd out how quickly their errors decrease as
a function of the accuracy control knob (which may be the time step, or the grid spacing, according
to the algorithm). The assumption is that accuracy is improving as the control knob makes the time
step (or the grid spacing) smaller (approaching zero).
Given an expression such as f (t) = 0.1t + 30t2 our interest would be to nd the dominant
term, that is the term that decreases to zero the slowest, as t 0. In the examples above we have
discovered that f (t) = 0.1t + 30t2 O(t). This to us indicates that f (t) decreases toward
zero at most as quickly as t. It does not decrease as quickly as t2 . Also, it does decrease toward
zero, which a constant, 1, does not. The notation f (t) O(t), and f (t)
/ O(t2 ), f (t)
/ O(1)
helps us lter out things that are not important, the numerical values of the coecients (0.1 and 30),
what other unimportant terms there might be (t2 ), and keep just the information that matters to
us: f (t) = 0.1t + 30t2 O(t).
Illustration 3
Use the order-of notation to compare the following polynomials as t 0.
p(t) = 100, 003t3 + 0.16131t2 555,
q(t) = 703t6 (1 + 2t),
r(t) = 3t6 log e .
Solution: As all polynomial expressions include the constant term, all of these polynomials are
O(1).
100
6 Analyzing errors
Illustration 4
Estimate the resulting magnitude of the Taylor series sum for tj+1 tj . Assume that all the
derivatives exist and are nite numbers.
d3 y(tj ) (tj+1 tj )3
d4 y(tj ) (tj+1 tj )4
d2 y(tj ) (tj+1 tj )2
+
+
+ ... .
dt2
2
dt3
3!
dt4
4!
First of all, the Taylor series is a polynomial in the quantity tj+1 tj , and this quantity goes to
zero as tj+1 tj . Therefore, we can introduce the new variable = tj+1 tj and write
d3 y(tj ) 3
d4 y(tj ) 4
d2 y(tj ) 2
+
+
+ ... .
dt2 2
dt3 3!
dt4 4!
d2 y (t ) d3 y (t ) d4 y (t )
The quantities 2!dt2j , 3!dt3j , 4!dt4j , . . . are just inconsequential coecients, and we can easily
convince ourselves that
d3 y(tj ) 3
d4 y(tj ) 4
d2 y(tj ) 2
+
+
+ . . . O( 2 )
dt2 2
dt3 3!
dt4 4!
by evaluating
d3 y(tj ) 3
d4 y(tj ) 4
d2 y(tj ) 2
+
+
+ ...
2
2
dt3 3!
dt4 4!
lim dt
=
2
0
d y(tj ) 1 d y(tj )
d y(tj )
d y(tj ) 1
+
+
+ ... =
<
2
3
4
dt
2
dt
3!
dt
4!
dt2 2
(
)
In conclusion, the Taylor series sum is O (tj+1 tj )2 .
lim
6.2.2 Error of the Riemann-sum approximation of integrals

The goal here is to estimate the error of the Riemann-sum approximation of integrals of one variable
using the order-of analysis. For instance, as shown in Figure 6.1, approximate the integral
b
y(x) dx
a
using the Riemann-sum approximation indicated by the lled rectangles in the gure. The error of
approximating the actual area between x0 and x0 + h by the rectangle y(x0 )h may be estimated by
expressing the Taylor series of y(x) at x0
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
(x x0 ) +
+ ...
dx
dx2
2
and integrating the Taylor series, where we can conveniently introduce the change of variables
s = x x0
)
x0 +h
h(
d2 y(x0 ) s2
dy(x0 )
s+
+ . . . ds .
y(x) dx =
y(x0 ) +
dx
dx2 2
x0
0
We obtain
x0 +h
y(x) dx = y(x0 )h +
x0
dy(x0 ) h2
d2 y(x0 ) h3
+
+ ... .
dx 2
dx2 6
101
Comparing with the approximate area y(x0 )h, we express the error as
dy(x0 ) h2
d2 y(x0 ) h3
+
+ ... .
dx 2
dx2 6
e=
We recall that the lowest polynomial power dominates, and therefore

e O(h2 ) .
The integral of the function y(x) between a and b is approximated as a sum of the areas of the
rectangles, let us say all of the same width h. There is
n=
ba
h
such rectangles. A pessimistic estimate of the total error magnitude would ignore the possibility of
error canceling, so that the absolute value of the total error could be bounded by the sum of the
absolute values of the errors committed for each subinterval
|E|
|ei | =
i=1
i=1
O(h2 ) = n O(h2 ) =
ba
O(h2 ) = O(h) .
h
Note that when we write in the equals sign in the above equation, we dont really mean equality, we
use it rather informally to mean is. In the terms of the order-of analysis, we would write for the
error E of the integral from a to b
E O(h) .
From the point of view of the user of the Riemann-sum approximation this is good news: The error
can be controlled. By decreasing h (that is by using more subintervals) we can make the total error
smaller. It would be even nicer if the error of was O(h2 ), since then it would decrease faster when h
was decreased. We demonstrate this as follows: assume that we use twice as many subintervals. For
E O(h) the error would decrease as
h h/2 E O(h) Enew O(h/2) = O(h)/2
so the error decreases with a factor of two. For E O(h2 ) the error would decrease as
h h/2 E O(h2 ) Enew O((h/2)2 ) = O(h2 )/4
so the error decreases with a factor of four. The pay o of using twice as many intervals is better
this time.
6.2.3 Error of the Midpoint approximation of integrals
Now we demonstrate the estimate the error of the midpoint approximation of integrals of one variable
using the order-of analysis. For instance, as shown in Figure 6.2, approximate the integral
y(x) dx
a
using the midpoint approximation indicated by the lled rectangles in the gure. The error of
approximating the actual area between x0 h/2 and x0 + h/2 by the rectangle y(x0 )h may be
estimated by expressing the Taylor series of y(x) at x0
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
(x x0 ) +
+ ...
dx
dx2
2
102
6 Analyzing errors
x0
h
Fig. 6.1. Riemann-sum approximation of the integral of a scalar variable.
and integrating the Taylor series, where we introduce the change of variables s = x x0
)
x0 +h/2
h/2 (
d2 y(x0 ) s2
dy(x0 )
s+
+ . . . ds .
y(x) dx =
y(x0 ) +
dx
dx2 2
x0 h/2
h/2
We obtain
x0 +h/2
x0 h/2
y(x) dx = y(x0 )h +
d2 y(x0 ) h3
+ ... .
dx2 24
0)
Importantly, the term with dy(x
produced one negative contribution (triangle) which canceled with
dx
the corresponding positive contribution (triangle), and so this term with its associated h2 dropped
out. Comparing with the approximate area y(x0 )h, we express the error as
e=
d2 y(x0 ) h3
+ . . . O(h3 ) ,
dx2 24
which is one order higher than the error estimated for the Riemann sum. The integral of the function
y(x) between a and b is approximated as a sum of the areas of the n rectangles and the absolute
value of the total error could be bounded by the sum of the absolute values of the errors committed
for each subinterval
|E|
|ei | =
i=1
O(h3 ) = n O(h3 ) =
i=1
ba
O(h3 ) = O(h2 ) .
h
The order-of analysis tells us that the error E of the integral from a to b for the midpoint rule is
E O(h2 )
and therefore the midpoint rule is more accurate than either of the Riemann sum rules.
6.3 Estimating error in ODE integrators

Intuitively we can see the forward Euler algorithm as related to the Riemann sum approximation of
integrals. That is especially clear when the right-hand side function does not depend on y
y = f (t) ,
y(0) = y0 .
To solve this equation we integrate
103
x0
h
Fig. 6.2. Midpoint approximation of the integral of a scalar variable.
y(t) = y0 +
f ( ) d .
0
Forward Euler approximates the integral on the right as

y(t) y0 + f (0)t
which leads exactly to the same kind of error estimate, O(t2 ), moving the solution forward by one
time step.
The situation is complicated somewhat by considering right-hand sides which depend both on
the time t and the solution y. For instance, Figure 6.3 shows what happens for of the equation
y = cos(2t)y ,
y(0) = 1 .
Each step of the forward Euler algorithm drifts o from the original curve. So we see one solution
curve departing from the starting point (t0 , y0 ), but after one step the forward Euler no longer tries
to follow that curve, but rather the one starting at (t1 , y1 ), and so on. Clearly, here is the potential
for amplifying small errors if the solution curves part company rapidly as the time goes on. However,
provided we use time steps which are suciently small so that the forward Euler does not excessively
amplify these little drifts, we can estimate the error on the entire solution interval (the so-called
global error ) from the so-called local errors in each time step.
y0
y4
y1
y2
t0
t1
t2
y3
t3
t4
Fig. 6.3. Forward Euler integration drifting o the original solution path.
104
6 Analyzing errors
6.3.1 Local error of forward Euler

Let us consider the vector ODE
y = f (t, y) ,
y(0) = y 0 ,
which is advanced from tj to tj+1 using the forward Euler algorithm as

y j+1 = y j + (tj+1 tj )f (tj , y j ) .
At the same time we can expand the solution in a Taylor series at (tj , y j )
dy(tj )
d2 y(tj ) (tj+1 tj )2
(tj+1 tj ) +
+ ... .
dt
dt2
2
y(tj+1 ) = y(tj ) +
Here y(tj+1 ) is the true solution that lies on the solution curve passing through the point (tj , y j ),
and y j+1 is what we get from forward Euler. Now we can substitute from the denition
dy(tj )
= f (tj , y j )
dt
to get
y(tj+1 ) = y(tj ) + f (tj , y j )(tj+1 tj ) +
d2 y(tj ) (tj+1 tj )2
+ ...
dt2
2
and then move the rst two terms on the right-hand side onto the left-hand side
y(tj+1 ) y(tj ) f (tj , y j )(tj+1 tj ) =
d2 y(tj ) (tj+1 tj )2
+ ... .
dt2
2
Finally, the second and third term on the left-hand side are y j+1 , and so we obtain the local error
(also called truncation error) in this time step as
y(tj+1 ) y j+1 =
d2 y(tj ) (tj+1 tj )2
d3 y(tj ) (tj+1 tj )3
+
+ ... .
dt2
2
dt3
3!
Two observations: rstly, the local error is second order in the time step
y(tj+1 ) y j+1 O((tj+1 tj )2 )
and secondly, the coecient of this term is the second derivative at (tj , y j ) which measures the
curvature of the solution curve at that point. The more the curve curves, the larger the error. If
the solution happens to have a zero curvature at (tj , y j ) then we would predict that the Euler step
should not incur any error. It still might: our prediction neglected all those dots (the higher order
terms) in the Taylor series, but at least for zero curvature the second order term in the error would be
absent. The local error resulted from the truncation of the Taylor series, which is a good explanation
of why it is called the truncation error.
6.3.2 Global error of forward Euler
We have demonstrated above (see Figure 6.3) that the global error , that is the dierence between
the analytical exact solution y(tn ) and the computational solution yn , is a mixture of two components. Now we will look at the global error in detail. We will try to estimate the global error at
time tn+1 , GEn+1 , from the global error GEn at time tn : see Figure 6.4. Note that we are thinking
in terms of a scalar dierential equation, but the conclusions may be readily generalized to coupled
equations.
105
The rst component of the global error is the local (truncation) error which is caused by the
truncation of the Taylor series as explained in the previous section.
The second component is caused by the drift o in the previous steps of the algorithm: every
step of the integrator will cause the solution to drift o the original curve passing through the
initial condition. Let us consider performing one single step of the numerical integration, from tn to
tn+1 . Two dierent curves pass through the two points (tn , yn ) and (tn , y(tn )): let us say ye(t) passes
through (tn , yn ), and y(t) passes through (tn , y(tn )). The dierence between the points (tn , yn ) and
(tn , y(tn )) is the global error at time tn , GEn .
The dierence between the two curves y(tn+1 ) ye(tn+1 ) at time tn+1 measures the propagated
error . We can estimate the propagated error from as P En+1 which is the global error GEn plus the
increase of the distance between the two curves. The increase can be approximated to rst order
from the slopes ye (tn ) = f (tn , yn ) and y(t
n ) = f (tn , y(tn ))
P En+1 GEn + (f (tn , y(tn )) f (tn , yn )) (tn+1 tn ) .
We can also use the Taylor series to expand the right-hand side function f as
f (tn , y(tn ) f (tn , yn ) +
f (tn , yn )
(y(tn ) yn )
y
to obtain
P En+1 GEn +
f (tn , yn )
(y(tn ) yn ) (tn+1 tn )
y
and substituting GEn = y(tn ) yn we arrive at

(
)
f (tn , yn )
P En+1 GEn 1 +
(tn+1 tn ) .
y
This is really saying that the propagated error in step tn+1 is the global error in step tn plus a
little bit more due to the dierence between the slopes at yn and y(tn ). As an illustration consider
a model equation
y = y ,
y(0) = y0 .
For this model equation the propagated error will read

P En+1 GEn (1 + (tn+1 tn )) .
Thus we see that the propagated error will be controlled by the stability (growth versus decay) of the analytical solution: for positive the propagated error will exponentially increase as
(1 + (tn+1 tn )) > 1, for negative lambda (and suciently small time step) the propagated error
will likely decrease as (1 + (tn+1 tn )) < 1.
Under reasonable assumptions concerning the smoothness of the right-hand side function f (note
well that this will not include models such as the friction stick-slip), the global error may be estimated
from the local errors using a (pessimistic) assumption that the local errors will never cancel each
other, they will always add up. Then we can estimate the global error E = y(tn+1 ) yn+1 as
|GEn+1 |
i=1
|ei | =
i=1
O(t2 ) = n O(t2 ) =
t
O(t2 ) = O(t) .
t
Thus we see that we lost one order in the error estimate going from local to global errors. The
forward Euler algorithm was second order locally, but it is only rst order globally.
Illustration 5
Now we can go back to graphs of Chapter 1, especially Figure 2.19. The slopes of the error curves on
the log-log scale will now be making sense. For the forward Euler we now know that its local error is
second order, but the global error is rst order. The graph 2.19 displays the global error, and hence
the slope (i.e. the convergence rate) is one. For the modied Euler the global error is second order,
consequently its local error is cubic in the time step.
106
6 Analyzing errors
y(tn+1 )
f(tn , y(tn ))
LEn+1
y(t)
y(tn )
yn+1
f(tn , yn )
GEn
yn
y2
y0
y1
t0
t1
t2
t3
t4
Fig. 6.4. Global error of the forward Euler integration. LEn+1 = local (truncation) error, P En+1 = propagated error, GEn = global error at time tn , GEn+1 = global error at time tn+1 .
1. Estimate from Figure 2.21 the order of the local error of the oderk4 Runge-Kutta integrator.
6.4 Approximation of derivatives

We can write the Taylor series of the function whose derivative we wish to approximate at x0 as
f (x) = f (x0 ) +
d2 f (x0 ) (x x0 )2
df (x0 )
(x x0 ) +
+ R2 ,
dx
dx2
2!
where R2 is the Lagrange remainder. We can write for instance

R2 =
d3 f (x + (1 )x0 ) (x x0 )3
,
dx3
3!
01.
The Taylor series expression can be solved for the derivative

df (x0 )
f (x) f (x0 ) d2 f (x0 ) (x x0 )
R2
=
.
dx
(x x0 )
dx2
2!
(x x0 )
When second-order derivative term and the remainder term are ignored, presumably because they
are much smaller in magnitude then what we keep from the right-hand side expression, we get an
approximation of the derivative as
df (x0 )
f (x) f (x0 )
dx
(x x0 )
(6.2)
Because of the form of this expression, we call this formula the divided dierences. All the formulas
for the approximation of derivatives derived in this section are of this nature.
What weve neglected above is
107
d2 f (x0 ) (x x0 )
R2
dx2
2!
(x x0 )
3
0)
and we realize that this is the error of the approximation. Unless d f (x+(1)x
behaves like 1/(x
dx3
x0 ), in words unless its magnitude blows up to innity as x x0 , we can estimate the magnitude
of the error using the order-of notation we developed earlier

R2 O(|x x0 |2 ) .
(x x0 )
Since
d2 f (x0 ) (x x0 )
O(|x x0 |)
dx2
2!
(6.3)
we see that it will dominate the error as the control parameter, the step along the x axis, x x0 ,
becomes shorter and shorter. The accuracy of the algorithm (6.2) is quite poor, the error being only
O(|x x0 |). We call this kind of error the truncation error, since it is the result of the truncation of
the Taylor series.
Illustration 6
Consider a common counterexample where (6.3) is not valid. In Figure 6.5 a piecewise linear function
is shown (in solid line) with its derivative (dashed). If we take (6.2) with x0 to the left of b, for x < b
the formula works perfectly. The second derivative is in fact
d2 f (x0 )
=0,
dx2
which makes our derivative computation perfect no error. Now we will make x0 approach b from
the left arbitrarily closely. The error estimate (6.3) is then no longer valid since at x0 = b
d2 f (x0 )
.
dx2
This unfortunate behavior is due to the rst derivative being discontinuous at b.
f (x)
f (x)
x
f (x)
Fig. 6.5. Piecewise linear function with its derivative
108
6 Analyzing errors
Now we will consider the approximation formula (6.2) for two cases: x > x0 and x < x0 . When
x > x0 we are looking forward with the formula to determine the slope at x0 , hence we get the
forward Euler approximation of the derivative. Let us write h = |x x0 |. Then the formula (6.2)
may be rewritten in the familiar form
df (x0 )
f (x0 + h) f (x0 )
.
dx
h
On the other hand, when x < x0 we are looking backward with the formula to determine the
slope at x0 , hence we get the backward Euler approximation of the derivative as this version of the
formula (6.2)
f (x0 ) f (x0 h)
df (x0 )
.
dx
h
f (x0 + h)
x0 h
x0
x0 + h
f (x0 h)
f (x0 )
Fig. 6.6. Forward and backward Euler approximation of the derivative.
Figure 6.6 illustrates these concepts. The actual derivative is the slope of the green line (tangent
at (x0 , f (x0 ))), which is approximated by the forward Euler algorithm as the slope of the red dashed
line, and by the backward Euler algorithm as the slope of the blue dashed line.
Evidently the gure suggests an improvement on these two algorithms. The green line seems to
have a slope rather close to the average of the slopes of the red and blue lines. (The angles between
the blue and green line and between the red and green line are about the same.) So what happens
if we average those Euler predictions?
(
)
f (x0 + h) f (x0 h)
1 f (x0 + h) f (x0 ) f (x0 ) f (x0 h)
+
=
.
(6.4)
2
h
h
2h
The above formula denes another algorithm, the centered dierence approximation of the derivative. Figure 6.7 shows the dashed green line which represents the centered dierence approximation
of the tangent, and we can see that indeed the slopes of the dashed and solid green lines are indeed
quite close. It appears that the centered dierence approximation should be more accurate, in general, and we can investigate this analytically by averaging out only the approximation formulas, but
the entire expressions including the errors.
The forward dierence approximation of the derivative, including the truncation error R2,f
df (x0 )
f (x0 + h) f (x0 ) d2 f (x0 ) h
R2,f
=
2
dx
h
dx
2!
h
109
f (x0 + h)
x0 h
x0
x0 + h
f(x0 h)
f (x0 )
Fig. 6.7. Forward and backward Euler and centered dierence approximation of the derivative.
is added to the backward dierence approximation of the derivative, including the truncation error
R2,b
df (x0 )
f (x0 h) f (x0 ) d2 f (x0 ) (h)
R2,b
=
dx
h
dx2
2!
(h)
to result in the expression of the centered dierence approximation
2
df (x0 )
f (x0 + h) f (x0 h) R2,f
R2,b
=
.
dx
h
h
(h)
The truncation error of the centered dierence approximation is seen to be
R2,f
R2,b
O(h2 ) .
2h
(2h)
(6.5)
It is one order higher than the truncation errors of the Euler algorithms (O(h2 ) versus O(h)), and
higher is better the error decreases faster with decreasing h.
The formulas for the numerical approximation of derivatives of functions, forward and backward
Euler, and the centered dierences, are called nite dierence stencils, and many more, sometimes
with a considerably higher accuracy, can be found in the technical literature. The price to pay is
that with higher accuracy one needs more function values around the point x0 .
Illustration 7
We shall now investigate the numerical evidence for these estimates of truncation error. 1 In the script
compare conv driver, x is the point where the derivative is evaluated, n is the number of reductions
of the step, dx0 is the initial step, which is then subsequently reduced by the factor divFactor.
funhand and funderhand are the handles of the function and its derivative (as anonymous MATLAB
functions).
funhand=@(x)2*x^2-1/3*x^3;
funderhand=@(x)4*x-3/3*x^2;
x=1e1;
n= 9;
dx0= 0.3;
divFactor=4;
110
6 Analyzing errors
10
err
10
10
10
10
FE
BE
CD
15
10
10
10
10
dx
Fig. 6.8. Forward and backward Euler and centered dierence approximation of the derivative. Error versus
the step size.
Figure 6.8 both conrms the expected outcome and presents an unexpected one: the forward
and backward Euler are of the same accuracy, and on the log-log scale the error decreases with rate
of convergence equal to one, and the centered dierence is both more accurate in absolute terms
and the error decreases with a convergence rate of two. What may be unexpected however is the
behavior of the centered dierence error for very small steps. The error does not decrease anymore,
rather the opposite occurs.
Shifting x as the point where the derivative is evaluated (change the third line to read
x=1e4;
gives the results in Figure 6.9. The performance of the numerical dierentiation algorithms has now
very much deteriorated, and a decrease in the step size does not necessarily lead to an improvement
in the result in neither the two Euler derivatives approximations, nor in the centered dierence
approximation.
The explanation for the behavior described in the Illustration above rests in what is displayed
in the graphs: the graphs present the total error incurred by the numerical algorithm, and this error
is the result of the interplay between the truncation error and the eect of the so-called machinerepresentation error. The term round-o error is commonly used for this type of error. However,
round-o is only a special case of the broader class of machine-representation errors. Another term
which would be equivalent is computer-arithmetic, or just arithmetic error. We will sometimes
use interchangeably machine-representation and arithmetic error.
6.5 Computer arithmetic

The machine-representation (arithmetic) error is due to the limited capability of computers to store
numbers: only some numbers may be represented in the computer.
6.5.1 Integer data types
The computer architectures in current use are based on binary storage: the smallest piece of data
is a bit, which assumes values 0 or 1. A collection of bits can store a binary number. In particular,
computers nowadays use a chunk of eight bits called byte. The position of the bit in the byte indicates
1
See: aetna/RoundoffTruncation/compare conv driver.m

10
err
10
10
10
10
111
10
FE
BE
CD
12
6
10
10
10
10
dx
Fig. 6.9. Forward and backward Euler and centered dierence approximation of the derivative. As Figure 6.8,
but the point of evaluation is shifted towards much bigger number, x=1e4.
the power of two, similarly to what were used to with decimal numbers. For instance, the decimal
number 13 = 1101 +3100 can be written in the binary system as 13 = 123 +122 +021 +120 .
Hence its binary representation is 1101. We can use the MATLAB function dec2bin:
>> dec2bin(13)
ans =
1101
The largest number we can store in a byte (more precisely in an unsigned byte) is 255, viz
>> dec2bin(255)
ans =
11111111
since in that case all the bits are toggled to 1. If we wish to represent signed numbers, we must
reserve one bit for the storage of the sign (positive or negative). Then we have only seven bits for
the storage of the actual pattern of 0s and 1s. The largest number that seems to be available then is
>> bin2dec(1111111)
ans =
127
However, by some clever manipulation it is possible to squeeze out one more number out of the eight
bits, and so we get as the algebraically smallest and largest integers using the MATLAB functions
intmin and intmax
>> intmin(int8)
ans =
-128
>> intmax(int8)
ans =
127
The clever trick is called the 2s complement representation, and the bits represent numbers as
shown here
00000000=0
00000001=1
00000010=2
00000011=3
...
112
6 Analyzing errors
01111111=127
11111111=-1
11111110=-2
11111101=-3
...
10000000 =-128
The argument int8 denotes the so-called integer type, and there are four signed and four unsigned
varieties in MATLAB (with 8, 16, 32, and 64 bits). As an example, here are the smallest and largest
unsigned 64-bit integer
>> intmin(uint64)
ans =
0
>> intmax(uint64)
ans =
18446744073709551615
Integers are nice to work with, and they are very useful for instance as counters in loops. If were
not careful, bad things can happen though. Take the following code fragment: First we create the
variable a as an 8-bit integer zero with int8
>> a= int8(0);
and then we increment it 1000 times by one. The result is a bit unexpected, perhaps:
for i=1: 1000
a=a+1;
end
a
a =
127
What happened? Overow! When the variable reached the largest value that can be stored in a
variable of this type, it stopped increasing: the variable overowed.
6.5.2 Floating-point data types
The oating-point numbers are represented with values for the so-called mantissa M and exponent
E, stored in bits essentially as described above, as
M*2^E
The basic datatype in MATLAB is a oating-point number stored in 64 bits, the so-called double. The
machine representation for this number is standardized, as described in the ANSI/IEEE Standard
754-1985, Standard for Binary Floating Point Arithmetic. The exponent and the mantissa are stored
as patterns of bits, which may be represented as numbered from 0 to 63, left to right. The rst bit is
the sign bit, S, the next eleven bits are the exponent bits, E, and the nal 52 bits are the mantissa
bits M:
S EEEEEEEEEEE MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
0 1
11 12
63
The value V represented by the 64-bit word may be determined by an algorithm (how else?):
1. If E=2047
a) If M is nonzero, then V=NaN (Not a Number)
b) Else (M==0)
i. If S is 1, then V=-Inf
113
ii. Else V=Inf

2. Else if 0<E<2047 then we get the so-called normalized values
V=(-1)^S * 2^(E-1023) * (1.M)
where 1.M is intended to represent the binary number created by prexing M with an implicit
leading 1 and a binary point.
3. Else (E==0)
a) If M is nonzero, then we get the so-called unnormalized values
V=(-1)^S * 2^(-1022) * (0.M)
b) Else (M==0)
i. If S is 1, then V=-0
ii. Else V=0
The cleverness of this representation should be appreciated. It allows us not only to store a zero
(twice!), and the regular (normalized) numbers, but also numbers which are extremely small (the
unnormalized values). In addition, we can also store negative and positive innity (Inf, and -Inf),
which may result if we divide by zero (most likely by accident)
>> 1/0
ans =
Inf
and nally we can also store something that isnt a number at all (for instance because the result of
an operation isnt dened at all):
>> 0/0
ans =
NaN
The following two functions can be used to obtain the smallest (normalized) and largest oatingpoint double value
>> realmin(double)
ans =
2.225073858507201e-308
>> realmax(double)
ans =
1.797693134862316e+308
Here we have one unnormalized value
>> realmin(double)/1e6
ans =
2.225073858324152e-314
The machine representation of the double oating-point values described above tells us something
important about which values can be represented in the computer, and they are not all the numbers
we can think of! To get from one particular number to the one right next to it we change one of the
bits in the mantissa. The least signicant one is the 52nd. So if we take a normalized number
V=(-1)^S * 2^(E-1023) * (1.M)
changing the bit in the 52nd position behind the binary point amounts to adding to or subtracting
from the total value of the number V of the tiny value (-1)^S * 2^(E-1023)*2^(-52). For instance
for E=1023 and S=0 the tiny dierence is
>> 2^-52
ans =
2.220446049250313e-016
114
6 Analyzing errors
For S=0, E-1023, and all the bits of the mantissa M set to zero the value V=1.0. Then the next closest
number to 1.0 that can be represented in the computer is 1.0 + 252 . Between these two values is a
gap where no numbers live. It is a tiny gap, so it may not bother us too much, but consider what
happens as the exponent 2^(E-1023) gets bigger. The MATLAB function eps computes the size of
the gap next to any particular double value. For instance, for a value representative of the Youngs
modulus in some units the gap will get bigger
>> eps(3e11)
ans =
6.103515625000000e-005
For the distance across the Milky Way, the gap already amounts to (in meters)
>> eps(100000 *9.46e15)
ans =
131072
and the distance to the outermost object in the universe can be recorded in the computer with a
double value only with precision amounting to tens of millions of kilometers
>> eps(18e9 *9.46e15)
ans =
3.435973836800000e+010
The gap between adjacent numbers represented in the computer is called machine epsilon, and
it is an important quantity. Especially when we consider round-o. As an example, consider the
addition of two numbers: watch what happens to the underlined red digits.
>> pi
ans =
3.141592653589793
>> 1+pi
ans =
4.141592653589793
>> 1e6+pi
ans =
1.000003141592654e+006
In the second example, the disparate magnitude led to truncation of the previously signicant digits.
Addition results in loss of signicance when the numbers are disparate. For subtraction, the
dangerous situation occurs when the two numbers are close. Consider this example:
>> 3.14159265-pi
ans =
-3.589792907376932e-009
The computer essentially made up the underlined red digits. (To check this we go to the Web: a lot
of digits of are available on the web: 3.14159265358979323846264338....) This problem is referred
to as loss of signicance, and it is one of the most deleterious aspects of computer arithmetic.
Another example illustrates the so-called catastrophic cancellation of terms, for both addition
and subtraction. We may consider the below expressions equivalent, but it evidently matters what
the MATLAB engine thinks of the expression we typed in:
>> (1 +2e-38)-(1 -2e-38)
ans =
0
>> (1 +2e-38-1 +2e-38)
ans =
2.000000000000000e-038
115
>> (1-1 +2e-38+2e-38)

ans =
4.000000000000000e-038
In the rst expression, the parentheses wiped out the small terms altogether. In the second expression,
the order in which the terms were processed wiped out only one of the tiny ones. Finally, in the last
expression the order of processing worked in our favor, and we got the correct answer.
One of the implications of using a binary computer representation of numbers is that numbers
that look really nice to us may be bad for the computer operations. This example illustrates it nicely:
>> a= 0;
for i=1: 10000
a=a+0.0001;
end
a
a =
0.999999999999906
to be compared with
>> a=0.0001*10000
a =
1
Similarly to the integer data types, oating-point values can overow .
>> realmax(double)+1==realmax(double)
ans =
1
This test should not have evaluated to true (numerical value 1), but since the left-hand side
overowed it did. Floating-point values can also underow : the value becomes so small that it gets
converted to the exact zero in the computer.
>> realmin(double)/1e16
ans =
0
There is a second oating-point type available in MATLAB, the single. Since there are only 32
bits available for its storage, the budgets for the exponent (eight bits) and the mantissa (23 bits)
are correspondingly reduced with respect to the double. This simply compounds all the problems
we described above for the double. For instance, the machine epsilon for the single at 1.0 is
>> eps(single(1.0))
ans =
1.1920929e-007
so almost 9 orders of magnitude larger than the one for the double. This will make the precision
considerably less for all sorts of operations. The only reason one may consider using a single is
that it saves half the storage space compared to a double. Therefore (pieces of) some commercial
softwares use single-precision storage, which could make them considerably less robust for certain
inputs then we would hope for. We as users need to be aware of such pitfalls.
6.5.3 Summary
A few points to sum up the basics of the computer representation of numbers:
1. Only some values can be represented in the computer. For both integers and oating-point values
theres a range in which numbers can be represented, and no values can exist outside of the range.
116
6 Analyzing errors
2. The range in which oating-point values are represented is sparsely populated: there are gaps
between numbers (the so-called machine epsilon), which increase with the magnitude of the
number. The machine epsilon essentially limits the largest number of signicant digits that we
can expect (15 for double, 6 for single).
3. Operations on the computer-represented numbers rarely lead to exact answers, and especially
addition and subtraction can prove devastating to our budget of signicant digits (overow,
underow, round-o, cancellation of terms,...). Consequently the number of signicant digits is
usually much less than the number of digits in the computer printouts.
6.6 Interplay of errors

We will inspect the centered-dierence formula for the approximation of derivatives (6.4). As the
step size in the denominator decreases, the numerator contains the dierence of two numbers which
are closer and closer in magnitude. We can estimate that the arithmetic error of the subtraction
f (x0 + h) f (x0 h) is going to be on the order of machine epsilon. Therefore, the error of the
derivative can be estimated as
ER =
(f (x0 ))
.
2h
Here (f (x0 )) is the machine epsilon at the real number f (x0 ). We can see that the arithmetic error
ER increases in inverse proportion to the step size h. Also, we see that the error increases with the
magnitude of the numbers to be subtracted f (x0 ), since the machine epsilon depends on it.
Now let us go back to Figure 6.8. The total error displayed in Figure 6.8 is the sum of the
truncation error and the arithmetic error. The descending branches of the errors are dominated by
the truncation error, either O(h) (slope +1) or O(h2 ) (slope +2). In the climbing branch of the
total error the arithmetic error dominates. The dependence of the arithmetic error on 1/h = h1
can be clearly detected in the slope 1 of the climbing branch of the total error of the derivative in
Figure 6.9.
Note well that while talking about the total error we disregard the avoidable errors of the nature
of bugs and mistakes. Sadly, these errors are sometimes present, but their unpredictable nature
makes them very dicult to discuss in general.
Illustration 8
The report DCS Upgrades for Nuclear Power Plants: Saving Money and Reducing Risk through
Virtual-Stimulation Control System Checkout by G. McKim, M. Yeager and C. Weirich from 2011,
states on page 5 when discussing a software simulator of a nuclear power plant subsystem: Here
was the rst surprise. The emulated Bailey response in Figure 5 didnt show this rate limiting. The
controller output traveled as fast as 12% per second. This led to a line-by- line examination of the
FORTRAN source code for the Bailey emulation, whereupon it was discovered that, contrary to
belief, the rate limiting was not included in the simulation.
This is an example of a software bug: the feature that was supposed to be programmed was
either never implemented or was implemented and later deleted.
Illustration 9
The Deepwater Horizon Accident Investigation Report from September 8, 2010 states on page 64
The 13.97 ppg interval at 18,200 ft. was included in the OptiCem model report as the reservoir
zone. The investigation team was unable to clarify why this pressure (13.97 ppg) was used in the
117
model since available log data measured the main reservoir pressure to be 12.6 ppg at the equivalent
depth. Use of the higher pressure would tend to increase the predicted gas ow potential. The same
OptiCem report refers to a 14.01 ppg zone at 17,700 ft. (which, in fact, should be 14.1 ppg: the
actual pressure measured using the GeoTap logging-while-drilling tool). (Emphasis is mine.)
These two instances are illustrations of an input error (mistake of the operator). Undoubtedly
important, but they are outside of the scope of error control that numerical methods can exercise
and therefore will not be discussed in this book.

1. JE Marsden , A. Weinstein, Calculus I and II, Springer, 1985.
Excellent presentation of the background to the Taylor series.
2. L. F. Shampine, I. Gladwell, S. Thompson, Solving ODEs with MATLAB, Cambridge University
Press, 2003.
A very complete discussion of the subject of errors in numerical integration of dierential equations.
3. GW Stewart, Matrix algorithms. Volume I: Basic decompositions. SIAM, 1998.
Good discussion of the rounding error (machine representation error). Complete and readable
presentation of the QR factorization and least squares.
7
Solution of systems of equations
Summary
1. We discuss a couple of representative methods for the solution of a scalar nonlinear equation.
Main idea: Newtons and bisection method are complementary with the respect to the rate of
convergence and robustness.
2. Newtons method (in one of its several variants) is a crucial building block in nonlinear analysis
of structures, where systems of coupled nonlinear equations need to be solved repeatedly. Main
idea: ecient solvers for systems of coupled linear equations are critical to the success of the
Newtons method.
3. Solutions of systems of coupled linear equations that fall under the class of factorizations on the
examples of the LU and QR decompositions. Main idea: factorizations provide critical infrastructure to a variety of numerical algorithms, especially Newton-like solvers of nonlinear equations
and eigenvalue problem solvers.
4. Errors produced by factorization algorithms depend on the so-called condition number. Main
idea: condition numbers are related to eigenvalues.
7.1 Single-variable nonlinear algebraic equation

We will start with a scalar nonlinear algebraic equation, but immediately thereafter we will move into
the subject of a system of coupled nonlinear algebraic equations. Our motivation will be provided
here by the so-called implicit time integration algorithms, of which we have seen as examples the
backward Euler and the trapezoidal integration algorithms.
As an example of a nonlinear algebraic equation consider the application of the backward Euler
integrator to a single-variable IVP. We recall that the time step advance was expressed as an implicit
function (implicit in the sense of unresolved) in the solution yj+1
yj+1 = yj + (tj+1 tj )f (tj+1 , yj+1 ) .
(7.1)
In general, for an arbitrary right-hand side function f this will require the solution of a nonlinear
algebraic equation to obtain yj+1 . For convenience we will dene the function of the unknown yj+1
F (yj+1 ) = yj+1 yj (tj+1 tj )f (tj+1 , yj+1 ) .
The solution y () to the equation
F (y () ) = 0
is the sought yj+1 .
We attempt to nd the solution to F (y () ) = 0 by rst guessing where the root may be y (0) , and
then using the Taylor series expansion of F at y (0)
120
7 Solution of systems of equations
F (y () ) = F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) + R1 = 0 .
dyj+1
The term
dF (y (0) )
dyj+1
is referred to as the Jacobian. Provided the remainder R1 is negligible compared to the other terms,
we can write approximately
F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) 0 ,
dyj+1
(7.2)
which may be solved for y () as a better approximation of the root

y () y (0)
F (y (0) )
= y (0)
dF (y (0) )
dyj+1
dF (y (0) )
dyj+1
)1
F (y (0) ) .
Thus we arrive at the Newtons algorithm for nding the solution of a nonlinear algebraic equation:
Guess the starting point of the iteration, y (0) , as close to the expected root of the equation y () as
possible. Then repeat until the error (in some measure to be determined) drops below acceptable
tolerance.
(
)1
dF (y (k1) )
(k)
(k1)
y =y
(7.3)
F (y (k1) )
dyj+1
if error e(k) < tolerance, break; otherwise go on
k = k + 1 and repeat from the top
The error could be measured as the dierence between the successive iterations
e(k) = y (k) y (k1)
or by comparing the value of the function F with zero
e(k) = F (y (k) ) .
Or, convergence can be decided by looking at some composite of the above errors, for instance the
iteration could be considered converged when either of these errors drops below a certain tolerance.
Illustration 1
How do we apply the Newtons algorithm to solve the nonlinear equation that denes a single step
for the backward Euler algorithm?
To advance the solution we have to solve F (yj+1 ) = yj+1 yj (tj+1 tj )f (tj+1 , yj+1 ) = 0. The
only diculty may present the derivative of the function f which we need to compute
f (tj+1 , y (k1) )
dF (y (k1) )
= 1 (tj+1 tj )
.
dyj+1
yj+1
This turns out to be really easy for the simple function f of a linear ODE with a constant coecient
f (t, y) = y
and the Jacobian is
121
dF (y (k1) )
= 1 (tj+1 tj ) .
dyj+1
The Newton algorithm gives the solution in one iteration step as
(
y () = y (1) = y (0)
dF (y (0) )
dyj+1
)1
F (y (0) ) =
yj
1 (tj+1 tj )
For this special right-hand side function it works out precisely as we would expect from the denition
of the backward Euler method. For general right-hand side functions f the solution will require
several iterations of the algorithm, until some tolerance is reached as discussed above.
Concerning the implementation of the backward Euler time integration in MATLAB: Either we
have to provide not only a function for the right-hand side f but also its derivative f (., y)/y, or the
software must make do without the derivative. Fortunately, we realize that numerical dierentiation
could be used, and we have developed some approaches in the previous Chapter.
7.1.1 Convergence rate of Newtons method
Write the Taylor series for the scalar function F (x), but this time keep the remainder
F (x() ) = F (x(k) ) +
dF (x(k) ) ()
(x x(k) ) + R1 .
dx
The remainder is written as

R1 =
d2 F () (x() x(k) )2
,
dx2
2!
x(k) , x() .
Since x() is the root, we know that

F (x() ) = 0
and so it follows that
F (x(k) ) +
dF (x(k) ) ()
(x x(k) ) + R1 = 0 .
dx
(7.4)
The Newtons algorithm would use the above equation, neglect the remainder R1 , and thus obtain
an estimate of the root x(k+1) from
F (x(k) ) +
dF (x(k) ) (k+1)
(x
x(k) ) = 0 .
dx
(7.5)
Now notice that the errors in iterations k and k + 1 are

Ek = x() x(k) ,
Ek+1 = x() x(k+1)
and these may be substituted both in Equation (7.4) and in the expression for the remainder.
Thus (7.4) may be written in terms of the errors as
F (x(k) ) +
d2 F () Ek2
dF (x(k) )
Ek +
=0.
dx
dx2 2!
Equation (7.5) may also be rewritten in terms of the errors by expressing
(7.6)
122
x(k+1) x(k) = x(k+1) x(k) + x() x() = Ek Ek+1

so that (7.5) becomes
F (x(k) ) +
dF (x(k) )
(Ek Ek+1 ) = 0 .
dx
(7.7)
Now (7.7) may be subtracted from (7.6) to yield

dF (x(k) )
d2 F () Ek2
Ek+1 +
=0.
dx
dx2 2!
Therefore we can write the error in iteration k + 1 in terms of the error in iteration k as
(
Ek+1 =
dF (x(k) )
dx
)1
d2 F () Ek2
.
dx2 2!
(7.8)
We say that the Newtons method attains a quadratic convergence rate, because the error in the
current iteration is proportional to the square of the error in the previous iteration (and this is good,
assuming the error is going to be small, and the square of a small number is even smaller).
Illustration 2
We shall solve the equation f (x) = 0.5 + (x 1)3 = 0 with Newtons method.1 The solver used
is the aetna implementation of the Newtons method newt2 . The approximate errors in the seven
iterations required for convergence to machine precision are
Iteration
1
2
3
4
5
6
7
Approximate Error
0.806299474015900
0.338070307349234
0.090929677656663
0.009026273904915
0.000101115651663
0.000000012879718
0.000000000000000
A good rule of thumb is that the number of zeros behind the decimal point of the error doubles with
each iteration. That is excellent convergence indeed.
Figure 7.1 illustrates the formula (7.8). We plot the approximate errors Ek+1 versus Ek as
plot(e(1:end-2),e(2:end-1),ro-)% e = approximate errors
Clearly the data resembles a parabolic arc, exactly as predicted by the formula. Re-plotted on a log
log scale (Figure 7.2) as
loglog(e(1:end-2),e(2:end-1),ro-)
conrms the relationship between Ek+1 and Ek . It is quadratic, since the slope on the log-log plot
is very close to 2.
1
2
See: aetna/NonlinearEquations/testnewt conv rate.m

See: aetna/NonlinearEquations/newt.m

3
123
@(x)(0.5+(x1) ), x0=2.6
@(x)(0.5+(x1) ), x0=2.6
0.35
10
0.3
5
10
0.2
Ek+1
Ek+1
0.25
0.15
0.1
10
10
15
10
0.05
0
0
20
0.2
0.4
0.6
0.8
Ek
10
10
10
10
Ek
10
10
Fig. 7.1. Typical convergence of the Newtons method.

F (x)
F (x)
x3
x1
x1
x0 x2
x3
x0
x2
Fig. 7.2. Failure of Newtons method due to divergence (left), and successful convergence upon the selection
of the initial guess closer to the root (right).
7.1.2 Robustness of Newtons method

Newtons method can converge very satisfactorily, but the bad news is it can also spectacularly fail
to deliver the goods. Consider for instance Figure 7.3. On the left: Choosing as the initial guess x0
leads to a succession of xj which drift away from the root rather than converging to it. On the right:
The function graph is scaled in the horizontal direction for clarity. Therefore the initial guess x0
that is shown there is in fact chosen much closer to the desired root than in the gure on the left.
Consequently, the Newtons method generates a succession of root locations which converge. This is
quite typical: as good an initial guess of the location of the root as possible is critical to the success
of the method.
Figure 7.3 shows a situation in which one may be looking for a root where there is none. Use
your imagination to reduce the gap between the horizontal axis and the hump of the function so
that the two almost merge visually. (In the gure we keep the gap large for clarity.) Then starting
the iteration in the vicinity of the presumed root will not lead to convergence. In fact, since the
function graph has a zero slope at some point at the top of the hump, there is a potential for the
Newtons method to blow up (remember, we need to divide with the value of the derivative).
Figure 7.4 illustrates another diculty. For rapidly oscillating functions with many roots it is
quite possible for the Newtons method to jump from root to root, and to eventually locate a root,
but not the one we were looking for originally. If the Newtons solver is used in an automatic fashion,
we might not be even aware of the switch.
124
F (x)
x1 x4 x2 x5
x3
x0
x
Fig. 7.3. Failure of Newtons method: rst its gets stuck next to a false root (maximum), then the iterations
blast o to innity.
F (x)
x4
x2 x3
x1
x0
Fig. 7.4. Failure of Newtons method: if the initial guess of the root is not sucient to close, it does not
nd the root that was intended.
7.1.3 Bisection method

The bisection method is a complement to the Newtons method. (a) While the Newtons method
converges quickly, bisection is slow to converge. (b) While the Newtons method may fail to nd a
root, bisection is guaranteed to converge to a root. (c) Newtons method needs to know both the
function and its derivative, while bisection can work with just the function. (d) While for bisection
we need the so-called bracket (pair of locations at which a given function gives values of opposite
signs), this is not needed for Newtons method.
Perhaps the best way to describe the bisection method is by an algorithm:
function [xl,xu] = bisect(funhandle,xl,xu,tolx,tolf)3
if (xl >xu)
temp =xl; xl = xu; xu =temp;
end
fl=feval(funhandle,xl);
fu=feval(funhandle,xu);
... a bit of error checking omitted for brevity
while 1
xr=(xu+xl)/2; % bisect interval
fr=feval(funhandle,xr); % value at the midpoint
if (fr*fl < 0),
xu=xr; fu=fr;% upper --> midpoint
elseif (fr == 0), xl=xr; xu=xr;% exactly at the root
else,
xl=xr; fl=fr;% lower --> midpoint
end
if (abs(xu-xl) < tolx) || (abs(fr) < tolf)
3
See: aetna/NonlinearEquations/bisect.m
125
break; % We are done

end
end
Figure 7.5 shows typical convergence of the bisection method: the relationship between Ek+1 and
Ek is roughly linear. (Data produced by testbisect conv rate.4 The solver is an aetna implementation of the bisection method bisect5 .)
3
@(x)(0.5+(x1) ), xl=1.7934, xu=1.7953
@(x)(0.5+(x1) ), xl=1.7934, xu=1.7953
0.5
10
0.4
1
10
Ek+1
Ek+1
0.3
0.2
10
0.1
0
0
0.2
0.4
0.6
0.8
10
10
Ek
10
Ek
10
Fig. 7.5. Typical convergence of the bisection method.
Figure 7.6 is a good comparison of the typical convergence properties of the Newtons and
bisection methods.6 Evidently the bisection method requires many more iterations than the Newtons
method. When each evaluation of the function is expensive, the quicker converging method wins.
When the robustness of bisection is required (such as when Newtons would not converge), the
slower method is preferable. Wouldnt it make sense to combine such disparate methods and switch
between them as needed? That is how the MATLAB fzero function works. (Find out from the
documentation which methods are combined in fzero.)
@(x)(0.5+(x1)3), Bisection versus Newton
10
Ek
10
10
10
15
10
20
10
4
6
Iteration k
10
Fig. 7.6. Comparison of the convergence of the bisection method (dashed line), and the Newtons method
(solid line).
See: aetna/NonlinearEquations/testbisect conv rate.m

See: aetna/NonlinearEquations/bisect.m
6
See: aetna/NonlinearEquations/bisection versus Newton.m
126
7.2 System of nonlinear algebraic equations

As an example of a system of nonlinear algebraic equations we now consider the application of the
backward Euler integrator to a multiple-variable (vector) IVP. The time step advance is expressed
as a vector function implicit in the solution y j+1
y j+1 = y j + (tj+1 tj )f (tj+1 , y j+1 ) .
For an arbitrary right-hand side function f this will require the solution of a nonlinear vector
algebraic equation to obtain y j+1 . We will dene the vector function of the vector unknown y
F (y) = y y j (tj+1 tj )f (tj+1 , y) ,
which is clearly the backward Euler algorithm for y = y j+1 . The solution y () to the equation
F (y () ) = 0
is the sought y j+1 .
Now we take as the initial guess y (0) , and we expand into a Taylor series
F (y () ) = F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) + R1 = 0 .
dy
Provided the remainder R1 is negligible compared to the other terms, we can write approximately
F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) 0 ,
dy
(7.9)
which at rst sight looks exactly like (7.2). There must be a dierence here, however, as we are
dealing with a system of equations. What do we mean by
dF (y (0) ) ()
(y y (0) ) ?
dy
(7.10)
The expression (7.9) holds for each component (row) of the vector (column matrix) separately. The
components of the vector function F and of the argument y may be written as
[F (.)]r ,
[y]c .
Each of the components [F (.)]r is a function of all the components [y]c . Therefore, equation (7.9)
in components must have the meaning
[F (y (0) )]r +
[F (y (0) )]r
[y () y (0) ]c 0 ,
[y]
c
c=1:n
i.e. in words: the change in the component [F (y (0) )]r is due to the change of this component in the
direction of each of the c components of the argument [y]c , which is expressed by the rst term of
the Taylor series. Thus we see that left-hand side of (7.9) is the sum of two vectors, F (y (0) ) and the
vector
dF (y (0) ) ()
(y y (0) ) ,
dy
which is the product of a square matrix
dF (y (0) )
dy
dF (y (0) )
and the vector (y () y (0) ). The matrix
dy
127
is the so-called Jacobian matrix , whose components are

[
]
dF (y (0) )
[F (y (0) )]r
=
.
dy
[y]c
rc
Thus we arrive at the Newtons algorithm for nding the solution of a nonlinear algebraic
equation: Initially guess y (0) ; then compute
(
)1
dF (y (k1) )
(k)
(k1)
y =y
F (y (k1) )
(7.11)
dy
k = k + 1 and repeat previous line
until the error (in some measure to be determined) drops below acceptable tolerance. In general it
is a good idea not to invert a matrix if we can help it. Rewriting the Newton algorithm as
J (y (k1) ) =
dF (y (k1) )
dy
% Compute the Jacobian matrix
J (y (k1) )y = F (y (k1) ) % Compute the increment y

y (k) = y (k1) + y % Update the solution
if error e(k) < tolerance, break; otherwise go on
(7.12)
k = k + 1 and go to the top

we see that the Newton algorithm will require repeated solutions of a system of linear algebraic
equations, since the rst line of the above algorithm means solve for y.
Clearly this could mean major computational eort, depending on how many equations there
are (how big the matrix J (y (k1) ) is), whether the Jacobian is symmetric, how many zeros and in
what pattern there might be in the Jacobian (in other words, is it dense, and if it isnt what is the
pattern of the sparse matrix), and so on. We will take up the subject of the solution of system of
equations in the next chapter.
The error e(k) of the solution in iteration k could be measured as the dierence between successive
iterations, as for the scalar equation (7.3), and it should be expressed in terms of vector norms
e(k) = y (k) y (k1)
or by comparing the norm of the function F with zero
e(k) = F (y (k) ) .
Illustration 3
Find the solution of the simultaneous equations
f (x, y) = (x2 + 3y 2 )/2 2 = 0 ,
g(x, y) = xy + 3/4 = 0 .
The two expressions f (x, y) and g(x, y) may be interpreted as surfaces raised above the x, y plane.
Setting these to zero is equivalent to forcing the points that satisfy these equations, individually, to
lie on the level curves of the surfaces. The solution of the two equations being satised simultaneously
corresponds to the intersection of the level curves. The gures of the surfaces were produced by the
script two surfaces7 .
The solution will be attempted with the Newton method. The vector argument is
[ ]
x
y=
y
and the vector function is
7
See: aetna/NonlinearEquations/two surfaces.m
128
f (x, y)
F (y) =
g(x, y)
]
.
Therefore the necessary Jacobian matrix is

J11 =
f (x, y)
f (x, y)
g(x, y)
g(x, y)
= x , J12 =
= 3y , J21 =
= y , J22 =
=x.
x
y
x
y
The Matlab code denes both the vector function and the Jacobian matrix as anonymous functions.
F=@(x,y) [((x.^2 + 3*y.^2)/2 -2); (x.*y +3/4)];
J=@(x,y) [x, 3*y; y, x];
With these functions at hand it is easy to carry out the iteration interactively, step-by-step. For
instance, guessing
w0= [-0.5;0.5];
we update the solution as
>> w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))
w =
-0.5000
1.5000
For the next iteration, we reset the variable w0
>> w0=w;
and repeat the solution
w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))
w =
-0.6154
1.1538
We can watch the dierences between the successive iterations getting smaller. With four iterations
we get ve decimal digits converged.
w =
-0.6923
1.0833
This point will be one of the four possible solutions (level-curve intersections). To get a dierent
solution we need to start with a dierent guess w0, for instance w0= [-2;0.5];.
129
7.2.1 Numerical Jacobian evaluation

To compute the Jacobian analytically is often not possible or feasible. The elements of the Jacobian
matrix can be computed by numerical dierentiation. MATLAB includes a sophisticated routine for
forming Jacobians numerically, numjac. Here we discuss just the basic idea.
Consider the vector function F (z), whose derivative should be evaluated at z. Each element of
the matrix
[F (z)]r
[z]c
is a partial derivative of the component r of the vector F with the respect to the component c of the
argument. The index c is the column index. Therefore, just one evaluation of the vector function F
per column is necessary for forward or backward dierence evaluation of the numerical derivative.
First, evaluate F = F (z). Then, for each column c of the Jacobian matrix evaluate
c
F = F (z + hc ec ) ,
where
[ec ]m = 1 for c = m ,
[ec ]m = 0 otherwise,
and hc is a suitably small number (not too small: let us not forget the eect of computer arithmetic).
The Jacobian matrix is approximated by the computed column vectors as
[
]
[
]
F (z)
(1 F F )/h1 , (2 F F )/h2 , . . . , (n F F )/hn , .
(7.13)
z
In these columns we recognize numerical approximations of derivatives of the vector function F
(divided dierences).
One can recognize in the Newtons method with the numerical approximation of the Jacobian a
variation of the so-called secant method .
Illustration 4
In the script Jacobian example8 we compare the analytically determined Jacobian matrix with its
numerical approximation. The vector function is taken as
[ 2
]
z1 + 2z1 z2 + z22
F (z) =
.
z1 z2
Therefore the Jacobian matrix is evaluated at z1 = 0.23, z2 = 0.6 as
8
See: aetna/NonlinearEquations/Jacobian example.m
130
>> F=@(z)[z(1)^2+2*z(1)*z(2)+z(2)^2;...
z(1)*z(2)];
dFdz=@(z)[2*z(1)+2*z(2),2*z(1)+2*z(2);...
z(2),
z(1)];
zbar = [-0.23;0.6];
>> Jac =dFdz(zbar)
Jac =
0.7400
0.7400
0.6000
-0.2300
Evaluating the function at the base point and using the step size of 0.1
>> Fbar =F(zbar);
h=1e-1;
we obtain the approximate (numerically dierentiated) Jacobian matrix
>> Jac_approx =[(F(zbar+[1;0]*h)-Fbar)/h,
Jac_approx =
0.8400
0.8400
0.6000
-0.2300
(F(zbar+[0;1]*h)-Fbar)/h]
We may note that the second row is in fact exact. (Why?) On the other hand the Jacobian matrix
will not be evaluated exactly in any component for the second example in Jacobian example. Check
it out.
7.2.2 Nonlinear structural analysis example

Consider a high-strength steel cable structure shown in Figure 7.7. The dashed line shows how the
cables are connected, but the geometry has no physical meaning. In reality, the cables are strung
between the joints so that the unstressed lengths of three sections of the main cable, connecting
joints p3 , p1 , p2 , and p4 , are given as 1.025 the distance between the joints. The two tiedowns
between joints p5 and p1 and between joints p5 and p2 have unstressed lengths which are less than
the distances between the points p5 and p1 and p5 and p2 : tiedown 4 has unstressed length 0.88
distance between p5 and p1 , and tiedown 5 has unstressed length 0.81 distance between p5 and p2 .
Therefore after the structure is assembled, the structure must deform and it must experience tensile
stress (it becomes prestressed). The goal is to nd the forces in the cables after the structure was
assembled. Since the problem is statically indeterminate, we will use the deformation method. The
requisite equations are going to be the equilibrium equations for the joints p1 , p2 , and the unknowns
are going to be the locations of these two joints. (Note that the locations of joints p3 , p4 , and p5 are
xed, those joints are supported.)
For instance, for the joint 1 we write the equilibrium equations as
x,1
x,2
x,4
+ N2
N4
=0
L1
L2
L4
.
y,1
y,2
y,4
N1
+ N2
N4
=0
L1
L2
L4
N1
The geometrical relationships for cable 1 are based on these expressions
x,1 = Y1 px,3 , y,1 = Y2 py,3 , L1 = 2x,1 + 2y,1 ,

where Y1 , Y2 are the coordinates of joint 1 after deformation. Similarly for cable 2
131
p4
3
p2
2
5
p1
1
4
p3
p5
Fig. 7.7. Cable structure conguration. Dashed line: schematic of the connections. Filled dots indicate
supported joints.
x,2 = Y3 Y1 ,
y,2 = Y4 Y2 ,
L2 =
2x,2 + 2y,2 ,
where Y3 , Y4 are the coordinates after deformation of joint 2. Together Y1 , Y2 , Y3 , Y4 constitute the
unknowns in the problem. Finally for the third cable running into joint 1 we have
x,4 = Y1 px,5 , y,4 = Y2 py,5 , L4 = 2x,4 + 2y,4 .

The forces can be based upon the following constitutive equation
N1 = EA1
L1 L10
L10
(and analogously for the other cables) which relates the relative stretch of the cable to the axial
force. This is based on the assumption that the stretches are small compared to 1.0 and therefore
the stresses are small compared to the elastic modulus, and this assumption is veried in the present
problem. In general it is a good idea to verify that the assumptions that go into a model are
reasonable by backtracking from the results. For instance, in the current problem we would nd the
locations of the joints, and from those we would compute the forces, and stresses. If the stresses in
the cables were well below the yield stress (or negligibly small with respect to the elastic modulus),
our assumption would have been veried.
Working out in detail just the rst term in the rst equation gives us an idea of the complexity
of the resulting equations. We do get an appreciation for the tedium associated with computing
derivatives of such terms with respect to the unknowns Y1 , Y2 , Y3 , Y4 to construct the Jacobian
matrix analytically:
(Y1 px,3 )2 + (Y2 py,3 )2 L10

x,1
Y1 px,3
N1
= EA1
.
L1
L10
(Y1 px,3 )2 + (Y2 py,3 )2
Therefore, we are going to construct the Jacobian matrix numerically using the numerical dierentiation technique from the preceding section.
We are going to present the computation as implemented in a MATLAB function.9 First we
dene the data of the problem.
function [y,sigma]=cable_config_myjac
% undeformed configuration, lengths in millimeters
p =[10,10;
25,25;
0,0;
40,40;
40,0]*1000;
9
See: aetna/NonlinearEquations/cable config myjac.m
132
y =p;% Initialize deformed configuration

% Which joints are connected by which cables?
conn = [3,1; 1,2; 2,4; 5,1; 5,2];
% Initial cable lengths
Initial_L =[...
1.025*[Length(1),Length(2),Length(3)],...
0.88*Length(4),0.81*Length(5)];
A= [150,150,150,100,100];% Cable cross-section mm^2
E=200000;% Youngs modulus, MPa
AbsTol = 0.0000005;%Tolerance in millimeters
maximum_iteration = 12;% How many iterations are allowed?
N =zeros(size(conn,1),1);% Prestressing force
We introduce some ancillary functions (implemented as nested MATLAB functions). They are used
to compute the geometry characteristics from the locations of the joints on the deformed structure.
function
Delt=Delta(j)% Compute the run/rise
Delt =diff(y(conn(j,:),:));
end
function L=Length(j)% Deformed cable length
L=sqrt(sum(Delta(j).^2));
end
This function computes the total force on each joint. When equilibrium is reached, the total force
(meaning the sum of the forces from all the cables connected at a joint) vanishes. When the structure
is not in equilibrium, the force residual is in general nonzero.
function R=Force_residual(Y)% Compute the force residual
y(1,:) =Y(1:2);
y(2,:) =Y(3:4);
F =zeros(size(p,1),2);
for j=1:size(conn,1)
L=Length(j);
N(j)=E*A(j)*(L-Initial_L(j))/L;
F(conn(j,1),:) =F(conn(j,1),:) +N(j)*Delta(j)/L;
F(conn(j,2),:) =F(conn(j,2),:) -N(j)*Delta(j)/L;
end
R =[F(1,:);F(2,:)];
end
In this function we compute the numerical approximation of the Jacobian matrix. Compare the expressions inside the loop with equation (7.13). Note that the step used in the numerical dierentiation
is hardwired at 1/1000th of the current value of the unknown. A more sophisticated implementation
could adjust these better.
function dRdy=myjac(Y)% Compute the current Jacobian
R=Force_residual(Y);
for j=1:size(Y,1)
Ys =Y; Ys(j)=Ys(j) +Ys(j)/1000;
dRdy(:,j) =(Force_residual(Ys)-R)/(Ys(j)/1000);
end
end
Here is the Newtons solver loop.
Y=[y(1,:);y(2,:)];% Initialize deformed configuration
for iteration = 1: maximum_iteration % Newton loop
R=Force_residual(Y);% Compute residual
133
dRdy = myjac(Y);% Compute Jacobian

dY=-dRdy\R;% Solve for correction
if norm(dY,inf)<AbsTol % Check convergence
y(1,:) =Y(1:2);% Converged
y(2,:) =Y(3:4);
R=Force_residual(Y);% update the forces
sigma =N./A;% Stress
return;
end
Y=Y+dY;% Update configuration
end
error(Not converged)% bummer :(
The output is
>> [y,sigma]=cable_config_myjac
y =
1.0e+004 *
1.293851615236427
0.660813145442825
2.924407085679964
2.105124641641672
0
0
4.000000000000000
4.000000000000000
4.000000000000000
0
sigma =
1.0e+002 *
4.494507981897321
4.851283944479819
2.463132114467833
2.051731752058516
3.659994343900810
Figure 7.8 displays the results of the computation. Note that the stresses are distributed somewhat
non-uniformly. A cool improvement on our computation would be to optimize the unstressed lengths
of the cables so that the prestress was uniform across the structure.
p4
3
= 414
p2
2
= 357
p3
5
= 208
1
p1
= 449
4 = 205
p5
Fig. 7.8. Cable structure conguration. Dashed line: schematic of the connections. Filled dots indicate
supported joints. Thick solid line: actual conguration of the prestressed structure. Tensile stresses are
indicated.
134
As a nal note, we shall point out that MATLAB comes with its own sophisticated function for
the numerical evaluation of the Jacobian matrix, numjac. The pieces of code that would need to
be changed with respect to our implementation10 are the computation of the residual (the function
needs to accept additional arguments)
function R=Force_residual(Ignore1,Y,varargin)
y(1,:) =Y(1:2);
y(2,:) =Y(3:4);
F =zeros(size(p,1),2);
for j=1:size(conn,1)
L=Length(j);
N(j)=E*A(j)*(L-Initial_L(j))/L;
F(conn(j,1),:) =F(conn(j,1),:) +N(j)*Delta(j)/L;
F(conn(j,2),:) =F(conn(j,2),:) -N(j)*Delta(j)/L;
end
R =[F(1,:);F(2,:)];
end
and the evaluation of the numerical Jacobian in the Newtons loop (there are a few additional
arguments to pass)
Y=[y(1,:);y(2,:)];% Initialize deformed configuration
for iteration = 1: maximum_iteration % Newton loop
R=Force_residual(0,Y);% Compute residual
[dRdy] = numjac(@Force_residual,0,...
Y,R,Y/1e3,[],0);% Compute Jacobian
dY=-dRdy\R;% Solve for correction
if norm(dY,inf)<AbsTol % Check convergence
y(1,:) =Y(1:2);% Converged
y(2,:) =Y(3:4);
R=Force_residual(0,Y);% update the forces
sigma =N./A;% Stress
return;
end
Y=Y+dY;% Update configuration
end
error(Not converged)% bummer :(
We can easily check that the two implementations of the computation give identical results.
In summary, Newtons method, in its several variants and renements, has a special place among
the mainstream methods for solving a system of nonlinear algebraic equations in engineering applications. One of the building blocks of this class of algorithms is a solver for repeatedly solving a
system of linear algebraic equations. This is the topic we will take up in the following sections.
7.3 LU factorization
Consider a system of linear algebraic equations
Ax = b
with a square matrix A. It is possible to factorize the matrix into the product of a lower triangular
matrix and an upper triangular matrix
A = LU
10
See: aetna/NonlinearEquations/cable config numjac.m
135
The triangular matrices are not determined uniquely. Here we will consider the variant where the
lower triangular matrix L has ones on the diagonal.
7.3.1 Forward and backward substitution
What is the value of the LU factorization? It derives from the eciency with which a system with
a triangular matrix can be solved. For instance, consider the system
y = b ,
Ly =
where L is lower triangular (non-zeros are indicated by the black dots, the zeros are not shown). In
the rst row of L there is only one nonzero, L11 . Therefore we can solve immediately for y1 . Next,
y1 may be substituted into the second equation, from which we can solve for y2 , and so on. Since
we are solving for the unknowns in the order of their indexes, 1, 2, 3, ..., n, we call this the forward
substitution.11
function c=fwsubs(L,b)
[n m] = size(L);
if n ~= m, error(Matrix must be square!);
c=zeros(n,1);
c(1)=b(1)/L(1,1);
for i=2:n
c(i)=(b(i)-L(i,1:i-1)*c(1:i-1))/L(i,i);
end
end
end
Now consider the system
x = c ,
Ux =

where U is upper triangular. In the last row of U there is only one nonzero, Unn . Therefore we can
solve immediately for xn . Next, xn may be substituted into the last but one equation, from which
we can solve for xn1 , and so on. Since we are solving for the unknowns in the reverse order of their
indexes, n, n 1, n 2, ..., 2, 1, we call this the backward substitution. 12
function x=bwsubs(U,c)
[n m] = size(U);
if n ~= m, error(Matrix must be square!); end
x=zeros(n,1);
x(n)=c(n)/U(n,n);
for i=n-1:-1:1
x(i)=(c(i)-U(i,i+1:n)*x(i+1:n))/U(i,i);
end
end
11
12
See: aetna/LUFactorization/fwsubs.m
See: aetna/LUFactorization/bwsubs.m
136
And so we come to the punchline: provided we can factorize a general matrix A into the triangular
factors, we can solve the system Ax = b in two steps. Write
Ax = LU x = L(U x) = Ly = b .
| {z }
y
Step one, solve for y from
Ly = b .
And step two, solve for x from
Ux = y .
Both solution steps can be done very eciently since the matrices involved are triangular. This is
handy in many situations where the right-hand side b will change several times while the matrix A
stays the same. For instance, here is how we compute the inverse of a general square matrix A:
write the denition of the inverse
AA1 = 1
column-by-column as
Ack (A1 ) = ck (1) .
Here by ck (A1 ) we mean the kth column of A1 , and by ck (1) we mean the kth column of the
identity matrix . So if we successively set the right-hand side vector to b = ck (1), k = 1, 2, ... and
solve Ax = b, we obtain the columns of the inverse matrix as ck (A1 ) = x.
7.3.2 Factorization
The crucial question is: how do we compute the factors? LU factorization can be easily explained
by reference to the well-known Gaussian elimination. We shall start with an example:
0.796, 0.7448, 0.1201, 0.0905

0.3649,
1.216, 0.3435, 0.5449
A=
0.0186, 0.093,
1.204, 0.0012
0.1734, 0.6695, 0.0653, 0.4113
First we will change the numbers below the diagonal in the rst column to zeros. Gaussian elimination
does this by replacing a row in which a zero should be introduced, let us say row j, by a combination of
the row j and the so-called pivot row. Thus zero will be introduced in the element 2, 1 by subtracting
(0.3649)/(0.796) row 1 from row 2 two obtain
0.796, 0.7448, 0.1201, 0.0905
0,
1.558, 0.2884, 0.5034
0.0186, 0.093,
1.204, 0.0012
0.1734, 0.6695, 0.0653, 0.4113
The element 1, 1 (the number .796) is called a pivot. Evidently, the success of the proceedings is going
to rely on the pivot being dierent from zero (not only strictly dierent from zero, but suciently
dierent: it shouldnt be too small compared to the other numbers in the same column). The
manipulation described above can be executed by the following code fragment
i=1;
A(2,i:end) =A(2,i:end)-A(2,i)/A(i,i)*A(i,i:end)
137
Importantly, the same can also be written as a result of a matrix-matrix multiplication by the
so-called elimination matrix
1, 0, 0, 0
0.4584, 1, 0, 0
E (2,1) =
0, 0, 1, 0
0, 0, 0, 1
The elimination matrices are easily computed in MATLAB as13
function E =elim_matrix(A,i,j)
E =eye(size(A));
E(i,j) =-A(i,j)/A(j,j);
end
We can readily verify that the element 2, 1 of A can be eliminated (zeroed out) by multiplying
0.796, 0.7448, 0.1201, 0.0905
0,
1.558, 0.2884, 0.5034
E (2,1) A =
0.0186, 0.093,
1.204, 0.0012
0.1734, 0.6695, 0.0653, 0.4113
Next we will change 0.0186 to a zero. Again, we will do this with an elimination matrix, and note
well that we will be working with the above right-hand side matrix, not the original A. So we will
construct
1, 0, 0, 0
0, 1, 0, 0
E (3,1) =
0.02337, 0, 1, 0
0, 0, 0, 1
and compute
0.796, 0.7448, 0.1201,

0.0905
0,
1.558, 0.2884, 0.5034
.
E (3,1) E (2,1) A =
0, 0.1104,
1.201, 0.003315
0.1734, 0.6695, 0.0653,
0.4113
And so on: eliminating the non-zeros in the rst column is constructed as the sequence
0.796, 0.7448,
0.1201,
0.0905
0,
1.558, 0.2884, 0.5034
.
E (4,1) E (3,1) E (2,1) A =
0, 0.1104,
1.201, 0.003315
0, 0.5073, 0.03914,
0.431
Now we start working on the second column. Note again that we are working with the matrix
E (4,1) E (3,1) E (2,1) A, not the elements of the original matrix. Thus 0.07087 = (0.1104/1.558),
and the elimination matrix to put a zero in the element 3, 2 reads
1,
0, 0, 0
0,
1, 0, 0
E (3,2) =
0, 0.07087, 1, 0 .
0,
0, 0, 1
Finally, we apply the elimination matrix to the element 4, 3 and the entire Gaussian elimination
sequence will read
13
See: aetna/LUFactorization/elim matrix.m
138
0.796, 0.7448, 0.1201,

0.0905
0, 1.558, 0.2884, 0.5034

.
E (4,3) E (4,2) E (3,2) E (4,1) E (3,1) E (2,1) A =
0,
0,
1.18, 0.03899
0,
0,
0,
0.2627
We recall that we wish to construct the factorization A = LU , which means that the above matrix
on the right is U and consequently
L1 = E (4,3) E (4,2) E (3,2) E (4,1) E (3,1) E (2,1) .
So now we have the matrix U and the inverse of L. Fortunately, L is obtained very easily. Not by
inverting the above product, but rather by inverting each of the terms separately
1
1
1
1
1
L = E 1
(2,1) E (3,1) E (4,1) E (3,2) E (4,2) E (4,3) .
For instance, to invert E (2,1) we realize that the eect of the matrix multiplication in the product
E (2,1) A is to make the second row of the result the sum of a multiple of the rst row and 1 the
second row. Therefore, to multiply with the inverse of E (2,1) is to undo this operation, to subtract
a multiple of the rst row from the second row. The inverse of E (2,1) also has ones on the diagonal,
the only change is that the o-diagonal element changes its sign (we want subtraction instead of
addition)
E 1
(2,1) = 21 E (2,1) .
The same reasoning applies to the other elimination matrices. Now we only have to gure out the
1
product of the inverses of the elimination matrices. Take for instance the product E 1
(2,1) E (3,1) :
1
E 1
(2,1) E (3,1)
1,
0.4584,
=
0,
0,
0, 0, 0
1,
1, 0, 0
0,
0, 1, 0 0.02337,
0, 0, 1
0,

0, 0, 0
1,
0.4584,
1, 0, 0
=
0, 1, 0 0.02337,
0, 0, 1
0,
0,
1,
0,
0,
0, 0
0, 0
.
1, 0
0, 1
The pattern is clear: each matrix in the product will simply copy its only nonzero o diagonal
element into the same location in the resulting matrix. Thus we have
1,
0,
0, 0
0.4584,
1,
0, 0
.
L=
0.02337, 0.07087,
1, 0
0.2178, 0.3256, 0.1127, 1
The entire elimination process for our given matrix can be expressed as a series of matrix multiplications
E21
E31
E41
E32
E42
E43
U =
=elim matrix(A,2,1)
=elim matrix(E21*A,3,1)
=elim matrix(E31*E21*A,4,1)
=elim matrix(E41*E31*E21*A,3,2)
=elim matrix(E32*E41*E31*E21*A,4,2)
=elim matrix(E42*E32*E41*E31*E21*A,4,3)
E43*E42*E32*E41*E31*E21*A
Inecient, but correct. In reality the elimination is done usually in-place. The upper triangle and
the diagonal of A store the matrix U , the lower triangle (below the diagonal) of A store the matrix
L (we do not store the diagonal, since we know that the diagonal of L consists of ones). naivelu4
is one of the naive implementations of the LU factorization in aetna.14
14
See: aetna/LUFactorization/naivelu4.m
139
function [l,u] = naivelu4(a)

[n m] = size(a);
if n ~= m
error(Matrix must be square!)
end
for col=1:n-1
ks=col+1:n;
ls=a(ks,col)/a(col,col); % col of L
a(ks,ks)=a(ks,ks)-ls*a(col,ks);
a(ks,col)=ls;
end
l=tril(a,-1)+eye(n,n);
u=triu(a);
end
(Note the use of tril and triu to extract the lower and upper triangle from a matrix respectively.)
7.3.3 Pivoting
The implementation of the LU factorization presented above is naive: it blithely divides by the
numerical value in the diagonal element, the so-called pivot. Unless the user is reasonably sure that
all the numbers encountered in the pivot locations are suciently large during the factorization,
it is preferable to use an implementation that does either partial or full pivoting. The MATLAB
implementation of the LU factorization can perform pivoting. Normally only the so-called partial
pivoting is performed. Partial pivoting consists of selecting which row should be used as the pivot
row when working in column j, and all the rows j and below are considered. The row with the
largest number in absolute value in column j is chosen. Complete pivoting would also consider
the possibility of switching columns in order to get the best element in the pivot position, but that
involves extensive searching throughout the matrix and is therefore expensive (and hence rarely
done).
The MATLAB implementation of the LU factorization will return the information in three matrices: Consider this example
0.4653, 0.1766, 0.8463, 0.7917

0.1805, 0.9188, 0.3244, 0.6952
A=
0.7891, 0.236, 0.007259, 0.4891 .
0.09073, 0.6998, 0.9637, 0.9205
Compute the factorization using this command
[L,U,P]=lu(A)
with the result
1,
0,
0, 0
0.2287,
1,
0, 0
,
L=
0.5897, 0.04327,
1, 0
0.115, 0.7778, 0.8597, 1
and the so-called permutation matrix
0, 0, 1, 0
0, 1, 0, 0
P =
1, 0, 0, 0 .
0, 0, 0, 1
The meaning of the output is that
0.7891, 0.236, 0.007259,

0.4891
0, 0.8648, 0.3228,
0.5833
U =
0,
0,
0.828,
0.478
0,
0,
0, 0.0003876
140
LU = P A .
The matrix P permutes (switches) the rows of the matrix A. That is the actual pivoting. Note that
the permutation matrix has an interesting inverse: it is its own transpose (the permutation matrix
is orthogonal). Therefore we can write the above as
P T LU = A .
The matrix
0.5897, 0.04327, 1, 0
0.2287,
1,
0, 0
PTL =
1,
0,
0, 0
0.115, 0.7778, 0.8597, 1
is the so-called psychologically lower triangular matrix. Such a matrix would be returned if we called
lu with only two output arguments
[L,U]=lu(A)
How do we use the three output matrices? Symbolically, we can write now in the way in which we use
the LU factorization (A = LU ) as (we do not actually use inverses, we use forward and backward
substitution!)
y = L1 b ,
x = U 1 y
or
(
)
x = U 1 L1 b .
When pivoting is used, we have rather A = P T LU so that we are solving
(
)1
y = PTL
b = L1 (P b) ,
x = U 1 y
or
(
)
x = U 1 L1 (P b) .
In MATLAB syntax, we write
x=U\(L\P*b);
In other words, the LU factorization is used as before, except that the rows of the right-hand side
vector are reordered (permuted) by P .
A more ecient approach to working with the LU factorization when pivoting is applied is to
compute the so-called permutation vector .
[L,U,p]=lu(A,vector)
The permutation vector is p = [3, 2, 1, 4]. We can see that it correlates with the position of the 1s
in the rows of the permutation matrix. The permutation vector is used for multiple right-hand sides
as
x=U\(L\b(p));
which is a shorthand for
y=L\b(p); x=U\y;
141
7.3.4 Computational cost

How much does it cost to perform an LU factorization? We see that the procedure is essentially
that of Gaussian elimination, which processes the matrix in blocks. First the block A(2:n,1:n) is
modied, then the block A(3:n,2:n), A(4:n,3:n), all the way down to A(n:n,n-1:n). If we take
as a measure of required time the number of modied elements of the matrix, we have
C(n 1)n + C(n 2)(n 1) + C(n 3)(n 2) + . . . + C(2)(3) + C(1)(2) =
C ((n 1)n + (n 2)(n 1) + (n 3)(n 2) + . . . + (2)(3) + (1)(2)) ,
where C is a time constant that measures how much time it takes to manipulate a single element of
the matrix. Multiplying through we see that the required time is the sum
( 2
)
2
n n + (n 1)2 (n 1) + (n 2)
(n 2) + . . . + 32 3 + 22 2 =
( C
)
C n2 + (n 1)2 + (n 2)2 + . . . + 32 + 22 C (n + (n 1) + (n 2) + . . . + 3 + 2) .
So nally, recalling the analogy between the integrals
x
x3
x2
(s2 s) ds =
3
2
0
and our sums, we conclude that the required factorization time is
( 3
)
n
n2
tLU = C
.
3
2
In Chapter 6 we have seen the big-O notation used as a means of describing how function value
decreases as the argument decreases towards zero. Here we introduce the opposite viewpoint: the
notation can be also used to express how quickly a function value grows. As we discussed, the bigO notation typically expresses how complicated functions behave in terms of a simple monomial
(say x2 ). For the measurement how quickly function value decreases the low powers dominate;
contrariwise, when we measure how quickly function value grows the high powers dominate.
Illustration 5
Consider the simple function f (x) = x2 + 30000x. Use the big-O notation to describe its behavior
as x 0 and as x .
As x 0 the function value decrease is dominated by the linear term (30000x) as it drops in
magnitude much slower then the square. On the contrary, the square term grows much faster than
the linear term as x . Therefore we conclude that f (x) O(x) as x 0 and that f (x) O(x2 )
as x .
The big-O notation is often used in computer science to express how quickly the cost of an
algorithm grows as the number of quantities to be processed grows. For instance, nice algorithms are
those that grow linearly or logarithmically - for instance computing the mean of a vector of length
n is an operation of O(n) or FFT is an operation of O(n log n). Not so nice algorithms may be very
expensive for large n for instance a nave discrete Fourier transform (the slow version of FFT) is
O(n2 ). Much more expensive than FFT!
The LU factorization is one of the more computationally-intensive algorithms. Based on the
expression that includes both a cubic term and a quadratic term we conclude that for suciently
large n we should write tLU = O(n3 ). Rather costly!
142
Illustration 6
Figure 7.9 shows the results of a numerical experiment. The MATLAB LU factorization is run for a
sequence of variously sized matrices, and the factorization time is recorded.
t = [];
for n = 10:10:600
A=rand(n);
tic;
for Trial = 1: 1000
[L,U,p]=lu(A,vector);
end
t(end+1) =toc
end
The curve of required CPU time per factorization illustrates our estimate: rst the time grows more
slowly than predicted, but asymptotically it appears to approach a straight line with slope 3 which
corresponds to a cubic dependence on the number of equations.
10
Time [s]
10
1:3
3
10
10
10
10
10
Matrix size n
10
Fig. 7.9. Timing of the LU factorization
In a similar way, we can show that the time for forward or backward substitution is going to
grow as O(n2 ). This is good news, since for many right-hand sides the time is only going to grow
as quickly as for the factorization itself. For instance, to compute a matrix inverse we need to solve
n times an n n system of linear algebraic equations. If we use LU factorization with forward and
backward substitution, it will take
+
n O(n2 )
= O(n3 )
O(n3 )
| {z }
| {z }
factorization forward/backward substitution
time. If we use just plain Gaussian elimination for each solve, it will take
nO(n3 ) = O(n4 ) .
A much more quickly growing cost!
143
Illustration 7
The cost estimate tLU = C O(n3 ) can be put to good use guessing the time that it may take to
factorize larger matrices. From Figure 7.9 we can read o that on this particular computer a 400400
matrix takes about one hundredth of a second:
tLU,400 = C O(4003 ) = 0.01 s .
Therefore we can express the time constant as
C=
0.01 s
.
O(4003 )
To factorize a matrix 10 times larger, we estimate that it would take

tLU,3000 = C O(30003 ) =
0.01 s
O(30003 ) = 4.2 s .
O(4003 )
Running the calculation we nd 2.35 s. This is a substantial dierence with respect to the prediction.
First, the measurement of 0.01 s is likely to be substantially in error as it is dicult to measure the
execution times for computations that conclude very quickly there are just too many confounding
factors in the software (think of all the operating system overhead) and hardware. Second, our
estimate was based on the cubic term, but we know there is also a quadratic term and that was not
taken into account. The matrix may not be large enough for the asymptotic big-O estimate to work
based on the largest term only.
Furthermore, let us say we want to use the second measurement, tLU,3000 = 2.35 s to predict
the factorization time for a 30, 000 30, 000 matrix. If we had a computer with enough memory to
accommodate a matrix of this size, our prediction would be that the factorization time would go
up with a factor of 1000 = 103 with the respect to the time measured for the 3000 3000 matrix,
so about 40 minutes. We would nd the prediction rather more accurate this time. (Try it with a
slightly more modest increase: for instance a factor of 2 increase in the size of the matrix would
increase the factorization time by a factor of 8.)
7.3.5 Large systems of coupled equations

Structural engineers nowadays meet almost daily with results produced by models which are much
larger than the ones encountered so far in this book. Structural analysis programs, or more generally
nite element analysis programs, work on a daily basis with models where one million unknowns
is not uncommon. In recent years there have been reports of successful analyses with billions of
unknowns (simulation of seismic events). How do our algorithms handle the linear algebra in big
models?
First we may note that in many analyses we work with symmetric matrices. Considerable savings
are possible then. Take the LU factorization of a symmetric matrix
A = LU .
Now it is possible to factor U by dividing the rows with its diagonal elements, so that we can write
U as the product of the diagonal D = diag(diag(U )) (expressed in MATLAB notation) with the
b
matrix U
b .
U = DU
Since we must have A = AT , substituting
144
b
A = LU = LD U
b = LT . Therefore for symmetric A we can make one more step from the LU factorimplies that U
ization to the LDLT factorization
A = LDLT .
This saves both time (we dont have to compute U ) and space (we dont have to store U ).
Figure 7.10 displays a nite element model with over 2000 unknowns. A small model, it can be
handled comfortably on a reasonably equipped laptop, yet it will serve us well to illustrate some of
the aspects of the so-called large-scale computing algorithms of which we need to be aware.
The gure shows a tuning fork. This one sounds approximately the note of A (440 Hz, international concert pitch). To nd this vibration frequency, we need to solve an eigenvalue problem (in
our terminology, the free vibration problem).
Fig. 7.10. Tuning fork nite element mesh.
The impedance matrix A = K 2 M which couples together the stiness and the mass matrix
is of dimension of roughly 2000 2000. However, not all 4 million numbers are nonzero. Figure 7.11
illustrates this by displaying the nonzeros as black dots (the zeros are not shown). The code to get
an image like this for the matrix A is as simple as
spy(A)
Where do the unknowns come from? The vibration model describes the motion of each node (that
would be the corners and the midsides of the edges of the tetrahedral shapes which constitute the
mesh of the tuning fork). At each node we have three displacements. Through the stiness and mass
of each of the tetrahedra the nodes which are connected by the tetrahedra are dynamically coupled (in
the sense that the motion of one node creates forces on another node). All these coupling interactions
are recorded in the impedance matrix A. If an unknown displacement j at node K is coupled to an
unknown displacement k at node M, there will be a nonzero element Ajk in the impedance matrix.
If we do not care how we number the individual unknowns, the impedance matrix may look for
instance as shown in Figure 7.11: there are some interesting patterns in the matrix, but otherwise
the connections seem to be pretty random.
An important aspect of working with large matrices is that as a rule only the non-zeros in
matrices will be stored. The matrices will be stored as sparse. So far we have been working with
dense matrices: all the numbers were stored in a two-dimensional table. A sparse matrix has a more
complicated storage, since only the non-zeros are kept, and all the zeros are implied (not stored, but
when we ask for an element of the matrix that is not in storage, we will get back a zero). This may
mean considerable savings for matrixes that hold only a very small number of non-zeros.
145
Fig. 7.11. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Original
numbering of the unknowns. The black dots represent non-zeros, zeros are not shown.
The reason we might want to worry about how the unknowns are numbered lies in the way the
LU factorization works. Remember, we are removing non-zeros below the diagonal by combining
rows. That means that if we are eliminating element km, we are adding a multiple of the row k and
the row m. If the row m happens to have non-zeros to the right of the column m, all those non-zeros
will now appear in row k. In this way, some of the zeros in a certain envelope around the diagonal
will become non-zeros during the elimination. This is clearly evident in Figure 7.11, where we can
see almost entirely black (non-zero) matrices L and U . Why is this a problem? Because there are a
lot more of non-zeros in the LU factors than in the original matrix A. The more numbers we have to
operate on, the more it costs to factorize the matrix, and the longer it takes. Also, all the non-zeros
need to be stored, and to update a sparse matrix with additional non-zeros is very expensive.
The appearance of additional non-zeros in the matrix during the elimination is called ll-in.
Fortunately, there are ways in which the ll-in may be minimized by carefully numbering coupled
unknowns. Figure 7.12 and Figure 7.13 visualize the impedance matrix and its factors for two
dierent renumbering schemes: the reverse Cuthill-McKee and symmetric approximate minimum
degree permutation. The matrix A holds the same number of non-zeros in all three gures (original
numbering, and the two renumbered cases). However the factors in the renumbered cases hold about
10 times less non-zeros than in the original factors. This may be signicant. Recall that for a dense
matrix the cost scales as O(N 3 ). For a sparse matrix with a nice numbering which will limit the
ll-in to say 100 elements per row, the cost will scale as O(100 N 2 ). For N = 106 this will be the
dierence between having to wait for the factors for one minute or for a full week.
Fig. 7.12. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumbering
of the unknowns with symrcm. The black dots represent non-zeros, zeros are not shown.
As a last note on the subject we may take into account other techniques of solving systems of
linear algebraic equations than factorization. There is a large class of iterative algorithms, a line
up starting with Jacobi and Gauss-Seidel solvers and currently ending with the so-called multigrid solvers. These algorithms are much less sensitive to the numbering of the unknowns. In this
book we do not discuss these techniques, only a couple of minimization-based solvers, including the
powerful conjugate gradients, but refer for instance to Trefethen, Bau for an interesting story on
current iterative solvers. They are becoming ubiquitous in commercial softwares, hence we better
know something about them.
146
Fig. 7.13. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumbering
of the unknowns with symamd. The black dots represent non-zeros, zeros are not shown.
7.3.6 Uses of the LU Factorization

Some of the uses of the LU factorization had been mentioned above: computing the matrix inverse,
in particular. Some other uses to which the factorization can be put are computing the matrix
determinant, nding out whether the matrix has a full rank, and assessing the so-called deniteness
(especially positive deniteness is of interest to us).
The determinant of the matrix A can be computed from the LU factorization as
det A = det (LU ) = det L det U .
Provided L is indeed lower triangular, the determinant of the two triangular matrices is the product
of the diagonal elements which yields
det L =
i=1
Lii = 1 ,
det U =
Uii
i=1
n
so that we have det A = i=1 Uii .
If on the other hand L has been modied by pivoting permutations, its determinant can be 1,
according to how many permutations occurred. (It is probably best to use the MATLAB built-in
det function. It uses the LU factorization, and correctly accounts for pivoting.)
Thats how determinants are computed, not by Cramers rule (not if we wish to live to see the
result).
We might consider using the LU factorization for determining the number of independent rows
(columns) of a matrix, the so-called rank . If the LU factorization succeeds, the matrix A had
a full rank. Otherwise, it is possible that the factorization failed just because full pivoting was
not applied: it is possible that the factorization might succeed if all possibilities for pivoting are
exploited. MATLAB does not use factorization for this reason (and other reasons that have to do
with the stability of the computation), it rather takes advantage of the so-called singular value
decomposition. If the matrix A does not have a full rank (the number of linearly independent
columns, or linearly independent rows, is less than the dimension of the matrix) it is singular, and
cannot be LU factorized.
On the diagonal of the matrix U we have the pivots. The signs of the pivots determine the
so-called positive or negative deniteness (or indeniteness) of a matrix. More about this in the
chapter on optimization.
7.4 Errors and condition numbers

When solving a system of linear algebraic equations Ax = b we should not expect to get an exact
solution. In other words if the obtained solution vector x is substituted into the equation the lefthand side does not equal the right-hand side. One way of looking at the reasons for this error is to
consider that each operation results in some arithmetic error, so in a way both the right-hand side
147
vector b and the coecient matrix A itself are not represented during the solution process faithfully.
Therefore, in this section we will consider how the properties of A and b aect the error of the
solution x.
First, we shall inspect the sensitivity of the solution of the system of coupled linear algebraic
equations Ax = b to the magnitude of the error of the right-hand side, and the properties of the
matrix A. Equivalently, we could also state this in terms of errors: how large can they get?
7.4.1 Perturbation of b
Imagine the right-hand side vector changes just a little bit to b + b. The solution will then also
change
A (x + x) = (b + b) ,
which then gives
Ax = b .
Now we would like to measure the relative change in the solution x/x due to the relative
change in the right-hand side b/b. In terms of norms we can write (symbolically, we never
actually invert the matrix)
x = A1 b
(7.14)
so that using the so-called CBS inequality (CBS: Cauchy, Bunyakovsky, Schwartz) we estimate
x A1 b .
(7.15)
It does not matter very much which norm is meant here, they are all equivalent. Also we can write
for the norms of the solution vector on the left-hand side and the vector on the right-hand side
Ax = b Ax b
(7.16)
Now we substitute (7.15) into (7.14) and divide both sides by b

x
A1 b
.
b
b
On the right-hand side we now have the relative error b/b. Now we can introduce (7.16) to
replace b on the left-hand side
x
A1 b
,
Ax
b
which will give us the relative error of the solution x/x. Finally we rearrange this result into
b
x
AA1
.
x
b
(7.17)
The quantity AA1 is the so-called condition number of the matrix A. This inequality relates
the relative error of the solution to the relative error of the right-hand side vector. The coecient
of proportionality is found to be determined by the properties of the coecient matrix.
148
Illustration 8
When the condition number is large, we see that there is a possibility of the change in the righthand side being very much magnied in the change of the solution. An example of the eect is given
here.15 Consider the least-squares computation of a quadratic function passing through three points:
the point locations are x= [0,1.11,1.13], and the values of the function at those three points are
y= [1,0.5,0.513]. The least squares computation is set up as
A = [x.^2,x.^1,x.^0];
p=(A*A)\(A*y)
to solve for the parameters p of the quadratic t from the so-called normal equations (see details in
Section 9.13). The solution is
p =
0.973849956151390
-1.531423901778500
1.000000000000001
Now change the values of the quadratic function by dy= [0,0.00746,-0.006658];, which is a
relative change of norm(dy)/norm(y)=0.00864. The solution changes by
dp=(A*A)\(A*dy)
dp =
-0.630637805947415
0.706728685322350
-0.000000000000000
which can be appreciated as a pretty substantial change. We see that
norm(dy)/norm(y)
ans =
0.008128568566353
norm(dp)/norm(p)
ans =
0.457113748779369
This means that while the data changed by less than 1%, the solution for the parameters changed by
almost 50%. We call matrices that produce this kind of large sensitivity ill conditioned . Figure 7.14
produced by
x =linspace(0,1.13,100);
plot(x,[x.^2,x.^1,x.^0]*p,r-,linewidth,2); hold on
plot(x,[x.^2,x.^1,x.^0]*(p+dp),k--,linewidth,2)
shows the eect of the ill conditioning : It shows two quadratic curves tted to the original data y
(red solid curve), and to the perturbed data y+dy (black dashed curve). The curves are very dierent
despite the fact that the points through which they pass have been moved only very little.
7.4.2 Condition number

The amplication of the right-hand side error can be measured as shown in equation (7.17) by
assessing the magnitude of the condition number . In MATLAB this can be evaluated with the
function cond. For instance, we nd for the matrix A from the Illustration above
15
See: aetna/DifficultMatrices/ill conditioned.m
149
1
0.9
0.8
0.7
0.6
0.5
0.4
0
0.2
0.4
0.6
x
0.8
Fig. 7.14. Quadratic curves t it to the original data y (red solid curve), and the perturbed data y+dy
(black dashed curve).
cond(A*A)
ans =
7.145344297615475e+004
The magnitude of the condition number can be understood in relative terms by considering the
condition numbers of identity matrices (these are probably the best matrices to work with!), which
are equal to one. More generally, orthogonal matrices also have condition numbers that are equal to
one. That is as low as the condition number goes, all other matrices have larger condition numbers.
The bigger the condition number, the bigger the ill conditioning problem. In particular, we can see
that the condition number depends on the existence of the inverse of A. The closer the matrix A is
to being not invertible, the larger the condition number is going to get. For a singular matrix the
condition number is dened to be innite. In the present case, the condition number is seen to be
fairly large. Hence we get the substantial amplication of the change of the right-hand side in the
solution vector.
Illustration 9
To continue the previous Illustration, we change the horizontal position of one of the points x=
[0,0.61,1.13].16 The perturbed quadratic curve is found to dier only slightly from the original.
The condition number conrms that the matrix is considerably less ill-conditioned
cond(A*A)
ans =
193.7789
7.4.3 Perturbation of A
We can also consider the eect of changes in the matrix itself. For instance, when the elements of
the matrix are calculated with some error. So when the matrix changes (not the right-hand side,
that remains the same), we write for the changed solution
16
See: aetna/DifficultMatrices/better conditioned.m
150
(A + A) (x + x) = b
canceling Ax = b gives
Ax + A (x + x) = 0
or
Ax = A (x + x) .
Considering the problem in terms of norms as before
x = A1 A (x + x)
and
x A1 Ax + x .
To bring in relative changes again, we divide by x + x on both sides and divide and multiply
with A on the right-hand side
x
A
AA1
.
x + x
A
We see that the relative change in the solution is expressed as before. It is bounded by the relative
change in the left-hand side matrix, and the multiplier is again the condition number.
The condition number appears to be an important quantity. In order to understand the condition
number we have to understand a little bit where the norms of the matrix and its inverse come from.
7.4.4 Induced matrix norm
An easy way in which we can talk of matrix norms while introducing nothing more than norms of
vectors stems from the so-called induced matrix norm. We think of the matrix A (here we will
discuss only square matrices, but this would also apply to rectangular matrices) as producing a map
from the vector space Rn to the same vector space by taking input x and producing output y
y = Ax .
We can measure how big a matrix is (that is its norm) by measuring how much all possible input
vectors x get stretched by A. We take the largest possible stretch as the induced norm of A
A =
Ax
max
.
x = 0 x
Note that on the left we have a matrix norm, and on the right we have a vector norm. That is why
we say that the matrix norm on the left is induced by the vector norm on the right. An alternative
form of the above equation, and a very useful one, can be expressed as
A =
max Ax .
x = 1
(7.18)
In other words, test the stretching on vectors of unit length.

We could take any norm of the vector x. We can pick one of all those generated by the denition
xp =
j=1
1/p
|xj |p
151
(we mayrecall the similarity with the root-mean-square formula for p = 2). Taking p = 1 we get
n
x1 = j=1 |xj | (the so-called 1-norm), taking p = 2 we obtain the usual Euclidean norm (also
called the 2-norm)
1/2
n
x2 =
|xj |2
.
j=1
Also used is the so-called innity norm, which has to be worked out by a limiting process x =
maxj=1:n |xj |.
Illustration 10
The three norms introduced above are illustrated in Figure 7.15. The squares and the circle represent
vectors of unit norm, as measured by the various norm denitions. The arrows are vectors of unit
norms, using the three norm denitions given above.17
kxk2 = 1
kxk = 1
0.5
x2
x
0
kxk1 = 1
0.5
1
1
0.5
0.5
1.5
x1
Fig. 7.15. Illustration of vector norms (1, 2, ).
7.4.5 Condition number in pictures

We are going to work out a useful visual association for the condition number AA1 . We have
the denition (7.18) and the induced matrix norm of the matrix inverse can be obtained by the
following substitution
Ay = x
into the denition of the induced matrix norm
A1 =
17
max
x =
0
A1 x
y
=
max
.
x
Ay
Ay =
0
See: aetna/MatrixNorms/vecnordemo.m
152
Note that we assume A to be invertible, and then Ay = 0 for y =

0. Also, we can change the
maximum into a one-over-minimum fraction, so that we can write for the norm of A1
(
1
Ay
min
y =
0 y
)1
(
=
)1
min Ay
y = 1
With these formulas for the norms, we can write for the condition number
1
AA
max Ax
x = 1
.
=
min Ay
y = 1
(7.19)
Now this is relatively easy to visualize. Figures 7.16 and 7.17 present a gallery of matrices. The
images visualize the results of the multiplication of unit-length vectors pointing in various directions
from the origin. The induced 2-norm is used, and consequently the heads of the unit-length vectors
form a circle of unit radius. We can see how the formula for the condition number (7.19) correlates
with the largest and smallest length of the vector that results from the multiplication of the matrix
and the unit vector. For instance, for the matrix A we may estimate the length of the longest and
shortest Ax vector as 3 and 2, and therefore we guess the condition number to be 3/2. This
may be compared with the computed condition number AA1 1.414. Alternatively, we could
take the length of the longest vector Ax as 3 and the length of the longest vector A1 x as 1/2,
and therefore we guess the condition number to be 3 1/2.
Illustration 11
Use the function matnordemo18 to create for each of the three norms a diagram similar to those of
Figure 7.16 for the matrix [2 -0.2; -1.5 3], and then try to read o the norm of this matrix from
the gure. Compare with the matrix norm computed as
norm([2 -0.2; -1.5 3],1)
norm([2 -0.2; -1.5 3],2)
norm([2 -0.2; -1.5 3],inf)
7.4.6 Condition number for symmetric matrices

Note that for the symmetric matrices B, D, F in Figures 7.16 and 7.17 the largest and the smallest
stretch occurs in the direction of some vector x. In other words, we have
Bx = x
and we see that the extreme stretches have to do with the eigenvalues of the symmetric matrix.
This may be contrasted with for instance the unsymmetric matrix A, where the stretch Ax never
occurs in the direction of x. Other examples similar Ax to are matrices C, E in Figure 7.16 and
Figure 7.17.
Symmetric matrices have real eigenvalues and can be always made similar to a diagonal matrix,
which means that symmetric matrices always have a full set of eigenvectors. Now we have seen
that for symmetric matrices the 2-norms are directly related to their eigenvalues. We all will fondly
remember the stress and strain representations as symmetric matrices: the principal stresses and
18
See: aetna/MatrixNorms/matnordemo.m

A=[2 1.5; 1.5 3], A*x
A=[2 1.5; 1.5 3], inv(A)*x
2
x2 , [A1x]2
x2 , [Ax]2
0
x1 , [Ax]1
B=[3 1.2; 1.2 2], B*x
2
x2 , [B 1x]2
x2 , [Bx]2
0
x1 , [A1x]1
B=[3 1.2; 1.2 2], inv(B)*x
0
x1 , [Bx]1
C=[0 1; 1 0], C*x

1
0.5
0.5
0
0.5
1
0
x1 , [B 1x]1
C=[0 1; 1 0], inv(C)*x
x2 , [C 1x]2
x2 , [Cx]2
153
0
0.5
0.5
0
0.5
x1 , [Cx]1
0.5
0
0.5
x1 , [C 1x]1
Fig. 7.16. Matrix and matrix inverse norm illustration. Matrix condition numbers: AA1 = 1.414;
BB 1 = 3.167; CC 1 = 1.0;
strains, and the directions of the principles stresses and strains, are the eigenvalues and eigenvectors
of these matrices.
In fact, for all matrices, symmetric and unsymmetric, the matrix norm has something to do with
eigenvalues and eigenvectors. Consider the denition of the induced matrix norm
A =
Ax
max
x = 0 x
and square both sides

A2 =
Ax2
max
.
2
x = 0 x
154

D=[2 0; 0 0.2], D*x
D=[2 0; 0 0.2], inv(D)*x

5
x2 , [D1x]2
x2 , [Dx]2
0
x1 , [Dx]1
0
x1 , [D1x]1
E=[1,1; 0,1], inv(E)*x
0.5
0.5
x2 , [E 1x]2
x2 , [Ex]2
E=[1,1; 0,1], E*x
0
0.5
0
0.5
1
1
0
x1 , [Ex]1
F=[1,1; 1,1.2], F*x
5
x2 , [F 1x]2
x2 , [F x]2
0
x1 , [E 1x]1
F=[1,1; 1,1.2], inv(F)*x
5
10
0
x1 , [F x]1
10
10
0
x1 , [F 1x]1
10
Fig. 7.17. Matrix and matrix inverse norm illustration. Matrix condition numbers: DD 1 = 10.0;
EE 1 = 2.618; F F 1 = 22.15;
For the moment we shall consider that the vector norms are Euclidean norms (2-norms). From the
denition of the vector norms, we have
Ax2 = (Ax)T (Ax)
so that we can write
A2 =
max
x = 0
xT AT Ax
.
xT x
The expression on the right is the so-called Rayleigh quotient of the matrix AT A (not of A itself!).
It is the result of the pre-multiplication of the eigenvalue problem
AT Ax = x
155
(7.20)
with xT , which can be rearranged as

=
xT AT Ax
.
xT x
Note that
xT AT Ax 0 ,
xT x > 0
where xT x = 0 is not allowed by the denition of the norm. Clearly, the Rayleigh quotient attains
its maximum for the largest eigenvalue in absolute value max ||, and its minimum for the smallest
eigenvalue in absolute value min ||. From this we can deduce
A = max || .
Similarly, we obtain
A1 = 1/ min || .
Hence, the condition number of A is found to be
max ||
1
.
AA =
min ||
If the matrix A is symmetric, we write an eigenvalue problem for it as
Av = v .
(7.21)
Now pre-multiplication of both sides of this equation with AT = A gives

AT Av = AT v = ( )2 v .
(7.22)
In comparison with (7.20) we see that = ( )2 . Therefore, the norm of a symmetric matrix will be
A = max | | .
where solves the eigenvalue problem (7.21). Analogously, the norm of the inverse of a symmetric
matrix will be
A1 =
1
,
min | |
and the condition number of the symmetric matrix is therefore

AA1 =
max | |
.
min | |
(7.23)
Illustration 12
Apply formula (7.23) to a singular matrix.
Any singular matrix has at least one zero eigenvalue. No matter how large an eigenvalue of a
singular matrix can get, we know that its smallest eigenvalue (in absolute value) is equal to zero.
Consequently, the condition number of the singular matrix .
156
7.5 QR factorization
Consider a system of linear algebraic equations
Ax = b
with a square matrix A. It is possible to factorize the matrix into the product of an orthogonal
matrix Q and an upper triangular matrix R
A = QR .
How does this work? If we write this relationship between the matrices in terms of their columns
things become clearer.
ck (A) = Q ck (R) .
Now remember, R is an upper triangular matrix. For instance like this

R=
.
Then the rst column of A is c1 (A) = c1 (Q) R11 (R11 = , all other coecients in the rst column
of R are zero). The fourth column of A is a linear combination of the rst four columns of Q (the
coecients are the s)
c4 (A) = c1 (Q) R14 + c2 (Q) R24 + c3 (Q) R34 + c4 (Q) R44
and so on. The principle is now clear: each of the columns of A is constructed of columns of Q which
are orthogonal, and the columns of Q can be obtained by straightening out the columns of A as
long as the columns of A are linearly independent (refer to Figure 7.18): q 1 is a unit vector in the
direction of a1 , and q 2 is obtained from the part of a2 that is orthogonal to q 1 .
Fig. 7.18. Two arbitrary linearly independent vectors a1 and a2 , and two orthonormal vectors vectors q 1
and q 2 that span the same plane
The great advantage that can be derived from this factorization stems from the fact that the
inverse of an orthogonal matrix is simply its transpose
Q1 = QT .
If we substitute this factorization into Ax = b we obtain
Ax = QRx = b
and this allows us to rewrite the system as
157
Rx = QT b .
Now since the matrix R is upper triangular, to solve for the unknown x is very ecient, starting
at the bottom we proceed by backsubstitution. The solution is not for free, of course. We had to
construct the factorization in the rst place.
An additional benet of this particular factorization is in the ability to factorize rectangular
matrices, not just square. Furthermore, due to the orthogonality of Q operations with it are as
nice numerically as possible (remember the perfect condition number of one?). Therefore the QR
factorization is used when numerical stability is at a premium. Examples may be found in the leastsquares tting subject. Also, the QR factorization leads to a valuable algorithm for the computation
of eigenvalues and eigenvectors for general matrices.
7.5.1 Householder reflections
The question now is how to compute the QR factorization. A particularly popular and eective
algorithm is based on the so-called Householder reections.
The Householder transformation (reection) is designed to modify a column matrix so that the
result of the transformation has only one nonzero element, the rst one, but the length of the result
(that is its norm) is preserved. Matrix transformations that preserve lengths are either rotations or
reections (the Householder transformation is the latter):
e,
Ha = a
where a = e
a .
e
The transformation produces the vector a
a
0
e= .
a
.
.
0
e a and
by reection in a plane that is dened by the normal generated as the dierence n = a
e and a being of
passes through the origin O (see Figure 7.19). This follows from the two vectors a
the same length.
n
a
e
a
O
e
a
Fig. 7.19. Householder transformation: the geometrical relationships. The reection plane is shown by the
dashed line. Consider that in the two-dimensional gure there are two possible reection planes.
e = a + n, which may be tweaked

The relationship between the three vectors may be written as a
using a little trick (note carefully the position of the parentheses)
( T)
(
)
( T )
nn a
nnT
n a
e =a+n
=a+
= 1+ T
a.
a
nT a
nT a
n a
Note that both matrices (the identity and the rest) in the parentheses are square. Together they
constitute an orthogonal matrix
158
H =1+
nnT
,
nT a
HT H = 1 .
(7.24)
Interestingly, this matrix is also symmetric. This is really how it should be: H produces a mirror
e = Ha. The mirror image of a
e , the inverse operation of a = H 1 a
e , must give us back
image of a, a
e from a.
a, but the inverse operation is again a reection, the same reection that gave us a
To compute the Householder matrix we could use the function Householder matrix.19 The sign
e is computed with particular attention to numerical stability: when we
of the non-zero element of a
e a, the vector a
e has only one nonzero element. To avoid numerical error when
compute n = a
subtracting two similar numbers e
a1 a1 we choose sign e
a1 = sign a1 .
function H = Householder_matrix(a)
if (a(1)>0) at1 =-norm(a);% choose the sign wisely
else
at1 =+norm(a); end
n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a
H = eye(length(a))+(n*n)/(n*a);% this is the formula
end
How do we use the Householder transformation? We consider the columns of the matrix to be
transformed as the vectors that we can reect as shown above. The rst step zeroes out the elements
of A below the diagonal of the rst column.

H1

.
H 1 A = 66
=
We write H 1 for the 6 6 matrix obtained from the rst column of A. We write H 2 for the 5 5
matrix obtained from the second column of A, from the diagonal to the bottom of the column.
Analogously for the other Householder matrices.

.
=
55
The last step that leads to an upper triangular matrix is
=R.
=
H5

22
To obtain A from R we would successively invert the above relationships one by one. That is not
dicult since we realize that those matrices are orthogonal and symmetric, so the inverse is equal
to the original matrix. We just have to switch the order of the matrices. We get
19
See: aetna/QRFactorization/Householder matrix.m
H1
A=
H2
1
1

1
...

1

H5
159
R ,
which means that orthogonal matrix Q is obtained as
1
1
1
H1
1
...
Q=
1
H
2
H5
Illustration 13
Here we present a factorization which is based directly on the schemas above. The function
Householder matrix20 computes the Householder matrix of equation (7.24). Note that the matrices H j are blocks embedded in an identity matrix. The following code fragment should be stepped
through, and I will bet that it will nicely reinforce our ideas of how Householder reections work.
format short
A=rand(5); R=A % this is where R starts
Q=eye(size(A));% this is where Q starts
for k=1:size(A,1)-1
H=eye(size(A));% Start with an identity...
% ...and then put in the Householder matrix as a block
H(k:end,k:end) = Householder_matrix(R(k:end,k:end))
R= H*R % this matrix is becoming R
Q= Q*H % this matrix is becoming Q
end
Q*Q% check that this is an orthogonal matrix: should get identity
A-Q*R % check that the factorization is correct
R-Q*A % another way to check
The algorithm to produce the QR factorization21 is designed to be a little bit more ecient than
the code above, but it is still surprisingly short and readable
function [Q,R] = HouseQR(A)
m=size(A,1);
Q=eye(m); R =A;
for k=1:size(A,1)-1
n = Householder_normal(R(k:end,k:k));
R(k:end,k:end) =R(k:end,k:end)-2*n*(n*R(k:end,k:end));
Q(:,k:end)=Q(:,k:end)-2*(Q(:,k:end)*n)*n;
end
end
20
21
See: aetna/QRFactorization/Householder matrix.m

See: aetna/QRFactorization/HouseQR.m
160
Instead of the Householder matrix (7.24) we use in HouseQR the equivalent expression
H = 1 2N N T ,
where N has the same direction as n but is of unit length
N=
n
.
n
Substituting this expression and

n = 2|N T a| and N T a < 0
e = a+n we obtain the above alternative expression of the Householder matrix
into the relationship a
)
(
(
)
(
)
nnT
n2 N N T
e = 1+ T
a
a = 1 2N N T a .
a= 1+
T
n a
nN a
The Householder normal is also computed with attention to numerical stability by choosing the sign
of the nonzero element of the normal to eliminate cancellation. Note well that the computed normal
is of unit length.22
function n = Householder_normal (a)
if (a(1)>0) at1 =-norm(a);% choose the sign wisely
else
at1 =+norm(a); end
n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a
n=n/sqrt(n*n);% normalize
end
Illustration 14
The HouseQR function acts as a black box: A goes in, Q,R come out in their nished form. It is
however possible to set a breakpoint inside the function to watch the matrices form layer-by-layer
by the Householder reections. Try it.

1. C. Meyer, Matrix Analysis and Applied Linear Algebra Book and Solutions Manual, SIAM:
Society for Industrial and Applied Mathematics, 2001.
Good treatment of the Gaussian elimination. Some very instructive examples of ill-conditioned
matrices. More than you ever wanted to know about matrix norms. Best of all, freely available
at http://matrixanalysis.com/.
2. L. N. Trefethen, D. Bau III, Numerical Linear Algebra, SIAM: Society for Industrial and Applied
Mathematics, 1997.
The treatment of QR factorization is excellent.
3. GW Stewart, Matrix algorithms. Volume I: Basic decompositions. SIAM, 1998.
Complete and readable presentation of the QR factorization.
4. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005. (Alternatively,
the 3rd edition, 1988.)
Overall one of the best references for introductory linear algebra. Clearly written, and full of
examples.
22
See: aetna/QRFactorization/Householder normal.m
8
Solutions methods for eigenvalue problems
Summary
1. We discover a few basic algorithms for the solution of the eigenvalue problem, both the standard
and the generalized form.
2. Repeated multiplications with matrices tends to amplify directions associated with eigenvectors
of dominance eigenvalues. Main idea: write the modal expansion, and consider the powers of
eigenvalues.
3. Various forms of the power iteration, including the QR iteration, form the foundations of some
of the workhorse routines used in vibration analysis and in general purpose software (with
appropriate, and sometimes considerable, renements).
4. The Rayleigh quotient is an invaluable tool both for algorithm design and for quick ad hoc
checks.
5. This area of numerical analysis has seen considerable progress in recent years and some powerful new algorithms have emerged. Solving large-scale eigenvalue problems nevertheless remains
nontrivial, even with sophisticated software packages.
8.1 Repeated multiplication by matrix

Consider the eect of repeated multiplication by the matrix A on its eigenvector
Av j = j v j .
To multiply both the left-hand side in the right-hand side of the above equation again with A yields
A (Av j ) = A2 v j = j Av j = 2j v j .
In general, we will have after k 1 multiplications
Ak v j = kj v j .
Now imagine that an arbitrary vector x is going to be multiplied repeatedly by A. Our goal is
to analyze the result of Ak x. We will use an expansion of the arbitrary vector x in terms of the
eigenvectors of the matrix A (the so-called modal expansion)
x=
cj v j .
j=1:n
The product Ak x may be written using the expansion as
Ak x =
cj Ak v j =
cj kj v j .
j=1:n
j=1:n
162
8 Solutions methods for eigenvalue problems
The eigenvalues will be ordered by absolute value so that

|1 | |2 | |3 | . . . | |n1 | |n | .
For the moment we shall assume that the rst eigenvalue is dominant: its absolute value is strictly
larger than the absolute value of any other eigenvalue |1 | > |j | , j = 2, 3, .... With these assumptions
we can write
Ak x =
cj Ak v j = |1 |k
j=1:n
j=1:n
cj
kj
vj .
|1 |k
Due to our assumption that the rst eigenvalue dominates, the coecients kj /|1 |k will approach
zero in absolute value as k , except for k1 /|1 |k which will maintain absolute value equal to
one. Therefore, as k the only term left from the modal expansion of x will be
lim Ak x = c1 k1 v 1 .
Figure 8.1 illustrates the eect of repeated multiplication of an arbitrary vector x by the 2 2
matrix A
Ax , AAx = A2 x , ...
The eigenvalues are 1 = 1.6 (with eigenvector v 1 ), 2 = 0.37 (with eigenvector v 2 ), so the rst eigenvalue is dominant, and evidently the result of the multiplication leans more and more towards the
rst eigenvector. The leaning is very rapid. The reason is that the fraction k2 /|1 |k = (0.23125)k
will decrease very rapidly with higher powers (for instance, (0.23125)4 = 0.00285). Therefore, the
contribution of the eigenvector v 2 to the vector Ak x will become vanishingly small rather quickly.
A=[1.68
0.548; 0.202
0.286]
A0 x
v2
A1 x
A2 x
A3 x
v1
A4 x
Fig. 8.1. The eect of several matrix-vector multiplications. Eigenvalues 1 = 1.6, 2 = 0.37
The repeated multiplication to amplify the components of the dominant eigenvector is the
principle behind the so-called power iteration method for the calculation of the dominant eigenvalue/eigenvector.
8.2 Power iteration

The power method (power iteration) relies on the above observation that provided there is one
dominant eigenvalue, the repeated multiplication of an arbitrary starting vector x by the coecient
8.2 Power iteration
163
matrix will diminish the contributions of all other eigenvectors except the rst one so that eventually
the product Ak x will be mostly in the direction of the rst eigenvector v 1 .
The method is not failproof. Firstly, it appears that if the starting vector x does not contain any
contribution of the rst eigenvector, c1 = 0, the power method is not going to converge. Fortunately,
any amount of the inevitable arithmetic error will likely introduce some contribution of the rst
eigenvector to which the power method will ultimately converge. Unfortunately, it may take a long
time.
Secondly, the method is denitely going to have trouble with converging for |2 | |1 | (in
words, when the second eigenvalue is close to the rst eigenvalue in magnitude). The ratio k2 /|1 |k
will decrease slowly, resulting in slow convergence. Such a situation is illustrated in Figure 8.2: the
eigenvalues are 1 = 0.8, 2 = 0.75. The iterated vector Ak x appears to converge to the direction
of v 1 , but slowly.
A few observations can be made from Figure 8.2. The iterated vector Ak x decreases in magnitude
(|1 | < 1), and if we iterate suciently long the vector will get so short that we may risk underow,
or at least numerical issues due to arithmetic error. (Note that for |1 | > 1 the approximations
to the eigenvector will grow, which may eventually result in overow.) Further, since 1 < 0 the
iterated vector aligns itself alternately with v 1 and v 1 . This is ne, since both are perfectly good
eigenvectors, but it complicates somewhat the issue of how to measure convergence. We want to
measure convergence of directions, not of the individual components of the vector!
A=[1.4
0.894; 1.43
1.35]
v2
A3 x
v1
A4 x
A2 x
A0 x
A1 x
Fig. 8.2. The eect of several matrix-vector multiplications. Eigenvalues 1 = 0.8, 2 = 0.75
To address the concerns about underow and overow we may introduce normalization (rescaling)
of the iterated vector as
x0 given
for k = 1, 2, ...
x(k) = Ax(k1)
x(k)
x(k) = x
(k)
How to measure the convergence of the algorithm may be made easier by considering the associated
problem of nding the eigenvalue 1 . An excellent tool is oered by the Rayleigh quotient. Premultiply the eigenvalue problem on both sides with v Tj
Av j = j v j v Tj Av j = j v Tj v j ,
which gives (the Rayleigh quotient)
164
j =
v Tj Av j
.
v Tj v j
Now consider the vector x(k) as an approximation of the eigenvector v 1 . A good approximation of
the eigenvalue will be
T
x(k) Ax(k)
T
x(k) x(k)
It will be much easier to measure relative approximate errors in the eigenvalue then to measure the
convergence of the direction of the eigenvector. An actual implementation of the power iteration
algorithm then follows easily:1
function [lambda,v,converged]=pwr2(A,v,tol,maxiter)
... some error checking omitted
plambda=Inf;% eigenvalue in previous iteration
converged = false;
for iter=1:maxiter
u=A*v; % update eigenvector approx
lambda=(u*v)/(v*v);% Rayleigh quotient
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
converged = true; break;% converged!
end
plambda=lambda;% eigenvalue in previous iteration
end
end
Note that we have to return a Boolean ag to indicate whether the iteration process inside the
function converged or not. This is a common design feature of software implementing iterative
processes, since the iterations may or may not succeed.
We conclude this section with pointing out that power iteration relies on the existence of a
dominant eigenvalue. This is not applicable in many important problems, for example for the rst
order form of the equations of motion of a vibrating system. For such systems eigenvalues come in
complex conjugate pairs. There is no single dominant eigenvalue, and consequently power iteration
will not converge. This is illustrated in Figure 8.3, where we show the progress of the power iteration
for two dierent starting vectors for a matrix with eigenvalues 1,2 = 0.7. There is no progress
towards any of the eigenvectors, since the iterated vectors just switch between two dierent directions
neither of which is the eigenvector direction.
In what follows we shall work with real symmetric matrices, unless we explicitly say
otherwise. The main reasons: these matrices are very important in practice, we dont
have to treat special cases such as missing eigenvectors, and the eigenvalues and eigenvectors are real.
Illustration 1
Figure 8.4 shows the model of two linked buildings. Each building is represented by a concentrated
mass m standing in for the total mass of the oor, and springs linking the oors kc which would be
representative of the total horizontal stiness of the columns in between the oors (or the ground).
The buildings are linked at each oor with another spring k , representative of walkways (bridges)
that connect the buildings. The masses in the system are numbered as shown.
1
See: aetna/EigenvalueProblems/pwr2.m
8.2 Power iteration
A=[1.24
0.808; 1.29
1.24]
A=[1.24
A1 x
0.808; 1.29
165
1.24]
v1
A2 x
A x
A x
v1
A4 x
A0 x
A2 x
A3 x
A4 x
v2
v2
A1 x
Fig. 8.3. The eect of several matrix-vector multiplications. Eigenvalues 1,2 = 0.7
10
Fig. 8.4. Vibration model of linked buildings.
The mass matrix is simply m (1010 identity matrix). The stiness matrix K has the structure
shown below. Note that if the buildings are not linked by the walkways (k = 0), the stiness matrix
will split into two uncoupled 5 5 diagonal blocks that correspond to each building separately.
Nonzero walkway stiness will couple the vibrations of the two buildings together.
kc +k kc
0
0
0
k
0
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
0
kc 2kc +k 0
0
0
0
k
K=
0
0
0
0
kc +k kc
0
0
0
k
k
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
0
k
0
0
0
kc 2kc +k
The vibration problem can be described by the equation (5.3)
2 M z = Kz .
Since the mass matrix is just a multiple of the identity, this may be written as
Az = z ,
166
where we dene
A=
1
K,
m
and = 2 .
The rst practice will apply the power method to the computation of the largest frequency of
vibration. We assume m = 133, kc = 61000, k = 3136 (in consistent units). The solution with
MATLABs eig is written for the eigenvalue problem as
[M,K,A] = lb_prop;
[V,D]=eig(A) % This may be replaced with [V,D]=eig(M,K)
disp(Frequencies [Hz])
sqrt(diag(D))/(2*pi)
which yields the resulting frequencies as
Frequencies [Hz]
ans =
0.9702 1.4614 2.8319 3.0354 4.4641 4.5960 5.7348 5.8380 6.5408 6.6315
Applying the power method as shown in the script lb A power2 with a random starting vector yields
an approximation of the highest eigenvalue, but it is not anywhere close to being converged. This
should not surprise us. We would expect the convergence to be slow: The two largest eigenvalues are
very closely spaced (the largest eigenvalue is weakly dominant): see Figure 8.5. This makes, together
with the inherent symmetry in the structure, for an interesting experiment: see below.
1. Use a starting vector in the form of ones(10,1). Do we get convergence to the largest eigenvalue?
If not, try to explain. [Dicult]
f9=6.5408[Hz]
f10=6.6315[Hz]
1
2
7
3
2
8
8
4
10
Fig. 8.5. The highest modes of the linked buildings.
9
5
See: aetna/EigenvalueProblems/LinkedBuildings/lb A power.m
9
10
8.3 Inverse power iteration
167

The power iteration can be used to compute the eigenvalue/eigenvector pair for the eigenvalue with
the largest absolute value. The inverse power iteration can look at the other end of the spectrum,
at the smallest eigenvalues.
The eigenvalues of a matrix A and A1 are related as follows: Provided the matrix is invertible
(and therefore does not have = 0 among its eigenvalues), we can multiply the eigenvalue problem
for A
Ax = x
with A1 and divide by to give
1
1 1
A Ax = A1 x
1
x = A1 x .
In words, the matrix A and A1 have the same eigenvectors, and the eigenvalues of A1 are the
inverses of the eigenvalues of A. Clearly, the largest eigenvalue of A1 will be one over the smallest
eigenvalue of A
max |eigenvalue of A1 | =
1
.
min |eigenvalue of A|
Therefore, to nd the eigenvalue/eigenvector pair of A for the smallest eigenvalue in absolute value
we can perform the power iteration on A1 . We would not wish to invert the matrix, of course, and
so we formulate the algorithm as
x0 given
for k = 1, 2, ...
Ax(k) = x(k1)
(k)
x(k) = x
x(k)
which simply means solve for x(k) from Ax(k) = x(k1) . (Compare with the power iteration algorithm on page 163; there is only one change, but an important one.) Since the solution is needed
during each iteration, we may conveniently and eciently take advantage of the LU factorization.
The inverse power iteration algorithm is summarized in the code below. Note the changes with
respect to the power iteration in the rst two lines in the for loop. 3
function [lambda,v,converged]=invpwr2(A,v,tol,maxiter)
plambda=Inf;% initialize eigenvalue in previous iteration
[L,U,p]=lu(A,vector);%Factorization
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v
lambda=(v*v)/(u*v);% Rayleigh quotient: note the inverse
end
plambda=lambda;
end
end
3
See: aetna/EigenvalueProblems/invpwr2.m
168
Note the shortcut to the value of the Rayleigh quotient: the vector product (u*v) incorporates the
multiplication with A1 . Then, because we are iterating to nd 1/, we invert the fraction.
The inverse power iteration also relies on the existence of a dominant eigenvalue. Dominant
here means that the smallest eigenvalue should be strictly smaller in absolute value than any other
eigenvalue of A. We assume again they are ordered in decreasing magnitude, and for the success of
the inverse iteration we require
|1 | |2 | |3 | . . . | |n1 | > |n | .
Analogously to the power iteration, the convergence of the inverse power iteration will be faster for
very dominant eigenvalues, |n1 | |n |, and painfully slow for |n1 | |n |.
Illustration 2
Here we illustrate the convergence of the inverse power iteration on the example of two symmetric
matrices.4 We construct two random matrices with spectra that are identical except for the smallest eigenvalue. The smallest eigenvalue is dominant in one matrix, and rather close to the second
eigenvalue in magnitude in the second matrix. Consequently Figure 8.6 displays quite disparate
convergence behaviors of the inverse power iteration: very good in the rst case, poor in the second.
Relative eigenvalue error
10
10
n = 13
5
10
10
n = 6.1
10
15
10
10
15
Iteration
20
25
Fig. 8.6. The relative error of the smallest eigenvalue for two symmetric 13 13 matrices with eigenvalues
[13, 14 : 25] and [6.1, 14 : 25].
Illustration 3
Apply the inverse power iteration method to the structure described in Illustration on page 164.
The inverse power method as shown in the script lb A invpower5 with a random starting vector
yields an approximation of the lowest eigenvalue with satisfactory convergence. The rst two mode
shapes are shown in Figure 8.7 (only the mode on the left was computed with inverse power iteration,
the mode on the right was added using eig()).
4
5
See: aetna/EigenvalueProblems/test invpwr conv1.m

See: aetna/EigenvalueProblems/LinkedBuildings/lb A invpower.m
f1=0.97015[Hz]
169
f2=1.4614[Hz]
10
10
Fig. 8.7. The lowest modes of the linked buildings.
1. Change the stiness of the link spring to k = 0. Does the inverse power iteration converge? If
not, why?
8.3.1 Shifting used with inverse power iteration

Consider the eect of adding an identity x = x to the eigenvalue problem.
Ax x = x x .
At rst blush, this does not seem to have any eect, but rewritten as
(A 1) x = ( )x
or
(A 1) x = x
it is revealed that it leads to a slightly dierent eigenvalue problem, with the same eigenvector, but
a shifted eigenvalue = . This leads to the idea of searching for an eigenvalue/eigenvector pair
for the shifted matrix, not the original one, because the smallest min || can be made to correspond
to min || 0. Then, the eigenvalue min || could be very strongly dominant, since 1/ min || is going
to be large compared to the other eigenvalues.
Figure 8.8 illustrates this concept with an example with four eigenvalues
= [2.80, 1.167, 0.609, 0.452]
The ratio 3 /4 1.34. Applying a shift = 0.3 leads to a shifted problem with eigenvalues
= [2.50, 0.867, 0.309, 0.152]
and the ratio 3 /4 2.04 > 1.34. The larger this ratio, the better. The solution of the inverse
power iteration on the shifted problem will converge faster.
170
1
3
0 4 2
1/1
1/3
1/2
1/4
0
= , > 0
1
3
0 4 2
1/1
0
1/3
1/2
1/4
Fig. 8.8. Visual representation of the eect of shifting.

5
Relative eigenvalue error
10
10
no shift
5
10
= 0.3
= 0.4
10
10
15
10
6
8
Iteration
10
12
Fig. 8.9. The relative error of the smallest eigenvalue 4 for the symmetric 4 4 matrices with eigenvalues
[2.80, 1.167, 0.609, 0.452]. Comparison of un-shifted and shifted inverse power iteration.
Figure 8.9 shows the eect of shifting. Two shifts are applied, one corresponding to Figure 8.8,
and one even closer to the eigenvalue 4 in magnitude, = 0.4. The eect of shifting is quite
dramatic. The closer we can guess the magnitude of the smallest eigenvalue (so that we can set the
shift to be equal to the guess the eigenvalue) the higher the convergence rate.
The inverse power iteration algorithm with shifting is given in MATLAB code below.6
function [lambda,v,converged]=sinvpwr2(A,v,sigma,tol,maxiter)
plambda=Inf;% initialize eigenvalue in previous iteration
v=v/norm(v);% normalize
[L,U,p]=lu((A-sigma*eye(n)),vector);%Factorization
for iter=1:maxiter
u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v
lambda=(u*A*u)/(u*u);% Rayleigh q. using the definition
6
See: aetna/EigenvalueProblems/sinvpwr2.m
8.4 Simultaneous power iteration
171

end
plambda=lambda;
end
end
Note that we factorize the shifted matrix, and also note that we compute the Rayleigh quotient using
the denition formula instead of the shortcut possible in the plain-vanilla inverse power iteration.
How to choose the shift in the rst place is a bit of a ticklish question. We dont know the smallest
eigenvalue to begin with, but the negative smallest eigenvalue would be the best shift to apply! If
we guess the shift incorrectly, the iteration may converge to an eigenvalue that we did not want.
Illustration 4
Consider the following eigenvalue problem with a 3 3 matrix whose eigenvalues are 1,2,4.7
A =[ 2.486697669648270 -0.326429831194336 -1.065046141649933
-0.326429831194336
2.167809045836811
1.032918306492685
-1.065046141649933
1.032918306492685
2.345493284514918];
n=3;
[V,D]=eig(A)
tol =1e-6; maxiter= 24;
v=rand(n,1);% starting vector
sigma =1.6;% the shift
[lambda,phi,converged]=sinvpwr2(A,v,sigma,tol,maxiter)
We guessed that the smallest eigenvalue was close to 1.6 and applied the shift 1.6. The shifted inverse
power iteration produced the eigenvalue approximation of 2, instead of the smallest eigenvalue we
hoped to nd.
Illustration 5
Apply the inverse power iteration method to the structure described in Illustration on page 164, but
change the stiness of the link spring to k = 0. Would shifting help with convergence to the rst
frequency?

So far we have been pointing out how the components of the dominant eigenvector are magnied
in each iteration. In fact, components of all eigenvectors are magnied, except not as strongly. This
leads to the idea to try to apply the power iteration (or the inverse power iteration) to several
starting vectors at once with the goal of extracting the components of the eigenvectors for several
dominant eigenvalues concurrently.
First decision we have to make concerns the starting vectors: We should make every eort to
avoid the starting vectors being orthogonal to the eigenvectors that were looking for. Often times
this is achieved by choosing starting vectors with random components.
7
See: aetna/EigenvalueProblems/test shift.m
172
A=[1.05
0.171; 0.171
(0)
(1)
w
(3) 2
(2)
w2 w2
(4)
w2
v1
w2
0.614]
(0)
w1
(1)
w1
(2)
w1
(3)
w1
(4)
w1
v2
Fig. 8.10. The eect of several matrix-vector multiplications. Eigenvalues 1 = 1.11, 2 = 0.556 . No eort
is made to maintain the iteration vectors linearly independent.
The fact that the most dominant eigenvector will be swamping out all the other eigenvectors is
going to keep us from obtaining reasonable approximations of the other eigenvectors. In other words,
since the dominant eigenvector components will be getting magnied more than the components
of the other vectors, eventually all the vectors on which we iterate will become aligned with the
dominant eigenvector. Figure 8.10 illustrates the eect of simultaneous iteration on two vectors:
(4)
(4)
(0)
(0)
starting vectors are w1 , w2 . After just four iterations the vectors w1 , w2 are pretty much
aligned with the dominant eigenvector v 1 . They are still linearly independent, but only barely.
So iteration on multiple vectors will be tricky. The desired eigenvectors will still be present,
but they will be hard to extract from such an ill conditioned basis (all vectors essentially parallel).
Therefore, similarly to power (inverse power) iteration where we normalized the approximation in
each step so as to avoid underow or overow, we will normalize the set of vectors on which we iterate.
Not only so they are unit magnitude, but also so that they are mutually orthogonal. (Technical
term: the vectors are orthonormal .) An excellent tool for this purpose is the QR factorization: the
columns of the matrix Q are orthonormal, and they come from the columns of the input matrix.
In this way we get the so-called simultaneous power iteration (also called block power iteration). The starting vectors will be arranged as columns of a rectangular matrix
]
[
(0)
(0)
.
W (0) = w1 , w2 , ...w(0)
p
The algorithm will repeatedly multiply the iterated n p matrix W (k) by the n n matrix A and
also orthogonalize the columns of the iterated matrix by the QR factorization.
W (0) given
for k = 1, 2, ...
W (k) = AW (k1)
QR = W (k) % compute QR factorization
W (k) = Q
(8.1)
The eigenvalue approximations may be computed as before from the Rayleigh quotient
(k)
(k) T
= wj
(k)
Awj
.
(k) T
Note that we have omitted dividing by wj
(k)
wj
because these vectors are orthonormal:

(k) T
wj
{
(k)
wj
173
1, when j = m
0, otherwise.
Figure 8.11 shows the eect of orthogonalization for the same matrix and the same starting vectors
as in Figure 8.10, but this time with QR factorization. The iterated vectors now converge to the two
eigenvectors.
A=[1.05
0.171; 0.171
0.614]
(3)
(2)
w2
(1)
w2 w2
(4)
w2
(0)
w1
(0)
v 1 w2
(4)
w1
(3)
w1
(2)
w1
(1)
w1
v2
Fig. 8.11. The eect of several matrix-vector multiplications. Eigenvalues 1 = 1.11, 2 = 0.556. Iteration
vectors are orthogonalized after each iteration.
In order to switch from the block power iteration to the block inverse power iteration we just
switch the one line that refers to the repeated multiplication with the coecient matrix so that the
multiplication is with its inverse
W (0) given
for k = 1, 2, ...
AW (k) = W (k1) % solve
W (k) = Q
(8.2)
The MATLAB code for the block inverse power iteration is given below. Note that the so-called
economy QR factorization is used: the matrix Q is rectangular rather than square. 8
function [lambda,v,converged]=binvpwr2(A,v,tol,maxiter)
nvecs =size(v,2);
plambda=Inf+zeros(nvecs,1);
lambda =plambda;
nvecs=size(v,2);% How many eigenvalues?
[v,r]=qr(v,0);% normalize
[L,U,p] =lu(A,vector);% Factorized for efficiency
for iter=1:maxiter
u=U\(L\v(p,:)); % update vectors
for j=1:nvecs % Rayleigh quotient
lambda(j)=(v(:,j)*v(:,j))./(u(:,j)*v(:,j));
8
See: aetna/EigenvalueProblems/binvpwr2.m
174
end
[v,r]=qr(u,0);% economy QR factorization
if (norm(lambda-plambda)/norm(lambda)<tol)
converged = true; break;
end
plambda=lambda;
end
end
Note that when were computing the Rayleigh quotient we have to account for u being the result of
the inverse power iteration. Also, we could have replaced
lambda(j)=(v(:,j)*v(:,j))./(u(:,j)*v(:,j)) with
lambda(j)= 1.0./(u(:,j)*v(:,j)) (why?).
Shifting could also be applied to block inverse power iteration. Even though only one shift value
can be used, the benecial eect applies to all iterated eigenvectors: The iteration will converge to
the eigenvectors with eigenvalues closest to the shift.
Illustration 6
Apply the block inverse power iteration method to the structure described in Illustration on page 164,
but change the stiness of the link spring to k = 0. Use it to nd the rst two modes.
A possible solution is given in the script lb A blinvpower.9
1. Interpret the mode shapes obtained above with the solution provided by MATLABs eig. The
mode shapes are dierent. Does it matter?
8.5 QR iteration
An obvious step to take with simultaneous power iteration is to compute all the eigenvalues and
eigenvectors of the n n matrix A by iterating on n vectors at the same time. This is shown in the
following algorithm (note the choice of the initial orthonormal vectors as the columns of an identity
matrix):
W (0) = 1
for k = 1, 2, ...
W (k) = AW (k1)
W (k) = Q
(8.3)
The matrix W (k) converges to a matrix of eigenvectors. Recall that the matrix of eigenvectors can
make the matrix A similar to a diagonal matrix, the matrix of the eigenvalues (call for (4.13)). The
matrix W (k) is only close to the matrix of eigenvectors (and getting closer with the iteration), and
therefore the matrix
T
A(k) = W (k) AW (k)

9
See: aetna/EigenvalueProblems/LinkedBuildings/lb A blinvpower.m
8.5 QR iteration
175
will be only close to a diagonal matrix, not perfectly diagonal, and the numbers on the diagonal will
approximate the eigenvalues.
It can be shown that the above simultaneous iteration is equivalent to the so-called QR iteration
(note well that this is dierent from QR factorization). The QR iteration is given by the following
algorithm:
A(0) = A
for k = 1, 2, ...
QR = A(k1) % compute QR factorization
A(k) = RQ % note the switched factors
(8.4)
The matrix A(k) that appears in the last step of (8.4) is the same as A(k) = W (k) AW (k) in
the algorithm (8.3) (explained in detail in Trefethen, Bau (1997)). In this sense the two algorithms
are equivalent. The script qr power correspondence10 demonstrates the equivalence of the two
algorithms for a randomly generated matrix.
The QR iteration (8.4) is amenable to several signicant enhancements as pointed out below.
The QR iteration is one of the most important algorithms used in eigenvalue/eigenvector problems.
First we will inspect the properties of the transformations eected by the above algorithm.
8.5.1 Schur factorization
The matrix A(k) in (8.4) converges to an upper triangular matrix. In fact, for our assumption of A
being symmetric, A(k) converges to a diagonal matrix. In the limit of k the transformation
T
A(k) = W (k) AW (k)

will lead to the so-called Schur factorization.
The Schur factorization in fact exists for all square matrices, complex or real, symmetric (Hermitian11 ) or un-symmetric (non-Hermitian), non-defective or defective12 . The so-called Schur lemma
claims: for any square matrix A there is a unitary matrix13 U such that the matrix T
T = U T AU
(8.5)
is upper triangular. This can be shown as follows: the square matrix A has at least one eigenvalue
and one eigenvector. Therefore, we can write (for simplicity the procedure is demonstrated here for
a 6 6 matrix; the symbols , , ... mean here general complex numbers; zeros are not shown)
,
AU 1 = U 1
where the rst column of U 1 is an eigenvector of A: Au1 = 1 u1 , and the other columns of U 1 are
arbitrarily selected to form an orthonormal basis (this is always possible). Now we write
10
11
See: aetna/EigenvalueProblems/qr power correspondence.m

T
Hermitian matrix: A = A , where A is the so-called conjugate transpose (its elements are complex
conjugates of the transposed matrix).
12
Defective matrix does not have a full set of eigenvectors. Example: [0, 1; 0, 0]. Double eigenvalue 0, a
single eigenvector [1; 0].
T
T
13
Unitary matrix: complex matrix U such that U U = U U = 1. For real matrices unitary = orthogonal.
176

T
U 1 AU 1 =

and we apply exactly the same argument to the smaller 5 5 matrix (the elements). This again
leads to the rst column having zeros below the diagonal, which we write as
1
1

2
T
T
T

U
=
U2
U
U
AU
U
=
2
1
1 2
2
And so we continue until we construct
1
2
T
T
T
3
.
U 5 ...U 2 U 1 AU 1 U 2 ...U 5 =
4
5
6
Since we can dene a unitary matrix as U = U 1 U 2 ...U 5 we have completed the Schur factorization.
This construction highlights the main attraction of the Schur factorization: the upper triangular
matrix on the right-hand side has the eigenvalues of A on the diagonal. It also points to a major
diculty: in order to compute the Schur factorization we have to solve a sequence of eigenvalue
problems. This is not possible in a nite number of steps in general, as follows from the impossibility
of nding the roots of an arbitrarily high order polynomial by explicit formulas. As a consequence,
computing the Schur factorization must be an iterative procedure, and in fact the QR iteration is
precisely such a procedure.
8.5.2 QR iteration: Shifting and deflation
The QR iteration is a numerically stable procedure because it proceeds by applying successive
orthogonal transformations, similarly to the construction we just outlined. To show this we write for
the QR factors in one step
Q(k) R(k) = A(k1)
A(k) = R(k) Q(k)
T
and substitute in the second equation R(k) = Q(k) A(k1) from the rst equation:
T
A(k) = Q(k) A(k1) Q(k) .

This is a similarity transformation with orthogonal matrices Q(k) .
The basic QR iteration14 convergence process is illustrated in Figure 8.12 (see the script
Visualize qr iteration15 ). The QR iteration is demonstrated by the gradual emergence of a
dominant diagonal of the matrix A(k) (which contains the approximations to the eigenvalues). The
magnitude of the elements of the matrix is coded in shades of gray: white for elements close to
zero, and the larger the element in absolute value the darker the shade. We see how the o-diagonal
8.5 QR iteration
177
Rows
Rows
Approx = [4.980, 5.016, 1.332, 3.872, 4.702, 3.263,] Approx = [6.854, 4.528, 4.623, 1.073, 0.516, 2.907,]
5
6
1
2
3
4
5
6
3
4
5
6
Columns
Columns
Approx = [6.994, 4.576, 4.897, 3.491, 2.351, 2.894,] Approx = [7.000, 4.678, 4.812, 3.954, 2.837, 2.893,]
2
Rows
Rows
4
5
4
5
6
1
3
4
Columns
3
4
Columns
Fig. 8.12. QR factorization example. Matrix eigenvalues [3, 3, 4, 4.5, 5, 7]. QR iterations 1, 5, 9, 13 are
shown top to bottom, left to right.
elements decrease in magnitude with successive iterations, and the diagonal elements come to dominate. Figure 8.13 shows similar computation as in Figure 8.12, but with a dierent matrix. This
time the QR iteration gets stuck on the three eigenvalues in the top left corner, and the iteration
does not result in a diagonal matrix. The lack of convergence is due to the repeated eigenvalues (in
absolute value), and additional sophistication is needed to extract the the repeated eigenvalues.
Shifting may be introduced into the QR iteration similarly as in the simultaneous inverse iteration. The QR iteration may be in fact shown to be equivalent not only to simultaneous iteration, but
also to simultaneous inverse iteration. Therefore, the shifting will have a very similar eect: faster
convergence in the lower eigenvalues. The shift can be selected in various judicious ways. Here we
will discuss a simple choice: the Rayleigh quotient shift. We have seen that the QR iteration was
successively transforming the original matrix to a diagonal matrix. The elements on the diagonal of
(k1)
the iterated matrix are in fact the Rayleigh quotients. A good shift therefore is the element Ann
of the iterated matrix. The shift is applied as
A(0) = A
for k = 1, 2, ...
(k1)
= Ann
QR = (A(k1) 1) % compute QR factorization
A(k) = RQ + 1
This translates directly into MATLAB code:16
function A = qrstepS(A)
14
See: aetna/EigenvalueProblems/qrstep.m
See: aetna/EigenvalueProblems/Visualize qr iteration.m
16
See: aetna/EigenvalueProblems/qrstepS.m
15
(8.6)
178
Rows
Rows
Approx = [1.257, 4.873, 0.387, 2.183, 1.108, 1.967,] Approx = [0.208, 4.915, 0.163, 3.041, 1.997, 1.001,]
3
4
5
6
1
2
3
4
5
6
Columns
Columns
Approx = [0.180, 4.913, 0.094, 3.001, 2.000, 1.000,] Approx = [0.180, 4.913, 0.093, 3.000, 2.000, 1.000,]
Rows
Rows
4
5
4
5
6
1
3
4
Columns
3
4
Columns
Fig. 8.13. QR factorization example. Matrix eigenvalues [5, 5, 5, 3, 2, 1]. QR iterations 1, 5, 9, 13 are shown
top to bottom, left to right.
[m,n]=size(A);
rho = A(n,n); % shift
[Q,R]=qr(A-rho*eye(n,n));
A = R*Q + rho*eye(n,n);
end
In practice, once an eigenvalue converges, the corresponding row and column are removed from
the matrix, and the QR iteration continues on the smaller remaining matrix. This is called deation.
Illustration 7
Apply the shifted QR iteration method to the structure described in Illustration on page 164.
The shifted QR algorithm using qrstepS in the script lb A qr17 does not in fact converge very
well. The basic algorithm without shifting18 works actually better. Even better is the strategy of
shifting known under the name of Wilkinson (James Hardy Wilkinson, 1919 - 1986, was a giant in
the 20th century history of numerical algorithms)19 .
1. Change the stiness of the link spring to k = 0. Does the QR iteration converge? Try the
variants with shifting.
17
See: aetna/EigenvalueProblems/LinkedBuildings/lb A qr.m

See: aetna/EigenvalueProblems/qrstep.m
19
See: aetna/EigenvalueProblems/qrstepW.m
18
8.6 Spectrum slicing
179
8.6 Spectrum slicing

e = M AM T is called a congruence. Is not a
For a real symmetric matrix A the transformation A
similarity transformation, so it does not preserve the eigenvalues. However according to the so-called
Sylvesters Law of inertia the congruence transformation preserves the number of positive, zero, and
negative eigenvalues. This fact leads to a simple and convenient method called spectrum slicing :
To nd the number of eigenvalues of the matrix A less than , form the LDLT factorization of the
matrix A 1
A 1 = LDLT
and count the number of negative elements (these are the pivots) in the diagonal matrix D.
This spectrum slicing approach is also easily extended to the generalized eigenvalue problem.
To nd the number of eigenvalues of Kx = M x less than , form the LDLT factorization of the
matrix
K M = LDLT
and count the number of negative elements in the diagonal matrix D.
Illustration 8
For the mechanical system of Figure 5.1 the mass and stiness matrices are
m 0 0
2k k 0
M = 0 m 0 , K = k 2k k
0 0 m
0 k k
Here k = 61, all the masses are equal m = 1.3. For instance, we can check how many natural
frequencies lie below 0.5 Hz. We form the matrix
A = K (0.5 2)2 M
Using MATLAB LU factorization as [L,U,P] =lu(A) yields
1
0
0
109 61 0
10
1
0 , U = 0 75.1 61 , P = 0 1
L = 0.559
0
0.812 1
0 0 1.39
00
0
0
1
Since there is only one negative number on the diagonal of U (that is on the diagonal of the matrix
D from the LDLT matrix factorization) we conclude that only one natural frequency lies below
0.5 Hz.
Next we check how many natural frequencies lie below 2.0 Hz. The factorization gives
1
0 0
83.3 61 0
100
1 0 , U = 0 61 144 , P = 0 0 1 ,
L = 0
0.732 0.633 1
0
0 30.3
010
which we compare with the frequencies given in Section 5.4 and conclude that something is wrong:
there are two negative numbers on the diagonal, but all three frequencies are in fact below 2.0 Hz.
The reason is that once the partial pivoting introduces a non-identity permutation matrix, so that
LU = P A
the congruence that the Sylvester theorem relies upon is no longer applicable. In fact, the product
LU is no longer symmetric and it is not possible to factor into LDLT . The pivoting has to be done
carefully to preserve the symmetry of the resulting product of factors. For instance, the MATLAB
180
function ldl produces directly the LDLT factorization and returns the psychologically lowertriangular factor L. We can write [L,D] =ldl(A), with the result
1
0 0
83.3 0
0
L = 0.732 0.423 1 , D = 0 144 0 .
0
1 0
0
0 12.8
Now we see three negative numbers on the diagonal of D which indeed corresponds to our prior
knowledge that all three frequencies are below 2.0 Hz.
8.7 Generalized eigenvalue problem

Small generalized eigenvalue problems may be approached by converting them to a standard eigenvalue problem form. For instance, if the stiness matrix is nonsingular, we may form the so-called
Cholesky factorization. It can be produced from the LDLT factorization as
LDLT = K
by dening R = L D so that RRT = K. We see that we need to work with a positive denite
stiness matrix so that the diagonal matrix D will give real roots. With the Cholesky factors at
hand we transform the generalized eigenvalue problem Kz = 2 M z as
Kz = RRT z = 2 M z
and by introducing y = RT z we obtain the standard eigenvalue problem
1
y = R1 M RT y .
2
If the stiness happens to be singular, but the mass matrix is not, the roles of these two matrices
may be reversed.
For larger generalized eigenvalue problems, and in vibration analysis it is not uncommon nowadays to work with millions of equations, and the conversion to the standard eigenvalue problem
would be too expensive. Moreover, we are typically not interested in all the eigenvalues anyway, and
a better suited technique will help us extract a few eigenvalues of interest, typically the lowest ones.
The inverse iteration method (8.2) is easily adapted to the generalized eigenvalue problem. The
simultaneous inverse iteration for the generalized eigenvalue problem is written as
W (0) given
for k = 1, 2, ...
KW (k) = M W (k1) % solve
W (k) = Q
The eigenvalues may be estimated during the iteration using the Rayleigh quotient. For the generalized eigenvalue problem the Rayleigh quotient is computed from
2 M z = Kz 2 =
z T Kz
.
zT M z
The MATLAB code for the generalized eigenvalue problem solved with block inverse power
iteration is given below: 20
20
See: aetna/EigenvalueProblems/gepbinvpwr2.m
8.7 Generalized eigenvalue problem
181
function [lambda,v,converged]=gepbinvpwr2(K,M,v,tol,maxiter)
nvecs=size(v,2);% How many eigenvalues?
plambda=Inf+zeros(nvecs,1);% previous eigenvalue
lambda =plambda;
[L,U,p] =lu(K,vector);
for iter=1:maxiter
u=U\(L\(M*v(p,:))); % update vector
for j=1:nvecs
lambda(j)=(v(:,j)*K*v(:,j))/(v(:,j)*M*v(:,j));% Rayleigh quotient
end
[v,r]=qr(u,0);% economy factorization
if (norm(lambda-plambda)/lambda<tol)
converged = true; break;
end
plambda=lambda;
end
end
Illustration 9
Apply the block inverse power iteration method for the generalized eigenvalue problem to the structure described in Illustration on page 164.
The algorithm gepbinvpwr2 converges as well as the regular block inverse power iteration for the
standard eigenvalue problem.21 No surprise, given how easy it was to transition from the generalized
to the standard eigenvalue problem for this particular mass matrix.
8.7.1 Shifting
Shifting could also be introduced into the block inverse power iteration for the generalized eigenvalue
problem. Not only to speed up convergence to the smallest eigenvalue by making it more dominant,
but also for precisely the opposite: to make the smallest eigenvalue less dominant. What we mean by
this is that if a structure contains rigid body modes (the structure can move without experiencing
any resisting forces), it has at least one zero frequency of vibration. Such a frequency is very strongly
dominant in the inverse power iteration (1/0!!!). The eect of this dominance cannot be exploited,
however, since the matrix K is not invertible. This would make the block inverse power iteration
algorithm (page 180) impossible.
Shifting can help. To the eigenvalue problem (with = 2 )
M z = Kz
we add the term M z on both sides
M z + M z = M z + Kz
and obtain
( + )M z = (M + K)z .
21
See: aetna/EigenvalueProblems/LinkedBuildings/lb A gepblinvpower.m
182
This can be written to resemble the original equation as

f ,
M z = Kz
f = (M + K). The matrix K
f is the shifted stiness.
where = ( + ), and K
Illustration 10
We consider a variation on the three-carriage vibrating system of Section 5.1, where the middle
spring is removed. The stiness matrix of such vibrating system is singular.
k 0 0
K = 0 k k
0 k k
Equivalently, we say that the structure has a rigid body mode. The frequency corresponding to the
rigid body mode is zero. Figure 8.14 shows this rigid body mode as a translation of the masses
2,3. Mass 1 does not displace.22 Clearly, all springs maintain their unstressed length: the rigid body
motion does not induce any forces in the structure.
z11 = 0
z21 = 0.62 z31 = 0.62
Fig. 8.14. Structure with a singular stiness matrix. The rigid body mode ( = 0).
Now we shall try to apply the block inverse power iteration with gepbinvpwr2. 23 The script
n3 sing undamped modes MK224 invokes gepbinvpwr2 to obtain the rst mode without shifting,
and the resulting eigenvector and eigenvalue are worthless. The eigenvector in fact contains not-anumbers (NaN). Why? Because the stiness matrix is singular, its LU factorization should not exist.
The MATLAB function lu (put a breakpoint inside gepbinvpwr2) returns the factors as
K>> L,U
L =
1
0
0
U =
61
0
0
0
1
-1
0
0
1
0
61
0
0
-61
0
The 0 in the element 3,3 of the U factor is a problem: at some point we will have to divide with it.
Hence the not-a-numbers.
The script n3 sing undamped modes MK325 invokes gepbinvpwr2 to obtain the rst mode with
shifting. The shift is guessed as 0.2. This number is arbitrary, but it should be suciently small
22
See:
See:
24
See:
25
See:
23
aetna/ThreeCarriages/n3 sing undamped modes MK1.m

aetna/EigenvalueProblems/gepbinvpwr2.m
183
to avoid getting close to the rst nonzero frequency. The script shows how we invoke gepbinvpwr2
for a stiness matrix that is modied by the addition of a multiple of the mass matrix to make it
non-singular.
[M,C,K,A,k1,k2,k3,c1,c2,c3] = properties_sing_undamped;
v=rand(size(M,1),1);% initial guess of the eigenvector
tol=1e-9; maxiter =4;% tolerance, how many iterations allowed?
sigma = 0.2;% this is the shift
[lambda,v,converged]=gepbinvpwr2(K+sigma*M,M,v,tol,maxiter)
lambda =lambda-sigma % subtract the shift to get the original eigenvalue
The output evidently shows that the iteration was successful.
lambda =
0.2000
v =
-0.0000
-0.7071
-0.7071
converged =
1
lambda =
6.3838e-016
% shifted
% shift removed: ~0
For the structure from Illustration on page 164:
1. Change the stiness of the link spring to k = 0. Does the block inverse power iteration converge?
2. Use the spectrum slicing approach to check the number of eigenvalues located by the power
iteration above.

1. C. Meyer, Matrix Analysis and Applied Linear Algebra Book and Solutions Manual, SIAM:
Society for Industrial and Applied Mathematics, 2001.
Good coverage of eigenvalue and eigenvector problems. Interesting examples. Best of all, freely
available at http://matrixanalysis.com/.
It covers well matrix analysis of natural frequencies and mode shapes, and some numerical
methods for modal analysis.
3. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005. (Alternatively,
the 3rd edition, 1988.)
Good coverage of the basics of the eigenvalue problem.
4. L. N. Trefethen, D. Bau III, Numerical Linear Algebra, SIAM: Society for Industrial and Applied
Mathematics, 1997.
The treatment of QR factorization is excellent.
9
Unconstrained Optimization
Summary
1. A number of basic techniques in structural analysis rely on results from the area of optimization.
Main idea: Equilibrium of structures and minimization of potential functions are intimately tied.
Equilibrium equations are the conditions of the minimum.
2. Stability of structures is connected to the classication of the stiness matrix. Main idea: positive
denite matrices correspond to stable structures.
3. The line search is a basic tool in minimization. Main idea: Monitor the gradient of the objective
function. Minimum (extremum) is indicated when the gradient becomes orthogonal to the line
search direction.
4. Solving a system of linear equations and minimizing an objective function are two roads to the
same destination. Main idea: We show that minimizing the so-called quadratic form solves a
system of linear algebraic equations.
5. The method of steepest descent may be improved by the method of conjugate gradients. Main
idea: keep track of directions of past line searches.
6. Direct versus iterative methods. Main idea: direct and iterative methods are rather dierent in
their properties (cost vs. accuracy). Iterative algorithms seem to be becoming more and more
important in modern software.
7. Least-squares tting is an important example of optimization.
9.1 Basic ideas

We shall start to explore the subject of unconstrained optimization on problems of static equilibrium of bar structures. The optimization problems are unconstrained in the sense that the minimum
or maximum is sought for variables (displacements) that can take on any value; we say they are not
constrained. An example of constrained optimization equilibrium problem with a contact condition
is treated in a subsequent chapter.
The optimization can search either for a minimum or maximum of the so-called objective function
(or as it is sometimes called, the cost function). Without any loss of generality we can assume
that the objective function is always minimized. It is possible to switch between minimization and
maximization by this trick: Let us assume the goal is to minimize the function f (x) by seeking the
location of the minimum as
Find x such that f (x ) f (x) for all x .
(9.1)
This can be easily changed into a maximization task by ipping the objective function about the
horizontal axis (i.e. changing its sign) and seeking the maximum as
Find x such that f (x ) f (x) for all x .
(9.2)
186
9 Unconstrained Optimization
9.2 Two degrees of freedom static equilibrium: unstable structure

Consider the simple static system in Figure 9.1. The stretch of the spring can be expressed as
s = x1 cos 30o + x2 sin 30o .
The energy stored in the spring (the energy of deformation) is
DE =
1 2
ks .
2
Using a matrix expression (for reasons that will become clear later), the stretch of the spring can
be expressed as
[ ]
[
] x1
s = cos 30o sin 30o
.
x2
The energy stored in the spring can also be written as
DE =
1 T
ks s ,
2
where by sT we mean the transpose (never mind that the transpose of a scalar doesnt do anything).
Substituting for the stretch we obtain
(
[ ])
[ ]
[
] x1 T [
] x1
1
o
o
o
o
cos 30 sin 30
cos 30 sin 30
DE = k
,
x2
x2
2
which gives in short order
[
]
[ ]
] cos 30o [
] x1
1 [
o
o
cos
30
sin
30
DE = k x1 x2
.
sin 30o
x2
2
If we dene the matrix
[
]
[
]
]
cos 30o [
cos 30o cos 30o , cos 30o sin 30o
o
o
cos 30 sin 30 = k
K=k
,
sin 30o
sin 30o cos 30o , sin 30o sin 30o
(9.3)
we can write the energy stored in the spring as

DE =
1 T
x Kx ,
2
(9.4)
where we write for convenience

[ ]
x1
x=
.
x2
The matrix K is the stiness matrix.
The energy DE is a quadratic function of the displacements x1 , x2 . Expressed in the form of
the matrix expression (9.4), it is called a quadratic form. In the optimization arena the energy
function DE would be referred to as the objective function, and the displacements that minimize
this function would be sought as the solution to the optimization problem.
As a scalar function of two variables, x1 , x2 , the energy DE may be visualized as a surface raised
above the plane x1 , x2 . Figure 9.2 shows the surface of the deformation energy in two views: from
the top, and isometric. We can see that the surface is a trough, with the bottom indicated by the
thick white level curve at DE = 0. This level curve appears to run in the direction
[
]
sin 30o
.
cos 30o
9.2 Two degrees of freedom static equilibrium: unstable structure
187
x2
x1
30o
Fig. 9.1. Static equilibrium of particle suspended on a spring.
Taking the displacement as

[ ]
[
]
x1
sin 30o
=
x2
cos 30o
(9.5)
we can compute the stretch in the spring as

[ ]
[
]
[
] x1
[
]
sin 30o
o
o
o
o
s = cos 30 sin 30
= cos 30 sin 30
=0.
x2
cos 30o
This conrms that for the displacements (9.5) the energy stored in the spring is equal to zero. This
property is encountered in structures which are mechanisms: they can move in some ways without
deformation, that is without the need to store energy. Such structures are unstable.
Furthermore, we can see that for the displacements (9.5) we get
Kx = 0 .
Thus we see that the matrix K of (9.3) is singular. Clearly, the fact that the matrix is singular and
the fact that the deformation energy may be zero for some nonzero displacement are related.
Fig. 9.2. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.
188
Illustration 1
Modify the code below to display the surface in Figure 9.2. The second and the last line need to
be modied to reect a particular objective function. The last line is supposed to draw arrows
representing the gradient.
[x,y]=meshgrid(-10:10,-10:10);
z=x.*y; % function
surf(x,y,z,Edgecolor,none); hold on
contour3(x,y,z,20,w); hold on
quiver(x,y,y,x); % gradient
9.3 Two degrees of freedom static equilibrium: stable structure

Figure 9.3 shows a static equilibrium system with two degrees of freedom as before, but this time
with two springs. The stiness matrix consists of two contributions, one from each spring
[
]
[
]
]
]
cos 30o [
cos 30o [
o
o
cos 30 sin 30 + (k/2)
cos 30o sin 30o
K=k
sin 30o
sin 30o
[
]
=k
[
]
+(k/2)
.
(9.6)
30o
x2
k/2
k
x1
30o
L
Fig. 9.3. Static equilibrium of particle suspended on two springs.
Figure 9.4 shows the variation of the deformation energy as a function of x1 , x2 : the only point
where the DE assumes the value of zero is at x1 = 0, x2 = 0. Everywhere else the deformation
energy is positive. This means that whenever the displacements are dierent from zero, the springs
will store nonzero energy. This is the hallmark of stable structures.
Matrices A that have the property
xT Ax > 0
for all x = 0, and for which
9.4 Potential function
189
Fig. 9.4. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.
xT Ax = 0
only for x = 0, are called positive denite. Stable structures have positive denite stiness matrices. Positive denite matrices are nonsingular (they are regular). This is a fact well worth retaining.
Note that the stiness matrix is symmetric. An important property of the quadratic forms is
that only symmetric matrices contribute to the value of the quadratic form. We can show that as
follows: For the moment assume that A is in general unsymmetric. The quadratic form is a scalar
(real number), and as such it is equal to its transpose
(
)T
xT Ax = xT Ax
.
Therefore, we can write
xT Ax = xT AT x
or
(
)
xT Ax xT AT x = xT A AT x = 0 .
(9.7)
The general matrix A may be written as a sum of a symmetric matrix and a skew-symmetric
(anti-symmetric) matrix
A=
) 1(
)
1(
A + AT +
A AT .
2
2
In the expression (9.7) we recognize the anti-symmetric part of A. Therefore, we conclude that
the anti-symmetric part does not contribute to the quadratic form, only the symmetric part does.
Therefore, normally we work only with symmetric matrices in quadratic forms.
9.4 Potential function

There is another reason for using symmetric matrices to generate quadratic forms: the quadratic form
often results as an expression of a potential function (potential, for short). The potential function
must have a symmetric matrix of second derivatives. The rst derivative, the so-called gradient,
comes from
)
d ( T
x Ax .
dx
190
Consider how to compute the derivative with respect to x of the product aT b: both vectors needs
to be dierentiated in turn using the chain rule. So that we dont have to dierentiate a transpose
of the vector a we take advantage of the fact that the result of the product aT b is a scalar which
may be transposed at will without changing anything
( T ) ( T )
a b = b a .
To dierentiate the vector b in the product aT b with respect to x is straightforward:
b
.
x
To dierentiate the vector a in the product aT b with respect to x, we will rst transpose the product
to get bT a, and then we dierentiate
aT
a
.
x
(
)
So the product aT b is dierentiated as
bT
( T )
b
a
a b = aT
+ bT
.
x
x
x
Now back to the quadratic function. The quadratic term may be identied with the above product
of vectors if we write
a = x,
b = Ax ,
which means that we have for the gradient

(
)
)
( T
x Ax = xT A + xT AT = xT A + AT .
(9.8)
x
The second-order derivative, the Hessian, is the derivative of the gradient
)) (
)
d ( T(
x A + AT
= A + AT .
dx
We see that for the potential(function )xT Ax both the gradient and the Hessian are expressed in
terms of a symmetric matrix A + AT .
Illustration 2
Compute the components of the Hessian of the potential (x) = 12 xT Ax.
As shown above, the gradient of (x) is
(
)
1
(x) = xT A + AT .
2
The result is a row matrix, with components
1
xi (Aic + Aci ) .
[(x)]c =
2
i
The components of the Hessian matrix Hrc can be obtained by dierentiating the gradient with the
respect to each xr . Therefore, we obtain
1
Hrc = (Arc + Acr ) .
2
Clearly, the Hessian is symmetric, Hrc = Hcr . For A symmetric we have
Hrc = Arc .
9.6 Two degrees of freedom static equilibrium: computing displacement
191
9.5 Determining definiteness

The LU factorization is a useful tool for determining positive deniteness. If the upper triangular
factor matrix has only positive pivots on the diagonal the matrix is positive denite. Otherwise,
the matrix is indenite. Else the factorization may have failed, in which case a more substantial
investigation is needed to determine the rank of the matrix (does it have zero eigenvalues?). To show
the above we write the deformation energy as
xT Ax .
Assuming symmetric A, its LU factorization is A = LU = LDLT , where D is the diagonal of U
(i.e. the pivots used in the factorization). The quadratic form may be written as
xT Ax = xT LU x = xT LDLT x .
Now note that we can write this same quadratic form using the new variable z = LT x as
xT Ax = xT LDLT x = z T Dz
or since D is diagonal we can write in components
xT Ax = z T Dz =
Dii zi2 .
i=1:n
The last expression is going to be positive for any combination of zi only if Dii > 0 for all i. So
Dii > 0 for all i guarantees that the quadratic form is positive denite.
If any of the Dii was equal to zero (to get this factorization if any of the elements in the pivot
position was zero would be tricky!) and all the others were positive, the matrix would be positive
semi-denite (and singular). (Just for completeness, if the pivots were a mixture of positive and
negative numbers, the matrix would be indenite.)
9.6 Two degrees of freedom static equilibrium: computing displacement

Now consider the system from Figure 9.1 with the stiness matrix (9.3) loaded by the forces
[ ]
L1
L=
.
L2
The solution of the static equilibrium problem is (presumably) available from the equilibrium equations
Kx = L .
We recall that the matrix (9.3) was singular. This means it is not invertible, and it isnt possible to
obtain a solution for just any load L since we cannot write
x = K 1 L .
Some particular sets of forces may lead to solutions: namely all such loads that are proportional to
the eigenvectors of K. Recall that the eigenproblem is written as
Kx = x
so that setting L = j xj where xj is an eigenvector, will lead to a solution of the equilibrium
problem in the form
x = xj .
The stiness matrix (9.3) has the eigenvalues 1 = 0, 2 = k. The corresponding eigenvectors are
[
]
[
]
sin 30o
cos 30o
x1 =
, x2 =
.
cos 30o
sin 30o
Evidently, for the zero eigenvalue, no nonzero load is admissible (L = 1 x1 = 0 x1 = 0). For
the nonzero eigenvalue, the load is seen to be in the direction of the spring.
192
9.7 One degree of freedom total energy minimization example

Consider a one degree of freedom system (particle on a grounded spring). The deformation energy
(elastic energy stored in the spring)
DE =
1
xKx ,
2
where K is the stiness constant of the spring. The potential energy of the applied forces is dened
as
W = Lx .
The total energy is dened as
T E = DE + W .
(9.9)
The solution for the equilibrium displacement is determined by the principle of minimum total
energy: for the equilibrium displacement x the total energy assumes the smallest possible value
x = arg min T E .
(9.10)
(It should be read: nd x as such argument that minimizes T E.) This is an unconstrained minimization problem. The minimum of the total energy is distinguished by the condition that the
slope at the minimum is zero:
(
)
dT E
d 1
=
xKx Lx = Kx L = 0 .
dx
dx 2
This condition is seen to be simply the equation of equilibrium, whose solution indeed is the equilibrium displacement.
The meaning of equation (9.9) and of the minimization problem (9.10) is illustrated in Figure 9.5.
The deformation energy is represented by a parabolic arc (dashed line), which attains zero value
(that is its minimum) at zero displacement. The potential energy of the external force is represented
by the straight dashed line. The sum of the deformation energy and the energy of the external
force tilts the dashed parabola into the solid line parabola, the total energy. That shifts the original
minimum on the dashed parabola into the new minimum on the solid parabola (negative value) at
x . The minimum is easily seen to be
min T E =
1
1
1
1
x Kx Lx = x Kx Kx x = x Kx = x L .
2
2
2
2
9.8 Two degrees of freedom total energy minimization example

Now we shall consider again the system of Figure 9.3. The potential energy of the applied forces L
is expressed as
W = LT x .
The eect of this term on the parabolic surface in Figure 9.4 is very similar to that of Figure 9.5,
except now it is in more than one variable: the parabolic surface of the deformation energy is tilted
into the parabolic surface of the total energy (TE). This surface is shown in Figure 9.6. The red
cross at the bottom represents the solution of the static equilibrium equations.
9.9 Application of the total energy minimization
193
Energy
TE
W
DE
x
x
Fig. 9.5. Total energy minimization diagram.
Fig. 9.6. Static equilibrium of particle suspended on two springs. The surface of total energy.

We have seen above that to nd the minimum of the total potential energy T E is equivalent to
solving a system of linear equations (the equilibrium equations). By nding the solution to one of
these problems, one has automatically solved the other. The application we have in mind here is to
solve the system of linear equations
Kx = L
by minimizing the energy
1 T
x Kx LT x .
(9.11)
2
This is a classical optimization problem. We have an objective function, the total potential energy
T E, and our goal is to nd the displacement x such that the objective function attains a minimum
for that displacement
TE =
x = arg min T E .
x
(9.12)
Since all candidate displacements x may be considered in the minimization without any restrictions,
the minimization problem is called unconstrained.
194
9.9.1 Line search

A commonly used technique for these kinds of problems is the so-called line search method. It
works as follows: start at a point. Then repeat as many times as necessary: pick a direction, and
nd along this direction a location where the objective function has a lower value than at the start
point. Make this the new start point, and go back to picking a direction.
The algorithm is seen to be a sort of walkabout on the surface of the objective function. The
goal is to reach the lowest point. Two issues need to be addressed: how to choose the direction,
and how to choose where to stop when moving along this direction from the starting point. One
particular strategy for addressing the rst issue is to choose the direction of the negative gradient
at the starting point. Since this direction leads to the steepest decrease of the objective function out
of all the directions as the starting point, this strategy is called the steepest descent. For general
objective functions, the second issue is dicult to address. To know when to stop when moving
from the starting point along the chosen direction could be expensive to compute. Compare with
Figure 9.7: the objective function appears to be rather complex, the minimum in the middle is only
a local one, not a global minimum: following the drop-o of the objective function in either of the
descending corners would lead to further decrease. Fortunately, our present objective function (9.11)
is much simpler, and hence much nicer to work with.
p1
p2
p3
p0
Fig. 9.7. Walk towards the minimum of the objective function. Starting point is p0 , the walk proceeds
against the direction of the gradient.
9.9.2 Line search for the quadratic-form objective function

First things rst: let us gure out the gradient of the objective function. Since the objective function
is based on a quadratic form, we have in fact already done something very much like this before
)
(
TE
1 T
=
x Kx LT x .
TE =
x
x 2
From (9.8) we have
)
(
TE
1
= xT K + K T LT .
x
2
Since the matrix K is symmetric, we can simplify
(
)
1 T
x Kx = xT K
x 2
and nally
TE =
TE
=
x
x
1 T
x Kx LT x
2
195
)
= x T K LT .
(9.13)
Note that the gradient is a row matrix.

So now we know how to determine the direction in which to move from a given point in order
to decrease the objective function. For direction vectors we usually use column matrices, and so we
dene the direction of steepest descent as
T
r = ( T E) = L Kx .
The vector r is called the residual. We make it into a column matrix in order for the addition of the
vector x and r to make sense.
Next we have to nd out how far to go. One possible strategy is to go as far as possible, meaning
that we would follow along a given direction until weve reached the lowest possible value of the
objective function starting from a given point in a given direction. Denoting the starting point x0 ,
we write the motion in the direction r
x = x0 + r .
The lowest point will be reached when we stop descending and if we went any further we would start
ascending on the surface of the objective function. We are moving along a direction which subtends
various angles with the gradient at any given point. When we are descending we are moving against
the direction of the gradient. This would be expressed as (see Figure 9.8, and observe the gradient
of function f at point p2 )
f (p2 )r(p0 ) < 0 .
Note that the result of the multiplication f (p2 )r (row matrix with one row times column matrix
with one column) is a number, cosine of the angle that these two arrows subtend.
p3
p2
r(p0 )
f(p0 )
p1
p4
f(p3 )
f(p4 )
f(p2 )
p0
Fig. 9.8. Walk to nd the minimum of the objective function along a given direction. Starting point is p0 ,
the walk proceeds in the direction of r(p0 ) towards the point p1 .
On the other hand, when we are ascending we are moving broadly in the same direction in which
the gradient points, and we have (see Figure 9.8, and observe the gradient of function f at point p3 )
f (p3 )r(p0 ) > 0 .
196
Finally, we must conclude that when we are standing at a point from which to move in any direction
would mean ascending, the path at that point must be perpendicular to the direction of the gradient
at that point (see Figure 9.8, observe the gradient of function f at point p4 )
f (p4 )r(p0 ) = 0 .
(Remark: This may be an oversimplication for more general objective functions. There is also the
possibility that a part of the path from p0 to p1 runs level no descending or ascending.)
The condition that the gradient (9.13) at the lowest point x must be orthogonal to the direction
of descent r can be written down as
( T
)
f (x )r = x K LT r = 0
and writing x = x0 + r we obtain
(
)
f (x )r = xT0 K + r T K LT r = 0 .
Further, we recognize in xT0 K LT = r T so that we arrive at
=
rT r
.
r T Kr
This is really the entire algorithm of steepest descent applied to the quadratic form objective function (9.11): improve the location of the lowest value of the objective function by moving from the
starting point x0 to the new point x
( T )
r r
x = x0 +
r , r = L Kx0
r T Kr
and then reset the starting point x0 = x. Such an algorithm is concisely written in MATLAB as
for iter=1:maxiter
r = b-A*x0;
x = x0 + (dot(r,r)/dot(A*r,r))* r;
x0 = x;
end
The steepest descent solver for quadratic objective functions is provided in the toolbox as
SteepestAxb. 1
Illustration 3
In Figure 9.9 we apply the solver SteepestAxb to the two-spring equilibrium problem from Section 9.8. Given that this is a two-unknowns system of linear algebraic equations, it takes a lot of
iterations to arrive at a solution: inecient! So why would we bother with this method? It does
have some redeeming characteristics. To mention one, it requires very little memory. More about
this later in Section 9.12.
9.10 Conjugate Gradients method

The system of two equations for the structure of Figure 9.3 will be considered again in the light of
what weve learned about the steepest descent method. By inspection of Figure 9.10 we realize that
9.10 Conjugate Gradients method
197
Norm of the error
10
10
10
10
15
10
20
10
10
15 20 25
Iteration
30
35
Fig. 9.9. Convergence in the norm of the solution error for the steepest descent algorithm applied to the
two-spring equilibrium problem.
p4
p3
p2
p1
p0
Fig. 9.10. Walk towards the minimum of the quadratic-form (total energy) objective function. Starting
point is p0 , the walk proceeds against the direction of the gradient.
eort is wasted by zigzagging in towards the minimum, with each step going too much sideways with
too little progress in the direction of the minimum.
We realize that there are only two independent directions in the plane x1 , x2 . The rst direction
is d(0) = f (x(0) )T , the direction for the rst descent step. Therefore, it must be possible to nd
a direction for the second step d(1) that would lead directly to the minimum. The reason is that at
the point x(2) (that is at the minimum) the gradient must vanish, which will make it perpendicular
to any vector, including the rst and second descent direction
f (x(2) )d(0) = 0 ,
and f (x(2) )d(1) = 0
f (x(2) ) = 0
x(2) is minimum .
The second orthogonality condition, that is f (x(2) )d(1) = 0, occurs naturally as a stopping condition for the step along d(1) (we go as far downhill as possible). We write
x(2) = x(1) + d(1)
and the second condition will allow us to express
1
See: aetna/SteepestDescent/SteepestAxb.m
198
f (x(1) )d(1)
T
d(1) Kd(1)
The rst condition may be put as

(
)
(
)
T
T
T
f (x(2) )d(0) = x(2) K LT d(0) = x(1) K + d(1) K LT d(0)
T
and since x(1) K LT = f (x(1) ) is orthogonal to d(0) , we get

T
d(1) Kd(0) = 0 .
(9.14)
From this condition we can determine the second descent direction. We can see that it must be a
combination of the rst direction d(0) and of f (x(1) )T : these two vectors are orthogonal and
therefore they span the plane. In other words any vector can be expressed as a linear combination
of these two. Thus we write
d(1) = f (x(1) )T + d(0) .
From (9.14) we obtain
=
f (x(1) )Kd(0)
T
d(0) Kd(0)
To show that the solution can indeed be obtained in just two steps in this case is possible with
MATLAB symbolic math:2
K =[sym(K11),sym(K12);sym(K12),sym(K22)];% stiffness
L =[sym(L1);sym(L2)];% load
X0 =[sym(X01);sym(X02)];% starting point
g=@(x)(x*K-L);% compute gradient
a=@(x,d)(-g(x)*d)/(d*K*d);% compute alpha
b=@(x,d)(g(x)*K*d)/(d*K*d);% compute beta
d0 =-g(X0);% first descent direction
X1 =X0 +a(X0,d0)*d0;% second point
d1 =b(X1,d0)*d0-g(X1);% second descent direction
X2 =X1 +a(X1,d1)*d1;% final point
simplify(g(X2))% gradient at final point ~ 0
The gradient at X2 indeed comes out as zero matrix. (Word of caution: the symbolic computation
may take a while computer-assisted algebra is not very ecient.)
9.11 Generalization to multiple equations

Now let us consider the solution of the linear system of coupled equations, this time with an n n
matrix K. We will revise the method proposed in the previous section so that it works for more
than two equations.
Consider the situation in which the iteration attained the point x(k) and now we want to determine a new search direction d(k) to nd the next point x(k+1)
x(k+1) = x(k) + d(k) .
2
See: aetna/SteepestDescent/analytical CG in 2D book.m
199
p2
p1
p0
Fig. 9.11. Walk towards the minimum of the quadratic-form (total energy) objective function. Starting
point is p0 , the walk proceeds in the directions determined to reach the minimum in just two steps.
We will again make the gradient at the point x(k+1) orthogonal to the two directions d(k1) and
d(k) ,
f (x(k+1) )d(k) = 0 ,
f (x(k+1) )d(k1) = 0 ,
only this time the gradient does not have to vanish identically at x(k+1) since there are many vectors
to which it could be orthogonal without having to become identically zero. First we will work out
the gradient at the point x(k+1)
(
)T
T
T
T
f (x(k+1) ) = x(k+1) K LT = x(k) + d(k) K LT = x(k) K + d(k) K LT ,
which results in
T
f (x(k+1) ) = f (x(k) ) + d(k) K .

We substitute into the rst orthogonality condition
T
f (x(k+1) )d(k) = f (x(k) )d(k) + d(k) Kd(k) = 0

so that we arrive at the step length coecient
=
f (x(k) )d(k)
T
d(k) Kd(k)
The second orthogonality condition gives

T
f (x(k+1) )d(k1) = f (x(k) )d(k1) + d(k) Kd(k1) = 0 .

We realize that the point x(k) was reached along the direction d(k1) and at that point the gradient
was orthogonal to the marching direction
f (x(k) )d(k1) = 0 .
Therefore, we must also have
200
T
d(k) Kd(k1) = 0 .
(9.15)
We say that the directions d(k1) and d(k) are K-orthogonal or K-conjugate (or just conjugate
directions for short).
So that we can determine the new direction d(k) to be K-conjugate to the old one d(k1) we
assume the new descent direction is a combination of the direction of steepest descent f (x(k) )T
and the old direction d(k1)
d(k) = f (x(k) )T + d(k1) .
Substituting into the K-conjugate condition (9.15) we obtain
=
f (x(k) )Kd(k1)
T
d(k1) Kd(k1)
The conjugate gradients algorithm may be succinctly sketched as3

x=x0;
g = x*A-b;
d=-g;
for iter=1:maxiter
alpha =(-g*d)/(d*A*d);
x = x + alpha* d;
g = x*A-b;
beta =(g*A*d)/(d*A*d);
d =beta*d-g;
end
Note well that this is not at all an ecient implementation. For instance, the product A*d should
be computed just once. For a real industrial-strength conjugate gradient implementation checkout
the MATLAB pcg solver.
Illustration 4
Here we apply the steepest descent and the conjugate gradient solvers to a system of linear algebraic
equations with a standard 324 324 matrix.4
Figure 9.12 illustrates that the method of conjugate gradients is a denite improvement over the
method of steepest descent. The convergence is much quicker. Note that after just 75 iterations or
so we could have stopped the conjugate gradient iteration since it reached a limit imposed by the
machine precision. The dierence between the two methods can be also dramatically displayed by
showing how the solution is approached during the iterations. Figure 9.13 shows how the iterated
solution (red dashed curve) approaches the converged solution (black solid line) for the steepest
descent method in relation to the number of iterations. Figure 9.14 shows the same kind of information. Clearly, even though the two methods started with essentially the same magnitude of error,
conjugate gradients managed to reduce it much more quickly.
3
4
See: aetna/SteepestDescent/ConjGradAxb.m
See: aetna/SteepestDescent/test cg 1.m
201
Norm of the error
10
10
10
10
10
15
10
50 100 150 200 250 300 350

Iteration
Fig. 9.12. Comparison of the convergence of the steepest-descent algorithm (dashed line) and the Conjugate
Gradients algorithm (solid line). Matrix: poisson(18), 324 unknowns.
iter =6
iter =16
iter =32
10
10
10
0
0
Solution
15
Solution
15
Solution
15
50
100
150 200
Iteration
iter =65
250
300
0
0
350
50
100
150 200
Iteration
iter =108
250
300
0
0
350
10
10
10
0
0
50
100
150 200
Iteration
250
300
0
0
350
100
150 200
Iteration
iter =162
250
300
350
50
100
150 200
Iteration
250
300
350
Solution
15
Solution
15
Solution
15
50
50
100
150 200
Iteration
250
300
0
0
350
Fig. 9.13. Solution obtained with the Steepest Descent algorithm for the matrix gallery(poisson,18),
324 unknowns, using various numbers of iterations.
iter =3
iter =6
iter =16
10
10
10
0
0
Solution
15
Solution
15
Solution
15
50
100
150 200
Iteration
250
300
350
0
0
50
100
150 200
Iteration
250
300
350
0
0
50
100
150 200
Iteration
250
300
350
Fig. 9.14. Solution obtained with the Conjugate Gradients algorithm for the matrix poisson(18), 324
unknowns, using various numbers of iterations.
202
9.12 Direct versus iterative methods

We have seen two representatives of two classes of numerical methods: the LU factorization as a
representative of the so-called direct methods, and the method of steepest descent as a representative
of the iterative methods.
The direct methods will complete their work in a number of steps that can be determined before
they start. If we took the time and eort, we could count every single addition and multiplication
that will be required for a given size matrix.
On the other hand, for iterative methods this is not possible. There may be constituents in the
iteration procedure whose cost maybe evaluated a priori, but the number of iterations is typically
impossible to determine beforehand.
Where does the method of conjugate gradients t? It can be shown that even though we have
enforced the orthogonality of gradients to two successive directions at a time, orthogonality of all
previous directions to the gradient at the current point is carried forward. Therefore, theoretically,
given innite arithmetic precision, after n steps we will again reach a point, x(n) , where the gradient
must be orthogonal to all n descent directions. Thus, the gradient at x(n) must vanish identically,
otherwise in an n dimensional space it couldnt be simultaneously orthogonal to n directions. In this
sense, a method of conjugate gradients is able to complete its work in time that can be determined
before the computation starts. On the other hand, it can also be used as an iteration procedure since
it is possible to stop it at any time, and the current point would be an improvement of the initial
guess.
The characteristics we have just introduced can be illustrated in Figure 9.15. The direct method
will start computing, and after a certain time and eort (which we can predict in advance: advantage!)
it will stop and deliver the solution with an error within the limits of computer precision (machine
epsilon). Until it does, we have nothing. (Disadvantage.)
The iterative method will start reducing the error of the initial guess right away. After a certain
time and eort the method will reduce the error to machine precision. (For simplicity we have
assumed that the two methods we are comparing will reach machine precision in the same time, this
may or may not be so.) Importantly, the iterative method can be stopped before it reaches machine
precision. If we are satised with a cruder tolerance, we could accept the solution much sooner, and
potentially save time (advantage!). For the iterative method we will not know in advance how long
its going to take to compute an acceptable solution. (Disadvantage.)
Nowadays there seems to be an agreement in the scientic and engineering computing community
that iterative methods are for many applications the preferred algorithms. This makes it a little bit
harder for the users of software built on iterative algorithms since iterative algorithms typically
include some tuning parameters, and various tolerances are involved. A judicious choice of these is
not always easy, and it can have a very signicant impact on the cost of such computations.
9.13 Least-squares minimization

Consider the following problem. The load-deection diagram of a stainless steel 303 round coupon
was determined experimentally as shown in Figure 9.16. The data comes from the initial, more or
less straight portion of the curve. What is the stiness coecient of the coupon? If the data were all
located on a straight line, it will be the slope of that straight line. However, we can see that not only
there is some experimental scatter, but the data points appear to lie on a curve, not a straight line.
The so-called linear regression approach to the above problem could start from the assumption that
the stiness could be determined as the slope of a straight line which somehow best approximates
the measured data points. If the data was all on a straight line, we could write
F (w) = p1 w + p2
for the relationship between the displacement w and of the force F , where p1 is the stiness coecient
of the coupon K = dF/dw = p1 . The data points are not located on a straight line however, which
Error
203
Direct
Iterative
tol
eps
Effort
Fig. 9.15. Comparison of eort versus error for direct and iterative methods.
2500
Force [lb]
2000
1500
1000
500
0
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.16. Stainless steel 303 round coupon, and the load-deection diagram.
means that substituting the displacement wk and the force measured for that displacement Fk into
the above relationship will not render it an equality, something will be left over: we will call it the
residual.
Fk F (wk ) = Fk p1 wk + p2 = rk .
This may be written in matrix form for all the data points as

F1
w1 , 1
r1
[
]
F2 w2 , 1
r2
p

1

= . .
.. .. ..
. . . p2
..
Fn
wn , 1
rn
For convenience, using the measured data w1 , w2 , ..., wn and F1 , F2 , ..., Fn we will dene the matrix
w1 , 1
w2 , 1
A= . .
.. ..
wn , 1
and the vector
204
F1
F2

b= . .
..
Fn
The vector of the parameters of the linear t is
[ ]
p
u= 1 .
p2
The vector of the residuals (also called the error of the linear t) is

r1
r2

e= . .
..
rn
So we write
b Au = e ,
where the matrix A has more rows than columns. This is the reason why it will not be possible to
make the error exactly zero in general: there are more equations than unknowns.
We realize that in default of being able to zero out the error, we have to go for the next best
thing which is to somehow minimize the magnitude of the error. In terms of the norm of the vector
e it means to nd the minimum of the following objective function
T
min e2 = min eT e = min (b Au) (b Au)

with the respect to the parameter vector of the linear t u. This is a classical unconstrained minimization problem. The argument u for which the minimum of the objective function is attained is
found from the above expression which in expanded form reads
(
)
T
u = arg min (b Au) (b Au) = arg min uT AT Au 2bT Au + bT b .
u
u
The rst-order condition for the existence of an extremum is the vanishing of the gradient of the
objective function
)
( T T
u A Au 2bT Au + bT b = 2uT AT A 2bT A = 0 .
u
Canceling out the factor 2 and transposing leads to the so-called normal equations
AT Au = AT b .
They are linear algebraic equations with a symmetric matrix, and since the columns of A are
linearly independent (if the wk s are not all the same), the matrix AT A is invertible. (It is however
not necessarily well conditioned. We have seen an evidence of this in the Illustration on page 148.)
The coupon data are
w = 102 [ 1.3, 1.8, 2.3, 2.8, 3.1, 3.6, 4.1, 4.5, 4.8, 5.3, 5.8, 6.3, 6.7 ]
and
F = 102 [ 1.2, 2.6, 4.3, 6.1, 7.3, 9.2, 11, 13, 14, 16, 18, 20, 21 ] .
Substituting our data into the normal equations leads to the solution
205
A =[w, ones(length(w),1)];
pl =(A*A)\A*F
pl =
1.0e+004 *
3.799600652696673
-0.042276210924081
So the stiness of the coupon based on the linear t is approximately 37996 lb/in. Continuing our
investigation, we realize that the data points appear to lie on S-shaped curve, which suggests a linear
regression with a cubic polynomial. This is easily accommodated in our model by taking
F (w) = p1 w3 + p2 w2 + p3 w + p4 .
The matrix A becomes
3 2
w1 , w1 , w1 , 1
w23 , w22 , w2 , 1
A= .
.
. .
.., .., .., ..
wn3 , wn2 , wn , 1
and the solution is
A =[w.^3, w.^2, w, ones(length(w),1)];
pc =(A*A)\A*F
pc =
1.0e+006 *
-7.000925471829832
0.862362550370550
0.006168259997214
-0.000087120026727
2500
2500
2000
2000
1500
1500
Force [lb]
Force [lb]
Figure 9.17 shows the linear and cubic polynomial t of the experimental data. The dierence is
somewhat inconspicuous, but plotting the residuals is quite enlightening. Figure 9.18 shows the
1000
500
0
0.01
1000
500
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
0
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.17. Stainless steel 303 round coupon, and the load-deection diagram. Linear polynomial t on the
left, cubic polynomial t on the right.
residual for the linear and cubic polynomial t. The linear polynomial t residual shows a clear bias
in the form of a cubic curve. This indicates that a cubic polynomial would be a better t. That is
indeed true, as both the magnitude decreased and the bias was removed from the cubic-t residual.
206
60
Residual [lb]
40
20
20
40
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.18. Stainless steel 303 round coupon load-deection diagram. Linear polynomial t residual in dashed
line, cubic polynomial t in solid line.
Figure 9.19 shows the variation of the stiness coecient as a function of the deection for
both the linear and the cubic polynomial t. It may be appreciated that the stiness varies by a
substantial amount when determined from the cubic t, while it is constant based on the linear t.
4
4.5
x 10
Stiffness [lb/in]
3.5
2.5
2
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.19. Stainless steel 303 round coupon, and the load-deection diagram. Stiness coecient as a
function of deection. Dashed line: from linear polynomial t, solid line: from cubic polynomial t.
9.13.1 Geometry of least squares fitting

Let us now come back to the geometrical meaning of the least squares equations. The equation
b Au = e
expresses that we cannot satisfy all the individual equations since there more equations than there
are unknown parameters: the vector b belongs to Rn , while u belongs to Rm , and we have m < n.
In other words, the matrix A is rectangular (tall and skinny).
The geometry viewpoint would imagine b as a vector (arrow) in Rn . Each of the columns of the
matrix A also represents a vector (arrow) in Rn . The product Au is a linear combination of the
columns of the matrix A
Au = c1 (A)u1 + c2 (A)u2 + . . . + cm (A)um ,
where c1 (A) is used to mean column 1 of the matrix A and so on. To reach every single point of Rn ,
we would need n linearly independent basis vectors. Since there are only m columns of the matrix A
207
they cannot serve as such basis vectors, and the linear combination of the columns of the matrix A
is only going to cover a subset of Rn . Inspect Figure 9.20: the columns of the matrix A generate the
gray plane as a graphical representation of the subset of Rn . The vector b is of course not conned
to the plane and somehow sticks out of it. The dierence e between b and Au also sticks out. To
make the error e as small as possible (as short as possible) then amounts to making it orthogonal
to the gray plane Au. The shortest possible error e = b Au will be orthogonal to all possible
vectors in the gray plane, Au as expressed here
(Au) e = 0 .
T
Substituting we obtain
(Au) (b Au ) = 0
T
or
(
)
uT AT (b Au ) = uT AT b AT Au = 0 .
When we say for all possible vectors in the gray plane, Au, we mean for all parameters u, and since
the above equation must be true for all u, we have again the normal equations
AT b AT Au = 0 .
The solution to the normal equations are such parameters u that they make the error of the least
squares tting as small as possible.
Fig. 9.20. Least squares tting: the geometrical relationships.
9.13.2 Solving least squares problems

A practical approach to the solution of the normal equations needs to consider numerical stability:
the normal equations are relatively poorly conditioned as a rule.
A good approach is based on the QR factorization of the matrix A
A = QR .
Note that Q and R are computed by the so-called economy QR factorization: Q has the same
dimensions as A, and R is square upper triangular m m matrix. We substitute this factorization
into AT Au = AT b and we obtain
RT QT QRu = RT QT b .
)1
(
Canceling the product of the orthogonal matrix with itself and multiplying on the left with RT
yields
Ru = QT b .
The meaning is: project the vector b onto the orthonormal basis of the m columns of the Q matrix.
Then solve (easily!) the upper triangular system for the tting parameters. Thanks to the QR
factorization all these operations are numerically stable.
208

1. R. Fletcher, Practical methods of optimization, second edition, John Wiley and sons, 2000.
Lucid presentation of the basics of unconstrained optimization.
2. P.Y. Papalambros, D. J. Wilde, Principles of optimal design, second edition, Cambridge University press, 2000.
Practical engineering treatment of both unconstrained and constrained optimization.
3. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005.
Great reference for the least-squares methodology.
.
Index
AETNA function or script

ConjGradAxb, 200
HouseQR, 159
Householder matrix, 158, 159
Householder normal, 160
Jacobian example, 129
StabilitySurfaces, 53
SteepestAxb, 197
Visualize qr iteration, 177
analytical CG in 2D book, 198
animated eigenvalue diagram, 72
better conditioned, 149
binvpwr2, 173
bisection versus Newton, 125
bisect, 124, 125
bwsubs, 135
cable config myjac, 131
cable config numjac, 134
compare conv driver, 110
dft example 1, 88
elim matrix, 137
froscill crit symb, 74
froscill sub symb, 67
froscill super symb, 65
froscill un symb, 69
fwsubs, 135
gepbinvpwr2, 180, 182
ill conditioned, 148
invpwr2, 167
lb A blinvpower, 174
lb A gepblinvpower, 181
lb A invpower, 168
lb A power, 166
lb A qr, 178
matnordemo, 152
n3 damped IC, 91
n3 damped fft f, 91
n3 damped modes A, 90
n3 damped non IC, 93
n3 damped non modes A, 92

n3 damped sing modes A, 95
n3 sing undamped modes MK1, 182
n3 undamped A animation, 82
n3 undamped IC, 82
n3 undamped direct modal, 83
n3 undamped fft f, 89
n3 undamped fft, 85
n3 undamped modes A, 80
n3 undamped modes MK symbolic, 79
n3 undamped modes MK, 79, 80
n3 undamped stable rk4, 84
naivelu4, 138
newt, 122
ode87fixed, 32
odebeul, 16
odefeul, 16
odemeul, 26
oderk4, 31, 83
odetrap, 26, 89
pwr2, 164
qr power correspondence, 175
qrstepS, 177
qrstepW, 178
qrstep, 177, 178
satellite1, 20
satellite energy, 21
scalardecayconv1, 26
scalardecayconv2, 31
scalardecayconv, 25
scalaroscill1stlong, 47
scalaroscill1st, 46
scalaroscillplot, 45
scalaroscillstream, 45
scalarsimple, 35
sinvpwr2, 170
stickslip harm 1 dirf, 23
210
Index
stickslip harm 2 animate, 23

stokes1, 10
stokes2, 11
stokes3, 13
stokes4ill, 14
stokes4, 14
stokes5, 15
stokes6, 16
stokes7, 17
stokesdirf, 9
stokesref, 8
test cg 1, 200
test invpwr conv1, 168
test shift, 171
testbisect conv rate, 125
testnewt conv rate, 122
two surfaces, 127
vecnordemo, 151
accuracy, 15
accuracy control knob, 99
adaptive time stepping, 11
algorithm
conjugate gradients, 200
aliasing, 86
amplication factor, 35, 36, 48, 53, 83
analytic function, 97
angular frequency, 80
anti-symmetric matrix, 189
approximate error, 24
asymptotic range, 27
backsubstitution, 157
backward dierence, 129
backward Euler, 16, 52, 121
approximation of derivatives, 108
backward substitution, 135
beam bending, 17
bias, 205
big O notation
function value decreasing, 98
function value increasing, 141
binary number, 110
bisection method, 124
bit, 110
block inverse power iteration, 173
block power iteration, 172
boundary conditions, 17
boundary value problem, 17
bugs, 116
Butcher tableau, 30
BVP, 17
byte, 110
Cauchy, Bunyakovsky, Schwartz, 147

CBS inequality, 147
centered dierence, 108
change of coordinates, 64
characteristic equation, 39, 60, 78
characteristic polynomial, 39, 78
Cholesky, 180
Cholesky factorization, 180
cofactor, 39, 63
commuting matrices, 75
complex conjugate, 66, 81, 82
complex conjugate numbers, 43
complex exponential, 38
complex plane, 51
complex-variable representation, 81
condition number, 147150, 152
condition of extremum, 192
conditionally stable, 52
congruence, 179
conjugate directions, 200
conjugate gradients, 145, 196
conjugate transpose, 175
constant of integration, 7
convergence, 25
convergence rate, 27
Newton, 122
cost function, 185
Coulomb friction, 21
critical damping, 62
critically damped oscillator, 62, 71
curvature, 104
damped oscillator, 59
damping matrix, 90
damping ratio, 62
decoupling, 65
defective matrix, 71, 72, 175
deation, 178
dense matrix, 127, 144
dependent variable, 7
determinant, 39, 42, 146
DFT, 86
diagonal matrix, 64
diagonalizable, 64
diagonalize, 74
dierential equation, 7, 51
direct methods, 202
direct time integration, 83
direction eld, 9, 14, 23
discrete Fourier transform, 86, 95
displacement, 81
divided dierences, 106, 129
dof, 77
dominant eigenvalue, 162, 168
Index
drag force, 5
dry friction, 21
dynamic viscosity, 6
eigenvalue, 39, 51
inverse, 167
positive, 79
shifting, 169
vibrating system, 164
eigenvalue problem, 34
generalized, 180
matrix, 39
standard, 180
eigenvalues of inverse matrix, 167
eigenvector, 39
elimination matrix, 137
energy, 186
energy minimization, 192
energy of deformation, 186
energy surface, 186
equation
explicit, 15
homogeneous, 7
implicit, 15
inhomogeneous, 7
partial dierential, 18
equilibrium, 132
error, 100
approximate, 24
arithmetic, 110
machine-representation, 110
round-o, 110
true, 24
truncation, 110
error estimate, 103
Euclidean norm, 151
Euler formula, 49
Eulers formula, 45, 80
Eulers method, 9
expansion
modal, 161
experimental scatter, 202
explicit equation, 15
exponent, 112
exponential
matrix, 70
exponential as solution, 60
extremum, 204
factorization
Cholesky, 180
LDLT, 144, 179, 191
LU, 134, 179
QR, 156
QR economy, 207
Schur, 175
FFT, 86, 91
ll-in, 145
nite dierence stencil, 109
rst-order form, 80
force residual, 132
forced vibration, 89
forward Euler, 51
forward dierence, 129
forward Euler, 16, 30, 52, 102
approximation of derivatives, 108
forward substitution, 135
Fourier series, 86
Fourier transform, 86
discrete, 86
free-vibration response, 82
frequency content, 85
friction coecient, 23
fundamental frequency, 87
Gauss-Seidel, 145
generalized eigenvalue problem, 78, 79, 180
global error, 104, 105
global minimum, 194
governing equation, 7
gradient, 189, 190, 194, 195
Hermitian matrix, 175
Hessian matrix, 190
homogeneous equation, 7
Householder transformation, 157
IBVP, 18
identity matrix, 136
ill conditioned, 148
ill conditioned basis, 172
ill conditioning, 148
impedance matrix, 144
implicit algorithm, 119
implicit equation, 15
in-place factorization, 138
indenite matrix, 191
independent variable, 7
induced matrix norm, 150
Inf, 112
inhomogeneous equation, 7
initial boundary value problem, 18
initial condition, 7, 18
initial conditions, 59
initial value problem, 7
instability, 15
integer, 111
integral
211
212
Index
midpoint approximation, 101

integration constants, 63
integrator, 51
4th-order explicit RK4, 30
modied Euler, 26
ode45, 30
trapezoidal, 26
inverse, 136
eigenvalue, 167
inverse iteration, 180
inverse matrix, 156
inverse power iteration, 167
invertible, 191
invertible matrix, 72
iteration, 120, 122
iterative methods, 202
IVP, 7, 59
Jacobi, 145
Jacobian, 120
symmetric, 127
Jacobian matrix, 127
numerical evaluation, 129
Jordan matrix, 74, 75
Lagrange remainder, 98, 106
Laplace formula, 39
LDLT factorization, 144, 179
least-squares, 148
least-squares tting, 157
line search, 194
linear algebraic equations, 127
linear combination, 68
linear t, 204
linear oscillator, 59
linear regression, 202
linearly independent columns, 146
local error, 104
local minimum, 194
loss of signicance, 114
lower triangular matrix, 134
LU factorization, 134, 135
inverse, 136
machine epsilon, 114, 116, 202
mantissa, 112
map, 150
MATLAB
bin2dec, 111
cond, 148
contour3, 54
dec2bin, 111
det, 146
diag, 65
diff, 10
double, 112
eig, 40, 166
eps, 114
eval, 66
expm, 75
exp, 65
ezplot, 40
fft, 88
fzero, 17, 83, 125
int8, 112
intmax, 111
intmin, 111
ldl, 180
linspace, 8, 54
lu, 139
meshgrid, 54
norm, 152
numjac, 129, 134
ode23, 11, 31
ode45, 20, 31, 46
odeset, 16
realmax, 113
realmin, 113
single, 115
solve, 79
sort, 81
spy, 144
surf, 53
syms, 40
taylor, 97
tril, 139
triu, 139
vectorize, 66, 98
anonymous function, 10
function handle, 17
matrix
commuting, 75
condition number, 147
congruence, 179
conjugate transpose, 175
damping, 77
defective, 71, 72, 175
dense, 127
determinant, 146
diagonalizable, 64
eigenvalue problem, 39
elimination, 137
exponential, 70
Fourier transform, 86
Hessian, 190
Householder, 157
identity, 39
Index
impedance, 144
indenite, 191
inverse, 127, 136
inverse, eigenvalues, 167
Jacobian, 127
Jordan, 74
lower diagonal, 30
lower triangular, 134
mass, 77
norm, 150
normal equations, 204
of eigenvectors, 63
of principal vectors, 74
orthogonal, 140, 156, 157
permutation, 139
positive denite, 189
positive semi-denite, 191
power, 71
psychologically lower triangular, 140
quadratic form, 186
rank, 146
Rayleigh quotient, 154
rotation, 45, 68, 71
similar, 64
singular, 39
skew-symmetric, 43, 71
sparse, 127, 144
sparse, ll-in, 145
spectrum, 179
stiness, 2, 186
symmetric, 79, 152
unitary, 175
unsymmetric, 152
upper triangular, 134, 156
matrix exponential, 70, 75
matrix inverse, 127
matrix power, 161
matrix powers, 71
method
conjugate gradients, 196
direct, 202
Eulers, 9
Hermitian, 175
iterative, 202
of conjugate gradients, 202
power, 161
rectangular, 206
minimization
unconstrained, 193
modal coordinates, 65, 83
modal expansion, 161
mode, 65
mode shape
undamped , 80
modied Euler, 30
modied Euler method, 26
multi-grid, 145
NaN, 112
natural frequency, 179
nested function, 132
Newtons algorithm, 120, 127
Newtons equation, 6
nonlinear algebraic equation, 119
vector, 126
nonsingular, 189
nonsingular matrix, 79
norm
1-norm, 151
2-norm, 151
Euclidean, 151
innity, 151
matrix, 150
vector, 150
normal coordinates, 65
normal equations, 148, 204, 207
normalized values, 113
numerical stability, 207
QR factorization, 157
Nyquist frequency, 86
Nyquist rate, 86
objective function, 185, 186, 193, 204
one-sided spectrum, 89
optimization
unconstrained, 185
order-of analysis, 98
orthogonal, 198, 207
orthogonal matrix, 140, 149, 156, 157
orthogonal vectors, 172
orthogonality condition, 199
orthonormal, 172
oscillation, 51
oscillator
multi-degree of freedom, 77
overow, 112, 115, 163
partial dierential equation, 18
partial pivoting, 139, 179
period of vibration, 85
periodic function, 86
permutation matrix, 139, 140, 179
permutation vector, 140
phase, 91, 93
phase shift, 91
phasor, 45, 68, 81
213
214
Index
piecewise linear function, 107

pivot, 136, 139, 179, 191
pivoting, 146
complete, 139
partial, 139
polynomial
roots, 176
positive denite, 146
positive denite matrix, 180, 189
positive deniteness, 146, 191
positive semi-denite matrix, 191
potential, 189
potential function, 189
power iteration, 162, 164
pre-asymptotic range, 27
principal stress, 152
principal vector, 74
propagated error, 105
psychologically lower triangular matrix, 140
QR factorization, 156, 157, 159, 172, 207
economy, 173, 207
QR iteration, 175
quadratic convergence rate, 122
quadratic form, 186
quadratic function, 186
rank, 146, 191
rate of convergence, 27, 110
Rayleigh damping, 91
Rayleigh quotient, 154, 163, 177, 180
reection, 157
regular matrix, 189
remainder, 126
repeated eigenvalue, 73
residual, 195, 204, 205
Richardson extrapolation, 29
Riemann sum, 25, 87, 100
rigid body mode, 182
rotation matrix, 45, 68, 71
round-o error, 110
row matrix, 195
Runge-Kutta
ode45, 31
oderk4, 31, 83
4th-order explicit RK4, 30
method, 29
Schur factorization, 175, 176
Schur lemma, 175
secant method, 129
shaker, 23
shift, 177
Rayleigh quotient, 177
Wilkinson, 178
shifted eigenvalue, 169
shifting, 174
similar matrix, 64, 152
similarity transformation, 64
simultaneous iteration, 174
simultaneous power iteration, 172
singular, 146, 191
matrix, 39
singular matrix, 149, 187
singular stiness, 95
singular value decomposition, 146
skew-symmetric matrix, 43, 71, 189
solution
general, 7
particular, 7
sparse matrix, 127, 144
spectrum, 84
spectrum slicing, 179
spring, 186
stability, 15, 33
stable structure, 188
stable time step, 36, 37
standard eigenvalue problem, 80, 180
static equilibrium, 188
steepest descent, 194196
stiness matrix, 186
Stokes, 5
Stokes Law, 5
stretch, 186
Sylvester, 179
Sylvesters Law of inertia, 179
symmetric
stiness, 189
symmetric matrix, 79, 190
Taylor series, 70, 100
terminal velocity, 8
time step, 84
time stepping
adaptive, 11
tolerance, 120
total energy, 192
trace, 42
transpose, 186, 190
trapezoidal integrator, 26
trapezoidal rule, 26
triangular matrix, 135
true error, 24
truncation error, 104, 105, 107, 108, 110
unconditionally stable, 52
unconditionally unstable, 52
unconstrained minimization, 192
Index
problem, 193
unconstrained minimization problem, 193
unconstrained optimization, 185, 193
uncoupled variables, 65
undamped oscillator, 67
undamped vibration
underow, 115, 163
unitary matrix, 175
unnormalized values, 113
unsigned byte, 111
unstable structure, 187
upper triangular matrix, 134, 156
vector function, 126

vector norm, 150
vector space, 150
vector unknown, 126
velocity, 81
viscous uid, 5
Wilkinson, 178
shift, 178
zero eigenvalue, 95
zero frequency, 182
215

Aetna Book 2015 Hyper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aetna Book 2015 Hyper

Uploaded by

Copyright:

Available Formats

Petr Krysl

Pressure Cooker Press

Modeling with dierential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preservation of solution features: stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.14 Summary of integrator stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Linear Single Degree of Freedom Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Linear Multiple Degree of Freedom Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3 Estimating error in ODE integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Solution of systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.6 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Solutions methods for eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 Simple model of motion

2 Modeling with dierential equations

Simplifying a little bit, we obtain

2.1 Simple model of motion

where we have quite sensibly taken t0 = 0.

and one particular solution to the inhomogeneous equation

2 Modeling with dierential equations

and the analytical solution to the initial value problem (2.7)

2.2 Eulers method

MATLABr is registered trademark of The MathWorks, Inc. Contact: www.mathworks.com.

2.2 Eulers method

(there is no mention of tj , so it does not appear). The meaning of v(t

2 Modeling with dierential equations

2.2.1 A simple implementation of Eulers method

2.2 Eulers method

2.2.2 Solving the Stokes IVP with built-in MATLAB integrator

2.2.3 Refining the Stokes IVP

Similarly the second integration gives

2 Modeling with dierential equations

Fig. 2.5. Stokes problem solution computed with a MATLAB integrator.

2.2 Eulers method

x(t) [m], v(t) [m/s]

2.2.4 Some properties of Eulers method

2 Modeling with dierential equations

2.2 Eulers method

The slope of the solution curve passing through (tj+1 , vj+1 ) is

The resulting expression for the velocity at time tj+1 reads

2 Modeling with dierential equations

Fig. 2.9. Stokes problem. Solution with the algorithm (2.16).

2.3 Beam bending model

2.3 Beam bending model

2 Modeling with dierential equations

v(0) = 0 , v (0) = 0 , v (L) = 0 , v (L) = 1 .

Fig. 2.10. Beam bending schematic

v(t, 0) = 0 , v (t, 0) = 0 , v(t, L) = 0 ,

2.4 Model of satellite motion

2.4 Model of satellite motion

Thus, the IVP reads

2 Modeling with dierential equations

Then the IVP is simply

2.5 On existence and uniqueness of solutions to IVPs

and the potential energy of the satellite is written as

2.5 On existence and uniqueness of solutions to IVPs

See: aetna/SatelliteMotion/satellite energy.m

2 Modeling with dierential equations

2.6 First look at accuracy

2.6 First look at accuracy

See: aetna/Stickslip/stickslip harm 2 animate.m

2 Modeling with dierential equations

True error for tj

Approximate error for tj

Ea,1 = yt,2 yt,1

For deniteness we will be working in this section with the IVP