Professional Documents
Culture Documents
An EngineersToolkit of Numerical
Algorithms
With the MATLABr toolbox
https://github.com/PetrKryslUCSD/AETNA
July 2015
Contents
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
8
10
11
11
13
15
16
17
18
19
21
23
25
26
29
31
32
33
33
34
34
36
37
37
37
39
40
40
43
43
43
46
47
49
50
VI
Contents
51
52
53
57
59
59
60
61
61
62
62
63
64
65
66
67
67
69
69
70
71
74
75
75
77
77
78
78
79
79
80
83
84
85
87
88
90
92
93
95
95
Analyzing errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Order-of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Using the big-O notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Error of the Riemann-sum approximation of integrals . . . . . . . . . . . . . . . . . . . . .
6.2.3 Error of the Midpoint approximation of integrals . . . . . . . . . . . . . . . . . . . . . . . . .
97
97
97
98
99
99
99
100
100
101
Contents
VII
102
104
104
105
106
107
109
110
110
112
115
116
116
116
117
119
119
120
121
122
123
124
126
127
129
129
130
134
135
136
139
141
141
142
143
143
146
146
147
148
148
149
149
150
151
151
152
152
155
156
157
159
160
VIII
Contents
161
161
162
164
167
168
168
169
171
171
171
174
174
175
176
178
179
179
180
181
181
182
183
Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 Basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Two degrees of freedom static equilibrium: unstable structure . . . . . . . . . . . . . . . . . . . .
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Two degrees of freedom static equilibrium: stable structure . . . . . . . . . . . . . . . . . . . . . .
9.4 Potential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5 Determining deniteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6 Two degrees of freedom static equilibrium: computing displacement . . . . . . . . . . . . . . .
9.7 One degree of freedom total energy minimization example . . . . . . . . . . . . . . . . . . . . . . .
9.8 Two degrees of freedom total energy minimization example . . . . . . . . . . . . . . . . . . . . . .
9.9 Application of the total energy minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9.1 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9.2 Line search for the quadratic-form objective function . . . . . . . . . . . . . . . . . . . . . .
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.10 Conjugate Gradients method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11 Generalization to multiple equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.12 Direct versus iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.13 Least-squares minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.13.1 Geometry of least squares tting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.13.2 Solving least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.14 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185
185
186
188
188
189
190
191
191
192
192
193
194
194
196
196
198
200
202
202
206
207
208
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
1
Motivation
The narrative in this chapter is provided in the hope that it will motivate the esteemed reader to
take the present subject seriously. Please do not be discouraged if the text in this Chapter is found
lacking in entertainment value. The rest of the book will make it up to you to excess.
Let us consider the experience made by the renowned structural engineer WTP (in Figure 1.1
accompanied by his attorney CR). It concerns a planar truss structure designed by WTP and
analyzed for static loads. The structural software used was developed by Owl & Co.
Fig. 1.1. The renowned structural engineer WTP depicted on the stairs of his house with his attorney CR.
CR was instrumental in keeping WTPs engineering career on track.
The structure was rst analyzed with default analysis settings and the shape after the deformation
is shown in Figure 1.2 ( also included is a visual representation of the applied loading and the three
pin supports). The shape before deformation is shown in broken line. The deformation is highly
magnied.
Later that day WTP was exploring the menus of the analysis software (there is always a rst time
for everything), and got intrigued by the fact that the analysis option Use automatic stabilization
was checked. The documentation was not very helpful in explaining the eects of this option (WTP
was in fact not sure in which language the documentation was written), and therefore an experiment
was in order. The analysis option Use automatic stabilization was unchecked, and the analysis
was repeated. To WTPs surprise the results were practically identical, except that a slight displacement unsymmetry developed (for instance, displacement of -0.7039 versus -0.6518 units). This was
disquieting since the structure and the boundary conditions (loads and supports) were symmetric.
WTP was however nonplussed, especially given that the analysis was to be delivered to the client
the next day.
Several weeks later the software developers alerted WTP that a bug was found in the analysis
software, the bug was xed, and an update was to be installed. WTP remembered the slight unsymmetry, and therefore checked whether the update removed it. Since the unsymmetry remained,
a brief discussion ensued in whose course WTP ascertained that the bug had to do with the color
1 Motivation
Fig. 1.2. The planar truss structure. Deformation under indicated static loads is shown in solid line (magnied). The undeformed structure is shown in dashed line.
in which the logo of the company was drawn in the splash screen, and therefore it was somewhat
unlikely to be the cause of the unsymmetry.
During a discussion with a colleague WTP was able to convince himself that no unsymmetry
was to be expected in the analysis, and that if it appeared it should be considered an error. At
that point WTP began to draw on his immense powers of reasoning. After only a few hours he
was able to recall the name of the text in which properties of coupled systems of linear algebraic
equations were discussed in his junior year in college. An intense session with the textbook followed,
and WTP was quickly able to nd the page that pertained to errors that can appear in the solution
of systems of equations. The error was found to be proportional to the error in the right-hand side
(the loads) and to the condition number of the stiness matrix. The loads were, as WTP checked,
specied correctly, and consequently the mysterious condition number was probably the source of
the confounding error.
WTP was now able to nd in the textbook that the condition number of the stiness matrix
was rather expensive to compute as one had to solve an eigenvalue problem. WTP was not to be
deterred however, and subcontracted this work out to a group of students from the local university,
cost it what it may (it wasnt much). The magnitudes of the eigenvalues of the stiness matrix found
by the students are shown in Figure 1.3.
Eigenvalue magnitude
10
10
10
10
10
15
10
10
20
30
Eigenvalue
40
50
Fig. 1.3. The magnitudes of the eigenvalues of the stiness matrix of structure from Figure 1.2
1 Motivation
The rather small rst eigenvalue did not escape WTP and a few more rewarding hours were
spent looking for information that could lead to an understanding of the relationship between the
condition number, the eigenvalue problem, and the stiness matrix. Eventually the critical piece of
information that the so-called singular matrix had at least one zero eigenvalue was located, and the
conclusion that the stiness matrix was somehow close to singular was reached.
The displacement shape corresponding to the rst eigenvector (Figure 1.4) facilitated the ultimate breakthrough. The structure contained a mechanism: a oppy piece of structure that was
insuciently connected to the rest of the structure (which was in fact suciently supported).
Fig. 1.4. The eigenvector 1 of the stiness matrix of structure from Figure 1.2
The structure was consequently subjected to a redesign to remove the mechanism, and the
redesign was eagerly adopted by the client who remarked on the propitious circumstance that a
superior design became available before the structure was realized. WTP has yet again demonstrated
that superior skill and knowledge cannot fail to win the day. Even though his friend CRs assistance
was not required in this matter, his comforting presence during these trials and tribulations was
gratefully noted by WTP.
2
Modeling with differential equations
Summary
1. In this chapter we develop an understanding of initial value problems (IVPs). We look at the
simple but illustrative model of motion in a viscous uid, and the model of satellite motion. The
main idea: these models can be treated similarly since they are are both members of the class
of IVPs. The constituents of an IVP are the governing equation and the initial conditions.
2. The IVPs that will be considered in this book will be in the form of coupled rst-order (only
one derivative with respect to the independent variable) equations.
3. We develop simple methods for integrating IVPs numerically in time. The main idea: approximate the curve by its tangent in order to make one discrete step in time. The basic visual picture
is provided by the direction eld.
4. We discuss the essential dierences between IVPs and BVPs (boundary value problems). The
main idea: BVPs are harder to solve than IVPs because the problem data is located on the entire
boundary of the domain of the independent variables.
5. We investigate the accuracy of some simple numerical solvers for IVPs. The main concepts:
Monomial relationship between the error and the time step length gives us formulas to estimate
the error, the log-log plot illuminates the convergence produced by the dependence on the time
step, and the convergence rate is revealed by the log-log plot.
6. We wrap up the exposition of the various time integrators by describing the Runge-Kutta integrators. Main idea: try to aim the time step for optimal accuracy by sampling the right-hand
side function (that is the slope) within the time step.
4 3
r (s f )g ,
3
(2.1)
where 43 r3 is the volume of the sphere, is the dynamic uid viscosity (for instance in SI units
Pa s), 6r is the shape factor of the sphere of radius r, v is the velocity of the falling sphere relative
to the uid, m is the mass of the sphere, and g is the gravitational acceleration. On the left of
equation (2.1) is the so-called drag force Fd , on the right is the gravitational force Fg (i.e. 43 r3 s g,
Fd
x
Fg
Fig. 2.1. Sphere falling in viscous uid.
where s is the mass density of the material of the sphere) minus the buoyancy force (i.e. 43 r3 f g,
where f is the mass density of the uid) compare with Figure 2.1.
An application of this law to structural engineering may be found for instance in composites
manufacturing: a commonly used manufacturing technique used for large parts infuses dry bers laid
up on a bagged mold with resin by creating a degree of vacuum (Vacuum Assisted Resin Transfer
Moulding (VARTM)) to suck the resin into the bers. A critical property of the polymer resins is
their dynamic viscosity: if the resin is too viscous, the bers may be incompletely impregnated and
the part must be discarded. Some of the techniques to determine the viscosity of the liquid resemble
a high school science experiment: drop a ball into a tube lled with this liquid. Measure the time it
takes the ball to travel some distance. From that calculate the balls velocity (distance/time), and
knowing the balls diameter and mass obtain from (2.1) the liquids viscosity.
Of course, if we calculate the balls velocity as (distance/time) it better be uniform in that interval. So how does the velocity of the falling ball vary with time? Let us say we observe the proceedings
with a high-speed camera. We drop the ball from rest, and then we see the ball rapidly accelerate.
Eventually it seems to settle down to a steady speed on the way downwards. The modeling keyword
is acceleration, and consequently we shall use Newtons equation: Acceleration is proportional to
force. The acceleration may be written as x
(measuring the distance traveled downwards: Figure 2.1),
and the total applied force is Fg Fd . Therefore, we write
4 3
4
r s x
= r3 (s f )g 6rv .
3
3
(2.2)
s f
9
g 2 v.
s
2r s
(2.3)
We see that we have one equation, but two variables: x and v. These are not independent, since
the velocity is dened as the rate of change of the distance fallen, v = x.
We have two choices. Either
we express equation (2.3) in terms of the distance
x
=
9
s f
g 2 x
s
2r s
(2.4)
and we obtain a second order dierential equation, or we express equation (2.3) in terms of the
velocity
v =
s f
9
g 2 v
s
2r s
(2.5)
and we obtain a rst order dierential equation. Since we are at the moment primarily interested in
the velocity, we will stick to the latter.
All these equations are the so-called equations of motion. They are dierential equations, expressing rate of change of some variable (x or v) in terms of the same variable (and/or other variables,
in general). The independent variable is the time t and the dependent variable is the velocity v.
We realize that to obtain a solution we must somehow integrate both sides of the equation of
motion. From calculus we know that integration brings in constant(s) of integration. So, for instance
for equation (2.5), we may write
]
t
t[
s f
9
v(
)d =
g 2 v( ) d
s
2r s
t0
t0
and evaluating the left-hand side we arrive at
]
t[
9
s f
v(t) v(t0 ) =
g 2 v( ) d .
s
2r s
t0
Here the task is to nd a suitable form of function v( ) to satisfy this equation for all times. The
value v(t0 ) is arbitrary. Its physical meaning is that of the velocity at the beginning of the interval
t0 t. Therefore, setting the value v(t0 ) to some particular number
v(t0 ) = v0
(2.6)
is called specifying the initial condition. The initial condition makes the solution to the equation
of motion meaningful to a particular problem. Therefore, we always think of the models of this type
in terms of the pair governing equation (the equation of motion) plus the initial condition. This
type of model is called the initial value model (and the problem which is modeled this way is called
an initial value problem: IVP). The problem of the falling sphere is an initial value problem,
and the model that needs to be solved is
v =
9
s f
g 2 v,
s
2r s
v(0) = v0
(2.7)
9
vh
2r2 s
s f
9
g 2 vp .
s
2r s
The homogenous equation may be solved by assuming the solution in the form of an exponential
vh ( ) = exp(a ) .
Dierentiating vh we nd a = 2r9
2 .
s
The particular solution can be guessed as vp = constant, and dierentiating we nd
vp =
2r2 (s f )
g.
9
The solution to the initial value problem is the sum of the particular solution and some multiple of
the general solution
v( ) = vp ( ) + Cvh ( )
and it must satisfy the initial condition v(0) = v0 . Substitution of = 0 leads to
C = v0
2r2 (s f )
g
9
0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.2. Sphere falling in viscous uid. Time variation of the descent speed.
See: aetna/Stokes/stokesref.m
not available. The tools available to engineers for these problems will most likely be numerical in
nature. (Hence the reason for this book.)
The simplest method with which to introduce numerical solutions to initial value problems (IVP)
is the Eulers method. It is based on a very simple observation: the solution graph is a curve. The
solution process itself could be understood as constructing a curve. The curve passes through a point
that is known to us from the specication of the IVP: the point (t0 , v0 ). A curve consists from an
innite number of points, and we do not want to have to compute the coordinates of an innite
number of points. The next best thing would be to compute the solution at only a few points along
the curve, and somehow approximate the curve in-between. It is logical to try to approach this task
by starting from the point we know from the beginning, (t0 , v0 ), and to compute next another point
on the curve, let us say (t1 , v1 ). Then restart the process by moving one point forward in time,
compute (t2 , v2 ), and so on. This is an important aspect of numerical methods: the algorithms make
discrete steps and they produce discrete solutions (as opposed to a continuous analytical solution).
In general we will not be able to compute this sequence of points so that they all lie on the
exact solution curve. The points will only be close to the curve we wish to nd (they will be
only approximately on the curve). In fact, there is in general an innite number of solution curves,
those passing through all possible initial conditions. Refer to Figure 2.3): Shown are ve solution
curves for ve dierent initial velocities. So if our numerical solution process drifts o the desired
solution curve, it will most probably lie on an adjacent solution curve.
Since the process is repetitive (start from a known solution point and then compute the next
solution point), we may just as well think in terms the pair (tj , vj ) (known) and (tj+1 , vj+1 ) (unknown, to be computed). How do we approximately locate the point (tj+1 , vj+1 ) from what we know
of the solution curve passing through (tj , vj )? We know the location (tj , vj ), but is there anything
else? The answer is yes: having (tj , vj ) allows us to substitute these values on the right-hand side of
the governing equation (2.7) and compute
v(t
j , vj ) =
s f
9
g 2 vj
s
2r s
(2.8)
t small .
(2.9)
Here v(t
j , vj )) may become confusing, since by the superimposed dot we dont mean that a time
derivative of some quantity was taken. We simply mean the value of the given function on the right
of (2.8). Therefore, we give the right-hand side function a name and we use the notation
v = f (t, v) ,
v(t0 ) = v0
(2.10)
for the IVP. Here by f (t, v) we mean that the right-hand side of the governing equation is known as
a function of t and v. Then the Euler algorithm may be written as
(tj+1 , vj+1 ) (tj + t, vj + tf (tj , vj )) ,
t small .
(2.11)
One more remark is in order in reference to Figure 2.3. The short red lines indicate the slope
of the solution curves passing through the points from which the straight red lines emanate (the
left-hand side ends). The straight lines represent the tangents to the solution curves (the slopes).
They are also known as the direction eld . Plotting the direction eld is a good way in which
the behavior of solutions to ordinary dierential equations can be understood. It works best for a
single scalar equation since it is hard to visualize the direction elds when there are more than one
dependent variables.
3
See: aetna/Stokes/stokesdirf.m
10
v(t) [m/s]
0.3
0.2
0.1
0.1
0
0.05
0.1
t [s]
0.15
0.2
Fig. 2.3. Stokes problem solutions corresponding to dierent initial conditions, with the direction eld
shown at a few selected points.
0.5];% seconds
Dene an anonymous function (assigned to the variable f) to return the value of the right-hand side
of (2.7) for a given time t and velocity v.
f=@(t,v)((rhos-rhof)/rhos*g - (9*eta)/(2*r^2*rhos)*v);
Decide how many steps the algorithm should take, and compute the time step to cover the time
span in the selected number of time steps.
nsteps =20;
dt= diff(tspan)/nsteps;
Initialize two arrays to hold the solution pairs. Note that the two lines in the loop correspond exactly
to the algorithm formula (2.11). We call the function f dened above to evaluate the right-hand
side.
t(1)=tspan(1);
v(1)=v0;
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =v(j)+dt*f(t(j),v(j));
end
Finally, we graphically represent the solution as a series of markers that correspond to the computed
solution pairs (tj , vj ).
plot(t,v,o)
xlabel(t [s]),ylabel(v(t) [m/s])
4
See: aetna/Stokes/stokes1.m
11
0.35
0.3
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.4. Stokes problem solution computed with a simple implementation of the Eulers method.
9
s f
g 2 (x(t) x(t0 )) .
s
2r s
See: aetna/Stokes/stokes2.m
12
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
(x(
) x(t
0 )) d =
t0
)
)
t(
t(
s f
9
( t0 )
g d
(x(
)
x(t
))
d .
0
s
2r2 s
t0
t0
This expression could be further simplied, but my point can be made here: the two constants of
integration are already present, x(t0 ) and x(t
0 ). Therefore the IVP (the governing equation plus the
initial conditions) may be written
x
=
s f
9
g 2 x ,
s
2r s
x(t0 ) = x0 , x(t
0 ) = v0 .
(2.13)
The meaning of the IVP is: Find a function (distance traveled) x(t) such that it satises the equation
of motion, and such that the initial distance and the initial velocity at the time t0 are x0 and v0
respectively.
The integration of IVPs in MATLAB is made general by requiring that all IVPs be rst order
(only rst derivatives of the variables may be present). Our IVP (2.13) is second order, but we can
see that it may be converted to a rst order form. Just introduce the velocity to write
v =
s f
9
g 2 v,
s
2r s
x = v ,
x(t0 ) = x0 , v(t0 ) = v0 .
(2.14)
The price to pay for having to deal with only the rst order derivatives is an increased number of
variables: now we have two. Since we have two variables, we better also have two equations. Note
that the initial conditions are now written in terms of the two variables, but we still have two of
them. That is not entirely surprising since we still need two integration constants: we have two rstorder equations, each of them needs to be integrated once, which will again result in two constants
of integration.
The IVP now deals with a system of coupled ordinary dierential equations. Such systems are
usually written in the so-called vector form. We introduce a vector to collect our variables
[ ]
x
z=
v
and then the IVP (2.14) is put as
v
9 = f (t, z) ,
z = s f
g 2 v
s
2r s
[
z(t0 ) =
]
x0
.
v0
(2.15)
Formally, this is the same as the IVP (2.7), except that our variable is a vector, and the function
on the right-hand side returns a vector and takes the time and a vector as arguments. This parallel
13
makes it possible to treat a variety of IVPs with the same code in MATLAB. Here we show an
implementation6 that computes the solution to (2.15).
The denitions of the constants are the same as above, except for the initial conditions. The
initial condition now is a column vector.
z0 = [0;0];% Initial distance and velocity, meters per second
The right-hand side function looks very similar to the one introduced above, except that it needs to
return a vector, and whenever it refers to velocity it needs to take it out of the input vector as z(2)
f=@(t,z)([z(2); (rhos-rhof)/rhos*g-(9*eta)/(2*r^2*rhos)*z(2)]);
The MATLAB integrator is called exactly as before.
[t,z] = ode23 (f, tspan, z0);
The arrays returned by the integrator collect results in the form of a table:
t(:)
t1
t2
...
z(:,1)
x1
x2
...
z(:,2)
v1
v2
...
Plotting the two arrays then the yields two curves: the distance traveled and the velocity (Figure 2.6).
plot(t,z,o-)
xlabel(t [s]),ylabel(x(t) [m], v(t) [m/s])
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.6. Stokes problem. Solution of (2.15) computed with a MATLAB integrator.
See: aetna/Stokes/stokes3.m
14
Making the step smaller is however expensive. The more steps we make the algorithm take, the
longer we have to wait for the computer to give us the solution. Hence we may wish to use a step that
is suciently large for the results to arrive quickly, but large steps also have consequences. What if
I wanted to take only 10 steps instead of 20 in the rst MATLAB script (Section 2.2.1, set nsteps
=10;). The result7 is shown in Figure 2.7 and it is clearly unphysical: in the actual experiment (and
in the analytical solution and in our prior calculations) the dropped sphere certainly seems to be
monotonically speeding up, whereas here the result tells us that the velocity oscillates, and moreover
at times seems to be higher than the terminal velocity.
0.45
0.4
0.35
v(t) [m/s]
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
t [s]
0.3
0.35
0.4
0.45
0.5
Fig. 2.7. Stokes problem. Solution with a larger time step than in Figure 2.4.
An explanation of this phenomenon8 may be found in Figure 2.8. Note the direction eld which
will help us understand the numerical solution. Starting from (t1 = 0, v1 = 0) we proceed along the
steep straight-line so far that the next solution point (t2 , v2 ) overshoots the terminal velocity. The
next step is along a straight line with a negative slope, and again we go so far that we undershoot
the terminal velocity. The third step takes us along a straight line with a positive slope, and we
overshoot again. This kind of computed response is not useful to us since the qualitative feature of
the solution, namely the monotonic increase of the speed, is lost in the numerical solution.
0.45
0.4
v(t) [m/s]
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.1
0.2
0.3
0.4
0.5
t [s]
Fig. 2.8. Stokes problem. Solution with a larger time step than in Figure 2.4. The direction eld and the
analytical solution are shown.
7
8
See: aetna/Stokes/stokes4.m
See: aetna/Stokes/stokes4ill.m
15
In summary, we see that the selection of the time step length has two kinds of implications.
Firstly, the time step aects the accuracy (how far are the computed solutions from the curve that
we would like to track?). Secondly, the time step eects the quality of the solution (is the shape of the
computed solution series a reasonable approximation of the shape of the exact solution curve?).
The rst aspect is generally referred to as accuracy . The second aspect is generally considered a
manifestation of stability (or instability, depending on how we look at it).
2.2.5 A variation on Eulers method
The Eulers method proposes to follow a straight line set up at (tj , vj ) to arrive at the point
(tj+1 , vj+1 ). As an alternative, let us consider the possibility of following a straight line set up
at the (initially unknown) point (tj+1 , vj+1 ). For simplicity let us work with the IVP (2.7). The
right-hand side of the equation of motion is the function
f (t, v) =
9
s f
g 2 v.
s
2r s
s f
9
g 2 vj+1 ,
s
2r s
which we substitute into the formula that expresses the movement from the point (tj , vj ) to
(tj+1 , vj+1 ) along a straight line
(tj+1 , vj+1 ) = (tj + t, vj + tf (tj+1 , vj+1 )) .
(2.16)
vj + t
vj+1 =
The MATLAB script of Section 2.2.1 may be easily modied to incorporate our new algorithm. The
only change occurs inside the time-stepping loop
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =(v(j)+dt*(rhos-rhof)/rhos*g)/(1+dt*(9*eta)/(2*r^2*rhos));
end
With this modication9 we can now compute the numerical solution without overshoot not only
with just 10 steps, but with just ve or even two see Figure 2.9. The computed points are not
particularly accurate, but the qualitative character of the solution curves is preserved. In this sense,
the present modication of the Eulers algorithm has rather dierent properties than the original.
9
See: aetna/Stokes/stokes5.m
16
v(t) [m/s]
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
t [s]
In order to be able to distinguish between these algorithms we will call the original algorithm
of Section 2.2.1 the forward Euler , and the algorithm introduced in this section will be called
backward Euler . The justication for this nomenclature may be sought in the visual analogy of
approximating a curve with a tangent: in the forward Euler method this tangent points forward from
the point (tj , vj ), in the backward Euler method the tangent points backward from (tj+1 , vj+1 ).
2.2.6 Implementations of forward and backward Euler method
In this book we shall spend some time experimenting with the forward and backward Euler method.
However, MATLAB does not come with integrators implementing these methods. They are too
simplistic to serve the general-purpose aspirations of MATLAB. Since it will make our life easier if
we dont have to code the forward and backward Euler method every time we want to apply it to a
dierent problem, the toolbox aetna that comes with the book provides integrators for this pair of
methods.
The aetna forward and backward Euler integrators are called in the same way as the built-in
MATLAB integrators. We have seen in Section 2.2.2 an example of the built-in MATLAB integrator,
ode23. There is one dierence, however, which is unavoidable. The built-in MATLAB integrators are
able to determine the time step automatically, and in general the time step is changed from step to
step. The aetna forward and backward Euler integrators are xed-time-step implementations: the
user controls the time step, and it will not change. Therefore, we have to supply the initial time step
as an option to the integrator. (In fact, even the MATLAB built-in integrators take that options
argument. It is used to control various aspects of the solution process.) The MATLAB odeset
function is used to create the options argument. To compute the solution with the forward Euler
integrator odefeul10 , replace the ode23 line in the script in Section 2.2.2 with these two lines11 :
options =odeset(initialstep, 0.01);
[t,v] = odefeul(f, tspan, [v0], options);
To compute the solution with a backward Euler integrator, use odebeul12 instead. The inquisitive
reader now probably wonders: how does odebeul solve for vj+1 from the implicit equation
vj+1 = vj + tf (tj+1 , vj+1 )
when it cannot even know how the function f was dened (all it is given is the function handle f)? The
answer is: the equation is solve numerically. Solving (systems of) non-linear algebraic equations is
10
See: aetna/utilities/ODE/integrators/odefeul.m
See: aetna/Stokes/stokes6.m
12
See: aetna/utilities/ODE/integrators/odebeul.m
11
17
so important that MATLAB cannot fail to deliver some methods for dealing with them. MATLABs
fzero implements a few methods by which the root of a single nonlinear equation may be located.
It takes two arguments, function handle: this would be the function whose zero we wish to nd; and
the initial guess of the root location. First we dene the function
F (vj+1 ) = vj+1 vj tf (tj+1 , vj+1 )
by moving all the terms to the left-hand side, and our goal will be to nd vj+1 such that F (vj+1 ) = 0.
For that purpose we will create a handle to an anonymous function @(v1)(v1-v(j)-dt*f(t(j+1),v1))
in which we readily recognize the function F (vj+1 ) (the argument vj+1 is called v1). Finally, the
time stepping loop for the backward Euler method is written as13
for j=1:nsteps
t(j+1) =t(j)+dt;
v(j+1) =fzero(@(v1)(v1-v(j)-dt*f(t(j+1),v1)),v(j));
end
where the second line inside the loop solves the implicit equation using fzero.
M = V ,
EIv = M ,
(2.17)
where V is the shear force resultant, q is the applied transverse load, M is the moment resultant, E
is the Youngs modulus, I is the moment of inertia of the cross-section, and (.) = d(.)/dx. Therefore,
the governing equation (static equilibrium of the dierential element of the beam) is fourth order
EIv = q .
(2.18)
Our knowledge of the particular conguration of the beam would be expressed in terms of the
conditions at either end: Is the cross-section at the end of the beam free of loading? Is it supported?
Is the support a roller or is the rotation at the supported cross-section disallowed? At the crosssection x = 0 we could write the following four conditions
1
1
v(0) = v0 , v (0) = s0 , v (0) =
M0 , v (0) =
V0 ,
EI
EI
depending of course on what was known: deection v0 , slope s0 , moment M0 , or shear force V0 .
Similar four conditions could be written for the cross-section at x = L (L = the length of the beam).
Normally we would know two quantities out of the four at each end of the beam. For instance, for a
beam supported on rollers on each end (the so-called simply supported beam) the known quantities
would be v0 = M0 = vL = ML = 0. Since the quantities are specied at the boundary x = 0 and
x = L of the domain 0 x L on which the governing equation is written, we call these the
boundary conditions. The entire setup leads consequently to a boundary value problem (BVP),
which for the instance of the simply supported beam would be dened as
EIv = q ,
v(0) = 0 ,
v (0) = 0 ,
v(L) = 0 ,
v (L) = 0 .
(2.19)
The dierence between an IVP and a BVP is innocuous but rather consequential. All the conditions
from which the integration constants needs to be obtained are given at the same point for the IVP.
On the other hand, they are not given at the same point for the BVP, and therefore the boundary
value problem is considerably more dicult to solve. One of the diculties is that solutions to BVPs
do not necessarily exist for some combinations of boundary conditions.
13
See: aetna/Stokes/stokes7.m
18
Illustration 1
Consider the beam-bending BVP with
EIv = 0 ,
Note that the beam is not loaded (q = 0). The boundary conditions correspond to a beam simply
supported at one end and on the other side with a free end with a nonzero shear force. In the absence
of other forces and moments, the shear force at x = L cannot be balanced. Such a beam is not stably
supported, and therefore no solution exists for these boundary conditions.
We will handle in this book some simple boundary value problems, but most of their intricacies
are outside of the scope of this book.
v(x)
q(x)
M (x + dx)
x
M (x)
V (x)
V (x + dx)
It is relatively straightforward to add the aspect of dynamics to the equation of motion (2.18).
All terms are moved to one side of the equation, and they represent the total applied force on the
dierential element of the beam. Then invoking Newtons law of motion, we obtain
v = q EIv .
(2.20)
Here is the mass per unit length, and v is the acceleration. The equation of motion now became
a partial dierential equation, since there are now derivatives with respect to space and time. With
the time derivatives there comes the need for more constants of integration. It is consistent with
our physical reality that the integration constants will come from the beginning of the time interval
on which the equation of motion (2.20) holds. Therefore, they will be obtained from the so-called
initial conditions. The solution will still be subject to the boundary conditions as before, and thus we
obtain an initial boundary value problem (IBVP) for the function v(t, x) of the midline deection.
For instance
v = q EIv ,
v (t, L) = 0 ,
(2.21)
This IBVP model describes the vibration of a simply-supported beam, whose deection at time t = 0
(initial deection) is described by the shape vt0 (x) and whose (initial) velocity at time t = 0 is given
as v t0 (x).
19
GmM
r.
r3
Here G is the gravitational constant, m and M are the masses of the satellite and the planet
respectively, and r is the vector from the center of the earth to the location of the satellite. The IVP
Fig. 2.11. Satellite motion. Satellite path and the gravitational force.
formulation is straightforward. The equation of motion is a classical Newtons law: the acceleration
of the mass of the satellite is proportional to the acting force
F = m
r.
Substituting for the force, we obtain
m
r=
GmM
r,
r3
which is entirely expressed in terms of the components of the location of the satellite with respect to
the Earth. The initial conditions are the location and velocity of the satellite at some time instant,
let us say at t = 0
r(0) = r 0 ,
r(0)
= v0 .
GmM
r,
r3
r(0) = r 0 ,
r(0)
= v0 .
(2.22)
As for the problem discussed in Section 2.2.3, the dynamics of this IVP is driven by a second order
equation. In order to convert to the rst order form, we shall use the obvious denition of a new
With this denition, the IVP may be written in rst order form as
variable, the velocity v = r.
v =
GM
r,
r3
v = r
r(0) = r 0 ,
v(0) = v 0 .
(2.23)
20
Note that the mass of the satellite canceled in the equation of motion.
As before we can introduce the same formal way of writing the IVP using a single dependent
variable. Introduce the vector
[ ]
r
z=
v
and the denition of the right-hand side function
[
]
v
f (t, z) =
.
GM
r 3 r
(2.24)
z(0) = z 0 .
The complete MATLAB code14 to compute the solution starts with a few denitions. Especially
note the initial conditions, velocity v0, and position r0.
G=6.67428 *10^-11;% cubic meters per kilogram second squared;
M=5.9736e24;% kilogram
R=6378e3;% meters
v0=[-2900;-3200;0]*0.9;% meters per second
r0=[R+20000e3;0;0];% meters
dt=0.125*60;% in seconds
te=50*3600; % seconds
Now the right-hand side function is dened (as an anonymous function, assigned to the variable f).
Clearly the MATLAB code corresponds very closely to equation (2.24).
f=@(t,z)([z(4:6);-G*M*z(1:3)/(norm(z(1:3))^3)]);
We set the initial time step (the MATLAB integrator may or may not consider it: it is always driven
by accuracy), and then we call the integrator ode45.
opts=odeset(InitialStep,dt);
[t,z]=ode45(f,[0,te],[r0;v0]);
Finally, we do some visualization in order to understand the output better than a printout of the
numbers can aord. In Figure 2.12 we compare results for this problem obtained with two MATLAB
integrators, ode45 and ode23, and with the forward and backward Euler integrators, odefeul and
odebeul. Some of the interesting features are: ode45 is nominally of higher accuracy than ode23.
However, we can see the individual curves spread out quite distinctly for ode45 while only a single
curve, at this resolution of the image, is presented for ode23. From what we know about analytical
solutions to this problem (remember Kepler?), the curve is an ellipse and the computed paths for
repeated revolutions of the satellite around the planet would ideally overlap and represent a single
curve. Therefore we have to conclude that ode23 is actually doing a better job, but not perfect (the
trajectory is not actually closed). The two Euler integrators produce altogether useless solutions.
The problem is not accuracy, it is the qualitative character of the orbits. From years and years of
observations of the motion of satellites (and from the analytical solution to this model) we know
that the energy of a satellite moving without contact with the atmosphere should be conserved to
a high degree. For the forward Euler the satellite is spiraling out (which would correspond to its
gaining energy), while for the backward Euler it is spiraling in (losing energy). A lot of energy! We
say that all these integrators fail to reproduce the qualitative character of the solution, but some
fail more spectacularly than others.
Looking at energy is a good way of judging the performance of the above integrators. The kinetic
energy of the satellite is
14
See: aetna/SatelliteMotion/satellite1.m
21
Fig. 2.12. Satellite motion. Solution computed with (left to right): ode45, ode23, odefeul, odebeul.
K=
mv2
2
mM
.
r
The total energy T = K + V should be conserved for all times. Let us compute this quantity
for the solutions produced by these various integrators, and graph it. Or, rather we will graph
T /m = K/m + V /m so that the expressions do not depend on the mass of the satellite, which did
not appear in the IVP in the rst place. Here is the code15 to produce Figure 2.13 which shows what
the time variation of the energies should look like (the total energy is conserved hence a horizontal
line).
Km=0*t;
Vm=0*t;
for i=1:length(t)
Km(i)=norm(z(i,4:6))^2/2;
Vm(i)=-G*M/norm(z(i,1:3));
end
plot(t,Km,k--); hold on
plot(t,Vm,k:); hold on
plot(t,Km+Vm,k-); hold on
xlabel(t [s]),ylabel(T/m,K/m,V/m [m^2/s^2])
In Figure 2.14 we compare the four integrators. Only the total energy is shown, so ideally we
should see horizontal lines corresponding to perfectly conserved energy. On the contrary, we can see
that neither of the four integrators conserves the total energy. Note that the vertical axes have rather
dierent scales. The Euler integrators perform very poorly: the change in total energy is huge. The
ode45 is signicantly outperformed by ode23, but both integrators lose kinetic energy nevertheless.
Since ode45 is signicantly more expensive than ode23, this example illustrates that choosing an
appropriate integrator can make the dierence between success and failure.
22
x 10
2 2
T/m,K/m,V/m [m /s ]
6
0
0.5
1
t [s]
1.5
2
5
x 10
Fig. 2.13. Satellite motion. Total energy (solid line), potential energy (dotted line), kinetic energy (dashed
line).
7.4
x 10
7.55
x 10
7.56
2 2
T/m,K/m,V/m [m /s ]
T/m,K/m,V/m [m2/s2]
7.5
7.6
7.7
7.57
7.58
7.59
7.6
7.61
7.8
7.62
7.9
0
5.5
x 10
0.5
1
t [s]
1.5
7.63
0
2
x 10
0.7
T/m,K/m,V/m [m2/s2]
T/m,K/m,V/m [m2/s2]
x 10
1
t [s]
1.5
1
t [s]
1.5
2
5
x 10
0.8
6.5
7.5
8
0
0.5
0.9
1
1.1
1.2
0.5
1
t [s]
1.5
2
5
x 10
1.3
0
0.5
2
5
x 10
Fig. 2.14. Satellite motion. Total energy computed with (left to right, top to bottom): ode45, ode23,
odefeul, odebeul.
23
0.08
v(t),[m/s], x(t)[m]
0.06
0.04
0.02
0
0.02
0.04
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
t [s]
Fig. 2.15. Dry friction sliding of eccentric mass shaker. Sliding motion: Displacement in dotted line, velocity
in solid line.
with a polished steel base lying on a steel platform. The mass of the shaker is m. The harmonic
force due to the eccentric mass motion is added to the weight of the shaker to give the normal
contact force between those two as mg + A sin(t), and the horizontal force parallel to the platform
A cos(t). The IVP of the shaker sliding motion may be written in terms of its velocity as
mv + (v)(mg + A sin(t))signv = A cos(t) ,
v(0) = v0 .
Here (v) is the friction coecient. For steel on steel contact we could take
{
s ,
for |v| vstick
(v) =
k s , otherwise.
Here s is the so-called static friction coecient, k is the kinetic friction coecient, and vstick is the
sticking velocity. In words, for low sliding velocity the coecient of friction is high, for high sliding
velocity the coecient of friction is low.
Run the simulation script stickslip harm 2 animate and watch the animation a few times to
get a feel for the motion.16 Figure 2.15 shows the displacement and velocity of the sliding motion.
The brief stick phases should be noted. Also note the drift of the shaker due to the lift o force
which reduces the contact force and hence also the friction force when the mass is moving upwards.
Now consider Figure 2.16 which shows the velocity of the sliding motion for two slightly dierent
initial conditions.17 Note well the direction eld and consider how quickly (in fact discontinuously)
it changes for some values of the velocity of sliding.
We take the initial velocity of the shaker to be 0.99vstick and 1.01vstick . We may expect that for
such close initial conditions the velocity curves would also stay close, but they dont. The reason is
the discontinuous (and divergent) direction eld, as especially evident in the close-up on the right.
The direction eld is also discontinuous at zero velocity, but there it is convergent, and solution
curves that arrive there are forced to remain at zero (and sticking occurs).
The divergent discontinuous direction eld makes the solution non-unique. As users of numerical
algorithms for IVPs we must be aware of such potential complications, and address them by careful
consideration of the formulation of the problem and interpretation of the results.
24
0.05
0.03
0.025
0.04
0.02
v(t) [m/s]
v(t) [m/s]
0.03
0.02
0.015
0.01
0.01
0.005
0
0.005
0.01
0
0.01
0.02
0.03
t [s]
0.04
0.05
0.06
0.005
0.01
t [s]
0.015
0.02
Fig. 2.16. Dry friction sliding of eccentric mass shaker. Direction eld and velocity curves for two initial
conditions.
First, what do we mean by error? Consider for example that we want to obtain a numerical
solution to the IVP
y = f (t, y) ,
y(0) = y0
(2.25)
in the sense that our goal is the approximation of the value of the solution at some given point t = t.
The dierence between our computed solution yt and the true answer y(t) will be the true error
Et = y(t) yt .
We have already seen that the time step length apparently controls the error (we will see later exactly
how it achieves this feat). So let us compute the solution for a few decreasing time step lengths. The
result will be a table of time step length versus true error.
Time step length
t1
t2
...
Solution at t for tj
yt,1
yt,2
...
The true error is a ne concept, but not very useful as knowing it implies knowing the exact
solution. In practical applications of numerical methods we will never know the true error (otherwise
why would we be computing a numerical solution?). In practice we will have to be content with the
concept of an approximate error . A useful form of approximate error is the dierence of successive
solutions. So now we can construct the table of approximate errors
Time step length
t1
t2
t3
t4
...
Solution at t for tj
yt,1
yt,2
yt,3
yt,4
...
y(0) = 1.0
(2.26)
and our goal will be to compute y(t = 4). Figure 2.17 shows on the left the succession of computed
solutions with various time steps. As we can see, the two methods used, the forward and backward
25
Euler, are approaching the same value as the time step gets smaller. We call this behavior convergence. From the computed sequence of solutions we can obtain the approximate errors as discussed
above18 . The approximate errors are shown in Figure 2.17 on the right.
0.25
0.07
0.06
0.2
0.05
0.15
Ea
y(t)
0.04
0.03
0.1
0.02
0.05
0.01
0
0
0.5
1
t
1.5
0
0
0.2
0.4
0.6
0.8
Fig. 2.17. Successive approximations to y(t = 4) for various time steps (on the left), and approximate errors
(on the right). Red curve backward Euler, blue curve forward Euler
With this data at hand we can try to ask some questions. How does the error depend on the
time step length? The curves in Figure 2.17 suggest a linear relationship. Before we look at this
question in more detail, we will consider the problem of numerical integration of the IVP again, this
time with a view towards devising a better (read: more accurate) integrator than the rst two Euler
algorithms.
2.6.1 Modified Euler method
As discussed below equation (2.5), the governing equation of the IVP (2.25) may be subject to
integration from t0 to t to obtain
t
y(t) y(t0 ) =
f (, y( ))d .
t0
To use this expression to obtain an actual solution may not be easy because of the integral on the
right-hand side. This gives us an incentive to try to approximate the right-hand side integral. One
possibility is to write
t
f (, y( ))d (t t0 )f (t0 , y(t0 ))
t0
and we get the backward Euler method. This should be familiar: we are approximating the areas
under curves (integrals of functions) by rectangles (recall the concept of the Riemann sum). A
better approximation would be achieved with trapezoids. Thus we may try
18
See: aetna/ScalarODE/scalardecayconv.m
26
f (, y( )) d
t0
(t t0 )
[f (t0 , y(t0 )) + f (t, y(t))] .
2
t
[f (tj , yj ) + f (tj+1 , vj+1 )])
2
has very attractive properties, and we will devote more attention to it later. (It is implemented in
aetna as odetrap19 .) One factor that may discourage its use is cost: it is an implicit method, and
to obtain y(t) one has to solve (in general, nonlinear) equation for y(t)
y(t) = y(t0 ) +
(t t0 )
[f (t0 , y(t0 )) + f (t, y(t))] .
2
(2.27)
To obtain a method that is explicit in y(t) one may try the following trick: in the above equation
approximate y(t) in f (t, y(t)) using the forward Euler step to arrive at
ya = y(t0 ) + (t t0 )f (t0 , y(t0 )) ,
(t t0 )
y(t) = y(t0 ) +
[f (t0 , y(t0 )) + f (t, ya )] .
2
(2.28)
This formula denes one of the so-called modied Euler algorithms. It turns out to be only a little
bit more expensive than the basic forward Euler method, but its accuracy is superior as we will
immediately see on some results. (An implementation is available in 20 odemeul.)
2.6.2 Deeper look at errors: going to the limit
We will now compute21 the solution to (2.26) also with the modied Euler method (2.28). Figure 2.18
shows that the modied Euler approaches the solution somehow quicker than both backward and
forward Euler methods. This is especially clear when we look at the approximate errors (on the
right).
0.2
0.08
0.15
0.06
y(t)
0.1
y(t)
0.25
0.1
0.04
0.05
0.02
0
0
0.5
1
t
1.5
0
0
0.2
0.4
0.6
0.8
Fig. 2.18. Successive approximations to y(t = 4) for various time steps (on the left), and approximate errors
(on the right). Red curve backward Euler, blue curve forward Euler, black curve modied Euler
The errors seem to decrease roughly linearly for the backward and forward Euler methods. The
modied Euler method errors decrease along some curve. Can we nd out what kind of curve? Could
it be a polynomial? That would be the rst thing to try, because polynomials tend to be very useful
in this way (viz the Taylor series later in the book).
19
See: aetna/utilities/ODE/integrators/odetrap.m
See: aetna/utilities/ODE/integrators/odemeul.m
21
See: aetna/ScalarODE/scalardecayconv1.m
20
27
We will assume that the approximate errors depend on the time step length as
Ea (t) Ct .
(2.29)
The exponent is unknown. One clever way in which we can use data to nd out the value of
relies on taking logarithms on both sides of the equation
log(Ea (t)) log(Ct ) ,
which yields
log(Ea (t)) log(C) + log(t) .
This is an expression for a straight line on a plot with logarithmic axes. The slope of the line would
be . Figure 2.19 shows the approximate errors re-plotted on the log-log scale. Also shown are two
red triangles. The hypotenuse in those triangles has slope 1 or 2 respectively. This may be compared
with the plotted data. The forward and backward Euler approximate errors (at least for the smaller
time step lengths) appear to lie along a straight line with slope equal to one. The modied Euler
approximate errors on the other hand are close to a straight line with slope equal to two. Therefore,
we may hypothesize that the approximate errors behave as Ea (t) Ct for the forward and
backward Euler, and as Ea (t) Ct2 for the modied Euler. The exponent is called the rate
of convergence (convergence rate). The higher the rate of convergence, the faster the errors drop.
Later in the book we will use mathematical analysis tools (the Taylor series) to understand where
the convergence rate is coming from.
10
y,E
10
10
10
10
10
10
t
10
Fig. 2.19. The approximate errors from Figure 2.18 re-plotted on the log-log scale. Red curve backward
Euler, blue curve forward Euler, black curve modied Euler
What about the rst few points in the computed series: they fail to lie along a straight line on the
log-log plot? We have assumed that the errors depended only a single power of the time step length.
This is a good assumption for very small time step lengths (the so-called asymptotic range, where
t 0), but for larger time step lengths (the so-called pre-asymptotic range) the error more
likely depends on a mixture of powers of the time step length. Then the data points will not lie on
a straight line on the log-log plot.
Plotting the data as in Figure 2.19 is very useful in that it gives us the convergence rate. Could
we use this information to get a handle on the true error? As explained above, we assumed that the
28
approximate error depended on the time step length through the mononomial relation (2.29). Using
a simple trick, we can relate the approximate errors to the true errors
Ea,j = yt,j+1 yt,j = yt,j+1 y(t) + y(t) yt,j ,
where Et,j+1 = yt,j+1 y(t) and Et,j = y(t) yt,j and so we have
Ea,j = Et,j Et,j+1 .
(2.30)
Then if the approximate error on the left behaves as the mononomial (2.29), then so will the true
errors on the right. There are two parameters in (2.29), the rate and the constant C. We have
estimated the rate by plotting the approximate errors on a log-log scale. Now we can estimate the
constant C by taking
Et,j Ctj ,
Et,j+1 Ctj+1
to obtain
Ea,j = Ctj Ctj+1
(2.31)
and
C=
Ea,j
tj
tj+1
For instance, for the forward Euler we have obtained the following approximate errors
>> ea_f =
6.2500e-2 3.7613e-2 1.7954e-2
>> dts =
2, 1, 1/2, 1/4, 1/8, 1/16, 1/32
8.7217e-3
4.2952e-3
2.1311e-3
and we have estimated from Figure 2.19 that the convergence rate was = 1. Therefore, we can
estimate the constant using (for example) Ea,3 as
>> C=ea_f(3)/(dts(3)-dts(4))
C =
0.071816687928745
This is useful: we can now predict for instance how small a time step will be required to obtain the
solution within the absolute tolerance 104 :
( 4 )1/
10
Et,j Ctj 104 tj
C
>> 1e-4/(ea_f(3)/(dts(3)-dts(4)))
ans =
0.001392434027300
Indeed, we nd that for time step length 1/1024 < 0.00139 the true error is computed as 0.000066 <
104 .
If we do not have an estimate of the convergence rate, we can try solving for it. Provided we
have at least two approximate errors, let us say Ea,1 and Ea,2 , we can write (2.31) twice as
Ea,1 = Ct1 Ct2 ,
This system of two nonlinear equations will allow us to solve for both unknowns C and . In general
a numerical solution of this nonlinear system of equations will be required. Only if the time steps
are always related by a constant factor so that
tj+1 = tj ,
29
(2.32)
where is a xed constant, will we be able to solve the system analytically: First we write
Ea,1
t1
t2
Ea,2
t2
t3
(2.33)
Ea,2
t2
(1 )
)
(
and canceling the factor 1
Ea,1
t1
(1
Ea,2
t2
This is then easily solved for the convergence rate by taking a logarithm of both sides to give
=
(2.34)
Ea,1
t1
t2
(2.35)
The described procedure for the estimation of the parameters of the relation (2.29) is a special
case of the so-called Richardson extrapolation. When the data for the extrapolation is nice,
this procedure is very useful. The data may not be nice: for instance for some reason we havent
reached the asymptotic range. Or, perhaps there is a lot of noise in the data. Then the extrapolation
procedure cannot work. It is a good idea to always visualize the approximate error on the log-log
graph. If the approximate error data does not not appear to lie on a straight line, the extrapolation
should not be attempted.
An important note: the above Richardson extrapolation can work only for results obtained with
xed step-length integrators. The step length is the parameter in the extrapolation formula. It varies
from step to step when the MATLAB adaptive step length integrators (i.e. ode23, ...) are used,
which the formula cannot accommodate, and the extrapolation is then not applicable.
(2.36)
which means that from y(t0 ) we follow a slope which is determined as a linear combination of slopes
kj evaluated at various points within the current time step
30
k1 = f (t0 + c1 t, y(t0 ))
k2 = f (t0 + c2 t, y(t0 ) + a21 tk1 )
k3 = f (t0 + c3 t, y(t0 ) + a31 tk1 + a32 tk2 )
(2.37)
where t = (t t0 ), and the coecients asj , bj , cj are determined in various ingenious ways so that
the method has the best accuracy and stability properties.
Figure 2.20 shows graphically an example of such an explicit Runge-Kutta method, the modied
Euler method. It can be written in the above notation as
(
)
1
1
y(t) = y(t0 ) + t
k1 + k2
2
2
(2.38)
k1 = f (t0 + 0 t, y(t0 ))
k2 = f (t0 + 1 t, y(t0 ) + 1 tk1 )
We see that the coecients of this method are c1 = 0, c2 = 1, a21 = 1 and b1 = b2 = 12 .
Fig. 2.20. The modied Euler algorithm as a graphical schematic: The solution at the time t = t0 + t is
arrived at from y(t0 ) using the average slope 21 k1 + 12 k2
The coecients of Runge-Kutta methods asj , bj , cj are usually presented in the form of the
so-called Butcher tableau
ca
b
(2.39)
where the coecients are elements of the three matrices. For the explicit RK methods c1 = 0 always,
and the matrix a is strictly lower diagonal. The modied Euler method is an RK method with s = 2
and the tableau
0 0
1 1
0
0
1
2
1
2
31
MATLAB in the ode45 integrator. The tableau of the fourth-order explicit Runge-Kutta with a
xed time step is
0 0
1
2
1
2
1
2
0
1 0
1
6
0 0 0
0 0 0
1
2
0 0
1 0
1
3
1
3
1
6
10
y,E
10
10
10
15
10
10
10
10
10
Fig. 2.21. The approximate errors plotted on the log-log scale. Red curve backward Euler, blue curve
forward Euler, black curve with o markers modied Euler, black curve with x markers fourth-order
Runge-Kutta oderk4
To round o this discussion we will consider the adaptive-step Runge-Kutta method implemented
in the Matlab ode23 integrator. The tableau reads
22
23
See: aetna/utilities/ODE/integrators/oderk4.m
See: aetna/ScalarODE/scalardecayconv2.m
32
1
2
3
4
1
2
0
2
9
2
9
7
24
0 0 0
0 0 0
3
0 0
4
1
3
1
3
1
4
4
9
4
9
1
3
0
0
1
8
The array b with two rows instead of one makes the method so useful: the solution at the time
t = t0 + t may be computed in two dierent ways from the slopes k1 , ..., k4 . One of these (the rst
row) is third-order accurate and the other (the second row) is fourth-order accurate. The dierence
between them can be used to guide the change of the time step to maintain accuracy.
Suggested experiments
1. The integrator ode87fixed24 uses a high order Runge-Kutta formula and xed time step length.
Repeat the above exercise with this integrator, and estimate its convergence rate.
24
See: aetna/utilities/ODE/integrators/ode87fixed.m
3
Preservation of solution features: stability
Summary
1. In this chapter we investigate the central role that the eigenvalue problem plays in the design
of ODE integrators. The goal is to preserve important solution features. This is referred to as
the stability of the integration algorithm. Main idea: stability can be investigated on the model
equation of the scalar linear ODE.
2. For the model IVP, the formula of a particular integrator can be written down so that the new
value of the solution is expressed as a multiple of the solution value in the previous step. Main
idea: the amplication factor depends on the product of the eigenvalue and the time step, and
therefore the shape of the numerical solution is determined by these quantities. The eigenvalue
is given as data, the time step can be (needs to be) chosen by the user.
3. The scalar linear ODE with a complex coecient is equivalent to two coupled real equations in
two real variables. Main idea: the ODE with a complex coecient describes harmonic oscillations.
4. For the model IVP with a complex coecient, the same procedure that leads to an amplication
factor is used. Main idea: the amplication factor and the solution now live in the complex
plane. The magnitude of the amplication factor again is seen to play a role in the stability
investigation.
5. Understanding the amplication factors is aided by appropriate diagrams. Main idea: The preservation of solution features is illustrated by a complete stability diagram for the various methods.
The magnitude of the amplication factor may also be visualized as a surface above the complex
plane.
y(0) = y0 .
(3.1)
For the moment we shall consider k real. As an example take k = 1/2, with an arbitrary initial
condition
34
1
y = y ,
2
y(0) = 1.3361 .
(3.2)
(3.3)
The constant B = 0 (otherwise we dont have a solution!), and for the above to hold for all times t
we must require
B( k) = 0 .
The above equation is called the eigenvalue problem, and this is denitely not the last time we
have encountered this type of equation in the present book. Here is the eigenvalue, and B is the
eigenvector. The solution is easy: we see that = k. Any B = 0 will satisfy the eigenvalue equation.
We could determine B so that the initial value problem (3.1) was solved by substituting into the
initial condition to obtain B = y0 .
The solution to the IVP (3.2) is drawn with a solid line in Figure 3.1. It is a decaying solution.
In the same gure theres also a growing solution (for k = 1/2), and a constant solution (for
k = 0).
10
8
y(t)
6
4
2
0
0
2
t
35
to obtain
yj+1 = yj + tkyj = (1 + tk)yj .
(3.4)
We would like to see a monotonically decaying numerical solution, | yj+1 |<| yj |, so the so-called
amplication factor (1 + tk) must be positive and its magnitude must be less than one
|1 + tk| < 1 .
If this condition is satised but (1 + tk) < 0 the solution decreases in magnitude, but changes
sign from step to step. Finally, (1 + tk) = 0 implies that the solution drops to zero in one step
and stays zero. Recall that for our example k = 1/2. Correspondingly, in Figure 3.2 we see1 a
monotonically decaying solution for t = 1.0 (|1 + tk| = |1 + 1.0 (1/2)| = 1/2 < 1), a solution
dropping to zero in one step for t = 2.0, a solution decaying, but non-monotonically for t = 3.0
(as 1 + tk = 1 + 3.0 (1/2) = 1/2), and nally for t = 4.0 we get a solution which oscillates
between y0 . Note that for an even bigger time step we would get an oscillating solution which
would increase in amplitude rather than decrease.
1.2
1.2
0.8
0.8
y
1.4
1.4
0.6
0.6
0.4
0.4
0.2
0.2
0
0
10
20
30
40
0
0
50
10
20
30
40
50
30
40
50
1.5
1.5
1
0.5
y
0.5
0
0
0.5
0.5
1
0
10
20
30
40
50
1.5
0
10
20
t
Fig. 3.2. Forward Euler solutions to (3.2) for time steps (left to right) t = 1.0, 2.0, 3.0, 4.0
In summary, for a negative coecient k < 0 in the model IVP (3.1) we can reproduce the correct
shape of the solution curve with the forward Euler method provided
0 < t 1/k .
(3.5)
This is visualized in Figure 3.3. On the top we show the real line, the thick part indicates where
the eigenvalues = k are located when they are negative. On the bottom we show the real line
for the quantity t. The thick segment corresponds to equation (3.5). The lled circle indicates
included, the empty circle indicates excluded. The meaning of (3.5) is expressed in words as:
1
See: aetna/ScalarODE/scalarsimple.m
36
for a negative = k the forward Euler will reproduce the correct decaying behavior provided the
quantity t lands in the segment 1 t < 0 as indicated by the arrow.
The time step lengths that satisfy equation (3.5) are called stable. If we need to be precise, we
would say that such time step lengths are stable for the forward Euler applied to IVPs with decaying
solutions.
Fig. 3.3. Forward Euler stability when applied to the model problem (3.1) for negative eigenvalues. The
given coecient is located in the negative part of the real axis on top. The time step t needs to be chosen
to place the product t in the unit interval 1 t < 0 on the axis at the bottom.
(3.6)
so that tk is allowed to be in the interval between -2 and zero. This will guarantee that the solution
decays, albeit non-monotonically. Such behavior is considered admissible when all we care about is
that the solution decays. Detailed discussion follows in Section ??.
(3.7)
yj
,
1 tk
where
1
1 tk
is the amplication factor for this Euler scheme. Now if we realize that by assumption k < 0, we see
that the solution is going to decay monotonically for all nonzero time step lengths, since 1 tk > 1
for t > 0. Hence we can state that any time step length is stable for the backward Euler method
applied to an IVP with a decaying solution.
37
(3.8)
The time step lengths that satisfy equation (3.8) are called stable. If we need to be precise, we
would say that such time step lengths are stable for the backward Euler applied to IVPs with growing
solutions. We see that the situation somehow mirrors the one discussed for the forward Euler applied
to decaying solutions. The Figure 3.4 which corresponds to (3.8) illustrates this quite clearly, as it
is quite literally a mirror image of the Figure 3.3 for the forward Euler and k < 0.
Fig. 3.4. Backward Euler stability one applied to the model problem (3.1) for positive eigenvalues. The
given coecient is located in the positive part of the real axis on top. The time step t needs to be chosen
to place the product t in the unit interval 0 < t 1 on the axis at the bottom.
38
y = y0 exp(kt) .
Note that the complex exponential may be expressed in terms of sine and cosine
exp(kt) = exp [(Rek + i Imk)t] = exp(Rek t) [cos(Imk t) + i sin(Imk t)] .
The solution is now to be sought with a time dependence in the form of a complex exponential. Let
us write the solution in terms of the real and imaginary parts
y = Rey + i Imy ,
which can be substituted into the dierential equation, together with k = Rek + i Imk, to give
Re y + i Im y = (Rek + i Imk)(Rey + i Imy) .
Expanding we obtain
Re y + i Im y = RekRey ImkImy + i RekImy + i ImkRey .
Now we group the real and imaginary terms
[Re y (RekRey ImkImy)] + i [Im y (RekImy + ImkRey)] = 0 ,
and in order to get a real zero on the right-hand side, we require that both brackets vanish identically,
and we obtain a system of coupled real dierential equations
Re y = RekRey ImkImy
Im y = RekImy + ImkRey .
(3.9)
Imy(0) = Imy 0 .
So we see that to solve (3.1) with k complex is equivalent to solving the real IVP (protably written
in matrix form)
[
] [
][
]
[
] [
]
Re y
Rek, Imk
Rey
Rey(0)
Rey 0
=
,
=
.
(3.10)
Im y
Imk, Rek
Imy
Imy(0)
Imy 0
The method of Section 3.2 can be used again, but with a little modication since we now have a
matrix dierential equation instead of a scalar ODE. We will seek the solution to (3.10) as
[
]
[ ]
Rey
z
= exp(t) 1 .
Imy
z2
For brevity we will introduce the notation
[
]
Rey
w=
,
Imy
and
K=
Rek, Imk
Imk, Rek
]
(3.11)
w(0) = w0 .
(3.12)
w = exp(t)z ,
39
(3.13)
(3.14)
This is the so-called matrix eigenvalue problem. The vector z is the eigenvector , the scalar
is the eigenvalue, and they both may be complex. The eigenvalue problem (EP) is highly nonlinear, and therefore for larger matrices impossible to solve analytically and quite dicult to solve
numerically.
Looking at (3.14) we realize that there are too many unknowns here: , z1 , and z2 (three), and
not enough equations (two). We need one more equation, and to get it we rewrite (3.14) as
(K 1)z = 0 ,
where 1 is an identity matrix. This is a system of linear equations for the vector z with a zero
right-hand side. In order for the above equation to have a nonzero solution, the square matrix
K 1
must be singular . (The linear combination of the columns of K 1 yields a zero vector, which is
just another way of saying that the columns are linearly dependent. Hence, the matrix is singular.)
We may put the fact that K 1 is singular dierently by referring to its determinant
det (K 1) = 0 .
(3.15)
This is the additional equation that makes the solution of the eigenvalue problem possible (the
characteristic equation).
Illustration 1
Expand the determinant of the 2 2 matrix
[
]
[
]
2, 1
1, 0
1, 1
0, 1
The determinant may be dened recursively in terms of cofactors (Laplace formula). For a 2 2
matrix we obtain the familiar diagonal products rule
([
]
[
])
2, 1
1, 0
det
= (2 )(1 ) (1)(1) = 2 3 + 1
1, 1
0, 1
We see that the expanded determinant is a polynomial in , the so-called characteristic polynomial .
For a 2 2 matrix the polynomial is quadratic, and with each additional row and column the
order of the polynomial goes up by one. As a consequence, to solve the eigenvalue problem means to
nd the roots of the characteristic polynomial. This is a highly nonlinear and unstable computation,
which for larger matrices must be done numerically since no analytical formulas exist.
40
Illustration 2
Display the characteristic polynomial of the matrix [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1].
The MATLAB symbolic solution
>> syms lambda real
>> det( [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1]-lambda*eye(4))
ans =
lambda+6*lambda^2-5*lambda^3-2+lambda^4
>> ezplot(ans)
>> grid on
yields a curve similar to the one shown in Figure 3.5. One has to zoom in to be able to estimate
where the roots lie. There are going to be four of them, corresponding to the highest power 4 .
p() =+6 25 32+4
p()
150
100
50
0
3
Illustration 3
We may familiarize ourselves with the concepts of the EP solutions by looking at some simple 2 2
matrices.
Zero matrix. The characteristic polynomial is
([
]
[
])
0, 0
1, 0
det
= 2 = 0
0, 0
0, 1
which has the double root 1,2 = 0. Apparently any vector v is an eigenvector since
0v = 0 v .
The MATLAB solution agrees with our analytical consideration (columns of V are the eigenvectors, the diagonal elements of D are the eigenvalues). Eigenvectors we obtained are particularly
nice because they are orthogonal.
>> [V,D]=eig([0,0;0,0])
V =
1
0
0
1
D =
0
0
0
0
41
= (1 )2 = 0
0, 1
0, 1
which has the double root 1,2 = 1. Again any vector is an eigenvector. The MATLAB solution
agrees with our analytical consideration (note that the eigenvectors are again orthonormal).
>> [V,D]=eig([1,0;0,1])
V =
1
0
0
1
D =
1
0
0
1
Diagonal matrix. The characteristic polynomial is
([
]
[
])
a, 0
1, 0
det
= (a )(b ) = 0
0, b
0, 1
which has the roots 1 = a and 2 = b. The eigenvectors may be calculated by substituting the
eigenvalue (let us start with 1 )
[
]
a, 0
v = 1 v 1 = av 1
0, b 1
and by guessing that this can be satised with the vector
[ ]
1
v1 =
.
0
Similarly for the second eigenvalue.
The symbolic MATLAB solution agrees with our analytical consideration. (a, b are real symbolic
constants.)
>> syms a b real
>> [V,D]=eig([a,0;0,b])
V =
[ 1, 0]
[ 0, 1]
D =
[ a, 0]
[ 0, b]
General real matrix. The characteristic polynomial is
([
]
[
])
a, d
1, 0
det
= (a )(b ) cd = 0
c, b
0, 1
The roots 1 and 2 need to be solved for from this quadratic equation. The below symbolic
MATLAB expression evaluates the determinant
>> syms a b c d lambda real
>> det([a,d;c,b]-lambda*[1,0;0,1])
ans =
a*b-a*lambda-lambda*b+lambda^2-d*c
42
A helpful observation usually made in a linear algebra course is that the trace of the 22 matrix
(i.e. the sum of the diagonal elements) is equal to the sum of the eigenvalues a + b = 1 + 2 , and
the determinant of the matrix is equal to the product of the eigenvalues ab cd = 1 2 . We can
easily verify this symbolically in MATLAB by rst computing the eigenvalues and eigenvectors
(symbolically)
syms a b c d lambda real
[V,D]=eig([a,d;c,b])
and then using the symbolic expressions
D(1,1) +D(2,2)-a-b
simple(D(1,1)*D(2,2)-a*b+c*d)
we check that we get identically zero. As an example consider the matrix
[
]
2, 1
1, 2
We nd the eigenvalues from 2 + 2 = 4 = 1 + 2 , and 2 2 (1) (1) = 3 = 1 2 . We
easily guess 1 = 3 and 2 = 1. The eigenvectors are found by substituting the eigenvalue into
the eigenvalue problem, and then solving the singular system of equations. For instance,
([
]
[
]) [
] [ ]
2, 1
1, 0
z11
0
1
=
1, 2
0, 1
z21
0
So that
[
][
] [ ]
1, 1
z11
0
=
1, 1
z21
0
These two equations are linearly dependent, and we cannot determine both elements z11 , z21
from a single equation. Choosing for instance z11 = 1 gives (one possible) solution for the rst
eigenvector
[
] [
]
z11
1
=
z21
1
Real matrix of the form (3.11)
([
]
[
])
a, b
1, 0
det
= (a )2 + b2 = 0
b, a
0, 1
Taking the helpful formula for the eigenvalues of the 2 x 2 matrix
1 + 2 = 2a ,
1 2 = a2 + b2
and the identity (a + i b) (a i b) = a2 + b2 we can see that the eigenvalues are in fact
1 = a + i b ,
2 = a i b .
So the diagonal elements of the matrix are the real parts of the eigenvalues, and the o diagonal
elements are the (real values) of the imaginary parts of the eigenvalues.
43
Suggested experiments
1. When we compute the eigenvector by solving the system with the singular matrix we have to
choose one element of the vector, apparently arbitrarily. Discuss whether the choice is truly
arbitrary. For instance, could we choose z11 = 0?
= (Rek )2 + (Imk)2 = 0 .
Imk, Rek
0, 1
(3.16)
We know that for the scalar case the eigenvalue is = k = Rek + iImk. Would this eigenvalue satisfy
also the characteristic equation above? Substituting and simplifying we obtain:
(Rek )2 + (Imk)2 = (Rek Rek iImk)2 + (Imk)2 = i2 (Imk)2 + (Imk)2 = 0 .
It does! That is not all, however. Numbers whose imaginary parts have equal magnitude but opposite
signs are called complex conjugate (see Figure 3.6). The characteristic equation (3.16) also has
the root = k = Rek iImk, where the overbar means complex conjugate. This holds because
(iImk)2 = (iImk)2 . The eigenvalue problem in Section 3.2 is saying the same thing, since forming
a complex conjugate of the equation
B( k) = B( k) = 0
is equally valid as the original equation.
(3.17)
44
Im
a
a+a
Re
a
These are interesting matrices, which occur commonly in many important applications. We will
hear more about them. The eigenvalues are 1,2 = i Imk, which means purely imaginary. We write
1 = 2 (and 1 = 2 ).
We solve for the components of the rst eigenvector. The procedure is the same as in the example
above: substitute the computed eigenvalue into the eigenproblem equation, and since the resulting
equations are linearly dependent, choose one of the components of the eigenvector and solve for the
rest. Thus we get for 1 = i Imk
([
]
[
]) [
] [ ]
0, Imk
1, 0
z11
0
1
=
.
Imk,
0
0, 1
z21
0
This may be rewritten
[
][
] [ ]
i , 1
z11
0
Imk
=
1, i
z21
0
and choosing z21 = 1 we obtain the rst eigenvector
[
] [ ]
z
i
z 1 = 11 =
.
z21
1
Similarly we obtain the second eigenvector as
[
] [
]
z12
i
z2 =
=
.
z22
1
Note that z 1 and z 2 are complex conjugate, as their corresponding eigenvalues. We can easily convince ourselves that an eigenvalue problem with complex conjugate eigenvalues must have complex
conjugate eigenvectors. For an arbitrary real matrix A write the complex conjugate on either side
of the equation
A z = z A z = A z = z
Both eigenvalue/eigenvector pairs
w1 = exp(1 t)z 1
and
w2 = exp(2 t)z 2 = w1
(3.18)
45
could be solutions of the IVP (3.10). A general solution therefore is likely to be a mix of these two
w = C1 w1 + C2 w2 .
We expect w to be a real vector, whereas w1 and w2 are both complex quantities. However, they are
complex conjugate which suggests that if the constants are also complex conjugates the expression
on the right may be real (refer to Figure 3.6).
w = C1 w1 + C1 w1
In general, the complex constant may be written as
C1 = ReC1 + i ImC1
(3.19)
and the complex exponential has the equivalent expression (Eulers formula from complex analysis)
exp(i Imkt) = cos(Imkt) + i sin(Imkt) .
(3.20)
(3.22)
The matrix in the above equation is the so-called rotation matrix . The quantity Imk has the
meaning of angular velocity, and correspondingly Imkt is the rotation angle. One way of visualizing
rotations is through phasors: see Figure 3.7. A phasor is a rotating vector whose components vary
harmonically.
Figure 3.8 provides a link between dierent ways of visualizing rotations2 . The black circle
corresponds to the trace of the tip of the rotating vector of Figure 3.7. The rainbow colored helical
tube (time advances from blue to red) is the black circle stretched in the time dimension. (Think
Slinky.)
The red curve is the projection of the helix onto the plane Imy = 0, and it is the graph of t
versus Rey. The blue curve is the projection of the helix onto the plane Rey = 0, and it is the graph
of t versus Imy. When we plot the solutions computed by the MATLAB integrators they are the
superposition of these (red and blue) curves in one plane, as shown on the right in Figure 3.8.3 The
rotating-vector picture tells us was kind of curves we should expect: the vector rotates with constant
angular velocity, which when projected onto either of the two coordinates will yield a sinusoidal
phase-shifted curve in time compare with Figure 3.8.
2
3
See: aetna/ScalarODE/scalaroscillstream.m
See: aetna/ScalarODE/scalaroscillplot.m
46
Imy
w(t)
Imk t
w0
Rey
Fig. 3.8. Graphical representation of the solution to (3.12) Imk = 0.3, Rey 0 = 0, Imy 0 = 8
See: aetna/ScalarODE/scalaroscill1st.m
47
8
6
Re y, Im y
4
2
0
2
4
6
8
0
10
time step, but the forward Euler integrator odefeul fails spectacularly: the solution blows up very
quickly (Figure 3.10 on the left). The backward Euler integrator is not much better, except that the
amplitude goes to zero (Figure 3.10 on the right). With smaller time steps we can reduce the rate of
the blowup (decay) of the amplitude, but we can never remove it (try it: decrease the time step by
couple of orders of magnitude and arm yourselves with patience, it is going to take a long time to
integrate). We consider the constant amplitude as the main feature of the solutions to this problem.
Therefore, we must conclude that for this problem the two integrators appear to be unconditionally
unstable as they are unable to maintain an unchanging amplitude of the oscillations no matter how
small the time step. For comparison we show the results for the built-in ode45 integrator, applied
600
8
6
400
4
Re y, Im y
Re y, Im y
200
0
200
2
0
2
4
400
600
0
6
2
10
8
0
10
Fig. 3.10. Example of Section 3.11, odefeul integrator (on the left), and odebeul integrator (on the right).
Time step t = 0.099.
to in a long integration time5 in Figure 3.11 (there are so many oscillations that the curves visually
melt into a solid block). We see that even for this integrator there is a systematic change (decay)
in the amplitude of the oscillation. By reducing the time step length we can reduce the drift, but
we cannot remove it entirely (as observed in numerical experiments). Again, this behavior has to do
with stability, not accuracy.
See: aetna/ScalarODE/scalaroscill1stlong.m
48
Re y, Im y
4
2
0
2
4
6
8
0
200
400
600
t
800
1000
1200
Fig. 3.11. Example of Section 3.11, ode45 integrator, long integration time
and we work in the knowledge that k, yj , yj+1 are complex. We now understand that for a purely
imaginary k = i Imk the solution may be represented as a circle in the plane Rey, Imy. Another way
of saying this is the modulus of the complex quantity y is constant. We take the modulus on both
sides
|yj+1 | = |(1 + tk)yj | = |1 + tk| |yj | ,
(3.23)
and in order to get |yj+1 | = |yj | (so that the solution points lie on a circle) we need for the complex
amplication factor to satisfy
|1 + tk| = 1 .
(3.24)
Figure 3.12 illustrates the meaning of the above equation graphically. The circle of radius equal to
1.0 centered at (0, 0) is translated to be centered at (1, 0) in order for the complex number tk to
satisfy (3.24). Now consider the purely imaginary value of the coecient k = i Imk. Such numbers
tImk
|1 + x| = 1
tk
|x| = 1
tRek
lie along the imaginary axis, Rek = 0, and when multiplied by t > 0 the resulting product just
moves closer to or further away from the origin. One such number tk is shown in Figure 3.12.
In order for tk to satisfy (3.24) the dot representing the number must move to the thick circle
in Figure 3.12. We can see that no such non-zero time step length exists: only t = 0 will make
tk = 0 lie on the circle at (0, 0). Therefore, we must conclude that the forward Euler method
is unconditionally unstable for imaginary k as there is no time step length that would satisfy the
stability requirement (3.24).
49
Next we shall consider the backward Euler method (3.7) for the same problem. Taking the
modulus on both sides we obtain
yj
|yj |
=
|yj+1 | =
,
(1 tk)
|1 tk|
and in order to get |yj+1 | = |yj | (so that the solution points lie on a circle) we need
|1 tk| = 1 .
(3.25)
Figure 3.13 now illustrates that the circle of radius equal to 1.0 centered at (0, 0) needs to be
translated to be centered at (+1, 0) in order for tk to satisfy (3.25). Again, we must conclude that
the backward Euler method is unconditionally unstable for imaginary k as there is no non-zero time
step length that would satisfy the stability requirement (3.25).
tImk
tk
|x| = 1
|1 x| = 1
+1
+2
tRek
(3.26)
The solution will be in the form of (3.21), except that everything will be multiplied by the real
exponential exp(Rekt)
[
]
ReC1 sin(Imkt) ImC1 cos(Imkt)
w = 2 exp(Rekt)
.
ReC1 cos(Imkt) ImC1 sin(Imkt)
Following the same steps as in Section 3.10, we arrive at the solution to the IVP in the form
[
][
]
cos(Imkt), sin(Imkt)
Rey0
w = exp(Rekt)
,
(3.27)
sin(Imkt), cos(Imkt)
Imy0
which may be interpreted readily as the rotation of a phasor with exponentially decreasing (Rek < 0)
or increasing (Rek > 0) amplitude.
50
Let us take rst Rek < 0 and the forward Euler algorithm. Equation (3.23) is still our starting
point, but now we are asking if there is a time step length that would make the modulus of the
solution decrease in time, or in mathematical terms
|1 + tk| < 1 .
(3.28)
For the accompanying picture refer to Figure 3.14: One possible complex coecient k is shown, as
is its scaling (down) by the time step tk. Clearly, it is now possible by choosing a suciently small
time step length to bring tk inside the circle so that its distance from (1, 0) is less than one, and
so that the stability criterion (3.28) is satised. Since now there is a time step length so that the
forward Euler can reproduce the correct solution shape, we call forward Euler for general complex k
and Rek < 0 conditionally stable. The condition implied by conditionally is equation (3.28), and
for a given k we can use it to solve for an appropriate t.
tImk
k
|1 + x| = 1
tk
2
tRek
On the other hand, we can now see that for the forward Euler algorithm we achieve stability
(satisfy equation (3.28)) for Rek > 0 for any t: the coecient k is in the right-hand side half plane,
and the stability circle is in the left-hand side half plane. Therefore multiplying a complex k with
an arbitrary t > 0 will satisfy |1 + tk| > 1. Hence, for Rek > 0 the forward Euler method is
unconditionally stable.
This state of aairs is again mirrored by the behavior of the backward Euler algorithm. First
take Rek > 0. Equation (3.25) is now used to gure out if there is a time step length that would
make the modulus of the solution increase in time, or in mathematical terms
1
>1.
|1 tk|
(3.29)
For the accompanying picture refer to Figure 3.15: One possible complex coecient k is shown,
as is its scaling tk. Clearly, it is now possible by choosing a suciently small time step length
to bring tk inside the circle so that its distance from (+1, 0) is less than one which will ensure
satisfaction of (3.29). Thus, the backward Euler method is conditionally stable for general complex
k and Rek > 0. Also, we now conclude the backward Euler algorithm achieves stability (satisfy
equation (3.29)) for Rek < 0 for any t: the coecient k is in the left-hand side half plane, and the
stability circle is in the right-hand side half plane. Similar reasoning as for the forward Euler leads
us to conclude that backward Euler is unconditionally stable for complex k and Rek < 0.
Illustration 4
Apply the modied Euler (2.28) to the model equation (3.1), and derive the amplication factor.
51
tImk
k
|1 x| = 1
tk
+1
+2
tRek
Substituting the right-hand side of the model equation into the formula (2.28) we get
ya = y(t0 ) + (t t0 )f (t0 , y(t0 )) = y(t0 ) + (t t0 )ky(t0 )
and
(t t0 )
[ky(t0 ) + kya ]
2
(t t0 )
= y(t0 ) +
[ky(t0 ) + k(y(t0 ) + (t t0 )ky(t0 ))]
2
[
]
[k(t t0 )]2
= y(t0 ) 1 + k(t t0 ) +
.
2
y(t) = y(t0 ) +
The term in square brackets that multiplies y(t0 ) is the amplication factor for the modied Euler.
Suggested experiments
1. Derive the amplication factor for the trapezoidal rule (2.27).
y(0) = y0 ,
k, y complex .
The eigenvalue = k (a complex number) is plotted in the complex plane. Depending on where
it lands, the analytical solution will display the following behaviors: In the left half-plane we get
decaying oscillations, in the right half-plane we get growing oscillations. If the eigenvalue is purely
imaginary, we get pure oscillation. If the eigenvalue is purely real, we get either exponentially decaying or growing solutions. Finally, zero eigenvalue yields a stagnant (unchanging) solution. Figure 3.17
shows the behaviors produced by the forward Euler integrator. The same color coding as in Figure 3.16 is used. The key to understanding whether the forward Euler integrator can give us a discrete
52
Im
Oscillation
Decaying
Growing
Re
Decaying
Oscillations
Growing
Oscillations
Fig. 3.16. Behavior classication for the rst order linear dierential equation
solution that mimics the analytical one is to compare the two gures. The complex number t is
plotted in the complex plane in Figure 3.17. Forward Euler can reproduce the desired behavior if
there is such a t as to place the number t in Figure 3.17 in the region with the same color as
the one in which was located in Figure 3.16.
Illustration 5
Example 1: consider = 0.1 + i3. The analytical solution is decaying oscillation. In Figure 3.17 we
can see that a suciently small time step t will indeed place t inside the circle of unit radius
centered at 1 which has the same color as the left-hand side half-plane in Figure 3.16. Forward
Euler is conditionally stable in this case. (The condition is that t must be suciently small.)
Example 2: consider = i3. The analytical solution is pure oscillation. In Figure 3.17 we can
see that it is not possible to nd any other time step but t = 0 to place t on the circle of unit
radius centered at 1 (which has the same color as the imaginary axis in Figure 3.16). Forward
Euler is unconditionally unstable for pure oscillations.
Example 3: consider = 13.3. The analytical solution is exponential growth. In Figure 3.17 we
can see that the positive part of the real axis has the same color in both gures. Therefore, for
all t > 0 we get the correct behavior. Forward Euler is unconditionally stable for exponentially
growing solutions.
Example 4: consider = 0.61. The analytical solution is exponentially decaying. In Figure 3.17
we can see that a suciently small time step t will indeed place t within the interval 1
t < 0 which has the same color as the negative part of the real axis in Figure 3.16. Forward Euler
is conditionally stable in this case. (The condition is that t must be suciently small.)
In words, using the pair of images 3.16 and 3.17 the forward Euler integrator is found to be
unconditionally unstable for pure oscillations, unconditionally stable for growing oscillations and
exponentially growing non-oscillating solutions, and conditionally stable for exponentially decayingoscillating and non-oscillating solutions. Analogous observations can be made about the backward
Euler integrator which is found to be unconditionally unstable for pure oscillations, conditionally stable for growing oscillations and exponentially growing non-oscillating solutions, and unconditionally
stable for exponentially decaying oscillating and non-oscillating solutions.
53
Im
Oscillation
Decaying
Growing
+1
Decaying
Oscillations
Re
Growing
Oscillations
Fig. 3.17. Behavior classication for the rst order linear dierential equation, Forward Euler algorithm
Im
Oscillation
Decaying
1
Decaying
Oscillations
Growing
+1
Re
Growing
Oscillations
Fig. 3.18. Behavior classication for the rst order linear dierential equation, Backward Euler algorithm.
(3.30)
All possible complex are allowed, which means that t may represent an arbitrary point of the
complex plane. The magnitude of the amplication factor may be therefore considered a function
of the complex number t, and it is often useful to visualize such functions as surfaces raised
above the complex plane. The MATLAB function surf is designed to do just that. It takes three
matrices which represent the coordinates of points of a logically rectangular grid. The elements
6
See: aetna/StabilitySurfaces/StabilitySurfaces.m
54
x(k,m), y(k,m), z(k,m), represent Cartesian coordinates of the k,m vertex of the grid. The grid
then may be rendered with surf(x,y,z). Here we set up a grid with 99 rectangular faces in each
direction (which is why we have 100 100 matrices for the corners of those faces). First the extent
of the grid and the number of corners.
xlow =-3.2; xhigh= 0.9;
ylow =-3.2; yhigh= 3.2;
n=100;
Then we set up the matrices for the coordinates. Note that k corresponds to moving in the x
direction, the index m corresponds to moving in the y direction. dtlambda is a complex number (1i
is the complex unit), so taking its absolute value means getting the magnitude of the amplication
factor.
x=zeros(n,n); y=zeros(n,n); z=zeros(n,n);
for k =1:n
for m =1:n
x(k,m) =xlow +(k-1)/(n-1)*(xhigh-xlow);
y(k,m) =ylow +(m-1)/(n-1)*(yhigh-ylow);
dtlambda = x(k,m) + 1i*y(k,m);
z(k,m) = abs(1 + dtlambda + 0.5*dtlambda.^2);
end
end
Of course there is more than one way of accomplishing this. Here is the whole setup accomplished
with just three lines using the handy meshgrid and linspace functions.
[x,y] = meshgrid(linspace(xlow,xhigh,n),linspace(ylow,yhigh,n));
dtlambda = x + 1i*y;
% Modified Euler
z = abs(1 + dtlambda + 0.5*dtlambda.^2);
Next we draw the color-coded surface that represents the height z above the complex plane: blue
is the lowest, red is the highest.
surf(x,y,z,edgecolor,none)
Then we draw into the same gure the level curve at height 1.0 of the same function z of x, y. We
set the linewidth of the curve using a handle returned from the function contour3.
hold on
[C,H] = contour3(x,y,z,[1, 1],k)
set(H,linewidth, 3)
Finally set up the view, and label the axes.
axis([-4 0.6 -4 4 0 8])
axis equal,
xlabel (Re (\Delta{t}\lambda))
ylabel (Im (\Delta{t}\lambda))
Voila Figure 3.19. It shows how the amplication factor falls below 1.0 in amplitude inside an oval
shape in the left-hand side half plane.
As shown in the MATLAB script StabilitySurfaces, corresponding surface representations of
the amplication factors for the methods discussed so far, forward and backward Euler (Figures 3.20
and 3.21), trapezoidal rule (Figure 3.22), and the fourth-order Runge-Kutta (Figure 3.23), are easily
obtained just by commenting out or uncommenting the appropriate denitions of the variable z.
Figure 3.24 compares the level curves at 1.0 for the amplitude of the amplication factor
for the rst order linear dierential equation for the integrators FEUL=forward Euler algorithm,
55
Fig. 3.19. Surface of the amplitude of the amplication factor for the rst order linear dierential equation,
MEUL=modied Euler algorithm. The contour of unit amplitude is shown in black.
Fig. 3.20. Surface of the amplitude of the amplication factor for the rst order linear dierential equation,
FEUL=forward Euler algorithm. The contour of unit amplitude is shown in black.
7
6
5
4
3
2
1
2
0
2
Im (t)
Re (t)
Fig. 3.21. Surface of the amplitude of the amplication factor for the rst order linear dierential equation,
BEUL=backward Euler algorithm. The contour of unit amplitude is shown in black.
56
Fig. 3.22. Surface of the amplitude of the amplication factor for the rst order linear dierential equation,
TRAP =trapezoidal rule algorithm. The contour of unit amplitude is shown in black.
Fig. 3.23. Surface of the amplitude of the amplication factor for the rst order linear dierential equation,
RK4 =fourth-order Runge-Kutta algorithm. The contour of unit amplitude is shown in black.
BEUL=backward Euler algorithm, MEUL=modied Euler algorithm, TRAP =trapezoidal rule algorithm, RK4 =fourth-order Runge-Kutta algorithm. Note that the level curve for the trapezoidal
rule coincides with the vertical axis in the gure. For a decaying solution, the integrator will produce a stable solution if it is inside the contours in the left-hand side plane, or outside the circle in
the right-hand side plane for the backward Euler. Clearly, comparing Figure 3.24 with the surface
representations of integrator stability in Figures 3.20 3.23 we can see that visualizing the stability
with contours is only part of the story: the surface gures supply the missing information about the
magnitude of the amplication factor.
Suggested experiments
1. Use the information in the Figure 3.22 to estimate the stability diagram for the integrator
odetrap similar to that shown in Figures 3.17 , 3.18.
57
RK4
BEUL
Im(t )
1
0
FEUL
1
2
MEUL
TRAP
3
3
0
Re(t )
Fig. 3.24. Level curves (contours) at value 1.0 of the amplitude of the amplication factor for the
rst order linear dierential equation, FEUL=forward Euler algorithm, BEUL=backward Euler algorithm,
MEUL=modied Euler algorithm, TRAP =trapezoidal rule algorithm, RK4 =fourth-order Runge-Kutta algorithm.
4
Linear Single Degree of Freedom Oscillator
Summary
1. The model of the linear oscillator with a single degree of freedom is investigated from the point
of view of the uncoupling procedure (so-called modal expansion), and the solution in the form
of a matrix exponential. Main idea: solve the eigenvalue problem for the governing ODE system,
and expand the original variables in terms of the eigenvectors. The modal expansion is a critical
piece in engineering vibration analysis.
2. For the single degree of freedom linear vibrating system we have study how to transform between
the second order and the rst order matrix form, and we discuss the relationship of the scalar
equation with the complex coecient from Chapter 3 with the linear oscillator model. Main
idea: the two IVPs are shown to be equivalent descriptions.
3. It is shown that modal analysis is possible as long as the system matrix is not defective, i.e. as
long as it has a full set of eigenvectors. The case of critical damping is discussed as a special case
which leads to a defective system matrix.
4. The modal analysis allows multiple degree of freedom systems to be understood in terms of the
properties of multiple single degree of freedom linear oscillators.
= v0
together this will constitute the complete denition of the IVP of the linear oscillator. Using the
denition of the velocity
v = x
will yield the general rst-order form of the 1-dof damped oscillator IVP as
y = A y , y(0) = y 0 ,
where
A=
0,
1
k/m, c/m
(4.1)
]
(4.2)
60
and
[ ]
x
y=
.
v
m
x
The discussion of Section 3.7 (refer to equation (3.13)), applies here too. We assume the solution
in the form of an exponential
y = et z .
The characteristic equation for the damped oscillator is
[
]
,
1
det (A 1) = det
= 2 + (c/m) + k/m = 0 .
k/m, c/m
The quantity n
n2 = k/m
(4.3)
61
Therefore, the dierential equation of motion may be written in terms of the new variables as
[
] [
][
]
w1
0, 1
w1
=
,
n w2
n2 , 0
n w2
and by canceling n in the second equation we obtain
[ ] [
][ ]
[
] [
]
w1
0, n
w1
w1 (0)
x0
=
,
=
,
w2
n ,
0
w2
w2 (0)
v0 /n
which is in perfect agreement with Section 3.10: we get two variables, the displacement x and the
velocity scaled by the angular velocity v/n , coupled together by a skew symmetric matrix which
is the same as in equation (3.17) (where n = Imk). The solution in the new variables w1 , w2 is
therefore expressed by the rotation matrix as in (3.22). Now we can understand that Figure 3.7
describes the motion of an oscillating mass.
c
=
2m
(
c )2
k
.
2m
m
2m
m
for 1,2 to come out real.
4.1.2 = (c/2m): Oscillation
This is the second subcase: Substituting = (c/2m) into the rst equation, we obtain
( c )2
k
c
)+
=0,
2 + (c/m)(
2m
2m
m
(4.4)
62
( c )2
k
=
.
m
2m
For to come out real and positive (we include the latter condition since = 0 was already covered
in the preceding section) we require
( c )2
k
<
.
2m
m
4.1.3 Critically damped oscillator
The case of = 0 and at the same time
( c )2
k
=
2m
m
yields a special case: the critically damped oscillator. The damping coecient
ccr = 2mn
(4.5)
is the so-called critical damping . The critically damped oscillator needs a special handling, which
we will postpone to its own section that will follow the discussion of the generic cases of the supercritically and the subcritically damped oscillator.
(i.e. > 1)
.
2m
2m
m
c
+
Let us compute the rst eigenvector corresponding to 1 = 2m
for the vector z 1 that solves
)
c 2
2m
k
m.
We are looking
(A 1 1) z 1 = 0 .
Substituting we have
[
][
] [ ]
1 ,
1
z11
0
=
.
k
c
z21
0
m
, m
1
The two equations are really only one equation (the rows and columns of the matrix on the left
are linearly dependent, since that is the condition from which we solved for 1 ). Therefore, using
63
for instance the rst equation and choosing z11 = 1 we compute z21 = 1 . We repeat the same
procedure for the second root to arrive at the two eigenvectors
[
] [ ]
[
] [ ]
z11
1
z12
1
z1 =
=
, z2 =
=
.
z21
1
z22
2
The general solution of the dierential equation of motion of the oscillator is therefore
y = c1 e1 t z 1 + c2 e2 t z 2 .
(4.6)
e1 t 0
0 e2 t
][
c1
c2
[
=V
e 1 t 0
0 e2 t
][
c1
c2
]
.
(4.7)
64
w0 = V 1 y 0
(4.8)
we can write
[ t
]
e 1 0
w=
w0
0 e2 t
as a completely equivalent solution to the oscillator IVP, using the new variable w. Each component
of the solution is independent of the other, as we can see from the scalar equivalent to the above
matrix equation
w1 (t) = e1 t w10 , w2 (t) = e2 t w20 .
(4.9)
w(0) = w0 = V 1 y 0 .
The matrix V 1 AV is a very nice one: it is diagonal . To see this, we realize that for each column
of the matrix V the eigenvalue problem
Az j = j z j , j = 1, 2
holds, and writing all such eigenvalue problems in one shot is possible as
[
]
0
A [z 1 , z 2 ] = [z 1 , z 2 ] 1
0 2
using the diagonal matrix
[
]
1 0
=
.
0 2
(4.10)
(4.11)
(4.12)
Therefore we have
A [z 1 , z 2 ] = AV = V
and pre-multiplying with V 1
V 1 AV = .
(4.13)
We say that the matrix A is similar to a diagonal matrix . (We also say that A is diagonalizable.)
So the IVP for the oscillator can be written in the new variable w as
= V 1 AV w = w ,
w
65
w(0) = w0 = V 1 y 0 .
This means that we can write totally independent scalar IVPs for each component
w 1 (t) = 1 w1 , w1 (0) = w10 ,
(4.14)
This is the well-known decoupling procedure: the original variables y are in general coupled together
since the matrix A is in general non-diagonal. Therefore to make things easier for us we switch to
a dierent set of variables w with the transformation (4.8) in which all the variables are uncoupled.
The uncoupled variables have each its own IVP which is easily solved. Finally, if we wish to, we
switch back to the original variables y. This procedure may be summarized as
y = Ay ,
y(0) = y 0
(original IVP),
(4.15)
z j wj (t) .
j=1
Illustration 3
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, = 3/2, x0 = 0, and v0 = 1.
We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra
toolbox1 . First the denitions of the variables. The variable names are self-explanatory.
function froscill_super_symb
syms m k c omega_n t x0 v0 real
y0= [x0;v0];
c_cr=2*m*omega_n;
c=3/2*c_cr;
A = [0, 1; -omega_n^2, -(c/m)];
We compute symbolically the eigenvalues and eigenvectors, and we construct the diagonal matrix
(called L in the code)
[V,D] =eig(A);
L =simple(inv(V)*A*V);
(Control question: How do L and D compare?) Next we can compute the matrix with ej t on the
diagonal (called eLt). Note that calling the MATLAB function exp on a matrix would exponentiate
each element of the matrix. This is not what we intend: only the elements on the diagonal should
be aected. Therefore we have to extract the diagonal of L with diag, exponentiate, and then
reconstruct a square matrix with another call to diag
1
66
eLt =diag(exp(diag(L)*t));
Now we are ready to write down the last equation (4.15) to construct the solution components
(displacement and velocity).
y=simple(V*eLt*inv(V))*y0;
It only remains to substitute numbers and plot. These are the given numbers and we also dene an
auxiliary variable.
x0= 0; v0=1;% [initial displacement; initial velocity]
m= 13;
k= 6100; omega_n= sqrt(k/m);
For the plotting we need data to plot on the horizontal and vertical axis. Here we set it up so that
the time variable array t consists of 200 points spanning two periods of vibration of the undamped
system.
T_n=(2*pi)/omega_n;
t=linspace(0, 2*T_n, 200);
Finally the plotting of the components of the solution.
plot(t,eval(vectorize(y(1))),m-); hold on
plot(t,eval(vectorize(y(2))),r--); hold on
Remember the components of y are symbolic expressions. Now that we have provided all the variables
with numerical values, we need to evaluate the numerical value of the solution components using
the MATLAB function eval. It also doesnt hurt to use the function vectorize: the variable t is an
array. In case the expression for the solution components contained arithmetic operators of two or
more terms that referred to t (such as exp(t)*sin(t)) we would want the expressions to evaluate
element-by-element. vectorize replaces all references to operators such as * or ^ with .* or
.^ so that these operators work on each scalar element of the arrays in turn.
( c )2
c
k
1,2 =
i
.
2m
m
2m
Let us remind ourselves that an undamped oscillator is a special case of the subcritically damped
oscillator for c = 0.
The same procedure as in Section 4.2 leads to the eigenvectors
[
] [ ]
[
] [ ]
z
1
z
1
z 1 = 11 =
, z 2 = 12 =
,
z21
1
z22
2
which are complex, since j are complex numbers. The solution is again written as in (4.6) but with
the important dierence that all quantities on the right-hand side are complex while the left-hand
side is expected to be real.
The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of
the rst one, 2 = 1 . We see this easily writing the complex conjugate of the equation A z = z
(see equation (3.18)). The two constants cj can be determined from the initial condition
y(0) = c1 e1 0 z 1 + c2 e2 0 z 2 = c1 z 1 + c2 z 2 = y 0
67
and since y 0 is real, the two constants must be complex conjugates of each other, c2 = c1 . The
constants are still determined by
[ ]
c1
= V 1 y 0 .
c2
Now we can follow all the derivations from the previous section, and the solution will be still arrived
at in the form of (4.7). Since both y and y 0 are real, the product of the three complex matrices
[ t
]
e 1 0
V
V 1
0 e 2 t
must also be real, and however surprising it may seem, it is real. (We can do the algebra by hand
or with MATLAB to check this.)
Illustration 4
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, = 0.2 (< 1 so that the
damping is subcritical), x0 = 0, and v0 = 1.
We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra
toolbox2 . First the denitions of the variables. The variable names are self-explanatory. The code is
pretty much the same as for the supercritically damped oscillator example above, except
function froscill_sub_symb
...
c=0.2*c_cr;
...
We may verify that the eigenvalues (and eigenvectors) are now general complex numbers. For instance
K>> D(1,1)
ans =
(-1/5+2/5*i*6^(1/2))*omega_n
It is rather satisfying to nd that no modications to the code of froscill_super_symb that was
written for the real (supercritical) case are required to account for the complex eigenvalues and
eigenvectors: it just works as is.
1,2 = i k/m = in .
Let us compute the rst eigenvector corresponding to 1 = in
[
] [
]
z11
1
z1 =
=
.
z21
in
The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of the
rst one, 2 = 1 = in ,
2
68
[
z2 =
z12
z22
[
= z1 =
1
in
]
.
The general solution of the free undamped oscillator motion is a linear combination of the eigenvectors
y = c1 e1 t z 1 + c2 e2 t z 2 .
Because of the complex conjugate status of the pairs of the eigenvalues and eigenvectors, we have
y = c1 e1 t z 1 + c2 e1 t z 1 .
Introducing the initial condition, which is real, we obtain
y(0) = c1 z 1 + c2 z 1
and we must conclude c1 = c2 , otherwise the right-hand side couldnt be real. Using
Rea = (a + a)/2
we see that the sum c1 z 1 + c2 z 1 therefore evaluates to 2Re(c1 z 1 ), and the constants can be determined from
[
]
[
] Rec 1
y(0) = 2Re(c1 z 1 ) = 2 (Rec 1 Rez 1 Imc 1 Imz 1 ) = 2 Rez 1 , Imz 1
.
Imc 1
We will introduce the matrix composed of the real and imaginary part of the eigenvector z 1
[
]
[
]
1, 0
Z = Rez 1 , Imz 1 =
.
(4.16)
0, n
Then we can write
[
]
[
]
1 1
1 1, 0
Rec 1
= Z y(0) =
y(0) .
Imc 1
2
2 0, n1
Using the same principle that we obtain a real number from the sum of the complex conjugates, we
write
(
)
y = 2Re c1 e1 t z 1 ,
(4.17)
which may be expanded into
y = 2 [Rec1 (cos n t Rez 1 sin n t Imz 1 ) Imc1 (sin n t Imz 1 + cos n t Rez 1 )] .
Then collecting the terms leads to the matrix expression
[
][
]
[
] cos n t , sin n t
Rec 1
y = 2 Rez 1 , Imz 1
,
sin n t , cos n t
Imc 1
which after substitution of Rec 1 , Imc 1 nally results in the matrix expression
[
]
cos n t , sin n t 1 1
Z y(0) = ZR(t)Z 1 y(0) .
y = 2Z
sin n t , cos n t
2
We have in this way introduced the time-dependent rotation matrix
[
]
cos n t , sin n t
R(t) =
.
sin n t , cos n t
(4.18)
(4.19)
The solution for the displacement and velocity of the linear single degree of freedom oscillator can
therefore be understood as the result of the rotation of the initial-value quantity Z 1 y(0) (phasor)
Z 1 y(t) = R(t)Z 1 y(0) .
(4.20)
69
Illustration 5
Check that the procedure (4.15) and the alternative formula (4.18) lead to the same solution.
We dont want to do this by hand. It is faster to use the MATLAB symbolic algebra. The function
froscill un symb3 computes the solution twice, and then subtracts one from the other. If we get
as a result zeroes, the solutions were the same.
The code begins by the same variable denitions and solution of the eigenvalue problem as for
froscill sub symb. We compute the rst solution using (4.15).
L =simple(inv(V)*A*V);
eLt =diag(exp(diag(L)*t));
y1=simple(V*eLt*inv(V))*y0;
Next, we compute the solution using the alternative with the rotation matrix (4.19).
Z =[real(V(:,1)),-imag(V(:,1))];
R = [cos(omega_n*t),-sin(omega_n*t);
sin(omega_n*t),cos(omega_n*t)];
y2 =simple(Z*R*inv(Z))*y0;
Finally we evaluate y1-y2.
Finally we can realize that the solution (4.20) is of the same form as that derived in Section 3.10
(as in (3.22)) and then again in the Illustration in Section 4.1. The new variables are w1 = y1 , w2 =
y2 /n as in Section 4.1.
4.5.1 Subcritically damped oscillator: alternative treatment
The eigenvalues are
( c )2
c
k
1,2 =
i
.
2m
m
2m
Equation (4.17) is still applicable. The only dierence is that 1,2 now have a real component. Using
e(+i)t = et eit
we see that (4.18) requires only a change of the matrix R, which should for the damped oscillator
read
[
]
t cos t , sin t
R(t) = e
.
sin t , cos t
Here
c
=
, =
2m
( c )2
k
.
m
2m
70
y(0) = y0 .
y(0) = y 0
ak tk
k=0
k!
The matrix exponential could be dened (and in fact this is one of its denitions) as
eAt =
Ak tk
k=0
k!
(4.21)
For a general matrix A evaluating the innite series would be dicult. Fortunately, for some
special matrices it turns out to be easy. Especially the nice diagonal matrix makes this a breeze:
Dk tk
11
0
...
0
0
k=0
k!
k k
D22
t
0
...
0
0
k=0
k!
k
k
D t
..
..
.
.
.
D
t
.
.
.
e
=
=
.
.
.
.
.
.
k!
k
k
Dn1,n1 t
k=0
0
0
. . . k=0
k!
k
k
Dnn t
0
0
...
0
k=0
k!
This result is easily veried by just multiplying through the diagonal matrix with itself. Finally we
realize that on the right-hand side we have a matrix with exponentials eDjj t on the diagonal
D t
e 11
0 ...
0
0
0 eD22 t . . .
0
0
k k
D
t
..
.
.
.. .
D
t
.
..
..
..
= .
e
=
(4.22)
.
k!
Dn1,n1 t
k=0
0
0 ... e
0
0
0 ...
0
eDnn t
This is very helpful indeed, since we already saw that having a full set of eigenvectors as in equation (4.13) allows us to write the matrix A as similar to a diagonal matrix. Let us substitute into
the denition of a matrix exponential the similarity
V 1 AV = ,
A = V V 1
71
as
eAt =
Ak tk
k=0
k!
)k
(
V V 1 tk
=
.
k!
k=0
Now we work out the matrix powers: The zeroth and rst,
(
V V 1
)0
= 1 = V 1V 1 ,
V V 1
)1
= V V 1 ,
V V 1
)2
(
)(
)
(
)
= V V 1 V V 1 = V V 1 V V 1 = V V 1 = V 2 V 1 .
V V 1
)k
= V k V 1 .
k tk
V k V 1 tk
A
t
e =
=V
V 1 = V et V 1 .
k!
k!
k=0
k=0
To compute the matrix exponential of the diagonal t is easy, so the only thing we need in order to
compute the exponential of At is a full set of eigenvectors of A. (Warning: There are matrices that
do not have a full set of linearly independent eigenvectors. Such matrices are called defective. More
details are discussed in the next section.)
As a matter of fact we have been using the matrix exponential all along. The solution (4.7)
is of the form V et V 1 . In equation (4.18) the matrix R(t) (rotation matrix) is also a matrix
exponential of a special matrix: the skew-symmetric matrix.
[
]
[
]
0, n
0, 1
S=
= n
.
n , 0
1, 0
Note that the powers of S have this special structure
S 2 = n2 1 , S 3 = n2 S , S 4 = n4 1 , S 5 = n4 S , ... .
Therefore, for the rotation matrix we have
R(t) = eS t =
S k tk
k=0
k!
1t0
St1
n2 1t2
n2 St3
+
+
+
+ ... .
0!
1!
2!
3!
Constructing the innite matrix series, this gives the correct Taylor expansions for cosines and
sines of the rotation matrix
[
]
]
[
n2 t2
n4 t4
n5 t5
n3 t3
1
S
t
R(t) = e = 1
+
+ ... 1 + n t
+
+ ...
S.
2!
4!
3!
5!
n
|
{z
}
|
{z
}
cos n t
sin n t
72
[
] [ ]
,
1
2m
z11 = 0 .
k
c
c
z21
0
,
m 2m m
( c )2
k
Further simplifying with m
= 2m
leads to
c
[
] [ ]
,
1
z11 = 0 .
(2m )2
c
c
z21
0
,
2m
2m
c
2m ,
yields
Inconveniently, this is the only eigenvector that we are going to get for the case of the critically
damped oscillator. Since we obtained a double real root, the second eigenvector is exactly the same
as the rst. We say inconveniently, because our approach was developed for an invertible eigenvector
matrix
V = [z 1 , z 2 ]
and it will now fail since both columns of V are the same, and such a matrix is not invertible.
We call matrices that have missing eigenvectors defective. For the critically damped oscillator the
matrix A is defective.
=1.005
20
15
10
Im
5
0
5
10
15
20
25
20
15
10
0
Re
10
15
20
25
Let us approach the degenerate case of the critically damped oscillator as the limit of the super
critically damped oscillator whose two eigenvalues will approach each other to become one. Figure 4.2
shows a circle of the radius equal to n for the data of the IVP (4.1) set to m = 13, k = 6100,
= 1.005 (in other words close to critical damping). The two (real) eigenvalues are indicated by
small circular markers (the function animated eigenvalue diagram4 illustrates with an animation
how the eigenvalues change in dependence on the amount of damping). For critical damping ( = 1.0)
the two eigenvalues would merge on the black circle and become one real eigenvalue (also referred to
4
73
as a repeated eigenvalue). As the eigenvalues approach each other 2 1 the solution may still
be written as
y = c1 e1 t z 1 + c2 e2 t z 2 .
In order to understand the behavior of the eigenvalues as they approach each other, we can write
the exponential e2 t using the Taylor series with 1 as the starting point
e2 t = e1 t +
d ( 2 t )
e
|1 (2 1 ) + . . . = e1 t + te1 t (2 1 ) + . . . .
d2
and
te1 t .
With essentially the same reasoning we can now look for the missing eigenvector. Write (again
assuming 2 1 )
dz 2
z2 z1 +
(2 1 ) .
d2 1
This allows us to subtract the two eigenvector equations from each other to obtain
(+) Az 2 = 2 z 2
() Az 1 = 1 z 1
,
A (z 2 z 1 ) = (2 z 2 1 z 1 )
where we can substitute the dierence of the eigenvectors to arrive at
dz 2
dz 2
A
(2 1 ) = (2 1 )z 1 + 2
(2 1 ) ,
d2 1
d2 1
and, factoring out (2 1 ), nally
dz 2
dz 2
A
= z 1 + 2
.
d2 1
d2 1
z 2
Note that dd
2
has the direction of the dierence between the two vectors z 2 and z 1 . Since z 2
z 2 . Therefore,
and z 1 are linearly independent vectors for 2 = 1 , so are the vectors z 1 and dd
2
1
when 2 = 1 , we can obtain a full set of linearly independent vectors that go with the double root
as the two vectors z 1 and p2 that solve
74
Az 1 = 1 z 1 ,
Ap2 = z 1 + 2 p2 .
(4.23)
Here p2 is not an eigenvector. Rather, it is called a principal vector . To continue with our critically
damped oscillator: we can compute the principal vector as
[
][
] [
]
]
[
0,
1
p12
z11
p12
=
+ 2
,
k
c
p22
z21
p22
,
m
m
or, upon substitution,
[
][
] [
]
[
]
0,
1
c p12
p12
1
( c )2
=
,
c
c
p22
2m
,
2m p22
2m
m
or, rearranging the terms,
c
]
[
] [
,
1
1
(2m )2
p12 =
.
c
c
c
p22
2m
,
2m
2m
Since the matrix on the left-hand side is singular, the principal vector is not determined uniquely.
One possible solution is
[
] [
]
p
0
p2 = 12 = c
.
p22
2m
Similarly as for the general oscillator eigenproblem (4.10) which could be written in the matrix
form (4.11), we can write here for the critically damped oscillator
[
]
1 1
A [z 1 , p2 ] = [z 1 , p2 ]
,
(4.24)
0 2
where we introduce the so-called Jordan matrix
[
] [
]
1 1
1 1
J=
=
(since 1 = 2 )
0 2
0 1
(4.25)
(4.26)
Illustration 6
Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, = 1.0 (critical damping),
x0 = 0, and v0 = 1.
We shall follow the procedure that leads to the Jordan matrix. The MATLAB solution is based
on the symbolic algebra toolbox5 .
The solution to the eigenvalue problem yields a rectangular one-column V. Therefore we solve for
the principal vector p2 , and we form the matrix M
5
75
Illustration 7
Compute the matrix exponential of the Jordan matrix
[
]
1
J =t
.
0
Solution: The matrix can be decomposed as
[
]
01
J = t1 + t
= t1 + t .
00
Because we have
(t1) (t) = (t) (t1)
(i.e. the matrices commute), it holds for the matrix exponential
et1 + t = et1 et = et et1 .
The exponential of the diagonal matrix is easy: see equation (4.22). For the matrix using the
denition (4.21) we readily get
et =
k tk
k=0
k!
= 1 + t
because all its powers higher than two are zero matrices, 2 = 0, and so on. Therefore, we have
[
]
1t
et1 + t = et1 et = et 1 (1 + t) = et
.
01
5
Linear Multiple Degree of Freedom Oscillator
Summary
1. For the multiple degree of freedom linear vibrating system we study how to transform between
the second order and the rst order matrix form. Modal analysis is discussed in detail for both
forms.
2. Modal analysis decouples the equations of the multiple degree of freedom system. The original coupled system may be understood in terms of the individual modal components. Main
idea: whether coupled or uncoupled, the response of the system is determined by the modal
characteristics. Each uncoupled equation evolves as governed by its own eigenvalue.
3. We can analyze a scalar real or complex linear dierential equation to gain insight into the
stability behavior. When the equations are coupled, stability is usually decided by the fastest
changing component of the solution (as dictated by the largest eigenvalue). This information is
used to select the time step for direct integration of the equations of motion.
4. The frequency content (spectrum) is a critical piece of information. We use the Fourier transform
and we discuss the Nyquist frequency.
5. The rst-order form of the vibrating system equations is used to analyze damped systems.
(5.1)
where M is the mass matrix, K is the stiness matrix, C is the damping matrix, and x is the vector
of displacements. In conjunction with the initial conditions
x(0) = x0 , x(0)
= v0
this will dene the multi-degree of freedom (dof) damped oscillator IVP. Using the denition
v = x
will yield the general rst-order form of the multi-dof damped oscillator IVP as
y = A y , y(0) = y 0 ,
where
A=
0,
1
M 1 K, M 1 C
(5.2)
]
78
and
[ ]
x
y=
.
v
The vector variable y collects both the vector of displacements x and the vector of velocities v.
Figure 5.1 shows an example of a multi-degree of freedom oscillator that is physically realized as
three carriages connected by springs and dampers. This will be our sample mechanical system that
will be studied in the following sections.
k1 , c 1 m 1 k2 , c 2 m 2 k3 , c 3 m 3
x1
x2
x3
(5.3)
m 0 0
2k, k, 0
M = 0 m 0 , K = k, 2k, k .
0 0 m
0, k, k
Similarly to the characteristic equation for the standard eigenvalue problem (3.15) we can write
(
)
det K 2 M = 0 .
(5.4)
79
Illustration 1
For the stiness and mass matrices given above, the characteristic polynomial is
2k, k, 0
m 0 0
det k, 2k, k 2 0 m 0 = k 3 6k 2 m 2 + 5km2 ( 2 )2 m3 ( 2 )3
0, k, k
0 0 m
The eigenvalues 2 are the roots of this polynomial.
Illustration 2
For the stiness and mass matrices given above, the characteristic equation is
k 3 6k 2 m 2 + 5km2 ( 2 )2 m3 ( 2 )3 = 0 .
Find the roots.
Symbolic solution can be delivered by Matlab, but it is far from tidy. Numerical solution eigenvalues for the data m = 1.3, k = 61, c = 0, results from the roots of
(2197/1000)( 2 )3 + (10309/20)( 2 )2 (145119/5) 2 + 226981 = 0 ,
x 10
3
2
1
0
1
2
3
0
50
100
150
o2
The eigenvalues (and eigenvectors) of the generalized eigenvalue problem are known to be
real for M , K symmetric. Also, when the stiness matrix is nonsingular, the eigenvalues will be
positive. Hence we write
2 = 2 0 .
The generalized eigenvalue problem is solved in MATLAB2 using
[V,D]=eig(K,M);
1
2
80
For the above matrices, the eigenvalues are 12 = 9.2937 (i.e. angular frequency 1 = 3.0486),
22 = 72.9634 (i.e. angular frequency 2 = 8.5419), and 32 = 152.3583 (i.e. angular frequency
3 = 12.3433). Therefore, we see that the s are all imaginary, j = i j . Note that there
are three eigenvalues, but each eigenvalue generates two solutions because of the for the square
roots. That is necessary, because there are six constants needed to satisfy the initial conditions (two
conditions, each with three equations).
The solutions are therefore found to be both
x = e+i j t z j
and
x = ei j t z j ,
which are complex vectors. The solution however needs to be real. This is easily accomplished by
taking as the solutions a linear combination of the above, for instance
(
)
(
)
x = Re e+i j t + ei j t z and x = Im e+i j t ei j t z .
From Eulers formula we know that
(
)
Re e+i j t + ei j t = 2 cos j t
and
)
(
Im e+i j t ei j t = 2 sin j t .
Therefore, we can take as the three linearly independent solutions (j = 1, 2, 3)
x = cos j tz j
and x = sin j tz j .
In this way we will obtain enough integration constants to satisfy the initial conditions, since the
general solution may be written as
x=
The undamped mode shapes for our example are shown in Figure 5.2, both graphically as arrows
and numerically as the values of the components.3
5.2.2 First order form
Next we will explore the free vibration of the same system in its rst-order form. The system matrix
is (note: no damping)
[
]
0,
1
A=
.
M 1 K, 0
The standard eigenvalue problem is solved in MATLAB as4
[V,D]=eig(A);
Note that the results for the eigenvalues on the diagonal of D indicate the eigenvalues are not ordered
from smallest in absolute value to the largest as we would like to see them.
12.34i 0
0
0
0
0
0 -12.34i 0
0
0
0
0
0 +8.54i 0
0
0
D=
0
0
0 -8.54i 0
0
0
0
0
0 3.05i 0
0
0
0
0
0 -3.05i
3
4
81
z12 = 0.646
Fig. 5.2. Linear 3-degree of freedom oscillator: second-order model, undamped modes
We can reorder them using the sort function: the rst line sorts the diagonal elements by ascending
modulus, the second line re-orders the rows and columns of D, and constructs the new D, the third
line then reorders the columns of V .
[Ignore,ix] = sort(abs(diag(D)));
D =D(ix,ix);
V =V(:,ix);
Here is the reordered D (be sure to compare with the eigenvalues computed in the previous section
for the generalized EP)
0+3.05i 0
0
0
0
0
0
0-3.05i
0
0
0
0
0
0 0+8.54i 0
0
0
D=
0
0
0
0-8.54i
0
0
0
0
0
0 0+12.3i 0
0
0
0
0
0
0-12.3i
and the corresponding eigenvectors as columns of V
56.2
56.2
32.6
32.6
73.5
73.5
70
70
-58.7
-58.7
-32.7
-32.7
Note that the eigenvalues come in complex conjugate pairs. The corresponding eigenvectors are
also complex conjugate. Each pair of complex conjugate eigenvalues corresponds to a one-degree of
freedom oscillator with complex-conjugate solutions.
Figure 5.3 illustrates graphically the modes of the A matrix. There are six components to each
eigenvector: the rst three elements represent the components of the displacement, and the last
three elements represent the components of the velocity. Therefore, the eigenvectors are visualized
using two arrows at each mass. We use the classical complex-vector (phasor) representation: the
real part is on the horizontal axis, and the imaginary part is on the vertical axis. Note that all
displacement components (green) are purely imaginary, while all the velocity components (red) are
real. An animation of the motion described by a single eigenvector
82
x = e j t z j
z11 , z41
z21 , z51
z31 , z61
z13 , z43
z23 , z53
z33 , z63
z15 , z45
z25 , z55
z35 , z65
z12 , z42
z22 , z52
z32 , z62
z14 , z44
z24 , z54
z34 , z64
z16 , z46
z26 , z56
z36 , z66
Fig. 5.3. Linear 3-degree of freedom oscillator: rst-order model, undamped modes
Figure 5.4 shows the free-vibration response to excitation in the form of the initial condition set
to (the real part of) mode 2.6 Note that the displacements go through zero at the same time, and
that the amplitude does not change.
0.25
0.2
0.15
0.1
y(1:3)
0.05
0
0.05
0.1
0.15
0.2
0.25
0
3
t
Fig. 5.4. Linear 3-degree of freedom oscillator: rst-order model, undamped. Free-vibration response to
initial condition in the form of mode 2.
We have made the observation that the eigenvalues and eigenvectors come in complex conjugate
pairs. Each pair of complex conjugate eigenvalues corresponds to a one-degree of freedom oscillator
with complex-conjugate solutions. We have shown in Section 4.3 that all the individual eigenvalue
problems for the 2 2 matrix A may be written as one matrix expression
AV = V ,
where each column of V corresponds to one eigenvector, and the eigenvalues are the diagonal elements of the diagonal matrix . So that provided V was invertible, the matrix A was similar to a
5
6
83
diagonal matrix (4.13). Exactly the same transformation may be used no matter what the size of
the matrix A. The 6 6 A is also similar to a diagonal matrix
V 1 AV = D
using the matrix of eigenvectors V . Therefore, the original IVP (5.2) may be written in the completely equivalent form
= D w , w(0) = V 1 y 0
w
(5.5)
for the new variables, the modal coordinates, w. Each modal coordinate wj is independent of the
others since the matrix D is diagonal.
w(0) = w0
reads
= 1 + t +
(t)2
(t)3
(t)4
+
+
.
2
6
24
The stability diagram is shown in Figure 3.24. The intersection of the imaginary axis with the level
curve = 1 of the amplication factor gives one and only one stable time step for purely oscillatory
solutions. Numerically we can solve for the corresponding stable time step with fzero as
F=@(dt)(abs(1+(dt*lambda)+(dt*lambda)^2/2+(dt*lambda)^3/6+(dt*lambda)^4/24)-1);
dt =fzero(F, 1.0)
Integrating with the stable time step leads to an oscillating solution with unchanging amplitude,
using a longer time step yields oscillating solutions with increasing amplitude, and decreasing the
time step leads to oscillations with decaying amplitude. Figure 5.5 was produced by the script
n3 undamped direct modal8 . The modal coordinate w2 (2 = 3.0486i) was integrated by oderk4
with a stable time step t (horizontal curve), slightly longer time step 1.00001t (rising curve), and
shorter time step t/10 (dropping curve), and it is a good illustration of the above derivation.
If we were to numerically integrate the IVP (5.5), i.e. the uncoupled form of the original (5.2),
we could integrate each equation separately from all the others since in the uncoupled form they
are totally independent. Hence we could also use dierent time steps for dierent equations. Let us
say we were to use a conditionally stable integrator such as the oderk4. Then for each equation j
we could nd a stable time step and integrate wj with that time step. Of course to construct the
original solution as y = V w would take additional work: All the wj would be computed at dierent
time instants, whereas all the components of y should be known at the same time instants.
Alternately, if we were to integrate the original IVP (5.2) in the coupled form, the uncoupled
modal coordinates wj would still be present in the solution y, only now they would be mixed
together (coupled) in the variables yk . Again, let us assume that we need to use a conditionally
stable integrator such as the oderk4. However, now we have to use only one time step for all the
components of the solution. It would be in general impossible for purely oscillatory solutions to
7
8
See: aetna/utilities/ODE/integrators/oderk4.m
See: aetna/ThreeCarriages/n3 undamped direct modal.m
84
0.5
0.506
|w|
Im(w )
0.504
0
0.502
0.5
0.5
0.5
0
Re(w )
0.5
0.498
0
20
40
60
80
100
Fig. 5.5. Integration of modal coordinate w2 (2 = 3.0486i). The real and imaginary part of the solution
(phase-space diagram) on the left, absolute value of the complex solution on the right. Integrated with stable
time step t (exactly one circle on the left, on the horizontal curve on the right), slightly longer time step
1.00001t (increasing radius on the left, rising curve on the right), and shorter time step t/10 (decreasing
radius on the left, dropping curve on the right)
integrate at a time step that was stable for all wj at the same time. If we cannot integrate all
solution components so that their amplitude of oscillation is conserved, then we would probably
elect to have the amplitudes decay rather than grow. Therefore, we would integrate the coupled IVP
with the time step equal to or shorter than the shortest stable time step. For our example the stable
time step lengths are9
dts =
0.9278
0.9278
0.3311
0.3311
0.2291
0.2291
The shortest stable time step (for solution components ve and six) is tmin 0.2291. Figure 5.6
shows that running the integrator at the shortest stable time step yields a solution of the original,
uncoupled, vibrating system which is non-growing (decaying oscillations), because two components
are integrated at the stable time step (and therefore their amplitude is maintained), and the rst
four components are integrated below their stable time step and hence their amplitude decays.
Running the integration at just a slightly longer time step than tmin means that the rst four
components are still integrated below their stable time steps. Their amplitude will still decay. The
last two components are integrated very slightly above their stable time step, which means that the
amplication factor for them is just a tad greater than one. We can clearly see how that can easily
destroy the solution as we get a sharply growing oscillation amplitude of the coupled solution (on
the right).
5.3.1 Practical use of eigenvalues for integration
The eigenvalues of the matrix A of the IVP (5.2) (sometimes referred to as the spectrum of A) need
to be taken into account when the IVP is to be integrated numerically. We have shown the reasons
for this above, and now we are going to try to summarize a few practical rules.
If the decoupling of the original system is feasible and cost-eective, each of the resulting independent modal equations can be integrated separately with its own time step. In particular,
exponentially decaying (or growing) solutions may require the time step to be smaller than some
appropriate length for stability. Purely oscillating solutions may also pose a limit on the time
step, depending on the integrator. To achieve stability we need to solve for an appropriate time
step from the amplication factor as shown for instance above for the fourth-order Runge-Kutta
integrator, or for the Euler integrators in Chapter 3.
9
y(1:3)
y(1:3)
0.5
0.5
0
85
10
20
30
40
50
0.5
0
10
20
30
40
50
Fig. 5.6. Integration of the undamped IVP with the shortest stable time step tmin (non-growing solution
on the left), and slightly longer time step than the shortest stable time step 1.002tmin (growing solution
on the right)
All types of solutions may also require time step that provides sucient accuracy. In this respect
we should remember that equations should not be integrated at a time step that is longer than the
stable time step. Therefore we rst consider stability, and then, if necessary, we further shorten
the time step length for accuracy. For oscillating solutions, good accuracy is typically achieved if
the time step is less than 1/10 of the period of oscillation. In particular, let us say we got a purely
imaginary eigenvalue for the jth mode, j = ij . Then the time step for acceptable accuracy
should be
t
Tj
,
10
2
.
j
If the equations cannot be decoupled (such as when the cost of solving the complete eigenvalue
problem is too high), the system has to be integrated in its coupled form. Firstly, we shall
think about stability. A time step must be chosen that works well for all the eigenvalues and
eigenvectors in the system. That shouldnt be a problem for unconditionally stable integrators
they would give reasonable answers for any time step length. Unfortunately, there is really only
one such integrator on our list, the trapezoidal rule. For conditionally stable integrators we have
to choose a suitable time step length. In particular, we would most likely try to avoid integrating
at a time step length that would make some of the solution components grow when they should
not grow (oscillating or decaying components). Then we should choose a time step that is the
smallest of all the time step limits computed for the individual eigenvector/eigenvalue pairs.
Secondly, the time step is typically assessed with respect to accuracy requirements this was
discussed above.
More on the topic of the time step selection in the next two sections that deal with solutions to
initial boundary value problems.
86
of the mass 3, which the simulation will give us as a discrete signal. The signal is a sequence of
numbers xj measured at equally spaced time intervals, tj such that tj tj1 = t.
The sampling interval is a critical quantity. With a given sampling interval length it is only possible to sample signals faithfully up to a certain frequency. Figure 5.7 shows two signals of dierent
frequencies sampled with the same sampling interval. Even though the signals have dierent frequencies, their sampling produces exactly the same numbers and therefore we would be interpreting
them as one and the same. This is called aliasing. The so-called Nyquist rate 1/t is the minimum
sampling rate required to avoid aliasing, i.e. viewing two very dierent frequencies as being the same
due to inadequate sampling.
s(t)
Fig. 5.7. Illustration of the Nyquist rate. Sampling at a rate that is lower than the Nyquist rate for the
signal represented with the dashed line. Clearly as far as the information obtained from the sampling the
two signals shown in the gure are completely equivalent, even though they have dierent frequencies.
We can see from Figure 5.8 that the Nyquist rate is twice the rate (frequency) of the frequency
we wish to reproduce faithfully. The highest frequency that is reproduced faithfully by the Nyquist
rate is the Nyquist frequency
fN y =
1 1
,
2 t
(5.6)
where t is the sampling interval. If we sample with an even higher rate (with smaller sampling
interval), the signal is going to be reproduced much better; on the other hand sampling slower, below
the Nyquist rate, i.e. with a longer sampling interval, the signal is going to be aliased: we will get
the wrong idea of its frequency.
In order to extract the frequencies that contribute to the response from the measured signal we
perform an FFT analysis. A quick refresher: The discrete Fourier transform (DFT ) is expressed
by the formula
Am =
N
1 i2(n1)(m1)/N
e
an ,
N n=1
m = 1, ..., N
(5.7)
that links two sets of numbers, the input signal an and its Fourier transform coecients Am . The
Fast Fourier transform (FFT) is simply a fast way of multiplying with the complex transform matrix,
N
i.e. evaluating the sum n=1 ei2(n1)(m1)/N an .
The Fourier transform (Fourier series) of a periodic function x(t) with period T is dened as
x(t) =
m=
where
Xm eim(2/T )t ,
(5.8)
s(t)
f = 1 fN y
f = 1.1 fN y
s(t)
s(t)
f = 2 fN y
87
f = 10 fN y
Fig. 5.8. Illustration of the Nyquist frequency. Frequencies which are lower than the Nyquist frequency are
sampled at a higher rate.
Xm =
1
T
x(t)eim(2/T )t dt .
(5.9)
Here 2
T = 0 is the fundamental frequency. The following illustration shows how equation (5.7)
that denes the transformation between the Fourier coecients and the input discrete signal can be
obtained from the above expressions for the continuous transform by a numerical approximation of
integral.
Illustration 3
Consider the possibility that the function x(t) is known only by its values xj = x(tj ) at equally spaced
time intervals, tj such that tj tj1 = t. Assume the period of the function is an integer number
of the time intervals, T = N t, and the function is periodic between 0 and T . The integral (5.9)
may then be approximated by a Riemann-sum
N
1 T
1
x(t)ei2mt/T dt
x(tn )ei2mtn /T t ,
T 0
T n=1
where m = 0, 1, .... After we substitute T = N t, tn = (n 1)t, and x(tn ) = xn we obtain
N
1 T
1
x(t)ei2mt/T dt
xn ei2m(n1)t/(N t) t
T 0
N t n=1
and nally
N
1 T
1
i2mt/T
x(t)e
dt
xn ei2m(n1)/N .
T 0
N n=1
This is already close to formula (5.7). The remaining dierence may be removed by a shift of the
index m. Therefore if we set m = 1, 2, ..., then the above will change to
N
1 T
1
x(t)ei2mt/T dt
xn ei2(m1)(n1)/N .
T 0
N n=1
As an example of the use of the DFT we will analyze the spectrum of an earthquake acceleration
record to nd out which frequencies were represented strongly in the ground motion.
88
Fig. 5.9. Workspace variables stored in elcentro.mat. The variable desc is the description of the data
stored in the le.
Illustration 4
The earthquake record is from the notorious 1940 El Centro earthquake. The acceleration data is
stored in elcentro.mat (Figure 5.9), and processed by the script dft example 111 . Note that when
the le is loaded as Data=load(elcentro.mat);, the variables stored in the le become elds of
a structure (in this case called Data).
Data=load(elcentro.mat);
dt=Data.delt;% The sampling interval
x=Data.han;% This is the signal: Let us process the North-South acceleration
t=(0:1:length(x)-1)*dt;% The times at which samples were taken
Next the signal is going to be padded to length which is an integral power of 2 for eciency. The
product of the complex transform matrix with the signal is carried out by fft.
N = 2^nextpow2(length(x)); % Next power of 2 from length of x
X = (1/N)*fft(x,N);% Now we compute the coefficients X_k
The Nyquist frequency is calculated and used to determine the N/2 frequencies of interest, which
are all frequencies lower than one half of the Nyquist rate.
f_Ny=(1/dt)/2; % This is the Nyquist frequency
f = f_Ny*linspace(0,1,N/2);% These are the frequencies
Because of the aliasing there is a symmetry of the computed coecients, and hence we also take
only one half of the coecients, X(1:N/2). In order to preserve the energy of the signal we multiply
by two.
absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients
Finally, the coecients are plotted.
plot(f,absX,Color, r,LineWidth, 3,LineStyle, -,Marker, .); hold on
xlabel ( Frequency f [Hz]); ylabel ( |X(f)|);
11
x 10
89
|X(f)|
4
3
2
1
0
0
10
15
Frequency f [Hz]
20
25
We can see that the highest-magnitude accelerations in the north-south direction occur with
frequencies below 5 Hz.
Finally, we are ready to come back to our vibration example. The displacement at the third mass
is the signal to transform.
x=y(:,3);% this is the signal to transform
The computation of the Fourier transform coecients proceeds as
N = 2^nextpow2(length(x)); % Next power of 2 from length of x
X = (1/N)*fft(x,N);% Now we compute the coefficients X_k
f_Ny=(1/dt)/2; % This is the Nyquist frequency
f = f_Ny*linspace(0,1,N/2);% These are the frequencies
absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients
Note that the absolute value of one half of the coecients (shown in Figure 5.10) is often called the
one-sided amplitude spectrum.
The three frequencies that we may expect to show up correspond to the angular frequencies
above and are 0.485 Hz, 1.359 Hz and 1.965 Hz. As evident from Figure 5.10 we can see that the
intermediate frequency, 1.359 Hz, is missing in the FFT. By including only the modes 1,2 and 5,6
with frequencies 0.485 Hz and 1.965Hz in the initial condition, we have excluded the intermediate
two modes from the response. Not to have been excited by the initial condition, the two modes will
not appear in the FFT: they will not contribute to the response of the system at any time.
Next we simulate the forced vibration of the system, with zero initial condition and sinusoidal
force at the frequency of 3 Hz applied at the mass 3.12 With the inclusion of forcing the second order
equations of motion are rewritten as
= Kx + L ,
Mx
where L is the vector of forces applied to the individual masses. Converting this to rst order form
results in
[ ] [
][ ] [ ]
0,
1
x
x
0
=
+
.
v
v
L
M 1 K, 0
Therefore, we add in the forcing to the right-hand side function supplied to the integrator: now it
includes a harmonic force applied to mass 3. 13
12
13
90
0.25
0.2
0.15
0.1
0.05
0
0
2
3
Frequency f [Hz]
Fig. 5.10. Linear 3-degree of freedom oscillator: rst-order model, undamped. Free-vibration response to
initial condition in the form of mode 1,2,5,6 mixture.
[t,y]=odetrap(@(t,y)A*y+sin(2*pi*3*t)*[0;0;0;0;0;1],...
tspan,y0,odeset(InitialStep,dt));
Again, the measurement of the response (the signal) will be the displacement of the mass 3. The
simulation will give us the displacement x3 as a discrete signal. The FFT analysis on this signal is
shown in Figure 5.11. We can see that now all free-vibration frequencies are present, and of course
the forcing frequency shows up strongly.
x 10
0
0
2
3
Frequency f [Hz]
Fig. 5.11. Linear 3-degree of freedom oscillator: rst-order model, undamped. Forced-vibration response.
14
91
2c c 0
C = c 2c c ,
0 c c
where for our particular data c = 3.13. This is an example of the so-called Rayleigh damping.
(In addition to stiness-proportional there is also a mass-proportional Rayleigh damping.) The
eigenvalues are now complex with negative real parts
-0.238+3.04i
0
0
0
0
0
0
-0.238-3.04i
0
0
0
0
0
0
-1.87+8.33i
0
0
0
.
D=
0
0
0
-1.87-8.33i
0
0
0
0
0
0
-3.91+11.7i
0
0
0
0
0
0
-3.91-11.7i
Clearly the system is strongly damped (the real parts of the eigenvalues are quite large in magnitude).
The eigenvectors shows that the velocities (the last three components) are no longer phase-shifted
by 90o with respect to the displacements.
31.2
31.2
73.2
73.2
-58.9
-58.9
56.2
56.2
32.6
32.6
73.5
73.5
70
70
-58.7
-58.7
-32.7
-32.7
z11, z41
z21, z51
z31, z61
z13, z43
z23, z53
z15, z45
z25, z55
z12, z42
z22, z52
z32, z62
z33, z63
z14, z44
z24, z54
z34, z64
z35, z65
z16, z46
z26, z56
z36, z66
Fig. 5.12. Linear 3-degree of freedom oscillator: rst-order model, modes for stiness-proportional damping
Figure 5.13 shows the free-vibration response to excitation in the form of the initial condition set
to (the real part of) mode 2.15 Note that the displacements go through zero at the same time. This
may be also deduced in Figure 5.12 from the fact that all the displacement arrows for any particular
mode are parallel, which means they all have the same phase shift. Next we repeat the frequency
analysis weve performed for the undamped system previously: we simulate the forced vibration of
the damped system, with zero initial condition and sinusoidal force at the frequency of 3 Hz applied
at the mass 3.16 Again, the measurement of the response will be displacement of the mass 3. The
one-sided amplitude FFT analysis on this signal is shown in Figure 5.14. We can see that not all
free-vibration frequencies are clearly distinguishable, while the forcing frequency shows up strongly.
15
16
92
y(1:3)
0.1
0.05
0
0.05
0.1
0.15
0.2
0
3
t
Fig. 5.13. Linear 3-degree of freedom oscillator: rst-order model, stiness-proportional damping. Freevibration response to initial condition in the form of mode 2.
1.6
x 10
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
3
Frequency f [Hz]
Fig. 5.14. Linear 3-degree of freedom oscillator: rst-order model, stiness-proportional damping. Forcedvibration response.
c1 0 0
C = 0 0 0 .
0 00
Otherwise the mass and stiness properties are unchanged.17
The eigenvalues are quite interesting. There are two negative and real eigenvalues (each corresponding to an exponentially decaying mode), and two pairs of complex conjugate eigenvalues for
one-degree of freedom oscillators.
17
D=
93
-2.4
0
0
0
0
0
0 -0.641+3.98i
0
0
0
0
0
0
-0.641-3.98i
0
0
0
.
0
0
0
-0.254+11.1i
0
0
0
0
0
0
-0.254-11.1i 0
0
0
0
0
0
-21.4
Correspondingly, the rst and last eigenvector is real, and the rest are complex conjugate pairs.
0.04
y(1:3)
0.03
0.02
0.01
0.01
0
3
t
Fig. 5.15. Linear 3-degree of freedom oscillator: rst-order model, modes for non-proportional damping.
Response for initial conditions in the form of mode 6.
Figure 5.16 illustrates graphically the modes of the A matrix. It is noteworthy that the displacements and velocities for the purely decaying modes are phase shifted by 180o (they are out of phase).
Figure 5.17 shows the free-vibration response to excitation in the form of the initial condition
set to (the real part of) mode 2. Note that the displacements no longer go through zero at the same
time: they are phase shifted. This may be also deduced in Figure 5.16 because the displacement
arrows for any particular mode are not parallel any more.
Illustration 5
The dynamics of the system discussed above is to be integrated with the time step t = 0.06 s with
the modied Euler integrator. Determine if this integrator will be stable.
The natural angular frequencies are diag(D)
lambda=[ -2.4030
-0.6411+3.9785i
18
94
z11 , z41
z21 , z51
z31 , z61
z12 , z42
z22 , z52
z32 , z62
z13 , z43
z23 , z53
z33 , z63
z14 , z44
z24 , z54
z34 , z64
z15 , z45
z25 , z55
z35 , z65
z16 , z46
z26 , z56
z36 , z66
Fig. 5.16. Linear 3-degree of freedom oscillator: rst-order model, modes for non-proportional damping
0.1
0.05
y(1:3)
0.05
0.1
0.15
0
3
t
Fig. 5.17. Linear 3-degree of freedom oscillator: rst-order model, nonproportional damping. Free-vibration
response to initial condition in the form of mode 2.
-0.6411-3.9785i
-0.2541+11.1142i
-0.2541-11.1142i
-21.4220]
Each angular frequency needs to be substituted into the amplication factor for the modied Euler (3.30), and its modulus (absolute value) needs to be evaluated. The result is
>> abs(1+dt*lambda+1/2*(dt*lambda).^2)
ans =
0.8662
0.9616
0.9616
1.0063
1.0063
0.5407
Since two of the amplication factors (for the complex-conjugate natural frequencies 4 and 5) are
greater than one in modulus, the integrator is not going to be stable with the given time step as the
contribution of the modes 4 and 5 would grow in time.
95
z11 , z41
z21 , z51
z31 , z61
z13 , z43
z23 , z53
z33 , z63
z15 , z45
z25 , z55
z35 , z65
z12 , z42
z22 , z52
z32 , z62
z14 , z44
z24 , z54
z34 , z64
z16 , z46
z26 , z56
z36 , z66
Fig. 5.18. Linear 3-degree of freedom oscillator: rst-order model, modes for singular-stiness nonproportional damping
Note the zero eigenvalue: for a singular stiness the entire matrix A must be singular (consider
whether the rst three columns of A can be linearly independent when K has linearly dependent
columns).
0
0
0
0
0
0
0 -0.679+4.31i
0
0
0
0
0
0
-0.679-4.31i
0
0
0
.
D=
0
0
0
-0.237+11.2i
0
0
0
0
0
0
-0.237-11.2i 0
0
0
0
0
0
-23.8
Correspondingly, the rst and last eigenvector is real, and the rest are complex conjugate pairs. The
rst eigenvector is as expected: all displacements the same, no velocities:
0
49.3-10i
49.3+10i
84.4
84.4
7.13
0
80.4
80.4
-51+3.48i
-51-3.48i 0.546
Under these conditions no forces are generated in any of the springs or the damper.
6
Analyzing errors
Summary
1. The basic tool is here the Taylor series. Especially important is the Lagrange remainder term.
2. We use it to reason about order-of estimates (i.e. big-O notation). Main idea: as we control error
in numerical algorithms by decreasing the time step length, the element size, and other control
parameters, towards zero, the rst term of the Taylor series that is missing in our model will
dominate the error. We use these ideas to evaluate errors of integrals and estimate local and
global errors of ODE integrators.
3. Combining order-of error estimates with repeated solutions with dierent time step lengths allows
us to construct time-adaptive integrators. Main idea: by controlling the local error (estimated
from the Taylor series) we attempt to deliver the solution within a user-given error tolerance.
4. We discuss the approximation of derivatives by the so-called nite dierence stencils. Main idea:
the total error has components of a distinct nature, the truncation error and the machinerepresentation error.
5. The computer represents numbers as collections of bits. Main idea: The machine-representation
error (round-o) is due to the inability of the computer to store only some values, to which
results of arithmetic operations must be converted (with the attendant loss of precision).
dy(x0 )
d2 y(x0 ) (x x0 )2
(x x0 ) +
+ ... .
dx
dx2
2
Its purpose is to approximate the function value at x from the function derivatives at x0 (the function
value may be considered the zeroth derivative). When the above series converges, the Taylor series
will become better and better approximation with any additional term. (When the Taylor series for
a given function converges, we call such function analytic.)
Illustration 1
Warning: The Taylor series need not be convergent. For instance, the function log(1 + x) has a
convergent Taylor series in the interval 1 < x < 1. Outside this interval the Taylor series does not
converge (the more terms are added, the worse the approximation becomes). Try the following code
that uses the taylor MATLAB function.
98
6 Analyzing errors
syms x real
t=taylor(log(1+x),6);
x=linspace(-1,+2,100);
plot (x,log(1+x))
hold on
plot (x,eval(vectorize(t)),--)
Note the use of vectorize: MATLAB will choke on all those powers of x from the Taylor series
function when x is an array of numbers.
Often it is useful to truncate the Taylor series exactly (that is to write down a nite number of
the terms, but still preserve the exact meaning). The Lagrange remainder can be used for this
purpose. For instance we can write
y(x) = y(x0 ) +
dy(b
x)
(x x0 )
dx
d2 y(b
x) (x x0 )2
dy(x0 )
(x x0 ) +
2
dx
dx
2
to truncate after the second term. Both truncations are exact (when the Taylor series converges, of
course). The trick is to write the last term (which is the Lagrange remainder) with a derivative taken
at x
b somewhere between x0 and x. The location x
b is not the same in the two truncations above.
In general, we would write
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
dn y(x0 ) (x x0 )n
(x x0 ) +
+
.
.
.
+
+
+ Rn ,
dx
dx2
2
dxn
n!
dn+1 y(b
x) (x x0 )n+1
.
dxn+1
(n + 1)!
(6.1)
Having reminded ourselves of the basics of Taylor series approximation, we can look at a very
useful tool (terminology really) to help us with engineering analyses of all kinds.
x0
|f (x)|
<M <,
|g(x)|
where we require g(x) = 0 for x = 0. In words, the absolute values of the two functions are in some
proportion that is of nite magnitude. We write f (x) O(g(x)) and say f of x is big o g of x as x
goes to zero. The meaning of this denition is that |f (x)| decreases towards zero at least as fast
as |g(x)|.
99
Illustration 2
Example 1: Consider f (x) = 0.1x + 30x2 , for x > 0. Show that it is of order g(x) = x as x 0.
We form the fraction and simplify
|f (x)|
|0.1x + 30x2 |
0.1x + 30x2
= lim
= lim
= lim 0.1 + 30x = 0.1 <
x0 |g(x)|
x0
x0
x0
|x|
x
lim
Conclusion: f (x) = 0.1x + 30x2 is of order g(x) = x as x 0. We say f of x is big o x, and write
f (x) = 0.1x + 30x2 O(x).
Example 2: Consider f (x) = 0.1x + 30, for x > 0. Show that it is of order g(x) = 1 as x 0.
We form the fraction and simplify
lim
x0
|0.1x + 30|
0.1x + 30
|f (x)|
= lim
= lim
= lim 30 = 30 <
x0
x0
|g(x)| x0
|1|
1
Conclusion: f (x) = 0.1x + 30 is of order g(x) = 1 as x 0. We say f of x is big o one, and write
f (x) = 0.1x + 30 O(1).
Example 3: Consider f (x) = 0.1x + 30x2 , for x > 0. Show that f (x) is not of order g(x) = x2 as
x 0.
We form the fraction and simplify
|0.1x + 30x2 |
0.1x + 30x2
|f (x)|
= lim
= lim
= lim 0.1/x + 30
2
x0
x0
x0
x0 |g(x)|
|x |
x2
lim
Solution: As all polynomial expressions include the constant term, all of these polynomials are
O(1).
100
6 Analyzing errors
Illustration 4
Estimate the resulting magnitude of the Taylor series sum for tj+1 tj . Assume that all the
derivatives exist and are nite numbers.
d3 y(tj ) (tj+1 tj )3
d4 y(tj ) (tj+1 tj )4
d2 y(tj ) (tj+1 tj )2
+
+
+ ... .
dt2
2
dt3
3!
dt4
4!
First of all, the Taylor series is a polynomial in the quantity tj+1 tj , and this quantity goes to
zero as tj+1 tj . Therefore, we can introduce the new variable = tj+1 tj and write
d3 y(tj ) 3
d4 y(tj ) 4
d2 y(tj ) 2
+
+
+ ... .
dt2 2
dt3 3!
dt4 4!
d2 y (t ) d3 y (t ) d4 y (t )
The quantities 2!dt2j , 3!dt3j , 4!dt4j , . . . are just inconsequential coecients, and we can easily
convince ourselves that
d3 y(tj ) 3
d4 y(tj ) 4
d2 y(tj ) 2
+
+
+ . . . O( 2 )
dt2 2
dt3 3!
dt4 4!
by evaluating
d3 y(tj ) 3
d4 y(tj ) 4
d2 y(tj ) 2
+
+
+ ...
2
2
dt3 3!
dt4 4!
lim dt
=
2
0
d y(tj ) 1 d y(tj )
d y(tj )
d y(tj ) 1
+
+
+ ... =
<
2
3
4
dt
2
dt
3!
dt
4!
dt2 2
(
)
In conclusion, the Taylor series sum is O (tj+1 tj )2 .
lim
using the Riemann-sum approximation indicated by the lled rectangles in the gure. The error of
approximating the actual area between x0 and x0 + h by the rectangle y(x0 )h may be estimated by
expressing the Taylor series of y(x) at x0
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
(x x0 ) +
+ ...
dx
dx2
2
and integrating the Taylor series, where we can conveniently introduce the change of variables
s = x x0
)
x0 +h
h(
d2 y(x0 ) s2
dy(x0 )
s+
+ . . . ds .
y(x) dx =
y(x0 ) +
dx
dx2 2
x0
0
We obtain
x0 +h
y(x) dx = y(x0 )h +
x0
dy(x0 ) h2
d2 y(x0 ) h3
+
+ ... .
dx 2
dx2 6
101
Comparing with the approximate area y(x0 )h, we express the error as
dy(x0 ) h2
d2 y(x0 ) h3
+
+ ... .
dx 2
dx2 6
e=
ba
h
such rectangles. A pessimistic estimate of the total error magnitude would ignore the possibility of
error canceling, so that the absolute value of the total error could be bounded by the sum of the
absolute values of the errors committed for each subinterval
|E|
|ei | =
i=1
i=1
O(h2 ) = n O(h2 ) =
ba
O(h2 ) = O(h) .
h
Note that when we write in the equals sign in the above equation, we dont really mean equality, we
use it rather informally to mean is. In the terms of the order-of analysis, we would write for the
error E of the integral from a to b
E O(h) .
From the point of view of the user of the Riemann-sum approximation this is good news: The error
can be controlled. By decreasing h (that is by using more subintervals) we can make the total error
smaller. It would be even nicer if the error of was O(h2 ), since then it would decrease faster when h
was decreased. We demonstrate this as follows: assume that we use twice as many subintervals. For
E O(h) the error would decrease as
h h/2 E O(h) Enew O(h/2) = O(h)/2
so the error decreases with a factor of two. For E O(h2 ) the error would decrease as
h h/2 E O(h2 ) Enew O((h/2)2 ) = O(h2 )/4
so the error decreases with a factor of four. The pay o of using twice as many intervals is better
this time.
6.2.3 Error of the Midpoint approximation of integrals
Now we demonstrate the estimate the error of the midpoint approximation of integrals of one variable
using the order-of analysis. For instance, as shown in Figure 6.2, approximate the integral
y(x) dx
a
using the midpoint approximation indicated by the lled rectangles in the gure. The error of
approximating the actual area between x0 h/2 and x0 + h/2 by the rectangle y(x0 )h may be
estimated by expressing the Taylor series of y(x) at x0
y(x) = y(x0 ) +
dy(x0 )
d2 y(x0 ) (x x0 )2
(x x0 ) +
+ ...
dx
dx2
2
102
6 Analyzing errors
x0
h
Fig. 6.1. Riemann-sum approximation of the integral of a scalar variable.
and integrating the Taylor series, where we introduce the change of variables s = x x0
)
x0 +h/2
h/2 (
d2 y(x0 ) s2
dy(x0 )
s+
+ . . . ds .
y(x) dx =
y(x0 ) +
dx
dx2 2
x0 h/2
h/2
We obtain
x0 +h/2
x0 h/2
y(x) dx = y(x0 )h +
d2 y(x0 ) h3
+ ... .
dx2 24
0)
Importantly, the term with dy(x
produced one negative contribution (triangle) which canceled with
dx
the corresponding positive contribution (triangle), and so this term with its associated h2 dropped
out. Comparing with the approximate area y(x0 )h, we express the error as
e=
d2 y(x0 ) h3
+ . . . O(h3 ) ,
dx2 24
which is one order higher than the error estimated for the Riemann sum. The integral of the function
y(x) between a and b is approximated as a sum of the areas of the n rectangles and the absolute
value of the total error could be bounded by the sum of the absolute values of the errors committed
for each subinterval
|E|
|ei | =
i=1
O(h3 ) = n O(h3 ) =
i=1
ba
O(h3 ) = O(h2 ) .
h
The order-of analysis tells us that the error E of the integral from a to b for the midpoint rule is
E O(h2 )
and therefore the midpoint rule is more accurate than either of the Riemann sum rules.
y(0) = y0 .
103
x0
h
Fig. 6.2. Midpoint approximation of the integral of a scalar variable.
y(t) = y0 +
f ( ) d .
0
y(0) = 1 .
Each step of the forward Euler algorithm drifts o from the original curve. So we see one solution
curve departing from the starting point (t0 , y0 ), but after one step the forward Euler no longer tries
to follow that curve, but rather the one starting at (t1 , y1 ), and so on. Clearly, here is the potential
for amplifying small errors if the solution curves part company rapidly as the time goes on. However,
provided we use time steps which are suciently small so that the forward Euler does not excessively
amplify these little drifts, we can estimate the error on the entire solution interval (the so-called
global error ) from the so-called local errors in each time step.
y0
y4
y1
y2
t0
t1
t2
y3
t3
t4
Fig. 6.3. Forward Euler integration drifting o the original solution path.
104
6 Analyzing errors
y(0) = y 0 ,
y(tj+1 ) = y(tj ) +
Here y(tj+1 ) is the true solution that lies on the solution curve passing through the point (tj , y j ),
and y j+1 is what we get from forward Euler. Now we can substitute from the denition
dy(tj )
= f (tj , y j )
dt
to get
y(tj+1 ) = y(tj ) + f (tj , y j )(tj+1 tj ) +
d2 y(tj ) (tj+1 tj )2
+ ...
dt2
2
and then move the rst two terms on the right-hand side onto the left-hand side
y(tj+1 ) y(tj ) f (tj , y j )(tj+1 tj ) =
d2 y(tj ) (tj+1 tj )2
+ ... .
dt2
2
Finally, the second and third term on the left-hand side are y j+1 , and so we obtain the local error
(also called truncation error) in this time step as
y(tj+1 ) y j+1 =
d2 y(tj ) (tj+1 tj )2
d3 y(tj ) (tj+1 tj )3
+
+ ... .
dt2
2
dt3
3!
Two observations: rstly, the local error is second order in the time step
y(tj+1 ) y j+1 O((tj+1 tj )2 )
and secondly, the coecient of this term is the second derivative at (tj , y j ) which measures the
curvature of the solution curve at that point. The more the curve curves, the larger the error. If
the solution happens to have a zero curvature at (tj , y j ) then we would predict that the Euler step
should not incur any error. It still might: our prediction neglected all those dots (the higher order
terms) in the Taylor series, but at least for zero curvature the second order term in the error would be
absent. The local error resulted from the truncation of the Taylor series, which is a good explanation
of why it is called the truncation error.
6.3.2 Global error of forward Euler
We have demonstrated above (see Figure 6.3) that the global error , that is the dierence between
the analytical exact solution y(tn ) and the computational solution yn , is a mixture of two components. Now we will look at the global error in detail. We will try to estimate the global error at
time tn+1 , GEn+1 , from the global error GEn at time tn : see Figure 6.4. Note that we are thinking
in terms of a scalar dierential equation, but the conclusions may be readily generalized to coupled
equations.
105
The rst component of the global error is the local (truncation) error which is caused by the
truncation of the Taylor series as explained in the previous section.
The second component is caused by the drift o in the previous steps of the algorithm: every
step of the integrator will cause the solution to drift o the original curve passing through the
initial condition. Let us consider performing one single step of the numerical integration, from tn to
tn+1 . Two dierent curves pass through the two points (tn , yn ) and (tn , y(tn )): let us say ye(t) passes
through (tn , yn ), and y(t) passes through (tn , y(tn )). The dierence between the points (tn , yn ) and
(tn , y(tn )) is the global error at time tn , GEn .
The dierence between the two curves y(tn+1 ) ye(tn+1 ) at time tn+1 measures the propagated
error . We can estimate the propagated error from as P En+1 which is the global error GEn plus the
increase of the distance between the two curves. The increase can be approximated to rst order
from the slopes ye (tn ) = f (tn , yn ) and y(t
n ) = f (tn , y(tn ))
P En+1 GEn + (f (tn , y(tn )) f (tn , yn )) (tn+1 tn ) .
We can also use the Taylor series to expand the right-hand side function f as
f (tn , y(tn ) f (tn , yn ) +
f (tn , yn )
(y(tn ) yn )
y
to obtain
P En+1 GEn +
f (tn , yn )
(y(tn ) yn ) (tn+1 tn )
y
y(0) = y0 .
i=1
|ei | =
i=1
O(t2 ) = n O(t2 ) =
t
O(t2 ) = O(t) .
t
Thus we see that we lost one order in the error estimate going from local to global errors. The
forward Euler algorithm was second order locally, but it is only rst order globally.
Illustration 5
Now we can go back to graphs of Chapter 1, especially Figure 2.19. The slopes of the error curves on
the log-log scale will now be making sense. For the forward Euler we now know that its local error is
second order, but the global error is rst order. The graph 2.19 displays the global error, and hence
the slope (i.e. the convergence rate) is one. For the modied Euler the global error is second order,
consequently its local error is cubic in the time step.
106
6 Analyzing errors
y(tn+1 )
f(tn , y(tn ))
LEn+1
y(t)
y(tn )
yn+1
f(tn , yn )
GEn
yn
y2
y0
y1
t0
t1
t2
t3
t4
Fig. 6.4. Global error of the forward Euler integration. LEn+1 = local (truncation) error, P En+1 = propagated error, GEn = global error at time tn , GEn+1 = global error at time tn+1 .
Suggested experiments
1. Estimate from Figure 2.21 the order of the local error of the oderk4 Runge-Kutta integrator.
d2 f (x0 ) (x x0 )2
df (x0 )
(x x0 ) +
+ R2 ,
dx
dx2
2!
d3 f (x + (1 )x0 ) (x x0 )3
,
dx3
3!
01.
.
dx
(x x0 )
dx2
2!
(x x0 )
When second-order derivative term and the remainder term are ignored, presumably because they
are much smaller in magnitude then what we keep from the right-hand side expression, we get an
approximation of the derivative as
df (x0 )
f (x) f (x0 )
dx
(x x0 )
(6.2)
Because of the form of this expression, we call this formula the divided dierences. All the formulas
for the approximation of derivatives derived in this section are of this nature.
What weve neglected above is
107
d2 f (x0 ) (x x0 )
R2
dx2
2!
(x x0 )
3
0)
and we realize that this is the error of the approximation. Unless d f (x+(1)x
behaves like 1/(x
dx3
x0 ), in words unless its magnitude blows up to innity as x x0 , we can estimate the magnitude
of the error using the order-of notation we developed earlier
R2 O(|x x0 |2 ) .
(x x0 )
Since
d2 f (x0 ) (x x0 )
O(|x x0 |)
dx2
2!
(6.3)
we see that it will dominate the error as the control parameter, the step along the x axis, x x0 ,
becomes shorter and shorter. The accuracy of the algorithm (6.2) is quite poor, the error being only
O(|x x0 |). We call this kind of error the truncation error, since it is the result of the truncation of
the Taylor series.
Illustration 6
Consider a common counterexample where (6.3) is not valid. In Figure 6.5 a piecewise linear function
is shown (in solid line) with its derivative (dashed). If we take (6.2) with x0 to the left of b, for x < b
the formula works perfectly. The second derivative is in fact
d2 f (x0 )
=0,
dx2
which makes our derivative computation perfect no error. Now we will make x0 approach b from
the left arbitrarily closely. The error estimate (6.3) is then no longer valid since at x0 = b
d2 f (x0 )
.
dx2
This unfortunate behavior is due to the rst derivative being discontinuous at b.
f (x)
f (x)
x
f (x)
108
6 Analyzing errors
Now we will consider the approximation formula (6.2) for two cases: x > x0 and x < x0 . When
x > x0 we are looking forward with the formula to determine the slope at x0 , hence we get the
forward Euler approximation of the derivative. Let us write h = |x x0 |. Then the formula (6.2)
may be rewritten in the familiar form
df (x0 )
f (x0 + h) f (x0 )
.
dx
h
On the other hand, when x < x0 we are looking backward with the formula to determine the
slope at x0 , hence we get the backward Euler approximation of the derivative as this version of the
formula (6.2)
f (x0 ) f (x0 h)
df (x0 )
.
dx
h
f (x0 + h)
x0 h
x0
x0 + h
f (x0 h)
f (x0 )
Fig. 6.6. Forward and backward Euler approximation of the derivative.
Figure 6.6 illustrates these concepts. The actual derivative is the slope of the green line (tangent
at (x0 , f (x0 ))), which is approximated by the forward Euler algorithm as the slope of the red dashed
line, and by the backward Euler algorithm as the slope of the blue dashed line.
Evidently the gure suggests an improvement on these two algorithms. The green line seems to
have a slope rather close to the average of the slopes of the red and blue lines. (The angles between
the blue and green line and between the red and green line are about the same.) So what happens
if we average those Euler predictions?
(
)
f (x0 + h) f (x0 h)
1 f (x0 + h) f (x0 ) f (x0 ) f (x0 h)
+
=
.
(6.4)
2
h
h
2h
The above formula denes another algorithm, the centered dierence approximation of the derivative. Figure 6.7 shows the dashed green line which represents the centered dierence approximation
of the tangent, and we can see that indeed the slopes of the dashed and solid green lines are indeed
quite close. It appears that the centered dierence approximation should be more accurate, in general, and we can investigate this analytically by averaging out only the approximation formulas, but
the entire expressions including the errors.
The forward dierence approximation of the derivative, including the truncation error R2,f
df (x0 )
f (x0 + h) f (x0 ) d2 f (x0 ) h
R2,f
=
2
dx
h
dx
2!
h
109
f (x0 + h)
x0 h
x0
x0 + h
f(x0 h)
f (x0 )
Fig. 6.7. Forward and backward Euler and centered dierence approximation of the derivative.
is added to the backward dierence approximation of the derivative, including the truncation error
R2,b
df (x0 )
f (x0 h) f (x0 ) d2 f (x0 ) (h)
R2,b
=
dx
h
dx2
2!
(h)
to result in the expression of the centered dierence approximation
2
df (x0 )
f (x0 + h) f (x0 h) R2,f
R2,b
=
.
dx
h
h
(h)
R2,f
R2,b
O(h2 ) .
2h
(2h)
(6.5)
It is one order higher than the truncation errors of the Euler algorithms (O(h2 ) versus O(h)), and
higher is better the error decreases faster with decreasing h.
The formulas for the numerical approximation of derivatives of functions, forward and backward
Euler, and the centered dierences, are called nite dierence stencils, and many more, sometimes
with a considerably higher accuracy, can be found in the technical literature. The price to pay is
that with higher accuracy one needs more function values around the point x0 .
Illustration 7
We shall now investigate the numerical evidence for these estimates of truncation error. 1 In the script
compare conv driver, x is the point where the derivative is evaluated, n is the number of reductions
of the step, dx0 is the initial step, which is then subsequently reduced by the factor divFactor.
funhand and funderhand are the handles of the function and its derivative (as anonymous MATLAB
functions).
funhand=@(x)2*x^2-1/3*x^3;
funderhand=@(x)4*x-3/3*x^2;
x=1e1;
n= 9;
dx0= 0.3;
divFactor=4;
110
6 Analyzing errors
10
err
10
10
10
10
FE
BE
CD
15
10
10
10
10
dx
Fig. 6.8. Forward and backward Euler and centered dierence approximation of the derivative. Error versus
the step size.
Figure 6.8 both conrms the expected outcome and presents an unexpected one: the forward
and backward Euler are of the same accuracy, and on the log-log scale the error decreases with rate
of convergence equal to one, and the centered dierence is both more accurate in absolute terms
and the error decreases with a convergence rate of two. What may be unexpected however is the
behavior of the centered dierence error for very small steps. The error does not decrease anymore,
rather the opposite occurs.
Shifting x as the point where the derivative is evaluated (change the third line to read
x=1e4;
gives the results in Figure 6.9. The performance of the numerical dierentiation algorithms has now
very much deteriorated, and a decrease in the step size does not necessarily lead to an improvement
in the result in neither the two Euler derivatives approximations, nor in the centered dierence
approximation.
The explanation for the behavior described in the Illustration above rests in what is displayed
in the graphs: the graphs present the total error incurred by the numerical algorithm, and this error
is the result of the interplay between the truncation error and the eect of the so-called machinerepresentation error. The term round-o error is commonly used for this type of error. However,
round-o is only a special case of the broader class of machine-representation errors. Another term
which would be equivalent is computer-arithmetic, or just arithmetic error. We will sometimes
use interchangeably machine-representation and arithmetic error.
err
10
10
10
10
111
10
FE
BE
CD
12
6
10
10
10
10
dx
Fig. 6.9. Forward and backward Euler and centered dierence approximation of the derivative. As Figure 6.8,
but the point of evaluation is shifted towards much bigger number, x=1e4.
the power of two, similarly to what were used to with decimal numbers. For instance, the decimal
number 13 = 1101 +3100 can be written in the binary system as 13 = 123 +122 +021 +120 .
Hence its binary representation is 1101. We can use the MATLAB function dec2bin:
>> dec2bin(13)
ans =
1101
The largest number we can store in a byte (more precisely in an unsigned byte) is 255, viz
>> dec2bin(255)
ans =
11111111
since in that case all the bits are toggled to 1. If we wish to represent signed numbers, we must
reserve one bit for the storage of the sign (positive or negative). Then we have only seven bits for
the storage of the actual pattern of 0s and 1s. The largest number that seems to be available then is
>> bin2dec(1111111)
ans =
127
However, by some clever manipulation it is possible to squeeze out one more number out of the eight
bits, and so we get as the algebraically smallest and largest integers using the MATLAB functions
intmin and intmax
>> intmin(int8)
ans =
-128
>> intmax(int8)
ans =
127
The clever trick is called the 2s complement representation, and the bits represent numbers as
shown here
00000000=0
00000001=1
00000010=2
00000011=3
...
112
6 Analyzing errors
01111111=127
11111111=-1
11111110=-2
11111101=-3
...
10000000 =-128
The argument int8 denotes the so-called integer type, and there are four signed and four unsigned
varieties in MATLAB (with 8, 16, 32, and 64 bits). As an example, here are the smallest and largest
unsigned 64-bit integer
>> intmin(uint64)
ans =
0
>> intmax(uint64)
ans =
18446744073709551615
Integers are nice to work with, and they are very useful for instance as counters in loops. If were
not careful, bad things can happen though. Take the following code fragment: First we create the
variable a as an 8-bit integer zero with int8
>> a= int8(0);
and then we increment it 1000 times by one. The result is a bit unexpected, perhaps:
for i=1: 1000
a=a+1;
end
a
a =
127
What happened? Overow! When the variable reached the largest value that can be stored in a
variable of this type, it stopped increasing: the variable overowed.
6.5.2 Floating-point data types
The oating-point numbers are represented with values for the so-called mantissa M and exponent
E, stored in bits essentially as described above, as
M*2^E
The basic datatype in MATLAB is a oating-point number stored in 64 bits, the so-called double. The
machine representation for this number is standardized, as described in the ANSI/IEEE Standard
754-1985, Standard for Binary Floating Point Arithmetic. The exponent and the mantissa are stored
as patterns of bits, which may be represented as numbered from 0 to 63, left to right. The rst bit is
the sign bit, S, the next eleven bits are the exponent bits, E, and the nal 52 bits are the mantissa
bits M:
S EEEEEEEEEEE MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
0 1
11 12
63
The value V represented by the 64-bit word may be determined by an algorithm (how else?):
1. If E=2047
a) If M is nonzero, then V=NaN (Not a Number)
b) Else (M==0)
i. If S is 1, then V=-Inf
113
114
6 Analyzing errors
For S=0, E-1023, and all the bits of the mantissa M set to zero the value V=1.0. Then the next closest
number to 1.0 that can be represented in the computer is 1.0 + 252 . Between these two values is a
gap where no numbers live. It is a tiny gap, so it may not bother us too much, but consider what
happens as the exponent 2^(E-1023) gets bigger. The MATLAB function eps computes the size of
the gap next to any particular double value. For instance, for a value representative of the Youngs
modulus in some units the gap will get bigger
>> eps(3e11)
ans =
6.103515625000000e-005
For the distance across the Milky Way, the gap already amounts to (in meters)
>> eps(100000 *9.46e15)
ans =
131072
and the distance to the outermost object in the universe can be recorded in the computer with a
double value only with precision amounting to tens of millions of kilometers
>> eps(18e9 *9.46e15)
ans =
3.435973836800000e+010
The gap between adjacent numbers represented in the computer is called machine epsilon, and
it is an important quantity. Especially when we consider round-o. As an example, consider the
addition of two numbers: watch what happens to the underlined red digits.
>> pi
ans =
3.141592653589793
>> 1+pi
ans =
4.141592653589793
>> 1e6+pi
ans =
1.000003141592654e+006
In the second example, the disparate magnitude led to truncation of the previously signicant digits.
Addition results in loss of signicance when the numbers are disparate. For subtraction, the
dangerous situation occurs when the two numbers are close. Consider this example:
>> 3.14159265-pi
ans =
-3.589792907376932e-009
The computer essentially made up the underlined red digits. (To check this we go to the Web: a lot
of digits of are available on the web: 3.14159265358979323846264338....) This problem is referred
to as loss of signicance, and it is one of the most deleterious aspects of computer arithmetic.
Another example illustrates the so-called catastrophic cancellation of terms, for both addition
and subtraction. We may consider the below expressions equivalent, but it evidently matters what
the MATLAB engine thinks of the expression we typed in:
>> (1 +2e-38)-(1 -2e-38)
ans =
0
>> (1 +2e-38-1 +2e-38)
ans =
2.000000000000000e-038
115
116
6 Analyzing errors
2. The range in which oating-point values are represented is sparsely populated: there are gaps
between numbers (the so-called machine epsilon), which increase with the magnitude of the
number. The machine epsilon essentially limits the largest number of signicant digits that we
can expect (15 for double, 6 for single).
3. Operations on the computer-represented numbers rarely lead to exact answers, and especially
addition and subtraction can prove devastating to our budget of signicant digits (overow,
underow, round-o, cancellation of terms,...). Consequently the number of signicant digits is
usually much less than the number of digits in the computer printouts.
(f (x0 ))
.
2h
Here (f (x0 )) is the machine epsilon at the real number f (x0 ). We can see that the arithmetic error
ER increases in inverse proportion to the step size h. Also, we see that the error increases with the
magnitude of the numbers to be subtracted f (x0 ), since the machine epsilon depends on it.
Now let us go back to Figure 6.8. The total error displayed in Figure 6.8 is the sum of the
truncation error and the arithmetic error. The descending branches of the errors are dominated by
the truncation error, either O(h) (slope +1) or O(h2 ) (slope +2). In the climbing branch of the
total error the arithmetic error dominates. The dependence of the arithmetic error on 1/h = h1
can be clearly detected in the slope 1 of the climbing branch of the total error of the derivative in
Figure 6.9.
Note well that while talking about the total error we disregard the avoidable errors of the nature
of bugs and mistakes. Sadly, these errors are sometimes present, but their unpredictable nature
makes them very dicult to discuss in general.
Illustration 8
The report DCS Upgrades for Nuclear Power Plants: Saving Money and Reducing Risk through
Virtual-Stimulation Control System Checkout by G. McKim, M. Yeager and C. Weirich from 2011,
states on page 5 when discussing a software simulator of a nuclear power plant subsystem: Here
was the rst surprise. The emulated Bailey response in Figure 5 didnt show this rate limiting. The
controller output traveled as fast as 12% per second. This led to a line-by- line examination of the
FORTRAN source code for the Bailey emulation, whereupon it was discovered that, contrary to
belief, the rate limiting was not included in the simulation.
This is an example of a software bug: the feature that was supposed to be programmed was
either never implemented or was implemented and later deleted.
Illustration 9
The Deepwater Horizon Accident Investigation Report from September 8, 2010 states on page 64
The 13.97 ppg interval at 18,200 ft. was included in the OptiCem model report as the reservoir
zone. The investigation team was unable to clarify why this pressure (13.97 ppg) was used in the
117
model since available log data measured the main reservoir pressure to be 12.6 ppg at the equivalent
depth. Use of the higher pressure would tend to increase the predicted gas ow potential. The same
OptiCem report refers to a 14.01 ppg zone at 17,700 ft. (which, in fact, should be 14.1 ppg: the
actual pressure measured using the GeoTap logging-while-drilling tool). (Emphasis is mine.)
These two instances are illustrations of an input error (mistake of the operator). Undoubtedly
important, but they are outside of the scope of error control that numerical methods can exercise
and therefore will not be discussed in this book.
7
Solution of systems of equations
Summary
1. We discuss a couple of representative methods for the solution of a scalar nonlinear equation.
Main idea: Newtons and bisection method are complementary with the respect to the rate of
convergence and robustness.
2. Newtons method (in one of its several variants) is a crucial building block in nonlinear analysis
of structures, where systems of coupled nonlinear equations need to be solved repeatedly. Main
idea: ecient solvers for systems of coupled linear equations are critical to the success of the
Newtons method.
3. Solutions of systems of coupled linear equations that fall under the class of factorizations on the
examples of the LU and QR decompositions. Main idea: factorizations provide critical infrastructure to a variety of numerical algorithms, especially Newton-like solvers of nonlinear equations
and eigenvalue problem solvers.
4. Errors produced by factorization algorithms depend on the so-called condition number. Main
idea: condition numbers are related to eigenvalues.
(7.1)
In general, for an arbitrary right-hand side function f this will require the solution of a nonlinear
algebraic equation to obtain yj+1 . For convenience we will dene the function of the unknown yj+1
F (yj+1 ) = yj+1 yj (tj+1 tj )f (tj+1 , yj+1 ) .
The solution y () to the equation
F (y () ) = 0
is the sought yj+1 .
We attempt to nd the solution to F (y () ) = 0 by rst guessing where the root may be y (0) , and
then using the Taylor series expansion of F at y (0)
120
F (y () ) = F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) + R1 = 0 .
dyj+1
The term
dF (y (0) )
dyj+1
is referred to as the Jacobian. Provided the remainder R1 is negligible compared to the other terms,
we can write approximately
F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) 0 ,
dyj+1
(7.2)
F (y (0) )
= y (0)
dF (y (0) )
dyj+1
dF (y (0) )
dyj+1
)1
F (y (0) ) .
Thus we arrive at the Newtons algorithm for nding the solution of a nonlinear algebraic equation:
Guess the starting point of the iteration, y (0) , as close to the expected root of the equation y () as
possible. Then repeat until the error (in some measure to be determined) drops below acceptable
tolerance.
(
)1
dF (y (k1) )
(k)
(k1)
y =y
(7.3)
F (y (k1) )
dyj+1
if error e(k) < tolerance, break; otherwise go on
k = k + 1 and repeat from the top
The error could be measured as the dierence between the successive iterations
e(k) = y (k) y (k1)
or by comparing the value of the function F with zero
e(k) = F (y (k) ) .
Or, convergence can be decided by looking at some composite of the above errors, for instance the
iteration could be considered converged when either of these errors drops below a certain tolerance.
Illustration 1
How do we apply the Newtons algorithm to solve the nonlinear equation that denes a single step
for the backward Euler algorithm?
To advance the solution we have to solve F (yj+1 ) = yj+1 yj (tj+1 tj )f (tj+1 , yj+1 ) = 0. The
only diculty may present the derivative of the function f which we need to compute
f (tj+1 , y (k1) )
dF (y (k1) )
= 1 (tj+1 tj )
.
dyj+1
yj+1
This turns out to be really easy for the simple function f of a linear ODE with a constant coecient
f (t, y) = y
and the Jacobian is
121
dF (y (k1) )
= 1 (tj+1 tj ) .
dyj+1
The Newton algorithm gives the solution in one iteration step as
(
y () = y (1) = y (0)
dF (y (0) )
dyj+1
)1
F (y (0) ) =
yj
1 (tj+1 tj )
For this special right-hand side function it works out precisely as we would expect from the denition
of the backward Euler method. For general right-hand side functions f the solution will require
several iterations of the algorithm, until some tolerance is reached as discussed above.
Concerning the implementation of the backward Euler time integration in MATLAB: Either we
have to provide not only a function for the right-hand side f but also its derivative f (., y)/y, or the
software must make do without the derivative. Fortunately, we realize that numerical dierentiation
could be used, and we have developed some approaches in the previous Chapter.
7.1.1 Convergence rate of Newtons method
Write the Taylor series for the scalar function F (x), but this time keep the remainder
F (x() ) = F (x(k) ) +
dF (x(k) ) ()
(x x(k) ) + R1 .
dx
d2 F () (x() x(k) )2
,
dx2
2!
x(k) , x() .
dF (x(k) ) ()
(x x(k) ) + R1 = 0 .
dx
(7.4)
The Newtons algorithm would use the above equation, neglect the remainder R1 , and thus obtain
an estimate of the root x(k+1) from
F (x(k) ) +
dF (x(k) ) (k+1)
(x
x(k) ) = 0 .
dx
(7.5)
and these may be substituted both in Equation (7.4) and in the expression for the remainder.
Thus (7.4) may be written in terms of the errors as
F (x(k) ) +
d2 F () Ek2
dF (x(k) )
Ek +
=0.
dx
dx2 2!
(7.6)
122
dF (x(k) )
(Ek Ek+1 ) = 0 .
dx
(7.7)
dF (x(k) )
dx
)1
d2 F () Ek2
.
dx2 2!
(7.8)
We say that the Newtons method attains a quadratic convergence rate, because the error in the
current iteration is proportional to the square of the error in the previous iteration (and this is good,
assuming the error is going to be small, and the square of a small number is even smaller).
Illustration 2
We shall solve the equation f (x) = 0.5 + (x 1)3 = 0 with Newtons method.1 The solver used
is the aetna implementation of the Newtons method newt2 . The approximate errors in the seven
iterations required for convergence to machine precision are
Iteration
1
2
3
4
5
6
7
Approximate Error
0.806299474015900
0.338070307349234
0.090929677656663
0.009026273904915
0.000101115651663
0.000000012879718
0.000000000000000
A good rule of thumb is that the number of zeros behind the decimal point of the error doubles with
each iteration. That is excellent convergence indeed.
Figure 7.1 illustrates the formula (7.8). We plot the approximate errors Ek+1 versus Ek as
plot(e(1:end-2),e(2:end-1),ro-)% e = approximate errors
Clearly the data resembles a parabolic arc, exactly as predicted by the formula. Re-plotted on a log
log scale (Figure 7.2) as
loglog(e(1:end-2),e(2:end-1),ro-)
conrms the relationship between Ek+1 and Ek . It is quadratic, since the slope on the log-log plot
is very close to 2.
1
2
123
@(x)(0.5+(x1) ), x0=2.6
@(x)(0.5+(x1) ), x0=2.6
0.35
10
0.3
5
10
0.2
Ek+1
Ek+1
0.25
0.15
0.1
10
10
15
10
0.05
0
0
20
0.2
0.4
0.6
0.8
Ek
10
10
10
10
Ek
10
10
F (x)
x3
x1
x1
x0 x2
x3
x0
x2
Fig. 7.2. Failure of Newtons method due to divergence (left), and successful convergence upon the selection
of the initial guess closer to the root (right).
124
F (x)
x1 x4 x2 x5
x3
x0
x
Fig. 7.3. Failure of Newtons method: rst its gets stuck next to a false root (maximum), then the iterations
blast o to innity.
F (x)
x4
x2 x3
x1
x0
Fig. 7.4. Failure of Newtons method: if the initial guess of the root is not sucient to close, it does not
nd the root that was intended.
See: aetna/NonlinearEquations/bisect.m
125
0.5
10
0.4
1
10
Ek+1
Ek+1
0.3
0.2
10
0.1
0
0
0.2
0.4
0.6
0.8
10
10
Ek
10
Ek
10
Figure 7.6 is a good comparison of the typical convergence properties of the Newtons and
bisection methods.6 Evidently the bisection method requires many more iterations than the Newtons
method. When each evaluation of the function is expensive, the quicker converging method wins.
When the robustness of bisection is required (such as when Newtons would not converge), the
slower method is preferable. Wouldnt it make sense to combine such disparate methods and switch
between them as needed? That is how the MATLAB fzero function works. (Find out from the
documentation which methods are combined in fzero.)
@(x)(0.5+(x1)3), Bisection versus Newton
10
Ek
10
10
10
15
10
20
10
4
6
Iteration k
10
Fig. 7.6. Comparison of the convergence of the bisection method (dashed line), and the Newtons method
(solid line).
126
dF (y (0) ) ()
(y y (0) ) + R1 = 0 .
dy
Provided the remainder R1 is negligible compared to the other terms, we can write approximately
F (y (0) ) +
dF (y (0) ) ()
(y y (0) ) 0 ,
dy
(7.9)
which at rst sight looks exactly like (7.2). There must be a dierence here, however, as we are
dealing with a system of equations. What do we mean by
dF (y (0) ) ()
(y y (0) ) ?
dy
(7.10)
The expression (7.9) holds for each component (row) of the vector (column matrix) separately. The
components of the vector function F and of the argument y may be written as
[F (.)]r ,
[y]c .
Each of the components [F (.)]r is a function of all the components [y]c . Therefore, equation (7.9)
in components must have the meaning
[F (y (0) )]r +
[F (y (0) )]r
[y () y (0) ]c 0 ,
[y]
c
c=1:n
i.e. in words: the change in the component [F (y (0) )]r is due to the change of this component in the
direction of each of the c components of the argument [y]c , which is expressed by the rst term of
the Taylor series. Thus we see that left-hand side of (7.9) is the sum of two vectors, F (y (0) ) and the
vector
dF (y (0) ) ()
(y y (0) ) ,
dy
which is the product of a square matrix
dF (y (0) )
dy
dF (y (0) )
and the vector (y () y (0) ). The matrix
dy
127
F (y (k1) )
(7.11)
dy
k = k + 1 and repeat previous line
until the error (in some measure to be determined) drops below acceptable tolerance. In general it
is a good idea not to invert a matrix if we can help it. Rewriting the Newton algorithm as
J (y (k1) ) =
dF (y (k1) )
dy
(7.12)
g(x, y) = xy + 3/4 = 0 .
The two expressions f (x, y) and g(x, y) may be interpreted as surfaces raised above the x, y plane.
Setting these to zero is equivalent to forcing the points that satisfy these equations, individually, to
lie on the level curves of the surfaces. The solution of the two equations being satised simultaneously
corresponds to the intersection of the level curves. The gures of the surfaces were produced by the
script two surfaces7 .
The solution will be attempted with the Newton method. The vector argument is
[ ]
x
y=
y
and the vector function is
7
128
f (x, y)
F (y) =
g(x, y)
]
.
f (x, y)
f (x, y)
g(x, y)
g(x, y)
= x , J12 =
= 3y , J21 =
= y , J22 =
=x.
x
y
x
y
The Matlab code denes both the vector function and the Jacobian matrix as anonymous functions.
F=@(x,y) [((x.^2 + 3*y.^2)/2 -2); (x.*y +3/4)];
J=@(x,y) [x, 3*y; y, x];
With these functions at hand it is easy to carry out the iteration interactively, step-by-step. For
instance, guessing
w0= [-0.5;0.5];
we update the solution as
>> w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))
w =
-0.5000
1.5000
For the next iteration, we reset the variable w0
>> w0=w;
and repeat the solution
w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))
w =
-0.6154
1.1538
We can watch the dierences between the successive iterations getting smaller. With four iterations
we get ve decimal digits converged.
w =
-0.6923
1.0833
This point will be one of the four possible solutions (level-curve intersections). To get a dierent
solution we need to start with a dierent guess w0, for instance w0= [-2;0.5];.
129
F = F (z + hc ec ) ,
where
[ec ]m = 1 for c = m ,
[ec ]m = 0 otherwise,
and hc is a suitably small number (not too small: let us not forget the eect of computer arithmetic).
The Jacobian matrix is approximated by the computed column vectors as
[
]
[
]
F (z)
(1 F F )/h1 , (2 F F )/h2 , . . . , (n F F )/hn , .
(7.13)
z
In these columns we recognize numerical approximations of derivatives of the vector function F
(divided dierences).
One can recognize in the Newtons method with the numerical approximation of the Jacobian a
variation of the so-called secant method .
Illustration 4
In the script Jacobian example8 we compare the analytically determined Jacobian matrix with its
numerical approximation. The vector function is taken as
[ 2
]
z1 + 2z1 z2 + z22
F (z) =
.
z1 z2
Therefore the Jacobian matrix is evaluated at z1 = 0.23, z2 = 0.6 as
8
130
>> F=@(z)[z(1)^2+2*z(1)*z(2)+z(2)^2;...
z(1)*z(2)];
dFdz=@(z)[2*z(1)+2*z(2),2*z(1)+2*z(2);...
z(2),
z(1)];
zbar = [-0.23;0.6];
>> Jac =dFdz(zbar)
Jac =
0.7400
0.7400
0.6000
-0.2300
Evaluating the function at the base point and using the step size of 0.1
>> Fbar =F(zbar);
h=1e-1;
we obtain the approximate (numerically dierentiated) Jacobian matrix
>> Jac_approx =[(F(zbar+[1;0]*h)-Fbar)/h,
Jac_approx =
0.8400
0.8400
0.6000
-0.2300
(F(zbar+[0;1]*h)-Fbar)/h]
We may note that the second row is in fact exact. (Why?) On the other hand the Jacobian matrix
will not be evaluated exactly in any component for the second example in Jacobian example. Check
it out.
N1
131
p4
3
p2
2
5
p1
1
4
p3
p5
Fig. 7.7. Cable structure conguration. Dashed line: schematic of the connections. Filled dots indicate
supported joints.
x,2 = Y3 Y1 ,
y,2 = Y4 Y2 ,
L2 =
2x,2 + 2y,2 ,
where Y3 , Y4 are the coordinates after deformation of joint 2. Together Y1 , Y2 , Y3 , Y4 constitute the
unknowns in the problem. Finally for the third cable running into joint 1 we have
L1 L10
L10
(and analogously for the other cables) which relates the relative stretch of the cable to the axial
force. This is based on the assumption that the stretches are small compared to 1.0 and therefore
the stresses are small compared to the elastic modulus, and this assumption is veried in the present
problem. In general it is a good idea to verify that the assumptions that go into a model are
reasonable by backtracking from the results. For instance, in the current problem we would nd the
locations of the joints, and from those we would compute the forces, and stresses. If the stresses in
the cables were well below the yield stress (or negligibly small with respect to the elastic modulus),
our assumption would have been veried.
Working out in detail just the rst term in the rst equation gives us an idea of the complexity
of the resulting equations. We do get an appreciation for the tedium associated with computing
derivatives of such terms with respect to the unknowns Y1 , Y2 , Y3 , Y4 to construct the Jacobian
matrix analytically:
132
133
p4
3
= 414
p2
2
= 357
p3
5
= 208
1
p1
= 449
4 = 205
p5
Fig. 7.8. Cable structure conguration. Dashed line: schematic of the connections. Filled dots indicate
supported joints. Thick solid line: actual conguration of the prestressed structure. Tensile stresses are
indicated.
134
As a nal note, we shall point out that MATLAB comes with its own sophisticated function for
the numerical evaluation of the Jacobian matrix, numjac. The pieces of code that would need to
be changed with respect to our implementation10 are the computation of the residual (the function
needs to accept additional arguments)
function R=Force_residual(Ignore1,Y,varargin)
y(1,:) =Y(1:2);
y(2,:) =Y(3:4);
F =zeros(size(p,1),2);
for j=1:size(conn,1)
L=Length(j);
N(j)=E*A(j)*(L-Initial_L(j))/L;
F(conn(j,1),:) =F(conn(j,1),:) +N(j)*Delta(j)/L;
F(conn(j,2),:) =F(conn(j,2),:) -N(j)*Delta(j)/L;
end
R =[F(1,:);F(2,:)];
end
and the evaluation of the numerical Jacobian in the Newtons loop (there are a few additional
arguments to pass)
Y=[y(1,:);y(2,:)];% Initialize deformed configuration
for iteration = 1: maximum_iteration % Newton loop
R=Force_residual(0,Y);% Compute residual
[dRdy] = numjac(@Force_residual,0,...
Y,R,Y/1e3,[],0);% Compute Jacobian
dY=-dRdy\R;% Solve for correction
if norm(dY,inf)<AbsTol % Check convergence
y(1,:) =Y(1:2);% Converged
y(2,:) =Y(3:4);
R=Force_residual(0,Y);% update the forces
sigma =N./A;% Stress
return;
end
Y=Y+dY;% Update configuration
end
error(Not converged)% bummer :(
We can easily check that the two implementations of the computation give identical results.
In summary, Newtons method, in its several variants and renements, has a special place among
the mainstream methods for solving a system of nonlinear algebraic equations in engineering applications. One of the building blocks of this class of algorithms is a solver for repeatedly solving a
system of linear algebraic equations. This is the topic we will take up in the following sections.
7.3 LU factorization
Consider a system of linear algebraic equations
Ax = b
with a square matrix A. It is possible to factorize the matrix into the product of a lower triangular
matrix and an upper triangular matrix
A = LU
10
7.3 LU factorization
135
The triangular matrices are not determined uniquely. Here we will consider the variant where the
lower triangular matrix L has ones on the diagonal.
7.3.1 Forward and backward substitution
What is the value of the LU factorization? It derives from the eciency with which a system with
a triangular matrix can be solved. For instance, consider the system
y = b ,
Ly =
where L is lower triangular (non-zeros are indicated by the black dots, the zeros are not shown). In
the rst row of L there is only one nonzero, L11 . Therefore we can solve immediately for y1 . Next,
y1 may be substituted into the second equation, from which we can solve for y2 , and so on. Since
we are solving for the unknowns in the order of their indexes, 1, 2, 3, ..., n, we call this the forward
substitution.11
function c=fwsubs(L,b)
[n m] = size(L);
if n ~= m, error(Matrix must be square!);
c=zeros(n,1);
c(1)=b(1)/L(1,1);
for i=2:n
c(i)=(b(i)-L(i,1:i-1)*c(1:i-1))/L(i,i);
end
end
end
x = c ,
Ux =
where U is upper triangular. In the last row of U there is only one nonzero, Unn . Therefore we can
solve immediately for xn . Next, xn may be substituted into the last but one equation, from which
we can solve for xn1 , and so on. Since we are solving for the unknowns in the reverse order of their
indexes, n, n 1, n 2, ..., 2, 1, we call this the backward substitution. 12
function x=bwsubs(U,c)
[n m] = size(U);
if n ~= m, error(Matrix must be square!); end
x=zeros(n,1);
x(n)=c(n)/U(n,n);
for i=n-1:-1:1
x(i)=(c(i)-U(i,i+1:n)*x(i+1:n))/U(i,i);
end
end
11
12
See: aetna/LUFactorization/fwsubs.m
See: aetna/LUFactorization/bwsubs.m
136
And so we come to the punchline: provided we can factorize a general matrix A into the triangular
factors, we can solve the system Ax = b in two steps. Write
Ax = LU x = L(U x) = Ly = b .
| {z }
y
Step one, solve for y from
Ly = b .
And step two, solve for x from
Ux = y .
Both solution steps can be done very eciently since the matrices involved are triangular. This is
handy in many situations where the right-hand side b will change several times while the matrix A
stays the same. For instance, here is how we compute the inverse of a general square matrix A:
write the denition of the inverse
AA1 = 1
column-by-column as
Ack (A1 ) = ck (1) .
Here by ck (A1 ) we mean the kth column of A1 , and by ck (1) we mean the kth column of the
identity matrix . So if we successively set the right-hand side vector to b = ck (1), k = 1, 2, ... and
solve Ax = b, we obtain the columns of the inverse matrix as ck (A1 ) = x.
7.3.2 Factorization
The crucial question is: how do we compute the factors? LU factorization can be easily explained
by reference to the well-known Gaussian elimination. We shall start with an example:
A=
0.0186, 0.093,
1.204, 0.0012
0.1734, 0.6695, 0.0653, 0.4113
First we will change the numbers below the diagonal in the rst column to zeros. Gaussian elimination
does this by replacing a row in which a zero should be introduced, let us say row j, by a combination of
the row j and the so-called pivot row. Thus zero will be introduced in the element 2, 1 by subtracting
(0.3649)/(0.796) row 1 from row 2 two obtain
0,
1.558, 0.2884, 0.5034
0.0186, 0.093,
1.204, 0.0012
0.1734, 0.6695, 0.0653, 0.4113
The element 1, 1 (the number .796) is called a pivot. Evidently, the success of the proceedings is going
to rely on the pivot being dierent from zero (not only strictly dierent from zero, but suciently
dierent: it shouldnt be too small compared to the other numbers in the same column). The
manipulation described above can be executed by the following code fragment
i=1;
A(2,i:end) =A(2,i:end)-A(2,i)/A(i,i)*A(i,i:end)
7.3 LU factorization
137
Importantly, the same can also be written as a result of a matrix-matrix multiplication by the
so-called elimination matrix
1, 0, 0, 0
0.4584, 1, 0, 0
E (2,1) =
0, 0, 1, 0
0, 0, 0, 1
The elimination matrices are easily computed in MATLAB as13
function E =elim_matrix(A,i,j)
E =eye(size(A));
E(i,j) =-A(i,j)/A(j,j);
end
We can readily verify that the element 2, 1 of A can be eliminated (zeroed out) by multiplying
0,
1.558, 0.2884, 0.5034
E (2,1) A =
0.0186, 0.093,
1.204, 0.0012
0.1734, 0.6695, 0.0653, 0.4113
Next we will change 0.0186 to a zero. Again, we will do this with an elimination matrix, and note
well that we will be working with the above right-hand side matrix, not the original A. So we will
construct
1, 0, 0, 0
0, 1, 0, 0
E (3,1) =
0.02337, 0, 1, 0
0, 0, 0, 1
and compute
0,
1.558, 0.2884, 0.5034
.
E (3,1) E (2,1) A =
0, 0.1104,
1.201, 0.003315
0.1734, 0.6695, 0.0653,
0.4113
And so on: eliminating the non-zeros in the rst column is constructed as the sequence
0.796, 0.7448,
0.1201,
0.0905
0,
1.558, 0.2884, 0.5034
.
E (4,1) E (3,1) E (2,1) A =
0, 0.1104,
1.201, 0.003315
0, 0.5073, 0.03914,
0.431
Now we start working on the second column. Note again that we are working with the matrix
E (4,1) E (3,1) E (2,1) A, not the elements of the original matrix. Thus 0.07087 = (0.1104/1.558),
and the elimination matrix to put a zero in the element 3, 2 reads
1,
0, 0, 0
0,
1, 0, 0
E (3,2) =
0, 0.07087, 1, 0 .
0,
0, 0, 1
Finally, we apply the elimination matrix to the element 4, 3 and the entire Gaussian elimination
sequence will read
13
138
0,
0,
1.18, 0.03899
0,
0,
0,
0.2627
We recall that we wish to construct the factorization A = LU , which means that the above matrix
on the right is U and consequently
L1 = E (4,3) E (4,2) E (3,2) E (4,1) E (3,1) E (2,1) .
So now we have the matrix U and the inverse of L. Fortunately, L is obtained very easily. Not by
inverting the above product, but rather by inverting each of the terms separately
1
1
1
1
1
L = E 1
(2,1) E (3,1) E (4,1) E (3,2) E (4,2) E (4,3) .
For instance, to invert E (2,1) we realize that the eect of the matrix multiplication in the product
E (2,1) A is to make the second row of the result the sum of a multiple of the rst row and 1 the
second row. Therefore, to multiply with the inverse of E (2,1) is to undo this operation, to subtract
a multiple of the rst row from the second row. The inverse of E (2,1) also has ones on the diagonal,
the only change is that the o-diagonal element changes its sign (we want subtraction instead of
addition)
E 1
(2,1) = 21 E (2,1) .
The same reasoning applies to the other elimination matrices. Now we only have to gure out the
1
product of the inverses of the elimination matrices. Take for instance the product E 1
(2,1) E (3,1) :
1
E 1
(2,1) E (3,1)
1,
0.4584,
=
0,
0,
0, 0, 0
1,
1, 0, 0
0,
0, 1, 0 0.02337,
0, 0, 1
0,
0, 0, 0
1,
0.4584,
1, 0, 0
=
0, 1, 0 0.02337,
0, 0, 1
0,
0,
1,
0,
0,
0, 0
0, 0
.
1, 0
0, 1
The pattern is clear: each matrix in the product will simply copy its only nonzero o diagonal
element into the same location in the resulting matrix. Thus we have
1,
0,
0, 0
0.4584,
1,
0, 0
.
L=
0.02337, 0.07087,
1, 0
0.2178, 0.3256, 0.1127, 1
The entire elimination process for our given matrix can be expressed as a series of matrix multiplications
E21
E31
E41
E32
E42
E43
U =
=elim matrix(A,2,1)
=elim matrix(E21*A,3,1)
=elim matrix(E31*E21*A,4,1)
=elim matrix(E41*E31*E21*A,3,2)
=elim matrix(E32*E41*E31*E21*A,4,2)
=elim matrix(E42*E32*E41*E31*E21*A,4,3)
E43*E42*E32*E41*E31*E21*A
Inecient, but correct. In reality the elimination is done usually in-place. The upper triangle and
the diagonal of A store the matrix U , the lower triangle (below the diagonal) of A store the matrix
L (we do not store the diagonal, since we know that the diagonal of L consists of ones). naivelu4
is one of the naive implementations of the LU factorization in aetna.14
14
See: aetna/LUFactorization/naivelu4.m
7.3 LU factorization
139
A=
0.7891, 0.236, 0.007259, 0.4891 .
0.09073, 0.6998, 0.9637, 0.9205
Compute the factorization using this command
[L,U,P]=lu(A)
with the result
1,
0,
0, 0
0.2287,
1,
0, 0
,
L=
0.5897, 0.04327,
1, 0
0.115, 0.7778, 0.8597, 1
0, 0, 1, 0
0, 1, 0, 0
P =
1, 0, 0, 0 .
0, 0, 0, 1
The meaning of the output is that
0, 0.8648, 0.3228,
0.5833
U =
0,
0,
0.828,
0.478
0,
0,
0, 0.0003876
140
LU = P A .
The matrix P permutes (switches) the rows of the matrix A. That is the actual pivoting. Note that
the permutation matrix has an interesting inverse: it is its own transpose (the permutation matrix
is orthogonal). Therefore we can write the above as
P T LU = A .
The matrix
0.5897, 0.04327, 1, 0
0.2287,
1,
0, 0
PTL =
1,
0,
0, 0
0.115, 0.7778, 0.8597, 1
is the so-called psychologically lower triangular matrix. Such a matrix would be returned if we called
lu with only two output arguments
[L,U]=lu(A)
How do we use the three output matrices? Symbolically, we can write now in the way in which we use
the LU factorization (A = LU ) as (we do not actually use inverses, we use forward and backward
substitution!)
y = L1 b ,
x = U 1 y
or
(
)
x = U 1 L1 b .
When pivoting is used, we have rather A = P T LU so that we are solving
(
)1
y = PTL
b = L1 (P b) ,
x = U 1 y
or
(
)
x = U 1 L1 (P b) .
In MATLAB syntax, we write
x=U\(L\P*b);
In other words, the LU factorization is used as before, except that the rows of the right-hand side
vector are reordered (permuted) by P .
A more ecient approach to working with the LU factorization when pivoting is applied is to
compute the so-called permutation vector .
[L,U,p]=lu(A,vector)
The permutation vector is p = [3, 2, 1, 4]. We can see that it correlates with the position of the 1s
in the rows of the permutation matrix. The permutation vector is used for multiple right-hand sides
as
x=U\(L\b(p));
which is a shorthand for
y=L\b(p); x=U\y;
7.3 LU factorization
141
3
2
0
and our sums, we conclude that the required factorization time is
( 3
)
n
n2
tLU = C
.
3
2
In Chapter 6 we have seen the big-O notation used as a means of describing how function value
decreases as the argument decreases towards zero. Here we introduce the opposite viewpoint: the
notation can be also used to express how quickly a function value grows. As we discussed, the bigO notation typically expresses how complicated functions behave in terms of a simple monomial
(say x2 ). For the measurement how quickly function value decreases the low powers dominate;
contrariwise, when we measure how quickly function value grows the high powers dominate.
Illustration 5
Consider the simple function f (x) = x2 + 30000x. Use the big-O notation to describe its behavior
as x 0 and as x .
As x 0 the function value decrease is dominated by the linear term (30000x) as it drops in
magnitude much slower then the square. On the contrary, the square term grows much faster than
the linear term as x . Therefore we conclude that f (x) O(x) as x 0 and that f (x) O(x2 )
as x .
The big-O notation is often used in computer science to express how quickly the cost of an
algorithm grows as the number of quantities to be processed grows. For instance, nice algorithms are
those that grow linearly or logarithmically - for instance computing the mean of a vector of length
n is an operation of O(n) or FFT is an operation of O(n log n). Not so nice algorithms may be very
expensive for large n for instance a nave discrete Fourier transform (the slow version of FFT) is
O(n2 ). Much more expensive than FFT!
The LU factorization is one of the more computationally-intensive algorithms. Based on the
expression that includes both a cubic term and a quadratic term we conclude that for suciently
large n we should write tLU = O(n3 ). Rather costly!
142
Illustration 6
Figure 7.9 shows the results of a numerical experiment. The MATLAB LU factorization is run for a
sequence of variously sized matrices, and the factorization time is recorded.
t = [];
for n = 10:10:600
A=rand(n);
tic;
for Trial = 1: 1000
[L,U,p]=lu(A,vector);
end
t(end+1) =toc
end
The curve of required CPU time per factorization illustrates our estimate: rst the time grows more
slowly than predicted, but asymptotically it appears to approach a straight line with slope 3 which
corresponds to a cubic dependence on the number of equations.
10
Time [s]
10
1:3
3
10
10
10
10
10
Matrix size n
10
In a similar way, we can show that the time for forward or backward substitution is going to
grow as O(n2 ). This is good news, since for many right-hand sides the time is only going to grow
as quickly as for the factorization itself. For instance, to compute a matrix inverse we need to solve
n times an n n system of linear algebraic equations. If we use LU factorization with forward and
backward substitution, it will take
+
n O(n2 )
= O(n3 )
O(n3 )
| {z }
| {z }
factorization forward/backward substitution
time. If we use just plain Gaussian elimination for each solve, it will take
nO(n3 ) = O(n4 ) .
A much more quickly growing cost!
7.3 LU factorization
143
Illustration 7
The cost estimate tLU = C O(n3 ) can be put to good use guessing the time that it may take to
factorize larger matrices. From Figure 7.9 we can read o that on this particular computer a 400400
matrix takes about one hundredth of a second:
tLU,400 = C O(4003 ) = 0.01 s .
Therefore we can express the time constant as
C=
0.01 s
.
O(4003 )
0.01 s
O(30003 ) = 4.2 s .
O(4003 )
Running the calculation we nd 2.35 s. This is a substantial dierence with respect to the prediction.
First, the measurement of 0.01 s is likely to be substantially in error as it is dicult to measure the
execution times for computations that conclude very quickly there are just too many confounding
factors in the software (think of all the operating system overhead) and hardware. Second, our
estimate was based on the cubic term, but we know there is also a quadratic term and that was not
taken into account. The matrix may not be large enough for the asymptotic big-O estimate to work
based on the largest term only.
Furthermore, let us say we want to use the second measurement, tLU,3000 = 2.35 s to predict
the factorization time for a 30, 000 30, 000 matrix. If we had a computer with enough memory to
accommodate a matrix of this size, our prediction would be that the factorization time would go
up with a factor of 1000 = 103 with the respect to the time measured for the 3000 3000 matrix,
so about 40 minutes. We would nd the prediction rather more accurate this time. (Try it with a
slightly more modest increase: for instance a factor of 2 increase in the size of the matrix would
increase the factorization time by a factor of 8.)
144
b
A = LU = LD U
b = LT . Therefore for symmetric A we can make one more step from the LU factorimplies that U
ization to the LDLT factorization
A = LDLT .
This saves both time (we dont have to compute U ) and space (we dont have to store U ).
Figure 7.10 displays a nite element model with over 2000 unknowns. A small model, it can be
handled comfortably on a reasonably equipped laptop, yet it will serve us well to illustrate some of
the aspects of the so-called large-scale computing algorithms of which we need to be aware.
The gure shows a tuning fork. This one sounds approximately the note of A (440 Hz, international concert pitch). To nd this vibration frequency, we need to solve an eigenvalue problem (in
our terminology, the free vibration problem).
The impedance matrix A = K 2 M which couples together the stiness and the mass matrix
is of dimension of roughly 2000 2000. However, not all 4 million numbers are nonzero. Figure 7.11
illustrates this by displaying the nonzeros as black dots (the zeros are not shown). The code to get
an image like this for the matrix A is as simple as
spy(A)
Where do the unknowns come from? The vibration model describes the motion of each node (that
would be the corners and the midsides of the edges of the tetrahedral shapes which constitute the
mesh of the tuning fork). At each node we have three displacements. Through the stiness and mass
of each of the tetrahedra the nodes which are connected by the tetrahedra are dynamically coupled (in
the sense that the motion of one node creates forces on another node). All these coupling interactions
are recorded in the impedance matrix A. If an unknown displacement j at node K is coupled to an
unknown displacement k at node M, there will be a nonzero element Ajk in the impedance matrix.
If we do not care how we number the individual unknowns, the impedance matrix may look for
instance as shown in Figure 7.11: there are some interesting patterns in the matrix, but otherwise
the connections seem to be pretty random.
An important aspect of working with large matrices is that as a rule only the non-zeros in
matrices will be stored. The matrices will be stored as sparse. So far we have been working with
dense matrices: all the numbers were stored in a two-dimensional table. A sparse matrix has a more
complicated storage, since only the non-zeros are kept, and all the zeros are implied (not stored, but
when we ask for an element of the matrix that is not in storage, we will get back a zero). This may
mean considerable savings for matrixes that hold only a very small number of non-zeros.
7.3 LU factorization
145
Fig. 7.11. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Original
numbering of the unknowns. The black dots represent non-zeros, zeros are not shown.
The reason we might want to worry about how the unknowns are numbered lies in the way the
LU factorization works. Remember, we are removing non-zeros below the diagonal by combining
rows. That means that if we are eliminating element km, we are adding a multiple of the row k and
the row m. If the row m happens to have non-zeros to the right of the column m, all those non-zeros
will now appear in row k. In this way, some of the zeros in a certain envelope around the diagonal
will become non-zeros during the elimination. This is clearly evident in Figure 7.11, where we can
see almost entirely black (non-zero) matrices L and U . Why is this a problem? Because there are a
lot more of non-zeros in the LU factors than in the original matrix A. The more numbers we have to
operate on, the more it costs to factorize the matrix, and the longer it takes. Also, all the non-zeros
need to be stored, and to update a sparse matrix with additional non-zeros is very expensive.
The appearance of additional non-zeros in the matrix during the elimination is called ll-in.
Fortunately, there are ways in which the ll-in may be minimized by carefully numbering coupled
unknowns. Figure 7.12 and Figure 7.13 visualize the impedance matrix and its factors for two
dierent renumbering schemes: the reverse Cuthill-McKee and symmetric approximate minimum
degree permutation. The matrix A holds the same number of non-zeros in all three gures (original
numbering, and the two renumbered cases). However the factors in the renumbered cases hold about
10 times less non-zeros than in the original factors. This may be signicant. Recall that for a dense
matrix the cost scales as O(N 3 ). For a sparse matrix with a nice numbering which will limit the
ll-in to say 100 elements per row, the cost will scale as O(100 N 2 ). For N = 106 this will be the
dierence between having to wait for the factors for one minute or for a full week.
Fig. 7.12. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumbering
of the unknowns with symrcm. The black dots represent non-zeros, zeros are not shown.
As a last note on the subject we may take into account other techniques of solving systems of
linear algebraic equations than factorization. There is a large class of iterative algorithms, a line
up starting with Jacobi and Gauss-Seidel solvers and currently ending with the so-called multigrid solvers. These algorithms are much less sensitive to the numbering of the unknowns. In this
book we do not discuss these techniques, only a couple of minimization-based solvers, including the
powerful conjugate gradients, but refer for instance to Trefethen, Bau for an interesting story on
current iterative solvers. They are becoming ubiquitous in commercial softwares, hence we better
know something about them.
146
Fig. 7.13. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumbering
of the unknowns with symamd. The black dots represent non-zeros, zeros are not shown.
i=1
Lii = 1 ,
det U =
Uii
i=1
n
so that we have det A = i=1 Uii .
If on the other hand L has been modied by pivoting permutations, its determinant can be 1,
according to how many permutations occurred. (It is probably best to use the MATLAB built-in
det function. It uses the LU factorization, and correctly accounts for pivoting.)
Thats how determinants are computed, not by Cramers rule (not if we wish to live to see the
result).
We might consider using the LU factorization for determining the number of independent rows
(columns) of a matrix, the so-called rank . If the LU factorization succeeds, the matrix A had
a full rank. Otherwise, it is possible that the factorization failed just because full pivoting was
not applied: it is possible that the factorization might succeed if all possibilities for pivoting are
exploited. MATLAB does not use factorization for this reason (and other reasons that have to do
with the stability of the computation), it rather takes advantage of the so-called singular value
decomposition. If the matrix A does not have a full rank (the number of linearly independent
columns, or linearly independent rows, is less than the dimension of the matrix) it is singular, and
cannot be LU factorized.
On the diagonal of the matrix U we have the pivots. The signs of the pivots determine the
so-called positive or negative deniteness (or indeniteness) of a matrix. More about this in the
chapter on optimization.
147
vector b and the coecient matrix A itself are not represented during the solution process faithfully.
Therefore, in this section we will consider how the properties of A and b aect the error of the
solution x.
First, we shall inspect the sensitivity of the solution of the system of coupled linear algebraic
equations Ax = b to the magnitude of the error of the right-hand side, and the properties of the
matrix A. Equivalently, we could also state this in terms of errors: how large can they get?
7.4.1 Perturbation of b
Imagine the right-hand side vector changes just a little bit to b + b. The solution will then also
change
A (x + x) = (b + b) ,
which then gives
Ax = b .
Now we would like to measure the relative change in the solution x/x due to the relative
change in the right-hand side b/b. In terms of norms we can write (symbolically, we never
actually invert the matrix)
x = A1 b
(7.14)
so that using the so-called CBS inequality (CBS: Cauchy, Bunyakovsky, Schwartz) we estimate
x A1 b .
(7.15)
It does not matter very much which norm is meant here, they are all equivalent. Also we can write
for the norms of the solution vector on the left-hand side and the vector on the right-hand side
Ax = b Ax b
(7.16)
.
b
b
On the right-hand side we now have the relative error b/b. Now we can introduce (7.16) to
replace b on the left-hand side
x
A1 b
,
Ax
b
which will give us the relative error of the solution x/x. Finally we rearrange this result into
b
x
AA1
.
x
b
(7.17)
The quantity AA1 is the so-called condition number of the matrix A. This inequality relates
the relative error of the solution to the relative error of the right-hand side vector. The coecient
of proportionality is found to be determined by the properties of the coecient matrix.
148
Illustration 8
When the condition number is large, we see that there is a possibility of the change in the righthand side being very much magnied in the change of the solution. An example of the eect is given
here.15 Consider the least-squares computation of a quadratic function passing through three points:
the point locations are x= [0,1.11,1.13], and the values of the function at those three points are
y= [1,0.5,0.513]. The least squares computation is set up as
A = [x.^2,x.^1,x.^0];
p=(A*A)\(A*y)
to solve for the parameters p of the quadratic t from the so-called normal equations (see details in
Section 9.13). The solution is
p =
0.973849956151390
-1.531423901778500
1.000000000000001
Now change the values of the quadratic function by dy= [0,0.00746,-0.006658];, which is a
relative change of norm(dy)/norm(y)=0.00864. The solution changes by
dp=(A*A)\(A*dy)
dp =
-0.630637805947415
0.706728685322350
-0.000000000000000
which can be appreciated as a pretty substantial change. We see that
norm(dy)/norm(y)
ans =
0.008128568566353
norm(dp)/norm(p)
ans =
0.457113748779369
This means that while the data changed by less than 1%, the solution for the parameters changed by
almost 50%. We call matrices that produce this kind of large sensitivity ill conditioned . Figure 7.14
produced by
x =linspace(0,1.13,100);
plot(x,[x.^2,x.^1,x.^0]*p,r-,linewidth,2); hold on
plot(x,[x.^2,x.^1,x.^0]*(p+dp),k--,linewidth,2)
shows the eect of the ill conditioning : It shows two quadratic curves tted to the original data y
(red solid curve), and to the perturbed data y+dy (black dashed curve). The curves are very dierent
despite the fact that the points through which they pass have been moved only very little.
149
1
0.9
0.8
0.7
0.6
0.5
0.4
0
0.2
0.4
0.6
x
0.8
Fig. 7.14. Quadratic curves t it to the original data y (red solid curve), and the perturbed data y+dy
(black dashed curve).
cond(A*A)
ans =
7.145344297615475e+004
The magnitude of the condition number can be understood in relative terms by considering the
condition numbers of identity matrices (these are probably the best matrices to work with!), which
are equal to one. More generally, orthogonal matrices also have condition numbers that are equal to
one. That is as low as the condition number goes, all other matrices have larger condition numbers.
The bigger the condition number, the bigger the ill conditioning problem. In particular, we can see
that the condition number depends on the existence of the inverse of A. The closer the matrix A is
to being not invertible, the larger the condition number is going to get. For a singular matrix the
condition number is dened to be innite. In the present case, the condition number is seen to be
fairly large. Hence we get the substantial amplication of the change of the right-hand side in the
solution vector.
Illustration 9
To continue the previous Illustration, we change the horizontal position of one of the points x=
[0,0.61,1.13].16 The perturbed quadratic curve is found to dier only slightly from the original.
The condition number conrms that the matrix is considerably less ill-conditioned
cond(A*A)
ans =
193.7789
7.4.3 Perturbation of A
We can also consider the eect of changes in the matrix itself. For instance, when the elements of
the matrix are calculated with some error. So when the matrix changes (not the right-hand side,
that remains the same), we write for the changed solution
16
150
(A + A) (x + x) = b
canceling Ax = b gives
Ax + A (x + x) = 0
or
Ax = A (x + x) .
Considering the problem in terms of norms as before
x = A1 A (x + x)
and
x A1 Ax + x .
To bring in relative changes again, we divide by x + x on both sides and divide and multiply
with A on the right-hand side
x
A
AA1
.
x + x
A
We see that the relative change in the solution is expressed as before. It is bounded by the relative
change in the left-hand side matrix, and the multiplier is again the condition number.
The condition number appears to be an important quantity. In order to understand the condition
number we have to understand a little bit where the norms of the matrix and its inverse come from.
7.4.4 Induced matrix norm
An easy way in which we can talk of matrix norms while introducing nothing more than norms of
vectors stems from the so-called induced matrix norm. We think of the matrix A (here we will
discuss only square matrices, but this would also apply to rectangular matrices) as producing a map
from the vector space Rn to the same vector space by taking input x and producing output y
y = Ax .
We can measure how big a matrix is (that is its norm) by measuring how much all possible input
vectors x get stretched by A. We take the largest possible stretch as the induced norm of A
A =
Ax
max
.
x = 0 x
Note that on the left we have a matrix norm, and on the right we have a vector norm. That is why
we say that the matrix norm on the left is induced by the vector norm on the right. An alternative
form of the above equation, and a very useful one, can be expressed as
A =
max Ax .
x = 1
(7.18)
xp =
j=1
1/p
|xj |p
151
(we mayrecall the similarity with the root-mean-square formula for p = 2). Taking p = 1 we get
n
x1 = j=1 |xj | (the so-called 1-norm), taking p = 2 we obtain the usual Euclidean norm (also
called the 2-norm)
1/2
n
x2 =
|xj |2
.
j=1
Also used is the so-called innity norm, which has to be worked out by a limiting process x =
maxj=1:n |xj |.
Illustration 10
The three norms introduced above are illustrated in Figure 7.15. The squares and the circle represent
vectors of unit norm, as measured by the various norm denitions. The arrows are vectors of unit
norms, using the three norm denitions given above.17
kxk2 = 1
kxk = 1
0.5
x2
x
0
kxk1 = 1
0.5
1
1
0.5
0.5
1.5
x1
Fig. 7.15. Illustration of vector norms (1, 2, ).
max
x =
0
A1 x
y
=
max
.
x
Ay
Ay =
0
See: aetna/MatrixNorms/vecnordemo.m
152
Ay
min
y =
0 y
)1
(
=
)1
min Ay
y = 1
With these formulas for the norms, we can write for the condition number
1
AA
max Ax
x = 1
.
=
min Ay
y = 1
(7.19)
Now this is relatively easy to visualize. Figures 7.16 and 7.17 present a gallery of matrices. The
images visualize the results of the multiplication of unit-length vectors pointing in various directions
from the origin. The induced 2-norm is used, and consequently the heads of the unit-length vectors
form a circle of unit radius. We can see how the formula for the condition number (7.19) correlates
with the largest and smallest length of the vector that results from the multiplication of the matrix
and the unit vector. For instance, for the matrix A we may estimate the length of the longest and
shortest Ax vector as 3 and 2, and therefore we guess the condition number to be 3/2. This
may be compared with the computed condition number AA1 1.414. Alternatively, we could
take the length of the longest vector Ax as 3 and the length of the longest vector A1 x as 1/2,
and therefore we guess the condition number to be 3 1/2.
Illustration 11
Use the function matnordemo18 to create for each of the three norms a diagram similar to those of
Figure 7.16 for the matrix [2 -0.2; -1.5 3], and then try to read o the norm of this matrix from
the gure. Compare with the matrix norm computed as
norm([2 -0.2; -1.5 3],1)
norm([2 -0.2; -1.5 3],2)
norm([2 -0.2; -1.5 3],inf)
See: aetna/MatrixNorms/matnordemo.m
2
x2 , [A1x]2
x2 , [Ax]2
0
x1 , [Ax]1
2
x2 , [B 1x]2
x2 , [Bx]2
0
x1 , [A1x]1
0
x1 , [Bx]1
0.5
0.5
0
0.5
1
0
x1 , [B 1x]1
x2 , [C 1x]2
x2 , [Cx]2
153
0
0.5
0.5
0
0.5
x1 , [Cx]1
0.5
0
0.5
x1 , [C 1x]1
Fig. 7.16. Matrix and matrix inverse norm illustration. Matrix condition numbers: AA1 = 1.414;
BB 1 = 3.167; CC 1 = 1.0;
strains, and the directions of the principles stresses and strains, are the eigenvalues and eigenvectors
of these matrices.
In fact, for all matrices, symmetric and unsymmetric, the matrix norm has something to do with
eigenvalues and eigenvectors. Consider the denition of the induced matrix norm
A =
Ax
max
x = 0 x
Ax2
max
.
2
x = 0 x
154
x2 , [D1x]2
x2 , [Dx]2
0
x1 , [Dx]1
0
x1 , [D1x]1
E=[1,1; 0,1], inv(E)*x
0.5
0.5
x2 , [E 1x]2
x2 , [Ex]2
0
0.5
0
0.5
1
1
0
x1 , [Ex]1
5
x2 , [F 1x]2
x2 , [F x]2
0
x1 , [E 1x]1
5
10
0
x1 , [F x]1
10
10
0
x1 , [F 1x]1
10
Fig. 7.17. Matrix and matrix inverse norm illustration. Matrix condition numbers: DD 1 = 10.0;
EE 1 = 2.618; F F 1 = 22.15;
For the moment we shall consider that the vector norms are Euclidean norms (2-norms). From the
denition of the vector norms, we have
Ax2 = (Ax)T (Ax)
so that we can write
A2 =
max
x = 0
xT AT Ax
.
xT x
The expression on the right is the so-called Rayleigh quotient of the matrix AT A (not of A itself!).
It is the result of the pre-multiplication of the eigenvalue problem
AT Ax = x
155
(7.20)
xT AT Ax
.
xT x
Note that
xT AT Ax 0 ,
xT x > 0
where xT x = 0 is not allowed by the denition of the norm. Clearly, the Rayleigh quotient attains
its maximum for the largest eigenvalue in absolute value max ||, and its minimum for the smallest
eigenvalue in absolute value min ||. From this we can deduce
A = max || .
Similarly, we obtain
A1 = 1/ min || .
Hence, the condition number of A is found to be
max ||
1
.
AA =
min ||
If the matrix A is symmetric, we write an eigenvalue problem for it as
Av = v .
(7.21)
(7.22)
In comparison with (7.20) we see that = ( )2 . Therefore, the norm of a symmetric matrix will be
A = max | | .
where solves the eigenvalue problem (7.21). Analogously, the norm of the inverse of a symmetric
matrix will be
A1 =
1
,
min | |
max | |
.
min | |
(7.23)
Illustration 12
Apply formula (7.23) to a singular matrix.
Any singular matrix has at least one zero eigenvalue. No matter how large an eigenvalue of a
singular matrix can get, we know that its smallest eigenvalue (in absolute value) is equal to zero.
Consequently, the condition number of the singular matrix .
156
7.5 QR factorization
Consider a system of linear algebraic equations
Ax = b
with a square matrix A. It is possible to factorize the matrix into the product of an orthogonal
matrix Q and an upper triangular matrix R
A = QR .
How does this work? If we write this relationship between the matrices in terms of their columns
things become clearer.
ck (A) = Q ck (R) .
Now remember, R is an upper triangular matrix. For instance like this
R=
.
Then the rst column of A is c1 (A) = c1 (Q) R11 (R11 = , all other coecients in the rst column
of R are zero). The fourth column of A is a linear combination of the rst four columns of Q (the
coecients are the s)
c4 (A) = c1 (Q) R14 + c2 (Q) R24 + c3 (Q) R34 + c4 (Q) R44
and so on. The principle is now clear: each of the columns of A is constructed of columns of Q which
are orthogonal, and the columns of Q can be obtained by straightening out the columns of A as
long as the columns of A are linearly independent (refer to Figure 7.18): q 1 is a unit vector in the
direction of a1 , and q 2 is obtained from the part of a2 that is orthogonal to q 1 .
Fig. 7.18. Two arbitrary linearly independent vectors a1 and a2 , and two orthonormal vectors vectors q 1
and q 2 that span the same plane
The great advantage that can be derived from this factorization stems from the fact that the
inverse of an orthogonal matrix is simply its transpose
Q1 = QT .
If we substitute this factorization into Ax = b we obtain
Ax = QRx = b
and this allows us to rewrite the system as
7.5 QR factorization
157
Rx = QT b .
Now since the matrix R is upper triangular, to solve for the unknown x is very ecient, starting
at the bottom we proceed by backsubstitution. The solution is not for free, of course. We had to
construct the factorization in the rst place.
An additional benet of this particular factorization is in the ability to factorize rectangular
matrices, not just square. Furthermore, due to the orthogonality of Q operations with it are as
nice numerically as possible (remember the perfect condition number of one?). Therefore the QR
factorization is used when numerical stability is at a premium. Examples may be found in the leastsquares tting subject. Also, the QR factorization leads to a valuable algorithm for the computation
of eigenvalues and eigenvectors for general matrices.
7.5.1 Householder reflections
The question now is how to compute the QR factorization. A particularly popular and eective
algorithm is based on the so-called Householder reections.
The Householder transformation (reection) is designed to modify a column matrix so that the
result of the transformation has only one nonzero element, the rst one, but the length of the result
(that is its norm) is preserved. Matrix transformations that preserve lengths are either rotations or
reections (the Householder transformation is the latter):
e,
Ha = a
where a = e
a .
e
The transformation produces the vector a
a
0
e= .
a
.
.
0
e a and
by reection in a plane that is dened by the normal generated as the dierence n = a
e and a being of
passes through the origin O (see Figure 7.19). This follows from the two vectors a
the same length.
n
a
e
a
O
e
a
Fig. 7.19. Householder transformation: the geometrical relationships. The reection plane is shown by the
dashed line. Consider that in the two-dimensional gure there are two possible reection planes.
158
H =1+
nnT
,
nT a
HT H = 1 .
(7.24)
Interestingly, this matrix is also symmetric. This is really how it should be: H produces a mirror
e = Ha. The mirror image of a
e , the inverse operation of a = H 1 a
e , must give us back
image of a, a
e from a.
a, but the inverse operation is again a reection, the same reection that gave us a
To compute the Householder matrix we could use the function Householder matrix.19 The sign
e is computed with particular attention to numerical stability: when we
of the non-zero element of a
e a, the vector a
e has only one nonzero element. To avoid numerical error when
compute n = a
subtracting two similar numbers e
a1 a1 we choose sign e
a1 = sign a1 .
function H = Householder_matrix(a)
if (a(1)>0) at1 =-norm(a);% choose the sign wisely
else
at1 =+norm(a); end
n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a
H = eye(length(a))+(n*n)/(n*a);% this is the formula
end
How do we use the Householder transformation? We consider the columns of the matrix to be
transformed as the vectors that we can reect as shown above. The rst step zeroes out the elements
of A below the diagonal of the rst column.
H1
.
H 1 A = 66
=
We write H 1 for the 6 6 matrix obtained from the rst column of A. We write H 2 for the 5 5
matrix obtained from the second column of A, from the diagonal to the bottom of the column.
Analogously for the other Householder matrices.
.
=
55
=R.
=
H5
22
To obtain A from R we would successively invert the above relationships one by one. That is not
dicult since we realize that those matrices are orthogonal and symmetric, so the inverse is equal
to the original matrix. We just have to switch the order of the matrices. We get
19
7.5 QR factorization
H1
A=
H2
1
1
1
...
1
H5
159
R ,
1
1
1
H1
1
...
Q=
1
H
2
H5
Illustration 13
Here we present a factorization which is based directly on the schemas above. The function
Householder matrix20 computes the Householder matrix of equation (7.24). Note that the matrices H j are blocks embedded in an identity matrix. The following code fragment should be stepped
through, and I will bet that it will nicely reinforce our ideas of how Householder reections work.
format short
A=rand(5); R=A % this is where R starts
Q=eye(size(A));% this is where Q starts
for k=1:size(A,1)-1
H=eye(size(A));% Start with an identity...
% ...and then put in the Householder matrix as a block
H(k:end,k:end) = Householder_matrix(R(k:end,k:end))
R= H*R % this matrix is becoming R
Q= Q*H % this matrix is becoming Q
end
Q*Q% check that this is an orthogonal matrix: should get identity
A-Q*R % check that the factorization is correct
R-Q*A % another way to check
The algorithm to produce the QR factorization21 is designed to be a little bit more ecient than
the code above, but it is still surprisingly short and readable
function [Q,R] = HouseQR(A)
m=size(A,1);
Q=eye(m); R =A;
for k=1:size(A,1)-1
n = Householder_normal(R(k:end,k:k));
R(k:end,k:end) =R(k:end,k:end)-2*n*(n*R(k:end,k:end));
Q(:,k:end)=Q(:,k:end)-2*(Q(:,k:end)*n)*n;
end
end
20
21
160
Instead of the Householder matrix (7.24) we use in HouseQR the equivalent expression
H = 1 2N N T ,
where N has the same direction as n but is of unit length
N=
n
.
n
8
Solutions methods for eigenvalue problems
Summary
1. We discover a few basic algorithms for the solution of the eigenvalue problem, both the standard
and the generalized form.
2. Repeated multiplications with matrices tends to amplify directions associated with eigenvectors
of dominance eigenvalues. Main idea: write the modal expansion, and consider the powers of
eigenvalues.
3. Various forms of the power iteration, including the QR iteration, form the foundations of some
of the workhorse routines used in vibration analysis and in general purpose software (with
appropriate, and sometimes considerable, renements).
4. The Rayleigh quotient is an invaluable tool both for algorithm design and for quick ad hoc
checks.
5. This area of numerical analysis has seen considerable progress in recent years and some powerful new algorithms have emerged. Solving large-scale eigenvalue problems nevertheless remains
nontrivial, even with sophisticated software packages.
x=
cj v j .
j=1:n
Ak x =
cj Ak v j =
cj kj v j .
j=1:n
j=1:n
162
cj Ak v j = |1 |k
j=1:n
j=1:n
cj
kj
vj .
|1 |k
Due to our assumption that the rst eigenvalue dominates, the coecients kj /|1 |k will approach
zero in absolute value as k , except for k1 /|1 |k which will maintain absolute value equal to
one. Therefore, as k the only term left from the modal expansion of x will be
lim Ak x = c1 k1 v 1 .
Figure 8.1 illustrates the eect of repeated multiplication of an arbitrary vector x by the 2 2
matrix A
Ax , AAx = A2 x , ...
The eigenvalues are 1 = 1.6 (with eigenvector v 1 ), 2 = 0.37 (with eigenvector v 2 ), so the rst eigenvalue is dominant, and evidently the result of the multiplication leans more and more towards the
rst eigenvector. The leaning is very rapid. The reason is that the fraction k2 /|1 |k = (0.23125)k
will decrease very rapidly with higher powers (for instance, (0.23125)4 = 0.00285). Therefore, the
contribution of the eigenvector v 2 to the vector Ak x will become vanishingly small rather quickly.
A=[1.68
0.548; 0.202
0.286]
A0 x
v2
A1 x
A2 x
A3 x
v1
A4 x
Fig. 8.1. The eect of several matrix-vector multiplications. Eigenvalues 1 = 1.6, 2 = 0.37
The repeated multiplication to amplify the components of the dominant eigenvector is the
principle behind the so-called power iteration method for the calculation of the dominant eigenvalue/eigenvector.
163
matrix will diminish the contributions of all other eigenvectors except the rst one so that eventually
the product Ak x will be mostly in the direction of the rst eigenvector v 1 .
The method is not failproof. Firstly, it appears that if the starting vector x does not contain any
contribution of the rst eigenvector, c1 = 0, the power method is not going to converge. Fortunately,
any amount of the inevitable arithmetic error will likely introduce some contribution of the rst
eigenvector to which the power method will ultimately converge. Unfortunately, it may take a long
time.
Secondly, the method is denitely going to have trouble with converging for |2 | |1 | (in
words, when the second eigenvalue is close to the rst eigenvalue in magnitude). The ratio k2 /|1 |k
will decrease slowly, resulting in slow convergence. Such a situation is illustrated in Figure 8.2: the
eigenvalues are 1 = 0.8, 2 = 0.75. The iterated vector Ak x appears to converge to the direction
of v 1 , but slowly.
A few observations can be made from Figure 8.2. The iterated vector Ak x decreases in magnitude
(|1 | < 1), and if we iterate suciently long the vector will get so short that we may risk underow,
or at least numerical issues due to arithmetic error. (Note that for |1 | > 1 the approximations
to the eigenvector will grow, which may eventually result in overow.) Further, since 1 < 0 the
iterated vector aligns itself alternately with v 1 and v 1 . This is ne, since both are perfectly good
eigenvectors, but it complicates somewhat the issue of how to measure convergence. We want to
measure convergence of directions, not of the individual components of the vector!
A=[1.4
0.894; 1.43
1.35]
v2
A3 x
v1
A4 x
A2 x
A0 x
A1 x
Fig. 8.2. The eect of several matrix-vector multiplications. Eigenvalues 1 = 0.8, 2 = 0.75
To address the concerns about underow and overow we may introduce normalization (rescaling)
of the iterated vector as
x0 given
for k = 1, 2, ...
x(k) = Ax(k1)
x(k)
x(k) = x
(k)
How to measure the convergence of the algorithm may be made easier by considering the associated
problem of nding the eigenvalue 1 . An excellent tool is oered by the Rayleigh quotient. Premultiply the eigenvalue problem on both sides with v Tj
Av j = j v j v Tj Av j = j v Tj v j ,
which gives (the Rayleigh quotient)
164
j =
v Tj Av j
.
v Tj v j
Now consider the vector x(k) as an approximation of the eigenvector v 1 . A good approximation of
the eigenvalue will be
T
x(k) Ax(k)
T
x(k) x(k)
It will be much easier to measure relative approximate errors in the eigenvalue then to measure the
convergence of the direction of the eigenvector. An actual implementation of the power iteration
algorithm then follows easily:1
function [lambda,v,converged]=pwr2(A,v,tol,maxiter)
... some error checking omitted
plambda=Inf;% eigenvalue in previous iteration
converged = false;
for iter=1:maxiter
u=A*v; % update eigenvector approx
lambda=(u*v)/(v*v);% Rayleigh quotient
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
converged = true; break;% converged!
end
plambda=lambda;% eigenvalue in previous iteration
end
end
Note that we have to return a Boolean ag to indicate whether the iteration process inside the
function converged or not. This is a common design feature of software implementing iterative
processes, since the iterations may or may not succeed.
We conclude this section with pointing out that power iteration relies on the existence of a
dominant eigenvalue. This is not applicable in many important problems, for example for the rst
order form of the equations of motion of a vibrating system. For such systems eigenvalues come in
complex conjugate pairs. There is no single dominant eigenvalue, and consequently power iteration
will not converge. This is illustrated in Figure 8.3, where we show the progress of the power iteration
for two dierent starting vectors for a matrix with eigenvalues 1,2 = 0.7. There is no progress
towards any of the eigenvectors, since the iterated vectors just switch between two dierent directions
neither of which is the eigenvector direction.
In what follows we shall work with real symmetric matrices, unless we explicitly say
otherwise. The main reasons: these matrices are very important in practice, we dont
have to treat special cases such as missing eigenvectors, and the eigenvalues and eigenvectors are real.
Illustration 1
Figure 8.4 shows the model of two linked buildings. Each building is represented by a concentrated
mass m standing in for the total mass of the oor, and springs linking the oors kc which would be
representative of the total horizontal stiness of the columns in between the oors (or the ground).
The buildings are linked at each oor with another spring k , representative of walkways (bridges)
that connect the buildings. The masses in the system are numbered as shown.
1
See: aetna/EigenvalueProblems/pwr2.m
A=[1.24
0.808; 1.29
1.24]
A=[1.24
A1 x
0.808; 1.29
165
1.24]
v1
A2 x
A x
A x
v1
A4 x
A0 x
A2 x
A3 x
A4 x
v2
v2
A1 x
Fig. 8.3. The eect of several matrix-vector multiplications. Eigenvalues 1,2 = 0.7
10
The mass matrix is simply m (1010 identity matrix). The stiness matrix K has the structure
shown below. Note that if the buildings are not linked by the walkways (k = 0), the stiness matrix
will split into two uncoupled 5 5 diagonal blocks that correspond to each building separately.
Nonzero walkway stiness will couple the vibrations of the two buildings together.
kc +k kc
0
0
0
k
0
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
0
kc 2kc +k 0
0
0
0
k
K=
0
0
0
0
kc +k kc
0
0
0
k
k
0
0
0
kc 2kc +k kc
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
0
k
0
0
0
kc 2kc +k kc
0
0
0
0
k
0
0
0
kc 2kc +k
The vibration problem can be described by the equation (5.3)
2 M z = Kz .
Since the mass matrix is just a multiple of the identity, this may be written as
Az = z ,
166
where we dene
A=
1
K,
m
and = 2 .
The rst practice will apply the power method to the computation of the largest frequency of
vibration. We assume m = 133, kc = 61000, k = 3136 (in consistent units). The solution with
MATLABs eig is written for the eigenvalue problem as
[M,K,A] = lb_prop;
[V,D]=eig(A) % This may be replaced with [V,D]=eig(M,K)
disp(Frequencies [Hz])
sqrt(diag(D))/(2*pi)
which yields the resulting frequencies as
Frequencies [Hz]
ans =
0.9702 1.4614 2.8319 3.0354 4.4641 4.5960 5.7348 5.8380 6.5408 6.6315
Applying the power method as shown in the script lb A power2 with a random starting vector yields
an approximation of the highest eigenvalue, but it is not anywhere close to being converged. This
should not surprise us. We would expect the convergence to be slow: The two largest eigenvalues are
very closely spaced (the largest eigenvalue is weakly dominant): see Figure 8.5. This makes, together
with the inherent symmetry in the structure, for an interesting experiment: see below.
Suggested experiments
1. Use a starting vector in the form of ones(10,1). Do we get convergence to the largest eigenvalue?
If not, try to explain. [Dicult]
f9=6.5408[Hz]
f10=6.6315[Hz]
1
2
7
3
2
8
8
4
10
9
5
9
10
167
1
x = A1 x .
In words, the matrix A and A1 have the same eigenvectors, and the eigenvalues of A1 are the
inverses of the eigenvalues of A. Clearly, the largest eigenvalue of A1 will be one over the smallest
eigenvalue of A
max |eigenvalue of A1 | =
1
.
min |eigenvalue of A|
Therefore, to nd the eigenvalue/eigenvector pair of A for the smallest eigenvalue in absolute value
we can perform the power iteration on A1 . We would not wish to invert the matrix, of course, and
so we formulate the algorithm as
x0 given
for k = 1, 2, ...
Ax(k) = x(k1)
(k)
x(k) = x
x(k)
which simply means solve for x(k) from Ax(k) = x(k1) . (Compare with the power iteration algorithm on page 163; there is only one change, but an important one.) Since the solution is needed
during each iteration, we may conveniently and eciently take advantage of the LU factorization.
The inverse power iteration algorithm is summarized in the code below. Note the changes with
respect to the power iteration in the rst two lines in the for loop. 3
function [lambda,v,converged]=invpwr2(A,v,tol,maxiter)
... some error checking omitted
plambda=Inf;% initialize eigenvalue in previous iteration
[L,U,p]=lu(A,vector);%Factorization
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v
lambda=(v*v)/(u*v);% Rayleigh quotient: note the inverse
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
converged = true; break;% converged!
end
plambda=lambda;
end
end
3
See: aetna/EigenvalueProblems/invpwr2.m
168
Note the shortcut to the value of the Rayleigh quotient: the vector product (u*v) incorporates the
multiplication with A1 . Then, because we are iterating to nd 1/, we invert the fraction.
The inverse power iteration also relies on the existence of a dominant eigenvalue. Dominant
here means that the smallest eigenvalue should be strictly smaller in absolute value than any other
eigenvalue of A. We assume again they are ordered in decreasing magnitude, and for the success of
the inverse iteration we require
|1 | |2 | |3 | . . . | |n1 | > |n | .
Analogously to the power iteration, the convergence of the inverse power iteration will be faster for
very dominant eigenvalues, |n1 | |n |, and painfully slow for |n1 | |n |.
Illustration 2
Here we illustrate the convergence of the inverse power iteration on the example of two symmetric
matrices.4 We construct two random matrices with spectra that are identical except for the smallest eigenvalue. The smallest eigenvalue is dominant in one matrix, and rather close to the second
eigenvalue in magnitude in the second matrix. Consequently Figure 8.6 displays quite disparate
convergence behaviors of the inverse power iteration: very good in the rst case, poor in the second.
10
10
n = 13
5
10
10
n = 6.1
10
15
10
10
15
Iteration
20
25
Fig. 8.6. The relative error of the smallest eigenvalue for two symmetric 13 13 matrices with eigenvalues
[13, 14 : 25] and [6.1, 14 : 25].
Illustration 3
Apply the inverse power iteration method to the structure described in Illustration on page 164.
The inverse power method as shown in the script lb A invpower5 with a random starting vector
yields an approximation of the lowest eigenvalue with satisfactory convergence. The rst two mode
shapes are shown in Figure 8.7 (only the mode on the left was computed with inverse power iteration,
the mode on the right was added using eig()).
4
5
f1=0.97015[Hz]
169
f2=1.4614[Hz]
10
10
Suggested experiments
1. Change the stiness of the link spring to k = 0. Does the inverse power iteration converge? If
not, why?
170
1
3
0 4 2
1/1
1/3
1/2
1/4
0
= , > 0
1
3
0 4 2
1/1
0
1/3
1/2
1/4
10
10
no shift
5
10
= 0.3
= 0.4
10
10
15
10
6
8
Iteration
10
12
Fig. 8.9. The relative error of the smallest eigenvalue 4 for the symmetric 4 4 matrices with eigenvalues
[2.80, 1.167, 0.609, 0.452]. Comparison of un-shifted and shifted inverse power iteration.
Figure 8.9 shows the eect of shifting. Two shifts are applied, one corresponding to Figure 8.8,
and one even closer to the eigenvalue 4 in magnitude, = 0.4. The eect of shifting is quite
dramatic. The closer we can guess the magnitude of the smallest eigenvalue (so that we can set the
shift to be equal to the guess the eigenvalue) the higher the convergence rate.
The inverse power iteration algorithm with shifting is given in MATLAB code below.6
function [lambda,v,converged]=sinvpwr2(A,v,sigma,tol,maxiter)
... some error checking omitted
plambda=Inf;% initialize eigenvalue in previous iteration
v=v/norm(v);% normalize
[L,U,p]=lu((A-sigma*eye(n)),vector);%Factorization
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v
lambda=(u*A*u)/(u*u);% Rayleigh q. using the definition
v=u/norm(u);% normalize
if (abs(lambda-plambda)/abs(lambda)<tol)
6
See: aetna/EigenvalueProblems/sinvpwr2.m
171
Illustration 5
Apply the inverse power iteration method to the structure described in Illustration on page 164, but
change the stiness of the link spring to k = 0. Would shifting help with convergence to the rst
frequency?
172
A=[1.05
0.171; 0.171
(0)
(1)
w
(3) 2
(2)
w2 w2
(4)
w2
v1
w2
0.614]
(0)
w1
(1)
w1
(2)
w1
(3)
w1
(4)
w1
v2
Fig. 8.10. The eect of several matrix-vector multiplications. Eigenvalues 1 = 1.11, 2 = 0.556 . No eort
is made to maintain the iteration vectors linearly independent.
The fact that the most dominant eigenvector will be swamping out all the other eigenvectors is
going to keep us from obtaining reasonable approximations of the other eigenvectors. In other words,
since the dominant eigenvector components will be getting magnied more than the components
of the other vectors, eventually all the vectors on which we iterate will become aligned with the
dominant eigenvector. Figure 8.10 illustrates the eect of simultaneous iteration on two vectors:
(4)
(4)
(0)
(0)
starting vectors are w1 , w2 . After just four iterations the vectors w1 , w2 are pretty much
aligned with the dominant eigenvector v 1 . They are still linearly independent, but only barely.
So iteration on multiple vectors will be tricky. The desired eigenvectors will still be present,
but they will be hard to extract from such an ill conditioned basis (all vectors essentially parallel).
Therefore, similarly to power (inverse power) iteration where we normalized the approximation in
each step so as to avoid underow or overow, we will normalize the set of vectors on which we iterate.
Not only so they are unit magnitude, but also so that they are mutually orthogonal. (Technical
term: the vectors are orthonormal .) An excellent tool for this purpose is the QR factorization: the
columns of the matrix Q are orthonormal, and they come from the columns of the input matrix.
In this way we get the so-called simultaneous power iteration (also called block power iteration). The starting vectors will be arranged as columns of a rectangular matrix
]
[
(0)
(0)
.
W (0) = w1 , w2 , ...w(0)
p
The algorithm will repeatedly multiply the iterated n p matrix W (k) by the n n matrix A and
also orthogonalize the columns of the iterated matrix by the QR factorization.
W (0) given
for k = 1, 2, ...
W (k) = AW (k1)
QR = W (k) % compute QR factorization
W (k) = Q
(8.1)
The eigenvalue approximations may be computed as before from the Rayleigh quotient
(k)
(k) T
= wj
(k)
Awj
.
(k) T
(k)
wj
wj
{
(k)
wj
173
1, when j = m
0, otherwise.
Figure 8.11 shows the eect of orthogonalization for the same matrix and the same starting vectors
as in Figure 8.10, but this time with QR factorization. The iterated vectors now converge to the two
eigenvectors.
A=[1.05
0.171; 0.171
0.614]
(3)
(2)
w2
(1)
w2 w2
(4)
w2
(0)
w1
(0)
v 1 w2
(4)
w1
(3)
w1
(2)
w1
(1)
w1
v2
Fig. 8.11. The eect of several matrix-vector multiplications. Eigenvalues 1 = 1.11, 2 = 0.556. Iteration
vectors are orthogonalized after each iteration.
In order to switch from the block power iteration to the block inverse power iteration we just
switch the one line that refers to the repeated multiplication with the coecient matrix so that the
multiplication is with its inverse
W (0) given
for k = 1, 2, ...
AW (k) = W (k1) % solve
QR = W (k) % compute QR factorization
W (k) = Q
(8.2)
The MATLAB code for the block inverse power iteration is given below. Note that the so-called
economy QR factorization is used: the matrix Q is rectangular rather than square. 8
function [lambda,v,converged]=binvpwr2(A,v,tol,maxiter)
... some error checking omitted
nvecs =size(v,2);
plambda=Inf+zeros(nvecs,1);
lambda =plambda;
nvecs=size(v,2);% How many eigenvalues?
[v,r]=qr(v,0);% normalize
[L,U,p] =lu(A,vector);% Factorized for efficiency
converged = false;% not yet
for iter=1:maxiter
u=U\(L\v(p,:)); % update vectors
for j=1:nvecs % Rayleigh quotient
lambda(j)=(v(:,j)*v(:,j))./(u(:,j)*v(:,j));
8
See: aetna/EigenvalueProblems/binvpwr2.m
174
end
[v,r]=qr(u,0);% economy QR factorization
if (norm(lambda-plambda)/norm(lambda)<tol)
converged = true; break;
end
plambda=lambda;
end
end
Note that when were computing the Rayleigh quotient we have to account for u being the result of
the inverse power iteration. Also, we could have replaced
lambda(j)=(v(:,j)*v(:,j))./(u(:,j)*v(:,j)) with
lambda(j)= 1.0./(u(:,j)*v(:,j)) (why?).
Shifting could also be applied to block inverse power iteration. Even though only one shift value
can be used, the benecial eect applies to all iterated eigenvectors: The iteration will converge to
the eigenvectors with eigenvalues closest to the shift.
Illustration 6
Apply the block inverse power iteration method to the structure described in Illustration on page 164,
but change the stiness of the link spring to k = 0. Use it to nd the rst two modes.
A possible solution is given in the script lb A blinvpower.9
Suggested experiments
1. Interpret the mode shapes obtained above with the solution provided by MATLABs eig. The
mode shapes are dierent. Does it matter?
8.5 QR iteration
An obvious step to take with simultaneous power iteration is to compute all the eigenvalues and
eigenvectors of the n n matrix A by iterating on n vectors at the same time. This is shown in the
following algorithm (note the choice of the initial orthonormal vectors as the columns of an identity
matrix):
W (0) = 1
for k = 1, 2, ...
W (k) = AW (k1)
QR = W (k) % compute QR factorization
W (k) = Q
(8.3)
The matrix W (k) converges to a matrix of eigenvectors. Recall that the matrix of eigenvectors can
make the matrix A similar to a diagonal matrix, the matrix of the eigenvalues (call for (4.13)). The
matrix W (k) is only close to the matrix of eigenvectors (and getting closer with the iteration), and
therefore the matrix
T
8.5 QR iteration
175
will be only close to a diagonal matrix, not perfectly diagonal, and the numbers on the diagonal will
approximate the eigenvalues.
It can be shown that the above simultaneous iteration is equivalent to the so-called QR iteration
(note well that this is dierent from QR factorization). The QR iteration is given by the following
algorithm:
A(0) = A
for k = 1, 2, ...
QR = A(k1) % compute QR factorization
A(k) = RQ % note the switched factors
(8.4)
The matrix A(k) that appears in the last step of (8.4) is the same as A(k) = W (k) AW (k) in
the algorithm (8.3) (explained in detail in Trefethen, Bau (1997)). In this sense the two algorithms
are equivalent. The script qr power correspondence10 demonstrates the equivalence of the two
algorithms for a randomly generated matrix.
The QR iteration (8.4) is amenable to several signicant enhancements as pointed out below.
The QR iteration is one of the most important algorithms used in eigenvalue/eigenvector problems.
First we will inspect the properties of the transformations eected by the above algorithm.
8.5.1 Schur factorization
The matrix A(k) in (8.4) converges to an upper triangular matrix. In fact, for our assumption of A
being symmetric, A(k) converges to a diagonal matrix. In the limit of k the transformation
T
(8.5)
is upper triangular. This can be shown as follows: the square matrix A has at least one eigenvalue
and one eigenvector. Therefore, we can write (for simplicity the procedure is demonstrated here for
a 6 6 matrix; the symbols , , ... mean here general complex numbers; zeros are not shown)
,
AU 1 = U 1
where the rst column of U 1 is an eigenvector of A: Au1 = 1 u1 , and the other columns of U 1 are
arbitrarily selected to form an orthonormal basis (this is always possible). Now we write
10
11
Hermitian matrix: A = A , where A is the so-called conjugate transpose (its elements are complex
conjugates of the transposed matrix).
12
Defective matrix does not have a full set of eigenvectors. Example: [0, 1; 0, 0]. Double eigenvalue 0, a
single eigenvector [1; 0].
T
T
13
Unitary matrix: complex matrix U such that U U = U U = 1. For real matrices unitary = orthogonal.
176
T
U 1 AU 1 =
and we apply exactly the same argument to the smaller 5 5 matrix (the elements). This again
leads to the rst column having zeros below the diagonal, which we write as
1
1
2
T
T
T
U
=
U2
U
U
AU
U
=
2
1
1 2
2
1
2
T
T
T
3
.
U 5 ...U 2 U 1 AU 1 U 2 ...U 5 =
4
5
6
Since we can dene a unitary matrix as U = U 1 U 2 ...U 5 we have completed the Schur factorization.
This construction highlights the main attraction of the Schur factorization: the upper triangular
matrix on the right-hand side has the eigenvalues of A on the diagonal. It also points to a major
diculty: in order to compute the Schur factorization we have to solve a sequence of eigenvalue
problems. This is not possible in a nite number of steps in general, as follows from the impossibility
of nding the roots of an arbitrarily high order polynomial by explicit formulas. As a consequence,
computing the Schur factorization must be an iterative procedure, and in fact the QR iteration is
precisely such a procedure.
8.5.2 QR iteration: Shifting and deflation
The QR iteration is a numerically stable procedure because it proceeds by applying successive
orthogonal transformations, similarly to the construction we just outlined. To show this we write for
the QR factors in one step
Q(k) R(k) = A(k1)
A(k) = R(k) Q(k)
T
and substitute in the second equation R(k) = Q(k) A(k1) from the rst equation:
T
8.5 QR iteration
177
Rows
Rows
Approx = [4.980, 5.016, 1.332, 3.872, 4.702, 3.263,] Approx = [6.854, 4.528, 4.623, 1.073, 0.516, 2.907,]
5
6
1
2
3
4
5
6
3
4
5
6
Columns
Columns
Approx = [6.994, 4.576, 4.897, 3.491, 2.351, 2.894,] Approx = [7.000, 4.678, 4.812, 3.954, 2.837, 2.893,]
2
Rows
Rows
4
5
4
5
6
1
3
4
Columns
3
4
Columns
Fig. 8.12. QR factorization example. Matrix eigenvalues [3, 3, 4, 4.5, 5, 7]. QR iterations 1, 5, 9, 13 are
shown top to bottom, left to right.
elements decrease in magnitude with successive iterations, and the diagonal elements come to dominate. Figure 8.13 shows similar computation as in Figure 8.12, but with a dierent matrix. This
time the QR iteration gets stuck on the three eigenvalues in the top left corner, and the iteration
does not result in a diagonal matrix. The lack of convergence is due to the repeated eigenvalues (in
absolute value), and additional sophistication is needed to extract the the repeated eigenvalues.
Shifting may be introduced into the QR iteration similarly as in the simultaneous inverse iteration. The QR iteration may be in fact shown to be equivalent not only to simultaneous iteration, but
also to simultaneous inverse iteration. Therefore, the shifting will have a very similar eect: faster
convergence in the lower eigenvalues. The shift can be selected in various judicious ways. Here we
will discuss a simple choice: the Rayleigh quotient shift. We have seen that the QR iteration was
successively transforming the original matrix to a diagonal matrix. The elements on the diagonal of
(k1)
the iterated matrix are in fact the Rayleigh quotients. A good shift therefore is the element Ann
of the iterated matrix. The shift is applied as
A(0) = A
for k = 1, 2, ...
(k1)
= Ann
QR = (A(k1) 1) % compute QR factorization
A(k) = RQ + 1
This translates directly into MATLAB code:16
function A = qrstepS(A)
14
See: aetna/EigenvalueProblems/qrstep.m
See: aetna/EigenvalueProblems/Visualize qr iteration.m
16
See: aetna/EigenvalueProblems/qrstepS.m
15
(8.6)
178
Rows
Rows
Approx = [1.257, 4.873, 0.387, 2.183, 1.108, 1.967,] Approx = [0.208, 4.915, 0.163, 3.041, 1.997, 1.001,]
3
4
5
6
1
2
3
4
5
6
Columns
Columns
Approx = [0.180, 4.913, 0.094, 3.001, 2.000, 1.000,] Approx = [0.180, 4.913, 0.093, 3.000, 2.000, 1.000,]
Rows
Rows
4
5
4
5
6
1
3
4
Columns
3
4
Columns
Fig. 8.13. QR factorization example. Matrix eigenvalues [5, 5, 5, 3, 2, 1]. QR iterations 1, 5, 9, 13 are shown
top to bottom, left to right.
[m,n]=size(A);
rho = A(n,n); % shift
[Q,R]=qr(A-rho*eye(n,n));
A = R*Q + rho*eye(n,n);
end
In practice, once an eigenvalue converges, the corresponding row and column are removed from
the matrix, and the QR iteration continues on the smaller remaining matrix. This is called deation.
Illustration 7
Apply the shifted QR iteration method to the structure described in Illustration on page 164.
The shifted QR algorithm using qrstepS in the script lb A qr17 does not in fact converge very
well. The basic algorithm without shifting18 works actually better. Even better is the strategy of
shifting known under the name of Wilkinson (James Hardy Wilkinson, 1919 - 1986, was a giant in
the 20th century history of numerical algorithms)19 .
Suggested experiments
1. Change the stiness of the link spring to k = 0. Does the QR iteration converge? Try the
variants with shifting.
17
179
m 0 0
2k k 0
M = 0 m 0 , K = k 2k k
0 0 m
0 k k
Here k = 61, all the masses are equal m = 1.3. For instance, we can check how many natural
frequencies lie below 0.5 Hz. We form the matrix
A = K (0.5 2)2 M
Using MATLAB LU factorization as [L,U,P] =lu(A) yields
1
0
0
109 61 0
10
1
0 , U = 0 75.1 61 , P = 0 1
L = 0.559
0
0.812 1
0 0 1.39
00
0
0
1
Since there is only one negative number on the diagonal of U (that is on the diagonal of the matrix
D from the LDLT matrix factorization) we conclude that only one natural frequency lies below
0.5 Hz.
Next we check how many natural frequencies lie below 2.0 Hz. The factorization gives
1
0 0
83.3 61 0
100
1 0 , U = 0 61 144 , P = 0 0 1 ,
L = 0
0.732 0.633 1
0
0 30.3
010
which we compare with the frequencies given in Section 5.4 and conclude that something is wrong:
there are two negative numbers on the diagonal, but all three frequencies are in fact below 2.0 Hz.
The reason is that once the partial pivoting introduces a non-identity permutation matrix, so that
LU = P A
the congruence that the Sylvester theorem relies upon is no longer applicable. In fact, the product
LU is no longer symmetric and it is not possible to factor into LDLT . The pivoting has to be done
carefully to preserve the symmetry of the resulting product of factors. For instance, the MATLAB
180
function ldl produces directly the LDLT factorization and returns the psychologically lowertriangular factor L. We can write [L,D] =ldl(A), with the result
1
0 0
83.3 0
0
L = 0.732 0.423 1 , D = 0 144 0 .
0
1 0
0
0 12.8
Now we see three negative numbers on the diagonal of D which indeed corresponds to our prior
knowledge that all three frequencies are below 2.0 Hz.
by dening R = L D so that RRT = K. We see that we need to work with a positive denite
stiness matrix so that the diagonal matrix D will give real roots. With the Cholesky factors at
hand we transform the generalized eigenvalue problem Kz = 2 M z as
Kz = RRT z = 2 M z
and by introducing y = RT z we obtain the standard eigenvalue problem
1
y = R1 M RT y .
2
If the stiness happens to be singular, but the mass matrix is not, the roles of these two matrices
may be reversed.
For larger generalized eigenvalue problems, and in vibration analysis it is not uncommon nowadays to work with millions of equations, and the conversion to the standard eigenvalue problem
would be too expensive. Moreover, we are typically not interested in all the eigenvalues anyway, and
a better suited technique will help us extract a few eigenvalues of interest, typically the lowest ones.
The inverse iteration method (8.2) is easily adapted to the generalized eigenvalue problem. The
simultaneous inverse iteration for the generalized eigenvalue problem is written as
W (0) given
for k = 1, 2, ...
KW (k) = M W (k1) % solve
QR = W (k) % compute QR factorization
W (k) = Q
The eigenvalues may be estimated during the iteration using the Rayleigh quotient. For the generalized eigenvalue problem the Rayleigh quotient is computed from
2 M z = Kz 2 =
z T Kz
.
zT M z
The MATLAB code for the generalized eigenvalue problem solved with block inverse power
iteration is given below: 20
20
See: aetna/EigenvalueProblems/gepbinvpwr2.m
181
function [lambda,v,converged]=gepbinvpwr2(K,M,v,tol,maxiter)
... some error checking omitted
nvecs=size(v,2);% How many eigenvalues?
plambda=Inf+zeros(nvecs,1);% previous eigenvalue
lambda =plambda;
[L,U,p] =lu(K,vector);
converged = false;% not yet
for iter=1:maxiter
u=U\(L\(M*v(p,:))); % update vector
for j=1:nvecs
lambda(j)=(v(:,j)*K*v(:,j))/(v(:,j)*M*v(:,j));% Rayleigh quotient
end
[v,r]=qr(u,0);% economy factorization
if (norm(lambda-plambda)/lambda<tol)
converged = true; break;
end
plambda=lambda;
end
end
Illustration 9
Apply the block inverse power iteration method for the generalized eigenvalue problem to the structure described in Illustration on page 164.
The algorithm gepbinvpwr2 converges as well as the regular block inverse power iteration for the
standard eigenvalue problem.21 No surprise, given how easy it was to transition from the generalized
to the standard eigenvalue problem for this particular mass matrix.
8.7.1 Shifting
Shifting could also be introduced into the block inverse power iteration for the generalized eigenvalue
problem. Not only to speed up convergence to the smallest eigenvalue by making it more dominant,
but also for precisely the opposite: to make the smallest eigenvalue less dominant. What we mean by
this is that if a structure contains rigid body modes (the structure can move without experiencing
any resisting forces), it has at least one zero frequency of vibration. Such a frequency is very strongly
dominant in the inverse power iteration (1/0!!!). The eect of this dominance cannot be exploited,
however, since the matrix K is not invertible. This would make the block inverse power iteration
algorithm (page 180) impossible.
Shifting can help. To the eigenvalue problem (with = 2 )
M z = Kz
we add the term M z on both sides
M z + M z = M z + Kz
and obtain
( + )M z = (M + K)z .
21
182
k 0 0
K = 0 k k
0 k k
Equivalently, we say that the structure has a rigid body mode. The frequency corresponding to the
rigid body mode is zero. Figure 8.14 shows this rigid body mode as a translation of the masses
2,3. Mass 1 does not displace.22 Clearly, all springs maintain their unstressed length: the rigid body
motion does not induce any forces in the structure.
z11 = 0
Fig. 8.14. Structure with a singular stiness matrix. The rigid body mode ( = 0).
Now we shall try to apply the block inverse power iteration with gepbinvpwr2. 23 The script
n3 sing undamped modes MK224 invokes gepbinvpwr2 to obtain the rst mode without shifting,
and the resulting eigenvector and eigenvalue are worthless. The eigenvector in fact contains not-anumbers (NaN). Why? Because the stiness matrix is singular, its LU factorization should not exist.
The MATLAB function lu (put a breakpoint inside gepbinvpwr2) returns the factors as
K>> L,U
L =
1
0
0
U =
61
0
0
0
1
-1
0
0
1
0
61
0
0
-61
0
The 0 in the element 3,3 of the U factor is a problem: at some point we will have to divide with it.
Hence the not-a-numbers.
The script n3 sing undamped modes MK325 invokes gepbinvpwr2 to obtain the rst mode with
shifting. The shift is guessed as 0.2. This number is arbitrary, but it should be suciently small
22
See:
See:
24
See:
25
See:
23
183
to avoid getting close to the rst nonzero frequency. The script shows how we invoke gepbinvpwr2
for a stiness matrix that is modied by the addition of a multiple of the mass matrix to make it
non-singular.
[M,C,K,A,k1,k2,k3,c1,c2,c3] = properties_sing_undamped;
v=rand(size(M,1),1);% initial guess of the eigenvector
tol=1e-9; maxiter =4;% tolerance, how many iterations allowed?
sigma = 0.2;% this is the shift
[lambda,v,converged]=gepbinvpwr2(K+sigma*M,M,v,tol,maxiter)
lambda =lambda-sigma % subtract the shift to get the original eigenvalue
The output evidently shows that the iteration was successful.
lambda =
0.2000
v =
-0.0000
-0.7071
-0.7071
converged =
1
lambda =
6.3838e-016
% shifted
% shift removed: ~0
Suggested experiments
For the structure from Illustration on page 164:
1. Change the stiness of the link spring to k = 0. Does the block inverse power iteration converge?
2. Use the spectrum slicing approach to check the number of eigenvalues located by the power
iteration above.
9
Unconstrained Optimization
Summary
1. A number of basic techniques in structural analysis rely on results from the area of optimization.
Main idea: Equilibrium of structures and minimization of potential functions are intimately tied.
Equilibrium equations are the conditions of the minimum.
2. Stability of structures is connected to the classication of the stiness matrix. Main idea: positive
denite matrices correspond to stable structures.
3. The line search is a basic tool in minimization. Main idea: Monitor the gradient of the objective
function. Minimum (extremum) is indicated when the gradient becomes orthogonal to the line
search direction.
4. Solving a system of linear equations and minimizing an objective function are two roads to the
same destination. Main idea: We show that minimizing the so-called quadratic form solves a
system of linear algebraic equations.
5. The method of steepest descent may be improved by the method of conjugate gradients. Main
idea: keep track of directions of past line searches.
6. Direct versus iterative methods. Main idea: direct and iterative methods are rather dierent in
their properties (cost vs. accuracy). Iterative algorithms seem to be becoming more and more
important in modern software.
7. Least-squares tting is an important example of optimization.
(9.1)
This can be easily changed into a maximization task by ipping the objective function about the
horizontal axis (i.e. changing its sign) and seeking the maximum as
Find x such that f (x ) f (x) for all x .
(9.2)
186
9 Unconstrained Optimization
1 2
ks .
2
Using a matrix expression (for reasons that will become clear later), the stretch of the spring can
be expressed as
[ ]
[
] x1
s = cos 30o sin 30o
.
x2
The energy stored in the spring can also be written as
DE =
1 T
ks s ,
2
where by sT we mean the transpose (never mind that the transpose of a scalar doesnt do anything).
Substituting for the stretch we obtain
(
[ ])
[ ]
[
] x1 T [
] x1
1
o
o
o
o
cos 30 sin 30
cos 30 sin 30
DE = k
,
x2
x2
2
which gives in short order
[
]
[ ]
] cos 30o [
] x1
1 [
o
o
cos
30
sin
30
DE = k x1 x2
.
sin 30o
x2
2
If we dene the matrix
[
]
[
]
]
cos 30o [
cos 30o cos 30o , cos 30o sin 30o
o
o
cos 30 sin 30 = k
K=k
,
sin 30o
sin 30o cos 30o , sin 30o sin 30o
(9.3)
1 T
x Kx ,
2
(9.4)
187
x2
x1
30o
(9.5)
Fig. 9.2. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.
188
9 Unconstrained Optimization
Illustration 1
Modify the code below to display the surface in Figure 9.2. The second and the last line need to
be modied to reect a particular objective function. The last line is supposed to draw arrows
representing the gradient.
[x,y]=meshgrid(-10:10,-10:10);
z=x.*y; % function
surf(x,y,z,Edgecolor,none); hold on
contour3(x,y,z,20,w); hold on
quiver(x,y,y,x); % gradient
30o
x2
k/2
k
x1
30o
L
Fig. 9.3. Static equilibrium of particle suspended on two springs.
Figure 9.4 shows the variation of the deformation energy as a function of x1 , x2 : the only point
where the DE assumes the value of zero is at x1 = 0, x2 = 0. Everywhere else the deformation
energy is positive. This means that whenever the displacements are dierent from zero, the springs
will store nonzero energy. This is the hallmark of stable structures.
Matrices A that have the property
xT Ax > 0
for all x = 0, and for which
189
Fig. 9.4. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.
xT Ax = 0
only for x = 0, are called positive denite. Stable structures have positive denite stiness matrices. Positive denite matrices are nonsingular (they are regular). This is a fact well worth retaining.
Note that the stiness matrix is symmetric. An important property of the quadratic forms is
that only symmetric matrices contribute to the value of the quadratic form. We can show that as
follows: For the moment assume that A is in general unsymmetric. The quadratic form is a scalar
(real number), and as such it is equal to its transpose
(
)T
xT Ax = xT Ax
.
Therefore, we can write
xT Ax = xT AT x
or
(
)
xT Ax xT AT x = xT A AT x = 0 .
(9.7)
The general matrix A may be written as a sum of a symmetric matrix and a skew-symmetric
(anti-symmetric) matrix
A=
) 1(
)
1(
A + AT +
A AT .
2
2
In the expression (9.7) we recognize the anti-symmetric part of A. Therefore, we conclude that
the anti-symmetric part does not contribute to the quadratic form, only the symmetric part does.
Therefore, normally we work only with symmetric matrices in quadratic forms.
190
9 Unconstrained Optimization
Consider how to compute the derivative with respect to x of the product aT b: both vectors needs
to be dierentiated in turn using the chain rule. So that we dont have to dierentiate a transpose
of the vector a we take advantage of the fact that the result of the product aT b is a scalar which
may be transposed at will without changing anything
( T ) ( T )
a b = b a .
To dierentiate the vector b in the product aT b with respect to x is straightforward:
b
.
x
To dierentiate the vector a in the product aT b with respect to x, we will rst transpose the product
to get bT a, and then we dierentiate
aT
a
.
x
(
)
So the product aT b is dierentiated as
bT
( T )
b
a
a b = aT
+ bT
.
x
x
x
Now back to the quadratic function. The quadratic term may be identied with the above product
of vectors if we write
a = x,
b = Ax ,
191
xT Ax = z T Dz =
Dii zi2 .
i=1:n
The last expression is going to be positive for any combination of zi only if Dii > 0 for all i. So
Dii > 0 for all i guarantees that the quadratic form is positive denite.
If any of the Dii was equal to zero (to get this factorization if any of the elements in the pivot
position was zero would be tricky!) and all the others were positive, the matrix would be positive
semi-denite (and singular). (Just for completeness, if the pivots were a mixture of positive and
negative numbers, the matrix would be indenite.)
192
9 Unconstrained Optimization
1
xKx ,
2
where K is the stiness constant of the spring. The potential energy of the applied forces is dened
as
W = Lx .
The total energy is dened as
T E = DE + W .
(9.9)
The solution for the equilibrium displacement is determined by the principle of minimum total
energy: for the equilibrium displacement x the total energy assumes the smallest possible value
x = arg min T E .
(9.10)
(It should be read: nd x as such argument that minimizes T E.) This is an unconstrained minimization problem. The minimum of the total energy is distinguished by the condition that the
slope at the minimum is zero:
(
)
dT E
d 1
=
xKx Lx = Kx L = 0 .
dx
dx 2
This condition is seen to be simply the equation of equilibrium, whose solution indeed is the equilibrium displacement.
The meaning of equation (9.9) and of the minimization problem (9.10) is illustrated in Figure 9.5.
The deformation energy is represented by a parabolic arc (dashed line), which attains zero value
(that is its minimum) at zero displacement. The potential energy of the external force is represented
by the straight dashed line. The sum of the deformation energy and the energy of the external
force tilts the dashed parabola into the solid line parabola, the total energy. That shifts the original
minimum on the dashed parabola into the new minimum on the solid parabola (negative value) at
x . The minimum is easily seen to be
min T E =
1
1
1
1
x Kx Lx = x Kx Kx x = x Kx = x L .
2
2
2
2
193
Energy
TE
W
DE
x
x
Fig. 9.6. Static equilibrium of particle suspended on two springs. The surface of total energy.
x = arg min T E .
x
(9.12)
Since all candidate displacements x may be considered in the minimization without any restrictions,
the minimization problem is called unconstrained.
194
9 Unconstrained Optimization
p1
p2
p3
p0
Fig. 9.7. Walk towards the minimum of the objective function. Starting point is p0 , the walk proceeds
against the direction of the gradient.
TE
1 T
=
x Kx LT x .
TE =
x
x 2
From (9.8) we have
)
(
TE
1
= xT K + K T LT .
x
2
Since the matrix K is symmetric, we can simplify
(
)
1 T
x Kx = xT K
x 2
and nally
TE =
TE
=
x
x
1 T
x Kx LT x
2
195
)
= x T K LT .
(9.13)
r = ( T E) = L Kx .
The vector r is called the residual. We make it into a column matrix in order for the addition of the
vector x and r to make sense.
Next we have to nd out how far to go. One possible strategy is to go as far as possible, meaning
that we would follow along a given direction until weve reached the lowest possible value of the
objective function starting from a given point in a given direction. Denoting the starting point x0 ,
we write the motion in the direction r
x = x0 + r .
The lowest point will be reached when we stop descending and if we went any further we would start
ascending on the surface of the objective function. We are moving along a direction which subtends
various angles with the gradient at any given point. When we are descending we are moving against
the direction of the gradient. This would be expressed as (see Figure 9.8, and observe the gradient
of function f at point p2 )
f (p2 )r(p0 ) < 0 .
Note that the result of the multiplication f (p2 )r (row matrix with one row times column matrix
with one column) is a number, cosine of the angle that these two arrows subtend.
p3
p2
r(p0 )
f(p0 )
p1
p4
f(p3 )
f(p4 )
f(p2 )
p0
Fig. 9.8. Walk to nd the minimum of the objective function along a given direction. Starting point is p0 ,
the walk proceeds in the direction of r(p0 ) towards the point p1 .
On the other hand, when we are ascending we are moving broadly in the same direction in which
the gradient points, and we have (see Figure 9.8, and observe the gradient of function f at point p3 )
f (p3 )r(p0 ) > 0 .
196
9 Unconstrained Optimization
Finally, we must conclude that when we are standing at a point from which to move in any direction
would mean ascending, the path at that point must be perpendicular to the direction of the gradient
at that point (see Figure 9.8, observe the gradient of function f at point p4 )
f (p4 )r(p0 ) = 0 .
(Remark: This may be an oversimplication for more general objective functions. There is also the
possibility that a part of the path from p0 to p1 runs level no descending or ascending.)
The condition that the gradient (9.13) at the lowest point x must be orthogonal to the direction
of descent r can be written down as
( T
)
f (x )r = x K LT r = 0
and writing x = x0 + r we obtain
(
)
f (x )r = xT0 K + r T K LT r = 0 .
Further, we recognize in xT0 K LT = r T so that we arrive at
=
rT r
.
r T Kr
This is really the entire algorithm of steepest descent applied to the quadratic form objective function (9.11): improve the location of the lowest value of the objective function by moving from the
starting point x0 to the new point x
( T )
r r
x = x0 +
r , r = L Kx0
r T Kr
and then reset the starting point x0 = x. Such an algorithm is concisely written in MATLAB as
for iter=1:maxiter
r = b-A*x0;
x = x0 + (dot(r,r)/dot(A*r,r))* r;
x0 = x;
end
The steepest descent solver for quadratic objective functions is provided in the toolbox as
SteepestAxb. 1
Illustration 3
In Figure 9.9 we apply the solver SteepestAxb to the two-spring equilibrium problem from Section 9.8. Given that this is a two-unknowns system of linear algebraic equations, it takes a lot of
iterations to arrive at a solution: inecient! So why would we bother with this method? It does
have some redeeming characteristics. To mention one, it requires very little memory. More about
this later in Section 9.12.
197
10
10
10
10
15
10
20
10
10
15 20 25
Iteration
30
35
Fig. 9.9. Convergence in the norm of the solution error for the steepest descent algorithm applied to the
two-spring equilibrium problem.
p4
p3
p2
p1
p0
Fig. 9.10. Walk towards the minimum of the quadratic-form (total energy) objective function. Starting
point is p0 , the walk proceeds against the direction of the gradient.
eort is wasted by zigzagging in towards the minimum, with each step going too much sideways with
too little progress in the direction of the minimum.
We realize that there are only two independent directions in the plane x1 , x2 . The rst direction
is d(0) = f (x(0) )T , the direction for the rst descent step. Therefore, it must be possible to nd
a direction for the second step d(1) that would lead directly to the minimum. The reason is that at
the point x(2) (that is at the minimum) the gradient must vanish, which will make it perpendicular
to any vector, including the rst and second descent direction
f (x(2) )d(0) = 0 ,
f (x(2) ) = 0
x(2) is minimum .
The second orthogonality condition, that is f (x(2) )d(1) = 0, occurs naturally as a stopping condition for the step along d(1) (we go as far downhill as possible). We write
x(2) = x(1) + d(1)
and the second condition will allow us to express
1
See: aetna/SteepestDescent/SteepestAxb.m
198
9 Unconstrained Optimization
f (x(1) )d(1)
T
d(1) Kd(1)
d(1) Kd(0) = 0 .
(9.14)
From this condition we can determine the second descent direction. We can see that it must be a
combination of the rst direction d(0) and of f (x(1) )T : these two vectors are orthogonal and
therefore they span the plane. In other words any vector can be expressed as a linear combination
of these two. Thus we write
d(1) = f (x(1) )T + d(0) .
From (9.14) we obtain
=
f (x(1) )Kd(0)
T
d(0) Kd(0)
To show that the solution can indeed be obtained in just two steps in this case is possible with
MATLAB symbolic math:2
K =[sym(K11),sym(K12);sym(K12),sym(K22)];% stiffness
L =[sym(L1);sym(L2)];% load
X0 =[sym(X01);sym(X02)];% starting point
g=@(x)(x*K-L);% compute gradient
a=@(x,d)(-g(x)*d)/(d*K*d);% compute alpha
b=@(x,d)(g(x)*K*d)/(d*K*d);% compute beta
d0 =-g(X0);% first descent direction
X1 =X0 +a(X0,d0)*d0;% second point
d1 =b(X1,d0)*d0-g(X1);% second descent direction
X2 =X1 +a(X1,d1)*d1;% final point
simplify(g(X2))% gradient at final point ~ 0
The gradient at X2 indeed comes out as zero matrix. (Word of caution: the symbolic computation
may take a while computer-assisted algebra is not very ecient.)
199
p2
p1
p0
Fig. 9.11. Walk towards the minimum of the quadratic-form (total energy) objective function. Starting
point is p0 , the walk proceeds in the directions determined to reach the minimum in just two steps.
We will again make the gradient at the point x(k+1) orthogonal to the two directions d(k1) and
d(k) ,
f (x(k+1) )d(k) = 0 ,
f (x(k+1) )d(k1) = 0 ,
only this time the gradient does not have to vanish identically at x(k+1) since there are many vectors
to which it could be orthogonal without having to become identically zero. First we will work out
the gradient at the point x(k+1)
(
)T
T
T
T
f (x(k+1) ) = x(k+1) K LT = x(k) + d(k) K LT = x(k) K + d(k) K LT ,
which results in
T
f (x(k) )d(k)
T
d(k) Kd(k)
200
9 Unconstrained Optimization
T
d(k) Kd(k1) = 0 .
(9.15)
We say that the directions d(k1) and d(k) are K-orthogonal or K-conjugate (or just conjugate
directions for short).
So that we can determine the new direction d(k) to be K-conjugate to the old one d(k1) we
assume the new descent direction is a combination of the direction of steepest descent f (x(k) )T
and the old direction d(k1)
d(k) = f (x(k) )T + d(k1) .
Substituting into the K-conjugate condition (9.15) we obtain
=
f (x(k) )Kd(k1)
T
d(k1) Kd(k1)
3
4
See: aetna/SteepestDescent/ConjGradAxb.m
See: aetna/SteepestDescent/test cg 1.m
201
10
10
10
10
10
15
10
Fig. 9.12. Comparison of the convergence of the steepest-descent algorithm (dashed line) and the Conjugate
Gradients algorithm (solid line). Matrix: poisson(18), 324 unknowns.
iter =6
iter =16
iter =32
10
10
10
0
0
Solution
15
Solution
15
Solution
15
50
100
150 200
Iteration
iter =65
250
300
0
0
350
50
100
150 200
Iteration
iter =108
250
300
0
0
350
10
10
10
0
0
50
100
150 200
Iteration
250
300
0
0
350
100
150 200
Iteration
iter =162
250
300
350
50
100
150 200
Iteration
250
300
350
Solution
15
Solution
15
Solution
15
50
50
100
150 200
Iteration
250
300
0
0
350
Fig. 9.13. Solution obtained with the Steepest Descent algorithm for the matrix gallery(poisson,18),
324 unknowns, using various numbers of iterations.
iter =3
iter =6
iter =16
10
10
10
0
0
Solution
15
Solution
15
Solution
15
50
100
150 200
Iteration
250
300
350
0
0
50
100
150 200
Iteration
250
300
350
0
0
50
100
150 200
Iteration
250
300
350
Fig. 9.14. Solution obtained with the Conjugate Gradients algorithm for the matrix poisson(18), 324
unknowns, using various numbers of iterations.
202
9 Unconstrained Optimization
Error
203
Direct
Iterative
tol
eps
Effort
Fig. 9.15. Comparison of eort versus error for direct and iterative methods.
2500
Force [lb]
2000
1500
1000
500
0
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.16. Stainless steel 303 round coupon, and the load-deection diagram.
means that substituting the displacement wk and the force measured for that displacement Fk into
the above relationship will not render it an equality, something will be left over: we will call it the
residual.
Fk F (wk ) = Fk p1 wk + p2 = rk .
This may be written in matrix form for all the data points as
F1
w1 , 1
r1
[
]
F2 w2 , 1
r2
p
1
= . .
.. .. ..
. . . p2
..
Fn
wn , 1
rn
For convenience, using the measured data w1 , w2 , ..., wn and F1 , F2 , ..., Fn we will dene the matrix
w1 , 1
w2 , 1
A= . .
.. ..
wn , 1
and the vector
204
9 Unconstrained Optimization
F1
F2
b= . .
..
Fn
The vector of the parameters of the linear t is
[ ]
p
u= 1 .
p2
The vector of the residuals (also called the error of the linear t) is
r1
r2
e= . .
..
rn
So we write
b Au = e ,
where the matrix A has more rows than columns. This is the reason why it will not be possible to
make the error exactly zero in general: there are more equations than unknowns.
We realize that in default of being able to zero out the error, we have to go for the next best
thing which is to somehow minimize the magnitude of the error. In terms of the norm of the vector
e it means to nd the minimum of the following objective function
T
205
A =[w, ones(length(w),1)];
pl =(A*A)\A*F
pl =
1.0e+004 *
3.799600652696673
-0.042276210924081
So the stiness of the coupon based on the linear t is approximately 37996 lb/in. Continuing our
investigation, we realize that the data points appear to lie on S-shaped curve, which suggests a linear
regression with a cubic polynomial. This is easily accommodated in our model by taking
F (w) = p1 w3 + p2 w2 + p3 w + p4 .
The matrix A becomes
3 2
w1 , w1 , w1 , 1
w23 , w22 , w2 , 1
A= .
.
. .
.., .., .., ..
wn3 , wn2 , wn , 1
and the solution is
A =[w.^3, w.^2, w, ones(length(w),1)];
pc =(A*A)\A*F
pc =
1.0e+006 *
-7.000925471829832
0.862362550370550
0.006168259997214
-0.000087120026727
2500
2500
2000
2000
1500
1500
Force [lb]
Force [lb]
Figure 9.17 shows the linear and cubic polynomial t of the experimental data. The dierence is
somewhat inconspicuous, but plotting the residuals is quite enlightening. Figure 9.18 shows the
1000
500
0
0.01
1000
500
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
0
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.17. Stainless steel 303 round coupon, and the load-deection diagram. Linear polynomial t on the
left, cubic polynomial t on the right.
residual for the linear and cubic polynomial t. The linear polynomial t residual shows a clear bias
in the form of a cubic curve. This indicates that a cubic polynomial would be a better t. That is
indeed true, as both the magnitude decreased and the bias was removed from the cubic-t residual.
206
9 Unconstrained Optimization
60
Residual [lb]
40
20
20
40
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.18. Stainless steel 303 round coupon load-deection diagram. Linear polynomial t residual in dashed
line, cubic polynomial t in solid line.
Figure 9.19 shows the variation of the stiness coecient as a function of the deection for
both the linear and the cubic polynomial t. It may be appreciated that the stiness varies by a
substantial amount when determined from the cubic t, while it is constant based on the linear t.
4
4.5
x 10
Stiffness [lb/in]
3.5
2.5
2
0.01
0.02
0.03
0.04
0.05
Deflection [in]
0.06
0.07
Fig. 9.19. Stainless steel 303 round coupon, and the load-deection diagram. Stiness coecient as a
function of deection. Dashed line: from linear polynomial t, solid line: from cubic polynomial t.
207
they cannot serve as such basis vectors, and the linear combination of the columns of the matrix A
is only going to cover a subset of Rn . Inspect Figure 9.20: the columns of the matrix A generate the
gray plane as a graphical representation of the subset of Rn . The vector b is of course not conned
to the plane and somehow sticks out of it. The dierence e between b and Au also sticks out. To
make the error e as small as possible (as short as possible) then amounts to making it orthogonal
to the gray plane Au. The shortest possible error e = b Au will be orthogonal to all possible
vectors in the gray plane, Au as expressed here
(Au) e = 0 .
T
Substituting we obtain
(Au) (b Au ) = 0
T
or
(
)
uT AT (b Au ) = uT AT b AT Au = 0 .
When we say for all possible vectors in the gray plane, Au, we mean for all parameters u, and since
the above equation must be true for all u, we have again the normal equations
AT b AT Au = 0 .
The solution to the normal equations are such parameters u that they make the error of the least
squares tting as small as possible.
208
9 Unconstrained Optimization
Index
210
Index
Index
drag force, 5
dry friction, 21
dynamic viscosity, 6
eigenvalue, 39, 51
inverse, 167
positive, 79
shifting, 169
vibrating system, 164
eigenvalue problem, 34
generalized, 180
matrix, 39
standard, 180
eigenvalues of inverse matrix, 167
eigenvector, 39
elimination matrix, 137
energy, 186
energy minimization, 192
energy of deformation, 186
energy surface, 186
equation
explicit, 15
homogeneous, 7
implicit, 15
inhomogeneous, 7
partial dierential, 18
equilibrium, 132
error, 100
approximate, 24
arithmetic, 110
machine-representation, 110
round-o, 110
true, 24
truncation, 110
error estimate, 103
Euclidean norm, 151
Euler formula, 49
Eulers formula, 45, 80
Eulers method, 9
expansion
modal, 161
experimental scatter, 202
explicit equation, 15
exponent, 112
exponential
matrix, 70
exponential as solution, 60
extremum, 204
factorization
Cholesky, 180
LDLT, 144, 179, 191
LU, 134, 179
QR, 156
QR economy, 207
Schur, 175
FFT, 86, 91
ll-in, 145
nite dierence stencil, 109
rst-order form, 80
force residual, 132
forced vibration, 89
forward Euler, 51
forward dierence, 129
forward Euler, 16, 30, 52, 102
approximation of derivatives, 108
forward substitution, 135
Fourier series, 86
Fourier transform, 86
discrete, 86
free-vibration response, 82
frequency content, 85
friction coecient, 23
fundamental frequency, 87
Gauss-Seidel, 145
generalized eigenvalue problem, 78, 79, 180
global error, 104, 105
global minimum, 194
governing equation, 7
gradient, 189, 190, 194, 195
Hermitian matrix, 175
Hessian matrix, 190
homogeneous equation, 7
Householder transformation, 157
IBVP, 18
identity matrix, 136
ill conditioned, 148
ill conditioned basis, 172
ill conditioning, 148
impedance matrix, 144
implicit algorithm, 119
implicit equation, 15
in-place factorization, 138
indenite matrix, 191
independent variable, 7
induced matrix norm, 150
Inf, 112
inhomogeneous equation, 7
initial boundary value problem, 18
initial condition, 7, 18
initial conditions, 59
initial value problem, 7
instability, 15
integer, 111
integral
211
212
Index
diff, 10
double, 112
eig, 40, 166
eps, 114
eval, 66
expm, 75
exp, 65
ezplot, 40
fft, 88
fzero, 17, 83, 125
int8, 112
intmax, 111
intmin, 111
ldl, 180
linspace, 8, 54
lu, 139
meshgrid, 54
norm, 152
numjac, 129, 134
ode23, 11, 31
ode45, 20, 31, 46
odeset, 16
realmax, 113
realmin, 113
single, 115
solve, 79
sort, 81
spy, 144
surf, 53
syms, 40
taylor, 97
tril, 139
triu, 139
vectorize, 66, 98
anonymous function, 10
function handle, 17
matrix
commuting, 75
condition number, 147
congruence, 179
conjugate transpose, 175
damping, 77
defective, 71, 72, 175
dense, 127
determinant, 146
diagonalizable, 64
eigenvalue problem, 39
elimination, 137
exponential, 70
Fourier transform, 86
Hessian, 190
Householder, 157
identity, 39
Index
impedance, 144
indenite, 191
inverse, 127, 136
inverse, eigenvalues, 167
Jacobian, 127
Jordan, 74
lower diagonal, 30
lower triangular, 134
mass, 77
norm, 150
normal equations, 204
of eigenvectors, 63
of principal vectors, 74
orthogonal, 140, 156, 157
permutation, 139
positive denite, 189
positive semi-denite, 191
power, 71
psychologically lower triangular, 140
quadratic form, 186
rank, 146
Rayleigh quotient, 154
rotation, 45, 68, 71
similar, 64
singular, 39
skew-symmetric, 43, 71
sparse, 127, 144
sparse, ll-in, 145
spectrum, 179
stiness, 2, 186
symmetric, 79, 152
unitary, 175
unsymmetric, 152
upper triangular, 134, 156
matrix exponential, 70, 75
matrix inverse, 127
matrix power, 161
matrix powers, 71
method
conjugate gradients, 196
direct, 202
Eulers, 9
Hermitian, 175
iterative, 202
of conjugate gradients, 202
power, 161
rectangular, 206
minimization
unconstrained, 193
modal coordinates, 65, 83
modal expansion, 161
mode, 65
mode shape
undamped , 80
modied Euler, 30
modied Euler method, 26
multi-grid, 145
NaN, 112
natural frequency, 179
natural frequency, 60
nested function, 132
Newtons algorithm, 120, 127
Newtons equation, 6
nonlinear algebraic equation, 119
vector, 126
nonsingular, 189
nonsingular matrix, 79
norm
1-norm, 151
2-norm, 151
Euclidean, 151
innity, 151
matrix, 150
vector, 150
normal coordinates, 65
normal equations, 148, 204, 207
normalized values, 113
numerical stability, 207
QR factorization, 157
Nyquist frequency, 86
Nyquist rate, 86
objective function, 185, 186, 193, 204
one-sided spectrum, 89
optimization
unconstrained, 185
order-of analysis, 98
orthogonal, 198, 207
orthogonal matrix, 140, 149, 156, 157
orthogonal vectors, 172
orthogonality condition, 199
orthonormal, 172
oscillation, 51
oscillator
multi-degree of freedom, 77
overow, 112, 115, 163
partial dierential equation, 18
partial pivoting, 139, 179
period of vibration, 85
periodic function, 86
permutation matrix, 139, 140, 179
permutation vector, 140
phase, 91, 93
phase shift, 91
phasor, 45, 68, 81
213
214
Index
Wilkinson, 178
shifted eigenvalue, 169
shifting, 174
similar matrix, 64, 152
similarity transformation, 64
simultaneous iteration, 174
simultaneous power iteration, 172
singular, 146, 191
matrix, 39
singular matrix, 149, 187
singular stiness, 95
singular value decomposition, 146
skew-symmetric matrix, 43, 71, 189
solution
general, 7
particular, 7
sparse matrix, 127, 144
spectrum, 84
spectrum slicing, 179
spring, 186
stability, 15, 33
stable structure, 188
stable time step, 36, 37
standard eigenvalue problem, 80, 180
static equilibrium, 188
steepest descent, 194196
stiness matrix, 186
Stokes, 5
Stokes Law, 5
stretch, 186
Sylvester, 179
Sylvesters Law of inertia, 179
symmetric
stiness, 189
symmetric matrix, 79, 190
Taylor series, 70, 100
terminal velocity, 8
time step, 84
time stepping
adaptive, 11
tolerance, 120
total energy, 192
trace, 42
transpose, 186, 190
trapezoidal integrator, 26
trapezoidal rule, 26
triangular matrix, 135
true error, 24
truncation error, 104, 105, 107, 108, 110
unconditionally stable, 52
unconditionally unstable, 52
unconstrained minimization, 192
Index
problem, 193
unconstrained minimization problem, 193
unconstrained optimization, 185, 193
uncoupled variables, 65
undamped oscillator, 67
undamped vibration
natural frequency, 60
underow, 115, 163
unitary matrix, 175
unnormalized values, 113
unsigned byte, 111
unstable structure, 187
upper triangular matrix, 134, 156
215