Professional Documents
Culture Documents
Advanced Mathematics
DLMDSAM01
Course Book
Advanced Mathematics
DLMDSAM01
2 Masthead
Masthead
Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
media@iu.org
www.iu.de
DLMDSAM01
Version No.: 001-2021-0512
www.iubh.de
Module Director 3
Module Director
Prof. Dr. Eric Guiffo Kaigom
Mr. Guiffo Kaigom has also mentored foreign students for many years.
He is a regular reviewer for the IEEE Robotics and Automation Society
conferences and serves as a journal reviewer and editorial board
member for various journals.
www.iubh.de
4 Contents
Table of Contents
Advanced Mathematics
Module Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introduction
Advanced Mathematics 7
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Unit 1
Calculus 12
1.1 Differentiation and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Unit 2
Integral Transformations 48
2.1 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Unit 3
Vector Algebra 64
3.1 Scalars and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
www.iubh.de
Contents 5
Unit 4
Vector Calculus 88
4.1 Differentiation of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Unit 5
Matrices and Vector Spaces 104
5.1 Basic Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Unit 6
Information Theory 126
6.1 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Appendix 1
List of References 146
Appendix 2
List of Tables and Figures 148
www.iubh.de
Introduction
Advanced Mathematics
8 Introduction
Welcome
This course book contains the core content for this course. Additional learning materials can
be found on the learning platform, but this course book should form the basis for your
learning.
The content of this course book is divided into units, which are divided further into sections.
Each section contains only one new key concept to allow you to quickly and efficiently add
new learning material to your existing knowledge.
At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the concepts
in each section.
For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of the
questions correctly.
When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.
Good luck!
www.iubh.de
Introduction 9
Learning Objectives
The course Advanced Mathematics aims to provide students with the mathematical back-
ground knowledge to use and understand current methods and approaches from engineer-
ing and the sciences.
To this end, the course starts with an exposition of the fundamentals of calculus. The notions
of differentiation and integration are introduced together with important generalizations to
multiple dimensions. Moreover, the widely used optimization technique of the calculus of
variations is explained. Integral transformations, which play a vital role in scientific and engi-
neering application, are also covered.
The subject domains of linear algebra and calculus are brought together in the explanation
of vector calculus. The course concludes with explanations of important concepts from the
field of information theory that underpins virtually all aspects of our contemporary commu-
nication systems.
www.iubh.de
Unit 1
Calculus
STUDY GOALS
… how to perform partial differentiation and multiple integrals for functions with multiple
variables.
DL-E-DLMDSAM01-L01
12 Unit 1
1. Calculus
Introduction
Functions express relationships between variables. For example, in standard notation,
the function y = f(x) formalizes how a value y, called the dependent variable, varies
with respect to another value x, called the independent variable. The letter f is the
name of the function.
The rate at which the dependent variable changes with respect to the independent var-
iable is of particular interest, both mathematically and for applications. One example
of such a rate of change is the change in distance with respect to time — also known as
velocity. The method for finding this rate when given a function is called differentiation.
The operation of differentiation can often be “undone” via an operation called integra-
tion. Integration can be seen as the inverse of differentiation and therefore, it is often
called the anti-derivative. Intuitively, this means that if we have a formula that
expresses the speed of a particle with respect to time, we can often construct a for-
mula for the displacement of the particle (i.e., the distance it has traveled).
Good textbooks that further cover this subject area are e. g. (Deisenroth, Faisal & Ong,
2020, Chap. 5), (Strang, 2017, Chap. 2-5, 7, 8, 13, 14) and (Loomis & Sternberg, 2014, Chap.
3, 8).
We are often interested in how a function changes with respect to its argument. For
example, we could imagine a travelling car with position s at a given time t. We know
from our everyday experience that at each time, t, a car has a velocity, v(t), which
measures how fast the car is travelling at time t. Over a given time interval, Δt, the aver-
age velocity describes the rate at which the car travels the distance for that interval of
time
www.iubh.de
Unit 1 13
Calculus
Δs Function
vt = ,
Δt A function is a rela-
tion between two
(1.1) sets that associates
every element of
where Δs is the change in position of the car or the distance covered by the car. In one set to exactly
equation 1.1, the time t is the argument of the function v, which establishes a relation- one element of the
ship between the distance covered by the car and the time it takes to cover that dis- other set.
tance.
More generally, we often wish to find the rate of change of a general function f(x),
where f depends on some argument x. We will begin by considering functions that
depend only on a single variable, such as f(x) = x2, which is shown in the graphic that
follows this explanation. Fix a given value of x and let us consider the value of the
function, f(x), as we change the input to a slightly different value, where we assume
that the function is continuous and doesn’t have any “kinks” or “jumps.” For example, if
we start at x0 = 1 and move to x1 = 1.1, the value of the function f(x) = x2 will change
from f(x0) = 1 to f(x1) = 1.21. Let us denote this change in x by Δx and write
x → x + Δx to indicate that x changes from x to x + Δx. Then, the change in the value
of the function f is Δf = f(x + Δx) - f(x). By making this increment Δx arbitrarily small,
we can work out the rate of change of f at a single instant. That is the idea of a deriva-
tive — an instantaneous rate of change — and the limit operation allows us to formally
capture this intuition. As we make the change in x smaller and smaller, written Δx → 0,
we can define the gradient or first derivative of the function f as
df x f x + Δx − f x
f’ x ≡ ≡ lim .
dx Δx → 0 Δx
(1.2)
The function is differentiable at xa if, and only if, this limit exists at the point x = xa.
Note that if the limit does not exist at x = xa, the function is not differentiable at this
xa. The definition 1.2 does not specify if we approach x from smaller values (so Δx is
negative) or vice versa (in which case Δx is positive). This is because, in order for the
f x + Δx − f x
limit to exist, the definition of the limit requires that the quotient Δx
approaches the same value, fʹ(x), from both the left and right of x.
www.iubh.de
14 Unit 1
Geometrically, the derivative fʹ(x) can be interpreted as the slope of the line tangent to
the function f(x) at the point x.
Example
www.iubh.de
Unit 1 15
Calculus
f x + Δx − f x
f′ x = lim
Δx → 0 Δx
2
x + Δx − x2
= lim
Δx → 0 Δx
2
x2 + 2xΔx + Δx − x2
= lim
Δx → 0 Δx
2
2xΔx + Δx
= lim
Δx → 0 Δx
Δx 2x + Δx
= lim
Δx → 0 Δx
= lim 2x + Δx
Δx → 0
= 2x
Using definition 1.2 in combination with the laws of limits, one can find derivatives of
many fundamental functions. For reference, here are the derivatives of some important
functions where n > 0 is a natural number and a is a real-valued constant.
www.iubh.de
16 Unit 1
d n d ax
x = nxn − 1 e = aeax
dx dx
d d 1 d
ln ax = ln a + ln x = dx sin ax = acos ax
dx dx x
d d a
cos ax = − asin ax tan ax =
dx dx cos2 ax
df′ x f′ x + Δx − f′ x
f′′ x ≡ ≡ lim ,
dx Δx → 0 Δx
(1.3)
where, again, fʹʹ is defined if, and only if, the limit exists. More generally, we can define
the nth derivative of f(x) to be
www.iubh.de
Unit 1 17
Calculus
(1.4)
Stationary points
Looking again at the first graphic depicting a parabola, we notice that the point (0, 0) is
special; the value of the function on either side of point x = 0 is greater than at x = 0.
In other words, at x = 0, f achieves a local minimum. Graphically, we observe that the
line tangent to the graph of f at this point is horizontal — its slope is equal to zero. To
reiterate: the slope of the line tangent to f at x = 0, which is fʹ(0), has a derivative at
that point with a value of zero.
Points where the derivative is equal to zero, such as the point described above, are
called stationary points. After examining a number of examples, we see that fʹ is often,
but not always, equal to zero at a local minimum. The other possibility, illustrated by
the previous graphic, is that the slope of the tangent line, the derivative, is undefined
at the local extrema. For f(x) = |x|, (0, 0) is a critical point, defined to be a place where
the derivative is zero or does not exist, but it is not a stationary point.
Note that there are three different stationary points. They are as follows:
Note that the maximum and minimum found this way may not be the global maximum
or minimum of the function, but rather a local extremum at the stationary point.
Rules of Differentiation
d d
dx
f x = f′ x = a dx g x = ag′ x .
www.iubh.de
18 Unit 1
Differentiation of products
Previously in this section, differentiation rules for some functions with simple struc-
tures were discussed. However, in many cases, we are interested in the rates of change
of functions that are more complicated.
As a first example, we will investigate how to differentiate functions that can be written
as products of two other functions, namely functions of the form f(x) = u(x) · v(x). The
idea is that if we know how to differentiate u and v, and how to use that information
together with the product structure to find a derivative of f, we can avoid applying the
definition of the derivative. We could, from this perspective, reexamine f(x) = x2, noting
that we could write it as f(x) = x · x. Slightly more complicated examples are
g(x) = x2 · sin(x), which we could decompose as g(x) = u(x) where u(x) = x2 and
v(x) = sin(x). Such a decomposition is not unique; we could consider any functions u
and v whose product is x2 · sin(x). However, the idea behind decomposing the original
function f(x) into two functions, u and v, is to choose u and v that are easier to differ-
entiate than f. Then, if we have a general method to calculate the derivative of a prod-
uct, we can apply that method to f in order to make taking the derivative easier than if
we were to calculate it using definition and equation 1.2. This general method, called
the product rule, is obtained from the definition (See equation 1.2) as follows. First, let’s
simplify the difference f(x+Δx) − f(x). This results in
f x + Δx − f x = u x + Δx · v x + Δx − u x · v x
= u x + Δx v x + Δx − v x + v x u x + Δx − u x .
Note that we added and subtracted v(x)u(x+Δx) in order to be able to factor. Substi-
tuting the result of our simplification into the definition of the derivative, we obtain
df f x + Δx − f x
= lim
dx Δx → 0 Δx
v x + Δx − v x u x + Δx − u x
= lim u x + Δx +v x .
Δx → 0 Δx Δx
As Δx approaches zero, u(x+Δx) approaches u(x) and the terms in the square brack-
ets become the derivatives of the functions u and v respectively. Hence, the formula for
the derivative of a product of functions, called the product rule, is given by
df d dv x du x
f′ ≡ ≡ u x v x =u x +v x = uv′ + vu′ .
dx dx dx dx
(1.5)
Using this rule repeatedly, the derivative of products of three or more differentiable
functions can be obtained as follows:
Given f x = u x v x w x ,
df d d
f′ x = =u vw + vw u
dx dx dx
dw dv du
= uv + uw + vw .
dx dx dx
www.iubh.de
Unit 1 19
Calculus
Example
d 2 d d 2
x sin x = x2 sin x + sin x x
dx dx dx
= x2cos x + 2xsin x .
The essential idea of the chain rule is that we differentiate the outer function f with
respect to the inner function u to get fʹ(u), leaving the inner function alone. Then differ-
entiate the inner function u with respect to x to get uʹ(x) and multiply the two together:
df df du
= · .
dx du dx
(1.6)
This is known as the chain rule because we “chain” the derivatives together. The con-
cept can be easily extended to functions of functions of functions, and so on. We only
need to repeatedly apply the chain rule until we reach the independent variable.
Example
We can write this as f(x) = u2 (x), where u(x) = x−1. Using theorem 1.6 we obtain
df df du
= ·
dx du dx
du
= 2u
dx
= 2u · 1
=2 x−1 .
The chain rule can also be used to calculate the derivative of functions of the form
f(x) = 1/v(x). Rather than writing this as a quotient, we can express it as a composition
of functions, f(x) = v−1(x) (noting that this is the —1 power, not the inverse), and then
apply the chain rule
www.iubh.de
20 Unit 1
df df dv
=
dx dv dx
dv
= −v−2
dx
1 dv
=− 2
v x dx
d n
where we have used the elementary derivative dx
x = nxn − 1.
Differentiation of quotients
In some cases, the function we want to take the derivative of can be written in the form
ux
of a quotient of two functions, such as f x = v x . One way to create a rule in order to
calculate derivatives for such functions is to combine the product rule in equation 1.5
with the chain rule, and write the product as f(x) = u(x)[1/v(x)]. Applying the product
rule, we get
df d ux
=
dx dx v x
d 1 1 d
=u + ux .
dx v x v x dx
d 1
Using the chain rule to evaluate dx v x
as above, we obtain
df dv x /dx du x /dx
=u − + .
dx vx
2 vx
The “prime” notation for the derivative yields an expression that is easier to read
u ′ vu′ − uv′
f′ = = ,
v v2
(1.7)
Δs
v= ,
Δt
www.iubh.de
Unit 1 21
Calculus
where Δs is the change in position (the distance) over time interval Δt. We considered
progressively shorter time intervals in order to investigate the instantaneous rate of
change
ds
vt = .
dt
The first derivative of the position with respect to time is the instantaneous velocity.
If we know a function for the velocity v(t), it is natural to wonder if it is possible to cal-
culate the distance Δs the car has traveled over a given time interval, Δt.
Informally, we can do this by looking at very small sub-intervals rather than consider-
ing the whole interval Δt at once. We will then assume that the velocity is constant over
these small rectangles in order to get an approximation of the distance the car travels
over each small interval. Each time, we will use our constant approximation of the
www.iubh.de
22 Unit 1
velocity multiplied by the length of time to get the area of the small rectangle, as
shown in the following figure. We know that we have a small error in each interval, but
the smaller the intervals, the smaller the error becomes. To find the total distance the
car has traveled, we combine the total areas of all the rectangles that represent the
contributions from each small time interval.
More formally, consider an arbitrary function f(x) of a single variable x that is defined
over the interval a ≤ x ≤ b. Following the approach above, we divide the interval [a, b]
into many sub-intervals by introducing intermediate points ξi so that a = ξ1 < ξ2 < ...
< ξn = b. The lengths of the intervals (ξi − ξi−1) are the lengths of the rectangles on
the x-axis and the f(ξi) are the heights of the rectangles. The sum S,
n
S= ∑f x i ξi − ξi − 1 ,
i=1
(1.8)
is the area of all of the rectangles. The area under some curves over certain intervals is
1
not finite; consider f x = x over the interval [0, 1]. Therefore, as we take more and
more intervals, i.e. as we consider the limit of S as n approaches ∞, the sum S may or
may not converge to a finite limit. If this limit exists, the limit of the sum is the definite
integral I of the function f(x) in the interval [a, b],
www.iubh.de
Unit 1 23
Calculus
∫ f x dx .
b
I=
a
(1.9)
For closed, finite intervals, the question of whether this limit exists — whether the func-
tion f is integrable over the given interval — hinges on whether the function f is contin-
uous over that interval.
For continuous functions over a finite interval [a, b] this limit, the integral, always
exists.
Example
www.iubh.de
24 Unit 1
The first step toward computing this integral, or determining whether this limit
exists, is to divide the interval [0, b] into n rectangles of uniform width w. Next, we
evaluate the function f(x) = x2 at the left hand endpoint of each sub-interval to
determine the height of each rectangle. We could also have taken the value at the
right hand endpoint or any in the middle — the limit does not depend on this
choice. The area of the ith rectangle is then w · (iw)2 = i2w3. The total area of our
approximation, A, is then given by
n
A= ∑ i2w3 .
i=1
The term w3 is a constant with respect to the index of summation, i, so we can fac-
tor it out of the sum operator as follows:
n
A = w3 ∑ i2 .
i=1
n
1
∑ i2 = 6 n n + 1 2n + 1 ,
i=1
1
A = w3 n n + 1 2n + 1 .
6
When constructing the rectangles, we divided the interval [0, b] into intervals of the
b
same length, namely w = n . Therefore, we can substitute into our expression for A
and reduce to get
b 31
A= n n + 1 2n + 1
n 6
b3 n n + 1 2n + 1
=
6 n3
b3 n + 1 2n + 1
=
6 n2
b3 2n2 + 3n + 1
=
6 n2
b3 3 1
= 2+ + 2 .
6 n n
www.iubh.de
Unit 1 25
Calculus
Using the properties of limits and finite sums, as above, one can see that the following
properties hold:
∫ 0dx = 0
b
(1.10)
∫ f x dx = 0
a
(1.11)
∫ ∫ f x dx + ∫ g x dx
b b b
f x + g x dx =
a a a
(1.12)
∫ f x dx = ∫ f x dx + ∫ f x dx, for all b ∈ a, c
c b c
a a b
(1.13)
∫ f x dx = − ∫ f x dx .
b a
a b
∫ f u du .
x
Fx =
a
(1.14)
∫ f u du
x + Δx
F x + Δx =
a
= ∫ f u du + ∫
x x + Δx
f u du
a x
=F x +∫
x + Δx
f u du .
x
www.iubh.de
26 Unit 1
If we divide both sides by Δx and bring F(x) to the left side, the equation reads
∫
F x + Δx − F x 1 x + Δx
= f u du .
Δx Δx x
dF x
=f x ,
dx
(1.15)
∫ f u du = f x .
d x
dx a
(1.16)
This says that the derivative of the integral gives back the original integrand. This very
important result is called the Fundamental Theorem of Calculus, and it has a second
part, which relates the definite integral to the antiderivative. Let’s explore it now.
The above discussion did not depend on any attribute of the arbitrary constant a.
Hence, the inverse of differentiation is not unique. However, any two inverse functions
F1(x) and F2(x) differ at most by a constant, so it is written as
∫f x dx = F x +c
(1.17)
d
for the family of functions with derivative f(x). Recall that dx
c = 0. This is the indefinite
integral of f(x), and c is called the constant of integration.
The antiderivative F(x) can also be used to evaluate definite integrals. Let x0 be an
arbitrary fixed point x0 in (a, b) and consider equation 1.13 to obtain
x0
∫ ∫ ∫ f x dx
b b
f x dx = f x dx +
a a x0
(1.18)
x0
∫ f x dx + ∫
b
= f x dx
x0 a
www.iubh.de
Unit 1 27
Calculus
(1.19)
∫ f x dx = ∫ f x dx − ∫ f x dx
b b a
a x0 x0
(1.20)
=F b −F a .
(1.21)
∫ ∫ f x dx =
∞ b
f x dx = lim lim F b − F a ,
a b→∞ a b→∞
(1.22)
Evaluation of integrals
Unfortunately, unlike differentiation, many integrals cannot be evaluated easily and
there are few simple rules which can be used. Some examples of indefinite integrals
are given below. Note that u is typically a function u(x), and du = u’(x)dx.
www.iubh.de
28 Unit 1
n+1
∫u du = un + 1 + c
n n≠ −1
∫ duu = ln u + c
u
∫a du = lna a + c
u
∫e du = e + c
u u
∫ sindu u = −cot u + c
2
∫ u du+ a = 1a arctan ua + c
2 2
∫ a du− u = arcsin ua + c
2 2
Formulae for a large number of integrals can be found in tables of integrals. In order to
evaluate unknown integrals, we generally try to transform integrals into forms that are
easier to evaluate. For reference, here are a few “techniques of integration” that might
help.
• Logarithmic integration: Integrals for which the integrand can be written as the quo-
tient of the derivative of a function, and that same function can be evaluated as
f′ x
∫ f x
dx = ln f x + c
n n
∫ ∑i = 1 aif i x dx = ∑i = 1 ai∫f i x dx
www.iubh.de
Unit 1 29
Calculus
b ub du t
∫a f x dx = ∫u a f ut dt
dt
The key to identifying integrals of this form is to find a suitable substitution func-
tion.
• Integration by parts: Recall the product rule:
d
dx
u ⋅ v = uv′ + u′v
Integration by parts enables us to split the integral into parts which are easier to
solve. Rearranging equation 1.5 (the product rule)
d dv du
uv = u +v +c
dx dx dx
to
dv d du
= uv − v .
dx dx dx
∫uv′dx = uv − ∫vu′dx .
The “art” of solving the integral is to choose the functions u and v so that the
remaining integral becomes easier to solve.
Example
Noting that the integrand is a product of x and cos x, we solve this using integra-
tion by parts and choose u = x and v’ = cosx, and thus get the result that v = sinx
and du = dx. Substituting, we get
www.iubh.de
30 Unit 1
a a
b
∫ sin xdx
b
= x sin x −
a a
b
= x sin x + cos x a
= b sin b + cos b − a sin a + cos a .
Example
1
Evaluate the integral ∫ dx .
x2 + x
First, we note that the denominator x2+x can be factored as x(x+1). Using partial
fraction decomposition, we get
∫ x 1+ x dx = ∫ x x 1+ 1 dx
2
=∫ −
1 1
dx
x x+1
= ln x − ln x + 1 + c
x
= ln +c
x+1
where we have split the difference inside the integral into a sum of integrals and
a
used the fact that ln b = ln a − ln b . However, in general, we need to be careful to
consider that the argument of the logarithm is not defined for negative numbers.
Taylor approximation
Taylor's Theorem A very useful application of derivatives and integrals is Taylor’s theorem. Taylor’s theo-
The theorem is rem provides an approximation to a function in the vicinity of a given point x0 as a
named after Brook sum. Taylor’s theorem requires that the function f(x) is continuous and that all of the
Taylor who derivatives f’(x),fʹ’(x), ..., up to order f(n)(x) exist in order to generate an nth degree pol-
expressed this rela- ynomial approximation to f(x) near x0. Using equation 1.21 we can express f(x) as
tionship in 1712.
∫
a+ϵ
f′ x dx = f a + ϵ − f a
a
(1.23)
∫
a+ϵ
f a+ϵ =f a + f′ x dx .
a
www.iubh.de
Unit 1 31
Calculus
(1.24)
Assuming that ϵ is very small, we can assume f'(x) is approximately equal to f'(a), and
hence
f a + ϵ ≈ f a + ϵf′ a
(1.25)
holds. We can express this in terms of x and a, assuming that we stay close to the point
a, to get the approximation
f x ≈ f a + x − a f′ a .
(1.26)
The approximation given by equation 1.26 is called the linear approximation to f(x)
near x = a. It is the tangent line approximation to the function f. By using more infor-
mation about f, namely by constructing a function that also agrees with f on higher
order derivatives at x = a, we can obtain an even better approximation. That is the
general idea of the Taylor approximation of degree n. Because f is n-differentiable, we
can apply the approximation to each of the derivatives of f to obtain
f′ x ≈ f′ a + x − a f′′ a ,
f′′ x ≈ f′′ a + x − a f′′′ a ,
and similarly,
n−1 n−1 n
f x ≈f a + x−a f a .
We can now substitute the estimate of f'(x) into equation 1.24 and obtain
∫
a+ϵ
f a+ϵ ≈f a + f′ a + x − a f′′ a dx
a
ϵ2
= f a + ϵf′ a + f′′ a .
2
This process can be repeated iteratively as long as the higher order derivatives exist,
which yields the nth-degree Taylor polynomial approximation. Expressing again in terms
of x and a, we can write:
2 n
x−a x−a n
f x ≈ f a + x − a f′ a + f′′ a + ⋯ + f a .
2! n!
(1.27)
www.iubh.de
32 Unit 1
www.iubh.de
Unit 1 33
Calculus
function. These derivatives are called partial derivatives indicating that we only
observe the “partial” change of the function along one of the variables. Similar to the
definition of the derivative with respect to a single variable (equation 1.2), we define
the partial derivative to be:
∂f f x + Δx, y − f x, y
= lim ,
∂x Δx → 0 Δx
and
(1.28)
∂f f x, y + Δy − f x, y
= lim ,
∂y Δy → 0 Δy
(1.29)
where the symbol ∂ indicates that this differentiation is performed partially with
respect to a single variable while the other variables are kept constant. To make this
explicit, it is often written as
∂f ∂f
and
∂x y ∂y x
(1.30)
in order to indicate which variable is considered in the derivative (the one in the par-
tial derivative expression) and which is kept constant (the one outside the parenthe-
ses). Just as there are many notations for the derivative in the single variable case,
there are also many ways to indicate partial derivatives. The following are some com-
mon short-hand notations for the partial derivative of f with respect to x:
∂f
= f x = ∂xf .
∂x
(1.31)
One can calculate higher order derivatives, provided that the relevant limits exist, and
are calculated in the same way. Some possibilities in the case of two-variables are
www.iubh.de
34 Unit 1
2
∂ ∂f ∂ f
= = f xx,
∂x ∂x ∂x2
2
∂ ∂f ∂ f
= = f yy,
∂y ∂y ∂y2
2
∂ ∂f ∂ f
= = f xy, and
∂x ∂y ∂x∂y
2
∂ ∂f ∂ f
= = f yx .
∂y ∂x ∂y∂x
2 2
∂ f ∂ f
=
∂x∂y ∂y∂x
holds.
Example
∂f
= 6xy2 .
∂x
For the partial derivative with respect to y, we now treat x as a constant and find
∂f
= 6x2y + 1 .
∂y
Total Differential
The definition of the partial derivatives allows us to examine the rate of change of a
function along, for example, the x or y axes. We now want to investigate the rate of
change if we move in any direction in the domain.
In a case where we have functions of two variables x and y, we move Δx in the x direc-
tion and Δy in the y direction. Following the approach we have taken previously, we can
evaluate
www.iubh.de
Unit 1 35
Calculus
Δf = f x + Δx, y + Δy − f x, y
= f x + Δx, y + Δy − f x, y + Δy + f x, y + Δy − f x, y
f x + Δx, y + Δy − f x, y + Δy
= Δx +
Δx
f x, y + Δy − f x, y
Δy
Δy
where we have performed the algebraic trick of adding and subtracting the same term,
namely −f(x, y +Δy) + f(x, y +Δy) = 0 in the middle step in order to factor into the
Δx Δy
desired quotients, and we have also multiplied by Δx = 1 and Δy = 1 . The term in
the first square brackets describes the change of the function f(x, y) if we move a step
Δx in the x direction, the term in the second square bracket corresponds in the y
direction. If we let Δx → 0 and Δy → 0 on both sides, the terms in the square brackets
are the partial derivatives defined in equation 1.28, and we obtain the total differential
of a function f(x, y), which is then given by
∂f ∂f
df = dx + dy .
∂x ∂y
(1.32)
∂f ∂f ∂f
df = dx + dx + ⋯ + dx .
∂x1 1 ∂x2 2 ∂xn n
(1.33)
Chain Rule
For example, in the case of a function f(x, y), the variables x and y are now functions of
df
another variable u and we wish to find the derivative with respect to u, i.e. du . Starting
from the total derivative in equation 1.32 we obtain
df ∂f dx ∂f dy
= + .
du ∂x du ∂y du
The same approach can be taken if the functions are nested in more than one level, i.e.
instead of f(u(x)) one might have f(u(v(x))), and the chain rule can be used to calcula-
ted the derivative, e.g.
df u v x ∂f ∂u dv
= .
dx ∂u ∂v dx
www.iubh.de
36 Unit 1
(1.34)
N
S= ∑f xp, yp ΔAp
p=1
www.iubh.de
Unit 1 37
Calculus
to express the approximate volume, where ΔAp is the area of the base and f(xp, yp) is
the height of cell p. Again, we consider many areas like this, i.e. we let N → ∞, implying
ΔAp → 0. Similar to the case of a single variable, if the above sum has a finite limit or
value, we say that this limit is the value of the double integral over f(x, y) over some
region R:
I= ∫f x, y dA,
R
(1.35)
where dA is an infinitesimally small area in the x, y plane where the function f(x, y) is
evaluated. So far, we have not made any assumption about the small area ΔA consid-
ered in the above sum. If we choose small rectangles in x and y direction, we can write
ΔA = ΔxΔy, and when Δx → 0 and Δy → 0, we can write
I= ∫∫f x, y dxdy
R
(1.36)
as the double integral. For such integrals, it sometimes matters whether we integrate
with respect to x or to y first. It is frequently helpful to draw a picture to see which vari-
able could be taken more easily to depend on the other. If x can be easily expressed as
a function of y, we might choose to take small areas in the direction of width dy first.
That gives us
x = x2 y
∫ ∫
y=d
I= f x, y dx dy .
y=c x = x1 y
(1.37)
In this case, the bounds of the inner integral are the parametrization of the boundary
curve C, expressed as x = x1(y) and x = x2(y). In the first step, y is treated as a con-
stant as the inner integral over x is evaluated. The next step of the computation, the
outer integral, is evaluated between the bounds y = c and y = d just as in the single
variable case as there are no x’s left in the expression.
Alternatively, we can first evaluate the integral over y and then over x, as
y = y2 x
∫ ∫
x=b
I= f x, y dy dx .
x=a y = y1 x
(1.38)
www.iubh.de
38 Unit 1
Example
Evaluate the integral I = ∫∫R x2y dxdy where R is given by a triangular area boun-
ded by x = 0, y = 0, x + y = 1.
First, we carry out the integration over y, which means that we keep x fixed. In this
case, the limits on y are y = 0 and y = 1 - x. Given the constraint x+y = 1, the
maximum value of x is x = 1 for y = 0. The integral is then written as
∫ ∫
x=1 y=1−x
I= x2ydy dx .
x=0 y=0
y=1−x
∫
y=1−x 1 2 2 1 2 2
x2ydy = x y = x 1−x .
y=0 2 y=0 2
∫
x = 11 2
x2 1 − x dx
x=0 2
= ∫ ∫ ∫
1 x=1 1 x=1 1 x=1
x2dx − 2x3dx + x4dx
2 x=0 2 x=0 2 x=0
11 3 1 1 4 1 11 51
= x − x + x
23 0 4 0 25 0
1 1 1
= − +
6 4 10
1
= .
60
In case of more than two variables, the same notation can be extended accordingly,
such as
∫∫∫f x, y, z dxdydz
V
(1.39)
where, in the case of three variables, we integrate over a specific volume rather than an
area.
www.iubh.de
Unit 1 39
Calculus
This is the idea behind calculus of variations. In most cases, we want to minimize or
maximize a given quantity that depends on a family of input functions; the calculus of
variations provides a method for finding a function f(x) which yields the extreme value.
As a concrete example, we could imagine a rope that is attached to two points, A and
B, as shown in the following figure, but otherwise hangs freely under the influence of
gravity. We expect that the rope will hang down in a shape such as the one indicated Gravity
by the solid line, and not take any other shape (e.g., those suggested by the two dotted Gravity is one of the
lines) — that is, as long as there is no external force other than gravity, and all initial natural forces
motion has come to a rest. In this example, the rope is fixed at the points A and B, so caused by the mass
we have two constraints (not including the length of the rope) which we take as con- of objects, resulting
stant. As the gravitational force acts on each part of the rope, the rope will take the in them being pulled
shape where the total potential energy, expressed by the integral over all small seg- towards each other.
ments of the rope, is minimal. We wish to find the function y(x) be the function that
describes the shape of the hanging rope with the minimal potential energy.
www.iubh.de
40 Unit 1
∫ F y, y′, x dx,
b
I=
a
(1.40)
where a, b, and F are given by the nature of the problem we wish to consider. This inte-
gral depends on the function y(x). In the example of the rope, the limits a and b of the
integral are fixed: they correspond to the endpoints of the rope at which the rope is
attached, for example, to two poles.
We call such functions (ones that take other functions as their input and result in a
scalar as their output) functionals. Here, I is a functional of y(x), which we denote by
I=Iy x .
(1.41)
We use square brackets to indicate that I is a functional rather than a function of ℝn.
We then look for the curves y(x), which are the stationary value(s) of the integral I, and
determine whether such curves are extrema of the integral. The integral may have one
or more stationary points.
A stationary point y(x) of the functional I[y(x)] is a point where the functional I does
not change if the y(x) is perturbed by a small amount. In the case of the rope, this
would be the function that describes the physical shape the rope takes if we fix it at
two points and let it hang under the influence of gravity. Because y(x) is a stationary
point of the integral I[y(x)], if we change
y x → y x + ϵη x
(1.42)
by a small amount ε using any (sufficiently well-behaved) function η(x), we require that
the value of I does not change, i.e.,
dI
= 0∀η x .
dϵ ϵ=0
(1.43)
We now insert the above equation 1.42 into the integral in equation 1.40:
∫ F y + ϵη, y′ + ϵη′, x dx .
b
I y x ,ϵ =
a
www.iubh.de
Unit 1 41
Calculus
We generally assume that all functions are well behaved, especially when considering
situations related to physical examples.
We have already encountered Taylor series for the case of the single variable in
Eqn.(1.27) and used this to expand a function into a series around some point. This
approach can be generalized to several variables. For example, for a function that
depends on two variables x and y, we can write the corresponding second degree
Taylor polynomial as:
∂f ∂f
f x, y = f x0, y0 + Δx + Δy
∂x ∂y
2 2 2
1 ∂ f 2 ∂ f ∂ f 2
+ Δx +2 ΔxΔy + 2 Δy
2! ∂x2 ∂x∂y ∂y
∂ ∂
f x, y = f x0, y0 + Δx + Δy f x, y
∂x ∂y
1 ∂ ∂ 2
+ Δx + Δy f x, y
2! ∂x ∂y
Extending to higher derivatives, we can write the Taylor series for a function of two
variables as:
∞ n
1 ∂ ∂
f x, y = ∑ n!
Δx
∂x
+ Δy
∂y
f x, y
x0, y0
n=0
We can further generalize this to any number of variables denoted by the vector x :
2
∂f 1 ∂ f
f x =f x0 +∑
∂xi 2! ∑ ∂xi ∂xj i j
+ Δx Δx + ⋯
i i, j
Returning to the calculus of variations and the integral I[y(x), ε], we can use the Taylor
series with Δy = ϵη and Δy′ = ϵη′ and write the integral in the following way:
I y x , ϵ = F y + ϵη, y′ + ϵη′, x dx
∫ F y, y′, x dx + ∫
b b ∂F ∂F
= ϵη + ϵη′ dx + O ϵ2 .
a a ∂y ∂y′
www.iubh.de
42 Unit 1
(1.44)
In the following, we ignore all terms of order ε2 and higher because ε is assumed to be
a very small number. This means we consider the equation
∫ F y, y′, x dx + ∫
b b ∂F ∂F
I y x ,ϵ = ϵη + ϵη′ dx .
a a ∂y ∂y′
Now, recall that when we introduced the small perturbation in y(x) in Equation 1.42, we
said that this should not change the integral because we are at a stationary point. We
expressed this more formally in Equation 1.43, where we demand that the integral I
does not change if we change y a little bit by the term εη(x) for any choice of η.
This then implies that the second term must be equal to zero for any choice of η(x),
because ε is a small (but non-zero) number and we do not make any demands of the
function η(x) except that it be sufficiently well behaved, so we can take its derivative,
integrate it, and so on. Then, because we demand that this holds for any small pertur-
bation, the second part in the equation above must vanish, which we can write as
∫
b ∂F ∂F
δI = η+ η′ dx = 0,
a ∂y ∂y′
(1.45)
where the notation δI is used to indicate the variation in the functional I[y(x)] due to
the change in y(x) → y(x) + εη(x). Furthermore, ε is a small but non-zero number and
can therefore be omitted from the above equation.
b
∫ ∂y′ ∫ η dxd
b ∂F ∂F b ∂F
η′dx = η − ,
a ∂y′ a a ∂y′
(1.46)
b
∫
∂F b ∂F d ∂F
η + − η x dx = 0 .
∂y′ a a ∂y dx ∂y′
(1.47)
We now impose the constraint that the endpoints a and b are fixed, as are y(a) and
y(b) – recalling our initial example of the freely hanging rope under the influence of
gravity, where the rope is fixed at its two attached points.
www.iubh.de
Unit 1 43
Calculus
Since y(a) and y(b) are fixed, we also require that, at these points, η a = 0 and
η b = 0: if we “wiggle” the rope a bit, i.e., change y(x), the endpoints
remain unchanged. It follows that the first term in the above equation vanishes. Since
equation 1.47 must be equal to zero for any choice of η x , this implies that the func-
tion in the integral must be zero, namely that
∂F d ∂F
= .
∂y dx ∂y′
(1.48)
Example
Show that the shortest path between two points is a straight line.
We start by specifying the initial and final points that will be connected with an
arbitrary path; initial point A is given by the coordinates (a, y(a)) and the final
point B, is given by the coordinates (b, y(b)), as shown below:
For any small segment of the path, the length can be approximated by a straight
line using the distance formula
2 2
ds = dx + dy ,
www.iubh.de
44 Unit 1
where we assume that dx and dy are small enough to justify the approximation of
the small triangle for ds. Factoring out dx, the equation above can be written as
2
ds = 1 + y′ dx .
(1.49)
∫ ds = ∫
b b 2
L= 1 + y′ dx,
a a
(1.50)
where the integration takes place along the path between the two points. We now
calculate the path which leads to a stationary point for L, in this case a minimum
which gives the shortest connection between the points A and B. We start from the
Euler-Lagrange equation 1.48 and note that the function in the integral L does not
depend on y explicitly. This implies that
∂F
= 0,
∂y
d ∂F
= 0,
dx ∂y′
(1.51)
∂F
= c,
∂y′
(1.52)
for some constant c. We now take the derivative of the function 1 + y′ 2 with
respect to y' and obtain
∂F y′
c= = ,
∂y′ 2
1 + y′
(1.53)
www.iubh.de
Unit 1 45
Calculus
y′
c=
2
1 + y′
for dy so that we can integrate both sides and obtain an explicit formula for y as
follows:
y′
c =
2
1 + y′
(1.54)
2
y′
c2 = 2
1 + y′
(1.55)
2 2
c2 1 + y′ = y′
(1.56)
2 2
c2 = y′ − c2 y′
(1.57)
2
= 1 − c2 y′
(1.58)
dy
c = 1 − c2
dx
(1.59)
c
dx = dy .
1 − c2
(1.60)
c
y= x+k
1 − c2
(1.61)
c
for some constant k by noting that the term is constant and ∫ dx = x.
1 − c2
www.iubh.de
46 Unit 1
Summary
In this unit, we have seen functions of a single variable f(x), as well as multivariate
functions, such as f(x, y). Differentiation is a tool for studying the rate of change of
a function with respect to a given variable. In the case of multivariate functions, the
partial derivatives indicate how much the function changes along the x- or y-axis,
for example, while the total differential extends this idea to the rate of change of a
function in any arbitrary direction. Integration of the functions of one variable was
introduced as the area enclosed by the function and can be interpreted as the
inverse of the derivative. The integral is therefore often called the antiderivative.
The Taylor expansion can be used to approximate a given function at a specific
point. Finally, the calculus of variation extends the concepts of differentiation and
integration to functions whose inputs are themselves functions.
Knowledge Check
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Unit 2
Integral Transformations
STUDY GOALS
… how to express time domain and frequency domain functions using Fourier
transformations.
DL-E-DLMDSAM01-L02
48 Unit 2
2. Integral Transformations
Introduction
Integral transformations play an important role in analyzing, manipulating, and trans-
forming signals. This unit focuses on two transformations which are of great impor-
tance in practical applications, convolutions, and Fourier transformations. Convolutions
describe how two functions interact with each other, for example, how the finite resolu-
tion of a sensor or measuring device impacts the value of the quantity measured by
this device. Fourier series and Fourier transformations are used to analyze and
describe periodic signals. This formalism allows us to express the observed signals as a
superposition of signals of different frequencies and intensities and allows us to switch
between equivalent descriptions in the (observed) time domain and in the frequency
domain. This can make the treatment of signals much easier to handle as some trans-
formations or filters are more easily applied in one domain than in the other.
A good textbook that covers this subject area is Signals & Systems (Oppenheim et al.,
1997).
2.1 Convolutions
In order to measure any quantity, we must rely on a measurement device. For example,
to measure a temperature, we use a thermometer which tells us the temperature of the
substance we want to investigate. This simple picture is not quite correct, however. The
measurement does not reflect the actual “true” physical quantity (such as the tempera-
Resolution ture) but is distorted by the intrinsic resolution of the measuring instrument. It is
Detectors are not important to keep in mind that the measured quantity, for example the temperature,
infinitely precise but does not exist as an abstract quantity but is always related to a real, physical system.
can only measure a As such, there is ultimately no single value associated with this property in the mathe-
quantity up to a cer- matical sense but it is always governed by probabilities and probability distributions.
tain precision deter- This implies that if we keep repeating the same measurement, we will get slightly dif-
mined by the resolu- ferent numeric values for the same “true” physical values that are determined by the
tion of the detector. intrinsic resolution of the device. Examples illustrating these possibilities for such reso-
lutions are shown in the following figure.
www.iubh.de
Unit 2 49
Integral Transformations
A good thermometer may have a high resolution and return an unbiased measurement.
This means that the thermometer does not shift the “true” value, but rather returns a
value which randomly fluctuates slightly around the true value. The measurements of a
thermometer with this property are indicated by the red dashed line. By repeating the
measurement many times, it is possible to determine the resolution of the instrument
and hence the intrinsic volatility of the measurements made with this thermometer.
The black solid line, on the other hand, illustrates the readings of a thermometer with
a lower resolution; the values still fluctuate randomly around the “true” value, but due
to the lower resolution of the instrument, the fluctuations are stronger.
Finally, the blue dash-dotted line illustrates what happens if the measuring instrument
itself introduces a bias. In this case, the measured values are no longer “faithful” to the
“true” ones; the resolution of the instrument is asymmetric with long tails, indicating
that the intrinsic fluctuations become biased towards higher values.
To express the ideas of “faithful” and “true” more mathematically, we note that we need
to have a function f(x) that represents the true values of the substance being meas-
ured. It is important to note that the vast majority of systems, objects, and processes in
our world are stochastic. This means that any value or number we observe is random,
but they follow a distinct probability distribution that is governed by a specific process
www.iubh.de
50 Unit 2
Stochastic relevant for this system. There are, of course, notable exceptions to this, otherwise it
Stochastic systems would be difficult to implement a clock. However, from the atomic scale to everyday sit-
are governed by the uations such as the speed of wind or the shopping behavior of customers in a super-
laws of probability, market, everything needs to be calculated in terms of probabilities and probability dis-
as opposed to deter- tributions. In the example above, f(x) would describe the distribution of the actual
ministic systems that temperature of the underlying physical process relevant for the substance we want to
can be calculated examine.
precisely.
The instrument itself is represented by a resolution function g(y) that determines how
the “true” values are observed, e.g. biased or unbiased with high or low resolution. The
measurements themselves are represented by some function h(z), which depends on
both the actual underlying state as described by f(x) and the resolution function g(y).
The variables x, y, and z all describe the same quantity, which, in our example, is the
temperature. However, they each enter the consideration at a different point: x is the
“true” value, y the resolution, and z what we finally observe. Effectively, the resolution
introduces a systematic error in the observation of the “true” value. If the measurement
device is biased, this error will include a shift that is more probable in one direction
compared to the other. In the example shown in the previous figure, the resolution
function is biased towards larger, more positive values and hence, more positive values
will be observed compared to negative values.
We now build a more detailed intuition about how convolutions work. Since, in general,
the true value x is a random number taken from a distinct probability distribution, we
know that the probability of getting x precisely is zero. Instead, we can calculate the
probability of getting the true value in the interval (x, x+dx) using the distribution f(x).
As f(x) defines the probability for any x, this is given by f(x)dx. Next, because we need a
device to measure this value, we need to include the resolution function g(y) so that
we finally observe the value z. The measurement instrument will generally shift the true
value to the observed value, hence we do not observe x, but rather z, which is shifted
by the amount z − x, i.e. the difference between them. Note that, in general, we cannot
know the true value of x. In case of an unbiased instrument, the resolution will “smear”
the true values so that the resulting distribution is broader than the true, physical one.
In case of a biased instrument, this step will also include a further shift in one direc-
tion. Hence, the original interval dx gets transformed into the interval dz and the over-
all effect of the instrument is given by g(z − x)dz. We can then combine the probabili-
ties for the true value and the instrument and obtain f(x)dxg(z − x)dz for a particular
observation. To express this for any value, we need to integrate over all possible values
of x to obtain the distribution for all observable values z
∫
∞
f∗g=h z = f x g z − x dx .
−∞
(2.1)
Equation 2.1 is called the convolution of the functions f and g and is typically denoted
by f ∗ g. Two examples of convolutions of two simple functions are shown in the fol-
lowing figure. In both cases, two uniform functions are nonzero over a small interval
and are convolved with each other. In the top instance, the two functions are separated
www.iubh.de
Unit 2 51
Integral Transformations
from each other, meaning that there is no portion of the domain where they are both
nonzero. In the other case, the intervals at which the two functions are nonzero over-
lap. The convolutions are shown in the right-hand column of graphs. Note that the
shape of the convolved functions are not the same as the shape of the original func-
tions; the two uniform functions gain a “triangular” shape when convolved.
Most functions are not uniform functions. An example with functions more like those
we are interested in studying is shown in the following figure. In this case, the “true”
values are distributed according to a Γ function (black, solid). The measurement device
is represented by an unbiased Gaussian resolution function with mean zero and stand-
ard deviation one. The observed values are then distributed according to the convolu-
tion of both functions as shown in the graph on the right.
www.iubh.de
52 Unit 2
This illustrates a typical pitfall when dealing with measured values. Although we know
that the “true” values are strictly positive (as illustrated by the black solid curve in the
left figure), the observed values can also be negative due to the finite resolution of the
measurement instrument. Depending on the concrete problem, these values need to be
treated with extra consideration as they may violate physical boundaries.
Convolutions play an important part in image processing and are at the core of (con-
ventional) image filters and convolutional neural networks. Instead of interpreting f and
g as a (true) signal and resolution function, one of the functions, for example f, repre-
sents the image and the other (g) represents a kernel that is used to operate on the
image. Given a suitable kernel, this defines a filter that can be used for a wide range of
applications such as blurring, sharpening, and edge detection. The kernel is often
denoted by K or ω.
In this section, we will discuss how convolution can be used to blur an image. To do so,
each part of the image is convolved with a Gaussian filter in the x and y directions, in
particular we apply a two-dimensional (or multivariate) Gaussian to each part of the
image. The two-dimensional function is illustrated in the following figure. In the case of
images, we work with discrete data as the images can be represented by collections of
individual pixels of the form (x, y, r, g, b) where x and y give the position of the pixel
in the image and give r, g, and b the relative levels of red, green, and blue in that pixel.
The continuous convolution integral in equation 2.1 then becomes, in the discrete case,
the summation
a b
K · f x, y = ∑ ∑ K i, j f x − i, y − j ,
i= −1j= −b
www.iubh.de
Unit 2 53
Integral Transformations
(2.2)
In the case of Gaussian blurring, the kernel is given by the matrix in equation 2.3, which
is applied to each part of the image. In this way, the value to which each pixel is trans-
formed is influenced by its neighboring pixels where their relative weight is given by
the kernel. The following figure shows the effect of applying such a kernel to an image.
www.iubh.de
54 Unit 2
Special care needs to be taken at the edges of the image where the kernel can poten-
tially exceed the image boundaries. In these cases one can either extend the image by,
for example, repeating the outermost pixels, or one can crop the image so that the ker-
nel always fits inside the original image:
1 2 1
1
K= 2 4 2 .
16
1 2 1
(2.3)
Fourier Series
We have previously seen how the Taylor expansion can be used to approximate a signal
or a function. However, Taylor expansions are not the only way to do this — there is
another way to look at functions, the Fourier series, that is particularly well suited to
periodic signals such as those found in a wide range of natural and engineering sys-
tems. The main idea of the Fourier series is to express the signal as a sum of sine and
cosine components of varying strength and frequency. In order to create a Fourier ser-
ies for a function, the signal must satisfy the following Dirichlet conditions:
www.iubh.de
Unit 2 55
Integral Transformations
The requirement that the function is periodic can be stated as the condition
f(x + L) = f(x) = f(x + nL); in other words, the function repeats itself after the full
period L has passed. This also implies that the signal has neither a beginning nor an
end. Note that in practice using 2L instead of L for a single full period is also common
notation. In practical applications, one has to consider how to deal with finite signals,
e.g. by assuming that the signal follows the same structure beyond the observed signal.
The Fourier series, the promised decomposition of a signal into a summation of sine
and cosine waves, is given by
∞
1 2πnx 2πnx
f x = a +
2 0 ∑ ancos
L
+ bnsin
L
.
n=1
(2.4)
∞
1
f x = a +
2 0 ∑ ancos nx + bnsin nx .
n=1
(2.5)
The figure below shows a periodic sinusoidal signal to which only one frequency con-
tributes, namely f(t) = sin(x). The left part of the figure shows the signal in the time
domain, i.e. what we would observe if we measured the signal at different points in
time. The right part of the figure shows which frequencies contribute to the signal, tell-
ing us which coefficients in the Fourier series are nonzero.
www.iubh.de
56 Unit 2
The next figure shows a slightly more complicated signal in which the signal with the
lower frequency (the same as in the previous figure) is overlaid with a signal that is ten
times faster. The resulting periodic signal as we would measure it, namely in the time
domain, is shown in the left part of the figure. The right part of the figure shows the
two frequencies contributing to this signal. The graphic titled “Periodic Sawtooth Sig-
nal” shows a more realistic signal that is frequently encountered in electrical engineer-
ing. During each period, the sawtooth signal rises linearly between the minimal and
maximal values, and drops to the minimum when the signal reaches the maximum. The
left part of the figure shows the measured signal in the time domain, and the right part
of the signal illustrates that many coefficients in the Fourier series are needed to build
up this more complex signal, and that the weight of the contribution of each signal is
exponentially dampened.
www.iubh.de
Unit 2 57
Integral Transformations
Recall Euler’s formula, which establishes the fundamental relationship between trigo- Euler's Formula
nometric functions and complex-valued exponential functions. In many applications it eiΦ = cos Φ + i sin Φ
is helpful to use Euler’s insight to express the sine and cosine terms in the Fourier ser-
ies by exponential functions with imaginary arguments. To do so, we can use the iden-
tities
1 inx 1
sin(nx)= e − e−inx , and cos nx = einx + e−inx .
2i 2
(2.6)
Inserting these into equation 2.5, the Fourier series can be expressed as
∞
∫ fxe
1 π
f x = ∑ cneinx , with cn =
2π
−inxdx .
n= −∞ −π
(2.7)
Fourier transformations
In equation 2.7, we expressed the Fourier series as an infinite sum of complex func-
tions, still assuming for simplicity that the period L is equal to 2π. If we relax this
assumption, the Fourier series can be written as
∞
f x = ∑ cnei2πnx/L,
n= −∞
www.iubh.de
58 Unit 2
(2.8)
∞
iωnx
f x = ∑ cne .
n= −∞
(2.9)
Previously, we always interpreted the period L as some finite interval after which the
function repeats itself — indeed, this was one of our core assumptions when we dis-
cussed the Fourier series. We now consider the limit of large periods, i.e. L → ∞ where
the signal only repeats after a very long time. Instead of fixed frequencies in the sine
and cosine terms that we have considered so far, the difference in frequencies
Δω = 2π/L becomes infinitesimally small and the frequencies become a continuum
rather than the discrete values we saw in the first two examples in this section. We
recall from equation 2.7 that the coefficients cn are given by
x0 + L x0 + L
cn =
1
L ∫
x0
f x e−i2πnx/Ldx =
Δω
2π ∫ x0
−iωnx
f xe dx
(2.10)
where we have again allowed an arbitrary period L and ωn a discrete function of the
index of coefficients n given by ωn = 2πn/L. In the integral limits, x0 is an arbitrary con-
stant which is often taken as -L/2. In fact, this is the reason that often 2L is used to
indicate a full period so that the limits of the integral can be written as -L for the lower,
and L for the upper limit instead of -L/2 and L/2. Substituting this expression for the
coefficients into the complex Fourier series, we get
∞
iωnx
f x = ∑ cne
n= −∞
∞ x0 + L
= ∑
n= −∞
Δω
2π ∫x0
f ue
−iωnu iω x
du e n
where we have used u instead of x in the expression for cn to avoid confusion with the
various variables and what they refer to.
We now consider the limit of long periods, i.e. L → ∞ which implies that Δω → 0. Then
the sum
∞
Δω iω x
∑ 2π
g ωn e n
n= −∞
www.iubh.de
Unit 2 59
Integral Transformations
(2.11)
∫
1 ∞
g ω eiωxdω
2π −∞
(2.12)
L/2
g ωn = ∫ −L/2
f ue
−iωnu
du
(2.13)
where we have chosen the constant x0 now to be -L/2. Putting this together, the inte-
gral becomes
∫ ∫
1 ∞ ∞
f x = eiωxdω f u e−iωudu .
2π −∞ −∞
(2.14)
∫
1 ∞
f ω = f x e−iωxdx
2π −∞
(2.15)
where we have changed back from u to x. We can also define the inverse Fourier trans-
formation
∫
1 ∞
f x = f ω e+iωxdω .
2π −∞
(2.16)
1
The normalization factor 2π is split equally between the Fourier transformation and its
1
inverse as , which avoids having to remember in which equation the factor needs to
2π
be inserted. Note that the tilde in f (ω) is often dropped if there is no risk of confusing
the Fourier transformation and its inverse.
Using the Fourier transformation, we can switch between two equivalent views of ana-
lysing a signal, either in the “observable” domain f(x) or in the frequency domain f (ω).
Since most functions are observed as time dependent functions or signals, we typically
use f(t) instead of f(x) to indicate the dependency of time. In fact, the left and right
www.iubh.de
60 Unit 2
parts of the previous three figures correspond to either the time domain (left part) or
the frequency domain (right part). The main advantage of being able to switch between
these equivalent representations of a signal is that some operations might be very
complicated in one domain but very easy in the other. For example, applying a fre-
quency filter is very difficult in the time domain but easy in the frequency domain. Con-
cretely, we consider the periodic signal in the figure titled “Periodic Sine Signal with
Two Contributing Frequencies.” We know from the frequency spectrum that this is a
simple signal where a low frequency sine function is combined with a high frequency
sine function. Using a low-pass filter (a filter that will only let low frequencies pass), we
can extract the part of the signal with the low frequency. To do so in the time-domain
where we observe the signal would be very difficult, but it is easily achieved in the fre-
quency domain by applying the dampening function shown in the previous figure.
Rather than using a simple rectangle as one might assume, we use a “softened” rectan-
gular function. In this simple example, this would make little difference, however, in
more complex examples, the smooth attenuation avoids artifacts that can be induced
by a sharp cut-off. The following figure shows that we can then extract the low-fre-
quency signal by switching back to the time domain as desired.
www.iubh.de
Unit 2 61
Integral Transformations
Example
Find the Fourier transformation of the function f(t) = Ae−λt for t ≥ 0, assuming
f(t) = 0 for t < 0.
∫ ∫
A ∞ A ∞
f ω = e−λte−iωtdt = e− λ + iω tdt
2π 0 2π 0
(2.17)
where we note that the bounds of integration range from zero to ∞ as we assume
1
that f(t) = 0 for t < 0. Remembering ∫eaxdx = a eax, the integral is evaluated as
A e− λ + iω t ∞
f ω = −
2π λ + iω t=0
A 1
= .
2π λ + iω
www.iubh.de
62 Unit 2
Summary
Convolutions mathematically express how the effects of two functions can be com-
bined. For example, a measurement of a physical quantity taken by some measure-
ment device can be expressed as a resolution function that is unique to the specific
details of the device and the “true” value. The observed value is then the convolu-
tion of the “true” value with the finite sensor resolution. In image processing, con-
volutions with a given kernel are used to manipulate images by blurring, sharpen-
ing, detecting edges, or transforming the image. Periodic signals can be expressed
as a Fourier series which is a series of sine and cosine functions of fixed frequen-
cies. The relative weight of the coefficients of the terms in the series determines
the shape of the final signal. Fourier transformations can be derived as the limit of
increasingly long periods over which the fixed frequencies become a continuous
spectrum. Using the Fourier transform, a function can be expressed either in its
original form or as a Fourier transformation. Some operations are much simpler to
perform in one form compared to the other, so the Fourier transform is very useful
in practice.
Knowledge Check
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Unit 3
Vector Algebra
STUDY GOALS
DL-E-DLMDSAM01-L03
64 Unit 3
3. Vector Algebra
Introduction
In this unit, we introduce techniques that allow us to operate on more complex mathe-
matical objects called vectors that contain one piece of information in each coordinate.
For example, a vector can describe the speed of a particle and the direction in which it
is moving. This is an example of a two- or three-dimensional vector. Likewise, an n-
dimensional vector could be used to record n-pieces of information such as time, tem-
perature, and the x-, y-, and z-coordinates of the direction of movement or force.
Given the numerous and important applications of vectors to physics, we will focus on
developing an intuition for vectors in two- and three-dimensions, in particular in ℝ2
and ℝ3 — the familiar Cartesian plane and space. Vector operations on these spaces
can be generalized to the broader contexts seen in information science, machine learn-
ing, and computer science in general.
Good textbooks that cover the subject area are chapters 11 and 12 of Calculus (Strang,
2017) and chapters one and two of Advanced Calculus (Loomis & Sternberg, 2014).
Regardless of their dimension, vectors can be analyzed by breaking them into their
component parts. For example, in the two-dimensional case, the components would be
horizontal and vertical. For n-dimensional vectors, there is no physical analogue, but
the ith-coordinate of one vector will measure the same quantity as the ith-coordinate of
another vector in the same space. Let’s begin by building an intuition for vectors in the
familiar, two-dimensional plane.
www.iubh.de
Unit 3 65
Vector Algebra
Two-Dimensional Vectors
One of the most common applications of vectors is the study of objects moving in two-
dimensional space, though there are many other applications of two-dimensional vec-
tors, too.
Let’s consider an initial point, P = (p1, p2), in the x, y-coordinate plane and a terminal
point, Q = (q1, q2). Then the vector representing the travel from P to Q can be written
PQ. The magnitude of PQ is the length of the line segment connecting P and Q,
obtained from the distance formula. Namely, the magnitude of PQ is given by
2 2
PQ = q1 − p1 + q2 − p2 .
If initial point P is the origin (0, 0), we say that the vector is in standard position and
we call it a position vector. Note that the ordered pair at Q = (q1, q2) uniquely specifies
a position vector as there is only one distance and one path from (0, 0) to (q1, q2). The
following figure shows an example of a vector v in standard position. Similarly, for vec-
→
tors in n-dimensions, we write v = v1, v2, …, vn for a vector whose ith-component is
vi where the vi come from an underlying field (such as the real or complex numbers).
www.iubh.de
66 Unit 3
Fundamental definitions
→ → →
1. Two vectors a = a1, …, an and b = b1, …, bm are equal, → a = b , if, and only if,
n = m and ai = bi for all 1 ≤ i ≤ n. This means that they have the same magnitude
and direction.
2. Given → →
a as above, the negation of the vector, denoted by − a , is (−a1,…,−an), a
vector having the same magnitude but opposite direction to that of → a.
→
3. If v = 1, we say that the vector → v is a unit vector.
Example
In two dimensions, this force can be represented as shown in the following figure.
www.iubh.de
Unit 3 67
Vector Algebra
Example
Suppose that the vector → v is given by the directed line segment extending from
the point (0, 0) to the point (3, 2) and that the vector →u is given by the directed line
segment from the point (1, 2) to the point (4, 4). Is it true that → →
u = v?
Let P = (0, 0), Q = (3, 2), R = (1, 2), and S = (4, 4) as shown in the following figure.
www.iubh.de
68 Unit 3
We need to determine whether the directed line segments have the same direction
and magnitude. We find the magnitude of both line segments to be
→ 2 2
PQ = v = 3−0 + 2−0 = 13 and,
→ 2 2
RS = u = 4−1 + 4−2 = 13 .
In the two-dimensional case, the slope of a line describes its direction; finding the
slope of both line segments, we see that they agree:
Δy y2 − y1 2−0 2
Slope of PQ = = = = ,
Δx x2 − x1 3−0 3
Δy y2 − y 1 4−2 2
Slope of RS = = = = .
Δx x2 − x1 4−1 3
As → → → →
u and v have the same direction and the same magnitude, we have u = v .
www.iubh.de
Unit 3 69
Vector Algebra
An easier way to determine whether two vectors are the same is to use the component
form of vectors: For a vector AC determined by the points A = (x1, y1) and C = (x2, y2),
the component form is defined as = x2 − x1, y2 − y1 . The coordinates of this vector
describe the position vector OP from the origin, (0, 0) to a point P = (x2 - x1, y2 - y1).
Computationally, the addition and subtraction of vectors is as easy as the addition and
subtraction of their components in the underlying field, which, for us, is either the real Field
or complex numbers. A mathematical field
is a set on which
Specifically, for addition, multiplica-
tion, subtraction,
v = v1, …, vn and u = u1, …, un , v + u and division are
defined and behave
= v1 + u1, …, vn + un and v − u = v1 − u1, …, vn − un .
the same way as for
rational and real
Notice that because the components of the vectors are elements of the field of real numbers.
numbers ℝ (or complex numbers ℂ), addition of these entries, and thus of vectors, is
commutative: namely, → → → →
v + u = u + v . Subtraction, of course, is not commutative.
Vectors also share other properties with the real or complex numbers: Vector addition
is associative, there exists an additive identity 0, every element v has an additive
inverse, − →
v .
→ → →
Let u = u1, u2 , v = v1, v2 , w = w1, w2 and let k and c be scalars.
1. → → → → (commutative property)
u + v = v + u
2. → → → (associative property)
u + v +w =
→ → →
u + v +w
www.iubh.de
70 Unit 3
3. → → → → → (additive identity)
u + 0 = u = 0 + u
4. → → →
u + −u = 0
5. → →
k c u = kc u
6. → → → →
k v + w = k v + kw
7. → → →
k+c u =ku +cu
Example
→ → → → → → →
Let → → →
u = (7, 2) and v = (−3, 5). Find u + v , u − 6 v , 3 u + 4 v and 5 v − 2 u .
We have:
→ →
u + v = 7, 2 + − 3, 5 = 4, 7 ;
→ →
u − 6 v = 7, 2 − 6 − 3, 5 = 7, 2 − − 18, 30 = 25, − 28 ;
→ →
3 u + 4 v = 3 7, 2 + 4 − 3, 5 = 21, 6 + − 12, 20 = 9, 26 ;
→ →
5 v − 2 u = 5 − 3, 5 − 2 7, 2 = − 15, 25 − 14, 4
2
= −29 + 212 ≈ 35, 8 .
The computational way of adding vectors gives little intuition for their physical applica-
tions. Let us now turn our attention to the graphical addition and subtraction of vec-
ℝ² tors, restricting ourselves, for now, to the case of ℝ2, which we will visualize as the Car-
ℝ2 is the two-dimen- tesian coordinate plane.
sional space of real
Let → →
x and y be vectors in ℝ . Conceptually, the sum of these two vectors should be
numbers. 2
the net effect of the two vectors together. For ease of explanation, suppose that →
x and
→ represent forces applied in the plane.
y
www.iubh.de
Unit 3 71
Vector Algebra
One way to think about this is as the two vectors in sequence, namely “doing” or
“applying” one and then the other. Graphically, that would mean placing the tail of → y
on the head of → a ; if we then draw a new vector, indicated by the red dashed line from
the tail of → → → →
x to the head of y , that vector is the sum x + y , as shown in the follow-
ing figure. We can denote this a new vector such as z = → →
x + y.
www.iubh.de
72 Unit 3
The next figure illustrates that performing addition graphically is equivalent to per-
forming addition using the component form of the vectors. The parallelogram also
shows that addition of vectors is commutative — we get the same diagonal regardless
of which vector we apply first.
Example
Let → → → →
v = (−3, 2) and w = (5,−9). Find v + w both graphically and by using the com-
ponent form of the vectors.
We have
→ →
v + w = −3 + 5, 2 + −9 = 2, − 7 .
Introduction to Bases
Spanning Set A basis for a vector space is a spanning set of vectors in that vector space. Minimal
A spanning set is means that if we took any of the vectors away, the set would no longer span the space
one that allows us to — there would be some vectors that cannot be formed as linear combinations of the
express every vector remaining elements in the set. Recall that a linear combination of elements is a sum of
in the space as a lin- those elements, possibly each multiplied by a scalar. For example, a two-dimensional
→ →
ear combination of space can be spanned by the vectors i and j pointing along the x and y axes. Any
elements in that element of this two-dimensional space (or plane) can be expressed by a linear combi-
→ →
spanning set. nation of these two vectors such as → v = 2 i + 3 j . If we were to take one of them
away, we could no longer reach all points in the plane.
www.iubh.de
Unit 3 73
Vector Algebra
Bases (the plural of basis) are fundamental to the study of vector spaces, and they are
not unique. Though we will typically use the familiar unit vectors in the coordinate-axis
directions as our bases for two- and three-dimensional space, there are many other
bases, and how to move from one basis to another is an important question.
Unit vectors
Recall that a vector of magnitude (or length) one is called a unit vector. For example,
→
the vector v = − 3/5, 4/5 is a unit vector because
→ 2 2 9 16 25
v = −3/5 + 4/5 = + = = 1.
25 25 25
Unit vectors are particularly useful because they all have the same length, namely
length one, so they differ from one another only in direction. In particular, there is only
one unit vector that points in a given direction.
→
→ v 1 →
u = → = → v .
v v
Example
The unit vectors parallel to the x- and y-axes, sometimes called standard unit vectors,
are particularly useful because they form what is called a basis of the 2-dimensional
vector space. In particular, letting
→ →
i = 1, 0 ; j = 0, 1 .
www.iubh.de
74 Unit 3
Example
→ →
Express →
r = (2,−6) as a linear combination of i and j .
We have
→ → → → →
r = 2, − 6 = 2 i + −6 j = 2 i − 6 j .
→ →
It is important to note that the set i , j is only one of many possible bases for ℝ2;
there are many other bases for this same space. For example, we could express each
point in terms of a distance r from the origin and an angle θ between the positive x-
axis and the vector to any point in ℝ2. Indeed, in the two-dimensional case, any two
vectors that are not scalar multiples of one another will form a basis. This condition is
guaranteed in the two-dimensional case, but not in higher dimensions.
→ → →
i = 1, 0, 0 ; j = 0, 1, 0 ; and k = 0, 0, 1 .
The three standard unit vectors are directed along the positive x-, y-, and z-axes of the
three-dimensional rectangular coordinate system.
→
For this basis, it is easy to see that any vector F can be expressed as a linear combina-
tion of the standard basis vectors because this is how we tend to think about position
in three-dimensional space. In particular,
→ → → → → → →
v = v x + v y + v z = v 1 i + v2 j + v 3 k
www.iubh.de
Unit 3 75
Vector Algebra
Position vectors
In the three-dimensional space of real numbers ℝ3, a position vector extends from the
origin (0, 0, 0) to the point (x, y, z) and is written →
r = (x, y, z). This can be expressed as
a linear combination of the standard unit vectors:
→ → → →
r =x i +y j +zk,
with magnitude
→
r = x2 + y2 + z2 .
Example
→ → → → → →
Given → →
r 1 = 3 i − 2 j + k , r 2 = 2 i − 4 j − 3 k and r 3 = − i + 2 j + 2 k .
→
Find the magnitude of M = → → →
r 1 + r 2 + r 3.
We have
www.iubh.de
76 Unit 3
→ → → →
M = r 1+ r 2+ r 3
→ → →
= i 3 + 2 − 1 + j −2 − 4 + 2 + k 1 − 3 + 2
→ →
=4 i −4 j
= 4, − 4, 0
and thus,
M = 42 + 42 + 02 = 4 2 .
Example
→
r = 0 − −2 , − 4 − 3, 4 − 1 = 2, − 7, 3 .
→ 2
r = 22 + −7 + 32 = 62,
Collinear vectors
Two vectors → →
u and v are collinear (a generalization of parallel) if there exists a real-
valued constant c such that → →
u = c v . We use the following notation to express that
→ →
two vectors are collinear: u ∥ v . Note that this definition holds for vector spaces of
any dimension.
Example
Suppose vector w has the initial point (2,−1, 3) and the terminal point (−4, 7, 5).
Are → → → →
w and u = (3,− 4,−1) collinear? What about w and v = (12, 16, 4)?
www.iubh.de
Unit 3 77
Vector Algebra
→ 1 1→
u = 3, − 4, − 1 = − −6, 8, 2 = − w ,
2 2
we conclude that → →
w ∥ u holds.
To check if → →
w is collinear with v , we must determine whether there is a constant c,
→ →
satisfying w = c v , namely
Example
Determine whether the points P = (1,−2, 3), Q = (2, 1, 0), and R = (4, 7,−6) lie
one the same line.
PQ = 2 − 1, 1 − − 2 , 0 − 3 = 1, 3, − 3 and
PR = 4 − 1, 7 − − 2 , − 6 − 3 = 3, 9, − 9 .
Note that PQ and PR have the same initial point by construction so they are collin-
→
ear if, and only if, they lie on the same line. By inspection, we see that PQ = 3PR,
and therefore PQ ∥ PR holds and the three points are on the same line.
www.iubh.de
78 Unit 3
Let → → → →
u = (u1, . . . , un) and v = (v1, . . . , vn) be vectors. The dot product of u with v is
the scalar
→ →
u ⋅ v = u1v1 + u2v2 + … + unvn .
Note that → →
u ⋅ v is a scalar. By considering the properties of the real valued compo-
nents, we can verify the following properties.
1. → → → →
u ⋅ v = v ⋅ u
→ → → → → → →
2. u ⋅ v +w = u ⋅ v + u ⋅w
3. → → → 2
u ⋅ u = u
→ → → → → →
4. u ⋅ cv =c u ⋅ v = cu ⋅ v
Example
→ →
Let → → → →
u = (2,−2) and v = (5, 8). Find u ⋅ v and u ⋅ 2 v .
We have:
→ →
u ⋅ v = 2 · 5 + − 2 · 8 = − 6;
→ → → →
u ⋅ 2 v = 2 u ⋅ v = 2 · −6 = − 12 .
The scalar or dot product can be extended so that we can handle vectors in ℂn with
complex numbers as well. We define the scalar or product as
→ →
u ⋅ v = u∗1v1 + u∗2v2 + … + u∗nvn
where the asterix in u∗i indicates the complex conjugate. We then find that some of the
properties above change, in particular, the following:
→ → → →
• u ⋅ v = v ⋅ u ∗
→ → → →
• c u ⋅ v = c∗ u ⋅ v
→ → → →
• u ⋅ cv =cu ⋅ v
www.iubh.de
Unit 3 79
Vector Algebra
a1b1 + a2b2
cos θ = →→ .
a b
(3.1)
We note without proof that the same formula works for vectors in three dimensions. In
→
particular, for →
a = (a1, a2, a3) and b = (b1, b2, b3),
(3.2)
Proof
www.iubh.de
80 Unit 3
ξ1 ξ2 →
cos θ = → ; sin θ = → ; ξ3 = a − ξ1 .
b b
(3.3)
→ →
It follows that ξ = b cos θ and ξ = b sin θ, as shown in panel (b) of the figure.
1 2
We also have
→ →2 2 2
a − b = a1 − b1 + a2 − b2
→2 → 2
= a + b − 2 a1b1 + a2b2 .
(3.4)
2 → 2 → → 2
a − b = b sin θ + a − b cos
→2 →2 →→
= b sin2 θ + cos2 θ + a − 2 a b cos θ
→2 → 2 →→
= a + b − 2 a b cos θ .
(3.5)
→→
a b cos θ = a1b1 + a2b2,
(3.6)
and from this, we can immediately obtain equation 3.1 by solving for cos θ.
→ → →→
a ⋅ b = a b cos θ .
(3.7)
Observe that this gives a particularly easy test for orthogonality. In particular, if →
u
is perpendicular to →v , we have
→ →
u ⋅ v =0
(3.8)
www.iubh.de
Unit 3 81
Vector Algebra
Example
Let → → →
u = (3,−1, 2) and v = (−4, 0, 2). Find the angle θ between u and v :
We have
→ → → →
u ⋅ v = u v cos θ,
→ →
u ⋅ v −12 + 4 −4
cos θ = → → = = .
u v 14 20 70
Since → →
u ⋅ v < 0,
−4
θ = arccos ≈ 2 .069 rad.
70
Vector projection
One important application of the dot product is in finding the extent to which a given
vector is “in the same direction” as a second vector. One example of this is the horizon-
tal and vertical components of the velocity of a projectile thrown into the air. The verti-
cal component is the projection of the initial velocity vector onto the y-axis and
reflects how much of that initial velocity is going in the “up” direction. The following
discussion allows us to consider this question for any two directions, and more for-
mally.
→
Let →
a and b be two-dimensional real vectors in ℝ . Imagine that we shine a light onto
2
→
vector a from a light source perpendicular to b . We can think of the projection of →
→
a
→ →
onto b as the shadow of → a casts onto vector b. If we think only of the length of the
shadow, this is called the scalar projection. If we consider the length and direction of
the shadow, we get the vector projection.
→
It is worth noting that this shadow, this projection, let’s call it Proj→
b a , might be longer
→
than
→ the vector b . Recall that one way to define the line containing
→
b is L = c b : c ∈ ℝ , namely all scalar multiples of b by a real number c. This means
that, so far, we know that for some c, Proj→ → .
b a =cb
→
The other thing we know about the projection of → a onto b is that it is orthogonal to
→
the “light →
→ source,” which, in this case, has the direction →a − c b . This means that
→
b ⋅ a − c b = 0. Using our knowledge of dot products, we can find a particular value
for c. We know that
www.iubh.de
82 Unit 3
→ → → → → → →
b ⋅ a −cb =0= b ⋅ a −cb ⋅ b .
(3.9)
→ →
b ⋅ a
Solving this for c, we know that c = →2 and therefore, that the vector projection of →
a
→ b
onto b is
→ →
→ b ⋅ a→
Proj→
b a = →2 b .
b
(3.10)
→
The scalar projection of → →
a onto b is the magnitude of the vector projection of a onto
→
b.
Example
→ → → → → →
Find the vector projection of → →
u = 3 i − 5 j + 2 k onto v = 7 i + j − 2 k .
→ →
→ v ⋅ u → 12 14 2 4
w = v = 7, 1, − 2 = , ,− .
→2 54 9 9 9
v
Inner product
The inner product can be seen as a generalization of the scalar or dot product dis-
cussed so far. While the dot or scalar product was specifically defined as
→ →
u ⋅ v = u1v1 + u2v2 + … + unvn, the inner product is a function that takes two vectors
and assigns a real (or complex) number to it. However, we generally assume that we
use complex numbers when working with the inner product. Effectively, the inner prod-
uct is a generalization of the dot product. The inner product of two vectors, → →
u and v ,
is written as
→ →
⟨ u v ⟩,
→ → → →
or alternatively, ⟨ u , v ⟩, u v , or just → →
u ⋅ v.
→ → → →
1. ⟨ u v ⟩ = ⟨ v u ⟩∗
→ → → → → → →
2. ⟨ u c v + d w ⟩ = c⟨ u v ⟩ + d⟨ u w ⟩
→ → → → → → →
3. ⟨c u + d v w ⟩ = c∗⟨ u w ⟩ + d∗⟨ v w ⟩
www.iubh.de
Unit 3 83
Vector Algebra
→ → → →
4. ⟨c u d v ⟩ = c∗d⟨ u v ⟩
→ →
5. Vectors are defined to be orthogonal if ⟨ u v ⟩ = 0
→ → → 1
The norm of a vector is defined as ∥ u ∥ = ⟨ u u ⟩ 2 as a generalization of the magni-
→ →
tude of a vector we have encountered so far. Generally, ⟨ u u ⟩ can be both positive
→ →
and negative. However, in most cases we will encounter vector spaces with ⟨ u u ⟩ ≥ 0
and hence we say that the norm is positive semi-definite.
Cross product
The cross product (or vector product) is generally only defined in the three-dimen-
→ → → → → → →
sional space ℝ3. Let → a = a1 i + a2 j + a3 k and b = b1 i + b2 j + b3 k . For nonzero
vectors in ℝ3, the cross product of a and b is a vector that is perpendicular to both of
the given vectors.
→ → → → →
a × b = a2b3 − b2a3 i − a1b3 − b1a3 j + a1b2 − b1a2 k .
Note that there is a more elegant definition using determinants of matrices. Then
→ → is the vector, which is orthogonal to the plane spanned by the vectors → →
a × b → a × b.
In particular, as we have noted, for nonzero vectors, →
a × b is perpendicular to both →
a
→
and b .
→ → → →
1. u × v = − v × u
→ → → → → → →
2. u × v +w = u × v + u ×w
→ → →
3. u ×→ u =→ 0
→ → → →
4. u ⋅ v ×w = u × v ⋅w
→ → → → → →
5. The cross product is not associative, i.e. u × v × w ≠ u × v × w
Example
→ → →
Let →
u = i − 2 j + k and v = 3 i + j − 2 k . Find u × v .
We have
→ → → → →
u × v = i 4 − 1 − j −2 − 3 + k 1 + 6
→ → →
=3 i +5 j +7k .
www.iubh.de
84 Unit 3
1. → → → → →
u × v = 0 if, and only if, u = k v (namely, if the vectors are scalar multiples of
one another).
2. The vector → →
u × v is orthogonal to both
→ and → as shown in the following figure.
u v
→ → →→
3. The magnitude is given by u × v = u v sin θ and is equal to the area of the par-
allelogram with adjacent sides → →
u and v .
→
The above figure illustrates the last property: There, we have two vectors →
a and b . Let
θ be an angle between them → and let h be the height of the parallelogram with adjacent
→
sides of lengths a and b . The height of the parallelogram is
→
h = b sin θ,
→ → →
Area = h a = b a sin θ .
Example
→ → →
Find a unit vector →
w that is orthogonal to both vectors →
u = i − 4 j + k and
→ → →
v =2 i +3 j .
→ → → → →
u × v = − 3 i + 2 j + 11 k ,
www.iubh.de
Unit 3 85
Vector Algebra
→ → 2
u × v = −3 + 22 + 112 = 134 .
The cross product gives us the desired direction; we normalize the vector to be a
unit vector (length one) by dividing the cross product by its magnitude:
→ →
→ u × v 3 → 2 → 11 →
w = → → = − i + j + k
u × v 134 134 134
Summary
In this unit, we learned the basics of how to interpret and compute with vectors —
mathematical objects that encode more than one measurement such as speed and
direction or even n-many attributes of the state of a system. We learned to add and
subtract vectors, to multiply vectors by scalar, and how to compute and use the dot
product (also called the scalar product because the output is a scalar) and the
cross product (whose output is a vector perpendicular to the two vectors forming
the product). The geometrical interpretation of vectors is especially important in
two- and three-dimensional space. Unit vectors are vectors of length one, and the
standard unit vectors for the Cartesian coordinate system are parallel to the coordi-
nate axes and pointing in the positive direction. The concept of a basis plays a cen-
tral role in vector calculus; a basis is a minimal spanning set of vectors, a set of the
smallest size so that every vector in the space can be formed as a linear combina-
tion of the vectors in the basis.
Knowledge Check
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Unit 4
Vector Calculus
STUDY GOALS
… what scalar fields and vector fields are and how to visualize them.
… how to use and interpret vector operators on scalar and vector fields.
DL-E-DLMDSAM01-L04
88 Unit 4
4. Vector Calculus
Introduction
This unit combines the concepts of differentiation, integration, and vector functions. It
introduces the mathematical tools used to study how objects behave in arbitrary coor-
dinate systems; for example, we will be aware of how to determine the rate of change
in a function that describes an object moving through three-dimensional space. As a
concrete example, we can imagine a plane flying through the sky where the position of
the plane relative to the observer or a fixed point is described as a time-dependent
vector (x(t), y(t), z(t)). The rate of change of this position vector with respect to time
gives the velocity of the plane at time t and the rate of change of the velocity gives the
acceleration of the plane at time t.
Two important concepts of vector calculus are scalar and vector fields. One example of
a scalar field from our common physical experience is a function that gives the temper-
ature at each point in a room: Given a position, such a function outputs a scalar. To vis-
ualize a vector field, we can consider the speed and direction of water as it flows down
a drain; the vector field associates the vector that describes the velocity (speed and
direction) of the water with each point in space.
For further explanation of this topic, see chapter 15 of the textbook Calculus (Strang,
2017).
When discussing the concept of the derivative, we are aware of the rate of change of a
function with respect to changes in its arguments. We can use the example of a car to
see that the rate of change in the distance traveled is the velocity. In this simple exam-
ple, we implicitly assume that the car would travel along a long, straight road, i.e. we
are not concerned with the direction of the car. In general, the car can not only change
its speed, but also the direction in which it is travelling. One natural way to describe
the position of the car and its velocity is with vectors. In particular, we can consider the
position of the car a as a vector function with a scalar argument, time.
The general idea of the derivative as a limit applies to vector functions as well. We can
define the derivative of the vector →
a u as
www.iubh.de
Unit 4 89
Vector Calculus
→ → →
da a u + Δu − a x
= lim .
du Δu → 0 Δu
(4.1)
The following figure illustrates a small change in the vector → a u caused by a small
change in the argument u. Note that the derivative of a vector function is also a vector
function. However, the two vectors are not necessarily parallel, but can point in differ-
ent directions. In Cartesian coordinates, equation 4.1 can be written as
(4.2)
which means that we can differentiate each component of the vector function →
a u
separately.
www.iubh.de
90 Unit 4
Example
→ → →
The position of a car at time t is given by →
x t = t2 i + 3t j + t k in Cartesian
coordinates. Find the velocity →v t (and its value, i.e. the speed, v t ) of the car
at time t = 1.
(4.3)
→ → →
At time t = 1, we know that the velocity →
v 1 = 2 i + 3 j + 1 k . Remembering
that speed is the magnitude of the velocity vector, we obtain the norm of the vector
→
v 1 = 22 + 32 + 12 = 14.
Just as for scalar functions, it is useful to make note of rules of differentiation that
→
allow us to avoid using the definition whenever they apply. Assume that → a and b are
differentiable vector functions and that Φ is a differentiable scalar function. Then we
can use equation 4.1 and prove the following useful rules:
→ →
d ϕa da dϕ →
=ϕ + a
du du du
(4.4)
→ → →
d a ⋅ b →
→ db da →
= a ⋅ + ⋅ b
du du du
(4.5)
→ → →
d a × b → →
→ db da
= a × + × b.
du du du
(4.6)
www.iubh.de
Unit 4 91
Vector Calculus
Example
Given a point particle circling a center with constant speed and fixed radius, show
that for any time, t, the velocity vector is perpendicular to the position vector.
Let →
r t denote the position function. The point particle is always the same dis-
tance from the center of a circle, and so r · r = r2 a constant. Note that, by the
→ →
problem, → 2
v t has a constant magnitude, so v ⋅ v = v . Hence,
d → → → → → → → →
r ⋅ r = r ⋅ v + v ⋅ r = 2 r ⋅ v = 0,
dt
which implies → →
r ⊥ v.
→
∂a
,
∂ui
(4.7)
we treat all variables uj where j ≠ i as constant and differentiate a as we vary only ui.
Using partial derivatives, one can prove a version of the chain rule in order to compute
derivatives of vector functions →a whose arguments u1, u2, …, un are themselves func-
tions of some variables vi, namely ui(v1, v2, …, vm), and get
→ → → →
∂a ∂ a ∂u1 ∂ a ∂u2 ∂ a ∂un
= + +⋯+ .
∂vi ∂u1 ∂vi ∂u2 ∂vi ∂un ∂vi
(4.8)
www.iubh.de
92 Unit 4
→ →
∫→a u du = A u + b
(4.9)
→
where b is an arbitrary constant vector.
u2 → →
∫ u1
→
a u du = A u2 − A u1 .
(4.10)
Naturally, just as the antiderivative of a scalar function is a scalar function, the antider-
ivative of a vector function is a vector function and its constant of integration is a vec-
tor constant.
→
W= ∫ F ⋅ d→r
C
(4.11)
www.iubh.de
Unit 4 93
Vector Calculus
as illustrated in the following figure. Only the component of the force that is parallel to
the line tangent to the curve contributes to the work done moving an object along the
curve C, hence the work W is given by the scalar product of the vectors for the force
and the parametrization of the curve. The path → r t parameterizes the way the force is
applied, e.g. in Cartesian coordinates
→
r t = x t ,y t ,z t
(4.12)
where we have included a dependency on the time t to indicate where the object on
which the force is applied is at a given moment. The differential d→
r is then given by
→ dx dy dz
d r t = dx, dy, dz = dt, dt, dt
dt dt dt
(4.13)
www.iubh.de
94 Unit 4
→ → → →
r u, v = r 0 + u a + v b
(4.14)
→
where r 0 is a fixed point in the surface, “anchoring” the surface in space. Linear com-
→ →
binations of vectors →a and b span the surface. In the Cartesian coordinate system, i
→
and j are orthogonal, so they create a rectangular area over which to integrate. In
→
general, the small surface area generated by →a and b will be a parallelogram and
→ → →
∂ r u, v ∂ r u, v
dA = × dudv .
∂u ∂v
(4.15)
→ → →
∫∫ A
dA = ∫∫A
∂ r u, v
∂u
×
∂ r u, v
∂v
dudv .
(4.16)
www.iubh.de
Unit 4 95
Vector Calculus
the temperature in the room, can have a different value at each position. In this exam-
ple, we could consider the temperature in the room at more than one time, suggesting
that our scalar field would be a function ϕ(x, y, z, t) that depends on both position
and time.
From our everyday experiences, we are also familiar with situations where each point in
a given region is associated with a movement or force in a given direction with some
strength. The following figure shows two examples. On the left, a paddle is pushed
through the water and, as the paddle leaves the water, the water moves with both
direction and magnitude (speed). The picture on the right shows water draining from a
sink. As the water drains out of the sink, it forms a vortex.
www.iubh.de
96 Unit 4
Intuitively, it is clear that the direction and speed of the water change with position. We
Vector Field can introduce the vector field V = V (x, y, z, t), which assigns a vector representing
A vector field relates the velocity of the water to each position (x, y, z) at a given time t. This particular
a vector to every example is dependent on four variables related to physical quantities such as position
point in a given area and time, but we can generalize the ideas of both scalar and vector fields to arbitrarily
A. many variables, such as V = V (x1, x1, …, xn).
Compared to the visualizations of a scalar field as shown in the right part of the figure
“Visualization of a Scalar Field,” the visualization of vector fields becomes more com-
plex. In the case of scalar fields, we only needed to associate a single value of the sca-
lar field to each coordinate. In the two-dimensional example, we used a single number
illustrated by an appropriate color scale to visualize the scalar field. In case of a vector
field, we need to add information about strength and direction at each point. For exam-
ple, we can place little arrows on a regular grid as shown in the next figure. At each
point, the arrow points in the direction of the vector field and its length indicates the
strength at this point.
www.iubh.de
Unit 4 97
Vector Calculus
Additionally, we can use color information to highlight the strength of the vector field.
Alternatively, we can use a stream plot as shown in the following figure where the lines
show how the vector field changes as a function of position and how we might observe
a particle as it follows the vector field. The direction is indicated by little arrows on the
field lines or streams and we can either vary the density or the color of the stream
lines to indicate the strength of the vector field (or both).
www.iubh.de
98 Unit 4
The ∇ Operator
We now introduce the vector operator ∇, which is often called nabla or del, and is used
in applications of derivatives of vector fields as well as many applications in physical
and information sciences. In Cartesian coordinates, the operator is defined to take par-
tial derivatives coordinate-wise, namely
∂→ ∂→ ∂→
∇≡ i + j + k
∂x ∂y ∂z
(4.17)
∂ ∂ ∂
≡ , ,
∂x ∂y ∂z
(4.18)
→ → →
where x, y, and z are the Cartesian coordinates, and i , j , and k are the standard unit
vectors along the coordinate axes. Correspondingly, the coordinate-wise second partial
derivatives can be obtained by repeated application of the del operator
2 2 2
∂ ∂ ∂
Δ = ∇2 = ∇ ⋅ ∇ = 2
+ 2+ 2.
∂x ∂y ∂z
(4.19)
One of the most important applications of the nabla operator is determining the rate
of change of a scalar field ϕ in a given direction called the directional derivative. In the
example of the figure “Visualization of a Scalar Field,” we discussed the scalar field of
the temperature in a room with a radiator in a corner. We now want to determine how
much the temperature changes if we move from one point to another. Starting at a
point P(x0, y0, z0), we move a small distance away from P along the line
→ → →
g s, a = x + s a in the direction of the vector → →
a , where x is the position vector of P
and s is a scalar. The value of the field at the new point is then Φ(x + sax, y + say,
z + saz). The rate of change of Φ in the direction of → a , which is called the directional
derivative and is frequently denoted ∇a, is then
www.iubh.de
Unit 4 99
Vector Calculus
dϕ s ∂ϕ dx ∂ϕ dy ∂ϕ dz
= + +
ds ∂x ds ∂y ds ∂z ds
(4.20)
∂ϕ ∂ϕ ∂ϕ
= a + a + a
∂x x ∂y y ∂z z
(4.21)
→
= a ⋅ ∇ϕ .
(4.22)
The quantity ∇ϕ is called the gradient of a scalar field ϕ and describes the direction of
dϕ s →
steepest ascent from any point in the field. The quantity ds = a ⋅ ∇ϕ describes the
rate of change of the field ϕ for some distance s in a given direction →
a.
→
A vector field V that is the gradient of some scalar field ϕ is called conservative and
the corresponding scalar field ϕ is called the potential of this conservative field.
Example
→ → →
∇ϕ = 2xyz4 i + x2z4 j + 4x2yz3 k .
The scalar product of the nabla operator with a vector field is called the divergence of
→
a vector field V :
→ →
∇ ⋅ V = div V .
(4.23)
The divergence is a measure of the flux of a vector field at any given point and has
important applications in physics. To illustrate this, imagine water flowing into one end
of a pipe and out the other. The flow of the water can be described by a vector field. In
the simplest case, there are no sources of additional water nor any drains that would
alter the total volume of water. In this case, the divergence of the vector field is zero. If,
www.iubh.de
100 Unit 4
Flux however, another pipe is attached between the entrance and exit of the original pipe,
The flux of a vector the total volume of water that exits the pipe could change. If the additional pipe adds
field can be interpre- water to the system, the divergence of the vector field is greater than zero. If the addi-
ted as how much the tional pipe drains water from the system, the divergence will be less than zero. The fig-
field acts like a ure “Visualization of a Vector Field Using Arrows to Indicate Direction and Strength of
‘‘source’’ or ‘‘drain’’ the Field at Each Position” is a representation of positive divergence. We can imagine a
at a given point. source in the center and the field flows outward.
Another physical example is an electric point charge from which field lines extend to
infinity.
The curl is the cross product of the nabla operator and a vector field
→
∇ × V = curl V
(4.24)
→
The curl describes the “whirliness” of a vector field. For example, if the vector field V
describes the flow of water after a paddle leaves the water as seen in the left part of
→
the figure “Illustrations of Vector Fields in Every Day Life,” the curl of the vector field V
is related to the vortices left behind by the paddles. More specifically, the curl
describes the angular velocity of the water in the area around any point. If we were to
insert a small probe such as a sheet of plastic at various points around the vortex left
behind by the paddle, this probe would tend to rotate in those regions with non-van-
→
ishing curl, i.e. where ∇ × V ≠ 0.
Summary
www.iubh.de
Unit 4 101
Vector Calculus
Knowledge Check
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Unit 5
Matrices and Vector Spaces
STUDY GOALS
… how to compute the determinant, trace, and transpose of matrices, as well as the
complex and Hermitian conjugates of matrices.
… what tensors are and how to perform basic calculations with them.
DL-E-DLMDSAM01-L05
104 Unit 5
Introduction
Matrices are arrays of numbers that play an important role in many applications of
mathematics, from solving systems of linear equations to quantum mechanics. Many
problem settings can be reformulated efficiently as matrix equations and then solved
in a systematic way. This unit introduces basic matrix algebra and operators. Diagonali-
zation of matrices is an important skill that facilitates the changes of a coordinate sys-
tem or the transformation of one set of variables into another. Choosing a specific set
of new variables can make the solution to the original problem much easier.
Tensors formalize and extend the concepts of scalars, vectors, and matrices. This unit
introduces tensors and the basic rules of how to work with them.
To gain further understanding of this topic, see chapters 2—4 of Mathematics for
Machine Learning (Deisenroth et al., 2020), chapter 11 of Calculus (Strang, 2017), and
chapters one and two of Advanced Calculus (Loomis & Sternberg, 2014).
a11 ⋯ a1m
A= ⋮ ⋮ ⋮ = aij .
an1 ⋯ anm
(5.1)
This n × m matrix has n rows and m columns. Each entry can be identified using the
two subscripts i and j. Matrices play an important part in many applications, for exam-
ple, to describe a rotation of a vector or properties of materials such as elasticity. In
many cases, matrices are a convenient way to perform operations on vectors. Some
important types of matrices are as follows:
www.iubh.de
Unit 5 105
• a diagonal matrix in which only the diagonal elements are nonzero, and
• the identity matrix is a square diagonal matrix in which aii = 1 for all i, and all other
elements are zero.
Examples
Matrix A is symmetric
945
A= 4 6 3 ,
537
matrix B is diagonal
900
B= 0 9 0 ,
001
100
I3 = 0 1 0 ⋅
001
We adopt the convention that capital letters such as A denote matricies, small letters
such as aij denote the entries of a matrix, and Greek letters such as α denote scalars.
The following rules apply when calculating with matrices:
A B = C .
n×k k×m n×m
www.iubh.de
106 Unit 5
(5.2)
The calculation is performed as rowi multiplied by columnj:
AB = C cij = ∑k aikbkj Matrix multiplication is distributive,
A(B + C) = AB + BC and (B + C)A = BA + CA, but is not, in general, commuta-
tive. In particular, there are many cases in which AB ≠ BA holds. Indeed, not only
are the products frequently unequal, but just because AB is defined, this doesn’t
mean that BA is, except when A and B are square matrices of the same dimension.
Commutator The quantity [A,B] ≡ AB − BA is called the commutator.
The commutator
between matrices
becomes very impor-
Example
tant in many appli-
cations such as Evaluate the matrix products AB and BA where
quantum mechanics.
3 2 −1
A= 0 3 2
1 −3 4
and
2 −2 3
B= 1 1 0 ⋅
3 2 1
For AB,
3 2 −1 2 −2 3 5 −6 8
0 3 2 1 1 0 = 9 7 2,
1 −3 4 3 2 1 11 3 7
2 −2 3 3 2 −1 9 −11 6
1 1 0 0 3 2 = 3 5 1 .
3 2 1 1 −3 4 10 9 5
www.iubh.de
Unit 5 107
In the following section, we will discuss matrix operations that are common in applica-
tions that involve calculating with matrices. Let An×m be a matrix consisting of n rows
and m columns. In some calculations, it is useful to interchange the rows and columns
of a matrix. The resulting matrix is called the transpose of A, which is denoted by AT.
Note that AT is an m × n matrix.
Example
1 2 3
A= .
4 5 6
The transpose is
1 4
T
A = 2 5 .
3 6
The transpose of a product of two matrices is the product of the transposed matrices
in reverse order, namely
T T
AB = B TA .
For matrices with complex a ± bi number entries, we find the complex conjugate Complex Conjugate
matrix, A∗, by taking the complex conjugate of each entry of A, The complex conju-
gate of a ± bi * is
∗ a ∓ bi.
a∗ ij = aij ,
(5.3)
www.iubh.de
108 Unit 5
Example
1 2 3i
A= .
4+i 5 6
∗ 1 2 −3i
A = .
4−i 5 6
The Hermitian conjugate of a matrix A is the transpose of the complex conjugate and is
denoted by A†.
Example
1 2 3i
A= .
4+i 5 6
1 4−i
†
A = 2 5 .
−3i 6
Note that the Hermitian conjugate (or the transpose matrix in the case of real-valued
matrices) is related to the inner (or dot) product of vectors. Let the two vectors →
a and
→
b be given by
a1 b1
→ a2 → b2
a = , and b = .
⋮ ⋮
an bn
Here, the vectors are represented by column matrices. If we take the Hermitian conju-
→
gate of →
a (resulting in a row matrix) and multiply it with b we obtain
www.iubh.de
Unit 5 109
b1
b2 N →†→
a∗1, a∗2, …, a∗n = ∑i = 1 a∗i bi = a∗i bi = a b ,
⋮
bn
(5.4)
which is the inner product a b . In the case of real numbers, a † becomes a T and
→
we use the more familiar notation for the inner product →
a ⋅ b .
In this derivation, we have used the summation convention that all indices occurring
twice are summed over without having to write the summation sign (∑ ) explicitly. In
this case, we use the notation a∗i bi as a short-hand version for ∑iN= 1 a∗i bi .
Trace of a Matrix
The trace of a square matrix is one of several characteristics associated with square
matrices where we express properties of the matrix with a single number. The trace of a
square matrix is the sum of the diagonal elements:
n
Tr A = a11 + a22 + ⋯ + ann = ∑ aii .
i=1
(5.5)
Some properties of the trace and the sum, difference, and product of matrices are
Tr A ± B = Tr A ± Tr B
(5.6)
and
Tr AB = Tr BA .
(5.7)
Example
1 2 3
A= 4 5 6 .
7 8 9
www.iubh.de
110 Unit 5
Determinant of a Matrix
The determinant of a matrix is also a single number which is only defined for square
matrices and is denoted as
(5.8)
det A = ∑ ϵαβ…ωa1αa2β…anω .
P αβ…ω
(5.9)
Permutation The above sum runs over all permutations of the indices indicated by P[αβ . . . ω]. For
A permutation is one example, two different permutations of the indices i and j are ij and ji. For n indices, n!
of several ways that permutations can be calculated, i.e. the sum runs over n! terms. The quantity εαβ...ω is
a number of items called the anti-symmetric tensor which takes the values +1 and −1 depending on how
can be arranged. often the indices are changed:
+1 for an even permutation of 1,…,n
εαβ…ω = −1 for an odd permutation of 1,…,n
0 if 2 indices are the same.
(5.10)
Example
a11 a12
A = = ϵ12a11a22 + ϵ21a12a21 = a11a22 − a12a21 .
a21 a22
www.iubh.de
Unit 5 111
Notice that the determinant is the product of the diagonal elements minus the
product of the off-diagonal elements. This is always the case for 2 × 2 matrices, but
it does not work for matrices of other dimensions.
Example
i+j
Cij = −1 Mij
(5.11)
where the minor Mij is the determinant of the matrix of size (n − 1) × (n − 1),
which is obtained by removing all elements of the ith row and jth column of the
original matrix A.
1. |AT | = |A|
2. |A†| = |(A∗)T | = |A∗| = |A|∗
3. |AB| = |A||B| = |BA|
4. |λA| = λn|A|
5. If two rows or columns of a matrix are interchanged, the determinant changes its
sign but not its value.
6. If two rows or columns of a matrix are identical, the determinant is zero.
www.iubh.de
112 Unit 5
Inverse of a Matrix
Just as in the familiar cases of operations on the real numbers and applications of
functions on real numbers, it is sometimes possible to multiply two matrices and
obtain the identity matrix. When
AB = BA = I,
(5.12)
−1 C ji
A ij =
det A
(5.13)
where Cji are the cofactors defined in equation 5.11 with the indices swapped, namely i,
j becomes j, i.
a11 a12
A= ,
a21 a22
the inverse of A is
−1 C
T
1 a22 −a21
A = = .
det A a11a22 − a12a21 −a12 a11
(5.14)
Example
www.iubh.de
Unit 5 113
1 2 3
A= 0 4 5 .
1 0 6
det A = 1 4 · 6 − 0 · 5 − 2 0 · 6 − 1 · 5 + 3 0 · 0 − 1 · 4 = 24 + 10 − 12 = 22 .
45 05 04
a11 = = 24 a12 = − = 5 a13 = = −4
06 16 10
23 13 12
a21 = − = − 12 a22 = = 3 a23 = − =2
06 16 10
23 13 12
a31 = = − 2 a32 = − = − 5 a33 = = 4.
45 05 04
24 5 −4
C = −12 3 2 .
−2 −5 4
−1 1 T
A = C
det A
24 5 −4 T
1
= −12 3 2
22
−2 −5 4
24 −12 −2
1
= 5 3 −5 .
22
−4 2 4
In such an example, we may find that the matrix operation applied to a vector only
changes the magnitude or length of a vector:
www.iubh.de
114 Unit 5
→ →
Ax =λx .
(5.15)
This equation is called the eigenvalue problem for matrix A where λ are either real- or
complex-valued numbers called the eigenvalues of the matrix, and the solution →x is an
eigenvector associated with λ. This equation can be written as
→ →
A − λI x = 0 ,
(5.16)
which gives rise to the characteristic equation. It can be viewed as a homogeneous sys-
tem of linear equations of the form B x = 0 where B = A − λI . If the determinant of
B is zero, the characteristic equation has a nontrivial (nonzero) solution and we can
determine the eigenvalues of A. Note that x = 0 is always a solution to this equation,
so the nontrivial requirement is important.
Example
10 −3
Calculate the eigenvalues and associated eigenvectors of the matrix A = .
−3 2
First, we form the equation
→ 10 − λ −3 → →
A − λI x = x = 0 .
−3 2−λ
Next, we find
det A − λI = 10 − λ 2 − λ − − 3 −3
2
= 20 − 10λ − 2λ + λ − 9
= λ 2 − 12λ + 11
=0
λ 2 − 12λ + 11 = 0
λ 2 − 12λ = −11
2
λ−6 = −11 + 36
2
λ−6 = 25,
Now that we have the eigenvalues, we can find the associated eigenvectors. For
λ1 = 1, we get
www.iubh.de
Unit 5 115
→ →
A − 1I x = 0
10 − 1 −3 x1 →
= 0,
−3 2 − 1 x2
9x1 − 3x2 = 0
−3x1 + x2 = 0
→ →
A − 11I x = 0
10 − 11 −3 x1 →
= 0
−3 2 − 11 x2
−1x1 − 3x2 = 0
−3x1 + − 9x2 = 0 .
Using the parameter t, we are able to describe the infinite amount of solutions as
those satisfying x1 = t, x2 = −t/3. Typically, we choose unit eigenvectors that are
unit vectors; in this case
→ 1 1 → 1 3
x1= and x 2 = .
10 3 10 −1
5.3 Diagonalization
Change of Basis
→
A basis e i : i = 1, 2, …, N is a minimal spanning set of linearly independent vectors. Linearly Independ-
One example is the Cartesian coordinate system, which forms a basis of ℝ3 where ent
→ → → →
→i = e 1 points along the positive x-axis, j = e 2 points along the positive y-axis, and Vectors are linearly
→
k = e 3 points along the positive z-axis. independent if they
cannot be expressed
Considering an n-dimensional vector space with basis e1, …, en, every vector →x in the as linear combina-
vector space can be expressed as a linear combination of the basis vectors, namely, tions of each other.
→ → → →
x = x1 e 1 + x2 e 2 + ⋯ + xn c n,
www.iubh.de
116 Unit 5
(5.17)
or →
x can be written in vector form as
→ T
x = x1, x2, …, xn .
(5.18)
However, the familiar i , j , k is not the only basis for ℝ3. For example, sometimes it
is easier to consider a problem in spherical coordinates rather than in Cartesian coor-
dinates. The new base vectors e ′i can be expressed as
N
→′ →
ei = ∑ Sij e j
j=1
(5.19)
N N N N
→ → → →
x = ∑ xi e i = ∑ x′i e i′ = ∑ x′j ∑ Sij e i .
i=1 i=1 j=1 i=1
(5.20)
This means that we can express the components of the vector → x in the original base
→
(or coordinate system) e i . We denote these components of the vector →x with xi. We
can also express the components of the vector →x in the new base (or coordinate sys-
→
tem) defined by e ′i — in this case, we denote the components by xi′ . The matrix Sij
connects the two representations
N
xi = ∑ Sijx′,j
j=1
(5.21)
or in vector notation,
→ → → −1→
x =Sx′ ⟺ x′=S x .
www.iubh.de
Unit 5 117
(5.22)
→ → → →
y = A x , y ′ = A′ x ′
(5.23)
and using equation 5.22, the first equation in the line above can be written as
→ →
S y ′ = AS x ′
(5.24)
expressing → →
x and y in the new “primed” coordinate system using the matrix S. Hence
→′ −1 →
y = S AS x ′,
(5.25)
−1
A′ = S AS .
(5.26)
Matrix Diagonalization
Consider the matrix S so that each column of the matrix corresponds to an eigenvector
of some matrix A, so
↑ ↑ ↑
→
S= x 1 →2 → N
x ⋯ x .
↓ ↓ ↓
(5.27)
→j →j
A x = λj x .
www.iubh.de
118 Unit 5
Then, the entry (A’)i,j of matrix A’, which is the matrix A expressed in the new base or
coordinate system consisting of the eigenvector, is given by
−1 −1
S AS ij = ∑∑ S A S
ik kl lj
k l
−1
= ∑∑ S A
ik kl
x j l eigenvectors in matrix S
k l
−1
=∑ S λ
ik j
x j k eigenvalues of matrix A applied
k
−1
= ∑ λj S S
ik kj
eigenvectors in matrix S ,
k
which means that the resulting matrix is diagonal with the eigenvalues on the diagonal
λ1 0 ⋯ 0
0 λ2 ⋮
A′ = .
⋮
0 ⋯ 0 λN
(5.28)
• the matrix S was chosen so that the columns of S are the eigenvectors of A,
• applying A to an eigenvector of A scales that eigenvector by the associated eigen-
value λ (recall that A→ →
x = λ x is an eigenvector of A), and
• applying the inverse of a matrix to that matrix gives the identity matrix, namely,
S-1 S = I.
Example
2 0 0
A= 1 2 1 .
−1 0 1
2−λ 0 0
2
det A − λI = 1 2−λ 1 = 2−λ 1 − λ = 0.
−1 0 1−λ
Next, we need to solve the characteristic equation (A−λI)x = 0 for each value of λ.
For λ1 = 1 we obtain
www.iubh.de
Unit 5 119
2−1 0 0 1 0 0
→ → →
A − 1I x = 1 2−1 1 x = 1 1 1 x .
−1 0 1−1 −1 0 0
x1 = 0
x1 + x 2 + x 3 = 0
−x1 = 0 .
The first and third equation imply x1 = 0, which means that x2 = −x3. The eigenvec-
tor is then given by
0
→
v 1 = a1 −1
1
0 −1
→ →
v 2 = a2 1 and v 3 = a3 0 .
0 1
0 0 −1
S = −1 1 0
1 0 1
1 0 0
D= 0 2 0 ,
0 0 2
www.iubh.de
120 Unit 5
5.4 Tensors
Introduction to Tensors
So far, we have encountered scalars, vectors, and matrices. In many examples, we have
seen how these objects can be applied, for example, to describe physical systems. In
the following section, we will discover how we can treat these objects in a more unified
way and how we can use this to extend these objects by building an intuition using
examples from our physical world. Note that this is only a first introduction and not a
fully formal treatment of the underlying mathematical theory.
In some scenarios, a simple number is sufficient, e.g. “how many pieces of cake are
there?” — and you might answer “(number of pieces of cake).” However, in many situa-
tions, a simple number is not sufficient, e.g. “how far is it to your home?” Just answer-
ing “3” is not sufficient, we also need a unit, e.g. km: “my house is 3 km from here.” In
applications and physics, these objects are called denominate numbers or scalars.
Other examples are temperature, the (inner) energy of a system, and pressure. When
calculating with scalars, we can add, subtract, multiply, or divide them, but we always
get a scalar when operating with scalars.
In other situations, more information is required than denominate numbers, for exam-
ple, answering the question “how do I get to your house?” In addition to a number, a
direction is needed: “walk 3 km due north.” These objects are vectors and are character-
ized by a direction and a value, the magnitude of the vector. Other examples include
velocity, acceleration, and (angular) momentum. Vectors are often represented by their
→ → → → → →
components → v = a i + b j + c k where i , j , k are the unit vectors (for example, the
x-, y- or z-direction in a Cartesian coordinate system) and a, b, c are scalars denoting
how far one has to go in each direction. We have already encountered how we can add,
subtract, or multiply vectors. Depending on the operation, the result can either be a
scalar or a vector.
• Sum: → → →
w = u ± v . The sum (or difference) of two vectors is a vector.
• Inner product: → →
u ⋅ v = η. The inner product is a scalar and the inner product of a
vector with itself is the square of its magnitude (length).
• Cross product: → → → →
s = u × v . The cross product of two vectors u and v is a vector
orthogonal to both → →
u and v . In the case of three-dimensional space, the cross
→
product s is perpendicular to the plane spanned by → →
u and v .
• Multiplication by a scalar: A vector can be multiplied by a scalar to change its mag-
nitude.
We have also used matrices to operate on vectors, which can involve rotating vectors,
or expressing a change of basis or coordinate system. However, many physical systems
cannot be described by scalars and vectors alone. A familiar example is the inertia
matrix (or tensor) of an object: If we rotate a three-dimensional object such as a gyro-
scope, we find that the rotation around some axis is stable (at least, as long as the
gyroscope does not lose too much energy due to friction in the course of our investiga-
www.iubh.de
Unit 5 121
tion). However, if we apply a force (a vector with both a direction and a magnitude)
from outside, the gyroscope will generally re-orient itself along some other direction
that is different to the one of the force we applied. We need the nine elements of a 3-
by-3 matrix (or tensor) to describe this behavior. Another example is the magnetic sus-
→
ceptibility of a material. When an external → magnetic field H is applied, any material
will, in general, show a magnetic response M . However, in many materials, the exter-
nal field and the magnetic response are not aligned. This means that the magnetic sus-
ceptibility χ is not a single number or a vector and we need 3 · 3 = 9 elements to
→ →
describe the response of the object M = χ H . To describe other physical systems, such
as understanding our universe in the context of general relativity, stationary, or rotating
black holes, we need objects with even more elements. We typically call these metrics
in the context of general relativity.
Tensors extend and generalize what we have seen so far of scalars, vectors, and matri-
ces in 3d space:
This list can be extended to higher ranks, though these are then generally just called
“tensor of rank k.” Here, we can understand the rank of a tensor intuitively as the num-
ber of indices we need to express the tensor. A scalar that has no index can be repre-
sented by a single number. The components of a vector can be identified by a single
index, for example ai, where we understand that ai are the elements of the vector a and
the index i takes all values i = 1, 2, …, n required to address all elements of the vector
of dimension n. In the same way, we can denote the elements of some matrix A by two
indices: ai,j.
→ →
If →
a and b are vectors and T is a tensor of rank 2, the linear equation b = T→a can be
expressed as the following system of linear equations which is equivalent to our previ-
ous use of matrix operations:
Tensors of rank 2 do behave like square matrices, but tensors generalize the concepts
we have encountered so far.
We use this example to introduce the following summation convention: Without writing
the summation sign explicitly, we always sum over indices that occur twice. In the
→
above example, we can write each component bi of the vector b as a sum
3
bi = ∑ tijaj = tijaj
j=1
www.iubh.de
122 Unit 5
(5.29)
where we have used the convention to avoid writing the summation sign ∑3j = 1. Note
that this also implies that the range of the indices i, j is no longer made explicit but is
understood from the context. This notation becomes useful if many indices are used.
Tensors invariant under a permutation of its indices (they do not change sign when two
indices are swapped) are called symmetric, but if the sign flips, they are called anti-
symmetric. Specifically,
• Addition: Adding or subtracting tensors is only defined for tensors of the same rank:
cij = aij + bij. Addition is commutative, so aij + bij = bij + aij.
• Dyad product: Multiplication is very different to what we have seen so far with sca-
lars or the inner product of vectors. It is obtained by multiplying each of the compo-
nents term by term. The dyad product always leads to a tensor of higher rank:
riklm = aik ⊗ blm. The dyadic product is neither the inner nor the cross product, but a
new entity. The dyad product is generally not commutative, so aik ⊗ blm ≠ blm ⊗ aik .
In case of two tensors of rank 1 →
T
a = a1, a2, a3 , the dyadic product is
(5.30)
• Contraction: Tensor contraction is used when we sum over the indices of a tensor,
that is, if an index occurs twice. Recall that we can view vectors as tensors of rank
→
one. The dot product of two vectors → a ⋅ b is given by the sum of the components
a1b1 + a2b2 + a2b3 for three-dimensional vectors. This can also be expressed as
→ → 3
a ⋅ b = ∑i = 1 aibi =aibi if we apply the summation convention that we introduced
earlier. The result is a number, i.e. a tensor of rank 0. In other words, we have con-
tracted the index i of the vectors (tensor of rank 1) a and b.
• Trace: The trace is an example of a contraction and is calculated in the same way as
we have seen in case of a square matrix, namely as a summation over the diagonal
Tr(rik) = rii.
www.iubh.de
Unit 5 123
In the following explanation, we use an intuition built by examples from our physical
world to introduce the concept of co- and contravariance. These concepts play an
important role in the description of physical system and, if we use tensors to describe
these systems, a further convention discussed below affects how we use the indices of
a tensor.
A covector, on the other hand, transforms the same way as the change of base or coor-
dinate system does, its components co-vary and hence, we call this property a cova-
riant. An example of a covariant vector is the gradient of a field. For example, if we
measure the strength of an electric field in V/m and change the unit of length to mm,
we need to multiply the coordinate system or base by 1000. The resulting covector also
needs to be multiplied by 1000 in order to stay invariant under this change of coordi-
nate system or base.
www.iubh.de
124 Unit 5
Generally, contravariant tensors are denoted with super-script indices, i.e. ri, and cova-
riant tensors are denoted with subscript indices, ri.
We can now connect this to our discussion of tensors. Tensors can have both co- and
contravariant properties. This means that, in general, a tensor will have both super-
script and subscript indices, for example gkij, depending on how it behaves during a
change of base or coordinate transformation.
Summary
Matrices and vector spaces are key ingredients in many complex calculations. For
example, matrices can be used to change coordinate systems. Basic skills in calcu-
lating with matrices include adding and subtracting matrices, performing matrix/
vector operations, and finding the eigenvectors and eigenvalues of matrices. An
important step in changing bases is the diagonalization of a matrix. Tensors extend
the concepts and calculations of vectors and matrices to a more general setting,
allowing for linear and non-linear coordinate systems.
Knowledge Check
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Unit 6
Information Theory
STUDY GOALS
… how to express the difference between a prediction and observed events using the MSE.
… the concepts of information entropy, Shannon entropy, and Kullback Leibler divergence.
DL-E-DLMDSAM01-L06
126 Unit 6
6. Information Theory
Introduction
The field of information theory focuses on the quantification, processing, and commu-
nication of information. The concept of entropy was introduced in the field of informa-
tion science by Claude Shannon and has many important applications, including the
quantification of information contained in a data stream. We will also develop related
tools for measuring the degree of similarity between probability distributions and
membership in data classes. Such techniques are useful for classification tasks where
one is concerned with assigning an event to one or more pre-defined classes as well as
in regression tasks, which focus on the extrapolation or prediction of a quantity in a
given situation.
Various metrics such as the mean squared error, Kulback-Leibler divergence, and cross-
entropy can be derived from information theoretic principles and play a leading role in
algorithmic approaches to the extrapolation and prediction of new or unknown events.
The simplest nontrivial example of such a model is linear regression where the form of
the function used to estimate the true values of the data is the line y = mx + b, for a
single variable x. In this case, we have two free parameters, namely a1 = m and a2 = b.
To develop such a model, we not only need the observed values of the variables X, but
also the corresponding observed true values y in order to choose the parameters ai
appropriately. During the development of the model, we obtain an intermediate esti-
mate of the predicted value ŷ based on the current values of the parameters ai and we
www.iubh.de
Unit 6 127
Information Theory
need a metric to determine how to optimize the parameters further. Once we have Free Parameters
optimized the final parameter values of the model, we need to assess the accuracy of Free parameters are
the model. those parameters in
a model that need to
One of the simplest and most popular metrics is the mean squared error be determined using
the data.
N
1
N∑
MSE = y−^
y 2.
i=1
(6.1)
The MSE is symmetric in the true observed value y and the model's prediction ŷ. It puts
a strong penalty on predictions ŷ that are far from the observed value y. Although this
sounds like a desirable quantity, it also means that the metric can be dominated by a
few extreme values, even if the bulk of the predicted values are close to the observed
ones.
The MSE and other metrics can be used in two different ways in the evaluation of a
prediction model. One use is during model construction and the other is during model
testing and assessment.
1. Loss function: A loss function is used during model building while the parameters ai
are optimized.
2. Score function: A score function is used to compare the values predicted by the
model with the observed values after the model has been built.
In the case of the loss function, the metric is directly used to optimize the model
parameters. The final value(s) of the model parameters will depend on the loss func-
tion in the sense that a different loss function will lead to different optimal parameters.
The final evaluation of the predicted values ŷ compared to the true values y is done
using a score function, which may or may not be different from the loss function.
www.iubh.de
128 Unit 6
To determine the Gini coefficient of the income distribution of people in a given coun-
try, we find the income of all the people in the country and present this data as a
cumulative percentage of population against the cumulative share of income earned.
An example of the resulting Lorenz Curve, is shown in the following figure:
The main idea behind the Gini index is to show the extent to which wealth is or is not
evenly distributed throughout the population of a country. Researchers are often also
interested in whether the government of said country is trying to keep this ratio as low
as possible, namely whether they are striving for income equality. As we already know,
the Gini coefficient 0 ≤ G ≤ 1 is the ratio of the areas
A
G=
A+B
www.iubh.de
Unit 6 129
Information Theory
Example
Suppose we live in a very small country with only ten people. Let’s call us a1, a2, …,
a10 and suppose that the total income for the country is $100 per day and this
income is distributed evenly among the population so that the income of each resi-
dent ai is $10 per day (i = 1, …, 10). Evaluate A and G.
The total income distribution is shown in the table below, where the proportion
and cumulative proportion refer to the population considered in this example. So,
in this case, A = 0 and G = 0; the Lorenz Curve and the line of equality coincide.
www.iubh.de
130 Unit 6
Example
Suppose that the graph of the cumulative proportion of population (on the hori-
zontal axis) against the cumulative percentage of income (on the vertical axis) is as
shown below, where the Lorenz curve is defined by y = x5.
www.iubh.de
Unit 6 131
Information Theory
We need to find the area of the region A, between the line of equality (in green)
and the Lorenz curve (in red). One way to do this is to find the area of the triangular
region below the line of equality and subtract the area under the Lorenz curve. The
area of the triangle below the line of equality is half the area of the whole square,
and is therefore 0.5. Let IB denote the area of the region B under the Lorenz curve.
IB is then
∫ x dx = 0 .1666
1
IB = 5
0
and therefore, the area of the region A is 0.5 − 0.1666 = 0.333. We can now find
the Gini index, which is
A 0 .333
G= = = 0 .667.
A+B 0 .5
www.iubh.de
132 Unit 6
Gini Impurity
The Gini index should not be confused with the Gini impurity. Unfortuntately, in prac-
tice the terminology is used interchangeably. The term “Gini index” is often used for the
Gini impurity and we need to check the context carefully to avoid further confusion.
Similar to the Gini index discussed above, the Gini impurity is a measure of the homo-
geneity of a distribution of elements in a set and is related to the probability of incor-
rectly classifying an object in a data set. Suppose that we have N classification groups
or classes in a given dataset and let pi be the probability of a random instance belong-
ing to class i. Then we have the following cases for two subsequent experiments where
we assign a class to an element of the dataset:
1. We obtain the identical output for the same category i with probability p2i .
2. We obtain the identical output, irrespective of the category with probability
N
∑i = 1 p2i .
3. Using the above, we obtain two different outputs with probability 1 − ∑N 2
i = 1 pi .
Therefore, to find the Gini impurity, we need to find the probability of being wrong
about any given classification and then sum over all classifications. The Gini impurity is
N
G= ∑ ∑ pipj .
i=1j≠i
(6.2)
It is sometimes computationally useful to write this formula in other ways. Recall that
ΣiN= 1pi = 1, which means that we must assign each item to one of the available
classes, and therefore, pi = 1− Σj≠ipj. Observe that
N
G= ∑ pi ∑ pj
i=1 j≠i
N
= ∑ pi 1 − pi
i=1
N
= ∑ pi − p2i
i=1
N N
= ∑ pi − ∑ p2i
i=1 i=1
N
=1− ∑ p2i
i=1
where we used the fact that ∑iN= 1 pi = 1 in the last step, which is a result of the fact
that there are no other possible outcomes except the N classifications.
www.iubh.de
Unit 6 133
Information Theory
What is Entropy?
Apart from the laws of quantum mechanics, entropy is perhaps the most confusing
physical quantity. In our everyday language, it is associated with the degree of random-
ness in a system — for example, we would say that a cube of sugar dissolved in tea has
a higher level of randomness as it is natural for the sugar to dissolve, but we have
never observed that a sweet tea spontaneously separates into tea and a cube of sugar
at the bottom. We also often hear that “entropy defines the arrow of time.” The origin Arrow of Time
of these analogies is understandable, but they do not quite capture the concept of The arrow of time is
entropy. Furthermore, entropy has historically been introduced first in thermodynamics a concept indicating
and then in statistical physics. At first glance, both appear very different but, after care- that time always
ful consideration, they are equivalent to each other. Before turning to information sci- moves forward (and
ence, it is therefore useful to understand entropy on a more fundamental level. not backward) and
that reactions follow
We start with the thermodynamic understanding of entropy and remind ourselves that this direction.
while some reactions occur spontaneously, others do not. A good textbook that covers
this subject in more detail is Physical Chemistry (Atkins & de Paula, 2006, p. 573 ff). For
example, a hot drink such as tea cools down to ambient temperature, a gas expands
into the available volume and a ball bounces a bit lower each time it hits the ground
until it comes to rest. In the case of the ball, we can intuitively understand this as with
each bounce, the ball transfers some of its kinetic energy to the ground, which is trans-
formed into the random thermal motion of the atoms in the ground, i.e. the ground
heats up a little bit. However, we have never observed that a ball resting on a warm
ground spontaneously jumps into the air. This could only happen if all the atoms in the
ground would act together and push the ball away. We can then identify spontaneous
reactions such as the bouncing ball or expanding gas by looking for changes that lead
to the dispersal of energy of the system: As the ball bounces a bit less each time, it
loses energy which is transferred into the random motion of atoms into the ground.
The thermodynamic definition of entropy is then centered on the idea that a change in
a system is related to the energy it loses in the process, which in turn can be expressed
by the amount of energy that is transferred by heat. This sounds quite complicated but
in thermodynamics, the (inner) energy of a system is a measure of how much work a Inner Energy
given system can do. For example, a compressed gas can turn a turbine, whereas a gas The inner energy can
filling the available space cannot. As we have seen in the example of the bouncing ball, be changed by either
heat is related to the random motion of the atoms as opposed to uniform motion in transferring energy
the case of work. We can then conclude that the ability of a system to perform “useful” as heat or perform-
work is reduced in proportion to the amount of heat transferred into random motion. ing work:
Furthermore, it is intuitively reasonable that this depends on the temperature: The dU=δQ+δW
effect of adding a bit more heat to an already hot system is much less than to a cold
system. This is indeed then the thermodynamic definition of entropy:
www.iubh.de
134 Unit 6
δQ
dS =
T
(6.3)
where S denotes the entropy, δQ the incremental amount of heat exchanged, and T the
temperature of the system. This definition gives us an understanding of why we asso-
ciate entropy with randomness if we think back to the example of the bouncing ball. As
a bit of heat is transferred to the ground, the atoms in the ground move around a bit
more and their motion becomes more random or more unordered.
www.iubh.de
Unit 6 135
Information Theory
N!
W= .
n0 !n1 !n2 !…
(6.4)
S = kBln W
(6.5)
where kB is the Boltzmann constant (kB = 1.38 · 10–23m2kgs–2K–1) and W is the weight Boltzmann constant
of the configuration. From the reasoning above, we can see that this quantity behaves kB = 1.38 · 10–
in the same way as the thermodynamic definition we have seen earlier (in fact, one can 23m2kgs–2K–1
show that they are equivalent). The only parameter that defines the system is the tem-
perature T, which we can modify by the exchange of some heat q. In the limit of T → 0,
only the ground state is accessible, which means that only one configuration is possi-
ble, leading to W = 1 and hence, S = 0 as ln 1 = 0. As only one state is accessible, the
amount of “randomness” is minimal and increases as we raise the temperature (by
adding some heat q) because more states become accessible.
There are, of course, exceptions to this. One example is carbon monoxide (CO). The
ground state is such that a carbon atom C is followed by an oxygen atom O and, as the
temperature gets lower and lower, the only accessible state should be CO CO CO… as
the system slowly “freezes” into a regular lattice. However, the state OC is not much
different from CO in terms of energy and hence, it can happen that the configuration
OC is “trapped” as not enough energy is available to flip into CO and as T → 0, our
lattice might look like this: COCOOC…. This gives us a first glimpse of how the entropy
can be used in relation to information science if we imagine that we could denote the
configuration CO with 0 and OC with 1 and the above sequence expressed as a bit
stream would read 001….
N!
ln W =
n0 !n1 !n2 !…
(6.6)
= ln N! − ln n0 ! + ln n1 ! + ln n2 ! + …
(6.7)
www.iubh.de
136 Unit 6
= lnN! − ∑ ln ni ! .
i
(6.8)
(6.9)
= Nln N − N − ∑ niln ni − ni
i
(6.10)
= Nln N − ∑ niln ni .
i
(6.11)
S = kB ∑ niln N − niln ni
i
(6.12)
ni
= −kB ∑ niln
N
i
(6.13)
= −NkB ∑ piln pi
i
(6.14)
where pi = ni/N is the fraction of molecules in state i or the probability that the mole-
cule is in state i.
www.iubh.de
Unit 6 137
Information Theory
Shannon Entropy
The father of information theory, Claude Shannon, introduced the term entropy to
describe the minimum encoding size necessary to send a message without information
loss (Shannon, 1948). This has two components to it. Firstly, what is the maximal com-
pression rate we can achieve to transmit the information? This is related to the entropy.
The other is concerned with the technical implementation and is related to the maxi-
mal capacity of a transmission channel. The latter is part of electrical engineering and,
for the remainder, we focus on the first part.
In information theory, we are concerned with the amount of information that we can
obtain from a system and the information content of some event A is defined as
1
I A = − log2 p A = log2
pA
(6.15)
where p(A) is the probability that the event occurs. We notice that the information con-
tent decreases when the probability of an event increases — the more likely an event
becomes, the less “surprised” we are about it and the more we expect it, which implies
that it merely confirms the information we already had. In the extreme case that p(A)
= 1 where the event always occurs, no further information is added. We also note that
the information due to independent events is additive: I(A1∩A2) = I(A1) + I(A2).
We now turn to larger systems that are described by some discrete variable X that can
take the values {x1, x2, …, xn} according to some probability distribution p(X). The
Shannon entropy is then defined as the average information content of an outcome
H = E I X = E − log2 p X
(6.16)
= − ∑ p xi log2 pi x
i
(6.17)
where E[.] is the expectation value we use to calculate the average. Comparing this Expectation Value
definition to equation 6.14, we find that they are the same, apart from the change in E x = ∫xp x dx or
base from the natural logarithm to base two and the fact that the Shannon entropy E x = ∑i xipi
does not have the constants NkB as they are not directly related to a physical system.
Since we have already studied the entropy of ensembles in the context of statistical
physics, this connection is not surprising. In both cases, we are concerned with large
systems that are described in terms of some probability function p that determines the
likelihood that a possible discrete value or state is occupied.
www.iubh.de
138 Unit 6
So far, the event space was discrete. By considering a probability density function
rather than a frequency, it is possible to meaningfully discuss Shannon’s entropy for
underlying variables that have infinitely many possible values. The underlying topology
of the variable we are measuring becomes important. If the possible values are the real
numbers, equation 6.17 can be written in integral form
Hx = − ∫p x log 2 p x dx
(6.18)
Example
We can compute the entropy of a coin toss. For a fair coin, heads and tails comes
up with equal probability of 50 percent. The Shannon entropy is
n
H = − ∑ p xi log2 p xi
i
2
1 1
= − ∑ log2
2 2
i=1
2
1
= −∑ −1
2
i=1
= 1bit.
In this case, the entropy is maximal as we cannot predict the outcome of the next
coin toss from what we have observed so far. Hence, we need one bit per coin toss
to encode the resulting information if heads or tails comes up. However, if the coin
is not fair so that heads comes up with a higher probability p than the probability q
for tails, our entropy would be different: H = −p log2(p) − q log2(q). This number is
smaller than one as the probability of heads is now higher and we are less “sur-
prised” if heads comes up.
Example
First, we note that our system only has two states, zero and one. Counting the num-
ber of each, we find that we have six zeros and two ones out of eight characters.
Hence, the probability to obtain a zero is p(0) = 6/8 = 3/4 and the probability to
obtain a one is p(1) = 2/8 = 1/4. The Shannon entropy is then
H = −0.75 log2(0.75) − 0.25 log2(0.25) = 0.811 bits. We can compare this to the coin
toss above: If the zeros (head) would occur as often as the ones (tails), we would
www.iubh.de
Unit 6 139
Information Theory
have had H = 1. However, the zeros occur much more frequently than the ones,
hence encountering a zero conveys less information as we can guess with a proba-
bility of 3/4 that the next letter will be a zero.
We can use the concept of entropy to determine how different two probability distribu-
tions p(x) and q(x) are. For each of them, we can define the Shannon entropy and we
can define the relative entropy or the Kullback-Leibler (KL) divergence between p(x)
and q(x) as
px
DKL p ∥ q = ∑ p x log2 q x
i
(6.19)
DKL p ∥ q = ∫p x log 2
px
qx
dx
(6.20)
The relative entropy between two probability distributions over the same random vari-
able is a measure of how different the two distributions are. It satisfies Gibbs’ inequal-
ity
DKL p ∥ q ≥ 0
(6.21)
www.iubh.de
140 Unit 6
them p and q, on the same underlying set of variables, suppose p is the true distribu-
tion, but q is the distribution for which we have optimized. How hard is it (measured in
number of bits of data that we need) to identify an event in the space? The cross
entropy of p and q is an effort to answer this question.
To define the cross entropy, we will use the entropy of the random variable x and the
Kullback-Leibler divergence between the true probability distribution p and the one we
use to estimate it, q. In essence, “how hard it is” to identify an event in the space is the
natural difficulty (uncertainty), which is measured by the entropy, plus the added diffi-
culty induced by estimating p with q, which is quantified by the Kullback-Leibler diver-
gence. Recall that the Kullback-Leibler divergence Dk(p||q) defined by equation 6.19 is
px
DKL p ∥ q = ∑ p x log2 q x .
i
N px
DKL p ∥ q = ∑ p xi log2 q xi
i=1 i
N
= ∑ p xi log2 p xi −
i=1
N
∑ p xi log2 q xi
i=1
= −H p + H p, q ,
(6.22)
where H(x) is the Shannon entropy of distribution of x, defined by equation 6.17 and
H(p, q) is defined to be
N
H p, q = − ∑ p xi log2q xi ,
i=1
(6.23)
and is called the cross entropy of p and q. The cross entropy, the number H(p, q), rep-
resents the average number of bits required for us to identify an event given that we
have coded our scheme using distribution q when the true distribution is p. Due to the
asymmetry of the Kullback-Leibler divergence, the cross entropy is also generally asym-
metric, in particular H(p, q) ≠H(q, p). Another basic attribute of cross entropy is that it
is bounded below by the entropy of the true distribution. Note that the smallest possi-
ble cross entropy is obtained when we use the true distribution in our coding scheme.
Namely, using equation 6.22 we see
www.iubh.de
Unit 6 141
Information Theory
N
H p, q = H p = − ∑ p xi log2 p xi =H p ,
i=1
In machine learning applications, the cross-entropy is also often used as a loss-func- Machine Learning
tion during the optimization of the model — in particular for classification tasks where In machine learning,
events are sorted into two or more categories. During model building and training, we algorithms are not
know the true category that the event is in — this is our p, namely pk = 1, for the true explicitly program-
category and pl = 0 for all others. The prediction model returns a probability for each med, but use data to
possible category, for example q1 = 0.1, q2 = 0.7, q3 = 0.01, …. where the sum ∑i qi = 1 learn specific rela-
is equal to one as the event has to belong to one of the categories. Hence, the cross- tionships.
entropy determines how well the model q describes the true p.
Summary
We also discuss how to evaluate predictive models, starting with the widespread
method of the Mean Squared Error (MSE), which is used to quantify the quadratic
deviation between a model and the observed data. The Gini index is often confused
with the Gini impurity. The Gini index is a statistical measure of distribution, com-
monly applied to study the income (or wealth) distribution of a country. The Gini
impurity is also an impurity measure, however, it is used in machine learning, in
particular in the construction of decision trees, to determine the probability of
being wrong about any given classification. The Kullback-Leibler divergence
between two probability distributions over the same variable is used to describe
the degree of similarity between the two distributions. We conclude this unit by
investigating what it actually means to estimate the entropy and how accurately we
are able to do so, and we develop the tool of cross-entropy, used to compare prob-
ability distributions, to help us.
www.iubh.de
142 Unit 6
Knowledge Check
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Evaluation 143
Congratulations!
You have now completed the course. After you have completed the knowledge tests on
the learning platform, please carry out the evaluation for this course. You will then be
eligible to complete your final assessment. Good luck!
www.iubh.de
Appendix 1
List of References
146 Appendix 1
List of References
Atkins, P. W., & de Paula, J. (2006). Physical chemistry (8th ed.). W. H. Freeman.
Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning.
Cambridge University Press. http://doi.org/10.1017/9781108679930
Gini, C. (1921). Measurement of inequality of incomes. The Economics Journal, 31(121), 124
—126. http://dx.doi.org/10.2307/2223319
Johnson, B. & Johnson, J. (2012, April 28). Cross-product in vector algebra. https://
www.thunderbolts.info/wp/2012/05/02/appendix-i-vector-algebra/cross-product-in-
vector-algebra/
Loomis, L., & Sternberg, S. (2014). Advanced calculus. World Scientific Publishing Com-
pany.
Oppenheim, A., Willsky, A.S. & Nawab, S.H. (1997). Signals & systems (2nd ed.). Prentice
Hall.
Pexels. (n.d.). Shallow focus photography of a cavalier king charles spaniel. https://
www.pexels.com/photo/shallow-focus-photography-of-a-cavalier-king-charles-span-
iel-1390361/
www.iubh.de
Appendix 2
List of Tables and Figures
148 Appendix 2
A Parabola: f(x) = x2
Source: Author.
f(x) = |x|
Source: Author.
www.iubh.de
Appendix 2 149
www.iubh.de
150 Appendix 2
Visualization of a Vector Field Using Arrows to Indicate Direction and Strength of the
Field at Each Position
Source: Author.
www.iubh.de
Appendix 2 151
www.iubh.de
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf