DLMDS Advanced+Mathematics.

COURSE BOOK
Advanced Mathematics
DLMDSAM01
Course Book
DLMDSAM01
2 Masthead
Masthead
Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
media@iu.org
www.iu.de
DLMDSAM01
Version No.: 001-2021-0512
© 2021 IU Internationale Hochschule GmbH

This course book is protected by copyright. All rights reserved.
This course book may not be reproduced and/or electronically edited, duplicated, or distributed in any kind of
form without written permission by the IU Internationale Hochschule GmbH.
www.iubh.de
Module Director 3
Module Director
Prof. Dr. Eric Guiffo Kaigom
Mr. Guiffo Kaigom is a professor at IU International University of

Applies Sciences in the robotics and industrial engineering programs.
His professional focus is on the development of digital solutions for
the use of intelligent robots in combination with human creativity.
After studying electrical engineering and obtaining his doctorate at

RWTH Aachen University, he headed the intelligent robot systems
team. His primary responsibility was the development and transfer of
robotics and virtual/augmented reality-driven technologies to aero-
space (on-orbit servicing) and manufacturing (smart factory) indus-
tries. After trading his academic position for one in industry, he held
positions as head of research and development, emerging technolo-
gies, and as a consultant. In the industry, he made important contri-
butions in product development and lighthouse projects in the fields
of digital engineering, digital transformation, and data-driven resil-
ient ecosystems.
Mr. Guiffo Kaigom has also mentored foreign students for many years.
He is a regular reviewer for the IEEE Robotics and Automation Society
conferences and serves as a journal reviewer and editorial board
member for various journals.
www.iubh.de
4 Contents
Table of Contents
Module Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introduction
Advanced Mathematics 7
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Unit 1
Calculus 12
1.1 Differentiation and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Partial Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.4 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Unit 2
Integral Transformations 48
2.1 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2 Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Unit 3
Vector Algebra 64
3.1 Scalars and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Addition and Subtraction of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Multiplication of Vectors: Dot Product and Scalar Product . . . . . . . . . . . . 77
www.iubh.de
Contents 5
Unit 4
Vector Calculus 88
4.1 Differentiation of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Integration of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Scalar and Vector Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Unit 5
Matrices and Vector Spaces 104
5.1 Basic Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Determinant, Trace, Transpose, Complex, and Hermitian Conjugates . 107
5.3 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Unit 6
Information Theory 126
6.1 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3 Entropy, Shannon Entropy, Kulback-Leibler Divergence . . . . . . . . . . . . . 133
6.4 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Appendix 1
List of References 146
Appendix 2
List of Tables and Figures 148
www.iubh.de
Introduction
8 Introduction
Signposts Throughout the Course Book
Welcome
This course book contains the core content for this course. Additional learning materials can
be found on the learning platform, but this course book should form the basis for your
learning.
The content of this course book is divided into units, which are divided further into sections.
Each section contains only one new key concept to allow you to quickly and efficiently add
new learning material to your existing knowledge.
At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the concepts
in each section.
For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of the
questions correctly.
When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.
Good luck!
www.iubh.de
Introduction 9
Learning Objectives
The course Advanced Mathematics aims to provide students with the mathematical back-
ground knowledge to use and understand current methods and approaches from engineer-
ing and the sciences.
To this end, the course starts with an exposition of the fundamentals of calculus. The notions
of differentiation and integration are introduced together with important generalizations to
multiple dimensions. Moreover, the widely used optimization technique of the calculus of
variations is explained. Integral transformations, which play a vital role in scientific and engi-
neering application, are also covered.
These analytical techniques are complemented by a thorough introduction to mathematical

methods associated with linear algebra. Here, you will learn about the important concepts of
vectors, matrices, and their algebraic manipulation. Furthermore, the concept of a tensor,
which plays a crucial role in popular approaches to machine learning, is introduced.
The subject domains of linear algebra and calculus are brought together in the explanation
of vector calculus. The course concludes with explanations of important concepts from the
field of information theory that underpins virtually all aspects of our contemporary commu-
nication systems.
www.iubh.de
Unit 1
Calculus
STUDY GOALS
On completion of this unit, you will have learned ...
… how to differentiate and integrate functions of a single variable.
… how to perform partial differentiation and multiple integrals for functions with multiple
variables.
… how to approximate a function in a Taylor series.
… the basic concepts of calculus of variations.
DL-E-DLMDSAM01-L01
12 Unit 1
1. Calculus
Introduction
Functions express relationships between variables. For example, in standard notation,
the function y = f(x) formalizes how a value y, called the dependent variable, varies
with respect to another value x, called the independent variable. The letter f is the
name of the function.
The rate at which the dependent variable changes with respect to the independent var-
iable is of particular interest, both mathematically and for applications. One example
of such a rate of change is the change in distance with respect to time — also known as
velocity. The method for finding this rate when given a function is called differentiation.
Differentiation is a powerful tool that is used to understand the relationship between

variables. In the case of multiple independent variables, for example, z = f(x, y), par-
tial differentiation allows us to explore the rate of change of the function with respect
to each independent variable.
The operation of differentiation can often be “undone” via an operation called integra-
tion. Integration can be seen as the inverse of differentiation and therefore, it is often
called the anti-derivative. Intuitively, this means that if we have a formula that
expresses the speed of a particle with respect to time, we can often construct a for-
mula for the displacement of the particle (i.e., the distance it has traveled).
In calculus of variations, the concept of differentiation is extended to functionals,

which are maps from a set of functions to the set of real numbers. In this sense, func-
tionals are functions of functions. The calculus of variations looks for which input func-
tion maximizes or minimizes the dependent (output) variable.
Good textbooks that further cover this subject area are e. g. (Deisenroth, Faisal & Ong,
2020, Chap. 5), (Strang, 2017, Chap. 2-5, 7, 8, 13, 14) and (Loomis & Sternberg, 2014, Chap.
3, 8).
1.1 Differentiation and Integration
Derivatives of Functions of a Single Variable
We are often interested in how a function changes with respect to its argument. For
example, we could imagine a travelling car with position s at a given time t. We know
from our everyday experience that at each time, t, a car has a velocity, v(t), which
measures how fast the car is travelling at time t. Over a given time interval, Δt, the aver-
age velocity describes the rate at which the car travels the distance for that interval of
time
www.iubh.de
Unit 1 13
Calculus
Δs Function
vt = ,
Δt A function is a rela-
tion between two
(1.1) sets that associates
every element of
where Δs is the change in position of the car or the distance covered by the car. In one set to exactly
equation 1.1, the time t is the argument of the function v, which establishes a relation- one element of the
ship between the distance covered by the car and the time it takes to cover that dis- other set.
tance.
More generally, we often wish to find the rate of change of a general function f(x),
where f depends on some argument x. We will begin by considering functions that
depend only on a single variable, such as f(x) = x2, which is shown in the graphic that
follows this explanation. Fix a given value of x and let us consider the value of the
function, f(x), as we change the input to a slightly different value, where we assume
that the function is continuous and doesn’t have any “kinks” or “jumps.” For example, if
we start at x0 = 1 and move to x1 = 1.1, the value of the function f(x) = x2 will change
from f(x0) = 1 to f(x1) = 1.21. Let us denote this change in x by Δx and write
x → x + Δx to indicate that x changes from x to x + Δx. Then, the change in the value
of the function f is Δf = f(x + Δx) - f(x). By making this increment Δx arbitrarily small,
we can work out the rate of change of f at a single instant. That is the idea of a deriva-
tive — an instantaneous rate of change — and the limit operation allows us to formally
capture this intuition. As we make the change in x smaller and smaller, written Δx → 0,
we can define the gradient or first derivative of the function f as
df x f x + Δx − f x
f’ x ≡ ≡ lim .
dx Δx → 0 Δx
(1.2)
The function is differentiable at xa if, and only if, this limit exists at the point x = xa.
Note that if the limit does not exist at x = xa, the function is not differentiable at this
xa. The definition 1.2 does not specify if we approach x from smaller values (so Δx is
negative) or vice versa (in which case Δx is positive). This is because, in order for the
f x + Δx − f x
limit to exist, the definition of the limit requires that the quotient Δx
approaches the same value, fʹ(x), from both the left and right of x.
www.iubh.de
14 Unit 1
Geometrically, the derivative fʹ(x) can be interpreted as the slope of the line tangent to
the function f(x) at the point x.
Example
Find the first derivative of f(x) = x2.
Using definition 1.2,
www.iubh.de
Unit 1 15
Calculus
f x + Δx − f x
f′ x = lim
Δx → 0 Δx
2
x + Δx − x2
= lim
Δx → 0 Δx
2
x2 + 2xΔx + Δx − x2
= lim
Δx → 0 Δx
2
2xΔx + Δx
= lim
Δx → 0 Δx
Δx 2x + Δx
= lim
Δx → 0 Δx
= lim 2x + Δx
Δx → 0
= 2x
Here, we have observed that Δx becomes infinitessimally small as it approaches

zero, but is always non-zero and can therefore be cancelled from the numerator
and denominator.
Be aware that to be differentiable at xa, a function must be continuous at xa (or else

the limit will not exist at that point), but merely being continuous everywhere does not
necessarily mean that a function is differentiable everywhere, as shown in the follow-
ing figure. As x approaches 0 from the left, the derivative (which is the slope, in this
case) is —1, but if we approach x = 0 from the right, the limit of the quotient
f x + Δx − f x
Δx
is +1. The right and left hand limits do not agree, so the limit, and there-
fore the derivative of f(x) = |x|, is not defined at x = 0.
Using definition 1.2 in combination with the laws of limits, one can find derivatives of
many fundamental functions. For reference, here are the derivatives of some important
functions where n > 0 is a natural number and a is a real-valued constant.
www.iubh.de
16 Unit 1
d n d ax
x = nxn − 1 e = aeax
dx dx
d d 1 d
ln ax = ln a + ln x = dx sin ax = acos ax
dx dx x
d d a
cos ax = − asin ax tan ax =
dx dx cos2 ax
Higher order derivatives

Derivatives are themselves functions, therefore we can consider their rates of change.
We call derivatives of derivatives of a function f higher order derivatives of f, and they
are obtained using the definition of the derivative in the same way. For the second
derivative, we use definition 1.2 but replace the function f(x) with the first derivative
fʹ(x) as follows:
df′ x f′ x + Δx − f′ x
f′′ x ≡ ≡ lim ,
dx Δx → 0 Δx
(1.3)
where, again, fʹʹ is defined if, and only if, the limit exists. More generally, we can define
the nth derivative of f(x) to be
www.iubh.de
Unit 1 17
Calculus
n−1 n−1 n−1

n df f x + Δx − f x
f x ≡ ≡ lim ,
dx Δx → 0 Δx
(1.4)
whenever the limit exists.
Stationary points
Looking again at the first graphic depicting a parabola, we notice that the point (0, 0) is
special; the value of the function on either side of point x = 0 is greater than at x = 0.
In other words, at x = 0, f achieves a local minimum. Graphically, we observe that the
line tangent to the graph of f at this point is horizontal — its slope is equal to zero. To
reiterate: the slope of the line tangent to f at x = 0, which is fʹ(0), has a derivative at
that point with a value of zero.
Points where the derivative is equal to zero, such as the point described above, are
called stationary points. After examining a number of examples, we see that fʹ is often,
but not always, equal to zero at a local minimum. The other possibility, illustrated by
the previous graphic, is that the slope of the tangent line, the derivative, is undefined
at the local extrema. For f(x) = |x|, (0, 0) is a critical point, defined to be a place where
the derivative is zero or does not exist, but it is not a stationary point.
Note that there are three different stationary points. They are as follows:
• the function f has a maximum at a stationary point at x = a if fʹ(a) = 0 and

fʹʹ(a) < 0,
• the function f has a minimum at a stationary point at x = a if fʹ(a) = 0 and
fʹʹ(a) > 0, and
• a stationary point at x = a is called a saddle point if fʹ(a) = 0 and fʹʹ changes sign at
this point.
Note that the maximum and minimum found this way may not be the global maximum
or minimum of the function, but rather a local extremum at the stationary point.
Rules of Differentiation
Differentiation of functions with a constant

Some functions are composed of a constant and a variable part, e.g. f(x) = a · g(x)
where a is an arbitrary constant and g(x) is some function that depends on x. The
derivative is given by
d d
dx
f x = f′ x = a dx g x = ag′ x .
www.iubh.de
18 Unit 1
Differentiation of products
Previously in this section, differentiation rules for some functions with simple struc-
tures were discussed. However, in many cases, we are interested in the rates of change
of functions that are more complicated.
As a first example, we will investigate how to differentiate functions that can be written
as products of two other functions, namely functions of the form f(x) = u(x) · v(x). The
idea is that if we know how to differentiate u and v, and how to use that information
together with the product structure to find a derivative of f, we can avoid applying the
definition of the derivative. We could, from this perspective, reexamine f(x) = x2, noting
that we could write it as f(x) = x · x. Slightly more complicated examples are
g(x) = x2 · sin(x), which we could decompose as g(x) = u(x) where u(x) = x2 and
v(x) = sin(x). Such a decomposition is not unique; we could consider any functions u
and v whose product is x2 · sin(x). However, the idea behind decomposing the original
function f(x) into two functions, u and v, is to choose u and v that are easier to differ-
entiate than f. Then, if we have a general method to calculate the derivative of a prod-
uct, we can apply that method to f in order to make taking the derivative easier than if
we were to calculate it using definition and equation 1.2. This general method, called
the product rule, is obtained from the definition (See equation 1.2) as follows. First, let’s
simplify the difference f(x+Δx) − f(x). This results in
f x + Δx − f x = u x + Δx · v x + Δx − u x · v x
= u x + Δx v x + Δx − v x + v x u x + Δx − u x .
Note that we added and subtracted v(x)u(x+Δx) in order to be able to factor. Substi-
tuting the result of our simplification into the definition of the derivative, we obtain
df f x + Δx − f x
= lim
dx Δx → 0 Δx
v x + Δx − v x u x + Δx − u x
= lim u x + Δx +v x .
Δx → 0 Δx Δx
As Δx approaches zero, u(x+Δx) approaches u(x) and the terms in the square brack-
ets become the derivatives of the functions u and v respectively. Hence, the formula for
the derivative of a product of functions, called the product rule, is given by
df d dv x du x
f′ ≡ ≡ u x v x =u x +v x = uv′ + vu′ .
dx dx dx dx
(1.5)
Using this rule repeatedly, the derivative of products of three or more differentiable
functions can be obtained as follows:
Given f x = u x v x w x ,
df d d
f′ x = =u vw + vw u
dx dx dx
dw dv du
= uv + uw + vw .
dx dx dx
www.iubh.de
Unit 1 19
Calculus
Example
Find the derivative of f(x) = x2 sin(x).
Using definition 1.5 with u(x) = x2 and v(x) = sin(x), we get
d 2 d d 2
x sin x = x2 sin x + sin x x
dx dx dx
= x2cos x + 2xsin x .
The chain rule

Many functions can be written as compositions of functions, namely as functions
whose inputs are functions themselves. For example, f(x) = (x − 1)2 can be written as
f(x) = u2(x) where u(x) = x − 1. Write this as f(u(x)).
The essential idea of the chain rule is that we differentiate the outer function f with
respect to the inner function u to get fʹ(u), leaving the inner function alone. Then differ-
entiate the inner function u with respect to x to get uʹ(x) and multiply the two together:
df df du
= · .
dx du dx
(1.6)
This is known as the chain rule because we “chain” the derivatives together. The con-
cept can be easily extended to functions of functions of functions, and so on. We only
need to repeatedly apply the chain rule until we reach the independent variable.
Example
Find the derivative of f(x) = (x − 1)2.
We can write this as f(x) = u2 (x), where u(x) = x−1. Using theorem 1.6 we obtain
df df du
= ·
dx du dx
du
= 2u
dx
= 2u · 1
=2 x−1 .
The chain rule can also be used to calculate the derivative of functions of the form
f(x) = 1/v(x). Rather than writing this as a quotient, we can express it as a composition
of functions, f(x) = v−1(x) (noting that this is the —1 power, not the inverse), and then
apply the chain rule
www.iubh.de
20 Unit 1
df df dv
=
dx dv dx
dv
= −v−2
dx
1 dv
=− 2
v x dx
d n
where we have used the elementary derivative dx
x = nxn − 1.
Differentiation of quotients
In some cases, the function we want to take the derivative of can be written in the form
ux
of a quotient of two functions, such as f x = v x . One way to create a rule in order to
calculate derivatives for such functions is to combine the product rule in equation 1.5
with the chain rule, and write the product as f(x) = u(x)[1/v(x)]. Applying the product
rule, we get
df d ux
=
dx dx v x
d 1 1 d
=u + ux .
dx v x v x dx
d 1
Using the chain rule to evaluate dx v x
as above, we obtain
df dv x /dx du x /dx
=u − + .
dx vx
2 vx
The “prime” notation for the derivative yields an expression that is easier to read
u ′ vu′ − uv′
f′ = = ,
v v2
(1.7)
where u = u(x) and v = v(x).
Integrals of Functions of a Single Variable
Integrals as area under the curve

In the first part of this unit, we focused on rates of change of functions of a single vari-
able, and developed the first derivative, or gradient, as a tool to mathematically investi-
gate rates of change. Returning to the example of the car traveling on a road, we found
that we could express the average velocity as the distance traveled by the car in a given
amount of time,
Δs
v= ,
Δt
www.iubh.de
Unit 1 21
Calculus
where Δs is the change in position (the distance) over time interval Δt. We considered
progressively shorter time intervals in order to investigate the instantaneous rate of
change
ds
vt = .
dt
The first derivative of the position with respect to time is the instantaneous velocity.
If we know a function for the velocity v(t), it is natural to wonder if it is possible to cal-
culate the distance Δs the car has traveled over a given time interval, Δt.
If the velocity is constant, this is intuitively clear. We have Δs = vΔt, as illustrated in

the figure above. Note that the area of the rectangle v · Δt corresponds to change in
the position Δs. In most cases, the velocity of a car will not be constant, and therefore
we would like a more general way to express the distance as a (continuous) function of
velocity and time.
Informally, we can do this by looking at very small sub-intervals rather than consider-
ing the whole interval Δt at once. We will then assume that the velocity is constant over
these small rectangles in order to get an approximation of the distance the car travels
over each small interval. Each time, we will use our constant approximation of the
www.iubh.de
22 Unit 1
velocity multiplied by the length of time to get the area of the small rectangle, as
shown in the following figure. We know that we have a small error in each interval, but
the smaller the intervals, the smaller the error becomes. To find the total distance the
car has traveled, we combine the total areas of all the rectangles that represent the
contributions from each small time interval.
More formally, consider an arbitrary function f(x) of a single variable x that is defined
over the interval a ≤ x ≤ b. Following the approach above, we divide the interval [a, b]
into many sub-intervals by introducing intermediate points ξi so that a = ξ1 < ξ2 < ...
< ξn = b. The lengths of the intervals (ξi − ξi−1) are the lengths of the rectangles on
the x-axis and the f(ξi) are the heights of the rectangles. The sum S,
n
S= ∑f x i ξi − ξi − 1 ,
i=1
(1.8)
is the area of all of the rectangles. The area under some curves over certain intervals is
1
not finite; consider f x = x over the interval [0, 1]. Therefore, as we take more and
more intervals, i.e. as we consider the limit of S as n approaches ∞, the sum S may or
may not converge to a finite limit. If this limit exists, the limit of the sum is the definite
integral I of the function f(x) in the interval [a, b],
www.iubh.de
Unit 1 23
Calculus
∫ f x dx .
b
I=
a
(1.9)
If the limit does not exist, the integral is undefined.
For closed, finite intervals, the question of whether this limit exists — whether the func-
tion f is integrable over the given interval — hinges on whether the function f is contin-
uous over that interval.
For continuous functions over a finite interval [a, b] this limit, the integral, always
exists.
Example
Evaluate the integral I = ∫0b x2dx.
The function f(x) = x2, called the integrand, is shown below.
www.iubh.de
24 Unit 1
The first step toward computing this integral, or determining whether this limit
exists, is to divide the interval [0, b] into n rectangles of uniform width w. Next, we
evaluate the function f(x) = x2 at the left hand endpoint of each sub-interval to
determine the height of each rectangle. We could also have taken the value at the
right hand endpoint or any in the middle — the limit does not depend on this
choice. The area of the ith rectangle is then w · (iw)2 = i2w3. The total area of our
approximation, A, is then given by
n
A= ∑ i2w3 .
i=1
The term w3 is a constant with respect to the index of summation, i, so we can fac-
tor it out of the sum operator as follows:
n
A = w3 ∑ i2 .
i=1
Recall that the sum ∑ni = 1 i2 has the closed form
n
1
∑ i2 = 6 n n + 1 2n + 1 ,
i=1
and hence the area of our approximation is
1
A = w3 n n + 1 2n + 1 .
6
When constructing the rectangles, we divided the interval [0, b] into intervals of the
b
same length, namely w = n . Therefore, we can substitute into our expression for A
and reduce to get
b 31
A= n n + 1 2n + 1
n 6
b3 n n + 1 2n + 1
=
6 n3
b3 n + 1 2n + 1
=
6 n2
b3 2n2 + 3n + 1
=
6 n2
b3 3 1
= 2+ + 2 .
6 n n
As we increase the number of intervals without bound, or as n → ∞, the value of

3
the sum in the above expression approaches 2 and thus b is the value of the finite
b 3
integral I = ∫0 x2dx = 13 b3.
www.iubh.de
Unit 1 25
Calculus
Using the properties of limits and finite sums, as above, one can see that the following
properties hold:
∫ 0dx = 0
b
(1.10)
∫ f x dx = 0
a
(1.11)
∫ ∫ f x dx + ∫ g x dx
b b b
f x + g x dx =
a a a
(1.12)
∫ f x dx = ∫ f x dx + ∫ f x dx, for all b ∈ a, c
c b c
a a b
(1.13)
If we set c = a in the last expression, we can derive the identity
∫ f x dx = − ∫ f x dx .
b a
a b
Integrals as the inverse of differentiation

So far we have introduced integrals over finite intervals [a, b], where the bounds a and
b are fixed. We can formally define the function F(x) to be
∫ f u du .
x
Fx =
a
(1.14)
To see how integration is related to differentiation, we evaluate the function F at posi-

tion x+Δx and apply equation 1.13 to get
∫ f u du
x + Δx
F x + Δx =
a
= ∫ f u du + ∫
x x + Δx
f u du
a x
=F x +∫
x + Δx
f u du .
x
www.iubh.de
26 Unit 1
If we divide both sides by Δx and bring F(x) to the left side, the equation reads
∫
F x + Δx − F x 1 x + Δx
= f u du .
Δx Δx x
Considering the limit as Δx approaches zero of both sides, this becomes
dF x
=f x ,
dx
(1.15)
or, written with the definition of F(x) substituted in,
∫ f u du = f x .
d x
dx a
(1.16)
This says that the derivative of the integral gives back the original integrand. This very
important result is called the Fundamental Theorem of Calculus, and it has a second
part, which relates the definite integral to the antiderivative. Let’s explore it now.
The above discussion did not depend on any attribute of the arbitrary constant a.
Hence, the inverse of differentiation is not unique. However, any two inverse functions
F1(x) and F2(x) differ at most by a constant, so it is written as
∫f x dx = F x +c
(1.17)
d
for the family of functions with derivative f(x). Recall that dx
c = 0. This is the indefinite
integral of f(x), and c is called the constant of integration.
The antiderivative F(x) can also be used to evaluate definite integrals. Let x0 be an
arbitrary fixed point x0 in (a, b) and consider equation 1.13 to obtain
x0
∫ ∫ ∫ f x dx
b b
f x dx = f x dx +
a a x0
(1.18)
x0
∫ f x dx + ∫
b
= f x dx
x0 a
www.iubh.de
Unit 1 27
Calculus
(1.19)
∫ f x dx = ∫ f x dx − ∫ f x dx
b b a
a x0 x0
(1.20)
=F b −F a .
(1.21)
Integrals with infinite bounds of integration

The somewhat intuitive definition of the integral as the area under a curve or as an
inverse function does not allow for bounds of integration that are infinite. However, we
can extend the definition to include these cases with the observation that
∫ ∫ f x dx =
∞ b
f x dx = lim lim F b − F a ,
a b→∞ a b→∞
(1.22)
where the limit as b approaches ∞ is evaluated after the integral is calculated.
Evaluation of integrals
Unfortunately, unlike differentiation, many integrals cannot be evaluated easily and
there are few simple rules which can be used. Some examples of indefinite integrals
are given below. Note that u is typically a function u(x), and du = u’(x)dx.
www.iubh.de
28 Unit 1
n+1
∫u du = un + 1 + c
n n≠ −1
∫ duu = ln u + c
u
∫a du = lna a + c
u
∫e du = e + c
u u
∫cos udu = sin u + c

∫sin udu = −cos u + c
∫cosh udu = sinh u + c
∫sinh udu = cosh u + c
∫ cosdu u = tan u + c
2
∫ sindu u = −cot u + c
2
∫ u du+ a = 1a arctan ua + c
2 2
∫ a du− u = arcsin ua + c
2 2
Formulae for a large number of integrals can be found in tables of integrals. In order to
evaluate unknown integrals, we generally try to transform integrals into forms that are
easier to evaluate. For reference, here are a few “techniques of integration” that might
help.
• Logarithmic integration: Integrals for which the integrand can be written as the quo-
tient of the derivative of a function, and that same function can be evaluated as
f′ x
∫ f x
dx = ln f x + c
• Decomposition: When the integrand is a linear combination of integrable functions,

we can split the integral of the sum into a sum of simpler integrals:
n n
∫ ∑i = 1 aif i x dx = ∑i = 1 ai∫f i x dx
• Substitution: If the integrand can be parameterized in terms of a different variable

or function x = u(t), we can often utilize the substitution:
www.iubh.de
Unit 1 29
Calculus
b ub du t
∫a f x dx = ∫u a f ut dt
dt
The key to identifying integrals of this form is to find a suitable substitution func-
tion.
• Integration by parts: Recall the product rule:
d
dx
u ⋅ v = uv′ + u′v
Integration by parts enables us to split the integral into parts which are easier to
solve. Rearranging equation 1.5 (the product rule)
d dv du
uv = u +v +c
dx dx dx
to
dv d du
= uv − v .
dx dx dx
Integrating both sides we obtain
∫uv′dx = uv − ∫vu′dx .
The “art” of solving the integral is to choose the functions u and v so that the
remaining integral becomes easier to solve.
Example
Evaluate the integral ∫ab xcos xdx .
Noting that the integrand is a product of x and cos x, we solve this using integra-
tion by parts and choose u = x and v’ = cosx, and thus get the result that v = sinx
and du = dx. Substituting, we get
www.iubh.de
30 Unit 1
∫ xcos xdx = ∫ x sin x ′dx

b b
a a
b
∫ sin xdx
b
= x sin x −
a a
b
= x sin x + cos x a
= b sin b + cos b − a sin a + cos a .
Example
1
Evaluate the integral ∫ dx .
x2 + x
First, we note that the denominator x2+x can be factored as x(x+1). Using partial
fraction decomposition, we get
∫ x 1+ x dx = ∫ x x 1+ 1 dx
2
=∫ −
1 1
dx
x x+1
= ln x − ln x + 1 + c
x
= ln +c
x+1
where we have split the difference inside the integral into a sum of integrals and
a
used the fact that ln b = ln a − ln b . However, in general, we need to be careful to
consider that the argument of the logarithm is not defined for negative numbers.
Taylor approximation
Taylor's Theorem A very useful application of derivatives and integrals is Taylor’s theorem. Taylor’s theo-
The theorem is rem provides an approximation to a function in the vicinity of a given point x0 as a
named after Brook sum. Taylor’s theorem requires that the function f(x) is continuous and that all of the
Taylor who derivatives f’(x),fʹ’(x), ..., up to order f(n)(x) exist in order to generate an nth degree pol-
expressed this rela- ynomial approximation to f(x) near x0. Using equation 1.21 we can express f(x) as
tionship in 1712.
∫
a+ϵ
f′ x dx = f a + ϵ − f a
a
(1.23)
where x and x − ϵ are in the vicinity of a. This can be written as
∫
a+ϵ
f a+ϵ =f a + f′ x dx .
a
www.iubh.de
Unit 1 31
Calculus
(1.24)
Assuming that ϵ is very small, we can assume f'(x) is approximately equal to f'(a), and
hence
f a + ϵ ≈ f a + ϵf′ a
(1.25)
holds. We can express this in terms of x and a, assuming that we stay close to the point
a, to get the approximation
f x ≈ f a + x − a f′ a .
(1.26)
The approximation given by equation 1.26 is called the linear approximation to f(x)
near x = a. It is the tangent line approximation to the function f. By using more infor-
mation about f, namely by constructing a function that also agrees with f on higher
order derivatives at x = a, we can obtain an even better approximation. That is the
general idea of the Taylor approximation of degree n. Because f is n-differentiable, we
can apply the approximation to each of the derivatives of f to obtain
f′ x ≈ f′ a + x − a f′′ a ,
f′′ x ≈ f′′ a + x − a f′′′ a ,
and similarly,
n−1 n−1 n
f x ≈f a + x−a f a .
We can now substitute the estimate of f'(x) into equation 1.24 and obtain
∫
a+ϵ
f a+ϵ ≈f a + f′ a + x − a f′′ a dx
a
ϵ2
= f a + ϵf′ a + f′′ a .
2
This process can be repeated iteratively as long as the higher order derivatives exist,
which yields the nth-degree Taylor polynomial approximation. Expressing again in terms
of x and a, we can write:
2 n
x−a x−a n
f x ≈ f a + x − a f′ a + f′′ a + ⋯ + f a .
2! n!
(1.27)
www.iubh.de
32 Unit 1
1.2 Partial Differentiation

In the previous section, we considered derivatives of functions of a single variable, i.e.
n dn
f x = dx f x . More generally, we can consider rates of change of functions that
depend on more than one variable. We can write f(x1, x2, . . . , xn) for a function that
depends on n variables x1, x2, . . . , xn. An example of a function that depends on two
variables x and y, f(x, y) = x2 + y2, is shown in the following figure. The function is
well-defined for each pair (x, y). For example, f(1, 1) = 2.
Previously, we discussed that the derivative of a function of a single variable is related

to the change or gradient of that function. As we consider more variables, we want to
know how the function changes as each of the variables change individually, imagining,
for example, how x changes as y is held constant. Considering again the function f(x,
y) = x2 + y2, it has a specific gradient in all directions of the xy plane. As a special
case, we consider it when we move in either the x or the y direction, for example, along
the x or y axis. We move along one direction, for example the x axis, and keep the
value of the other variable(s), in this case y, constant as we observe the change of the
www.iubh.de
Unit 1 33
Calculus
function. These derivatives are called partial derivatives indicating that we only
observe the “partial” change of the function along one of the variables. Similar to the
definition of the derivative with respect to a single variable (equation 1.2), we define
the partial derivative to be:
∂f f x + Δx, y − f x, y
= lim ,
∂x Δx → 0 Δx
and
(1.28)
∂f f x, y + Δy − f x, y
= lim ,
∂y Δy → 0 Δy
(1.29)
where the symbol ∂ indicates that this differentiation is performed partially with
respect to a single variable while the other variables are kept constant. To make this
explicit, it is often written as
∂f ∂f
and
∂x y ∂y x
(1.30)
in order to indicate which variable is considered in the derivative (the one in the par-
tial derivative expression) and which is kept constant (the one outside the parenthe-
ses). Just as there are many notations for the derivative in the single variable case,
there are also many ways to indicate partial derivatives. The following are some com-
mon short-hand notations for the partial derivative of f with respect to x:
∂f
= f x = ∂xf .
∂x
(1.31)
One can calculate higher order derivatives, provided that the relevant limits exist, and
are calculated in the same way. Some possibilities in the case of two-variables are
www.iubh.de
34 Unit 1
2
∂ ∂f ∂ f
= = f xx,
∂x ∂x ∂x2
2
∂ ∂f ∂ f
= = f yy,
∂y ∂y ∂y2
2
∂ ∂f ∂ f
= = f xy, and
∂x ∂y ∂x∂y
2
∂ ∂f ∂ f
= = f yx .
∂y ∂x ∂y∂x
Note that under sufficient continuity conditions, the relation
2 2
∂ f ∂ f
=
∂x∂y ∂y∂x
holds.
Example
Find fx and fy, the first partial derivatives of f(x, y) = 3x2y2 + y.
First, we calculate the partial derivative with respect to x, treating y as a constant to

obtain
∂f
= 6xy2 .
∂x
For the partial derivative with respect to y, we now treat x as a constant and find
∂f
= 6x2y + 1 .
∂y
Total Differential
The definition of the partial derivatives allows us to examine the rate of change of a
function along, for example, the x or y axes. We now want to investigate the rate of
change if we move in any direction in the domain.
In a case where we have functions of two variables x and y, we move Δx in the x direc-
tion and Δy in the y direction. Following the approach we have taken previously, we can
evaluate
www.iubh.de
Unit 1 35
Calculus
Δf = f x + Δx, y + Δy − f x, y
= f x + Δx, y + Δy − f x, y + Δy + f x, y + Δy − f x, y
f x + Δx, y + Δy − f x, y + Δy
= Δx +
Δx
f x, y + Δy − f x, y
Δy
Δy
where we have performed the algebraic trick of adding and subtracting the same term,
namely −f(x, y +Δy) + f(x, y +Δy) = 0 in the middle step in order to factor into the
Δx Δy
desired quotients, and we have also multiplied by Δx = 1 and Δy = 1 . The term in
the first square brackets describes the change of the function f(x, y) if we move a step
Δx in the x direction, the term in the second square bracket corresponds in the y
direction. If we let Δx → 0 and Δy → 0 on both sides, the terms in the square brackets
are the partial derivatives defined in equation 1.28, and we obtain the total differential
of a function f(x, y), which is then given by
∂f ∂f
df = dx + dy .
∂x ∂y
(1.32)
For functions of n variables, the formula above is extended accordingly to
∂f ∂f ∂f
df = dx + dx + ⋯ + dx .
∂x1 1 ∂x2 2 ∂xn n
(1.33)
Chain Rule
When a function of a single variable could be expressed as a composition of functions,

we used the chain rule (recall equation 1.6) to differentiate it. The same approach can
be applied to functions with several variables.
For example, in the case of a function f(x, y), the variables x and y are now functions of
df
another variable u and we wish to find the derivative with respect to u, i.e. du . Starting
from the total derivative in equation 1.32 we obtain
df ∂f dx ∂f dy
= + .
du ∂x du ∂y du
The same approach can be taken if the functions are nested in more than one level, i.e.
instead of f(u(x)) one might have f(u(v(x))), and the chain rule can be used to calcula-
ted the derivative, e.g.
df u v x ∂f ∂u dv
= .
dx ∂u ∂v dx
www.iubh.de
36 Unit 1
(1.34)
1.3 Multiple Integrals

Previously in this unit, we introduced integrals over functions of a single variable.
Recall that the definite integral measures the area under the curve given by 28 function
f(x) over the interval [a, b]. We later extended this explanation to encompass indefinite
integration. Furthermore, we interpreted the integration as the inverse of differentia-
tion.
Again, in the multivariate case, we will approach integration as a limit of approxima-

tions, focusing on the case of two variables x and y first. We wish to find the volume
enclosed by the x, y-plane and the function f(x, y) with specific bounds in the x- and
y- directions, represented by a region R enclosed by a contour C. Following the
approach described previously in this unit, we divide the area R inside the curve into N
areas of ΔAp with p = 1, 2, . . . , N and define the sum
N
S= ∑f xp, yp ΔAp
p=1
www.iubh.de
Unit 1 37
Calculus
to express the approximate volume, where ΔAp is the area of the base and f(xp, yp) is
the height of cell p. Again, we consider many areas like this, i.e. we let N → ∞, implying
ΔAp → 0. Similar to the case of a single variable, if the above sum has a finite limit or
value, we say that this limit is the value of the double integral over f(x, y) over some
region R:
I= ∫f x, y dA,
R
(1.35)
where dA is an infinitesimally small area in the x, y plane where the function f(x, y) is
evaluated. So far, we have not made any assumption about the small area ΔA consid-
ered in the above sum. If we choose small rectangles in x and y direction, we can write
ΔA = ΔxΔy, and when Δx → 0 and Δy → 0, we can write
I= ∫∫f x, y dxdy
R
(1.36)
as the double integral. For such integrals, it sometimes matters whether we integrate
with respect to x or to y first. It is frequently helpful to draw a picture to see which vari-
able could be taken more easily to depend on the other. If x can be easily expressed as
a function of y, we might choose to take small areas in the direction of width dy first.
That gives us
x = x2 y
∫ ∫
y=d
I= f x, y dx dy .
y=c x = x1 y
(1.37)
In this case, the bounds of the inner integral are the parametrization of the boundary
curve C, expressed as x = x1(y) and x = x2(y). In the first step, y is treated as a con-
stant as the inner integral over x is evaluated. The next step of the computation, the
outer integral, is evaluated between the bounds y = c and y = d just as in the single
variable case as there are no x’s left in the expression.
Alternatively, we can first evaluate the integral over y and then over x, as
y = y2 x
∫ ∫
x=b
I= f x, y dy dx .
x=a y = y1 x
(1.38)
www.iubh.de
38 Unit 1
Example
Evaluate the integral I = ∫∫R x2y dxdy where R is given by a triangular area boun-
ded by x = 0, y = 0, x + y = 1.
First, we carry out the integration over y, which means that we keep x fixed. In this
case, the limits on y are y = 0 and y = 1 - x. Given the constraint x+y = 1, the
maximum value of x is x = 1 for y = 0. The integral is then written as
∫ ∫
x=1 y=1−x
I= x2ydy dx .
x=0 y=0
We evaluate the inner integral first, treating x as a constant, to get
y=1−x
∫
y=1−x 1 2 2 1 2 2
x2ydy = x y = x 1−x .
y=0 2 y=0 2
This result is now inserted into the outer integral as follows:
∫
x = 11 2
x2 1 − x dx
x=0 2
= ∫ ∫ ∫
1 x=1 1 x=1 1 x=1
x2dx − 2x3dx + x4dx
2 x=0 2 x=0 2 x=0
11 3 1 1 4 1 11 51
= x − x + x
23 0 4 0 25 0
1 1 1
= − +
6 4 10
1
= .
60
In case of more than two variables, the same notation can be extended accordingly,
such as
∫∫∫f x, y, z dxdydz
V
(1.39)
where, in the case of three variables, we integrate over a specific volume rather than an
area.
www.iubh.de
Unit 1 39
Calculus
1.4 Calculus of Variations

Previously, we introduced the idea of local extrema and how to use stationary points to
find them. We can apply the same ideas to more than one variable. In fact, we can even
extend this idea to look for input functions that give extrema (maxima and minima),
rather than input values that give extrema.
This is the idea behind calculus of variations. In most cases, we want to minimize or
maximize a given quantity that depends on a family of input functions; the calculus of
variations provides a method for finding a function f(x) which yields the extreme value.
As a concrete example, we could imagine a rope that is attached to two points, A and
B, as shown in the following figure, but otherwise hangs freely under the influence of
gravity. We expect that the rope will hang down in a shape such as the one indicated Gravity
by the solid line, and not take any other shape (e.g., those suggested by the two dotted Gravity is one of the
lines) — that is, as long as there is no external force other than gravity, and all initial natural forces
motion has come to a rest. In this example, the rope is fixed at the points A and B, so caused by the mass
we have two constraints (not including the length of the rope) which we take as con- of objects, resulting
stant. As the gravitational force acts on each part of the rope, the rope will take the in them being pulled
shape where the total potential energy, expressed by the integral over all small seg- towards each other.
ments of the rope, is minimal. We wish to find the function y(x) be the function that
describes the shape of the hanging rope with the minimal potential energy.
www.iubh.de
40 Unit 1
To introduce the calculus of variations, we start with the integral
∫ F y, y′, x dx,
b
I=
a
(1.40)
where a, b, and F are given by the nature of the problem we wish to consider. This inte-
gral depends on the function y(x). In the example of the rope, the limits a and b of the
integral are fixed: they correspond to the endpoints of the rope at which the rope is
attached, for example, to two poles.
We call such functions (ones that take other functions as their input and result in a
scalar as their output) functionals. Here, I is a functional of y(x), which we denote by
I=Iy x .
(1.41)
We use square brackets to indicate that I is a functional rather than a function of ℝn.
We then look for the curves y(x), which are the stationary value(s) of the integral I, and
determine whether such curves are extrema of the integral. The integral may have one
or more stationary points.
A stationary point y(x) of the functional I[y(x)] is a point where the functional I does
not change if the y(x) is perturbed by a small amount. In the case of the rope, this
would be the function that describes the physical shape the rope takes if we fix it at
two points and let it hang under the influence of gravity. Because y(x) is a stationary
point of the integral I[y(x)], if we change
y x → y x + ϵη x
(1.42)
by a small amount ε using any (sufficiently well-behaved) function η(x), we require that
the value of I does not change, i.e.,
dI
= 0∀η x .
dϵ ϵ=0
(1.43)
We now insert the above equation 1.42 into the integral in equation 1.40:
∫ F y + ϵη, y′ + ϵη′, x dx .
b
I y x ,ϵ =
a
www.iubh.de
Unit 1 41
Calculus
We generally assume that all functions are well behaved, especially when considering
situations related to physical examples.
Taylor Series with Multiple Variables
We have already encountered Taylor series for the case of the single variable in
Eqn.(1.27) and used this to expand a function into a series around some point. This
approach can be generalized to several variables. For example, for a function that
depends on two variables x and y, we can write the corresponding second degree
Taylor polynomial as:
∂f ∂f
f x, y = f x0, y0 + Δx + Δy
∂x ∂y
2 2 2
1 ∂ f 2 ∂ f ∂ f 2
+ Δx +2 ΔxΔy + 2 Δy
2! ∂x2 ∂x∂y ∂y
where we evaluate the derivatives around some point (x0,y0)
and Δx = x – x0 and Δy = y – y0.
We can write this as:
∂ ∂
f x, y = f x0, y0 + Δx + Δy f x, y
∂x ∂y
1 ∂ ∂ 2
+ Δx + Δy f x, y
2! ∂x ∂y
Extending to higher derivatives, we can write the Taylor series for a function of two
variables as:
∞ n
1 ∂ ∂
f x, y = ∑ n!
Δx
∂x
+ Δy
∂y
f x, y
x0, y0
n=0
We can further generalize this to any number of variables denoted by the vector x :
2
∂f 1 ∂ f
f x =f x0 +∑
∂xi 2! ∑ ∂xi ∂xj i j
+ Δx Δx + ⋯
i i, j
Returning to the calculus of variations and the integral I[y(x), ε], we can use the Taylor
series with Δy = ϵη and Δy′ = ϵη′ and write the integral in the following way:
I y x , ϵ = F y + ϵη, y′ + ϵη′, x dx
∫ F y, y′, x dx + ∫
b b ∂F ∂F
= ϵη + ϵη′ dx + O ϵ2 .
a a ∂y ∂y′
www.iubh.de
42 Unit 1
(1.44)
In the following, we ignore all terms of order ε2 and higher because ε is assumed to be
a very small number. This means we consider the equation
∫ F y, y′, x dx + ∫
b b ∂F ∂F
I y x ,ϵ = ϵη + ϵη′ dx .
a a ∂y ∂y′
Now, recall that when we introduced the small perturbation in y(x) in Equation 1.42, we
said that this should not change the integral because we are at a stationary point. We
expressed this more formally in Equation 1.43, where we demand that the integral I
does not change if we change y a little bit by the term εη(x) for any choice of η.
This then implies that the second term must be equal to zero for any choice of η(x),
because ε is a small (but non-zero) number and we do not make any demands of the
function η(x) except that it be sufficiently well behaved, so we can take its derivative,
integrate it, and so on. Then, because we demand that this holds for any small pertur-
bation, the second part in the equation above must vanish, which we can write as
∫
b ∂F ∂F
δI = η+ η′ dx = 0,
a ∂y ∂y′
(1.45)
where the notation δI is used to indicate the variation in the functional I[y(x)] due to
the change in y(x) → y(x) + εη(x). Furthermore, ε is a small but non-zero number and
can therefore be omitted from the above equation.
We now integrate the second part of the integral by parts, resulting in
b
∫ ∂y′ ∫ η dxd
b ∂F ∂F b ∂F
η′dx = η − ,
a ∂y′ a a ∂y′
(1.46)
so the integral equation becomes
b
∫
∂F b ∂F d ∂F
η + − η x dx = 0 .
∂y′ a a ∂y dx ∂y′
(1.47)
We now impose the constraint that the endpoints a and b are fixed, as are y(a) and
y(b) – recalling our initial example of the freely hanging rope under the influence of
gravity, where the rope is fixed at its two attached points.
www.iubh.de
Unit 1 43
Calculus
Since y(a) and y(b) are fixed, we also require that, at these points, η a = 0 and
η b = 0: if we “wiggle” the rope a bit, i.e., change y(x), the endpoints
remain unchanged. It follows that the first term in the above equation vanishes. Since
equation 1.47 must be equal to zero for any choice of η x , this implies that the func-
tion in the integral must be zero, namely that
∂F d ∂F
= .
∂y dx ∂y′
(1.48)
Equation 1.48 is known as the Euler-Lagrange equation.
Example
Show that the shortest path between two points is a straight line.
We start by specifying the initial and final points that will be connected with an
arbitrary path; initial point A is given by the coordinates (a, y(a)) and the final
point B, is given by the coordinates (b, y(b)), as shown below:
For any small segment of the path, the length can be approximated by a straight
line using the distance formula
2 2
ds = dx + dy ,
www.iubh.de
44 Unit 1
where we assume that dx and dy are small enough to justify the approximation of
the small triangle for ds. Factoring out dx, the equation above can be written as
2
ds = 1 + y′ dx .
(1.49)
The total length of the line is given by the integral
∫ ds = ∫
b b 2
L= 1 + y′ dx,
a a
(1.50)
where the integration takes place along the path between the two points. We now
calculate the path which leads to a stationary point for L, in this case a minimum
which gives the shortest connection between the points A and B. We start from the
Euler-Lagrange equation 1.48 and note that the function in the integral L does not
depend on y explicitly. This implies that
∂F
= 0,
∂y
so the Euler-Langrange equation can be written as
d ∂F
= 0,
dx ∂y′
(1.51)
which in turn implies that
∂F
= c,
∂y′
(1.52)
for some constant c. We now take the derivative of the function 1 + y′ 2 with
respect to y' and obtain
∂F y′
c= = ,
∂y′ 2
1 + y′
(1.53)
recalling that w can be written as w1/2.
We now solve the equation
www.iubh.de
Unit 1 45
Calculus
y′
c=
2
1 + y′
for dy so that we can integrate both sides and obtain an explicit formula for y as
follows:
y′
c =
2
1 + y′
(1.54)
2
y′
c2 = 2
1 + y′
(1.55)
2 2
c2 1 + y′ = y′
(1.56)
2 2
c2 = y′ − c2 y′
(1.57)
2
= 1 − c2 y′
(1.58)
dy
c = 1 − c2
dx
(1.59)
c
dx = dy .
1 − c2
(1.60)
Integrating both sides yields
c
y= x+k
1 − c2
(1.61)
c
for some constant k by noting that the term is constant and ∫ dx = x.
1 − c2
www.iubh.de
46 Unit 1
As expected, the above equation is indeed a straight line of the form y = mx + b

c
with m = 2 and constant k = b.
1−c
Summary
In this unit, we have seen functions of a single variable f(x), as well as multivariate
functions, such as f(x, y). Differentiation is a tool for studying the rate of change of
a function with respect to a given variable. In the case of multivariate functions, the
partial derivatives indicate how much the function changes along the x- or y-axis,
for example, while the total differential extends this idea to the rate of change of a
function in any arbitrary direction. Integration of the functions of one variable was
introduced as the area enclosed by the function and can be interpreted as the
inverse of the derivative. The integral is therefore often called the antiderivative.
The Taylor expansion can be used to approximate a given function at a specific
point. Finally, the calculus of variation extends the concepts of differentiation and
integration to functions whose inputs are themselves functions.
Knowledge Check
Did you understand this unit?
You can check your understanding by completing the questions for this unit on the
learning platform.
Good luck!
www.iubh.de
Unit 2
Integral Transformations
STUDY GOALS
… what integral transformations are.
… how to combine the effects of two functions using a convolution integral.
… how to use convolutions to describe real-life applications such as finite sensor

resolution or image manipulation.
… how to express periodic signals as a Fourier series.
… how to express time domain and frequency domain functions using Fourier
transformations.
DL-E-DLMDSAM01-L02
48 Unit 2
2. Integral Transformations
Introduction
Integral transformations play an important role in analyzing, manipulating, and trans-
forming signals. This unit focuses on two transformations which are of great impor-
tance in practical applications, convolutions, and Fourier transformations. Convolutions
describe how two functions interact with each other, for example, how the finite resolu-
tion of a sensor or measuring device impacts the value of the quantity measured by
this device. Fourier series and Fourier transformations are used to analyze and
describe periodic signals. This formalism allows us to express the observed signals as a
superposition of signals of different frequencies and intensities and allows us to switch
between equivalent descriptions in the (observed) time domain and in the frequency
domain. This can make the treatment of signals much easier to handle as some trans-
formations or filters are more easily applied in one domain than in the other.
A good textbook that covers this subject area is Signals & Systems (Oppenheim et al.,
1997).
2.1 Convolutions
Definition of the Convolution
In order to measure any quantity, we must rely on a measurement device. For example,
to measure a temperature, we use a thermometer which tells us the temperature of the
substance we want to investigate. This simple picture is not quite correct, however. The
measurement does not reflect the actual “true” physical quantity (such as the tempera-
Resolution ture) but is distorted by the intrinsic resolution of the measuring instrument. It is
Detectors are not important to keep in mind that the measured quantity, for example the temperature,
infinitely precise but does not exist as an abstract quantity but is always related to a real, physical system.
can only measure a As such, there is ultimately no single value associated with this property in the mathe-
quantity up to a cer- matical sense but it is always governed by probabilities and probability distributions.
tain precision deter- This implies that if we keep repeating the same measurement, we will get slightly dif-
mined by the resolu- ferent numeric values for the same “true” physical values that are determined by the
tion of the detector. intrinsic resolution of the device. Examples illustrating these possibilities for such reso-
lutions are shown in the following figure.
www.iubh.de
Unit 2 49
A good thermometer may have a high resolution and return an unbiased measurement.
This means that the thermometer does not shift the “true” value, but rather returns a
value which randomly fluctuates slightly around the true value. The measurements of a
thermometer with this property are indicated by the red dashed line. By repeating the
measurement many times, it is possible to determine the resolution of the instrument
and hence the intrinsic volatility of the measurements made with this thermometer.
The black solid line, on the other hand, illustrates the readings of a thermometer with
a lower resolution; the values still fluctuate randomly around the “true” value, but due
to the lower resolution of the instrument, the fluctuations are stronger.
Finally, the blue dash-dotted line illustrates what happens if the measuring instrument
itself introduces a bias. In this case, the measured values are no longer “faithful” to the
“true” ones; the resolution of the instrument is asymmetric with long tails, indicating
that the intrinsic fluctuations become biased towards higher values.
To express the ideas of “faithful” and “true” more mathematically, we note that we need
to have a function f(x) that represents the true values of the substance being meas-
ured. It is important to note that the vast majority of systems, objects, and processes in
our world are stochastic. This means that any value or number we observe is random,
but they follow a distinct probability distribution that is governed by a specific process
www.iubh.de
50 Unit 2
Stochastic relevant for this system. There are, of course, notable exceptions to this, otherwise it
Stochastic systems would be difficult to implement a clock. However, from the atomic scale to everyday sit-
are governed by the uations such as the speed of wind or the shopping behavior of customers in a super-
laws of probability, market, everything needs to be calculated in terms of probabilities and probability dis-
as opposed to deter- tributions. In the example above, f(x) would describe the distribution of the actual
ministic systems that temperature of the underlying physical process relevant for the substance we want to
can be calculated examine.
precisely.
The instrument itself is represented by a resolution function g(y) that determines how
the “true” values are observed, e.g. biased or unbiased with high or low resolution. The
measurements themselves are represented by some function h(z), which depends on
both the actual underlying state as described by f(x) and the resolution function g(y).
The variables x, y, and z all describe the same quantity, which, in our example, is the
temperature. However, they each enter the consideration at a different point: x is the
“true” value, y the resolution, and z what we finally observe. Effectively, the resolution
introduces a systematic error in the observation of the “true” value. If the measurement
device is biased, this error will include a shift that is more probable in one direction
compared to the other. In the example shown in the previous figure, the resolution
function is biased towards larger, more positive values and hence, more positive values
will be observed compared to negative values.
We now build a more detailed intuition about how convolutions work. Since, in general,
the true value x is a random number taken from a distinct probability distribution, we
know that the probability of getting x precisely is zero. Instead, we can calculate the
probability of getting the true value in the interval (x, x+dx) using the distribution f(x).
As f(x) defines the probability for any x, this is given by f(x)dx. Next, because we need a
device to measure this value, we need to include the resolution function g(y) so that
we finally observe the value z. The measurement instrument will generally shift the true
value to the observed value, hence we do not observe x, but rather z, which is shifted
by the amount z − x, i.e. the difference between them. Note that, in general, we cannot
know the true value of x. In case of an unbiased instrument, the resolution will “smear”
the true values so that the resulting distribution is broader than the true, physical one.
In case of a biased instrument, this step will also include a further shift in one direc-
tion. Hence, the original interval dx gets transformed into the interval dz and the over-
all effect of the instrument is given by g(z − x)dz. We can then combine the probabili-
ties for the true value and the instrument and obtain f(x)dxg(z − x)dz for a particular
observation. To express this for any value, we need to integrate over all possible values
of x to obtain the distribution for all observable values z
∫
∞
f∗g=h z = f x g z − x dx .
−∞
(2.1)
Equation 2.1 is called the convolution of the functions f and g and is typically denoted
by f ∗ g. Two examples of convolutions of two simple functions are shown in the fol-
lowing figure. In both cases, two uniform functions are nonzero over a small interval
and are convolved with each other. In the top instance, the two functions are separated
www.iubh.de
Unit 2 51
from each other, meaning that there is no portion of the domain where they are both
nonzero. In the other case, the intervals at which the two functions are nonzero over-
lap. The convolutions are shown in the right-hand column of graphs. Note that the
shape of the convolved functions are not the same as the shape of the original func-
tions; the two uniform functions gain a “triangular” shape when convolved.
Most functions are not uniform functions. An example with functions more like those
we are interested in studying is shown in the following figure. In this case, the “true”
values are distributed according to a Γ function (black, solid). The measurement device
is represented by an unbiased Gaussian resolution function with mean zero and stand-
ard deviation one. The observed values are then distributed according to the convolu-
tion of both functions as shown in the graph on the right.
www.iubh.de
52 Unit 2
This illustrates a typical pitfall when dealing with measured values. Although we know
that the “true” values are strictly positive (as illustrated by the black solid curve in the
left figure), the observed values can also be negative due to the finite resolution of the
measurement instrument. Depending on the concrete problem, these values need to be
treated with extra consideration as they may violate physical boundaries.
Applications in Image Processing
Convolutions play an important part in image processing and are at the core of (con-
ventional) image filters and convolutional neural networks. Instead of interpreting f and
g as a (true) signal and resolution function, one of the functions, for example f, repre-
sents the image and the other (g) represents a kernel that is used to operate on the
image. Given a suitable kernel, this defines a filter that can be used for a wide range of
applications such as blurring, sharpening, and edge detection. The kernel is often
denoted by K or ω.
In this section, we will discuss how convolution can be used to blur an image. To do so,
each part of the image is convolved with a Gaussian filter in the x and y directions, in
particular we apply a two-dimensional (or multivariate) Gaussian to each part of the
image. The two-dimensional function is illustrated in the following figure. In the case of
images, we work with discrete data as the images can be represented by collections of
individual pixels of the form (x, y, r, g, b) where x and y give the position of the pixel
in the image and give r, g, and b the relative levels of red, green, and blue in that pixel.
The continuous convolution integral in equation 2.1 then becomes, in the discrete case,
the summation
a b
K · f x, y = ∑ ∑ K i, j f x − i, y − j ,
i= −1j= −b
www.iubh.de
Unit 2 53
(2.2)
where f(x, y) is the original image and K is the appropriate kernel.
In the case of Gaussian blurring, the kernel is given by the matrix in equation 2.3, which
is applied to each part of the image. In this way, the value to which each pixel is trans-
formed is influenced by its neighboring pixels where their relative weight is given by
the kernel. The following figure shows the effect of applying such a kernel to an image.
www.iubh.de
54 Unit 2
Special care needs to be taken at the edges of the image where the kernel can poten-
tially exceed the image boundaries. In these cases one can either extend the image by,
for example, repeating the outermost pixels, or one can crop the image so that the ker-
nel always fits inside the original image:
1 2 1
1
K= 2 4 2 .
16
1 2 1
(2.3)
2.2 Fourier Transformation
Fourier Series
We have previously seen how the Taylor expansion can be used to approximate a signal
or a function. However, Taylor expansions are not the only way to do this — there is
another way to look at functions, the Fourier series, that is particularly well suited to
periodic signals such as those found in a wide range of natural and engineering sys-
tems. The main idea of the Fourier series is to express the signal as a sum of sine and
cosine components of varying strength and frequency. In order to create a Fourier ser-
ies for a function, the signal must satisfy the following Dirichlet conditions:
• the function must be periodic,

• the function must be single-valued and at most with a finite number of non-infinite
discontinuities within a period (such as a sawtooth function,
• within each period, the function must have a finite number of maxima or minima,
and
• the integral over the function over a single period must exist and be finite.
www.iubh.de
Unit 2 55
The requirement that the function is periodic can be stated as the condition
f(x + L) = f(x) = f(x + nL); in other words, the function repeats itself after the full
period L has passed. This also implies that the signal has neither a beginning nor an
end. Note that in practice using 2L instead of L for a single full period is also common
notation. In practical applications, one has to consider how to deal with finite signals,
e.g. by assuming that the signal follows the same structure beyond the observed signal.
The Fourier series, the promised decomposition of a signal into a summation of sine
and cosine waves, is given by
∞
1 2πnx 2πnx
f x = a +
2 0 ∑ ancos
L
+ bnsin
L
.
n=1
(2.4)
When the period L is equal to 2π, this simplifies to
∞
1
f x = a +
2 0 ∑ ancos nx + bnsin nx .
n=1
(2.5)
The figure below shows a periodic sinusoidal signal to which only one frequency con-
tributes, namely f(t) = sin(x). The left part of the figure shows the signal in the time
domain, i.e. what we would observe if we measured the signal at different points in
time. The right part of the figure shows which frequencies contribute to the signal, tell-
ing us which coefficients in the Fourier series are nonzero.
www.iubh.de
56 Unit 2
The next figure shows a slightly more complicated signal in which the signal with the
lower frequency (the same as in the previous figure) is overlaid with a signal that is ten
times faster. The resulting periodic signal as we would measure it, namely in the time
domain, is shown in the left part of the figure. The right part of the figure shows the
two frequencies contributing to this signal. The graphic titled “Periodic Sawtooth Sig-
nal” shows a more realistic signal that is frequently encountered in electrical engineer-
ing. During each period, the sawtooth signal rises linearly between the minimal and
maximal values, and drops to the minimum when the signal reaches the maximum. The
left part of the figure shows the measured signal in the time domain, and the right part
of the signal illustrates that many coefficients in the Fourier series are needed to build
up this more complex signal, and that the weight of the contribution of each signal is
exponentially dampened.
www.iubh.de
Unit 2 57
Recall Euler’s formula, which establishes the fundamental relationship between trigo- Euler's Formula
nometric functions and complex-valued exponential functions. In many applications it eiΦ = cos Φ + i sin Φ
is helpful to use Euler’s insight to express the sine and cosine terms in the Fourier ser-
ies by exponential functions with imaginary arguments. To do so, we can use the iden-
tities
1 inx 1
sin(nx)= e − e−inx , and cos nx = einx + e−inx .
2i 2
(2.6)
Inserting these into equation 2.5, the Fourier series can be expressed as
∞
∫ fxe
1 π
f x = ∑ cneinx , with cn =
2π
−inxdx .
n= −∞ −π
(2.7)
Fourier transformations
In equation 2.7, we expressed the Fourier series as an infinite sum of complex func-
tions, still assuming for simplicity that the period L is equal to 2π. If we relax this
assumption, the Fourier series can be written as
∞
f x = ∑ cnei2πnx/L,
n= −∞
www.iubh.de
58 Unit 2
(2.8)
which, using the frequency ω = 2πn/L, can be expressed as
∞
iωnx
f x = ∑ cne .
n= −∞
(2.9)
Previously, we always interpreted the period L as some finite interval after which the
function repeats itself — indeed, this was one of our core assumptions when we dis-
cussed the Fourier series. We now consider the limit of large periods, i.e. L → ∞ where
the signal only repeats after a very long time. Instead of fixed frequencies in the sine
and cosine terms that we have considered so far, the difference in frequencies
Δω = 2π/L becomes infinitesimally small and the frequencies become a continuum
rather than the discrete values we saw in the first two examples in this section. We
recall from equation 2.7 that the coefficients cn are given by
x0 + L x0 + L
cn =
1
L ∫
x0
f x e−i2πnx/Ldx =
Δω
2π ∫ x0
−iωnx
f xe dx
(2.10)
where we have again allowed an arbitrary period L and ωn a discrete function of the
index of coefficients n given by ωn = 2πn/L. In the integral limits, x0 is an arbitrary con-
stant which is often taken as -L/2. In fact, this is the reason that often 2L is used to
indicate a full period so that the limits of the integral can be written as -L for the lower,
and L for the upper limit instead of -L/2 and L/2. Substituting this expression for the
coefficients into the complex Fourier series, we get
∞
iωnx
f x = ∑ cne
n= −∞
∞ x0 + L
= ∑
n= −∞
Δω
2π ∫x0
f ue
−iωnu iω x
du e n
where we have used u instead of x in the expression for cn to avoid confusion with the
various variables and what they refer to.
We now consider the limit of long periods, i.e. L → ∞ which implies that Δω → 0. Then
the sum
∞
Δω iω x
∑ 2π
g ωn e n
n= −∞
www.iubh.de
Unit 2 59
(2.11)
is the following integral
∫
1 ∞
g ω eiωxdω
2π −∞
(2.12)
where, in our case, the function g(ω) is given by
L/2
g ωn = ∫ −L/2
f ue
−iωnu
du
(2.13)
where we have chosen the constant x0 now to be -L/2. Putting this together, the inte-
gral becomes
∫ ∫
1 ∞ ∞
f x = eiωxdω f u e−iωudu .
2π −∞ −∞
(2.14)
From this, we define the Fourier transform of the function f(x) to be
∫
1 ∞
f ω = f x e−iωxdx
2π −∞
(2.15)
where we have changed back from u to x. We can also define the inverse Fourier trans-
formation
∫
1 ∞
f x = f ω e+iωxdω .
2π −∞
(2.16)
1
The normalization factor 2π is split equally between the Fourier transformation and its
1
inverse as , which avoids having to remember in which equation the factor needs to
2π
be inserted. Note that the tilde in f (ω) is often dropped if there is no risk of confusing
the Fourier transformation and its inverse.
Using the Fourier transformation, we can switch between two equivalent views of ana-
lysing a signal, either in the “observable” domain f(x) or in the frequency domain f (ω).
Since most functions are observed as time dependent functions or signals, we typically
use f(t) instead of f(x) to indicate the dependency of time. In fact, the left and right
www.iubh.de
60 Unit 2
parts of the previous three figures correspond to either the time domain (left part) or
the frequency domain (right part). The main advantage of being able to switch between
these equivalent representations of a signal is that some operations might be very
complicated in one domain but very easy in the other. For example, applying a fre-
quency filter is very difficult in the time domain but easy in the frequency domain. Con-
cretely, we consider the periodic signal in the figure titled “Periodic Sine Signal with
Two Contributing Frequencies.” We know from the frequency spectrum that this is a
simple signal where a low frequency sine function is combined with a high frequency
sine function. Using a low-pass filter (a filter that will only let low frequencies pass), we
can extract the part of the signal with the low frequency. To do so in the time-domain
where we observe the signal would be very difficult, but it is easily achieved in the fre-
quency domain by applying the dampening function shown in the previous figure.
Rather than using a simple rectangle as one might assume, we use a “softened” rectan-
gular function. In this simple example, this would make little difference, however, in
more complex examples, the smooth attenuation avoids artifacts that can be induced
by a sharp cut-off. The following figure shows that we can then extract the low-fre-
quency signal by switching back to the time domain as desired.
www.iubh.de
Unit 2 61
Example
Find the Fourier transformation of the function f(t) = Ae−λt for t ≥ 0, assuming
f(t) = 0 for t < 0.
We use equation 2.15 and write
∫ ∫
A ∞ A ∞
f ω = e−λte−iωtdt = e− λ + iω tdt
2π 0 2π 0
(2.17)
where we note that the bounds of integration range from zero to ∞ as we assume
1
that f(t) = 0 for t < 0. Remembering ∫eaxdx = a eax, the integral is evaluated as
A e− λ + iω t ∞
f ω = −
2π λ + iω t=0
A 1
= .
2π λ + iω
www.iubh.de
62 Unit 2
Summary
Convolutions mathematically express how the effects of two functions can be com-
bined. For example, a measurement of a physical quantity taken by some measure-
ment device can be expressed as a resolution function that is unique to the specific
details of the device and the “true” value. The observed value is then the convolu-
tion of the “true” value with the finite sensor resolution. In image processing, con-
volutions with a given kernel are used to manipulate images by blurring, sharpen-
ing, detecting edges, or transforming the image. Periodic signals can be expressed
as a Fourier series which is a series of sine and cosine functions of fixed frequen-
cies. The relative weight of the coefficients of the terms in the series determines
the shape of the final signal. Fourier transformations can be derived as the limit of
increasingly long periods over which the fixed frequencies become a continuous
spectrum. Using the Fourier transform, a function can be expressed either in its
original form or as a Fourier transformation. Some operations are much simpler to
perform in one form compared to the other, so the Fourier transform is very useful
in practice.
Knowledge Check
learning platform.
Good luck!
www.iubh.de
Unit 3
Vector Algebra
STUDY GOALS
… the difference between scalars and vectors.
… how to perform basic operations with vectors.
… geometric and physical interpretations of vectors.
… examples of applications that use basic vector operations.
DL-E-DLMDSAM01-L03
64 Unit 3
3. Vector Algebra
Introduction
In this unit, we introduce techniques that allow us to operate on more complex mathe-
matical objects called vectors that contain one piece of information in each coordinate.
For example, a vector can describe the speed of a particle and the direction in which it
is moving. This is an example of a two- or three-dimensional vector. Likewise, an n-
dimensional vector could be used to record n-pieces of information such as time, tem-
perature, and the x-, y-, and z-coordinates of the direction of movement or force.
Given the numerous and important applications of vectors to physics, we will focus on
developing an intuition for vectors in two- and three-dimensions, in particular in ℝ2
and ℝ3 — the familiar Cartesian plane and space. Vector operations on these spaces
can be generalized to the broader contexts seen in information science, machine learn-
ing, and computer science in general.
Good textbooks that cover the subject area are chapters 11 and 12 of Calculus (Strang,
2017) and chapters one and two of Advanced Calculus (Loomis & Sternberg, 2014).
3.1 Scalars and Vectors

Numbers as we are accustomed to them are scalars, and they measure one-dimen-
sional quantities like temperature or weight. A familiar example of a scalar-valued
function is h(t) = −9.8t2 + v0t + h0, the parabola describing the height of an object t
seconds after being thrown into the air with initial height h0, initial velocity v0, and
Acceleration Due to acceleration due to gravity of −9.8 meters per second squared.
Gravity
Note that this is only Technically, a vector is an element of a mathematical structure called a vector space.
an average value as For our purposes, we can consider a vector to be a mathematical object that has both
the exact value magnitude and direction. This way of thinking of vectors works very well for two- and
depends on the three- dimensional space. For more general applications to machine learning and
location and local information science, vectors can have many more than three coordinates, and might
topology of the have nothing to do with physical space — however, they will still satisfy the same basic
earth. rules as our more familiar examples from physical space.
Regardless of their dimension, vectors can be analyzed by breaking them into their
component parts. For example, in the two-dimensional case, the components would be
horizontal and vertical. For n-dimensional vectors, there is no physical analogue, but
the ith-coordinate of one vector will measure the same quantity as the ith-coordinate of
another vector in the same space. Let’s begin by building an intuition for vectors in the
familiar, two-dimensional plane.
www.iubh.de
Unit 3 65
Vector Algebra
Two-Dimensional Vectors
One of the most common applications of vectors is the study of objects moving in two-
dimensional space, though there are many other applications of two-dimensional vec-
tors, too.
In this context, we want a two-dimensional vector to tell us which direction to travel

and how far to go. Note that a vector doesn’t tell us where to start. Also, the “how far to
go” is usually over a fixed time increment, and therefore tells us a speed, not a posi-
tion. There are several ways to indicate that a quantity is a vector rather than a scalar:
The element may be written in underline, in bold font, or with an arrow over the sym-
bol. In the following, we will use the latter notation and indicate denote vectors like →
v.
Let’s consider an initial point, P = (p1, p2), in the x, y-coordinate plane and a terminal
point, Q = (q1, q2). Then the vector representing the travel from P to Q can be written
PQ. The magnitude of PQ is the length of the line segment connecting P and Q,
obtained from the distance formula. Namely, the magnitude of PQ is given by
2 2
PQ = q1 − p1 + q2 − p2 .
If initial point P is the origin (0, 0), we say that the vector is in standard position and
we call it a position vector. Note that the ordered pair at Q = (q1, q2) uniquely specifies
a position vector as there is only one distance and one path from (0, 0) to (q1, q2). The
following figure shows an example of a vector v in standard position. Similarly, for vec-
→
tors in n-dimensions, we write v = v1, v2, …, vn for a vector whose ith-component is
vi where the vi come from an underlying field (such as the real or complex numbers).
www.iubh.de
66 Unit 3
Fundamental definitions
→ → →
1. Two vectors a = a1, …, an and b = b1, …, bm are equal, → a = b , if, and only if,
n = m and ai = bi for all 1 ≤ i ≤ n. This means that they have the same magnitude
and direction.
2. Given → →
a as above, the negation of the vector, denoted by − a , is (−a1,…,−an), a
vector having the same magnitude but opposite direction to that of → a.
→
3. If v = 1, we say that the vector → v is a unit vector.
Example
Use a vector to graphically represent a force of 10 Newtons in a direction of 30o

North East.
In two dimensions, this force can be represented as shown in the following figure.
www.iubh.de
Unit 3 67
Vector Algebra
Example
Suppose that the vector → v is given by the directed line segment extending from
the point (0, 0) to the point (3, 2) and that the vector →u is given by the directed line
segment from the point (1, 2) to the point (4, 4). Is it true that → →
u = v?
Let P = (0, 0), Q = (3, 2), R = (1, 2), and S = (4, 4) as shown in the following figure.
www.iubh.de
68 Unit 3
We need to determine whether the directed line segments have the same direction
and magnitude. We find the magnitude of both line segments to be
→ 2 2
PQ = v = 3−0 + 2−0 = 13 and,
→ 2 2
RS = u = 4−1 + 4−2 = 13 .
In the two-dimensional case, the slope of a line describes its direction; finding the
slope of both line segments, we see that they agree:
Δy y2 − y1 2−0 2
Slope of PQ = = = = ,
Δx x2 − x1 3−0 3
Δy y2 − y 1 4−2 2
Slope of RS = = = = .
Δx x2 − x1 4−1 3
As → → → →
u and v have the same direction and the same magnitude, we have u = v .
www.iubh.de
Unit 3 69
Vector Algebra
An easier way to determine whether two vectors are the same is to use the component
form of vectors: For a vector AC determined by the points A = (x1, y1) and C = (x2, y2),
the component form is defined as = x2 − x1, y2 − y1 . The coordinates of this vector
describe the position vector OP from the origin, (0, 0) to a point P = (x2 - x1, y2 - y1).
Note that two position vectors → →

u = (u1, u2) and v = (v1, v2) are equal if their respec-
→ →
tive coordinates are equal, so u = v if, and only if, u1 = v1 and u2 = v2. Also, note that
the initial and final points are not part of the vector itself. The vectors AC and OP are
the same vector.
3.2 Addition and Subtraction of Vectors
Properties of Addition and Subtraction of Vectors
Computationally, the addition and subtraction of vectors is as easy as the addition and
subtraction of their components in the underlying field, which, for us, is either the real Field
or complex numbers. A mathematical field
is a set on which
Specifically, for addition, multiplica-
tion, subtraction,
v = v1, …, vn and u = u1, …, un , v + u and division are
defined and behave
= v1 + u1, …, vn + un and v − u = v1 − u1, …, vn − un .
the same way as for
rational and real
Notice that because the components of the vectors are elements of the field of real numbers.
numbers ℝ (or complex numbers ℂ), addition of these entries, and thus of vectors, is
commutative: namely, → → → →
v + u = u + v . Subtraction, of course, is not commutative.
Vectors also share other properties with the real or complex numbers: Vector addition
is associative, there exists an additive identity 0, every element v has an additive
inverse, − →
v .
We summarize these properties below:
→ → →
Let u = u1, u2 , v = v1, v2 , w = w1, w2 and let k and c be scalars.
1. → → → → (commutative property)
u + v = v + u
2. → → → (associative property)
u + v +w =
→ → →
u + v +w
www.iubh.de
70 Unit 3
3. → → → → → (additive identity)
u + 0 = u = 0 + u
4. → → →
u + −u = 0
5. → →
k c u = kc u
6. → → → →
k v + w = k v + kw
7. → → →
k+c u =ku +cu
Example
→ → → → → → →
Let → → →
u = (7, 2) and v = (−3, 5). Find u + v , u − 6 v , 3 u + 4 v and 5 v − 2 u .
We have:
→ →
u + v = 7, 2 + − 3, 5 = 4, 7 ;
→ →
u − 6 v = 7, 2 − 6 − 3, 5 = 7, 2 − − 18, 30 = 25, − 28 ;
→ →
3 u + 4 v = 3 7, 2 + 4 − 3, 5 = 21, 6 + − 12, 20 = 9, 26 ;
→ →
5 v − 2 u = 5 − 3, 5 − 2 7, 2 = − 15, 25 − 14, 4
2
= −29 + 212 ≈ 35, 8 .
Vectors in Two Dimensions
The computational way of adding vectors gives little intuition for their physical applica-
tions. Let us now turn our attention to the graphical addition and subtraction of vec-
ℝ² tors, restricting ourselves, for now, to the case of ℝ2, which we will visualize as the Car-
ℝ2 is the two-dimen- tesian coordinate plane.
sional space of real
Let → →
x and y be vectors in ℝ . Conceptually, the sum of these two vectors should be
numbers. 2
the net effect of the two vectors together. For ease of explanation, suppose that →
x and
→ represent forces applied in the plane.
y
www.iubh.de
Unit 3 71
Vector Algebra
One way to think about this is as the two vectors in sequence, namely “doing” or
“applying” one and then the other. Graphically, that would mean placing the tail of → y
on the head of → a ; if we then draw a new vector, indicated by the red dashed line from
the tail of → → → →
x to the head of y , that vector is the sum x + y , as shown in the follow-
ing figure. We can denote this a new vector such as z = → →
x + y.
The difference of two vectors → → → →

x and y is the sum of x and − y . The difference
→ → → →
x + − y = x − y is shown in the following figure. We can denote this a new vector,
such as → → →
z = x − y
www.iubh.de
72 Unit 3
The next figure illustrates that performing addition graphically is equivalent to per-
forming addition using the component form of the vectors. The parallelogram also
shows that addition of vectors is commutative — we get the same diagonal regardless
of which vector we apply first.
Example
Let → → → →
v = (−3, 2) and w = (5,−9). Find v + w both graphically and by using the com-
ponent form of the vectors.
We have
→ →
v + w = −3 + 5, 2 + −9 = 2, − 7 .
Introduction to Bases
Spanning Set A basis for a vector space is a spanning set of vectors in that vector space. Minimal
A spanning set is means that if we took any of the vectors away, the set would no longer span the space
one that allows us to — there would be some vectors that cannot be formed as linear combinations of the
express every vector remaining elements in the set. Recall that a linear combination of elements is a sum of
in the space as a lin- those elements, possibly each multiplied by a scalar. For example, a two-dimensional
→ →
ear combination of space can be spanned by the vectors i and j pointing along the x and y axes. Any
elements in that element of this two-dimensional space (or plane) can be expressed by a linear combi-
→ →
spanning set. nation of these two vectors such as → v = 2 i + 3 j . If we were to take one of them
away, we could no longer reach all points in the plane.
www.iubh.de
Unit 3 73
Vector Algebra
Bases (the plural of basis) are fundamental to the study of vector spaces, and they are
not unique. Though we will typically use the familiar unit vectors in the coordinate-axis
directions as our bases for two- and three-dimensional space, there are many other
bases, and how to move from one basis to another is an important question.
Unit vectors
Recall that a vector of magnitude (or length) one is called a unit vector. For example,
→
the vector v = − 3/5, 4/5 is a unit vector because
→ 2 2 9 16 25
v = −3/5 + 4/5 = + = = 1.
25 25 25
Unit vectors are particularly useful because they all have the same length, namely
length one, so they differ from one another only in direction. In particular, there is only
one unit vector that points in a given direction.
Let us denote the unit vector in the direction of vector → →

v by u . To find the unit vector
in the direction of a given vector → →
v , we scale u to make a vector of length one as fol-
lows:
→
→ v 1 →
u = → = → v .
v v
Example
Find a unit vector in the direction of →

u = (−1, 2).
→ → 2 + 22
First we find the magnitude of v = v = −1 = 5.
→
→ v − 1, 2 1
Then the unit vector in the direction of →
v is u = → = = −1, 2 .
v 5 5
The unit vectors parallel to the x- and y-axes, sometimes called standard unit vectors,
are particularly useful because they form what is called a basis of the 2-dimensional
vector space. In particular, letting
→ →
i = 1, 0 ; j = 0, 1 .
Any vector in two-dimensional space can be written as a linear combination of these

→ →
two vectors i and j . For example, we can express an arbitrary vector
→ → → →
v as v = v1, v2 = v1, 0 + 0, v2 = v1 1, 0 + v2 0, 1 = v1 i + v2 j .
www.iubh.de
74 Unit 3
Example
→ →
Express →
r = (2,−6) as a linear combination of i and j .
We have
→ → → → →
r = 2, − 6 = 2 i + −6 j = 2 i − 6 j .
→ →
It is important to note that the set i , j is only one of many possible bases for ℝ2;
there are many other bases for this same space. For example, we could express each
point in terms of a distance r from the origin and an angle θ between the positive x-
axis and the vector to any point in ℝ2. Indeed, in the two-dimensional case, any two
vectors that are not scalar multiples of one another will form a basis. This condition is
guaranteed in the two-dimensional case, but not in higher dimensions.
Vectors in Three-Dimensional Space
The definitions and properties we observed in two dimensions can be extended to

three dimensions in a straightforward→way. For basis of standard unit vectors for three
→ →
dimensions, we define the set i , j , k , where
→ → →
i = 1, 0, 0 ; j = 0, 1, 0 ; and k = 0, 0, 1 .
The three standard unit vectors are directed along the positive x-, y-, and z-axes of the
three-dimensional rectangular coordinate system.
→
For this basis, it is easy to see that any vector F can be expressed as a linear combina-
tion of the standard basis vectors because this is how we tend to think about position
in three-dimensional space. In particular,
→ → → → → → →
v = v x + v y + v z = v 1 i + v2 j + v 3 k
where for i = 1, 2, 3, the vi are the components (orthogonal projections) of →

v in the x-,
y-, and z-directions, respectively, as illustrated in the following figure.
www.iubh.de
Unit 3 75
Vector Algebra
Position vectors
In the three-dimensional space of real numbers ℝ3, a position vector extends from the
origin (0, 0, 0) to the point (x, y, z) and is written →
r = (x, y, z). This can be expressed as
a linear combination of the standard unit vectors:
→ → → →
r =x i +y j +zk,
with magnitude
→
r = x2 + y2 + z2 .
Example
→ → → → → →
Given → →
r 1 = 3 i − 2 j + k , r 2 = 2 i − 4 j − 3 k and r 3 = − i + 2 j + 2 k .
→
Find the magnitude of M = → → →
r 1 + r 2 + r 3.
We have
www.iubh.de
76 Unit 3
→ → → →
M = r 1+ r 2+ r 3
→ → →
= i 3 + 2 − 1 + j −2 − 4 + 2 + k 1 − 3 + 2
→ →
=4 i −4 j
= 4, − 4, 0
and thus,
M = 42 + 42 + 02 = 4 2 .
Example
Find the position vector →

r corresponding to the vector v with initial point (−2, 3, 1)
and terminal point (0,−4, 4). Next, find the unit vector → →
u in the direction of r .
The position vector is
→
r = 0 − −2 , − 4 − 3, 4 − 1 = 2, − 7, 3 .
The magnitude of the position vector is
→ 2
r = 22 + −7 + 32 = 62,
and therefore the unit vector in the direction of →

r is
→
→ r 1
u = → = 2, − 7, 3 .
r 62
Collinear vectors
Two vectors → →
u and v are collinear (a generalization of parallel) if there exists a real-
valued constant c such that → →
u = c v . We use the following notation to express that
→ →
two vectors are collinear: u ∥ v . Note that this definition holds for vector spaces of
any dimension.
Example
Suppose vector w has the initial point (2,−1, 3) and the terminal point (−4, 7, 5).
Are → → → →
w and u = (3,− 4,−1) collinear? What about w and v = (12, 16, 4)?
First let us write →

w in the component form,
→
w = −4 − 2, 7 − −1 , 5 − 3 = −6, 8, 2 .
www.iubh.de
Unit 3 77
Vector Algebra
Observing that it is possible to write u as
→ 1 1→
u = 3, − 4, − 1 = − −6, 8, 2 = − w ,
2 2
we conclude that → →
w ∥ u holds.
To check if → →
w is collinear with v , we must determine whether there is a constant c,
→ →
satisfying w = c v , namely
12, 16, 4 = c −6, 8, 2 .
This generates three equations in c that must be simultaneously satisfied. Such a c

would have to be —2 for the first two coordinates but 2 for the last coordinate. The-
refore, there is no such c and → → → →
w is not collinear with v , w ∦ u .
Example
Determine whether the points P = (1,−2, 3), Q = (2, 1, 0), and R = (4, 7,−6) lie
one the same line.
The position vectors for PQ and PR are
PQ = 2 − 1, 1 − − 2 , 0 − 3 = 1, 3, − 3 and
PR = 4 − 1, 7 − − 2 , − 6 − 3 = 3, 9, − 9 .
Note that PQ and PR have the same initial point by construction so they are collin-
→
ear if, and only if, they lie on the same line. By inspection, we see that PQ = 3PR,
and therefore PQ ∥ PR holds and the three points are on the same line.
3.3 Multiplication of Vectors: Dot Product and Scalar

Product
We have covered addition and subtraction of vectors and multiplication of vectors by
scalars, which scales vectors, changing their length but not their direction. While there
is not a single “multiplication” of vectors, there are two important products that we will
introduce in this section: the scalar or dot product of two vectors, which has a scalar
output, and the cross product of two vectors, the output of which is a vector.
www.iubh.de
78 Unit 3
Scalar or Dot Product
Let → → → →
u = (u1, . . . , un) and v = (v1, . . . , vn) be vectors. The dot product of u with v is
the scalar
→ →
u ⋅ v = u1v1 + u2v2 + … + unvn .
Note that → →
u ⋅ v is a scalar. By considering the properties of the real valued compo-
nents, we can verify the following properties.
Properties of the dot product

For all real vectors → →
u and w in ℝ , the following properties hold:
n
1. → → → →
u ⋅ v = v ⋅ u
→ → → → → → →
2. u ⋅ v +w = u ⋅ v + u ⋅w
3. → → → 2
u ⋅ u = u
→ → → → → →
4. u ⋅ cv =c u ⋅ v = cu ⋅ v
Example
→ →
Let → → → →
u = (2,−2) and v = (5, 8). Find u ⋅ v and u ⋅ 2 v .
We have:
→ →
u ⋅ v = 2 · 5 + − 2 · 8 = − 6;
→ → → →
u ⋅ 2 v = 2 u ⋅ v = 2 · −6 = − 12 .
The scalar or dot product can be extended so that we can handle vectors in ℂn with
complex numbers as well. We define the scalar or product as
→ →
u ⋅ v = u∗1v1 + u∗2v2 + … + u∗nvn
where the asterix in u∗i indicates the complex conjugate. We then find that some of the
properties above change, in particular, the following:
→ → → →
• u ⋅ v = v ⋅ u ∗
→ → → →
• c u ⋅ v = c∗ u ⋅ v
→ → → →
• u ⋅ cv =cu ⋅ v
The magnitude of the vector remains unchanged as → →

u ⋅ u is real.
www.iubh.de
Unit 3 79
Vector Algebra
The angle between two vectors

In two or three dimensions, the scalar or dot product can be interpreted geometrically
as the angle between two vectors. In two dimensions (ℝ2), we let θ denote the angle
→
between two vectors → a = (a1, a2) and b = (b1, b2) as shown in the following figure. The
cosine of the angle θ between these two vectors is given by
a1b1 + a2b2
cos θ = →→ .
a b
(3.1)
We note without proof that the same formula works for vectors in three dimensions. In
→
particular, for →
a = (a1, a2, a3) and b = (b1, b2, b3),
a1b1 + a2b2 + a3b3

θ = arccos →→ .
a b
(3.2)
Proof
As we can see from the panel (a) of the figure above,
www.iubh.de
80 Unit 3
ξ1 ξ2 →
cos θ = → ; sin θ = → ; ξ3 = a − ξ1 .
b b
(3.3)
→ →
It follows that ξ = b cos θ and ξ = b sin θ, as shown in panel (b) of the figure.
1 2
We also have
→ →2 2 2
a − b = a1 − b1 + a2 − b2
→2 → 2
= a + b − 2 a1b1 + a2b2 .
(3.4)
On the other hand
2 → 2 → → 2
a − b = b sin θ + a − b cos
→2 →2 →→
= b sin2 θ + cos2 θ + a − 2 a b cos θ
→2 → 2 →→
= a + b − 2 a b cos θ .
(3.5)
Comparing equation 3.4 with equation 3.6, we conclude that
→→
a b cos θ = a1b1 + a2b2,
(3.6)
and from this, we can immediately obtain equation 3.1 by solving for cos θ.
A more conventional way to write equation 3.1 is
→ → →→
a ⋅ b = a b cos θ .
(3.7)
Observe that this gives a particularly easy test for orthogonality. In particular, if →
u
is perpendicular to →v , we have
→ →
u ⋅ v =0
(3.8)
www.iubh.de
Unit 3 81
Vector Algebra
since cos θ = cos π/2 = 0.
Example
Let → → →
u = (3,−1, 2) and v = (−4, 0, 2). Find the angle θ between u and v :
We have
→ → → →
u ⋅ v = u v cos θ,
from which we obtain
→ →
u ⋅ v −12 + 4 −4
cos θ = → → = = .
u v 14 20 70
Since → →
u ⋅ v < 0,
−4
θ = arccos ≈ 2 .069 rad.
70
Vector projection
One important application of the dot product is in finding the extent to which a given
vector is “in the same direction” as a second vector. One example of this is the horizon-
tal and vertical components of the velocity of a projectile thrown into the air. The verti-
cal component is the projection of the initial velocity vector onto the y-axis and
reflects how much of that initial velocity is going in the “up” direction. The following
discussion allows us to consider this question for any two directions, and more for-
mally.
→
Let →
a and b be two-dimensional real vectors in ℝ . Imagine that we shine a light onto
2
→
vector a from a light source perpendicular to b . We can think of the projection of →
→
a
→ →
onto b as the shadow of → a casts onto vector b. If we think only of the length of the
shadow, this is called the scalar projection. If we consider the length and direction of
the shadow, we get the vector projection.
→
It is worth noting that this shadow, this projection, let’s call it Proj→
b a , might be longer
→
than
→ the vector b . Recall that one way to define the line containing
→
b is L = c b : c ∈ ℝ , namely all scalar multiples of b by a real number c. This means
that, so far, we know that for some c, Proj→ → .
b a =cb
→
The other thing we know about the projection of → a onto b is that it is orthogonal to
→
the “light →
→ source,” which, in this case, has the direction →a − c b . This means that
→
b ⋅ a − c b = 0. Using our knowledge of dot products, we can find a particular value
for c. We know that
www.iubh.de
82 Unit 3
→ → → → → → →
b ⋅ a −cb =0= b ⋅ a −cb ⋅ b .
(3.9)
→ →
b ⋅ a
Solving this for c, we know that c = →2 and therefore, that the vector projection of →
a
→ b
onto b is
→ →
→ b ⋅ a→
Proj→
b a = →2 b .
b
(3.10)
→
The scalar projection of → →
a onto b is the magnitude of the vector projection of a onto
→
b.
Example
→ → → → → →
Find the vector projection of → →
u = 3 i − 5 j + 2 k onto v = 7 i + j − 2 k .
By formula (3.10), the vector projection of → → →

u onto v is the vector w , given by
→ →
→ v ⋅ u → 12 14 2 4
w = v = 7, 1, − 2 = , ,− .
→2 54 9 9 9
v
Inner product
The inner product can be seen as a generalization of the scalar or dot product dis-
cussed so far. While the dot or scalar product was specifically defined as
→ →
u ⋅ v = u1v1 + u2v2 + … + unvn, the inner product is a function that takes two vectors
and assigns a real (or complex) number to it. However, we generally assume that we
use complex numbers when working with the inner product. Effectively, the inner prod-
uct is a generalization of the dot product. The inner product of two vectors, → →
u and v ,
is written as
→ →
⟨ u v ⟩,
→ → → →
or alternatively, ⟨ u , v ⟩, u v , or just → →
u ⋅ v.
The following properties hold:
→ → → →
1. ⟨ u v ⟩ = ⟨ v u ⟩∗
→ → → → → → →
2. ⟨ u c v + d w ⟩ = c⟨ u v ⟩ + d⟨ u w ⟩
→ → → → → → →
3. ⟨c u + d v w ⟩ = c∗⟨ u w ⟩ + d∗⟨ v w ⟩
www.iubh.de
Unit 3 83
Vector Algebra
→ → → →
4. ⟨c u d v ⟩ = c∗d⟨ u v ⟩
→ →
5. Vectors are defined to be orthogonal if ⟨ u v ⟩ = 0
→ → → 1
The norm of a vector is defined as ∥ u ∥ = ⟨ u u ⟩ 2 as a generalization of the magni-
→ →
tude of a vector we have encountered so far. Generally, ⟨ u u ⟩ can be both positive
→ →
and negative. However, in most cases we will encounter vector spaces with ⟨ u u ⟩ ≥ 0
and hence we say that the norm is positive semi-definite.
Cross product
The cross product (or vector product) is generally only defined in the three-dimen-
→ → → → → → →
sional space ℝ3. Let → a = a1 i + a2 j + a3 k and b = b1 i + b2 j + b3 k . For nonzero
vectors in ℝ3, the cross product of a and b is a vector that is perpendicular to both of
the given vectors.
The cross or vector product is defined as:
→ → → → →
a × b = a2b3 − b2a3 i − a1b3 − b1a3 j + a1b2 − b1a2 k .
Note that there is a more elegant definition using determinants of matrices. Then
→ → is the vector, which is orthogonal to the plane spanned by the vectors → →
a × b → a × b.
In particular, as we have noted, for nonzero vectors, →
a × b is perpendicular to both →
a
→
and b .
Algebraic properties of the cross product

For vectors → →
u and w in ℝ , the following properties hold:
3
→ → → →
1. u × v = − v × u
→ → → → → → →
2. u × v +w = u × v + u ×w
→ → →
3. u ×→ u =→ 0
→ → → →
4. u ⋅ v ×w = u × v ⋅w
→ → → → → →
5. The cross product is not associative, i.e. u × v × w ≠ u × v × w
Example
→ → →
Let →
u = i − 2 j + k and v = 3 i + j − 2 k . Find u × v .
We have
→ → → → →
u × v = i 4 − 1 − j −2 − 3 + k 1 + 6
→ → →
=3 i +5 j +7k .
Geometric properties of the cross product

For vectors → → →
u , v , and w the following properties hold:
www.iubh.de
84 Unit 3
1. → → → → →
u × v = 0 if, and only if, u = k v (namely, if the vectors are scalar multiples of
one another).
2. The vector → →
u × v is orthogonal to both
→ and → as shown in the following figure.
u v
→ → →→
3. The magnitude is given by u × v = u v sin θ and is equal to the area of the par-
allelogram with adjacent sides → →
u and v .
→
The above figure illustrates the last property: There, we have two vectors →
a and b . Let
θ be an angle between them → and let h be the height of the parallelogram with adjacent
→
sides of lengths a and b . The height of the parallelogram is
→
h = b sin θ,
and the area of the parallelogram is
→ → →
Area = h a = b a sin θ .
Example
→ → →
Find a unit vector →
w that is orthogonal to both vectors →
u = i − 4 j + k and
→ → →
v =2 i +3 j .
We first find the cross product of → →

u and v
→ → → → →
u × v = − 3 i + 2 j + 11 k ,
www.iubh.de
Unit 3 85
Vector Algebra
which has magnitude
→ → 2
u × v = −3 + 22 + 112 = 134 .
The cross product gives us the desired direction; we normalize the vector to be a
unit vector (length one) by dividing the cross product by its magnitude:
→ →
→ u × v 3 → 2 → 11 →
w = → → = − i + j + k
u × v 134 134 134
Summary
In this unit, we learned the basics of how to interpret and compute with vectors —
mathematical objects that encode more than one measurement such as speed and
direction or even n-many attributes of the state of a system. We learned to add and
subtract vectors, to multiply vectors by scalar, and how to compute and use the dot
product (also called the scalar product because the output is a scalar) and the
cross product (whose output is a vector perpendicular to the two vectors forming
the product). The geometrical interpretation of vectors is especially important in
two- and three-dimensional space. Unit vectors are vectors of length one, and the
standard unit vectors for the Cartesian coordinate system are parallel to the coordi-
nate axes and pointing in the positive direction. The concept of a basis plays a cen-
tral role in vector calculus; a basis is a minimal spanning set of vectors, a set of the
smallest size so that every vector in the space can be formed as a linear combina-
tion of the vectors in the basis.
Knowledge Check
learning platform.
Good luck!
www.iubh.de
Unit 4
Vector Calculus
STUDY GOALS
… how to differentiate and integrate vector functions.
… how to differentiate along an arbitrary line.
… how to integrate over an arbitrary surface.
… what scalar fields and vector fields are and how to visualize them.
… how to use and interpret vector operators on scalar and vector fields.
DL-E-DLMDSAM01-L04
88 Unit 4
4. Vector Calculus
Introduction
This unit combines the concepts of differentiation, integration, and vector functions. It
introduces the mathematical tools used to study how objects behave in arbitrary coor-
dinate systems; for example, we will be aware of how to determine the rate of change
in a function that describes an object moving through three-dimensional space. As a
concrete example, we can imagine a plane flying through the sky where the position of
the plane relative to the observer or a fixed point is described as a time-dependent
vector (x(t), y(t), z(t)). The rate of change of this position vector with respect to time
gives the velocity of the plane at time t and the rate of change of the velocity gives the
acceleration of the plane at time t.
Two important concepts of vector calculus are scalar and vector fields. One example of
a scalar field from our common physical experience is a function that gives the temper-
ature at each point in a room: Given a position, such a function outputs a scalar. To vis-
ualize a vector field, we can consider the speed and direction of water as it flows down
a drain; the vector field associates the vector that describes the velocity (speed and
direction) of the water with each point in space.
For further explanation of this topic, see chapter 15 of the textbook Calculus (Strang,
2017).
4.1 Differentiation of Vectors
Differentiation of Vector Functions
When discussing the concept of the derivative, we are aware of the rate of change of a
function with respect to changes in its arguments. We can use the example of a car to
see that the rate of change in the distance traveled is the velocity. In this simple exam-
ple, we implicitly assume that the car would travel along a long, straight road, i.e. we
are not concerned with the direction of the car. In general, the car can not only change
its speed, but also the direction in which it is travelling. One natural way to describe
the position of the car and its velocity is with vectors. In particular, we can consider the
position of the car a as a vector function with a scalar argument, time.
More generally, let → →

a = a u be a vector function with scalar → argument u. In three-
→ → → →
dimensional Cartesian space, we can express a as a = ax u i + ay u j + az u k ,
→ →
where i , j , and k are the unit vectors in the x−, y−, and z−directions, and the scalar
functions ax(u), ay(u), and az(u) are the components of the vector in each of these
directions.
The general idea of the derivative as a limit applies to vector functions as well. We can
define the derivative of the vector →
a u as
www.iubh.de
Unit 4 89
Vector Calculus
→ → →
da a u + Δu − a x
= lim .
du Δu → 0 Δu
(4.1)
The following figure illustrates a small change in the vector → a u caused by a small
change in the argument u. Note that the derivative of a vector function is also a vector
function. However, the two vectors are not necessarily parallel, but can point in differ-
ent directions. In Cartesian coordinates, equation 4.1 can be written as
→ dax → day → daz →

da
= i + j + k,
du du du du
(4.2)
which means that we can differentiate each component of the vector function →
a u
separately.
www.iubh.de
90 Unit 4
Example
→ → →
The position of a car at time t is given by →
x t = t2 i + 3t j + t k in Cartesian
coordinates. Find the velocity →v t (and its value, i.e. the speed, v t ) of the car
at time t = 1.
First, we find the derivative of →

x t . The result is
→ → → →
→ dx t
v t = = 2t i + 3 j + 1 k .
dt
(4.3)
→ → →
At time t = 1, we know that the velocity →
v 1 = 2 i + 3 j + 1 k . Remembering
that speed is the magnitude of the velocity vector, we obtain the norm of the vector
→
v 1 = 22 + 32 + 12 = 14.
Rules of Differentiation of Vector Functions
Just as for scalar functions, it is useful to make note of rules of differentiation that
→
allow us to avoid using the definition whenever they apply. Assume that → a and b are
differentiable vector functions and that Φ is a differentiable scalar function. Then we
can use equation 4.1 and prove the following useful rules:
→ →
d ϕa da dϕ →
=ϕ + a
du du du
(4.4)
→ → →
d a ⋅ b →
→ db da →
= a ⋅ + ⋅ b
du du du
(4.5)
→ → →
d a × b → →
→ db da
= a × + × b.
du du du
(4.6)
www.iubh.de
Unit 4 91
Vector Calculus
Example
Given a point particle circling a center with constant speed and fixed radius, show
that for any time, t, the velocity vector is perpendicular to the position vector.
Let →
r t denote the position function. The point particle is always the same dis-
tance from the center of a circle, and so r · r = r2 a constant. Note that, by the
→ →
problem, → 2
v t has a constant magnitude, so v ⋅ v = v . Hence,
d → → → → → → → →
r ⋅ r = r ⋅ v + v ⋅ r = 2 r ⋅ v = 0,
dt
which implies → →
r ⊥ v.
Vector functions with multiple scalar arguments

In the case of multivariable scalar functions, we used partial derivatives to express the
rate of change of a function with respect to a single variable. Analogously, we can
extend the idea of partial derivatives to vector functions that depend on more than
→
one variable. Suppose that a u1, u2, …, un is a vector function with scalar arguments
u1, …, un. Then, to find
→
∂a
,
∂ui
(4.7)
we treat all variables uj where j ≠ i as constant and differentiate a as we vary only ui.
Using partial derivatives, one can prove a version of the chain rule in order to compute
derivatives of vector functions →a whose arguments u1, u2, …, un are themselves func-
tions of some variables vi, namely ui(v1, v2, …, vm), and get
→ → → →
∂a ∂ a ∂u1 ∂ a ∂u2 ∂ a ∂un
= + +⋯+ .
∂vi ∂u1 ∂vi ∂u2 ∂vi ∂un ∂vi
(4.8)
www.iubh.de
92 Unit 4
4.2 Integration of Vectors
Integration of Vector Functions
We can view integration of vector functions analogously to integration of functions of a

single variable. If we consider the vector function →a u as the derivative of some func-
→ →
tion A , namely → a = d A u /du , the integral, viewed as the antiderivative, can be
expressed as
→ →
∫→a u du = A u + b
(4.9)
→
where b is an arbitrary constant vector.
The definite integral is given by
u2 → →
∫ u1
→
a u du = A u2 − A u1 .
(4.10)
Naturally, just as the antiderivative of a scalar function is a scalar function, the antider-
ivative of a vector function is a vector function and its constant of integration is a vec-
tor constant.
Integration along paths

Previously, we integrated functions along the axes, for example ∫f x dx along the x-
axis or, in the multivariate case, ∫∫f x, y dxdy along first the x-axis and then the y-
axis. In general, however, integrals can be performed along an arbitrary path, not only
along one of the coordinate axes. An intuitive example is the physical definition of the
work performed when applying a force along a path. In the simplest case, the work is
given as W = F · r, where F is the force along a defined path. The work W itself is a
→
scalar, however, the path → r (not only along the axes) and the force F are in general
→
vectors as we can apply a force of some strength F in any direction as well as move
→
in any direction. Hence, the definition of the work W done by a force F as a particle
travels along path C, becomes
→
W= ∫ F ⋅ d→r
C
(4.11)
www.iubh.de
Unit 4 93
Vector Calculus
as illustrated in the following figure. Only the component of the force that is parallel to
the line tangent to the curve contributes to the work done moving an object along the
curve C, hence the work W is given by the scalar product of the vectors for the force
and the parametrization of the curve. The path → r t parameterizes the way the force is
applied, e.g. in Cartesian coordinates
→
r t = x t ,y t ,z t
(4.12)
where we have included a dependency on the time t to indicate where the object on
which the force is applied is at a given moment. The differential d→
r is then given by
→ dx dy dz
d r t = dx, dy, dz = dt, dt, dt
dt dt dt
(4.13)
as x, y, z implicitly depend on t and are typically parameterized as a function of t as

shown in equation 4.12.
www.iubh.de
94 Unit 4
Integration over surfaces

Just as we extended integration along a coordinate axis to integration over arbitrary
curves in space, we can also extend double integration to integration over arbitrary sur-
faces. For a fixed curve, a single free parameter can be used to describe the movement
along the curve. In the case of a surface, we need two free variables in order to para-
metrize a surface. For example, we could parameterize by
→ → → →
r u, v = r 0 + u a + v b
(4.14)
→
where r 0 is a fixed point in the surface, “anchoring” the surface in space. Linear com-
→ →
binations of vectors →a and b span the surface. In the Cartesian coordinate system, i
→
and j are orthogonal, so they create a rectangular area over which to integrate. In
→
general, the small surface area generated by →a and b will be a parallelogram and
→ → →
∂ r u, v ∂ r u, v
dA = × dudv .
∂u ∂v
(4.15)
Therefore, the integral over an arbitrary surface is given by
→ → →
∫∫ A
dA = ∫∫A
∂ r u, v
∂u
×
∂ r u, v
∂v
dudv .
(4.16)
4.3 Scalar and Vector Fields

Although the phrases “scalar field” and “vector field” may sound intimidating, we are
quite familiar with these concepts from our everyday life even if we don’t know them
Scalar Field by these names. A scalar field ϕ(x, y) is a function that assigns a scalar value to each
A scalar field relates point in a two-dimensional space. Note that we can define a scalar field on a space of
a single (scalar) any dimension, not only dimension two.
value to every point
in a given area A. As an example, consider a room with a radiator in a corner as shown in the following
figure. As the radiator heats up the room, the temperature is different at each point of
the room. If we had a thermometer fixed at a wall, we would measure a particular
value, a scalar, that changes with time. That is the value of the scalar field at that one
point. If we take the thermometer in our hand and walk around the room, the values
change as we walk. We can potentially obtain a different measurement of the tempera-
ture at each position (x, y), meaning that the function, the scalar field, that represents
www.iubh.de
Unit 4 95
Vector Calculus
the temperature in the room, can have a different value at each position. In this exam-
ple, we could consider the temperature in the room at more than one time, suggesting
that our scalar field would be a function ϕ(x, y, z, t) that depends on both position
and time.
From our everyday experiences, we are also familiar with situations where each point in
a given region is associated with a movement or force in a given direction with some
strength. The following figure shows two examples. On the left, a paddle is pushed
through the water and, as the paddle leaves the water, the water moves with both
direction and magnitude (speed). The picture on the right shows water draining from a
sink. As the water drains out of the sink, it forms a vortex.
www.iubh.de
96 Unit 4
Intuitively, it is clear that the direction and speed of the water change with position. We
Vector Field can introduce the vector field V = V (x, y, z, t), which assigns a vector representing
A vector field relates the velocity of the water to each position (x, y, z) at a given time t. This particular
a vector to every example is dependent on four variables related to physical quantities such as position
point in a given area and time, but we can generalize the ideas of both scalar and vector fields to arbitrarily
A. many variables, such as V = V (x1, x1, …, xn).
Compared to the visualizations of a scalar field as shown in the right part of the figure
“Visualization of a Scalar Field,” the visualization of vector fields becomes more com-
plex. In the case of scalar fields, we only needed to associate a single value of the sca-
lar field to each coordinate. In the two-dimensional example, we used a single number
illustrated by an appropriate color scale to visualize the scalar field. In case of a vector
field, we need to add information about strength and direction at each point. For exam-
ple, we can place little arrows on a regular grid as shown in the next figure. At each
point, the arrow points in the direction of the vector field and its length indicates the
strength at this point.
www.iubh.de
Unit 4 97
Vector Calculus
Additionally, we can use color information to highlight the strength of the vector field.
Alternatively, we can use a stream plot as shown in the following figure where the lines
show how the vector field changes as a function of position and how we might observe
a particle as it follows the vector field. The direction is indicated by little arrows on the
field lines or streams and we can either vary the density or the color of the stream
lines to indicate the strength of the vector field (or both).
www.iubh.de
98 Unit 4
4.4 Vector Operations
The ∇ Operator
We now introduce the vector operator ∇, which is often called nabla or del, and is used
in applications of derivatives of vector fields as well as many applications in physical
and information sciences. In Cartesian coordinates, the operator is defined to take par-
tial derivatives coordinate-wise, namely
∂→ ∂→ ∂→
∇≡ i + j + k
∂x ∂y ∂z
(4.17)
∂ ∂ ∂
≡ , ,
∂x ∂y ∂z
(4.18)
→ → →
where x, y, and z are the Cartesian coordinates, and i , j , and k are the standard unit
vectors along the coordinate axes. Correspondingly, the coordinate-wise second partial
derivatives can be obtained by repeated application of the del operator
2 2 2
∂ ∂ ∂
Δ = ∇2 = ∇ ⋅ ∇ = 2
+ 2+ 2.
∂x ∂y ∂z
(4.19)
The resulting operator is called the Delta or Laplace operator.
Gradient of a Scalar Field
One of the most important applications of the nabla operator is determining the rate
of change of a scalar field ϕ in a given direction called the directional derivative. In the
example of the figure “Visualization of a Scalar Field,” we discussed the scalar field of
the temperature in a room with a radiator in a corner. We now want to determine how
much the temperature changes if we move from one point to another. Starting at a
point P(x0, y0, z0), we move a small distance away from P along the line
→ → →
g s, a = x + s a in the direction of the vector → →
a , where x is the position vector of P
and s is a scalar. The value of the field at the new point is then Φ(x + sax, y + say,
z + saz). The rate of change of Φ in the direction of → a , which is called the directional
derivative and is frequently denoted ∇a, is then
www.iubh.de
Unit 4 99
Vector Calculus
dϕ s ∂ϕ dx ∂ϕ dy ∂ϕ dz
= + +
ds ∂x ds ∂y ds ∂z ds
(4.20)
∂ϕ ∂ϕ ∂ϕ
= a + a + a
∂x x ∂y y ∂z z
(4.21)
→
= a ⋅ ∇ϕ .
(4.22)
The quantity ∇ϕ is called the gradient of a scalar field ϕ and describes the direction of
dϕ s →
steepest ascent from any point in the field. The quantity ds = a ⋅ ∇ϕ describes the
rate of change of the field ϕ for some distance s in a given direction →
a.
→
A vector field V that is the gradient of some scalar field ϕ is called conservative and
the corresponding scalar field ϕ is called the potential of this conservative field.
Example
Find the gradient of the scalar field ϕ = x2yz4.
Applying the nabla operator, we obtain
→ → →
∇ϕ = 2xyz4 i + x2z4 j + 4x2yz3 k .
Divergence of a Vector Field
The scalar product of the nabla operator with a vector field is called the divergence of
→
a vector field V :
→ →
∇ ⋅ V = div V .
(4.23)
The divergence is a measure of the flux of a vector field at any given point and has
important applications in physics. To illustrate this, imagine water flowing into one end
of a pipe and out the other. The flow of the water can be described by a vector field. In
the simplest case, there are no sources of additional water nor any drains that would
alter the total volume of water. In this case, the divergence of the vector field is zero. If,
www.iubh.de
100 Unit 4
Flux however, another pipe is attached between the entrance and exit of the original pipe,
The flux of a vector the total volume of water that exits the pipe could change. If the additional pipe adds
field can be interpre- water to the system, the divergence of the vector field is greater than zero. If the addi-
ted as how much the tional pipe drains water from the system, the divergence will be less than zero. The fig-
field acts like a ure “Visualization of a Vector Field Using Arrows to Indicate Direction and Strength of
‘‘source’’ or ‘‘drain’’ the Field at Each Position” is a representation of positive divergence. We can imagine a
at a given point. source in the center and the field flows outward.
Another physical example is an electric point charge from which field lines extend to
infinity.
Curl of a Vector Field
The curl is the cross product of the nabla operator and a vector field
→
∇ × V = curl V
(4.24)
→
The curl describes the “whirliness” of a vector field. For example, if the vector field V
describes the flow of water after a paddle leaves the water as seen in the left part of
→
the figure “Illustrations of Vector Fields in Every Day Life,” the curl of the vector field V
is related to the vortices left behind by the paddles. More specifically, the curl
describes the angular velocity of the water in the area around any point. If we were to
insert a small probe such as a sheet of plastic at various points around the vortex left
behind by the paddle, this probe would tend to rotate in those regions with non-van-
→
ishing curl, i.e. where ∇ × V ≠ 0.
Summary
The concepts of differentiation and integration can be extended to vector functions

in a straightforward way. In addition to the familiar integration along the coordinate
axis, line integrals can be used to integrate along arbitrary curves. In a similar way,
integrals can be defined over arbitrary surfaces. Scalar and vector fields play an
important role in many applications and vector operations based on the nabla
operator (∇) are used to define the gradient, divergence, and curl.
www.iubh.de
Unit 4 101
Vector Calculus
Knowledge Check
learning platform.
Good luck!
www.iubh.de
Unit 5
Matrices and Vector Spaces
STUDY GOALS
… what matrices and special matrices are.
… how to perform calculations with matrices.
… how to compute the determinant, trace, and transpose of matrices, as well as the
complex and Hermitian conjugates of matrices.
… how to determine eigenvalues and eigenvectors of matrices.
… how to diagonalize matrices and change bases.
… what tensors are and how to perform basic calculations with them.
DL-E-DLMDSAM01-L05
104 Unit 5
5. Matrices and Vector Spaces
Introduction
Matrices are arrays of numbers that play an important role in many applications of
mathematics, from solving systems of linear equations to quantum mechanics. Many
problem settings can be reformulated efficiently as matrix equations and then solved
in a systematic way. This unit introduces basic matrix algebra and operators. Diagonali-
zation of matrices is an important skill that facilitates the changes of a coordinate sys-
tem or the transformation of one set of variables into another. Choosing a specific set
of new variables can make the solution to the original problem much easier.
Tensors formalize and extend the concepts of scalars, vectors, and matrices. This unit
introduces tensors and the basic rules of how to work with them.
To gain further understanding of this topic, see chapters 2—4 of Mathematics for
Machine Learning (Deisenroth et al., 2020), chapter 11 of Calculus (Strang, 2017), and
chapters one and two of Advanced Calculus (Loomis & Sternberg, 2014).
5.1 Basic Matrix Algebra

Scalars that represent single numbers are used in many physical applications. They
can, for example, represent a single measurement such as the temperature at a given
point. We have also previously encountered vectors that are characterized by a magni-
tude and direction in a single mathematical object, such as the velocity of a car. Com-
pared to scalars, vectors have several components and we can represent a vector as
follows: →
v = (vx, vy, vz) = vi. In other words, we can use a single index, i, to identify the
components. Matrices are rectangular arrays of numbers. Each entry can be identified
by its row (numbered top to bottom) and column (numbered left to right). Using these
two indices, we write
a11 ⋯ a1m
A= ⋮ ⋮ ⋮ = aij .
an1 ⋯ anm
(5.1)
This n × m matrix has n rows and m columns. Each entry can be identified using the
two subscripts i and j. Matrices play an important part in many applications, for exam-
ple, to describe a rotation of a vector or properties of materials such as elasticity. In
many cases, matrices are a convenient way to perform operations on vectors. Some
important types of matrices are as follows:
• a square matrix in which m = n,

• a symmetric matrix for which aij = aji,
www.iubh.de
Unit 5 105
• a diagonal matrix in which only the diagonal elements are nonzero, and
• the identity matrix is a square diagonal matrix in which aii = 1 for all i, and all other
elements are zero.
Examples
Matrix A is symmetric
945
A= 4 6 3 ,
537
matrix B is diagonal
900
B= 0 9 0 ,
001
and matrix I3 is the 3 × 3 identity matrix
100
I3 = 0 1 0 ⋅
001
Calculating with Matricies
We adopt the convention that capital letters such as A denote matricies, small letters
such as aij denote the entries of a matrix, and Greek letters such as α denote scalars.
The following rules apply when calculating with matrices:
1. A = B ⇔ ∀i, j: (aij = bij)

Two matrices are equal if, and only if, the matrices have the same dimension and
identical entries.
2. A + B = C ⇔ ∀i, j: (cij = aij + bij)
Matrix addition and subtraction are performed entry-wise. Matrix addition is com-
mutative and associative: A + B = B + A and A + (B + C) = (A + B) + C.
3. B = αA ⇔ ∀i, j: (bij = αaij)
Multiplying a matrix by a scalar multiplies each entry by said scalar.
4. Matrix multiplication is only defined when the number of columns in the left matrix
is equal to the number of rows in the right matrix. The resulting matrix, the product,
has the same number of rows as the left matrix and the same number of columns
as the right matrix:
A B = C .
n×k k×m n×m
www.iubh.de
106 Unit 5
(5.2)
The calculation is performed as rowi multiplied by columnj:
AB = C cij = ∑k aikbkj Matrix multiplication is distributive,
A(B + C) = AB + BC and (B + C)A = BA + CA, but is not, in general, commuta-
tive. In particular, there are many cases in which AB ≠ BA holds. Indeed, not only
are the products frequently unequal, but just because AB is defined, this doesn’t
mean that BA is, except when A and B are square matrices of the same dimension.
Commutator The quantity [A,B] ≡ AB − BA is called the commutator.
The commutator
between matrices
becomes very impor-
Example
tant in many appli-
cations such as Evaluate the matrix products AB and BA where
quantum mechanics.
3 2 −1
A= 0 3 2
1 −3 4
and
2 −2 3
B= 1 1 0 ⋅
3 2 1
For AB,
3 2 −1 2 −2 3 5 −6 8
0 3 2 1 1 0 = 9 7 2,
1 −3 4 3 2 1 11 3 7
and for BA,
2 −2 3 3 2 −1 9 −11 6
1 1 0 0 3 2 = 3 5 1 .
3 2 1 1 −3 4 10 9 5
Note that AB ≠ BA.
www.iubh.de
Unit 5 107
5.2 Determinant, Trace, Transpose, Complex, and

Hermitian Conjugates
The Transpose of a Matrix and the Hermitian Conjugate Matrix
In the following section, we will discuss matrix operations that are common in applica-
tions that involve calculating with matrices. Let An×m be a matrix consisting of n rows
and m columns. In some calculations, it is useful to interchange the rows and columns
of a matrix. The resulting matrix is called the transpose of A, which is denoted by AT.
Note that AT is an m × n matrix.
Example
Find the transpose of
1 2 3
A= .
4 5 6
The transpose is
1 4
T
A = 2 5 .
3 6
The transpose of a product of two matrices is the product of the transposed matrices
in reverse order, namely
T T
AB = B TA .
In convincing ourselves of this, it is useful to consider the dimensions of the matrices

involved.
For matrices with complex a ± bi number entries, we find the complex conjugate Complex Conjugate
matrix, A∗, by taking the complex conjugate of each entry of A, The complex conju-
gate of a ± bi * is
∗ a ∓ bi.
a∗ ij = aij ,
(5.3)
where aij is the element of matrix A in row i and column j.
www.iubh.de
108 Unit 5
Example
Find the complex conjugate matrix of
1 2 3i
A= .
4+i 5 6
The complex conjugate matrix is
∗ 1 2 −3i
A = .
4−i 5 6
The Hermitian conjugate of a matrix A is the transpose of the complex conjugate and is
denoted by A†.
Example
Find the Hermitian conjugate of
1 2 3i
A= .
4+i 5 6
The Hermitian conjugate is
1 4−i
†
A = 2 5 .
−3i 6
Note that the Hermitian conjugate (or the transpose matrix in the case of real-valued
matrices) is related to the inner (or dot) product of vectors. Let the two vectors →
a and
→
b be given by
a1 b1
→ a2 → b2
a = , and b = .
⋮ ⋮
an bn
Here, the vectors are represented by column matrices. If we take the Hermitian conju-
→
gate of →
a (resulting in a row matrix) and multiply it with b we obtain
www.iubh.de
Unit 5 109
b1
b2 N →†→
a∗1, a∗2, …, a∗n = ∑i = 1 a∗i bi = a∗i bi = a b ,
⋮
bn
(5.4)
which is the inner product a b . In the case of real numbers, a † becomes a T and
→
we use the more familiar notation for the inner product →
a ⋅ b .
In this derivation, we have used the summation convention that all indices occurring
twice are summed over without having to write the summation sign (∑ ) explicitly. In
this case, we use the notation a∗i bi as a short-hand version for ∑iN= 1 a∗i bi .
Trace of a Matrix
The trace of a square matrix is one of several characteristics associated with square
matrices where we express properties of the matrix with a single number. The trace of a
square matrix is the sum of the diagonal elements:
n
Tr A = a11 + a22 + ⋯ + ann = ∑ aii .
i=1
(5.5)
Some properties of the trace and the sum, difference, and product of matrices are
Tr A ± B = Tr A ± Tr B
(5.6)
and
Tr AB = Tr BA .
(5.7)
Example
Find the trace of
1 2 3
A= 4 5 6 .
7 8 9
www.iubh.de
110 Unit 5
The trace is Tr(A) = 1 + 5 + 9 = 15.
Determinant of a Matrix
The determinant of a matrix is also a single number which is only defined for square
matrices and is denoted as
a11 a12 ⋯ a1n

a21 a22 ⋯ a2n
det A = A = .
⋮
an1 an2 ⋯ ann
(5.8)
The determinant is calculated as
det A = ∑ ϵαβ…ωa1αa2β…anω .
P αβ…ω
(5.9)
Permutation The above sum runs over all permutations of the indices indicated by P[αβ . . . ω]. For
A permutation is one example, two different permutations of the indices i and j are ij and ji. For n indices, n!
of several ways that permutations can be calculated, i.e. the sum runs over n! terms. The quantity εαβ...ω is
a number of items called the anti-symmetric tensor which takes the values +1 and −1 depending on how
can be arranged. often the indices are changed:
+1 for an even permutation of 1,…,n
εαβ…ω = −1 for an odd permutation of 1,…,n
0 if 2 indices are the same.
(5.10)
This means that εαβγ...ω = −εβαγ...ω = +εβγα...ω.
Example
Calculate the determinant of an arbitrary 2 × 2 matrix.
For an arbitrary 2 × 2 matrix we obtain
a11 a12
A = = ϵ12a11a22 + ϵ21a12a21 = a11a22 − a12a21 .
a21 a22
www.iubh.de
Unit 5 111
Notice that the determinant is the product of the diagonal elements minus the
product of the off-diagonal elements. This is always the case for 2 × 2 matrices, but
it does not work for matrices of other dimensions.
Example
Calculate the determinant of a 3 × 3 matrix.
For a 3 × 3 matrix we obtain
a11 a12 a13

A = a21 a22 a23
a31 a32 a33
= +a11a22a33 − a11a23a32
+a12a23a31 − a12a21a33
+a13a21a32 − a13a22a31 .
Another way to calculate the determinant of a matrix M is to use the Laplace

expansion, also called the cofactor expansion. The cofactors are
i+j
Cij = −1 Mij
(5.11)
where the minor Mij is the determinant of the matrix of size (n − 1) × (n − 1),
which is obtained by removing all elements of the ith row and jth column of the
original matrix A.
For future use, we record several useful properties of determinants.
1. |AT | = |A|
2. |A†| = |(A∗)T | = |A∗| = |A|∗
3. |AB| = |A||B| = |BA|
4. |λA| = λn|A|
5. If two rows or columns of a matrix are interchanged, the determinant changes its
sign but not its value.
6. If two rows or columns of a matrix are identical, the determinant is zero.
www.iubh.de
112 Unit 5
Inverse of a Matrix
Just as in the familiar cases of operations on the real numbers and applications of
functions on real numbers, it is sometimes possible to multiply two matrices and
obtain the identity matrix. When
AB = BA = I,
(5.12)
we call B the inverse of A, denoted B = A−1. Note that AB = I = BA can only be

satisfied when A is a square matrix, and that it is possible that BA = I but AB ≠ I, and
vice versa.
The entries of the inverse matrix can be calculated using
−1 C ji
A ij =
det A
(5.13)
where Cji are the cofactors defined in equation 5.11 with the indices swapped, namely i,
j becomes j, i.
Matrices with det(A) = 0 are called singular and cannot be inverted.
For the case of an invertible 2 × 2 matrix A,
a11 a12
A= ,
a21 a22
the inverse of A is
−1 C
T
1 a22 −a21
A = = .
det A a11a22 − a12a21 −a12 a11
(5.14)
Example
Find the inverse of the matrix
www.iubh.de
Unit 5 113
1 2 3
A= 0 4 5 .
1 0 6
We first find the determinant by cofactor expansion:
det A = 1 4 · 6 − 0 · 5 − 2 0 · 6 − 1 · 5 + 3 0 · 0 − 1 · 4 = 24 + 10 − 12 = 22 .
We then find the cofactors for each entry:
45 05 04
a11 = = 24 a12 = − = 5 a13 = = −4
06 16 10
23 13 12
a21 = − = − 12 a22 = = 3 a23 = − =2
06 16 10
23 13 12
a31 = = − 2 a32 = − = − 5 a33 = = 4.
45 05 04
The cofactor matrix is
24 5 −4
C = −12 3 2 .
−2 −5 4
Hence, the inverse matrix is
−1 1 T
A = C
det A
24 5 −4 T
1
= −12 3 2
22
−2 −5 4
24 −12 −2
1
= 5 3 −5 .
22
−4 2 4
Eigenvalues and Eigenvectors of a Matrix
Observe that a column vector is an n × 1 matrix, and therefore, it is possible to perform

certain matrix operations on vectors. This means that we can use matrices to modify
vectors or express certain operations as the product of a matrix and a vector. For exam-
ple we can express a rotation as a matrix applied to a vector, or a shearing transforma-
tion that moves one axis to a different angle but only stretches or compresses another
axis.
In such an example, we may find that the matrix operation applied to a vector only
changes the magnitude or length of a vector:
www.iubh.de
114 Unit 5
→ →
Ax =λx .
(5.15)
This equation is called the eigenvalue problem for matrix A where λ are either real- or
complex-valued numbers called the eigenvalues of the matrix, and the solution →x is an
eigenvector associated with λ. This equation can be written as
→ →
A − λI x = 0 ,
(5.16)
which gives rise to the characteristic equation. It can be viewed as a homogeneous sys-
tem of linear equations of the form B x = 0 where B = A − λI . If the determinant of
B is zero, the characteristic equation has a nontrivial (nonzero) solution and we can
determine the eigenvalues of A. Note that x = 0 is always a solution to this equation,
so the nontrivial requirement is important.
Example
10 −3
Calculate the eigenvalues and associated eigenvectors of the matrix A = .
−3 2
First, we form the equation
→ 10 − λ −3 → →
A − λI x = x = 0 .
−3 2−λ
Next, we find
det A − λI = 10 − λ 2 − λ − − 3 −3
2
= 20 − 10λ − 2λ + λ − 9
= λ 2 − 12λ + 11
=0
and then solve the characteristic equation
λ 2 − 12λ + 11 = 0
λ 2 − 12λ = −11
2
λ−6 = −11 + 36
2
λ−6 = 25,
which has solutions λ1 = 1 and λ2 = 11.
Now that we have the eigenvalues, we can find the associated eigenvectors. For
λ1 = 1, we get
www.iubh.de
Unit 5 115
→ →
A − 1I x = 0
10 − 1 −3 x1 →
= 0,
−3 2 − 1 x2
which leads to the system of equations
9x1 − 3x2 = 0
−3x1 + x2 = 0
and is solved for x1 = t, x2 = 3t where t is a free variable. The same approach is

taken for λ2 = 11. The equation
→ →
A − 11I x = 0
10 − 11 −3 x1 →
= 0
−3 2 − 11 x2
leads to the system of equations
−1x1 − 3x2 = 0
−3x1 + − 9x2 = 0 .
Using the parameter t, we are able to describe the infinite amount of solutions as
those satisfying x1 = t, x2 = −t/3. Typically, we choose unit eigenvectors that are
unit vectors; in this case
→ 1 1 → 1 3
x1= and x 2 = .
10 3 10 −1
5.3 Diagonalization
Change of Basis
→
A basis e i : i = 1, 2, …, N is a minimal spanning set of linearly independent vectors. Linearly Independ-
One example is the Cartesian coordinate system, which forms a basis of ℝ3 where ent
→ → → →
→i = e 1 points along the positive x-axis, j = e 2 points along the positive y-axis, and Vectors are linearly
→
k = e 3 points along the positive z-axis. independent if they
cannot be expressed
Considering an n-dimensional vector space with basis e1, …, en, every vector →x in the as linear combina-
vector space can be expressed as a linear combination of the basis vectors, namely, tions of each other.
→ → → →
x = x1 e 1 + x2 e 2 + ⋯ + xn c n,
www.iubh.de
116 Unit 5
(5.17)
or →
x can be written in vector form as
→ T
x = x1, x2, …, xn .
(5.18)
However, the familiar i , j , k is not the only basis for ℝ3. For example, sometimes it
is easier to consider a problem in spherical coordinates rather than in Cartesian coor-
dinates. The new base vectors e ′i can be expressed as
N
→′ →
ei = ∑ Sij e j
j=1
(5.19)
where Sij is a matrix that transforms from the old base → →

e to the new base e ′ . In the
→
following, we denote the original base with e and the new base that we want to
→
change to with e ′i . Intuitively it is clear that any object represented by a vector in
→
either base does not change its properties if we express it using →e or e i′ . For example,
→
we can imagine that the vector x refers to the position of an object. Irrespective of
how we choose the base to refer to the position of the object, its position remains
unchanged. In general, an arbitrary vector can be expressed as
N N N N
→ → → →
x = ∑ xi e i = ∑ x′i e i′ = ∑ x′j ∑ Sij e i .
i=1 i=1 j=1 i=1
(5.20)
This means that we can express the components of the vector → x in the original base
→
(or coordinate system) e i . We denote these components of the vector →x with xi. We
can also express the components of the vector →x in the new base (or coordinate sys-
→
tem) defined by e ′i — in this case, we denote the components by xi′ . The matrix Sij
connects the two representations
N
xi = ∑ Sijx′,j
j=1
(5.21)
or in vector notation,
→ → → −1→
x =Sx′ ⟺ x′=S x .
www.iubh.de
Unit 5 117
(5.22)
In case of equations where an arbitrary matrix A is applied to a vector →

x , this can be
expressed in both basis as
→ → → →
y = A x , y ′ = A′ x ′
(5.23)
and using equation 5.22, the first equation in the line above can be written as
→ →
S y ′ = AS x ′
(5.24)
expressing → →
x and y in the new “primed” coordinate system using the matrix S. Hence
→′ −1 →
y = S AS x ′,
(5.25)
which implies that
−1
A′ = S AS .
(5.26)
Matrix Diagonalization
Consider the matrix S so that each column of the matrix corresponds to an eigenvector
of some matrix A, so
↑ ↑ ↑
→
S= x 1 →2 → N
x ⋯ x .
↓ ↓ ↓
(5.27)
In the above matrix, →1 →2 →n

x , x , …, x indicate the eigenvectors. Note that the superscript
does not denote an exponent in this case. They fulfill the equation for eigenvectors
→j →j
A x = λj x .
www.iubh.de
118 Unit 5
Then, the entry (A’)i,j of matrix A’, which is the matrix A expressed in the new base or
coordinate system consisting of the eigenvector, is given by
−1 −1
S AS ij = ∑∑ S A S
ik kl lj
k l
−1
= ∑∑ S A
ik kl
x j l eigenvectors in matrix S
k l
−1
=∑ S λ
ik j
x j k eigenvalues of matrix A applied
k
−1
= ∑ λj S S
ik kj
eigenvectors in matrix S ,
k
which means that the resulting matrix is diagonal with the eigenvalues on the diagonal
λ1 0 ⋯ 0
0 λ2 ⋮
A′ = .
⋮
0 ⋯ 0 λN
(5.28)
In this derivation we have used the following:
• the matrix S was chosen so that the columns of S are the eigenvectors of A,
• applying A to an eigenvector of A scales that eigenvector by the associated eigen-
value λ (recall that A→ →
x = λ x is an eigenvector of A), and
• applying the inverse of a matrix to that matrix gives the identity matrix, namely,
S-1 S = I.
Example
Diagonalize the following matrix:
2 0 0
A= 1 2 1 .
−1 0 1
As a first step, we need to find the eigenvalues of A via det(A−λI) = 0:
2−λ 0 0
2
det A − λI = 1 2−λ 1 = 2−λ 1 − λ = 0.
−1 0 1−λ
Next, we need to solve the characteristic equation (A−λI)x = 0 for each value of λ.
For λ1 = 1 we obtain
www.iubh.de
Unit 5 119
2−1 0 0 1 0 0
→ → →
A − 1I x = 1 2−1 1 x = 1 1 1 x .
−1 0 1−1 −1 0 0
This leads to the following set of equations:
x1 = 0
x1 + x 2 + x 3 = 0
−x1 = 0 .
The first and third equation imply x1 = 0, which means that x2 = −x3. The eigenvec-
tor is then given by
0
→
v 1 = a1 −1
1
where a1 is an arbitrary constant. We can choose a1 so that the eigenvector

becomes a unit vector. However, as we want to use the matrix that we construct
from the eigenvalues in the later process of changing from one base or coordinate
system to another, we choose a1 = 1 to make the resulting matrix of eigenvectors
as simple as possible. We find the eigenvalues for λ2 = 2 in the same way and
obtain the eigenvectors
0 −1
→ →
v 2 = a2 1 and v 3 = a3 0 .
0 1
Again, we choose a2 = 1 and a3 = 1 to simplify the matrix of eigenvectors. The

matrix of eigenvectors is then
0 0 −1
S = −1 1 0
1 0 1
and the diagonal matrix (with the eigenvalues on the diagonal) is
1 0 0
D= 0 2 0 ,
0 0 2
which can be verified using the expression D = S−1 AS.
www.iubh.de
120 Unit 5
5.4 Tensors
Introduction to Tensors
So far, we have encountered scalars, vectors, and matrices. In many examples, we have
seen how these objects can be applied, for example, to describe physical systems. In
the following section, we will discover how we can treat these objects in a more unified
way and how we can use this to extend these objects by building an intuition using
examples from our physical world. Note that this is only a first introduction and not a
fully formal treatment of the underlying mathematical theory.
In some scenarios, a simple number is sufficient, e.g. “how many pieces of cake are
there?” — and you might answer “(number of pieces of cake).” However, in many situa-
tions, a simple number is not sufficient, e.g. “how far is it to your home?” Just answer-
ing “3” is not sufficient, we also need a unit, e.g. km: “my house is 3 km from here.” In
applications and physics, these objects are called denominate numbers or scalars.
Other examples are temperature, the (inner) energy of a system, and pressure. When
calculating with scalars, we can add, subtract, multiply, or divide them, but we always
get a scalar when operating with scalars.
In other situations, more information is required than denominate numbers, for exam-
ple, answering the question “how do I get to your house?” In addition to a number, a
direction is needed: “walk 3 km due north.” These objects are vectors and are character-
ized by a direction and a value, the magnitude of the vector. Other examples include
velocity, acceleration, and (angular) momentum. Vectors are often represented by their
→ → → → → →
components → v = a i + b j + c k where i , j , k are the unit vectors (for example, the
x-, y- or z-direction in a Cartesian coordinate system) and a, b, c are scalars denoting
how far one has to go in each direction. We have already encountered how we can add,
subtract, or multiply vectors. Depending on the operation, the result can either be a
scalar or a vector.
• Sum: → → →
w = u ± v . The sum (or difference) of two vectors is a vector.
• Inner product: → →
u ⋅ v = η. The inner product is a scalar and the inner product of a
vector with itself is the square of its magnitude (length).
• Cross product: → → → →
s = u × v . The cross product of two vectors u and v is a vector
orthogonal to both → →
u and v . In the case of three-dimensional space, the cross
→
product s is perpendicular to the plane spanned by → →
u and v .
• Multiplication by a scalar: A vector can be multiplied by a scalar to change its mag-
nitude.
We have also used matrices to operate on vectors, which can involve rotating vectors,
or expressing a change of basis or coordinate system. However, many physical systems
cannot be described by scalars and vectors alone. A familiar example is the inertia
matrix (or tensor) of an object: If we rotate a three-dimensional object such as a gyro-
scope, we find that the rotation around some axis is stable (at least, as long as the
gyroscope does not lose too much energy due to friction in the course of our investiga-
www.iubh.de
Unit 5 121
tion). However, if we apply a force (a vector with both a direction and a magnitude)
from outside, the gyroscope will generally re-orient itself along some other direction
that is different to the one of the force we applied. We need the nine elements of a 3-
by-3 matrix (or tensor) to describe this behavior. Another example is the magnetic sus-
→
ceptibility of a material. When an external → magnetic field H is applied, any material
will, in general, show a magnetic response M . However, in many materials, the exter-
nal field and the magnetic response are not aligned. This means that the magnetic sus-
ceptibility χ is not a single number or a vector and we need 3 · 3 = 9 elements to
→ →
describe the response of the object M = χ H . To describe other physical systems, such
as understanding our universe in the context of general relativity, stationary, or rotating
black holes, we need objects with even more elements. We typically call these metrics
in the context of general relativity.
Tensors extend and generalize what we have seen so far of scalars, vectors, and matri-
ces in 3d space:
1. Scalar: Tensor of rank 0 (1 component).

2. Vector: Tensor of rank 1 (3 components).
3. Dyad: Tensor of rank 2 (32 = 9 components).
4. Triad: Tensor of rank 3 (33 = 27 components).
This list can be extended to higher ranks, though these are then generally just called
“tensor of rank k.” Here, we can understand the rank of a tensor intuitively as the num-
ber of indices we need to express the tensor. A scalar that has no index can be repre-
sented by a single number. The components of a vector can be identified by a single
index, for example ai, where we understand that ai are the elements of the vector a and
the index i takes all values i = 1, 2, …, n required to address all elements of the vector
of dimension n. In the same way, we can denote the elements of some matrix A by two
indices: ai,j.
→ →
If →
a and b are vectors and T is a tensor of rank 2, the linear equation b = T→a can be
expressed as the following system of linear equations which is equivalent to our previ-
ous use of matrix operations:
b1 = t11a1 + t12a2 + t13a3

b2 = t21a1 + t22a2 + t23a3
b3 = t31a1 + t32a2 + t33a3 .
Tensors of rank 2 do behave like square matrices, but tensors generalize the concepts
we have encountered so far.
We use this example to introduce the following summation convention: Without writing
the summation sign explicitly, we always sum over indices that occur twice. In the
→
above example, we can write each component bi of the vector b as a sum
3
bi = ∑ tijaj = tijaj
j=1
www.iubh.de
122 Unit 5
(5.29)
where we have used the convention to avoid writing the summation sign ∑3j = 1. Note
that this also implies that the range of the indices i, j is no longer made explicit but is
understood from the context. This notation becomes useful if many indices are used.
Tensors invariant under a permutation of its indices (they do not change sign when two
indices are swapped) are called symmetric, but if the sign flips, they are called anti-
symmetric. Specifically,
if tijk . . . l=t jik . . . l when i and j are swapped: T is symmetric

if tijk . . . l=−t jik . . . l when i and j are swapped: T is anti‐symmetric
Calculating with Tensors
• Addition: Adding or subtracting tensors is only defined for tensors of the same rank:
cij = aij + bij. Addition is commutative, so aij + bij = bij + aij.
• Dyad product: Multiplication is very different to what we have seen so far with sca-
lars or the inner product of vectors. It is obtained by multiplying each of the compo-
nents term by term. The dyad product always leads to a tensor of higher rank:
riklm = aik ⊗ blm. The dyadic product is neither the inner nor the cross product, but a
new entity. The dyad product is generally not commutative, so aik ⊗ blm ≠ blm ⊗ aik .
In case of two tensors of rank 1 →
T
a = a1, a2, a3 , the dyadic product is
a1b1 a1b2 a1b3

→ →
a ⊗ b = a2b1 a2b2 a2b3 .
a3b1 a3b2 a3b3
(5.30)
• Contraction: Tensor contraction is used when we sum over the indices of a tensor,
that is, if an index occurs twice. Recall that we can view vectors as tensors of rank
→
one. The dot product of two vectors → a ⋅ b is given by the sum of the components
a1b1 + a2b2 + a2b3 for three-dimensional vectors. This can also be expressed as
→ → 3
a ⋅ b = ∑i = 1 aibi =aibi if we apply the summation convention that we introduced
earlier. The result is a number, i.e. a tensor of rank 0. In other words, we have con-
tracted the index i of the vectors (tensor of rank 1) a and b.
• Trace: The trace is an example of a contraction and is calculated in the same way as
we have seen in case of a square matrix, namely as a summation over the diagonal
Tr(rik) = rii.
www.iubh.de
Unit 5 123
Co- and Contravariant Tensors
In the following explanation, we use an intuition built by examples from our physical
world to introduce the concept of co- and contravariance. These concepts play an
important role in the description of physical system and, if we use tensors to describe
these systems, a further convention discussed below affects how we use the indices of
a tensor.
The quantitative description of a physical process is independent of the coordinate

system in which we describe it. For example, we can describe a physical process in Car-
tesian coordinates using x, y, and z. If we were to to use a spherical coordinate system
with coordinates r, θ, ϕ or cylindrical coordinates with parameters r, θ, z to describe
the same system, we must obtain the same physical relationships. However, describing
a given physical system may be more convenient in one coordinate system than in an
other as, for example, we can exploit some form of symmetry. An object rotating at a Symmetry
fixed distance from an origin in a plane is easily described in spherical or cylindrical If a system is sym-
coordinates where the angle θ describing the rotation varies with time and the other metric under some
coordinates are constant. To express the same system in Cartesian coordinates requires operation, we mean
a more complex description involving both x and y. It is therefore often desirable to that it looks the
change the coordinate system or base. As we do so, some quantities will change but same whether we
others will remain invariant. apply the operation
or not.
An example of an invariant is the mass m of a particle, which is described as a scalar.
Irrespective of the coordinate system used, the mass will stay the same. Using the
example above, if we express the rotation of the object around a fixed point in Carte-
sian coordinates, we will obtain the same value of its mass as if we used cylindrical
coordinates.
A normal vector such as the direction vector → →

r or the velocity vector v will generally
change if we change the coordinate system or base. For a vector to remain constant
under such a change, its components have to contra-vary to compensate. For the vector
to be independent of the change of base or coordinate system, the matrix that trans-
forms the vector components has to be the inverse of the matrix expressing the change
in base or coordinate system. We call this property contravariant. As an example, we
consider the displacement vector → r pointing to an object. If the object is at a distance
of 1m in the x-direction, we can express this as (1m, 0m, 0m). However, if we change the
units from meter to millimeter in x direction, the same object is now referred to as
(1000mm, 0m, 0m). To change the base, in this case going from meters to millimeters,
we divide the base by 1000 but multiply the components of the vector by 1000.
A covector, on the other hand, transforms the same way as the change of base or coor-
dinate system does, its components co-vary and hence, we call this property a cova-
riant. An example of a covariant vector is the gradient of a field. For example, if we
measure the strength of an electric field in V/m and change the unit of length to mm,
we need to multiply the coordinate system or base by 1000. The resulting covector also
needs to be multiplied by 1000 in order to stay invariant under this change of coordi-
nate system or base.
www.iubh.de
124 Unit 5
Generally, contravariant tensors are denoted with super-script indices, i.e. ri, and cova-
riant tensors are denoted with subscript indices, ri.
We can now connect this to our discussion of tensors. Tensors can have both co- and
contravariant properties. This means that, in general, a tensor will have both super-
script and subscript indices, for example gkij, depending on how it behaves during a
change of base or coordinate transformation.
Summary
Matrices and vector spaces are key ingredients in many complex calculations. For
example, matrices can be used to change coordinate systems. Basic skills in calcu-
lating with matrices include adding and subtracting matrices, performing matrix/
vector operations, and finding the eigenvectors and eigenvalues of matrices. An
important step in changing bases is the diagonalization of a matrix. Tensors extend
the concepts and calculations of vectors and matrices to a more general setting,
allowing for linear and non-linear coordinate systems.
Knowledge Check
learning platform.
Good luck!
www.iubh.de
Unit 6
Information Theory
STUDY GOALS
On completion of this unit, you will have learned...
… how to express the difference between a prediction and observed events using the MSE.
… what the Gini index is and how to determine it.
… the concepts of information entropy, Shannon entropy, and Kullback Leibler divergence.
… how to use the cross-entropy to compare two probability functions.
DL-E-DLMDSAM01-L06
126 Unit 6
6. Information Theory
Introduction
The field of information theory focuses on the quantification, processing, and commu-
nication of information. The concept of entropy was introduced in the field of informa-
tion science by Claude Shannon and has many important applications, including the
quantification of information contained in a data stream. We will also develop related
tools for measuring the degree of similarity between probability distributions and
membership in data classes. Such techniques are useful for classification tasks where
one is concerned with assigning an event to one or more pre-defined classes as well as
in regression tasks, which focus on the extrapolation or prediction of a quantity in a
given situation.
Various metrics such as the mean squared error, Kulback-Leibler divergence, and cross-
entropy can be derived from information theoretic principles and play a leading role in
algorithmic approaches to the extrapolation and prediction of new or unknown events.
6.1 Mean Squared Error (MSE)

In many cases, we need to predict a quantity from observed values. One way to do this
is a regression where we associate one or more independent variables with a target or
dependent variable we wish to predict. The independent variables are also often called
features and the dependent variable is called a label. If we denote the independent
variable(s) by X and the dependent variable by y, we wish to establish a relationship
so that y = f(X, ai) where f is some functional form that maps the independent varia-
ble(s) X to the dependent variable y. This mapping may depend on one or more
parameters ai.
The class X is a collection of variables in a given dataset and x corresponds to a spe-

cific set of values of these ordered variables. The domain of the vector x does not nec-
essarily fulfill all requirements of a vector space that we have encountered so far, it is
often merely a convenient notation to express that we consider all observations or val-
ues of the independent variables for a specific event. For any concrete set of observa-
tions → x this mapping f(.) results in a prediction ŷ. The obvious task in modeling is to
estimate f(X, ai), which could be done, for example, by building an explicit statistical or
causal model, or by using a machine learning algorithm to learn from past observa-
tions.
The simplest nontrivial example of such a model is linear regression where the form of
the function used to estimate the true values of the data is the line y = mx + b, for a
single variable x. In this case, we have two free parameters, namely a1 = m and a2 = b.
To develop such a model, we not only need the observed values of the variables X, but
also the corresponding observed true values y in order to choose the parameters ai
appropriately. During the development of the model, we obtain an intermediate esti-
mate of the predicted value ŷ based on the current values of the parameters ai and we
www.iubh.de
Unit 6 127
Information Theory
need a metric to determine how to optimize the parameters further. Once we have Free Parameters
optimized the final parameter values of the model, we need to assess the accuracy of Free parameters are
the model. those parameters in
a model that need to
One of the simplest and most popular metrics is the mean squared error be determined using
the data.
N
1
N∑
MSE = y−^
y 2.
i=1
(6.1)
The MSE is symmetric in the true observed value y and the model's prediction ŷ. It puts
a strong penalty on predictions ŷ that are far from the observed value y. Although this
sounds like a desirable quantity, it also means that the metric can be dominated by a
few extreme values, even if the bulk of the predicted values are close to the observed
ones.
The MSE and other metrics can be used in two different ways in the evaluation of a
prediction model. One use is during model construction and the other is during model
testing and assessment.
1. Loss function: A loss function is used during model building while the parameters ai
are optimized.
2. Score function: A score function is used to compare the values predicted by the
model with the observed values after the model has been built.
In the case of the loss function, the metric is directly used to optimize the model
parameters. The final value(s) of the model parameters will depend on the loss func-
tion in the sense that a different loss function will lead to different optimal parameters.
The final evaluation of the predicted values ŷ compared to the true values y is done
using a score function, which may or may not be different from the loss function.
6.2 Gini Index

The Gini index, or Gini coefficient, is a statistical measure of the degree of inequality of
values in frequency distributions. It is commonly applied to measure the distribution of
the income (or wealth) within a country. The index was developed by the Italian Cor-
rado Gini in 1921 (Gini, 1912, 1921). The coefficient ranges from zero, corresponding to
zero percent difference and therefore total equality, to one, which corresponds to 100
percent difference and therefore perfect inequality. Values over one are only possible if
there is negative income or wealth, namely people with debts. The higher the value of
the Gini index, the more pronounced the inequality is.
www.iubh.de
128 Unit 6
To determine the Gini coefficient of the income distribution of people in a given coun-
try, we find the income of all the people in the country and present this data as a
cumulative percentage of population against the cumulative share of income earned.
An example of the resulting Lorenz Curve, is shown in the following figure:
The main idea behind the Gini index is to show the extent to which wealth is or is not
evenly distributed throughout the population of a country. Researchers are often also
interested in whether the government of said country is trying to keep this ratio as low
as possible, namely whether they are striving for income equality. As we already know,
the Gini coefficient 0 ≤ G ≤ 1 is the ratio of the areas
A
G=
A+B
depicted in the Lorenz Curve.
A few cases of interest are examined below.
1. If A = 0, the Lorenz Curve coincides with Line of Equality.

2. If G = 0, there is a “perfect” distribution of income, meaning a perfectly uniform
distribution of income; all people in the country possess the same regular influx of
money.
3. If A is very large, the area B becomes very small. In this case, G ≈ 1 (the Gini coeffi-
cient is large) and there is very uneven distribution of income.
www.iubh.de
Unit 6 129
Information Theory
Example
Suppose we live in a very small country with only ten people. Let’s call us a1, a2, …,
a10 and suppose that the total income for the country is $100 per day and this
income is distributed evenly among the population so that the income of each resi-
dent ai is $10 per day (i = 1, …, 10). Evaluate A and G.
The total income distribution is shown in the table below, where the proportion
and cumulative proportion refer to the population considered in this example. So,
in this case, A = 0 and G = 0; the Lorenz Curve and the line of equality coincide.
The cumulative proportions (“prop.”) of population versus the cumulative percent-

age of income (“inc.”) is shown below:
Cumulative Proportions of Population versus Cumulative Percentage of

Income
Citizen Prop. Cumul. prop. Inc. Cumul. inc.
a1 10% 10% 10% 10%
www.iubh.de
130 Unit 6
Citizen Prop. Cumul. prop. Inc. Cumul. inc.
a2 10% 20% 10% 20%
a3 10% 30% 10% 30%
a4 10% 40% 10% 40%
a5 10% 50% 10% 50%
a6 10% 60% 10% 60%
a7 10% 70% 10% 70%
a8 10% 80% 10% 80%
a9 10% 90% 10% 90%
a10 10% 100% 10% 100%
Example
Suppose that the graph of the cumulative proportion of population (on the hori-
zontal axis) against the cumulative percentage of income (on the vertical axis) is as
shown below, where the Lorenz curve is defined by y = x5.
www.iubh.de
Unit 6 131
Information Theory
Determine the Gini index.
We need to find the area of the region A, between the line of equality (in green)
and the Lorenz curve (in red). One way to do this is to find the area of the triangular
region below the line of equality and subtract the area under the Lorenz curve. The
area of the triangle below the line of equality is half the area of the whole square,
and is therefore 0.5. Let IB denote the area of the region B under the Lorenz curve.
IB is then
∫ x dx = 0 .1666
1
IB = 5
0
and therefore, the area of the region A is 0.5 − 0.1666 = 0.333. We can now find
the Gini index, which is
A 0 .333
G= = = 0 .667.
A+B 0 .5
www.iubh.de
132 Unit 6
Gini Impurity
The Gini index should not be confused with the Gini impurity. Unfortuntately, in prac-
tice the terminology is used interchangeably. The term “Gini index” is often used for the
Gini impurity and we need to check the context carefully to avoid further confusion.
Similar to the Gini index discussed above, the Gini impurity is a measure of the homo-
geneity of a distribution of elements in a set and is related to the probability of incor-
rectly classifying an object in a data set. Suppose that we have N classification groups
or classes in a given dataset and let pi be the probability of a random instance belong-
ing to class i. Then we have the following cases for two subsequent experiments where
we assign a class to an element of the dataset:
1. We obtain the identical output for the same category i with probability p2i .
2. We obtain the identical output, irrespective of the category with probability
N
∑i = 1 p2i .
3. Using the above, we obtain two different outputs with probability 1 − ∑N 2
i = 1 pi .
Therefore, to find the Gini impurity, we need to find the probability of being wrong
about any given classification and then sum over all classifications. The Gini impurity is
N
G= ∑ ∑ pipj .
i=1j≠i
(6.2)
It is sometimes computationally useful to write this formula in other ways. Recall that
ΣiN= 1pi = 1, which means that we must assign each item to one of the available
classes, and therefore, pi = 1− Σj≠ipj. Observe that
N
G= ∑ pi ∑ pj
i=1 j≠i
N
= ∑ pi 1 − pi
i=1
N
= ∑ pi − p2i
i=1
N N
= ∑ pi − ∑ p2i
i=1 i=1
N
=1− ∑ p2i
i=1
where we used the fact that ∑iN= 1 pi = 1 in the last step, which is a result of the fact
that there are no other possible outcomes except the N classifications.
www.iubh.de
Unit 6 133
Information Theory
6.3 Entropy, Shannon Entropy, Kulback-Leibler

Divergence
What is Entropy?
Apart from the laws of quantum mechanics, entropy is perhaps the most confusing
physical quantity. In our everyday language, it is associated with the degree of random-
ness in a system — for example, we would say that a cube of sugar dissolved in tea has
a higher level of randomness as it is natural for the sugar to dissolve, but we have
never observed that a sweet tea spontaneously separates into tea and a cube of sugar
at the bottom. We also often hear that “entropy defines the arrow of time.” The origin Arrow of Time
of these analogies is understandable, but they do not quite capture the concept of The arrow of time is
entropy. Furthermore, entropy has historically been introduced first in thermodynamics a concept indicating
and then in statistical physics. At first glance, both appear very different but, after care- that time always
ful consideration, they are equivalent to each other. Before turning to information sci- moves forward (and
ence, it is therefore useful to understand entropy on a more fundamental level. not backward) and
that reactions follow
We start with the thermodynamic understanding of entropy and remind ourselves that this direction.
while some reactions occur spontaneously, others do not. A good textbook that covers
this subject in more detail is Physical Chemistry (Atkins & de Paula, 2006, p. 573 ff). For
example, a hot drink such as tea cools down to ambient temperature, a gas expands
into the available volume and a ball bounces a bit lower each time it hits the ground
until it comes to rest. In the case of the ball, we can intuitively understand this as with
each bounce, the ball transfers some of its kinetic energy to the ground, which is trans-
formed into the random thermal motion of the atoms in the ground, i.e. the ground
heats up a little bit. However, we have never observed that a ball resting on a warm
ground spontaneously jumps into the air. This could only happen if all the atoms in the
ground would act together and push the ball away. We can then identify spontaneous
reactions such as the bouncing ball or expanding gas by looking for changes that lead
to the dispersal of energy of the system: As the ball bounces a bit less each time, it
loses energy which is transferred into the random motion of atoms into the ground.
The thermodynamic definition of entropy is then centered on the idea that a change in
a system is related to the energy it loses in the process, which in turn can be expressed
by the amount of energy that is transferred by heat. This sounds quite complicated but
in thermodynamics, the (inner) energy of a system is a measure of how much work a Inner Energy
given system can do. For example, a compressed gas can turn a turbine, whereas a gas The inner energy can
filling the available space cannot. As we have seen in the example of the bouncing ball, be changed by either
heat is related to the random motion of the atoms as opposed to uniform motion in transferring energy
the case of work. We can then conclude that the ability of a system to perform “useful” as heat or perform-
work is reduced in proportion to the amount of heat transferred into random motion. ing work:
Furthermore, it is intuitively reasonable that this depends on the temperature: The dU=δQ+δW
effect of adding a bit more heat to an already hot system is much less than to a cold
system. This is indeed then the thermodynamic definition of entropy:
www.iubh.de
134 Unit 6
δQ
dS =
T
(6.3)
where S denotes the entropy, δQ the incremental amount of heat exchanged, and T the
temperature of the system. This definition gives us an understanding of why we asso-
ciate entropy with randomness if we think back to the example of the bouncing ball. As
a bit of heat is transferred to the ground, the atoms in the ground move around a bit
more and their motion becomes more random or more unordered.
Unfortunately, while the thermodynamic understanding of entropy explains our every-

day understanding well, it does not really help us to relate the entropy to any concept
in information science. We therefore need to turn to statistical physics to gain a deeper
understanding. In statistical physics, we are concerned with the emergent properties of
large ensembles and not primarily with a detailed description of the interaction of two
or maybe a few atoms or molecules on a quantum level. Instead, we analyze how a
large number of molecules behave and treat them as little hard balls hitting each
other. This simplification allows us to analyze macroscopic quantities: For example, one
Mole mole of water consists of approximately 1023 water molecules. It would be near impos-
A mole is the base sible to calculate all precise effects of this large number of molecules, nor is it neces-
unit for the amount sary to do so in order to describe, for example, how a small amount of water heats up
of a substance and when we put it on a stove. In this approach we neglect the contribution to the total
contains exactly energy of a system that arises from the interaction of molecules, but instead assume
6.02214076 · 1023 par- that the molecules fly around as tiny “billiard balls,” hitting each other constantly and
ticles (atoms, mole- thus not only changing energy but also changing modes of motion. Our large ensemble
cules, etc.). consequently consists of N of these billiard balls or molecules, each molecule is in a
specific state of energy ϵi. This concept of “state of energy” is important as the different
levels of energy ϵi are not continuous but discrete, for example, a number of molecules
Ground State are in the ground state ϵ0, others in the next level ϵ1, …. We can imagine that in the
The ground state is ground state ϵ0, all molecules are at rest in an ordered lattice and no longer move —
the lowest energy of this is not exactly correct, but it serves as a useful analogy. As we are only concerned
a particle (e.g. atom, with large numbers of molecules, we say that, on average, ni molecules occupy the
molecule). energy state ϵi. The laws of statistical physics then tell us that the distribution of mole-
cules across the possible states is governed by a single parameter — the temperature.
We can visualize this in the following way: The hotter a system is, i.e. the higher its tem-
perature, the more energy states are accessible and the more the molecules can move
about and hit each other. In each collision, some molecules will lose a bit of energy
and go into a lower state, and other molecules will gain that energy and go into a
higher state but, on average, the population of states remains the same. At very low
temperatures, only few energy states are accessible. This dependency already brings us
closer to our statistical understanding of entropy and how it is related to the random-
ness or orderedness we have discussed above. As the temperature gets lower and
lower, only the ground state ϵ0 of the system is accessible and all particles are in that
state. In this case, we can write {N, 0, 0, …}. If the temperature is a bit higher, more
states are accessible and another configuration of the system might be
{N, −2, 2, 0, …} where the first state ϵ1 above the ground state is now accessible. In
general, the population of the system’s energy states is described by {n0, n1, n2, …},
www.iubh.de
Unit 6 135
Information Theory
which can be achieved in W different ways depending on which molecule is in which

state. If we imagine the system as lots of identical balls, we can see that we can obtain
the same configuration of states by many different choices of which ball goes into
which state as we cannot tell them apart. W is called the “weight” of the configuration
and is given by
N!
W= .
n0 !n1 !n2 !…
(6.4)
With these tools, we can define the Boltzmann entropy:
S = kBln W
(6.5)
where kB is the Boltzmann constant (kB = 1.38 · 10–23m2kgs–2K–1) and W is the weight Boltzmann constant
of the configuration. From the reasoning above, we can see that this quantity behaves kB = 1.38 · 10–
in the same way as the thermodynamic definition we have seen earlier (in fact, one can 23m2kgs–2K–1
show that they are equivalent). The only parameter that defines the system is the tem-
perature T, which we can modify by the exchange of some heat q. In the limit of T → 0,
only the ground state is accessible, which means that only one configuration is possi-
ble, leading to W = 1 and hence, S = 0 as ln 1 = 0. As only one state is accessible, the
amount of “randomness” is minimal and increases as we raise the temperature (by
adding some heat q) because more states become accessible.
There are, of course, exceptions to this. One example is carbon monoxide (CO). The
ground state is such that a carbon atom C is followed by an oxygen atom O and, as the
temperature gets lower and lower, the only accessible state should be CO CO CO… as
the system slowly “freezes” into a regular lattice. However, the state OC is not much
different from CO in terms of energy and hence, it can happen that the configuration
OC is “trapped” as not enough energy is available to flip into CO and as T → 0, our
lattice might look like this: COCOOC…. This gives us a first glimpse of how the entropy
can be used in relation to information science if we imagine that we could denote the
configuration CO with 0 and OC with 1 and the above sequence expressed as a bit
stream would read 001….
For later use, it is convenient to rewrite the Boltzmann entropy as
N!
ln W =
n0 !n1 !n2 !…
(6.6)
= ln N! − ln n0 ! + ln n1 ! + ln n2 ! + …
(6.7)
www.iubh.de
136 Unit 6
= lnN! − ∑ ln ni ! .
i
(6.8)
Stirling's Approxima- We can simplify the factorials using Stirling’s approximation:

tion
lnx! ≈ xlnx − x ln W = lnN! − ∑ ln ni !
i
(6.9)
= Nln N − N − ∑ niln ni − ni
i
(6.10)
= Nln N − ∑ niln ni .
i
(6.11)
Using N = ∑i ni, we can express the entropy as
S = kB ∑ niln N − niln ni
i
(6.12)
ni
= −kB ∑ niln
N
i
(6.13)
= −NkB ∑ piln pi
i
(6.14)
where pi = ni/N is the fraction of molecules in state i or the probability that the mole-
cule is in state i.
www.iubh.de
Unit 6 137
Information Theory
Shannon Entropy
The father of information theory, Claude Shannon, introduced the term entropy to
describe the minimum encoding size necessary to send a message without information
loss (Shannon, 1948). This has two components to it. Firstly, what is the maximal com-
pression rate we can achieve to transmit the information? This is related to the entropy.
The other is concerned with the technical implementation and is related to the maxi-
mal capacity of a transmission channel. The latter is part of electrical engineering and,
for the remainder, we focus on the first part.
In information theory, we are concerned with the amount of information that we can
obtain from a system and the information content of some event A is defined as
1
I A = − log2 p A = log2
pA
(6.15)
where p(A) is the probability that the event occurs. We notice that the information con-
tent decreases when the probability of an event increases — the more likely an event
becomes, the less “surprised” we are about it and the more we expect it, which implies
that it merely confirms the information we already had. In the extreme case that p(A)
= 1 where the event always occurs, no further information is added. We also note that
the information due to independent events is additive: I(A1∩A2) = I(A1) + I(A2).
We now turn to larger systems that are described by some discrete variable X that can
take the values {x1, x2, …, xn} according to some probability distribution p(X). The
Shannon entropy is then defined as the average information content of an outcome
H = E I X = E − log2 p X
(6.16)
= − ∑ p xi log2 pi x
i
(6.17)
where E[.] is the expectation value we use to calculate the average. Comparing this Expectation Value
definition to equation 6.14, we find that they are the same, apart from the change in E x = ∫xp x dx or
base from the natural logarithm to base two and the fact that the Shannon entropy E x = ∑i xipi
does not have the constants NkB as they are not directly related to a physical system.
Since we have already studied the entropy of ensembles in the context of statistical
physics, this connection is not surprising. In both cases, we are concerned with large
systems that are described in terms of some probability function p that determines the
likelihood that a possible discrete value or state is occupied.
www.iubh.de
138 Unit 6
So far, the event space was discrete. By considering a probability density function
rather than a frequency, it is possible to meaningfully discuss Shannon’s entropy for
underlying variables that have infinitely many possible values. The underlying topology
of the variable we are measuring becomes important. If the possible values are the real
numbers, equation 6.17 can be written in integral form
Hx = − ∫p x log 2 p x dx
(6.18)
where p(x) represents the probability density function.
Example
We can compute the entropy of a coin toss. For a fair coin, heads and tails comes
up with equal probability of 50 percent. The Shannon entropy is
n
H = − ∑ p xi log2 p xi
i
2
1 1
= − ∑ log2
2 2
i=1
2
1
= −∑ −1
2
i=1
= 1bit.
In this case, the entropy is maximal as we cannot predict the outcome of the next
coin toss from what we have observed so far. Hence, we need one bit per coin toss
to encode the resulting information if heads or tails comes up. However, if the coin
is not fair so that heads comes up with a higher probability p than the probability q
for tails, our entropy would be different: H = −p log2(p) − q log2(q). This number is
smaller than one as the probability of heads is now higher and we are less “sur-
prised” if heads comes up.
Example
Calculate the Shannon entropy of the string 00100010.
First, we note that our system only has two states, zero and one. Counting the num-
ber of each, we find that we have six zeros and two ones out of eight characters.
Hence, the probability to obtain a zero is p(0) = 6/8 = 3/4 and the probability to
obtain a one is p(1) = 2/8 = 1/4. The Shannon entropy is then
H = −0.75 log2(0.75) − 0.25 log2(0.25) = 0.811 bits. We can compare this to the coin
toss above: If the zeros (head) would occur as often as the ones (tails), we would
www.iubh.de
Unit 6 139
Information Theory
have had H = 1. However, the zeros occur much more frequently than the ones,
hence encountering a zero conveys less information as we can guess with a proba-
bility of 3/4 that the next letter will be a zero.
The Kullback-Leibler Divergence
We can use the concept of entropy to determine how different two probability distribu-
tions p(x) and q(x) are. For each of them, we can define the Shannon entropy and we
can define the relative entropy or the Kullback-Leibler (KL) divergence between p(x)
and q(x) as
px
DKL p ∥ q = ∑ p x log2 q x
i
(6.19)
for discrete distributions p and q, and
DKL p ∥ q = ∫p x log 2
px
qx
dx
(6.20)
for the continuous case.
The relative entropy between two probability distributions over the same random vari-
able is a measure of how different the two distributions are. It satisfies Gibbs’ inequal-
ity
DKL p ∥ q ≥ 0
(6.21)
where DKL(p||q) = 0 only if p(x) = q(x). The Kullback-Leibler divergence is sometimes

also called the KL “distance” although it is not, strictly speaking, a distance, since it is
not symmetric in p and q, i.e. its value changes if p and q are interchanged.
6.4 Cross Entropy

Unsurprisingly, we are not always right when we infer a probability distribution from a
data set. We want to be able to discuss this possibility more formally, and it is for this
purpose that we introduce cross entropy. Given two probability distributions, let’s call
www.iubh.de
140 Unit 6
them p and q, on the same underlying set of variables, suppose p is the true distribu-
tion, but q is the distribution for which we have optimized. How hard is it (measured in
number of bits of data that we need) to identify an event in the space? The cross
entropy of p and q is an effort to answer this question.
To define the cross entropy, we will use the entropy of the random variable x and the
Kullback-Leibler divergence between the true probability distribution p and the one we
use to estimate it, q. In essence, “how hard it is” to identify an event in the space is the
natural difficulty (uncertainty), which is measured by the entropy, plus the added diffi-
culty induced by estimating p with q, which is quantified by the Kullback-Leibler diver-
gence. Recall that the Kullback-Leibler divergence Dk(p||q) defined by equation 6.19 is
px
DKL p ∥ q = ∑ p x log2 q x .
i
Using the properties of logarithms, this can be rewritten as
N px
DKL p ∥ q = ∑ p xi log2 q xi
i=1 i
N
= ∑ p xi log2 p xi −
i=1
N
∑ p xi log2 q xi
i=1
= −H p + H p, q ,
(6.22)
where H(x) is the Shannon entropy of distribution of x, defined by equation 6.17 and
H(p, q) is defined to be
N
H p, q = − ∑ p xi log2q xi ,
i=1
(6.23)
and is called the cross entropy of p and q. The cross entropy, the number H(p, q), rep-
resents the average number of bits required for us to identify an event given that we
have coded our scheme using distribution q when the true distribution is p. Due to the
asymmetry of the Kullback-Leibler divergence, the cross entropy is also generally asym-
metric, in particular H(p, q) ≠H(q, p). Another basic attribute of cross entropy is that it
is bounded below by the entropy of the true distribution. Note that the smallest possi-
ble cross entropy is obtained when we use the true distribution in our coding scheme.
Namely, using equation 6.22 we see
www.iubh.de
Unit 6 141
Information Theory
N
H p, q = H p = − ∑ p xi log2 p xi =H p ,
i=1
which is the Shannon entropy.
In machine learning applications, the cross-entropy is also often used as a loss-func- Machine Learning
tion during the optimization of the model — in particular for classification tasks where In machine learning,
events are sorted into two or more categories. During model building and training, we algorithms are not
know the true category that the event is in — this is our p, namely pk = 1, for the true explicitly program-
category and pl = 0 for all others. The prediction model returns a probability for each med, but use data to
possible category, for example q1 = 0.1, q2 = 0.7, q3 = 0.01, …. where the sum ∑i qi = 1 learn specific rela-
is equal to one as the event has to belong to one of the categories. Hence, the cross- tionships.
entropy determines how well the model q describes the true p.
Summary
Information science is a multidisciplinary field that studies the collection, classifi-

cation, storage, processing, and dissemination of information. The field is con-
cerned both with the underlying theoretical framework and theories, as well as the
practical applications. Information science incorporates aspects from a wide range
of fields such as computer science, cognitive, and social science. One key aspect of
the underlying theory in information science is to understand how much informa-
tion is contained in a data stream and how to transmit this information in the
smallest possible lossless encoding. An important tool used in this is entropy.
Understanding how we can use entropy to describe the emergent properties of
large ensembles of physical systems, we can also understand how the concept is
used in information science.
We also discuss how to evaluate predictive models, starting with the widespread
method of the Mean Squared Error (MSE), which is used to quantify the quadratic
deviation between a model and the observed data. The Gini index is often confused
with the Gini impurity. The Gini index is a statistical measure of distribution, com-
monly applied to study the income (or wealth) distribution of a country. The Gini
impurity is also an impurity measure, however, it is used in machine learning, in
particular in the construction of decision trees, to determine the probability of
being wrong about any given classification. The Kullback-Leibler divergence
between two probability distributions over the same variable is used to describe
the degree of similarity between the two distributions. We conclude this unit by
investigating what it actually means to estimate the entropy and how accurately we
are able to do so, and we develop the tool of cross-entropy, used to compare prob-
ability distributions, to help us.
www.iubh.de
142 Unit 6
Knowledge Check
learning platform.
Good luck!
www.iubh.de
Evaluation 143
Congratulations!
You have now completed the course. After you have completed the knowledge tests on
the learning platform, please carry out the evaluation for this course. You will then be
eligible to complete your final assessment. Good luck!
www.iubh.de
Appendix 1
List of References
146 Appendix 1
List of References
Atkins, P. W., & de Paula, J. (2006). Physical chemistry (8th ed.). W. H. Freeman.
Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning.
Cambridge University Press. http://doi.org/10.1017/9781108679930
Gini, C. (1912). Variabilita e mutabilita. Memorie di metodologica statistica.
Gini, C. (1921). Measurement of inequality of incomes. The Economics Journal, 31(121), 124
—126. http://dx.doi.org/10.2307/2223319
HiClipart. (n.d.). White panel sliding door illustration. https://www.hiclipart.com/free-

transparent-background-png-clipart-iozib
Johnson, B. & Johnson, J. (2012, April 28). Cross-product in vector algebra. https://
www.thunderbolts.info/wp/2012/05/02/appendix-i-vector-algebra/cross-product-in-
vector-algebra/
Loomis, L., & Sternberg, S. (2014). Advanced calculus. World Scientific Publishing Com-
pany.
Math 10. (n.d.). Vector operations. https://www.math10.com/en/geometry/vectors-oper-

ations/vectors-operations.html
Oppenheim, A., Willsky, A.S. & Nawab, S.H. (1997). Signals & systems (2nd ed.). Prentice
Hall.
Pexels. (n.d.). Shallow focus photography of a cavalier king charles spaniel. https://
www.pexels.com/photo/shallow-focus-photography-of-a-cavalier-king-charles-span-
iel-1390361/
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Techni-

cal Journal, 27(3), 379—423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
Strang, G. (2017). Calculus (3rd ed.). Wellesley-Cambridge Press.
www.iubh.de
Appendix 2
List of Tables and Figures
148 Appendix 2
A Parabola: f(x) = x2
Source: Author.
f(x) = |x|
Source: Author.
Constant Velocity Against Time

Source: Author.
Variable Velocity Against Time

Source: Author.
Parabola for x > 0

Source: Author.
A Function Depending on Two Variables f(x, y) = x2 + y2

Source: Author.
Schematic View of a Function f(x, y) which is to be Integrated

Source: Author.
Illustration of the Concept of Calculus of Variations with a Rope Hanging From

Points A and B
Source: Author.
Two Points A, B Connected By a Path

Source: Author.
Three Examples of Biased and Unbiased Resolution Functions

Source: Author.
Convolution of Two Overlapping and Non-Overlapping Uniform Functions

Source: Author.
Convolution of a Gaussian and a Γ Function

Source: Author.
www.iubh.de
Appendix 2 149
Multivariate Gaussian Function

Source: Author.
Illustration of Gaussian Blurring: Original, Medium, and Strong Blurring

Source: Pexels, n.d.
Periodic Sine Signal with One Contributing Frequency

Source: Author.
Periodic Sine Signal with Two Contributing Frequencies

Source: Author.
Periodic Sawtooth Signal

Source: Author.
Definition of a Low-Pass Filter

Source: Author.
Extracted Signal in the Time Domain Using a Low-Pass Filter

Source: Author.
Example of a Position Vector

Source: Math 10, n.d.
Graphical Representation of a Force of 10N in a Direction of 30o North East

Source: Author.
Illustration of Vectors PQ and RS

Source: Author.
Addition of Two Vectors

Source: Author.
Difference of Two Vectors, a − b

Source: Author.
www.iubh.de
150 Appendix 2
Addition of Two Vectors in Component Form

Source: Math 10, n.d.
Components of a Three-Dimensional Vector

Source: Author.
Angle Between Vectors

Source: Author.
Geometric Interpretation of the Cross-Product Between Vectors

Source: Johnson & Johnson, 2012.
Illustration of the Rate of Change of the Vector Function a u

Source: Author.
A Force F Applied Along a Curve r

Source: Author.
Visualization of a Scalar Field

Source: HiClipart, n.d.
Illustrations of Vector Fields in Every Day Life

Source: Pexels, n.d.
Visualization of a Vector Field Using Arrows to Indicate Direction and Strength of the
Field at Each Position
Source: Author.
Visualization of a Vector Field as a Stream Plot

Source: Author.
Example of a General Lorenz Curve

Source: Author.
Lorenz Curve for an Even Income Distribution

Source: Author.
www.iubh.de
Appendix 2 151
Cumulative Proportions of Population versus Cumulative Percentage of Income

Source: Author.
Lorenz Curve for the Second Exercise

Source: Author.
www.iubh.de
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
Phone: +49 30 311 988 55

media@iu.org

DLMDS Advanced+Mathematics.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DLMDS Advanced+Mathematics.

Uploaded by

Copyright:

Available Formats

COURSE BOOK

© 2021 IU Internationale Hochschule GmbH

Mr. Guiffo Kaigom is a professor at IU International University of

After studying electrical engineering and obtaining his doctorate at

1.2 Partial Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.3 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.4 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2 Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Addition and Subtraction of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.3 Multiplication of Vectors: Dot Product and Scalar Product . . . . . . . . . . . . 77

4.2 Integration of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Scalar and Vector Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4 Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Determinant, Trace, Transpose, Complex, and Hermitian Conjugates . 107

5.3 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.3 Entropy, Shannon Entropy, Kulback-Leibler Divergence . . . . . . . . . . . . . 133

6.4 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Signposts Throughout the Course Book

These analytical techniques are complemented by a thorough introduction to mathematical

On completion of this unit, you will have learned ...

… how to differentiate and integrate functions of a single variable.

… how to approximate a function in a Taylor series.

… the basic concepts of calculus of variations.

Differentiation is a powerful tool that is used to understand the relationship between

In calculus of variations, the concept of differentiation is extended to functionals,

1.1 Differentiation and Integration

Derivatives of Functions of a Single Variable

Find the first derivative of f(x) = x2.

Using definition 1.2,

Here, we have observed that Δx becomes infinitessimally small as it approaches

Be aware that to be differentiable at xa, a function must be continuous at xa (or else

Higher order derivatives

n−1 n−1 n−1

whenever the limit exists.

• the function f has a maximum at a stationary point at x = a if fʹ(a) = 0 and

Differentiation of functions with a constant

Find the derivative of f(x) = x2 sin(x).

Using definition 1.5 with u(x) = x2 and v(x) = sin(x), we get

The chain rule

Find the derivative of f(x) = (x − 1)2.

where u = u(x) and v = v(x).

Integrals of Functions of a Single Variable

Integrals as area under the curve

If the velocity is constant, this is intuitively clear. We have Δs = vΔt, as illustrated in

If the limit does not exist, the integral is undefined.

Evaluate the integral I = ∫0b x2dx.

The function f(x) = x2, called the integrand, is shown below.

Recall that the sum ∑ni = 1 i2 has the closed form

and hence the area of our approximation is

As we increase the number of intervals without bound, or as n → ∞, the value of

If we set c = a in the last expression, we can derive the identity

Integrals as the inverse of differentiation

To see how integration is related to differentiation, we evaluate the function F at posi-

Considering the limit as Δx approaches zero of both sides, this becomes

or, written with the definition of F(x) substituted in,

Integrals with infinite bounds of integration

where the limit as b approaches ∞ is evaluated after the integral is calculated.

∫cos udu = sin u + c

• Decomposition: When the integrand is a linear combination of integrable functions,

• Substitution: If the integrand can be parameterized in terms of a different variable

Integrating both sides we obtain