Professional Documents
Culture Documents
June 2, 2022
Contents
1 Introduction to the course 3
1.1 Goals of machine learning and of this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Zooming in on Linear Algebra, and Optimization and ML . . . . . . . . . . . . . . . . . . . . 4
1.3 Remainder of the lecture notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Optimization basics 81
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2 Building up the gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3 What do we converge to? Convex functions and global optimality . . . . . . . . . . . . . . . . 89
3.4 Modelling losses for ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5 What distinguishes optimization methods used for ML? . . . . . . . . . . . . . . . . . . . . . 97
3.6 Gradient descent: a simple proof of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1
6 Clustering 127
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 K-means clustering - Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3 K-means clustering - kernelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4 Graph-based clustering problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Cut-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Final comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2
1 Introduction to the course
1.1 Goals of machine learning and of this course
Welcome to the course! This is a course about machine learning (ML) which is a reasonably ‘trendy’ subject
these days. One can say - yet another course about machine learning. What kind of machine learning will
that be? These are legitimate questions because ML is not a single thing but instead, it consists of multiple
sub-domains: supervised learning, unsupervised learning, semi=supervised learning, and recently we can
speak of the emergence of deep learning as a separate branch of ML, although, in fact, it supports the earlier
three sub-domains. All of these are worlds on their own, and you can spend a lifetime on exploring the
literature related to them. There are many books and many ways in which you can approach this subject:
1. understanding of the mathematical structures with which the data can be represented;
2. mathematical formulation of the goal we aim to achieve (which is usually pretty vague at the starting
point) including a judgement if it is achievable at all, or at the cost of what mathematical/computa-
tional difficulties;
3. designing efficient computational procedures that achieve the goal in a reasonable time.
This interplay of the data (the dirty part) and mathematics (the pure part) is where a lot of theoretical
knowledge and practice is needed. Often, achieving the ‘ideal’ goal will not be possible (in short: you cannot
solve your problem by stating a theorem that will fit the data), but life still requires solutions that work
most of the time, so we aim to teach you what to pay attention to in order to build a good tool that delivers
what you want, most of the time.
This is a good place to say that we will not discuss machine learning from a probabilistic or statistical
perspective, and hence, you will not see many symbols like P here. Staying probability-free, we will assume
that data of certain types is given, and we will focus on how to construct machine learning models from that;
this process is also denoted as learning from the data. Now, we are ready to explain where the course title
comes from. Surprisingly, once the data is given, machine learning models for different domains all consist of
elementary mathematical building blocks from optimization and linear algebra. To put it differently, when
3
Machine Learning
Optimization
Figure 1.1: It is difficult to relate machine learning with optimization and linear algebra. All machine
learning algorithms discussed in our lectures are based on linear algebra and optimization; optimization
always uses linear algebra.
presented with a new problem in which your task is to perform ‘some kind of learning’, it is very likely that
any tool you come up with is going to require solving an optimization problem (at the higher level) and
linear algebraic techniques (lower level) to make it work.
However, we will not strictly split the course among linear algebra and optimization topics since many
building blocks of machine learning algorithms have aspects from both fields; see also fig. 1.1. We will often
use the term linear algebra synonymously with numerical linear algebra. This is because ultimately, we are
always aiming at using the algorithms on a computer; hence, it is important to investigate them under this
condition.
From the implementation perspective, to understand the algorithms in-depth, we will focus on imple-
menting the ML algorithms from scratch instead of using black-box (Python) packages. Therefore, you can
expect a certain amount of implementation work during the course.
We will lack to the time to gain the experience on how to optimally tune machine learning algorithms
for a use case at hand. This we will leave to applied machine learning courses and the further literature,
partly given above. Nonetheless, we want to gain insights into the relevance and influence of the tunable
parameters (a.k.a. hyper parameters) of machine learning algorithms, such that you can
1. modify the settings and hyper parameters of existing ML packages and
2. build computationally reliable tools/packages yourself.
4
• Matrix-matrix, matrix-vector, and vector-vector multiplication, that is,
C = A · B,
y = A · x, and (1.2)
>
c = y · x,
A·x
requires n2 scalar multiplications and n(n−1) scalar additions. Since, on a normal CPU (central
processing unit), addition and multiplication can be performed as one floating point operation
(FLOP), the cost is essentially O(n2 ) FLOPs.
On the other hand,
A·B
equals to multiplying A with n vectors (columns of B). Hence, eq. (1.3) requires O(n3 + n2 ) =
O(n3 ) FLOPs, whereas eq. (1.4) only requires O(2n2 ) = O(n2 ) FLOPs. The larger n, the larger
the computational cost of eq. (1.3) compared to eq. (1.4).
Ax = b. (1.5)
If the system is ‘nice’, i.e., the matrix A is square and invertible, we know that x can be computed
as x = A−1 b. Matrix inversion is, however, infeasible in terms of computational work as well as
memory consumption. Hence, (numerical) linear algebra and optimization algorithms for inexactly
solving eq. (1.5) using, for instance, iterative schemes are used in practice; these also make use of the
previous types of linear algebra operations, that is, summation and multiplication.
These and many other operations are at the core of ML techniques; they are also used in many optimization
algorithms; cf. fig. 1.1.
5
1.2.2 Optimization
Optimization is concerned with finding solutions to problems of the form
where X is the set of admissible solutions and f (x) is the function whose value is to be minimized by selecting
x; of course, minimization and maximization are algorithmically equivalent because
then we have formulated the problem of solving a linear system of equations, as in eq. (1.5), as an optimization
problem of finding an x that minimizes the mismatch between the Ax and b vectors; eq. (1.7) is also called
the least-squares problem. In some sense, this problem is more general compared to eq. (1.5): if x is a solution
of eq. (1.5) is obviously also a solution of eq. (1.7). However, there are cases where we can find a solution of
the least-squares problem eq. (1.7), even if the linear equation system eq. (1.5) has no solution.
Rewriting a linear system of equations as an optimization problem is just one example of the fact that
optimization and linear algebra have a lot of links between them, even without thinking about ML; we could
say that they were friends already before ML was fashionable.
Supervised learning. Supervised learning can be explained as data fitting - the process that tries to find
a mapping from one part of the data (features) to another part (labels). It is used when you have a dataset
in which many objects are equipped with certain features, and each of these objects is labelled. This can
correspond to the following pairs:
Object Features Label
Person Income, education, Defaulting on loan repayment in the past
Meal Amounts of various ingredients Taste, rated from 1 to 10
Device Age, temperature when working Needs replacement or not
Photo A vector of pixel values Name of the object on the photo: cat or dog?
Your goal is to, when you encounter a new object equipped with a new set of features but whose label is not
known (or not available), to guess the correct ‘label’. ML does it by trying to figure out a relationship between
the known objects’ features and labels in the training dataset and then use this estimated relationship to guess
the labels on new data. Why would one do something like this? Typically, this is because determining the
correct label for the new object is not possible (one would need to wait a long time) or very expensive effort-
or money-wise. If the ML-based model is able to make correct guesses often enough, then the downsides of
making an error every once-in-a-while will be outnumbered by the benefits of making the guesses fast.
6
More formally, supervised learning corresponds to having a dataset
I = {x1 , . . . , xn } , O = {y1 , . . . , yn } ,
with n samples is given, such that
xi 7→ yi ,
for 1 ≤ i ≤ n. This means that xi contains the features (input) and yi the labels (output) of the ith data
sample. The features and labels can be integer- or scalar-valued. We can sub-divide supervised learning into
classification, which corresponds to the case when the labels are integer-valued, and regression, where the
labels are scalar-valued. In supervised learning, the model is constructed to minimize the data misfit, which
- as mentioned before - corresponds to solving a minimization problem.
Example 1.2. Supervised Learning – Regression
The most classical example of supervised learning are regression problems, known in some areas
as data-fitting. Graphically, this can be described as aiming to find a function that most closely
describes the mapping of the points on the x-axis in the picture below, with their corresponding
y-axis coordinates (marked as red rectangles).
20 data
model
0
−20
y
−40
−60
−80
0 2 4 6 8 10
x
where we aim to minimize the sum of squares of the total discrepancies between the actual values
of the output value ŷi assigned to sample x̂i , and the value w> Φ(x̂i ) which is an inner product of
a vector w of parameters that we control, and Φ(x̂i ) is a point-to-vector mapping that gives our
function w> Φ(x) some ‘flexibility’. For example, the blue curve in the picture has been obtained by
taking
1
x
Φ(x) = x2 ,
x3
whereby, naturally it has to hold that w ∈ R4 . In other words, the blue curve illustrates fitting the
best degree-3 polynomial to match the relationship of x and y in the data.
7
Example 1.3. Supervised Learning – Classification
Another type of supervised learning is classification which can be used, e.g., to guessing if a given
tissue is healthy or cancer-afflicted. Suppose you have a number of observation vectors xi and their
corresponding labels yi ∈ {0, 1}, i = 1, . . . , n each. You would like to find out if it is possible to find
a relationship between the input values x and the labels y, to be able to predict this in the future.
One of the ways to do it is to build up a support vector machine, or a hyperplane that separates the
groups of points with labels −1 and 1.
w> x = 1
buffer zone
w> x = 0
w> x = −1
margin
w > xi ≥ 1 ∀i : yi = 1
>
w xi ≤ −1 ∀i : yi = −1
Why not ≥ 0 and ≤ 0, respectively? In that case, the obvious solution would be w = 0 which would
not provide us with any useful tool.
The above system of inequalities can be written concisely as:
yi (w> xi ) ≥ 1 ∀i.
This is a nice idea but such a good separation of points might not always be possible. What we can
do then is to find w such that the total sum of violations of the above inequalities is as small as
possible:
Xn
min max{1 − yi (w> xi ), 0}.
w
i=1
Most often, we will prefer the problem to be formulated with a square
n
X
min max{1 − yi (w> xi ), 0}2
w
i=1
because that will keep the function to be minimized smooth and differentiable, i.e., it will keep it a
nice function from an optimization point of view.
Optimization algorithms that help us find the best w will have to compute many times the gradient
of the minimized function (the objective function) w.r.t. the parameter w (remember - gradient is
the direction of steepest ascent of a function, so minus gradient should guide us towards decreasing
the function value). Hence, we will perform many iterations with matrix X in which all the data xi
is stored, and a vector y of all the labels yi . This is the part of the procedure in which efficient linear
algebra will be of great importance.
8
Unsupervised learning. Another domain of ML – unsupervised learning – is when the data does not
possess anything like labels, but we are still interested in certain patterns inside it. One of the typical
examples is here is clustering – a process when, trying to make sense out of a huge amount of data (objects),
we try to subdivide them into groups where the items belonging to the same group are ‘similar’.
Example 1.4. Unsupervised Learning - Clustering
Consider that you have a group of people and are informed who of them knows/exchanges messages
with each other. That is, for each pair (i, j) of people, you know if there exists a relationship (1)
or not between them (0), with 0 for pairs (i, i) by convention. If you place all this information into
a matrix, you obtain a so-called adjacency matrix of the graph in which the nodes are persons, and
edges are existing relationships between them:
0 1 ··· 0
1 0 ··· 1
. . .
A= .. .. . . . ..
..
0 1 . 0
Based on this information you might have to figure out what are the k ‘groups of friends’ among
them. There are many ways to translate this question into a mathematical goal, but all of them
will be an optimization problem, whose efficient solution will require the use of the linear algebraic
properties of the adjacency matrix A.
Note that the adjacency matrix is often sparse (for instance, for the example visualized above).
If possible, it is very important to take this property into account since; this significantly reduces
computational cost, in terms of computational work and memory.
In order to formulate the partitioning of the graph into k clusters mathematically, we can, for instance,
consider the following optimization problem:
n X
X n
min uik uik d(i, j)
uik ∈{0,1}
i=1 j=1
where uik are decision variables if the i-th element is assigned to the k-th cluster or not, and d(i, j)
defines a ‘distance’ between nodes i and j. In a graph context, it makes sense that the ‘distance’ is
defined in relation to the strength of the link between i and j through graph edges (and not their
Euclidean distance – think about it: it is more important how many people can connect me to a
given person rather than how far that person lives from me). For that purpose, a technique known
as spectral clustering uses a transformed version of the matrix A to solve this problem.
Another example of unsupervised learning is dimensionality reduction where where we try to describe a
high-dimensional object with a low-dimensional structure. The question is then essentially, which dimensions
can be dropped (possibly after applying some initial transformation to the data first), so that different objects
can still be distinguished, but that they take less memory space. A perfect example here is image compression.
9
Example 1.5. Unsupervised Learning – Dimension Reduction
Surface of Mercury.
Let A be the matrix containing the pixel values of an image. Then, we can reduce the size of the image
by using dimension reduction techniques. In particular, we consider the singular value decomposition
(SVD)
> Σr 0
A = U ΣV , Σ = ,
0 0
where Σ ∈ Rm×n and
σ1
σ2
Σr = diag(σ1 , σ2 , · · · , σr ) = .
..
.
σr
By replacing all diagonal entries σi with i larger as some k, we can reduce the size of the image.
Above, you can see the original image of size 1 144 × 1 071 = 1 225 224 (left) and the image resulting
from taking only 38 diagonal entries from the SVD (right). The compression factor is ≈ 15.0.
Semi-supervised learning Real life situations often involve a setup in which there is a certain amount of
labelled data (e.g. animal pictures, as in supervised learning), and much more unlabelled data (new animal
pictures, where nobody has indicated the type of an animal). The unlabelled data need not, however, be
useless because it can be similar in features to the old data (new dog pictures typically look more similar to
old dog pictures than to old cat pictures). In other words, one can try to use the unlabelled animal pictures
to create a better predictive model than the model on the labelled data alone.
A very heuristic idea here is to build a supervised model on the labelled data, and to use it to predict
labels on a part of the unlabelled data. Then, one can add the ‘pseudo labelled’ data to the original labelled
dataset, and try to learn a new supervised model on the enlarged dataset. Under certain assumptions,
this approach, known as self-training, can work. A classical domain for this type of learning is language
processing.
Semi-supervised learning, essentially, involves algorithms that use the tools of (un)supervised learning,
therefore we will skip giving very specific examples here and move on directly to the example of neural
networks.
10
Deep learning a.k.a. neural networks. Separating deep learning from the rest of ML is, on a the-
oretical level, incorrect. Deep learning is, strictly speaking, an add-on used to perform the tasks of the
supervised/unsupervised/semi-supervised learning. However, deep learning models have attracted so much
attention in the past years, and their mathematical analysis is so distinctive, that it makes sense to discuss
them as a separate subject.
Neural networks are a computationally efficient tool to build very complex ML models, and they are
inspired by the way neural networks in our bodies transform and transmit signals from the nerves to the
brain. These complex networks can discover much more complex relationships in the data than the ‘earliest
and simple’ ML models would do. In short, deep learning is the good old ML albeit on computational
steroids.
Example 1.6. Neural Networks
The picture here illustrates a simple example of a neural network that transforms the input vector
x consisting of three entries x1 , x2 , x3 , first into a so-called hidden layer consisting of three neurons
(1) (1) (1)
a1 , a2 , a3 that transform the incoming signals (incoming arrows) combining them into a single
number, and the transformed signals from a1 , a2 , a3 are then sent to another single-neuron hidden
layer a(2) , which transforms the incoming signal into a single output value y.
x1 (1)
a1
x2 (1) (2) y
a2 a2
x3 (1)
a3
What does it exactly mean that a signal is sent by means of an arrow and how do the neurons work?
(1)
For that purpose, it makes sense to zoom in on a single neuron, for example a1 and show a typical,
very simple, set of mathematical operations it consists of.
α0 wj0
α1 wj1
P
s= wji αi β = ϕ(s) β
α2 wj2
αi wji
First, the numbers incoming from the preceding nodes (nodes from which an arrow/arc leads to the
current neuron), are multiplied by a scalar parameter wij . Then, all those multiplied signals are
added altogether into a single number s. This single number, in the end, is transformed by means of
a simple, nonlinear function ϕ(·) into the output value β.
Coming back to the big picture again, the symbols Θ(1) and Θ(2) stand there to denote all the
weights wij corresponding to the neuron zoom-ins from the smaller picture. In ML terminology, Θ(1)
and Θ(2) are the parameters of the neural network and the goal of the training process is to find
11
(optimize) the values of these parameters so that over a training sample consisting of many pairs
(x̂(1) , ŷ (1) ), . . . , (x̂(n) , ŷ (n) ), the values y that the network would generate based on the input vectors
x̂(1) , are as close as possible to the corresponding values ŷ (i) .
In this way, neural networks can be depicted as ‘computational graphs’ where on the left-hand side
we have the input features, and each node corresponds to a linear or non-linear transformation of the
numbers coming from the incoming arcs. In principle, the output value y is a compound function of
the input vector x:
! ! !!
X X X X X X
y(x) := ϕ w.. . ϕ w... xi + w.. . ϕ w... xi + w.. . ϕ w... xi ,
... ... ...
but good luck if you try to work with a function like this directly – here, the mathematical/geometrical
structure of a graph helps put an order among the possibly hundreds of hidden layers, and millions
of neurons, and in particular, in the training process.
This problem of training a neural network can be formulated, for example, as minimization of the
squared norm between the output ŷ and the value y.
n
X 2
min y(x̂(i) ) − ŷ (i)
Θ(1) ,Θ(2)
i=1
An algorithm that will solve this minimization problem will need to compute the gradient of the
function that gives us z with respect to all the in-between parameters wi j. Doing so in a way that
you learn in a calculus course would be computationally intractable, so we will learn graph-theoretic
and linear-algebraic tools that will allow us to keep the number of computations as low as possible.
12
2 Linear algebra basics
As pointed out in the previous section, (numerical) linear algebra is the foundation of many algorithms and
techniques in machine learning. The most obvious reason is the data, which is the basis for machine learning
models. There are many different types of data subject to machine learning techniques, such as
• textual data,
The data can be characterized in different ways. For instance, the data can be numeric or non-numeric,
structured or unstructured, or static or temporal. However, when represented digitally, the data is usually
encoded in terms of linear algebra structures, that is, vectors, matrices, and tensors (arrays) of scalars
(floating point numbers) or integers. Therefore, dealing with data, naturally involves dealing with linear
algebra structures.
Let us, for instance, consider an image:
A grey image can be stored as a matrix where each entry corresponds to the intensity of a pixel of the image;
in 8-bit greyscale a value of zero corresponds to black, a value of 255 corresponds to white, and all values
between zero and 255 correspond to different shades of grey. For instance, the matrix
0 0 0
0 255 0
0 0 0
0 255 0
0 0 0
13
A color image corresponds to three matrices corresponding to the amount of red, green, and blue (RGB)
in the image.
As a second example, consider time-dependent sensor data:
20
sensor data
15
10
s
5
0
0 2 4 6 8 10
t
14
real numbers
In this course, we will only consider certain types of numbers for the entries of matrices and vectors, that is,
scalars
the real numbers R, or scalars, and integers Z; in some cases, we might also consider boolean numbers
integers
{0, 1}.
boolean
For now, we will concentrate with the case that all entries are real numbers, that is, v1 , . . . , vm ∈ R and
a11 , . . . , anm ∈ R. We then write that v ∈ Rm and A ∈ Rn×m . In machine learning, the case of integers
and boolean numbers will mostly arise from classification problems, that is, when categorizing data into a
finite number of classes, which are labeled as boolean values of integers. Therefore, instead of considering
all integers, we typically restrict ourselves to the natural numbers N. natural
The space Rm is a vector space of all vectors of length m, and Rn×m is also a vector space, however, a numbers
vector space of matrices. This means that the axioms of vector spaces are satisfied for matrices and vectors
with the element-wise addition
a11 · · · a1m b11 · · · b1n a11 + b11 · · · a1n + b1n
.. .. .. .. .. .. .. .. ..
. + . . =
. . . . . .
al1 ··· alm bm1 ··· bmn am1 + bm1 ··· amn + bmn
and
v1 w1 v1 + w1
.. .. ..
. + . =
.
vm wm vm + wm
as well as the scalings
··· ···
a11 a1m ca11 ca1m
. .. .. .. .. ..
c .. . =
. . . .
al1 ··· alm cal1 ··· calm
and
v1 cv1
. .
c .. = .. .
vm cvm
Here, all aij , bij , vi , wi and c are scalars.
Exercise 2.1. Vector spaces
Recall the axioms of vectors spaces.
The dimension of a vector space is defined as the number of vectors of a basis of this vector space; dimension
in order to be well-defined, this implies that all basis of one vector space have the same size, which is a
well-known statement from linear algebra. A standard basis of Rm is
1 0 0
0 1
0
.. , .. , . . . , .. ,
. . .
0 0 1
which shows again that the dimension is m. A standard basis for the matrix space Rn×m is defined analo-
gously, and hence, its dimension n · m follows.
Any vector v in an m-dimensional vector V space can be represented as a linear combination linear
combination
v = a1 b1 + . . . + am bm
of a basis {b1 , . . . , bm }, where a1 , . . . , am ∈ R. Recall also that a basis of a vector space is a maximum linear basis
independent set of vectors. Vectors are linear independent if linear
independent
a1 b1 + . . . + am bm ⇒ a1 = . . . = am = 0.
15
Otherwise, they are linearly dependent. A linearly independent set of vectors is a basis if we cannot add linearly
any vector from the vector space without the set becoming linear dependent. Any subset W of a vector dependent
space V is called a subspace if it is closed with respect to addition and scaling; as a result, the subspace is
a vector space itself.
As long as seeing vectors and matrices just as elements of vector spaces, a specific vector or matrix does
not have any special properties, except for the zero elements
0 ··· 0
0
.. . .. .
. and .. . .. .
0 0 ··· 0
This operation is only well-defined if the sizes of the matrix and the vector are compatible. The matrix of
the composition of two linear maps that are represented by the matrices A ∈ Rl×m and B ∈ Rm×n can also
be obtained by the matrix-matrix multiplication matrix-matrix
multiplication
a11 · · · a1m b11 · · · b1n c11 · · · c1n
. .. .. .. .. .. .. .. ..
AB = .. . . · . . . = . . . ,
al1 ··· alm bm1 ··· bmn cl1 ··· aln
where
m
X
cij = aik bkj = ai1 b1j + . . . + aim bmj
k=1
for 1 ≤ i ≤ l and 1 ≤ j ≤ n. Again, it is important that the sizes of the matrices are compatible. For
simplicity, we will use matrices and the corresponding linear maps synonymously.
Let us recall the example example 1.1 from section 1, which is about the computational efficiency of those
operations:
Example 2.1. Matrix-Matrix Vs Matrix-Vector Multiplication
Let us consider the special case of A, B ∈ Rn×n being dense matrices and x ∈ Rn . Then,
A·x
requires n2 scalar multiplications and n(n − 1) scalar additions. On some computing architectures,
addition and multiplication can be performed in parallel. Then, we count them as one floating point
operation (FLOP). The the cost for the matrix-vector multiplication is essentially O(n2 ) FLOPs.
On the other hand,
A·B
equals to multiplying A with n vectors (columns of B). This results in a total of O(n3 ) FLOPs.
16
Hence,
(A · B) · x (2.1)
3 2 3
requires O(n + n ) = O(n ) FLOPs, whereas
A · (B · x) (2.2)
only requires O(2n2 ) = O(n2 ) FLOPs. The larger n, the larger the overhead of eq. (2.1) is compared
to eq. (2.2).
This example shows that, even though mathematically equivalent, it is important how to perform com-
putations in practice. In particular, when dealing with large data sets and when repeatedly performing
such operations, this can become critic for performance. In the following, we will discuss more properties of
matrices.
Rectangular, quadratic matrices, and the matrix rank The size of a matrix is linked with the
domain space
domain space and codomain space of the corresponding linear map. In particular, as discussed before,
codomain
if A ∈ Rn×m , then it corresponds to a linear map space
A : Rm → R n ,
where the domain space is domain (A) = Rm and the codomain space is codomain (A) = Rn . Now, the
dimension formula from linear algebra says that dimension
formula
dim (domain (A)) = dim (range (A)) + dim (ker (A)) (2.3)
range space
Here, the range space range (A) is also called the column space because it is the space spanned by the
column space
columns of A; in other words, it is the set of all linear combinations of the column vectors of A. The kernel
kernel
ker (A) is also called the null space; it corresponds to the space of vector w ∈ domain (A) such that Aw = 0.
null space
We will use the terms kernel and null space synonymously in this course.
From eq. (2.3), we obtain
···
a11 an1
> .. .. .. m×n
A := . . . ∈R
a1m ··· anm
of
···
a11 a1m
. .. .. n×m
A = .. . . ∈R ,
an1 ··· anm
we get the corresponding linear map
17
we obtain
dim (codomain (A)) = dim range A> + dim ker A>
.
>
is the column space of A> , which is the same as the row space A, that is, the space
Here, range A row space
spanned by the rows of A (= b columns of A> ). The null space ker A> of A> is also called the left null
space of A, and it corresponds to all w ∈ domain (A) = codomain (A) such that A> w = 0. left null space
One important observation from linear algebra is that dim (range (A)) = dim range A> , that is, that
the dimension of the column and row space are the same. The numbers dim (range (A)) and dim range A>
row rank
are denoted as the row rank and column rank of the matrix A, and since they are the same, we just call
column rank
it the rank of the matrix A. The rank of the matrix A is a measure for the amount of information stored
rank
in the matrix, that is, the number of linear independent rows and columns of matrix.
We can already imagine that, when building machine learning models from large data sets, which are
stored as matrices, it will be essential to actually know the how much information is stored in the matrix.
Even if a data set is very large, it may still represented by just a few vectors spanning the whole column or
row space. During the course, we will also discuss how to measure which dimensions of the row and column
spaces are actually most relevant and which can be >neglected at thecost of only small errors.
> > >
Even though the spaces domain A , range A , and ker A as well as codomain (A), range A ,
>
and ker A do not fully describe the action of A, they are important to characterize the matrix. Many
important matrix properties can be formulated in terms of these spaces and their dimensions as well as the
rank of the matrix. For instance, injectivity and surjectivity can be defined based thereon:
Theorem 2.1.
Let A ∈ Rn×m be a matrix. The linear map corresponding
A is bijective ⇔ dim (ker (A)) = 0 ∧ dim (range (A)) = dim (codomain (A))
⇔ dim (domain (A)) = dim (range (A)) ∧ dim (range (A)) = dim (codomain (A))
⇔ dim (domain (A)) = dim (range (A)) = dim (codomain (A))
One conclusion is that A ∈ Rn×n , that is, that the matrix is square matrix. square matrix
Diagonal, triangular, and symmetric matrices Any square matrix A ∈ Rn×n can be partitioned as
A = L + D + U, (2.4)
18
and R = (rij )ij ∈ Rn×n is a upper triangular matrix with upper
triangular
matrix
aij for i < j,
rij =
0 otherwise.
Note that it is also common to write D = diag1≤i≤n (aii ). We can also write
19
or
a11 a12 ··· a1n
.. .. ..
a21 . . .
.. .. ..
. .
. a(n−1),n
A =
..
an1 . ann
..
. an+1,m
. ..
..
.
an1 ··· ··· amn
0 ··· ··· 0 a11 0 ··· 0 0 a12 ··· a1n
.. .. .. .. .. .. .. .. ..
a21 . . 0 . . . . .
. .
.. .. .. .. .. .. .. .. ..
. . . . .
. . . 0 . a(n−1),n
.. ..
=
.. + .. +
.
an1 . 0 . . ann . 0
.. .. .. ..
. an+1,m
. 0 . .
. .. . .. . ..
.. . .. . ..
.
an1 ··· ··· amn 0 ··· ··· 0 0 ··· ··· 0
| {z } | {z } | {z }
=L =D =U
20
and R = (Rij )ij ∈ Rn×m is a upper triangular block matrix with upper
triangular
block matrix
Aij for i < j,
Rij =
0 otherwise,
Again, we can write D = diag1≤i≤n (Aii ). This idea can also be extended analogously to k ×l block matrices,
with k 6= l.
Sparse matrices In practice, there are many cases where a large amount of the matrix entries are actually
zero. One typical example is the following: Let us assume that a video-streaming platform has n users and
offers m different videos, movies or series, for streaming. The information whether a user has watched a
n×m
movie, or not, could be encoded in terms of a matrix A ∈ {0, 1} , where aij = 1 corresponds to the case
when user i has seen the movie j and aij = 0 if not. For instance, A could be of the form
1 0 0 ···
0 0 0 · · ·
A = 1 1 1 · · ·,
.. .. .. . .
. . . .
meaning that user 1 has seen movie 1 but did not see movies 2 and 3. On the other hand, user 3 has seen
movies 1, 2, and 3. Of course, most users will have not seen the largest part of all movies. A similar example
would be the data for rating the different movies. In case the user could rate a movie with 1 to 5 stars, the
n×m
resulting data could be stored as a B ∈ {0, . . . , 5} matrix.
In such cases where by far the most matrix entries have the same value (this value is typically zero), it
is most efficient to only store those entries that differ from this value. We call these matrices sparse. Let sparse
us consider a small example
4 0 0 2 0
0 0 0 0 0
A= 5 4 3 0 0 (2.5)
0 0 0 0 1
0 2 0 0 0
Instead of storing all 25 entries of this matrix, we could store the following triples of row index, column
index, and value of all nonzero entries of A:
The matrix A has 7 nonzero entries, and hence, 21 numbers are sufficient to store the whole matrix; in this
small example, we save 4 numbers in memory for storing A. This format is also denoted as the dictionary
of keys (DOK) format. dictionary of
Later in this course, we will learn about some unsupervised learning techniques, so-called recommender keys (DOK)
systems, specifically aimed at dealing with the data sets described above. In particular, the goal of these
techniques is to fill the missing data such sparse matrices. One typical application is name giving, namely
trying to recommend articles (for instance, videos, movies, and series) to a customer based on information
from previous ratings. compressed
Two other famous sparse formats are the compressed sparse row (CSR) and compressed sparse sparse row
(CSR)
column (CSC) formats. In the following example, we will introduce the CRS format; in this format, the
compressed
rows are stored as in the DOK format, but we need less data for storing the column information. sparse column
(CSC)
21
Example 2.2. CRS format
Definition 2.1.
The compressed sparse row (CRS) format of a matrix A ∈ Rn×m with k non-zero entries is
defined by two 1D arrays val and col ind of length k and another array row ptr of length
n + 1.
Only the k non-zero entries of A are written row-by-row in val, and the corresponding column
indices are written in col ind. row ptr[i] points to the first entry of the i-th row in val,
where the last entry of row ptr points to the first entry in the fictitious n + 1-th row.
val 4 2 5 4 3 1 2
col ind 1 4 1 2 3 5 2
row ptr 1 3 3 6 7 8
We can observe that 2 × 7 + 6 = 20 numbers are sufficient to store the matrix. This is cheaper than
storing the full matrix A as well as storing the DOK format of the matrix.
The CSC format is similar to the CRS format, but in CSC, the matrix is stored column-by-column. This
means, that we store val, row ind, and col ptr arrays for sizes k, k, and m + 1, respectively; as before, k
is the number of nonzero entries and m is the number of columns.
Exercise 2.2. CSC formats
Write down the CSC format of the matrix
4 0 0 2 0
0 0 0 0 0
5 4 3 0 0
.
0 0 0 0 1
0 2 0 0 0
22
Equally important as reducing the memory consumption for storing a sparse matrix when using a sparse
matrix format, it is also more computationally efficient. For instance, the number of floating point operations
for computing the matrix-vector product
A·x
of a matrix A ∈ Rn×m and a vector x ∈ Rm is n × m. In case the matrix has only k nonzero entries, the
matrix-vector product requires only O(k) FLOPs. If k << n × m, this is significantly less computationally
demanding then performing a matrix-vector multiplication with a full matrix.
Let us now that combinations of block matrices and sparsity are common: a common example is a matrix
of the form
A11 0 0 A14 0
0 0 0 0 0
A
A = 31
A 32 A 33 0 0
,
0 0 0 0 A45
0 A52 0 0 0
where each block is dense but there are many blocks which are completely zero. This is basically the same
as eq. (2.5) but with dense matrices as blocks. In this case, a combination of sparse and dense matrix formats
would be favorable.
Bilinear forms We have seen that any matrix defines a linear map between two vector spaces. Moreover,
any matrix A ∈ Rn×m also defines a bilinear form
a (·, ·) : Rn × Rm → R,
(2.6)
(v, w) 7→ w> Av.
In order to be a bilinear form, a (·, ·) has to be linear in each argument: bilinear form
In case the matrix is square A ∈ Rn×n and symmetric, the bilinear form a (·, ·) : Rn × Rn → R is also
symmetric, that is,
a (v, w) = a (w, v) ∀v, w ∈ Rn ,
positive
and this bilinear form is called: definite
negative
• positive definite or negative definite if a (v, v) > 0 or a (v, v) < 0 for each 0 6= v ∈ Rn definite
positive
• positive semi-definte or negative semi-definite if a (v, v) ≥ 0 or a (v, v) ≤ 0 for each v ∈ Rn semi-definte
If the matrix is symmetric positive definite (SPD), the bilinear form is called a scalar product (or inner negative
product), and semi-definite
2 scalar product
kvkA := a (v, v) = v > Av (2.7) inner product
defines a norm on Rn .
23
Exercise 2.5. Scalar product
Show that a (·, ·) as defined in eq. (2.6) is a scalar product if the matrix A is SPD.
Then, we obtain the scalar product which is known as the Euclidean inner product Euclidean
inner product
(·, ·) : Rn × Rn → R,
n
X (2.8)
(v, w) 7→ w> v = wi vi .
i=1
Let us finally remark that there is also the outer product of two vectors outer product
· ⊗ · : R n × Rm → Rn×m
v⊗w 7→ vw> .
2
Vector norms Just before, we have seen that the an SPD matrix A induces a vector norm k·kA . Various
sub-additivity
different matrix and vector norms are frequently used in ML. Along with the sub-additivity/triangle
inequality
triangle
kv + wk ≤ kvk + kwk ∀v, w ∈ V, inequality
and
kcvk = |c| kvk ∀c ∈ R, v ∈ V,
any norm k·k on a vector space V is positive definite: positive
definite
kvk ≥ 0 v ∈ V, kvk = 0 ⇒ v = 0.
min f (v)
f (v) + α kvk
24
2
Figure 2.1: The functions f (x) = sin(x1 ) + cos(x2 ) (left) and g(x) = sin(x1 ) + cos(x2 ) + 0.1 kxk (right):
while f does not have a unique minimizer, g has one.
will have a unique minimum, for a sufficiently large α ∈ R. Moreover, for α → +∞,
arg min {f (v) + α kvk} → 0.
This is also called regularization, and will be discussed in more detail in the optimization basics as well as regularization
in the second half of the course; see also fig. 2.1.
A effect of regularization, which is highly relevant and will appear over and over in machine learning, is
indicated the following example:
Example 2.3. Overfitting
25
v
α
w
As can be seen in the left picture, we would get a quite good fit to the original function. Of course,
since the noise is unknown, it is difficult to recover the original function exactly. Here, adding more
data helps to improve the fit.
In practice, we do not know what kind of (nonlinear) function describes the true relation between
the x and y. Therefore, one might use a higher polynomial degree in order to ensure that the model
actually has the capacity to learn this relation. As we can see in the middle image, the resulting model
is very good fit with respect to the noisy data but not necessarily a good fit of the true function f .
This is called overfitting. overfitting
Without going into the details of regularization, we just make an observation about what happens
when adding the norm of the coefficient vector as a regularization term. In particular, solving the
regularized least-squares problem
2
a0
10
.
2
X
arg min (ga0 ,...,a10 (xi ) − ŷi ) + λ
..
,
a0 ,...,a10
i=0
a10
with the regularization parameter λ ∈ R+ . Now, setting λ = 0.1, we again obtain a reasonable fit
of the original function, without using the knowledge that the original function was a polynomial of
degree 3; see the plot on the right.
Let us discuss some typical examples of vector norms. The standard Euclidean norm of a vector v ∈ Rn Euclidean
is given by the square root of the sum of the squares of the vector entries, norm
v
u n
uX
kvk := t vi2 . (2.9)
i=1
Note that the Euclidean inner product of two vectors v and w is a measure for the angle α between them;
cf. fig. 2.2. In particular, now that we have defined the Euclidean norm, we have
(v, w)
cos (α) = .
kvk kwk
As a consequence, if two vectors are orthogonal, meaning that the angle is (1/2 + k)π, for some k ∈ Z, we
have that
(v, w) = cos ((1/2 + k)π) kvk kwk = 0.
Furthermore, we obtain the
26
Theorem 2.2. Cauchy-Schwarz inequality
In fact, the definition of the Euclidean norm can be generalized to the lp -norm as follows: lp -norm
v
u n
p
uX
kvkp := t
p
|vi | , (2.11)
i=1
Vectors of length 1 (with respect to a certain norm) are outstanding in terms of normalization. They normalization
are also denoted as unit vectors. To compute the unit vector v 0 corresponding to some vector v, we can unit vector
just normalize it with
v
v0 = .
kvk
We obtain that kv 0 k = 1.
Different norms are also used to measure errors in machine learning: if a machine learning model predicts
an output of y and the correct output is ŷ, then the error is measured as
kŷ − yk .
In particular, this is only zero if ŷ = y. As we will discuss later, different norms might used, depending on
the situation.
Matrix norms We will also have to deal with different matrix norms. Since matrices are just linear
operators, we can always define a matrix norm as the operator norm induced by a vector norm: operator norm
kAxk
kAk = max = max kAxk , (2.14)
x6=0 kxk kxk=1
27
• the row sum norm row sum norm
n
X
kAk∞ = max kAxk∞ = max |aij | . (2.17)
kxk∞ =1 i=1,...,m
j=1
which will play an important role in dimension reduction techniques based on the singular value decom-
position, which will be introduced later. In the machine learning community, the square of the Frobenius
norm
2
kAkF
is also called the energy of the matrix. It can equivalently written using the Frobenius inner product, energy
2
kAkF = (A, A)F , (2.19)
where the Frobenius inner product is defined as Frobenius
inner product
(A, B)F := tr A> B
which is in analogy with powers of a scalar; of course, for a non-square matrix, we cannot even compute AA.
Moreover, in analogy to c0 = 1 for any c ∈ R, we define A0 = I, where I ∈ Rn×n is the identity matrix. For
matrices, it is possible to have
AB = 0,
for 0 6= A, B. It is also possible that
Ad = 0
for A 6= 0.
28
Definition 2.2. Nilpotent matrix of index d
We call a matrix nilpotent of index d if nilpotent of
index d
Ad = 0
but
Ad−1 6= 0.
Raising a matrix to a high power is an operation that, if done ‘as written on paper’ involves a lot of
matrix multiplications – an operation which is rather expensive. Whenever possible, this should be avoided.
Therefore, let us discuss how to efficiently compute powers for a certain class of matrices, that is, for
diagonalizable matrices.
Definition 2.3. Daigonalizable matrix
A square matrix A ∈ Rn×n is called diagonalizable if an invertible matrix V ∈ Rn×n and a diagonal diagonalizable
matrix D ∈ Rn×n exist, such that
A = V DV −1 .
Since V V −1 = I, we obtain
Ad = V −1 Dd V.
Recall that D is a diagonal matrix, and it is easy to compute a power of a diagonal matrix
d
Dd = (diag (dii )) = diag ddii ,
that is, by computing the power of the diagonal entries. This also gives us the possibility to compute a
fractional power of a diagonal matrix fractional
power
1/d
D1/d = diag dii ,
29
with ai ∈ R, for 0 ≤ i ≤ n. It is easy to see that two matrix polynomials f (A) and g (A) with the same
matrix A commute, that is,
f (A) g (A) = g (A) f (A) .
Again, analogously to the scalar case, we can also define negative powers of a matrix. We have that
A1 A−1 = A−1 A1 = A0 = I.
Therefore, A−1 has to be the (multiplicative) inverse of A. The inverse of a matrix exists if it has full (multiplica-
rank. This means that the matrix has to be a square matrix, and that all columns and rows have to be tive)
linear independent. inverse
AA−1 = A−1 A = I.
As before, for the case of positive powers of a matrix A, we can define negative powers of an invertible negative
matrix powers
d
Y
A−d = |A−1 ·{z
· · A−1} = A−1 .
d times i=1
for d ∈ N and r ∈ Q+ if D−1/d and D−r exist. Therefore, we have extended powers to any rational numbers.
An interesting observation is that, as for scalars,
I − An+1 = (I + A + . . . + An ) (I − A)
30
For a proof, see, for example, [1, Section 1.2.5, Lemma 1.2.5]. Matrix updates of the form
A + uv >
are called rank 1 updates, and they are often used in so-called quasi Newton methods for minimizing rank 1 update
nonlinear functions or solving nonlinear equations; those will be discussed in more detail when we discuss
the basics of optimization.
Example 2.4. Discussion of the computational work
Let us assume that A ∈ Rn×n and its inverse A−1 are given, and let both matrices be dense. Fur-
thermore, let u, v ∈ Rn . Equation (2.20) can be split into the following computations
Furthermore, the following, more general, extension of lemma 2.1 to higher-rank updates of a matrix will
be important:
Theorem 2.3. Sherman–Morrison–Woodbury identity
Let A ∈ Rn×n be invertible and U, V ∈ Rn×k for some small k, that is, k << n. Then, the matrix
A + UV >
It can be seen that lemma 2.1 corresponds to the special case of k = 1 in theorem 2.3. The matrix U V >
is a rank k matrix, and
A + UV >
rank k update
is called a rank k update, or since k << n, low-rank update. If A−1 is known, the computational work
low-rank
for computing eq. (2.21) is generally much lower than computing update
−1
A + UV >
directly. We will discuss the solution of linear equation systems in section 2.6; then, it will get clearer why
it is important to save computational work.
31
Krylov subspaces A matrix polynomial p (A) for a matrix A ∈ Rn×n also yields a matrix, which can be
applied to some vector v:
Xm
p (A) v = ak Ak v.
k=0
that is, the space spanned by the vectors Ak v 0≤k≤m−1 for some v ∈ Rn , is called the Krylov
p(A)v = 0
minimal
is called the minimal polynomial of v with respect to A. We call the degree of this polynomial polynomial of
v
the grade of v with respect to A.
grade of v
It immediately clear that the grade of v has to be lower or equal to n. Otherwise, we would have
Lemma 2.3.
The Krylov subspace Km has dimension m if and only if the grade µ of v with respect to A is not
less than m:
dim (Km ) = m ⇔ grade (v) ≥ m.
Hence,
dim (Km ) = min {m, grade (v)} .
32
See [41, Section 6.2, Propositions 6.1 and 6.2] for more details.
As we will see, Krylov subspaces have favorable properties:
• They can be computed relatively efficiently, without changing the sparsity pattern of the matrix: sparse
remains sparse.
• They serve well as reduced dimensional spaces; we will discuss later what this means in practice.
2.4 Orthogonalization
As a next step, we discuss how to orthogonalize and orthonormalize
a set of vectors a1 , . . . , am . This will
also give a way of computing the rank of a matrix A = a1 , . . . , am , and hence, to determine how much
information how many dimensions does a given data set span, and if possible, whether some dimensions can
be dropped to reduce the data set size. This will also enable us to define a first matrix factorization
A = QR,
which can be employed to solve linear equation systems. In principle, it is convenient to transform a matrix
into a collection of orthogonal objects because then, all kinds of matrix operations becomes more numerically
safe from errors.
consisting from a1 , . . . , am as columns. Of course, the vectors could be linearly dependent, and the rank
of A could be smaller than m (and n). In order to construct a basis of the space V = span {a1 , . . . , am }
and to determine the rank of A, we can orthogonalize the vectors a1 , . . . , am using the Gram–Schmidt orthogonalize
orthogonalization algorithm. Let us first assume that A has full column rank. Gram–Schmidt
orthogonaliza-
Algorithm 1: Gram–Schmidt orthogonalization algorithm tion
Data: Linearly independent a1 , . . . , am ∈ Rn algorithm
q1 = a1 ;
for i = 2 to n do
qi = ai ;
for j = 1 to i − 1 do
(a ,q )
qi = qi − (qji ,qjj ) qj ;
end
end
Result: Orthogonal q1 , . . . , qm ∈ Rn
Here, the inner for-loop can be written out as follows:
33
Definition 2.7. Orthogonal vectors
A set of vectors v1 , . . . , vm is called orthogonal if orthogonal
(vi , vj ) = 0
for all i 6= j.
They are called orthonormal if additionally orthonormal
kvi k = 1
for all i.
If the rank of A is smaller than m, it is clearly not possible to find m orthogonal vectors from the space
V . Investigate this:
Exercise 2.13. Modification of Gram–Schmidt
2. Modify algs. 1 and 2 such that it will not fail and still generate an orthogonal or orthonormal,
respectively, basis of V .
As nice as it is, it turns out that Gram–Schmidt is (notion to be defined) numerically unstable, and,
when executed on a computer, it often fails to accurately produce orthogonal vectors due to rounding errors.
Let us now build up som understanding of what numerical stability means.
Example 2.5. Numerical stability of the Gram–Schmidt algorithm
Consider the three vectors
1 1 1
a1 = 0.01, a2 = 0 , a3 = 0.01.
0 0.01 0.01
34
By performing Gram–Schmidt orthonormalization with 10−3 accuracy, we obtain the vectors
1 0 0
q1 = 0.01, q2 = −0.707, q3 = 0.
0 0.707 1
We notice that, even when also computing the inner products with 10−3 accuracy, we obtain
Conditioning and stability The concepts of conditioning and stability are important to investigate the
usage of numerical schemes. Here, we will discuss them for the solution map of an abstract problem abstract
problem
f : X → Y,
where X and Y are a normed vector space (vector spaces with a norm). The space X contains the (input)
data, and Y contains the solutions (or targets). Let us note that conditioning is a property of the problem,
whereas stability is a property of a numerical algorithm to solve the problem.
conditioning
Therefore, let us first discuss the conditioning of a problem. We call a problem well-conditioned
well-
if a small change in the data x yields only small changes in the corresponding solution f (x). This means conditioned
that, if there are only small perturbations in the data, the resulting solution also changes only mildly. In
rounding
the context of numerical computations, a typical type of perturbations of the data are rounding errors errors
caused by storing scalars as floating-point numbers and performing the computations in floating-point
floating-point
arithmetic. numbers
Example 2.6. Floating-point numbers floating-point
arithmetic
In floating point format, a scalar is stored as follows:
This format is standardized according to IEEE Standard for Floating-Point Arithmetic (IEEE 754):
The relative error due to rounding is bounded by the machine precision ε (also denoted as the machine
epsilon):
fl(x) − x
≤ε
x
This error depends on the format. It can be reduced, but generally it cannot be prevented.
Problems that are badly conditioned are also called ill-conditioned. This means that small changes in ill-conditioned
the data x results in relatively large changes in the solution f (x). It is clear that well-conditioned problems
are much more favorable than ill-conditioned problems. In numerical computations, we always have to
35
deal with small errors in the data (rounding errors), and therefore, ill-conditioned problems can be highly
problematic.
The conditioning of a problem is measured by the condition number:
Definition 2.8. Condition number
• The absolute condition number of a problem f is defined as: absolute
condition
kf (x + δx) − f (x)k number
κabs := κabs (x) := sup
δx∈D kδxk
Here, D is a neighborhood of admissible perturbations δx, that is, such that f (x+δx) yields an admissible
solution in Y .
We obtain that
kA(x + δx) − Axk kxk kAδxk kxk kxk
κ = sup = sup ≤ kAk
δx∈D kAxk kδxk δx∈D kδxk kAxk kAxk
Furthermore, if A is invertible
−1
kxk kzk y:=Az
A y
=
A−1
≤ sup ≤ sup
kAxk z∈X kAzk y∈Y kyk
We obtain the following theorem:
Theorem 2.4.
Let the matrix A ∈ Rn×n be invertible. The (relative) condition number of the matrix-vector multi-
plication Ax = b is
κ(x) ≤ kAk
A−1
and the (relative) condition number of solving the linear equation system Ax = b, which corresponds
the matrix-vector multiplication A−1 b = x, is also
Exercise 2.14.
Show that κ(A) ≥ 1 for an invertible matrix A ∈ Rn×n .
f˜ : X → Y
36
ỹ
f
x̃
δy
f˜
δx
f
x
Figure 2.3: Application of an algorithm f˜ to a problem f , backward error δx, and forward error δy.
be a numerical algorithm for solving an abstract problem f : X → Y . Consider the relative error resulting numerical
from solving the problem with the numerical algorithm f˜: algorithm
˜
f (x) − f (x)
kf (x)k
In other words:
A stable algorithm gives almost the exact answer f˜(x) to the almost exact question x̃.
37
Definition 2.11. Backward stability
We call an algorithm backward stable if, for every x ∈ X, there is an x̃ ∈ X such that backward
stable
kx̃ − xk
= O(ε)
kxk
and
f˜(x) = f (x̃).
In other words:
A backward stable algorithm f˜ gives the exact answer f˜(x) to the almost correct question x̃.
This means that backward stable algorithms exhibit the best stability behavior we could hope for since
the error is only in the order of the machine precision times the condition number of the problem, which
cannot be influenced by the algorithm.
Example 2.7. Modified Gram–Schmidt algorithm
As we have already mentioned earlier, the classical alg. 1 is not numerically stable. By
a modification of the inner for-loop, the stability of the algorithm can be improved:
Algorithm 3: Modified Gram–Schmidt orthogonalization algorithm
qi = ai ;
for j = 1 to i − 1 do
qi = qi − (qi , qj ) qj ;
end
qi = kqqii k ;
Later, we will discuss further approaches for stabilizing the Gram–Schmidt procedure.
QR
QR factorization Based on the Gram–Schmidt algorithm, we will derive the QR factorization (or QR factorization
decomposition) of a matrix. In the general case of A ∈ Rn×m , with n ≥ m and full column rank, we can
QR
derive a factorization decomposition
A = QR, (2.23)
where Q ∈ Rn×n is an orthogonal matrix and R ∈ Rn×m is an upper triangular matrix. This factorization
is called the QR factorization.
38
Definition 2.12. Orthogonal and semi-orthogonal matrix
A semi-orthogonal matrix Q ∈ Rn×m is a matrix with m orthonormal columns, that is, a matrix semi-
with the property orthogonal
matrix
Q> Q = Im ∈ Rm×m . (2.24)
An orthogonal matrix is a semi-orthogonal matrix with full rank. In other words, it is a square orthogonal
matrix with orthonormal columns. In this case, the inverse of the matrix Q is its transpose Q> . matrix
where Q1 ∈ Rn×(n−m) and Q2 ∈ Rn×m are semi-orthogonal,R1 ∈ Rm×m has full rank, and
R1
Q
A = QR = 1 2 Q = Q1 R1 .
0
This is an alternative form of the QR factorization, and we will focus on this variant. Therefore, we alternative
will just use the notation Q = Q1 and R = R1 ; for n = m, both variants of the QR factorization coincide. form of the
QR
One way to compute the QR decomposition of a matrix A is to perform the Gram–Schmidt orthonor- factorization
malization algorithm to the columns of A. Since we assume that A has full column rank, we can just
perform algs. 2 and 3 without the modifications discussed in exercise 2.13. The result will be the columns
of the semi-orthogonal matrix Q and the coefficients in the algorithm will yield the matrix R.
Algorithm 4: QR factorization via the Gram–Schmidt orthonormalization algorithm
Data: A = a1 , . . . , an ∈ Rn×m , R = 0 ∈ Rm×m
r11 = ka1 k;
q1 = ra111 ;
for i = 2 to m do
qi = ai ;
for j = 1 to i − 1 do
rji = (ai , qj );
qi = qi − rji qj ;
end
rii = kqi k;
qi = rqiii ;
end
Result: Semi-orthogonal Q = q1 , . . . , qm , upper triangular R
The following exercise deals with the verification of the desired properties of the resulting matrices.
Exercise 2.15. QR factorization
Verify that
• Q is semi-orthogonal and R is upper triangular matrix,
• A = QR if Q and R have been computed using alg. 1, and
• Q> Q = I.
Once the QR decomposition of an invertible matrix A ∈ Rn×n has been computed, the linear equation
system
Ax = b
39
can be solved easily. In particular,
Ax = b
⇔ QRx = b (2.25)
>
⇔ Rx = Q b
can again be solved easily. In particular, since R is of the form
r11 · · · r1n
. .. ..
R = .. . . ,
0 0 rnn
we can solve it by row-by-row from the last to the first, that is, using backward substitution. backward
For a dense matrix A, while the computational complexity of the the QR decomposition is O(n3 ), it is substitution
only O(n2 ) for computing Q> b and solving Rx. Therefore, if a QR factorization has been computed, solving
a linear equation system with A is relatively cheap. Unfortunately, if A is sparse, the factors Q and R can
be denser compared to the original matrix.
Exercise 2.16. Condition number of an orthogonal matrix
Show that
• the condition number of an orthogonal matrix Q is 1 and so is the condition number of Q> and
As a result of exercise 2.16, solving Ax = b has the same conditioning as solving Rx = Q> b.
Solving an overdetermined system using the QR factorization Let us now consider the case of
a rectangular matrix A ∈ Rn×m , with n > m and linear independent columns. Then, the linear equation
system
Ax = b (2.26)
is overdetermined since there are more equations than variables. It only has a solution if b ∈ range (A). overdeter-
In this case, the solution is unique since we assumed that the columns are linear independent. mined
40
Example 2.8.
!
The problem f (x) := x3 + 1 = 2, for x ∈ R, is well-posed. 10
We have that 5
x3 + 1 = 2 ⇒ x3 = 1 ⇒ x = 1.
0
Thus, there is a unique solution in R. Moreover, f (x) is continuously
differentiable in a neighborhood of the solution x = 1 and fdfx (1) = −5
x3 + 1
3 6= 0. By the implicit function theorem, we obtain a continuously 2
differentiable inverse function of f in the neighborhood of x = 1 Hence, −2 0 2
the solution depends continuously on the data.
• b ∈ range (A).
If b ∈
/ range (A), there cannot be an exact solution of eq. (2.26). However, we can still try to find a vector
that is as close as possible to solving eq. (2.26). More precisely, let us consider the solution which is most
accurate in terms of the Euclidean norm of the error, that is,
arg min kAx − bk .
x∈domain(A)
We also call this the least-squares problem because it corresponds to minimizing the square of the least-squares
Euclidean norm, or l2 -norm, of the error. problem
2
Note that, in case A is invertible, kAx − bk is minimized if Ax = b. Hence, both problems eq. (2.26)
and eq. (2.27) are equivalent in this case.
As it turns out, we can derive a formula for the solution of the least-squares problem as follows: First of
all,
2 >
kAx − bk = (Ax − b) (Ax − b) = x> A> Ax − x> A> b − b> Ax + b> b = x> A> Ax − 2b> Ax + b> b.
This function is continuously differentiable, and hence, a necessary condition for finding a local minimum is
d 2 d
x> A> Ax − 2x> A> b + b> b 2A> Ax − 2A> b
0 = kAx − bk = =
dx dx
⇔ A> Ax = A> b
This linear equation system is also called the normal equations. If the columns of A are linear independent, normal
equations
A> Ax = 0
>
⇒ xA Ax = 0
2
⇒ kAxk = 0
⇔ Ax = 0
⇒ x = 0
41
Hence, ker A> A = {0}, and since A> A ∈ Rn×n , A> A is invertible. This means that there is a unique
instead of computing the inverse of the matrix (in practical analysis, saying that a vector is a result of solving
a well-determined set of linear equations is pretty much the same as saying that we have a formula for this
vector).
If the Hessian
d2 2
kAx − bk = A> A
dx2
2
is positive definite everywhere, this solution is a global minimizer of kAx − bk . Let x 6= 0. Then, since the
columns of A are linear independent, we have
Hence, the unique solution of the normal equation, x̂, is the solution of the least-squares problem eq. (2.27).
The case that the columns of A are not linear independent will be discussed at a later point; then, we cannot
find a unique minimizer.
The normal equations
A> Ax = A> b
are a linear equation system with the matrix A> A. We have not extended the definition of the condition
number of a matrix to singular matrices. However, as we will see later
κ(A> A) = κ(A)2 .
Since κ(A) ≥ 1, the conditioning of the normal equations is much worse compared to the original equation
system eq. (2.26).
Now, let
A = QR
be a QR factorization of A. Then,
If the columns of A are linearly independent, the columns of R have to be linear independent as well.
Therefore, R and R> are invertible, and we have
Rx̂ = Q> b,
which can be solved using backward substitution. In contrast to the normal equations and
this problem has the same conditioning as the original problem eq. (2.26).
In the next paragraphs, we will see other examples for the use of the QR decomposition of a matrix.
42
Computing Krylov subspaces In section 2.3, we have introduced Krylov subspaces:
By using Gram–Schmidt orthogonalization, we derive Arnoldi’s method for computing an orthonormal Arnoldi’s
basis of Km (A, v) as follows: method
veTk = v ⊗ ek
is a rank one matrix. From eq. (2.30) and the fact that w = hk+1,k qk+1 ⊥ Qk , we obtain
43
symmetric A:
h1,1 h2,1 O
.. ..
. .
h2,1
Hk = .
.. ..
. . hk,k−1
O hk,k−1 hk,k
(symmetric)
Now, with αk = hk,k and βk = hk−1,k , Arnoldi’s method simplifies to the (symmetric) Lanczos method; Lanczos
in this method, we make use of the fact that each new orthonormal basis vector can be computed using a method
three-term recurrence relation, that is, using only the two previous basis vectors: three-term
recurrence
Algorithm 6: Lanczos’ method for computing Km (A, v) for a symmetric A. relation
Data: A ∈ Rn×n and v ∈ Rn
β1 = 0; q0 = 0; /* initialization */
v
q1 = kvk ;
for k := 1, . . . , m do /* iteration */
αk := qk> Aqk ;
w = Aqk − αk qk − βk qk−1 ; /* new direction orthogonal to previous q */
βk+1 = kwk2 ;
qk+1 = w/βk+1 ; /* normalization */
end
Result: Orthonormal basis q1 , . . . of Km (A, v)
From Lanczos’ method, we obtain the tridiagonal matrix Hessenberg matrix
α1 β2 0
..
.
β2 α 2
Tk =
.. .. ..
. . .
.. ..
0 . . βk
βk αk
and
QTk AQk = Tk . (2.33)
Orthogonalization with projections, rotations, and reflections In alg. 1, we have employed the
Gram-Schmidt orthgonormalization algorithm to find an orthonormal basis for the range of a matrix A.
Therefore, we have used linear operations of the form
Pw v = (v, w) w (2.34)
where kwk = 1. In fact, this type of a linear map is an orthogonal projection; in fig. 2.4, we can see a
graphical representation of the application of Pw .
44
v
v − Pw v
Pw v w
Definition 2.14.
A projection matrix is a quadratic matrix P ∈ Rn×n with the property projection
matrix
P2 = P
Definition 2.15.
A projection is an orthogonal projection if orthogonal
projection
P > = P.
Let Pw be defined in eq. (2.34). Discuss the relative condition number eq. (2.22) as defined in defini-
tion 2.8 for Pw (v) depending on v. What is the worst case?
We will therefore discuss two alternative approaches for computing an orthogonal basis, that is, using
rotations or reflections.
Let us first consider the matrix describing the rotation around the origin by an angle α:
cos(α) sin(α)
Gα = (2.36)
− sin(α) cos(α)
45
y
Gα2 v α2
v
α1
Gα1 v x
We have that
cos(α) − sin(α) cos(α) sin(α)
G>
α Gα =
sin(α) cos(α) − sin(α) cos(α)
sin(α)2 + cos(α)2
sin(α) cos(α) − sin(α) cos(α)
= ,
sin(α) cos(α) − sin(α) cos(α) sin(α)2 + cos(α)2
1 0
= =I
0 1
which means that Gα is actually an orthogonal matrix. It is clear that the angle α can be chosen such that
a r
Gα1 = (2.37)
b 0
or
a 0
Gα2 = , (2.38)
b l
that is, such that the resulting vector is aligned with one of the two axes; cf. fig. 2.5.
Exercise 2.20. Rotation to axes
A = G>
α1 R (2.39)
In order to derive a scheme for general matrices, we can extend the rotation matrix to higher dimensions
46
as follows:
1
..
.
1
cos(α) 0 ··· 0 sin(α)
0 1 0
.. ..
Gij ..
α = , (2.40)
. . .
0 1 0
− sin(α) 0 ··· 0 cos(α)
1
..
.
1
where i and j correspond to the rows and columns for performing the rotation. In particular, the rotation
is, again, performed around the origin but within the plane spanned by the ith and jth coordinates. We call
these matrices Givens rotation matrices. Givens
rotation
Exercise 2.21. matrices
into a upper triangular matrix as follows: In a first step, we eliminate the entry an1 by choosing a suitable
angle αn , such that
r ? ··· ?
21
a ··· ··· a2m
a11 · · · a1m
.. .. ..
.
1n .. . .. . .
Gαn . . . . = . ..
.. ..
an1 · · · anm . .
an−11 ··· ··· an−1m
0 ? ··· ?
In the same way, we eliminate all entries a21 , . . . , am1 , resulting in the matrix
? ··· ?
r̂11
a11 · · · a1m
0 ? · · · ?
1n .. .. ..
G2n
α2 · · · Gαn . . = . .. . . .. (2.41)
.
.. . . .
an1 · · · anm
0 ? ··· ?
As before, in the computation of the LU factorization, we continue with grey submatrix for the second step.
In each step, we eliminate entries from the first column, until we end up with an upper triangular matrix R.
Since all the rotation matrix involved in this procedure are orthogonal, we can easily obtain the Q matrix
in the QR factorization of A by computing the transposed of the product of the Givens rotation matrices,
as in eq. (2.39).
A similar procedure can be carried out using Householder reflection matrices. In particular, we can Householder
reflection
matrices
47
y
p2
w2
Hp = I − 2ww> = I − 2w ⊗ w
48
Now, using Householder reflection matrices, we can transform a matrix
a11 · · · a1m
.. .. ..
. . .
an1 · · · anm
into an upper triangular matrix, analogously to eq. (2.41). As shown in exercise 2.22, the reflection matrices
are orthogonal, and hence, we can easily obtain a QR factorization by multiplying and transposing the
reflection matrices.
The reason that Givens rotations and Householder reflections lead to a numerically stable scheme is that
they are orthogonal matrices. As we have seen in exercise 2.16, the matrix-vector multiplication with an
orthogonal matrix is well-conditioned; the condition number is optimal, that is, 1. This means that numerical
errors are not amplified by an application of Givens rotations and Householder reflections.
Example 2.9. Numerical stability of the QR factorization
Let
1 1 1
A = 0.01 0 0.01.
0 0.01 0.01
We compute A = QR using Gram–Schmidt projections, Givens rotations, and Householder reflections.
Computations with 10−3 accuracy yields
• for Gram–Schmidt projections:
1 0 0 1 0 0
Q = 0.01 −0.707 0 ⇒ Q> Q = 0.01 1 0.707
0 0.707 1 0 0.707 1
49
y y
Ae2
e2
Ae3
e3 e1 x Ae1 x
z z
Figure 2.7: Application of A to the canonical basis vectors e1 , e2 , e3 . The determinant corresponds to the
(signed) volume of the parallelepiped defined by Ae1 , Ae2 , Ae3 .
In fig. 2.7, we can see the application of the matrix A, that is, the application to the canonical basis
vectors e1 , e2 , e3 ; this corresponds to the rows of A, that is,
A = a1 . . . an = Ae1 . . . Aen .
where j is a fixed column index, and Aij ∈ R(n−1)×(n−1) results from dropping the ith row and
jth column from A.
• If n = 1,
det (A) = a11 .
Exercise 2.23.
Derive a formula for the determinant of an upper triangular matrix R = (rij )ij ∈ Rn×n .
The determinant contains important information about the matrix. First of all, det (A) = 0 if and only
if A is singular. Moreover,
• switching two rows (or columns) flips the sign of det (A),
50
• scaling one row (or column) of A by a constant c, resulting in the matrix Ã, will scale the (signed)
volume of the parallelepiped by c. Hence,
det Ã) = c det (A)) .
Another very important property of the determinant is the following lemma; cf. [2, lemma 3.2.1].
Lemma 2.4.
Let A ∈ Rn×n and B ∈ Rn×n . Then,
Based on lemma 2.4 and exercise 2.23, we obtain that, once we have computed a QR factorization
A = QR
Eigendecomposition The eigenvalues and eigenvectors expose very important information about the
matrix.
Let the matrix A ∈ Rn×n be diagonalizable with
A = V DV −1 , (2.44)
where V invertible and D = diag (λ1 , . . . , λn ) diagonal. Equation (2.44) is also called the eigendecomposition eigendecompo-
of A. In particular, it yields sition
AV = V D,
and hence, we have
Avi = λi vi , (2.45)
for 1 ≤ i ≤ n, where vi is the ith column of V .
51
Definition 2.17. Eigenvalues and eigenvectors
eigenvalue
eigenvector
Let Rn×n . The scalar λ ∈ R and vector 0 6= v ∈ Rn are called eigenvalue and eigenvector of A if
Av = λv. (2.46)
Eλ := {v ∈ Rn |Av = λv}
Therefore, a diagonalizable matrix is fully determined by its eigenvalues and -vectors. They also provide
additional information about the matrix and the data stored in it. For instance:
Exercise 2.25. Spectral norm and eigenvalues
Let A ∈ Rn×n be diagonalizable. Show that:
1. The spectral norm of A corresponds to its largest eigenvalue, that is,
kAk2 = λmax .
52
For a symmetric matrix A, the basis of eigenvectors V can be chosen orthonormal. Then, we have V −1 = V > ,
and we obtain
A = V DV > .
Let us now discuss how to compute the eigenvalues and -vectors of a matrix. The eigenvalues of matrix
A are the roots of the characteristic polynomial characteristic
polynomial
p (λ) = det (A − λI) .
0 = Av − λIv = (A − λI) v.
This means that, in the complex numbers, we can always find n, not necessarily distinct, eigenvalues and
a corresponding basis of eigenvectors. Therefore, every matrix A ∈ Cn×n is diagonalizable in the complex
numbers. Unfortunately, the same is not true for real numbers. However, as mentioned earlier let us focus
on scalars (real numbers) for now.
Let us assume that the characteristic polynomial has n, not necessarily distinct, real roots. This means
that
n
Y
p (λ) = λ̂i − λ .
i=1
Multiple of the λ̂i could be the same, that is, an eigenvalue λi can be root of multiplicity larger than one of
the characteristic polynomial. We call this the the algebraic multiplicity of the eigenvalue. The algebraic algebraic
multiplicity of an eigenvalue λ is not necessarily the same as the dimension of the eigenspace, dim (Eλ ), multiplicity
which is called the geometric multiplicity. However, the algebraic multiplicity is an upper bound for the geometric
geometric multiplicity. multiplicity
Once the eigenvalues {λi }1≤i≤n have been computed, the corresponding eigenvectors can be obtained by
solving the linear equation systems
(A − λi I) vi = 0
for vi . The geometric multiplicity if λi is then given by the dimension of the solution spaces, which corre-
sponds to Eλi .
For large data sets and, hence, large matrices, the characteristic polynomial is of high degree. Hence, it
becomes computationally demanding to compute the characteristic polynomial and all its roots numerically.
Therefore, the use of iterative schemes for computing the eigenvalues approximately can be more efficient.
One relatively simple algorithm can be derived based on computing QR factorizations. Hence, the algorithm
53
is called the QR algorithm.
Algorithm 7: QR algorithm
Data: A ∈ Rn×n
A0 = A;
k = 0;
while kLk+1 k > T OL do /* Iteration */
Ak = Qk Rk ; /* Compute QR factorization using alg. 1 */
Ak+1 = Rk Qk ; /* Interchange factors */
Ak+1 = Lk+1 + Dk+1 + Uk+1 ; /* Partition for checking stopping criterion */
k =k+1 ; /* Update k */
end
Data: Eigenvalues diag (A)
Let us discuss briefly how the algorithm works: First, we compute the QR decomposition of the matrix
A0 = A, that is,
A0 = Q0 R0
Then, we interchange the factors Q0 and R0 to obtain an update for our matrix
A1 = R0 Q0 .
Hence, if A0 = A is symmetric, A(1) is symmetric as well. Moreover, all matrices A0 , A(1) , . . . will have the
same eigenvalues.
During the iteration, the lower triangular part Lk of Ak will converge to zero, while the diagonal entries
of Ak will converge to the eigenvalues of A. In order to check for convergence, we consider kLk k ≤ T OL,
that is, we stop the QR iteration once the lower triangular part Lk of Ak is almost zero.
In the following two examples, you can see the convergence of the QR algorithm based on two 3 × 3
matrices, one symmetric and one unsymmetric matrix.
54
Example 2.10. QR Algorithm – Symmetric Matrix
For the symmetric case, we can easily obtain the eigenvectors as follows: Since we have that
D = Q> AQ
with D = Ak almost diagonal and Q = Q0 · · · Qk orthgonal. Therefore, the columns of Q are good approxi-
mations to the eigenvectors of A.
55
Example 2.11. QR Algorithm – Unsymmetric Matrix
We have that
λi
<1
λ1
56
for all |λi | < |λ1 |. Assuming that |λ1 | > |λ2 |, the expression in eq. (2.47) will converge to v1 for d → ∞.
Hence,
A v
→ |λ1 |d and
d
1
Ad v → v 1
kAd vk
for d → ∞.
The resulting algorithm alg. 8 only requires the application of A but does not change the matrix itself.
Algorithm 8: Power method
Data: A ∈ Rn×n
for k := 1, . . . , m do /* iteration */
Av
v = kAvk ;
>
µ = vv>Av
v
; /* Rayleigh quotient */
end
Data: Approximate eigenvalue µ ≈ λ1 of largest absolute value and corresponding eigenvector
v ≈ v1 .
Hd ṽ = µṽ, (2.48)
Then, v := Qd ṽ is an eigenvector of
Qd Q>
d Av = µv.
Qd Q>
d Av ≈ Av.
In order to solve the small eigenvalue problem eq. (2.48), even if it is not feasible for the original matrix
A, we can now apply the QR algorithm.
57
Figure 2.8: Optimization of a quadratic function f (colored) under a quadratic constraint g = 0.
Left and right eigenvectors The eigenvectors, which we have introduced before, are right eigenvectors
Av = λv
since they are applied to A from the right. In addition to that, we can introduce left eigenvectors, given by
the eigenvalue problem
v > A = λv > (2.49)
for the left eigenvector v and the corresponding eigenvalue λ. left
eigenvector
Exercise 2.26.
Show that the left and right eigenvalues of a matrix A ∈ Rn×n are the same.
A = V DV −1 ,
the columns of V are the right eigenvectors. On the other hand, we obtain
V −1 A = DV −1 ,
58
Definition 2.19. Simultaneous diagonalizability
A set of diagonalizable matrices {A1 , . . . , Ak } is called simultaneously diagonalizable if there simultaneously
exists a single invertible matrix V such that diagonalizable
Di = V Ai V −1
A = QR,
Ax = b. (2.50)
For a square matrix A ∈ R1n × n, the computational complicity of this approach is O(n3 ), in particular, it
is
2 3
n + O(n2 ).
3
Here, we are going to discuss alternative matrix factorization approaches,
• the LU factorization and
• the Cholesky factorization,
and iterative Krylov schemes, which are more frequently used, in particular, for sparse matrices. Therefore,
as a first step, we will discuss how to perform row and column operations using matrix-matrix multiplications.
Row and column operations In order to derive matrix factorization techniques, it is generally help-
ful to understand that matrix row and column operations can be written in terms of matrix-matrix row and
multiplications, and how. column
operations
Let us first discuss row operations: We start with the exemplary matrix
1 2 3
4 5 6 ∈ R3×3
7 8 9
In particular, the following matrix-matrix multiplications correspond to different linear row operations:
59
• Permutation of two rows: Permutation o
two rows
1 0 0 1 2 3 1 2 3
0 0 14 5 6 = 7 8 9
0 1 0 7 8 9 4 5 6
• Based on these considerations, give inverse matrices for the row and column operation matrices.
Of course, if each of the matrices O1 , . . . , Ok corresponds to a row operation, multiple row operations on
a matrix A can be assembled into a single operation by multiplying all those matrices:
Ok · · · O1 ·A
| {z }
=:Ô
by performing row operations. Compute the matrix corresponding to performing all those row oper-
ations at once. Which matrix matrix factorization results from these operations?
Performing column operations to a matrix A can generally be handled analogously by multiplying oper-
ators from the right:
A · Operator.
60
Exercise 2.29. Column operations
The same as exercise 2.28 but with column operations: Eliminate all upper-diagonal entries (green)
in the matrix in eq. (2.51) by performing only column operations. Compute the matrix corresponding
to performing all those column operations at once and the resulting matrix factorization.
In exercises 2.28 and 2.29, you have already seen how to simplify the matrix structure by performing row
or column operations. As we have already discussed in section 2.4, a linear equation system of a triangular
matrix is straight forward having an upper triangular matrix. We will make use of this in the next paragraph.
LU factorization The most common way of factorizing an invertible matrix A ∈ Rn×n is the so-called LU
LU factorization (or LU decomposition), that is, the factorization into an upper triangular matrix U factorization
(or LU decom-
and a lower triangular matrix L: position)
⇔ Ly = b
l11 0 ··· 0
.. .. ..
. .
l21 .
⇔ ..
y = b
.. ..
. . . 0
ln1 ··· ln,(n−1) lnn
| {z }
=L
Solving this system is called the forward substitution, and it can be easily done row by row, from the forward
first row to the last row. substitution
In order to compute x from y, we only have to solve
r11 r12 · · · r1n
.. .. ..
. .
0 .
x = y.
..
.. ..
. . . r(n−1),n
0 ··· 0 rnn
| {z }
=U
This step is called backward substitution, and it can, again, be performed row by row; this time, due to backward
the matrix structure of U , it is done from the last to the first row. substitution
61
The LU decomposition can be computed by Gaussian elimination. As a first step, for eliminating one Gaussian
entry ai1 below the diagonal from elimination
a11 · · · · · · · · · a1n
.. .. ..
.
. .
ai1 · · · aii · · · ain ,
.. ..
..
. . .
an1 · · · · · · · · · ann
we subtract aa11
i1
times the first row from the ith row. In terms of row operations via matrix-matrix multipli-
cations, this corresponds to
62
Hence, a full row can be eliminated as follows:
a11 a12 ··· a1n
.. ..
. .
a21 a2n
..
.. ..
. . . a(n−1),n
an1 · · · an,(n−1) ann
1 0 ··· ··· 0
a21 .. .. a11 a12 ··· a1n
1 . .
a21 − aa21 a11 a22 − aa21 a12 a2n − aa11
a
11 ··· 21
a1n
. . .. 11 11
= .. 0 .. · .. .. ..
. 0 ..
.. . . . .
.. . . . .
. . 0 an1 − aan1
a a − an1
··· an1
ann − a11 a1n (2.54)
a11 a12
. . 11 n2
11
an1
a11 0 · · · 0 1
1 0 · · · · · · 0 r
r12 r13 · · · r1n
11
.. ..
. . 0 r22 r23 · · · r2n
l21 1
(1) (1) (1)
..
= . 0 . . . . . . 0 · 0 r32 r33 · · · r3n
.. .. .. .. ..
.. .. . . . .
. . . . .
. . . . 0 (1) (1) (1)
ln1 0 · · · 0 1 | 0 rn2 r{z n3 · · · rnn
| {z } }
(1)
U
L(1)
We use the upper index (k) to denote entries of the kth step of the algorithm; the index is omitted in case
the entries remain the same until termination of the algorithm. Note that, after one iteration of Gaussian
elimination, the first column of L(1) already contains the final entries of L. Moreover, the first two rows of
U (1) already contain the final entries of U .
Based on the discussion in the previous paragraph, that is, on row and column operations, it can be
observed that this factorization indeed yields the original matrix A.
As a next step after eq. (2.54), the matrix
r22 r23 · · · r2n
(1) (1) (1)
r32 r33 · · · r3n
.. .. ..
..
. . . .
(1) (1) (1)
rn2 rn3 ··· rnn
is factorized as already in eq. (2.54). This procedure is performed recursively, until the remaining matrix
(n−1)
block to factorize is only a single entry rnn = rnn .
Exercise 2.30. LU Decomposition
Compute the LU decomposition of the matrix
1 2 4
A = 3 8 14
2 6 13
63
Exercise 2.31. Computational Complexity
Discuss the number of FLOPs needed for
• computing the LU decomposition A = LU of a dense matrix A ∈ Rn×n ,
• solving Ly = b and U x = y for a right hand side b ∈ Rn .
Note that the step in eq. (2.53) can only be performed if a11 6= 0 because we would have to divide by
zero otherwise. However, if the matrix is invertible there is always a pivot, i.e., a nonzero element, in each pivot
column. Therefore, we can exchange the rows such that there is a nonzero element on the diagonal. As we
have seen in the previous paragraph, this can be achieved by multiplication with a permutation matrix from
the left. This yields an LU factorization with pivoting LU
factorization
P A = LU. (2.55) with pivoting
In order to improve the stability of the algorithm, we always exchange rows such that the element with the
maximum absolute value is on the diagonal.
Let us finally remark that, if the matrix A is sparse and many entries below the diagonal are already
zero, we can omit eliminating the corresponding entries in eq. (2.54). This will save computation work and
may also result in sparse matrices L and U . However, the matrices L and U can, in general, be much denser
compared to the original matrix A.
Cholesky factorization In case the matrix A is symmetric positive definite, an LU factorization can be
computed with fewer operations. In particular, instead of a general decomposition
A = LU,
is actually an LU decomposition as derived before. From the fact that A is positive definite, we have that
the entries of D are all positive, and hence, we can compute the matrix D1/2 as discussed in section 2.3,
such that
D1/2 D1/2 = D
64
and with L̂ := D1/2 L, we can get another form of the Cholesky factorization:
A = L̂ · L̂> . (2.58)
Different from the LU factorization discussed in the previous paragraph and the LU factorization in eq. (2.57),
the diagonal entries of L̂ are generally not one.
Both variants eqs. (2.56) and (2.58) are computationally more efficient than the standard LU factorization,
and storing the resulting matrices L and D or L̂, respectively, generally requires less storage compared to
storing L und U .
In order to find the matrix L in eq. (2.56), let li be the ith column of L. Then, one can make use of the
fact that
1/2 1/2
aij = dii djj li> · lj ,
1. Derive an algorithm for computing the Cholesky factorization of an SPD matrix A ∈ Rn×n .
2. What is the computational complexity of computing the Cholesky factorization?
Matrix inversion The inverse of an invertible matrix A ∈ Rn×n can be written as inverse of an
invertible
A−1 = ã1 · · · ãn = A−1 e1 · · · A−1 en ,
matrix
where {e1 , . . . , em } is the canonical basis in Rn . Each of the columns ãi ∈ Rn can be computed by solving a
linear equation system
Aãi = ei .
Hence, assuming that we have computed an LU factorization of matrix A = LU , the inverse A−1 can be
computed by solving n linear equation systems using forward and backward substitution. Based on exer-
cise 2.31, it is easy to derive the computational complexity O(n3 ).
Usually, we are not interested in all the entries of the matrix A− , but we want to be able to apply it to
some vectors {v1 , . . . , vk }, where usually k << n. In the case k << n, it is therefore much more efficient to
solve the linear equation systems
Axi = vi
for 1 < i < k. On the other hand, even if k ≈ n, A−1 may be much denser compared to L and U . Therefore,
even storing the matrix A−1 might be infeasible.
65
Figure 2.9: Minimizing a function by following the steepest descent, that is, the gradient. For a non-convex
function Φ, if we would start in different point, we might end up in a different local minimum.
Iterative solvers Finally, we will discuss briefly how to solve linear equation systems
Ax = b
using iterative solvers, which only involve matrix-vector multiplications but no modifications of the matrix
itself.
Even though this assumption is not satisfied for matrices corresponding to general data sets, for the sake
of brevity and simplicity, let us focus on the easy case that A ∈ Rn×n is SPD. Then,
where Φ (y) := 21 y > Ay−y > b. We will now first introduce the gradient descent method, which is illustrated gradient
by fig. 2.9. It is motivated by the idea that the fastest way of reaching the minimum of a valley is to descend descent
method
in the steepest direction, which is the direction of the negative gradient
This vector r := b − Ax is also called the residual of the linear equation system Ax = b. That is, if Ax̂ = b, residual
the corresponding residual b − Ax̂ = 0. Instead of just using the residual as the step, we usually scale it by
some factor α, which, in machine learning, is usually called the learning rate. The learning rate is a hyper learning rate
parameter, and changing the learning rate can have two consequences: hyper
parameter
• If the individual steps are unnecessarily small otherwise, we can speed up convergence by choosing a
large α.
• If the individual steps are too large, for instance, if we jump over the whole valley within one step, we
can improve the convergence by choosing a small α
There are also extensions of this simple method, which choose the learning rate adaptively. This is important
because, similar to the function in fig. 2.9, training a machine learning model generally corresponds to
optimizing a non-convex function. Hence, the plain gradient descent algorithm might not yield optimal
convergence, and more advanced variants are required.
66
The standard gradient method is given in alg. 9.
Algorithm 9: Gradient descent method
Data: Initial guess x(0) ∈ Rn , learning rate α ∈ R+ , and tolerance T OL > 0
r(0) :=
b − Ax
(0)
;
while
r
≥ T OL
r(0)
do
(k)
Exercise 2.33.
If we allow for choosing the learning rate in each iteration step, derive a formula for the optimal
choice, in case A ∈ Rn×n is SPD.
Gradient based optimization techniques will be discussed in more detail in the optimization and machine
learning parts of the course.
Finally, we want to note that Krylov spaces can be used to improve the convergence of the standard
gradient descent method. In particular, instead of approaching the solution by only considering the local
gradient in one direction, we take into account the previous search directions as well.
On example of Krylov subspace methods for solving a linear equation system with an SPD matrix is the
conjugate gradient method. conjugate
gradient
Algorithm 10: Conjugate gradient method method
Result: Approximate solution of the linear equation system Ax = b
Data: Initial guess x(0) ∈ Rn , and tolerance T OL > 0
p(0) :=
r(0) :=
b − Ax(0)
;(0)
(k)
while r
≥ T OL
r
do
(p(k) ,r(k) )
αk := Ap(k) ,p(k) ;
( )
x(k+1) := x(k) + αk p(k) ;
r(k+1) := r(k) − αk Ap(k) ;
(Ap(k) ,r(k+1) )
βk := Ap(k) ,p(k) ;
( )
p(k+1) := r(k+1) − βk p(k) ;
end
Result: Approximate solution of Ax = b
Without discussing the algorithm in detail, similar to Lanczos method, it uses the symmetry of the matrix
such that the orthogonalization can be performed in a short recurrence.
67
Definition 2.20. Rank-deficiency
A matrix A ∈ Rn×m is called rank-deficient if rank-deficient
This is the most general type of data set to be stored in a matrix A. This also results in the fact that
the least squares problem
2
arg min kAx − bk ,
x∈domain(A)
as defined in eq. (2.27), is not well-posed anymore. Consider the following example of image data:
Example 2.12. Rank-deficient data
The above picture of the surface of Mercury has an image resolution of 1 144 × 1 071 pixels, however
its rank is only 153.
A different example is a data base with much more features than observations, for instance, the
data base of a video platform with more videos than users; see also the discussion on sparse data
in section 2.1.
In this case, the matrix A may not be diagonalizable, that is, there is no basis of eigenvectors, and we
will need a different tool in order to extract structure and important information from the matrix.
Singular value decomposition Let us consider a general matrix A ∈ Rn×m with rank r, where r ≤
min (n, m). In contrast to the eigendecomposition, the singular value decomposition exists for any matrix.
This is stated in the following theorem and definition:
68
Theorem 2.9. Singular value decomposition
Let A ∈ Rn×m be a matrix of rank r. Then, there exist orthogonal matrices U ∈ Rn×n and V ∈ Rm×m
such that
> Σr 0
A = U ΣV , Σ = , (2.59)
0 0
where Σ ∈ Rn×m and
σ1
σ2
Σr = diag(σ1 , σ2 , · · · , σr ) = ,
..
.
σr
singular value
and σ1 ≥ σ2 ≥ · · · > 0. Equation (2.59) is called the singular value decomposition (SVD) of A. decomposition
(SVD)
The σi are called the singular values of A, and they are unique up to their ordering. The columns
singular values
of U and V are called the left and right singular vectors, respectively.
left and right
singular
The same holds true if R is replaced by C. Then, vectors
A = U ΣV †
Avi = σi ui , A> ui = σi vi ∀i = i, . . . , n,
with U = u1 . . . un and V = v1 . . . vm . This looks very similar to eigenvalue problem eq. (2.46)
we saw earlier. However, the vectors on the left and right hand sides are now different. Nonetheless, there
is a close connection of the SVD and the eigendecomposition introduced before. In particular,
Since V is orthogonal, we have that V −1 = V > . Therefore, the eigenvalues of A> A are the squared singular
values of A. Moreover, the eigenvectors of A> A are the right singular vectors of A.
Similarly,
AA> = U ΣV > V Σ> U > = U ΣΣ >
| {z } U >,
=diag(σ12 ,σ22 ,··· ,σr2 )
and the eigenvalues of AA> are also the squared singular values of A. Moreover, the eigenvectors of AA>
are the left singular vectors of A.
Moreover, if the matrix A ∈ Rn×n has an orthonormal basis of eigenvectors V ∈ Rn×n , then
A = V DV >
is also a singular value decomposition of A, where U = V ; here, D ∈ Rn×n is the diagonal matrix containing
the eigenvalues of A. Hence, the left and right singular vectors are the eigenvectors and the singular values
are the eigenvalues of A.
Exercise 2.34. Singular value and eigendecomposition
For some small n (for instance n = 3):
69
1. Give an example for matrices A, V, D ∈ Rn×n with
A = V DV > ,
A = V DV > ,
3. Can you come up with matrix properties discussed earlier which are necessary conditions for
the existence of a decomposition
A = V DV > ,
with V ∈ Rn×n orthogonal and D ∈ Rn×n diagonal?
kAk2 = σmax .
kAk2 ≤ kAkF .
There is an alternative form of the SVD, the compact SVD, which uses semi-orthogonal matrices instead
of orthogonal matrices:
Theorem 2.10. Compact singular value decomposition
For a rank r matrix A ∈ Rn×m , there exists a compact singular value decomposition (compact
SVD): compact
singular value
σ1
decomposition
σ2
(compact
A = U Σr V > , Σr = diag(σ1 , σ2 , · · · , σr ) = ,
.. SVD)
.
σr
where U ∈ Rn×r and V ∈ Rm×r are semi-orthogonal matrices.
70
The following theorem justifies why the SVD is an essential tool in unsupervised learning:
Theorem 2.11. Eckart–Young Theorem
Let A ∈ Rn×m be of rank r and have a singular value decomposition A = U ΣV > with
Σr 0
Σ= , Σr = diag(σ1 , σ2 , · · · , σr ).
0 0
kA − BkF
Therefore, the singular value decomposition of a matrix A enables us to find best approximations with a
given rank k, that is, best low-rank approximations of A. We will see this in the following example: best low-rank
approxima-
Example 2.13. Image Compression tions
Surface of Mercury.
Let us again consider the image of the surface of Mercury already shown earlier in examples 1.5
and 2.12. The high-resolution image is of size 1 144 × 1 071 and rank 1 071.
In the following plot, we can see the singular values σ1 , . . . , l1 071 of the matrix corresponding to the
pixel values:
71
It can be observed that the first singular values are much higher than the smallest singular values, and
hence, following theorem 2.11, we can find good low-rank approximations of the image with respect
to the Frobenius norm. The image on the top right is a rank 37 approximation of the image, that is,
only 37 singular values as well as left and right singular vectors are necessary to store this image.
In the following table, several examples of error resulting from such low-rank approximations are
listed:
kA − ÂkF /kAkF rank numbers to store compression factor
0.05 1 2 216 553.0
0.01 9 19 944 61.0
0.005 37 81 992 15.0
0.001 153 339 048 3.6
0.0 1 071 1 225 224 1
Numerical results for SVD image compression
As we can observe, for the image on the top right, we obtain compression by a factor of 15, resulting
in an error of only 0.5 % in the Frobenius norm.
Efficiently computing the SVD In section 2.5, we have already discussed how to use Krylov subspaces
to approximately solve an eigenvalue problem. Due to the connection of the SVD and the eigenvalue de-
composition of A> A respectively AA> , we compute a singular value decomposition of A by solving the
respective eigenvalue problems. However, the matrices A> A and AA> are ill-conditioned, and using the QR
decomposition for a large matrices may result in high computational cost. Moreover, as before, the algorithm
does not preserve sparsity of the matrix. Instead, we can, again, derive a scheme to approximately compute
the SVD of A using Krylov subspaces.
As we have seen in theorem 2.11 and example 2.13, the largest singular values are most relevant for good
approximations of the matrix corresponding to the data.s
Therefore, let use apply Lanczos’ method alg. 6, as described in section 2.5 to the symmetric matrix
0 A
C=
A> 0
72
The first iteration of Lanczos’ algorithm yields
β1 = 0
q0 = 0
α1 := q1> Cqk
v = Cq1 − α1 q1 − β1 q0
v = Cq1 − α1 q1 − β1 q0
|{z} |{z}
q1> Cq1 =0
0 A 1 w 1 > 0 A 1 w 1 w
= − w 0
A> 0 kwk 0 kwk A> 0 kwk 0 kwk 0
1 0 1 0 w
w> 0
= −
kwk A> w kwk3 A> w 0
| {z }
=0
1 0
=
kwk A> w
Repeating this procedure shows that we get orthogonal vectors qk of the form
u 0
q2l−1 = and q2l = , l = 1, . . . .
0 v
73
and the matrix
α1
β2 α2
..
.
Bk =
β3 ,
..
. αk
βk+1
the Lanczos Bidiagonalization algorithm yields the following relations:
AVk =Uk+1 Bk
>
(2.60)
A Uk+1 =Vk Bk> + αk+1 vk+1 e> >
k+1 = Vk Bk + αk+1 vk+1 ⊗ ek+1
Let us now discuss how this algorithm will help us to compute a reduced dimensional variant of the
singular value decomposition. Before, we have seen that, for a singular value σ and its left and right reduced
singular vectors u and v, dimensional
variant of the
Av = σu, A> u = σv. singular value
decomposition
Now, let σ̃ be a singular value of Bk and ũ and ṽ the corresponding left and right singular vectors. Then, as
before,
Bk ṽ = σ̃ũ, Bk> ũ = σ̃ṽ.
Using eq. (2.60), we obtain
Thus, as before in section 2.5 for computing eigenvalues and eigenvectors in a Lanczos subspace, the singular
values of Bk converge to the singular values of A, and the vectors Uk+1 ũ and Vk ṽ converge to the left and
right singular vectors of A. In particular, once we span the full rank of A, the term αk+1 vk+1 e> k+1 ũ will
vanish, and we obtain all nonzero singular values as well as the corresponding left and right singular vectors.
Pseudo-inverse matrices Let us again consider a rank-deficient matrix A ∈ Rn×m . In this case, there
are multiple solutions of the least squares problem eq. (2.27),
In order to make the problem well-posed, we can reformulate the problem, such that the solution becomes
unique. In particular, let us consider the least squares solution xLSMN , that is, the least-squares solution
with minimum norm (LSMN). least-squares
Let x̃ ⊥ ker (A), then each solution x̂ of eq. (2.61) can be written as solution with
minimum
norm (LSMN)
x̂ = x̃ + y,
74
and hence, the minimum is attained if y = 0. The solution with minimum norm is xLSMN , that is, the
solution orthogonal to ker (A). Hence, we obtain the LSMN solution by solving the constrained problem:
Now, let
> Σr 0
A = U ΣV , Σ=
0 0
be the SVD eq. (2.59) of A. Then, the matrix
−1
+ Σ 0 >
A =V r U (2.62)
0 0
Σr 0 Σ−1
r Σ r 0 > Σr 0 >
= U V =U V =A
0 0 0 0 0 0
and analogously −1
0 > Σr 0 > Σ−1
+ + Σ 0 >
A AA = V r U U V V r U
0 0 0 0 0 0
−1
Σ 0 >
= V r U = A+ .
0 0
In case Rn×n invertible, we also have
ABA = A
BAB = B
75
Theorem 2.12.
The LSMN solution xLSMN introduced before can be computed using the Moore–Penrose inverse by
−1
+ Σr 0 >
xLSMN = A b = V U b
0 0
Based on the definition of the pseudo-inverse, we can generalize the definition of the condition number
of an invertible matrix definition 2.9 to general matrices:
Definition 2.23. Condition number of a matrix
We define the condition number of a matrix A ∈ Rn×m as condition
number
κ := κ(A) = kAk
A+
.
As we have discussed earlier, the conditioning of the normal equations is significantly worse compared to
the original problem. The following exercise will explain this:
76
x1 edges
e1 e2
−1 1 0 0
e1
e3 −1 0 1 0 e2
x2 x3
A=
0 −1 1 0
e3
e4 e5 0 −1 0 1 e4
x4 0 0 −1 1 e5
nodes x1 x2 x3 x4
Figure 2.10: Graph and the corresponding incidence matrix. In the incidence matrix, we use the convention
that, in each row, the −1 comes first and 1 comes second.
As in the example in fig. 2.10 (left), each graph consists of nodes connected by edges. In fig. 2.10 (right),
we show the resulting incidence matrix A, indicating the edges of the graph. For a graph with n nodes incidence
and m edges, the incidence matrix A has m rows and n columns, that is, A ∈ Rm×n . matrix
In order to define the values of the incidence matrix of a graph in a unique way, we use the following
convention: An edge ek connecting the nodes xi and xj is indicated by the entries
where i < j. All other entries of the kth row of A are zero. In other words, each row contains a single −1
and a single 1, and the −1 always comes before the 1. Hence, the incidence matrix A of a graph is generally
sparse.
77
x3
e2
e3 x1 x4 x5
e5 e4
e1
x2
Figure 2.11: A connected graph, which is not connected anymore once the edge e5 (gray) is omitted.
Figure 2.11 shows an example of a connected graph. In case e5 is omitted, then the sets of nodes
{x1 , x2 , x3 } and {x4 , x5 } are not connected. Otherwise, there is a path of edges between each pair of nodes.
The incidence matrix of a connected graph has the following properties:
dim ker AT = m − n + 1
We will not discuss a general proof for these properties, but the next exercise will help us to understand
them:
Exercise 2.39. Connected Graph
Verify the properties eq. (2.64) of the incidence matrix for the graph in fig. 2.11. How does the
situation change if edge e5 is omitted.
Definition 2.25.
A complete graph is a graph where each pair of nodes is connected by an edge.
78
x1
x2 x3
x4
Figure 2.12: A complete graph, where each pair of nodes is connected by an edge.
x1
x2 x3
x4
Definition 2.26.
A tree is a graph in which each pair of nodes is connected by exactly one path.
Using the incidence matrix of a graph, we can define three more matrices describing the graph:
Definition 2.27.
The symmetric positive semidefinite graph Laplacian matrix of a graph is defined as graph
Laplacian
matrix
L := A> A ∈ Rn×n ,
where A is the incidence matrix of the graph with n nodes. The degree matrix D is the diagonal degree matrix
of L,
D := diag (L) ,
and
B := D − L (2.65)
is the adjacency matrix of the graph. adjacency
matrix
The adjacency matrix counts the number of edge at each node. In particular, dii is the number of edges
of the node xi . The adjacency matrix is a symmetric binary matrix (entries 0 and 1), where aij = aji = 1 if
and only if there is any edge between the nodes xi and xj .
79
Exercise 2.42. Graph Laplacian, degree matrix, and adjacency matrix
Compute the graph Laplacian, degree matrix, and adjacency matrix of the graphs in figs. 2.12
and 2.13.
Since the graph Laplacian L is symmetric positive semi-definite and sparse, many of the techniques
introduced so far are applicable, for instance, efficient techniques for solving a linear equation system with
L and computing its eigenvalues.
The concepts introduced before can be extended to weighted graphs, that is, graphs where the strength weighted
of the connection between two nodes can differ. If C is a diagonal matrix with positive weights, then, for graphs
instance, the resulting weighted graph Laplacian matrix is weighted
graph
Laplacian
A> CA. (2.66) matrix
We will not further discuss this weighted graphs at this point, but we might come back to them at a later
point.
80
3 Optimization basics
3.1 Motivation
In this lecture we will begin our three-part story of optimization methods we need for the remainder of the
course.
Example 3.1.
Consider again the regression problem of Example 1.2. Try to approximate the function based on the
11 sampled perturbed measurements at points {0, 1, . . . , 10}, by doing a least-squares regression with
1
x
Φ(x) =
x2 ,
x3
which was generated using a symbolic computation package (not by hand derivation). We call w the
decision variable and f (w) the objective function.
Although the above problem can be formulated as a least-squares problem whose solution can be found
analytically, there are good would involve enormous effort. Next to that, a tiny modification in the loss
function used (use |yi − Φ(x)> w|, for example, or something else), would make the problem no longer a least
squares one. We therefore need a more general methodology.
Finding the w that minimizes the empirical loss (training is nothing else but solving an optimization
problem in which we select the function’s parameters so that the value in the above expression gets minimized.
A need therefore arises to find w that ‘achieves the best performance’ on the training data that would have a
reasonable computational cost. Ideally, the algorithm used to determine the parameters works fast, so that
it can be run on big datasets without difficulties.
Remark 3.1.
In the following lectures, we will only consider optimization problems in which the decision variables
can take any real value. Such a group of problem is known as continuous optimization problems.
There are also many applications in which the decision variables are restricted to be equal to, for
example, {0, 1}. Solving such integer or mixed-integer optimization problems is, in general much
81
Figure 3.1: Plot of the polynomial 2w4 + w3 − 3w2 generated using 351 equidistant points on the interval
[−2, 1.5].
more difficult computationally and is not common in ML due to the sizes of datasets used.
For time being, we will leave the specific ML optimization problems to focus on the optimization alone,
and once we know ‘what are convenient optimization problems’ we will get back to ML to formulate some
basic optimization problems.
where B(x̄, ) is a ball of radius around x̄. For a local maximum, the opposite inequality holds.
A point x̄ is a global minimum of a function f : Rn → R if
f (x̄) ≤ f (x) ∀x ∈ Rn .
82
From a visual inspection, we clearly see that the point (0.70, −0.65) is a only a local minimum, whereas
the point (−1.07, −2.04) is both a local and a global minimum. Obviously, when we want to minimize a
function f , we would like to end up in the point (−1.07, −2.04). In this specific case, the task is easy - we
can just point with our finger to where the minimum is because we ‘see it’.
However, ‘seeing’ required us, first, the problem to be a one-dimensional one (you can’t ‘see’ things in
problems where x would have more than two dimensions). Secondly, we needed to have a good guess about
where the minimum is, and secondly, to evaluate the function 351 times to draw the pictures. In fact, what
we did was applying a strategy known as the grid search.
Definition 3.2.
Grid search is a minimization strategy that evaluates a function at a grid of n equidistant points and
picks the point with the lowest value as the solution. Grid search is a very general methodology and
it is used, among others, for hyperparameter optimization of ML models, see section 3.7.
Grid search can be a good idea in low-dimensional optimization problems to get a ‘rough feel’ about the
area where the minimum might be. Already for two dimensions, however, the number of points one needs to
evaluate can be prohibitive. If x ∈ R2 , and we want to inspect 10 possible values for each entry of x, then
overall we inspect 102 combinations of values. For n dimensions, this means 10n points which is a killing
number. For highly irregular functions even grid search might not provide very informative results.
At the same time, in real-life applications, function evaluation can be an expensive operation and the goal
is to find a minimum in a possibly short time/using as few as possible function evaluations. For that reason,
in order to build minimization algorithms for which we are able to provide any performance guarantees,
we need to limit ourselves to minimizing ‘well-behaved’ functions. As we will repeat many times in these
lectures, making assumptions will be rather cheap in the ML context, because it is us – the ML tool designer
– who chooses what functions are minimized.
For the above reasons, we will now assume that we are minimizing differentiable functions over Rn . For
such, we have the following convenient property:
Theorem 3.1. Stationary points
For a differentiable function f : Rn → R for every local minimum/maximum we have that
∇f (x) = 0 (3.1)
As we will see from the construction of our gradient method, it will be rather unlikely that we end up
in a local minimum. However, we will be in a real danger of ending up in a saddle-point. It can be also a
so-called saddle point. Saddle points are, typically, a big enemy of an optimizer.
Definition 3.3. Saddle points
For a differentiable function f : Rn → R a saddle point is a stationary point that is neither a local
minimum or a local maximum.
Exercise 3.1.
Draw an example of a f : R → R function with a saddle point.
We will address the issue of ending up in saddle points in the next lecture and for now, we will just
make the common-sense assumption that a good strategy to minimize a given function f is to look for points
where the gradient vanishes (even though such a point is not implied to be a local minimizer, it is still one
of the best strategies we have because it represents the necessary condition).
83
It sounds thus like it would be a good idea to find a minimizer by simply solving the system of equations
(3.1). Usually, however, this involves a lot of nonlinear functions and is not easy to solve at all.
For that reason, we need to perform the minimization smartly, trying various points until exhaustion
of the computational time we have (typically things need to work fast), and picking the best one among
the ones we tried. This ‘trying’ needs to be smart, because otherwise we end up wasting a lot of effort on
evaluating points that make no sense.
For that reason, real-life optimization employs most often employs a ‘local search’ strategy of moving
from a point to another, typically trying to improve the value of the objective function at each step. Being
in a point xk , we would like to select a next point xk so that
Remember that as an algorithm, while being in a given point xk we don’t ‘see’ the function image around
us because seeing this image requires us to perform many function evaluations. Thus, we need to make a
decision about the next point to move to using only information that is possible to compute, at a low cost, at
the current point. How can we quickly get an idea about how the function looks like around a given point?
Here, what comes us to help us is the first-order Taylor approximation of the function:
It means that the gradient of a function gives us the normal vector of the hyperplane tangent to the graph of
f and as you might remember, the gradient is the ‘direction of steepest ascent’ of the function. If we choose
a direction dk such that
d>
k ∇f (xk ) < 0.
We call such a direction the descent direction. Having such a dk we know that there exists an > 0 such
that
f (xk + tdk ) < f (xk ) ∀t ∈ (0, ),
for small enough step sizes in the direction of dk we will achieve a descent. This is the basis of the standard
descent algorithm 19.
There are three choices to be made in alg. 19: (i) mechanism for the selection of direction dk , (ii) choice
of step size tk , (iii) the stopping criterion. We will address them beginning with the stopping criterion.
Stopping criteria. As you remember, for smooth functions every (local) minimum is a point at which
the function gradient vanishes. It is therefore logical to establish the gradient vanishing as the criterion. But
numerically, we cannot hope to hit directly a point where the gradient is equal to 0 so, instead, the following
criterion is usually used:
k∇f (xk )k2 ≤ ,
where is some small number. Because even this can take a long time (if it happens at all), additional
stopping criteria can be used such as the maximum number of iterations, or that the improvement between
subsequent iteration is small enough:
|f (xk ) − f (xk−1 )| ≤ .
84
Stepsize selection. Next, we can address the issue of choosing the stepsize. Intuitively, we would like
to make big steps when we are far away from the minimum, and then smaller ones when we approach the
minimum (otherwise, there would be a risk that we ‘jump over’ the minimum without noticing it). But of
course, we never really know how far we are from the minimum so we need to make these choices somewhat
blindly. Here, we present and discuss several basic methods of choosing the stepsize tk :
• Constant stepsize: tk = t. This is the simplest of all strategies where the question is, of course,
what number should be chosen for t. One can try several values for this number and ‘judge by eye’
the convergence of the algorithm. When more analytical properties of the function to be minimized
are known (for example, the Lipschitz constant of the gradient), we can also choose a constant t that
will guarantee convergence of the algorithm.
• Exact line search (or bisection). If we are lucky, the problem of minimizing the function along the
search direction
tk = argmin{f (xk − tdk )}.
can be solved exactly at a cheap cost. For some problems, for example when the function is quadratic,
this is indeed possible because an efficient oracle can be created to solve such a one-dimensional
optimization problem. In other situations, one can use the bi-section method to determine a point
where the gradient of the function g(t) = f (xk − t∇f (xk )) vanishes (remember – does not necessarily
guarantee that it is the minimum of this function, but this is better than nothing). For this, note that
g 0 (0) < 0. If we guess a value tmax such that g 0 (tmax ) > 0, then by continuity of g 0 (t) there has to
exist a point t̄ on the interval (0, tmax ) such that g 0 (t̄) = 0. In this situation we can employ the highly
efficient single-dimensional bisection Algorithm 13 to find a point where the gradient of g(t) vanishes.
• Backtracking. It is a procedure that tries to find a step length t that would give a ‘sufficiently good’
decrease of the function (we will define what that means), but without requiring the possible number
of function evaluations that exact line search would do. It requires three parameters: s > 0, α ∈ (0, 1),
β ∈ (0, 1) and its working is outlined in Algorithm 14.
In other words, the stepsize is chosen as tk = sβ ik where ik is the smallest nonnegative integer for
which the condition
f (xk ) − f (xk + tk dk ) ≥ −αtk ∇f (xk )> dk
is satisfied. What is the quantity −tk ∇f (xk )> dk ? It is the amount of descent ‘promised’ by the first-
order Taylor approximation. We therefore want the stepsize to be such that the real function value
improvement it achieves is at least a factor of α of the Taylor-promised improvement.
85
• Diminishing step size. This is a strategy that tries to achieve the balance between ‘long steps in
the beginning’ and ‘short steps later on’ which we discussed above. Different formulas for decreasing
the step size are possible, for example
1
tk = or tk = a exp(−bk)
a + bk
Decreasing step size algorithms are popular in applied machine learning and typically, because they
perform well and theoretical convergence results are fairly easily obtained for them with an algorithm
called the stochastic gradient descent, to be briefly introduced later. One does not, however, run them
for too long as the final stepsize values can be negligible.
Exercise 3.2.
Prove that in the backtracking stepsize selection rule, for
nablaf (xk ) 6= 0 and s > 0, α ∈ (0, 1), β ∈ (0, 1) the ‘while’ loop inside alg. 14 always finishes after a
finite number of steps.
Descent direction. We now address the issue of selecting the descent direction which has a very natural
solution. Because the gradient itself provides the direction of steepest ascent of the function, we know that
its negative gives the direction of steepest descent of the function:
This is the idea of the most basic method for minimization of differentiable functions – the gradient descent
method, which is a generalization of the one of alg. 9. The vanilla version of this method is given in
alg. 15. Equipped with this, you are almost ready to apply the gradient descent algorithm to any smooth
minimization problem you wish. However, there is one well-known issue related to scaling/preprocessing the
problem’s data before running the gradient problem (or any optimization algorithm, in fact). It is known as
the zig-zagging behavior.
Example 3.2. Zigzagging and variable scaling
The pure gradient method can exhibit some unfavorable behavior if the magnitude of the gradient
components related to different variables differs a lot. Consider the following example of minimizing
the quadratic function
f (x) = x21 + 8x1 x2 + 20x22
using the pure gradient method, with fixed stepsize of length 1.05. The path traversed by the
method is illustrated in the following figure , where the blue curves denote the isolines of the function
f (x) = x21 + 8x1 x2 + 20x22 , while the red line illustrates the path followed by the algorithm.
86
As you can see, at each step the gradient direction does not really point in the direction of the true
minimum and as a result, many more steps are needed. This is a result of the fact that the condition
number of the Hessian to the function
f (x) = x> Ax
to be minimized
1 4
A=
4 20
is very high, in fact ≈ 108. This is when we speak of badly-conditioned optimization problems.
Such situations occur, for example, when features used in ML-based optimization problems are not
normalized. Imagine that you are performing an optimization problem based on data in which people’s
height is measured in meters and weight in kilograms. Then, you easily obtain a situation in which
one feature has values in the interval [1, 2] while the other in the interval [50, 100]. To mitigate against
it, all features can be normalized, for example, by (i) forcing their mean to be equal to 0 and their
variance to be equal to 1, (ii) forcing their minimum and maximum values to be equal to 0 and 1,
respectively.
If scaling is not performed before optimization problem is set up, then optimization field-trick for
is to apply scaling to the problem in which we transform our vector of decision variables using an
invertible matrix D:
x = D1/2 y.
In this new optimization variable, the optimization problem becomes
equivalent to
xk+1 = xk − tk D∇f (xk )
in terms of the original decision variable, which is known as the scaled gradient descent method.
87
A most logical choice for the scaling matrix D in for a quadratic f would be the matrix D = A−1
because it would make the transformed matrix equal to an identity matrix. In some cases, however,
computing such a matrix would be expensive and one can, for example, use the matrix
D = (diag(A))−1
which uses only the diagonal elements of A, which makes the inversion process computationally very
cheap.
In the next lecture, we will see that considering the issue of ‘what would be the optimal scaling
technique’ leads to an extension of the gradient method – the Newton method.
where Q is a positive definite matrix. Suppose we use the diagonal scaling matrix
−1
Q11 0
D= .
0 Q−122
Show that the scaling improves the condition number of the matrix in the sense that
So far, we discussed many possibilities for things and the algorithm from a high-level look. But, you can’t
really learn optimization and get a hang of it without running a few algorithms yourself. For that reason,
we propose this exercise.
Exercise 3.4. Adapted exercise 4.3 from [3]
min x> Ax
x∈R5
88
• the diminishing stepsize rule
1
tk =
a + bk
with parameters of our choice (test a few possibilities).
For that reason, typically some problem structure is required to be able to ‘prove’ that the gradient
method converges. A most standard assumption is, of course, that a minimum exists or at least that the
function to be minimized is bounded from below (usually a rather cheap assumption), plus some smoothness
assumption such as the Lipschitz continuity of the gradient, i.e., that there exists a constant L > 0 such
that:
k∇f (x) − ∇f (y)k ≤ L kx − yk ∀x, y ∈ Rn .
We will use this assumption in section 3.7 to give an example convergence proof of the gradient method. To
make yourself familiar with this assumption and to teach you think about the bad-type situations for the
gradient methods, we propose the following exercise.
Exercise 3.6. Exercise 4.10 from [3]
Give an example of a twice differentiable function f : R → R and a starting point x0 ∈ R such that
the problem min x ∈ Rf (x) has an optimal solution and the gradient descent method with stepsize
2/L diverges, i.e., the sequence {f (xk )} is increasing.
Luckily, because in ML applications it is us who designs the functions to be minimized, we are free to pick
a function for which convergence is expected. For this reason, now let’s assume that we designed our ML
problem reasonably so that it has a minimizer, and that our gradient descent implementation converges to
a point at which the gradient vanishes. The the question is: what can we say about the point we converged
to? Is this a global minimum? Is this at least a local minimum?
For general differentiable functions f the answer is no to both of these questions, sadly. Despite that
fact, gradient methods are widly popular because they are computationally very cheap and still, stationary
points are good candidates for being minima, even if we cannot prove it. In fact, all the vastly popular deep
learning networks are trained exactly in this way – nobody knows if the minimum found is a local minimum
or a global one, but they still work.
That being said, since in ML we have the freedom of modelling the problem with loss functions of our
choice, whenever we can, we rather pick a function where convergence can be guaranteed and therefore,
powerful algorithms can be used. And there is a fairly general class of functions for which the gradient
method is guaranteed to converge to a global minimum – convex functions.
89
Figure 3.2: Examples of a convex (left) and nonconvex functions (right). Convex functions are very common
in simple supervised learning models such as linear regression. In deep learning methods, one almost always
has to do with nonconvex functions with multiple local minima, the best of which is hard to find.
Definition 3.4.
A function f : Rn → R is convex if for every x, y ∈ Rn it holds that
If the function is differentiable everywhere, then convexity is equivalent to the following property:
∇2 f (x) ≥ 0 ∀x ∈ Rn .
Geometrically, convexity of a function means that ‘straight line segment connecting two points on or
above the graph of the function, lies fully above or on the graph of the function’, see fig. 6.1.
For such functions, we can show that every local minimum is also a global one.
Exercise 3.7.
Show that for a differentiable convex function f : Rn → R every stationary point is a global minimum.
The conclusion of this fact is that convexity is a desirable property of a function to minimize. How do we
check/make sure that a function is convex? Luckily, we do not need to do it by going ‘straight to definition’
always, but just like we have calculus for checking differentiability of functions, we have certain ‘rules’ using
which we can construct ‘big convex functions’ out of ‘small convex functions’.
90
Theorem 3.2. Operations that preserve convexity
If two functions f, g : Rn → R are convex then the following functions are convex as well:
If the functions f : Rn → R is convex and the function g : R → R is convex and nondecreasing, then
the function h(x) = g(f (x)) is convex.
Exercise 3.8.
Prove Theorem 3.2.
Exercise 3.9.
Prove or give a counterexample. If f (x) and g(y) are convex, then F (x, y) = f (x)g(y) is convex.
In short – whenever you can and it is not too limiting on your ability to model the problem at hand, it
is always good to opt for a formulation in which convexity can be guaranteed.
ȳ = g(w, x)
parametrized by a vector w ∈ Rn . In order to find a good w, we need to choose it so that the relationship
g(w, x) performs well on the training data first. We do it by choosing w to be a minimizer of the training
problem:
N
X
w = argmin L(w, xi , yi ),
w∈Rn i=1
where L(·, ·) is a to-be-defined-by-us loss function. Naturally, the loss function should be closely connected
to the prediction mechanism g(·, ·), typically through a loss function
91
Whenever we model a ML prediction tool and the corresponding loss function, the loss function we choose
should meet two goals. (i) reflect the real loss we’re trying to minimize, (ii) be easy to minimize. In learning
how to do this, we will begin with the simplest example of regression.
3.4.1 Regression
A regression problem is formulated by having a data sample consisting of (xi , yi ), i = 1, . . . , n, where we try
to find the mechanism hidden in the relation
x i ∈ Rn → yi ∈ R.
in the training part of the data. As you already know, the simplest way to model this problem is to use the
following prediction mechanism
ȳ = w> x.
where w ∈ Rn
Remark 3.2.
Note that we can extend this model easily to y depending nonlinearly on x, by ‘appending’ vector
x with nonlinear transformations of each of its entry. Similarly, we can append an element 1 to the
vectors xi so that a ‘constant term’ is also taken into account in our model.
Our goal is to find a w that errs as little as possible on the data that we have at our disposal. We
therefore need to ‘punish’ the difference
yi − ȳi = yi − w> xi
by using a loss function l(·) where we minimize
N
X
l(yi − w> xi ).
i=1
2
If we use the quadratic loss function l(s) = s , then the objective function to minimize by playing with w is:
N
X 2
f (w) = yi − w > xi
i=1
What do we know about this function? First of all, it is quadratic in w. Secondly, it is also convex in w, any
point at which the function gradient is equal to 0 is a minimizer of this function. It is therefore, a very nice
optimization problem to solve, and we can employ the gradient method (or some other method) right away.
Remark 3.3. Huber loss function
Another popular choice for a loss function is the Huber loss:
1 2
2s if − 1 ≤ s ≤ 1
l(s) =
|s| otherwise.
In some applications, the upside of the Huber loss function is that the total loss over all samples is
then not overly influenced by a few points for which the error is very large, known as the outliers.
But realistically, our first choice will always be the quadratic loss function. It is a good practice to see
how things look like when we put them in vector notation. Stacking all yi into one vector y ∈ Rn , and all
xi ’s as row vectors in a matrix X ∈ Rn×m , the function to be minimized is:
>
f (w) = ky − Xwk22 = (y − Xw) (y − Xw) .
92
where we use the norm notation of section 2.2. The gradient is then equal to
X > Xw = X > y,
which you know from section 2.4 , which have a unique solution if the matrix X has full rank. To repeat,
determining w by solving the system of normal equations equations is computationally not efficient.
If the matrix X is not full rank, then , typically the performance of the gradient method on the opti-
mization problem of minimizing f (w) is often poor. Just as in numerical linear algebra a common trick in
such cases is to select the minimum-norm solution of the system, a common optimization trick is to ‘add’
a component to the minimized objective function to ensure that the optimization problem has a unique
minimizer. This process is known as regularization and its most classical example is Tikhonov regularization
>
f (w) = ky − Xwk22 + λw> w = (y − Xw) (y − Xw) + λw> w,
where λ > 0 is a regularization parameter. Although regularization typically changes the optimal solution
to the problem, the concept of minimum-norm solution and regularizations are closely related, as we will
see in our third lecture on optimization. Regularization, as it turns out is extremely important in machine
learning not only for numerical reasons.
Remark 3.4. Regularization and out-of-sample performance
One of the effects of the regularization term on the optimal solution is that it promotes ‘small’ vectors
w, which corresponds to ‘simpler’ models in which more entries of w are likely to be 0. This is in
line with the common experience in ML that simpler models trained on the training data, tend to
perform better on the test data than non-regularized models. In Example 11.2 we will see how this
‘intuitive’ and ‘empirical’ intuition can be derived mathematically.
Coming back to the numerical aspects of regularization, if we look at it from the point of view equating
the gradient to 0, then we obtain the following system of equations:
−1
(X > X + λI)w = X > y ⇐⇒ w = X > X + λI X > y.
Here, an interesting relation to the pseudo-inverse matrices (recall section 2.7) arises because, as it turns
out, we have that
X + = lim (X > X + λI)−1 X > = lim X > (XX > + λI)−1
λ→0+ λ→0+
x ∈ Rn → yi ∈ {−1, 1}.
How can we build up the corresponding prediction and training mechanism to keep the optimization problem
‘nice’ ?
93
A brute-force approach to this problem would be to behave as if the problem was a regression problem,
i.e., to use the same loss function as in regression and then, on new samples, to use the following mechanism
to predict the label of a new object:
ȳ = sign(w> x) (3.2)
This idea, however, suffers from a serious flaw. Namely, the regression function loss function penalizes every
deviation from the target label (for example, 1), while in fact, (3.2) is ‘wrong’ only when sign(w> xi ) 6= yi .
For that reason, we need a loss function that, using the prediction mechanism (3.2), will impose penalty
only when the sign mismatch happens. In Example 11.2 we already got to know one particular solution
to this problem, known as the support vector machine (SVM), where the function to be minimized is, for
example:
N
X 2
max 0, 1 − yi (w> xi )
f (w) = (3.3)
i=1
This formulation is known as the L2 -loss support vector machine, where the term L2 is there in relation to
the 2-norm related penalization of the violation of inequality 1 ≤ yi (w> xi ).
Exercise 3.10.
Prove that (3.3) is a convex function of w.
Similarly to the regression case, it is very common to add a regularization term to the objective function.
N
X 2
max 0, 1 − yi (w> xi ) + λkwk22 .
f (w) = (3.4)
i=1
Exercise 3.11.
Argue that (8.1) is a convex function of w and derive its gradient.
We easily verify that this formulation is convex in w as well. The name comes from the shape of the
function max{0, 1 − x}. It is, however, not differentiable everywhere. Still, it is possible to apply a
modified version of the gradient method to this problem. You would be surprised that in the ML
literature, people behave as if this function is differentiable, i.e., the call the algorithm used to solve
this problem the gradient method. We will, however, be mathematically correct by calling it the
subgradient method and we will introduce it only in the next lecture.
Exercise 3.12.
Construct a training dataset on the [0, 10]2 set by sampling two groups of points, one labelled −1,
and the other labelled 1, so that they are mostly located apart from each other. Then, set up an
L2 -loss SVM for this dataset with a regularization parameter λ and optimize it for a few values of λ
using the gradient descent method. For each λ, plot the corresponding w> x = 0 line to inspect how
94
well it separates the two groups of points.
Now, sample in the same way two new groups of points labelled −1 and 1. Check which SVM you
obtained before, performs best in predicting the correct labels for this new dataset.
Yet another another way to achieve the same thing is to think of the prediction less as a label itself, but
more as a probability of being labelled either -1 or 1. The only question is how to model probabilities in
a continuous way, that will make things friendly to optimization. A popular way to model the probability
that a given model receives value +1 is:
exp w> x
P(ȳ = 1|x) = ,
1 + exp (w> x)
No time for questions, jump in and we’ll explain later. For such a model we use the following prediction
mechanism:
1 if P(ȳ = 1|x) ≥ 0.5
ȳ =
−1 otherwise.
Under such a prediction mechanism, how could we quantify ‘loss’ or ‘success’ so that we can somehow
optimize w on the traning data set? Success happens when we have a sample (xi , yi ) and our prediction
mechanism would assign the label correctly. In line with our prediction model, the probability of success
(predicting the correct label) is:
exp w> x
1
P(ȳi = yi |(xi , yi )) = = for yi = 1
1 + exp (w> x) 1 + exp (−w> x)
1
P(ȳi = yi |(xi , yi )) = for yi = −1
1 + exp (w> x)
and with this formula, the probability that across all samples we are correct, can be written as:
N n
Y Y 1
P(ȳi = yi |(xi , yi )) =
i=1 i=1
1 + exp (−yi (w> xi ))
Training our model would mean to maximize this term over w, and it looks like a difficult expression to
maximize. If we take a logarithm of it, we obtain:
N N
!
Y X
log 1 + exp −yi (w> xi )
log P(ȳ = yi |(xi , yi )) = −
i=1 i=1
To maximize this expression is equivalent to minimization of the negative of it, and we obtain the following
optimization problem to solve:
N
X
log 1 + exp −yi (w> xi ) .
min
w
i=1
As strange as it sounds, it turns out that this function is convex in w and amenable to efficient optimization
as the so-called log-sum-exp function. The entire derivation we just went through is equivalent to solving
95
the so-called logistic regression problem, and similarly to the previous cases, it is possible and common to
add a regularization term to the objective function
N
X
log 1 + exp −yi (w> xi ) + λkwk2
f (w) =
i=1
Logistic regression is a very popular tool for solving binary classification problems and as we shall see, also
for multi-class classification ones.
Remark 3.5.
The name ‘logistic regression’ comes from the idea of modelling probabilities as
p
log ∼ w> x
1−p
which term is known as the ‘logit’, which has been invented as a ‘trick’ to relate probabilities (which
have to belong to [0, 1]), with an expression that would be linear (and thus unbounded) in the
parameter vector.
In other words, we assign to a given point the number of the class whose inner product of the SVM vector
and the observations feature is the largest. But how do we train such a model?
Mathematically, we want that for an observation (xi , yi ), if yi = j, then it holds that
From here, we are nearly at the formulation, the only thing left if to modify this inequality slightly to
prevent the trivial solution in which all wj = 0. We do it in the same fashion as in the standard SVM (recall
example 1.3), by introducing the threshold value 1:
wy>i xi − wl> xi ≥ 1 ∀l 6= yi .
96
Now, we want to penalize the situation 1 − wy>i xi + wl> xi , for each l 6= yi and for each sample 1. We can do
it, for example, using the hinge loss function by the following formulation
N X
X
max 1 − wy>i xi + wl> xi , 0 .
(3.5)
i=1 l6=yi
This function is convex in w but it is not differentiable. If we prefer a function that is differentiable (and in
this chapter we definitely do prefer this because we only learned the gradient descent method so far) we can
also apply the L2 -loss function to obtain:
N X
X 2
max 1 − wy>i xi + wl> xi , 0 .
(3.6)
i=1 l6=yi
As in all the previous cases, it is a standard practice to add a regularization term to the objective function
to add preference for ‘simple’ models.
Another possible trick is to extend the concept of logistic regression to the multi-class setting. There,
the reasoning would be that we try to model the probability that a given sample belongs to the j-th class:
exp(wj> x)
P (ȳ = j|x) = k
.
>
P
exp(wl x)
l=1
How do we set up an optimization model to achieve that? As it turns out, we can repeat directly the
idea from the logistic regression to obtain a complicated ‘log-sum-exp’ style function to minimize, which is
nevertheless convex.
97
be reformulated as an optimization problem of the following form:
This is a classical production optimization problem. You have the data, there is a certain objective
function to be maximized, and there are certain constraints to be met. Here is already the first difference
with the optimization problems we’ve constructed so far – they did not include any constraints on the decision
variable w, which was free to take any value in Rn . Although there are ML-related optimization problems
in which constraints play a role, it is rather uncommon because it makes the problem more difficult to solve
(apart from making sure that one moves in a direction of decreasing the loss function, one also needs to
make sure that no constraints are violated).
Training data performance and test data performance The most important difference between
ordinary optimization purposes and ML lies in the following. In the production planning example, we can
safely assume (unless there are errors) that the problem data will be as stated when we implement the
solution. That means, that the logic chip will, for example, require 1 gram of silicon etc. In this context, it
makes sense to put a lot of effort seek the solution that really maximizes the objective function.
In optimization problems related to supervised machine learning the situation is different by design. We
have the datasets (xi , yi ), i = 1, . . . , n and we optimize the vectors w to perform as well as possible on this
data. But in the end, the predictive mechanism will be used on new data, which is different from the training
dataset.
Of course, we expect the new data to be similar to the training set (otherwise there would be no reason
to train anything), but it will not be exactly the same data. There are two conclusions that can be made
from this.
First, if there is a way to explicitly take into account the fact that the new data will be slightly different
from the training data, then we can try to include this information already at the training stage so that we
‘behave as if we were optimizing for the unknown yet data’. This might sound weird, but it turns out that
regularization is implicitly doing exactly this.
Example 3.5. Regularizer as a by-product of data uncertainty
Consider the classical linear regression problem where we minimize the following problem (now we
take the norm, not the squared norm)
f (w) = ky − Xwk2
where X, y represents the training data set. Now, we will try to imitate the fact that the new data
on which the model will be evaluated, is slightly different. In particular, let us assume that the new
data is of the form:
X + ∆, y
while we leave the vector y unchanged for simplicity, and we perturb the feature vector a bit by a
matrix ∆. If we had a chance to train our model on this new data, then the value of the loss function
98
on the training data would be equal to:
The problem is that we do not know the value of ∆ in advance. Continuing our thought experiment,
assume that the matrix ∆ is ‘not too big’. For example, that it holds that its 2-operator norm (see
section 2.2) is not larger than a certain small number: k∆k2 ≤ λ.
If we worry about the worst-possible realization of ∆ for a given w, then we can discover that it holds
that
max ky − (X + ∆)wk2 = ky − Xwk2 + λkwk2 .
k∆k2 ≤λ
This is the worst-case value of the loss function. For that reason, if our aim is to minimize such a
worst-case value of the loss function on the unknown data, we are in fact applying regularization to
our optimization problem.
Of course, the setup in this example was somehow artificial but in fact, one can re-derive a lot of
various regularization terms by considering a specific form of ‘perturbation’ in the training data and
then deriving the form of the loss function under the worst-case value of this perturbation.
Another conclusion from the fact that the data on which we really care about the performance is different
from the one we optimize on, is the fact that it might not make sense to run an optimization algorithm all
the way until a minimum is found. For this reason, optimization performed the the purposes of machine
learning often stops much earlier, i.e., the stopping criteria are more relaxed.
Separability of the objective functions and the stochastic gradient descent. Next, the objective
functions in ML-related problems are separable: they are sums of many likewise terms, each of which
corresponds to a single sample in the training dataset:
N
X
f (w) = L(w, xi , yi )
i=1
As the number of terms can be very large (as the dataset), even computing the gradient might be a very
time-consuming task at each step of the gradient descent algorithm. At the same time, chances are high
that many of the terms are very similar as they correspond to the dataset, where many points are expected
to be close to each other.
For that reason, it is common to approximate the objective function (and its gradient) by randomly
selecting nS elements in the training sample. In other words, to use the function
f˜(w) =
X
L(w, xi , yi )
i∈S
where S is a sample of NS elements from the training set. It turns out that while this function includes
many many less points, its gradient is still an excellent approximation of the gradient of the original function.
This approach is known as the stochastic gradient descent (SGD) algorithm and is extremely popular in ML
applications.
It is important to notice that
∇f (w) ≈ ∇f˜(w) =
X
∇L(w, xi , yi ),
i∈S
so that the gradient computation can be done in parallel for all of the nS samples, and then simply summed.
For this reason, the batch size nS is often chosen as the maximum number of processes one can run in parallel
on the machine.
SGD methods typically cycle through the full data set, rather than simply sampling the data points at
random. In other words, the data points are permuted in some random order and blocks of points are drawn
99
from this ordering. Therefore, all other points are processed before arriving at a data point again. Each
cycle of the SGD procedure is referred to as an epoch - a term you will often see in ML publications.
This lemma can be used to quantify things very efficiently for the gradient method, in the following
simple corollary.
100
Lemma 3.2. Sufficient decrease
Let f be twice continuously differentiable with a L-Lipschitz continuous gradient. Then, we have
that
Lt 2
f (x) − f (x − t∇f (x)) ≥ t 1 − k∇f (x)k
2
As you can see, we managed to obtain a lower bound on the improvement in the objective function as
a result of a gradient step. If we want to maximize this lower bound, then we can manipulate the step size
t to maximize the term on the right. This boils down to maximization of a simple quadratic function with
the following famous formula:
1
t∗ = ,
L
in which case we have
1 2
f (x) − f (x − t∇f (x)) ≥ k∇f (x)k
2L
Here, you see that if for our function we know the Lipschitz constant of the gradient, then we can use it to
provide a constant stepsize.
Theorem 3.3.
Let f be twice continuously differentiable with a L-Lipschitz continuous gradient. Let {xk } be the
sequence generated by the gradient method with constant stepsize t = 1/L for solving
min f (x)
x∈Rn
101
Via this result, we were able to obtain not only the result that the gradient of the minimized function
converges, but√also an upper bound on the rate of this convergence: the norm of the gradient converges at
the rate O(1/ k). Similar results can be obtained for the other stepsize selection strategies we presented –
backtracking and exact line search.
This is, in fact, one of the worst convergence rates you can get, if you can prove anything about an
algorithm. But, under the assumptions we took, this is the sharpest possible rate estimate we can get, but
for special function classes rates of O(1/k) or even O(1/k 2 ) are possible.
Exercise 3.13. Exercise 4.11 from [3]
Suppose that f : Rn → R is a twice differentiable function with Lipschitz continuous gradient with
Lipschitz constant L. Suppose that the optimal value of minimizing f is f ∗ . Let {xk } be the sequence
generated by the gradient descent method with constant stepsize 1/L. Show that if {xk } is bounded
then f (xk ) → f ∗ as k → +∞.
• training set - on this dataset a specific ML model is trained under specific values of hyperparameters
such as the regularization parameter λ, stepsize for gradient descent etc.
• validation set - performance of the model trained on the ‘training set’ is evaluated on the validation
set;
• test set - on this set the model ML tool with the best hyperparameter value is evaluated to see if a
given ML tool is good. If multiple ML tools are considered, their validation-best versions respectively,
are compared against each other on this test set.
How do we optimize over the hyperparameters? Typically, the most common tool is grid search – for
example, we test 10 different values for λ, train a model on the training set for each of them, and pick the
best λ among them based on the models’ performance on the validation set.
As per the model training itself, a standard practice is to pre-process the data. This is done, for example,
by normalizing all the features (entries of xi ’s) so that each of them have a common minimum/maximum
value, or a common mean/standard deviation. This helps to avoid algorithm convergence issues such as
zigzagging and overall, improves the optimization performance.
102
Figure 4.1: Examples of shallow minima (left) and flat regions (right).
103
Figure 4.2: A nonsmooth function.
As you know, one of the particularly nasty examples are saddle points. An inherent property of ML
problems (objective functions being sums of many functions of many parameters) is that they tend to have
many saddle points, as the following exercise tries to explain.
Exercise 4.2.
Consider the univariate function f (x) = x3 − 3x and its natural multivariate extension:
n
X
F (x1 , . . . , xn ) = f (xi ).
i=1
Show that this function has only one minimum, one maximum and 2n − 2 saddle points. Argue why
saddle points proliferate in high dimensional functions.
104
ball that rolls on the graph of a function and even when it falls into a local minimum, it maintains some of
its earlier speed, rolling further. The computation of the update direction is then given by
xk+1 = xk + dk , dk = (βdk−1 − tk ∇f (xk ))
where β ∈ (0, 1) is a parameter that determines how much of the previous update is ‘remembered’ in the
next update. Clearly, with an update like this, the algorithm will not stop immediately when it encounters
a point xk at which ∇f (xk ) = 0, but instead, it will keep moving further.
AdaGrad (Adaptive Gradient Descent, [16]). The AdaGrad algorithm differentiates the scaling
of different components of x. In particular, it keeps track of the sum of squared magnitudes of the partial
derivatives with respect to xk,i . From iteration to iteration, one updates these quantities as:
2
∂f
A0,i = 0, Ak,i = Ak−1,i + i = 1, . . . , n.
∂xi
These are used to scale down the update with respect to the corresponding parameters as:
α ∂f
xk+1,i = xk,i − p .
Ak,i ∂xi
Clearly, AdaGrad is a diminishing stepsize update rule. This means that from a certain moment, the method
will practically stop moving. Another downside is that none of the gradient history stored in Ak,i ’s gets
forgotten.
RMSProp (Root Mean Square Propagation, [24]). The RMSProp algorithm is essentially Ada-
Grad, but with the important trick that the past magnitude information gets gradually forgotten with time
at a rate quantified by a parameter 1 − ρ where ρ ∈ (0, 1). The update rules are then:
2
∂f
A0,i = 0, Ak,i = ρAk−1,i + (1 − ρ) i = 1, . . . , n,
∂xi
and
α ∂f
xk+1,i = xk,i − p .
Ak,i ∂xi
This is an ‘improvement’ upon AdaGrad, but the feature of both algorithms is that there is no momentum
effect in the gradient itself (only in the step length). Out of these considerations, the Adam algorithm was
born.
Adam (Adaptive Moment Estimation, [27]). The very popular Adam algorithm marries the features
of all the above-mentioned methods, performing exponential smoothing of both the stepsize, and the gradient
direction on a per-entry level of x. The corresponding magnitude rule for Ak,i is the same as for RMSProp,
and for the direction it applies the following rule:
∂f
F0,i = 0, Fk,i = ρf Fk−1,i + (1 − ρf ) .
∂xi
With these values, the per-entry update rule is:
α
xk+1,i = xk,i − p Fk,i .
Ak,i
which combines both the ideas of gradient memory, and the stepsize magnitude memory.
The algorithms mentioned above are state-of-the-art tools for training, for example, huge neural networks.
All the methods mentioned now have been hand-crafted through experimentation and are empirically seen
to perform very well on ML-related optimization problems. Theoretical results on their convergence rates
under different sets of assumptions are also available (similar to those of gradient descent), but they are of
smaller importance.
Importantly, we can add that the stochastic gradient descent algorithm, to a certain degree, achieves
similar goals as the approaches mentioned above. Due to the stochasticity of the gradient evaluation, SGD
is less likely than normal GD to get stuck in local optimal, and more likely to traverse flat regions.
105
4.3 Newton method
4.3.1 Introduction
The algorithms of the previous section were trying to account for curvature of the functions (which is,
fundamentally, second order information about the function, as opposed to the gradient method, which is
an example of first-order information) by accumulating information from the past history of the first-order
information. There are good reasons for that – while computation of the gradient requires O(n) time,
computing the curvature (Hessian) of the function takes O(n2 ) time, which is expensive for an operation
that would be related to just making one step.
Nevertheless, second-order information is a very powerful source of knowledge about a function and
second-order information-based optimization methods (Newton method and its variants) are a very important
part of the optimization curriculum, due to their ability to converge quickly to the minimum once they get
close to it. But, if you remember well what we said in the previous section about the fact that in ML-minded
optimization we don’t care that much about the actual minimum, this should not be a convincing enough
reason to use the Newton method.
The answer is: it still makes sense to learn about the idea of the Newton method, so that it can be a
good motivation for computationally more efficient tools that try to do the same thing and are inspired by
it, but work at a lower computational cost.
A point x obtained in this way is a critical point of the Taylor approximation. If the matrix ∇2 f (xk ) is
invertible, we obtain:
x = xk − (∇2 f (xk ))−1 ∇f (xk ),
where, of course, we typically don’t want to compute (∇2 f (xk ))−1 explicitly since it is only applied once.
This is the basis of the ‘pure’ Newton method outlined in alg. 16. In running of the Newton method, the
vast majority of time is spent solving the system of linear equations to get xk+1 − xk and smart approaches
are needed that do it efficiently, utilizing the problem structure as much as possible. For that reason, it will
not be an overstatement that an optimization algorithm related to Newton method, is almost equivalent to
the algorithm used to compute the Newton updates.
106
All this computational effort is not for nothing. Newton method is a very powerful one and it is immensely
popular in mathematical optimization. As the following exercise shows, for quadratic functions it actually
converges immediately to the minimum.
Exercise 4.3.
Show that the pure Newton method finds the minimum of a function
in one step.
This is in sharp contrast with gradient-style methods that may exhibit zigzagging behavior, as you have
seen before. Moreover, for the Newton method, one can show that being close to the minimum x∗ , its
convergence becomes quadratic as the following result shows. This result is so powerful that we provide
you with a minimum-required knowledge proof. The assumptions of this theorem are, except for specific
cases, expected to hold only locally around minima for non-quadratic functions, but they make the analysis
illustrative.
Theorem 4.1.
Let f be a twice continuously differentiable function defined over Rn . Assume that
If we combine the last equality with the fact that ∇2 f (xk ) ≥ mI, then we obtain that k(∇2 f (xk ))−1 k ≤ 1/m.
Therefore, we have
Z 1
kxk+1 − x∗ k ≤ k(∇2 f (xk ))−1 k
2 ∗ 2 ∗
(∇ f (x k + t(x − xk )) − ∇ f (x k ))(x − x k )dt
0
Z 1
≤ k(∇2 f (xk ))−1 k
2
(∇ f (xk + t(x∗ − xk )) − ∇2 f (xk ))(x∗ − xk )
dt
0
Z 1
≤ k(∇2 f (xk ))−1 k
2
(∇ f (xk + t(x∗ − xk )) − ∇2 f (xk ))
k(x∗ − xk )k dt
0
1
L L
Z
≤ tkxk − x∗ k2 dt = kxk − x∗ k2 ,
m 0 2m
107
which is the desired result.
All the things so far are great news for minimization of quadratic functions or situations where we are so
close to a local minimum that the quadratic approximation of a function is very accurate. However, not all
functions are convex quadratic with a positive definite Hessian. For other functions, even if they are convex,
the step made by the pure Newton method might be simply too long and can guide one to a place where
the Taylor approximation loses its accuracy completely.
For that reason, one typically does not implement the pure Newton method but instead, uses the search
direction implied by the Newton method, combined with line search (see one of the strategies we learned
in the previous section) in order to determine the next point, see alg. 17. This is a basis for a ‘realistic’
implementation of the Newton method.
Line search, unfortunately, is not something that is very expedient to do in ML because of the size of
datasets involved and the amount of time it may take to evaluate the function value.
But this is not the end of issues with the Newton method. As you remember, it mostly consists of solving
a system of linear equations to obtain the Newton direction. But the Newton matrix need not be invertible
or positive definite, all we know is that it is at least symmetric. To check this (and also to solve the system
of linear equations later), typically Cholesky factorization from section 2.6 is performed to check the matrix
eigenvalues. If you are lucky, all the eigenvenvalues are positive and the matrix is positive definite, and
hence, invertible.
But some problems are ‘not convex enough’, which means that the Hessian matrix might become singular
(when one of the eigenvalues is zero) or or they are not convex at all at a given area in which case the Hessian
is indefinite (which happens, for example, at saddle points). For the the case when the Hessian is positive
semidefinite but one of the eigenvalues is zero, people use a very simple trick that is similar to regularization:
a matrix λI is added so that the step is:
Another approach, more suitable around saddle points, is to trust the Taylor approximation only within a
certain distance from the current point, and to determine the next point as the minimizer of the function
within a ball around it. The optimization problem solved then is
1
xk+1 = argmin f (xk ) + ∇f (xk )> (x − xk ) + (x − xk )> ∇2 f (xk )(x − xk )
x 2
s.t. kx − xk k22 = (x − xk )> (x − xk ) ≤ δ,
where δ > 0 is the radius of the trust region. This idea is illustrated in fig. 4.3.
Such optimization problems can be solved very efficiently due to the fact that the Hessian of the the
objective function – ∇2 f (xk ) – and the Hessian of the function (x − xk )> (x − xk ) used in the constraint –
an identity matrix I – are simultaneously diagonalizable, a term you have encountered in section 2.5. This
property allow extremely efficient special algorithms for this so-called ‘Trust Region Subproblem’ (TRSP).
Despite all the fixes presented so far, one of the main problems with the Newton’s method is indiscrimi-
nately attracted to all critical points (such as maxima or saddle points). This is particularly troublesome as,
in the kind of objective functions encountered in ML, saddle points tend to proliferate a lot, as you have seen
earlier. Surprisingly, the first-order methods we discussed earlier, exhibit less attraction to such points. For
that reason, the Newton method does not always perform better than gradient descent. The Newton method
108
Figure 4.3: (An ugly picture of) minimization of a saddle-point quadratic function in the trust region method.
is needed for loss functions with complex curvatures, but without too many saddle points. Overall thus, the
computational-work-per-value of the Newton method is not great in its ‘almost pure’ versions. Therefore,
real-world ML practitioners often prefer gradient-descent methods in combination with computational algo-
rithms like Adam. However, there exists low-cost imitations of the Newton method that are used in ML. We
will introduce one of them in the next section.
Exercise 4.4.
Is it possible for a Newton update to reach a maximum rather than a minimum? Justify your answer.
In what types of functions is the Newton method guaranteed to reach a maximum rather than a
minimum?
where in the Newton method the matrix Dk is simply ∇2 f (xk ). But computation and inversion of ∇2 f (xk )
are rather expensive, so the idea is to make Dk a sequence of matrices that behave ‘sort of’ like the Hessian,
but which are cheap to update from Dk to Dk+1 .
What would it mean that we want Dk to behave like the Hessians? The idea is that the updated gradient
should satisfy the approximate secant condition (as it relates to the secant method for finding zeros of
functions)
Dk+1 (xk+1 − xk ) ≈ ∇f (xk+1 ) − ∇f (xk ),
which is a finite-difference approximation. But there are really many matrices Dk+1 satisfying this condition
and it’s not immediately obvious how one could decide for a specific choice among them which, again, would
be cheap to update from Dk to Dk+1 . We will now present some of the empirically working solutions to this
problem.
109
There, the idea is to, given the matrix Dk and xk+1 , xk , ∇f (xk+1 ), ∇f (xk ), and choose for a Dk+1 which
is as close as possible to Dk :
The question mark at the norm is there deliberately, because the easiness of solving the above optimization
problem hinges upon the choice of that norm. It turns out that the problem is easy to solve if we pick
kXk? := kA1/2 XA1/2 kF , where
Z 1
A= ∇2 f (x + t(xk+1 − xk ))dt.
0
Not by coincidence, it happens that A(xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk ). Under this choice, the problem
can be analyzed by hand and there is an explicit solution to the minimum-norm optimization problem given
by:
with
1
vk = ∇f (xk+1 ) − ∇f (xk ), qk = xk+1 − xk , γk = .
vk> qk
where we do not show the entire reasoning behind (it is conceptually not that difficult, but it requires a lot
of analysis of the Lagrange optimality conditions for (4.4)). This is a neat algebraic result, but we need the
inverse of this matrix, rather than the matrix itself. And moreover, we do not want to perform this inverse
−1
explicitly for every k. Ideally, we would like to obtain the inverse Dk+1 from Dk−1 in a cheap way. What
comes to rescue is that (4.2) is a low-rank update of the matrix Dk .
Because of that we can use the Sherman-Morrison-Woodbury identity of theorem 2.3 to obtain the
following formula:
This is known as the Davidon–Fletcher–Powell (DFP) quasi-Newton method update formula, and further
computational tricks are used even to avoid the storage of the entire matrix in real-life software.
Exercise 4.5.
Check the validity of (4.3) using theorem 2.3.
Another, more popular update rule like this, known as the Broyden, Fletcher, Goldfarb & Shanno (BFGS)
is obtained by formulating the problem (4.4) directly in terms of the the inverted Hessian, instead of the
Hessian itself.
−1 −1
Dk+1 = argmin kDk+1 − Dk−1 k? (4.4)
−1
Dk+1
−1
s.t. xk+1 − xk = Dk+1 (∇f (xk+1 ) − ∇f (xk ))
−1 −>
Dk+1 = Dk+1 .
The BFGS method is the state-of-the art among the quasi-Newton methods used in optimization.
110
4.4 Non-smooth optimization
In the end, we move to a very important case of having to minimize a function which is not differentiable
everywhere. Recall the hinge loss SVM problem of example 3.3, where the loss function was clearly not
everywhere differentiable. Another example is the following one.
Example 4.1. Lasso
A classical example of a non-differentiable loss function is the L1 -regularized linear regression, known
as lasso (least absolute shrinkage and selection operator). The loss function there is
This function has the benefit that using the L1 norm in the regularization term is particularly effective
at forcing many entries of w to be equal to 0 at the optimal solution – see exercise 4.8.
For that reason, it is essential that we have methods that are able to minimize such functions in a
mathematically rigorous way as well.
For differentiable functions we will have a close relation of the gradient and the subgradient, givne by
the following result.
Lemma 4.1.
For a function f : Rn → R, which is differentiable at a point x, we have
A subgradient is, essentially, the normal vector of a hyperplane passing through the point (x, f (x)) that
includes the graph of f completely above or on itself. It is not a coincidence that the condition in the
definition of a subgradient is very similar to that in the definition of convexity of a function. In fact, we
have the following result.
Lemma 4.2.
A function f : Rn → R is convex if and only if ∂f (x) 6= ∅ everywhere.
Although in theory, one can have a discussion about subgradients of nonconvex functions in at least some
points of their domains, in practice discussion about subgradients is almost always done only in the context
of convex functions and this is also the assumption we shall make.
Just as for computing derivatives (gradients), for the subgradients we have some ‘calculus rules’ for
computing them for complicated functions out of the subgradients for the simpler functions.
111
Lemma 4.3.
For convex functions f : Rn → R we have
Exercise 4.6.
Pn
Compute the subgradients of the corresponding functions: kxk2 , kxk1 = i=1 |xi |, max{0, 1 − x}.
One of the nice features of the subgradients is that we can use the subgradients to formulate a very
general version of an optimality conditions:
Theorem 4.2.
For a function f : Rn → R, if 0 ∈ ∂f (x) then x is a global minimizer of f .
Exercise 4.7.
Prove theorem 4.2.
Equipped with the notion of the subgradient, we can present the generalization of the gradient descent
method to nondifferentiable convex functions, known as the subgradient descent method. Depending on the
√
assumptions that we impose on our function at hand, we can prove some results about O(1/ K) or O(1/K)
convergence of the subgradient method, where K is the number of iterations.
The subgradient method is, however, quite slow because it is ‘blind to the structure’ of the function f .
Especially in ML, we are the designers of our own functions and even in situations when some functions are
nondifferentiable, there can be some good things about them that can be exploited, apart from the single
112
negative fact that it is not always differentiable. Typically, this structure to be exploited is as follows:
We introduce the class of algorithms that are able to exploit this structure in the next section.
Now, in machine learning, the nondifferentiable functions we try to minimize very often come in the form
where g(x) is convex and differentiable and h(x) is convex and not differentiable, but still ‘nice’ in the sense
that it is a fairly simple function.
Example 4.2. Compressed sensing
Proximal gradient algorithms are used very frequently in image recognition tasks, or image deblurring.
A typical situation in image deblurring or compressed sensing is that we observe a signal y, which
corresponds to the underlying ‘true’ signal x which is not observed and has to be estimated. By laws
of physics we expect the relationship
Ax ≈ y
to hold, where A is a known matrix that models the physics. Then, the deblurring task is performed
by minimizing the function
kAw − yk22 + λkT wk1 ,
where T is a special matrix such that the term kT wk1 triggers the optimal solution to be ‘sparse’ in
a way that is proper to the given application – for example, in an image we don’t want to have many
pairs of adjacent pixels with completely different colors.
113
Exercise 4.9.
Derive the proximal operator of the function h(x) = kxk1 .
• with respect to the obtained point, compute the proximal operator of the function h(x).
which get formulated as
1
xk+1 = argmin g(xk ) + ∇g(xk )> (y − xk ) + ky − xk k22 + h(y),
y 2γ
What is crucial for the proximal gradient mapping is that the proximal operator is easy to compute for
the function f (x). If you have this, you are good to go to apply the proximal gradient descent method which
is a super powerful and popular tool in machine learning methods. For example, we have the following result
that gives us a O(1/K) convergence rate for the value of the objective function.
Theorem 4.3.
Let g(x) be convex and have an L-Lipschitz continuous gradient and proxh,tk (xk − tk ∇g(xk )) can be
computed easily. Then choosing fixed stepsize tk = 1/L we obtain that for arbitrary x0 it holds
L
f (xk ) − f ∗ ≤ kx0 − x∗ k2
2k
The proof of this theorem is not particularly difficult but it requires ‘putting together’ quite a few small
facts and properties of convex functions.
Using the proximal gradient method for minimizing the Lasso-regularized linear regression carries the
name of iterative shrinkage-thresholding algorithm (ISTA). Under specific assumptions on the functions g,
h one can obtain an even faster rate of convergence O(1/K 2 ) through the so-called fast iterative shrinkage-
thresholding algorithm (FISTA, [4]), which is one of the most popular algorithms used in ML or image
retrieval applications (check the number of citations of the paper).
Exercise 4.10.
Work out the details of the proximal gradient methods for the Lasso algorithm.
114
4.5 Practical summary
Long story short the situation is as follows. In machine learning you often face situations in which a pure
gradient method is not possible to implement. This can be, often due to non-differentiability of the function
at hand. If you are the master of the situation and the function to be minimized has to be nondifferentiable
but you can at least keep it convex, then it’s very handy that the non-differentiable component has an
easy-to-compute proximal operator in which case you can apply the proximal gradient descent methods.
In high-dimensional ML models, chances are high that the function you are trying to minimize is not
going to be convex, it will have a lot of local minima, and even more saddle points. For that reason, the
modifications of the gradient methods introduced in the beginning of this section will come in very handy.
If the problem size allows it, quasi-Newton methods can yield a faster convergence.
In highly complicated models such as the neural networks with ReLU activation functions, it can happen
that both difficulties (nonconvexity and nondifferentiability) appear at the same time. We will treat this
case separately in the neural networks part of this course.
This means that applying any of the algorithms we learned so far might lead to ‘falling outside the feasible
set W ’.
How to deal with that? The first advice is – if you can, to avoid constrained problems. One interesting
case here is when you encounter a problem of the form:
min ky − Xwk22
w
Aw = b,
Exercise 5.1.
Transform the above problem to an equivalent, unconstrained one.
Sometimes, however, the constraints are more complicated than that and the problem one is solving is
min f (w)
w∈Rn
s.t. gi (w) ≤ 0 i = 1, . . . , m.
where gi (w) are some functions (typically convex and differentiable). A very engineering way to deal with a
situation like this is to turn this problem back into an unconstrained one by including a penalty for constraint
violation:
N
X
minn f (w) + C max{0, (gi (w))2 } (5.1)
w∈R
i=1
In that way, we make peace with the fact that the constraints can be violated and impose a certain price C
per squared unit of violation, which should be tuned to make sure that the optimization problem is nudged
115
into staying ‘close enough’ to the constraints being satisfied. Such a reformulation has one nice feature which
you are asked to show in the following exercise, and one which is not nice - that the conditioning of the
problem might be bad.
Exercise 5.2.
Prove the statement that if the functions f (w), gi (w) are all convex, then the resulting penalized
problem (5.1) has a convex objective.
But sometimes situations arise (or we make them by modelling the problem in a certain way) where we
really need to respect some constraints and they are not trivially eliminated from the problem. Then, the
answer depends what to do depends on the structure of the set W .
Example 5.1. L1 regularized regression via a constraint
As already mentioned, L1 regularization is a very popular tool in ML. In some applications, it is more
common to perform this regularization by explicitly bounding the L1 -magnitude of the parameter
vector:
min ky − Xwk22
w
kwk1 ≤ C,
which is a generalization of a concept you have seen in the earlier sections of this course (section 2.7,
minimum-norm solutions).
Just as in the above example, in ML we don’t encounter very complicated sets W . Typically, this will
be a box, a ball or something of similar level of complication. For that reason, the two classes of algorithms
that we will discuss first are essentially fixes of the gradient(-like) methods.
Towards the end of this section, we will learn about duality. In classical optimization, duality plays the
role that checking the ∇f (x) condition plays in unconstrained optimization – checking if we are close to the
optimal solution. Additionally, when used properly it can be used to strengthen algorithms.
For other types of sets, some set-specific norm might be used in the projection operator which typically has
116
Of course, you never want to be solving an optimization problem (5.2) to find the projection itself. The
idea is that this algorithm is applied only if the projection ΠW (x) is easy to compute – in optimization terms
this means that the operator is either available as a closed-form formula or, in the worst case, as a result of
optimization over a single variable.
Exercise 5.3.
What is the formula for the projection operator onto set
W = {x ∈ Rn : li ≤ xi ≤ ui }
W = {x ∈ Rn : kxk1 ≤ 1}
Hint: it can be reduced to minimization over a single decision variable but we don’t know of any
closed-form formula for it. You will find using Lagrange multipliers useful in this task.
The convergence results for the projected gradient descent algorithm are similar to that of the gradient
method. Typically, they will depend on the parameters of the set W to some extent, such as the diameter
and the shape. In such situations, even more than in unconstrained optimization the following rule holds:
there is a set of parameters that makes things work in the proofs of convergence, but for purposes of real life
optimization one typically picks larger stepsizes than the theoretically-valid ones.
Because, in the above expression, the only really variable term is y, the problem can be reduced to:
This not the end, however, because we do not trust the Taylor approximation too far away from the current
point. For that reason, the real step that is made is:
that is, we stop somewhere on the way to the point xk , which prevents us from jumping from one boundary
point of W to another boundary point (you can easily check that the point yk+1 lies on the boundary of the
set W ).
It turns out that the L1 -regularized linear regression is exactly one of the situations in which (5.3) is
easier than the projection operator.
117
Algorithm 21: Frank-Wolfe algorithm
Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Solve yk+1 = min ∇f (xk )> y
y∈W
Select step size tk
xk+1 = xk + tk (yk+1 − xk )
if Stopping criterion met then
Stop and return xk
Exercise 5.5.
Derive the formula for the Frank-Wolfe step of minimization over L1 ball.
As for the convergence of the Frank-Wolfe algorithm, these roughly follow the same pattern as the those
for the projected gradient algorithm, depending on our assumptions on the function we minimize.
If you are interested in more mathematical details on the convergence rates of various algorithms we
introduced so far in this course, in the context of ML, we recommend the material of the excellent course
‘Optimization for Machine Learning’ by Martin Jaggi, for which there are also YouTube video lectures
available. 1
5.4 Duality
5.4.1 General duality
Any introdution to optimization is incomplete without giving at least a glimpse of duality theory. Duality
theory is something that in classical optimization is mostly used for the purpose of (i) constructing optimality
certificates for the solutions of optimization problems, (ii) constructing better algorithms by leveraging dual
information (e.g. primal-dual interior point methods). However, in ML duality has found a beautiful appli-
cation where the so called dual problem of the optimization problem we solve (most often, the SVM) allows
us to construct primal predictive tools of almost arbitrary level of sophistication at no extra computational
cost.
To introduce this, we will start from the ‘classical angle’ and then move on to the ML applications.
Suppose you are solving a problem
which we will henceforth call the primal problem. If all the functions f and gi (x) are convex, then this is
a ‘nice’ optimization problem for which we can have legitimate hopes to find an optimal solution. For that
reason, we will make this assumption from now onwards.
In general, constrained optimization problems are ‘nice’ if both the objective function to minimize and
the set of feasible solutions are convex.
Exercise 5.6.
Show that if gi (x) are convex functions then the set of solutions of (5.4), i.e.,
X = {x ∈ Rn : gi (x) ≤ 0, i = 1, . . . , m},
1 See: https://github.com/epfml/OptML_course
118
is convex, that means, for all x, y ∈ X it holds:
which is a nondifferentiable function. We will turn it into a constrained optimization problem where
all the functions involved are differentiable and convex. If we introduce an additional decision variable
ξi and require that
ξi ≥ 0
ξi ≥ 1 − yi (w> xi ),
then ξi can become our ‘proxy’ for the value of the term max{0, 1−yi (w> xi )} so that the optimization
problem is:
N
X
min ξi + Cw> w (5.5)
w,ξ
i=1
ξi ≥ 1 − yi (w> xi ) ∀i.
ξi ≥ 0∀i.
Check for yourself that indeed at the optimal solution ξi will always have the incentive to be equal
to max{0, 1 − yi (w> xi )}. Also, check that if you rewrite this problem to form (5.4), then all the
constraints and objective functions are indeed convex in w, ξ.
We now come back to problem (5.4). How do you certify the optimality of the solution x you find? If
the problem had no constraints, and the function f (x) was convex and differentiable, then a simple answer
to this question is: by checking if ∇f (x) = 0. But in the presence of constraints, a stationary point might
not be feasible, as depicted in fig. 5.1.
In a way, an optimality certificate, if it exists, must take form of an easy-to-verify statement that ‘from
this point onwards it is not possible to go to any better point because the constraints forbid it’. At the same
time, the optimality certificate should be easy to compute, just like for unconstrained optimization problems
it is easy to compute the gradient and to check if it is equal to 0. An optimality certificate that would be a
beautiful mathematical statement of the form ’there exists no ... such that...’ but which would not be easily
verified numerically, is useless in practice.
Remark 5.1.
What are going to derive is a special case of the separating hyperplane theorem [37], which in turn
is a special case of the Hahn-Banach theorem in functional analysis [39].
One extremely popular technology of building such optimality certificates is via the so-called Lagrange
119
Optimal
solution
Function
minimum
Figure 5.1: Constrained minimization of a convex quadratic function f (x) = x> Ax. Due to the constraints,
the feasible region consists of points that are below the a> >
1 x − b1 ≤ 0 and to the left of a2 x − b2 ≤ 0 and the
optimal point is the vertex of the feasible region, not the minimum of the function itself (which is infeasible
because of the constraints).
relaxation of the problem. The Lagrangian dual function of (5.4) is defined as:
m
( )
X
`(α) := inf f (x) + αi gi (x) . (5.6)
x∈X
i=1
is known as the Lagrangian of (5.4) and it can be summarized as ‘we make peace that some of the constraints
can be violated, but we impose a price αi for each unit of violation of the i-th constraint, and the objective
becomes lower by αi per unit extra if the constraint is satisfied ‘with a slack’. In (5.6) we wrote X instead
of Rn because some of the constraints might be so simple (e.g., nonnegativity constrainsts) that it is easy to
find the infimum (5.6) even without relaxing them – you will see that in example 5.3.
Minimizing L(x, α) over x is a relaxation of the original problem because its optimal value for α ≥ 0 is
always going to be a lower-bound on the optimal value of (5.4). This result is known as the weak duality
theorem.
Exercise 5.7. Weak duality
Show that for any α ≥ 0 it holds that `(α) ≤ f (x) for any x that satisfies the constraints of (5.4).
Thus, plugging α ≥ 0 into the dual function, we obtain a lower bound on the optimal value (5.4). Of
course, everything hinges on the question: is the dual function easy to compute via an easy formula? In
many interesting cases the answer is yes, as in our SVM example.
Example 5.3. Support vector machine with L2 regularizer - dual function
Let us introduce variable αi that will play the role of α in the Lagrange relaxation of the SVM
problem (5.5). In theory, we could also introduce a set of variables to relax the constraints ξi ≥ 0
but this is not needed because minimization over ξ ≥ 0 is easy even without the relaxation of the
constraints.
120
The Lagrange dual function becomes
N
X N
X
L(w, ξ, α) = Cw> w + ξs + αi (1 − yi (w> xi ) − ξi )
i=1 i=1
N
X N
X N
X
= Cw> w − αi yi (w> xi ) + αi + ξs (1 − αi )
i=1 i=1 i=1
Exercise 5.8.
Verify the formulation (5.7).
We come back to our general considerations. Imagine the following: suppose that you determined an x
and α ≥ 0 such that x is feasible for (5.4), and it holds that f (x) = `(α). Because `(α) is a lower bound on
the value of any feasible solution of x, a logical conclusion is that such a α is an optimality certificate for x.
For that reason, the search for the best possible lower bound provided by the dual function is important
on its own, because the gap between f (x) for a feasible x and `(α) will quantify the maximum amount of loss
in the value of the objective function compared to the (unknown) optimal solution. This search is formulate
as the dual optimization problem:
121
The KKT conditions play an important role in optimization. In a few special cases it is possible to
solve the KKT conditions analytically and thus solve the optimization problem. More generally, many
optimization algorithms have been conceived as methods for solving the KKT conditions: imagine that you
treat the ‘equality part’ of the KKT conditions as a system of linear equations for which you are trying to
find a root. If you try to apply the Newton method to this system, then you are close to the derivation
of something called primal-dual methods for constrained optimization. For more details on this way of
explaining these methods, see [9].
In the context of fig. 5.1, the KKT conditions constitute a certificate that from the optimal point, it is
not allowed to move any further in the direction of improving the objective function values.
Example 5.4.
For our SVM example the dual optimization problem of maximizing `(α) is equivalent to:
N N
X 1 X
max αi − αi αj yi yj x>
i xj (5.9)
α
i=1
4C i,j=1
s.t. 0 ≤ αi ≤ 1.
Why is it so? Well, if we want to maximize the value of the dual function we certainly do not want
its value to be equal to −∞. Therefore, we can enforce the constraints that make the value of the
dual function finite, without any loss of generality.
It is easy to verify that the Slater condition holds for our pair of problems (5.5) and (5.12) and that
their optimal values must be bounded. Therefore, strong duality holds and the optimal values of both
problems are the same, attained by solutions w∗ , ξ ∗ , α∗ that satisfy the KKT conditions:
N
X
∗
2Cw = αi∗ yi xi
i=1
ξi∗ ≥ 1 − yi (w∗> xi ) ∀i.
ξi∗ ≥0 ∀i
0 ≤ αi∗ ≤ 1
αi∗ (1 − yi (w∗> xi ) − ξi ) = 0.
where the i’s for which αi∗ > 0 are called the support vectors because they constitute the predictive
tools. This is where the name of the tool comes from. If you remember, the predictive tool was
y = sign(w> xi ), it can now be written as:
N
!> N
!
∗> 1 X
∗ 1 X
∗ >
y = sign(w x) = sign α yi xi x = sign α yi xi x .
2C i=1 i 2C i=1 i
Exercise 5.9.
Show that Slater condition holds for our SVM primal-dual pair (5.5)-(5.12), and that their optimal
values must be are finite.
122
Figure 5.2: The idea of lifting the dimensionality of the features by including their nonlinear transformation
so that the data becomes more ‘linearly separable’. Source: https://datascience.stackexchange.com/
questions/17536/kernel-trick-explanation
In ML, our interest in duality theory is not so much due to the need to find optimality certificates
because, as you might remember, we don’t care about optimality that much. But, sometimes the dual
problem is actually easier to solve numerically than the primal problem. In the SVM case the dual problem
has constraints on the variables αi , but these are very simple constraints.
Exercise 5.10.
Which of the constrained optimization algorithms we learned is applicable to the dual SVM problem
(5.12).
And from the solution to the dual problem, the primal solution can be recovered by using the KKT
conditions as visible in the above example. But the real added value of the dual problem in ML, especially
in the SVM is only about to come.
123
Figure 5.3: A linear SVM and polynomial SVM applied to the feature space. Source: https://
scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html
so that it suddenly has n + n(n + 1)/2 entries. That means, the dimension of our data set increases a lot.
How does the SVM problem formulation look like then? Something like this:
N
X
min C w̃> w̃ + ξi (5.11)
w̃,ξ
i=1
ξi ≥ 1 − yi (w̃> x̃i ) ∀i
ξi ≥ 0 ∀i,
so w̃ ∈ Rn+n(n+1)/2 is much bigger than the original w also increases substantially! Sometimes this increase
might be worth paying the price, because as a result we will obtain a much better predictive tool :
y = sign w̃> x̃ ,
which, if applied to the original feature space, can take a much more flexible form, as illustrated in fig. 5.3.
However, the more features, the heavier the computations in this case. This is where duality theory will
come in. Let us get back to the case without the nonlinear transformations of the features and recall once
again the dual problem:
N N
X 1 X
max αi − αi αj yi yj hxi , xj i (5.12)
α
i=1
4C i,j=1
s.t. 0 ≤ αi ≤ 1. (5.13)
As you can see, the dual problem doesn’t depend on vectors xi as such, but on their inner products only,
and each inner product is, in the end, a single number. Moreover, you can check that the dual of (5.11)
would look exactly the same, only that the inner products would be different. The terms hxi , xj i would only
become hx̃i , x̃j i. The corresponding prediction tool would then be
N
!
1 X ∗
y = sign α yi hx̃i , x̃i ,
2C i=1 i
Thus, the dimensionality of α and the number of terms in the prediction tool do not scale with the number
of features in the primal problem but with the number of samples alone.
Here comes the key of the so-called kernel trick, where we can generalize the term hxi , xj i in the dual
problem formulation to any kernel function K(x, y), which roughly measures the similarity of two vectors,
satisfying
124
• symmetry
K(x, y) = K(y, x) ∀x, y ∈ Rn
• nonnegativity
K(x, y) ≥ 0 ∀x, y ∈ Rn .
Such a kernel function, as it turns out, always corresponds to a certain nonlinear transform of the feature
data in the primal space (by Mercer theorem, [43]). Sometimes, this transform can only be expressed with
an inifinite number of terms, which would lead to solving a problem with an infinite number of entries in w
in the primal space. Doing so will increase our ability to separate groups of points greatly.
In other words, we can formulate an operator K(xi , xj ) and doing so will ‘imitate’ considering many,
many more features in the primal problem. This operator should be fast-to-compute.
Example 5.5.
If we want to use the monomials of degree at most 2 of our feature values, then the way of doing it
in (5.10) is highly inefficient because in the dual problem we would have inner products of very long
vectors. A much more efficient way of taking into account expressions of degree up to r in terms of
the features is to use the polynomial kernel function:
K(xi , xj ) = (1 + x> r
i xj ) .
It does the same job as the original idea but requires much less floating point operations. Additionally,
it accounts immediately for including the ‘constant term’ in our feature vector.
where γ is a to-be-tuned hyperparameter. This kernel does not correspond to a finite-dimensional trans-
formation of the data in the primal space, but to an infinite dimensional one, which you can infer from
interpreting this formula as an ‘inner product’ of two series in terms of x, y for γ = 1:
1 2 1 1
exp − kx − yk2 = exp( x> y − kxk2 − kyk2 )
2 2 2 2
1 1
= exp(x> y) exp(− kxk2 ) exp(− kyk2 )
2 2
∞
(x> y)j
X 1 2 1 2
= exp − kxk exp − kyk
j=0
j! 2 2
∞
x · · · xnk k y · · · yknk
n1 n1
X X 1 1
= exp − kxk2 √1 exp − kyk2 √1
j=0
P 2 n1 ! · · · nk ! 2 n1 ! · · · nk !
n =j
i
Using the kernels, how does the corresponding prediction tool look like? Analogously to the original
formulas, we have:
N
!
1 X ∗
y = sign α yi K(xi , x) .
2C i=1 i
All this is really cool because it allows us to play with the flexibility with respect to separability of the
sets based on the feature, in a way that does not increase the size of the optimization problem formulation.
Overall, we can say that the kernel trick used in SVM is one of the most impressive usages of duality
theory apart from certifying the optimality of solutions to optimization problems.
125
Exercise 5.11. Kernelization of the SVM
Consider again exercise 3.12. Now, for your dataset consider using one of the kernels mentioned
above to create a nonlinear SVM classifier. That is, formulate the corresponding dual problem, solve
it using the projected gradient method. Then, you can use the obtained decision tool to color your
picture in order to see into what parts did the SVM classifier divide the entire [0, 10]2 square. You
can achieve it by taking a dense grid of points on this set, and classifying each of them using the
SVM you obtained, and coloring the points classified with two different colors.
126
6 Clustering
6.1 Introduction
Finally, after having the crash-course introduction to the relevant linear algebra and optimization, the time
has come to discuss some machine learning subjects.
To begin with, we make one important remark: we cannot discuss all the possible techniques, so we make
a selection of those in which we find the linear algebra/optimization most illustrative. At the end of each
section, if necessary, we will hint at other popular techniques which we do not discuss as they do not involve
(that much) interesting linear algebra or optimization.
Clustering falls into the group of unsupervised learning techniques, which corresponds to revealing struc-
ture in a data set without any labels. Whereas in the optimization basics, we have mostly discussed techniques
relevant to supervised learning, we have already seen other techniques that can seen as unsupervised learning
in the linear algebra basics; in particular, we have already discussed how to reduce the dimension of a data
set to the – with respect to some measure – most relevant information using Krylov methods and the SVD.
Clustering is about having N objects that need to be divided into groups such that the objects within
a single group are similar to each other, but the groups are different. About these objects, we might have
a feature information stored in per-object vectors xi , i = 1, . . . , N . In other, we might not have the feature
information about each object but instead, we have pairwise information about the relationships between
the objects, which can take the form of:
• 0 − 1 information about whether there exists a link between the two objects or not
In this context, the higher wij , the less similar two objects.
One of the nicest illustrations of clustering is image segmentation, where we try to divide an image into
different parts, for example, to separate people from the background. Another example is grouping people
based on knowing each other or their interests.
When fixing which data points belong to which cluster, we know that, for each cluster, the point ck that
minimizes the expression X
kxi − ck k2
i∈Ck
127
Figure 6.1: Examples of image segmentation through clustering, source: https://it.mathworks.com/
matlabcentral/fileexchange/66181-image-segmentation-using-fast-fuzzy-c-means-clusering
is
1 X
ck = xi .
|Ck |
i∈Ck
Exercise 6.1.
Verify the above statement.
For that reason, in the minimization problem (6.1), we can eliminate minimizing over c1 , . . . , cK and
focus on the minimization across the composition of clusters C1 , . . . , CK .
The bad news is that solving this problem to optimality is an extremely difficult task – the problem is
known to be N P -complete – so that for realistic problem sizes, we need to resort to heuristic algorithms.
The heuristic on which the classical K-means algorithm rests consists of alternating steps of:
1. Computing the cluster centers ck as the averages of the points included in the cluster.
2. Re-assigning the points to the cluster Ck whose center ck it lies the closest to.
128
Algorithm 22: K means
Data: x1 , . . . , xN , k – number of clusters
Initialize centroids c1 , . . . , cK
while Improvement of (6.1) obtained in the previous iteration do
Assign xi to cluster k = argminj kxi − cj k22
for k = 1, 2, . . . , K do P
Update ck = 1/|Ck | xi .
i∈Ck
then you notice that in fact it depends only on the inner products of ci and xi . You might be already guessing
what is about to happen. Namely, one can lift the feature vectors to a higher space using a mapping Φ(·)
and replace the inner product with a kernel function so that the new ‘distance’ function becomes
Φ(xi )> Φ(xi ) − 2Φ(xi )> Φ(ci ) + Φ(ci )> Φ(ci ) = K(xi , xi ) − 2K(xi , ci ) + K(ci , ci )
The examples of kernel functions used for clustering are the same as in the case of kernel SVMs.
This idea is the basis of kernel K-means method, which would be the same as the the standard K-means,
but with the re-assignment of points to clusters on the basis of this modified distance.
The only caveat about this idea is how to compute the points c1 , . . . , cK . We cannot compute them
anymore as the averages of points within a cluster because working with the kernel distance means that
implicitly, we have ‘lifted’ our feature vector using a mapping Φ(·) to a higher dimension
xi → Φ(xi ),
where Φ(xi ) could be infinitely dimensional (recall the case of the Gaussian kernel from section 5.4.2), and
it is in that higher-dimensional space that we are performing K-means, so that the distance is computed as:
hΦ(xi ) − Φ(ci ), Φ(xi ) − Φ(ci )i = hΦ(xi ), Φ(xi )i − 2 hΦ(xi ), Φ(ci )i + hΦ(ci ), Φ(ci )i
= K(xi , xi ) − 2K(xi , ci ) + K(ci , ci ).
For that reason, our ‘center’ of cluster Ck is a higher-dimensional vector Φ(ck ) such that
1 X
Φ(ck ) = Φ(xi ).
|Ck |
i∈Ck
Of course, we do not want to compute this vector explicitly, among others because, for some kernels, it
corresponds to an infinite-dimensional transformation. However, we can easily compute the distance of each
point from it using the following derivation:
* +
1 X 1 X 2 X
Φ(xj ) − Φ(xi ), Φ(xj ) − Φ(xi ) = hΦ(xj ), Φ(xj )i − hΦ(xj ), Φ(xi )i
|Ck | |Ck | |Ck |
i∈Ck i∈Ck i∈Ck
1 X
+ 2
hΦ(xi ), Φ(xl )i
|Ck |
i,l∈Ck
2 X 1 X
=K(xj , xj ) − K(xj , xi ) + 2
K(xi , xl )
|Ck | |Ck |
i∈Ck i,l∈Ck
129
The complete algorithm description is given in Algorithm 23 where as you can see, we still need to
initialize the algorithm with some centroids, but once the initial cluster assignment is done, we don’t need
the variables c1 , . . . , cK anymore.
3 9
5 6
8
4
2 7
1
Based on this information, you can formulate an adjacency matrix of the graph in which the nodes
are persons, and edges are existing relationships between them:
0 1 0 1 0 0 0 0 0
1 0 0 1 1 0 0 0 0
0 0 0 1 1 0 0 0 0
1 1 1 0 0 0 0 0 0
W = 0 1 1 0 0 1 0 0 0
0 0 0 0 1 0 1 1 1
0 0 0 0 0 1 0 1 0
0 0 0 0 0 1 1 0 1
0 0 0 0 0 1 0 1 0
Based on this you might have to figure out what are the two ‘groups of friends’ among them.
In such a situation, it is natural to visualize the data at hand as a graph and to try to cluster the objects
with a graph-oriented mindset.
130
However, also in other contexts it is possible to visualize the problem as a graph, where the weight of
each edge connecting two nodes stands for the ‘strength of similarity’ between the objects represented by
the nodes.
Example 6.3. Image clustering
Consider a set of N black-white images consisting of h × w pixels. Each of such images can be
represented with a vector xi ∈ Rhw , where each entry corresponds to a number on the black (0) -
white (1) scale. Then, similarity of two images i and j can be computed using a number obtained,
e.g., with the Gaussian kernel:
Such a matrix will correspond to a graph in which two nodes are connected by an edge only if the
two nodes are considered ‘sufficiently similar’.
Summarizing the above examples: we can consider every clustering task as clustering nodes in a graph,
where the graph information can consist of
• 0-1 information if two nodes are connected by an edge or not
• continuous information informing about the ‘distance’ between the two nodes.
Equipped with this mindset, we will now present popular clustering ideas that come directly from the world
of graphs, and we will simply assume that for a graph we have a matrix W ∈ RN ×N which has either the
0-1 or continuous entries.
known as the min-cut problem (cut is the division of the set of nodes in a graph into two complementary
sets). While being a nice formulation, in applied contexts it suffers from the downside that, often, it can
lead to one of the clusters consisting of just one, most isolated, node. As in clustering the aim is often to
divide objects into subsets of sizes with ‘similar magnitudes’, a workaround that helps to achieve this goal
is, instead, to minimize a ‘normalized’ version of this quantity:
P P
wrs wrs
r∈V,s∈V̄ r∈V,s∈V̄
min P + P ,
V wrs wrs
r,s∈V r,s∈V̄
131
where the sum of the weights of outgoing arcs is normalized by the ‘inner weight’ of a given cluster, i.e., how
strongly connected are the nodes within a given cluster. Note that the edges inside the cluster are counted
twice. This trick has the property of preventing highly asymmetric cluster sizes.
The above considerations applied only to contexts with two clusters in mind for illustrative purposes.
The multi-cluster analogue of the min-cut idea would be to minimize the quantity:
K
X X
Cut(C1 , . . . , CK ) = wrs .
k=1 r∈Ck ,s6∈Ck
where the edges inside the cluster are counted twice. The bad news, however is that again, minimizing a
quantity like this is not a computationally tractable optimization problem and one would need to resort to
heuristic techniques similar to that of the K-means clustering algorithm, namely, trying if shifting a given
node from one set to another helps improve the RadioCut value.
As it turns out however, considering the linear algebraic properties of the matrix W can in some cases
imitate the minimization of the above quantity, and does lead to nice clustering techniques.
L=D−W
PN
where D is a diagonal matrix with Di,i = j=1 wij . If the edges information consists only of 0-1 information
whether there is an edge or not, then the diagonal entries of D contain the degree information of nodes in
the graph.
What is important to note about this matrix is that it is symmetric and positive semidefinite; cf. Defi-
nition 2.27. Hence, all its eigenvalues will be real and nonnegative.
Exercise 6.3.
What is the smallest eigenvalue of matrix L and the corresponding eigenvector?
By considering this matrix, it turns out that we can recover the idea of RatioCut by left- and right-
multiplying it with a specific matrix whose nonzero entries indicate the belonging of a given node to a
cluster k.
Proposition 6.1.
Let C1 , C2 , . . . , CK be the the clusters (sets of sample indices) and let H ∈ RN ×K be a matrix where
1
Hi,j = p 1i∈Cj ,
|Cj |
where 1? is an indicator function equal to 1 if the clause holds, and 0 otherwise. We then have that
132
Exercise 6.4.
Prove the above proposition.
Note that the columns of so-defined matrix H form an orthonormal set and their nonzero entries uniquely
define the belonging of a given observation to each of the clusters.
Noticing this, we can state that for the problem of minimizing p the RatioCut we can search for a matrix
H whose columns are orthonormal and such that each Hij ∈ {0, 1/ |Cj |}. Unfortunately, this is an integer
programming problem which we cannot solve efficiently.
But we can relax some of the restrictions of so-defined problem where the idea would be to search for an
orthogonal matrix H ∈ RN ×k that minimizes trace(H > LH). It is rather difficult to solve an optimization
problem that includes constraints of the type ‘columns of a given vectors should be orthogonal’, but luckily,
this problem has a well-defined solution.
By linear algebra we know the solution to this problem is the matrix H whose columns are the eigenvectors
corresponding to the K minimal eigenvalues of L. The downside is, of course, that such a matrix will not
have rows with only one nonzero entry indicating to which cluster a given observation should belong. For
that reason, we still need to... cluster the entries rows of H. Here, you realize that there is no escape from
the K-means clustering algorithm because it is the algorithm used most commonly for clustering the rows
of H. The resulting algorithm is called unnormalized spectral clustering, presented in Algorithm 24.
which ‘correctly’ classifies the nodes into the red and blue clusters.
In some contexts, the unnormalized version of the spectral clustering algorithm does not lead to the most
desired results because the topological structure of the graph is dominated by a few nodes with the largest
degree Dii .
In Internet-related clustering problems this can mean, for example, that a given node is a unit that is
sending out spam.
For that reason, another, normalized variant of the spectral clustering algorithm is used, where the
Laplacian is normalized as
L̄ = D−1/2 LD−1/2 = I − D−1/2 W D−1/2 .
With that change, the algorithm remains the same and is given in Algorithm 25.
133
Algorithm 25: Normalized spectral clustering
Data: x1 , . . . , xN , k – number of clusters
Compute the similarity matrix W and the normalized Laplacian L̄ = I − D−1/2 W D−1/2 .
Construct a matrix H whose columns are the eigenvectors corresponding to the K minimal
eigenvalues of L̄.
Use K-means algorithm to cluster the rows of H into C1 , . . . , CK .
Both the unnormalized and normalized version of this algorithm require us to compute the K smallest
eigenvalue of a matrix whose size scales linearly with N – the number of samples. For that reason, from
a certain size onwards the task might become challenging computationally. For that purpose, effective
approximate techniques such as Nyström sampling, have been constructed [1].
By means of the material of this course, since the graph Laplacian is symmetric, its eigenvalues and
eigenvectors can be approximated using the QR algorithm. If the graph Laplacian is sparse, one may instead
want to compute the eigenvalues and eigenvectors in a Krylov subspace. Unfortunately, the eigenvectors
corresponding to the largest eigenvalues are dominant in the Krylov subspace. Therefore, one could compute
approximates of the largest eigenvalues and corresponding eigenvectors in the Krylov subspace
−1
Km (L + I) , v ,
for some v, > 0 small and m > K, instead; the term I is a small regularization term making the matrix
−1
invertible. Note that, if λi is an eigenvalue of L, then λ1i is close to an eigenvalue of (L + I) with
approximately the same eigenvector. Hence, the K largest eigenvalues and corresponding eigenvectors of
−1
(L + I) can be used to approximately compute the K smallest eigenvalues and eigenvectors, respectively,
of L.
Exercise 6.5.
Consider a graph consisting of two complete subgraphs which are not connected by any edge, and
where the matrix W is an adjacency matrix. Will unnormalized/normalized clustering with K = 2
lead to two clusters corresponding to the two connected parts? How will that look like for K complete
subgraphs, and clustering with K clusters?
134
Figure 7.1: Comparison of an SVM classification tool and rectangle-partition tool. Areas are shaded in the
color of the predicted value.
7 Tree-based learners
7.1 Introduction
So far we did nice and cool ‘proper optimization’ in this course, that is, we were learning methods that, if
only the problem possesses some friendly properties (convexity), then our method should converge to the
optimal solution – for example the gradient method.
This allowed us to optimize the shape of fairly complex classification (SVMs) or regression (linear regres-
sion, or linear regression including nonlnear feature transformations) tools.
What was the ‘essence’ of the power of, for example, SVMs? It was that through the kernel trick we were
able to construct a fairly complicated ‘separation surface’. This complicated surface, as a result, was doing
a good job, separating points with different label values +1 and −1.
One can, however, try also to think differently about the problem and instead of dividing the points using
a ‘single but complicated shape’, divide the feature space into many simple shapes, and assign value +1 or
−1 as the predictor to each of the simple shapes, depending on whatever is the majority of labels there in
the training set. This idea is illustrated in fig. 7.2.
In the right panel, the feature space has been divided by partitioning it into hyper-rectangles, so that
the sample points within the subsequent rectangles are more and more ‘uniform’, i.e. that in the end we end
up with small rectangles where nearly every point has the same label.
How do we formalize the corresponding prediction tool? If we denote each rectangle by Xl , and y(Xl ) is
the label assigned to Xl , then the decision tool is:
y = y(Xl ) if x ∈ Xl ,
135
2 2
1 1
0 0
1 1
2 2
3 2 1 0 1 2 3 3 2 1 0 1 2 3
Figure 7.2: Comparison of a polynomial regression of degree 3 with a piecewise-constant regression through
domain splitting.
• interpretability of the subsets – if the hyperplanes used use, for example, only one nonzero entry, then
we have a set of simple threshold rules that identifies what label should a given data point receive
Disadvantages:
• prone to overfitting when the partition is too fine
Xl = x ∈ Rn : a>
lk x ≤ blk , k ∈ 1, . . . , Nl .
136
Additionally, to each of the subsets Xl we will assign the corresponding label y(Xl ) which will act as the
predictor for that specific set.
What do we want to achieve with this partitioning and these labels? Just as in our earlier optimization
attempts, we will try to fit the partition and the labels to the training data as well as possible.
When is the fit good on the training set? When within each set Xl the labels of the data points there
are close to y(Xl ). When is the fit bad? When we have the exact opposite. Because labelling each subset
Xl can be considered separately, we will do so in the formal discussion now, considering also regression and
classification separately.
You know what is the result of it – v being the average value of all the labels inside the set:
1 X
v= yi .
|{i : xi ∈ Xl }|
i:xi ∈Xl
if w> x ≥ 0
1
y=
−1 otherwise.
Now, if we’re trying to fit a single constant label for all observations withing a given set Xl , we can skip the
linear dependency and simply try to find a number v ∈ R to minimize a loss function
X
log (1 + exp (−yi v)) , (7.1)
i: xi ∈Xl
137
Figure 7.3: Hierarchical partitioning of the space into four subsets.
Do you have an idea how a problem like this can be solved in general case? That would be a very complicated
task because the feasible set is definitely not convex, and an attempt to formulate the above constraints using
closed-form expressions would be an absolutely daunting task. In other words, a problem posed like this is
absolutely hopeless.
With two simplifying restrictions, however, it is possible to formulate the problem in a way that can be
at least tried to be solved using existing software.
Organizing the partition into a tree. The first assumption is that the number of subsets L should
be a power of 2 such that the subsets are obtained through a process of recursive tree of partiions, where the
‘children nodes’ inherit the half-spaces of the ‘parent nodes’. First, one splits the entire space using a single
hyperplane. Then, each of the resulting subsets is split using a single hyperplane into two again, which gives
us four subsets after two splitting rounds. After d splitting rounds we obtain 2d subsets, each defined using
d hyperplanes. In this way, we obtain a tree structure. This idea is illustrated in 7.3.
Each partition depend only on one feature. Another step is to to restrict each of the hyperplanes
to only one feature, that is, that all vectors ali are unit vectors or their negatives. Actually, this is exactly
138
the way that the right panel in fig. 7.2 has been constructed, using a tree of depth 3, because all the lines
in this plot are either vertical or horizontal (correpsonding to splits along only one feature). This restriction
has an optimization-friendly feature that suddenly, the search for the best ali is restricted to n possibilities
only.
A tree satisfying the above two assumptions would be a very interpretable one - and interpretability is a
much discussed topic nowadays in the context of ML used in societal applications.
Example 7.1. Interpretability
Imagine you are constructing a classification tree for deciding whether someone is suspected of having
diabetes or not, based on a number of the patient’s characteristics (BMI, cholesterol, etc.). A classi-
fication tree built using simple one-feature rules such as ‘is the patient’s BMI higher than Z’ is way
more trustworthy to practitioners than a classification tree built using rules such as ‘is 0.4 patient’s
BMI plus 0.145 patient’s cholesterol level higher than Z’. This is exactly where the whole talk about
interpretability of AI tools is about, if you heard about it.
The two restrictions simplify the accounting a lot, and the resulting problem can be actually written
down as a mixed-integer linear optimization (MILP, [49]) problem, i.e., a problem in of the form
min c> x
x∈Rt1 ×Zt2
s.t. Ax ≤ b,
sMtaUWQOjcY
139
For this, we need to select the feature j ∈ {1, . . . , n} that will be the basis of the partition, and then optimize
the threshold b. How do we do this? We can try every possible feature j = 1, . . . , n and for each of them we
optimize the threshold, and select the one that together with the threshold gives the subset purity:
L(Xl− (b, j), v− ) + L(Xl+ (b, j), v+ ).
For a fixed feature index j, we can optimize for the value b by, for example, searching over the interval
[min{xi,j : xi ∈ Xl }, max{xi,j : xi ∈ Xl }] .
Note that if there are N 0 points within the given subset, then the number of thresholds one actually needs
to try does not exceed N − 1 (why?).
Formally, we thus do the following to split the set Xl :
min L(Xl− (b, j), v− ) + L(Xl+ (b, j), v+ ).
j,b,v− ,v+
Overall the the recursive algorithm for constructing a classification or regression tree is as follows.
What are the benefits of an algorithm like this? First of all, at each step one minimizes over a single-
dimensional parameter b and most of the computations can be paralellized. For that reason, the buildup
of trees like this is extremely fast. Additionally, because at each level one uses a single-feature criteria, the
corresponding prediction tools are easily interpretable.
The recursive mechanism has some drawbacks as well, of course. Compared to the ideal situation in
which all the partition parameters would be optimized jointly, the accuracy of such a greedily-built tree will
certainly be suboptimal.
Exercise 7.1.
Construct a worst-case two-class classification dataset in xR2 such that if you apply the recursive tree
construction once, at least one of the optimal tree predicts nothing, i.e., it is just as good as guessing
the label of a new point based on whichever label is most frequent.
So far, we assumed that the tree depth d is a fixed value. In fact, it is a hyperparameter of our tool that
we need to tune to make it work as well as possible – just like the power of the polynominal kernel in SVMs,
for example. If a sufficiently high d is chosen, we it is possible to partition the dataset into a perfectly pure
way where each sample lives in its own cell. But that is not the point – this is only a the training data
and it is likely that a tree like this will underperform on the test data. For that reason, d should be chosen
(the tree can be ‘pruned’) to select a depth value that will be best not on the training dataset, but on the
test/validation dataset.
140
Figure 7.4: Classifiers obtained for random forests with trees of maximum depth of 3 each, consisting of 1,
50 and 100 trees.
two features gives roughly the same result, why would one pick one feature over the other? And how would
we know how it impacts the splittings further down the tree? In the end, investigating every possible feature
in alg. 26 to select the best one among them, although can be parallelized, still takes time/effort.
As it turns out, a strategy that works better usually is to select the features to branch on randomly for a
given tree, but to construct many trees simultaneously (which can be parallelized). While this might sound
like a risky choice, actually this idea works pretty well. alg. 27 illustrates the idea of randomly creating
many trees = a random forest of trees.
One question is left. How do we aggregate the predictions of multiple trees? In the case of regression, the
simplest answer is to average out the predictions generated by different trees. For classification, we can use
a ‘voting’ mechanism where a given sample receives a label that most of the trees select. fig. 7.4 illustrates
the random forest idea.
7.5 Boosting
Instead of voting/averaging of many randomly created trees, one can also come up with the following idea:
create a single tree first, and then, ‘add to it’ another tree that would focus on samples on which the
previous tree’s classifications were wrong. In that way, the next tree is created deliberately to compensate
for underperformances of the previous one, not in a random fashion.
This is the idea of boosting. Although the boosting idea can be applied pretty much to any ML tool, it
became particularly popular when constructing tree-based classification and regression tools. This is because,
for these tools the trade off between ’let’s construct many simple tools without optimizing each of them too
much’ versus ’let’s construct a single, highly-optimized complicated tool’ seems to be in the favor of the
former.
141
we now introduce the idea formally on the example of regression first. Suppose that for our training data
set we we construct a first predictive model:
y = f1 (x)
by minimizing, for example, the following loss function:
N
X
min (yi − f (xi ))2 .
f ∈F
i=1
model1 = f1 (x)
Note that essentially, we are trying to fit a model that will cover for the ‘missclassifications’ of the previous
model.
The general m-th step of the boosting approach then is given by
where at each step we treat the previous ‘model’ fixed and we only optimize the new term that corrects for
the misclassifications of the previous one.
For each subsequent tree in the boosting process, the creation of the next tree can follow the same
recursive mechanism as before, and we only modify the loss function to minimize when performing the
splits.
For regressions trees the implementation is simple - just as in the formulas above, we construct a decision
tree fitted to the error generated by the previous trees, in other words, we fit a regression tree to the data
set
(x1 , y1 − modelm−1 (x1 )), . . . , (xN , yN − modelm−1 (xN )).
For classification trees, we cannot simply ‘subtract’ the model prediction from the label itself (we could
end up with numbers different from −1 or 1). But what we can do is to treat the ‘v’ of the previous model
for a given sample as something to be corrected. In the logistic regression setting we chose, this means the
following loss function to be minimized:
X
log (1 + exp (−yi (modelm−1 (xi ) + v)))
i:xi ∈Xl
This is where, we hope, you see the entire point of introducing the softmax function mechanism. fig. 7.5
illustrates the idea.
142
Figure 7.5: Classifiers obtained for boosting with trees of maximum depth of 3 each, after a single step (no
boosting), 50 and 100 steps.
Exercise 7.2.
Consider again exercise 3.12. This time – code your own classification tree constructor, including its
forest- and boosting-variant. Compare the speed and performance to the SVM-based classification.
7.6 Need for interpretability – will proper optimization make its comeback?
As you have seen above, the two approaches that make up for deficiencies of having a single tree, consist
in creating sums or random collections of many trees. While this improves the predictive performance, it
makes the model less interpretable than a single tree based on single-feature splits only.
For that reason, a logical question is the following one. If one day the pressure for the AI tools used, e.g.,
to process job applications, be interpretable, will lead us to a ban on using random forests or too complicated
boosted trees, what will the future be for such highly efficients prediction tools?
A possible direction in which things might go is the following one. Imagine the ML tool designer to be
forced to create a simple, interpretable tool. In the trees context, this might correspond to having a single
tree of fixed maximum depth. Then, the focus on the quality of each of the branchings will be much higher
than now – currently these branchings are done heuristically and quality is improved by creating more trees.
It is possible that in such a situation, heuristic approaches will no longer be sufficient and one will have to
develop (almost) exact algorithms for the joint problems much in the spirit of [6].
143
8 Hyperparameter optimization
8.1 Introduction
Many, if not all, ML tools are described by hyperparameters chosen by the user before the ML tool gets
optimized to fit the training data – think of:
• stepsize used in the gradient descent method
• the regularization parameter in SVMs
• γ used in kernel functions
While some hyperparameters are discrete numbers – for example, the degree of the polynomial kernel used
in SVMs – other hyperparameters take continuous values – such as the the stepsize lengths.
The name hyper stems from the need to distinguish them from the actual parameters which are optimized
‘automatically’:
Were we to minimize over w and λ ≥ 0 simultaneously, the optimal λ would always be equal to 0,
which means that something is wrong either with the loss function, or with the very idea of optimizing
them jointly.
A second, much more important reason is that hyperparameters optimized in such a way would lead
to models that perform very bad out of sample, i.e., on the data which is not part of the training data.
Therefore, hyperparameters should be chosen in such a way that the model performs as well as possible on
data that is different from the training data. How can this be done?
There are multiple ways to do it and process of doing so is called validation. A most classical approach
is the so-called K-fold cross-validation. In this approach the training dataset X is divided into K distinct
sets of equal size (or almost equal size if that’s not possible):
X = ∪K
k=1 Xk .
Denote by M(X , h) the model trained on dataset X with hyperparameter value h, and by P(M(X , h), X 0 ) its
performance on set X 0 (for example, SVM loss function without the regularizer, out of sample classification
error, loss function in regression, etc...).
144
Then, the k-fold cross-validated performance of the hyperparameter setting h is computed as
K
1 X
P(M(X \ Xk , h), Xk ).
K
k=1
In other words, for each k, we train the model on the data set consisting of all samples except for Xk (training
set), and then evaluate its performance on Xk (validation set). In this way, the models are evaluated on
different data than they were trained, and each sample in the dataset has played the role both of a training
and validation sample. The goal of k-fold cross validation is to make the model independent of the specific
data split at hand.
Coming back to hyperparameter choice, our goal would be to fit the model as well as possible, corrected
for the validation step. We want to solve the following problem thus:
K
1 X
min G(h) := P(M(X \ Xk , h), Xk ). (8.2)
h∈H K
k=1
This problem is known as the hyperparameter optimization (HO) and it is computationally challenging
because for the very evaluation of the function for a specific h, the model has to be trained K times. Of
course, the model trainings can be easily parallelized as they are completely independent, but in principle,
training even a single model is not a trivial step.
Remark 8.1. Training-validation-test splitting
Strictly speaking, the most popular approach to building and assessing ML models is as follows. First,
you divide your dataset into training and test sets. Next, on the training set you perform HO with
k-fold crossvalidation (thus iteratively ‘taking’ out small pieces of the training set which will be used
for validation). Then, once the hyperparameter value h∗ that minimizes (8.2) is found, the model
is trained on the entire training set with hyperparameters h∗ . In the end, the performance of this
model is assessed on the test set (which was not used at all till this moment).
HO, by the nature of the problem, is a task that can involve only a limited number of model training /
K-fold validation because each of them, on its own, is time-expensive. In general – the longer it takes to
train a single model, the less ‘attempts’ we have to try different values for h. What doesn’t make the goal
easier is also the fact that the function we are trying to minimize can be non-convex like in fig. 8.1, and the
fact that some entries of h might have to take integer values.
Luckily (as in many other cases discussed earlier in this course), ML is not the first field to encounter
this kind of problems – industrial design, experiment design in biotechnological research, all these fields deal
with this problem since the beginnings of the 20th century.
Example 8.2.
Imagine you are operating an oil field and try to figure out the best place to drill to put an oil rig. In
principle, you want to find a place which is most shallow. Each depth measurement costs a lot of time
and money, and you only can perform a maximum of 10 measurements. How would you strategize
the different places to try to measure the depth?
For that reason, it should not come as a surprise that the ML strategies for HO will ‘borrow’ from the
ideas used in those fields. Our running assumption throughout this section is that we will be facing the task
of finding the best value h ∈ H, where H is the set of all ‘reasonable’ parameter values.
145
0.4
0.5
0.6
0.7
0.8
0.9
10
5
10 0
5
0 5
5 10
10
Figure 8.1: Negative of the classification accuracy on the validation dataset for an SVM described by two
hyperparameters (we want to minimize this function over the hyperparameter values).
Remark 8.2.
As already mentioned above, hyperparameters can take both a continuous and discrete form – this
of the kernel parameter γ in SVMs and the number of hidden layers in neural networks. This section
of the notes is written mostly having continuous parameters in mind, but most of the ideas here can
be extended in a ‘life-hacky’ way to searching over discrete parameter spaces.
What would be typical first-shot strategies for hyperparameter selection? As a first try, you would
probably ask people who work in an application similar to yours what values of hyperparameters typically
‘work’. This is a good strategy because ‘folk knowledge’ like this can save you a lot of time and you benefit
from the work done already by other people.
If this does not make you happy though, you can also try a few different values for h yourself and simply
select the best one. This will almost certainly improve upon the single-shot value h you get from folk wisdom,
but (i) is not easily replicable if someone wants to check your results, (ii) might involve a lot of your own
time, which you might prefere to spend on other activities.
For these reasons, a need for automated HO/tuning strategies arises. In what follows, we shall discuss
three general approaches for solving this problem in an increasing order of their mathematical sophistication
(and decreasing order of popularity). Although this chapter is constructed for the purposes of HO, you
should consider it as a general discussion what to do when we try to minimize a function which is very
costly to evaluate and for which no information apart from its value can be obtained. Formally, this is called
black-box optimization.
146
Since we need to find a joint selection of all components h1 , . . . , hnh , we need to search the entire multi-
dimensional product set:
H = H1 × . . . × Hnh .
For each parameter combination in this set we need to train our model and validate it. That means that for
higher-dimensional h, grid search suffers from the curse of dimensionality - the number of model trainings
needed becomes impractical. For hyperparameter vectors of length 2 or 3, as in for example, the support
vector machines with the Gaussian kernel, it’s a perfectly suitable approach. Of course, no approach is free
from a certain degree of being arbitrary - in the case of grid search, we need to select the grid area first.
As a rule of thumb, you should begin with a rather coarse grid and if you observe that the best-performing
values of h line on the ‘boundary’ of the grid, it is a signal that perhaps you should extend the grid a bit
to see if even better values do not lie outside of it. The ‘best’ situation thus is when the best values are
somewhere ‘in the middle’ of the grid because that gives you a signal that they are at least ‘locally optimal’.
Once you identified an are with particularly good values, you can try performing another grid search there,
with a finer grid.
Another strategy for hyperparameter selection is to sample them randomly from H and to stick with the
best value found after a prescribed number of samples. An upside of this strategy is that it is us who designs
the probability distribution to sample from. Any prior knowledge we might have about where it is ‘more
likely’ that we find best possible values, can be included in the sampling strategy.
Example 8.3. Low ‘effective dimensionality’ of the hyperparameter space
When tuning an SVM, it is likely that the best values of λ and γ are related to each other – when λ is
small, γ should be small as well so that the impact of one parameter does not dominate the training
process. For that reason, it makes sense for the sampling strategy to sample the different values for
(λ, γ) in a ‘correlated’ way.
A downside of sampling is that it is only replicable if you select the sampling seed for the process and
that, just as the grid search, requires some prior knowledge ‘where to search’.
• the assumed prior distribution used to construct our ‘guess’ about the shape of G
• the acquisition function used to quantify where the gain of evaluating the next sample is the highest.
In this lecture, we focus on the most popular case of using a Gaussian process as the prior. We now
explain the meaning of the term ‘Gaussian process’ in this context. At the start of the optimization, we
147
Idea of the function After 3 evaluations
1.4
1.2
1.0
G(h)
0.8
0.6
0.4
0.2
2 0 2 4 6 8 10
h
Figure 8.2: The ‘true’ evaluated function (blue) and its ‘idea’ consisting of the mean (dashed) and a 95% con-
fidence interval (turquoise shaded area) based on three evaluations of the function (red points). Constructed
using the ‘bayesian-optimization’ package in Python [34].
assume that the function value per each point h is a normal random variable with expectation
E(G(h)) = 0,
and that the covariance of value of the function between points h and h0 is
E(G(h)G(h0 )) = K(h, h0 ),
where K(·, ·) is some selected kernel function. The kernel function can be one of the kernel functions we
already learned in the course, such as the Gaussian kernel.
Remark 8.3.
Note that when h = h0 , the formula gives us the variance of G(h) because the expectation is assumed
to be 0.
Given these assumptions, and the pairs (h1 , y1 = G(h1 )), . . . , (ht , yt = G(ht )), the distribution for G(h)
is computed as the conditional distribution of a multivariate normal distribution of G(h) given the values
y = (y1 , . . . , yt ). This is exactly where the power and the beauty of using a normal distribution as the
prior distribution comes in – from basic statistics we obtain a closed-form formula for the parameters of this
conditional distribution, from which the confidence regions per-point are depicted in fig. 8.2.
To derive them, we formulate things formally. We assume that the distribution of g = (G(h1 ), . . . , G(ht ), G(h))
follows a multivariate normal distribution with
E(g) = 0
and
> Σ k
E(gg ) = .
k> K(h, h)
where
K(h1 , h1 ) K(h1 , h2 ) · · · K(h1 , ht ) K(h1 , h)
K(h2 , h1 ) K(h2 , h2 ) · · · K(h2 , ht ) K(h2 , h)
Σ= .. .. .. , k= ..
..
. . . . .
K(ht , h1 ) K(ht , h2 ) · · · K(ht , ht ) K(ht , h)
148
Then, given y = (y1 , . . . , yt ), from basic statistics and calculus [17] we know that G(h) follows the following
conditional distribution:
G(h)|h̄ ∼ N k > Σ−1 y, K(h, h) − k > Σ−1 k ,
that is, normal distribution with the mean and the standard deviation given by
q
µ(h) = k > Σ−1 y, σ(h) = K(h, h) − k > Σ−1 k.
Importantly, the fact that we use a kernel function to model the (co)variances will make sure that the term in
the square root will never become a negative number because both Σ and E(gg > ) are positive-semidefinite.
In fig. 8.2 you could see an example of the obtained expected values and 95% confidence intervals for the
function value at different h.
Now, given this way of establishing our probability distribution for the unknown function, what is the
best point to try as the next one? It should be a point where the gain from running the next evaluation is the
largest, i.e., a point that balances carefully between exploration (investigating regions of biggest variance)
and exploitation (minimizing in areas where we expect the function to be the lowest). The field of BO has
constructed the concept of acquisition functions that try to capture exactly this, and which become the
objects to minimize when searching for the next iterate.
For the Gaussian process prior, they popular acquisition functions are generally a function of three
things: mean µ(h), standard deviation σ(h) of G(h), and the best value seen so far ybest . Three examples of
acquisition functions to minimize are
• (negative of) the probability of improving upon the so-far best value ybest :
• expected improvement
Exercise 8.1.
Derive the formulas for (8.3) and (8.4) as functions of µ(h), σ(h) and Φ – the cumulative distribution
function of the standard normal distribution.
Of course, the minimization of the acquisition function is itself an optimization problem to solve. On
the upside, this problem is typically low-dimensional. On the downside, typically this function will be non-
convex and hence, the most that one can hope for is to find a local minimum by applying gradient descent
or a quasi-Newton method starting from a random point. Indeed, the most frequently used algorithm for
this problem is the BFGS algorithm (or L-BFGS, which a ‘limited memory’ version of BFGS).
Once a minimizer y is found, the function is evaluated at the new point h and the ‘idea’ of function
G is refined, hopefully with G(h) < ybest . Figure 8.3 illustrate the subsequent iterations of the Bayesian
optimization algorithm.
An important aspect that has not been mentioned so far is that to begin with, one needs to sample a few
points h1 , . . . , hninit for which the function will be evaluated and which will serve as the basis for the first
iteration of BO (without this, the BO algorithm cannot be initialized).
The complete description of the algorithm is given alg. 28. BO is a global optimization method for which
no general convergence results can be provided. However, in practice it works really well on moderately-
dimensional problem for which performing a single function evaluation is expensive. For that reason, BO is
a part of many ML libraries such as AutoML [23]. For more information on BO, please see [19].
149
1.4 Target
Observations
Prediction
1.2 95% confidence interval
1.0
0.8
f(x)
0.6
0.4
0.2
0.0
2 0 2 4 6 8 10
x
Acquisition
Utility Function
2 Next Best Guess
x
2 0 2 4 6 8 10
1.4 Target
Observations
Prediction
95% confidence interval
1.2
1.0
0.8
f(x)
0.6
0.4
0.2
2 0 2 4 6 8 10
x
Acquisition
x
2 0 2 4 6 8 10
Target
Observations
1.50 Prediction
95% confidence interval
1.25
1.00
f(x)
0.75
0.50
0.25
0.00
2 0 2 4 6 8 10
x
Acquisition
Utility Function
2 Next Best Guess
x
2 0 2 4 6 8 10
150
Figure 8.3: The evaluated function (above) and its ‘idea’ after 4, 5 and 9 evaluations, and the corresponding
acquisition function, together with the next-best point to evaluate marked (lower confidence bound, below).
Algorithm 28: Bayesian optimization algorithm.
Data: Function G(·), kernel function K(·, ·), number ninit of initial points to sample randomly,
acquisition function.
Sample ninit initial points h1 , . . . , hninit randomly evaluate G(·) on them.
Set t = ninit while Stopping criterion not met do
Update the Gaussian function model based on (h1 , y1 ), ..., (ht , yt ).
Minimize the acquisition function to find a new point ht+1 .
Evaluate the function yt+1 = G(·) on ht+1 .
Set t := t + 1
Another idea is something slightly more wild, which goes in the same direction of thought as the trust-
region methods did. Namely, to create an approximate ‘image’ of our function around the current iterate
ht and then, to minimize this ‘approximated image’ over a small set around ht – the trust region. This is
a most classical idea of model-based derivative-free optimization (model-based because we build a model of
how our function might look like).
Imagine that around the current iterate ht you sample or select deterministically p + 1 points h̄0 , . . . , h̄p ,
with h̄0 = ht , and for each of them you evaluate the corresponding function value G(h̄s ), s = 0, . . . , p. Then
you can use this ‘sample’ of points and the corresponding function value as a data-set to which you are trying
to fit a polynomial function that describes it as closely as possible (according to the metric of your choice).
The most common choices here are first-order polynomial (thus an imitation of the first-order Taylor
approximation), and the second-order polynomial (imitation of the second-order Taylor approximation). For
higher-order polynomials, one would need to sample too many points to obtain something reasonable and
the benefit diminishes.
Speaking formally, we are trying to fit a model
nα
X
m(h) = αj φj (h)
j=1
where φj (h) are basis functions (monomials of degree at most 1 or 2), so that the relationship h → m(h)
151
mimicks the relationship
hs → G(hs ), s = 0, . . . , p.
which can be done using, for example, linear regression. Smart implementations of this algorithm select the
points hs in such a way that the computation of the optimal αj ’s is as efficient as possible. For more details,
see [12]. Once the model is ready, we perform a trust-region step, as outlined in 11.2.
Exercise 8.2.
For the SVM problem in exercise 3.12, perform hyperparameter optimization using the methods
learned in this section, assuming that you are constructing regularized SVM steered by regularization
parameter C, and using Gaussian kernel with parameter γ, and using the accuracy on the validation
dataset as the quality measure.
For this, you need to write a function that performs k-fold cross validation using your earlier-
constructed SVM function, which will serve as your function G(h). For Bayesian optimization, you
can use the bayesian-optimization package in Python to which you only need to pass the domain for
(C, γ) and the function which is to be evaluated – the package will do the rest for you.
Plot the best-obtained model quality measure against the total number of function evaluations per-
formed.
8.5 Summary
HO is a standard and essential thing to do for any ML problem – without it, you are likely not to obtain a
useful ML tool. At the same time, tuning they hyperparameters is costly – for each attempt you make, you
need to train the entire ML model to evaluate the validation loss function.
Additionally, it is important that your HO tool is replicable – another person can obtain the same results
on the same data (if the same random number generator seeds are used in case of stochastic approaches).
For that reason, automated methods for HO are in demand. In this section, you have seen a broad overview
of such methods in decreasing order of their popularity.
It is possible to criticize each of the mentioned methods on the basis of the fact that they themselves
also require certain user-provided choices, such as the kernel used in BO, or the stepsize used in coordinate
search. Of course, a discussion like this can go on forever, as any approach will in the end require ‘some’
parameters. However, you need to keep the end goal in mind – it is obtaining the best possible generalization
performance of your ML tool. In this context, any approach is good that yields satisfiable results within an
acceptable time budget.
152
9 Linear unsupervised learning
In this section, we will discuss several unsupervised learning techniques, which have not been discussed
before or present a reformulation of certain techniques in terms of matrix factorizations techniques. As we
had already discussed several matrix factorization techniques in section 2, many ideas will look very familiar.
Even though the MNIST data set contains a total of 70 000 images with 28 × 28 pixels. However, in
each image, many pixels are zero. Therefore, is very likely that the data set can be well represented
in a space of lower dimension than R28×28 .
(Of course, as already pointed out in section 2, gray-scale images are not stored as matrices of scalars
but as matrices of integers.)
Let us, for now, assume that a spanning set C is given and first discuss only the task of finding the optimal
weight vectors wk . One strategy to compute the wl is to formulate the following least-squares problem:
K
1 X 2
min g(w1 , . . . , wK ) with g(w1 , . . . , wK ) = kCwk − xk k . (9.2)
w1 ,...,wK K
k=1
153
As we have discussed earlier, if the columns of C are linearly independent, there is a unique solution to
this problem. With respect to dimension reduction, the case of a linearly dependent spanning sets is of
course the less relevant since the dimension could first be easily reduced by making the spanning set linearly
independent.
Linear autoencoder Let us now discuss eq. (9.2) in more detail. Since the individual terms of the sum
are independent of each other, they can be optimized independently, and we obtain the the normal equations
as a condition for optimality:
C > Cwk = C > xk (9.3)
for k = 1, . . . , K; see also eq. (2.28). The weight vector wk is also denoted as the encoding of the data encoding
point xk , and Cwj is denoted as the corresponding decoding. Note that, if decoding
Cwk = xk ,
that is, if the decoding coincides with the data point, we have found a minimizer for eq. (9.2).
In case the matrix is C is semi-orthogonal, we have C > C = I, and eq. (9.3) simplifies to
wk = C > xk . (9.4)
Of course, we can always transform C into a semi-orthogonal matrix by orthonormalization; cf. section 2.4.
Substituting eq. (9.4) into eq. (9.1) yields the autoencoder formula autoencoder
formula
CC > xk = xk , (9.5)
which represents the process of encoding (wk = C > xk ) and decoding (Cwk = CC > xk ) in one formula. If C
is orthogonal, we have that
CC > = I
as well, and hence, encoding and decoding are inverse operations.
In practice, as we already saw in section 2.7, the dimension of a data set can be significantly reduced
without introducing a large error using the SVD. Let us now discuss this from a different perspective. This
is the case if the matrix is rank-deficient or, as another example, if there is some small noise in the data. In
particular, instead of eq. (9.1), we are then interested in finding C ∈ Rn×m for a small m, such that
Cwk ≈ xk k = 1, . . . , K. (9.6)
This means that we are trying to find an approximate spanning set of size m, or in other words, we try to approximate
find an m-dimensional subspace which approximates the data well; cf. fig. 9.1. Following similar arguments spanning set
as before, a semi-orthogonal C yields an approximate autoencoder formula
CC > xk ≈ xk ; (9.7)
cf. eq. (9.5). All the previous steps are clear based on what we have learned earlier. However, the discussion
so far was based on the assumption that an (initial) spanning set is given. In practice, this is typically not
the case. Therefore, the more interesting and more challenging task is to find a suitable spanning set C for
a given data set.
Let the number of spanning vectors m ≤ K be given. Our goal is then to find C ∈ Rn×m such that eq. (9.6)
is satisfied as good as possible. The most straight-forward strategy is to simply modify eq. (9.2) to optimize
for both the encodings and the spanning vectors at the same time:
K
1 X 2
min g(w1 , . . . , wK , C) with g(w1 , . . . , wK , C) = kCwk − xk k . (9.8)
w1 ,...,wK ,C K
k=1
154
Figure 9.1: Dimension reduction with an autoencoder with a varying number of vectors. Original image
taken from unsplash.com.
Here, in the derivation, we have again used the assumption that C is semi-orthogonal. However, in eq. (9.9),
we do not explicitly enforce orthonormality of the columns of C anymore. One can show that all minima
are actually orthogonal matrices; see the following exercise.
Exercise 9.1. Orthogonality of autoencoders
Show that the minima of eq. (9.9) are all semi-orthogonal matrices.
An alternative way of seeing eq. (9.9) is by optimizing C with respect to the approximate autoencoder
formula eq. (9.7) using a least-squares formulation. The resulting matrix C is also denoted as linear
autoencoder. One way of learning a linear autoencoder is by optimizing eq. (9.9) using a gradient descent linear
iteration. autoencoder
Principal component analysis The solution of eq. (9.9) is generally not unique. A simple way of seeing
this is to multiply the minimizer C of eq. (9.9) by hogonal matrix, that is,
C̃ := CQ
with Q being orthogonal. Obviously, we get
K K
1 X
C̃ C̃ > xk − xk
2 = 1
X
C QQ> C > xk − xk
2 = g(C).
g(C̃) =
K K | {z }
k=1 k=1
=I
155
Figure 9.2: Principal components of a data set in two dimensions.
K
1 X
µ= xk
K
k=1
and
x̂ = x − µ.
Then, the resulting data matrix reads
X = x̂1 ··· x̂K ,
and we consider its covariance matrix:
Definition 9.1. Covariance matrix
Let X ∈ Rn×K a mean-centered data matrix. Then, the covariance matrix of X is given by covariance
matrix
1
CXX = XX > .
K
This matrix is symmetric, and in order to find the directions of maximum variance, we can compute the
eigenvalue decomposition
1
XX > = CXX = V DV > ,
K
with orthonormal basis of eigenvectors V and D = diag (λi ). These eigenvectors are also denoted as the
principal components. As we will now recall, these are not really new to us. Let us also consider the principal
SVD of the matrix X > , components
X > = Û ΣV̂ > . (9.10)
156
Figure 9.3: Three dimension data with high variance within a plane and low variance orthogonal to the
plane.
1 1 > 1 1
XX > = Û ΣV̂ > Û ΣV̂ > = V̂ Σ> |Û{z
>
Û} ΣV̂ > = V̂ Σ2 V̂ > .
K K K K
=I
V = V̂ ,
which means that the eigenvectors in V are the same as the right singular vectors of X > , which are the same
as the left singular vectors of X.
Now, from the Eckart–Young theorem, theorem 2.11, we know that choosing the singular values with
maximum magnitude as well as the corresponding singular values gives us the best approximation for the
matrix in terms of the Frobenius norm. In the PCA, we choose a number m < K of eigenvectors from V
as spanning vectors to obtain a lower dimensional approximation of the columns of the data matrix X. Let
this matrix be denoted as Vm .
An m-dimensional representation of the rows of X, can simply be obtained with
Um = X > Vm ,
where Um = Ûm Σm , and Ûm and Σm are the matrices from the best rank m approximation from the
Eckart–Young theorem; cf. eq. (9.10).
In conclusion, the PCA is nothing else than computing the SVD for the mean-centered matrix corre-
sponding to a data set. Plugged into the autoencoder formula, we obtain
Vm Vm> X ≈ X.
157
9.2 Recommender systems
Let us now consider a machine learning task, which we had already mentioned earlier in section 2.1. In
particular, let us consider the case where some of the values in the data set are missing. This could be the
case if measurement data is missing due to failure or if the data is generally still unknown. A very typical
example is the following one:
Example 9.2. Recommender systems
Consider a video-streaming platform which has n users and offers m different videos, movies or series,
for streaming; other examples would be any kinds of online stores. The information whether a user
n×m
has watched a movie, or not, could be encoded in terms of a matrix A ∈ {0, 1} , where aij = 1
corresponds to the case when user i has seen the movie j and aij = 0 if not. For instance, A could
be of the form
1 0 0 ···
0 0 0 · · ·
1 1 1 · · ·.
.. .. .. . .
. . . .
As we realized this matrix is actually sparse, which means, that one could only store the nonzero
entries. The same could be true when storing the data for user ratings of the movies:
4 0 0 ···
0 0 0 · · ·
5 4 2 · · ·.
.. .. .. . .
. . . .
Here, 1–5 corresponds to a rating and 0 corresponds to the case where data is missing.
In practice, it would, of course, be desirable to fill those gaps using machine learning techniques, that
is, to predict the ratings of unseen yet movies by a given user, and based on this, to recommend
movies. This means that the resulting matrix will actually be dense.
Let us consider a data set with missing entries. In this case, we formalize the set of index pairs for which
data is available as follows:
which contains all index pairs of available data of the data point xk . In fact, this corresponds to a sparse
binary column vector, which is a column of the sparse binary matrix corresponding to the index pairs
Ω = {(i, k) ∈ Ωk ∀k}
for all data points; the matrix would actually correspond to the first matrix in example 9.2.
158
In order to fill the missing entries of the data vectors x1 , . . . , xk , we will use the concept introduced in
the previous section. In particular, we will try to generate an approximate spanning set for the available
data, that is,
Cwk ≈ xk k = 1, . . . , K;
cf. eq. (9.6). However, we can only enforce a good fit for those entries where data is available. For all other
entries, we are missing a label, and the decoding is a pure prediction. Therefore, we consider the least-squares
problem eq. (9.8), but where we only evaluate the error for those entries that contain data:
K
1 X
{Cwk − xk }|
2 ,
min g(w1 , . . . , wK , C) with g(w1 , . . . , wK , C) = Ωk (9.11)
w1 ,...,wK ,C K
k=1
CW = X,
which shows that all previous techniques are actually based on a matrix factorization of X with the weight matrix
and data matrices factorization
W = w1 · · · wK and X = x1 · · · xK . (9.12)
Let us now discuss how general the concept of matrix factorizations in unsupervised learning is.
Linear autoencoder and PCA Using W and X from eq. (9.12) eq. (9.8) can be rewritten as follows
1 2
min g(W, C) with g(W, C) = kCW − XkF ; (9.13)
w1 ,...,wK ,C K
2
here, k·kF is the Frobenius norm as defined in eq. (2.18). Note that, as we have mentioned earlier, this
minimization problem does not have a unique solution, such that, regularization techniques are often used
to obtain matrices W and C with minimum norm. Therefore, for all the matrix factorization techniques
mentioned in this section, one typically optimizes the regularized loss function regularized
loss function
1 2 2 2
min g(W, C) with g(W, C) = kCW − XkF + λ kCkF + λ kW kF , (9.14)
w1 ,...,wK ,C K
for some λ > 0 instead. Of course, other norms than the Frobenius norm can also be used in the regularized
problem. For the sake of brevity, we will concentrate on discussing the differences in the non-regularized loss
function.
159
Exercise 9.2.
Show that g(w1 , . . . , wk , C) from eq. (9.8) and g(W, C) from eq. (9.13) are equivalent.
Hence, the goal of eq. (9.16) can be posed as computing a matrix factorization
CW ≈ X, (9.15)
where the ≈ can be understood in terms of the Frobenius norm. In case we prescribe the condition that the
matrix C is orthogonal, we end up with a linear autoencoder.
Of course, as discussed in section 9.1, the PCA is also derived in terms of different matrix factorizations,
namely, by an eigen decomposition or singular value decomposition.
Recommender systems Similar to eq. (9.15), recommender systems can be understood as computing
the matrix factorization
{CW ≈ X}|Ω .
Here, {V }|Ω is the matrix extension of the masking operator {v}|Ωk introduced for vectors in section 9.2.
The resulting loss function is then given by
1 2
g(W, C) = k {CW − X}|Ω kF . (9.16)
K
As before, it turns out that this is just a reformulation of the loss function in eq. (9.11).
Exercise 9.3.
Show that g(w1 , . . . , wk , C) from eq. (9.11) and g(W, C) from eq. (9.16) are equivalent.
K-means clustering Interestingly, many unsupervised learning techniques can be written as a matrix
factorization problem. One other example from this lecture is the K-means clustering algorithm, which has
already been discussed in section 6. In particular, in K-means clustering, our goal is that the data points
assigned to a cluster are similar to the center of the cluster, that is,
ck ≈ xi ∀xi ∈ Ck , k = 1, . . . , K. (9.17)
Here, ck is the centroid of the kth cluster, and Ck the index set of all data points assigned to the kth cluster.
Now, we form a matrix C out of the ck and obtain that
Cek ≈ xp ∀xi ∈ Ck , k = 1, . . . , K,
which is equivalent to eq. (9.17). As before, ek is the kth canonical Euclidean basis function.
In matrix notation, this reads
CW ≈ X, k = 1, . . . , K,
where, again, we have combined all the data points xk into a single matrix. Furthermore, each row of W
has to be some canonical basis vector
wk ∈ {ek }k=1,...,K .
Hence, the matrix factorization corresponding to the K-means clustering algorithms requires very specific
constraints for the matrix W .
This matrix factorization corresponds to the constrained optimization problem
2
min kCW − XkF ,
C,W
(9.18)
where wk ∈ {ek }k=1,...,K .
160
Algorithm 31: Block coordinate search with block size k.
Data: Function G(·), initial point h0 .
Set t := 1 while Stopping criterion not met do
for Block i = 1, . . . , n/k do
hblock := ht .
for Points hnext ∈ {h ± κej , j = (i − 1)(n/k) + 1, . . . , i(n/k)} do
if G(hnext ) < G(hblock ) then
Set hblock := hnext .
if G(hblock ) < G(ht ) then
Set ht+1 := hhblock and exit the for loop.
if None G(hblock ) < G(ht ) then
Change κ.
This can, for instance, be solved using a (block) coordinate search algorithm; cf. alg. 30. Even though, a
simple coordinate search algorithm can be used, the block variant might be more efficient:
It is just a slight variation of the coordinate search algorithm where in we look for the best optimization
step in blocks of coordinates. This way, we do not necessarily choose the next best coordinate to reduce G
but we always look for the best choice within each block before stopping the iteration. Of course, we can
also choose the order of the coordinates randomly.
Sparse coding In K-means clustering, we have chosen each column of the matrix W to be a standard
basis vector. This ensures that each data point is assigned to exactly one cluster. A natural extension of
this it to allow each data point to be in more than one cluster. One may think about the following examples,
where it may be reasonable that certain data points belong to more than one cluster:
Example 9.3. Handwritten numbers
If, instead of just a single digit, we consider numbers from 0 to 99 as data points, there are various
ways of clustering them into overlapping clusters. For instance, based on whether they contain a
certain digit or not.
161
Example 9.4. Clustering images of faces
Similar to the previous example, clustering of images of faces into overlapping clusters may result in
very reasonable results, for instance, considering the predominant color in the image and and the size
of the face.
If we slightly modify eq. (9.18) and allow each data point to belong to at most S clusters, we obtain the
sparse coding algorithm [35]: sparse coding
2
min kCW − XkF ,
C,W
(9.19)
where kwk k0 ≤ S, k = 1, . . . , K.
The k·k0 norm used in this formulation denotes the number of nonzero entries in the vector. The name
sparse coding is linked to the matrix structure of W . That is, due to the limited number of nonzeros for
each data point, the matrix W is sparse, and its density is limited by the number of clusters a data point
can belong to, S.
Nonnegative matrix factorizations Another very useful type of matrix factorizations are given by the
nonnegative matrix factorization problem nonnegative
matrix
2 factorization
min kCW − XkF , problem
C,W
(9.20)
where C, W ≥ 0,
where the inequalities are meant element-wise, that is, all matrix entries are supposed to be nonnegative. Let
us briefly comment on a simple technique for solving this constrained optimization problem with so-called
box constraints. To optimize eq. (9.20), or the corresponding regularized problem, one can perform a box
simple projected gradient descent method, where, after each step, all negative values are simply set to constraints
projected
zero; cf. section 5.2. As a result, the constraints are satisfied after each gradient step. Alternatively, the gradient
descent
problem can be solved using duality, as discussed in section 5.4.
Nonnegative matrix factorizations can be very helpful since they, for certain examples, they are highly
interpretable. Consider the following example:
Example 9.5. Text classification
Consider this example taken from [1]: The following table lists the word counts for Lion, Tiger,
Cheetah, Jaguar, Porsche, and Ferrari (columns) in 6 different documents, which are the rows of the
table:
162
Lion Tiger Cheetah Jaguar Porsche Ferrari
Document 1 2 2 1 2 0 0
Document 2 2 3 3 3 0 0
Document 3 1 1 1 1 0 0
Document 4 2 2 2 3 1 1
Document 5 0 0 0 1 1 1
Document 6 0 0 0 2 1 2
• In C, we can see how often the words from the different topics appear on average in each of the
documents.
• In W , we can see which words belong to which topic.
(Note that this is an idealized situation in which the entries of C and W are not only nonnegative
but also integer valued. However, also in a less idealized setting, the factorization remains better
interpretable compared with a general factorization.)
This concept can very well be incorporated into recommender systems. Nonnegative matrix factorizations
can then be employed to encode whether, for instance, a movie includes action, humor, etc. This can be used
to group movies into certain catergories or identify groups of users which have the same taste with respect
to certain elements of movies.
163
1 line segment 10 line segments 100 line segments
10 Neural networks
10.1 Feedforward neural networks
Let us consider a general supervised learning task, that is, we want to fit a function F : Rn → Rm to map
given input to corresponding output data,
I = {x1 , . . . , xN } , O = {y1 , . . . , yN } ,
with xi ∈ Rn and yi ∈ Rm , for i = 1, . . . , N . More precisely, we want the function F to satisfy
F (xi ) = yi
as good as possible, with respect to some error norm. As we have already discussed in section 5.4.2, it may
be difficult to construct a linear map in case the data is actually following nonlinear relations. In the kernel
trick in SVMs, we lift the data into a higher dimensional space using a nonlinear map, with the hope that
the separation of the data into classes by a linear model is easier in that representation. There, we therefore
had to select a specific kernel function. Neural networks will provide a framework which automatically learns
nonlinear relations and can therefore be seen as more flexible.
In order to understand the concept, let us consider the two-dimensional data set depicted in fig. 10.1.
Obviously, we cannot fit a simple linear classification model to separate the two classes (red and blue);
instead , a logistic regression with cross-entropy loss function has been used.
Example 10.1. Logistic regression with cross-entropy loss
The cross-entropy is a measure from information theory for the similarity of two probability dis- cross-entropy
tributions. Hence, it can be used as a measure for the quality of a model through comparing the
discrete probability distributions for the predictions of a trained model with the distribution of labels
in the data itself. For a binary classification problem, the cross-entropy loss is given by:
N
X
yi log σ x> + (1 − yi ) log 1 − σ x>
min i w i w , (10.1)
w,b
i=1
F (x) = σ x>
i w ,
164
1 line segment 10 line segments 100 line segments
Figure 10.2: Logistic regression with a piece-wise linear model. Compared with fig. 10.1, noise applied to
the data set. We can also notice the effects of overfitting when comparing with the decision boundary of the
original data fig. 10.1.
Note that the true labels yi are either zero or one, yi ∈ {0, 1}, such that for each term in the sum
only
log (F (xi )) or log (1 − F (xi ))
remains, measuring the deviation from the correct label; since we use eq. (10.2), the model output
will be in the interval [0, 1].
To train the linear model, we optimize the coefficients w in
x> w.
In fig. 10.1, we have used an BFGS type quasi Newton method to optimize the parameters of the model;
see section 4.3.3. More precisely, we have used a limited memory BFGS (L-BFGS) method. In fact, since
the data set is just two-dimensional, the model to be trained has just two parameters w1 and w2 .
The decision boundary of the linear logistic regression model, that is, the hypersurface that partitions decision
the space into the two classes, as shown in fig. 10.1 (left), is naturally also a linear function. Therefore, it is boundary
clear that a linear model is not sufficient. As a remedy, similar to the idea of SVMs, we could now introduce
a nonlinear mapping that allows us to have a nonlinear decision boundary. A simple extension of a linear
model would be to use piece-wise linear functions. Given a sufficiently high number of segments, we should
- in principle - be able to describe any continuous nonlinear relation with arbitrary precision; as you can
see in fig. 10.1 (middle and right), we can find a good nonlinear model in this way. Figure 10.2 shows the
corresponding results for noisy data, where training for a perfect fit is even more difficult.
Let us formalize this type of models: Any piece-wise linear function p with at most M segments can be
written as follows
M
X
p (x) = ai α (bi x + ci ) , (10.3)
i=1
where
α(x) = max {0, x} (10.4)
is a simple piece-wise linear function also called the rectified linear unit (ReLU) function; cf. fig. 10.3. rectified linear
unit (ReLU)
165
Figure 10.3: ReLU, sigmoid, and hyperbolic tangent activation functions.
1. Verify that every piece-wise linear function can be written as eq. (10.3).
2. Draw, for some exemplary piece-wise linear functions, the decomposition into functions of the
form ai α (bi x + ci ).
A nice property of the representation (10.3) is that the grid points for the piece-wise linear function are
implicitly given through the parameters ai , bi , and ci , for i = 1, . . . , M . The do not have to chosen a priori
but can be automatically optimized when training the model.
artificial
With eq. (10.3), we have already introduced the most simple form of an artificial neural network neural network
(ANN) or, for simplicity, just neural networks (NN). Note that there are also other ways to introducing (ANN)
neural networks, for instance, based on the biological motivation of modeling the brain as a network of neural
neurons; however, we will focus on our algorithmic approach. networks (NN)
Let us now discuss how to extend eq. (10.3) to obtain the definition of a general feedforward neural
networks. Therefore, us first note that eq. (10.3) can simply be extended to higher dimensions as follows: feedforward
neural
P (x) = A α (Bx + c) , (10.5) networks
where x, c, and P (x) are now vectors and A and B are matrices, respectively. Due to the matrix notation,
we also got rid of the sum in eq. (10.3). Moreover, the function α is now applied component-wise to the
vector (Bx + c). This also gives us some freedom to vary a dimension inside the model as long as all the
other dimensions stay compatible. In particular, with
x ∈ Rn , c ∈ Rk , P (x) ∈ Rm ,
A ∈ Rm×k , B ∈ Rk×n ,
the dimensions are compatible, and k can be chosen freely independent of the input and output dimensions
n and m. Furthermore, we see that α is essential for the nonlinearity of the model. If α was a linear
166
k=1 k = 10 k = 100
α = Id
α = ReLU
α = sigmoid
α = tanh
Figure 10.4: Logistic regression with model eq. (10.5) for varying k and activation functions α on the noisy
data set.
167
function, then P (x) would just be the composition of linear functions, resulting in linear function as well;
cf. fig. 10.4 (first row) where we see that the model does not change when increasing k if we use a linear
function (α = Id). In the context of NNs, α is also called an activation function. activation
So far, we chose α such that the resulting function is piece-wise linear, however, we could also choose function
other nonlinear functions for α. For instance, the sigmoid function eq. (10.2) or the hyperbolic tangent hyperbolic
tangent
ex − e−x
tanh (x) =
ex + e−x
are other common choices for activation functions; cf. fig. 10.3. See also fig. 10.4 for the decision boundaries
of models with different activation functions. The most important property of all those functions is that
they are nonlinear. Moreover, as we will see next, the activation functions will appear many times in typical
neural networks. Therefore, in terms of computational work, it is beneficial to have a function of which is
relatively simple and efficient to compute; moreover, the computation of the gradients for optimizing the
neural network should be efficient as well.
It has been shown that neural networks of the form eq. (10.5) are universal function approximators. universal
As one example, we now give a concrete example of the universal approximation theorem for NNs with function
approximators
sigmoid activation, one hidden layer, and arbitrary width; we will shortly explain the notion of hidden layers
and the width of a layer in detail.
Theorem 10.1. Universal approximation theorem (sigmoid)
n
Let In denote the n-dimensional unit cube, [0, 1] , and α be the sigmoid activation function. Then,
finite sums of the form
M
X
P (x) = ai α (Bx + c) , (10.6)
i=1
are dense in C (In ). In other words, given any f ∈ C (In ) and ε > 0, there is a sum, P (x), of the
above form, for which
|P (x) − f (x)| < ε ∀x ∈ In .
Note that C (In ) is the space of continuous functions on In .
This theorem has been proven by Cybenko in [14] for a more general class of activation functions of
which the sigmoid function is an example. Further generalizations can be found, for example, in [26, 25]. In
particular, it has been shown that the approximation property hold for a large class of activation functions
and is mostly due to the architecture of the network itself. Note that eq. (10.6), of course, easily extends to
the vector-valued case in eq. (10.5).
In order to arrive at the general definition of a DNN, we will now define the composition of functions of
the form eq. (10.5). In particular, let x ∈ Rn be the input of the NN, then an NN with L hidden layers is hidden layers
given by
h1 = α (W1 x + b1 ) ,
hi+1 = α (Wi+1 hi + bi+1 ) , for i = 1, . . . , L − 1 (10.7)
y = WL+1 hL .
The final vector y ∈ R is then the output of the neural network, and the other vectors hi ∈ Rni , i = 1, . . . , L,
m
are the states of the neurons in the hidden layers of the network. The matrices Wi ∈ Rni ×ni−1 contain the
weights of the neural network, and the vectors bi ∈ Rni are often denoted as the biases of the neural weights
biases
networks; see also fig. 10.5 for a schematic visualization of a DNN with two hidden layers. The final layer,
the output layer, is linear, that is, it only corresponds to the multiplication with the matrix WL+1 ∈ Rn×nL ;
however, one may also add a bias vector to the last layer.
A network is called dense if the weight matrices are dense. There are also network architectures which dense
use a lower number of weights, such that the matrices Wi are sparse matrices. Moreover, even though we
168
Input Hidden Hidden Output
layer layer 1 layer 2 layer
x1
x2
y
x3
x4
Figure 10.5: Dense feedforward neural network with four inputs, one output, and four hidden layers with
five neurons each.
one layer two layers three layers
Figure 10.6: Logistic regression with a neural network model with 1 layer(left), 2 layers (middle), and 3
layers (right) of 5 neurons each on the noisy data set.
kept the activation function α fixed, we can also vary it from layer to layer, that is, taking α1 , . . . , αL instead
of α. For the sake of simplicity, we restrict the discussion largely to the case that α is the same for all layers.
The number of layers L of the network is also denoted its depth, and the numbers of neurons within a depth
layer are denoted as the width of the layer. The universal approximation theorem, theorem 10.1, describes width
the approximation properties for the case of depth 1 and arbitrary width. Universal approximation properties
can also be shown for fixed width but arbitrary depth; see, for example, [30]. Training deep neural networks
is often also denoted as deep learning. There is no uniform definition of the term deep, but deep learning deep learning
usually starts at around 3 or 4 hidden layers. However, modern networks can easily have tens or more
than one hundred layers. In fig. 10.6, we can see that increasing the number of layers to the networks has
a similar effect as increasing the number of neurons, which corresponds to increasing the number of line
segments in fig. 10.2.
Fi (x) := α (Wi x + bi ) ,
we have
NNα
W,b (x) = WL+1 FL ◦ . . . ◦ F1 (x) .
169
Figure 10.7: Visualization of the loss landscapes of two different neural networks. The loss landscape may be
highly non-convex. Images taken from https://github.com/tomgoldstein/loss-landscape. See also [29]
for more details.
Then, let
I = {x1 , . . . , xN } , O = {y1 , . . . , yN } ,
given input and output data. Then, training a neural network generally corresponds to solving the following
general type of optimization problem
N
X
L NNα
min W,b (xi ) , yi , (10.8)
W,b
i=1
where L is a loss function penalizing deviations of the model output N N α,W,b (xi ) from the correct labels
yi . In that sense, it is usually a type of a distance function, such as the cross entropy in eq. (10.1) or the
mean squared error (MSE) mean squared
error (MSE)
1
2
LMSE N N α
N N α
W,b (xi ) , yi = W,b (xi ) − yi 2 ,
N
which we had already seen earlier. Another typical variant is the mean absolute error (MAE) mean absolute
error (MAE)
1
LMAE N N α
N N α
W,b (xi ) , yi = W,b (xi ) − yi 1 .
N
As discussed in section 3, convex optimization is generally much easier compared to non-convex opti-
mization. Therefore, of course, our hope is that the loss function is convex with respect to the parameters
W and b of the neural network. Unfortunately, as can be seen for two examples in fig. 10.7, this is generally
not the case. In particular, depending on the activation functions, the depth of the network, and widths
of the hidden layers, the loss function, optimizing a neural network may correspond to a highly non-convex
optimization problem.
The most common techniques for training a neural network are variants of the stochastic gradient descent
(SGD) method (sections 3.5 and 4.2) and quasi-Newton methods (section 4.3.3). In particular, the Adam
gradient descent and L-BFGS quasi-Newton methods are very popular for training neural networks. In the
next paragraphs, we will discuss some important aspects for the optimization of neural networks.
Parameter initialization The convergence of the optimization schemes depends significantly on the initial
guess for the weight matrices W and bias vectors b. In particular, due to the complex landscapes of typical
loss functions of neural networks (fig. 10.7), a bad initial choice may result in slow convergence or even
170
divergence of the optimization scheme. On the other hand, due to the high complexity of (deep) neural
networks, it is generally challenging to come up with good initialization strategies. In particular, a good
initialization strategy may strongly depend on the network architecture used. For a more detailed discussion,
see, for instance, [21, Section 8.4].
Let us here discuss a few commonly used heuristic strategies. A first approach is to sample the weights
in the ith layer, that is, the coefficients of Wi , from the uniform distribution
1 1
U −√ , √ ,
ni ni
where ni is the number inputs of the layer. In [20], Glorot and Bengio suggest a slightly different ap-
proach, which also takes into account the number of outputs of the layer ni+1 . They call it a normalized
initialization: normalized
initialization
s s !
6 6
U − ,
ni + ni+1 ni + ni+1
Even though this formula has been derived based on very strong assumptions, which are generally not satisfied
for neural networks, the popularity of the approach shows that the strategy also works well in practice. In
particular, the formula is based on the assumption that the network corresponds to a composition of linear
maps, that is, just the multiplication of the weight matrices.
In [42], Saxe et al. suggest to initialize with random orthogonal matrices. Then, in order to account
for the nonlinearity in each layer, the weights are scaled with a factor g (also denoted as gain factor). In
fact, if the gain factor is chosen appropriately, this allows for training very deep networks, even if the weight
matrices are not orthogonal. If it is not chosen appropriately, the output and the gradients of very deep
networks can deteriorate to be either very large or almost zero.
Another strategy is the sparse initialization described in [31] by Martens where all neurons are initial- sparse
ized a fixed number of nonzero weights; the other weights of the neurons are initialized as zero. In practice, initialization
this means that, at initialization, the matrices Wi have a fixed number of nonzero entries per row.
The strategy for initialization of the bias vectors has to be compatible with the strategy for the weights.
As it turns out, a simple initialization with zero is compatible with most weight initialization schemes. One
counter example is the case of output data which is not mean centered. Then, one may want to add a
bias vector to the output layer to account for this; in eq. (10.7), we had defined the output layer without a
bias vector. In particular, one may initialize the bias in the output, depending on the initialization of the
remaining parameters of the network, to fit the marginal statistics of the output on the training set.
Mini-batch optimization As can be seen in eq. (10.8), the loss function for neural networks is typically
separable, that is, we have one additive term for each data point. In optimizing neural networks, mini-
batch optimization is typically used. Mini-batch optimization is a variant of the stochastic gradient mini-batch
descent method, where in each gradient step one random term from the sum optimization
N
X
L NNα
min W,b (xi ) , yi
W,b
i=1
∇W,b L N N α
W,b (xi ) , yi ; (10.9)
cf. section 3.5. Only this gradient is then used to perform the update in the gradient descent method. In the
next step, another random term from the remaining ones is chosen, until the gradient for term in the sum
(that is, for each data point) has been used once; the number of all iterations required to cover the whole
sum eq. (10.8) again is also denoted as one epoch. The main argument for the feasibility of this approachs epoch
is that, for a large data set, the expectation for a gradient update remains the same. This approach has two
major advantages:
171
• The computation of the gradients and, as a result, each gradient step becomes cheaper.
• The convergence becomes more robust against getting stuck in local minima. This is because a local
minimum in eq. (10.8) might not be local minimum of a single term,
L NNα
W,b (xi ) , yi ,
anymore. Therefore, it is possible to escape local minima, which might not be the case for the classical
gradient descent method.
On the other hand:
• One epoch of SGD is significantly more expensive than one epoch of the classical gradient descent
method, where an epoch is the same as a single iteration.
• Since, in each step, not the whole loss function is used, convergence can also be slower compared to
the classical gradient descent method.
The classical and the stochastic gradient descent methods are extreme cases of mini-batch gradient descent
methods. In this approach, the index set {0, . . . , N } of the data points is first partitioned into disjoint subsets
of cardinality k; the sets are denoted as batches and the cardinality k is denoted as the batch size. Let us batches
batch size
denote those subsets as B1 , . . . , BK . Then, in the j-th step of mini-batch SGD, we use
X
∇W,b L N N α
W,b (xi ) , yi
i∈Bj
K iterations of mini-batch SGD yields one epoch. Mini-batch SGD with full batches simply corresponds full batches
to the gradient descent method, and mini-batch SGD with batch size one corresponds to SGD.
Forward and backward propagation In order to optimize neural networks via gradient descent or a
quasi-Newton methods, several computational steps are necessary. First, in each iteration, it is necessary to
evaluate the neural network and the loss function for given input data. This can easily be done by going
though the network based on the scheme in eq. (10.7); going through the network from input to output is
also denoted as the forward propagation. forward
Then, in order to compute the update, we have to compute gradients of the loss function eqs. (10.8) propagation
and (10.9) with respect to the network parameters W and b. On the one hand, this could potentially be
difficult because neural networks are highly nonlinear and can have a large number of parameters, which
appear in different layers of the network. However, as we have seen before, neural networks are also built
of very simple elementary building blocks; cf. eq. (10.7). In particular, they are just compositions of affine
linear and nonlinear activation functions, and for a single network layer, it is quite simple to compute the
derivatives of the output of the layer with respect to the parameters of the layer. The propagation through
multiple layers can then performed using simple derivation rules.
Exercise 10.2.
1. For a neural network with a single layer with an activation function of you choice, derive
formulae for the derivatives of the output
y = α (W x̃ + b)
with respect to the network parameters W and b.
172
2. Derive formulae for computing the derivatives of an MSE loss.
Exercise 10.3.
Implement the neural network from exercise 10.2, its derivatives, and a gradient descent algorithm
to optimize the network parameters for a given data set.
For deep and wide neural networks, the computation of the gradients can result in high computational
costs if performed in a naive way. The backpropagation algorithm is a special case of automatic backpropaga-
differentiation, which also performs the computation of the gradients based on simple derivation rules. tion
automatic
However, the computations are arranged in such a way that the they can be performed very efficiently. differentiation
In section 10.3, we will discuss the three necessary steps in the optimization of neural networks, the con-
struction of the computational graph of the neural network: the forward propagation and the backward
propagation.
Data normalization and batch-normalization Figure 10.8 nicely shows the effect of data normalization
on a simple two-dimensional linear regression with MSE loss. In particular, starting with some unnormalized
data set, we obtain a loss function which is a very stretched convex function; see also example 3.2. As we
can see, the resulting gradient iteration converges very slowly. In particular, after 100 iterations, we still
have not converged to a good data fit.
A simple standard normalization of the data improves the optimization significantly. In particular, for
each feature, we first compute the mean
N
1 X
µj = (xi )j
N i=1
and the standard deviation v
u
u1 X N 2
σj = t (xi )j − µj
N i=1
for j = 1, . . . , n. Then, we transform each data point xi to have zero mean and standard deviation one by
(xi )j − µj
(x̂i )j = ∀j = 1, . . . , n.
σj
This requires that σj 6=, which we can assume since, otherwise, the feature is constant for all data points,
such that we could simply remove it.
The positive effect of this simple transformation, which is completely invertible, can be seen in fig. 10.8.
In particular, due to the much more favorable shape of the loss function of the least-squares problem, the
gradient descent iteration yields a much better fit within just 20 iterations.
Since the neural network is a composition of models, it could be beneficial for the optimization of each
layer if the input of the layer would be standard normalized as well. We can only perform the normalization
of the input of the first layer before the training. Since the inputs of the deeper layers depend on the
outputs of the corresponding previous layer, it cannot be performed before the optimization. In particular,
the outputs of the previous layer may change significantly during the optimization.
Therefore, in the batch-normalization approach, the normalization is performed on-the-fly for each batch
in the mini-bath optimization. In particular, we append the normalization step to each layer in eq. (10.7).
Therefore, let x̃1 , . . . , x̃k be the input vectors of the lth layer for some batch of the optimization. Applying
the lth layer yields
ỹi = α (Wl x̃i + bi ) ,
173
data set
Figure 10.8: Comparing 100 gradient descent steps on a data set without normalization and 20 gradient
descent steps on the same data sets with normalization. Visualization using the code from https://github.
com/jermwatt/machine_learning_refined. See also [48].
174
for i = 1, . . . , k. Without batch normalization, these vectors would be directly used as inputs for the next
layer. Instead, we normalize each ỹi as before:
(yi )j − µj
(ŷi )j = ∀j = 1, . . . , k,
σj
where v
N u N 2
1 X u1 X
µj = (ỹi )j and σj = t (ỹi )j − µj .
N i=1 N i=1
Hyperparameter optimization One of the strength of neural networks is their flexibility. In fact, they
can be applied to a different types of data sets of varying complexity, to classification and regression problems,
and they can even be employed in unsupervised learning; see, for instance, section 10.4. At the same time,
neural networks have a large number of hyperparameters, including:
• Network architecture: depth of the network, width of the layers, choice of activation function(s); see
also section 10.4 for some famous examples of network architectures
• Optimization method, such as variants of gradient descent or quasi-Newton methods; these methods
themselves introduce additional hyperparameters
• Weight initialization (as discussed before)
• Regularization techniques: norm regularization, dropout ([45]), early stopping (trying to stop the
optimization before overfitting), etc.
• ...
The most common way of hyperparameter optimization is grid-search with k-fold cross-validation. Al-
ternative approaches have already been discussed in section 8.
These steps will help us to efficiently evaluate a function as well as compute the derivatives of the function
with respect to the inputs. Note that the forward and backward propagation as described here is also
denoted as the reverse model of automatic differentiation, which is particularly efficient for network reverse model
type functions; for more details on this, see, for instance, the discussion in [48, Appendix B]. of automatic
differentiation
We will discuss these concepts for simple examples of nonlinear functions. However, the concepts are
very general and can be to complex functions, for instance, neural networks.
175
Computational graph Any function which is given by an algebraic expression can be expressed as the
composition of elementary operations. Let us start with a very simple example: The function f with
f (x) = ax + b
f (x) = h (g (x)) .
These operations can be organized in a computational graph, which tracks the order in which these oper- computational
ations are performed. By doing so it also helps to reveal data-dependencies and parallelism of computations. graph
Moreover, it can be used to optimize computations. Let us discuss this based on some simple examples:
Example 10.2. Computational graph with a single input
Let us consider the arbitrary function
2
f (x) = sin (x) cos (x) + ecos(x) .
2
sin (·) (·)
x cos (·) ×
Now, we can go through the graph from left to right, layer by layer:
1.
a = sin (x)
b = cos (x)
c = cos (x)
2.
d = a2
e = ec
3.
f = db
4.
g =f +e
First of all, we notice that all operations in each step can be carried out in parallel. Moreover, we
can notice that we have some redundancy in our computations. In particular, we could optimize the
graph be a slight rearrangement:
176
2
sin (·) (·)
x cos (·) ×
e(·) +
Once the computational graph for a function has been setup up, it can be reused and combined with
other graphs to build more complex compositions of functions:
Example 10.3. Composing computational graphs and multiple inputs
Let us consider the function
g (x1 , x2 ) = f (x1 ) − f (x2 ) ,
where f is defined as in example 10.2. We obtain the following computational graph:
x1 f (·)
x2 f (·) −
Here, each f (·) node of the graph corresponds to the computational graph in example 10.2. We
obtain:
1.
a = f (x1 )
b = f (x2 )
2.
c=a−b
Exercise 10.4.
Create the computational graph for a single-layer neural network of the form eq. (10.5) with two
inputs, three neurons, and a single output and a generic activation function α.
Note that, for a function with a fixed structure, such as a neural network with a fixed numbers of layers
and neurons within the layers, the computational graph has to computed only once. Even if the parameters
of the network are optimized, the computational graph will remain the same.
Forward propagation After we have created the computational graph of our function, let us discuss the
next step, which is the forward propagation. In this step, we sweep through the computational graph forward
from left to right and compute the values in the nodes as well as the partial derivatives of each node of the propagation
computational graph with respect to its input; in terms of the graph, we compute the partial derivatives of
177
each child node with respect to its parent nodes. By doing so, instead of directly computing the derivatives
directly with respect to the inputs, we can omit computing and storing a lot unnecessary partial derivatives.
For instance, consider see example 10.3, where
∂a ∂b
= 0 and = 0.
∂x2 ∂x1
Moreover, in neural networks, we have to compute the derivatives of the loss with respect to all network
parameters. As you will notice in exercise 10.8, once we have computed all the partial derivatives with
respect to the parent nodes, we have all information necessary to assemble all required derivatives.
Let us, again, discuss the forward propagation for a simple example.
Example 10.4. Forward propagation
Let us consider the function
x2 + x22 + x23 + x24
f (x1 , x2 , x3 , x4 ) = 1
4
with 4 inputs. We obtain the following computational graph:
x1 2
(·)
x2 2
(·) +
x3 2
(·)
x4 2
(·) + + ·/4
Now, let us propagate through the graph from left to right and compute both the evaluations and
the partial derivatives with respect to the corresponding inputs:
1.
∂a
a = x21 = 2x1
∂x1
∂b
b = x22 = 2x2
∂x2
∂c
c = x23 = 2x3
∂x3
∂d
d = x24 = 2x4
∂x4
2.
∂e ∂e
e=a+b =1 =1
∂a ∂b
∂f ∂f
f =c+d =1 =1
∂c ∂d
3.
∂g ∂g
g =e+f =1 =1
∂e ∂f
178
4.
g ∂h 1
h= =
4 ∂g 4
In order to perform the backward propagation next, we store both the function evaluations and the
partial derivatives in each node of the computational graph; cf. example 10.5.
Exercise 10.5.
Take the graph from exercise 10.4 and perform the forward propagation.
Backward propagation The final step for computing the derivatives of a function with respect to the
inputs is to sweep through the computational graph backwards, that is, from right to left. This is called
the backward propagation. In this step, using elementary rules for computing the derivatives, such as backward
the chain rule, the product rule, and the linearity of the derivative, we can simply compute the derivatives propagation
of the outputs with respect to inputs of the function. As mentioned before, this step relies heavily on the
computations performed before in the forward propagation.
We continue example 10.4:
Example 10.5. Backward propagation
Let us consider the function
x2 + x22 + x23 + x24
f (x1 , x2 , x3 , x4 ) = 1
4
from example 10.4, where we can also find its computational graph. Now given the output h, we
propagate backwards through the computational graph to compute the partial derivatives
∂h ∂h ∂h ∂h
, , , .
x1 x2 x3 x4
We get that
1.
∂h 1
=
∂g 4
2.
∂h ∂h ∂g 1 1
= = ·1=
∂e ∂g ∂e 4 4
∂h ∂h ∂g 1 1
= = ·1=
∂e ∂g ∂f 4 4
3.
∂h ∂h ∂e 1 1
= = ·1=
∂a ∂e ∂a 4 4
∂h ∂h ∂e 1 1
= = ·1=
∂b ∂e ∂b 4 4
∂h ∂h ∂f 1 1
= = ·1=
∂c ∂f ∂c 4 4
∂h ∂h ∂f 1 1
= = ·1=
∂d ∂f ∂d 4 4
179
4.
∂h ∂h ∂a 1 1
= = · 2x1 = x1
∂x1 ∂a ∂x1 4 2
∂h ∂h ∂b 1 1
= = · 2x2 = x2
∂x2 ∂b ∂x2 4 2
∂h ∂h ∂c 1 1
= = · 2x3 = x3
∂x3 ∂c ∂x3 4 2
∂h ∂h ∂d 1 1
= = · 2x4 = x4
∂x4 ∂d ∂x4 4 2
Hence, computing the derivatives of the output with respect to any of the nodes in the graph results
from just multiplying the partial derivatives computed before in the forward propagation, using the
chain rule.
Example 10.5 shows nicely that, after creating the computational graph and performing the forward
propagation, computing all the derivatives just corresponds to a simple assembly of the previously computed
partial derivatives. Moreover, each step only involves adjacent nodes in the computational graph. These
aspects make the computations very efficient.
In the following exercises you can step-by-step apply this concept to neural networks.
Exercise 10.6.
Take the graph and forward propagation from exercises 10.4 and 10.5 and perform the backward
propagation to compute all the necessary derivatives for performing a step of the gradient descent
method.
Exercise 10.7.
Implement the algorithm derived in exercises 10.4 to 10.6 and optimize the small neural network to
fit a function f : R2 → R based on noisy data. Therefore, choose your own function f and a set of
3
100 data points x1 , . . . , x100 ∈ [0, 1] , evaluate f in all data points (yi = f (xi )) and add some noise
from U (−ε, ε) for a small ε to each function evaluation yi .
Exercise 10.8.
Extend exercises 10.6 and 10.7 to deep networks with a variable number of layers.
Finally note that, other than the creation of the computational graph, the forward and backward prop-
agation depend on the value of the input of the function. Therefore, during the optimization of a neural
network, these steps have to be repeated in each step of the optimizer, even though the computational graph
remains fixed.
Convolutional neural networks In order to motivate convolutional neural networks, let us discuss first
what a discrete convolution is. Therefore, let us consider the data matrix I, which could be the matrix
180
kernel matrix
representation of a gray-scale image and a kernel matrix K; the kernel matrix is also called a filter. Then,
filter
the convolution of I with the kernel K is given by
convolution
XX
S (i, j) = (I ∗ K) (i, j) = I (m, n) K (i − m, j − n) . (10.10)
m n
Here, for simplicity, we use the notation S (i, j) instead of Sij . The convolution operation is commutative,
in the sense that data matrix I and kernel matrix K can be flipped. This yields:
XX
S(i, j) = (K ∗ I) (i, j) = I (i − m, j − n) K (m, n) . (10.11)
m n
In both cases, the sum is performed over all valid indices. Otherwise, we define the result of I (m, n) K (i − m, j − n)
or I (i − m, j − n) K (m, n), respectively, to be zero.
One often instead uses the equivalent cross-correlation operation, which is essentially a convolution cross-
without flipping the kernel correlation
flipping the
kernel
XX
S (i, j) = (I ∗ K) (i, j) = I (i + m, j + n) K (m, n) .
m n
This is what is typically implemented in neural network libraries instead of the convolution introduced before.
Note that, in neural networks, the weights of the kernel are typically not chosen by the user but optimized
during training. Therefore, there is effectively no difference between the two operations. For simplicity,
we will therefore also focus on this operation and also use the terms convolution and cross-correlation
synonymously.
Let us consider a small example to understand the convolutional operation:
Example 10.6.
Let
a b c d
I = e f g h,
i j k l
and
w x
K=
y z
Then, we obtain the following result from I ∗ K:
a b c d
w x aw + bx + ey + f z bw + cx + f y + gz cw + dx + gy + hz
S= e f g h ∗
=
y z ew + f x + iy + jz f w + gx + jy + kz gw + hx + ky + lz
i j k l
(10.12)
Notice that, as highlighted in blue, each entry in S is computed only by neighboring coefficients in
I. The kernel matrix is shifted over the matrix I to compute the entries of S. The data dependency
for the second entry is
a b c d
w x aw + bx + ey + f z bw + cx + f y + gz cw + dx + gy + hz
S= e f g h ∗
= ,
y z ew + f x + iy + jz f w + gx + jy + kz gw + hx + ky + lz
i j k l
where, again, four neighboring entries of I are used to compute one entry of S.
As can be seen in fig. 10.9, the choice of a specific kernel matrices may help to identify certain features in
an image, such as vertical and horizontal edges. The output of the convolution is also called feature map, feature map
181
0 0 0 0 1 −1
original K= 1 1 1 K= 0
1 −1
−1 −1 −1 0 1 −1
Figure 10.9: Applying convolutions with two different kernel matrices highlight horizontal and vertical edges
in the image.
and convolutions can be seen as a special technique for feature engineering. Therefore, only neighboring feature
information is used: cf. eq. (10.12) in example 10.6. The size of the kernel matrix determines the locality of engineering
the convolutional operation.
Moreover, it directly follows from eqs. (10.10) and (10.11) that the convolution is a linear map with
respect to I, that is,
(a · I1 + b · I2 ) ∗ K = aI1 ∗ K + bI2 ∗ K,
for a, b ∈ R. As a result, a convolution can be written as a matrix multiplication
I ∗ K = IK, (10.13)
with a suitable matrix K. Moreover, due to the locality of the convolutional operation displayed in light blue
in eq. (10.12), it becomes clear that the matrix K is sparse and the sparsity is determined by the size of the
kernel matrix K.
Exercise 10.9.
Derive a formula for the matrix K defined by eq. (10.13).
Exercise 10.10.
Visualize the sparsity pattern of K for a full matrix I of size 5 × 5 and kernel matrix sizes 1 × 1, 2 × 2,
and 3 × 3.
Another observation from example 10.6 is that the dimension of the matrix reduces when applying the
convolution. This can be avoided by extending the matrix I by layers of zeros. This approach is also called
padding
padding; cf. fig. 10.10 for a one-dimensional example with padding, where the input and output vectors
have the same length. Conversely, in order to obtain an even stronger reduction of the dimension, striding
striding
can be used, which means that the kernel is shifted by more than one entry in I for computing the next
entry in S; see, for example [21, Chapter 9].
Note that convolutional operations are not restricted to data arranged in as a two-dimensional matrix.
They can also be employed for data which has a tensor product structure of any other dimension; see fig. 10.10
for a one-dimensional example.
Now, a convolutional neural network is basically a network in which all (or some of the) weight matrices
are replaced by convolutional operations. As discussed above, convolutional operations just correspond to
182
s1 s2 s3 s4 s5
x y z
i1 i2 i3 i4 i5
Figure 10.10: One-dimensional convolution with padding: each entry in the output vector s is computed by
at most three neighboring entries in the vector i. The colors red, green, and blue correspond to the three
entries x, y, and z of the kernel vector k.
the special type of sparse matrices. In that sense, they are just a specialization of the networks described
by eq. (10.7). In particular, with the same width of the layers as in a dense neural network, the number of
trainable parameters is significantly lower for two reasons
• The matrix K resulting from a convolution with a kernel K is sparse.
• When applying a kernel matrix of size n × m, the number of trainable parameters for this convolution
is exactly n · m, independent of the size of the input matrix I. This means that, compared with a
standard dense neural networks, many weights are shared between several neurons.
Due to this extreme reduction in the number of trainable parameters, it becomes feasible to use multiple
kernel matrices within each layer of a convolutional neural network. As a result multiple feature maps can
be learned; cf. fig. 10.9 where one kernel highlights horizontal edges, and another filter highlights vertical
edges. This also shows that, for complex images, it will be necessary to use multiple kernel matrices in
order to prevent loss of image information necessary in the following layers. The different feature maps in
a convolutional layer are also called channels. One other example of channels are intensities of the colors channels
red, green, and blue in the RGB format of a picture.
After the convolutional operations in a layer, as in dense neural networks (eq. (10.7)), we apply an
activation function make the layer nonlinear. Finally, another important type of operations on neighboring
matrix entries is used, which gathers (statistical) information of nearby outputs. This type of layers are
called pooling layers. Two typical examples are: pooling layers
• Average pooling: computes the average value within each neighborhood of entries; average pooling Average
can also be written as a convolution pooling
• Max pooling: computes the maximum value within each neighborhood of entries; max pooling is Max pooling
nonlinear and can therefore not be written as a linear convolution
For more details, see, for example, [21, Section 9.4]. Pooling operations are generally employed after convo-
lution and activation, but they do not have to be employed after each convolutional layer.
Whether the use of convolutional neural networks is feasible, depends on the structure of the data. If
the data is arranged in a tensor product structure and there is a notion of neighboring entries, the use of
convolutional neural networks can be reasonable. If we have input data which cannot be structured in such a
way, the use of convolutional neural networks can be disadvantageous. However, convolutional operations can
also be defined on structured data with some adjacency structure. In the unstructured case, the adjacency
is given by a graph (section 2.8), and the corresponding type of convolutional neural networks is denoted as
graph convolutional neural networks; for the sake of brevity, we will not discuss this concept here. graph
convolutional
neural
networks
183
Average pooling
hi hi x
hidden layer 1
Wi+1 · +bi+1 Wi+1 · +bi+1
Id Ŵi+1 Id / Ŵi+1
···
α α
hidden layer N
+ +
+
hi+1 hi+1 y
Figure 10.12: Simple residual layer (left), residual layer with linear map (middle), and residual block ranging
over multiple hidden layers.
184
16
1024 1024
I/
Bottleneck Conv
512 512 512 512 512 512
8
I/
I/
256 256 256 256 256 256
4
4
I/
I/
128 128 128 128 128 128
2
2
I/
I/
64 64 64 64 64 64
I
I
Softmax
Figure 10.13: U-Net architecture with skip connections. Image taken from https://github.com/
HarisIqbal88/PlotNeuralNet. See [38] for the U-Net architecture. When neglecting the skip connections,
the U-Net has an autoencoder architecture with convolutional layers; cf. fig. 10.15.
Residual neural networks The development of residual neural networks (ResNets) arose from the residual neural
observation that the most successful neural networks for challenging (image) data sets are very deep networks; networks
(ResNets)
see, for instance, the ImageNet image recognition challenge [40]. As already mentioned before in section 10.2,
it is important to make sure that the network outputs and gradients neither become extremely large nor
vanish. In particular, it has been observed that, for very deep neural networks, the gradients vanish.
For ReLU activation functions, this can be attributed to the fact that, for half of the cases, the gradient
of the ReLU function
α (x) = max {0, x}
is zero; cf. fig. 10.3. In particular, consider a single layer with
hi+1 = α (Wi+1 hi + bi+1 ) . (10.14)
Let Wi+1 = (wkl )k,l . Then, we can consider a single entry of the output vector
ni
!
X
(hi+1 )j = α wjk (hi )k + (bi+1 )j ,
k=1
185
Figure 10.14: Two data sets that can be approximated well by a one-dimensional encoding: linear (left) and
quadratic (right).
reducing the chance for vanishing gradients significantly. However, this requires matching dimensions of the
two adjacent layers, that is, ni+1 = ni . In case the dimensions do not fit, one can instead use a linear map
Ŵi+1 with appropriate dimensions; cf. fig. 10.12 (middle).
Instead of adding the input to the output of each individual layer, a certain number of hidden layers can
be placed in between fig. 10.12 (right); this technique is used very often in practice. Instead of a residual
layer, we also denote this as a residual block. The connection in the computational graph which takes the residual block
input of a residual block and adds it to the output of the last hidden layer in the residual block is also called
a skip connection. This name emphasizes that the input is skipping all the hidden layers and is directly skip
propagated to the end of the residual block. connection
A very successful type of convolutional neural network which uses skip connections is the U-Net. It has
been introduced by Ronneberger et al. in [38]. Even though originally introduced for medical image process-
ing, it is now frequently used for many different types of data with tensor product structure; see fig. 10.13
for a visualization.
Exercise 10.11.
Discuss the computational graph, forward propagation, and backward propagation for residual layer
of a feedforward neural network in detail.
Nonlinear autoencoders Previously, in section 9.1, we had already introduced the linear autoencoder,
that is, a linear map C which encodes (C T ) and decodes (CC T ) in such way that the data points of a data
set can be represented as good as possible
CC > xk ≈ xk ;
cf. eq. (9.7). The main reason for constructing a (linear) autoencoder is to reduce the dimension of the data
set, that is, if xk Rn and C > xk ∈ Rk , we want
k << n.
If the data set follows approximately a linear relation, a linear autoencoder will perform well; see fig. 10.14
(left). If the data follows approximately a nonlinear relation, a linear autoencoder will perform poorly;
see fig. 10.14 (right). If the nonlinear relation is known, it might be possible to transform the data such that
the relation is approximately linear.
186
Input Latent Output Input Latent Output
layer layer layer layer layer layer
Figure 10.15: Nonlinear autoencoders based on a simple dense neural network: minimum number of layers
(left) and additional hidden layers (right).
Of course, the nonlinear relation is typically unknown. As we have discussed in section 10.1, neural
networks are well-suited for approximating nonlinear relations; moreover, the nonlinearity can be optimized
automatically during the training. Therefore, let us discuss briefly whether we can employ neural networks
to perform a nonlinear dimensionality reduction; this is an example of using neural networks to perform an
unsupervised learning task. Therefore, consider the data set
{x1 , . . . , xK }
with x1 , . . . , xK ∈ Rn . Now, we can reduce the dimension of the data set from n to k < n by training a
neural network
w = α (W1 x + b1 ) ,
(10.16)
y = α (W2 w + b2 )
with x, y ∈ Rn and c ∈ Rk , such that
yi ≈ xi ∀i = 1, . . . , K.
See fig. 10.15 (left) for an example of a corresponding network architecture. The reduced dimensional space
latent space
is often also denoted as the latent space and it is the space of the latent layer. As mentioned before, in
latent layer
order to obtain a dimensionality reduction, we need that k < n; therefore, the latent layer is also denoted
as the bottleneck. More precisely, for the example of a MSE loss function, we train the network by solving bottleneck
the minimization problem
K
1 X 2
min kα (W2 α (W1 xi + b1 ) + b2 ) − xi k .
W1 ,W2 ,b1 ,b2 K i=1
This minimization problem does only depend on the data points x1 , . . . , xK but does not require additional
labels. Therefore, training an autoencoder network is an unsupervised learning task. After training the
neural network, the encoding of xi is
wi = α (W1 xi + b1 ) .
If the MSE is optimal, we decode the encoding with
yi = α (W2 wi + b2 )
187
hi+1
hi ...
Figure 10.16: Layer of a recurrent neural network. A neuron can also be connected with itself.
yielding a good approximation of xi . Figure 10.15 (right) shows an additional example with additional
hidden layers allowing for a higher degree of nonlinearity. The U-Net, which has been shown in fig. 10.13,
is based on a deep autoencoder network with convolutional layers; however, it due to the skip connections,
the bottleneck does not necessarily learn an encoding of the input.
Comment on recurrent neural networks Let us end this section of different neural network archi-
tectures with a short comment on recurrent neural networks (RNNs). In particular, they allow for recurrent
interaction of neurons in a single layer. Therefore, they are particularly well-suited for time-dependent data, neural
networks
where the neurons are interact in chronological order; fig. 10.16 from left to right. (RNNs)
Example 10.7. RNNs for time series
Let us again consider the example of time-dependent sensor data already shown in section 2.
20
sensor data
15
10
s
0
0 2 4 6 8 10
t
and used as input for a neural network. In the context of time-dependent data, the data at a
certain time si often depends on the data at previous times si−1 , si−2 , . . .. A very simple examples
are temperature measurements during the course of a day. Another example of temporal data are
stock prices. In those cases, it makes sense to represent the time-dependence by an RNN network
architecture, which explicitly accounts for the connections of neurons in chronological order.
188
11 Optimization with constraint learning
11.1 Introduction
Most of this course has been devoted to (numerical) linear algebraic and optimization techniques whose main
aim was to achieve a certain ML goal – to train a predictive or clustering tool. However, in real life even
ML is typically only a ‘servant’ of a certain bigger goal. How could such bigger goals look like?
For example, when we cluster, apart from purely technical applications such as image segmentation,
we typically want to do it in order to understand the group of data and, perhaps, to apply a customized
offer/treatment to a uniform group of patients, clients etc. Think of the different ways you can cluster people
in:
• public transport to offer them suitable periodic ticket deals
• retail, to identify different groups that choose various levels of the prestige-price tradeoff, and to
propose to each of them a different product line (it is a fact of life that many ‘different’ store chains are
owned by the same owner and the main role of the differences is to offer to people different tradeoffs
of price/prestige/sustainability of a given brand).
• healthcare - if you were to develop several different treatment schedules
As you can see, in all these examples there are still decisions to be made after the clustering is done and the
clustering is only a ‘servant’ of these decisions.
What about supervised learning? Again, we make predictions in order to act upon them. Think of:
• classification model to judge which flights are most likely to be late – based on the probabilities, you
will be prioritizing the luggage belts or ground crews to certain parts of the airport to mitigate against
the delays
• regression model to measure the precise impact of a given treatment on a certain tissue in the body –
based on this you will be choosing between various treatments
• default/no default classification built to decide whether to grant someone a loan or not; regression
model to predict the sales price of a range of used cars.
Sometimes, the decisions you make on the basis of the ML models are going to be simple – to do something
or not to do it (grant a loan) or to simply rank certain decisions in the order of their making (prioritizing
various treatment schedules/teams at the airport). Then, there’s nothing too complicated to think about -
one can pick the best decisions (i.e. ’optimize’) almost by hand.
But sometimes, the decisions to be made on the basis of the ML model are complicated - they entail an
entire vector of decisions, which are related one to another in a nontrivial way. This means that in order to
pick the best possible decision, one needs a nontrivial tool. This nontrivial tool is optimization.
The goal is to introduce you to interesting situations in which first, an optimization model is used to
train an ML tool, and then, the ML tool is embedded inside an optimization problem which is to make the
’real decisions’ based on it.
As the goal set in this way can be arbitrarily complex, we need to ‘set the conditions in which it might
work’ or to speak with military terminology, define the perimeter of the operation. If ML models, which can
be pretty complicated, are to be only a part of a bigger thing, then certainty the bigger thing to be solved
needs to be a reasonably solvable class of optimization problems.
In this way, the first part of this lecture is will be a crash course on (mixed integer) linear programming
which is by far the most scalable optimization technology, and which covers, volume-wise, probably 90% of
non-ML applications of optimization. The crash course is imported from an in-progress textbook [36]. 3 We
shall not cover the basic algorithms for solving these problems but they are very simple and you can learn
about them, for example, in [7].
3 You can find the hopefully self-explanatory Jupyter notebooks at http://jckantor.github.io/MO-book/intro.html
189
The shorter, second part of the lecture, in turn, will be an attempt to show you how to combine this with
machine learning in two-three specific contexts. To make it concrete, it’s mostly imported from the survey
[18] that illustrates two ‘really applied’ applications.
Using a modelling package Pyomo, this problem can be formulated in Python and solved using a
solver GLPK as follows.
1 m = pyo.ConcreteModel('BIM')
2
3 m.x1 = pyo.Var()
4 m.x2 = pyo.Var()
5
190
12 m.x1domain = pyo.Constraint(expr = m.x1 >= 0)
13 m.x2domain = pyo.Constraint(expr = m.x2 >= 0)
14
15 pyo.SolverFactory('glpk').solve(m)
Several things can be said about the above example. First, it was rather straightforward to model – the
choice of the decision variables was obvious and the constraints were easy to formulate. Secondly, it was easy
to find the optimal solution by hand. Third, it is evident to ‘naked eye’ that the solution found is indeed
optimal.
Surprisingly, also much larger and seemingly more complicated problems can be modelled using linear
constraints only. For such problems, however, we are often not able to find the solution by hand, and
one typically cannot judge ‘by eye’ that a particular solution is an optimal one. To move on to working
confidently with real-life problems, we need to gain more knowledge about LP.
In the following sections, we shall expand on what we learned so far. First, we will extend our capabilities
of modelling various situations using linear constraints. Secondly, we will provide a formal definition of the
search for the optimality certificate of a given solution. In the end, we will explain intuitively how numerical
algorithms for LP problems work.
For some expressions, like the ones in example 11.1, it is clear that we can write them down using linear
functions. But, there are real-life important objective functions and constraints, for which it is difficult to
immediately see the same. At the same time, because LP are by far the easiest problems to solve, a problem
should be expressed using linear constraints long as it is possible. We will therefore provide a number of
useful LP modeling techniques.
xi = x+ −
i − xi ,
|xi | = x+ −
i + xi ,
x+ −
i , xi ≥ 0.
It is easy to show that for any solution of the modified problem if either x+ −
i or xi is zero for every i, then
the problems with and without absolute values are equivalent.
Note that the same reasoning can be used to reformulate absolute values involving entire expressions
such as |x1 − x4 | and, constraints such as
|x1 | + x2 ≤ 4,
but it cannot be used to do so when the coefficient in front of the absolute value is negative.
Example 11.2. Wine quality prediction
Physical, chemical, and sensory quality properties were collected for a large number of red and white
variants of the Portuguese ”Vinho Verde” wine and then then donated to the UCI machine learning
repository, see [13]. The dataset consists of n = 1, 599 measurements of 11 physical and chemical
characteristics plus an integer measure of sensory quality recorded on a scale from 3 to 8. Due to
privacy and logistic issues, there is no data about grape types, wine brand, wine selling price, etc.
The goal of the regression is find coefficients mj and b to minimize the mean absolute deviation
191
(MAD), that is
n
1X
MAD (ŷ) = min |yi − ŷi |
n i=1
J
X
s.t. ŷi = xi,j mj + b ∀i = 1, . . . , n
j=1
where xi,j are values of ’explanatory’ variables, in this case the 11 physical and chemical characteristics
of the wines.
General expressions such as (11.3) can be linearized by introducing an auxiliary variable z and setting
min z
s.t. c>
k x ≤ z, ∀ k ∈ K.
This trick works because if all the quantities corresponding to different indices k ∈ K are below the auxiliary
variable z, then we are guaranteed that also their maximum is also below z and vice versa. Note that the
absolute value function can be rewritten |xi | = max{xi , −xi }, hence the linearization of the optimization
problem involving absolute values in the objective functions is a special case of this.
c> x + α
min
d> x + β
s.t. Ax ≤ b,
x ≥ 0,
where the term d> x + β is either strictly positive or strictly negative over the entire feasible set of x.
192
−1
Setting first t = d> x + β and then yi = xi t for every index i, we obtain the following equivalent
linear optimization problem
min c> y + αt
s.t. Ay ≤ tb,
d> y + βt = 1,
t ≥ 0,
y ≥ 0.
Note that the inequality for t should in fact be strict, i.e., t > 0, but in view of the assumption above for
d> x + β, having relaxed the constraint does not change the optimal solution.
min c> x
x∈Rn
s.t. Ax ≤ b,
xi ∈ Z, i ∈ I,
where I ⊆ {1, . . . , n} is the set of indices identifying the variables that take integer values. Of course, if
the decision variable are required to be nonnegative, we could use the set N instead of Z. A special case
of integer variables are binary variables, which can take only values in B = {0, 1}. Consider the following
example.
Example 11.3. Building microchips with integers
The company BIM realizes that a 1% fraction of the copper always gets wasted while producing both
types of microchips, more specifically 1% of the required amount. This means that it actually takes
4.04 gr of copper to produce a logic chip and 2.02 gr of copper to produce a memory chip. If we
rewrite the linear problem in example 11.1 and modify accordingly the coefficients in the corresponding
constraints, we obtain the following problem
193
If we solve again we obtain a different optimal solution than the original one, namely (x1 , x2 ) ≈
(626.238, 1123.762) and an optimal value of roughly 17628.713. Note, in particular, that this new
optimal solution is not integer, but on the other hand in the LP above there is no constraint requiring
x1 and x2 to be such.
In terms of production, of course we would simply produce entire chips but it is not clear how
to implement the fractional solution (x1 , x2 ) ≈ (626.238, 1123.762). Rounding down to (x1 , x2 ) =
(626, 1123) will intuitively yield a feasible solution, but we might be giving away some profit and/or
not using efficiently the available material. Rounding up to (x1 , x2 ) = (627, 1124) could possibly lead
to an unfeasible solution for which the available material is not enough. We can of course manually
inspect by hand all these candidate integer solutions, but if the problem involved many more decision
variables or had a more complex structure, this would become much harder and possibly not lead to
the true optimal solution.
A much safer approach is to explicitly require the two decision variables to be nonnnegative integers,
thus transforming the original into the following MILP:
The optimal solution is (x1 , x2 ) = (626, 1124) with a profit of 17628. Note that for this specific
problem both the naive rounding strategies outlined above would have not yield the true optimal
solution. The Python code for obtaining the optimal solution using MILP solvers is given below.
1 m = pyo.ConcreteModel('BIMperturbed')
2
MILP naturally applies to situations in which we need to deal with integer numbers, as when scheduling
people, as the following extensive example illustrates.
Example 11.4. Shift scheduling
This example concerns a model for scheduling weekly shifts for a small campus food store. It is
inspired by a Towards Data Science article, whose original implementation has been revised. Let us
look at the problem description from the original article.
A new food store has been opened at the University Campus which will be open 24 hours
a day, 7 days a week. Each day, there are three eight-hour shifts. Morning shift is from
6:00 to 14:00, evening shift is from 14:00 to 22:00 and night shift is from 22:00 to 6:00 of
194
the next day. During the night there is only one worker while during the day there are
two, except on Sunday that there is only one for each shift. Each worker will not exceed
a maximum of 40 hours per week and have to rest for 12 hours between two shifts. As for
the weekly rest days, an employee who rests one Sunday will also prefer to do the same
that Saturday. In principle, there are available ten employees, which is clearly over-sized.
The less the workers are needed, the more the resources for other stores.
This problem requires assignment of N workers to a predetermined set of shifts. There are three
shifts per day, seven days per week. These observations suggest the need for three ordered sets:
• WORKERS with N elements representing workers.
• DAYS with labeling the days of the week.
• SHIFTS labeling the shifts each day.
The problem describes additional considerations that suggest the utility of several additional sets.
• SLOTS is an ordered set of (day, shift) pairs describing all of the available shifts during the
week.
• BLOCKS is an order set of all overlapping 24 hour periods in the week. An element of the
set contains the (day, shift) period in the corresponding period. This set will be used to limit
worker assignments to no more than one for each 24 hour period.
• WEEKENDS is a the set of all (day, shift) pairs on a weekend. This set will be used to
implement worker preferences on weekend scheduling.
These additional sets improve the readability of the model.
N = number of workers
WorkersRequiredd,s = number of workers required for each day, shift pair (d, s)
195
Let us now look at the model constraints. Assign workers to each shift to meet staffing requirement.
X
assignw,d,s ≥ WorkersRequiredd,s ∀(d, s) ∈ SLOTS
w∈ WORKERS
The model objective is to minimize the overall number of workers needed to fill the shift and work
requirements while also attempting to meet worker preferences regarding weekend shift assignments.
This is formulated here as an objective for minimizing a weighted sum of the number of workers needed
to meet all shift requirements and the number of workers assigned to weekend shifts. The positive
weight γ determines the relative importance of these two measures of a desirable shift schedule.
!
X X
min neededw + γ weekendw )
w∈ WORKERS w∈ WORKERS
Figure 11.1 is a visual representation of the optimal shift schedule obtain for a specific instance of
the problem.
Previously, we claimed that every optimization problem that can be formulated as an LP is ‘easy’ to
solve. Does that mean that, in contrast, every MILP problem is easy to solve? Not necessarily, but due
to the significantly greater modelling capacities of MILP, it can be indeed used to model problems that are
fundamentally ‘difficult’, i.e., for which no efficient solution procedure is known, even if tools other than
MILP are allowed. Here, we present an example of a classical problem like this – the knapsack problem –
which can be used to model many resource allocation situations, e.g., on computational clusters.
Example 11.5. Resource allocation – Knapsack 0-1 problem
A traveler can only bring a single fixed-weight knapsack and must fill it with the most valuable items.
Given a finite set of n items, where each item i has a weight wi and a value vi , we want to select the
subset of items to put in the knapsack so that the total weight is less than or equal to a given limit
196
Figure 11.1: The optimal shift schedule.
W and the total value is as large as possible. It can be formulated as an MILP as follows:
n
X
maxn v i xi
x∈R
i=1
Xn
s.t. wi xi ≤ W,
i=1
xi ∈ B, i = 1, . . . , n.
The knapsack problem is a one of the most fundamental combinatorial optimization problems whose
variants often arise in resource allocation where the decision makers have to choose a subset of non-
divisible tasks/resources under a fixed time/budget constraint, respectively. General versions of this
problem are routinely solved on online computational clusters, for example.
This problem is known to be N P -complete which, roughly, means that it is widely believed that for
such a problem there exists no algorithm that would not need to check, on a worst-case instance all
2n solutions.
As visible from the example, our enthusiasm of modelling a problem we encounter as an MILP can
sometimes lead us to, accidentally, modelling one of the well known NP-hard problems. Does that mean
that MILP is an inefficient technology? No, because powerful solvers have been developed for MILPs that
allows to solve efficiently optimization problems with thousands of integer variables.
To do that, we need to become familiar with techniques and tricks that allow us to model via integer
variables and MILP constraints situations that we could not think of at first encounter.
197
Now we can model the discontinuous variable x by the following linear constraints:
x ≤ uy,
x ≥ ly,
y ∈ B.
Indeed, by studying this system of constraints, you can see that the relationship (11.4) becomes enforced.
For this, again the big-M method can be used where we need a new binary variable y ∈ B and two large
positive constants M1 , M2 > 0. The linearized constraints are then
a>
1 x ≤ b1 + M1 y,
a>
2 x ≤ b2 + M2 (1 − y),
y ∈ B.
198
Example 11.6. Multi-plant production
Consider the following production problem
max profit
x,y≥0
The optimal solution is (x, y) = (20, 60), which results in a profit of 2600.
Labor B is a relatively high cost for the production of product X. A new technology has been developed
with the potential to lower cost by reducing the time required to finish product X to 1.5 hours, but
require a more highly skilled labor type C at a unit cost of 60 per hour.
It is our task to assess if the new technology is beneficial, i.e., whether adopting it would lead to
a higher profit. In this situation we have an either-or structure for the objective and for Labor B
constraint:
max profit
x,y≥0,z∈B
s.t. x ≤ 40 (demand)
x + y ≤ 80 (labor A)
profit ≤ 40x + 30y + M z
profit ≤ 60x + 30y + M (1 − z)
2x + y ≤ 100 + M z
1.5x + y ≤ 100 + M (1 − z).
where the variable z ∈ {0, 1} ‘activates’ the constraints related to the old or new technology, respec-
tively, and M is a big enough number.
A: a>
1 x ≤ b1 ,
must hold. A situation like this can still be encoded as a linear model as follows. First notice that the
implication A ⇒ B is logically equivalent to A ∨ B. Using this trick, the if-then condition is logically
equivalent to requiring The if-then condition requires that if a condition A holds, say a> 1 x ≤ b1 , then
another condition B, say a>
2 x ≤ b2 must hold. A situation like this can be encoded as a linear model using
the big-M method.
199
First notice that the implication A ⇒ B is logically equivalent to A ∨ B. Using this trick, the if-then
condition is logically equivalent to requiring
a>
1 x > b1 or a>
2 x ≤ b2 .
Introducing two large constants M1 , M2 > 0 and a binary variable y, the either-or constraint is equivalent
to
a>
1 x > b1 − M1 y,
a>
2 x ≤ b2 + M2 (1 − y),
y ∈ B,
Here, one needs to be careful because in MILP, strict constraints of the form a>1 x > b1 − M1 y cannot be
enforced as such, and are always implemented as weak inequalities a>
1 x ≥ b 1 − M 1 y, which in most contexts
is fine.
y ≤ x1 ,
y ≤ x2 ,
y ≥ x1 + x2 − 1,
y ∈ B.
Similarly, the product x1 x2 with x1 ∈ B and l ≤ x2 ≤ u can be replaced by a new variable y and the following
additional constraints
y ≤ ux1 ,
y ≥ lx1 ,
y ≤ x2 − l(1 − x1 ),
y ≥ x2 − u(1 − x1 ),
y ∈ R.
200
optimization problem. Such an optimization problem can be formulated in general as:
min f (x)
x,y
s.t. g(x) ≤ 0
y = h(x)
θ(y) ≤ 0,
where f (x) is the cost function to be minimized in terms of the decision vector x, subject to constraints
g(x) ≤ 0 which can involve multiple smaller constraints. The problem entails another set of decision variables
y which are a result of using a predictive model that transforms the ‘original decision variables x into y. The
requirement is that the ‘transformed decision variables’ y need to meet certain criteria, which are formulated
using the constraint θ(y).
What are the two specific applications that could fit into such a description?
Example 11.7. Optimizing the World Food Programme supply chain [18]
Imagine you are to construct an optimal ‘food basket’ for a humanitarian crisis relief operation. In
constructing the meal, you are to pick from a number of ingredients, each of which has a certain
unit price. The goal is, of course, to keep the total cost of the food basket as low as possible, while
meeting two essential requirements:
• the food basket should meet minimum nutritional requirements in terms of proteins, vitamins
etc.
• the meal made out of such a food basket should meet a minimum ‘taste’ score, because otherwise
it is not likely to be eaten and some food is going to be wasted.
Formulating the objective in such a problem is easy – it is simply the inner product of a decision
vector x, where each entry measures how much of a given type of food (rice, dates, etc.) the basket
should include, and the vector c of unit prices.
Formulating the nutrition constraints is also straightforward – it will be something like:
Ax ≥ b,
where the matrix A stores information about how much of the nutrient i (row) does the food type j
(column) contain per unit, and the j-th entry of b is the minimum requirement for a given nutrient.
Solving the problem of minimizing the cost of a nutrition-requirement meeting food basket can be,
depending on the number of options, an easy or large-scale LP problem to solve, and can be formulated
with Pyomo exactly the same way as in the previous two subsections.
The ‘taste’ score is however, a more tricky question how to formulate it. How can we map a vector x
into some kind of a taste value? We can’t really, unless we know how do people value the taste score
of various baskets and try to transform this data into a ‘predicted’ score of a potential new basket.
This is exactly where the predictive model comes in. In the WFP problem, this was done based on a
database of past food baskets x1 , . . . , xN along with the scores y 1 , . . . , y N they received from people.
Based on these scores, a regression tree that maps x to y was fitted:
y ≈ CART(x).
Using this tree, the optimization problem to be solved to optimize the food basket is:
minc> x
x,y
s.t. Ax ≥ b
y ≥ ymin
y = CART(x).
201
Example 11.8. Radiotherapy - [18]
Consider another example, in which the vector x defines the position and intensity of radiation of
various beams used in cancer radiotherapy. Normally speaking, for such a vector x, to quantify the
amount of radiation that has reached a certain area of the patient’s body, one needs to perform expen-
sive simulations based on the physics that describes the radiation propagation throughout different
tissues, taking into account the patient’s geometry.
Because such simulations can be time-costly and because radiotherapy is something that requires
rather fast decisions, a different approach has been proposed. Namely, for many patients i with
patient geometries wi and corresponding radiation schemes xi , a predictive model is trained that
predicts the amount of radiation to different parts of the patient’s body. For simplicity, let us assume
that a function:
fBad : x, w → R+
quantifies the amount of radiation sent to healthy organs that are around the cancer tumor, i.e.,
radiation sent to areas where radiation should not go to; and the function
fGood : x, w → R+
quantifies the amount of radiation sent to the cancer tumor. The benefit of such trained functions
is that they are very quick to compute given a vector x and geometry w. In the field of scientific
machine learning, training a function like this would be called model order reduction
In rough terms thus, the optimization problem that one would like to solve is to maximize the amount
of radiation sent to the cancer tumor itself, subject to constraints on the amount of radiation sent to
the healthy tissues:
where δ stands for the upper bound on the radiation of non-cancerous areas.
In both of the above examples we can say ‘easier said than done’ with including constraints involving
complicated, ML-generated constraints into an optimization problem and saying that it should be easy to
solve. The question is, of course, how to encode the relationship
y = h(x)
Linear regression The easiest case of transforming decision variables x into a ‘score’ is, of course, when
the ML tool used is a linear regression tool of the form
y = w> x
because that is the only thing that needs to be added to the problem formulation.
Regression trees A slightly more involved case is the situation where we use a regression tree. In this
case, we have partitioned the domain space of x into L different regions where L is the number of leaves in
202
the tree, where each leaf is described using a set of linear constraints
Al x ≤ bl ,
with pl being the predictor value assigned to the l-th leaf so that the predictive tool is:
p1
if Al x ≤ bl
y= .
..
pL if AL x ≤ bL
However, this reformulation is a bit of an overkill as it might be more effective to parallelize the search for
the best x by solving L smaller optimization problems by simply ‘trying’ each of the leaves where y l ≥ ymin .
This has the benefit of avoiding the usage of integer variables.
Neural networks with ReLU activation function When we use a dense neural network, the only
difficult part about formulating it using MILP constraints is the nonlinear transformation – all else are just
combinations of products of the neural network’s weights and the decision variable.
So, how do we transform the output x of a given node into the value
y = max{0, x}?
Provided that the absolute value |x| can be bounded using a large enough number M , we can formulate it
as:
y≥x
y ≤ x + M (1 − z)
y ≤ Mz
y≥0
z ∈ {0, 1}.
Exercise 11.2.
Suppose you are to formulate an optimization problem where the constraint learning part consists of
a logistic regression function:
exp w> x
y= .
1 + exp (w> x)
How would you (approximately) model it using (mixed-integer) linear programming? After you have
tried, you can check a possible solution to this question in [5].
203
predictive model. While this is not forbidden by definition, common sense tells us that the predictive model’s
estimates for a solution like this can be far from exact.
For that reason, most of the authors developing optimization with constraint learning approaches suggest
using a ‘trust region approach’. That is, to define an area for the decision variable vector x for which the
predictive model is still trusted. This area can be defined, for example, as the convex hull of all the samples
xi so far, or some region around this convex hull.
Summarizing, a general approach for solving a decision problem based on machine learning tool would
be as follows.
1. Identify what relationships inside the decision problem are not clear and are best discovered using ML.
2. Collect the data and train the ML tool you need.
3. If the decisions to be made on the basis of the ML tool are simple – stop here.
4. IF the decisions to be made on the basis of the ML tool form a complicated array of decisions best
investigated using algorithms, set up an optimization problem with the ML tool embedded and solve
it.
5. Validate the obtained solution for example, by bootstrapping the samples from your data set and
investigating the solution’s performance on them.
204
12 Exercise solutions
Exercise 2.22
1. For the sake of brevity, we define x := sign (v1 ) e1 . Then,
w̃ := v − kvk x
and
w̃
w := .
kw̃k
Then,
2
Hp v = I − 2ww> v = v − 2 w> v w = v − w̃> v w̃
2
kw̃k
Here,
2 2
kw̃k = 2 kvk − kvk x> v (12.1)
and
> 2
2 w̃> v = 2 (v − kvk x) v = 2 kvk − kvk x> v .
(12.2)
Hp v = v − w̃ = v − (v − kvk x) = kvk x
205
References
[1] Charu Aggarwal. Linear algebra and optimization for machine learning. Springer, 2020.
[2] Charu C Aggarwal et al. Neural networks and deep learning. Springer, 10:978–3, 2018.
[3] Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB. SIAM, 2014.
[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[5] David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, and Arvind U Raghunathan. Janos: an inte-
grated predictive and prescriptive modeling framework. INFORMS Journal on Computing, 34(2):807–
816, 2022.
[6] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082,
2017.
[7] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena
Scientific Belmont, MA, 1997.
[8] Christopher M. Bishop. Pattern recognition and machine learning. Information Science and Statistics.
Springer, New York, 2006.
[9] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004.
[10] Steven L Brunton and J Nathan Kutz. Data-driven science and engineering: Machine learning,
dynamical systems, and control. Cambridge University Press, 2019.
[11] Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.
[12] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization.
SIAM, 2009.
[13] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine
preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553,
November 2009.
[14] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
[15] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal
Processing Magazine, 29(6):141–142, 2012.
[16] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of machine learning research, 12(7), 2011.
[17] Morris L Eaton. Multivariate statistics: a vector space approach. 1983.
[18] Adejuyigbe Fajemisin, Donato Maragno, and Dick den Hertog. Optimization with constraint learning:
A framework and survey. arXiv preprint arXiv:2110.02121, 2021.
[19] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
[20] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net-
works. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
206
[21] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Adaptive Computation and Machine
Learning series. MIT Press, 2016.
[22] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Second Edition. Springer Series in Statistics. Springer New York, 2009.
[23] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-Based
Systems, 212:106622, 2021.
[24] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
6a overview of mini-batch gradient descent. http://www.cs.toronto.edu/~tijmen/csc321/slides/
lecture_slides_lec6.pdf, 2012.
[25] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–
257, 1991.
[26] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal
approximators. Neural networks, 2(5):359–366, 1989.
[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[28] J Nathan Kutz. Data-driven modeling & scientific computation: methods for complex systems & big
data. Oxford University Press, 2013.
[29] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape
of neural nets. In Neural Information Processing Systems, 2018.
[30] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30, 2017.
[31] James Martens et al. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742,
2010.
[32] Andreas C Müller and Sarah Guido. Introduction to machine learning with Python: a guide for data
scientists. ” O’Reilly Media, Inc.”, 2016.
[33] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[34] Fernando Nogueira. Bayesian Optimization: Open source constrained global optimization tool for
Python, 2014–.
[35] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy
employed by v1? Vision research, 37(23):3311–3325, 1997.
[36] Krzysztof Postek, Alessandro Zocca, Joaquim Gromicho, and Jeffrey Kantor. Data-Driven Mathematical
Optimization in Python. Online, 2022.
[37] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-assisted
intervention, pages 234–241. Springer, 2015.
[39] Walter Rudin. Functional analysis. Inc, New York, 45:46, 1991.
207
[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252,
2015.
[41] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.
[42] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
[43] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press, 2002.
[44] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to
Algorithms. Cambridge University Press, 2014.
[45] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning
research, 15(1):1929–1958, 2014.
[46] Gilbert Strang. Linear algebra and learning from data. Wellesley-Cambridge Press Cambridge, 2019.
[47] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
[48] Jeremy Watt, Reza Borhani, and Aggelos K Katsaggelos. Machine learning refined: Foundations,
algorithms, and applications. Cambridge University Press, 2020.
[49] Laurence A Wolsey. Integer programming. John Wiley & Sons, 2020.
208