You are on page 1of 208

Linear Algebra and Optimization for Machine Learning

Alexander Heinlein (a.heinlein@tudelft.nl) Krzysztof Postek (k.s.postek@tudelft.nl)

June 2, 2022

Contents
1 Introduction to the course 3
1.1 Goals of machine learning and of this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Zooming in on Linear Algebra, and Optimization and ML . . . . . . . . . . . . . . . . . . . . 4
1.3 Remainder of the lecture notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Linear algebra basics 13


2.1 Matrices and vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Bilinear forms and norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Matrix powers and polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Diagonalization and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6 Solving linear equation systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.7 The singular value decomposition and pseudo inverses . . . . . . . . . . . . . . . . . . . . . . 67
2.8 Graphs and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3 Optimization basics 81
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2 Building up the gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3 What do we converge to? Convex functions and global optimality . . . . . . . . . . . . . . . . 89
3.4 Modelling losses for ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5 What distinguishes optimization methods used for ML? . . . . . . . . . . . . . . . . . . . . . 97
3.6 Gradient descent: a simple proof of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4 Unconstrained optimization: beyond the gradient method 103


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Correcting the gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3 Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4 Non-smooth optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 Practical summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5 Constrained optimization and duality 115


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Projected gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Frank-Wolfe algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

1
6 Clustering 127
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 K-means clustering - Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3 K-means clustering - kernelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4 Graph-based clustering problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Cut-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Final comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7 Tree-based learners 135


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Classification and regression trees - the optimization problem . . . . . . . . . . . . . . . . . . 136
7.3 Recursive tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.4 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.6 Need for interpretability – will proper optimization make its comeback? . . . . . . . . . . . . 143

8 Hyperparameter optimization 144


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Grid search and random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.4 Zero-order (derivative-free) approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

9 Linear unsupervised learning 153


9.1 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.3 Matrix factorization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

10 Neural networks 164


10.1 Feedforward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.2 Optimization of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.3 Computational graph, forward propagation, and backward propagation . . . . . . . . . . . . . 175
10.4 Some examples of neural network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 180

11 Optimization with constraint learning 189


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.2 Linear optimization and its modelling techniques . . . . . . . . . . . . . . . . . . . . . . . . . 190
11.3 Mixed-integer linear programming and modelling techniques . . . . . . . . . . . . . . . . . . . 193
11.4 Constraint learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

12 Exercise solutions 205

2
1 Introduction to the course
1.1 Goals of machine learning and of this course
Welcome to the course! This is a course about machine learning (ML) which is a reasonably ‘trendy’ subject
these days. One can say - yet another course about machine learning. What kind of machine learning will
that be? These are legitimate questions because ML is not a single thing but instead, it consists of multiple
sub-domains: supervised learning, unsupervised learning, semi=supervised learning, and recently we can
speak of the emergence of deep learning as a separate branch of ML, although, in fact, it supports the earlier
three sub-domains. All of these are worlds on their own, and you can spend a lifetime on exploring the
literature related to them. There are many books and many ways in which you can approach this subject:

• General machine learning introduction [32, 48, 46]


• Learning theory – here a good resource would be the book [44], or a somewhat softer book [22]
• The probabilistic perspective at various ML – [33]

• Data-driven modeling [28, 10]


• Clicking yourself through the tools and applying them – for example, a good starter on the subject
would be the textbook [32]
• Focus on the specific domain of deep learning [21, 11, 2]

• Pattern recognition and machine learning [8]


On top of things, there is plenty of online material, often in form of blog posts or videos, with which you
can start using your first ML models within a few hours. However, be careful with your sources: in contrast
to the books listed above, many of those articles have not gone through a serious review process; of course,
they can still be extremely helpful.
So what will this course bring if there is such an abundance of teach-it-to-yourself material online?
Well, we want you to know how to design ML tools that really achieve the goal you want, or that achieve
a computationally feasible goal that is closest to the one you had in mind. This, unavoidably, requires
understanding of how to transform the data at hand into a tool.
Understanding what data-to-tool transformations are possible and building them, requires

1. understanding of the mathematical structures with which the data can be represented;
2. mathematical formulation of the goal we aim to achieve (which is usually pretty vague at the starting
point) including a judgement if it is achievable at all, or at the cost of what mathematical/computa-
tional difficulties;
3. designing efficient computational procedures that achieve the goal in a reasonable time.

This interplay of the data (the dirty part) and mathematics (the pure part) is where a lot of theoretical
knowledge and practice is needed. Often, achieving the ‘ideal’ goal will not be possible (in short: you cannot
solve your problem by stating a theorem that will fit the data), but life still requires solutions that work
most of the time, so we aim to teach you what to pay attention to in order to build a good tool that delivers
what you want, most of the time.
This is a good place to say that we will not discuss machine learning from a probabilistic or statistical
perspective, and hence, you will not see many symbols like P here. Staying probability-free, we will assume
that data of certain types is given, and we will focus on how to construct machine learning models from that;
this process is also denoted as learning from the data. Now, we are ready to explain where the course title
comes from. Surprisingly, once the data is given, machine learning models for different domains all consist of
elementary mathematical building blocks from optimization and linear algebra. To put it differently, when

3
Machine Learning

Optimization

(Numerical) Linear Algebra

Figure 1.1: It is difficult to relate machine learning with optimization and linear algebra. All machine
learning algorithms discussed in our lectures are based on linear algebra and optimization; optimization
always uses linear algebra.

presented with a new problem in which your task is to perform ‘some kind of learning’, it is very likely that
any tool you come up with is going to require solving an optimization problem (at the higher level) and
linear algebraic techniques (lower level) to make it work.
However, we will not strictly split the course among linear algebra and optimization topics since many
building blocks of machine learning algorithms have aspects from both fields; see also fig. 1.1. We will often
use the term linear algebra synonymously with numerical linear algebra. This is because ultimately, we are
always aiming at using the algorithms on a computer; hence, it is important to investigate them under this
condition.
From the implementation perspective, to understand the algorithms in-depth, we will focus on imple-
menting the ML algorithms from scratch instead of using black-box (Python) packages. Therefore, you can
expect a certain amount of implementation work during the course.
We will lack to the time to gain the experience on how to optimally tune machine learning algorithms
for a use case at hand. This we will leave to applied machine learning courses and the further literature,
partly given above. Nonetheless, we want to gain insights into the relevance and influence of the tunable
parameters (a.k.a. hyper parameters) of machine learning algorithms, such that you can
1. modify the settings and hyper parameters of existing ML packages and
2. build computationally reliable tools/packages yourself.

1.2 Zooming in on Linear Algebra, and Optimization and ML


Once we have set ultimate purpose of this course on a high level, the time has come to zoom in a bit and
see how its components will play together. We will give you first ‘our’ view on what linear algebra and
optimization are, and then on the examples of simple ML applications, we will see their place in the ML
game.

1.2.1 Linear algebra


Linear algebra is a very fundamental branch of mathematics and, broadly speaking it deals with properties
of linear system of equations, matrices, vectors, and operations between all these. Nearly all applied math-
ematics relies on tools from linear algebra because approximating all (usually nonlinear) phenomena using
linear ones is very convenient for our minds. Here, let us consider some important types of linear algebra
operation that are utilized heavily in ML are as follows:

• Matrix and vector summation:


C = A + B and
(1.1)
z = x + y,
for A, B, C ∈ Rn×m and x, y, z ∈ Rk .

4
• Matrix-matrix, matrix-vector, and vector-vector multiplication, that is,

C = A · B,
y = A · x, and (1.2)
>
c = y · x,

for A ∈ Rn×m , B ∈ Rm×k , C ∈ Rn×k , x ∈ Rm , y ∈ Rn , and c ∈ R. As the summation operation


before, the multiplication is a very straight-forward operation. However, it is computationally more
demanding. Moreover, the specific order of carrying out multiple multiplications in a row can have a
significant influence on the numerical stability and computational work. For instance, even though,
from a mathematical standpoint,
(A · B) · z (1.3)
and
A · (B · z) (1.4)
gives the same result, for z ∈ Rk , the latter is significantly more efficient, especially in case of dense
matrices.
Definition 1.1.
A matrix A ∈ Rn×m is called dense if it has approximately n × m nonzero entries. If it has
significantly less nonzero entries, it is instead called sparse.

Example 1.1. Matrix-Matrix Vs Matrix-Vector Multiplication


Let us consider the special case of A, B ∈ Rn×n being dense matrices and x ∈ Rn . Then,

A·x

requires n2 scalar multiplications and n(n−1) scalar additions. Since, on a normal CPU (central
processing unit), addition and multiplication can be performed as one floating point operation
(FLOP), the cost is essentially O(n2 ) FLOPs.
On the other hand,
A·B
equals to multiplying A with n vectors (columns of B). Hence, eq. (1.3) requires O(n3 + n2 ) =
O(n3 ) FLOPs, whereas eq. (1.4) only requires O(2n2 ) = O(n2 ) FLOPs. The larger n, the larger
the computational cost of eq. (1.3) compared to eq. (1.4).

• Solving linear systems of equations of the form

Ax = b. (1.5)

If the system is ‘nice’, i.e., the matrix A is square and invertible, we know that x can be computed
as x = A−1 b. Matrix inversion is, however, infeasible in terms of computational work as well as
memory consumption. Hence, (numerical) linear algebra and optimization algorithms for inexactly
solving eq. (1.5) using, for instance, iterative schemes are used in practice; these also make use of the
previous types of linear algebra operations, that is, summation and multiplication.
These and many other operations are at the core of ML techniques; they are also used in many optimization
algorithms; cf. fig. 1.1.

5
1.2.2 Optimization
Optimization is concerned with finding solutions to problems of the form

min f (x) (1.6)


x∈X

where X is the set of admissible solutions and f (x) is the function whose value is to be minimized by selecting
x; of course, minimization and maximization are algorithmically equivalent because

arg max f (x) = arg min −f (x).


x∈X x∈X

Hence, it is sufficient to consider the minimization case eq. (1.6).


In our context, the function f (x) will typically represent a measure of misfit of the model to the data,
and we want this misfit to be as small as possible. This means that we have learned a model from the data.
For example, if the optimization problem is

min f (x) := kAx − bk22 , (1.7)


x∈X

then we have formulated the problem of solving a linear system of equations, as in eq. (1.5), as an optimization
problem of finding an x that minimizes the mismatch between the Ax and b vectors; eq. (1.7) is also called
the least-squares problem. In some sense, this problem is more general compared to eq. (1.5): if x is a solution
of eq. (1.5) is obviously also a solution of eq. (1.7). However, there are cases where we can find a solution of
the least-squares problem eq. (1.7), even if the linear equation system eq. (1.5) has no solution.
Rewriting a linear system of equations as an optimization problem is just one example of the fact that
optimization and linear algebra have a lot of links between them, even without thinking about ML; we could
say that they were friends already before ML was fashionable.

1.2.3 Machine learning


Generally, ML deals with finding models based on given data. In this section, our aim will be to show
selected examples of ML tools, at the same time pinpointing to where do linear algebra and optimization sit
inside them.
It makes sense to group our discussion along the classical divisions of, supervised learning, unsupervised
learning, semi-supervised learning and deep learning.

Supervised learning. Supervised learning can be explained as data fitting - the process that tries to find
a mapping from one part of the data (features) to another part (labels). It is used when you have a dataset
in which many objects are equipped with certain features, and each of these objects is labelled. This can
correspond to the following pairs:
Object Features Label
Person Income, education, Defaulting on loan repayment in the past
Meal Amounts of various ingredients Taste, rated from 1 to 10
Device Age, temperature when working Needs replacement or not
Photo A vector of pixel values Name of the object on the photo: cat or dog?
Your goal is to, when you encounter a new object equipped with a new set of features but whose label is not
known (or not available), to guess the correct ‘label’. ML does it by trying to figure out a relationship between
the known objects’ features and labels in the training dataset and then use this estimated relationship to guess
the labels on new data. Why would one do something like this? Typically, this is because determining the
correct label for the new object is not possible (one would need to wait a long time) or very expensive effort-
or money-wise. If the ML-based model is able to make correct guesses often enough, then the downsides of
making an error every once-in-a-while will be outnumbered by the benefits of making the guesses fast.

6
More formally, supervised learning corresponds to having a dataset
I = {x1 , . . . , xn } , O = {y1 , . . . , yn } ,
with n samples is given, such that
xi 7→ yi ,
for 1 ≤ i ≤ n. This means that xi contains the features (input) and yi the labels (output) of the ith data
sample. The features and labels can be integer- or scalar-valued. We can sub-divide supervised learning into
classification, which corresponds to the case when the labels are integer-valued, and regression, where the
labels are scalar-valued. In supervised learning, the model is constructed to minimize the data misfit, which
- as mentioned before - corresponds to solving a minimization problem.
Example 1.2. Supervised Learning – Regression
The most classical example of supervised learning are regression problems, known in some areas
as data-fitting. Graphically, this can be described as aiming to find a function that most closely
describes the mapping of the points on the x-axis in the picture below, with their corresponding
y-axis coordinates (marked as red rectangles).

20 data
model
0

−20
y

−40

−60

−80

0 2 4 6 8 10
x

The (optimization) problem of finding such a function can be formulated mathematically as


n
X
min (w> Φ(x̂i ) − ŷi )2 ,
w
i=1

where we aim to minimize the sum of squares of the total discrepancies between the actual values
of the output value ŷi assigned to sample x̂i , and the value w> Φ(x̂i ) which is an inner product of
a vector w of parameters that we control, and Φ(x̂i ) is a point-to-vector mapping that gives our
function w> Φ(x) some ‘flexibility’. For example, the blue curve in the picture has been obtained by
taking  
1
 x 
Φ(x) =  x2  ,

x3
whereby, naturally it has to hold that w ∈ R4 . In other words, the blue curve illustrates fitting the
best degree-3 polynomial to match the relationship of x and y in the data.

7
Example 1.3. Supervised Learning – Classification
Another type of supervised learning is classification which can be used, e.g., to guessing if a given
tissue is healthy or cancer-afflicted. Suppose you have a number of observation vectors xi and their
corresponding labels yi ∈ {0, 1}, i = 1, . . . , n each. You would like to find out if it is possible to find
a relationship between the input values x and the labels y, to be able to predict this in the future.
One of the ways to do it is to build up a support vector machine, or a hyperplane that separates the
groups of points with labels −1 and 1.

w> x = 1

buffer zone
w> x = 0

w> x = −1

margin

The aim there is to find a parameter vector w such that we have

w > xi ≥ 1 ∀i : yi = 1
>
w xi ≤ −1 ∀i : yi = −1

Why not ≥ 0 and ≤ 0, respectively? In that case, the obvious solution would be w = 0 which would
not provide us with any useful tool.
The above system of inequalities can be written concisely as:

yi (w> xi ) ≥ 1 ∀i.

This is a nice idea but such a good separation of points might not always be possible. What we can
do then is to find w such that the total sum of violations of the above inequalities is as small as
possible:
Xn
min max{1 − yi (w> xi ), 0}.
w
i=1
Most often, we will prefer the problem to be formulated with a square
n
X
min max{1 − yi (w> xi ), 0}2
w
i=1

because that will keep the function to be minimized smooth and differentiable, i.e., it will keep it a
nice function from an optimization point of view.
Optimization algorithms that help us find the best w will have to compute many times the gradient
of the minimized function (the objective function) w.r.t. the parameter w (remember - gradient is
the direction of steepest ascent of a function, so minus gradient should guide us towards decreasing
the function value). Hence, we will perform many iterations with matrix X in which all the data xi
is stored, and a vector y of all the labels yi . This is the part of the procedure in which efficient linear
algebra will be of great importance.

8
Unsupervised learning. Another domain of ML – unsupervised learning – is when the data does not
possess anything like labels, but we are still interested in certain patterns inside it. One of the typical
examples is here is clustering – a process when, trying to make sense out of a huge amount of data (objects),
we try to subdivide them into groups where the items belonging to the same group are ‘similar’.
Example 1.4. Unsupervised Learning - Clustering

Red: cluster 1 Blue: cluster 2

Consider that you have a group of people and are informed who of them knows/exchanges messages
with each other. That is, for each pair (i, j) of people, you know if there exists a relationship (1)
or not between them (0), with 0 for pairs (i, i) by convention. If you place all this information into
a matrix, you obtain a so-called adjacency matrix of the graph in which the nodes are persons, and
edges are existing relationships between them:

0 1 ··· 0

 1 0 ··· 1 
. . . 
 
A=  .. .. . . . .. 
..
 
0 1 . 0

Based on this information you might have to figure out what are the k ‘groups of friends’ among
them. There are many ways to translate this question into a mathematical goal, but all of them
will be an optimization problem, whose efficient solution will require the use of the linear algebraic
properties of the adjacency matrix A.
Note that the adjacency matrix is often sparse (for instance, for the example visualized above).
If possible, it is very important to take this property into account since; this significantly reduces
computational cost, in terms of computational work and memory.
In order to formulate the partitioning of the graph into k clusters mathematically, we can, for instance,
consider the following optimization problem:
n X
X n
min uik uik d(i, j)
uik ∈{0,1}
i=1 j=1

where uik are decision variables if the i-th element is assigned to the k-th cluster or not, and d(i, j)
defines a ‘distance’ between nodes i and j. In a graph context, it makes sense that the ‘distance’ is
defined in relation to the strength of the link between i and j through graph edges (and not their
Euclidean distance – think about it: it is more important how many people can connect me to a
given person rather than how far that person lives from me). For that purpose, a technique known
as spectral clustering uses a transformed version of the matrix A to solve this problem.

Another example of unsupervised learning is dimensionality reduction where where we try to describe a
high-dimensional object with a low-dimensional structure. The question is then essentially, which dimensions
can be dropped (possibly after applying some initial transformation to the data first), so that different objects
can still be distinguished, but that they take less memory space. A perfect example here is image compression.

9
Example 1.5. Unsupervised Learning – Dimension Reduction

Surface of Mercury.

Let A be the matrix containing the pixel values of an image. Then, we can reduce the size of the image
by using dimension reduction techniques. In particular, we consider the singular value decomposition
(SVD)  
> Σr 0
A = U ΣV , Σ = ,
0 0
where Σ ∈ Rm×n and
 
σ1
 σ2 
Σr = diag(σ1 , σ2 , · · · , σr ) =  .
 
..
 . 
σr

By replacing all diagonal entries σi with i larger as some k, we can reduce the size of the image.
Above, you can see the original image of size 1 144 × 1 071 = 1 225 224 (left) and the image resulting
from taking only 38 diagonal entries from the SVD (right). The compression factor is ≈ 15.0.

Semi-supervised learning Real life situations often involve a setup in which there is a certain amount of
labelled data (e.g. animal pictures, as in supervised learning), and much more unlabelled data (new animal
pictures, where nobody has indicated the type of an animal). The unlabelled data need not, however, be
useless because it can be similar in features to the old data (new dog pictures typically look more similar to
old dog pictures than to old cat pictures). In other words, one can try to use the unlabelled animal pictures
to create a better predictive model than the model on the labelled data alone.
A very heuristic idea here is to build a supervised model on the labelled data, and to use it to predict
labels on a part of the unlabelled data. Then, one can add the ‘pseudo labelled’ data to the original labelled
dataset, and try to learn a new supervised model on the enlarged dataset. Under certain assumptions,
this approach, known as self-training, can work. A classical domain for this type of learning is language
processing.
Semi-supervised learning, essentially, involves algorithms that use the tools of (un)supervised learning,
therefore we will skip giving very specific examples here and move on directly to the example of neural
networks.

10
Deep learning a.k.a. neural networks. Separating deep learning from the rest of ML is, on a the-
oretical level, incorrect. Deep learning is, strictly speaking, an add-on used to perform the tasks of the
supervised/unsupervised/semi-supervised learning. However, deep learning models have attracted so much
attention in the past years, and their mathematical analysis is so distinctive, that it makes sense to discuss
them as a separate subject.
Neural networks are a computationally efficient tool to build very complex ML models, and they are
inspired by the way neural networks in our bodies transform and transmit signals from the nerves to the
brain. These complex networks can discover much more complex relationships in the data than the ‘earliest
and simple’ ML models would do. In short, deep learning is the good old ML albeit on computational
steroids.
Example 1.6. Neural Networks
The picture here illustrates a simple example of a neural network that transforms the input vector
x consisting of three entries x1 , x2 , x3 , first into a so-called hidden layer consisting of three neurons
(1) (1) (1)
a1 , a2 , a3 that transform the incoming signals (incoming arrows) combining them into a single
number, and the transformed signals from a1 , a2 , a3 are then sent to another single-neuron hidden
layer a(2) , which transforms the incoming signal into a single output value y.

Θ(1) a(1) Θ(2) a(2)

x1 (1)
a1

x2 (1) (2) y
a2 a2

x3 (1)
a3

What does it exactly mean that a signal is sent by means of an arrow and how do the neurons work?
(1)
For that purpose, it makes sense to zoom in on a single neuron, for example a1 and show a typical,
very simple, set of mathematical operations it consists of.

α0 wj0

α1 wj1
P
s= wji αi β = ϕ(s) β
α2 wj2

αi wji

First, the numbers incoming from the preceding nodes (nodes from which an arrow/arc leads to the
current neuron), are multiplied by a scalar parameter wij . Then, all those multiplied signals are
added altogether into a single number s. This single number, in the end, is transformed by means of
a simple, nonlinear function ϕ(·) into the output value β.
Coming back to the big picture again, the symbols Θ(1) and Θ(2) stand there to denote all the
weights wij corresponding to the neuron zoom-ins from the smaller picture. In ML terminology, Θ(1)
and Θ(2) are the parameters of the neural network and the goal of the training process is to find

11
(optimize) the values of these parameters so that over a training sample consisting of many pairs
(x̂(1) , ŷ (1) ), . . . , (x̂(n) , ŷ (n) ), the values y that the network would generate based on the input vectors
x̂(1) , are as close as possible to the corresponding values ŷ (i) .
In this way, neural networks can be depicted as ‘computational graphs’ where on the left-hand side
we have the input features, and each node corresponds to a linear or non-linear transformation of the
numbers coming from the incoming arcs. In principle, the output value y is a compound function of
the input vector x:
! ! !!
X X X X X X
y(x) := ϕ w.. . ϕ w... xi + w.. . ϕ w... xi + w.. . ϕ w... xi ,
... ... ...

but good luck if you try to work with a function like this directly – here, the mathematical/geometrical
structure of a graph helps put an order among the possibly hundreds of hidden layers, and millions
of neurons, and in particular, in the training process.
This problem of training a neural network can be formulated, for example, as minimization of the
squared norm between the output ŷ and the value y.
n 
X 2
min y(x̂(i) ) − ŷ (i)
Θ(1) ,Θ(2)
i=1

An algorithm that will solve this minimization problem will need to compute the gradient of the
function that gives us z with respect to all the in-between parameters wi j. Doing so in a way that
you learn in a calculus course would be computationally intractable, so we will learn graph-theoretic
and linear-algebraic tools that will allow us to keep the number of computations as low as possible.

1.3 Remainder of the lecture notes


We hope that by now the goal of the course is more or less clear. How are we going to structure the path
towards this goals? Overall, we are going to divide the course into two parts:
• Theory buildup: within the first 5-6 weeks. First, we will introduce the necessary tools from linear
algebra and optimization. For some of you, some notions here will be repetitions from other courses,
but we will try to motivate the need for them by giving ML application examples.
• ML-oriented part. In the second part of the course, we shall go through several important ML tech-
niques and we will analyze them from an LA/Opt point of view.

12
2 Linear algebra basics
As pointed out in the previous section, (numerical) linear algebra is the foundation of many algorithms and
techniques in machine learning. The most obvious reason is the data, which is the basis for machine learning
models. There are many different types of data subject to machine learning techniques, such as
• textual data,

• visual data (images and videos),


• audio data,
• measurement data, and
• many others.

The data can be characterized in different ways. For instance, the data can be numeric or non-numeric,
structured or unstructured, or static or temporal. However, when represented digitally, the data is usually
encoded in terms of linear algebra structures, that is, vectors, matrices, and tensors (arrays) of scalars
(floating point numbers) or integers. Therefore, dealing with data, naturally involves dealing with linear
algebra structures.
Let us, for instance, consider an image:

Taken from https://unsplash.com

A grey image can be stored as a matrix where each entry corresponds to the intensity of a pixel of the image;
in 8-bit greyscale a value of zero corresponds to black, a value of 255 corresponds to white, and all values
between zero and 255 correspond to different shades of grey. For instance, the matrix
 
0 0 0
0 255 0
 
0 0 0
 
0 255 0
0 0 0

corresponds to the image of an 8:

13
A color image corresponds to three matrices corresponding to the amount of red, green, and blue (RGB)
in the image.
As a second example, consider time-dependent sensor data:
20
sensor data
15

10
s
5

0
0 2 4 6 8 10
t

This data can be stored as a vector of the sensor measurements:


 
s0 s1 s2 · · · s10
A second, even more important reason for the importance of (numerical) linear algebra in machine
learning is that virtually all machine learning algorithms are based on or use tools and techniques from
numerical linear algebra. In particular, many supervised machine learning algorithms are based on iterative
schemes that involve, for instance, the repeated application of matrices onto a vector. Unsupervised learning
techniques are aimed at revealing structure in given data, and this is often done on the basis of matrix
factorization techniques.
The reason why it is important and helpful to investigate these algorithms from the (numerical) linear
perspective is because the performance of the algorithms can be analyzed based on the properties of linear
algebra objects, that is, the matrices and vectors. In particular, we will discuss both performance aspects
• convergence speed, accuracy, and robustness as well as
• computational efficiency.
While it is often difficult to generalize the statements about convergence speed, accuracy, and robustness of
the algorithms to general data sets, the theory from (numerical) linear algebra can still be guideline for the
application to general data sets.
Moreover, since real data sets are often large and convergence and accuracy are often worse compared to
the configurations discussed in the theory, the algorithms involve large computational costs and often con-
verge only slowly. As a result, it becomes particularly important to design, implement, and tune algorithms
for computational efficiency.
In this section, we will introduce/recall (numerical) linear algebra basics and tools that are employed in
a wide range of the machine learning algorithms; we will discuss this in more detail in the second part of the
lecture.

2.1 Matrices and vectors


First, we will introduce matrix and vector notations used in this course and discuss special classes of matrices
as well as important properties matrices can have. First of all, recall that a vector is a tensor of first order vector
and a matrix is a tensor of second order. In easier words, a vector is a one-dimensional array of numbers, matrix
and a matrix is a two-dimensional array of numbers; we write them as follows
a11 · · · a1m
   
v1
 .   . .. .. 
v =  ..  and A =  .. . . .
vm an1 ··· anm

14
real numbers
In this course, we will only consider certain types of numbers for the entries of matrices and vectors, that is,
scalars
the real numbers R, or scalars, and integers Z; in some cases, we might also consider boolean numbers
integers
{0, 1}.
boolean
For now, we will concentrate with the case that all entries are real numbers, that is, v1 , . . . , vm ∈ R and
a11 , . . . , anm ∈ R. We then write that v ∈ Rm and A ∈ Rn×m . In machine learning, the case of integers
and boolean numbers will mostly arise from classification problems, that is, when categorizing data into a
finite number of classes, which are labeled as boolean values of integers. Therefore, instead of considering
all integers, we typically restrict ourselves to the natural numbers N. natural
The space Rm is a vector space of all vectors of length m, and Rn×m is also a vector space, however, a numbers
vector space of matrices. This means that the axioms of vector spaces are satisfied for matrices and vectors
with the element-wise addition
a11 · · · a1m b11 · · · b1n a11 + b11 · · · a1n + b1n
     
 .. .. ..   .. .. ..   .. .. ..
. + . . =

 . . . . . . 
al1 ··· alm bm1 ··· bmn am1 + bm1 ··· amn + bmn
and      
v1 w1 v1 + w1
 ..   ..   ..
 . + . =

. 
vm wm vm + wm
as well as the scalings
··· ···
   
a11 a1m ca11 ca1m
 . .. ..   .. .. .. 
c .. . =
.   . . . 
al1 ··· alm cal1 ··· calm
and    
v1 cv1
 .   . 
c ..  =  .. .
vm cvm
Here, all aij , bij , vi , wi and c are scalars.
Exercise 2.1. Vector spaces
Recall the axioms of vectors spaces.

The dimension of a vector space is defined as the number of vectors of a basis of this vector space; dimension
in order to be well-defined, this implies that all basis of one vector space have the same size, which is a
well-known statement from linear algebra. A standard basis of Rm is
     

 1 0 0  
0 1
 0 
 .. ,  .. , . . . ,  ..  ,
     

. .  . 
 
0 0 1
 

which shows again that the dimension is m. A standard basis for the matrix space Rn×m is defined analo-
gously, and hence, its dimension n · m follows.
Any vector v in an m-dimensional vector V space can be represented as a linear combination linear
combination
v = a1 b1 + . . . + am bm
of a basis {b1 , . . . , bm }, where a1 , . . . , am ∈ R. Recall also that a basis of a vector space is a maximum linear basis
independent set of vectors. Vectors are linear independent if linear
independent
a1 b1 + . . . + am bm ⇒ a1 = . . . = am = 0.

15
Otherwise, they are linearly dependent. A linearly independent set of vectors is a basis if we cannot add linearly
any vector from the vector space without the set becoming linear dependent. Any subset W of a vector dependent
space V is called a subspace if it is closed with respect to addition and scaling; as a result, the subspace is
a vector space itself.
As long as seeing vectors and matrices just as elements of vector spaces, a specific vector or matrix does
not have any special properties, except for the zero elements
0 ··· 0
   
0
 ..  . .. .
 .  and  .. . .. .
0 0 ··· 0

In particular, for any element of a vector space v, we always have that v + 0 = v.


Matrices can have additional properties when identifying them with linear maps between vector spaces. In
particular, all linear maps from Rm to Rn can be represented as a matrices in Rn×m . A map A : Rm → Rn linear map
is linear if
A(v + w) = A(v) + A(w) and A(c · v) = cA(v),
where v, w ∈ Rm and c ∈ R. In order to apply the linear map corresponding to the matrix A ∈ Rn×m to a
vector v ∈ Rm , we perform the matrix-vector multiplication matrix-vector
multiplication
a11 · · · a1m
     
v1 a11 v1 + . . . + a1m vm
 . .. ..   ..   ..
Av =  .. . · . = .

. .
an1 ··· anm vm an1 v1 + . . . + anm vm

This operation is only well-defined if the sizes of the matrix and the vector are compatible. The matrix of
the composition of two linear maps that are represented by the matrices A ∈ Rl×m and B ∈ Rm×n can also
be obtained by the matrix-matrix multiplication matrix-matrix
multiplication
a11 · · · a1m b11 · · · b1n c11 · · · c1n
     
 . .. ..   .. .. ..   .. .. .. 
AB =  .. . . · . . . = . . . ,
al1 ··· alm bm1 ··· bmn cl1 ··· aln

where
m
X
cij = aik bkj = ai1 b1j + . . . + aim bmj
k=1
for 1 ≤ i ≤ l and 1 ≤ j ≤ n. Again, it is important that the sizes of the matrices are compatible. For
simplicity, we will use matrices and the corresponding linear maps synonymously.
Let us recall the example example 1.1 from section 1, which is about the computational efficiency of those
operations:
Example 2.1. Matrix-Matrix Vs Matrix-Vector Multiplication
Let us consider the special case of A, B ∈ Rn×n being dense matrices and x ∈ Rn . Then,

A·x

requires n2 scalar multiplications and n(n − 1) scalar additions. On some computing architectures,
addition and multiplication can be performed in parallel. Then, we count them as one floating point
operation (FLOP). The the cost for the matrix-vector multiplication is essentially O(n2 ) FLOPs.
On the other hand,
A·B
equals to multiplying A with n vectors (columns of B). This results in a total of O(n3 ) FLOPs.

16
Hence,
(A · B) · x (2.1)
3 2 3
requires O(n + n ) = O(n ) FLOPs, whereas

A · (B · x) (2.2)

only requires O(2n2 ) = O(n2 ) FLOPs. The larger n, the larger the overhead of eq. (2.1) is compared
to eq. (2.2).

This example shows that, even though mathematically equivalent, it is important how to perform com-
putations in practice. In particular, when dealing with large data sets and when repeatedly performing
such operations, this can become critic for performance. In the following, we will discuss more properties of
matrices.

Rectangular, quadratic matrices, and the matrix rank The size of a matrix is linked with the
domain space
domain space and codomain space of the corresponding linear map. In particular, as discussed before,
codomain
if A ∈ Rn×m , then it corresponds to a linear map space

A : Rm → R n ,

where the domain space is domain (A) = Rm and the codomain space is codomain (A) = Rn . Now, the
dimension formula from linear algebra says that dimension
formula
dim (domain (A)) = dim (range (A)) + dim (ker (A)) (2.3)
range space
Here, the range space range (A) is also called the column space because it is the space spanned by the
column space
columns of A; in other words, it is the set of all linear combinations of the column vectors of A. The kernel
kernel
ker (A) is also called the null space; it corresponds to the space of vector w ∈ domain (A) such that Aw = 0.
null space
We will use the terms kernel and null space synonymously in this course.
From eq. (2.3), we obtain

ker (A) ⊂ domain (A) and range (A) ⊂ codomain (A) .

When considering the transposed

···
 
a11 an1
>  .. .. ..  m×n
A :=  . . . ∈R
a1m ··· anm

of
···
 
a11 a1m
 . .. ..  n×m
A =  .. . . ∈R ,
an1 ··· anm
we get the corresponding linear map

A> : domain A> → codomain A>


 
| {z } | {z }
=codomain(A) domain(A)

From the dimension formula,

dim domain A> = dim range A> + dim ker A>


  
,

17
we obtain
dim (codomain (A)) = dim range A> + dim ker A>
 
.
>
is the column space of A> , which is the same as the row space A, that is, the space

Here, range A row space
spanned by the rows of A (= b columns of A> ). The null space ker A> of A> is also called the left null
space of A, and it corresponds to all w ∈ domain (A) = codomain (A) such that A> w = 0. left null space
One important observation from linear algebra is that dim (range (A)) = dim range A> , that is, that


the dimension of the column and row space are the same. The numbers dim (range (A)) and dim range A>

row rank
are denoted as the row rank and column rank of the matrix A, and since they are the same, we just call
column rank
it the rank of the matrix A. The rank of the matrix A is a measure for the amount of information stored
rank
in the matrix, that is, the number of linear independent rows and columns of matrix.
We can already imagine that, when building machine learning models from large data sets, which are
stored as matrices, it will be essential to actually know the how much information is stored in the matrix.
Even if a data set is very large, it may still represented by just a few vectors spanning the whole column or
row space. During the course, we will also discuss how to measure which dimensions of the row and column
spaces are actually most relevant and which can be >neglected at thecost of only small errors.
> > >
 
Even though the spaces domain A , range A , and ker A as well as codomain (A), range A ,
>

and ker A do not fully describe the action of A, they are important to characterize the matrix. Many
important matrix properties can be formulated in terms of these spaces and their dimensions as well as the
rank of the matrix. For instance, injectivity and surjectivity can be defined based thereon:
Theorem 2.1.
Let A ∈ Rn×m be a matrix. The linear map corresponding

• is injective ⇔ dim (ker (A)) = 0


• is surjective ⇔ dim (range (A)) = dim (codomain (A))
• is bijective ⇔ is injective and surjective.

Note that, using the dimension formula, we obtain that

A is bijective ⇔ dim (ker (A)) = 0 ∧ dim (range (A)) = dim (codomain (A))
⇔ dim (domain (A)) = dim (range (A)) ∧ dim (range (A)) = dim (codomain (A))
⇔ dim (domain (A)) = dim (range (A)) = dim (codomain (A))

One conclusion is that A ∈ Rn×n , that is, that the matrix is square matrix. square matrix

Diagonal, triangular, and symmetric matrices Any square matrix A ∈ Rn×n can be partitioned as

A = L + D + U, (2.4)

where L = (lij )ij ∈ Rn×n is a lower triangular matrix with lower


triangular
 matrix
aij for j < i,
lij =
0 otherwise,

D = (dij )ij ∈ Rn×n is a diagonal matrix with diagonal


matrix

aij = aii for i = j,
dij =
0 otherwise,

18
and R = (rij )ij ∈ Rn×n is a upper triangular matrix with upper
triangular
 matrix
aij for i < j,
rij =
0 otherwise.

Note that it is also common to write D = diag1≤i≤n (aii ). We can also write

a11 a12 ··· a1n


 
.. .. ..
. .
 
 a21 . 
A = 
 ..

.. .. 
 . . . a(n−1),n 
an1 ··· an,(n−1) ann
0 ··· ··· 0 a11 0 · · · 0 0 a12 ··· a1n
     
.. ..   .. .. ..   .. .. .. ..
. . . . .
 
 a21 . + 0  + .
.  .
  
= 
 .. ..   ..   ..
.
.. .. . .. .. . .. 
 . . . .  . 0  . . a(n−1),n 
an1 ··· an,(n−1) 0 0 ··· 0 ann 0 ··· ··· 0
| {z } | {z } | {z }
=L =D =U

These definitions can also be applied to non-square matrices, resulting in

a11 a12 ··· ··· ··· · · · a1n


 
.. .. .. 
. .

 a21 . 
A =   .. .. 

.. .. ..
 . . . . . 
am1 · · · am,m−1 am,m am,m+1 · · · amn
0 ··· ··· ··· ··· ··· 0 a11 0 · · · ··· ··· ··· 0
   
.. ..   .. .. .. 
. . .

 a21 . + 0 .

=   .. ..  .. .. 

.. ..  . .
.. .. ..
 . . . .  . . .
am1 · · · am,m−1 0 · · · · · · 0 0 · · · 0 am,m 0 ··· 0
| {z } | {z }
=L =D
0 a12 ··· ··· ··· ··· a1n
 
 .. .. .. .. 
. . . . 
+
 .. .. 

.. ..
. . . . 
0 ··· ··· 0 am,m+1 ··· amn
| {z }
=U

19
or
a11 a12 ··· a1n
 
.. .. ..
 a21 . . .
 

 .. .. ..
 
. .

 . a(n−1),n 
 
A =
 .. 
an1 . ann 
 ..
 

 . an+1,m 
 
 . ..
 ..

. 
an1 ··· ··· amn
0 ··· ··· 0 a11 0 ··· 0 0 a12 ··· a1n
    
.. .. .. .. ..   .. .. .. ..
 a21 . .   0 . . .  . .
 . .
   
 
 .. .. .. ..   .. .. ..   .. ..
   
. . . . .

 . .   . 0   . a(n−1),n 

  ..   ..
   
=
 .. + .. +

.
an1 . 0    . . ann  . 0 
 ..   ..   .. ..
 

 . an+1,m 
  . 0   . .
  
 
 . ..   . ..  . ..
 .. .   .. .   ..

. 
an1 ··· ··· amn 0 ··· ··· 0 0 ··· ··· 0
| {z } | {z } | {z }
=L =D =U

A square matrix A ∈ Rn×n can be have a symmetry property: if


aij = aji ∀1 ≤ i, j ≤ n,
we call the matrix symmetric. In other words, A> = A or, based on eq. (2.4), L = U > . Hence, the diagonal symmetric
D of a square matrix A, is always symmetric, whereas L and U are only symmetric if L = U = 0.
Note that previous definitions can be extended to block matrices. For simplicity, we consider a matrix block matrices
with k × k blocks:
A11 A12 ··· A1k
 
..
A21 . . . ..
.
 
. 
A =   ..

.. .. 
 . . . A(k−1),k 
Ak1 · · · Ak,(k−1) Akk

and Aij ∈ Rni ×mj for all 1 ≤ i, j ≤ k. Then, A ∈ Rn×m , with


k
X k
X
ni = n and mi = m.
i=1 i=1

Again, we consider the decomposition


A = L + D + U,
n×m
where L = (Lij )ij ∈ R is a lower triangular block matrix with lower
 triangular
Aij for j < i, block matrix
Lij =
0 otherwise,

D = (Dij )ij ∈ Rn×m is a diagonal block matrix with diagonal block


 matrix
Aij = Aii for i = j,
Dij =
0 otherwise,

20
and R = (Rij )ij ∈ Rn×m is a upper triangular block matrix with upper
triangular
 block matrix
Aij for i < j,
Rij =
0 otherwise,

Again, we can write D = diag1≤i≤n (Aii ). This idea can also be extended analogously to k ×l block matrices,
with k 6= l.

Sparse matrices In practice, there are many cases where a large amount of the matrix entries are actually
zero. One typical example is the following: Let us assume that a video-streaming platform has n users and
offers m different videos, movies or series, for streaming. The information whether a user has watched a
n×m
movie, or not, could be encoded in terms of a matrix A ∈ {0, 1} , where aij = 1 corresponds to the case
when user i has seen the movie j and aij = 0 if not. For instance, A could be of the form

1 0 0 ···
 
0 0 0 · · ·
A = 1 1 1 · · ·,
 
.. .. .. . .
 
. . . .

meaning that user 1 has seen movie 1 but did not see movies 2 and 3. On the other hand, user 3 has seen
movies 1, 2, and 3. Of course, most users will have not seen the largest part of all movies. A similar example
would be the data for rating the different movies. In case the user could rate a movie with 1 to 5 stars, the
n×m
resulting data could be stored as a B ∈ {0, . . . , 5} matrix.
In such cases where by far the most matrix entries have the same value (this value is typically zero), it
is most efficient to only store those entries that differ from this value. We call these matrices sparse. Let sparse
us consider a small example  
4 0 0 2 0
0 0 0 0 0
 
A= 5 4 3 0 0 (2.5)

0 0 0 0 1
0 2 0 0 0
Instead of storing all 25 entries of this matrix, we could store the following triples of row index, column
index, and value of all nonzero entries of A:

{(1, 1, 4) , (1, 4, 2) , (3, 1, 5) , (3, 2, 4) , (3, 3, 3) , (4, 5, 1) , (5, 2, 2)}

The matrix A has 7 nonzero entries, and hence, 21 numbers are sufficient to store the whole matrix; in this
small example, we save 4 numbers in memory for storing A. This format is also denoted as the dictionary
of keys (DOK) format. dictionary of
Later in this course, we will learn about some unsupervised learning techniques, so-called recommender keys (DOK)
systems, specifically aimed at dealing with the data sets described above. In particular, the goal of these
techniques is to fill the missing data such sparse matrices. One typical application is name giving, namely
trying to recommend articles (for instance, videos, movies, and series) to a customer based on information
from previous ratings. compressed
Two other famous sparse formats are the compressed sparse row (CSR) and compressed sparse sparse row
(CSR)
column (CSC) formats. In the following example, we will introduce the CRS format; in this format, the
compressed
rows are stored as in the DOK format, but we need less data for storing the column information. sparse column
(CSC)

21
Example 2.2. CRS format

Definition 2.1.
The compressed sparse row (CRS) format of a matrix A ∈ Rn×m with k non-zero entries is
defined by two 1D arrays val and col ind of length k and another array row ptr of length
n + 1.
Only the k non-zero entries of A are written row-by-row in val, and the corresponding column
indices are written in col ind. row ptr[i] points to the first entry of the i-th row in val,
where the last entry of row ptr points to the first entry in the fictitious n + 1-th row.

Considering the exemplary matrix


 
4 0 0 2 0
0 0 0 0 0
 
5
A= 4 3 0 0

0 0 0 0 1
0 2 0 0 0

from eq. (2.5), we obtain the following CRS matrix format

val 4 2 5 4 3 1 2
col ind 1 4 1 2 3 5 2
row ptr 1 3 3 6 7 8

We can observe that 2 × 7 + 6 = 20 numbers are sufficient to store the matrix. This is cheaper than
storing the full matrix A as well as storing the DOK format of the matrix.

The CSC format is similar to the CRS format, but in CSC, the matrix is stored column-by-column. This
means, that we store val, row ind, and col ptr arrays for sizes k, k, and m + 1, respectively; as before, k
is the number of nonzero entries and m is the number of columns.
Exercise 2.2. CSC formats
Write down the CSC format of the matrix
 
4 0 0 2 0
0 0 0 0 0
 
5 4 3 0 0
 .
0 0 0 0 1
0 2 0 0 0

Exercise 2.3. Matrix formats


Discuss the memory consumption of storing
• all entries (including the zero entries),
• the DOK format,
• the CSR format, and
• the CSC format
of a sparse matrix A ∈ Rn×m with k nonzero entries per row. When is which format the most efficient
one with respect to the memory consumption.

22
Equally important as reducing the memory consumption for storing a sparse matrix when using a sparse
matrix format, it is also more computationally efficient. For instance, the number of floating point operations
for computing the matrix-vector product
A·x
of a matrix A ∈ Rn×m and a vector x ∈ Rm is n × m. In case the matrix has only k nonzero entries, the
matrix-vector product requires only O(k) FLOPs. If k << n × m, this is significantly less computationally
demanding then performing a matrix-vector multiplication with a full matrix.
Let us now that combinations of block matrices and sparsity are common: a common example is a matrix
of the form  
A11 0 0 A14 0
 0 0 0 0 0 
 
A
A =  31
 A 32 A 33 0 0 
,
 0 0 0 0 A45 
0 A52 0 0 0
where each block is dense but there are many blocks which are completely zero. This is basically the same
as eq. (2.5) but with dense matrices as blocks. In this case, a combination of sparse and dense matrix formats
would be favorable.

2.2 Bilinear forms and norms


In this section, we will define additional operations on matrices and vectors, that is, bilinear forms and
norms. Both are highly relevant for machine learning algorithms. For the usage of norms, we can already
give some intuition.

Bilinear forms We have seen that any matrix defines a linear map between two vector spaces. Moreover,
any matrix A ∈ Rn×m also defines a bilinear form
a (·, ·) : Rn × Rm → R,
(2.6)
(v, w) 7→ w> Av.

In order to be a bilinear form, a (·, ·) has to be linear in each argument: bilinear form

a (c1 v1 + c2 v2 , w) = c1 a (v1 , w) + c2 a (v2 , w) and a (c, c1 w1 + c2 w2 ) = c1 a (u, w1 ) + c2 a (v, w2 ) ,

for all v, v1 , v2 ∈ Rn , w, w1 , w2 ∈ Rm , and c1 , c2 ∈ R.


Exercise 2.4. Bilinear form
Show that a (·, ·) as defined in eq. (2.6) defines a bilinear form.

In case the matrix is square A ∈ Rn×n and symmetric, the bilinear form a (·, ·) : Rn × Rn → R is also
symmetric, that is,
a (v, w) = a (w, v) ∀v, w ∈ Rn ,
positive
and this bilinear form is called: definite
negative
• positive definite or negative definite if a (v, v) > 0 or a (v, v) < 0 for each 0 6= v ∈ Rn definite
positive
• positive semi-definte or negative semi-definite if a (v, v) ≥ 0 or a (v, v) ≤ 0 for each v ∈ Rn semi-definte
If the matrix is symmetric positive definite (SPD), the bilinear form is called a scalar product (or inner negative
product), and semi-definite
2 scalar product
kvkA := a (v, v) = v > Av (2.7) inner product
defines a norm on Rn .

23
Exercise 2.5. Scalar product
Show that a (·, ·) as defined in eq. (2.6) is a scalar product if the matrix A is SPD.

Exercise 2.6. Norm


2
Show that k·kA as defined in eq. (2.7) defines a norm on Rn .

If the matrix A is just the identity I, we have

a (v, w) = w> Iv = w> v.

Then, we obtain the scalar product which is known as the Euclidean inner product Euclidean
inner product
(·, ·) : Rn × Rn → R,
n
X (2.8)
(v, w) 7→ w> v = wi vi .
i=1

Let us finally remark that there is also the outer product of two vectors outer product

· ⊗ · : R n × Rm → Rn×m
v⊗w 7→ vw> .

Exercise 2.7. Outer product


For two vectors v ∈ Rn and w ∈ Rn , show that
1. every row of v ⊗ w is a multiple of every other row,
2. every column of v ⊗ w is a multiple of every other column,
3. v ⊗ w is a rank 1 matrix ⇔ v, w 6= 0.

2
Vector norms Just before, we have seen that the an SPD matrix A induces a vector norm k·kA . Various
sub-additivity
different matrix and vector norms are frequently used in ML. Along with the sub-additivity/triangle
inequality
triangle
kv + wk ≤ kvk + kwk ∀v, w ∈ V, inequality
and
kcvk = |c| kvk ∀c ∈ R, v ∈ V,
any norm k·k on a vector space V is positive definite: positive
definite
kvk ≥ 0 v ∈ V, kvk = 0 ⇒ v = 0.

It can therefore be used to make minimization problem

min f (v)

uniquely solvable. In particular, if f is bounded from below, the function

f (v) + α kvk

24
2
Figure 2.1: The functions f (x) = sin(x1 ) + cos(x2 ) (left) and g(x) = sin(x1 ) + cos(x2 ) + 0.1 kxk (right):
while f does not have a unique minimizer, g has one.

will have a unique minimum, for a sufficiently large α ∈ R. Moreover, for α → +∞,
arg min {f (v) + α kvk} → 0.
This is also called regularization, and will be discussed in more detail in the optimization basics as well as regularization
in the second half of the course; see also fig. 2.1.
A effect of regularization, which is highly relevant and will appear over and over in machine learning, is
indicated the following example:
Example 2.3. Overfitting

Our goal is to recover the original function


f (x) = x3 − 13x2 + 30x.
as plotted in blue from noisy data (depicted in red)
{(xi , ŷi )}i=0,...,10 .
as good as possible. For now, we will use polynomial regression, in order to illustrate overfitting, and
how regularization helps to prevent it.
First of all, we make the following observation: if we already knew that the original function is of the
form
ga0 ,a1 ,a2 ,a3 (x) = a0 + a1 x + a2 x2 + a3 x3 ,
we could easily fit the coefficients to the noisy data by minimizing the squared errors
10
2
X
arg min (ga,b,c,d (xi ) − ŷi ) .
a,b,c,d
i=0

25
v

α
w

Figure 2.2: Angle between the vectors v and w.

As can be seen in the left picture, we would get a quite good fit to the original function. Of course,
since the noise is unknown, it is difficult to recover the original function exactly. Here, adding more
data helps to improve the fit.
In practice, we do not know what kind of (nonlinear) function describes the true relation between
the x and y. Therefore, one might use a higher polynomial degree in order to ensure that the model
actually has the capacity to learn this relation. As we can see in the middle image, the resulting model
is very good fit with respect to the noisy data but not necessarily a good fit of the true function f .
This is called overfitting. overfitting
Without going into the details of regularization, we just make an observation about what happens
when adding the norm of the coefficient vector as a regularization term. In particular, solving the
regularized least-squares problem
  2
a0
10
 . 

2
X
arg min (ga0 ,...,a10 (xi ) − ŷi ) + λ  ..  ,
a0 ,...,a10
i=0 a10

with the regularization parameter λ ∈ R+ . Now, setting λ = 0.1, we again obtain a reasonable fit
of the original function, without using the knowledge that the original function was a polynomial of
degree 3; see the plot on the right.

Let us discuss some typical examples of vector norms. The standard Euclidean norm of a vector v ∈ Rn Euclidean
is given by the square root of the sum of the squares of the vector entries, norm

v
u n
uX
kvk := t vi2 . (2.9)
i=1

It can also be defined using the Euclidean inner product:


2
kvk = (v, w) . (2.10)

Note that the Euclidean inner product of two vectors v and w is a measure for the angle α between them;
cf. fig. 2.2. In particular, now that we have defined the Euclidean norm, we have

(v, w)
cos (α) = .
kvk kwk

As a consequence, if two vectors are orthogonal, meaning that the angle is (1/2 + k)π, for some k ∈ Z, we
have that
(v, w) = cos ((1/2 + k)π) kvk kwk = 0.
Furthermore, we obtain the

26
Theorem 2.2. Cauchy-Schwarz inequality

|(v, w)| ≤ kvk kwk ∀v, w ∈ Rn .

In fact, the definition of the Euclidean norm can be generalized to the lp -norm as follows: lp -norm
v
u n
p
uX
kvkp := t
p
|vi | , (2.11)
i=1

Then, the Euclidean norm is the special case of p = 2, that is,


kvk = kvk2 . (2.12)
The following exercise will deal with these norms, and the case of p = ∞; see the definition in eq. (2.13).
Exercise 2.8. lp norm
2
Show that kvkp as defined in eq. (2.10) defines a norm on Rn and, for n = 2 visualize the set of all
solutions of
kvkp = 1
for increasing values of p (starting with 1) as well as the set of all solutions of

kvk∞ := max |vi | = 1. (2.13)


i=1,...,n

Vectors of length 1 (with respect to a certain norm) are outstanding in terms of normalization. They normalization
are also denoted as unit vectors. To compute the unit vector v 0 corresponding to some vector v, we can unit vector
just normalize it with
v
v0 = .
kvk
We obtain that kv 0 k = 1.
Different norms are also used to measure errors in machine learning: if a machine learning model predicts
an output of y and the correct output is ŷ, then the error is measured as
kŷ − yk .
In particular, this is only zero if ŷ = y. As we will discuss later, different norms might used, depending on
the situation.

Matrix norms We will also have to deal with different matrix norms. Since matrices are just linear
operators, we can always define a matrix norm as the operator norm induced by a vector norm: operator norm

kAxk
kAk = max = max kAxk , (2.14)
x6=0 kxk kxk=1

where k · k is some vector norm. Some examples are


• the column sum norm column sum
m
X norm
kAk1 = max kAxk1 = max |aij | , (2.15)
kxk1 =1 j=1,...,n
i=1

• the spectral norm spectral norm


kAk2 = max kAxk2 , and (2.16)
kxk2 =1

27
• the row sum norm row sum norm
n
X
kAk∞ = max kAxk∞ = max |aij | . (2.17)
kxk∞ =1 i=1,...,m
j=1

Exercise 2.9. Matrix norms


Consider the matrix  
2 1
A=
0 0.5
and visualize

• Ax for all x with kxk1 = 1 as well as kAxk1 ,


• Ax for all x with kxk2 = 1 as well as kAxk2 , and
• Ax for all x with kxk∞ = 1 as well as kAxk∞ .

Another important norm is the so-called Frobenius norm Frobenius


v norm
u n X m
uX
2
kAkF := t |aij | (2.18)
i=1 j=1

which will play an important role in dimension reduction techniques based on the singular value decom-
position, which will be introduced later. In the machine learning community, the square of the Frobenius
norm
2
kAkF
is also called the energy of the matrix. It can equivalently written using the Frobenius inner product, energy
2
kAkF = (A, A)F , (2.19)
where the Frobenius inner product is defined as Frobenius
inner product
(A, B)F := tr A> B


Exercise 2.10. Frobenius norm


Verify that eq. (2.18) and eq. (2.19) are equivalent.

2.3 Matrix powers and polynomials


Matrix powers and polynomials By repeated multiplying a square matrix A ∈ Rn×n with itself, we
can define the power of a matrix power of a
d
Y matrix
Ad = A · ·
| {z }· A = A,
d times i=1

which is in analogy with powers of a scalar; of course, for a non-square matrix, we cannot even compute AA.
Moreover, in analogy to c0 = 1 for any c ∈ R, we define A0 = I, where I ∈ Rn×n is the identity matrix. For
matrices, it is possible to have
AB = 0,
for 0 6= A, B. It is also possible that
Ad = 0
for A 6= 0.

28
Definition 2.2. Nilpotent matrix of index d
We call a matrix nilpotent of index d if nilpotent of
index d
Ad = 0

but
Ad−1 6= 0.

Exercise 2.11. Nilpotent matrix


Give an example for a nilpotent of index d matrix N ∈ Rd×d , for the cases d = 2, 3 and then for the
general case d = n.

Raising a matrix to a high power is an operation that, if done ‘as written on paper’ involves a lot of
matrix multiplications – an operation which is rather expensive. Whenever possible, this should be avoided.
Therefore, let us discuss how to efficiently compute powers for a certain class of matrices, that is, for
diagonalizable matrices.
Definition 2.3. Daigonalizable matrix
A square matrix A ∈ Rn×n is called diagonalizable if an invertible matrix V ∈ Rn×n and a diagonal diagonalizable
matrix D ∈ Rn×n exist, such that
A = V DV −1 .

If a matrix A is diagonalizable, we have that


d
Ad = V DV −1 = V DV −1 V DV −1 · · · V DV −1 .
  
| {z }
d times

Since V V −1 = I, we obtain
Ad = V −1 Dd V.
Recall that D is a diagonal matrix, and it is easy to compute a power of a diagonal matrix
d
Dd = (diag (dii )) = diag ddii ,


that is, by computing the power of the diagonal entries. This also gives us the possibility to compute a
fractional power of a diagonal matrix fractional
  power
1/d
D1/d = diag dii ,

and based on this,


Dr = diag (drii ) ,
with r ∈ Q+ (rational numbers), if drii exists for all 1 ≤ i ≤ n. In particular, it exists for non-negative
diagonal entries.
From the definition of fractional and rational powers of a diagonal matrix, we can extend this definition
to diagonalizable matrices. If A = V DV −1 , then
A1/d = V −1 D1/d V and Ar = V −1 Dr V.
A matrix polynomial p of rank n of a matrix A is given by the expression matrix
polynomial
n
X
p (A) = ak Ak ,
k=0

29
with ai ∈ R, for 0 ≤ i ≤ n. It is easy to see that two matrix polynomials f (A) and g (A) with the same
matrix A commute, that is,
f (A) g (A) = g (A) f (A) .
Again, analogously to the scalar case, we can also define negative powers of a matrix. We have that

A1 A−1 = A−1 A1 = A0 = I.

Therefore, A−1 has to be the (multiplicative) inverse of A. The inverse of a matrix exists if it has full (multiplica-
rank. This means that the matrix has to be a square matrix, and that all columns and rows have to be tive)
linear independent. inverse

Definition 2.4. Invertible matrix


invertible
We call a matrix A invertible or nonsingluar if there exists a matrix A−1 such that
nonsingluar

AA−1 = A−1 A = I.

Otherwise, we call the matrix singular. singular

As before, for the case of positive powers of a matrix A, we can define negative powers of an invertible negative
matrix powers
d
Y
A−d = |A−1 ·{z
· · A−1} = A−1 .
d times i=1

If A is additionally diagonalizable, can also define

A1/d = V −1 D−1/d V and Ar = V −1 D−r V,

for d ∈ N and r ∈ Q+ if D−1/d and D−r exist. Therefore, we have extended powers to any rational numbers.
An interesting observation is that, as for scalars,

I − An+1 = (I + A + . . . + An ) (I − A)

and, by replacing A by −A,


n+1 n
I − (−A) = I − A + A2 − A3 . . . + (−A) (I − A) .

Now, if An → 0 for n → ∞, with respect to some matrix norm k · k, we obtain


−1
(I + A) = I − A + A2 − A3 + . . .
−1
(I − A) = I + A + A2 + A3 + . . .

Similarly, we obtain, without proving it, the following lemma:


Lemma 2.1. Matrix inversion Lemma
Let A ∈ Rn×n be invertible and 0 6= u, v ∈ Rn . Then,
A + uv > (= A + u ⊗ v)
is invertible if and only if
v T Au 6= −1.
Furthermore, in this case,
−1 A−1 uv > A−1
A + uv > = A−1 − . (2.20)
1 + v > A−1 u

30
For a proof, see, for example, [1, Section 1.2.5, Lemma 1.2.5]. Matrix updates of the form

A + uv >

are called rank 1 updates, and they are often used in so-called quasi Newton methods for minimizing rank 1 update
nonlinear functions or solving nonlinear equations; those will be discussed in more detail when we discuss
the basics of optimization.
Example 2.4. Discussion of the computational work
Let us assume that A ∈ Rn×n and its inverse A−1 are given, and let both matrices be dense. Fur-
thermore, let u, v ∈ Rn . Equation (2.20) can be split into the following computations

û = A−1 u O(n2 ) FLOPs


> −1 >
α=1+v A u = 1 + v û O(n) + 1 = O(n) FLOPs
> > −1 2
v̂ = v A O(n ) FLOPs
> 2
 = ûv̂ O(n ) FLOPs
−1 1
A + uv > = A−1 − Â O(n2 )

FLOPs
α
This means that this computation needs 4O(n2 ) + O(n) = O(n2 ) FLOPS, whereas the direct compu-
tation of the inverse needs 3O(n3 ) +O(n2 ) = O(n3 ) FLOPS. Hence, depending on n, using lemma 2.1
−1
can be considerably faster than computing A + uv > from scratch.

Furthermore, the following, more general, extension of lemma 2.1 to higher-rank updates of a matrix will
be important:
Theorem 2.3. Sherman–Morrison–Woodbury identity
Let A ∈ Rn×n be invertible and U, V ∈ Rn×k for some small k, that is, k << n. Then, the matrix

A + UV >

is invertible if and only if


I + V T A−1 U
is invertible. Then, the inverse is given by
−1 −1
A + UV > = A−1 − A−1 U I + V T A−1 U V T A−1 . (2.21)

It can be seen that lemma 2.1 corresponds to the special case of k = 1 in theorem 2.3. The matrix U V >
is a rank k matrix, and
A + UV >
rank k update
is called a rank k update, or since k << n, low-rank update. If A−1 is known, the computational work
low-rank
for computing eq. (2.21) is generally much lower than computing update
−1
A + UV >

directly. We will discuss the solution of linear equation systems in section 2.6; then, it will get clearer why
it is important to save computational work.

31
Krylov subspaces A matrix polynomial p (A) for a matrix A ∈ Rn×n also yields a matrix, which can be
applied to some vector v:
Xm
p (A) v = ak Ak v.
k=0

Definition 2.5. Krylov subspace of order m


The space
Km (A, v) := span v, Av, A2 v, . . . , Am−1 v ,


that is, the space spanned by the vectors Ak v 0≤k≤m−1 for some v ∈ Rn , is called the Krylov


subspace of order m. Krylov


subspace of
order m
The Krylov subspace is the space of all vectors that can be written as p (A) v, where p is a polynomial of
maximum degree m − 1. For more details on Krylov spaces and the corresponding Krylov subspace methods,
see [41]; furthermore, we will also briefly discuss Krylov subspace methods in section 2.4. These methods
are relevant for iteratively solving linear equation systems or eigenvalue problems.
We only discuss some properties of these spaces here before discussing their use shortly. Consider the
following definition:
Definition 2.6. Minimal polynomial of v
Let v ∈ Rn . The non-zero matrix polynomial p of lowest degree such that

p(A)v = 0
minimal
is called the minimal polynomial of v with respect to A. We call the degree of this polynomial polynomial of
v
the grade of v with respect to A.
grade of v

It immediately clear that the grade of v has to be lower or equal to n. Otherwise, we would have

dim (Kn (A, v)) > n,

which is impossible since Kn (A, v) ⊂ Rn .


Then, we have the following lemmata:
Lemma 2.2.
Let µ be the grade of v (with respect to A). Then Kµ := Kµ (A, v) is invariant under the application
of A, that is,
AKµ ⊂ Kµ ,
and Km = Kµ for all m ≥ µ. Here,

AKµ := span Av, A2 v, . . . , Aµ v .




Lemma 2.3.
The Krylov subspace Km has dimension m if and only if the grade µ of v with respect to A is not
less than m:
dim (Km ) = m ⇔ grade (v) ≥ m.
Hence,
dim (Km ) = min {m, grade (v)} .

32
See [41, Section 6.2, Propositions 6.1 and 6.2] for more details.
As we will see, Krylov subspaces have favorable properties:
• They can be computed relatively efficiently, without changing the sparsity pattern of the matrix: sparse
remains sparse.

• They serve well as reduced dimensional spaces; we will discuss later what this means in practice.

2.4 Orthogonalization
As a next step, we discuss how to orthogonalize and orthonormalize
 a set of vectors a1 , . . . , am . This will
also give a way of computing the rank of a matrix A = a1 , . . . , am , and hence, to determine how much
information how many dimensions does a given data set span, and if possible, whether some dimensions can
be dropped to reduce the data set size. This will also enable us to define a first matrix factorization

A = QR,

which can be employed to solve linear equation systems. In principle, it is convenient to transform a matrix
into a collection of orthogonal objects because then, all kinds of matrix operations becomes more numerically
safe from errors.

Gram–Schmidt orthogonalization Consider the vectors a1 , . . . , am ∈ Rn and the matrix


 
A = a1 , . . . , am ,

consisting from a1 , . . . , am as columns. Of course, the vectors could be linearly dependent, and the rank
of A could be smaller than m (and n). In order to construct a basis of the space V = span {a1 , . . . , am }
and to determine the rank of A, we can orthogonalize the vectors a1 , . . . , am using the Gram–Schmidt orthogonalize
orthogonalization algorithm. Let us first assume that A has full column rank. Gram–Schmidt
orthogonaliza-
Algorithm 1: Gram–Schmidt orthogonalization algorithm tion
Data: Linearly independent a1 , . . . , am ∈ Rn algorithm
q1 = a1 ;
for i = 2 to n do
qi = ai ;
for j = 1 to i − 1 do
(a ,q )
qi = qi − (qji ,qjj ) qj ;
end
end
Result: Orthogonal q1 , . . . , qm ∈ Rn
Here, the inner for-loop can be written out as follows:

(ai , q1 ) (ai , qi−1 )


qi = ai − q1 − . . . − qi−1
(q1 , q1 ) (qi−1 , qi−1 )

which is also the same as


(ai , q1 ) (ai , qi−1 )
qi = ai − 2 q1 − . . . − 2 qi−1 .
kq1 k kqi−1 k
If the rank of A is m, the resulting vectors q1 , . . . , qm will be orthogonal:

33
Definition 2.7. Orthogonal vectors
A set of vectors v1 , . . . , vm is called orthogonal if orthogonal

(vi , vj ) = 0

for all i 6= j.
They are called orthonormal if additionally orthonormal

kvi k = 1

for all i.

In order to obtain orthonormal vectors, the algorithm has to be slightly modified:


Algorithm 2: Gram–Schmidt orthogonalization algorithm
Data: Linearly independent a1 , . . . , am ∈ Rn
q1 = kaa11 k ;
for i = 2 to n do
qi = ai ;
for j = 1 to i − 1 do
qi = qi − (ai , qj ) qj ;
end
qi = kqqii k ;
end
Result: Orthonormal q1 , . . . , qm ∈ Rn

Exercise 2.12. Gram–Schmidt orthogonality


Prove that, if a1 , . . . , am ∈ Rn are linearly independent, algs. 1 and 2 produce orthogonal and or-
thonormal vectors q1 , . . . , qm , respectively.

If the rank of A is smaller than m, it is clearly not possible to find m orthogonal vectors from the space
V . Investigate this:
Exercise 2.13. Modification of Gram–Schmidt

1. How does the algorithm fail if a1 , . . . , am ∈ Rn are linearly dependent.

2. Modify algs. 1 and 2 such that it will not fail and still generate an orthogonal or orthonormal,
respectively, basis of V .

As nice as it is, it turns out that Gram–Schmidt is (notion to be defined) numerically unstable, and,
when executed on a computer, it often fails to accurately produce orthogonal vectors due to rounding errors.
Let us now build up som understanding of what numerical stability means.
Example 2.5. Numerical stability of the Gram–Schmidt algorithm
Consider the three vectors
     
1 1 1
a1 = 0.01, a2 =  0 , a3 = 0.01.
0 0.01 0.01

34
By performing Gram–Schmidt orthonormalization with 10−3 accuracy, we obtain the vectors
     
1 0 0
q1 = 0.01, q2 = −0.707, q3 = 0.
0 0.707 1

We notice that, even when also computing the inner products with 10−3 accuracy, we obtain

(q1 , q1 ) = 1 (q1 , q2 ) = 0 (q1 , q3 ) = 0


(q2 , q1 ) = 0.01 (q2 , q2 ) = 1 (q2 , q3 ) = 0.707
(q3 , q1 ) = 0 (q3 , q2 ) = 0.707 (q3 , q3 ) = 1

Hence, the vectors are far from orthogonal.

Conditioning and stability The concepts of conditioning and stability are important to investigate the
usage of numerical schemes. Here, we will discuss them for the solution map of an abstract problem abstract
problem
f : X → Y,

where X and Y are a normed vector space (vector spaces with a norm). The space X contains the (input)
data, and Y contains the solutions (or targets). Let us note that conditioning is a property of the problem,
whereas stability is a property of a numerical algorithm to solve the problem.
conditioning
Therefore, let us first discuss the conditioning of a problem. We call a problem well-conditioned
well-
if a small change in the data x yields only small changes in the corresponding solution f (x). This means conditioned
that, if there are only small perturbations in the data, the resulting solution also changes only mildly. In
rounding
the context of numerical computations, a typical type of perturbations of the data are rounding errors errors
caused by storing scalars as floating-point numbers and performing the computations in floating-point
floating-point
arithmetic. numbers
Example 2.6. Floating-point numbers floating-point
arithmetic
In floating point format, a scalar is stored as follows:

sign significand × baseexponent

This format is standardized according to IEEE Standard for Floating-Point Arithmetic (IEEE 754):

precision sign exponent significand field total bits ε


half 1 5 10 16 2−11
single 1 8 23 32 2−24
double 1 11 52 64 2−53

The relative error due to rounding is bounded by the machine precision ε (also denoted as the machine
epsilon):
fl(x) − x
≤ε
x
This error depends on the format. It can be reduced, but generally it cannot be prevented.

Problems that are badly conditioned are also called ill-conditioned. This means that small changes in ill-conditioned
the data x results in relatively large changes in the solution f (x). It is clear that well-conditioned problems
are much more favorable than ill-conditioned problems. In numerical computations, we always have to

35
deal with small errors in the data (rounding errors), and therefore, ill-conditioned problems can be highly
problematic.
The conditioning of a problem is measured by the condition number:
Definition 2.8. Condition number
• The absolute condition number of a problem f is defined as: absolute
condition
kf (x + δx) − f (x)k number
κabs := κabs (x) := sup
δx∈D kδxk

• The (relative) condition number of a problem f is defined as: (relative)


condition
kf (x + δx) − f (x)k kxk number
κ := κ(x) := sup (2.22)
δx∈D kf (x)k kδxk

Here, D is a neighborhood of admissible perturbations δx, that is, such that f (x+δx) yields an admissible
solution in Y .
We obtain that
kA(x + δx) − Axk kxk kAδxk kxk kxk
 
κ = sup = sup ≤ kAk
δx∈D kAxk kδxk δx∈D kδxk kAxk kAxk
Furthermore, if A is invertible
−1
kxk kzk y:=Az A y
= A−1

≤ sup ≤ sup
kAxk z∈X kAzk y∈Y kyk
We obtain the following theorem:
Theorem 2.4.
Let the matrix A ∈ Rn×n be invertible. The (relative) condition number of the matrix-vector multi-
plication Ax = b is
κ(x) ≤ kAk A−1

and the (relative) condition number of solving the linear equation system Ax = b, which corresponds
the matrix-vector multiplication A−1 b = x, is also

κ(b) ≤ kAk A−1 .


From that, we derive the definition:


Definition 2.9. Condition number of a matrix
We define the condition number of an invertible matrix A ∈ Rn×n as condition
number
κ := κ(A) = kAk A−1 .

Exercise 2.14.
Show that κ(A) ≥ 1 for an invertible matrix A ∈ Rn×n .

In contrast to the conditioning of a problem, stability is a property of a numerical algorithm. Let

f˜ : X → Y

36

f

δy

δx

f
x

Figure 2.3: Application of an algorithm f˜ to a problem f , backward error δx, and forward error δy.

be a numerical algorithm for solving an abstract problem f : X → Y . Consider the relative error resulting numerical
from solving the problem with the numerical algorithm f˜: algorithm


˜
f (x) − f (x)

kf (x)k

We call an algorithm exact if exact


˜
f (x) − f (x)

= O(ε),
kf (x)k
that is, if the error is in the order of the machine precision; in practice, this is the best we could hope for.
However, exactness of an algorithm is almost not achievable in practice. Instead, we will consider stability;
cf. [47, p. 104].
forward error
We will now define forward and backward stability; see also fig. 2.3 for the forward error and backward
error appearing in these definitions.
backward error
Definition 2.10. Stability
We call an algorithm stable if stable
˜
f (x) − f (x̃)

= O(ε)
kf (x̃)k
for
kx̃ − xk
= O(ε)
kxk

In other words:
A stable algorithm gives almost the exact answer f˜(x) to the almost exact question x̃.

37
Definition 2.11. Backward stability
We call an algorithm backward stable if, for every x ∈ X, there is an x̃ ∈ X such that backward
stable
kx̃ − xk
= O(ε)
kxk

and
f˜(x) = f (x̃).

In other words:
A backward stable algorithm f˜ gives the exact answer f˜(x) to the almost correct question x̃.

It is shown in [47, theorem 15.1] that


Theorem 2.5.

Let f˜ be a backward stable algorithm for the solution of f : X → Y . Then,



˜
f (x) − f (x)

≤ O(κ(x)ε),
kf (x)k

where κ(x) is the (relative) condition number of the problem f .

This means that backward stable algorithms exhibit the best stability behavior we could hope for since
the error is only in the order of the machine precision times the condition number of the problem, which
cannot be influenced by the algorithm.
Example 2.7. Modified Gram–Schmidt algorithm
As we have already mentioned earlier, the classical alg. 1 is not numerically stable. By
a modification of the inner for-loop, the stability of the algorithm can be improved:
Algorithm 3: Modified Gram–Schmidt orthogonalization algorithm
qi = ai ;
for j = 1 to i − 1 do
qi = qi − (qi , qj ) qj ;
end
qi = kqqii k ;
Later, we will discuss further approaches for stabilizing the Gram–Schmidt procedure.

QR
QR factorization Based on the Gram–Schmidt algorithm, we will derive the QR factorization (or QR factorization
decomposition) of a matrix. In the general case of A ∈ Rn×m , with n ≥ m and full column rank, we can
QR
derive a factorization decomposition
A = QR, (2.23)
where Q ∈ Rn×n is an orthogonal matrix and R ∈ Rn×m is an upper triangular matrix. This factorization
is called the QR factorization.

38
Definition 2.12. Orthogonal and semi-orthogonal matrix
A semi-orthogonal matrix Q ∈ Rn×m is a matrix with m orthonormal columns, that is, a matrix semi-
with the property orthogonal
matrix
Q> Q = Im ∈ Rm×m . (2.24)
An orthogonal matrix is a semi-orthogonal matrix with full rank. In other words, it is a square orthogonal
matrix with orthonormal columns. In this case, the inverse of the matrix Q is its transpose Q> . matrix

For the case n > m, we have that


 
  R1
Q = Q1 Q2 R= ,
0

where Q1 ∈ Rn×(n−m) and Q2 ∈ Rn×m are semi-orthogonal,R1 ∈ Rm×m has full rank, and
 
  R1
Q
A = QR = 1 2 Q = Q1 R1 .
0

This is an alternative form of the QR factorization, and we will focus on this variant. Therefore, we alternative
will just use the notation Q = Q1 and R = R1 ; for n = m, both variants of the QR factorization coincide. form of the
QR
One way to compute the QR decomposition of a matrix A is to perform the Gram–Schmidt orthonor- factorization
malization algorithm to the columns of A. Since we assume that A has full column rank, we can just
perform algs. 2 and 3 without the modifications discussed in exercise 2.13. The result will be the columns
of the semi-orthogonal matrix Q and the coefficients in the algorithm will yield the matrix R.
Algorithm 4: QR factorization via the Gram–Schmidt orthonormalization algorithm
Data: A = a1 , . . . , an ∈ Rn×m , R = 0 ∈ Rm×m
 

r11 = ka1 k;
q1 = ra111 ;
for i = 2 to m do
qi = ai ;
for j = 1 to i − 1 do
rji = (ai , qj );
qi = qi − rji qj ;
end
rii = kqi k;
qi = rqiii ;
end  
Result: Semi-orthogonal Q = q1 , . . . , qm , upper triangular R
The following exercise deals with the verification of the desired properties of the resulting matrices.
Exercise 2.15. QR factorization

Verify that
• Q is semi-orthogonal and R is upper triangular matrix,
• A = QR if Q and R have been computed using alg. 1, and

• Q> Q = I.

Once the QR decomposition of an invertible matrix A ∈ Rn×n has been computed, the linear equation
system
Ax = b

39
can be solved easily. In particular,
Ax = b
⇔ QRx = b (2.25)
>
⇔ Rx = Q b
can again be solved easily. In particular, since R is of the form

r11 · · · r1n
 
 . .. .. 
R =  .. . . ,
0 0 rnn

we can solve it by row-by-row from the last to the first, that is, using backward substitution. backward
For a dense matrix A, while the computational complexity of the the QR decomposition is O(n3 ), it is substitution
only O(n2 ) for computing Q> b and solving Rx. Therefore, if a QR factorization has been computed, solving
a linear equation system with A is relatively cheap. Unfortunately, if A is sparse, the factors Q and R can
be denser compared to the original matrix.
Exercise 2.16. Condition number of an orthogonal matrix
Show that
• the condition number of an orthogonal matrix Q is 1 and so is the condition number of Q> and

• the condition number of R is the same as the condition number of A.

As a result of exercise 2.16, solving Ax = b has the same conditioning as solving Rx = Q> b.

Solving an overdetermined system using the QR factorization Let us now consider the case of
a rectangular matrix A ∈ Rn×m , with n > m and linear independent columns. Then, the linear equation
system
Ax = b (2.26)
is overdetermined since there are more equations than variables. It only has a solution if b ∈ range (A). overdeter-
In this case, the solution is unique since we assumed that the columns are linear independent. mined

Definition 2.13. Well-posedness (Hadamard (1902))

A problem f : X → Y is well-posed for some x ∈ X if well-posed

• a solution f (x) exists,

• the solution is unique, and


• the solution f (x) depends continuously on the data x.
If these conditions are not satisfied, the problem is called ill-posed. ill-posed

40
Example 2.8.
!
The problem f (x) := x3 + 1 = 2, for x ∈ R, is well-posed. 10

We have that 5
x3 + 1 = 2 ⇒ x3 = 1 ⇒ x = 1.
0
Thus, there is a unique solution in R. Moreover, f (x) is continuously
differentiable in a neighborhood of the solution x = 1 and fdfx (1) = −5
x3 + 1
3 6= 0. By the implicit function theorem, we obtain a continuously 2
differentiable inverse function of f in the neighborhood of x = 1 Hence, −2 0 2
the solution depends continuously on the data.

We come back to the linear equation system eq. (2.26):


Exercise 2.17. Well-posedness
Show that eq. (2.26) is well-posed if
• A has full column rank and

• b ∈ range (A).

If b ∈
/ range (A), there cannot be an exact solution of eq. (2.26). However, we can still try to find a vector
that is as close as possible to solving eq. (2.26). More precisely, let us consider the solution which is most
accurate in terms of the Euclidean norm of the error, that is,
arg min kAx − bk .
x∈domain(A)

As we will see, it is more convenient to instead consider the equivalent problem:


2
arg min kAx − bk . (2.27)
x∈domain(A)

We also call this the least-squares problem because it corresponds to minimizing the square of the least-squares
Euclidean norm, or l2 -norm, of the error. problem
2
Note that, in case A is invertible, kAx − bk is minimized if Ax = b. Hence, both problems eq. (2.26)
and eq. (2.27) are equivalent in this case.
As it turns out, we can derive a formula for the solution of the least-squares problem as follows: First of
all,
2 >
kAx − bk = (Ax − b) (Ax − b) = x> A> Ax − x> A> b − b> Ax + b> b = x> A> Ax − 2b> Ax + b> b.
This function is continuously differentiable, and hence, a necessary condition for finding a local minimum is
d 2 d
x> A> Ax − 2x> A> b + b> b 2A> Ax − 2A> b

0 = kAx − bk = =
dx dx
⇔ A> Ax = A> b
This linear equation system is also called the normal equations. If the columns of A are linear independent, normal
equations
A> Ax = 0
>
⇒ xA Ax = 0
2
⇒ kAxk = 0
⇔ Ax = 0
⇒ x = 0

41
Hence, ker A> A = {0}, and since A> A ∈ Rn×n , A> A is invertible. This means that there is a unique


solution x̂ to the normal equation


A> Ax̂ = A> b, (2.28)
−1
that is, x̂ = A> A A> b; in practice, for better efficiency, we always solve a single linear equation system


instead of computing the inverse of the matrix (in practical analysis, saying that a vector is a result of solving
a well-determined set of linear equations is pretty much the same as saying that we have a formula for this
vector).
If the Hessian
d2 2
kAx − bk = A> A
dx2
2
is positive definite everywhere, this solution is a global minimizer of kAx − bk . Let x 6= 0. Then, since the
columns of A are linear independent, we have

x> A> |{z}


Ax = y > y > 0.
=:y6=0

Hence, the unique solution of the normal equation, x̂, is the solution of the least-squares problem eq. (2.27).
The case that the columns of A are not linear independent will be discussed at a later point; then, we cannot
find a unique minimizer.
The normal equations
A> Ax = A> b
are a linear equation system with the matrix A> A. We have not extended the definition of the condition
number of a matrix to singular matrices. However, as we will see later

κ(A> A) = κ(A)2 .

Since κ(A) ≥ 1, the conditioning of the normal equations is much worse compared to the original equation
system eq. (2.26).
Now, let
A = QR
be a QR factorization of A. Then,

A> Ax̂ = A> b


> >
⇔ (QR) QRx̂ = (QR) b
⇔ R> Q> Q Rx̂ = R> Q> b
| {z }
=I
⇔ R> Rx̂ = R> Q> b

If the columns of A are linearly independent, the columns of R have to be linear independent as well.
Therefore, R and R> are invertible, and we have

Rx̂ = Q> b,

which can be solved using backward substitution. In contrast to the normal equations and

R> Rx̂ = R> Q> b,

this problem has the same conditioning as the original problem eq. (2.26).
In the next paragraphs, we will see other examples for the use of the QR decomposition of a matrix.

42
Computing Krylov subspaces In section 2.3, we have introduced Krylov subspaces:

Km (A, v) := span v, Av, A2 v, . . . , Am−1 v .




By using Gram–Schmidt orthogonalization, we derive Arnoldi’s method for computing an orthonormal Arnoldi’s
basis of Km (A, v) as follows: method

Algorithm 5: Arnoldi’s method for computing Km (A, v).


Data: A ∈ Rn×n and v ∈ Rn
v
q1 = kvk ;
for k := 1, . . . , m − 1 do /* iteration */
w := Aqk ; /* expansion */
for i := 1 to k do /* orthogonalization */
hi,k := (w, qi );
w := w − hi,k qi ;
end
hk+1,k := kwk2 ;
if hk+1,k = 0 then break; /* invariant subspace spanned */
w
qk+1 := hk+1,k ; /* new basis vector */
end
Result: Orthonormal basis q1 , . . . of Km (A, v)
Of course, in practice, instead of checking hk+1,k = 0 for an early termination of the algorithm, one would
check if |hk+1,k | is below some tolerance.
In particular, in the kth iteration step, we obtain the so-called Hessenberg matrix Hessenberg
matrix
h1,1 ... ... h1,k
 
.. .. 
.

h2,1 . 
Hk =  .. 
 (2.29)
 .. ..
 . . . 
O hk,k−1 hk,k

and an orthonormal basis of the Krylov subspace Kk (A, v)


 
Qk = q1 q2 . . . qk .

We obtain the relation


AQk = Qk Hk + hk+1,k qk+1 eTk (2.30)
| {z }
=w
k
where ek is the kth canonical basis vector in R . Note that

veTk = v ⊗ ek

is a rank one matrix. From eq. (2.30) and the fact that w = hk+1,k qk+1 ⊥ Qk , we obtain

QTk AQk = Hk . (2.31)

Here, w ⊥ Qk means that w> Qk = 0 or


(w, qi ) = 0,
for 1 ≤ i ≤ k.
If A is symmetric, we obtain from eq. (2.31) that the Hessenberg matrix Hk is symmetric as well. Since
the Hessenberg matrix is also of the form in eq. (2.29), the Hessenberg matrix must be tridiagonal for a

43
symmetric A:
h1,1 h2,1 O
 
.. ..
. .
 
h2,1 
Hk =  .
 .. .. 
 . . hk,k−1 
O hk,k−1 hk,k
(symmetric)
Now, with αk = hk,k and βk = hk−1,k , Arnoldi’s method simplifies to the (symmetric) Lanczos method; Lanczos
in this method, we make use of the fact that each new orthonormal basis vector can be computed using a method
three-term recurrence relation, that is, using only the two previous basis vectors: three-term
recurrence
Algorithm 6: Lanczos’ method for computing Km (A, v) for a symmetric A. relation
Data: A ∈ Rn×n and v ∈ Rn
β1 = 0; q0 = 0; /* initialization */
v
q1 = kvk ;
for k := 1, . . . , m do /* iteration */
αk := qk> Aqk ;
w = Aqk − αk qk − βk qk−1 ; /* new direction orthogonal to previous q */
βk+1 = kwk2 ;
qk+1 = w/βk+1 ; /* normalization */
end
Result: Orthonormal basis q1 , . . . of Km (A, v)
From Lanczos’ method, we obtain the tridiagonal matrix Hessenberg matrix

α1 β2 0
 
..
.
 
 β2 α 2 
 
Tk = 
 .. .. .. 
 . . . 

 .. .. 
 0 . . βk 
βk αk

and the orthonormal basis of the Krylov subspace Kk (A, v)


 
Qk = q1 q2 . . . qk .

Finally, analogously to Arnoldi’s method, we have

AQk = Qk Tk + βk+1 qk+1 eTk (2.32)

and
QTk AQk = Tk . (2.33)

Orthogonalization with projections, rotations, and reflections In alg. 1, we have employed the
Gram-Schmidt orthgonormalization algorithm to find an orthonormal basis for the range of a matrix A.
Therefore, we have used linear operations of the form

Pw v = (v, w) w (2.34)

where kwk = 1. In fact, this type of a linear map is an orthogonal projection; in fig. 2.4, we can see a
graphical representation of the application of Pw .

44
v
v − Pw v

Pw v w

Figure 2.4: Orthogonal projection of v onto w.

Definition 2.14.
A projection matrix is a quadratic matrix P ∈ Rn×n with the property projection
matrix
P2 = P

Definition 2.15.
A projection is an orthogonal projection if orthogonal
projection
P > = P.

Exercise 2.18. Orthogonal Projection


Verify that Pw as defined in eq. (2.34) is an orthogonal projection.

If a linear map P is an orthogonal projection, we have that P v is orthogonal to (I −P )v, or P v ⊥ (I −P )v,


which is the same as
(P v, (I − P )v) = 0. (2.35)
Furthermore, if P is an orthogonal projection, so is I − P .
To build on orthogonal basis of range(A) using the Gram-Schmidt algorithm, we make use of the fact
that
w ⊥ (I − Pw )v,
which is easy to show derive from eq. (2.35) and the fact that w and Pw v are linearly dependent.
As mentioned before the classical Gram-Schmidt algorithm is not numerically stable. In particular, a
small numerical errors appearing during the algorithm can result in large errors in the result. In particular,
if the orthogonal projections Pw are not applied exactly, the matrix Q in the QR factorization of a matrix
can be far from orthogonal. This is caused by the fact that, depending on v, the condition number of Pw (v)
can be ill-conditioned:
Exercise 2.19. Condition number of Pw

Let Pw be defined in eq. (2.34). Discuss the relative condition number eq. (2.22) as defined in defini-
tion 2.8 for Pw (v) depending on v. What is the worst case?

We will therefore discuss two alternative approaches for computing an orthogonal basis, that is, using
rotations or reflections.
Let us first consider the matrix describing the rotation around the origin by an angle α:
 
cos(α) sin(α)
Gα = (2.36)
− sin(α) cos(α)

45
y

Gα2 v α2

v
α1

Gα1 v x

Figure 2.5: Rotation of v onto the x and y axes.

We have that   
cos(α) − sin(α) cos(α) sin(α)
G>
α Gα =
sin(α) cos(α) − sin(α) cos(α)
sin(α)2 + cos(α)2
 
sin(α) cos(α) − sin(α) cos(α)
= ,
sin(α) cos(α) − sin(α) cos(α) sin(α)2 + cos(α)2
 
1 0
= =I
0 1
which means that Gα is actually an orthogonal matrix. It is clear that the angle α can be chosen such that
   
a r
Gα1 = (2.37)
b 0
or    
a 0
Gα2 = , (2.38)
b l
that is, such that the resulting vector is aligned with one of the two axes; cf. fig. 2.5.
Exercise 2.20. Rotation to axes

1. Compute α such that eq. (2.37) or eq. (2.38).


2. Give formulae for r and l.

For a given matrix, we have that


   
a c r ?
Gα1 = =: R.
b d 0 ?
| {z }
=:A

Hence, we have already computed a QR factorization of A:

A = G>
α1 R (2.39)

In order to derive a scheme for general matrices, we can extend the rotation matrix to higher dimensions

46
as follows:
1
 
..
.
 
 
 

 1 


 cos(α) 0 ··· 0 sin(α) 

 0 1 0 
.. ..
 
Gij ..
α = , (2.40)
 
 . . . 

 0 1 0 


 − sin(α) 0 ··· 0 cos(α) 


 1 

 .. 
 . 
1
where i and j correspond to the rows and columns for performing the rotation. In particular, the rotation
is, again, performed around the origin but within the plane spanned by the ith and jth coordinates. We call
these matrices Givens rotation matrices. Givens
rotation
Exercise 2.21. matrices

Discuss why Gij


α in eq. (2.40) is orthogonal.

We can now use these rotation matrices to turn


a11 · · ·
 
a1m
 .. .. .. 
 . . . 
an1 · · · anm

into a upper triangular matrix as follows: In a first step, we eliminate the entry an1 by choosing a suitable
angle αn , such that
r ? ··· ?
 

  21
 a ··· ··· a2m 
a11 · · · a1m

 .. .. ..

.

1n  .. . ..    . . 
Gαn  . . . . = . ..

 .. .. 
an1 · · · anm  . . 

an−11 ··· ··· an−1m 
0 ? ··· ?
In the same way, we eliminate all entries a21 , . . . , am1 , resulting in the matrix

? ··· ?
 
r̂11
a11 · · · a1m
 
0 ? · · · ?
1n  .. .. ..  
G2n
α2 · · · Gαn  . . = . .. . . ..  (2.41)

.
 .. . . .
an1 · · · anm
0 ? ··· ?

As before, in the computation of the LU factorization, we continue with grey submatrix for the second step.
In each step, we eliminate entries from the first column, until we end up with an upper triangular matrix R.
Since all the rotation matrix involved in this procedure are orthogonal, we can easily obtain the Q matrix
in the QR factorization of A by computing the transposed of the product of the Givens rotation matrices,
as in eq. (2.39).
A similar procedure can be carried out using Householder reflection matrices. In particular, we can Householder
reflection
matrices

47
y

p2

w2

Hp2 v = sign (v2 ) kvk


p1
v
w1

Hp1 v = sign (v1 ) kvk x

Figure 2.6: Reflection of v onto the x and y axes.

transform any vector  


v1
 v2 
v =  .. 
 
 . 
vm
to a vector  
r
0
 .. 
 
.
0
by using a reflection as follows: If we use a reflection along a plane p, the norm of the vector will remain the
same. Therefore,
r = sign (v1 ) kvk ; (2.42)
as can be seen in fig. 2.6. Now, let
w̃ = v − sign (v1 ) kvk e1 (2.43)

and w := kw̃k .

Exercise 2.22. Householder reflection


Show that, with the notation given above in eqs. (2.42) and (2.43), the matrix

Hp = I − 2ww> = I − 2w ⊗ w

is a reflection along the plane p ⊥ w. In particular, show that


1. Hp v = sign (v1 ) kvk e1 ,
2. Hp> = Hp ,

3. Hp> Hp = Hp Hp> = I, and


4. Hp2 = I.
In particular, Hp is an orthogonal matrix.

48
Now, using Householder reflection matrices, we can transform a matrix

a11 · · · a1m
 
 .. .. .. 
 . . . 
an1 · · · anm

into an upper triangular matrix, analogously to eq. (2.41). As shown in exercise 2.22, the reflection matrices
are orthogonal, and hence, we can easily obtain a QR factorization by multiplying and transposing the
reflection matrices.
The reason that Givens rotations and Householder reflections lead to a numerically stable scheme is that
they are orthogonal matrices. As we have seen in exercise 2.16, the matrix-vector multiplication with an
orthogonal matrix is well-conditioned; the condition number is optimal, that is, 1. This means that numerical
errors are not amplified by an application of Givens rotations and Householder reflections.
Example 2.9. Numerical stability of the QR factorization

Let  
1 1 1
A = 0.01 0 0.01.
0 0.01 0.01
We compute A = QR using Gram–Schmidt projections, Givens rotations, and Householder reflections.
Computations with 10−3 accuracy yields
• for Gram–Schmidt projections:
   
1 0 0 1 0 0
Q = 0.01 −0.707 0 ⇒ Q> Q = 0.01 1 0.707
0 0.707 1 0 0.707 1

• for Givens rotations:


   
1 −0.007 −0.007 1 0 0
Q = 0.01 0.707 0.707  ⇒ Q> Q = 0 1 0
0 −0.707 0.707 0 0 1

• for Householder reflections:


   
−1 0.007 0.007 1 0 0
Q = −0.01 −0.707 −0.707 ⇒ Q> Q = 0 1 0
0 0.707 −0.707 0 0 1

See also example 2.9.

2.5 Diagonalization and eigenvectors


In this section, we will concentrate on square matrices A ∈ Rn×n again. Previously, in section 2.3, we have
already seen advantages of diagonalizable matrices with respect to the efficient computation of matrix powers
Ad . In this section, we will discuss the diagonalization of matrices and the related topic of eigenvalues and
eigenvectors. Moreover, we will draw some first connections to machine learning.

49
y y
Ae2

e2
Ae3
e3 e1 x Ae1 x
z z

Figure 2.7: Application of A to the canonical basis vectors e1 , e2 , e3 . The determinant corresponds to the
(signed) volume of the parallelepiped defined by Ae1 , Ae2 , Ae3 .

The determinant Let us first consider a triangular matrix


 
2 1 0
A = 0 3 1
0 0 2

In fig. 2.7, we can see the application of the matrix A, that is, the application to the canonical basis
vectors e1 , e2 , e3 ; this corresponds to the rows of A, that is,
   
A = a1 . . . an = Ae1 . . . Aen .

Definition 2.16. Determinant


The determinant of a matrix A ∈ Rn×n is the (signed) volume of the n-dimensional parallelepiped determinant
defined by its column vectors.
It can be computed using the recursive formula:
• If n > 1,
n
X
det (A) = (−1)i+j aij det (Aij ) ,
i=1

where j is a fixed column index, and Aij ∈ R(n−1)×(n−1) results from dropping the ith row and
jth column from A.

• If n = 1,
det (A) = a11 .

Exercise 2.23.
Derive a formula for the determinant of an upper triangular matrix R = (rij )ij ∈ Rn×n .

The determinant contains important information about the matrix. First of all, det (A) = 0 if and only
if A is singular. Moreover,
• switching two rows (or columns) flips the sign of det (A),

• det A> = det (A), and




50
• scaling one row (or column) of A by a constant c, resulting in the matrix Ã, will scale the (signed)
volume of the parallelepiped by c. Hence,
 
det Ã) = c det (A)) .

Another very important property of the determinant is the following lemma; cf. [2, lemma 3.2.1].
Lemma 2.4.
Let A ∈ Rn×n and B ∈ Rn×n . Then,

det (AB) = det (A) det (B) .

This lemma has many direct consequences:


Exercise 2.24.
Show:
• det A−1 = 1

det(A)

• The determinant of an orthogonal matrix is either 1 or −1.


• If A ∈ Rn×n diagonalizable with
A = V DV −1 ,
where V invertible and D = diag (λ1 , . . . , λn ) diagonal. Then,
n
Y
det (A) = det (D) = λi .
i=1

Based on lemma 2.4 and exercise 2.23, we obtain that, once we have computed a QR factorization

A = QR

of a matrix A ∈ Rn×n , we can easily compute its determinant.


Using the determinant of a matrix, we can now discuss eigenvalue and eigenvectors of matrices.

Eigendecomposition The eigenvalues and eigenvectors expose very important information about the
matrix.
Let the matrix A ∈ Rn×n be diagonalizable with

A = V DV −1 , (2.44)

where V invertible and D = diag (λ1 , . . . , λn ) diagonal. Equation (2.44) is also called the eigendecomposition eigendecompo-
of A. In particular, it yields sition
AV = V D,
and hence, we have
Avi = λi vi , (2.45)
for 1 ≤ i ≤ n, where vi is the ith column of V .

51
Definition 2.17. Eigenvalues and eigenvectors
eigenvalue
eigenvector
Let Rn×n . The scalar λ ∈ R and vector 0 6= v ∈ Rn are called eigenvalue and eigenvector of A if

Av = λv. (2.46)

We also call (λ, v) an eigenpair. eigenpair

Let us introduce the convention that λmax := λ1 ≥ . . . ≥ λn =: λmin .


We can directly see that, if v ∈ Rn is an eigenvector corresponding to the eigenvalue λ, then cv is also
an eigenvector corresponding to λ for 0 6= c ∈ R:

A(cv) = cAv = cλv = λ(cv).

Similarly, if v, w ∈ Rn are eigenvectors corresponding to the same eigenvalue λ, then

A(v + w) = Av + Aw = λv + λw = λ(v + w).

Hence v + w is also an eigenvector corresponding to λ. Therefore,

Eλ := {v ∈ Rn |Av = λv}

is a subspace of Rn and we call it the eigenspace of λ. eigenspace of λ


We have that:
Theorem 2.6.
A matrix A ∈ Rn×n is diagonalizable if and only if there exists a basis of Rn of eigenvectors.

Therefore, a diagonalizable matrix is fully determined by its eigenvalues and -vectors. They also provide
additional information about the matrix and the data stored in it. For instance:
Exercise 2.25. Spectral norm and eigenvalues
Let A ∈ Rn×n be diagonalizable. Show that:
1. The spectral norm of A corresponds to its largest eigenvalue, that is,

kAk2 = λmax .

2. The condition number of A is given by


λmax
κ (A) = .
λmin

If the matrix A is symmetric, let us consider

Av1 = λ1 v1 and Av2 = λ1 v2

with λ1 6= λ2 . We have that


>
λ1 v1> v2 = (Av1 ) v2 = v1> Av2 = v1> (Av2 ) = v1> (λ2 v2 ) = λ2 v1> v2 .

This can only be the case if


0 = v1> v2 = (v1 , v2 ) .

52
For a symmetric matrix A, the basis of eigenvectors V can be chosen orthonormal. Then, we have V −1 = V > ,
and we obtain
A = V DV > .
Let us now discuss how to compute the eigenvalues and -vectors of a matrix. The eigenvalues of matrix
A are the roots of the characteristic polynomial characteristic
polynomial
p (λ) = det (A − λI) .

This is clear because, if there is an eigenpair (λ, v), we have that

0 = Av − λIv = (A − λI) v.

Hence, the matrix


(A − λI)
has a nontrivial kernel, and its determinant is zero. The characteristic polynomial is a polynomial of degree
n, and we have:
Theorem 2.7. Fundamental theorem of algebra
Every non-constant single-variable polynomial with complex coefficients has at least one complex
root.

This means that, in the complex numbers, we can always find n, not necessarily distinct, eigenvalues and
a corresponding basis of eigenvectors. Therefore, every matrix A ∈ Cn×n is diagonalizable in the complex
numbers. Unfortunately, the same is not true for real numbers. However, as mentioned earlier let us focus
on scalars (real numbers) for now.
Let us assume that the characteristic polynomial has n, not necessarily distinct, real roots. This means
that
n 
Y 
p (λ) = λ̂i − λ .
i=1

Multiple of the λ̂i could be the same, that is, an eigenvalue λi can be root of multiplicity larger than one of
the characteristic polynomial. We call this the the algebraic multiplicity of the eigenvalue. The algebraic algebraic
multiplicity of an eigenvalue λ is not necessarily the same as the dimension of the eigenspace, dim (Eλ ), multiplicity
which is called the geometric multiplicity. However, the algebraic multiplicity is an upper bound for the geometric
geometric multiplicity. multiplicity
Once the eigenvalues {λi }1≤i≤n have been computed, the corresponding eigenvectors can be obtained by
solving the linear equation systems
(A − λi I) vi = 0
for vi . The geometric multiplicity if λi is then given by the dimension of the solution spaces, which corre-
sponds to Eλi .
For large data sets and, hence, large matrices, the characteristic polynomial is of high degree. Hence, it
becomes computationally demanding to compute the characteristic polynomial and all its roots numerically.
Therefore, the use of iterative schemes for computing the eigenvalues approximately can be more efficient.
One relatively simple algorithm can be derived based on computing QR factorizations. Hence, the algorithm

53
is called the QR algorithm.
Algorithm 7: QR algorithm
Data: A ∈ Rn×n
A0 = A;
k = 0;
while kLk+1 k > T OL do /* Iteration */
Ak = Qk Rk ; /* Compute QR factorization using alg. 1 */
Ak+1 = Rk Qk ; /* Interchange factors */
Ak+1 = Lk+1 + Dk+1 + Uk+1 ; /* Partition for checking stopping criterion */
k =k+1 ; /* Update k */
end
Data: Eigenvalues diag (A)
Let us discuss briefly how the algorithm works: First, we compute the QR decomposition of the matrix
A0 = A, that is,
A0 = Q0 R0
Then, we interchange the factors Q0 and R0 to obtain an update for our matrix

A1 = R0 Q0 .

This matrix satisfies


A1 = Q> Q0 R0 Q0 = Q> >
0 Q0 R0 Q0 = Q0 A0 Q0 .
| 0{z } | {z }
=I =A0

Hence, if A0 = A is symmetric, A(1) is symmetric as well. Moreover, all matrices A0 , A(1) , . . . will have the
same eigenvalues.
During the iteration, the lower triangular part Lk of Ak will converge to zero, while the diagonal entries
of Ak will converge to the eigenvalues of A. In order to check for convergence, we consider kLk k ≤ T OL,
that is, we stop the QR iteration once the lower triangular part Lk of Ak is almost zero.
In the following two examples, you can see the convergence of the QR algorithm based on two 3 × 3
matrices, one symmetric and one unsymmetric matrix.

54
Example 2.10. QR Algorithm – Symmetric Matrix

Consider the symmetric 3 × 3 matrix


 
−10 13 13
A =  13 7 −16
13 −16 −7

Below, you can find some iterates of the QR algorithm, showing


that the diagonal entries converge to the true eigenvalues of A,
which are 4, 18, and −32. Since A is symmetric, both the lower
triangular and lower triangular parts are converging to zero; see
also the plots on the right.
 
−30.0639 −8.8883 3.8254
A1 =  −8.8883 16.2198 2.9181
3.8254 2.9181 3.8441
 
−32.2091 −0.8803 0.0010
A5 =  −0.8803 18.1897 0.0053
0.0010 0.0053 4.0194
 
−32.2245 0.0507 0.0000
A10 =  0.0507 18.2050 −0.0000
0.0000 −0.0000 4.0194

For the symmetric case, we can easily obtain the eigenvectors as follows: Since we have that

Ak+1 = Q> > >


k Ak Qk = Qk · · · Q0 AQ0 · · · Qk ,

upon convergence of the method at some k, we have

D = Q> AQ

with D = Ak almost diagonal and Q = Q0 · · · Qk orthgonal. Therefore, the columns of Q are good approxi-
mations to the eigenvectors of A.

55
Example 2.11. QR Algorithm – Unsymmetric Matrix

Consider the unsymmetric 3 × 3 matrix


 
16 14 14
A =  3 11 6 
−10 −16 −11

The matrix is unsymmetric, and hence, only the lower triagular


part converges to zero, while the diagonal entries converge to the
correct eigenvalues of A, which are 2, 5, and 9.
 
9.7836 −0.0785 −32.9416
A1 = −0.3661 5.8744 −8.6751 
0.2509 0.2231 0.3421
 
8.9573 −1.4163 −31.4510
A5 = −0.1217 5.0626 −13.7851
0.0004 0.0038 1.9800
 
8.9978 −1.3396 31.0316
A10 = −0.0067 5.0024 14.7059
−0.0000 −0.0000 1.9998

The QR algorithm requires the computation of a QR decomposition of a matrix in each iteration. As


pointed out before, this comes with a significant computational cost of O(n3 ). A much less favorable property
of the algorithm comes from the fact that the QR decomposition does not preserve the sparsity pattern of a
matrix. As we have discussed before, operations involving sparse matrices can be significantly more efficient
than operations with dense matrices of the same size. Consequently, algorithms which do not change the
matrix are often much more efficient.
A simple iterative scheme for computing the eigenvalue with the largest absolute value and the corre-
sponding eigenvector of a matrix is the power method. It is based on the following observation: Let power method
v1 , . . . , vn be a basis of eigenvectors with corresponding eigenvalues |λ1 | ≥ . . . ≥ |λn | of a diagonalizable
matrix A ∈ Rn×n . Then,
n
X
v= ai vi
i=1
and, hence,
n
X n
X
Av = ai Avi = ai λi vi .
i=1 i=1
Applying A d times, we see that
n
X n
X
Ad v = ai Ad vi = ai λdi vi .
i=1 i=1
For our investigation of the convergence, we divide by λ1 :
n  d
1 d X λi
A v= ai vi . (2.47)
λ1 i=1
λ1

We have that
λi
<1
λ1

56
for all |λi | < |λ1 |. Assuming that |λ1 | > |λ2 |, the expression in eq. (2.47) will converge to v1 for d → ∞.
Hence,
A v → |λ1 |d and
d 1
Ad v → v 1
kAd vk
for d → ∞.
The resulting algorithm alg. 8 only requires the application of A but does not change the matrix itself.
Algorithm 8: Power method
Data: A ∈ Rn×n
for k := 1, . . . , m do /* iteration */
Av
v = kAvk ;
>
µ = vv>Av
v
; /* Rayleigh quotient */
end
Data: Approximate eigenvalue µ ≈ λ1 of largest absolute value and corresponding eigenvector
v ≈ v1 .

Definition 2.18. Rayleigh quotient

The Rayleigh quotient is defined as Rayleigh


quotient
(v, v)A
R(A, v) := ,
(v, v)

where (v, v)A := v > Av.

Obviously, we have that


R(A, vi ) = λi ,
for Avi = λi vi .
We have seen that when applying a matrix multiple times, the eigenvalue with largest absolute value
becomes dominant. Therefore, an alternative approach results from combining Krylov subspaces and the
QR algorithm. Similarly, the Krylov subspace Km (A, v) will approximate the eigenvectors with the largest
absolute eigenvalues best. Therefore, let us apply d iterations of Arnoldi’s method or, for a symmetric A,
Lanczos’ method. This results in an orthogonal matrix Qd ∈ Rn×d spanning Kd (A, v). In order to compute
approximations of the largest absolute eigenvalues, we can solve the smaller eigenvalue problem

Hd ṽ = µṽ, (2.48)

with the Hessenberg matrix Hd = Q> d AQd ∈ R


d×d
being a reduced dimensional approximation of A in the
Krylov subspace Kd (A, v). The eigenvalues µ1 , . . . , µd of eq. (2.48) are also called the Ritz values. Ritz values
We obtain from eq. (2.48) that

Q> AQd ṽ = µṽ ⇒ Qd Q>


d AQd ṽ = µQd ṽ.
| d {z }
=Hd

Then, v := Qd ṽ is an eigenvector of
Qd Q>
d Av = µv.

However, since Qd is only semi-orthogonal, we only have

Qd Q>
d Av ≈ Av.

In order to solve the small eigenvalue problem eq. (2.48), even if it is not feasible for the original matrix
A, we can now apply the QR algorithm.

57
Figure 2.8: Optimization of a quadratic function f (colored) under a quadratic constraint g = 0.

Left and right eigenvectors The eigenvectors, which we have introduced before, are right eigenvectors

Av = λv

since they are applied to A from the right. In addition to that, we can introduce left eigenvectors, given by
the eigenvalue problem
v > A = λv > (2.49)
for the left eigenvector v and the corresponding eigenvalue λ. left
eigenvector
Exercise 2.26.
Show that the left and right eigenvalues of a matrix A ∈ Rn×n are the same.

By transposing eq. (2.49),


A> v = λv
we see that the left eigenvectors and -values actually correspond to those of the transposed A> of A. A
direct result from this is:
Lemma 2.5.
If the matrix A ∈ Rn×n is symmetric, the left and right eigenvectors and eigenvalues coincide.

Moreover, for a diagonalizable matrix A with

A = V DV −1 ,

the columns of V are the right eigenvectors. On the other hand, we obtain

V −1 A = DV −1 ,

and hence, the rows of V −1 are the left eigenvectors.

Simultaneous diagonalizability Another important property of diagonalizable matrices, which is help-


ful in optimization techniques, is simultaneous diagonalizability. For instance, it helps when optimizing a
quadratic function under a quadratic constraint; see fig. 2.8. It is also used in trust region approaches in
optimization.

58
Definition 2.19. Simultaneous diagonalizability
A set of diagonalizable matrices {A1 , . . . , Ak } is called simultaneously diagonalizable if there simultaneously
exists a single invertible matrix V such that diagonalizable

Di = V Ai V −1

is diagonal, for all 1 ≤ i ≤ k.

There is an alternative characterization of simultaneously diagonalizable matrices, which is given by the


following theorem:
Theorem 2.8.
A set of diagonalizable matrices {A1 , . . . , Ak } is simultaneously diagonalizable if and only if the
matrices commute.

2.6 Solving linear equation systems


Previously, we have already discussed the QR factorization,

A = QR,

which can be employed to solve a linear equation system

Ax = b. (2.50)

For a square matrix A ∈ R1n × n, the computational complicity of this approach is O(n3 ), in particular, it
is
2 3
n + O(n2 ).
3
Here, we are going to discuss alternative matrix factorization approaches,
• the LU factorization and
• the Cholesky factorization,
and iterative Krylov schemes, which are more frequently used, in particular, for sparse matrices. Therefore,
as a first step, we will discuss how to perform row and column operations using matrix-matrix multiplications.

Row and column operations In order to derive matrix factorization techniques, it is generally help-
ful to understand that matrix row and column operations can be written in terms of matrix-matrix row and
multiplications, and how. column
operations
Let us first discuss row operations: We start with the exemplary matrix
 
1 2 3
4 5 6 ∈ R3×3
7 8 9

and multiply it from the left with certain 3 × 3 matrices as follows:


 
1 2 3
Operator · 4 5 6.
7 8 9

In particular, the following matrix-matrix multiplications correspond to different linear row operations:

59
• Permutation of two rows: Permutation o
two rows
    
1 0 0 1 2 3 1 2 3
0 0 14 5 6 = 7 8 9
0 1 0 7 8 9 4 5 6

Here, the second and third row.


• Adding a multiple c ∈ R of one row to another row: Adding a
multiple c ∈ R
of one row to
    
1 2 0 1 2 3 9 12 15
0 another row
1 04 5 6 = 4 5 6
0 0 1 7 8 9 7 9 8

Here, adding 2 times the second row to the first row.


• Scaling a row by a constant c ∈ R: Scaling a row
by a constant
c∈R
    
1 0 0 1 2 3 1 2 3
0 1 04 5 6 =  4 5 6
0 0 3 7 8 9 21 24 27

Here, scaling the third row by 3.

Exercise 2.27. Revertible operations


• Discuss which of the row and column operations defined introduced before are revertible and
for which constant c.
• How do the operations change the determinant matrix?

• Based on these considerations, give inverse matrices for the row and column operation matrices.

Of course, if each of the matrices O1 , . . . , Ok corresponds to a row operation, multiple row operations on
a matrix A can be assembled into a single operation by multiplying all those matrices:

Ok · · · O1 ·A
| {z }
=:Ô

In order to understand this better, solve the following exercise:


Exercise 2.28. Row operations
Eliminate all lower-diagonal entries (red) in the following matrix
 
1 1 1 1
1 4 9 16 

1 8 27 64 
 (2.51)
1 16 81 256

by performing row operations. Compute the matrix corresponding to performing all those row oper-
ations at once. Which matrix matrix factorization results from these operations?

Performing column operations to a matrix A can generally be handled analogously by multiplying oper-
ators from the right:
A · Operator.

60
Exercise 2.29. Column operations
The same as exercise 2.28 but with column operations: Eliminate all upper-diagonal entries (green)
in the matrix in eq. (2.51) by performing only column operations. Compute the matrix corresponding
to performing all those column operations at once and the resulting matrix factorization.

In exercises 2.28 and 2.29, you have already seen how to simplify the matrix structure by performing row
or column operations. As we have already discussed in section 2.4, a linear equation system of a triangular
matrix is straight forward having an upper triangular matrix. We will make use of this in the next paragraph.

LU factorization The most common way of factorizing an invertible matrix A ∈ Rn×n is the so-called LU
LU factorization (or LU decomposition), that is, the factorization into an upper triangular matrix U factorization
(or LU decom-
and a lower triangular matrix L: position)

l11 0 ··· 0 r11 r12 · · · r1n


   
a11 · · · a1n
 
.. .. ..   ..
. .  0 r22 . . .
 
 .. .. ..   l21 .  . 
A= . . =
.   .
  · 
.
. (2.52)
.. .. ..   .. .. .. 
an1 · · · ann  . . 0   . . r(n−1),n

ln1 · · · ln,(n−1) lnn 0 ··· 0 rnn
| {z } | {z }
=L =U
Once the LU factorization has been computed, the linear equation system in eq. (2.50) can solved by
forward and backward substitution:
Ax = b
r11 r12 ··· r1n
 
.. ..
.
 
0 r22 . 
⇔ L
 ..
x = b
.. .. 
 . . . r(n−1),n 
0 ··· 0 rnn
| {z }
=:y

⇔ Ly = b
l11 0 ··· 0
 
.. .. .. 
. .

 l21 . 
⇔  ..
 y = b
.. .. 
 . . . 0 
ln1 ··· ln,(n−1) lnn
| {z }
=L

Solving this system is called the forward substitution, and it can be easily done row by row, from the forward
first row to the last row. substitution
In order to compute x from y, we only have to solve
r11 r12 · · · r1n
 
.. .. ..
. .
 
0 . 
 x = y.
 ..

.. .. 
 . . . r(n−1),n 
0 ··· 0 rnn
| {z }
=U

This step is called backward substitution, and it can, again, be performed row by row; this time, due to backward
the matrix structure of U , it is done from the last to the first row. substitution

61
The LU decomposition can be computed by Gaussian elimination. As a first step, for eliminating one Gaussian
entry ai1 below the diagonal from elimination
a11 · · · · · · · · · a1n
 
 .. .. .. 
 .
 . . 
 ai1 · · · aii · · · ain ,
 .. .. 
 
..
 . . . 
an1 · · · · · · · · · ann
we subtract aa11
i1
times the first row from the ith row. In terms of row operations via matrix-matrix multipli-
cations, this corresponds to

1 0 ··· ··· ··· ··· ··· 0


 
.. .. .. 
 0 . . .

 . .. 
.. .. ..  a11 ··· ··· ··· a1n
 
 .
 . . . . .
  . .. .. 
.. .. .. ..   .. .


 0 . . . . 
.  

 a ..  ·  ai1
  ··· aii ··· ain 
 . .. 

− i1 0 1 0 .   .. ..
 aii . . 
 .. .. .. .. 
 0 . . . .  an1 ··· ··· ··· ann
(2.53)

 . .. ..
 ..

. . 0
0 ··· ··· ··· ··· ··· 0 1
a11 ··· ··· ··· a1n
 
.. .. ..
.
 
 . . 
 ai1 aii ain 
ai1 − a11 a11
= ··· aii − a11 a11 ··· ain − a11 a11 ,

 .. .. .. 
 . . . 
an1 ··· ··· ··· ann

where the entry in the i row and first column is


ai1
ai1 − a11 = 0.
a11

62
Hence, a full row can be eliminated as follows:
a11 a12 ··· a1n
 
.. ..
. .
 
 a21 a2n 
 ..
 
.. .. 
 . . . a(n−1),n 
an1 · · · an,(n−1) ann
1 0 ··· ··· 0
 
 a21 .. ..   a11 a12 ··· a1n

1 . .
  a21 − aa21 a11 a22 − aa21 a12 a2n − aa11
a
 11 ··· 21
a1n 
 . . .. 11 11
= .. 0 .. · .. .. ..
  
. 0 .. 
 .. . . . .
.. . . . .
   
. . 0 an1 − aan1

a a − an1
··· an1
ann − a11 a1n (2.54)
a11 a12
 . . 11 n2
11
an1
a11 0 · · · 0 1
1 0 · · · · · · 0 r
 
r12 r13 · · · r1n

11
.. .. 
. .  0 r22 r23 · · · r2n 

 l21 1
(1) (1) (1) 
 ..
  
=  . 0 . . . . . . 0 ·  0 r32 r33 · · · r3n 
  
  .. .. .. .. .. 
 .. .. . . . .

  . . . . . 
 . . . . 0 (1) (1) (1)
ln1 0 · · · 0 1 | 0 rn2 r{z n3 · · · rnn
| {z } }
(1)
U
L(1)

We use the upper index (k) to denote entries of the kth step of the algorithm; the index is omitted in case
the entries remain the same until termination of the algorithm. Note that, after one iteration of Gaussian
elimination, the first column of L(1) already contains the final entries of L. Moreover, the first two rows of
U (1) already contain the final entries of U .
Based on the discussion in the previous paragraph, that is, on row and column operations, it can be
observed that this factorization indeed yields the original matrix A.
As a next step after eq. (2.54), the matrix
r22 r23 · · · r2n
 
(1) (1) (1)
r32 r33 · · · r3n 
 .. .. .. 
 
..
 . . . . 
(1) (1) (1)
rn2 rn3 ··· rnn

is factorized as already in eq. (2.54). This procedure is performed recursively, until the remaining matrix
(n−1)
block to factorize is only a single entry rnn = rnn .
Exercise 2.30. LU Decomposition
Compute the LU decomposition of the matrix
 
1 2 4
A = 3 8 14
2 6 13

and use the LU decomposition to solve the linear equation system


 
3
Ax = 9
9

63
Exercise 2.31. Computational Complexity
Discuss the number of FLOPs needed for
• computing the LU decomposition A = LU of a dense matrix A ∈ Rn×n ,
• solving Ly = b and U x = y for a right hand side b ∈ Rn .

What is the computational complexity of these operations?

Note that the step in eq. (2.53) can only be performed if a11 6= 0 because we would have to divide by
zero otherwise. However, if the matrix is invertible there is always a pivot, i.e., a nonzero element, in each pivot
column. Therefore, we can exchange the rows such that there is a nonzero element on the diagonal. As we
have seen in the previous paragraph, this can be achieved by multiplication with a permutation matrix from
the left. This yields an LU factorization with pivoting LU
factorization
P A = LU. (2.55) with pivoting

In order to improve the stability of the algorithm, we always exchange rows such that the element with the
maximum absolute value is on the diagonal.
Let us finally remark that, if the matrix A is sparse and many entries below the diagonal are already
zero, we can omit eliminating the corresponding entries in eq. (2.54). This will save computation work and
may also result in sparse matrices L and U . However, the matrices L and U can, in general, be much denser
compared to the original matrix A.

Cholesky factorization In case the matrix A is symmetric positive definite, an LU factorization can be
computed with fewer operations. In particular, instead of a general decomposition

A = LU,

we can find a decomposition


A = LDL> , (2.56)
where
1 0 ··· 0
 
.. .. .. 
. .

 l21 .
L=
 ..

.. .. 
 . . . 0
ln1 ··· ln,(n−1) 1
is a lower triangular matrix and
d11 0 ··· 0
 
.. .. .. 
. .

 0 . 
D=
 ..

.. .. 
 . . . 0 
0 ··· 0 dnn
is a diagonal matrix. Note that
A = L DL> ,

(2.57)
| {z }
=:U

is actually an LU decomposition as derived before. From the fact that A is positive definite, we have that
the entries of D are all positive, and hence, we can compute the matrix D1/2 as discussed in section 2.3,
such that
D1/2 D1/2 = D

64
and with L̂ := D1/2 L, we can get another form of the Cholesky factorization:
A = L̂ · L̂> . (2.58)
Different from the LU factorization discussed in the previous paragraph and the LU factorization in eq. (2.57),
the diagonal entries of L̂ are generally not one.
Both variants eqs. (2.56) and (2.58) are computationally more efficient than the standard LU factorization,
and storing the resulting matrices L and D or L̂, respectively, generally requires less storage compared to
storing L und U .
In order to find the matrix L in eq. (2.56), let li be the ith column of L. Then, one can make use of the
fact that
1/2 1/2
aij = dii djj li> · lj ,


which directly follows from eq. (2.56). Similarly, we obtain that


aij = ˆli> · ˆlj
from eq. (2.58).
Let us briefly explain the idea how to derive the algorithm to compute L; the same can be done similarly
for L̂. Since the first row of L only contains a single 1 as its first entry, d11 can be computed from
1/2 1/2
a11 = d11 d11 e>

1 · e1 = d11 ,
where e1 is the first Euclidean standard basis vector. Then, (l2 )1 ((l2 )2 is 1) and d22 can be computed from
the two equations:
1/2 1/2 1/2 1/2
a21 = d22 d11 l2> · e1

= d11 d22 (l2 )1 and
 
2
a22 = d22 l2> · l2

= d22 1 + (l2 )1 .
Using this procedure, the entries of L and D can be computed recursively. We will not discuss this in detail,
but you might want to derive the algorithms yourself:
Exercise 2.32. Cholesky Factorization

1. Derive an algorithm for computing the Cholesky factorization of an SPD matrix A ∈ Rn×n .
2. What is the computational complexity of computing the Cholesky factorization?

3. Why is it necessary that the matrix is SPD?

Matrix inversion The inverse of an invertible matrix A ∈ Rn×n can be written as inverse of an
invertible
A−1 = ã1 · · · ãn = A−1 e1 · · · A−1 en ,
   
matrix

where {e1 , . . . , em } is the canonical basis in Rn . Each of the columns ãi ∈ Rn can be computed by solving a
linear equation system
Aãi = ei .
Hence, assuming that we have computed an LU factorization of matrix A = LU , the inverse A−1 can be
computed by solving n linear equation systems using forward and backward substitution. Based on exer-
cise 2.31, it is easy to derive the computational complexity O(n3 ).
Usually, we are not interested in all the entries of the matrix A− , but we want to be able to apply it to
some vectors {v1 , . . . , vk }, where usually k << n. In the case k << n, it is therefore much more efficient to
solve the linear equation systems
Axi = vi
for 1 < i < k. On the other hand, even if k ≈ n, A−1 may be much denser compared to L and U . Therefore,
even storing the matrix A−1 might be infeasible.

65
Figure 2.9: Minimizing a function by following the steepest descent, that is, the gradient. For a non-convex
function Φ, if we would start in different point, we might end up in a different local minimum.

Iterative solvers Finally, we will discuss briefly how to solve linear equation systems

Ax = b

using iterative solvers, which only involve matrix-vector multiplications but no modifications of the matrix
itself.
Even though this assumption is not satisfied for matrices corresponding to general data sets, for the sake
of brevity and simplicity, let us focus on the easy case that A ∈ Rn×n is SPD. Then,

Ax = b ⇔ x = arg min Φ (y) ,

where Φ (y) := 21 y > Ay−y > b. We will now first introduce the gradient descent method, which is illustrated gradient
by fig. 2.9. It is motivated by the idea that the fastest way of reaching the minimum of a valley is to descend descent
method
in the steepest direction, which is the direction of the negative gradient

−∇x (Φ (x)) = − (Ax − b) = b − Ax.

This vector r := b − Ax is also called the residual of the linear equation system Ax = b. That is, if Ax̂ = b, residual
the corresponding residual b − Ax̂ = 0. Instead of just using the residual as the step, we usually scale it by
some factor α, which, in machine learning, is usually called the learning rate. The learning rate is a hyper learning rate
parameter, and changing the learning rate can have two consequences: hyper
parameter
• If the individual steps are unnecessarily small otherwise, we can speed up convergence by choosing a
large α.
• If the individual steps are too large, for instance, if we jump over the whole valley within one step, we
can improve the convergence by choosing a small α
There are also extensions of this simple method, which choose the learning rate adaptively. This is important
because, similar to the function in fig. 2.9, training a machine learning model generally corresponds to
optimizing a non-convex function. Hence, the plain gradient descent algorithm might not yield optimal
convergence, and more advanced variants are required.

66
The standard gradient method is given in alg. 9.
Algorithm 9: Gradient descent method
Data: Initial guess x(0) ∈ Rn , learning rate α ∈ R+ , and tolerance T OL > 0
r(0) := b − Ax
(0)
;
while r ≥ T OL r(0) do
(k)

x(k+1) := x(k) + αr(k)


r(k+1) := b − Ax(k+1)
end
Result: Approximate solution of Ax = b

Exercise 2.33.
If we allow for choosing the learning rate in each iteration step, derive a formula for the optimal
choice, in case A ∈ Rn×n is SPD.

Gradient based optimization techniques will be discussed in more detail in the optimization and machine
learning parts of the course.
Finally, we want to note that Krylov spaces can be used to improve the convergence of the standard
gradient descent method. In particular, instead of approaching the solution by only considering the local
gradient in one direction, we take into account the previous search directions as well.
On example of Krylov subspace methods for solving a linear equation system with an SPD matrix is the
conjugate gradient method. conjugate
gradient
Algorithm 10: Conjugate gradient method method
Result: Approximate solution of the linear equation system Ax = b
Data: Initial guess x(0) ∈ Rn , and tolerance T OL > 0
p(0) := r(0) :=
b − Ax(0) ;(0)
(k)
while r ≥ T OL r do
(p(k) ,r(k) )
αk := Ap(k) ,p(k) ;
( )
x(k+1) := x(k) + αk p(k) ;
r(k+1) := r(k) − αk Ap(k) ;
(Ap(k) ,r(k+1) )
βk := Ap(k) ,p(k) ;
( )
p(k+1) := r(k+1) − βk p(k) ;
end
Result: Approximate solution of Ax = b
Without discussing the algorithm in detail, similar to Lanczos method, it uses the symmetry of the matrix
such that the orthogonalization can be performed in a short recurrence.

2.7 The singular value decomposition and pseudo inverses


In section 2.4, we have already discussed how to solve an overdetermined linear equation system, with a
matrix with full column rank, using a least-squares formulation. This leads to the normal equations, which
can be solved using the QR factorization of the matrix.
In this section, we will deal with general data sets, leading to a general matrix A ∈ Rn×m . This matrix
may be rectangular and rank-deficient.

67
Definition 2.20. Rank-deficiency
A matrix A ∈ Rn×m is called rank-deficient if rank-deficient

rank (A) < min (n, m)

and the rank-deficient is defined as rank-deficient

min (n, m) − rank (A) .

This is the most general type of data set to be stored in a matrix A. This also results in the fact that
the least squares problem
2
arg min kAx − bk ,
x∈domain(A)

as defined in eq. (2.27), is not well-posed anymore. Consider the following example of image data:
Example 2.12. Rank-deficient data

The above picture of the surface of Mercury has an image resolution of 1 144 × 1 071 pixels, however
its rank is only 153.

A different example is a data base with much more features than observations, for instance, the
data base of a video platform with more videos than users; see also the discussion on sparse data
in section 2.1.

In this case, the matrix A may not be diagonalizable, that is, there is no basis of eigenvectors, and we
will need a different tool in order to extract structure and important information from the matrix.

Singular value decomposition Let us consider a general matrix A ∈ Rn×m with rank r, where r ≤
min (n, m). In contrast to the eigendecomposition, the singular value decomposition exists for any matrix.
This is stated in the following theorem and definition:

68
Theorem 2.9. Singular value decomposition
Let A ∈ Rn×m be a matrix of rank r. Then, there exist orthogonal matrices U ∈ Rn×n and V ∈ Rm×m
such that  
> Σr 0
A = U ΣV , Σ = , (2.59)
0 0
where Σ ∈ Rn×m and
 
σ1
 σ2 
Σr = diag(σ1 , σ2 , · · · , σr ) =  ,
 
..
 . 
σr
singular value
and σ1 ≥ σ2 ≥ · · · > 0. Equation (2.59) is called the singular value decomposition (SVD) of A. decomposition
(SVD)
The σi are called the singular values of A, and they are unique up to their ordering. The columns
singular values
of U and V are called the left and right singular vectors, respectively.
left and right
singular
The same holds true if R is replaced by C. Then, vectors

A = U ΣV †

and U and V are unitary matrices.


Note that eq. (2.59) is equivalent to

Avi = σi ui , A> ui = σi vi ∀i = i, . . . , n,
   
with U = u1 . . . un and V = v1 . . . vm . This looks very similar to eigenvalue problem eq. (2.46)
we saw earlier. However, the vectors on the left and right hand sides are now different. Nonetheless, there
is a close connection of the SVD and the eigendecomposition introduced before. In particular,

A> A = V Σ> U > U ΣV > = V Σ >


| {zΣ} V>
=diag(σ12 ,σ22 ,··· ,σr2 )

Since V is orthogonal, we have that V −1 = V > . Therefore, the eigenvalues of A> A are the squared singular
values of A. Moreover, the eigenvectors of A> A are the right singular vectors of A.
Similarly,
AA> = U ΣV > V Σ> U > = U ΣΣ >
| {z } U >,
=diag(σ12 ,σ22 ,··· ,σr2 )

and the eigenvalues of AA> are also the squared singular values of A. Moreover, the eigenvectors of AA>
are the left singular vectors of A.
Moreover, if the matrix A ∈ Rn×n has an orthonormal basis of eigenvectors V ∈ Rn×n , then

A = V DV >

is also a singular value decomposition of A, where U = V ; here, D ∈ Rn×n is the diagonal matrix containing
the eigenvalues of A. Hence, the left and right singular vectors are the eigenvectors and the singular values
are the eigenvalues of A.
Exercise 2.34. Singular value and eigendecomposition
For some small n (for instance n = 3):

69
1. Give an example for matrices A, V, D ∈ Rn×n with

A = V DV > ,

where V orthogonal and D diagonal.

2. Give an example for matrix A ∈ Rn×n which cannot be decomposed as

A = V DV > ,

with V ∈ Rn×n orthogonal and D ∈ Rn×n diagonal.

3. Can you come up with matrix properties discussed earlier which are necessary conditions for
the existence of a decomposition
A = V DV > ,
with V ∈ Rn×n orthogonal and D ∈ Rn×n diagonal?

Exercise 2.35. Spectral norm and singular values


Show that the spectral norm of a matrix corresponds to its largest singular value, that is,

kAk2 = σmax .

Exercise 2.36. Frobenius norm and singular values


Show that the Frobenius norm of a rank r matrix A is the sum of the squares of the singular values:
v
u r
uX
kAkF = t σi2 .
i=1

A direct consequence of the two previous exercises is the fact that

kAk2 ≤ kAkF .

There is an alternative form of the SVD, the compact SVD, which uses semi-orthogonal matrices instead
of orthogonal matrices:
Theorem 2.10. Compact singular value decomposition
For a rank r matrix A ∈ Rn×m , there exists a compact singular value decomposition (compact
SVD): compact
singular value
 
σ1
decomposition
 σ2 
(compact
A = U Σr V > , Σr = diag(σ1 , σ2 , · · · , σr ) =  ,
 
.. SVD)
 . 
σr
where U ∈ Rn×r and V ∈ Rm×r are semi-orthogonal matrices.

Exercise 2.37. Compact SVD


Discuss how the SVD and the compact SVD are equivalent.

70
The following theorem justifies why the SVD is an essential tool in unsupervised learning:
Theorem 2.11. Eckart–Young Theorem
Let A ∈ Rn×m be of rank r and have a singular value decomposition A = U ΣV > with
 
Σr 0
Σ= , Σr = diag(σ1 , σ2 , · · · , σr ).
0 0

Let B ∈ Rn×m be of rank k ≤ r. The matrix B that minimizes

kA − BkF

is given by B = U Σ̂V > , where


 
Σk 0
Σ̂ = , Σk = diag(σ1 , σ2 , · · · , σk ).
0 0

Therefore, the singular value decomposition of a matrix A enables us to find best approximations with a
given rank k, that is, best low-rank approximations of A. We will see this in the following example: best low-rank
approxima-
Example 2.13. Image Compression tions

Surface of Mercury.

Let us again consider the image of the surface of Mercury already shown earlier in examples 1.5
and 2.12. The high-resolution image is of size 1 144 × 1 071 and rank 1 071.
In the following plot, we can see the singular values σ1 , . . . , l1 071 of the matrix corresponding to the
pixel values:

71
It can be observed that the first singular values are much higher than the smallest singular values, and
hence, following theorem 2.11, we can find good low-rank approximations of the image with respect
to the Frobenius norm. The image on the top right is a rank 37 approximation of the image, that is,
only 37 singular values as well as left and right singular vectors are necessary to store this image.
In the following table, several examples of error resulting from such low-rank approximations are
listed:
kA − ÂkF /kAkF rank numbers to store compression factor
0.05 1 2 216 553.0
0.01 9 19 944 61.0
0.005 37 81 992 15.0
0.001 153 339 048 3.6
0.0 1 071 1 225 224 1
Numerical results for SVD image compression

As we can observe, for the image on the top right, we obtain compression by a factor of 15, resulting
in an error of only 0.5 % in the Frobenius norm.

Efficiently computing the SVD In section 2.5, we have already discussed how to use Krylov subspaces
to approximately solve an eigenvalue problem. Due to the connection of the SVD and the eigenvalue de-
composition of A> A respectively AA> , we compute a singular value decomposition of A by solving the
respective eigenvalue problems. However, the matrices A> A and AA> are ill-conditioned, and using the QR
decomposition for a large matrices may result in high computational cost. Moreover, as before, the algorithm
does not preserve sparsity of the matrix. Instead, we can, again, derive a scheme to approximately compute
the SVD of A using Krylov subspaces.
As we have seen in theorem 2.11 and example 2.13, the largest singular values are most relevant for good
approximations of the matrix corresponding to the data.s
Therefore, let use apply Lanczos’ method alg. 6, as described in section 2.5 to the symmetric matrix
 
0 A
C=
A> 0

with starting vector  


1 w
q1 = .
kwk 0

72
The first iteration of Lanczos’ algorithm yields

β1 = 0
q0 = 0
α1 := q1> Cqk
v = Cq1 − α1 q1 − β1 q0

Hence, the second vector in the Krylov subspace becomes

v = Cq1 − α1 q1 − β1 q0
|{z} |{z}
q1> Cq1 =0
          
0 A 1 w 1  >  0 A 1 w 1 w
= − w 0
A> 0 kwk 0 kwk A> 0 kwk 0 kwk 0
      
1 0 1 0 w
w> 0
 
= −
kwk A> w kwk3 A> w 0
| {z }
=0
 
1 0
=
kwk A> w

As the next step, the resulting vector v normalized, and we obtain


 
1 1 0
q2 = v= .
kvk2 kA> wk A> w

Repeating this procedure shows that we get orthogonal vectors qk of the form
   
u 0
q2l−1 = and q2l = , l = 1, . . . .
0 v

As a result, we obtain the following Lanczos Bidiagonalization algorithm: Lanczos Bidi-


agonalization
Algorithm 11: Lanczos Bidiagonalization (Golub and Kahan). algorithm
Data: A ∈ Rn×m ; w with kwk2 = 1
u1 = w; /* initialization */
β1 = 1;
v1 = A> u1 ;
α1 := kv1 k2 ;
v1 := v1 /α1 ;
for k := 1, . . . do /* iteration */
uk+1 := Avk − αk uk ; /* new direction orthogonal to previous u */
βk+1 := kuk+1 k2 ;
uk+1 := uk+1 /βk+1 ; /* normalization */
vk+1 := A> uk+1 − βk+1 vk ; /* new direction orthogonal to previous v */
αk+1 := kvk+1 k2 ;
vk+1 := vk+1 /αk+1 ; /* normalization */
end
Now, with the semi-orthogonal matrices
   
Uk := u1 u2 . . . uk , and Vk := v1 v2 ... vk

73
and the matrix
α1
 
 β2 α2 
..
 
.
 
Bk = 
 β3 ,

 .. 
 . αk 
βk+1
the Lanczos Bidiagonalization algorithm yields the following relations:

AVk =Uk+1 Bk
>
(2.60)
A Uk+1 =Vk Bk> + αk+1 vk+1 e> >
k+1 = Vk Bk + αk+1 vk+1 ⊗ ek+1

Let us now discuss how this algorithm will help us to compute a reduced dimensional variant of the
singular value decomposition. Before, we have seen that, for a singular value σ and its left and right reduced
singular vectors u and v, dimensional
variant of the
Av = σu, A> u = σv. singular value
decomposition
Now, let σ̃ be a singular value of Bk and ũ and ṽ the corresponding left and right singular vectors. Then, as
before,
Bk ṽ = σ̃ũ, Bk> ũ = σ̃ṽ.
Using eq. (2.60), we obtain

AVk ṽ =Uk+1 Bk ṽ = σ̃Uk+1 ũ


A Uk+1 ũ =Vk Bk> ũ + αk+1 vk+1 e>
> >
k+1 ũ = σ̃Vk ṽ + αk+1 vk+1 ek+1 ũ

Thus, as before in section 2.5 for computing eigenvalues and eigenvectors in a Lanczos subspace, the singular
values of Bk converge to the singular values of A, and the vectors Uk+1 ũ and Vk ṽ converge to the left and
right singular vectors of A. In particular, once we span the full rank of A, the term αk+1 vk+1 e> k+1 ũ will
vanish, and we obtain all nonzero singular values as well as the corresponding left and right singular vectors.

Pseudo-inverse matrices Let us again consider a rank-deficient matrix A ∈ Rn×m . In this case, there
are multiple solutions of the least squares problem eq. (2.27),

min kAx − bk22 . (2.61)


x

In particular, if x̃ is a least squares solution, then each

x̂ = x̃ + y with y ∈ ker (A)

is also a least squares solution; this is clear because

kAx̂ − bk22 = kA(x̃ + y) − bk22 = kAx̃ − bk22 .

In order to make the problem well-posed, we can reformulate the problem, such that the solution becomes
unique. In particular, let us consider the least squares solution xLSMN , that is, the least-squares solution
with minimum norm (LSMN). least-squares
Let x̃ ⊥ ker (A), then each solution x̂ of eq. (2.61) can be written as solution with
minimum
norm (LSMN)
x̂ = x̃ + y,

where y ∈ ker (A). We have that


2 2 2
kx̃k = kx̂k + kyk ,

74
and hence, the minimum is attained if y = 0. The solution with minimum norm is xLSMN , that is, the
solution orthogonal to ker (A). Hence, we obtain the LSMN solution by solving the constrained problem:

min kAx − bk22 .


x⊥ker(A)

Now, let  
> Σr 0
A = U ΣV , Σ=
0 0
be the SVD eq. (2.59) of A. Then, the matrix
 −1 
+ Σ 0 >
A =V r U (2.62)
0 0

is somehow related to inverting A. In particular,


   −1   
Σ 0 > Σr 0 > Σr 0 >
AA+ A = U r V V U U V
0 0 | {z } 0 0 | {z } 0 0
=In =Im

Σr 0 Σ−1
    
r Σ r 0 > Σr 0 >
= U V =U V =A
0 0 0 0 0 0

and analogously  −1
0 > Σr 0 > Σ−1
    
+ + Σ 0 >
A AA = V r U U V V r U
0 0 0 0 0 0
 −1 
Σ 0 >
= V r U = A+ .
0 0
In case Rn×n invertible, we also have

AA−1 A = A and A−1 AA−1 = A−1 .

Hence, A+ , as defined in eq. (2.62), is a generalization of the inverse to general matrices.


In particular, A+ is a pseudo-inverse of a general matrix A ∈ Rn×m :
Definition 2.21. Pseudo-inverse matrix
A matrix B which satisfies

ABA = A
BAB = B

is also called a pseudo-inverse of A. pseudo-inverse

The pseudo-inverse given by eq. (2.62) is a specific choice:


Definition 2.22.
The matrix  −1 
Σr 0 >
A+ = V U
0 0
is a specific pseudo-inverse of A, which is called the Moore–Penrose inverse of A. Moore–
Penrose
inverse

75
Theorem 2.12.
The LSMN solution xLSMN introduced before can be computed using the Moore–Penrose inverse by
 −1 
+ Σr 0 >
xLSMN = A b = V U b
0 0

Let us now prove this theorem.


Proof. Let    
z1 c1
z = V >x = and c = U > b = ,
z2 c2
where z1 , c1 ∈ Rr , z2 ∈ Rn−r , and c2 ∈ Rm−r . The first property of xLSMN is that it solves eq. (2.61). We
have that
kb − Axk2 =kb − U ΣV > xk2
=kU > (b − U ΣV > x)k2
=kc − U > U Σzk2
  
c1
  2
Σr 0 z1 (2.63)
=

c2 0 0 z2
c1 − Σr z1 2
 
=
c2
In the second step, we have used that
> 2
U v = kvk2 ,
which follows from the orthogonality of U > .
From eq. (2.63), we obtain that kb − Axk2 is minimized by choosing z1 := Σ−1
r c1 . However, z2 can still
be chosen freely. Since V > is orthogonal, we have
2 2 2
kzk = V > x = kxk ,

and thus, kxk2 is minimized by choosing z2 = 0. Hence,


 −1 
Σ c
xLSMN = V zLSMN =V r 1
0
 −1  
Σr 0 c1
=V
0 0 c2
 −1 
Σr 0 >
=V U b = A+ b.
0 0

Based on the definition of the pseudo-inverse, we can generalize the definition of the condition number
of an invertible matrix definition 2.9 to general matrices:
Definition 2.23. Condition number of a matrix
We define the condition number of a matrix A ∈ Rn×m as condition
number
κ := κ(A) = kAk A+ .

As we have discussed earlier, the conditioning of the normal equations is significantly worse compared to
the original problem. The following exercise will explain this:

76
x1 edges
e1 e2 
−1 1 0 0

e1
e3  −1 0 1 0  e2
x2 x3  
A= 
 0 −1 1 0 
 e3
e4 e5  0 −1 0 1  e4
x4 0 0 −1 1 e5
nodes x1 x2 x3 x4

Figure 2.10: Graph and the corresponding incidence matrix. In the incidence matrix, we use the convention
that, in each row, the −1 comes first and 1 comes second.

Exercise 2.38. Condition number of A> A


How are the condition number of a general matrix A ∈ Rn×m and the condition number of A> A
related?

2.8 Graphs and matrices


Let us now discuss another important type of data sets that often arise in data science and machine learning
problems; see, e.g., [46] for more details.
Example 2.14. Social networks

Taken from https://pixabay.com


The information which people know each other, can be stored as a graph object. Dealing with the
graph can help to identify subgroups of people using, for instance, clustering algorithms.

As in the example in fig. 2.10 (left), each graph consists of nodes connected by edges. In fig. 2.10 (right),
we show the resulting incidence matrix A, indicating the edges of the graph. For a graph with n nodes incidence
and m edges, the incidence matrix A has m rows and n columns, that is, A ∈ Rm×n . matrix
In order to define the values of the incidence matrix of a graph in a unique way, we use the following
convention: An edge ek connecting the nodes xi and xj is indicated by the entries

aki = −1 and akj = 1,

where i < j. All other entries of the kth row of A are zero. In other words, each row contains a single −1
and a single 1, and the −1 always comes before the 1. Hence, the incidence matrix A of a graph is generally
sparse.

77
x3
e2

e3 x1 x4 x5
e5 e4
e1
x2

Figure 2.11: A connected graph, which is not connected anymore once the edge e5 (gray) is omitted.

One interesting property of incidence matrices A ∈ Rm×n is that


 
c
 .. 
 .  ⊂ ker(A) ∀c ∈ R
c

Let us consider some special cases of graphs:


Definition 2.24.
A connected graph is the graph in which each pair of distinct nodes is connected by a single edge connected
or a path of edges. graph

Figure 2.11 shows an example of a connected graph. In case e5 is omitted, then the sets of nodes
{x1 , x2 , x3 } and {x4 , x5 } are not connected. Otherwise, there is a path of edges between each pair of nodes.
The incidence matrix of a connected graph has the following properties:

dim (ker (A)) = 1


dim (range (A)) = n − 1
(2.64)
dim range A> = n − 1


dim ker AT = m − n + 1


We will not discuss a general proof for these properties, but the next exercise will help us to understand
them:
Exercise 2.39. Connected Graph
Verify the properties eq. (2.64) of the incidence matrix for the graph in fig. 2.11. How does the
situation change if edge e5 is omitted.

Definition 2.25.
A complete graph is a graph where each pair of nodes is connected by an edge.

An example for a complete graph is shown in fig. 2.12.


Exercise 2.40. Complete Graph
Compute the number of edges of a complete graph with n nodes.

78
x1

x2 x3

x4

Figure 2.12: A complete graph, where each pair of nodes is connected by an edge.

x1

x2 x3

x4

Figure 2.13: A tree type graph.

Definition 2.26.
A tree is a graph in which each pair of nodes is connected by exactly one path.

An example for a tree is shown in fig. 2.13.


Exercise 2.41. Tree
Compute the number of edges of tree with n nodes.

Using the incidence matrix of a graph, we can define three more matrices describing the graph:
Definition 2.27.
The symmetric positive semidefinite graph Laplacian matrix of a graph is defined as graph
Laplacian
matrix
L := A> A ∈ Rn×n ,

where A is the incidence matrix of the graph with n nodes. The degree matrix D is the diagonal degree matrix
of L,
D := diag (L) ,
and
B := D − L (2.65)
is the adjacency matrix of the graph. adjacency
matrix

The adjacency matrix counts the number of edge at each node. In particular, dii is the number of edges
of the node xi . The adjacency matrix is a symmetric binary matrix (entries 0 and 1), where aij = aji = 1 if
and only if there is any edge between the nodes xi and xj .

79
Exercise 2.42. Graph Laplacian, degree matrix, and adjacency matrix

Compute the graph Laplacian, degree matrix, and adjacency matrix of the graphs in figs. 2.12
and 2.13.

Since the graph Laplacian L is symmetric positive semi-definite and sparse, many of the techniques
introduced so far are applicable, for instance, efficient techniques for solving a linear equation system with
L and computing its eigenvalues.
The concepts introduced before can be extended to weighted graphs, that is, graphs where the strength weighted
of the connection between two nodes can differ. If C is a diagonal matrix with positive weights, then, for graphs
instance, the resulting weighted graph Laplacian matrix is weighted
graph
Laplacian
A> CA. (2.66) matrix

We will not further discuss this weighted graphs at this point, but we might come back to them at a later
point.

80
3 Optimization basics
3.1 Motivation
In this lecture we will begin our three-part story of optimization methods we need for the remainder of the
course.
Example 3.1.
Consider again the regression problem of Example 1.2. Try to approximate the function based on the
11 sampled perturbed measurements at points {0, 1, . . . , 10}, by doing a least-squares regression with
 
1
 x 
Φ(x) = 
 x2  ,

x3

then the example minimization problem with w ∈ R4 looks as follows:

f (w) =(4.0w3 + 27.0w4 + w1 + w2 − 3.22)2 +


(4.0w3 + 27.0w4 + w1 + w2 − 4.04)2 +
(4.0w3 + 27.0w4 + w1 + w2 − 9.01)2 +
(11.68 + 4.0w3 + 27.0w4 + w1 + w2 )2 +
(4.0w3 + 27.0w4 + w1 + w2 − 20.32)2 +
(30.06 + 4.0w3 + 27.0w4 + w1 + w2 )2 +
(42.16 + 4.0w3 + 27.0w4 + w1 + w2 )2 +
(51.108 + 4.0w3 + 27.0w4 + w1 + w2 )2 +
(65.76 + 4.0w3 + 27.0w4 + w1 + w2 )2 +
(67.27 + 4.0w3 + 27.0w4 + w1 + w2 )2 +
(87.64 + 4.0w3 + 27.0w4 + w1 + w2 )2 ,

which was generated using a symbolic computation package (not by hand derivation). We call w the
decision variable and f (w) the objective function.

Although the above problem can be formulated as a least-squares problem whose solution can be found
analytically, there are good would involve enormous effort. Next to that, a tiny modification in the loss
function used (use |yi − Φ(x)> w|, for example, or something else), would make the problem no longer a least
squares one. We therefore need a more general methodology.
Finding the w that minimizes the empirical loss (training is nothing else but solving an optimization
problem in which we select the function’s parameters so that the value in the above expression gets minimized.
A need therefore arises to find w that ‘achieves the best performance’ on the training data that would have a
reasonable computational cost. Ideally, the algorithm used to determine the parameters works fast, so that
it can be run on big datasets without difficulties.
Remark 3.1.
In the following lectures, we will only consider optimization problems in which the decision variables
can take any real value. Such a group of problem is known as continuous optimization problems.
There are also many applications in which the decision variables are restricted to be equal to, for
example, {0, 1}. Solving such integer or mixed-integer optimization problems is, in general much

81
Figure 3.1: Plot of the polynomial 2w4 + w3 − 3w2 generated using 351 equidistant points on the interval
[−2, 1.5].

more difficult computationally and is not common in ML due to the sizes of datasets used.

For time being, we will leave the specific ML optimization problems to focus on the optimization alone,
and once we know ‘what are convenient optimization problems’ we will get back to ML to formulate some
basic optimization problems.

3.2 Building up the gradient method


We will begin with the simplest possible case when the parameter vector is one-dimensional. Consider the
situation in which the function to be minimized is
f (w) = 2w4 + w3 − 3w2
How do we find the minimum of a function like this? First thing we can do is to plot it. To get a reasonable
plot, we need a grid of points at which we evaluate the function, see fig. 3.1.
In the plot, we can distinguish several points - local and global minimum, whose formal definitions are
given in what follows.
Definition 3.1. Local and global optimum
A point x̄ is a local minimum of a function f : Rn → R if there exists a  > 0 such that

f (x̄) ≤ f (x) ∀x ∈ B(x̄, )

where B(x̄, ) is a ball of radius  around x̄. For a local maximum, the opposite inequality holds.
A point x̄ is a global minimum of a function f : Rn → R if

f (x̄) ≤ f (x) ∀x ∈ Rn .

For a local maximum, the opposite inequality holds.

82
From a visual inspection, we clearly see that the point (0.70, −0.65) is a only a local minimum, whereas
the point (−1.07, −2.04) is both a local and a global minimum. Obviously, when we want to minimize a
function f , we would like to end up in the point (−1.07, −2.04). In this specific case, the task is easy - we
can just point with our finger to where the minimum is because we ‘see it’.
However, ‘seeing’ required us, first, the problem to be a one-dimensional one (you can’t ‘see’ things in
problems where x would have more than two dimensions). Secondly, we needed to have a good guess about
where the minimum is, and secondly, to evaluate the function 351 times to draw the pictures. In fact, what
we did was applying a strategy known as the grid search.
Definition 3.2.
Grid search is a minimization strategy that evaluates a function at a grid of n equidistant points and
picks the point with the lowest value as the solution. Grid search is a very general methodology and
it is used, among others, for hyperparameter optimization of ML models, see section 3.7.

Grid search can be a good idea in low-dimensional optimization problems to get a ‘rough feel’ about the
area where the minimum might be. Already for two dimensions, however, the number of points one needs to
evaluate can be prohibitive. If x ∈ R2 , and we want to inspect 10 possible values for each entry of x, then
overall we inspect 102 combinations of values. For n dimensions, this means 10n points which is a killing
number. For highly irregular functions even grid search might not provide very informative results.
At the same time, in real-life applications, function evaluation can be an expensive operation and the goal
is to find a minimum in a possibly short time/using as few as possible function evaluations. For that reason,
in order to build minimization algorithms for which we are able to provide any performance guarantees,
we need to limit ourselves to minimizing ‘well-behaved’ functions. As we will repeat many times in these
lectures, making assumptions will be rather cheap in the ML context, because it is us – the ML tool designer
– who chooses what functions are minimized.
For the above reasons, we will now assume that we are minimizing differentiable functions over Rn . For
such, we have the following convenient property:
Theorem 3.1. Stationary points
For a differentiable function f : Rn → R for every local minimum/maximum we have that

∇f (x) = 0 (3.1)

Such a point is known as a stationary point of the function.

As we will see from the construction of our gradient method, it will be rather unlikely that we end up
in a local minimum. However, we will be in a real danger of ending up in a saddle-point. It can be also a
so-called saddle point. Saddle points are, typically, a big enemy of an optimizer.
Definition 3.3. Saddle points
For a differentiable function f : Rn → R a saddle point is a stationary point that is neither a local
minimum or a local maximum.

Exercise 3.1.
Draw an example of a f : R → R function with a saddle point.

We will address the issue of ending up in saddle points in the next lecture and for now, we will just
make the common-sense assumption that a good strategy to minimize a given function f is to look for points
where the gradient vanishes (even though such a point is not implied to be a local minimizer, it is still one
of the best strategies we have because it represents the necessary condition).

83
It sounds thus like it would be a good idea to find a minimizer by simply solving the system of equations
(3.1). Usually, however, this involves a lot of nonlinear functions and is not easy to solve at all.
For that reason, we need to perform the minimization smartly, trying various points until exhaustion
of the computational time we have (typically things need to work fast), and picking the best one among
the ones we tried. This ‘trying’ needs to be smart, because otherwise we end up wasting a lot of effort on
evaluating points that make no sense.
For that reason, real-life optimization employs most often employs a ‘local search’ strategy of moving
from a point to another, typically trying to improve the value of the objective function at each step. Being
in a point xk , we would like to select a next point xk so that

f (x0 ) ≥ f (x1 ) ≥ f (x2 ) ≥ f (x3 ) ≥ . . .

Remember that as an algorithm, while being in a given point xk we don’t ‘see’ the function image around
us because seeing this image requires us to perform many function evaluations. Thus, we need to make a
decision about the next point to move to using only information that is possible to compute, at a low cost, at
the current point. How can we quickly get an idea about how the function looks like around a given point?
Here, what comes us to help us is the first-order Taylor approximation of the function:

f (x) ≈ f (xk ) + ∇f (xk )> (x − xk ).

It means that the gradient of a function gives us the normal vector of the hyperplane tangent to the graph of
f and as you might remember, the gradient is the ‘direction of steepest ascent’ of the function. If we choose
a direction dk such that
d>
k ∇f (xk ) < 0.

We call such a direction the descent direction. Having such a dk we know that there exists an  > 0 such
that
f (xk + tdk ) < f (xk ) ∀t ∈ (0, ),
for small enough step sizes in the direction of dk we will achieve a descent. This is the basis of the standard
descent algorithm 19.

Algorithm 12: Descent algorithm


Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Select dk such that d> k ∇f (xk ) < 0
Select step size tk
xk+1 = xk + tk dk
if Stopping criterion met then
Stop and return xk

There are three choices to be made in alg. 19: (i) mechanism for the selection of direction dk , (ii) choice
of step size tk , (iii) the stopping criterion. We will address them beginning with the stopping criterion.
Stopping criteria. As you remember, for smooth functions every (local) minimum is a point at which
the function gradient vanishes. It is therefore logical to establish the gradient vanishing as the criterion. But
numerically, we cannot hope to hit directly a point where the gradient is equal to 0 so, instead, the following
criterion is usually used:
k∇f (xk )k2 ≤ ,
where  is some small number. Because even this can take a long time (if it happens at all), additional
stopping criteria can be used such as the maximum number of iterations, or that the improvement between
subsequent iteration is small enough:
|f (xk ) − f (xk−1 )| ≤ .

84
Stepsize selection. Next, we can address the issue of choosing the stepsize. Intuitively, we would like
to make big steps when we are far away from the minimum, and then smaller ones when we approach the
minimum (otherwise, there would be a risk that we ‘jump over’ the minimum without noticing it). But of
course, we never really know how far we are from the minimum so we need to make these choices somewhat
blindly. Here, we present and discuss several basic methods of choosing the stepsize tk :
• Constant stepsize: tk = t. This is the simplest of all strategies where the question is, of course,
what number should be chosen for t. One can try several values for this number and ‘judge by eye’
the convergence of the algorithm. When more analytical properties of the function to be minimized
are known (for example, the Lipschitz constant of the gradient), we can also choose a constant t that
will guarantee convergence of the algorithm.
• Exact line search (or bisection). If we are lucky, the problem of minimizing the function along the
search direction
tk = argmin{f (xk − tdk )}.
can be solved exactly at a cheap cost. For some problems, for example when the function is quadratic,
this is indeed possible because an efficient oracle can be created to solve such a one-dimensional
optimization problem. In other situations, one can use the bi-section method to determine a point
where the gradient of the function g(t) = f (xk − t∇f (xk )) vanishes (remember – does not necessarily
guarantee that it is the minimum of this function, but this is better than nothing). For this, note that
g 0 (0) < 0. If we guess a value tmax such that g 0 (tmax ) > 0, then by continuity of g 0 (t) there has to
exist a point t̄ on the interval (0, tmax ) such that g 0 (t̄) = 0. In this situation we can employ the highly
efficient single-dimensional bisection Algorithm 13 to find a point where the gradient of g(t) vanishes.

Algorithm 13: Bisection algorithm for line search.


Data: a = 0, b ∈ R+ such that g 0 (a) < 0, g 0 (b) > 0. Precision parameter .
while |a − b| >  do
Set c = (a + b)/2 Evaluate g 0 (c).
if If g 0 (c) > 0 then
Set b := c
else
Set a := c

• Backtracking. It is a procedure that tries to find a step length t that would give a ‘sufficiently good’
decrease of the function (we will define what that means), but without requiring the possible number
of function evaluations that exact line search would do. It requires three parameters: s > 0, α ∈ (0, 1),
β ∈ (0, 1) and its working is outlined in Algorithm 14.

Algorithm 14: Backtracking step size selection.


Data: dk , s > 0, α ∈ (0, 1), β ∈ (0, 1)
Initialize tk = s.
while f (xk ) − f (xk + tk ) < −αtk ∇f (xk )> dk do
Set tk = βtk
Return tk .

In other words, the stepsize is chosen as tk = sβ ik where ik is the smallest nonnegative integer for
which the condition
f (xk ) − f (xk + tk dk ) ≥ −αtk ∇f (xk )> dk
is satisfied. What is the quantity −tk ∇f (xk )> dk ? It is the amount of descent ‘promised’ by the first-
order Taylor approximation. We therefore want the stepsize to be such that the real function value
improvement it achieves is at least a factor of α of the Taylor-promised improvement.

85
• Diminishing step size. This is a strategy that tries to achieve the balance between ‘long steps in
the beginning’ and ‘short steps later on’ which we discussed above. Different formulas for decreasing
the step size are possible, for example
1
tk = or tk = a exp(−bk)
a + bk
Decreasing step size algorithms are popular in applied machine learning and typically, because they
perform well and theoretical convergence results are fairly easily obtained for them with an algorithm
called the stochastic gradient descent, to be briefly introduced later. One does not, however, run them
for too long as the final stepsize values can be negligible.

Exercise 3.2.
Prove that in the backtracking stepsize selection rule, for
nablaf (xk ) 6= 0 and s > 0, α ∈ (0, 1), β ∈ (0, 1) the ‘while’ loop inside alg. 14 always finishes after a
finite number of steps.

Descent direction. We now address the issue of selecting the descent direction which has a very natural
solution. Because the gradient itself provides the direction of steepest ascent of the function, we know that
its negative gives the direction of steepest descent of the function:

xk+1 = xk − tk ∇f (xk ), tk > 0

This is the idea of the most basic method for minimization of differentiable functions – the gradient descent
method, which is a generalization of the one of alg. 9. The vanilla version of this method is given in
alg. 15. Equipped with this, you are almost ready to apply the gradient descent algorithm to any smooth

Algorithm 15: Gradient descent algorithm.


Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Compute ∇f (xk )
Use one of the methods for determining tk
xk+1 = xk − tk ∇f (xk )
if k∇f (xk )k2 ≤  then
Stop and return xk

minimization problem you wish. However, there is one well-known issue related to scaling/preprocessing the
problem’s data before running the gradient problem (or any optimization algorithm, in fact). It is known as
the zig-zagging behavior.
Example 3.2. Zigzagging and variable scaling
The pure gradient method can exhibit some unfavorable behavior if the magnitude of the gradient
components related to different variables differs a lot. Consider the following example of minimizing
the quadratic function
f (x) = x21 + 8x1 x2 + 20x22
using the pure gradient method, with fixed stepsize of length 1.05. The path traversed by the
method is illustrated in the following figure , where the blue curves denote the isolines of the function
f (x) = x21 + 8x1 x2 + 20x22 , while the red line illustrates the path followed by the algorithm.

86
As you can see, at each step the gradient direction does not really point in the direction of the true
minimum and as a result, many more steps are needed. This is a result of the fact that the condition
number of the Hessian to the function
f (x) = x> Ax
to be minimized  
1 4
A=
4 20
is very high, in fact ≈ 108. This is when we speak of badly-conditioned optimization problems.
Such situations occur, for example, when features used in ML-based optimization problems are not
normalized. Imagine that you are performing an optimization problem based on data in which people’s
height is measured in meters and weight in kilograms. Then, you easily obtain a situation in which
one feature has values in the interval [1, 2] while the other in the interval [50, 100]. To mitigate against
it, all features can be normalized, for example, by (i) forcing their mean to be equal to 0 and their
variance to be equal to 1, (ii) forcing their minimum and maximum values to be equal to 0 and 1,
respectively.
If scaling is not performed before optimization problem is set up, then optimization field-trick for
is to apply scaling to the problem in which we transform our vector of decision variables using an
invertible matrix D:
x = D1/2 y.
In this new optimization variable, the optimization problem becomes

min y > (D1/2 )> AD1/2 y


y∈Rn

and the gradient step becomes

yk+1 = yk − tk (D1/2 )> AD1/2 y

equivalent to
xk+1 = xk − tk D∇f (xk )
in terms of the original decision variable, which is known as the scaled gradient descent method.

87
A most logical choice for the scaling matrix D in for a quadratic f would be the matrix D = A−1
because it would make the transformed matrix equal to an identity matrix. In some cases, however,
computing such a matrix would be expensive and one can, for example, use the matrix

D = (diag(A))−1

which uses only the diagonal elements of A, which makes the inversion process computationally very
cheap.
In the next lecture, we will see that considering the issue of ‘what would be the optimal scaling
technique’ leads to an extension of the gradient method – the Newton method.

Exercise 3.3. Exercise 4.2 from [3]

Consider the minimization problem


min x> Qx
x∈R2

where Q is a positive definite matrix. Suppose we use the diagonal scaling matrix
 −1 
Q11 0
D= .
0 Q−122

Show that the scaling improves the condition number of the matrix in the sense that

κ(D1/2 QD1/2 ) ≤ κ(Q)

So far, we discussed many possibilities for things and the algorithm from a high-level look. But, you can’t
really learn optimization and get a hang of it without running a few algorithms yourself. For that reason,
we propose this exercise.
Exercise 3.4. Adapted exercise 4.3 from [3]

Consider the quadratic minimization problem

min x> Ax
x∈R5

where A is the 5 × 5 Hilbert matrix defined by


1
Aij = , ∀i, j
i+j−1
Implement and run the following methods and compare the number of iterations required by each of
the methods when the initial vector is x0 = [1, 2, 3, 4, 5], to obtain a solution with k∇f (X)k ≤ 10−4 :
• gradient method with backtracking stepsize rule and parameters α = 0.5, β = 0.5, s = 1
• gradient method with backtracking stepsize rule and parameters α = 0.1, β = 0.5, s = 1
• gradient method with exact line search (analytically finding the minimum of the one-dimensional
quadratic function)
• diagonally scaled gradient method with diagonal elements Dii = 1/Aii
• diagonally scaled gradient method with diagonal elements Dii = 1/Aii and backtracking line
search with parameters α = 0.1, β = 0.5, s = 1.

88
• the diminishing stepsize rule
1
tk =
a + bk
with parameters of our choice (test a few possibilities).

3.3 What do we converge to? Convex functions and global optimality


For general differentiable loss functions f (w) we are not guaranteed that the gradient method will always
converge to a stationary point, due to problem structure (there is no minimum) or bad selection of the
algorithm parameters – stepsize length, initial point, or a combination thereof. In bad cases, if you use the
gradient method blindly with a a bad stepsize length, it is even possible that the sequence diverges (opposite
of converging) in the sense of the function value sequence {f (xk )} not being nonincreasing.
Exercise 3.5.
Construct an example of a function f : R → R and a starting point x0 sucht that the gradient descent
method with constant stepsize tk = 1 yields an increasing sequence {f (xk )}.

For that reason, typically some problem structure is required to be able to ‘prove’ that the gradient
method converges. A most standard assumption is, of course, that a minimum exists or at least that the
function to be minimized is bounded from below (usually a rather cheap assumption), plus some smoothness
assumption such as the Lipschitz continuity of the gradient, i.e., that there exists a constant L > 0 such
that:
k∇f (x) − ∇f (y)k ≤ L kx − yk ∀x, y ∈ Rn .
We will use this assumption in section 3.7 to give an example convergence proof of the gradient method. To
make yourself familiar with this assumption and to teach you think about the bad-type situations for the
gradient methods, we propose the following exercise.
Exercise 3.6. Exercise 4.10 from [3]

Give an example of a twice differentiable function f : R → R and a starting point x0 ∈ R such that
the problem min x ∈ Rf (x) has an optimal solution and the gradient descent method with stepsize
2/L diverges, i.e., the sequence {f (xk )} is increasing.

Luckily, because in ML applications it is us who designs the functions to be minimized, we are free to pick
a function for which convergence is expected. For this reason, now let’s assume that we designed our ML
problem reasonably so that it has a minimizer, and that our gradient descent implementation converges to
a point at which the gradient vanishes. The the question is: what can we say about the point we converged
to? Is this a global minimum? Is this at least a local minimum?
For general differentiable functions f the answer is no to both of these questions, sadly. Despite that
fact, gradient methods are widly popular because they are computationally very cheap and still, stationary
points are good candidates for being minima, even if we cannot prove it. In fact, all the vastly popular deep
learning networks are trained exactly in this way – nobody knows if the minimum found is a local minimum
or a global one, but they still work.
That being said, since in ML we have the freedom of modelling the problem with loss functions of our
choice, whenever we can, we rather pick a function where convergence can be guaranteed and therefore,
powerful algorithms can be used. And there is a fairly general class of functions for which the gradient
method is guaranteed to converge to a global minimum – convex functions.

89
Figure 3.2: Examples of a convex (left) and nonconvex functions (right). Convex functions are very common
in simple supervised learning models such as linear regression. In deep learning methods, one almost always
has to do with nonconvex functions with multiple local minima, the best of which is hard to find.

Definition 3.4.
A function f : Rn → R is convex if for every x, y ∈ Rn it holds that

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y), ∀θ ∈ [0, 1]

If the function is differentiable everywhere, then convexity is equivalent to the following property:

f (y) ≥ f (x) + ∇f (x)> (y − x), ∀y, x

If the function f is continuously twice differentiable, then convexity on Rn is equivalent to:

∇2 f (x) ≥ 0 ∀x ∈ Rn .

Geometrically, convexity of a function means that ‘straight line segment connecting two points on or
above the graph of the function, lies fully above or on the graph of the function’, see fig. 6.1.
For such functions, we can show that every local minimum is also a global one.
Exercise 3.7.
Show that for a differentiable convex function f : Rn → R every stationary point is a global minimum.

The conclusion of this fact is that convexity is a desirable property of a function to minimize. How do we
check/make sure that a function is convex? Luckily, we do not need to do it by going ‘straight to definition’
always, but just like we have calculus for checking differentiability of functions, we have certain ‘rules’ using
which we can construct ‘big convex functions’ out of ‘small convex functions’.

90
Theorem 3.2. Operations that preserve convexity
If two functions f, g : Rn → R are convex then the following functions are convex as well:

h(x) = αf (x), α≥0


h(x) = f (x) + g(x)

If the functions f : Rn → R is convex and the function g : R → R is convex and nondecreasing, then
the function h(x) = g(f (x)) is convex.

Exercise 3.8.
Prove Theorem 3.2.

Exercise 3.9.
Prove or give a counterexample. If f (x) and g(y) are convex, then F (x, y) = f (x)g(y) is convex.

In short – whenever you can and it is not too limiting on your ability to model the problem at hand, it
is always good to opt for a formulation in which convexity can be guaranteed.

3.4 Modelling losses for ML


So far we’ve had some exposure to optimization, but without a real commitment to ML. We had to do this
because basic understanding of optimization is necessary to model the learning problem in such a way that
minimization of the right functions is easy. Now, however, it’s time to come back to the actual purpose of
optimization: solving optimization problems whose solutions provide us with ML tools.
Optimization can and will be used throughout this course to a variety of ML purposes (clustering, SVD
etc.). However, while in most ML domains optimization is simply a workhorse to get the result you want,
in supervised learning the interplay of optimization properties and the modeller’s freedom is largest.
For that, we will now go through several standard examples of supervised learning problems to ‘walk
together by hand’ through how we formulate optimization problems, and how we notice elements of the
structure of these problems that help us construct more efficient algorithms to solve them.
Vaguely speaking, in supervised learning, at our disposal we have a dataset (xi , yi ), i = 1, . . . , N . We
will call the vector xi ∈ Rn the feature vector and the output y the label. We will call this initial set the
training data set. Based on this data, we need to make a prediction for the y-part for new data, which is
not known at the training stage. We will call this data the test data set.
We can formulate a prediction mechanism as a to-be-defined-by-us function

ȳ = g(w, x)

parametrized by a vector w ∈ Rn . In order to find a good w, we need to choose it so that the relationship
g(w, x) performs well on the training data first. We do it by choosing w to be a minimizer of the training
problem:
N
X
w = argmin L(w, xi , yi ),
w∈Rn i=1

where L(·, ·) is a to-be-defined-by-us loss function. Naturally, the loss function should be closely connected
to the prediction mechanism g(·, ·), typically through a loss function

L(w, xi , yi ) = l(ȳi − yi ) = l(g(w, xi ) − yi ).

91
Whenever we model a ML prediction tool and the corresponding loss function, the loss function we choose
should meet two goals. (i) reflect the real loss we’re trying to minimize, (ii) be easy to minimize. In learning
how to do this, we will begin with the simplest example of regression.

3.4.1 Regression
A regression problem is formulated by having a data sample consisting of (xi , yi ), i = 1, . . . , n, where we try
to find the mechanism hidden in the relation
x i ∈ Rn → yi ∈ R.
in the training part of the data. As you already know, the simplest way to model this problem is to use the
following prediction mechanism
ȳ = w> x.
where w ∈ Rn
Remark 3.2.
Note that we can extend this model easily to y depending nonlinearly on x, by ‘appending’ vector
x with nonlinear transformations of each of its entry. Similarly, we can append an element 1 to the
vectors xi so that a ‘constant term’ is also taken into account in our model.

Our goal is to find a w that errs as little as possible on the data that we have at our disposal. We
therefore need to ‘punish’ the difference
yi − ȳi = yi − w> xi
by using a loss function l(·) where we minimize
N
X
l(yi − w> xi ).
i=1

2
If we use the quadratic loss function l(s) = s , then the objective function to minimize by playing with w is:
N
X 2
f (w) = yi − w > xi
i=1

What do we know about this function? First of all, it is quadratic in w. Secondly, it is also convex in w, any
point at which the function gradient is equal to 0 is a minimizer of this function. It is therefore, a very nice
optimization problem to solve, and we can employ the gradient method (or some other method) right away.
Remark 3.3. Huber loss function
Another popular choice for a loss function is the Huber loss:
 1 2
2s if − 1 ≤ s ≤ 1
l(s) =
|s| otherwise.

In some applications, the upside of the Huber loss function is that the total loss over all samples is
then not overly influenced by a few points for which the error is very large, known as the outliers.

But realistically, our first choice will always be the quadratic loss function. It is a good practice to see
how things look like when we put them in vector notation. Stacking all yi into one vector y ∈ Rn , and all
xi ’s as row vectors in a matrix X ∈ Rn×m , the function to be minimized is:
>
f (w) = ky − Xwk22 = (y − Xw) (y − Xw) .

92
where we use the norm notation of section 2.2. The gradient is then equal to

∇f (w) = X > Xw − X > y.

Putting that expression equal to zero, we obtain a system of equations

X > Xw = X > y,

which you know from section 2.4 , which have a unique solution if the matrix X has full rank. To repeat,
determining w by solving the system of normal equations equations is computationally not efficient.
If the matrix X is not full rank, then , typically the performance of the gradient method on the opti-
mization problem of minimizing f (w) is often poor. Just as in numerical linear algebra a common trick in
such cases is to select the minimum-norm solution of the system, a common optimization trick is to ‘add’
a component to the minimized objective function to ensure that the optimization problem has a unique
minimizer. This process is known as regularization and its most classical example is Tikhonov regularization
>
f (w) = ky − Xwk22 + λw> w = (y − Xw) (y − Xw) + λw> w,

where λ > 0 is a regularization parameter. Although regularization typically changes the optimal solution
to the problem, the concept of minimum-norm solution and regularizations are closely related, as we will
see in our third lecture on optimization. Regularization, as it turns out is extremely important in machine
learning not only for numerical reasons.
Remark 3.4. Regularization and out-of-sample performance
One of the effects of the regularization term on the optimal solution is that it promotes ‘small’ vectors
w, which corresponds to ‘simpler’ models in which more entries of w are likely to be 0. This is in
line with the common experience in ML that simpler models trained on the training data, tend to
perform better on the test data than non-regularized models. In Example 11.2 we will see how this
‘intuitive’ and ‘empirical’ intuition can be derived mathematically.

Coming back to the numerical aspects of regularization, if we look at it from the point of view equating
the gradient to 0, then we obtain the following system of equations:
−1
(X > X + λI)w = X > y ⇐⇒ w = X > X + λI X > y.

Here, an interesting relation to the pseudo-inverse matrices (recall section 2.7) arises because, as it turns
out, we have that
X + = lim (X > X + λI)−1 X > = lim X > (XX > + λI)−1
λ→0+ λ→0+

with the equality holding because of the push-through identity.

3.4.2 Classification: binary setting


One has to say that regression problems are easy. Easy to the point that if you have enough data, then you
are able to write down a least-squares solution in a closed form.
Things get really interesting from a modelling standpoint when we want to build a model that learns to
classify objects whether they belong to a certain group or not:

x ∈ Rn → yi ∈ {−1, 1}.

How can we build up the corresponding prediction and training mechanism to keep the optimization problem
‘nice’ ?

93
A brute-force approach to this problem would be to behave as if the problem was a regression problem,
i.e., to use the same loss function as in regression and then, on new samples, to use the following mechanism
to predict the label of a new object:

ȳ = sign(w> x) (3.2)

This idea, however, suffers from a serious flaw. Namely, the regression function loss function penalizes every
deviation from the target label (for example, 1), while in fact, (3.2) is ‘wrong’ only when sign(w> xi ) 6= yi .
For that reason, we need a loss function that, using the prediction mechanism (3.2), will impose penalty
only when the sign mismatch happens. In Example 11.2 we already got to know one particular solution
to this problem, known as the support vector machine (SVM), where the function to be minimized is, for
example:
N
X 2
max 0, 1 − yi (w> xi )

f (w) = (3.3)
i=1

This formulation is known as the L2 -loss support vector machine, where the term L2 is there in relation to
the 2-norm related penalization of the violation of inequality 1 ≤ yi (w> xi ).
Exercise 3.10.
Prove that (3.3) is a convex function of w.

Similarly to the regression case, it is very common to add a regularization term to the objective function.
N
X 2
max 0, 1 − yi (w> xi ) + λkwk22 .

f (w) = (3.4)
i=1

Exercise 3.11.
Argue that (8.1) is a convex function of w and derive its gradient.

Example 3.3. Hinge loss SVM


Another example of an SVM formulation is the hinge loss one:
N
X
max 0, 1 − yi (w> xi ) .

f (w) =
i=1

We easily verify that this formulation is convex in w as well. The name comes from the shape of the
function max{0, 1 − x}. It is, however, not differentiable everywhere. Still, it is possible to apply a
modified version of the gradient method to this problem. You would be surprised that in the ML
literature, people behave as if this function is differentiable, i.e., the call the algorithm used to solve
this problem the gradient method. We will, however, be mathematically correct by calling it the
subgradient method and we will introduce it only in the next lecture.

Exercise 3.12.
Construct a training dataset on the [0, 10]2 set by sampling two groups of points, one labelled −1,
and the other labelled 1, so that they are mostly located apart from each other. Then, set up an
L2 -loss SVM for this dataset with a regularization parameter λ and optimize it for a few values of λ
using the gradient descent method. For each λ, plot the corresponding w> x = 0 line to inspect how

94
well it separates the two groups of points.
Now, sample in the same way two new groups of points labelled −1 and 1. Check which SVM you
obtained before, performs best in predicting the correct labels for this new dataset.

Yet another another way to achieve the same thing is to think of the prediction less as a label itself, but
more as a probability of being labelled either -1 or 1. The only question is how to model probabilities in
a continuous way, that will make things friendly to optimization. A popular way to model the probability
that a given model receives value +1 is:

exp w> x

P(ȳ = 1|x) = ,
1 + exp (w> x)

No time for questions, jump in and we’ll explain later. For such a model we use the following prediction
mechanism: 
1 if P(ȳ = 1|x) ≥ 0.5
ȳ =
−1 otherwise.
Under such a prediction mechanism, how could we quantify ‘loss’ or ‘success’ so that we can somehow
optimize w on the traning data set? Success happens when we have a sample (xi , yi ) and our prediction
mechanism would assign the label correctly. In line with our prediction model, the probability of success
(predicting the correct label) is:

exp w> x

1
P(ȳi = yi |(xi , yi )) = = for yi = 1
1 + exp (w> x) 1 + exp (−w> x)
1
P(ȳi = yi |(xi , yi )) = for yi = −1
1 + exp (w> x)

In both cases, the probability can be captured with a single formula:


1
P(ȳi = yi |(xi , yi )) =
1 + exp (−yi (w> xi ))

and with this formula, the probability that across all samples we are correct, can be written as:
N n
Y Y 1
P(ȳi = yi |(xi , yi )) =
i=1 i=1
1 + exp (−yi (w> xi ))

Training our model would mean to maximize this term over w, and it looks like a difficult expression to
maximize. If we take a logarithm of it, we obtain:
N N
!
Y X
log 1 + exp −yi (w> xi )

log P(ȳ = yi |(xi , yi )) = −
i=1 i=1

To maximize this expression is equivalent to minimization of the negative of it, and we obtain the following
optimization problem to solve:
N
X
log 1 + exp −yi (w> xi ) .

min
w
i=1

As strange as it sounds, it turns out that this function is convex in w and amenable to efficient optimization
as the so-called log-sum-exp function. The entire derivation we just went through is equivalent to solving

95
the so-called logistic regression problem, and similarly to the previous cases, it is possible and common to
add a regularization term to the objective function
N
X
log 1 + exp −yi (w> xi ) + λkwk2

f (w) =
i=1

Logistic regression is a very popular tool for solving binary classification problems and as we shall see, also
for multi-class classification ones.
Remark 3.5.
The name ‘logistic regression’ comes from the idea of modelling probabilities as
 
p
log ∼ w> x
1−p

which term is known as the ‘logit’, which has been invented as a ‘trick’ to relate probabilities (which
have to belong to [0, 1]), with an expression that would be linear (and thus unbounded) in the
parameter vector.

3.4.3 Classification: multiclass settings


As a last element of our exercises in modelling supervised learning problem, we will consider the question of
how to model a classification problem in which, for a change, we no longer choose between two classes, but
more than that:
x → yi ∈ {1, . . . , k}.
How to set up a prediction mechanism, and the corresponding optimization problem to model it, to do
something like this? It always good to construct a new idea starting from an idea that already worked
somewhere else.
In support vector machines, we used a linear function w> xi and would assign label −1 and +1 based on
whether this linear function is less than or greater than zero. Some idea would be to use a linear function
again, and, in case of three classes, to assign label 1, 2, 3 based on whether the value of this function falls,
for example, into intervals (−∞, 0], (0, 10], and (10, +∞). Building an optimization problem around this
idea, however, is somewhat tricky and the threshold values we’ve chosen are somewhat arbitrary.
An idea that would be ‘symmetric’ w.r.t. each of the classes 1, . . . , k is a slightly different modification
of the SVM idea. Namely, to each of the classes we assign its own vector wj and establish the prediction
mechanism as
ȳ = argmax wj> x
j∈{1,...,k}

In other words, we assign to a given point the number of the class whose inner product of the SVM vector
and the observations feature is the largest. But how do we train such a model?
Mathematically, we want that for an observation (xi , yi ), if yi = j, then it holds that

wj> xi ≥ wl> xi ∀l 6= j, ⇔ wj> xi − wl> xi ≥ 0 ∀l 6= j.

From here, we are nearly at the formulation, the only thing left if to modify this inequality slightly to
prevent the trivial solution in which all wj = 0. We do it in the same fashion as in the standard SVM (recall
example 1.3), by introducing the threshold value 1:

wy>i xi − wl> xi ≥ 1 ∀l 6= yi .

96
Now, we want to penalize the situation 1 − wy>i xi + wl> xi , for each l 6= yi and for each sample 1. We can do
it, for example, using the hinge loss function by the following formulation
N X
X
max 1 − wy>i xi + wl> xi , 0 .

(3.5)
i=1 l6=yi

This function is convex in w but it is not differentiable. If we prefer a function that is differentiable (and in
this chapter we definitely do prefer this because we only learned the gradient descent method so far) we can
also apply the L2 -loss function to obtain:
N X
X 2
max 1 − wy>i xi + wl> xi , 0 .

(3.6)
i=1 l6=yi

As in all the previous cases, it is a standard practice to add a regularization term to the objective function
to add preference for ‘simple’ models.
Another possible trick is to extend the concept of logistic regression to the multi-class setting. There,
the reasoning would be that we try to model the probability that a given sample belongs to the j-th class:

exp(wj> x)
P (ȳ = j|x) = k
.
>
P
exp(wl x)
l=1

How do we set up an optimization model to achieve that? As it turns out, we can repeat directly the
idea from the logistic regression to obtain a complicated ‘log-sum-exp’ style function to minimize, which is
nevertheless convex.

3.5 What distinguishes optimization methods used for ML?


So far, for the supervised learning problems, we’ve been quite successful in formulating optimization problems
that were convex, and in some cases, having even a unique optimal solution. That would make us think
that optimization was almost made for machine learning. And if it was made for machine learning, that
would mean that to estimate ML models, we could use specialized software packages for solving optimization
problems called... solvers. But using optimization-minded solvers is usually not a good idea for ML because,
originally, mathematical optimization was designed to solve problems like the following example.
Example 3.4. Chip production
The company BIM (Best International Machines) produces two types of microchips, logic chips (each
of which requires 1gr silicon, 1gr plastic, 4gr copper, and yields 12 euro profit) and memory chips
(1gr germanium, 1gr plastic, 2gr copper, 9 euro profit). The current stock of raw materials is as
follows: 1000gr silicon, 1500gr germanium, 1750gr plastic, 4800gr of copper. How many microchips
of each type should be produced to maximize the profit while respecting the raw material stock
availability? Let x1 denote the number of logic chips and x2 that of memory chips. This decision can

97
be reformulated as an optimization problem of the following form:

max 12x1 + 9x2


x1 ,x2

such that = s.t. x1 ≤ 1000 (silicon)


x2 ≤ 1500 (germanium)
x1 + x2 ≤ 1750 (plastic)
4x1 + 2x2 ≤ 4800 (copper)
x1 , x2 ≥ 0 (nonnegative production)

This is a classical production optimization problem. You have the data, there is a certain objective
function to be maximized, and there are certain constraints to be met. Here is already the first difference
with the optimization problems we’ve constructed so far – they did not include any constraints on the decision
variable w, which was free to take any value in Rn . Although there are ML-related optimization problems
in which constraints play a role, it is rather uncommon because it makes the problem more difficult to solve
(apart from making sure that one moves in a direction of decreasing the loss function, one also needs to
make sure that no constraints are violated).

Training data performance and test data performance The most important difference between
ordinary optimization purposes and ML lies in the following. In the production planning example, we can
safely assume (unless there are errors) that the problem data will be as stated when we implement the
solution. That means, that the logic chip will, for example, require 1 gram of silicon etc. In this context, it
makes sense to put a lot of effort seek the solution that really maximizes the objective function.
In optimization problems related to supervised machine learning the situation is different by design. We
have the datasets (xi , yi ), i = 1, . . . , n and we optimize the vectors w to perform as well as possible on this
data. But in the end, the predictive mechanism will be used on new data, which is different from the training
dataset.
Of course, we expect the new data to be similar to the training set (otherwise there would be no reason
to train anything), but it will not be exactly the same data. There are two conclusions that can be made
from this.
First, if there is a way to explicitly take into account the fact that the new data will be slightly different
from the training data, then we can try to include this information already at the training stage so that we
‘behave as if we were optimizing for the unknown yet data’. This might sound weird, but it turns out that
regularization is implicitly doing exactly this.
Example 3.5. Regularizer as a by-product of data uncertainty
Consider the classical linear regression problem where we minimize the following problem (now we
take the norm, not the squared norm)

f (w) = ky − Xwk2

where X, y represents the training data set. Now, we will try to imitate the fact that the new data
on which the model will be evaluated, is slightly different. In particular, let us assume that the new
data is of the form:
X + ∆, y
while we leave the vector y unchanged for simplicity, and we perturb the feature vector a bit by a
matrix ∆. If we had a chance to train our model on this new data, then the value of the loss function

98
on the training data would be equal to:

g(w) = ky − X̂wk2 = ky − (X + ∆)wk2 .

The problem is that we do not know the value of ∆ in advance. Continuing our thought experiment,
assume that the matrix ∆ is ‘not too big’. For example, that it holds that its 2-operator norm (see
section 2.2) is not larger than a certain small number: k∆k2 ≤ λ.
If we worry about the worst-possible realization of ∆ for a given w, then we can discover that it holds
that
max ky − (X + ∆)wk2 = ky − Xwk2 + λkwk2 .
k∆k2 ≤λ

This is the worst-case value of the loss function. For that reason, if our aim is to minimize such a
worst-case value of the loss function on the unknown data, we are in fact applying regularization to
our optimization problem.
Of course, the setup in this example was somehow artificial but in fact, one can re-derive a lot of
various regularization terms by considering a specific form of ‘perturbation’ in the training data and
then deriving the form of the loss function under the worst-case value of this perturbation.

Another conclusion from the fact that the data on which we really care about the performance is different
from the one we optimize on, is the fact that it might not make sense to run an optimization algorithm all
the way until a minimum is found. For this reason, optimization performed the the purposes of machine
learning often stops much earlier, i.e., the stopping criteria are more relaxed.
Separability of the objective functions and the stochastic gradient descent. Next, the objective
functions in ML-related problems are separable: they are sums of many likewise terms, each of which
corresponds to a single sample in the training dataset:
N
X
f (w) = L(w, xi , yi )
i=1

As the number of terms can be very large (as the dataset), even computing the gradient might be a very
time-consuming task at each step of the gradient descent algorithm. At the same time, chances are high
that many of the terms are very similar as they correspond to the dataset, where many points are expected
to be close to each other.
For that reason, it is common to approximate the objective function (and its gradient) by randomly
selecting nS elements in the training sample. In other words, to use the function

f˜(w) =
X
L(w, xi , yi )
i∈S

where S is a sample of NS elements from the training set. It turns out that while this function includes
many many less points, its gradient is still an excellent approximation of the gradient of the original function.
This approach is known as the stochastic gradient descent (SGD) algorithm and is extremely popular in ML
applications.
It is important to notice that

∇f (w) ≈ ∇f˜(w) =
X
∇L(w, xi , yi ),
i∈S

so that the gradient computation can be done in parallel for all of the nS samples, and then simply summed.
For this reason, the batch size nS is often chosen as the maximum number of processes one can run in parallel
on the machine.
SGD methods typically cycle through the full data set, rather than simply sampling the data points at
random. In other words, the data points are permuted in some random order and blocks of points are drawn

99
from this ordering. Therefore, all other points are processed before arriving at a data point again. Each
cycle of the SGD procedure is referred to as an epoch - a term you will often see in ML publications.

3.6 Gradient descent: a simple proof of convergence


This course is, still a mathematical course and from a standpoint of a mathematician implementing optimiza-
tion methods, it is important to understand ‘why’ optimization algorithms work, i.e., why do they converge.
Because an entire course could be devoted to just that, we restrict ourselves and help you ‘get a foot in the
door’ of proving convergence of algorithms.
In this section we want to give you a minimum-required-knowledge proof of the convergence of the
gradient value in the gradient descent method. As a by-product, you will also have a chance to see how, if
we know something about the function to be minimized, this knowledge can be used to determine a good
value of the constant stepsize for the gradient method.
But as it is aften in life, in order to prove something, we need to assume something. Convergence proofs
typically require some kind of a smoothness assumption on the function, i.e., some knowledge about how
fast the function changes. Without them, proving convergence (or convergence rate) can be difficult.
In our case, to keep things simple we will make the already-mentioned Lipschitz-continuity assumption
on the function’s gradient. First, we will need a short result that relates the amount of change in the function
value of the gradient step to the value of the gradient, step length and the Lipschitz constant L.
Lemma 3.1.
Let f be twice continuously differentiable with a L-Lipschitz continuous gradient. Then, we have
that
L
f (y) ≤ f (x) + ∇f (x)> (y − x) + kx − yk2 .
2

Proof. We have that


Z 1
f (y) − f (x) = h∇f (x), y − xi + h∇f (x + t(y − x)) − ∇f (x), y − xidt
0

We therefore have that


Z 1

|f (y) − f (x) − h∇f (x), y − xi| = h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
Z 1
≤ |h∇f (x + t(y − x)) − ∇f (x), y − xi| dt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)k ky − xk dt
0
Z 1
2
≤ tL ky − xk dt
0
L 2
≤ ky − xk
2

This lemma can be used to quantify things very efficiently for the gradient method, in the following
simple corollary.

100
Lemma 3.2. Sufficient decrease
Let f be twice continuously differentiable with a L-Lipschitz continuous gradient. Then, we have
that  
Lt 2
f (x) − f (x − t∇f (x)) ≥ t 1 − k∇f (x)k
2

As you can see, we managed to obtain a lower bound on the improvement in the objective function as
a result of a gradient step. If we want to maximize this lower bound, then we can manipulate the step size
t to maximize the term on the right. This boils down to maximization of a simple quadratic function with
the following famous formula:
1
t∗ = ,
L
in which case we have
1 2
f (x) − f (x − t∇f (x)) ≥ k∇f (x)k
2L
Here, you see that if for our function we know the Lipschitz constant of the gradient, then we can use it to
provide a constant stepsize.
Theorem 3.3.
Let f be twice continuously differentiable with a L-Lipschitz continuous gradient. Let {xk } be the
sequence generated by the gradient method with constant stepsize t = 1/L for solving

min f (x)
x∈Rn

Assume that the function f is bounded from below by constant m.


Then, we have that the sequence {f (xk )} is nonincreasing and
s
2(f (x0 ) − m)
min k∇f (xk )k ≤
k=0,...,K L(K + 1)

Proof. By the sufficient decrease theorem, we have that


L
f (xk ) − f (xk+1 ) ≥ k∇f (xk )k2 ≥ 0
2
which resolves the nondescreasingness issue. Secondly, because the sequence of f (xk ) is nonincreasing and
bounded from below, we know that it is convergent and therefore, also limk→+∞ k∇f (xk )k = 0.
Now, we can do something which is a standard trick in all proofs of algortihms convergence – we will
add all such inequalities via the telescoping trick over k = 0, . . . , K. Through this trick, all the intermediate
function values will vanish and we will have
K
LX
f (x0 ) − f (xK+1 ) ≥ k∇f (xi )k2
2 i=0

Because f (xK+1 ) ≥ m, we have


K
LX L
f (x0 ) − m ≥ f (x0 ) − f (xK+1 ) ≥ k∇f (xi )k2 ≥ (K + 1) min k∇f (xk )k2 .
2 i=0 2 k=0,...,K

101
Via this result, we were able to obtain not only the result that the gradient of the minimized function
converges, but√also an upper bound on the rate of this convergence: the norm of the gradient converges at
the rate O(1/ k). Similar results can be obtained for the other stepsize selection strategies we presented –
backtracking and exact line search.
This is, in fact, one of the worst convergence rates you can get, if you can prove anything about an
algorithm. But, under the assumptions we took, this is the sharpest possible rate estimate we can get, but
for special function classes rates of O(1/k) or even O(1/k 2 ) are possible.
Exercise 3.13. Exercise 4.11 from [3]

Suppose that f : Rn → R is a twice differentiable function with Lipschitz continuous gradient with
Lipschitz constant L. Suppose that the optimal value of minimizing f is f ∗ . Let {xk } be the sequence
generated by the gradient descent method with constant stepsize 1/L. Show that if {xk } is bounded
then f (xk ) → f ∗ as k → +∞.

3.7 Final remarks


We finish this lecture by summarizing the key steps necessary for using optimization in an ML application,
especially in the context of supervised learning.
First of all we need to decide what kind of a prediction model we want to build. In order to optimize the
performance of such a model on the training data set, we need to establish an optimization model where a
loss function is minimized. If multiple choices are possible, it is usually preferable that the function to be
minimized be smooth and convex in the model’s parameters because that makes it possible to deploy the
(stochastic) gradient descent method that converges to a global minimum.
Choice of the ML model and the corresponding optimization problem implies also choice of the corre-
sponding hyperparameters: for example, the regularization factor λ (promotes the usage of simpler predictive
tools), stepsize rule parameters in the gradient descent method, etc. We want to select the hyperparameter
values that give the best performing ML tool in the end.
The issue is that the optimization model is obtained given the values of the hyperparameters. For that
reason, if we want to optimize (‘tune’) the hyperparameters, we need to have an ‘optimization problem over
λ’s each of which requires an optimization problem on its own’ ?
For that reason, in ML, the first thing that is done is to split the dataset into three parts:

• training set - on this dataset a specific ML model is trained under specific values of hyperparameters
such as the regularization parameter λ, stepsize for gradient descent etc.
• validation set - performance of the model trained on the ‘training set’ is evaluated on the validation
set;
• test set - on this set the model ML tool with the best hyperparameter value is evaluated to see if a
given ML tool is good. If multiple ML tools are considered, their validation-best versions respectively,
are compared against each other on this test set.
How do we optimize over the hyperparameters? Typically, the most common tool is grid search – for
example, we test 10 different values for λ, train a model on the training set for each of them, and pick the
best λ among them based on the models’ performance on the validation set.
As per the model training itself, a standard practice is to pre-process the data. This is done, for example,
by normalizing all the features (entries of xi ’s) so that each of them have a common minimum/maximum
value, or a common mean/standard deviation. This helps to avoid algorithm convergence issues such as
zigzagging and overall, improves the optimization performance.

102
Figure 4.1: Examples of shallow minima (left) and flat regions (right).

4 Unconstrained optimization: beyond the gradient method


4.1 Introduction
The gradient method is a nice and very intuitive tool of optimization of loss functions of unconstrained
problems (we assume that x can take any value in the Euclidean space). However, there are several things that
can make its application troublesome – we spend a lot of effort on computation against small improvements
in the objective function value.
Common examples of these two situations are:
• the gradient method gets stuck ‘shallow’ local minima or stops on flat regions from which it could still
continue further on, but it ‘doesn’t know’ to do it, because the standard stopping criterion makes it
stop where it is as in fig. 4.1.
• the curvature of the function (essentially, how quickly the gradient changes) makes the directions
selected by the gradient method not great, because the direction stops to be a descent direction very
quickly, as was in the zigzagging example 3.2
• the function to minimize is not smooth, as in fig. 4.2
These are the classical downsides of applying the gradient method blindly, or situations in which it simply
cannot be applied on an as-is basis. Throughout this section, we will learn a few methods for dealing with
these situations, with a special attention to those that require only first-order information.
To warm up for the coming discussion, we need to become familiar with the deep connection between
the definiteness of the Hessian of the local approximation of f , and the properties of the stationary point at
hand.
Exercise 4.1.
Show that for a twice differentiable function f : Rn → R, a point x∗ for which ∇f (x∗ ) = 0 and where
∇2 f (x∗ ) is positive definite, is a local minimum. Next, provide a counterexample for the case when
the matrix is positive semidefinite.

103
Figure 4.2: A nonsmooth function.

As you know, one of the particularly nasty examples are saddle points. An inherent property of ML
problems (objective functions being sums of many functions of many parameters) is that they tend to have
many saddle points, as the following exercise tries to explain.
Exercise 4.2.
Consider the univariate function f (x) = x3 − 3x and its natural multivariate extension:
n
X
F (x1 , . . . , xn ) = f (xi ).
i=1

Show that this function has only one minimum, one maximum and 2n − 2 saddle points. Argue why
saddle points proliferate in high dimensional functions.

4.2 Correcting the gradient method


Gradient-based methods are very popular because the computational effort needed to obtain the gradient is
fairly low – requires only n entries. This is attractive because in ML, due to the size of datasets, higher-order
information is relatively expensive to compute. But as we have already seen, such methods applied blindly
can under-perform. The first idea is therefore to keep the cost of the method low, but avoid the downsides
of a blind implementation.
Most popular methods aiming at dealing with irregular curvatures of the function correspond to ‘modify-
ing’ the gradient directions/stepsizes to insure against the situations mentioned above. In particular, some of
these methods will even modify the step length per each component of the parameter vector w, depending on
how rapidly has the magnitude of ∂f /∂xi been changing in the previous iterations, in line with the intuition:
for parameters with ‘stable’ history of partial derivative values, we make longer steps, while for parameters
with a ‘turbulent’ history of partial derivatives we make smaller steps. You can see it as a way of avoiding
the zigzagging behavior presented in example 3.2.
Simple momentum-based learning. One of the simplest ideas is to try to imitate the behavior of a

104
ball that rolls on the graph of a function and even when it falls into a local minimum, it maintains some of
its earlier speed, rolling further. The computation of the update direction is then given by
xk+1 = xk + dk , dk = (βdk−1 − tk ∇f (xk ))
where β ∈ (0, 1) is a parameter that determines how much of the previous update is ‘remembered’ in the
next update. Clearly, with an update like this, the algorithm will not stop immediately when it encounters
a point xk at which ∇f (xk ) = 0, but instead, it will keep moving further.
AdaGrad (Adaptive Gradient Descent, [16]). The AdaGrad algorithm differentiates the scaling
of different components of x. In particular, it keeps track of the sum of squared magnitudes of the partial
derivatives with respect to xk,i . From iteration to iteration, one updates these quantities as:
 2
∂f
A0,i = 0, Ak,i = Ak−1,i + i = 1, . . . , n.
∂xi
These are used to scale down the update with respect to the corresponding parameters as:
 
α ∂f
xk+1,i = xk,i − p .
Ak,i ∂xi
Clearly, AdaGrad is a diminishing stepsize update rule. This means that from a certain moment, the method
will practically stop moving. Another downside is that none of the gradient history stored in Ak,i ’s gets
forgotten.
RMSProp (Root Mean Square Propagation, [24]). The RMSProp algorithm is essentially Ada-
Grad, but with the important trick that the past magnitude information gets gradually forgotten with time
at a rate quantified by a parameter 1 − ρ where ρ ∈ (0, 1). The update rules are then:
 2
∂f
A0,i = 0, Ak,i = ρAk−1,i + (1 − ρ) i = 1, . . . , n,
∂xi
and  
α ∂f
xk+1,i = xk,i − p .
Ak,i ∂xi
This is an ‘improvement’ upon AdaGrad, but the feature of both algorithms is that there is no momentum
effect in the gradient itself (only in the step length). Out of these considerations, the Adam algorithm was
born.
Adam (Adaptive Moment Estimation, [27]). The very popular Adam algorithm marries the features
of all the above-mentioned methods, performing exponential smoothing of both the stepsize, and the gradient
direction on a per-entry level of x. The corresponding magnitude rule for Ak,i is the same as for RMSProp,
and for the direction it applies the following rule:
 
∂f
F0,i = 0, Fk,i = ρf Fk−1,i + (1 − ρf ) .
∂xi
With these values, the per-entry update rule is:
α
xk+1,i = xk,i − p Fk,i .
Ak,i
which combines both the ideas of gradient memory, and the stepsize magnitude memory.
The algorithms mentioned above are state-of-the-art tools for training, for example, huge neural networks.
All the methods mentioned now have been hand-crafted through experimentation and are empirically seen
to perform very well on ML-related optimization problems. Theoretical results on their convergence rates
under different sets of assumptions are also available (similar to those of gradient descent), but they are of
smaller importance.
Importantly, we can add that the stochastic gradient descent algorithm, to a certain degree, achieves
similar goals as the approaches mentioned above. Due to the stochasticity of the gradient evaluation, SGD
is less likely than normal GD to get stuck in local optimal, and more likely to traverse flat regions.

105
4.3 Newton method
4.3.1 Introduction
The algorithms of the previous section were trying to account for curvature of the functions (which is,
fundamentally, second order information about the function, as opposed to the gradient method, which is
an example of first-order information) by accumulating information from the past history of the first-order
information. There are good reasons for that – while computation of the gradient requires O(n) time,
computing the curvature (Hessian) of the function takes O(n2 ) time, which is expensive for an operation
that would be related to just making one step.
Nevertheless, second-order information is a very powerful source of knowledge about a function and
second-order information-based optimization methods (Newton method and its variants) are a very important
part of the optimization curriculum, due to their ability to converge quickly to the minimum once they get
close to it. But, if you remember well what we said in the previous section about the fact that in ML-minded
optimization we don’t care that much about the actual minimum, this should not be a convincing enough
reason to use the Newton method.
The answer is: it still makes sense to learn about the idea of the Newton method, so that it can be a
good motivation for computationally more efficient tools that try to do the same thing and are inspired by
it, but work at a lower computational cost.

4.3.2 Newton method


The source of the gradient method was to take a ‘local picture’ of the function we’re minimizing to determine
the direction to move to next. That local image was a first-order Taylor approximation. Just as well, someone
might create a better local picture of the function by including the curvature information. This is the second-
order Taylor approximation.
1
f (x) = f (xk ) + ∇f (xk )> (x − xk ) + (x − xk )> ∇2 f (xk )(x − xk ) + o(kx − xk k2 ).
2
This quality picture comes at a higher cost but it is much more informative – we also have an estimate of
how quickly the gradient will change. So the idea of the Newton method is to behave like this is the actual
function we’re trying to minimize and to equate the gradient of this function to zero. This gives a system of
equations:

∇f (xk ) + ∇2 f (xk )(x − xk ) = 0

A point x obtained in this way is a critical point of the Taylor approximation. If the matrix ∇2 f (xk ) is
invertible, we obtain:
x = xk − (∇2 f (xk ))−1 ∇f (xk ),
where, of course, we typically don’t want to compute (∇2 f (xk ))−1 explicitly since it is only applied once.
This is the basis of the ‘pure’ Newton method outlined in alg. 16. In running of the Newton method, the
vast majority of time is spent solving the system of linear equations to get xk+1 − xk and smart approaches
are needed that do it efficiently, utilizing the problem structure as much as possible. For that reason, it will
not be an overstatement that an optimization algorithm related to Newton method, is almost equivalent to
the algorithm used to compute the Newton updates.

Algorithm 16: Pure Newton method.


Data: x0
while Convergence criterion not met do
xk+1 = xk − (∇2 f (xk ))−1 ∇f (xk ).

106
All this computational effort is not for nothing. Newton method is a very powerful one and it is immensely
popular in mathematical optimization. As the following exercise shows, for quadratic functions it actually
converges immediately to the minimum.
Exercise 4.3.
Show that the pure Newton method finds the minimum of a function

f (x) = x> Ax + b> x + c, A positive definite

in one step.

This is in sharp contrast with gradient-style methods that may exhibit zigzagging behavior, as you have
seen before. Moreover, for the Newton method, one can show that being close to the minimum x∗ , its
convergence becomes quadratic as the following result shows. This result is so powerful that we provide
you with a minimum-required knowledge proof. The assumptions of this theorem are, except for specific
cases, expected to hold only locally around minima for non-quadratic functions, but they make the analysis
illustrative.
Theorem 4.1.
Let f be a twice continuously differentiable function defined over Rn . Assume that

• there exists m > 0 for which ∇2 f (x) ≥ mI for and x ∈ Rn


• there exists L > 0 for which k∇2 f (x) − ∇2 f (y)k ≤ Lkx − yk for any x, y ∈ Rn
Let {xk } be the sequence generated by Newton’s method and let x∗ be the unique minimizer of f
over Rn . Then for any k = 0, 1, . . . the inequality holds:
L
kxk+1 − x∗ k ≤ kxk − x∗ k2 .
2m

Proof. Let k be a nonnegative integer. Then we have

xk+1 − x∗ = xk − (∇2 f (xk ))−1 ∇f (xk ) − x∗


= xk − x∗ + (∇2 f (xk ))−1 (∇f (x∗ ) − ∇f (xk )) as ∇f (x∗ ) = 0
Z 1
∗ 2 −1
= xk − x + (∇ f (xk )) (∇2 f (xk + t(x∗ − xk )))(x∗ − xk )dt
0
Z 1
= (∇2 f (xk ))−1 (∇2 f (xk + t(x∗ − xk )) − ∇2 f (xk ))(x∗ − xk )dt.
0

If we combine the last equality with the fact that ∇2 f (xk ) ≥ mI, then we obtain that k(∇2 f (xk ))−1 k ≤ 1/m.
Therefore, we have
Z 1
kxk+1 − x∗ k ≤ k(∇2 f (xk ))−1 k 2 ∗ 2 ∗

(∇ f (x k + t(x − xk )) − ∇ f (x k ))(x − x k )dt

0
Z 1
≤ k(∇2 f (xk ))−1 k
2
(∇ f (xk + t(x∗ − xk )) − ∇2 f (xk ))(x∗ − xk ) dt

0
Z 1
≤ k(∇2 f (xk ))−1 k
2
(∇ f (xk + t(x∗ − xk )) − ∇2 f (xk )) k(x∗ − xk )k dt

0
1
L L
Z
≤ tkxk − x∗ k2 dt = kxk − x∗ k2 ,
m 0 2m

107
which is the desired result.
All the things so far are great news for minimization of quadratic functions or situations where we are so
close to a local minimum that the quadratic approximation of a function is very accurate. However, not all
functions are convex quadratic with a positive definite Hessian. For other functions, even if they are convex,
the step made by the pure Newton method might be simply too long and can guide one to a place where
the Taylor approximation loses its accuracy completely.
For that reason, one typically does not implement the pure Newton method but instead, uses the search
direction implied by the Newton method, combined with line search (see one of the strategies we learned
in the previous section) in order to determine the next point, see alg. 17. This is a basis for a ‘realistic’
implementation of the Newton method.

Algorithm 17: Newton method with line search.


Data: x0
while Convergence criterion not met do
Compute dk by solving ∇2 f (xk )dk = −∇f (xk ).
Use one of the line search strategies to select stepsize tk for xk+1 = xk + tk dk .

Line search, unfortunately, is not something that is very expedient to do in ML because of the size of
datasets involved and the amount of time it may take to evaluate the function value.
But this is not the end of issues with the Newton method. As you remember, it mostly consists of solving
a system of linear equations to obtain the Newton direction. But the Newton matrix need not be invertible
or positive definite, all we know is that it is at least symmetric. To check this (and also to solve the system
of linear equations later), typically Cholesky factorization from section 2.6 is performed to check the matrix
eigenvalues. If you are lucky, all the eigenvenvalues are positive and the matrix is positive definite, and
hence, invertible.
But some problems are ‘not convex enough’, which means that the Hessian matrix might become singular
(when one of the eigenvalues is zero) or or they are not convex at all at a given area in which case the Hessian
is indefinite (which happens, for example, at saddle points). For the the case when the Hessian is positive
semidefinite but one of the eigenvalues is zero, people use a very simple trick that is similar to regularization:
a matrix λI is added so that the step is:

(∇2 f (xk ) + λI)d = −∇f (xk ).

Another approach, more suitable around saddle points, is to trust the Taylor approximation only within a
certain distance from the current point, and to determine the next point as the minimizer of the function
within a ball around it. The optimization problem solved then is
1
xk+1 = argmin f (xk ) + ∇f (xk )> (x − xk ) + (x − xk )> ∇2 f (xk )(x − xk )
x 2
s.t. kx − xk k22 = (x − xk )> (x − xk ) ≤ δ,

where δ > 0 is the radius of the trust region. This idea is illustrated in fig. 4.3.
Such optimization problems can be solved very efficiently due to the fact that the Hessian of the the
objective function – ∇2 f (xk ) – and the Hessian of the function (x − xk )> (x − xk ) used in the constraint –
an identity matrix I – are simultaneously diagonalizable, a term you have encountered in section 2.5. This
property allow extremely efficient special algorithms for this so-called ‘Trust Region Subproblem’ (TRSP).
Despite all the fixes presented so far, one of the main problems with the Newton’s method is indiscrimi-
nately attracted to all critical points (such as maxima or saddle points). This is particularly troublesome as,
in the kind of objective functions encountered in ML, saddle points tend to proliferate a lot, as you have seen
earlier. Surprisingly, the first-order methods we discussed earlier, exhibit less attraction to such points. For
that reason, the Newton method does not always perform better than gradient descent. The Newton method

108
Figure 4.3: (An ugly picture of) minimization of a saddle-point quadratic function in the trust region method.

is needed for loss functions with complex curvatures, but without too many saddle points. Overall thus, the
computational-work-per-value of the Newton method is not great in its ‘almost pure’ versions. Therefore,
real-world ML practitioners often prefer gradient-descent methods in combination with computational algo-
rithms like Adam. However, there exists low-cost imitations of the Newton method that are used in ML. We
will introduce one of them in the next section.
Exercise 4.4.
Is it possible for a Newton update to reach a maximum rather than a minimum? Justify your answer.
In what types of functions is the Newton method guaranteed to reach a maximum rather than a
minimum?

4.3.3 Quasi-Newton methods


Quasi-Newton methods try to ‘eat the cookie and have the cookie’ that is, to gain good second-order in-
formation about the function, but by computing only the first-order information about the function. In
particular, they try to mimic the logic of the gradient method by making steps of the form

xk+1 = xk − tk Dk−1 ∇f (xk ),

where in the Newton method the matrix Dk is simply ∇2 f (xk ). But computation and inversion of ∇2 f (xk )
are rather expensive, so the idea is to make Dk a sequence of matrices that behave ‘sort of’ like the Hessian,
but which are cheap to update from Dk to Dk+1 .
What would it mean that we want Dk to behave like the Hessians? The idea is that the updated gradient
should satisfy the approximate secant condition (as it relates to the secant method for finding zeros of
functions)
Dk+1 (xk+1 − xk ) ≈ ∇f (xk+1 ) − ∇f (xk ),
which is a finite-difference approximation. But there are really many matrices Dk+1 satisfying this condition
and it’s not immediately obvious how one could decide for a specific choice among them which, again, would
be cheap to update from Dk to Dk+1 . We will now present some of the empirically working solutions to this
problem.

109
There, the idea is to, given the matrix Dk and xk+1 , xk , ∇f (xk+1 ), ∇f (xk ), and choose for a Dk+1 which
is as close as possible to Dk :

Dk+1 = argmin kDk+1 − Dk k? (4.1)


Dk+1

s.t. Dk+1 (xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk )


>
Dk+1 = Dk+1 .

The question mark at the norm is there deliberately, because the easiness of solving the above optimization
problem hinges upon the choice of that norm. It turns out that the problem is easy to solve if we pick
kXk? := kA1/2 XA1/2 kF , where
Z 1
A= ∇2 f (x + t(xk+1 − xk ))dt.
0

Not by coincidence, it happens that A(xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk ). Under this choice, the problem
can be analyzed by hand and there is an explicit solution to the minimum-norm optimization problem given
by:

Dk+1 = (I − γk vk qk> )Dk (I − γk qk vk> ) + γk vk vk> , (4.2)

with
1
vk = ∇f (xk+1 ) − ∇f (xk ), qk = xk+1 − xk , γk = .
vk> qk
where we do not show the entire reasoning behind (it is conceptually not that difficult, but it requires a lot
of analysis of the Lagrange optimality conditions for (4.4)). This is a neat algebraic result, but we need the
inverse of this matrix, rather than the matrix itself. And moreover, we do not want to perform this inverse
−1
explicitly for every k. Ideally, we would like to obtain the inverse Dk+1 from Dk−1 in a cheap way. What
comes to rescue is that (4.2) is a low-rank update of the matrix Dk .
Because of that we can use the Sherman-Morrison-Woodbury identity of theorem 2.3 to obtain the
following formula:

−1 Dk−1 vk vk> Dk−1 qk q >


Dk+1 = Dk−1 − > −1 + >k . (4.3)
vk Dk vk qk vk

This is known as the Davidon–Fletcher–Powell (DFP) quasi-Newton method update formula, and further
computational tricks are used even to avoid the storage of the entire matrix in real-life software.
Exercise 4.5.
Check the validity of (4.3) using theorem 2.3.

Another, more popular update rule like this, known as the Broyden, Fletcher, Goldfarb & Shanno (BFGS)
is obtained by formulating the problem (4.4) directly in terms of the the inverted Hessian, instead of the
Hessian itself.
−1 −1
Dk+1 = argmin kDk+1 − Dk−1 k? (4.4)
−1
Dk+1
−1
s.t. xk+1 − xk = Dk+1 (∇f (xk+1 ) − ∇f (xk ))
−1 −>
Dk+1 = Dk+1 .

The BFGS method is the state-of-the art among the quasi-Newton methods used in optimization.

110
4.4 Non-smooth optimization
In the end, we move to a very important case of having to minimize a function which is not differentiable
everywhere. Recall the hinge loss SVM problem of example 3.3, where the loss function was clearly not
everywhere differentiable. Another example is the following one.
Example 4.1. Lasso
A classical example of a non-differentiable loss function is the L1 -regularized linear regression, known
as lasso (least absolute shrinkage and selection operator). The loss function there is

f (w) = ky − Xwk22 + λkwk1 .

This function has the benefit that using the L1 norm in the regularization term is particularly effective
at forcing many entries of w to be equal to 0 at the optimal solution – see exercise 4.8.

For that reason, it is essential that we have methods that are able to minimize such functions in a
mathematically rigorous way as well.

4.4.1 Sugradient method


Among our methods, we begin with the simplest one which uses the following fact: while for convex functions
f : Rn → R it can indeed happen that they are not differentiable, they may have something that behaves
‘almost’ like the gradient. This thing is called subgradient and it is defined as follows:
Definition 4.1.
For a function f : Rn → R a vector g ∈ Rn is a subgradient of f at x if

f (y) ≥ f (x) + g > (y − x), ∀y ∈ Rn

The set of subgradients of f at x is called the subdifferential at x and is denoted by ∂f (x).

For differentiable functions we will have a close relation of the gradient and the subgradient, givne by
the following result.
Lemma 4.1.
For a function f : Rn → R, which is differentiable at a point x, we have

∂f (x) ⊂ {∇f (x)}

A subgradient is, essentially, the normal vector of a hyperplane passing through the point (x, f (x)) that
includes the graph of f completely above or on itself. It is not a coincidence that the condition in the
definition of a subgradient is very similar to that in the definition of convexity of a function. In fact, we
have the following result.
Lemma 4.2.
A function f : Rn → R is convex if and only if ∂f (x) 6= ∅ everywhere.

Although in theory, one can have a discussion about subgradients of nonconvex functions in at least some
points of their domains, in practice discussion about subgradients is almost always done only in the context
of convex functions and this is also the assumption we shall make.
Just as for computing derivatives (gradients), for the subgradients we have some ‘calculus rules’ for
computing them for complicated functions out of the subgradients for the simpler functions.

111
Lemma 4.3.
For convex functions f : Rn → R we have

∂(αf )(x) = α∂f (x), α>0


∂(f + g)(x) = ∂f (x) + ∂g(x)
g(x) = f (Ax + b) ⇒ ∂g(x) = A> ∂f (Ax + b)

Exercise 4.6.
Pn
Compute the subgradients of the corresponding functions: kxk2 , kxk1 = i=1 |xi |, max{0, 1 − x}.

One of the nice features of the subgradients is that we can use the subgradients to formulate a very
general version of an optimality conditions:
Theorem 4.2.
For a function f : Rn → R, if 0 ∈ ∂f (x) then x is a global minimizer of f .

Exercise 4.7.
Prove theorem 4.2.

Exercise 4.8. L1 regularization and sparsity


Consider the Lasso minimization function of example 4.1. Write down the optimality condition for
a global for f (w) using theorem 4.2. Now, consider the similar objective function but with the
L2 /Tikhonov regularization:
f (w) = ky − Xwk22 + λkwk22 ,
and write down the optimality condition for this function as well. Do you see the intuition why L1
regularization is more likely to yield optimal points with more components of w equal to 0?

Equipped with the notion of the subgradient, we can present the generalization of the gradient descent
method to nondifferentiable convex functions, known as the subgradient descent method. Depending on the

Algorithm 18: Subgradient descent algorithm


Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Select dk ∈ ∂f (xk )
Select step size tk
xk+1 = xk − tk dk
if Stopping criterion met then
Stop and return xk


assumptions that we impose on our function at hand, we can prove some results about O(1/ K) or O(1/K)
convergence of the subgradient method, where K is the number of iterations.
The subgradient method is, however, quite slow because it is ‘blind to the structure’ of the function f .
Especially in ML, we are the designers of our own functions and even in situations when some functions are
nondifferentiable, there can be some good things about them that can be exploited, apart from the single

112
negative fact that it is not always differentiable. Typically, this structure to be exploited is as follows:

f (x) = g(x) + h(x)


|{z} |{z}
convex, differentiable, L-Lipschitz gradient convex, nondifferentiable but ‘simple’

We introduce the class of algorithms that are able to exploit this structure in the next section.

4.4.2 Proximal gradient


It makes sense to begin with a re-derivation of the standard gradient descent step. Namely, making a step
of the form
xk+1 = xk − γ∇f (xk )
can be seen as a result of minimizing a specific local quadratic approximation of the function f :
1 1
min f (xk ) + ∇f (xk )> (y − xk ) + ky − xk k2 ⇔ min ky − (xk − γ∇f (xk ))k .
y 2γ y 2γ
In other words, the gradient step could be obtained by minimizing a local linear approximation of the function
f plus a quadratic penalization term that punishes going too far away from xk .
Based on this information, we will now define the following operator known as the proximal operator.
Definition 4.2.
For a given function h : Rn → R we define the proximal operator
 
1 2
proxh,γ (x) = argmin ky − xk + h(y)
y 2γ

Under this definition, traditional gradient descent on a function h can be seen as

xk+1 = proxh,γ (xk )

Now, in machine learning, the nondifferentiable functions we try to minimize very often come in the form

f (x) = g(x) + h(x)

where g(x) is convex and differentiable and h(x) is convex and not differentiable, but still ‘nice’ in the sense
that it is a fairly simple function.
Example 4.2. Compressed sensing
Proximal gradient algorithms are used very frequently in image recognition tasks, or image deblurring.
A typical situation in image deblurring or compressed sensing is that we observe a signal y, which
corresponds to the underlying ‘true’ signal x which is not observed and has to be estimated. By laws
of physics we expect the relationship
Ax ≈ y
to hold, where A is a known matrix that models the physics. Then, the deblurring task is performed
by minimizing the function
kAw − yk22 + λkT wk1 ,
where T is a special matrix such that the term kT wk1 triggers the optimal solution to be ‘sparse’ in
a way that is proper to the given application – for example, in an image we don’t want to have many
pairs of adjacent pixels with completely different colors.

113
Exercise 4.9.
Derive the proximal operator of the function h(x) = kxk1 .

The idea of proximal gradient descent is very simple:


• make a gradient step with respect to g(x) – the part of the function for which we can compute the
gradient;

• with respect to the obtained point, compute the proximal operator of the function h(x).
which get formulated as
1
xk+1 = argmin g(xk ) + ∇g(xk )> (y − xk ) + ky − xk k22 + h(y),
y 2γ

This is the basis of the proximal gradient algorithm:

Algorithm 19: Proximal gradient algorithm


Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Select step size tk
xk+1 = proxh,tk (xk − tk ∇g(xk ))
if Stopping criterion met then
Stop and return xk

What is crucial for the proximal gradient mapping is that the proximal operator is easy to compute for
the function f (x). If you have this, you are good to go to apply the proximal gradient descent method which
is a super powerful and popular tool in machine learning methods. For example, we have the following result
that gives us a O(1/K) convergence rate for the value of the objective function.
Theorem 4.3.
Let g(x) be convex and have an L-Lipschitz continuous gradient and proxh,tk (xk − tk ∇g(xk )) can be
computed easily. Then choosing fixed stepsize tk = 1/L we obtain that for arbitrary x0 it holds

L
f (xk ) − f ∗ ≤ kx0 − x∗ k2
2k

The proof of this theorem is not particularly difficult but it requires ‘putting together’ quite a few small
facts and properties of convex functions.
Using the proximal gradient method for minimizing the Lasso-regularized linear regression carries the
name of iterative shrinkage-thresholding algorithm (ISTA). Under specific assumptions on the functions g,
h one can obtain an even faster rate of convergence O(1/K 2 ) through the so-called fast iterative shrinkage-
thresholding algorithm (FISTA, [4]), which is one of the most popular algorithms used in ML or image
retrieval applications (check the number of citations of the paper).
Exercise 4.10.
Work out the details of the proximal gradient methods for the Lasso algorithm.

114
4.5 Practical summary
Long story short the situation is as follows. In machine learning you often face situations in which a pure
gradient method is not possible to implement. This can be, often due to non-differentiability of the function
at hand. If you are the master of the situation and the function to be minimized has to be nondifferentiable
but you can at least keep it convex, then it’s very handy that the non-differentiable component has an
easy-to-compute proximal operator in which case you can apply the proximal gradient descent methods.
In high-dimensional ML models, chances are high that the function you are trying to minimize is not
going to be convex, it will have a lot of local minima, and even more saddle points. For that reason, the
modifications of the gradient methods introduced in the beginning of this section will come in very handy.
If the problem size allows it, quasi-Newton methods can yield a faster convergence.
In highly complicated models such as the neural networks with ReLU activation functions, it can happen
that both difficulties (nonconvexity and nondifferentiability) appear at the same time. We will treat this
case separately in the neural networks part of this course.

5 Constrained optimization and duality


5.1 Introduction
So far, in our kick-start introduction to optimization we’ve been living a comfortable life of minimizing
unconstrained problem, i.e., the parameter vector w was allowed to belong anywhere in Rn as long as it was
‘doing the job’. Sometimes however, the situation requires the parameters to belong to a certain subset of
the Euclidean space:
min n f (w).
w∈W ⊂R

This means that applying any of the algorithms we learned so far might lead to ‘falling outside the feasible
set W ’.
How to deal with that? The first advice is – if you can, to avoid constrained problems. One interesting
case here is when you encounter a problem of the form:

min ky − Xwk22
w
Aw = b,

Exercise 5.1.
Transform the above problem to an equivalent, unconstrained one.

Sometimes, however, the constraints are more complicated than that and the problem one is solving is

min f (w)
w∈Rn
s.t. gi (w) ≤ 0 i = 1, . . . , m.

where gi (w) are some functions (typically convex and differentiable). A very engineering way to deal with a
situation like this is to turn this problem back into an unconstrained one by including a penalty for constraint
violation:
N
X
minn f (w) + C max{0, (gi (w))2 } (5.1)
w∈R
i=1

In that way, we make peace with the fact that the constraints can be violated and impose a certain price C
per squared unit of violation, which should be tuned to make sure that the optimization problem is nudged

115
into staying ‘close enough’ to the constraints being satisfied. Such a reformulation has one nice feature which
you are asked to show in the following exercise, and one which is not nice - that the conditioning of the
problem might be bad.
Exercise 5.2.
Prove the statement that if the functions f (w), gi (w) are all convex, then the resulting penalized
problem (5.1) has a convex objective.

But sometimes situations arise (or we make them by modelling the problem in a certain way) where we
really need to respect some constraints and they are not trivially eliminated from the problem. Then, the
answer depends what to do depends on the structure of the set W .
Example 5.1. L1 regularized regression via a constraint
As already mentioned, L1 regularization is a very popular tool in ML. In some applications, it is more
common to perform this regularization by explicitly bounding the L1 -magnitude of the parameter
vector:

min ky − Xwk22
w
kwk1 ≤ C,

which is a generalization of a concept you have seen in the earlier sections of this course (section 2.7,
minimum-norm solutions).

Just as in the above example, in ML we don’t encounter very complicated sets W . Typically, this will
be a box, a ball or something of similar level of complication. For that reason, the two classes of algorithms
that we will discuss first are essentially fixes of the gradient(-like) methods.
Towards the end of this section, we will learn about duality. In classical optimization, duality plays the
role that checking the ∇f (x) condition plays in unconstrained optimization – checking if we are close to the
optimal solution. Additionally, when used properly it can be used to strengthen algorithms.

5.2 Projected gradient


A most simple ‘fix algorithm’ approach to solving constrained optimization problems is to do the following
– whenever the gradient method makes us jump outside the set W , we perform a projection back onto this
set. Usually, but not always, the projection operator is defined using the orthogonal projection:

ΠW (x) = argmin ky − xk22 . (5.2)


y∈W

For other types of sets, some set-specific norm might be used in the projection operator which typically has

Algorithm 20: Projected gradient algorithm


Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Select step size tk
xk+1 = ΠW (xk − tk ∇f (xk ))
if Stopping criterion met then
Stop and return xk

the purpose of easy computation of the projection.

116
Of course, you never want to be solving an optimization problem (5.2) to find the projection itself. The
idea is that this algorithm is applied only if the projection ΠW (x) is easy to compute – in optimization terms
this means that the operator is either available as a closed-form formula or, in the worst case, as a result of
optimization over a single variable.
Exercise 5.3.
What is the formula for the projection operator onto set

W = {x ∈ Rn : li ≤ xi ≤ ui }

Exercise 5.4. Challenging


Try to derive the projection operator for the L1 -ball

W = {x ∈ Rn : kxk1 ≤ 1}

Hint: it can be reduced to minimization over a single decision variable but we don’t know of any
closed-form formula for it. You will find using Lagrange multipliers useful in this task.

The convergence results for the projected gradient descent algorithm are similar to that of the gradient
method. Typically, they will depend on the parameters of the set W to some extent, such as the diameter
and the shape. In such situations, even more than in unconstrained optimization the following rule holds:
there is a set of parameters that makes things work in the proofs of convergence, but for purposes of real life
optimization one typically picks larger stepsizes than the theoretically-valid ones.

5.3 Frank-Wolfe algorithm


Projected gradient descent, as we said, requires us to solve a quadratic optimization problem (5.2). If the
solution to such a problem is not easily formulated and we really need to solve a problem then we must
recall the basic principle in optimization – linear optimization problems, where both the objective function
and the constraints are linear, are easier to solve than ones in which either is quadratic (function).
A next step in this reasoning is to ask the question where would we go if we trusted the first-order Taylor
approximation of the function? That is, to find the solution to:

min f (xk ) + ∇f (xk )> (y − xk ).


y∈W

Because, in the above expression, the only really variable term is y, the problem can be reduced to:

yk+1 = min ∇f (xk )> y. (5.3)


y∈W

This not the end, however, because we do not trust the Taylor approximation too far away from the current
point. For that reason, the real step that is made is:

xk+1 = xk + tk (yk+1 − xk ), tk ∈ (0, 1),

that is, we stop somewhere on the way to the point xk , which prevents us from jumping from one boundary
point of W to another boundary point (you can easily check that the point yk+1 lies on the boundary of the
set W ).
It turns out that the L1 -regularized linear regression is exactly one of the situations in which (5.3) is
easier than the projection operator.

117
Algorithm 21: Frank-Wolfe algorithm
Data: x0 ∈ Rn
for k = 0, 1, 2, . . . do
Solve yk+1 = min ∇f (xk )> y
y∈W
Select step size tk
xk+1 = xk + tk (yk+1 − xk )
if Stopping criterion met then
Stop and return xk

Exercise 5.5.
Derive the formula for the Frank-Wolfe step of minimization over L1 ball.

As for the convergence of the Frank-Wolfe algorithm, these roughly follow the same pattern as the those
for the projected gradient algorithm, depending on our assumptions on the function we minimize.
If you are interested in more mathematical details on the convergence rates of various algorithms we
introduced so far in this course, in the context of ML, we recommend the material of the excellent course
‘Optimization for Machine Learning’ by Martin Jaggi, for which there are also YouTube video lectures
available. 1

5.4 Duality
5.4.1 General duality
Any introdution to optimization is incomplete without giving at least a glimpse of duality theory. Duality
theory is something that in classical optimization is mostly used for the purpose of (i) constructing optimality
certificates for the solutions of optimization problems, (ii) constructing better algorithms by leveraging dual
information (e.g. primal-dual interior point methods). However, in ML duality has found a beautiful appli-
cation where the so called dual problem of the optimization problem we solve (most often, the SVM) allows
us to construct primal predictive tools of almost arbitrary level of sophistication at no extra computational
cost.
To introduce this, we will start from the ‘classical angle’ and then move on to the ML applications.
Suppose you are solving a problem

min f (x) (5.4)


x
s.t. gi (x) ≤ 0 i = 1, . . . , m,

which we will henceforth call the primal problem. If all the functions f and gi (x) are convex, then this is
a ‘nice’ optimization problem for which we can have legitimate hopes to find an optimal solution. For that
reason, we will make this assumption from now onwards.
In general, constrained optimization problems are ‘nice’ if both the objective function to minimize and
the set of feasible solutions are convex.
Exercise 5.6.
Show that if gi (x) are convex functions then the set of solutions of (5.4), i.e.,

X = {x ∈ Rn : gi (x) ≤ 0, i = 1, . . . , m},

1 See: https://github.com/epfml/OptML_course

118
is convex, that means, for all x, y ∈ X it holds:

θx + (1 − θ)y ∈ X, ∀θ ∈ [0, 1].

Example 5.2. Hinge-loss SVM with L2 regularizer


Consider the Hinge-loss SVM of example 3.3 again. We are going to try to fit a hyperplane w that
separates the data points as well as possible. The corresponding optimization problem will be one of
minimizing the sum of loss functions max{0, 1 − yi (w> xi )}, plus a quadratic regularizer:
N
X
max 0, 1 − yi (w> xi ) + Cw> w

f (w) =
i=1

which is a nondifferentiable function. We will turn it into a constrained optimization problem where
all the functions involved are differentiable and convex. If we introduce an additional decision variable
ξi and require that

ξi ≥ 0
ξi ≥ 1 − yi (w> xi ),

then ξi can become our ‘proxy’ for the value of the term max{0, 1−yi (w> xi )} so that the optimization
problem is:
N
X
min ξi + Cw> w (5.5)
w,ξ
i=1
ξi ≥ 1 − yi (w> xi ) ∀i.
ξi ≥ 0∀i.

Check for yourself that indeed at the optimal solution ξi will always have the incentive to be equal
to max{0, 1 − yi (w> xi )}. Also, check that if you rewrite this problem to form (5.4), then all the
constraints and objective functions are indeed convex in w, ξ.

We now come back to problem (5.4). How do you certify the optimality of the solution x you find? If
the problem had no constraints, and the function f (x) was convex and differentiable, then a simple answer
to this question is: by checking if ∇f (x) = 0. But in the presence of constraints, a stationary point might
not be feasible, as depicted in fig. 5.1.
In a way, an optimality certificate, if it exists, must take form of an easy-to-verify statement that ‘from
this point onwards it is not possible to go to any better point because the constraints forbid it’. At the same
time, the optimality certificate should be easy to compute, just like for unconstrained optimization problems
it is easy to compute the gradient and to check if it is equal to 0. An optimality certificate that would be a
beautiful mathematical statement of the form ’there exists no ... such that...’ but which would not be easily
verified numerically, is useless in practice.
Remark 5.1.
What are going to derive is a special case of the separating hyperplane theorem [37], which in turn
is a special case of the Hahn-Banach theorem in functional analysis [39].

One extremely popular technology of building such optimality certificates is via the so-called Lagrange

119
Optimal
solution

Function
minimum

Figure 5.1: Constrained minimization of a convex quadratic function f (x) = x> Ax. Due to the constraints,
the feasible region consists of points that are below the a> >
1 x − b1 ≤ 0 and to the left of a2 x − b2 ≤ 0 and the
optimal point is the vertex of the feasible region, not the minimum of the function itself (which is infeasible
because of the constraints).

relaxation of the problem. The Lagrangian dual function of (5.4) is defined as:
m
( )
X
`(α) := inf f (x) + αi gi (x) . (5.6)
x∈X
i=1

The infimized inner function


m
X
L(x, α) = f (x) + αi gi (x)
i=1

is known as the Lagrangian of (5.4) and it can be summarized as ‘we make peace that some of the constraints
can be violated, but we impose a price αi for each unit of violation of the i-th constraint, and the objective
becomes lower by αi per unit extra if the constraint is satisfied ‘with a slack’. In (5.6) we wrote X instead
of Rn because some of the constraints might be so simple (e.g., nonnegativity constrainsts) that it is easy to
find the infimum (5.6) even without relaxing them – you will see that in example 5.3.
Minimizing L(x, α) over x is a relaxation of the original problem because its optimal value for α ≥ 0 is
always going to be a lower-bound on the optimal value of (5.4). This result is known as the weak duality
theorem.
Exercise 5.7. Weak duality
Show that for any α ≥ 0 it holds that `(α) ≤ f (x) for any x that satisfies the constraints of (5.4).

Thus, plugging α ≥ 0 into the dual function, we obtain a lower bound on the optimal value (5.4). Of
course, everything hinges on the question: is the dual function easy to compute via an easy formula? In
many interesting cases the answer is yes, as in our SVM example.
Example 5.3. Support vector machine with L2 regularizer - dual function
Let us introduce variable αi that will play the role of α in the Lagrange relaxation of the SVM
problem (5.5). In theory, we could also introduce a set of variables to relax the constraints ξi ≥ 0
but this is not needed because minimization over ξ ≥ 0 is easy even without the relaxation of the
constraints.

120
The Lagrange dual function becomes
N
X N
X
L(w, ξ, α) = Cw> w + ξs + αi (1 − yi (w> xi ) − ξi )
i=1 i=1
N
X N
X N
X
= Cw> w − αi yi (w> xi ) + αi + ξs (1 − αi )
i=1 i=1 i=1

As an exercise, you will check that we have


 N N
1
αi αj yi yj x>
 P P
αi − 4C i xj if 0 ≤ αi ≤ 1
inf L(w, ξ, α) (5.7)
ξ≥0,w  i=1 i,j=1
−∞ otherwise.

Exercise 5.8.
Verify the formulation (5.7).

We come back to our general considerations. Imagine the following: suppose that you determined an x
and α ≥ 0 such that x is feasible for (5.4), and it holds that f (x) = `(α). Because `(α) is a lower bound on
the value of any feasible solution of x, a logical conclusion is that such a α is an optimality certificate for x.
For that reason, the search for the best possible lower bound provided by the dual function is important
on its own, because the gap between f (x) for a feasible x and `(α) will quantify the maximum amount of loss
in the value of the objective function compared to the (unknown) optimal solution. This search is formulate
as the dual optimization problem:

max `(α) (5.8)


α
s.t. α ≥ 0.

The questions are:


1. for what problems (5.4) does there always exist such a pair (x, α)?
2. if such a pair exists, how to determine α?
With respect to the first question, we have the following fundamental result.
Theorem 5.1. Strong duality
Assume the functions f, gi : Rn → R in problem (5.4) are all convex and differentiable.
Also, assume that there exists x̂ such that gi (x̂) < 0 for every non-affine gi and gi (x̂) ≤ 0 when gi is
affine, which is known as the Slater assumption.
If all the above holds, then strong duality holds, i.e., there exists a pair (x∗ , α∗ ) such that the optimal
values of (5.4) and (5.8) are the same, and it holds that
Pm
∇f (x∗ ) + i=1 αi∗ ∇gi (x∗ ) = 0, (stationarity)
gi (x∗ ) ≤ 0, for i = 1, . . . , m, (primal feasibility)
αi∗ ≥ 0, for i = 1, . . . , m, (dual feasibility)
αi∗ gi (x∗ ) = 0, for i = 1, . . . , m. (complementarity slackness)

which are known as the Karush-Kuhn-Tucker (KKT) conditions.

121
The KKT conditions play an important role in optimization. In a few special cases it is possible to
solve the KKT conditions analytically and thus solve the optimization problem. More generally, many
optimization algorithms have been conceived as methods for solving the KKT conditions: imagine that you
treat the ‘equality part’ of the KKT conditions as a system of linear equations for which you are trying to
find a root. If you try to apply the Newton method to this system, then you are close to the derivation
of something called primal-dual methods for constrained optimization. For more details on this way of
explaining these methods, see [9].
In the context of fig. 5.1, the KKT conditions constitute a certificate that from the optimal point, it is
not allowed to move any further in the direction of improving the objective function values.
Example 5.4.
For our SVM example the dual optimization problem of maximizing `(α) is equivalent to:
N N
X 1 X
max αi − αi αj yi yj x>
i xj (5.9)
α
i=1
4C i,j=1
s.t. 0 ≤ αi ≤ 1.

Why is it so? Well, if we want to maximize the value of the dual function we certainly do not want
its value to be equal to −∞. Therefore, we can enforce the constraints that make the value of the
dual function finite, without any loss of generality.
It is easy to verify that the Slater condition holds for our pair of problems (5.5) and (5.12) and that
their optimal values must be bounded. Therefore, strong duality holds and the optimal values of both
problems are the same, attained by solutions w∗ , ξ ∗ , α∗ that satisfy the KKT conditions:
N
X

2Cw = αi∗ yi xi
i=1
ξi∗ ≥ 1 − yi (w∗> xi ) ∀i.
ξi∗ ≥0 ∀i
0 ≤ αi∗ ≤ 1
αi∗ (1 − yi (w∗> xi ) − ξi ) = 0.

From the first condition we can derive the formula


N
1 X ∗ 1 X ∗
w= αi yi xi = αi yi xi ,
2C i=1 2C ∗
i:αi >0

where the i’s for which αi∗ > 0 are called the support vectors because they constitute the predictive
tools. This is where the name of the tool comes from. If you remember, the predictive tool was
y = sign(w> xi ), it can now be written as:

N
!>  N
!
∗> 1 X
∗ 1 X
∗ >
y = sign(w x) = sign  α yi xi x = sign α yi xi x .
2C i=1 i 2C i=1 i

Exercise 5.9.
Show that Slater condition holds for our SVM primal-dual pair (5.5)-(5.12), and that their optimal
values must be are finite.

122
Figure 5.2: The idea of lifting the dimensionality of the features by including their nonlinear transformation
so that the data becomes more ‘linearly separable’. Source: https://datascience.stackexchange.com/
questions/17536/kernel-trick-explanation

In ML, our interest in duality theory is not so much due to the need to find optimality certificates
because, as you might remember, we don’t care about optimality that much. But, sometimes the dual
problem is actually easier to solve numerically than the primal problem. In the SVM case the dual problem
has constraints on the variables αi , but these are very simple constraints.
Exercise 5.10.
Which of the constrained optimization algorithms we learned is applicable to the dual SVM problem
(5.12).

And from the solution to the dual problem, the primal solution can be recovered by using the KKT
conditions as visible in the above example. But the real added value of the dual problem in ML, especially
in the SVM is only about to come.

5.4.2 Duality in SVM: the kernel trick


Consider the standard SVM formulation with L2 regularizer as we did so far in this section. The idea of this
problem formulation is that it tries, inasmuch as it is possible, to separate points with label −1 from those
with label 1 using a hyperplane.
Now, suppose that it is not possible to separate them perfectly, but it is possible to do so if we also
include the squares and products of the features as the decision variables. This idea is illustrated in fig. 5.2.
Therefore, we perform a transformation of our feature vector to
 
x1
 .. 
 . 
 
 xn 
2
 
x → x̃ =   x1  ,
 (5.10)
 x1 x2 
 . 
 
 .. 
x2n .

123
Figure 5.3: A linear SVM and polynomial SVM applied to the feature space. Source: https://
scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html

so that it suddenly has n + n(n + 1)/2 entries. That means, the dimension of our data set increases a lot.
How does the SVM problem formulation look like then? Something like this:
N
X
min C w̃> w̃ + ξi (5.11)
w̃,ξ
i=1
ξi ≥ 1 − yi (w̃> x̃i ) ∀i
ξi ≥ 0 ∀i,

so w̃ ∈ Rn+n(n+1)/2 is much bigger than the original w also increases substantially! Sometimes this increase
might be worth paying the price, because as a result we will obtain a much better predictive tool :

y = sign w̃> x̃ ,


which, if applied to the original feature space, can take a much more flexible form, as illustrated in fig. 5.3.
However, the more features, the heavier the computations in this case. This is where duality theory will
come in. Let us get back to the case without the nonlinear transformations of the features and recall once
again the dual problem:
N N
X 1 X
max αi − αi αj yi yj hxi , xj i (5.12)
α
i=1
4C i,j=1
s.t. 0 ≤ αi ≤ 1. (5.13)

As you can see, the dual problem doesn’t depend on vectors xi as such, but on their inner products only,
and each inner product is, in the end, a single number. Moreover, you can check that the dual of (5.11)
would look exactly the same, only that the inner products would be different. The terms hxi , xj i would only
become hx̃i , x̃j i. The corresponding prediction tool would then be
N
!
1 X ∗
y = sign α yi hx̃i , x̃i ,
2C i=1 i

Thus, the dimensionality of α and the number of terms in the prediction tool do not scale with the number
of features in the primal problem but with the number of samples alone.
Here comes the key of the so-called kernel trick, where we can generalize the term hxi , xj i in the dual
problem formulation to any kernel function K(x, y), which roughly measures the similarity of two vectors,
satisfying

124
• symmetry
K(x, y) = K(y, x) ∀x, y ∈ Rn

• nonnegativity
K(x, y) ≥ 0 ∀x, y ∈ Rn .

Such a kernel function, as it turns out, always corresponds to a certain nonlinear transform of the feature
data in the primal space (by Mercer theorem, [43]). Sometimes, this transform can only be expressed with
an inifinite number of terms, which would lead to solving a problem with an infinite number of entries in w
in the primal space. Doing so will increase our ability to separate groups of points greatly.
In other words, we can formulate an operator K(xi , xj ) and doing so will ‘imitate’ considering many,
many more features in the primal problem. This operator should be fast-to-compute.
Example 5.5.
If we want to use the monomials of degree at most 2 of our feature values, then the way of doing it
in (5.10) is highly inefficient because in the dual problem we would have inner products of very long
vectors. A much more efficient way of taking into account expressions of degree up to r in terms of
the features is to use the polynomial kernel function:

K(xi , xj ) = (1 + x> r
i xj ) .

It does the same job as the original idea but requires much less floating point operations. Additionally,
it accounts immediately for including the ‘constant term’ in our feature vector.

Another popular kernel is the Gaussian kernel:


 
2
K(xi , xj ) = exp −γ kxi − xj k2 ,

where γ is a to-be-tuned hyperparameter. This kernel does not correspond to a finite-dimensional trans-
formation of the data in the primal space, but to an infinite dimensional one, which you can infer from
interpreting this formula as an ‘inner product’ of two series in terms of x, y for γ = 1:
 
1 2 1 1
exp − kx − yk2 = exp( x> y − kxk2 − kyk2 )
2 2 2 2
1 1
= exp(x> y) exp(− kxk2 ) exp(− kyk2 )
2 2

(x> y)j
   
X 1 2 1 2
= exp − kxk exp − kyk
j=0
j! 2 2

x · · · xnk k y · · · yknk
  n1   n1
X X 1 1
= exp − kxk2 √1 exp − kyk2 √1
j=0
P 2 n1 ! · · · nk ! 2 n1 ! · · · nk !
n =j
i

Using the kernels, how does the corresponding prediction tool look like? Analogously to the original
formulas, we have:
N
!
1 X ∗
y = sign α yi K(xi , x) .
2C i=1 i
All this is really cool because it allows us to play with the flexibility with respect to separability of the
sets based on the feature, in a way that does not increase the size of the optimization problem formulation.
Overall, we can say that the kernel trick used in SVM is one of the most impressive usages of duality
theory apart from certifying the optimality of solutions to optimization problems.

125
Exercise 5.11. Kernelization of the SVM
Consider again exercise 3.12. Now, for your dataset consider using one of the kernels mentioned
above to create a nonlinear SVM classifier. That is, formulate the corresponding dual problem, solve
it using the projected gradient method. Then, you can use the obtained decision tool to color your
picture in order to see into what parts did the SVM classifier divide the entire [0, 10]2 square. You
can achieve it by taking a dense grid of points on this set, and classifying each of them using the
SVM you obtained, and coloring the points classified with two different colors.

126
6 Clustering
6.1 Introduction
Finally, after having the crash-course introduction to the relevant linear algebra and optimization, the time
has come to discuss some machine learning subjects.
To begin with, we make one important remark: we cannot discuss all the possible techniques, so we make
a selection of those in which we find the linear algebra/optimization most illustrative. At the end of each
section, if necessary, we will hint at other popular techniques which we do not discuss as they do not involve
(that much) interesting linear algebra or optimization.
Clustering falls into the group of unsupervised learning techniques, which corresponds to revealing struc-
ture in a data set without any labels. Whereas in the optimization basics, we have mostly discussed techniques
relevant to supervised learning, we have already seen other techniques that can seen as unsupervised learning
in the linear algebra basics; in particular, we have already discussed how to reduce the dimension of a data
set to the – with respect to some measure – most relevant information using Krylov methods and the SVD.
Clustering is about having N objects that need to be divided into groups such that the objects within
a single group are similar to each other, but the groups are different. About these objects, we might have
a feature information stored in per-object vectors xi , i = 1, . . . , N . In other, we might not have the feature
information about each object but instead, we have pairwise information about the relationships between
the objects, which can take the form of:
• 0 − 1 information about whether there exists a link between the two objects or not

• wij ≥ 0 information about the ‘degree of similarity’ of objects i and j.


Importantly, such relational information can also be obtained from the feature information as in the following
example.
Example 6.1.
The simplest example of the ‘degree of similarity’ is Euclidean distance between two objects:

wij = kxi − xj k22 .

In this context, the higher wij , the less similar two objects.

One of the nicest illustrations of clustering is image segmentation, where we try to divide an image into
different parts, for example, to separate people from the background. Another example is grouping people
based on knowing each other or their interests.

6.2 K-means clustering - Euclidean distance


We will begin with the classical K-means algorithm which is the ‘mother’ of many other clustering approaches.
There, the goal is to minimize the sum of squared distances of the data points xi to the centers ck of the
clusters Ck they belong to:
K X
X
min kxi − ck k2 . (6.1)
C1 ,...,CK ,c1 ,...,cK
k=1 i∈Ck

When fixing which data points belong to which cluster, we know that, for each cluster, the point ck that
minimizes the expression X
kxi − ck k2
i∈Ck

127
Figure 6.1: Examples of image segmentation through clustering, source: https://it.mathworks.com/
matlabcentral/fileexchange/66181-image-segmentation-using-fast-fuzzy-c-means-clusering

is
1 X
ck = xi .
|Ck |
i∈Ck

Exercise 6.1.
Verify the above statement.

For that reason, in the minimization problem (6.1), we can eliminate minimizing over c1 , . . . , cK and
focus on the minimization across the composition of clusters C1 , . . . , CK .
The bad news is that solving this problem to optimality is an extremely difficult task – the problem is
known to be N P -complete – so that for realistic problem sizes, we need to resort to heuristic algorithms.
The heuristic on which the classical K-means algorithm rests consists of alternating steps of:
1. Computing the cluster centers ck as the averages of the points included in the cluster.
2. Re-assigning the points to the cluster Ck whose center ck it lies the closest to.

Formally, the algorithm is described by Algorithm 22.


Exercise 6.2.
Prove that alg. 22 stops. Hint: show that the sequence of the values of the loss function is nonde-
creasing.

128
Algorithm 22: K means
Data: x1 , . . . , xN , k – number of clusters
Initialize centroids c1 , . . . , cK
while Improvement of (6.1) obtained in the previous iteration do
Assign xi to cluster k = argminj kxi − cj k22
for k = 1, 2, . . . , K do P
Update ck = 1/|Ck | xi .
i∈Ck

6.3 K-means clustering - kernelization


Although the K-means algorithm is very nice and is typically your first shot when trying to cluster objects,
the Euclidean distance used to measure the similarity of xi and xj need not always be the most suitable one.
If you consider the formula for the squared Euclidean distance
kxi − ck k2 = x> > >
i xi − 2xi ci + ci ci ,

then you notice that in fact it depends only on the inner products of ci and xi . You might be already guessing
what is about to happen. Namely, one can lift the feature vectors to a higher space using a mapping Φ(·)
and replace the inner product with a kernel function so that the new ‘distance’ function becomes
Φ(xi )> Φ(xi ) − 2Φ(xi )> Φ(ci ) + Φ(ci )> Φ(ci ) = K(xi , xi ) − 2K(xi , ci ) + K(ci , ci )
The examples of kernel functions used for clustering are the same as in the case of kernel SVMs.
This idea is the basis of kernel K-means method, which would be the same as the the standard K-means,
but with the re-assignment of points to clusters on the basis of this modified distance.
The only caveat about this idea is how to compute the points c1 , . . . , cK . We cannot compute them
anymore as the averages of points within a cluster because working with the kernel distance means that
implicitly, we have ‘lifted’ our feature vector using a mapping Φ(·) to a higher dimension
xi → Φ(xi ),
where Φ(xi ) could be infinitely dimensional (recall the case of the Gaussian kernel from section 5.4.2), and
it is in that higher-dimensional space that we are performing K-means, so that the distance is computed as:
hΦ(xi ) − Φ(ci ), Φ(xi ) − Φ(ci )i = hΦ(xi ), Φ(xi )i − 2 hΦ(xi ), Φ(ci )i + hΦ(ci ), Φ(ci )i
= K(xi , xi ) − 2K(xi , ci ) + K(ci , ci ).
For that reason, our ‘center’ of cluster Ck is a higher-dimensional vector Φ(ck ) such that
1 X
Φ(ck ) = Φ(xi ).
|Ck |
i∈Ck

Of course, we do not want to compute this vector explicitly, among others because, for some kernels, it
corresponds to an infinite-dimensional transformation. However, we can easily compute the distance of each
point from it using the following derivation:
* +
1 X 1 X 2 X
Φ(xj ) − Φ(xi ), Φ(xj ) − Φ(xi ) = hΦ(xj ), Φ(xj )i − hΦ(xj ), Φ(xi )i
|Ck | |Ck | |Ck |
i∈Ck i∈Ck i∈Ck
1 X
+ 2
hΦ(xi ), Φ(xl )i
|Ck |
i,l∈Ck
2 X 1 X
=K(xj , xj ) − K(xj , xi ) + 2
K(xi , xl )
|Ck | |Ck |
i∈Ck i,l∈Ck

129
The complete algorithm description is given in Algorithm 23 where as you can see, we still need to
initialize the algorithm with some centroids, but once the initial cluster assignment is done, we don’t need
the variables c1 , . . . , cK anymore.

Algorithm 23: K means clustering - kernel version


Data: x1 , . . . , xN , k – number of clusters
Initialize clusters C1 , . . . , CK .
while Loss function improved in the previous iteration do P
Assign each xj to cluster k 0 = argmink K(xj , xj ) − |C2k | 1
P
K(xj , xi ) + |Ck |2 K(xi , xl )
i∈Ck i,l∈Ck

6.4 Graph-based clustering problems


So far, we have discussed clustering in a way that considered all the observations x1 , . . . , xN as a loose
collection of separate objects equipped with features. In some applications, however, the objects do not
really have features themselves and instead, we have information about there being a link between two
objects or not. Recall the ‘groups of friends’ example.
Example 6.2. Groups of friends
Consider that you have a group of people and are informed who of them knows/exchanges messages
with each other. That is, for each pair (i, j) of people, you know if there exists a relationship (1) or
not between them (0), with 0 for pairs (i, i) by convention.

3 9
5 6
8
4
2 7
1

Red: cluster 1 Blue: cluster 2

Based on this information, you can formulate an adjacency matrix of the graph in which the nodes
are persons, and edges are existing relationships between them:
 
0 1 0 1 0 0 0 0 0
 1 0 0 1 1 0 0 0 0 
 
 0 0 0 1 1 0 0 0 0 
 
 1 1 1 0 0 0 0 0 0 
 
W =  0 1 1 0 0 1 0 0 0


 0 0 0 0 1 0 1 1 1 
 
 0 0 0 0 0 1 0 1 0 
 
 0 0 0 0 0 1 1 0 1 
0 0 0 0 0 1 0 1 0

Based on this you might have to figure out what are the two ‘groups of friends’ among them.

In such a situation, it is natural to visualize the data at hand as a graph and to try to cluster the objects
with a graph-oriented mindset.

130
However, also in other contexts it is possible to visualize the problem as a graph, where the weight of
each edge connecting two nodes stands for the ‘strength of similarity’ between the objects represented by
the nodes.
Example 6.3. Image clustering
Consider a set of N black-white images consisting of h × w pixels. Each of such images can be
represented with a vector xi ∈ Rhw , where each entry corresponds to a number on the black (0) -
white (1) scale. Then, similarity of two images i and j can be computed using a number obtained,
e.g., with the Gaussian kernel:

wij = exp −γkxi − xj k22 , γ > 0.




In this way, we can construct a similarity matrix W ∈ RN ×N , which is a generalization of the


adjacency matrix that you learned in definition 2.27. Such a matrix will correspond to a complete
graph with real, nonnegative numbers standing for the edge weights.
The construction of a graph based on such a problem can go even further, by deciding on a threshold
value  such that two images are considered ‘possibly related’ if wij >  and otherwise not. In this
way, we can construct an adjacency matrix W of a graph where

1 if wij > 
wij =
0 otherwise.

Such a matrix will correspond to a graph in which two nodes are connected by an edge only if the
two nodes are considered ‘sufficiently similar’.

Summarizing the above examples: we can consider every clustering task as clustering nodes in a graph,
where the graph information can consist of
• 0-1 information if two nodes are connected by an edge or not
• continuous information informing about the ‘distance’ between the two nodes.
Equipped with this mindset, we will now present popular clustering ideas that come directly from the world
of graphs, and we will simply assume that for a graph we have a matrix W ∈ RN ×N which has either the
0-1 or continuous entries.

6.5 Cut-based clustering


Suppose you were to divide the nodes of a graph into two clusters, having the matrix W at your disposal.
One of the first ideas that can come to someone’s mind is to divide the nodes into to two sets V and V̄
such that the sum of weights of the edges connecting pairs of nodes from the different sets, are as small as
possible. In that case, we would be minimizing
X
min wrs ,
V
r∈V,s∈V̄

known as the min-cut problem (cut is the division of the set of nodes in a graph into two complementary
sets). While being a nice formulation, in applied contexts it suffers from the downside that, often, it can
lead to one of the clusters consisting of just one, most isolated, node. As in clustering the aim is often to
divide objects into subsets of sizes with ‘similar magnitudes’, a workaround that helps to achieve this goal
is, instead, to minimize a ‘normalized’ version of this quantity:
P P
wrs wrs
r∈V,s∈V̄ r∈V,s∈V̄
min P + P ,
V wrs wrs
r,s∈V r,s∈V̄

131
where the sum of the weights of outgoing arcs is normalized by the ‘inner weight’ of a given cluster, i.e., how
strongly connected are the nodes within a given cluster. Note that the edges inside the cluster are counted
twice. This trick has the property of preventing highly asymmetric cluster sizes.
The above considerations applied only to contexts with two clusters in mind for illustrative purposes.
The multi-cluster analogue of the min-cut idea would be to minimize the quantity:
K
X X
Cut(C1 , . . . , CK ) = wrs .
k=1 r∈Ck ,s6∈Ck

And the normalized version of this idea is given by:


P
K wrs
X r∈Ck ,s6∈Ck
RatioCut(C1 , . . . , CK ) = P
wij
k=1
r,s∈Ck

where the edges inside the cluster are counted twice. The bad news, however is that again, minimizing a
quantity like this is not a computationally tractable optimization problem and one would need to resort to
heuristic techniques similar to that of the K-means clustering algorithm, namely, trying if shifting a given
node from one set to another helps improve the RadioCut value.
As it turns out however, considering the linear algebraic properties of the matrix W can in some cases
imitate the minimization of the above quantity, and does lead to nice clustering techniques.

6.6 Spectral clustering


Consider the Laplacian matrix of the graph given by

L=D−W
PN
where D is a diagonal matrix with Di,i = j=1 wij . If the edges information consists only of 0-1 information
whether there is an edge or not, then the diagonal entries of D contain the degree information of nodes in
the graph.
What is important to note about this matrix is that it is symmetric and positive semidefinite; cf. Defi-
nition 2.27. Hence, all its eigenvalues will be real and nonnegative.
Exercise 6.3.
What is the smallest eigenvalue of matrix L and the corresponding eigenvector?

By considering this matrix, it turns out that we can recover the idea of RatioCut by left- and right-
multiplying it with a specific matrix whose nonzero entries indicate the belonging of a given node to a
cluster k.
Proposition 6.1.
Let C1 , C2 , . . . , CK be the the clusters (sets of sample indices) and let H ∈ RN ×K be a matrix where
1
Hi,j = p 1i∈Cj ,
|Cj |

where 1? is an indicator function equal to 1 if the clause holds, and 0 otherwise. We then have that

RatioCut(C1 , . . . , CK ) = trace H > LH .




132
Exercise 6.4.
Prove the above proposition.

Note that the columns of so-defined matrix H form an orthonormal set and their nonzero entries uniquely
define the belonging of a given observation to each of the clusters.
Noticing this, we can state that for the problem of minimizing p the RatioCut we can search for a matrix
H whose columns are orthonormal and such that each Hij ∈ {0, 1/ |Cj |}. Unfortunately, this is an integer
programming problem which we cannot solve efficiently.
But we can relax some of the restrictions of so-defined problem where the idea would be to search for an
orthogonal matrix H ∈ RN ×k that minimizes trace(H > LH). It is rather difficult to solve an optimization
problem that includes constraints of the type ‘columns of a given vectors should be orthogonal’, but luckily,
this problem has a well-defined solution.
By linear algebra we know the solution to this problem is the matrix H whose columns are the eigenvectors
corresponding to the K minimal eigenvalues of L. The downside is, of course, that such a matrix will not
have rows with only one nonzero entry indicating to which cluster a given observation should belong. For
that reason, we still need to... cluster the entries rows of H. Here, you realize that there is no escape from
the K-means clustering algorithm because it is the algorithm used most commonly for clustering the rows
of H. The resulting algorithm is called unnormalized spectral clustering, presented in Algorithm 24.

Algorithm 24: Unnormalized spectral clustering


Data: x1 , . . . , xN , k – number of clusters
Compute the similarity matrix W and the Laplacian L = D − W .
Construct a matrix H whose columns are the eigenvectors corresponding to the K minimal
eigenvalues of L.
Use K-means algorithm to cluster the rows of H into C1 , . . . , CK .

Example 6.4. Special case K = 2


If there are to be only two clusters, then the eigvenvector corresponding to the second-smallest
eigenvalue is known as the Fiedler vector. Because the entries in the eigenvector corresponding to
the smallest eigenvalue are all the same (see exercise 6.3), the clustering of rows is done solely on the
basis of the entries of the second eigenvector. A common technique then is to split the objects into
two clusters is simply to form one cluster with those consisting of the indices with a positive entry,
and the other with the nonpositive entries.
For example, in the ‘groups of friends’ example 6.2 we have the following Fiedler vector

v2 = [−0.39, −0.31, −0.275, −0.35, −0.11, 0.28, 0.38, 0.38, 0.38],

which ‘correctly’ classifies the nodes into the red and blue clusters.

In some contexts, the unnormalized version of the spectral clustering algorithm does not lead to the most
desired results because the topological structure of the graph is dominated by a few nodes with the largest
degree Dii .
In Internet-related clustering problems this can mean, for example, that a given node is a unit that is
sending out spam.
For that reason, another, normalized variant of the spectral clustering algorithm is used, where the
Laplacian is normalized as
L̄ = D−1/2 LD−1/2 = I − D−1/2 W D−1/2 .
With that change, the algorithm remains the same and is given in Algorithm 25.

133
Algorithm 25: Normalized spectral clustering
Data: x1 , . . . , xN , k – number of clusters
Compute the similarity matrix W and the normalized Laplacian L̄ = I − D−1/2 W D−1/2 .
Construct a matrix H whose columns are the eigenvectors corresponding to the K minimal
eigenvalues of L̄.
Use K-means algorithm to cluster the rows of H into C1 , . . . , CK .

Both the unnormalized and normalized version of this algorithm require us to compute the K smallest
eigenvalue of a matrix whose size scales linearly with N – the number of samples. For that reason, from
a certain size onwards the task might become challenging computationally. For that purpose, effective
approximate techniques such as Nyström sampling, have been constructed [1].
By means of the material of this course, since the graph Laplacian is symmetric, its eigenvalues and
eigenvectors can be approximated using the QR algorithm. If the graph Laplacian is sparse, one may instead
want to compute the eigenvalues and eigenvectors in a Krylov subspace. Unfortunately, the eigenvectors
corresponding to the largest eigenvalues are dominant in the Krylov subspace. Therefore, one could compute
approximates of the largest eigenvalues and corresponding eigenvectors in the Krylov subspace
 
−1
Km (L + I) , v ,

for some v,  > 0 small and m > K, instead; the term I is a small regularization term making the matrix
−1
invertible. Note that, if λi is an eigenvalue of L, then λ1i is close to an eigenvalue of (L + I) with
approximately the same eigenvector. Hence, the K largest eigenvalues and corresponding eigenvectors of
−1
(L + I) can be used to approximately compute the K smallest eigenvalues and eigenvectors, respectively,
of L.
Exercise 6.5.
Consider a graph consisting of two complete subgraphs which are not connected by any edge, and
where the matrix W is an adjacency matrix. Will unnormalized/normalized clustering with K = 2
lead to two clusters corresponding to the two connected parts? How will that look like for K complete
subgraphs, and clustering with K clusters?

6.7 Final comments


As already said in the beginning, in this chapter we only introduced the clustering approaches that have
some interesting linear-algebraic or optimization-related properties. Because of that, we do not introduce
other popular techniques such as, for example, agglomerative/degglomerative techniques, distance based
techniques such as DBSCAN [22, 32] all of which scale nicely to very large datasets.

134
Figure 7.1: Comparison of an SVM classification tool and rectangle-partition tool. Areas are shaded in the
color of the predicted value.

7 Tree-based learners
7.1 Introduction
So far we did nice and cool ‘proper optimization’ in this course, that is, we were learning methods that, if
only the problem possesses some friendly properties (convexity), then our method should converge to the
optimal solution – for example the gradient method.
This allowed us to optimize the shape of fairly complex classification (SVMs) or regression (linear regres-
sion, or linear regression including nonlnear feature transformations) tools.
What was the ‘essence’ of the power of, for example, SVMs? It was that through the kernel trick we were
able to construct a fairly complicated ‘separation surface’. This complicated surface, as a result, was doing
a good job, separating points with different label values +1 and −1.
One can, however, try also to think differently about the problem and instead of dividing the points using
a ‘single but complicated shape’, divide the feature space into many simple shapes, and assign value +1 or
−1 as the predictor to each of the simple shapes, depending on whatever is the majority of labels there in
the training set. This idea is illustrated in fig. 7.2.
In the right panel, the feature space has been divided by partitioning it into hyper-rectangles, so that
the sample points within the subsequent rectangles are more and more ‘uniform’, i.e. that in the end we end
up with small rectangles where nearly every point has the same label.
How do we formalize the corresponding prediction tool? If we denote each rectangle by Xl , and y(Xl ) is
the label assigned to Xl , then the decision tool is:

y = y(Xl ) if x ∈ Xl ,

with ties being broken arbitrarily.


A similar reasoning as above can be applied to regression problems. We can construct a complicated
function that we try to fit to the data, or use a super-simple function (constant), that is applied on subsets
of the partitioned feature space. This is illustrated in fig. 7.2. The corresponding predictor will be then
again:
y = y(Lk ) if x ∈ Lk ,
As you can intuit, the finer a partition we use, we more ‘exact’ can a prediction tool like this become
on the training dataset. Partitioning the space into such rectangles can also have the benefit of creating
interpretable ‘clusters’ of points. Let us now put aside the question of how to obtain such a partitioning and
try to group the potential (dis)advantages like this.
Advantages:

• simplicity of construction – only hyperplanes used to separate subsets

135
2 2
1 1
0 0
1 1
2 2
3 2 1 0 1 2 3 3 2 1 0 1 2 3

Figure 7.2: Comparison of a polynomial regression of degree 3 with a piecewise-constant regression through
domain splitting.

• interpretability of the subsets – if the hyperplanes used use, for example, only one nonzero entry, then
we have a set of simple threshold rules that identifies what label should a given data point receive
Disadvantages:
• prone to overfitting when the partition is too fine

• how to optimize them?


The construction of this chapter will be somewhat different than what you would see in a typical machine
learning texbook related to the so-called classification and regression trees ([22], which is the name of the
tools that we will be constructing). Normally, every textbook would start directly from providing you with
a classical heuristic approach to construction of the partitions such as the ones above.
We will, however, begin by trying to formulate the problem of creating the ‘best partition’ as an optimiza-
tion problem and point to its difficulty. Only having seen the difficulty, we shall move to the heuristics that
are popular in the ML community to construct the best partitions or ensembles of partitions used together.
In the end, however, in the discussion about interpretability of ML tools, we will come back to the proper
optimization idea, asking ourselves the question if it can be useful to create the ML tools of the future, that
will be required by the lawmakers to be interpretable.

7.2 Classification and regression trees - the optimization problem


7.2.1 Rough idea
Suppose you have a data set of N points (xi , yi ), i = 1, . . . , N where xi ∈ Rn . We will assume that we are
dealing either with a regression task in which case yi ∈ R, or a two-class classification task where yi ∈ {−1, 1}.
Our goal will be to construct an ‘optimal parition of the feature space Rn ’ for purposes of classifica-
tion/regression. That, given the maximum number L of subsets, to divide the space Rn into L subsets, each
of which is defined as an intersection of a certain number of half-spaces:

Xl = x ∈ Rn : a>

lk x ≤ blk , k ∈ 1, . . . , Nl .

In particular, we want our partition to meet the following two requirements:


• mutually exclusive: for any l, l0 it holds that the set Xl ∩ Xl0 is either empty or is a subset of a
hyperplane – this means that the probability that we end up in Xl ∩ Xl0 is zero under all reasonable

• collectively exhaustive: it holds that [


Xl = Rn .
l

136
Additionally, to each of the subsets Xl we will assign the corresponding label y(Xl ) which will act as the
predictor for that specific set.
What do we want to achieve with this partitioning and these labels? Just as in our earlier optimization
attempts, we will try to fit the partition and the labels to the training data as well as possible.
When is the fit good on the training set? When within each set Xl the labels of the data points there
are close to y(Xl ). When is the fit bad? When we have the exact opposite. Because labelling each subset
Xl can be considered separately, we will do so in the formal discussion now, considering also regression and
classification separately.

7.2.2 Fitting a label for a single Xl for regression


When we’re dealing with the case y ∈ R then the goal of finding a single best-fitting label is rather straight-
forward – pick a label value v that is closest possible to all the labels within the set Xl . This can be achieved
by minimizing the following loss function
2
X
(v − yi ) .
i:xi ∈Xl

You know what is the result of it – v being the average value of all the labels inside the set:
1 X
v= yi .
|{i : xi ∈ Xl }|
i:xi ∈Xl

7.2.3 Fitting a single label for two-class classification


Consider now the case of y ∈ {−1, 1}. It is easy to say that the best-fitting single label for a given subset
is the label of the majority of points in that set. However, we need a slightly more complicated story here
that will pay off later on.
If you remember logistic regression, the idea then was that we used a linear function to model the
logarithm of the ‘odds ratio’, i.e., the probability of a sample getting labelled +1 versus the probability of
the opposite:  
P(y = 1|x)
w> x ∼ log
1 − P(y = 1|x)
so that the probability is:
exp w> x

P(y = 1|x) =
1 + exp (w> x)
The corresponding prediction model was essentially:

if w> x ≥ 0

1
y=
−1 otherwise.

and the loss function to minimize over a sample was


N
X
log 1 + exp −yi w> x

.
i=1

Now, if we’re trying to fit a single constant label for all observations withing a given set Xl , we can skip the
linear dependency and simply try to find a number v ∈ R to minimize a loss function
X
log (1 + exp (−yi v)) , (7.1)
i: xi ∈Xl

137
Figure 7.3: Hierarchical partitioning of the space into four subsets.

with the prediction model being 


1 if v ≥ 0
y=
−1 otherwise.
It is worth noticing that because of the simplicity of the prediction mechanism, the optimization of v in
(7.1) need not be performed until the minimize is found – getting the sign right is enough. In practice, this
means that v will take a positive value if the majority of labels is +1 and it will take a negative value if the
majority of labels is −1.

7.2.4 The optimization problem to solve


To unify the discussion now, assume that per each subset Xl we are minimizing some loss function L(Xl , vl )
depending on the task we face, boiling down to fitting a single ‘label-generating number’ vl .
If we want to optimize the partition of the space, then we obtain the problem:
L
X
min L(Xl , vl )
ali ,bli ,vl
l=1
s.t. Xl = x ∈ Rn : a>

lk x ≤ blk , k ∈ 1, . . . , Nl
Xl ∩ Xl0 = ∅ or dim(Xl ∩ Xl0 ) ≤ n − 1 ∀l 6= l0
[
Xl = Rn .
l

Do you have an idea how a problem like this can be solved in general case? That would be a very complicated
task because the feasible set is definitely not convex, and an attempt to formulate the above constraints using
closed-form expressions would be an absolutely daunting task. In other words, a problem posed like this is
absolutely hopeless.
With two simplifying restrictions, however, it is possible to formulate the problem in a way that can be
at least tried to be solved using existing software.
Organizing the partition into a tree. The first assumption is that the number of subsets L should
be a power of 2 such that the subsets are obtained through a process of recursive tree of partiions, where the
‘children nodes’ inherit the half-spaces of the ‘parent nodes’. First, one splits the entire space using a single
hyperplane. Then, each of the resulting subsets is split using a single hyperplane into two again, which gives
us four subsets after two splitting rounds. After d splitting rounds we obtain 2d subsets, each defined using
d hyperplanes. In this way, we obtain a tree structure. This idea is illustrated in 7.3.
Each partition depend only on one feature. Another step is to to restrict each of the hyperplanes
to only one feature, that is, that all vectors ali are unit vectors or their negatives. Actually, this is exactly

138
the way that the right panel in fig. 7.2 has been constructed, using a tree of depth 3, because all the lines
in this plot are either vertical or horizontal (correpsonding to splits along only one feature). This restriction
has an optimization-friendly feature that suddenly, the search for the best ali is restricted to n possibilities
only.
A tree satisfying the above two assumptions would be a very interpretable one - and interpretability is a
much discussed topic nowadays in the context of ML used in societal applications.
Example 7.1. Interpretability
Imagine you are constructing a classification tree for deciding whether someone is suspected of having
diabetes or not, based on a number of the patient’s characteristics (BMI, cholesterol, etc.). A classi-
fication tree built using simple one-feature rules such as ‘is the patient’s BMI higher than Z’ is way
more trustworthy to practitioners than a classification tree built using rules such as ‘is 0.4 patient’s
BMI plus 0.145 patient’s cholesterol level higher than Z’. This is exactly where the whole talk about
interpretability of AI tools is about, if you heard about it.

The two restrictions simplify the accounting a lot, and the resulting problem can be actually written
down as a mixed-integer linear optimization (MILP, [49]) problem, i.e., a problem in of the form

min c> x
x∈Rt1 ×Zt2

s.t. Ax ≤ b,

where the last t2 entries of x are restricted to be equal to be integer.2


As of now, application of algorithms used to solve MILPs is rather limited in ML, for which reason we
don’t teach you the MILP algorithms here. We can say however, that good algorithms exist and this idea
has been tried in a line of work started by [6]. However, for problem sizes encountered in ML, this is not
really going to work (but maybe something better in that fashion will be invented in the future – we will get
back to this point).

7.3 Recursive tree construction


People who invented classification/regression trees have been doing it long before [6] tried to construct these
trees optimally. To build these trees, they did what one usually does when the optimization problem at hand
is too difficult – they invented smart heuristics for solving this problem, and then, heuristics to improve the
heuristics.
The result is astonishing – decision trees (the basic tool built by performing constructing the tree of
hyperplanes in fig. 7.3 sequentially from top to the bottom instead of jointly), and their more advanced
versions known as gradient-boosted trees and random forests, often frequent the top of the winning entries’
list in various ML competitions.
If an idea which abolishes the noble pursuit of optimality in solving fig. 7.3 works so well, it is definitely
worth studying. We begin with the construction of a single decision tree.
The idea is to split the optimization of the different hyperplanes in fig. 7.3 into a sequential process.
First – think only about the first partition with a single hyperplane and select this hyperplane such that
the partition into two subsets is ‘purest possible’. Then, for each of the resulting subsets, we can repeat the
procedure, independently from each other. This idea has a significant computational advantage – splitting
of various ‘children’ of the same subset can be paralellized.
How does a single partitioning look like then? Consider a set Xl that we want to partition using a
hyperplane xj = b into two subsets:

Xl− (b, i) = Xl ∩ {x : xj ≤ b}, Xl+ (b, i) = Xl ∩ {x : xj > b}.


2 If you are curious about how such problems are solved, here is a nice video tutorial: http://www.youtube.com/watch?v=

sMtaUWQOjcY

139
For this, we need to select the feature j ∈ {1, . . . , n} that will be the basis of the partition, and then optimize
the threshold b. How do we do this? We can try every possible feature j = 1, . . . , n and for each of them we
optimize the threshold, and select the one that together with the threshold gives the subset purity:
L(Xl− (b, j), v− ) + L(Xl+ (b, j), v+ ).
For a fixed feature index j, we can optimize for the value b by, for example, searching over the interval
[min{xi,j : xi ∈ Xl }, max{xi,j : xi ∈ Xl }] .
Note that if there are N 0 points within the given subset, then the number of thresholds one actually needs
to try does not exceed N − 1 (why?).
Formally, we thus do the following to split the set Xl :
min L(Xl− (b, j), v− ) + L(Xl+ (b, j), v+ ).
j,b,v− ,v+

Overall the the recursive algorithm for constructing a classification or regression tree is as follows.

Algorithm 26: Recursive tree building


Data: Dataset (x1 , y1 ), . . . (xN , yN ), tree depth d, purity measure H.
for Tree level l = 1, . . . , d do
for Set Al in level l do
Determine the best feature j, threshold b and the corresponding subset labellings v− , v+ to
minimize
L(Xl− (b, j), v− ) + L(Xl+ (b, j), v+ )
Add the resulting sets to the sets on level l + 1.

What are the benefits of an algorithm like this? First of all, at each step one minimizes over a single-
dimensional parameter b and most of the computations can be paralellized. For that reason, the buildup
of trees like this is extremely fast. Additionally, because at each level one uses a single-feature criteria, the
corresponding prediction tools are easily interpretable.
The recursive mechanism has some drawbacks as well, of course. Compared to the ideal situation in
which all the partition parameters would be optimized jointly, the accuracy of such a greedily-built tree will
certainly be suboptimal.
Exercise 7.1.
Construct a worst-case two-class classification dataset in xR2 such that if you apply the recursive tree
construction once, at least one of the optimal tree predicts nothing, i.e., it is just as good as guessing
the label of a new point based on whichever label is most frequent.

So far, we assumed that the tree depth d is a fixed value. In fact, it is a hyperparameter of our tool that
we need to tune to make it work as well as possible – just like the power of the polynominal kernel in SVMs,
for example. If a sufficiently high d is chosen, we it is possible to partition the dataset into a perfectly pure
way where each sample lives in its own cell. But that is not the point – this is only a the training data
and it is likely that a tree like this will underperform on the test data. For that reason, d should be chosen
(the tree can be ‘pruned’) to select a depth value that will be best not on the training dataset, but on the
test/validation dataset.

7.4 Random forests


As already discussed, a way to create a fine precision classification/regression tree is to make it a fairly deep
tree. Such a tree, however, is likely to suffer from overfitting. It can be also quite arbitrary – if splitting on

140
Figure 7.4: Classifiers obtained for random forests with trees of maximum depth of 3 each, consisting of 1,
50 and 100 trees.

two features gives roughly the same result, why would one pick one feature over the other? And how would
we know how it impacts the splittings further down the tree? In the end, investigating every possible feature
in alg. 26 to select the best one among them, although can be parallelized, still takes time/effort.
As it turns out, a strategy that works better usually is to select the features to branch on randomly for a
given tree, but to construct many trees simultaneously (which can be parallelized). While this might sound
like a risky choice, actually this idea works pretty well. alg. 27 illustrates the idea of randomly creating
many trees = a random forest of trees.

Algorithm 27: Recursive tree building


Data: Dataset (x1 , y1 ), . . . (xN , yN ), tree depth d, purity measure H, number of trees T .
for Tree t = 1, . . . , T do
for Tree level l = 1, . . . , d do
for Set Al in level l do
Randomly select feature j and pick the best threshold b and the corresponding subset
labellings v− , v+ to minimize

L(Xl− (b, j), v− ) + L(Xl+ (b, j), v+ )

Add the resulting sets to the sets on level l + 1.

One question is left. How do we aggregate the predictions of multiple trees? In the case of regression, the
simplest answer is to average out the predictions generated by different trees. For classification, we can use
a ‘voting’ mechanism where a given sample receives a label that most of the trees select. fig. 7.4 illustrates
the random forest idea.

7.5 Boosting
Instead of voting/averaging of many randomly created trees, one can also come up with the following idea:
create a single tree first, and then, ‘add to it’ another tree that would focus on samples on which the
previous tree’s classifications were wrong. In that way, the next tree is created deliberately to compensate
for underperformances of the previous one, not in a random fashion.
This is the idea of boosting. Although the boosting idea can be applied pretty much to any ML tool, it
became particularly popular when constructing tree-based classification and regression tools. This is because,
for these tools the trade off between ’let’s construct many simple tools without optimizing each of them too
much’ versus ’let’s construct a single, highly-optimized complicated tool’ seems to be in the favor of the
former.

141
we now introduce the idea formally on the example of regression first. Suppose that for our training data
set we we construct a first predictive model:
y = f1 (x)
by minimizing, for example, the following loss function:
N
X
min (yi − f (xi ))2 .
f ∈F
i=1

The idea of boosting is the following - we treat the predictive model

model1 = f1 (x)

as a given and next try to construct a new model

model2 (x) = model1 (x) + f2 (x)

by minimizing the following loss function


N
X
f2 (x) := argmin ((yi − model1 (x)) − f (xi ))2 .
f ∈F i=1

Note that essentially, we are trying to fit a model that will cover for the ‘missclassifications’ of the previous
model.
The general m-th step of the boosting approach then is given by

modelm (x) = modelm−1 (x) + fm (x)

fitting the objective function


N
X
fm (x) := argmin ((yi − modelm−1 (x)) − f (xi ))2 .
f ∈F i=1

where at each step we treat the previous ‘model’ fixed and we only optimize the new term that corrects for
the misclassifications of the previous one.
For each subsequent tree in the boosting process, the creation of the next tree can follow the same
recursive mechanism as before, and we only modify the loss function to minimize when performing the
splits.
For regressions trees the implementation is simple - just as in the formulas above, we construct a decision
tree fitted to the error generated by the previous trees, in other words, we fit a regression tree to the data
set
(x1 , y1 − modelm−1 (x1 )), . . . , (xN , yN − modelm−1 (xN )).
For classification trees, we cannot simply ‘subtract’ the model prediction from the label itself (we could
end up with numbers different from −1 or 1). But what we can do is to treat the ‘v’ of the previous model
for a given sample as something to be corrected. In the logistic regression setting we chose, this means the
following loss function to be minimized:
X
log (1 + exp (−yi (modelm−1 (xi ) + v)))
i:xi ∈Xl

This is where, we hope, you see the entire point of introducing the softmax function mechanism. fig. 7.5
illustrates the idea.

142
Figure 7.5: Classifiers obtained for boosting with trees of maximum depth of 3 each, after a single step (no
boosting), 50 and 100 steps.

Exercise 7.2.
Consider again exercise 3.12. This time – code your own classification tree constructor, including its
forest- and boosting-variant. Compare the speed and performance to the SVM-based classification.

7.6 Need for interpretability – will proper optimization make its comeback?
As you have seen above, the two approaches that make up for deficiencies of having a single tree, consist
in creating sums or random collections of many trees. While this improves the predictive performance, it
makes the model less interpretable than a single tree based on single-feature splits only.
For that reason, a logical question is the following one. If one day the pressure for the AI tools used, e.g.,
to process job applications, be interpretable, will lead us to a ban on using random forests or too complicated
boosted trees, what will the future be for such highly efficients prediction tools?
A possible direction in which things might go is the following one. Imagine the ML tool designer to be
forced to create a simple, interpretable tool. In the trees context, this might correspond to having a single
tree of fixed maximum depth. Then, the focus on the quality of each of the branchings will be much higher
than now – currently these branchings are done heuristically and quality is improved by creating more trees.
It is possible that in such a situation, heuristic approaches will no longer be sufficient and one will have to
develop (almost) exact algorithms for the joint problems much in the spirit of [6].

143
8 Hyperparameter optimization
8.1 Introduction
Many, if not all, ML tools are described by hyperparameters chosen by the user before the ML tool gets
optimized to fit the training data – think of:
• stepsize used in the gradient descent method
• the regularization parameter in SVMs
• γ used in kernel functions

While some hyperparameters are discrete numbers – for example, the degree of the polynomial kernel used
in SVMs – other hyperparameters take continuous values – such as the the stepsize lengths.
The name hyper stems from the need to distinguish them from the actual parameters which are optimized
‘automatically’:

• SVMs coefficient vector


• coefficients of a neural network
• hyperplanes that define a classification tree
For a given model, let us summarize the specific hyperparameter settings by a vector h ∈ Rnh . For a given h,
the parameters of the model are optimized to fit the training data as well as possible. However, it is possible
that for some values of h the fit is better than for the others. That seems to mean that the hyperparameter
values should be optimized simultaneously with the parameters of the given ML tool.
This is, however, not how things are done. The reasons for that are twofold. First, it could lead to trivial
solutions.
Example 8.1.
Consider the loss function in the regularized SVM
N
X 2
max 0, 1 − yi (w> xi ) + λkwk22 .

(8.1)
i=1

Were we to minimize over w and λ ≥ 0 simultaneously, the optimal λ would always be equal to 0,
which means that something is wrong either with the loss function, or with the very idea of optimizing
them jointly.

A second, much more important reason is that hyperparameters optimized in such a way would lead
to models that perform very bad out of sample, i.e., on the data which is not part of the training data.
Therefore, hyperparameters should be chosen in such a way that the model performs as well as possible on
data that is different from the training data. How can this be done?
There are multiple ways to do it and process of doing so is called validation. A most classical approach
is the so-called K-fold cross-validation. In this approach the training dataset X is divided into K distinct
sets of equal size (or almost equal size if that’s not possible):

X = ∪K
k=1 Xk .

Denote by M(X , h) the model trained on dataset X with hyperparameter value h, and by P(M(X , h), X 0 ) its
performance on set X 0 (for example, SVM loss function without the regularizer, out of sample classification
error, loss function in regression, etc...).

144
Then, the k-fold cross-validated performance of the hyperparameter setting h is computed as
K
1 X
P(M(X \ Xk , h), Xk ).
K
k=1

In other words, for each k, we train the model on the data set consisting of all samples except for Xk (training
set), and then evaluate its performance on Xk (validation set). In this way, the models are evaluated on
different data than they were trained, and each sample in the dataset has played the role both of a training
and validation sample. The goal of k-fold cross validation is to make the model independent of the specific
data split at hand.
Coming back to hyperparameter choice, our goal would be to fit the model as well as possible, corrected
for the validation step. We want to solve the following problem thus:
K
1 X
min G(h) := P(M(X \ Xk , h), Xk ). (8.2)
h∈H K
k=1

This problem is known as the hyperparameter optimization (HO) and it is computationally challenging
because for the very evaluation of the function for a specific h, the model has to be trained K times. Of
course, the model trainings can be easily parallelized as they are completely independent, but in principle,
training even a single model is not a trivial step.
Remark 8.1. Training-validation-test splitting
Strictly speaking, the most popular approach to building and assessing ML models is as follows. First,
you divide your dataset into training and test sets. Next, on the training set you perform HO with
k-fold crossvalidation (thus iteratively ‘taking’ out small pieces of the training set which will be used
for validation). Then, once the hyperparameter value h∗ that minimizes (8.2) is found, the model
is trained on the entire training set with hyperparameters h∗ . In the end, the performance of this
model is assessed on the test set (which was not used at all till this moment).

HO, by the nature of the problem, is a task that can involve only a limited number of model training /
K-fold validation because each of them, on its own, is time-expensive. In general – the longer it takes to
train a single model, the less ‘attempts’ we have to try different values for h. What doesn’t make the goal
easier is also the fact that the function we are trying to minimize can be non-convex like in fig. 8.1, and the
fact that some entries of h might have to take integer values.
Luckily (as in many other cases discussed earlier in this course), ML is not the first field to encounter
this kind of problems – industrial design, experiment design in biotechnological research, all these fields deal
with this problem since the beginnings of the 20th century.
Example 8.2.
Imagine you are operating an oil field and try to figure out the best place to drill to put an oil rig. In
principle, you want to find a place which is most shallow. Each depth measurement costs a lot of time
and money, and you only can perform a maximum of 10 measurements. How would you strategize
the different places to try to measure the depth?

For that reason, it should not come as a surprise that the ML strategies for HO will ‘borrow’ from the
ideas used in those fields. Our running assumption throughout this section is that we will be facing the task
of finding the best value h ∈ H, where H is the set of all ‘reasonable’ parameter values.

145
0.4
0.5
0.6
0.7
0.8
0.9
10
5
10 0
5
0 5
5 10
10

Figure 8.1: Negative of the classification accuracy on the validation dataset for an SVM described by two
hyperparameters (we want to minimize this function over the hyperparameter values).

Remark 8.2.
As already mentioned above, hyperparameters can take both a continuous and discrete form – this
of the kernel parameter γ in SVMs and the number of hidden layers in neural networks. This section
of the notes is written mostly having continuous parameters in mind, but most of the ideas here can
be extended in a ‘life-hacky’ way to searching over discrete parameter spaces.

What would be typical first-shot strategies for hyperparameter selection? As a first try, you would
probably ask people who work in an application similar to yours what values of hyperparameters typically
‘work’. This is a good strategy because ‘folk knowledge’ like this can save you a lot of time and you benefit
from the work done already by other people.
If this does not make you happy though, you can also try a few different values for h yourself and simply
select the best one. This will almost certainly improve upon the single-shot value h you get from folk wisdom,
but (i) is not easily replicable if someone wants to check your results, (ii) might involve a lot of your own
time, which you might prefere to spend on other activities.
For these reasons, a need for automated HO/tuning strategies arises. In what follows, we shall discuss
three general approaches for solving this problem in an increasing order of their mathematical sophistication
(and decreasing order of popularity). Although this chapter is constructed for the purposes of HO, you
should consider it as a general discussion what to do when we try to minimize a function which is very
costly to evaluate and for which no information apart from its value can be obtained. Formally, this is called
black-box optimization.

8.2 Grid search and random sampling


The most basic HO idea brings us back to the very beginnings of our learning of optimization – the grid
search. The idea is simple, having to find a best values h ∈ H ⊂ Rnh , for each entry of h we investigate a
number of possible values
hi ∈ Hi = {h̄i,1 , . . . , h̄i,ri }.

146
Since we need to find a joint selection of all components h1 , . . . , hnh , we need to search the entire multi-
dimensional product set:
H = H1 × . . . × Hnh .
For each parameter combination in this set we need to train our model and validate it. That means that for
higher-dimensional h, grid search suffers from the curse of dimensionality - the number of model trainings
needed becomes impractical. For hyperparameter vectors of length 2 or 3, as in for example, the support
vector machines with the Gaussian kernel, it’s a perfectly suitable approach. Of course, no approach is free
from a certain degree of being arbitrary - in the case of grid search, we need to select the grid area first.
As a rule of thumb, you should begin with a rather coarse grid and if you observe that the best-performing
values of h line on the ‘boundary’ of the grid, it is a signal that perhaps you should extend the grid a bit
to see if even better values do not lie outside of it. The ‘best’ situation thus is when the best values are
somewhere ‘in the middle’ of the grid because that gives you a signal that they are at least ‘locally optimal’.
Once you identified an are with particularly good values, you can try performing another grid search there,
with a finer grid.
Another strategy for hyperparameter selection is to sample them randomly from H and to stick with the
best value found after a prescribed number of samples. An upside of this strategy is that it is us who designs
the probability distribution to sample from. Any prior knowledge we might have about where it is ‘more
likely’ that we find best possible values, can be included in the sampling strategy.
Example 8.3. Low ‘effective dimensionality’ of the hyperparameter space
When tuning an SVM, it is likely that the best values of λ and γ are related to each other – when λ is
small, γ should be small as well so that the impact of one parameter does not dominate the training
process. For that reason, it makes sense for the sampling strategy to sample the different values for
(λ, γ) in a ‘correlated’ way.

A downside of sampling is that it is only replicable if you select the sampling seed for the process and
that, just as the grid search, requires some prior knowledge ‘where to search’.

8.3 Bayesian optimization


A major issue with the grid or random search is that once they evaluate the function G(h) at a given point,
this information is not used anymore. This is a big loss because in HO, we can only perform a limited
number of function evaluations, so it makes sense for every piece of information collected to be used when
‘guessing’ what points to try next, and to avoid sampling points in areas in which our hope for finding a
good h is low.
In this section, we aim to achieve exactly this – once we have already tried values h1 , . . . , ht for which
we know the corresponding some values of G(h), we will construct a ’confidence region’ for the function in
other places. This ’confidence region’ should follow common-sense: we should have more confidence when
we try a point h close to the already evaluated ones, compared to trying a value far away from them. fig. 8.2
outlines this approach.
Having constructed such a confidence region for the function and then, one can then select as the next
evaluation point h one that would decrease our uncertainty about G the most (exploration), or where the
chance of finding a value G(h) < min G(hk ) is the largest (exploitation).
k≤K
Bayesian optimization (BO) implements this idea with its two most important elements being:

• the assumed prior distribution used to construct our ‘guess’ about the shape of G
• the acquisition function used to quantify where the gain of evaluating the next sample is the highest.
In this lecture, we focus on the most popular case of using a Gaussian process as the prior. We now
explain the meaning of the term ‘Gaussian process’ in this context. At the start of the optimization, we

147
Idea of the function After 3 evaluations

1.4

1.2

1.0
G(h)

0.8

0.6

0.4

0.2
2 0 2 4 6 8 10
h

Figure 8.2: The ‘true’ evaluated function (blue) and its ‘idea’ consisting of the mean (dashed) and a 95% con-
fidence interval (turquoise shaded area) based on three evaluations of the function (red points). Constructed
using the ‘bayesian-optimization’ package in Python [34].

assume that the function value per each point h is a normal random variable with expectation
E(G(h)) = 0,
and that the covariance of value of the function between points h and h0 is
E(G(h)G(h0 )) = K(h, h0 ),
where K(·, ·) is some selected kernel function. The kernel function can be one of the kernel functions we
already learned in the course, such as the Gaussian kernel.
Remark 8.3.
Note that when h = h0 , the formula gives us the variance of G(h) because the expectation is assumed
to be 0.

Given these assumptions, and the pairs (h1 , y1 = G(h1 )), . . . , (ht , yt = G(ht )), the distribution for G(h)
is computed as the conditional distribution of a multivariate normal distribution of G(h) given the values
y = (y1 , . . . , yt ). This is exactly where the power and the beauty of using a normal distribution as the
prior distribution comes in – from basic statistics we obtain a closed-form formula for the parameters of this
conditional distribution, from which the confidence regions per-point are depicted in fig. 8.2.
To derive them, we formulate things formally. We assume that the distribution of g = (G(h1 ), . . . , G(ht ), G(h))
follows a multivariate normal distribution with
E(g) = 0
and
 
> Σ k
E(gg ) = .
k> K(h, h)
where    
K(h1 , h1 ) K(h1 , h2 ) · · · K(h1 , ht ) K(h1 , h)
 K(h2 , h1 ) K(h2 , h2 ) · · · K(h2 , ht )   K(h2 , h) 
Σ= .. .. .. , k= ..
   
.. 
 . . . .   . 
K(ht , h1 ) K(ht , h2 ) · · · K(ht , ht ) K(ht , h)

148
Then, given y = (y1 , . . . , yt ), from basic statistics and calculus [17] we know that G(h) follows the following
conditional distribution:
G(h)|h̄ ∼ N k > Σ−1 y, K(h, h) − k > Σ−1 k ,


that is, normal distribution with the mean and the standard deviation given by
q
µ(h) = k > Σ−1 y, σ(h) = K(h, h) − k > Σ−1 k.

Importantly, the fact that we use a kernel function to model the (co)variances will make sure that the term in
the square root will never become a negative number because both Σ and E(gg > ) are positive-semidefinite.
In fig. 8.2 you could see an example of the obtained expected values and 95% confidence intervals for the
function value at different h.
Now, given this way of establishing our probability distribution for the unknown function, what is the
best point to try as the next one? It should be a point where the gain from running the next evaluation is the
largest, i.e., a point that balances carefully between exploration (investigating regions of biggest variance)
and exploitation (minimizing in areas where we expect the function to be the lowest). The field of BO has
constructed the concept of acquisition functions that try to capture exactly this, and which become the
objects to minimize when searching for the next iterate.
For the Gaussian process prior, they popular acquisition functions are generally a function of three
things: mean µ(h), standard deviation σ(h) of G(h), and the best value seen so far ybest . Three examples of
acquisition functions to minimize are
• (negative of) the probability of improving upon the so-far best value ybest :

a(h, ybest ) = P(G(h) < ybest ) (8.3)

• expected improvement

a(h, ybest ) = E(min{G(h) − ybest , 0}) (8.4)

• lower confidence bound

a(µ(h), σ(h), ybest ) = µ(h) − ασ(h) (8.5)

Exercise 8.1.
Derive the formulas for (8.3) and (8.4) as functions of µ(h), σ(h) and Φ – the cumulative distribution
function of the standard normal distribution.

Of course, the minimization of the acquisition function is itself an optimization problem to solve. On
the upside, this problem is typically low-dimensional. On the downside, typically this function will be non-
convex and hence, the most that one can hope for is to find a local minimum by applying gradient descent
or a quasi-Newton method starting from a random point. Indeed, the most frequently used algorithm for
this problem is the BFGS algorithm (or L-BFGS, which a ‘limited memory’ version of BFGS).
Once a minimizer y is found, the function is evaluated at the new point h and the ‘idea’ of function
G is refined, hopefully with G(h) < ybest . Figure 8.3 illustrate the subsequent iterations of the Bayesian
optimization algorithm.
An important aspect that has not been mentioned so far is that to begin with, one needs to sample a few
points h1 , . . . , hninit for which the function will be evaluated and which will serve as the basis for the first
iteration of BO (without this, the BO algorithm cannot be initialized).
The complete description of the algorithm is given alg. 28. BO is a global optimization method for which
no general convergence results can be provided. However, in practice it works really well on moderately-
dimensional problem for which performing a single function evaluation is expensive. For that reason, BO is
a part of many ML libraries such as AutoML [23]. For more information on BO, please see [19].

149
1.4 Target
Observations
Prediction
1.2 95% confidence interval

1.0

0.8
f(x)

0.6

0.4

0.2

0.0
2 0 2 4 6 8 10
x
Acquisition

Utility Function
2 Next Best Guess

x
2 0 2 4 6 8 10

1.4 Target
Observations
Prediction
95% confidence interval
1.2

1.0

0.8
f(x)

0.6

0.4

0.2

2 0 2 4 6 8 10
x
Acquisition

2.0 Utility Function


Next Best Guess
1.5
1.0
0.5
0.0

x
2 0 2 4 6 8 10

Target
Observations
1.50 Prediction
95% confidence interval
1.25

1.00
f(x)

0.75

0.50

0.25

0.00

2 0 2 4 6 8 10
x
Acquisition

Utility Function
2 Next Best Guess

x
2 0 2 4 6 8 10

150

Figure 8.3: The evaluated function (above) and its ‘idea’ after 4, 5 and 9 evaluations, and the corresponding
acquisition function, together with the next-best point to evaluate marked (lower confidence bound, below).
Algorithm 28: Bayesian optimization algorithm.
Data: Function G(·), kernel function K(·, ·), number ninit of initial points to sample randomly,
acquisition function.
Sample ninit initial points h1 , . . . , hninit randomly evaluate G(·) on them.
Set t = ninit while Stopping criterion not met do
Update the Gaussian function model based on (h1 , y1 ), ..., (ht , yt ).
Minimize the acquisition function to find a new point ht+1 .
Evaluate the function yt+1 = G(·) on ht+1 .
Set t := t + 1

8.4 Zero-order (derivative-free) approaches


As already said before, the hope that we can get hold of the derivative of G(h) and use it to perform some
kind of a gradient descent method, is rather thin. Nevertheless, we can still try to come up with some
approach that works like a local search optimizer (moving from a point to another, in direction where we
hope the objective function to improve).
The simplest of such ideas is the coordinate search. In this mindset, being at a given point h, we check
which entry of h can we modify a bit up/down so that the function value decreases. If we specify the size of
the maximum step to be equal to κ, this means that we need to check all points h ± κei . This approach is
outlined in alg. 30.

Algorithm 29: Coordinate search.


Data: Function G(·), initial point h0 .
Set t := 1 while Stopping criterion not met do
for Points hnext ∈ {h ± κei , i = 1, . . . , nh } do
if G(hnext ) < G(ht ) then
Set ht+1 := hnext and exit the for loop.
if None G(hnext ) < G(ht ) then
Change κ.

Another idea is something slightly more wild, which goes in the same direction of thought as the trust-
region methods did. Namely, to create an approximate ‘image’ of our function around the current iterate
ht and then, to minimize this ‘approximated image’ over a small set around ht – the trust region. This is
a most classical idea of model-based derivative-free optimization (model-based because we build a model of
how our function might look like).
Imagine that around the current iterate ht you sample or select deterministically p + 1 points h̄0 , . . . , h̄p ,
with h̄0 = ht , and for each of them you evaluate the corresponding function value G(h̄s ), s = 0, . . . , p. Then
you can use this ‘sample’ of points and the corresponding function value as a data-set to which you are trying
to fit a polynomial function that describes it as closely as possible (according to the metric of your choice).
The most common choices here are first-order polynomial (thus an imitation of the first-order Taylor
approximation), and the second-order polynomial (imitation of the second-order Taylor approximation). For
higher-order polynomials, one would need to sample too many points to obtain something reasonable and
the benefit diminishes.
Speaking formally, we are trying to fit a model

X
m(h) = αj φj (h)
j=1

where φj (h) are basis functions (monomials of degree at most 1 or 2), so that the relationship h → m(h)

151
mimicks the relationship
hs → G(hs ), s = 0, . . . , p.
which can be done using, for example, linear regression. Smart implementations of this algorithm select the
points hs in such a way that the computation of the optimal αj ’s is as efficient as possible. For more details,
see [12]. Once the model is ready, we perform a trust-region step, as outlined in 11.2.

Algorithm 30: Trust-region search.


Data: Function G(·), initial point h0 .
Set t = 0. while Stopping criterion not met do
Sample points h̄0 = ht , . . . , h̄p and compute the function values for them.
Fit the function model m(h) to obtained dataset.
Solve the trust-region subproblem ht+1 = argminh: kh−ht k2 ≤δ m(h).

Exercise 8.2.
For the SVM problem in exercise 3.12, perform hyperparameter optimization using the methods
learned in this section, assuming that you are constructing regularized SVM steered by regularization
parameter C, and using Gaussian kernel with parameter γ, and using the accuracy on the validation
dataset as the quality measure.
For this, you need to write a function that performs k-fold cross validation using your earlier-
constructed SVM function, which will serve as your function G(h). For Bayesian optimization, you
can use the bayesian-optimization package in Python to which you only need to pass the domain for
(C, γ) and the function which is to be evaluated – the package will do the rest for you.
Plot the best-obtained model quality measure against the total number of function evaluations per-
formed.

8.5 Summary
HO is a standard and essential thing to do for any ML problem – without it, you are likely not to obtain a
useful ML tool. At the same time, tuning they hyperparameters is costly – for each attempt you make, you
need to train the entire ML model to evaluate the validation loss function.
Additionally, it is important that your HO tool is replicable – another person can obtain the same results
on the same data (if the same random number generator seeds are used in case of stochastic approaches).
For that reason, automated methods for HO are in demand. In this section, you have seen a broad overview
of such methods in decreasing order of their popularity.
It is possible to criticize each of the mentioned methods on the basis of the fact that they themselves
also require certain user-provided choices, such as the kernel used in BO, or the stepsize used in coordinate
search. Of course, a discussion like this can go on forever, as any approach will in the end require ‘some’
parameters. However, you need to keep the end goal in mind – it is obtaining the best possible generalization
performance of your ML tool. In this context, any approach is good that yields satisfiable results within an
acceptable time budget.

152
9 Linear unsupervised learning
In this section, we will discuss several unsupervised learning techniques, which have not been discussed
before or present a reformulation of certain techniques in terms of matrix factorizations techniques. As we
had already discussed several matrix factorization techniques in section 2, many ideas will look very familiar.

9.1 Dimensionality reduction


We had already discussed dimensionality reduction from a linear algebra perspective in terms of Krylov
subspaces and the singular value decomposition. Let us now discuss the dimensionality reduction from the
data perspective. Therefore, we will first introduce the concept of a linear autoencoder. We will later extend
this to nonlinear autoencoders in section 10 and, more specifically, in section 10.4
The question of dimension reduction is very closely connected to the question if the data belongs to a
subspace of the whole Rn , exactly or approximately. Therefore, we call c1 , . . . , cm ∈ Rn be a spanning set spanning set
for a data set {x1 , . . . , xK }, x1 , . . . , xK ∈ Rn , if we can represent all the data points as linear combinations
of c1 , . . . , cm . This means, that wki ∈ R exist, such that
m
X
xk = ci wki
i=1

for all k = 1, . . . , K. Obviously, this is the case if


span {c1 , . . . , cm } = span {xi , . . . , xK }
With  
wk1
 . 
and wk :=  .. ,
 
C := c1 ··· cm
wkm
we obtain the shorter matrix notation
Cwk = xk k = 1, . . . , K. (9.1)

Example 9.1. Spanning sets

Taken from the MNIST data set [15].

Even though the MNIST data set contains a total of 70 000 images with 28 × 28 pixels. However, in
each image, many pixels are zero. Therefore, is very likely that the data set can be well represented
in a space of lower dimension than R28×28 .
(Of course, as already pointed out in section 2, gray-scale images are not stored as matrices of scalars
but as matrices of integers.)

Let us, for now, assume that a spanning set C is given and first discuss only the task of finding the optimal
weight vectors wk . One strategy to compute the wl is to formulate the following least-squares problem:
K
1 X 2
min g(w1 , . . . , wK ) with g(w1 , . . . , wK ) = kCwk − xk k . (9.2)
w1 ,...,wK K
k=1

153
As we have discussed earlier, if the columns of C are linearly independent, there is a unique solution to
this problem. With respect to dimension reduction, the case of a linearly dependent spanning sets is of
course the less relevant since the dimension could first be easily reduced by making the spanning set linearly
independent.

Linear autoencoder Let us now discuss eq. (9.2) in more detail. Since the individual terms of the sum
are independent of each other, they can be optimized independently, and we obtain the the normal equations
as a condition for optimality:
C > Cwk = C > xk (9.3)
for k = 1, . . . , K; see also eq. (2.28). The weight vector wk is also denoted as the encoding of the data encoding
point xk , and Cwj is denoted as the corresponding decoding. Note that, if decoding

Cwk = xk ,

that is, if the decoding coincides with the data point, we have found a minimizer for eq. (9.2).
In case the matrix is C is semi-orthogonal, we have C > C = I, and eq. (9.3) simplifies to

wk = C > xk . (9.4)

Of course, we can always transform C into a semi-orthogonal matrix by orthonormalization; cf. section 2.4.
Substituting eq. (9.4) into eq. (9.1) yields the autoencoder formula autoencoder
formula
CC > xk = xk , (9.5)

which represents the process of encoding (wk = C > xk ) and decoding (Cwk = CC > xk ) in one formula. If C
is orthogonal, we have that
CC > = I
as well, and hence, encoding and decoding are inverse operations.
In practice, as we already saw in section 2.7, the dimension of a data set can be significantly reduced
without introducing a large error using the SVD. Let us now discuss this from a different perspective. This
is the case if the matrix is rank-deficient or, as another example, if there is some small noise in the data. In
particular, instead of eq. (9.1), we are then interested in finding C ∈ Rn×m for a small m, such that

Cwk ≈ xk k = 1, . . . , K. (9.6)

This means that we are trying to find an approximate spanning set of size m, or in other words, we try to approximate
find an m-dimensional subspace which approximates the data well; cf. fig. 9.1. Following similar arguments spanning set
as before, a semi-orthogonal C yields an approximate autoencoder formula

CC > xk ≈ xk ; (9.7)

cf. eq. (9.5). All the previous steps are clear based on what we have learned earlier. However, the discussion
so far was based on the assumption that an (initial) spanning set is given. In practice, this is typically not
the case. Therefore, the more interesting and more challenging task is to find a suitable spanning set C for
a given data set.
Let the number of spanning vectors m ≤ K be given. Our goal is then to find C ∈ Rn×m such that eq. (9.6)
is satisfied as good as possible. The most straight-forward strategy is to simply modify eq. (9.2) to optimize
for both the encodings and the spanning vectors at the same time:
K
1 X 2
min g(w1 , . . . , wK , C) with g(w1 , . . . , wK , C) = kCwk − xk k . (9.8)
w1 ,...,wK ,C K
k=1

154
Figure 9.1: Dimension reduction with an autoencoder with a varying number of vectors. Original image
taken from unsplash.com.

By plugging in eq. (9.4), we obtain a optimization problem, which only depends on C:


K
1 X CC > xk − xk 2 .

min g(C) with g(C) = (9.9)
C K
k=1

Here, in the derivation, we have again used the assumption that C is semi-orthogonal. However, in eq. (9.9),
we do not explicitly enforce orthonormality of the columns of C anymore. One can show that all minima
are actually orthogonal matrices; see the following exercise.
Exercise 9.1. Orthogonality of autoencoders
Show that the minima of eq. (9.9) are all semi-orthogonal matrices.

An alternative way of seeing eq. (9.9) is by optimizing C with respect to the approximate autoencoder
formula eq. (9.7) using a least-squares formulation. The resulting matrix C is also denoted as linear
autoencoder. One way of learning a linear autoencoder is by optimizing eq. (9.9) using a gradient descent linear
iteration. autoencoder

Principal component analysis The solution of eq. (9.9) is generally not unique. A simple way of seeing
this is to multiply the minimizer C of eq. (9.9) by hogonal matrix, that is,
C̃ := CQ
with Q being orthogonal. Obviously, we get
K K
1 X C̃ C̃ > xk − xk 2 = 1
X
C QQ> C > xk − xk 2 = g(C).

g(C̃) =
K K | {z }
k=1 k=1
=I

155
Figure 9.2: Principal components of a data set in two dimensions.

Therefore C̃ is also a minimizer of eq. (9.9).


Let us now discuss a particular choice, that is, the case where the spanning vectors point in orthogonal
directions of largest variance; cf. figs. 9.2 and 9.3 for a two-dimensional and a three-dimensional example.
principal
These directions are also called the principal components. The resulting method, that is, reducing the components
dimension using a linear autoencoder based on the principal components is then also called principal
component analysis (PCA). principal
In order to compute the principal components, we first center the data set at the origin; the resulting component
analysis (PCA)
data is also denoted as mean-centered. Therefore, let mean-centered

K
1 X
µ= xk
K
k=1

and
x̂ = x − µ.
Then, the resulting data matrix reads  
X = x̂1 ··· x̂K ,
and we consider its covariance matrix:
Definition 9.1. Covariance matrix
Let X ∈ Rn×K a mean-centered data matrix. Then, the covariance matrix of X is given by covariance
matrix
1
CXX = XX > .
K

This matrix is symmetric, and in order to find the directions of maximum variance, we can compute the
eigenvalue decomposition
1
XX > = CXX = V DV > ,
K
with orthonormal basis of eigenvectors V and D = diag (λi ). These eigenvectors are also denoted as the
principal components. As we will now recall, these are not really new to us. Let us also consider the principal
SVD of the matrix X > , components
X > = Û ΣV̂ > . (9.10)

156
Figure 9.3: Three dimension data with high variance within a plane and low variance orthogonal to the
plane.

where Σ = diag (µi ). As we notice,

1 1  > 1 1
XX > = Û ΣV̂ > Û ΣV̂ > = V̂ Σ> |Û{z
>
Û} ΣV̂ > = V̂ Σ2 V̂ > .
K K K K
=I

Hence, the eigenvalues satisfy


µ2i
λi = ,
K
where µi are the singular values of X > . Moreover, we notice that

V = V̂ ,

which means that the eigenvectors in V are the same as the right singular vectors of X > , which are the same
as the left singular vectors of X.
Now, from the Eckart–Young theorem, theorem 2.11, we know that choosing the singular values with
maximum magnitude as well as the corresponding singular values gives us the best approximation for the
matrix in terms of the Frobenius norm. In the PCA, we choose a number m < K of eigenvectors from V
as spanning vectors to obtain a lower dimensional approximation of the columns of the data matrix X. Let
this matrix be denoted as Vm .
An m-dimensional representation of the rows of X, can simply be obtained with

Um = X > Vm ,

where Um = Ûm Σm , and Ûm and Σm are the matrices from the best rank m approximation from the
Eckart–Young theorem; cf. eq. (9.10).
In conclusion, the PCA is nothing else than computing the SVD for the mean-centered matrix corre-
sponding to a data set. Plugged into the autoencoder formula, we obtain

Vm Vm> X ≈ X.

157
9.2 Recommender systems
Let us now consider a machine learning task, which we had already mentioned earlier in section 2.1. In
particular, let us consider the case where some of the values in the data set are missing. This could be the
case if measurement data is missing due to failure or if the data is generally still unknown. A very typical
example is the following one:
Example 9.2. Recommender systems

Consider a video-streaming platform which has n users and offers m different videos, movies or series,
for streaming; other examples would be any kinds of online stores. The information whether a user
n×m
has watched a movie, or not, could be encoded in terms of a matrix A ∈ {0, 1} , where aij = 1
corresponds to the case when user i has seen the movie j and aij = 0 if not. For instance, A could
be of the form
1 0 0 ···
 
0 0 0 · · ·
1 1 1 · · ·.
 
.. .. .. . .
 
. . . .
As we realized this matrix is actually sparse, which means, that one could only store the nonzero
entries. The same could be true when storing the data for user ratings of the movies:

4 0 0 ···
 
0 0 0 · · ·
5 4 2 · · ·.
 
.. .. .. . .
 
. . . .

Here, 1–5 corresponds to a rating and 0 corresponds to the case where data is missing.
In practice, it would, of course, be desirable to fill those gaps using machine learning techniques, that
is, to predict the ratings of unseen yet movies by a given user, and based on this, to recommend
movies. This means that the resulting matrix will actually be dense.

Let us consider a data set with missing entries. In this case, we formalize the set of index pairs for which
data is available as follows:

Ωk = {(i, k) | the ith entry of xk is has data}

which contains all index pairs of available data of the data point xk . In fact, this corresponds to a sparse
binary column vector, which is a column of the sparse binary matrix corresponding to the index pairs

Ω = {(i, k) ∈ Ωk ∀k}

for all data points; the matrix would actually correspond to the first matrix in example 9.2.

158
In order to fill the missing entries of the data vectors x1 , . . . , xk , we will use the concept introduced in
the previous section. In particular, we will try to generate an approximate spanning set for the available
data, that is,
Cwk ≈ xk k = 1, . . . , K;
cf. eq. (9.6). However, we can only enforce a good fit for those entries where data is available. For all other
entries, we are missing a label, and the decoding is a pure prediction. Therefore, we consider the least-squares
problem eq. (9.8), but where we only evaluate the error for those entries that contain data:
K
1 X {Cwk − xk }| 2 ,

min g(w1 , . . . , wK , C) with g(w1 , . . . , wK , C) = Ωk (9.11)
w1 ,...,wK ,C K
k=1

where, for a vector v ∈ Rn , 


 vi if (i, k) in Ωk ,
{v}|Ωk = .
i 0 otherwise.
This means that the {·}|Ωk operator acts like a mask on a vector. As a result, only those entries contribute
to the least-squares error where data is available.
Equation (9.11) can be optimized analogously to eq. (9.8).

9.3 Matrix factorization techniques


Comparing the techniques described in this section so far, they are all derived from the simple concept of a
spanning set, described by
Cwk = xk k = 1, . . . , K;
cf. eq. (9.1). This can equivalently be written as

CW = X,

which shows that all previous techniques are actually based on a matrix factorization of X with the weight matrix
and data matrices factorization
   
W = w1 · · · wK and X = x1 · · · xK . (9.12)
Let us now discuss how general the concept of matrix factorizations in unsupervised learning is.

Linear autoencoder and PCA Using W and X from eq. (9.12) eq. (9.8) can be rewritten as follows
1 2
min g(W, C) with g(W, C) = kCW − XkF ; (9.13)
w1 ,...,wK ,C K
2
here, k·kF is the Frobenius norm as defined in eq. (2.18). Note that, as we have mentioned earlier, this
minimization problem does not have a unique solution, such that, regularization techniques are often used
to obtain matrices W and C with minimum norm. Therefore, for all the matrix factorization techniques
mentioned in this section, one typically optimizes the regularized loss function regularized
loss function
1 2 2 2
min g(W, C) with g(W, C) = kCW − XkF + λ kCkF + λ kW kF , (9.14)
w1 ,...,wK ,C K
for some λ > 0 instead. Of course, other norms than the Frobenius norm can also be used in the regularized
problem. For the sake of brevity, we will concentrate on discussing the differences in the non-regularized loss
function.

159
Exercise 9.2.
Show that g(w1 , . . . , wk , C) from eq. (9.8) and g(W, C) from eq. (9.13) are equivalent.

Hence, the goal of eq. (9.16) can be posed as computing a matrix factorization

CW ≈ X, (9.15)

where the ≈ can be understood in terms of the Frobenius norm. In case we prescribe the condition that the
matrix C is orthogonal, we end up with a linear autoencoder.
Of course, as discussed in section 9.1, the PCA is also derived in terms of different matrix factorizations,
namely, by an eigen decomposition or singular value decomposition.

Recommender systems Similar to eq. (9.15), recommender systems can be understood as computing
the matrix factorization
{CW ≈ X}|Ω .
Here, {V }|Ω is the matrix extension of the masking operator {v}|Ωk introduced for vectors in section 9.2.
The resulting loss function is then given by
1 2
g(W, C) = k {CW − X}|Ω kF . (9.16)
K
As before, it turns out that this is just a reformulation of the loss function in eq. (9.11).
Exercise 9.3.
Show that g(w1 , . . . , wk , C) from eq. (9.11) and g(W, C) from eq. (9.16) are equivalent.

K-means clustering Interestingly, many unsupervised learning techniques can be written as a matrix
factorization problem. One other example from this lecture is the K-means clustering algorithm, which has
already been discussed in section 6. In particular, in K-means clustering, our goal is that the data points
assigned to a cluster are similar to the center of the cluster, that is,

ck ≈ xi ∀xi ∈ Ck , k = 1, . . . , K. (9.17)

Here, ck is the centroid of the kth cluster, and Ck the index set of all data points assigned to the kth cluster.
Now, we form a matrix C out of the ck and obtain that

Cek ≈ xp ∀xi ∈ Ck , k = 1, . . . , K,

which is equivalent to eq. (9.17). As before, ek is the kth canonical Euclidean basis function.
In matrix notation, this reads
CW ≈ X, k = 1, . . . , K,
where, again, we have combined all the data points xk into a single matrix. Furthermore, each row of W
has to be some canonical basis vector
wk ∈ {ek }k=1,...,K .
Hence, the matrix factorization corresponding to the K-means clustering algorithms requires very specific
constraints for the matrix W .
This matrix factorization corresponds to the constrained optimization problem
2
min kCW − XkF ,
C,W
(9.18)
where wk ∈ {ek }k=1,...,K .

160
Algorithm 31: Block coordinate search with block size k.
Data: Function G(·), initial point h0 .
Set t := 1 while Stopping criterion not met do
for Block i = 1, . . . , n/k do
hblock := ht .
for Points hnext ∈ {h ± κej , j = (i − 1)(n/k) + 1, . . . , i(n/k)} do
if G(hnext ) < G(hblock ) then
Set hblock := hnext .
if G(hblock ) < G(ht ) then
Set ht+1 := hhblock and exit the for loop.
if None G(hblock ) < G(ht ) then
Change κ.

This can, for instance, be solved using a (block) coordinate search algorithm; cf. alg. 30. Even though, a
simple coordinate search algorithm can be used, the block variant might be more efficient:
It is just a slight variation of the coordinate search algorithm where in we look for the best optimization
step in blocks of coordinates. This way, we do not necessarily choose the next best coordinate to reduce G
but we always look for the best choice within each block before stopping the iteration. Of course, we can
also choose the order of the coordinates randomly.

Sparse coding In K-means clustering, we have chosen each column of the matrix W to be a standard
basis vector. This ensures that each data point is assigned to exactly one cluster. A natural extension of
this it to allow each data point to be in more than one cluster. One may think about the following examples,
where it may be reasonable that certain data points belong to more than one cluster:
Example 9.3. Handwritten numbers

Image created based on the MNIST data set [15].

If, instead of just a single digit, we consider numbers from 0 to 99 as data points, there are various
ways of clustering them into overlapping clusters. For instance, based on whether they contain a
certain digit or not.

161
Example 9.4. Clustering images of faces

Photos taken from unsplash.com.

Similar to the previous example, clustering of images of faces into overlapping clusters may result in
very reasonable results, for instance, considering the predominant color in the image and and the size
of the face.

If we slightly modify eq. (9.18) and allow each data point to belong to at most S clusters, we obtain the
sparse coding algorithm [35]: sparse coding
2
min kCW − XkF ,
C,W
(9.19)
where kwk k0 ≤ S, k = 1, . . . , K.
The k·k0 norm used in this formulation denotes the number of nonzero entries in the vector. The name
sparse coding is linked to the matrix structure of W . That is, due to the limited number of nonzeros for
each data point, the matrix W is sparse, and its density is limited by the number of clusters a data point
can belong to, S.

Nonnegative matrix factorizations Another very useful type of matrix factorizations are given by the
nonnegative matrix factorization problem nonnegative
matrix
2 factorization
min kCW − XkF , problem
C,W
(9.20)
where C, W ≥ 0,

where the inequalities are meant element-wise, that is, all matrix entries are supposed to be nonnegative. Let
us briefly comment on a simple technique for solving this constrained optimization problem with so-called
box constraints. To optimize eq. (9.20), or the corresponding regularized problem, one can perform a box
simple projected gradient descent method, where, after each step, all negative values are simply set to constraints
projected
zero; cf. section 5.2. As a result, the constraints are satisfied after each gradient step. Alternatively, the gradient
descent
problem can be solved using duality, as discussed in section 5.4.
Nonnegative matrix factorizations can be very helpful since they, for certain examples, they are highly
interpretable. Consider the following example:
Example 9.5. Text classification
Consider this example taken from [1]: The following table lists the word counts for Lion, Tiger,
Cheetah, Jaguar, Porsche, and Ferrari (columns) in 6 different documents, which are the rows of the
table:

162
Lion Tiger Cheetah Jaguar Porsche Ferrari
Document 1 2 2 1 2 0 0
Document 2 2 3 3 3 0 0
Document 3 1 1 1 1 0 0
Document 4 2 2 2 3 1 1
Document 5 0 0 0 1 1 1
Document 6 0 0 0 2 1 2

Be computing a nonnegative rank-2 (approximate) factorization of the corresponding matrix, we


obtain    
2 2 1 2 0 0 2 0
2 3 3 3 0 0  3 0 
 
 
1 1 1 1 0 0 1 0 1 1 1 1 0 0
 ≈  .
2 2 2 3 1 1  2 1 | 0 0 0 {z 1 1 1 }
 

0 0 0 1 1 1 0 1
=W
0 0 0 2 1 2 0 2
| {z }
=C

The matrices are interpretable in the following sense:


• The columns of C and rows of W correspond to the topics cats and cars.

• In C, we can see how often the words from the different topics appear on average in each of the
documents.
• In W , we can see which words belong to which topic.
(Note that this is an idealized situation in which the entries of C and W are not only nonnegative
but also integer valued. However, also in a less idealized setting, the factorization remains better
interpretable compared with a general factorization.)

This concept can very well be incorporated into recommender systems. Nonnegative matrix factorizations
can then be employed to encode whether, for instance, a movie includes action, humor, etc. This can be used
to group movies into certain catergories or identify groups of users which have the same taste with respect
to certain elements of movies.

163
1 line segment 10 line segments 100 line segments

Figure 10.1: Logistic regression with a piece-wise linear model.

10 Neural networks
10.1 Feedforward neural networks
Let us consider a general supervised learning task, that is, we want to fit a function F : Rn → Rm to map
given input to corresponding output data,
I = {x1 , . . . , xN } , O = {y1 , . . . , yN } ,
with xi ∈ Rn and yi ∈ Rm , for i = 1, . . . , N . More precisely, we want the function F to satisfy
F (xi ) = yi
as good as possible, with respect to some error norm. As we have already discussed in section 5.4.2, it may
be difficult to construct a linear map in case the data is actually following nonlinear relations. In the kernel
trick in SVMs, we lift the data into a higher dimensional space using a nonlinear map, with the hope that
the separation of the data into classes by a linear model is easier in that representation. There, we therefore
had to select a specific kernel function. Neural networks will provide a framework which automatically learns
nonlinear relations and can therefore be seen as more flexible.
In order to understand the concept, let us consider the two-dimensional data set depicted in fig. 10.1.
Obviously, we cannot fit a simple linear classification model to separate the two classes (red and blue);
instead , a logistic regression with cross-entropy loss function has been used.
Example 10.1. Logistic regression with cross-entropy loss
The cross-entropy is a measure from information theory for the similarity of two probability dis- cross-entropy
tributions. Hence, it can be used as a measure for the quality of a model through comparing the
discrete probability distributions for the predictions of a trained model with the distribution of labels
in the data itself. For a binary classification problem, the cross-entropy loss is given by:
N
X
yi log σ x> + (1 − yi ) log 1 − σ x>
 
min i w i w , (10.1)
w,b
i=1

where w are the parameters of the linear model and


1
σ (x) = (10.2)
1 + e−x
is the logistic function. It is also denoted as the sigmoid function; see fig. 10.3. Hence, writing the sigmoid
logistic regression model as a single function function

F (x) = σ x>

i w ,

164
1 line segment 10 line segments 100 line segments

Figure 10.2: Logistic regression with a piece-wise linear model. Compared with fig. 10.1, noise applied to
the data set. We can also notice the effects of overfitting when comparing with the decision boundary of the
original data fig. 10.1.

we obtain the loss


N
X
min yi log (F (xi )) + (1 − yi ) log (1 − F (xi ))
w,b
i=1

Note that the true labels yi are either zero or one, yi ∈ {0, 1}, such that for each term in the sum
only
log (F (xi )) or log (1 − F (xi ))
remains, measuring the deviation from the correct label; since we use eq. (10.2), the model output
will be in the interval [0, 1].
To train the linear model, we optimize the coefficients w in

x> w.

Therefore, we can use optimization techniques of section 3.

In fig. 10.1, we have used an BFGS type quasi Newton method to optimize the parameters of the model;
see section 4.3.3. More precisely, we have used a limited memory BFGS (L-BFGS) method. In fact, since
the data set is just two-dimensional, the model to be trained has just two parameters w1 and w2 .
The decision boundary of the linear logistic regression model, that is, the hypersurface that partitions decision
the space into the two classes, as shown in fig. 10.1 (left), is naturally also a linear function. Therefore, it is boundary
clear that a linear model is not sufficient. As a remedy, similar to the idea of SVMs, we could now introduce
a nonlinear mapping that allows us to have a nonlinear decision boundary. A simple extension of a linear
model would be to use piece-wise linear functions. Given a sufficiently high number of segments, we should
- in principle - be able to describe any continuous nonlinear relation with arbitrary precision; as you can
see in fig. 10.1 (middle and right), we can find a good nonlinear model in this way. Figure 10.2 shows the
corresponding results for noisy data, where training for a perfect fit is even more difficult.
Let us formalize this type of models: Any piece-wise linear function p with at most M segments can be
written as follows
M
X
p (x) = ai α (bi x + ci ) , (10.3)
i=1

where
α(x) = max {0, x} (10.4)
is a simple piece-wise linear function also called the rectified linear unit (ReLU) function; cf. fig. 10.3. rectified linear
unit (ReLU)

165
Figure 10.3: ReLU, sigmoid, and hyperbolic tangent activation functions.

Exercise 10.1. Piece-wise linear functions

1. Verify that every piece-wise linear function can be written as eq. (10.3).
2. Draw, for some exemplary piece-wise linear functions, the decomposition into functions of the
form ai α (bi x + ci ).

A nice property of the representation (10.3) is that the grid points for the piece-wise linear function are
implicitly given through the parameters ai , bi , and ci , for i = 1, . . . , M . The do not have to chosen a priori
but can be automatically optimized when training the model.
artificial
With eq. (10.3), we have already introduced the most simple form of an artificial neural network neural network
(ANN) or, for simplicity, just neural networks (NN). Note that there are also other ways to introducing (ANN)
neural networks, for instance, based on the biological motivation of modeling the brain as a network of neural
neurons; however, we will focus on our algorithmic approach. networks (NN)
Let us now discuss how to extend eq. (10.3) to obtain the definition of a general feedforward neural
networks. Therefore, us first note that eq. (10.3) can simply be extended to higher dimensions as follows: feedforward
neural
P (x) = A α (Bx + c) , (10.5) networks

where x, c, and P (x) are now vectors and A and B are matrices, respectively. Due to the matrix notation,
we also got rid of the sum in eq. (10.3). Moreover, the function α is now applied component-wise to the
vector (Bx + c). This also gives us some freedom to vary a dimension inside the model as long as all the
other dimensions stay compatible. In particular, with

x ∈ Rn , c ∈ Rk , P (x) ∈ Rm ,
A ∈ Rm×k , B ∈ Rk×n ,

the dimensions are compatible, and k can be chosen freely independent of the input and output dimensions
n and m. Furthermore, we see that α is essential for the nonlinearity of the model. If α was a linear

166
k=1 k = 10 k = 100
α = Id

α = ReLU

α = sigmoid

α = tanh

Figure 10.4: Logistic regression with model eq. (10.5) for varying k and activation functions α on the noisy
data set.

167
function, then P (x) would just be the composition of linear functions, resulting in linear function as well;
cf. fig. 10.4 (first row) where we see that the model does not change when increasing k if we use a linear
function (α = Id). In the context of NNs, α is also called an activation function. activation
So far, we chose α such that the resulting function is piece-wise linear, however, we could also choose function
other nonlinear functions for α. For instance, the sigmoid function eq. (10.2) or the hyperbolic tangent hyperbolic
tangent
ex − e−x
tanh (x) =
ex + e−x
are other common choices for activation functions; cf. fig. 10.3. See also fig. 10.4 for the decision boundaries
of models with different activation functions. The most important property of all those functions is that
they are nonlinear. Moreover, as we will see next, the activation functions will appear many times in typical
neural networks. Therefore, in terms of computational work, it is beneficial to have a function of which is
relatively simple and efficient to compute; moreover, the computation of the gradients for optimizing the
neural network should be efficient as well.
It has been shown that neural networks of the form eq. (10.5) are universal function approximators. universal
As one example, we now give a concrete example of the universal approximation theorem for NNs with function
approximators
sigmoid activation, one hidden layer, and arbitrary width; we will shortly explain the notion of hidden layers
and the width of a layer in detail.
Theorem 10.1. Universal approximation theorem (sigmoid)
n
Let In denote the n-dimensional unit cube, [0, 1] , and α be the sigmoid activation function. Then,
finite sums of the form
M
X
P (x) = ai α (Bx + c) , (10.6)
i=1

are dense in C (In ). In other words, given any f ∈ C (In ) and ε > 0, there is a sum, P (x), of the
above form, for which
|P (x) − f (x)| < ε ∀x ∈ In .
Note that C (In ) is the space of continuous functions on In .

This theorem has been proven by Cybenko in [14] for a more general class of activation functions of
which the sigmoid function is an example. Further generalizations can be found, for example, in [26, 25]. In
particular, it has been shown that the approximation property hold for a large class of activation functions
and is mostly due to the architecture of the network itself. Note that eq. (10.6), of course, easily extends to
the vector-valued case in eq. (10.5).
In order to arrive at the general definition of a DNN, we will now define the composition of functions of
the form eq. (10.5). In particular, let x ∈ Rn be the input of the NN, then an NN with L hidden layers is hidden layers
given by
h1 = α (W1 x + b1 ) ,
hi+1 = α (Wi+1 hi + bi+1 ) , for i = 1, . . . , L − 1 (10.7)
y = WL+1 hL .
The final vector y ∈ R is then the output of the neural network, and the other vectors hi ∈ Rni , i = 1, . . . , L,
m

are the states of the neurons in the hidden layers of the network. The matrices Wi ∈ Rni ×ni−1 contain the
weights of the neural network, and the vectors bi ∈ Rni are often denoted as the biases of the neural weights
biases
networks; see also fig. 10.5 for a schematic visualization of a DNN with two hidden layers. The final layer,
the output layer, is linear, that is, it only corresponds to the multiplication with the matrix WL+1 ∈ Rn×nL ;
however, one may also add a bias vector to the last layer.
A network is called dense if the weight matrices are dense. There are also network architectures which dense
use a lower number of weights, such that the matrices Wi are sparse matrices. Moreover, even though we

168
Input Hidden Hidden Output
layer layer 1 layer 2 layer

x1

x2
y
x3

x4

Figure 10.5: Dense feedforward neural network with four inputs, one output, and four hidden layers with
five neurons each.
one layer two layers three layers

Figure 10.6: Logistic regression with a neural network model with 1 layer(left), 2 layers (middle), and 3
layers (right) of 5 neurons each on the noisy data set.

kept the activation function α fixed, we can also vary it from layer to layer, that is, taking α1 , . . . , αL instead
of α. For the sake of simplicity, we restrict the discussion largely to the case that α is the same for all layers.
The number of layers L of the network is also denoted its depth, and the numbers of neurons within a depth
layer are denoted as the width of the layer. The universal approximation theorem, theorem 10.1, describes width
the approximation properties for the case of depth 1 and arbitrary width. Universal approximation properties
can also be shown for fixed width but arbitrary depth; see, for example, [30]. Training deep neural networks
is often also denoted as deep learning. There is no uniform definition of the term deep, but deep learning deep learning
usually starts at around 3 or 4 hidden layers. However, modern networks can easily have tens or more
than one hundred layers. In fig. 10.6, we can see that increasing the number of layers to the networks has
a similar effect as increasing the number of neurons, which corresponds to increasing the number of line
segments in fig. 10.2.

10.2 Optimization of neural networks


Let N N α
W,b be some neural network of the form eq. (10.7), which uses the activation function α in each layer
and is parameterized by the weights W = {Wi }i=1,...,L+1 and bias vectors b = {bi }i=1,...,L . Then, with

Fi (x) := α (Wi x + bi ) ,

we have
NNα
W,b (x) = WL+1 FL ◦ . . . ◦ F1 (x) .

169
Figure 10.7: Visualization of the loss landscapes of two different neural networks. The loss landscape may be
highly non-convex. Images taken from https://github.com/tomgoldstein/loss-landscape. See also [29]
for more details.

Then, let
I = {x1 , . . . , xN } , O = {y1 , . . . , yN } ,
given input and output data. Then, training a neural network generally corresponds to solving the following
general type of optimization problem
N
X
L NNα

min W,b (xi ) , yi , (10.8)
W,b
i=1

where L is a loss function penalizing deviations of the model output N N α,W,b (xi ) from the correct labels
yi . In that sense, it is usually a type of a distance function, such as the cross entropy in eq. (10.1) or the
mean squared error (MSE) mean squared
error (MSE)
1 2
LMSE N N α N N α

W,b (xi ) , yi = W,b (xi ) − yi 2 ,

N
which we had already seen earlier. Another typical variant is the mean absolute error (MAE) mean absolute
error (MAE)
1
LMAE N N α N N α

W,b (xi ) , yi = W,b (xi ) − yi 1 .

N
As discussed in section 3, convex optimization is generally much easier compared to non-convex opti-
mization. Therefore, of course, our hope is that the loss function is convex with respect to the parameters
W and b of the neural network. Unfortunately, as can be seen for two examples in fig. 10.7, this is generally
not the case. In particular, depending on the activation functions, the depth of the network, and widths
of the hidden layers, the loss function, optimizing a neural network may correspond to a highly non-convex
optimization problem.
The most common techniques for training a neural network are variants of the stochastic gradient descent
(SGD) method (sections 3.5 and 4.2) and quasi-Newton methods (section 4.3.3). In particular, the Adam
gradient descent and L-BFGS quasi-Newton methods are very popular for training neural networks. In the
next paragraphs, we will discuss some important aspects for the optimization of neural networks.

Parameter initialization The convergence of the optimization schemes depends significantly on the initial
guess for the weight matrices W and bias vectors b. In particular, due to the complex landscapes of typical
loss functions of neural networks (fig. 10.7), a bad initial choice may result in slow convergence or even

170
divergence of the optimization scheme. On the other hand, due to the high complexity of (deep) neural
networks, it is generally challenging to come up with good initialization strategies. In particular, a good
initialization strategy may strongly depend on the network architecture used. For a more detailed discussion,
see, for instance, [21, Section 8.4].
Let us here discuss a few commonly used heuristic strategies. A first approach is to sample the weights
in the ith layer, that is, the coefficients of Wi , from the uniform distribution
 
1 1
U −√ , √ ,
ni ni

where ni is the number inputs of the layer. In [20], Glorot and Bengio suggest a slightly different ap-
proach, which also takes into account the number of outputs of the layer ni+1 . They call it a normalized
initialization: normalized
initialization
s s !
6 6
U − ,
ni + ni+1 ni + ni+1
Even though this formula has been derived based on very strong assumptions, which are generally not satisfied
for neural networks, the popularity of the approach shows that the strategy also works well in practice. In
particular, the formula is based on the assumption that the network corresponds to a composition of linear
maps, that is, just the multiplication of the weight matrices.
In [42], Saxe et al. suggest to initialize with random orthogonal matrices. Then, in order to account
for the nonlinearity in each layer, the weights are scaled with a factor g (also denoted as gain factor). In
fact, if the gain factor is chosen appropriately, this allows for training very deep networks, even if the weight
matrices are not orthogonal. If it is not chosen appropriately, the output and the gradients of very deep
networks can deteriorate to be either very large or almost zero.
Another strategy is the sparse initialization described in [31] by Martens where all neurons are initial- sparse
ized a fixed number of nonzero weights; the other weights of the neurons are initialized as zero. In practice, initialization
this means that, at initialization, the matrices Wi have a fixed number of nonzero entries per row.
The strategy for initialization of the bias vectors has to be compatible with the strategy for the weights.
As it turns out, a simple initialization with zero is compatible with most weight initialization schemes. One
counter example is the case of output data which is not mean centered. Then, one may want to add a
bias vector to the output layer to account for this; in eq. (10.7), we had defined the output layer without a
bias vector. In particular, one may initialize the bias in the output, depending on the initialization of the
remaining parameters of the network, to fit the marginal statistics of the output on the training set.

Mini-batch optimization As can be seen in eq. (10.8), the loss function for neural networks is typically
separable, that is, we have one additive term for each data point. In optimizing neural networks, mini-
batch optimization is typically used. Mini-batch optimization is a variant of the stochastic gradient mini-batch
descent method, where in each gradient step one random term from the sum optimization

N
X
L NNα

min W,b (xi ) , yi
W,b
i=1

is selected to compute the gradient, that is,

∇W,b L N N α

W,b (xi ) , yi ; (10.9)

cf. section 3.5. Only this gradient is then used to perform the update in the gradient descent method. In the
next step, another random term from the remaining ones is chosen, until the gradient for term in the sum
(that is, for each data point) has been used once; the number of all iterations required to cover the whole
sum eq. (10.8) again is also denoted as one epoch. The main argument for the feasibility of this approachs epoch
is that, for a large data set, the expectation for a gradient update remains the same. This approach has two
major advantages:

171
• The computation of the gradients and, as a result, each gradient step becomes cheaper.
• The convergence becomes more robust against getting stuck in local minima. This is because a local
minimum in eq. (10.8) might not be local minimum of a single term,

L NNα

W,b (xi ) , yi ,

anymore. Therefore, it is possible to escape local minima, which might not be the case for the classical
gradient descent method.
On the other hand:
• One epoch of SGD is significantly more expensive than one epoch of the classical gradient descent
method, where an epoch is the same as a single iteration.
• Since, in each step, not the whole loss function is used, convergence can also be slower compared to
the classical gradient descent method.
The classical and the stochastic gradient descent methods are extreme cases of mini-batch gradient descent
methods. In this approach, the index set {0, . . . , N } of the data points is first partitioned into disjoint subsets
of cardinality k; the sets are denoted as batches and the cardinality k is denoted as the batch size. Let us batches
batch size
denote those subsets as B1 , . . . , BK . Then, in the j-th step of mini-batch SGD, we use
X
∇W,b L N N α

W,b (xi ) , yi
i∈Bj

for the gradient step. With


[
Bj = {0, . . . , N } and Bi ∩ Bj = ∅ i 6= j,
j=1,...,K

K iterations of mini-batch SGD yields one epoch. Mini-batch SGD with full batches simply corresponds full batches
to the gradient descent method, and mini-batch SGD with batch size one corresponds to SGD.

Forward and backward propagation In order to optimize neural networks via gradient descent or a
quasi-Newton methods, several computational steps are necessary. First, in each iteration, it is necessary to
evaluate the neural network and the loss function for given input data. This can easily be done by going
though the network based on the scheme in eq. (10.7); going through the network from input to output is
also denoted as the forward propagation. forward
Then, in order to compute the update, we have to compute gradients of the loss function eqs. (10.8) propagation
and (10.9) with respect to the network parameters W and b. On the one hand, this could potentially be
difficult because neural networks are highly nonlinear and can have a large number of parameters, which
appear in different layers of the network. However, as we have seen before, neural networks are also built
of very simple elementary building blocks; cf. eq. (10.7). In particular, they are just compositions of affine
linear and nonlinear activation functions, and for a single network layer, it is quite simple to compute the
derivatives of the output of the layer with respect to the parameters of the layer. The propagation through
multiple layers can then performed using simple derivation rules.
Exercise 10.2.

1. For a neural network with a single layer with an activation function of you choice, derive
formulae for the derivatives of the output
y = α (W x̃ + b)
with respect to the network parameters W and b.

172
2. Derive formulae for computing the derivatives of an MSE loss.

Exercise 10.3.
Implement the neural network from exercise 10.2, its derivatives, and a gradient descent algorithm
to optimize the network parameters for a given data set.

For deep and wide neural networks, the computation of the gradients can result in high computational
costs if performed in a naive way. The backpropagation algorithm is a special case of automatic backpropaga-
differentiation, which also performs the computation of the gradients based on simple derivation rules. tion
automatic
However, the computations are arranged in such a way that the they can be performed very efficiently. differentiation
In section 10.3, we will discuss the three necessary steps in the optimization of neural networks, the con-
struction of the computational graph of the neural network: the forward propagation and the backward
propagation.

Data normalization and batch-normalization Figure 10.8 nicely shows the effect of data normalization
on a simple two-dimensional linear regression with MSE loss. In particular, starting with some unnormalized
data set, we obtain a loss function which is a very stretched convex function; see also example 3.2. As we
can see, the resulting gradient iteration converges very slowly. In particular, after 100 iterations, we still
have not converged to a good data fit.
A simple standard normalization of the data improves the optimization significantly. In particular, for
each feature, we first compute the mean
N
1 X
µj = (xi )j
N i=1
and the standard deviation v
u
u1 X N  2
σj = t (xi )j − µj
N i=1

for j = 1, . . . , n. Then, we transform each data point xi to have zero mean and standard deviation one by

(xi )j − µj
(x̂i )j = ∀j = 1, . . . , n.
σj

This requires that σj 6=, which we can assume since, otherwise, the feature is constant for all data points,
such that we could simply remove it.
The positive effect of this simple transformation, which is completely invertible, can be seen in fig. 10.8.
In particular, due to the much more favorable shape of the loss function of the least-squares problem, the
gradient descent iteration yields a much better fit within just 20 iterations.
Since the neural network is a composition of models, it could be beneficial for the optimization of each
layer if the input of the layer would be standard normalized as well. We can only perform the normalization
of the input of the first layer before the training. Since the inputs of the deeper layers depend on the
outputs of the corresponding previous layer, it cannot be performed before the optimization. In particular,
the outputs of the previous layer may change significantly during the optimization.
Therefore, in the batch-normalization approach, the normalization is performed on-the-fly for each batch
in the mini-bath optimization. In particular, we append the normalization step to each layer in eq. (10.7).
Therefore, let x̃1 , . . . , x̃k be the input vectors of the lth layer for some batch of the optimization. Applying
the lth layer yields
ỹi = α (Wl x̃i + bi ) ,

173
data set

loss function gradient steps model fit

loss function gradient steps model fit

Figure 10.8: Comparing 100 gradient descent steps on a data set without normalization and 20 gradient
descent steps on the same data sets with normalization. Visualization using the code from https://github.
com/jermwatt/machine_learning_refined. See also [48].

174
for i = 1, . . . , k. Without batch normalization, these vectors would be directly used as inputs for the next
layer. Instead, we normalize each ỹi as before:

(yi )j − µj
(ŷi )j = ∀j = 1, . . . , k,
σj

where v
N u N  2
1 X u1 X
µj = (ỹi )j and σj = t (ỹi )j − µj .
N i=1 N i=1

Then, we use these as inputs for the next layer.

Hyperparameter optimization One of the strength of neural networks is their flexibility. In fact, they
can be applied to a different types of data sets of varying complexity, to classification and regression problems,
and they can even be employed in unsupervised learning; see, for instance, section 10.4. At the same time,
neural networks have a large number of hyperparameters, including:
• Network architecture: depth of the network, width of the layers, choice of activation function(s); see
also section 10.4 for some famous examples of network architectures
• Optimization method, such as variants of gradient descent or quasi-Newton methods; these methods
themselves introduce additional hyperparameters
• Weight initialization (as discussed before)

• Regularization techniques: norm regularization, dropout ([45]), early stopping (trying to stop the
optimization before overfitting), etc.
• ...
The most common way of hyperparameter optimization is grid-search with k-fold cross-validation. Al-
ternative approaches have already been discussed in section 8.

10.3 Computational graph, forward propagation, and backward propagation


In section 10.2, we had already mentioned that the optimization of neural networks consists of three essential
steps, which will be discussed in detail shortly:

• Setting up the computational graph


• Forward propagation
• Backward propagation

These steps will help us to efficiently evaluate a function as well as compute the derivatives of the function
with respect to the inputs. Note that the forward and backward propagation as described here is also
denoted as the reverse model of automatic differentiation, which is particularly efficient for network reverse model
type functions; for more details on this, see, for instance, the discussion in [48, Appendix B]. of automatic
differentiation
We will discuss these concepts for simple examples of nonlinear functions. However, the concepts are
very general and can be to complex functions, for instance, neural networks.

175
Computational graph Any function which is given by an algebraic expression can be expressed as the
composition of elementary operations. Let us start with a very simple example: The function f with

f (x) = ax + b

can be decomposed into two functions

g (x) = ax and h (x) = x + b,

which only perform a single elementary operation, such that

f (x) = h (g (x)) .

These operations can be organized in a computational graph, which tracks the order in which these oper- computational
ations are performed. By doing so it also helps to reveal data-dependencies and parallelism of computations. graph
Moreover, it can be used to optimize computations. Let us discuss this based on some simple examples:
Example 10.2. Computational graph with a single input
Let us consider the arbitrary function
2
f (x) = sin (x) cos (x) + ecos(x) .

We can write it as the following computational graph:

2
sin (·) (·)

x cos (·) ×

cos (·) e(·) +

Now, we can go through the graph from left to right, layer by layer:
1.
a = sin (x)
b = cos (x)
c = cos (x)

2.
d = a2
e = ec

3.
f = db

4.
g =f +e

First of all, we notice that all operations in each step can be carried out in parallel. Moreover, we
can notice that we have some redundancy in our computations. In particular, we could optimize the
graph be a slight rearrangement:

176
2
sin (·) (·)

x cos (·) ×

e(·) +

By doing so, we save one application of cos.

Once the computational graph for a function has been setup up, it can be reused and combined with
other graphs to build more complex compositions of functions:
Example 10.3. Composing computational graphs and multiple inputs
Let us consider the function
g (x1 , x2 ) = f (x1 ) − f (x2 ) ,
where f is defined as in example 10.2. We obtain the following computational graph:

x1 f (·)

x2 f (·) −

Here, each f (·) node of the graph corresponds to the computational graph in example 10.2. We
obtain:

1.
a = f (x1 )
b = f (x2 )

2.
c=a−b

Exercise 10.4.
Create the computational graph for a single-layer neural network of the form eq. (10.5) with two
inputs, three neurons, and a single output and a generic activation function α.

Note that, for a function with a fixed structure, such as a neural network with a fixed numbers of layers
and neurons within the layers, the computational graph has to computed only once. Even if the parameters
of the network are optimized, the computational graph will remain the same.

Forward propagation After we have created the computational graph of our function, let us discuss the
next step, which is the forward propagation. In this step, we sweep through the computational graph forward
from left to right and compute the values in the nodes as well as the partial derivatives of each node of the propagation
computational graph with respect to its input; in terms of the graph, we compute the partial derivatives of

177
each child node with respect to its parent nodes. By doing so, instead of directly computing the derivatives
directly with respect to the inputs, we can omit computing and storing a lot unnecessary partial derivatives.
For instance, consider see example 10.3, where
∂a ∂b
= 0 and = 0.
∂x2 ∂x1
Moreover, in neural networks, we have to compute the derivatives of the loss with respect to all network
parameters. As you will notice in exercise 10.8, once we have computed all the partial derivatives with
respect to the parent nodes, we have all information necessary to assemble all required derivatives.
Let us, again, discuss the forward propagation for a simple example.
Example 10.4. Forward propagation
Let us consider the function
x2 + x22 + x23 + x24

f (x1 , x2 , x3 , x4 ) = 1
4
with 4 inputs. We obtain the following computational graph:

x1 2
(·)

x2 2
(·) +

x3 2
(·)

x4 2
(·) + + ·/4

Now, let us propagate through the graph from left to right and compute both the evaluations and
the partial derivatives with respect to the corresponding inputs:
1.
∂a
a = x21 = 2x1
∂x1
∂b
b = x22 = 2x2
∂x2
∂c
c = x23 = 2x3
∂x3
∂d
d = x24 = 2x4
∂x4
2.
∂e ∂e
e=a+b =1 =1
∂a ∂b
∂f ∂f
f =c+d =1 =1
∂c ∂d
3.
∂g ∂g
g =e+f =1 =1
∂e ∂f

178
4.
g ∂h 1
h= =
4 ∂g 4
In order to perform the backward propagation next, we store both the function evaluations and the
partial derivatives in each node of the computational graph; cf. example 10.5.

Exercise 10.5.
Take the graph from exercise 10.4 and perform the forward propagation.

Backward propagation The final step for computing the derivatives of a function with respect to the
inputs is to sweep through the computational graph backwards, that is, from right to left. This is called
the backward propagation. In this step, using elementary rules for computing the derivatives, such as backward
the chain rule, the product rule, and the linearity of the derivative, we can simply compute the derivatives propagation
of the outputs with respect to inputs of the function. As mentioned before, this step relies heavily on the
computations performed before in the forward propagation.
We continue example 10.4:
Example 10.5. Backward propagation
Let us consider the function
x2 + x22 + x23 + x24

f (x1 , x2 , x3 , x4 ) = 1
4
from example 10.4, where we can also find its computational graph. Now given the output h, we
propagate backwards through the computational graph to compute the partial derivatives
∂h ∂h ∂h ∂h
, , , .
x1 x2 x3 x4
We get that
1.
∂h 1
=
∂g 4
2.
∂h ∂h ∂g 1 1
= = ·1=
∂e ∂g ∂e 4 4
∂h ∂h ∂g 1 1
= = ·1=
∂e ∂g ∂f 4 4
3.
∂h ∂h ∂e 1 1
= = ·1=
∂a ∂e ∂a 4 4
∂h ∂h ∂e 1 1
= = ·1=
∂b ∂e ∂b 4 4
∂h ∂h ∂f 1 1
= = ·1=
∂c ∂f ∂c 4 4
∂h ∂h ∂f 1 1
= = ·1=
∂d ∂f ∂d 4 4

179
4.
∂h ∂h ∂a 1 1
= = · 2x1 = x1
∂x1 ∂a ∂x1 4 2
∂h ∂h ∂b 1 1
= = · 2x2 = x2
∂x2 ∂b ∂x2 4 2
∂h ∂h ∂c 1 1
= = · 2x3 = x3
∂x3 ∂c ∂x3 4 2
∂h ∂h ∂d 1 1
= = · 2x4 = x4
∂x4 ∂d ∂x4 4 2
Hence, computing the derivatives of the output with respect to any of the nodes in the graph results
from just multiplying the partial derivatives computed before in the forward propagation, using the
chain rule.

Example 10.5 shows nicely that, after creating the computational graph and performing the forward
propagation, computing all the derivatives just corresponds to a simple assembly of the previously computed
partial derivatives. Moreover, each step only involves adjacent nodes in the computational graph. These
aspects make the computations very efficient.
In the following exercises you can step-by-step apply this concept to neural networks.
Exercise 10.6.
Take the graph and forward propagation from exercises 10.4 and 10.5 and perform the backward
propagation to compute all the necessary derivatives for performing a step of the gradient descent
method.

Exercise 10.7.
Implement the algorithm derived in exercises 10.4 to 10.6 and optimize the small neural network to
fit a function f : R2 → R based on noisy data. Therefore, choose your own function f and a set of
3
100 data points x1 , . . . , x100 ∈ [0, 1] , evaluate f in all data points (yi = f (xi )) and add some noise
from U (−ε, ε) for a small ε to each function evaluation yi .

Exercise 10.8.
Extend exercises 10.6 and 10.7 to deep networks with a variable number of layers.

Finally note that, other than the creation of the computational graph, the forward and backward prop-
agation depend on the value of the input of the function. Therefore, during the optimization of a neural
network, these steps have to be repeated in each step of the optimizer, even though the computational graph
remains fixed.

10.4 Some examples of neural network architectures


Let us finally discuss some famous types of network architectures which are frequently used in machine
learning. Due to the rapid development of the research in machine learning, and in particular, neural
networks, this overview will be far from complete and may only be the starting point for a more detailed
study.

Convolutional neural networks In order to motivate convolutional neural networks, let us discuss first
what a discrete convolution is. Therefore, let us consider the data matrix I, which could be the matrix

180
kernel matrix
representation of a gray-scale image and a kernel matrix K; the kernel matrix is also called a filter. Then,
filter
the convolution of I with the kernel K is given by
convolution
XX
S (i, j) = (I ∗ K) (i, j) = I (m, n) K (i − m, j − n) . (10.10)
m n

Here, for simplicity, we use the notation S (i, j) instead of Sij . The convolution operation is commutative,
in the sense that data matrix I and kernel matrix K can be flipped. This yields:
XX
S(i, j) = (K ∗ I) (i, j) = I (i − m, j − n) K (m, n) . (10.11)
m n

In both cases, the sum is performed over all valid indices. Otherwise, we define the result of I (m, n) K (i − m, j − n)
or I (i − m, j − n) K (m, n), respectively, to be zero.
One often instead uses the equivalent cross-correlation operation, which is essentially a convolution cross-
without flipping the kernel correlation
flipping the
kernel
XX
S (i, j) = (I ∗ K) (i, j) = I (i + m, j + n) K (m, n) .
m n

This is what is typically implemented in neural network libraries instead of the convolution introduced before.
Note that, in neural networks, the weights of the kernel are typically not chosen by the user but optimized
during training. Therefore, there is effectively no difference between the two operations. For simplicity,
we will therefore also focus on this operation and also use the terms convolution and cross-correlation
synonymously.
Let us consider a small example to understand the convolutional operation:
Example 10.6.
Let  
a b c d
I = e f g h,
i j k l
and  
w x
K=
y z
Then, we obtain the following result from I ∗ K:
 
a b c d    
w x aw + bx + ey + f z bw + cx + f y + gz cw + dx + gy + hz
S= e f g h ∗
  =
y z ew + f x + iy + jz f w + gx + jy + kz gw + hx + ky + lz
i j k l
(10.12)
Notice that, as highlighted in blue, each entry in S is computed only by neighboring coefficients in
I. The kernel matrix is shifted over the matrix I to compute the entries of S. The data dependency
for the second entry is
 
a b c d    
w x aw + bx + ey + f z bw + cx + f y + gz cw + dx + gy + hz
S= e f g h ∗
  = ,
y z ew + f x + iy + jz f w + gx + jy + kz gw + hx + ky + lz
i j k l

where, again, four neighboring entries of I are used to compute one entry of S.

As can be seen in fig. 10.9, the choice of a specific kernel matrices may help to identify certain features in
an image, such as vertical and horizontal edges. The output of the convolution is also called feature map, feature map

181
   
0 0 0 0 1 −1
original K=  1 1 1 K= 0
 1 −1
−1 −1 −1 0 1 −1

Figure 10.9: Applying convolutions with two different kernel matrices highlight horizontal and vertical edges
in the image.

and convolutions can be seen as a special technique for feature engineering. Therefore, only neighboring feature
information is used: cf. eq. (10.12) in example 10.6. The size of the kernel matrix determines the locality of engineering
the convolutional operation.
Moreover, it directly follows from eqs. (10.10) and (10.11) that the convolution is a linear map with
respect to I, that is,
(a · I1 + b · I2 ) ∗ K = aI1 ∗ K + bI2 ∗ K,
for a, b ∈ R. As a result, a convolution can be written as a matrix multiplication
I ∗ K = IK, (10.13)
with a suitable matrix K. Moreover, due to the locality of the convolutional operation displayed in light blue
in eq. (10.12), it becomes clear that the matrix K is sparse and the sparsity is determined by the size of the
kernel matrix K.
Exercise 10.9.
Derive a formula for the matrix K defined by eq. (10.13).

Exercise 10.10.
Visualize the sparsity pattern of K for a full matrix I of size 5 × 5 and kernel matrix sizes 1 × 1, 2 × 2,
and 3 × 3.

Another observation from example 10.6 is that the dimension of the matrix reduces when applying the
convolution. This can be avoided by extending the matrix I by layers of zeros. This approach is also called
padding
padding; cf. fig. 10.10 for a one-dimensional example with padding, where the input and output vectors
have the same length. Conversely, in order to obtain an even stronger reduction of the dimension, striding
striding
can be used, which means that the kernel is shifted by more than one entry in I for computing the next
entry in S; see, for example [21, Chapter 9].
Note that convolutional operations are not restricted to data arranged in as a two-dimensional matrix.
They can also be employed for data which has a tensor product structure of any other dimension; see fig. 10.10
for a one-dimensional example.
Now, a convolutional neural network is basically a network in which all (or some of the) weight matrices
are replaced by convolutional operations. As discussed above, convolutional operations just correspond to

182
s1 s2 s3 s4 s5

x y z

i1 i2 i3 i4 i5

Figure 10.10: One-dimensional convolution with padding: each entry in the output vector s is computed by
at most three neighboring entries in the vector i. The colors red, green, and blue correspond to the three
entries x, y, and z of the kernel vector k.

the special type of sparse matrices. In that sense, they are just a specialization of the networks described
by eq. (10.7). In particular, with the same width of the layers as in a dense neural network, the number of
trainable parameters is significantly lower for two reasons
• The matrix K resulting from a convolution with a kernel K is sparse.
• When applying a kernel matrix of size n × m, the number of trainable parameters for this convolution
is exactly n · m, independent of the size of the input matrix I. This means that, compared with a
standard dense neural networks, many weights are shared between several neurons.
Due to this extreme reduction in the number of trainable parameters, it becomes feasible to use multiple
kernel matrices within each layer of a convolutional neural network. As a result multiple feature maps can
be learned; cf. fig. 10.9 where one kernel highlights horizontal edges, and another filter highlights vertical
edges. This also shows that, for complex images, it will be necessary to use multiple kernel matrices in
order to prevent loss of image information necessary in the following layers. The different feature maps in
a convolutional layer are also called channels. One other example of channels are intensities of the colors channels
red, green, and blue in the RGB format of a picture.
After the convolutional operations in a layer, as in dense neural networks (eq. (10.7)), we apply an
activation function make the layer nonlinear. Finally, another important type of operations on neighboring
matrix entries is used, which gathers (statistical) information of nearby outputs. This type of layers are
called pooling layers. Two typical examples are: pooling layers

• Average pooling: computes the average value within each neighborhood of entries; average pooling Average
can also be written as a convolution pooling

• Max pooling: computes the maximum value within each neighborhood of entries; max pooling is Max pooling
nonlinear and can therefore not be written as a linear convolution
For more details, see, for example, [21, Section 9.4]. Pooling operations are generally employed after convo-
lution and activation, but they do not have to be employed after each convolutional layer.
Whether the use of convolutional neural networks is feasible, depends on the structure of the data. If
the data is arranged in a tensor product structure and there is a notion of neighboring entries, the use of
convolutional neural networks can be reasonable. If we have input data which cannot be structured in such a
way, the use of convolutional neural networks can be disadvantageous. However, convolutional operations can
also be defined on structured data with some adjacency structure. In the unstructured case, the adjacency
is given by a graph (section 2.8), and the corresponding type of convolutional neural networks is denoted as
graph convolutional neural networks; for the sake of brevity, we will not discuss this concept here. graph
convolutional
neural
networks

183
Average pooling

0.65 1.1 0.6 1.1 0.65

1.0 0.3 2.0 0.1 1.2


Max pooling

1.0 2.0 2.0 2.0 1.0

1.0 0.3 2.0 0.1 1.2

Figure 10.11: Average and max pooling.

hi hi x

hidden layer 1
Wi+1 · +bi+1 Wi+1 · +bi+1
Id Ŵi+1 Id / Ŵi+1
···
α α
hidden layer N

+ +
+

hi+1 hi+1 y

Figure 10.12: Simple residual layer (left), residual layer with linear map (middle), and residual block ranging
over multiple hidden layers.

184
16
1024 1024

I/
Bottleneck Conv
512 512 512 512 512 512

8
I/

I/
256 256 256 256 256 256
4

4
I/

I/
128 128 128 128 128 128
2

2
I/

I/
64 64 64 64 64 64
I

I
Softmax

Figure 10.13: U-Net architecture with skip connections. Image taken from https://github.com/
HarisIqbal88/PlotNeuralNet. See [38] for the U-Net architecture. When neglecting the skip connections,
the U-Net has an autoencoder architecture with convolutional layers; cf. fig. 10.15.

Residual neural networks The development of residual neural networks (ResNets) arose from the residual neural
observation that the most successful neural networks for challenging (image) data sets are very deep networks; networks
(ResNets)
see, for instance, the ImageNet image recognition challenge [40]. As already mentioned before in section 10.2,
it is important to make sure that the network outputs and gradients neither become extremely large nor
vanish. In particular, it has been observed that, for very deep neural networks, the gradients vanish.
For ReLU activation functions, this can be attributed to the fact that, for half of the cases, the gradient
of the ReLU function
α (x) = max {0, x}
is zero; cf. fig. 10.3. In particular, consider a single layer with
hi+1 = α (Wi+1 hi + bi+1 ) . (10.14)
Let Wi+1 = (wkl )k,l . Then, we can consider a single entry of the output vector
ni
!
X
(hi+1 )j = α wjk (hi )k + (bi+1 )j ,
k=1

and we see that Pni


∂ (hi+1 )j

=
wjk if k=1 wjk (hi )k + (bi+1 )j ≥ 0,
∂ (hi )k 0 else.
In section 10.3, we saw that the partial derivatives of the different layers are multiplied or summed up when
propagating through the network backwards. It has been observed that for deep networks, this often leads
vanishing gradients; similar effects can be observed for other activation functions with low gradients in the
largest part of R.
In order to prevent this, we can modify the layer eq. (10.14) as follows
hi+1 = α (Wi+1 hi + bi+1 ) + hi ; (10.15)
cf. fig. 10.12 (left). This is also denoted as a residual layer. Resulting from the introduction of a residual residual layer
layer, the gradients are modified accordingly
Pni
∂ (hi+1 )j

=
1 + wjk if k=1 wjk (hi )k + (bi+1 )j ≥ 0,
∂ (hi )k 1 else.

185
Figure 10.14: Two data sets that can be approximated well by a one-dimensional encoding: linear (left) and
quadratic (right).

reducing the chance for vanishing gradients significantly. However, this requires matching dimensions of the
two adjacent layers, that is, ni+1 = ni . In case the dimensions do not fit, one can instead use a linear map
Ŵi+1 with appropriate dimensions; cf. fig. 10.12 (middle).
Instead of adding the input to the output of each individual layer, a certain number of hidden layers can
be placed in between fig. 10.12 (right); this technique is used very often in practice. Instead of a residual
layer, we also denote this as a residual block. The connection in the computational graph which takes the residual block
input of a residual block and adds it to the output of the last hidden layer in the residual block is also called
a skip connection. This name emphasizes that the input is skipping all the hidden layers and is directly skip
propagated to the end of the residual block. connection
A very successful type of convolutional neural network which uses skip connections is the U-Net. It has
been introduced by Ronneberger et al. in [38]. Even though originally introduced for medical image process-
ing, it is now frequently used for many different types of data with tensor product structure; see fig. 10.13
for a visualization.
Exercise 10.11.
Discuss the computational graph, forward propagation, and backward propagation for residual layer
of a feedforward neural network in detail.

Nonlinear autoencoders Previously, in section 9.1, we had already introduced the linear autoencoder,
that is, a linear map C which encodes (C T ) and decodes (CC T ) in such way that the data points of a data
set can be represented as good as possible
CC > xk ≈ xk ;
cf. eq. (9.7). The main reason for constructing a (linear) autoencoder is to reduce the dimension of the data
set, that is, if xk Rn and C > xk ∈ Rk , we want

k << n.

If the data set follows approximately a linear relation, a linear autoencoder will perform well; see fig. 10.14
(left). If the data follows approximately a nonlinear relation, a linear autoencoder will perform poorly;
see fig. 10.14 (right). If the nonlinear relation is known, it might be possible to transform the data such that
the relation is approximately linear.

186
Input Latent Output Input Latent Output
layer layer layer layer layer layer

Figure 10.15: Nonlinear autoencoders based on a simple dense neural network: minimum number of layers
(left) and additional hidden layers (right).

Of course, the nonlinear relation is typically unknown. As we have discussed in section 10.1, neural
networks are well-suited for approximating nonlinear relations; moreover, the nonlinearity can be optimized
automatically during the training. Therefore, let us discuss briefly whether we can employ neural networks
to perform a nonlinear dimensionality reduction; this is an example of using neural networks to perform an
unsupervised learning task. Therefore, consider the data set

{x1 , . . . , xK }

with x1 , . . . , xK ∈ Rn . Now, we can reduce the dimension of the data set from n to k < n by training a
neural network
w = α (W1 x + b1 ) ,
(10.16)
y = α (W2 w + b2 )
with x, y ∈ Rn and c ∈ Rk , such that
yi ≈ xi ∀i = 1, . . . , K.
See fig. 10.15 (left) for an example of a corresponding network architecture. The reduced dimensional space
latent space
is often also denoted as the latent space and it is the space of the latent layer. As mentioned before, in
latent layer
order to obtain a dimensionality reduction, we need that k < n; therefore, the latent layer is also denoted
as the bottleneck. More precisely, for the example of a MSE loss function, we train the network by solving bottleneck
the minimization problem
K
1 X 2
min kα (W2 α (W1 xi + b1 ) + b2 ) − xi k .
W1 ,W2 ,b1 ,b2 K i=1

This minimization problem does only depend on the data points x1 , . . . , xK but does not require additional
labels. Therefore, training an autoencoder network is an unsupervised learning task. After training the
neural network, the encoding of xi is
wi = α (W1 xi + b1 ) .
If the MSE is optimal, we decode the encoding with

yi = α (W2 wi + b2 )

187
hi+1

hi ...

Figure 10.16: Layer of a recurrent neural network. A neuron can also be connected with itself.

yielding a good approximation of xi . Figure 10.15 (right) shows an additional example with additional
hidden layers allowing for a higher degree of nonlinearity. The U-Net, which has been shown in fig. 10.13,
is based on a deep autoencoder network with convolutional layers; however, it due to the skip connections,
the bottleneck does not necessarily learn an encoding of the input.

Comment on recurrent neural networks Let us end this section of different neural network archi-
tectures with a short comment on recurrent neural networks (RNNs). In particular, they allow for recurrent
interaction of neurons in a single layer. Therefore, they are particularly well-suited for time-dependent data, neural
networks
where the neurons are interact in chronological order; fig. 10.16 from left to right. (RNNs)
Example 10.7. RNNs for time series
Let us again consider the example of time-dependent sensor data already shown in section 2.

20
sensor data
15

10
s

0
0 2 4 6 8 10
t

Of course the corresponding data can be stored as a vector


 
s0 s1 s2 · · · s10

and used as input for a neural network. In the context of time-dependent data, the data at a
certain time si often depends on the data at previous times si−1 , si−2 , . . .. A very simple examples
are temperature measurements during the course of a day. Another example of temporal data are
stock prices. In those cases, it makes sense to represent the time-dependence by an RNN network
architecture, which explicitly accounts for the connections of neurons in chronological order.

188
11 Optimization with constraint learning
11.1 Introduction
Most of this course has been devoted to (numerical) linear algebraic and optimization techniques whose main
aim was to achieve a certain ML goal – to train a predictive or clustering tool. However, in real life even
ML is typically only a ‘servant’ of a certain bigger goal. How could such bigger goals look like?
For example, when we cluster, apart from purely technical applications such as image segmentation,
we typically want to do it in order to understand the group of data and, perhaps, to apply a customized
offer/treatment to a uniform group of patients, clients etc. Think of the different ways you can cluster people
in:
• public transport to offer them suitable periodic ticket deals
• retail, to identify different groups that choose various levels of the prestige-price tradeoff, and to
propose to each of them a different product line (it is a fact of life that many ‘different’ store chains are
owned by the same owner and the main role of the differences is to offer to people different tradeoffs
of price/prestige/sustainability of a given brand).
• healthcare - if you were to develop several different treatment schedules
As you can see, in all these examples there are still decisions to be made after the clustering is done and the
clustering is only a ‘servant’ of these decisions.
What about supervised learning? Again, we make predictions in order to act upon them. Think of:
• classification model to judge which flights are most likely to be late – based on the probabilities, you
will be prioritizing the luggage belts or ground crews to certain parts of the airport to mitigate against
the delays
• regression model to measure the precise impact of a given treatment on a certain tissue in the body –
based on this you will be choosing between various treatments
• default/no default classification built to decide whether to grant someone a loan or not; regression
model to predict the sales price of a range of used cars.
Sometimes, the decisions you make on the basis of the ML models are going to be simple – to do something
or not to do it (grant a loan) or to simply rank certain decisions in the order of their making (prioritizing
various treatment schedules/teams at the airport). Then, there’s nothing too complicated to think about -
one can pick the best decisions (i.e. ’optimize’) almost by hand.
But sometimes, the decisions to be made on the basis of the ML model are complicated - they entail an
entire vector of decisions, which are related one to another in a nontrivial way. This means that in order to
pick the best possible decision, one needs a nontrivial tool. This nontrivial tool is optimization.
The goal is to introduce you to interesting situations in which first, an optimization model is used to
train an ML tool, and then, the ML tool is embedded inside an optimization problem which is to make the
’real decisions’ based on it.
As the goal set in this way can be arbitrarily complex, we need to ‘set the conditions in which it might
work’ or to speak with military terminology, define the perimeter of the operation. If ML models, which can
be pretty complicated, are to be only a part of a bigger thing, then certainty the bigger thing to be solved
needs to be a reasonably solvable class of optimization problems.
In this way, the first part of this lecture is will be a crash course on (mixed integer) linear programming
which is by far the most scalable optimization technology, and which covers, volume-wise, probably 90% of
non-ML applications of optimization. The crash course is imported from an in-progress textbook [36]. 3 We
shall not cover the basic algorithms for solving these problems but they are very simple and you can learn
about them, for example, in [7].
3 You can find the hopefully self-explanatory Jupyter notebooks at http://jckantor.github.io/MO-book/intro.html

189
The shorter, second part of the lecture, in turn, will be an attempt to show you how to combine this with
machine learning in two-three specific contexts. To make it concrete, it’s mostly imported from the survey
[18] that illustrates two ‘really applied’ applications.

11.2 Linear optimization and its modelling techniques


The simplest, and most scalable class of optimization problems is the one where the objective function and
the constraints are formulated using the simplest possible type of functions – linear functions. A linear
program (LP) is an optimization problem of the form
min c> x (11.1)
s.t. Ax ≥ b,
x ≥ 0,
where the n (decision) variables are grouped in a vector x ∈ Rn , c ∈ Rn are the objective coefficients, and
the m linear constraints are described by the matrix A ∈ Rm×n and the vector b ∈ Rm .
Of course, linear problems could also (i) be minimization problems, (ii) involve equality constraints and
constraints of the form ≥, and (iii) have unbounded or non-positive decision variables xi . In fact, any LP
problem with such features can be easily converted to the ‘canonical’ LP form (11.1) by adding/removing
variables and/or multiplying specific inequalities by −1.
Example 11.1. Building microchips pt. 1 – problem formulation
The company BIM (Best International Machines) produces two types of microchips, logic chips (1gr
silicon, 1gr plastic, 4gr copper) and memory chips (1gr germanium, 1gr plastic, 2gr copper). Each
of the logic chips can be sold for a 12 profit, and each of the memory chips for a 9 profit. The
current stock of raw materials is as follows: 1000gr silicon, 1500gr germanium, 1750gr plastic, 4800gr
of copper. How many microchips of each type should be produced to maximize the profit while
respecting the raw material stock availability?
Let x1 denote the number of logic chips and x2 that of memory chips. This decision can be reformu-
lated as an optimization problem of the following form:

max 12x1 + 9x2


s.t. x1 ≤ 1000 (silicon)
x2 ≤ 1500 (germanium)
(11.2)
x1 + x2 ≤ 1750 (plastic)
4x1 + 2x2 ≤ 4800 (copper)
x1 , x2 ≥ 0

Using a modelling package Pyomo, this problem can be formulated in Python and solved using a
solver GLPK as follows.

1 m = pyo.ConcreteModel('BIM')
2

3 m.x1 = pyo.Var()
4 m.x2 = pyo.Var()
5

6 m.profit = pyo.Objective( expr = 12*m.x1 + 9*m.x2, sense= pyo.maximize )


7

8 m.silicon = pyo.Constraint(expr = m.x1 <= 1000)


9 m.germanium = pyo.Constraint(expr = m.x2 <= 1500)
10 m.plastic = pyo.Constraint(expr = m.x1 + m.x2 <= 1750)
11 m.copper = pyo.Constraint(expr = 4*m.x1 + 2*m.x2 <= 4800)

190
12 m.x1domain = pyo.Constraint(expr = m.x1 >= 0)
13 m.x2domain = pyo.Constraint(expr = m.x2 >= 0)
14

15 pyo.SolverFactory('glpk').solve(m)

Several things can be said about the above example. First, it was rather straightforward to model – the
choice of the decision variables was obvious and the constraints were easy to formulate. Secondly, it was easy
to find the optimal solution by hand. Third, it is evident to ‘naked eye’ that the solution found is indeed
optimal.
Surprisingly, also much larger and seemingly more complicated problems can be modelled using linear
constraints only. For such problems, however, we are often not able to find the solution by hand, and
one typically cannot judge ‘by eye’ that a particular solution is an optimal one. To move on to working
confidently with real-life problems, we need to gain more knowledge about LP.
In the following sections, we shall expand on what we learned so far. First, we will extend our capabilities
of modelling various situations using linear constraints. Secondly, we will provide a formal definition of the
search for the optimality certificate of a given solution. In the end, we will explain intuitively how numerical
algorithms for LP problems work.
For some expressions, like the ones in example 11.1, it is clear that we can write them down using linear
functions. But, there are real-life important objective functions and constraints, for which it is difficult to
immediately see the same. At the same time, because LP are by far the easiest problems to solve, a problem
should be expressed using linear constraints long as it is possible. We will therefore provide a number of
useful LP modeling techniques.

11.2.1 Absolute values


Consider a situation where the objective function in a minimization problem contains absolute values |xi | of
(some of) the variables (or a linear combination of these with non-negative coefficients).
Although the objective is not linear when it contains absolute values |xi |, we can obtain an equivalent
linear formulation. For such variable xi , introduce two new variables x− +
i , xi ≥ 0 and replace each occurrence
of xi and |xi | by

xi = x+ −
i − xi ,
|xi | = x+ −
i + xi ,
x+ −
i , xi ≥ 0.

It is easy to show that for any solution of the modified problem if either x+ −
i or xi is zero for every i, then
the problems with and without absolute values are equivalent.
Note that the same reasoning can be used to reformulate absolute values involving entire expressions
such as |x1 − x4 | and, constraints such as
|x1 | + x2 ≤ 4,
but it cannot be used to do so when the coefficient in front of the absolute value is negative.
Example 11.2. Wine quality prediction
Physical, chemical, and sensory quality properties were collected for a large number of red and white
variants of the Portuguese ”Vinho Verde” wine and then then donated to the UCI machine learning
repository, see [13]. The dataset consists of n = 1, 599 measurements of 11 physical and chemical
characteristics plus an integer measure of sensory quality recorded on a scale from 3 to 8. Due to
privacy and logistic issues, there is no data about grape types, wine brand, wine selling price, etc.
The goal of the regression is find coefficients mj and b to minimize the mean absolute deviation

191
(MAD), that is
n
1X
MAD (ŷ) = min |yi − ŷi |
n i=1
J
X
s.t. ŷi = xi,j mj + b ∀i = 1, . . . , n
j=1

where xi,j are values of ’explanatory’ variables, in this case the 11 physical and chemical characteristics
of the wines.

11.2.2 Minimax objective


Another class of seemingly complicated objective functions that can be easily rewritten as a LP are those
stated as maxima over several linear functions. Given a finite set of indices K and a collection of vectors
{ck }k∈K , the minimax problem given by
X
min max c>
k x. (11.3)
k∈K
i∈I

General expressions such as (11.3) can be linearized by introducing an auxiliary variable z and setting

min z
s.t. c>
k x ≤ z, ∀ k ∈ K.

This trick works because if all the quantities corresponding to different indices k ∈ K are below the auxiliary
variable z, then we are guaranteed that also their maximum is also below z and vice versa. Note that the
absolute value function can be rewritten |xi | = max{xi , −xi }, hence the linearization of the optimization
problem involving absolute values in the objective functions is a special case of this.

11.2.3 Fractional objective


In some problems, one might be interested in minimizing a ratio of one quantity to the other, where both
depend on the decision variables. Although terms such as
x1
2x2 + 5
are clearly not linear expressions, under certain conditions it is also possible to minimize them using a
linear problem. To demonstrate this, take a finite index set I and consider the optimization problem with
a fractional objective function of the form

c> x + α
min
d> x + β
s.t. Ax ≤ b,
x ≥ 0,

where the term d> x + β is either strictly positive or strictly negative over the entire feasible set of x.

192
−1
Setting first t = d> x + β and then yi = xi t for every index i, we obtain the following equivalent
linear optimization problem

min c> y + αt
s.t. Ay ≤ tb,
d> y + βt = 1,
t ≥ 0,
y ≥ 0.

Note that the inequality for t should in fact be strict, i.e., t > 0, but in view of the assumption above for
d> x + β, having relaxed the constraint does not change the optimal solution.

11.3 Mixed-integer linear programming and modelling techniques


The particular feature of the linear programs was that as long as the decision variables satisfy all the
constraints, they can take any real value. However, there are many situations in which it makes sense to
restrict the solution space in a way that cannot be expressed using linear (in-)equality constraints. For
example, some numbers might need to be integers, such as the number of people to be assigned to a task.
Another situation is where certain constraints are to hold only if another constraint holds – for example, the
amount of power generated in a coal plant taking at least its minimum value only if the generator is turned
on.
None of these two examples can be formulated using linear constraints alone. For such and many other
situations, it is often possible that the problem can still be modelled as an LP, yet with an extra restriction
that some variables need to take integer values only.
An mixed-integer linear program (MILP) is an LP problem in which some variables are constrained to
be integers. Formally, it is defined as

min c> x
x∈Rn
s.t. Ax ≤ b,
xi ∈ Z, i ∈ I,

where I ⊆ {1, . . . , n} is the set of indices identifying the variables that take integer values. Of course, if
the decision variable are required to be nonnegative, we could use the set N instead of Z. A special case
of integer variables are binary variables, which can take only values in B = {0, 1}. Consider the following
example.
Example 11.3. Building microchips with integers
The company BIM realizes that a 1% fraction of the copper always gets wasted while producing both
types of microchips, more specifically 1% of the required amount. This means that it actually takes
4.04 gr of copper to produce a logic chip and 2.02 gr of copper to produce a memory chip. If we
rewrite the linear problem in example 11.1 and modify accordingly the coefficients in the corresponding
constraints, we obtain the following problem

max 12x1 + 9x2


s.t. x1 ≤ 1000 (silicon)
x2 ≤ 1500 (germanium)
x1 + x2 ≤ 1750 (plastic)
4.04x1 + 2.02x2 ≤ 4800 (copper with waste)
x1 , x2 ≥ 0.

193
If we solve again we obtain a different optimal solution than the original one, namely (x1 , x2 ) ≈
(626.238, 1123.762) and an optimal value of roughly 17628.713. Note, in particular, that this new
optimal solution is not integer, but on the other hand in the LP above there is no constraint requiring
x1 and x2 to be such.
In terms of production, of course we would simply produce entire chips but it is not clear how
to implement the fractional solution (x1 , x2 ) ≈ (626.238, 1123.762). Rounding down to (x1 , x2 ) =
(626, 1123) will intuitively yield a feasible solution, but we might be giving away some profit and/or
not using efficiently the available material. Rounding up to (x1 , x2 ) = (627, 1124) could possibly lead
to an unfeasible solution for which the available material is not enough. We can of course manually
inspect by hand all these candidate integer solutions, but if the problem involved many more decision
variables or had a more complex structure, this would become much harder and possibly not lead to
the true optimal solution.
A much safer approach is to explicitly require the two decision variables to be nonnnegative integers,
thus transforming the original into the following MILP:

max 12x1 + 9x2


s.t. x1 ≤ 1000 (silicon)
x2 ≤ 1500 (germanium)
x1 + x2 ≤ 1750 (plastic)
4.04x1 + 2.02x2 ≤ 4800 (copper with waste)
x1 , x2 ∈ N.

The optimal solution is (x1 , x2 ) = (626, 1124) with a profit of 17628. Note that for this specific
problem both the naive rounding strategies outlined above would have not yield the true optimal
solution. The Python code for obtaining the optimal solution using MILP solvers is given below.

1 m = pyo.ConcreteModel('BIMperturbed')
2

3 m.x1 = pyo.Var( within=pyo.NonNegativeIntegers )


4 m.x2 = pyo.Var( within=pyo.NonNegativeIntegers )
5

6 m.obj = pyo.Objective(expr = 12*m.x1 + 9*m.x2, sense= pyo.maximize)


7

8 m.silicon = pyo.Constraint(expr = m.x1 <= 1000 )


9 m.germanium = pyo.Constraint(expr = m.x2 <= 1500)
10 m.plastic = pyo.Constraint(expr = m.x1 + m.x2 <= 1750)
11 m.copper = pyo.Constraint(expr = 4.04*m.x1 + 2.02*m.x2 <= 4800)

MILP naturally applies to situations in which we need to deal with integer numbers, as when scheduling
people, as the following extensive example illustrates.
Example 11.4. Shift scheduling
This example concerns a model for scheduling weekly shifts for a small campus food store. It is
inspired by a Towards Data Science article, whose original implementation has been revised. Let us
look at the problem description from the original article.
A new food store has been opened at the University Campus which will be open 24 hours
a day, 7 days a week. Each day, there are three eight-hour shifts. Morning shift is from
6:00 to 14:00, evening shift is from 14:00 to 22:00 and night shift is from 22:00 to 6:00 of

194
the next day. During the night there is only one worker while during the day there are
two, except on Sunday that there is only one for each shift. Each worker will not exceed
a maximum of 40 hours per week and have to rest for 12 hours between two shifts. As for
the weekly rest days, an employee who rests one Sunday will also prefer to do the same
that Saturday. In principle, there are available ten employees, which is clearly over-sized.
The less the workers are needed, the more the resources for other stores.

This problem requires assignment of N workers to a predetermined set of shifts. There are three
shifts per day, seven days per week. These observations suggest the need for three ordered sets:
• WORKERS with N elements representing workers.
• DAYS with labeling the days of the week.
• SHIFTS labeling the shifts each day.

The problem describes additional considerations that suggest the utility of several additional sets.
• SLOTS is an ordered set of (day, shift) pairs describing all of the available shifts during the
week.
• BLOCKS is an order set of all overlapping 24 hour periods in the week. An element of the
set contains the (day, shift) period in the corresponding period. This set will be used to limit
worker assignments to no more than one for each 24 hour period.
• WEEKENDS is a the set of all (day, shift) pairs on a weekend. This set will be used to
implement worker preferences on weekend scheduling.
These additional sets improve the readability of the model.

WORKERS = {w1 , w2 , . . . , w1 } set of all workers


DAYS = {Mon, Tues, . . . , Sun} days of the week
SHIFTS = {morning, evening, night} 8 hour daily shifts
SLOTS = DAYS × SHIFTS ordered set of all (day, shift) pairs
BLOCKS ⊂ SLOTS × SLOTS × SLOTS all 24 blocks of consecutive slots
WEEKENDS ⊂ SLOTS subset of slots corresponding to weekends

The model parameters are

N = number of workers
WorkersRequiredd,s = number of workers required for each day, shift pair (d, s)

The decision variables are


(
1 if worker w is assigned to day, shift pair (d, s) ∈ SLOTS
assignw,d,s =
0 otherwise
(
1 if worker w is assigned to a weekend day, shift pair (d, s) ∈ WEEKENDS
weekendw =
0 otherwise
(
1 if worker w is needed during the week
neededw =
0 otherwise

195
Let us now look at the model constraints. Assign workers to each shift to meet staffing requirement.
X
assignw,d,s ≥ WorkersRequiredd,s ∀(d, s) ∈ SLOTS
w∈ WORKERS

Assign no more than 40 hours per week to each worker.


X
8 assignw,d,s ≤ 40 ∀w ∈ WORKERS
d,s∈ SLOTS

Assign no more than one shift in each 24 hour period.

assignw,d1 ,s1 + assignw,d2 ,s2 + w, assignw,d3 ,s3 ≤ 1 ∀w ∈ WORKERS


∀((d1 , s1 ), (d2 , s2 ), (d3 , s3 )) ∈ BLOCKS

Indicator if worker has been assigned any shift.


X
assignw,d,s ≤ MSLOTS · neededw ∀w ∈ WORKERS
d,s∈ SLOTS

Indicator if worker has been assigned a weekend shift.


X
assignw,d,s ≤ MWEEKENDS · weekendw ∀w ∈ WORKERS
d,s∈ WEEKENDS

The model objective is to minimize the overall number of workers needed to fill the shift and work
requirements while also attempting to meet worker preferences regarding weekend shift assignments.
This is formulated here as an objective for minimizing a weighted sum of the number of workers needed
to meet all shift requirements and the number of workers assigned to weekend shifts. The positive
weight γ determines the relative importance of these two measures of a desirable shift schedule.
!
X X
min neededw + γ weekendw )
w∈ WORKERS w∈ WORKERS

Figure 11.1 is a visual representation of the optimal shift schedule obtain for a specific instance of
the problem.

Previously, we claimed that every optimization problem that can be formulated as an LP is ‘easy’ to
solve. Does that mean that, in contrast, every MILP problem is easy to solve? Not necessarily, but due
to the significantly greater modelling capacities of MILP, it can be indeed used to model problems that are
fundamentally ‘difficult’, i.e., for which no efficient solution procedure is known, even if tools other than
MILP are allowed. Here, we present an example of a classical problem like this – the knapsack problem –
which can be used to model many resource allocation situations, e.g., on computational clusters.
Example 11.5. Resource allocation – Knapsack 0-1 problem
A traveler can only bring a single fixed-weight knapsack and must fill it with the most valuable items.
Given a finite set of n items, where each item i has a weight wi and a value vi , we want to select the
subset of items to put in the knapsack so that the total weight is less than or equal to a given limit

196
Figure 11.1: The optimal shift schedule.

W and the total value is as large as possible. It can be formulated as an MILP as follows:
n
X
maxn v i xi
x∈R
i=1
Xn
s.t. wi xi ≤ W,
i=1
xi ∈ B, i = 1, . . . , n.

The knapsack problem is a one of the most fundamental combinatorial optimization problems whose
variants often arise in resource allocation where the decision makers have to choose a subset of non-
divisible tasks/resources under a fixed time/budget constraint, respectively. General versions of this
problem are routinely solved on online computational clusters, for example.
This problem is known to be N P -complete which, roughly, means that it is widely believed that for
such a problem there exists no algorithm that would not need to check, on a worst-case instance all
2n solutions.

As visible from the example, our enthusiasm of modelling a problem we encounter as an MILP can
sometimes lead us to, accidentally, modelling one of the well known NP-hard problems. Does that mean
that MILP is an inefficient technology? No, because powerful solvers have been developed for MILPs that
allows to solve efficiently optimization problems with thousands of integer variables.
To do that, we need to become familiar with techniques and tricks that allow us to model via integer
variables and MILP constraints situations that we could not think of at first encounter.

11.3.1 Variables taking a set of discontinuous values


MILP can be used to model variables taking on discontinuous values. For instance when either x = 0 or
l ≤ x ≤ u must hold. We introduce a binary variable y ∈ B with the interpretation that
(
0 if x = 0,
y= (11.4)
1 if l ≤ x ≤ u.

197
Now we can model the discontinuous variable x by the following linear constraints:
x ≤ uy,
x ≥ ly,
y ∈ B.
Indeed, by studying this system of constraints, you can see that the relationship (11.4) becomes enforced.

11.3.2 Variable enforcing a given constraint or not


In many optimization problems, we want to ensure that a certain constraint a> x ≤ b holds only under
specific conditions, which we can capture with a yes-no decision variable, say a binary variable y ∈ B. The
so-called big-M method gives a way to construct such an ‘optional’ constraint by writing the original
constraint as
a> x ≤ b + M (1 − y),
where M > 0 is a large number. In this constraint, if the binary variable y takes a ‘yes’ value 1, then the
right-hand side is equal to b and we recover the original constraints, which then ‘needs to hold’. Otherwise,
if y = 0 the right-hand side becomes so large due to the M (1 − y) = M term that effectively the constraint
does not impose any restriction on x.
In this example, the value M can typically be easily guessed from the problem properties – it should
be large enough to make the trick work, i.e., be greater than a> − b for all reasonable x, but it should not
too large because that will deteriorate the solver performance. We will now see this trick applied to several
specific situations.

11.3.3 Cost function with a fixed component


Often in production planning problems, cost of producing a given good consists of a part which scales with
the production size, and a fixed part (machine setup costs, for example). Such a cost function f (x) with a
per-unit cost c and a fixed component k given by
(
0 if x = 0,
f (x) =
k + cx if x > 0,

can be modeled by adding a binary variable y ∈ B as follows:


f (x, y) = ky + cx,
x ≤ M y,
y ∈ B,
where M > 0 is a large positive constant that should be an upper bound on x.

11.3.4 Either-or constraints


The either-or constraints require that at least one of the following constraints must hold:
a>
1 x ≤ b1 or a>
2 x ≤ b2 .

For this, again the big-M method can be used where we need a new binary variable y ∈ B and two large
positive constants M1 , M2 > 0. The linearized constraints are then
a>
1 x ≤ b1 + M1 y,

a>
2 x ≤ b2 + M2 (1 − y),
y ∈ B.

198
Example 11.6. Multi-plant production
Consider the following production problem

max profit
x,y≥0

s.t. profit = 40x + 30y


x ≤ 40 (demand)
x + y ≤ 80 (labor A)
2x + y ≤ 100 (labor B).

The optimal solution is (x, y) = (20, 60), which results in a profit of 2600.
Labor B is a relatively high cost for the production of product X. A new technology has been developed
with the potential to lower cost by reducing the time required to finish product X to 1.5 hours, but
require a more highly skilled labor type C at a unit cost of 60 per hour.
It is our task to assess if the new technology is beneficial, i.e., whether adopting it would lead to
a higher profit. In this situation we have an either-or structure for the objective and for Labor B
constraint:

profit = 40x + 30y, 2x + y ≤ 100 or profit = 60x + 30y, 1.5x + y ≤ 100 .


| {z } | {z }
old technology new technology

Using MILP, we can formulate this problem as follows:

max profit
x,y≥0,z∈B

s.t. x ≤ 40 (demand)
x + y ≤ 80 (labor A)
profit ≤ 40x + 30y + M z
profit ≤ 60x + 30y + M (1 − z)
2x + y ≤ 100 + M z
1.5x + y ≤ 100 + M (1 − z).

where the variable z ∈ {0, 1} ‘activates’ the constraints related to the old or new technology, respec-
tively, and M is a big enough number.

11.3.5 If-then constraints


The if-then condition requires that if one condition, say

A: a>
1 x ≤ b1 ,

has to hold, then another condition, say


B: a>
2 x ≤ b2

must hold. A situation like this can still be encoded as a linear model as follows. First notice that the
implication A ⇒ B is logically equivalent to A ∨ B. Using this trick, the if-then condition is logically
equivalent to requiring The if-then condition requires that if a condition A holds, say a> 1 x ≤ b1 , then
another condition B, say a>
2 x ≤ b2 must hold. A situation like this can be encoded as a linear model using
the big-M method.

199
First notice that the implication A ⇒ B is logically equivalent to A ∨ B. Using this trick, the if-then
condition is logically equivalent to requiring

a>
1 x > b1 or a>
2 x ≤ b2 .

Introducing two large constants M1 , M2 > 0 and a binary variable y, the either-or constraint is equivalent
to

a>
1 x > b1 − M1 y,

a>
2 x ≤ b2 + M2 (1 − y),
y ∈ B,

Here, one needs to be careful because in MILP, strict constraints of the form a>1 x > b1 − M1 y cannot be
enforced as such, and are always implemented as weak inequalities a>
1 x ≥ b 1 − M 1 y, which in most contexts
is fine.

11.3.6 Products of variables


If the optimization problem contains the product of two variable, in a few special cases, it still is possible to
“linearize” it at the cost of adding a new variable and a set of additional constraints.
The product x1 x2 of two binary variables x1 , x2 ∈ B can be replaced by a new variable y and the following
additional constraints

y ≤ x1 ,
y ≤ x2 ,
y ≥ x1 + x2 − 1,
y ∈ B.

Similarly, the product x1 x2 with x1 ∈ B and l ≤ x2 ≤ u can be replaced by a new variable y and the following
additional constraints

y ≤ ux1 ,
y ≥ lx1 ,
y ≤ x2 − l(1 − x1 ),
y ≥ x2 − u(1 − x1 ),
y ∈ R.

Exercise 11.1. Constructing an optimal classification tree


Suppose you have a sample (x1 , y 1 ), . . . , (xN , y N ) of data points where xi ∈ [0, 1]n and where y i ∈
{−1, 1}. Recall the section about tree-based learners. Your goal is to fit a depth-k classification tree
to this data set so as to minimize the number of misclassified samples. Set it up as a MILP.

11.4 Constraint learning


11.4.1 Introduction
Now, imagine that you are to solve an optimization problem in which certain relationships between decision
variables are not immediately expressed using MILP contraints but instead, they come from a trained

200
optimization problem. Such an optimization problem can be formulated in general as:
min f (x)
x,y

s.t. g(x) ≤ 0
y = h(x)
θ(y) ≤ 0,
where f (x) is the cost function to be minimized in terms of the decision vector x, subject to constraints
g(x) ≤ 0 which can involve multiple smaller constraints. The problem entails another set of decision variables
y which are a result of using a predictive model that transforms the ‘original decision variables x into y. The
requirement is that the ‘transformed decision variables’ y need to meet certain criteria, which are formulated
using the constraint θ(y).
What are the two specific applications that could fit into such a description?
Example 11.7. Optimizing the World Food Programme supply chain [18]

Imagine you are to construct an optimal ‘food basket’ for a humanitarian crisis relief operation. In
constructing the meal, you are to pick from a number of ingredients, each of which has a certain
unit price. The goal is, of course, to keep the total cost of the food basket as low as possible, while
meeting two essential requirements:
• the food basket should meet minimum nutritional requirements in terms of proteins, vitamins
etc.
• the meal made out of such a food basket should meet a minimum ‘taste’ score, because otherwise
it is not likely to be eaten and some food is going to be wasted.
Formulating the objective in such a problem is easy – it is simply the inner product of a decision
vector x, where each entry measures how much of a given type of food (rice, dates, etc.) the basket
should include, and the vector c of unit prices.
Formulating the nutrition constraints is also straightforward – it will be something like:
Ax ≥ b,
where the matrix A stores information about how much of the nutrient i (row) does the food type j
(column) contain per unit, and the j-th entry of b is the minimum requirement for a given nutrient.
Solving the problem of minimizing the cost of a nutrition-requirement meeting food basket can be,
depending on the number of options, an easy or large-scale LP problem to solve, and can be formulated
with Pyomo exactly the same way as in the previous two subsections.
The ‘taste’ score is however, a more tricky question how to formulate it. How can we map a vector x
into some kind of a taste value? We can’t really, unless we know how do people value the taste score
of various baskets and try to transform this data into a ‘predicted’ score of a potential new basket.
This is exactly where the predictive model comes in. In the WFP problem, this was done based on a
database of past food baskets x1 , . . . , xN along with the scores y 1 , . . . , y N they received from people.
Based on these scores, a regression tree that maps x to y was fitted:
y ≈ CART(x).
Using this tree, the optimization problem to be solved to optimize the food basket is:
minc> x
x,y

s.t. Ax ≥ b
y ≥ ymin
y = CART(x).

201
Example 11.8. Radiotherapy - [18]

Consider another example, in which the vector x defines the position and intensity of radiation of
various beams used in cancer radiotherapy. Normally speaking, for such a vector x, to quantify the
amount of radiation that has reached a certain area of the patient’s body, one needs to perform expen-
sive simulations based on the physics that describes the radiation propagation throughout different
tissues, taking into account the patient’s geometry.
Because such simulations can be time-costly and because radiotherapy is something that requires
rather fast decisions, a different approach has been proposed. Namely, for many patients i with
patient geometries wi and corresponding radiation schemes xi , a predictive model is trained that
predicts the amount of radiation to different parts of the patient’s body. For simplicity, let us assume
that a function:
fBad : x, w → R+
quantifies the amount of radiation sent to healthy organs that are around the cancer tumor, i.e.,
radiation sent to areas where radiation should not go to; and the function

fGood : x, w → R+

quantifies the amount of radiation sent to the cancer tumor. The benefit of such trained functions
is that they are very quick to compute given a vector x and geometry w. In the field of scientific
machine learning, training a function like this would be called model order reduction
In rough terms thus, the optimization problem that one would like to solve is to maximize the amount
of radiation sent to the cancer tumor itself, subject to constraints on the amount of radiation sent to
the healthy tissues:

max fGood (x, w)


x
s.t. fBad (x, w) ≤ δ,

where δ stands for the upper bound on the radiation of non-cancerous areas.

In both of the above examples we can say ‘easier said than done’ with including constraints involving
complicated, ML-generated constraints into an optimization problem and saying that it should be easy to
solve. The question is, of course, how to encode the relationship

y = h(x)

into an optimization problem.

11.4.2 Modelling techniques


We now give a few examples of ML models that are possible to be reformulated to a MILP form, and will
treat it as a modelling exercise.

Linear regression The easiest case of transforming decision variables x into a ‘score’ is, of course, when
the ML tool used is a linear regression tool of the form

y = w> x

because that is the only thing that needs to be added to the problem formulation.

Regression trees A slightly more involved case is the situation where we use a regression tree. In this
case, we have partitioned the domain space of x into L different regions where L is the number of leaves in

202
the tree, where each leaf is described using a set of linear constraints

Al x ≤ bl ,

with pl being the predictor value assigned to the l-th leaf so that the predictive tool is:

 p1
 if Al x ≤ bl
y= .
..

pL if AL x ≤ bL

by means of the following system of constraints:



 Al x ≤ bl + (1 − z1 )M
..



 .



AL x ≤ bL + (1 − zL )M


 z1 + . . . + zL = 1
y = p1 z1 + . . . + pL zL




z1 , . . . , zL ∈ {0, 1}.

However, this reformulation is a bit of an overkill as it might be more effective to parallelize the search for
the best x by solving L smaller optimization problems by simply ‘trying’ each of the leaves where y l ≥ ymin .
This has the benefit of avoiding the usage of integer variables.

Neural networks with ReLU activation function When we use a dense neural network, the only
difficult part about formulating it using MILP constraints is the nonlinear transformation – all else are just
combinations of products of the neural network’s weights and the decision variable.
So, how do we transform the output x of a given node into the value

y = max{0, x}?

Provided that the absolute value |x| can be bounded using a large enough number M , we can formulate it
as:


 y≥x
 y ≤ x + M (1 − z)


y ≤ Mz
y≥0




z ∈ {0, 1}.

Exercise 11.2.
Suppose you are to formulate an optimization problem where the constraint learning part consists of
a logistic regression function:
exp w> x

y= .
1 + exp (w> x)
How would you (approximately) model it using (mixed-integer) linear programming? After you have
tried, you can check a possible solution to this question in [5].

11.4.3 Overall strategy for optimization with constraint learning


One issue that has not been mentioned up till now is a very important one – it might happen that the optimal
solution chosen by the ‘big optimization problem’ lies very far away from all the samples used to train the

203
predictive model. While this is not forbidden by definition, common sense tells us that the predictive model’s
estimates for a solution like this can be far from exact.
For that reason, most of the authors developing optimization with constraint learning approaches suggest
using a ‘trust region approach’. That is, to define an area for the decision variable vector x for which the
predictive model is still trusted. This area can be defined, for example, as the convex hull of all the samples
xi so far, or some region around this convex hull.
Summarizing, a general approach for solving a decision problem based on machine learning tool would
be as follows.

1. Identify what relationships inside the decision problem are not clear and are best discovered using ML.
2. Collect the data and train the ML tool you need.
3. If the decisions to be made on the basis of the ML tool are simple – stop here.

4. IF the decisions to be made on the basis of the ML tool form a complicated array of decisions best
investigated using algorithms, set up an optimization problem with the ML tool embedded and solve
it.
5. Validate the obtained solution for example, by bootstrapping the samples from your data set and
investigating the solution’s performance on them.

204
12 Exercise solutions
Exercise 2.22
1. For the sake of brevity, we define x := sign (v1 ) e1 . Then,

w̃ := v − kvk x

and

w := .
kw̃k
Then,
2
Hp v = I − 2ww> v = v − 2 w> v w = v − w̃> v w̃
  
2
kw̃k
Here,  
2 2
kw̃k = 2 kvk − kvk x> v (12.1)

and  
> 2
2 w̃> v = 2 (v − kvk x) v = 2 kvk − kvk x> v .

(12.2)

Since eqs. (12.1) and (12.2) are equal, we have

Hp v = v − w̃ = v − (v − kvk x) = kvk x

Now, plugging in the definition of x, we obtain the final result

Hp v = kvk sign (v1 ) e1 .

205
References
[1] Charu Aggarwal. Linear algebra and optimization for machine learning. Springer, 2020.
[2] Charu C Aggarwal et al. Neural networks and deep learning. Springer, 10:978–3, 2018.
[3] Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB. SIAM, 2014.
[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[5] David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, and Arvind U Raghunathan. Janos: an inte-
grated predictive and prescriptive modeling framework. INFORMS Journal on Computing, 34(2):807–
816, 2022.
[6] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082,
2017.
[7] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena
Scientific Belmont, MA, 1997.
[8] Christopher M. Bishop. Pattern recognition and machine learning. Information Science and Statistics.
Springer, New York, 2006.
[9] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004.
[10] Steven L Brunton and J Nathan Kutz. Data-driven science and engineering: Machine learning,
dynamical systems, and control. Cambridge University Press, 2019.
[11] Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.
[12] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization.
SIAM, 2009.
[13] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine
preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553,
November 2009.
[14] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
[15] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal
Processing Magazine, 29(6):141–142, 2012.
[16] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of machine learning research, 12(7), 2011.
[17] Morris L Eaton. Multivariate statistics: a vector space approach. 1983.
[18] Adejuyigbe Fajemisin, Donato Maragno, and Dick den Hertog. Optimization with constraint learning:
A framework and survey. arXiv preprint arXiv:2110.02121, 2021.
[19] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
[20] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net-
works. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256. JMLR Workshop and Conference Proceedings, 2010.

206
[21] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Adaptive Computation and Machine
Learning series. MIT Press, 2016.
[22] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Second Edition. Springer Series in Statistics. Springer New York, 2009.

[23] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-Based
Systems, 212:106622, 2021.
[24] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
6a overview of mini-batch gradient descent. http://www.cs.toronto.edu/~tijmen/csc321/slides/
lecture_slides_lec6.pdf, 2012.

[25] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–
257, 1991.
[26] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal
approximators. Neural networks, 2(5):359–366, 1989.
[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[28] J Nathan Kutz. Data-driven modeling & scientific computation: methods for complex systems & big
data. Oxford University Press, 2013.
[29] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape
of neural nets. In Neural Information Processing Systems, 2018.
[30] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30, 2017.
[31] James Martens et al. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742,
2010.

[32] Andreas C Müller and Sarah Guido. Introduction to machine learning with Python: a guide for data
scientists. ” O’Reilly Media, Inc.”, 2016.
[33] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[34] Fernando Nogueira. Bayesian Optimization: Open source constrained global optimization tool for
Python, 2014–.
[35] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy
employed by v1? Vision research, 37(23):3311–3325, 1997.
[36] Krzysztof Postek, Alessandro Zocca, Joaquim Gromicho, and Jeffrey Kantor. Data-Driven Mathematical
Optimization in Python. Online, 2022.
[37] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-assisted
intervention, pages 234–241. Springer, 2015.

[39] Walter Rudin. Functional analysis. Inc, New York, 45:46, 1991.

207
[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252,
2015.
[41] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.

[42] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
[43] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press, 2002.

[44] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to
Algorithms. Cambridge University Press, 2014.
[45] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning
research, 15(1):1929–1958, 2014.

[46] Gilbert Strang. Linear algebra and learning from data. Wellesley-Cambridge Press Cambridge, 2019.
[47] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
[48] Jeremy Watt, Reza Borhani, and Aggelos K Katsaggelos. Machine learning refined: Foundations,
algorithms, and applications. Cambridge University Press, 2020.

[49] Laurence A Wolsey. Integer programming. John Wiley & Sons, 2020.

208

You might also like