You are on page 1of 337

Organizational Information 3

Motivation 9
Machine Learning Types 22
Machine Learning Pipeline 55
Summary 69
Linear Regression - Motivation 72
Linear Regression - Overall Picture 86
Linear Regression - Model 94
Linear Regression - Optimization 103
Linear Regression - Basis Functions 128
Logistic Regression - Motivation 137
Logistic Regression - Framework 153
Overfitting and Underfitting 185
Problem Statement 208
Optimization 216
Kernel Trick 224
Hard and Soft Margin 234
Regression 238
Summary 244
Applications 247
Intuition 252
Mathematics 259
Summary 267
Perceptron 270
Multilayer Perceptron 280
Loss Function 287
Gradient Descent 295
Learning Process 305
Convolution 311
Pooling 325
Applications 333
Machine Learning for Engineers
Organizational Information

Bilder: TF / Malter
Course Organizers
Prof. Dr. Bjoern M. Eskofier
Machine Learning and Data Analytics Lab
University Erlangen-Nürnberg

Prof. Dr. Jörg Franke


Institute for Factory Automation and Production Systems
University Erlangen-Nürnberg

Prof. Dr. Nico Hanenkamp


Institute for Resource and Energy Efficient Production Systems
University Erlangen-Nürnberg

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2


Course Details
5 ECTS
4 SWS
Written Exam
• Exam is in English. The questions are based on understanding the methods and
ability to implement them.
• Two exercises should be successfully completed.
• Exam date will be announced for all universities.

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


Course Structure
Lectures
1. Introduction to Machine Learning (today)
2. Linear Regression
3. Support Vector Machines
4. Dimensionality Reduction
5. Deep Neural Networks

Exercises
Two real-world industrial applications
• Exercise 1: Energy prediction using Support Vector Machines
• Exercise 2: Image classification using Convolutional Neural Networks

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


Lecture Material
Pattern Recognition and Machine Learning
by C. Bishop, 2007

Deep Learning
by A. Courville, I. Goodfellow, Y. Bengio, 2015

Machine Learning: A Probabilistic Perspective


by K. Murphy, 2012

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


Introduction to Machine Learning
Motivation

Bilder: TF / Malter
Four Industrial Revolutions
Currently in the 4. Industrial Revolution 3. Industrial Revolution
Electronics and IT enable automation-driven rationalization
1. Industrial Revolution:
1. Industrial Revolution Steam 1960
Machinesas well as varied series production
Steam machines allow the
2. Industrial
1750 Revolution: Division of Labor (keyword: Conveyer belts)
Industrialization
3. Industrial Revolution: Electronics, Machines and Automatisation with IT
4. Industrial Revolution: Interlinked Production. Machines Communicate
with each other and optimize the process

2. Industrial Revolution 4. Industrial Revolution


Mass division of labor with the Industrial production is interlinked with information and
1870 2013 communication technology
help of electrical energy
In April 2013, the national platform Industry 4.0 was
founded at the Hannover Messe

Source: Industrie 4.0 in Produktion, Automatisierung und Logistik, T. Bauernhansl, M. Hompel, B. Vogel-Heuser, 2014
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Core of the Fourth Industrial Revolution is Digitalization
• 1960 : Dominant Factor Mechanical Engineering
Percantage of Engineering in Mechanical and Plant Engineering
• Mechanical Engineering decreasing with each decade
• Mechanical and plant
• During
Mechanical Engineering Electrical Engineering Computer Sciences Systems Engineering
that time Computer Science (AI, and Machine Learning)
engineering increase
was dominated by
100% 3
5 5
90% steadily! 9 15 mechcanical engineering in
10
1960s
80% 18
• Share of mechanical engineering
70% 30
has been steadily decreasing
60%
until today
50%
95 • Impact of computer sciences has
40% 85 25
70 been increasing since 1980s
30%
20%
30
→ Mechanical and plant
10% engineering is composed of
0% mutliple fields and became an
1960 1980 2000 2020
interdisciplinary area unlike 1960s
Source: Automatisierung 4.0, T. Schmertosch, M. Krabbes, 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
What drives that progress in Computer Science?

Annual Of course:
number Algorithmic
of AI papers advances (NOT ONLY AI)

25000 BUT: One significant force of progress IS AI
USA
Europe China
• You can see that clearly with the amount of publications and the
Total number of AI papers

20000
applications in that area

15000 So now: What is Artificial Intelligence actually? An What is the
difference to Machine Learning and Deep Learning?
10000

5000

0
2000 2005 2010 2015

Source: Left: Scopus, 2019


23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4
Artificial Intelligence vs. Machine Learning
• So what is Artificial Intelligence? What is Machine Learning? What is
Deep Learning?
• AI: Algorithms mimic behavior of humans
• Machine Learning: Algorithms do that by learning from data (Features
are hand crafted!)
• Deep Learning: Algorithms do the learning from data fully automatic
(The algorithms finds good features themselves)

Source: https://www.ibm.com/blogs
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Driving factors for the Advancement of AI
There are three important reasons for these advancements!
1. Increase of the Computational Power
Increase ofin
2. Increase Increase
thethe Amount of the amount
of Available Dataof Development of new
computational power available data algorithms
3. Development of New Algorithms

Sources:
Left Image: https://datafloq.com
Middle Image:https://europeansting.com
Right Image: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Increase of Computational Power
•• Moore’s
Transistors are Law: “the number of transistors inMoore’s
the building a dense
Law integrated
Transistors per Square Millimeter by Year
circuit
blocks (IC) doubles about every two years.”
of CPUs
•• Transistors: Small “blocks” in the computer doing “simple”
Numbers of transistors are
calculations
correlated with computational
• power
On the right image: x-axis = Years from 1970 – 2020 , y-axis =
logarithmic scale of transistors per mm^2
• Number of transistors has been
• increasing,
You see thethe
hence amount of transistors mm^2 increases linearly BUT
computational power
on a logarithmic scale, that is exponential!
•• ThisThat meansisexponential
phenomenon stated in growth of computation power!
Moore’s law

Source: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Computational Power of today’s supercomputers
TOP500 Supercomputers (2020)
•June
2020 Increase in computational power especially noticeablePflop/s
System Specs Site Country Cores in
Rmax
Power

1 Supercomputer
Fugaku Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D RIKEN R-CCS Japan 7,288,072 415.5 28.3

•2 Super
Summit Computers
IBM POWER9 (22C, have
3.07GHz),millions
NVIDIA Volta GV100 of cores (2020)
(80C), Dual Rail Mellanox EDR Infiniband
DOE/SC/ORNL USA 2,414,592 148.6 10.1

•3 Leaders
Sierra (2020) are(22C,Japan,
IBM POWER9 3.1GHz), NVIDIAUSA,
Tesla V100 China
(80C), Dual Rail Mellanox EDR Infiniband
DOE/NNSA/LLNL USA 1,572,480 94.6 7.4

•4 Germany is Shenwei
Sunway TaihuLight
on the SW2601013.
Interconnect
place
(260C, 1.45GHz) Custom
NSCC in Wuxi China 10,649,600 93 15.4

5 Tianhe-2A Intel Ivy Bridge (12C, 2.2GHz NSCC Guangzhou China 4,981,760 61.4 18.5

The best german supercomputer:


ThinkSystem SD650, Xeon Platinum 8174 24C 3.1GHz, Intel
13 SuperMUC-NG Omni-Path
Leibniz RZ Germany 305,856 19 -

Source: https://www.top500.org
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Increase in the Amount of Available Data
• The amount
Other ofbig driving
created datafactor for AI:
increased fromAvailability
Accordingof to
DATA
General Electric (GE), each of
• twoCreated
zettabytesData
in 2010 to 47 zettabytes
jumped from 2 zettabyteits aircraft
(2010) engines
to 47produces
zettabytearound twenty
(2020)
in 2020 terabyte of data per hour
• It comes from all things around us (cars, IoT, aircrafts….)
• General electric: Each engine produces 20TB of data

https://www.statista.com https://www.forbes.com

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9


Development of New Algorithms
• 2006 : Hinton showed how to train deep neural networks (deep
learning)
• 2015 : Algorithms are officially better than humans in object
recognition!

• In 2006, Geoffrey Hinton et al. published a • Increasing performance of visual


paper showing how to train a deep neural object recognition
network
• In 2015 machine learning algorithms
• They branded this technique “Deep Learning.” exceed the human capability

Source: A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging, Curtis P. Langlotz et al., 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Industrial Examples – Real Environments
• (Show
KUKA Video)
Bottleflip Challenge WAYMO driverless driving
Engineers from KUKA tackled the “Bottleflip Challenge”
The robot flips a bottle in the air.
Fun Fact: Robot trained itself in a single night
• (Show Video)
Driving has been traditionally a human job
Autonomous Driving could progress thanks to:
Artificial neural networks and deep learning.
• Challenge called «Bottleflipping» • Autonomous driving
• Robot trained itself over night • A traditional human job is carried out
Source: https://www.youtube.com/watch?v=HkCTRcN8TB0&t by machine learning algorithms
Source: https://www.youtube.com/watch?v=aaOB-ErYq6Y&t

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11


Game Examples – Defined Rules
• 2015 : AlphaGo
AlphaGo and (creators:
Go Deep Mind) won against
AlphaStarthe andEuropean
Starcraft 2
champion 5 – 0 [Mr. Fan Hui]
AlphaGo uses:
“Advanced Search Tree”
a Deep Network
Reinforcement Learning
• StarCraft II is one of the most enduring and popular real-time strategy
video games of all time.
•AlphaStar …
Go is considered as the most • StarCraft II is one of the most popular
…challenging
is the first AI to
classical reach the top league of real-time
game a widely popular
strategy E-sport
video games
• In October 2015, AlphaGo won against • AlphaStar is the first AI that reached
without any game
the European Champion restrictions. the top league

Source: https://deepmind.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Introduction to Machine Learning
Machine Learning Types

Bilder: TF / Malter
Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
Supervised Learning
The agent observes some example Input (Features) and Output (Label)
pairs and learns a function that maps input to output.

Key Aspects:
• Learning is explicit
• Learning using direct feedback
• Data with labeled output

→ Resolves classification and regression problems

Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4
Supervised Learning Problems

Classification Regression

Label 𝒚 = Dog
Feature 𝑥1

Label 𝑦
Label 𝒚 = Cat

Feature 𝑥0 Feature 𝑥

Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Regression
Regression is used to predict
a continuous value Regression

Training is based on a set of


input – output pairs (samples)

Label 𝑦
𝒟 = (𝒙𝟎 , 𝒚𝟎 ), (𝐱 𝟏 , 𝒚𝟏 ), … , (𝐱 𝐧 , 𝒚𝒏 )

Sample : (𝒙𝒊 , 𝒚𝒊 )
𝒙𝒊 ∈ ℝ𝑚 is the feature vector of sample 𝑖
Feature 𝑥
𝒚𝒊 ∈ ℝ is the label value of sample 𝑖

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


Regression
Goal:
Find a relationship (function), which Regression
expresses the input and output the best!

That means, we fit a regression

Label 𝑦
model 𝑓 to all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟

In this case 𝑓 is a linear regression


Feature 𝑥
model! (Black Line)

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


Example: Predicting Surface Quality
In this example, we consider a milling process. We take speed and feed
as input features and the value of surface roughness is predicted
Milling Process Input: Speed, Feed Output: Surface Quality

https://de.wikipedia.org/wiki/Zerspanen

Objective: Realization:
• Prediction of the surface quality • Speed and Feed data is gathered and the
based on the production parameters surface roughness measured for some trials
• By using regression algorithms surface
quality is predicted
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Example: Prediction of Remaining Useful Life (RUL)
“Health Monitoring and Prediction (HMP)” system (developed by BISTel)
Objective:
• Machine components degrade over time
• This causes malfunctions in the production line
• Replace component before the malfunction!

Realization:
• Use data of malfunctioning systems
• Fit a regression model to the data and predict
the RUL
• Use the model on a active systems and apply
necessary maintenance

Source: https://www.bistel.com/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
Classification
Classification is used to predict
the class of the input Classification

Label 𝒚 = Dog
Important:

Feature 𝑥1
The output belongs to only one class

Sample : 𝒔𝐢 = (𝑥𝑖 , 𝑦𝑖 )
𝑦𝑖 ∈ 𝐿 is the label of sample 𝑖 Label 𝒚 = Cat

Feature 𝑥0
In this example: 𝐿 = {𝐶𝑎𝑡, 𝐷𝑜𝑔}
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Classification
Goal:
Find a way to divide the input into the Classification
output classes!
Label 𝒚 = Dog

Feature 𝑥1
That means, we find a decision
boundary 𝑓 for all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
Label 𝒚 = Cat
In this case 𝑓 is a linear decision
Feature 𝑥0
boundary! (Black Line)

Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Foreign object detection
Objective:
• Foreign object detection on a production part
• After a production step a chip can remain
on the piston rod
• Quality assurance: Only parts without chip
are allowed for the next production step
Realization:
• A camera system records 3,000 pictures
• All images are labeled by a human
Piston rod Piston rod with chip
• A machine learning classification algorithm without chip
was trained to differentiate between chip
or no chip situations

Source: Implementation and potentials of a machine vision system in a series production using deep learning and low-cost hardware, Hubert
Würschinger et al., 2020
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Classification algorithms
Below, you Nearest
Input Data can see example
Linear datasets
RBF andGaussian
the application of algorithms
Different different use
algorithms.Neighbors SVM SVM Processes different ways to classify
data

Therefore:
The algorithms perform
different on the datasets

Example:
Linear SVM’s has bad
results in the second
dataset
Reason: No linear way to
distinguish data
The lower right value shows the classification accuracy
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Classification vs. Regression
Both areRegression
Supervised Learning Problems i.e.Classification
require labeled
datasets in Machine learning. But •the
• The output are continuous or real
difference between both is
The output variable must be a discrete
how they are used for different machine
values learning problems.
value (class)
• We try to fit a regression model, which • We try to find a decision boundary,
can predict the output more accurately which can divide the dataset into different
classes.
• Regression algorithms can be used to • Classification Algorithms can be used to
solve the regression problems such as solve classification problems such as
Weather Prediction, House price Hand-written digits(MNIST), Speech
prediction, Stock market prediction etc. Recognition, Identification of cancer cells,
Defected or Undefected solar cells etc.

Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Unsupervised Learning
Unsupervised learning observes some example Input (Features) – No
Labels! - and finds patterns based on a metric

Key Aspects:
• Learning is implicit
• Learning using indirect feedback
• Methods are self-organizing

Resolves clustering and dimensionality reduction problems

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16


Example for Clustering
Iris Flower data set
Contains 50 samples of the flowers Iris setosa, Iris virginica and Iris
versicolor
Measured parameters:
petal length, petal width, sepals length, sepals width

Iris setosa Iris versicolor Iris virginica


Source: Fisher's Iris data set
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Clustering
Goal: Identify similar samples and assign them the same label

Mostly used for data analysis, 2.5


data exploration, and/or
data preprocessing 2.0

Petal width
1.5

Data should be: 1.0


• Homogeneous in the cluster 0.5
(Intra-cluster distance)
0.0
• Heterogenous between the cluster 1 2 3 4 5 6 7
(Inter-cluster distance) Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Clustering
Goal: Identify similar samples and assign them the same label

Mostly used for data analysis, 2.5


data exploration, and/or
data preprocessing 2.0

Petal width
1.5

Data should be: 1.0


• Homogeneous in the cluster 0.5 Cluster 1
(Intra-cluster distance) Cluster 2
0.0
• Heterogenous between the cluster 1 2 3 4 5 6 7
(Inter-cluster distance) Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Clustering vs. Classification
Clustering does not have labeled data and may not find all classes!
Classification Clustering
2.5 2.5

2.0 2.0

Petal width
Petal width

1.5 1.5

1.0 1.0
Iris virginica
0.5 Iris versicolor 0.5 Cluster 1
Iris setsoa Cluster 2
0.0 0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal length Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Clustering algorithms
K-Means Affinity Mean Shift Spectral Ward Different clustering
Propagation Clustering methods produce
different results
e.g. some algorithms “find”
more clusters than others

Example:
K-Means performs “well” on
the third dataset, but not on
dataset one and two
Reason:
K-Means can only identify
“circular” clusters

Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Curse of Dimensionality
“As the number of features or dimensions grows, the amount of data we
need to generalize accurately grows exponentially.” – Charles Isbell

The intuition in lower dimensions does not hold in higher dimensions:


• Almost all samples are close to at least one boundary
• Distances (e.g. euclidean) between all samples are similar
• Features might be wrongly correlated with outputs
• Finding decision boundaries becomes more complex
→ Problems become much more difficult to solve!

Source: Chen L. (2009) Curse of Dimensionality. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA.
https://doi.org/10.1007/978-0-387-39940-9_133
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22
Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”

Question: How many samples do we need, to have all possible sensor


states in the dataset?

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23


Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”

Question: How many samples do we need, to have all possible sensor


states in the dataset?

𝑁 =1 : 𝐷 = 21 = 2
𝑁 = 10 : 𝐷 = 210 = 1024
𝑁 = 100 : 𝐷 = 2100 = 1.2 × 1030

For N = 100, the number of points are even more than the number of atoms
in the universe!
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24
Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem!
T0 T1 T2 T3

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1


Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25


Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
Identical
your problem!
T0 T1 T2 T3

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1


Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26


Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem! Applied a function f
T0 T1 T2 T3

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1


Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27


Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28
Reinforcement Learning
Reinforcement learning observes some example Input (Features) – No
Labels! - and finds the optimal action i.e maximizes its future reward

Key Aspects:
• Learning is implicit
• Learning using indirect feedback
based on trials and reward signals
• Actions are affecting future measurements (i.e. inputs)

Resolves control and decision problems


i.e. controlling agents in games or robots
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29
Reinforcement Learning
Goal: Agents should take actions in an environment which maximize the
cumulative reward.

To achieve this RL uses reward and punishment signals based on the


previous actions to optimize the model.

Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018


23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30
Reinforcement Learning vs. (Un)supervised Learning
• No (Un)supervised
teacher/supervisor, only reward signalsReinforcement Learning
Learning
• • Delayed feedback,
The feedback is given bynot instantaneous• (credit
a supervisor assignment
The feedback problem)
is given by a reward signal
• Learning
or a metricby interaction between environment and agent over time
• • Agent’s
Feedbackactions affect the environment:
is instantaneous • Feedback can be delayed (credit
assignment problem)
• Actions have consequences!!!
• Learning by using static data (no re- • Training is based on trials i.e. interaction
→ non
1. recording of data i.i.d. aspect.
necessary) between environment and agent (re-
• Active Learning process: the actions that recording necessary)
the agent takes affect the
• subsequent
Prediction does not affect
data future
it receives. • The prediction (actions) affect future
measurements – The data is assumed measurements i.e. the measurements are
Independent Identically Distributed (i.i.d) no necessarily i.i.d!

Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018


23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 31
Example: Flexible handling of components
Google designed an experiment to handle objects using a robot
After training the RL algorithm: The robots succeeded in 96% of the grasp
attempts across 700 trial grasps on previously unseen objects
Training objects:
Training and data
acquisition:
• 7 industrial robots
• 10 GPU's
• Numerous CPU's
• 580,000 gripping
attempts

Source: https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
Introduction to Machine Learning
Machine Learning Pipeline

Bilder: TF / Malter
The Machine Learning Pipeline
A concept that provides guidance in a machine learning project
• Step-by-step process
• Each step has a goal and a method

There exist multiple pipelines and concepts,


however the idea behind all pipelines and steps are the same!

The 8-step machine learning pipeline by Artasanchez & Joshi :


Problem Data Data Data Model Model Model Perfomance
Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

Source: Artificial Intelligence with Python - Second Edition, by Alberto Artasanchez, Prateek Joshi
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
1. Problem Definition
Our aim with machine learning is to develop a solution for a problem. In order to develop a satisfying
solution, we need to define the problem. This definition of the problem lays the foundation to our solution. It
shows us what kind of data we need, what kind of algorithms we can use.
Examples:
• If we are trying to find a faulty equipment, we have a classification problem
• If we are trying to predict a continuous number, we have a regression problem

Prediction of the energy consumption of a milling machine – problem definition

• Want to predict the necessary energy to conduct a process step


• The prediction should be made on the basis of the milling parameters
• Therefore we want to get transparency
• Further this should allow the optimization of the parameters

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


2. Data Ingestion
• After defining the problem, we have to define which data we need to describe the problem
• The data should represent the problem and the information needed to successfully predict the target
value
• If we have defined the data we need, we can gather them in databases or record them with trials

Prediction of the energy consumption of a milling machine – data ingestion


Data: • Data can be collected with
• Feed rate sensors
• Speed • Trials can be conducted on an
• Path existing machine
• Energy consumption

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


3. Data Preparation
We gathered the data and created our dataset. But it is highly unlikely that we can use our data as soon as
we gather it. We have to prepare it for our machine learning solution. For example:
• We can get rid of instances with missing values or outliers
• We can use dimensionality reduction for higher dimensional dataset,
• We can normalize the data to transform our data to a common scale.
The quality of a machine learning algorithm is highly dependent on the quality of data!

Prediction of the energy consumption of a milling machine – data preparation


• We have enough instances with no outliers
In our dataset we have outliers or missing values
and missing values • Therefore the easiest way to treat the
instances with missing values and outliers
is to delete them

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


An Example Dataset and Terminology
Value of the instance
Attribute for the attribute Label

Tennis
Day Weather Temp. Humidity Wind
recommended?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Cloudy Hot High Weak Yes
4 Rainy Mild High Weak Yes
… … … … … …

Instance Target Value

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


4. Data Segregation
Before we train a model we have to separate our data set:
• We have to separate the target variable
• Training Set: This set is used to fit our model
• Validation Set: Validation set is used to test our fit of the model and tune it’s hyperparameters
• Test Set: Test set is used to evaluate the performance of the final version of our model

Prediction of the energy consumption of a milling machine – data segregation


• We use an training/test split of
Attributes/ independent variable: Target variable:
80%/20%
• Feed rate • Energy consumption
• Because we do not perform
• Speed
hyperparameter optimization we
• Path
need no validation set

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


5. Model Training
Next step is to select an algorithm to train our data. So the question is which algorithm to use:
• Labeled data → supervised learning algorithms
• Prediction of a discreet value → regression algorithm
• Prediction of a class → classification algorithm
• There are numerous algorithms with their strengths and weaknesses
• To find out which algorithm is the best for our data set we have to test them

Prediction of the energy consumption of a milling machine – model training


We use the following regression
We have labeled data We have a discreet value algorithms
→ Supervised Energy consumption (kWh) → Linear regression
problem → Regression problem → Random forest regression
→ Support vector regression

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8


6. Model Evaluation
Training and evaluation of the model are iterative Prediction of the energy consumption of a milling
processes: machine – model evaluation
1. First, we train our model with the training set
2. Then evaluate its performance with validation
set with evaluation metrics
3. Based on this information, we tune our
algorithm’s hyperparameters

This iterative process continues unless we decide


that we can’t improve our algorithm anymore

Then we use the test set to see the performance on


unknown data
Random Forest Regressor achieves the best results
→ We choose this model

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9


Classification - Confusion Matrix and Accuracy
Confusion Matrix gives us a matrix as output and describes the performance of the model.

There are 4 important terms: Predicted: Predicted:


True Positives: The cases in which we predicted n=165 NO YES
YES, and the actual output was also YES
True Negatives: The cases in which we predicted
Actual: NO 50 10
NO, and the actual output was NO
False Positives: The cases in which we predicted
YES, and the actual output was NO Actual: YES 5 100
False Negatives: The cases in which we predicted
NO, and the actual output was YES Confusion Matrix

Classification Accuracy is what we usually mean,


when we use the term accuracy. It is the ratio of Accuracy = (150/165) = 0.909
number of correct predictions to the total number of
input samples.

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Regression – Error Metrics
Mean Absolute Error (MAE): 𝑁
• Average difference between the original and predicted 1
values Mean Absolute Error= ෍ 𝑦𝑗 − 𝑦ො𝑗
• Measure how far predictions were from the actual output 𝑁
𝑗=1
• Does not give and idea about the direction of error.

Mean Squared Error (MSE):


• Similar to MAE 𝑁
• It takes the average of the square of the difference 1 2
between original and predicted value
Mean Squared Error= ෍ 𝑦𝑗 − 𝑦ො𝑗
𝑁
• Larger errors become more pronounced, so that the 𝑗=1
model can focus on larger errors.

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11


7. Model Deployment
We have trained our model, evaluated it with evaluation metrics and tuned the hyperparameters with the
help of our validation set. After finalization of our model, we evaluated it with test set. Finally we are ready to
deploy our model to our problem. Here we have often to consider the following issues:
• Real time requirements
• Robust hardware (sensors and processor)

prediction of the energy consumption of a milling machine – model deployment


We are using the
model to predict the Axis=2
Predicted energy
energy consumption Feed=800
consumption = 0.046
bast on production Path=60
parameters

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


8. Performance Monitoring
We must monitor the performance of our model and make sure it still produces satisfactory results.
Unsatisfactory results might be caused by different reasons. For example,
• Sensors can malfunction and provide wrong data
• The data can be out of the trained range of the model
Monitoring the performance can mean the monitoring of the input data regarding any changes as well as the
comparison of the prediction and the actual value after defined time periods

prediction of the energy consumption of a milling machine – performance monitoring

We measure the actual We calculate the Within defined range: model is working
energy consumption of error of the sufficiently
10 work steps after each predictions of the Out defined range: model is working insufficient
maintenance procedure energy consumption → stop using and root cause analysis

Candidate
Problem Data Data Data Model Model Perfomance
Model
Definition Ingestion Preparation Segregation Training Deployment Monitoring
Evaluation

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13


Introduction to Machine Learning
Summary

Bilder: TF / Malter
Summary
In this chapter we talked about:
• The history of machine learning and recent trends
• The different types of machine learning types such as supervised learning, unsupervised learning and
reinforcement learning
• The steps involved in a common machine learning pipeline

After studying this chapter, you should be able to:


• Identify the type of machine learning needed to solve a given problem
• Understand the differences of regression and classification tasks
• Setup a generic project pipeline following containing all relevant steps from problem definition to
performance monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2


Machine Learning for Engineers
Linear Regression - Motivation

Bilder: TF / Malter
Motivation
Linear regression is used in multiple
scenarios each and every day!

Use Cases:
• Trend Analysis in financial
markets, sales and business
• Computer Vision problems
Financial development in FXCM
i.e. registration & localization
problems
• etc.

Source: https://www.tradingview.com/script/OZZpxf0m-Linear-Regression-Trend-Channel-with-Entries-Alerts/
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough i.e. Trend Analysis
• Linear models are the basis for complex models
i.e. Deep Networks

→ Let’s have a look at regression


i.e. prediction of a real-valued output

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm
5.7 years 116 cm
6.2 years 112 cm
6.3 years 116 cm
6.6 years 120 cm
6.7 years 122 cm

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm We want to answer Questions like:
5.7 years 116 cm What is the height of my child,
6.2 years 112 cm when it is 14 years old?
6.3 years 116 cm
6.6 years 120 cm
What is the height of my child,
6.7 years 122 cm
when it is 30 years old?

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140

Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140

Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Example: 2.Fit a model
145
140

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145

• Linear model 140

Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110

5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145

• Linear model 140

Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110

Answer: Linear Model! 5 6 7 8 9 10 11 12


Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270

→ 165 cm 230

Height in (cm)
190
150
110
70

0 5 10 15 20 25 30 35
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270

→ 165 cm 230

Height in (cm)
190

What is the height of my child, 150


when it is 30 years old? 110

→ 270 cm 70

0 5 10 15 20 25 30 35
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Example: Actual Answer
What is the height of my child,
when it is 14 years old?
(We estimated 165 cm)
→ ~165 cm

What is the height of my child,


when it is 30 years old?
(We estimated 270 cm)
→ ~175 cm

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Next in this Lecture:
• What is the Mathematical Framework?
• How do we fit a linear model?

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21


Machine Learning for Engineers
Linear Regression - Mathematics

Bilder: TF / Malter
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24


Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25


Overall Picture
Linear Regression is a method to fit linear models to our data!

145
140

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲

Height in (cm)
130

𝑥0 1 120
𝐱= 𝑥 =
1 Age in Years 115
y = Height 110
w = Weights
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲

Height in (cm)
130
→ Finding a good 𝑤0 and 𝑤1 is 120
called fitting! 115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 31


The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140

Height in (cm)
130
120 𝜖𝑖
115
110

5 6 7 8 9 10 11 12
Age in years
Note: In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:

Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
115
110

5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:

Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
1) 115
The 𝜖𝑖 and the summation 𝝐 of
all samples is called Residual Error! 110

5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
The Residual Error 𝝐
The residual error 𝜖 is not part of the input!
It is a random variable! Gaussian Distribution
0.5

0.4
Typically: We assume 𝜖 is Gaussian

Probability
Distributed (i.e. Gaussian noise) 0.3
1)
1 𝜖−𝜇 2 0.2
1 −2 𝜎
𝑝 𝜖 = 𝑒
𝜎 2𝜋 0.1

-3 -2 -1 0 1 2 3 4
But other distributions are also possible! Event
1) The short form is 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: In our use-case 𝜇 = 0 and 𝜎 is small. That means our samples are only “slightly” deviate
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35
The distribution of 𝑦
The noise is Gaussian:
𝑦1
𝑝 𝜖 =𝒩 𝜖 0, 𝜎 2 ) 145
140 𝑝(𝐲1 |𝐱1 )

Height in (cm)
“Inserting” the linear function: 130

𝑝 𝐲 𝐱, 𝐰, 𝜎) = 𝒩 𝐲 𝐰 T 𝐱, 𝜎 2 ) 120
𝑦0
115 𝑝(𝐲0 |𝐱 0 )
110
This is the conditional probability 𝑥0 𝑥1
density for the target variable 𝐲!
5 6 7 8 9 10 11 12
Age in years

Note for later use: 𝑝 𝐲 𝐱, 𝛉) = 𝑝 𝐲 𝐱, 𝐰, 𝜎)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 36
General Linear Model
A more general formulation:
𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1

Vector Notation:
𝑓 𝐱 = 𝐰T𝐱 + 𝜖 = 𝐲

Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
What parameters do we have to estimate?
The model is linear:
𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1

Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38
What parameters do we have to estimate?
The model is linear:
𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1

The model parameters are:


𝛉 = 𝑤0 , … , 𝑤𝑛 , 𝜎 = {𝐰, 𝜎}

Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 40


What are optimal model parameters?
We define the optimal parameters as:

𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉

That means:
• We want to find the optimal parameters 𝛉∗
• We search over all possible 𝛉
• We select the 𝛉 which most likely generated our training dataset 𝒟

Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 41
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝜽
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟

0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2

0.1

0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 42
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟

0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1

0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 43
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟

0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1 𝛉 = {5, 1.5} : Best

0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
Maximum Likelihood Estimation
This process is called Maximum Likelihood Estimation (MLE)

𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉

Important: 𝑙 𝛉 = 𝑝 𝒟 𝛉 is called Likelihood!

Note: It can also written as 𝑝(𝒟|𝜽) = 𝑙(𝜽)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 45
The Likelihood Function

We assume each sample is independent, identically distributed (i.i.d)!

𝓛 𝛉 =𝑝 𝒟𝛉
= 𝑝 𝒔0 , 𝒔1 , … , 𝒔𝑛 𝛉
∗ 𝑝 𝒔 𝛉 ⋅ 𝑝 𝒔 | 𝛉 ⋅ … ⋅ 𝑝(𝐬 𝛉
= 0 1 𝑛
𝑁

= ෑ 𝑝(𝐬𝑖 |𝛉)
𝑖=1

→ This is the basis to judge, how good our parameters are!

Note: In our case a sample 𝒔𝑖 is 𝐱 𝑖 and 𝐲𝑖 ; Just assume both are “tied” together i.e. like a vector
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 46
The Log-Likelihood
The product of small numbers in a computer is unstable!

We transform the likelihood into the quasi-equivalent Log-Likelihood:


𝑁 𝑁

𝓛 𝛉 = ෑ 𝑝 𝐬𝑖 𝛉 ⇔ ෍ log[𝑝 𝐬𝑖 𝛉 ] = 𝓵(𝛉)
𝑖=1 𝑖=1

* The logarithm „just“ scales the problem. Likelihood and Log-Likelihood both have to be maximized!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 47
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁

෍ log[𝑝 𝐬𝑖 𝛉 ]
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 48


The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 49


The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1
𝑁
1 2 𝑁
= − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 50
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1
𝑁
1 2 𝑁
= − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 51
Interpreting the Loss
We can “simplify” the current log-likelihood:

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 52


Interpreting the Loss
We can “simplify” the current log-likelihood:

Constant Constant

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 53


Interpreting the Loss
We can “simplify” the current log-likelihood:

Constant Constant

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

Residual Sum of Squares

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 54


Interpreting the Loss
We can “simplify” the current log-likelihood:

Constant Constant

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

Residual Sum of Squares

→ We only have to minimize the Residual Sum of Squares!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 55


The Residual Sum of Squares
We minimize the loss term:
𝑁
2
𝐿 𝛉 = ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖
𝑖=1

For linear regression: The loss is


shaped like a bowl with a unique
minimum!
𝐿(𝛉)
𝑤0
The cross indicates the optimal 𝑤1
parameters i.e. where the loss is
minimal
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 56
Optimization
Multiple ways to find the minimum:
• Analytical Solution
• Gradient Descend
• Newtons Method

1)
… and of course many, many more methods!

1) Examples: Particle Swarm Optimization, Genetic Algorithms, etc.


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 57
Optimization
Multiple ways to find the minimum:
• Analytical Solution
• Gradient Descend
• Newtons Method

1)
… and of course many, many more methods!

1) Examples: Particle Swarm Optimization, Genetic Algorithms, etc.


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 58
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝑁
2
𝐿 𝛉 = ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖
𝑖=1

𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐

1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 59
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐

We find the minimum conventionally by using the derivative:


NLL′ 𝛉 = 𝐱 T 𝐱𝐰 − 𝐱 T 𝐲

Equating the gradient to zero:


𝐱 𝑇 𝐱𝐰 = 𝐱 T 𝐲

1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 60
Analytical Solution
The solution i.e. the minimum is: 𝑁
−1 T
𝐰 = 𝐱T𝐱 𝐱 𝐲 𝐱 T 𝐲 = ෍ 𝐱 𝑖 𝐲𝑖
𝑖=1
𝑁
In practice too time
𝐱 T 𝐱 = ෍ 𝐱 𝑖 𝐱 𝑖𝑇
consuming to compute!
𝑖=1
N 2
𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,1 ⋅ 𝑥𝑖,𝐷
Reason: =෍ ⫶ ⋱ ⫶
The more training data, 2
𝑖=1 𝑥𝑖,𝐷 ⋅ 𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,𝐷
the longer the calculation!

Note: 𝑁 : Amount of Samples in the Training Dataset ; 𝐷 : Amount of Dimensions (length of the input vector)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 61
What is the quality of the fit?
The goodness of fit should be evaluated to verify the learned model!

A typical measure for quality is : 𝑅2

It is a measure for the percentage of variability of the dependent


variable (𝐲) in the data explained by the model!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 62


The 𝑅𝟐 – Measure
The 𝑅𝟐 – Measure is calculated as:

Let 𝑦ො be the mean of the labels, then


• Total sum of squares: 𝑆𝑆𝑡𝑜𝑡 = σ𝑖 𝐲𝑖 − 𝐲ො 2
2
• Residual sum of squares: 𝑆𝑆𝑟𝑒𝑠 = σ𝑖 𝐲𝑖 − 𝑓(𝐱 𝑖 )
𝑆𝑆𝑟𝑒𝑠
• Coefficient of determination: 𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡

→ This is different from the Mean Squared Error (MSE)!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 63


Linear Regression: Evaluation

Model can completely (100%) Model can partially (91.3%) Model cannot explain any
explain variations in 𝑦 explain variations in y, but still (0%) variation in y, because
considered good it's its own average

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 64


Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 65


Can we fit the following dataset?
House Prices in Springfield (USA)

House prices in (Mio. € )


1.6
1.5
1.4
1.3
1.2
1.1

1910 1940 1970 2000


Years from 1900 - 2020

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 66
Polynomial Regression
House Prices in Springfield (USA)
The answer is – of course – yes!

House prices in (Mio. € )


1.6
1.5
Polynomial regression fits a
1.4
nonlinear relationship between
input of 𝐱 and the corresponding 1.3
conditional mean of 𝐲 1.2
1.1
Example of a polynomial model:
1910 1940 1970 2000
𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝑚 𝑥 𝑚 + 𝜖 Years from 1900 - 2020

Note: 𝑚 is called the degree of the polynomial


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 67
How can we use polynomials in linear regression?
Trick: Basis function expansion!

We can just apply a function before we predict our “linear line”:


𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝚽𝒋 (𝒙) + 𝜖 = 𝐲
𝑗=1

That means: We transform our Input-Space by using the function 𝚽!

→ The estimation from “normal” linear regression is the same!


Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 68
Example: Polynomial Regression
House Prices in Springfield (USA)
The Basis Function for our example:

House prices in (Mio. € )


1.6
6 T
𝚽 𝑥 = 1 𝑥 𝑥2 𝑥3 𝑥4 𝑥5 𝑥 1.5
1.4
1)
The model parameters are: 1.3
𝛉 = {𝑤0 , 𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , 𝑤5 , 𝑤6 , 𝜎} 1.2
1.1

1910 1940 1970 2000


Years from 1900 - 2020

1) We assume that the noise is Gaussian distributed!


Note: The input here is NOT a vector! It is a scalar! However, the result is a 7-dim vector!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 69
Complexity of Polynomial Regression
The greater degrees of the polynomial, the more complex relationships
we can express!

Degree 𝑚 = 2 Degree 𝑚 = 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 70
Basis Functions
We can use any arbitrary function, with one condition:

The resulting vector has to have a constant “length”


i.e. linear in respect to the parameters!

Possible basis functions:


• 𝚽 𝐱 = 1 𝑥0 𝑥1 … 𝑥𝐷 T
T
• 𝚽 𝐱 = 1 𝑥 𝑥2 … 𝑥𝑚
T
• 𝚽 𝐱 = 1 𝑥0 𝑥1 𝑥0 𝑥1 𝑥02 𝑥12 𝑥22
• etc.
Note: That is the reason, why polynomial regression can be considered linear
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 71
Examples for Basis Functions
T T
𝚽 𝐱 = 1 𝑥0 𝑥1 𝚽 𝐱 = 1 𝑥0 𝑥1 𝑥2 𝑥12 𝑥22

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 72


Thank you for listening!
Machine Learning for Engineers
Logistic Regression - Motivation

Bilder: TF / Malter
Motivation
Logistic regression is the application of
linear regression to classification!

Use Cases:
• Credit Scoring
• Medicine
• Text Processing
• etc.

→ Especially useful for “explainability”


Source: https://activewizards.com/blog/5-real-world-examples-of-logistic-regression-application
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 75
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough
• Basis for complex models

→ Let’s have a look at classification


i.e. prediction of a categorial-valued output

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 76


Example: Iris Flower Dataset
Label Petal Petal Contains 50 samples of the flowers Iris setosa,
Width Length Iris virginica and Iris versicolor
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
We want to answer Questions like:
Setosa 7.7 mm 18.9 mm
My flower has a Petal Width of 7mm and a Petal
Versi. 9.1 mm 32.1 mm
Length of 15mm.
Setosa 7.9 mm 15.5 mm
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm Is my flower a Iris Setosa or Iris Versicolor?
Versi. 15.0 mm 39 mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 77
Example: Transform the Label
Label Petal Petal
Width Length
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
Setosa 7.7 mm 18.9 mm
Versi. 9.1 mm 32.1 mm
Setosa 7.9 mm 15.5 mm
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm
Versi. 15.0 mm 39 mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 78
Example: Transform the Label
Label Petal Petal
Width Length
0 5.0 mm 9.2 mm
1 9.2 mm 26.1 mm
0 7.7 mm 18.9 mm
1 9.1 mm 32.1 mm
0 7.9 mm 15.5 mm
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm
1 15.0 mm 39 mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 79
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0

Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm
2.5
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 80
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0

Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm Iris Setosa (0)
2.5 Iris Versicolor (1)
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 81
Example: 2. Find a decision boundary
15.0

Petal Width in mm
12.5
10.0
7.5
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 82
Example: 2. Find a decision boundary
What decision boundary i.e. function
separates the data? 15.0

Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 83
Example: 2. Find a decision boundary
What decision boundary i.e. function
separates the data? 15.0

Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5

Answer: Linear Decision Boundary! 10 14 18 23 27 31 36 40


Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 84
Example: 3. Answer your Question
15.0

Petal Width in mm
12.5
10.0
7.5
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 85
Example: 3. Answer your Question
My flower has a Petal Width of
7mm and a Petal Length of 15.0

Petal Width in mm
15mm. 12.5
10.0

Is my Flower a Iris Setosa or Iris 7.5


Versicolor? 5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 86
Example: 3. Answer your Question
My flower has a Petal Width of
7mm and a Petal Length of 15.0

Petal Width in mm
15mm. 12.5
10.0

Is my Flower a Iris Setosa or Iris 7.5


Versicolor? 5.0
2.5
→ Iris Setosa
10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor | Reason: The point is on the “left side” of the decision boundary
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 87
Next in this Lecture:
• What is the Mathematical Framework?
• How do we classify using a linear model?

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 88


Thank you for listening!
Machine Learning for Engineers
Logistic Regression

Bilder: TF / Malter
Overview
• The Logistic Model
• Optimization

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 91


Overview
• The Logistic Model
• Optimization

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 92


Logistic Regression
How do we describe the linear
model as decision boundary? 15.0

Petal Width in mm
12.5
10.0
7.5
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 93
Logistic Regression
How do we describe the linear
model as decision boundary? 15.0

Petal Width in mm
12.5
“Uncertain”
10.0
Thumb-Rule: “Certain”
7.5
The larger the distance of the
input 𝐱 to the decision boundary, 5.0

the more certain it belongs to 2.5


either Setosa (left of the line) or
10 14 18 23 27 31 36 40
Versicolor (right of the line)! Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 94
From distance to “probability”
The linear model:
𝐷 15.0

Petal Width in mm
𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 12.5
f(x)
𝑗=1 10.0
f(x)
1) 7.5
Calculates a signed distance,
between the input and the linear 5.0
model 2.5

→ How do we get a “probability”? 10 14 18 23 27 31 36 40


Petal Length in mm

Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 95
The sigmoid function
The sigmoid (logit, logistic) function
maps to the range [0,1]! 1.0
0.82

Output 𝜇 𝑥, 𝐰
0.66
That means the model is now:
0.5
1
𝜇 𝐱, 𝐰 = 𝐓𝐱 0.33
1 + 𝑒 −𝐰
0.17

→ „Only“ the probability for the −7.5 −5−2.5 0 2.5 5 7.5 10


event versicolor… Input 𝐰 𝐓 𝐱

Fun fact: The sigmoid function is sometimes lovingly called “squashing function”
Note: Here we already inserted the function f! Homework: What is the general form of the sigmoid function?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 96
Bernoulli distribution
The Bernoulli distribution can model
both events (yes-or-no event): 1.0

𝑝 y 𝐱, 𝐰) = Ber y 𝜇 𝐱, 𝐰 )
𝑦 1−𝑦
= 𝜇 𝐱, 𝐰 1 − 𝜇 𝐱, 𝐰 0.5

Heads
Mrs. B
Mr. A
Problem:

Tails
How do we get the label,
Voting Fair Coin
based on the calculated probability?

Note: We use the above distribution for the MLE estimation! Basically we replace “𝑝 𝐬𝑖 𝛉 ” in the log-likelihood with this!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 97
The Decision Rule
Based on 𝜇(𝐱, 𝐰) we can decide,
which class is more “likely”! 1.0
0.82

Output 𝜇 𝑥, 𝐰
0.66
The Decision Rule is:
1, if 𝜇(𝐱, 𝐰) > 0.5 0.5
𝑦=ቊ
0, if 𝜇(𝐱, 𝐰) ≤ 0.5 0.33
0.17

Question: −7.5 −5−2.5 0 2.5 5 7.5 10


How do we find the optimal Input 𝐰 𝐓 𝐱
parameters?
Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 98
Overview
• The Logistic Model
• Optimization

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 99


Constructing the Loss
Reminder: The Log-Likelihood
𝑁

𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ]
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 100


Constructing the Loss
Reminder: The Log-Likelihood Bernoulli Distribution
𝑁
𝑝 y 𝐱, 𝐰) =
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ] 1−𝑦
𝑖=1
= 𝜇 𝐱, 𝐰 𝑦 1 − 𝜇 𝐱, 𝐰
𝑁
𝑦𝑖 1−𝑦𝑖
* ෍ log 𝜇 𝐱 𝑖 , 𝐰
= 1 − 𝜇 𝐱𝑖 , 𝐰
𝑖=1

* We just insert the Probability from Slide 99


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 101
Constructing the Loss
Reminder: The Log-Likelihood Bernoulli Distribution
𝑁
𝑝 y 𝐱, 𝐰) =
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ] 1−𝑦
𝑖=1
= 𝜇 𝐱, 𝐰 𝑦 1 − 𝜇 𝐱, 𝐰
𝑁
𝑦𝑖 1−𝑦𝑖
* ෍ log 𝜇 𝐱 𝑖 , 𝐰
= 1 − 𝜇 𝐱𝑖 , 𝐰
𝑖=1
𝑁

= ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰
𝑖=1

* We just insert the Probability from Slide 99


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 102
The Cross Entropy Loss
Instead of Maximizing the Log-Likelihood, we minimize the Negative Log-
Likelihood!

This Loss is called the Cross Entropy:


𝑁

𝐿(𝛉) = − ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰


𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 103


The Cross Entropy Loss
Instead of Maximizing the Log-Likelihood, we minimize the Negative Log-
Likelihood!

This Loss is called the Cross Entropy:


𝑁

𝐿(𝛉) = − ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰


𝑖=1

• Unique minimum
• No analytical solution possible!
→ Optimization with Gradient descent.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 104


Optimization Method: Gradient Descend
Observation:
The loss is like a
mountainous landscape!

Idea:
We find the minimum, by “walking
down” the slope of the mountain 𝐿(𝛉)
𝑤0
𝑤1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 105


Optimization Method: Gradient Descend
Approach:
1. Start with random weights
2. Calculate: The direction of
steepest descend
3. Step in that direction
4. Repeat from step 2
𝐿(𝛉)
Result:
𝑤0
Some local minimum is reached! 𝑤1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 106


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 107


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃

Note: The gradient in matrix-vector form is: 𝐗 𝑇 (𝛍 − 𝐲)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 108
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
The Gradient for Logistic Regression:

∇𝐿(𝜃𝑖 ) = ෍ 𝜇 𝐱 𝐢 , 𝐰 − 𝑦𝑖 𝐱 𝑖
𝑖
𝜃𝑖
Parameter 𝜃

Note: The gradient in matrix-vector form is: 𝐗 𝑇 (𝛍 − 𝐲)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 109
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )

𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 110


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜂 is called the learning rate
• If 𝜂 is too large
→ Overshooting the minimum
𝜃𝑖 𝜃𝑖+1
• If 𝜂 is too small Parameter 𝜃
→ Minimum is not reached
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 111
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )

𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 112


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1

𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 113


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )

𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 114


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum

𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3


Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 115


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum

Repeat the process until the loss is minimal! 𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 116


General weaknesses of Gradient Descend

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 117


General weaknesses of Gradient Descend
Loss
• Multiple local minima are common

Minimum

Parameter
Minimum

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 118


General weaknesses of Gradient Descend
Loss
• Multiple local minima are common
• Into which the network converges to
depends heavily on random initialization

Minimum

Parameter
Minimum

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 119


General weaknesses of Gradient Descend
• Multiple local minima are common
• Into which the network converges to
depends heavily on random initialization
• Success depends on learning rate 𝜂

https://de.serlo.org/mathe/funktionen/wichtige-funktionstypen-ihre-eigenschaften/polynomfunktionen-beliebigen-grades/polynom
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 120
Thank you for listening!
Machine Learning for Engineers
Overfitting and Underfitting

Bilder: TF / Malter
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 123
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

For 𝑀 = 9 the function exactly models the


given training data, as the chosen model
Is too complex (Overfitting)

Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 124
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

For 𝑀 = 9 the function exactly models the


given training data, as the chosen model
Is too complex (Overfitting)

For 𝑀 = 3 the function closely matches


the expected function
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 125
Overfitting and Underfitting
UNDERFITTING OVERFITTING

Error on the Error on the


training data training data is
very high very low

Testing error is Testing error is


high high
Examples are Example is 𝑀 = 9
𝑀 = 0 and 𝑀 = 1

Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 126
Complexity vs. Generalization Error
Plotting over all complexities, Underfitting Overfitting Training Set
typically reveals a “sweet spot” Testing Set

Prediction Error
(i.e. an ideal complexity)

The prediction error is affected


by two kind of errors:
• Bias error Generalization error

• Variance error
Ideal Complexity
Complexity
(Minimum
Testing Loss)
Note: Complexity does not mean parameters! It means the mathematical complexity! i.e. the ability of the model to capture a relationship!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 127
Complexity vs. Generalization Error
Bias:  High Bias Low Bias →
Error induced by simplifying

Prediction Error
assumptions of the model

Complexity

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 128


Complexity vs. Generalization Error
Bias:  High Bias Low Bias →
Error induced by simplifying  Low Variance High Variance →

Prediction Error
assumptions of the model

Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”

Complexity

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 129


Complexity vs. Generalization Error
Bias:  High Bias Low Bias →
Error induced by simplifying  Low Variance High Variance →

Prediction Error
assumptions of the model

Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”

Both errors oppose each other! Complexity


When reducing bias, you increase variance!
(and vice versa)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 130
Complexity vs. Generalization Error
The sweet spot is located at  High Bias Low Bias →
the minimum of the test curve  Low Variance High Variance →

Prediction Error
It has both a low bias and a
low variance!

This spot is – usually – found


using hyperparameter search
Ideal Complexity
Complexity
(Minimum
Testing Loss)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 131
Hyperparameter search
1)
Approach: Calculate the prediction error on the validation set and select
the hyperparameters with the smallest loss

Example for hyperparameters:


• Amount of degrees in a polynomial
• Learning rate 𝜂
• Different Basis Functions (See BFE)
• In Neural Networks: The amount of neurons

→ There exists multiple methods to solve this!

1) We use this instead of the test set. This is important! You only touch the test set for the final evaluation!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 132
Hyperparameter search methods
There exist multiple ways to find good hyperparameters…
… manual search (Trial-and-Error)
… random search
… grid search
… Bayesian methods
… etc.

→ In practice you typically use Trial-and-Error & Grid-Search

Note: Basically every known optimization technique can be used. Examples: Particle Swarm Optimization, Genetic Algorithms, Ant Colony Optimization…
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 133
Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 134


Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 135


Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

If the dataset is large “enough” →

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 136


Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

If the dataset is large “enough” →


If the dataset is small … →

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 137


Cross Validation
Problem of the Static Split:
The data1) is split into 80% train set and 20% validation set

If you are “unlucky”: The validation split is not representative!


i.e. your prediction error is a bad estimate for the
generalization performance!

→ Solution: Cross Validation!

1) We assume the test data is already split and put away. A typical split of all the data is 80% train – 10% validation - 10% testing
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 138
K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 139


K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:


1. Train on all splits, Valid. Training data 𝑖=1
except the 𝑖-th Fold
2. Evaluate on the 𝑖-th Fold
Finally: Average over all results

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 140


K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:


1. Train on all splits, Valid. Training data 𝑖=1
except the 𝑖-th Fold Valid. Training data 𝑖=2
2. Evaluate on the 𝑖-th Fold
Finally: Average over all results

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 141


K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:


1. Train on all splits, Valid. Training data 𝑖 =1
except the 𝑖-th Fold Valid. Training data 𝑖 =2
2. Evaluate on the 𝑖-th Fold Training data Valid. Training data 𝑖 =3
Finally: Average over all results Training data Valid. 𝑖 =4
Training data Valid. 𝑖 =5

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 142


Cross Validation
K-Fold Cross validation is the typical form of cross validation

When k=1, the process is called Leave-one out cross validation (or LOOCV)

There exist other variants as well, like:


• Grouped Cross Validation
• Nested Cross Validation
• Stratified Cross Validation

→ Each has its unique use-case!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 143


Thank you for listening!
Machine Learning for Engineers
Support Vector Machines – Problem Statement

Bilder: TF / Malter
Recap: Logistic Regression
Goal: Find a linear decision Classification
boundary separating the two classes
Label 𝒚 = Dog
Problem: Multiple linear decision

Feature 𝑥1
boundaries solve the problem with
equal accuracy
Question: Which one is the best?
Label 𝒚 = Cat D
A B C D
Feature 𝑥0
A B C

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2


Intuition behind Support Vector Machines
A is closer to Dog at the top and to Classification
Cat at the bottom
Label 𝒚 = Dog
B is far away from both Dog and Cat

Feature 𝑥1
throughout the boundary
C always is closer to Dog than Cat
D almost touches Dog at the bottom
and Cat at the top Label 𝒚 = Cat D
Feature 𝑥0
A B C

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


Intuition behind Support Vector Machines
B is far away from both Dog and Cat Classification
throughout the boundary
Label 𝒚 = Dog
To put this in more mathematical

Feature 𝑥1
terms, we say it has the largest
margin 𝑚 𝑚
𝑚
Updated Goal: Find linear decision
boundary with the largest margin. Label 𝒚 = Cat
Feature 𝑥0
B

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


Mathematical Model for Decision Boundary
Mathematically, any hyperplane is
given by the equation
𝒘⋅𝒙−𝑏 =0 𝒘

Feature 𝑥1
𝒘 is the normal vector

𝒙 is any arbitrary feature vector


𝑏
𝑏 𝒘
is the distance to the origin
𝒘
Feature 𝑥0

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


Mathematical Model for Margins
Decision boundary is given by
𝒘⋅𝒙−𝑏 =0 2
𝒘
𝒘

Feature 𝑥1
Furthermore, by definition, the margin
boundaries are given by
𝒘 ⋅ 𝒙 − 𝑏 = +1
𝑏
𝒘 ⋅ 𝒙 − 𝑏 = −1 𝒘
2 Feature 𝑥0
is thus the distance between the
𝒘
two margins → maximize
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Mathematical Model for Classes
Two-class problem 𝑦1 , … , 𝑦𝑛 = ±1
𝑦 = +1
All training data 𝒙1 , … , 𝒙𝑛 needs to be 2
correctly classified outside margin 𝒘
𝒘

Feature 𝑥1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 if 𝑦𝑖 = +1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≤ −1 if 𝑦𝑖 = −1 𝑦 = −1
𝑏
Due to the label choice, this can be 𝒘
simplified as Feature 𝑥0
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


Optimization of Support Vector Machines
This leads us to a constrained
𝑦 = +1
optimization problem.
2
2 𝒘
1. Maximizing the margin , which 𝒘

Feature 𝑥1
𝒘
1 2
is equal to minimizing 𝒘
2
1
min 𝒘 2 𝑦 = −1
𝒘,𝑏 2 𝑏
𝒘
2. Subject to no misclassification on
Feature 𝑥0
the training data
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8


Machine Learning for Engineers
Support Vector Machines – Optimization

Bilder: TF / Malter
Optimization of Support Vector Machines
In the previous section, we derived the constrained optimization problem
for Support Vector Machines (SVMs):
1 2
min 𝒘
𝒘,𝑏 2

s. t. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

This is a quadratic programming problem, minimizing a quadratic function


subject to some inequality constraints.

⮩ We can thus introduce Lagrange multipliers

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Introduction of Lagrange Multipliers
For each of the 𝑛 inequality constraints, introduce 𝛼𝑖 Lagrange multiplier1):
𝑛
1 2
ℒ 𝒘, 𝑏, 𝜶 = 𝒘 − ෍ 𝛼𝑖 [𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 − 1]
2
𝑖=1

The Lagrange multipliers need to be maximized, resulting in the following


derived optimization problem:
min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶

1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Primal versus Dual Formulation
This optimization problem is called the primal formulation, with solution 𝑝∗ :
𝑝∗ = min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶

Alternatively, there’s also the dual formulation, with solution 𝑑 ∗ :


𝑑 ∗ = max min ℒ 𝒘, 𝑏, 𝜶
𝜶 𝒘,𝑏

Since Slater’s condition holds for this convex optimization problem, we can
guarantee that 𝑝∗ = 𝑑 ∗ and solve the dual problem instead.

⮩ We can thus first solve the minimization problem for 𝒘 and 𝑏

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


Solution using Partial Derivatives
We now want to minimize the following function for 𝒘 and 𝑏:
𝑛
1 2
ℒ 𝒘, 𝑏, 𝜶 = 𝒘 − ෍ 𝛼𝑖 [𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 − 1]
2
𝑖=1
𝜕ℒ 𝜕ℒ
At the minimum, we know that the derivatives and need to be zero:
𝜕𝒘 𝜕𝑏
𝑛 𝑛
𝜕ℒ
= 𝒘 − ෍ 𝛼𝑖 𝑦𝑖 𝒙𝑖 = 0 ⇒ 𝒘 = ෍ 𝛼𝑖 𝑦𝑖 𝒙𝑖
𝜕𝒘
𝑖=1 𝑖=1
𝑛 𝑛
𝜕ℒ
= − ෍ 𝛼𝑖 𝑦𝑖 = 0 ⇒ ෍ 𝛼𝑖 𝑦𝑖 = 0
𝜕𝑏
𝑖=1 𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13


Solution using Partial Derivatives
We then eliminate 𝒘 and 𝑏 by inserting both equations into the function:
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
2
𝑖=1 𝑖,𝑗=1

This directly leads us to the remaining maximization problem for 𝜶:


max ℒ 𝒘, 𝑏, 𝜶
𝜶
𝑛

s. t. 𝛼𝑖 ≥ 0, ෍ 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
This quadratic programming problem can be solved using sequential
minimal optimization, but which is not a topic of this lecture.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
The Karush–Kuhn–Tucker Conditions
The given problem fulfills the Karush-Kuhn-Tucker conditions1):
𝛼𝑖 ≥ 0
𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 ≥ 0
𝛼𝑖 𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 = 0

Based on the last condition, we can derive that either 𝛼𝑖 = 0 or 𝑦𝑖 (𝑤 ⋅ 𝑥 −


𝑏) = 1 for each of the training point.

💡 When 𝛼𝑖 is non-zero, the training point is on the margin and thus a so-
called “support vector”.
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Optimization Summary
The dual formulation leads to a quadratic programming problem on 𝜶:1)
𝑛 𝑛
1
max ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
𝜶 2
𝑖=1 𝑖,𝑗=1

The solution 𝜶∗ can then be used to compute 𝒘 and make predictions on


previously unseen data points 𝒙:2)
𝑛

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝒙𝑖 ⋅ 𝒙


𝑖=1

1) This is simplified for the purposes of this summary. For the additional constraints that are also required refer to slide 14
2) Remember the equation for 𝒘 resulting from the derivative in slide 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Machine Learning for Engineers
Support Vector Machines – Non-Linearity and the Kernel Trick

Bilder: TF / Malter
Recap: Basis Functions
House Prices in Springfield (USA)
For linear regression we transform

House prices in (Mio. € )


1.6
the input space using a polynomial
basis function 𝚽: 1.5

T
1.4
𝚽 𝑥 = 1 𝑥 2 3 4 5
𝑥 𝑥 𝑥 𝑥 𝑥 6
1.3

This can lead to 1.2


1.1
• Computational issues depending
on the number of points 1910 1940 1970 2000
Years from 1900 - 2020
• Memory issues depending on the
dimensionality of the output
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Introduction of Basis Functions
We’ve previously derived the optimization and prediction functions for SVM.
Next, we can also introduce a basis function 𝚽(𝒙) to allow for non-linearity.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

💡 Note how the basis function is applied at each of the dot products.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19


Solving Computational Issues with Sparsity
There are 𝑛2 summands for the optimization and 𝑛 for the prediction, that
is, both functions naively scale with the number of points 𝑛.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20


Solving Computational Issues with Sparsity
However, from the Karush-Kuhn-Tucker conditions we know that 𝛼𝑖 is non-
zero only for the few support vectors.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

Most summands are thus zero. This property is called sparsity and greatly
simplifies the computations.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Solving Memory Issues with Kernel Trick
Since the basis function 𝚽: ℝ𝑛 ↦ ℝ𝑚 is applied explicitly on each of the
points, the memory scales with the output dimensionality m.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22


Solving Memory Issues with Kernel Trick
However, instead of explicitly computing 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ), we can instead
replace it with a kernel function 𝐾(𝒙𝑖 , 𝒙𝑗 ).
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 𝐾(𝒙𝑖 , 𝒙𝑗 )
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝐾(𝒙𝑖 , 𝒙𝑗 )


𝑖=1

The kernel function usually doesn’t require explicit computation of the basis
function. This method of replacement is called the kernel trick.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23
The Linear Kernel
The kernel function is given by:
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2 𝒙𝑖 ⋅ 𝒙𝑗
2𝜎
• 𝜎 is a length-scale parameter

• Feature space mapping is basically


just the identity function

• Only useful for linearly separable


classification Image from https://scikit-
learn.org/stable/auto_examples/svm/plot_svm_kernels.html

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24


The Radial Basis Function Kernel
The kernel function is given by:
2
𝒙𝑖 − 𝒙𝑗
𝐾 𝒙𝑖 , 𝒙𝑗 = exp −
2𝜎 2
• 𝜎 is a length-scale parameter

• Useful for non-linear classification


with clusters

• Probably the most popular kernel


Image from https://scikit-
learn.org/stable/auto_examples/svm/plot_svm_kernels.html

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25


The Polynomial Kernel
The kernel function is given by:
𝑑
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2
𝑥𝑖 ⋅ 𝑥𝑗 + 𝑟
2𝜎

• 𝜎 is a length-scale parameter

• 𝑟 is a free parameter for the trade-off


between lower- and higher-order

• 𝑑 is the degree of the polynomial, a


Image from https://scikit-
typical choice is 𝑑 = 2 learn.org/stable/auto_examples/svm/plot_svm_kernels.html

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26


Machine Learning for Engineers
Support Vector Machines – Hard Margin and Soft Margin

Bilder: TF / Malter
Hard Margin and Soft Margin
Up until now we have always implicitly
𝑦 = +1
assumed classes are linearly separable
2
⮩ So-called hard margin 𝒘
𝒘

Feature 𝑥1
What happens when the classes 𝜉2
overlap and there is no solution? 𝜉1
𝑦 = −1
⮩ So-called soft margin 𝑏
𝒘
Introduce slack variables 𝜉1 , 𝜉2 that Feature 𝑥0
move points onto the margin

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28


Soft Margin – Slack Variables
Slack variable 𝜉𝑖 is the distance
𝑦 = +1
required to correctly classify point 𝒙𝑖
2
If correctly classified and outside margin 𝒘
𝒘

Feature 𝑥1
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 👍
𝜉1
• No correction, thus 𝜉𝑖 = 0 𝜉2
𝑦 = −1
𝑏
If incorrectly classified or inside margin
𝒘
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 < 1 👎 Feature 𝑥0
• Move by 𝜉𝑖 = 1 − 𝑦𝑖 (𝒘 ⋅ 𝒙𝑖 − 𝑏)

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29


Updated Optimization Problem
We can update the original optimization problem with the slack variables 𝝃,
which leads us to the following constrained optimization problem:
𝑛
1 2
1
min 𝒘 + 𝜆 ෍ 𝜉𝑖
𝒘,𝑏,𝝃 2 𝑛
𝑖=1

𝑠. 𝑡. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 − 𝜉𝑖

The optimization procedure and kernel trick can similarly be derived for
this, but we’ll spare you the details 😉.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30


Machine Learning for Engineers
Support Vector Machines – Regression

Bilder: TF / Malter
Intuition behind Support Vector Regression
The Support Vector Machine is a Regression
method for classification.
Let us now turn to regression. 𝜖

Label 𝑦
With training data 𝒙1 , 𝑦1 , … , (𝒙𝑛 , 𝑦𝑛 ),
we want to find a function of form:
𝜖
𝑦 =𝒘⋅𝒙+𝑏
For Support Vector Regression, we
Feature 𝑥
want to keep everything in a 𝜖-tube

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32


Keeping Everything inside The Tube
Thus, for each point (𝒙𝑖 , 𝑦𝑖 ), the Regression
following conditions need to hold:
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 𝜖

Label 𝑦
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 𝜉+ 𝜉−
For outliers, we introduce slack 𝜖
variables 𝜉 + and 𝜉 − :
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
Feature 𝑥
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖−

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33


Optimization of Support Vector Regression
This leads us to another constrained Regression
optimization problem.
1
𝜖
1. Minimizing 𝒘 2 and the slack
2

Label 𝑦
+
variables 𝜉𝑖 and 𝜉𝑖− 𝜉+ 𝜉−
𝑛
1
min 𝒘 2 + 𝐶 ෍( 𝜉𝑖+ + 𝜉𝑖− ) 𝜖
𝒘,𝑏 2
𝑖=1
𝐶 is a free parameter specifying
the trade-off for outliers Feature 𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34


Optimization of Support Vector Regression
This leads us to another constrained Regression
optimization problem.
𝜖
2. Subject to all 𝑦𝑖 being inside the

Label 𝑦
tube specified by 𝜖 𝜉+ 𝜉−
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖− 𝜖
𝜉𝑖+ ≥ 0
𝜉𝑖− ≥ 0
Feature 𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35


Pros and Cons of Support Vector Regression
Advantages Regression
• Less susceptible to outliers when
𝜖
compared to linear regression

Label 𝑦
𝜉+ 𝜉−
Disadvantages

• Requires careful tuning of the free 𝜖


parameters 𝜖 and 𝐶

• Doesn’t scale to large data sets Feature 𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 36


Machine Learning for Engineers
Support Vector Machines – Summary

Bilder: TF / Malter
Summary of Support Vector Machines
Support Vector Machine Advantages

• Elegant to optimize, since they have one local and global optimum

• Efficiently scales to high-dimensional features due to the sparsity and


the kernel trick

Support Vector Machine Disadvantages

• Difficult to choose good combination of kernel function and other free


parameters, such as regularization coefficient 𝑪

• Does not scale to large amount of training data


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38
References
1. D. Sontag. (2016). Introduction To Machine Learning: Support Vector Machines [Online].
Available: https://people.csail.mit.edu/dsontag/courses/ml16/
2. A. Zisserman. (2015). Lecture 2: The SVM classifier [Online]. Available:
https://www.robots.ox.ac.uk/~az/lectures/ml/
3. K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press,
2012.
4. B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2018.
5. A. Kowalczyk, Support Vector Machines Succintly. Morrisville, NC, USA: Syncfusion, 2017.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39


Machine Learning for Engineers
Principal Component Analysis – Applications

Bilder: TF / Malter
Application: Image Compression
We can save data more efficiently
by applying PCA!

The MaD logo (266 × 266 × 3 px)


can be compressed from 212 kB to
Original 32 Components (97.27%)
• 32 components : 8 kB
• 16 components : 4 kB
• 8 components : 2 kB

Can be used for other data as well! 16 Components (93.62%) 8 Components (85.50%)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Application: Facial Recognition
Eigenfaces (1991) are a fast and efficient
way to recognize faces in a gallery of facial
images.
General Idea:
1. Generate a set of eigenfaces (see right)
based on a gallery
2. Compute the weight for each eigenface
based on the query face image
3. Use the weights to classify (and identify)
the person
Turk, M. and A. Pentland. “Face recognition using eigenfaces.” Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1991): 586-591.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Application: Anomaly Detection
Detecting anomalous traffic in IP Networks is crucial for administration!
General Idea:
Transform the time series of IP packages (requests, etc.) using a PCA
→ The first k (typically 10) components represent
“normal” behavior
→ The components “after” k, represent
anomalies and/or noise
Packages mapped primarily to the latter component are classified as
anomalous!
Anukool Lakhina, Konstantina Papagiannaki, Mark Crovella, Christophe Diot, Eric D. Kolaczyk, and Nina Taft. 2004. Structural analysis of network traffic flows. SIGMETRICS Perform. Eval.
Rev. 32, 1 (June 2004), 61–72. DOI:https://doi.org/10.1145/1012888.1005697
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Limitation: The Adidas Problem
Let the data points be distributed like 𝑤2
the Adidas logo, with three implicit 𝑤1
classes (teal, red, blue stripe).

Feature 2
The principal directions 𝑤1 and 𝑤2
found by PCA are given.
The first principal direction 𝑤1 does
not preserve the information to
classify the points!
Feature 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20


Machine Learning for Engineers
Principal Component Analysis – Intuition

Bilder: TF / Malter
Intuition behind Principal Component Analysis
Consider the following data with Twin Heights
height of two twins ℎ1 and ℎ2 .
Now assume that

Twin height ℎ2
a) We can only plot one-dimensional
figures due to limitations1
b) We have complex models that
can only handle a single value1
Twin height ℎ1

1) These are obviously constructed limitations for didactical concerns. However, in real life you will also face limitations in visualization
(maximum of three dimensions) or model feasibility.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Intuition behind Principal Component Analysis
How do we find a transformation from Twin Heights
two values to one value?
In this specific case we could

Twin height ℎ2
1
a) Keep the average height (ℎ1 +
2
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
Twin height ℎ1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


Intuition behind Principal Component Analysis
In this specific case we could Twin Heights
1
a) Keep the average height (ℎ1 +
2

Height difference
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
💡 This corresponds to a change in
basis or coordinate frame in the plot.
Average height
⮩ Rotation by matrix 𝑨

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


Principal Component Analysis as Rotation
The Principal Component Analysis (PCA) is essentially a change in basis
using a rotation matrix 𝑨.
Twin Heights Twin Heights

Height difference
Twin height ℎ2

Rotation

1 1 1
𝑨=
2 1 −1
Twin height ℎ1 Average height
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Principal Component Analysis: Preserving Distances
💡 Idea: Preserve the distances Twin Heights
between each pair in the lower
dimension

Height difference
• Points that are far apart in the
original space remain far apart
• Points that are close in the original
space remain close

Average height

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


Principal Component Analysis: Preserving Variance
The previous idea is closely coupled Twin Heights
with the “variance”.

Height difference
• Find those axes and dimensions

Low variance
with high variance
• Discard the axes and dimensions
with increasingly low variance
High variance
Goal: Algorithmically find directions
with high/low variance in data1 Average height

1) In this trivial case we could hand-engineer such a transformation. However, this is not generally the case.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Machine Learning for Engineers
Principal Component Analysis – Mathematics

Bilder: TF / Malter
Description of Input Data
Let 𝑿 be a 𝑛 × 𝑝 matrix of data Twin Heights
points, where
• 𝑛 is number of points (here 15)

Twin height ℎ2
• 𝑝 is number of features (here 2)
1.75 1.77
1.67 1.68
𝑿=
⋮ ⋮
2.01 1.98 Twin height ℎ1

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9


Description of Input Data
Let 𝑪 = 𝑿𝑇 𝑿 be the 𝑝 × 𝑝 covariance Twin Heights
matrix of data points1, where
• 𝑿 are data points

Twin height ℎ2
• 𝑝 is number of features (here 2)
The covariance matrix generalizes
the variance 𝜎 2 of a Gaussian
distribution 𝒩(𝜇, 𝜎 2 ) to higher
dimensions. Twin height ℎ1

1) Technically the covariance matrix of the transposed data points 𝑿𝑇 after centering to zero mean.
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Description of Desired Output Data
We’re interested in new axes that Twin Heights
rotate the system, here 𝑤1 and 𝑤2
These are columns of a matrix

Twin height ℎ2
𝜆2
𝑤2 𝜆1
0.7 0.7 𝑤1
𝑾=
0.7 −0.7
𝑤1 𝑤2

We’re also interested in the


“variance” of each axis, 𝜆1 and 𝜆2 Twin height ℎ1

𝜆1 = 2, 𝜆2 = 1

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11


Relationship to Eigenvectors and -values
We’re in luck as this corresponds to Twin Heights
the eigenvectors and –values of 𝑪:
𝑪 = 𝑾𝑳𝑾𝑇

Twin height ℎ2
𝜆2
𝑤2 𝜆1
• 𝑾 is 𝑝 × 𝑝 matrix, where each 𝑤1
column is an eigenvector
• 𝑳 is a 𝑝 × 𝑝 matrix, with all
eigenvalues on diagonal in
decreasing order Twin height ℎ1

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


Relationship to Singular Value Decomposition
Alternatively, one can also perform a Twin Heights
Singular Value Decomposition (SVD)
on the data points 𝑿:

Twin height ℎ2
𝜆2
𝑿= 𝑼𝑺𝑾𝑇 𝑤2 𝜆1
𝑤1
• 𝑼 is 𝑛 × 𝑛 unitary matrix
• 𝑺 is 𝑛 × 𝑝 matrix of singular values
𝜎1 , … , 𝜎𝑝 on diagonal
Twin height ℎ1
• 𝑾 is 𝑝 × 𝑝 matrix of singular
vectors 𝒘1 , … , 𝒘2

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13


Relationship to Singular Value Decomposition
Let us insert 𝑿 = 𝑼𝑺𝑾𝑇 into the Twin Heights
definition of the covariance matrix 𝑪:
𝑪 = 𝑿𝑇 𝑿

Twin height ℎ2
𝜆2
𝑇 𝑇 𝜆1
= 𝑼𝑺𝑾 𝑼𝑺𝑾𝑇 𝑤2
𝑤1
= 𝑾𝑺𝑇 𝑼𝑇 𝑼𝑺𝑾𝑇
= 𝑾𝑺2 𝑾𝑇
• Eigenvalues correspond to
squared singular values 𝜆𝑖 = 𝜎𝑖2 Twin height ℎ1
• Eigenvectors directly correspond
to singular vectors
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Summary of Algorithm
1. Find principal directions 𝑤1 , … 𝑤𝑝 Twin Heights
and eigenvalues 𝜆1 , … , 𝜆𝑝

Twin height ℎ2
2. Project data points into new
coordinate frame using 𝑤2
𝑤1
𝑻 = 𝑿𝑾
3. Keep the 𝑞 most important
dimensions as determined by
𝜆1 , … 𝜆𝑞 (which are sorted by Twin height ℎ1
variance)

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15


Machine Learning for Engineers
Principal Component Analysis – Summary

Bilder: TF / Malter
Summary of Principal Component Analysis
• Rotate coordinate system such that all axes are sorted from most
variance to least variance
• Required axes 𝑾 determined using either
• Eigenvectors and –values of covariance matrix 𝑪 = 𝑾𝑳𝑾𝑇
• Singular Value Decomposition (SVD) of data points 𝑿 = 𝑼𝑺𝑾𝑇

• Subsequently drop axes with least variance


• Variance-based feature selection has limitations (see. Adidas problem)

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22


References
1. P. Liang. (2009). Practical Machine Learning: Dimensionality Reduction [Online]. Available:
https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/dimensionality/
2. K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press,
2012.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23


Machine Learning for Engineers
Deep Learning – Perceptron

Bilder: TF / Malter
The Human Brain
The human brain is our reference for
an intelligent agent, that
a) … contains different areas
specialized for some tasks (e.g.,
the visual cortex)
b) … consists of neurons as the
fundamental unit of “computation”

⮩ Let us have a closer look at the inner workings of a neuron 💡

Image from https://ucresearch.tumblr.com/post/138868208574


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
The Human Neuron – Signaling Mechanism
1. Excitatory
stimuli reach
the neuron
⮩ Input
2. Threshold is
reached
3. Neuron fires
and triggers
action potential
⮩ Output

Image from https://www.khanacademy.org/science/biology/human-biology/neuron-nervous-system/a/the-synapse


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
The Perceptron – Computational Model of a Neuron
1. Let’s start by adding some basic components (input and output), we’re
subsequently going to build the computational model step by step

Input 𝑥1

Input 𝑥2 Output 𝑦1

Input 𝑥3

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


The Perceptron – Computational Model of a Neuron
2. Next, let’s add some weights to select and deselect input channels as
not all are relevant to our model neuron

Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2
Input 𝑥3
⋅ 𝑤3

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


The Perceptron – Computational Model of a Neuron
3. Then, let’s add all the excitatory stimuli to the resting potential to
determine the current potential of our model neuron

Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


The Perceptron – Computational Model of a Neuron
4. Finally, let’s apply a threshold function 𝜎 (usually the sigmoid) to
determine whether to send an action potential in the output

Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


The Perceptron – Computational Model of a Neuron
5. We can now write a perceptron as a mathematical function mapping the
inputs 𝑥1 , 𝑥2 , 𝑥3 to the output 𝑦1 using channel weights 𝑤1 , 𝑤2 , 𝑤3 , 𝑏:

𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏 = 𝜎(෍ 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏)
𝑖
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
The Perceptron – Signaling Mechanism
Given a perceptron with parameters
𝑤1 = 4, 𝑤2 = 7, 𝑤3 = 11, 𝑏 = −10 and equation
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏

Input 𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟎, 𝒙𝟑 = 𝟎 (“100”) Input 𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟏, 𝒙𝟑 = 𝟏 (“111”)


𝑦1 = 𝜎 4 ⋅ 1 + 7 ⋅ 0 + 11 ⋅ 0 − 10 𝑦1 = 𝜎 4 ⋅ 1 + 7 ⋅ 1 + 11 ⋅ 1 − 10
= 𝜎 4 + 0 + 0 − 10 = 𝜎 −6 = 𝜎 4 + 7 + 11 − 10 = 𝜎 12
= 0.00 = 1.00
⮩ Output 𝑦1 not activated ⮩ Output 𝑦1 is activated
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
Generalizing the Threshold: Activation Function
Sigmoid Hyperbolic Tangent Rectified Linear Unit

1
𝜎 𝑥 = 𝜎 𝑥 = tanh(𝑥) 𝜎 𝑥 = max(𝑥, 0)
1 + 𝑒 −𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Machine Learning for Engineers
Deep Learning – Multilayer Perceptron

Bilder: TF / Malter
The Perceptron – A Recap
In the last section we learned about the perceptron, a computational model
representing a neuron.
Inputs 𝑥1 , … , 𝑥𝑛 𝑥1
Output 𝑦1 𝑥2 𝑦1
Computation 𝑦1 = 𝜎(σ𝑛𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏) 𝑥3

⮩ We can combine more than one perceptron for complex models 💡

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


Combining Multiple Perceptron – Individual Notation
𝑥1 Let’s now combine three perceptron with
different outputs 𝑦1 , 𝑦2 and 𝑦3 .
𝑥2 𝑦1
The computations for each these are:
𝑥3
𝑥1 𝑦1 = 𝜎 𝒘𝟏𝟏 𝑥1 + 𝒘𝟏𝟐 𝑥2 + 𝒘𝟏𝟑 𝑥3 + 𝒃𝟏
𝑥2 𝑦2 𝑦2 = 𝜎 𝒘𝟐𝟏 𝑥1 + 𝒘𝟐𝟐 𝑥2 + 𝒘𝟐𝟑 𝑥3 + 𝒃𝟐
𝑥3 𝑦3 = 𝜎 𝒘𝟑𝟏 𝑥1 + 𝒘𝟑𝟐 𝑥2 + 𝒘𝟑𝟑 𝑥3 + 𝒃𝟑
𝑥1
Note how they all have their own parameters
𝑥2 𝑦3 (weights 𝑤 and bias 𝑏)
𝑥3
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Combining Multiple Perceptron – Matrix Notation
𝑥1 The outputs 𝑦, weights 𝑤, inputs 𝑥 and bias 𝑏 of
each perceptron are matrices and vectors.
𝑥2 𝑦1
We can thus rewrite the three computations1) as:
𝑥3
𝑥1 𝑦1 𝑤11 𝑤12 𝑤13 𝑥1 𝑏1
𝑦2 = 𝑤21 𝑤22 𝑤23 𝑥2 + 𝑏2
𝑥2 𝑦2 𝑦3 𝑤31 𝑤32 𝑤33 𝑥3 𝑏3
𝑥3
𝑥1 Or in a more simplified form:

𝑥2 𝑦3 𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)

𝑥3 1) This is a perfect opportunity to brush up your linear algebra skills. As an optional


homework assignment, you can verify that the given matrix multiplication holds.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Combining Multiple Perceptron – Single Layer
Instead of depicting each computation as a node
𝑥1 𝑦1 or circle, we can also
• Depict each value (input 𝑥 and output 𝑦) as a
𝑥2 𝑦2 node or circle
• Depict each weighted connection as an arrow
𝑥3 𝑦3
This is the typical graphical representation1), with
𝑾, 𝒃 the underlying computation remaining
𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)

1) Note how the bias is implicitly assumed and not visualized.


| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Combining Multiple Perceptron – Multiple Layers
There’s no reason to limit ourselves to a
𝑥1 𝑦1 single layer.
𝑧1 We can chain multiple layers, with each
𝑥2 𝑦2 output being the input of the next:
𝑧2 𝒚 = 𝜎(𝑾1 ⋅ 𝒙 + 𝒃1 )
𝑥3 𝑦3
𝒛 = 𝜎(𝑾2 ⋅ 𝒚 + 𝒃2 )
𝑾1 , 𝒃1 𝑾2 , 𝒃2
Combining these leads us to:
𝒛 = 𝜎(𝑾2 ⋅ 𝜎 𝑾1 ⋅ 𝒙 + 𝒃1 + 𝒃2 )

| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16


Multilayer Perceptron – A Summary
• A multilayer perceptron is a neural
𝑥1 𝑦1 network with multiple perceptron
𝑧1 • It is oftentimes organized into layers
𝑥2 𝑦2
• Each layer has its own set of
𝑧2 parameters (weights 𝑾𝑖 and bias 𝒃𝑖 )
𝑥3 𝑦3
• The underlying computation is a
1 1 2 2
𝑾 ,𝒃 𝑾 ,𝒃 matrix multiplication described by
𝒚𝑖+1 = 𝜎(𝑾𝑖 ⋅ 𝒚𝑖 + 𝒃𝑖 )

| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17


Machine Learning for Engineers
Deep Learning – Loss Function

Bilder: TF / Malter
How do our models learn?
• Up until now, the parameters 𝑾𝑖 and 𝒃𝑖 of the multilayer perceptron
were assumed as given
• We now aim to learn the parameters 𝑾𝑖 and 𝒃𝑖 based on some
example input and output data
• Let us assume we have a dataset of 𝒙 and corresponding 𝒚
1.2 0.2
3.2 0.2
𝒙1 , 𝒚1 = (1.4, ) and 𝒙2 , 𝒚2 = (0.4, ) and …
0.2 0.2
1.3 0.3

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19


How do our models learn?
• We now want to ensure that the parameters 𝑾𝑖 and 𝒃𝑖 can correctly
predict the 𝒚𝑖 based on the 𝒙𝑖
• To do this, we apply 𝒙𝑖 on our model and compare the result with 𝒚𝑖

1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
𝑾1 , 𝒃1 𝑾2 , 𝒃2
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
The Loss Function
• ෝ𝑖 and the
We need a comparison metric between the predicted outputs 𝒚
expected outputs 𝒚𝑖
• This is called the loss function, which usually depends on the type of
problem the multilayer perceptron solves
• For regression, a common metric is the mean squared error
• For classification, a common metric is the cross entropy

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21


Mean Squared Error
• For regression, each output 𝑦1 , … , 𝑦𝑛 is a real-valued target
• The difference 𝑦ො𝑖 − 𝑦𝑖 tells us something about how far off we are, with
the mean square error computed as
𝑛
1 2
ෝ, 𝒚 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝐿 𝒚
𝑛
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22


Cross Entropy – Class Probability and Softmax
• For classification, each output 𝑦1 , … , 𝑦𝑛 represents a class energy
• If the target class index is 𝑖, then only 𝑦𝑖 = 1 and all other 𝑦𝑗 = 0, 𝑖 ≠ 𝑗
(this is also called one-hot encoding)
• Example for class 1: 𝑦 = 1 0 0 𝑇
0 0
𝑇
• Example for class 3: 𝑦 = 0 0 1 0 0

• To ensure that we have probabilities, we apply a softmax transformation


exp(𝑦ො𝑖 )
𝑦ത𝑖 = 𝑛
σ𝑗=1 exp(𝑦ො𝑗 )

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23


Cross Entropy – Negative Log Likelihood
• After the transformation, each 𝑦ത1 , … , 𝑦ത𝑛 represents a class probability
• The negative log − log 𝑦ത𝑖 for the target class index 𝑖 tells us how far off
we are, with the cross entropy calculated as
ෝ, 𝒚 = − log 𝑦ത𝑖
𝐿 𝒚
exp(𝑦ො𝑖 )
= − log 𝑛
σ𝑗=1 exp(𝑦ො𝑗 )

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24


How do our models learn? – A Summary
• How wrong is the current set of parameters 𝜽? Forward Pass
• How should we change the set of parameters by ∇𝜽? Backward Pass

∇𝜽 ← Backward Pass
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
Forward Pass → 𝜽
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25
Machine Learning for Engineers
Deep Learning – Gradient Descent

Bilder: TF / Malter
Parameter Optimization
• At this stage, we can compute the gradient of the parameters ∇𝜽
• The gradient tells us how we need to change the current parameters 𝜽
in order to make fewer errors on the given data
• We can use this in an iterative algorithm called gradient descent, with
the central equation being
𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
• The learning rate 𝜇 tells us how quickly we should change the current
parameters 𝜽𝑖
⮩ Let us have a closer look at an illustrative example 💡
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Gradient Descent – An Example
• Let us assume that the error function is
𝑓 𝜃 = 𝜃2
• The gradient of the error function is the
derivative 𝑓 ′ 𝜃 = 2 ⋅ 𝜃
• We aim to find the minimum error at
the location 𝜃 = 0
• Our initial estimate for the location is
𝜃1 = 2
• The learning rate is 𝜇 = 0.25

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5
• 3. Step: 𝜃 3 = 0.5, ∇𝜃 3 = 1
𝜃 4 = 𝜃 3 − 𝜇∇𝜃 3 = 0.5 − 0.25 ⋅ 1 = 0.25

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 31


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5
• 3. Step: 𝜃 3 = 0.5, ∇𝜃 3 = 1
𝜃 4 = 𝜃 3 − 𝜇∇𝜃 3 = 0.5 − 0.25 ⋅ 1 = 0.25
• We incrementally approach the
expected target 𝜃 = 0

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32


Gradient Descent – Limitations
• The learning rate 𝜇 can lead to slow
convergence if not properly configured
(see example for 𝜇 = 0.95)

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33


Gradient Descent – Limitations
• The learning rate 𝜇 can lead to slow
convergence if not properly configured
(see example for 𝜇 = 0.95)
• The starting parameters 𝜃1 can lead to
a solution stuck in local minimum (see
example for 𝑓 𝜃 = 0.5𝜃 4 − 0.25𝜃 2 +
0.1𝜃)
• There’s no guarantee to finding an
optimal solution (a global minimum)

Global Local
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
Gradient Descent – A Summary
• Gradient Descent incrementally adjusts the parameters 𝜽 based on the
gradient ∇𝜽 of the parameters
1. For each iteration
2. Compute the error of the parameters 𝜽
3. Compute the gradient of the parameters ∇𝜽
4. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖

• There is no guarantee that the algorithm converges to the global


optimum (e.g., due to learning rate 𝜇 or initial parameters 𝜽1 )

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35


Machine Learning for Engineers
Deep Learning – Learning Process

Bilder: TF / Malter
How do our models learn?
• We’ve just talked about the concept of gradient descent, but largely
avoided the details for the function 𝑓(𝜽)1)
• The goal is to minimize the error or loss function across the complete
dataset 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 )
𝑛 𝑛

𝑓 𝜽 = ෍ 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) = ෍ 𝐿(𝑔(𝒙𝑖 , 𝜽), 𝒚𝑖 )
𝑖=1 𝑖=1

• The function 𝑔(𝒙𝑖 , 𝜽) encapsulates the computations of a multilayer


perceptron with parameters 𝜽

1) In the previous section the function was 𝑓 𝜃 = 𝜃 2 . For multilayer perceptron, the function is quite obviously different.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
How do our models learn?
• We thus need to loop over the training data 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 ) to
compute sums over individual data points
• This results in a slight modification to the gradient descent algorithm
1. For each epoch – Loop over training data
2. For each batch – Loop over pieces of training data
3. Compute the error of the parameters 𝜽
4. Compute the gradient of the parameters ∇𝜽
5. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38


The Concept of Epochs
• Gradient descent requires multiple iterations to reach a minimum
• The optimization function 𝑓 𝜽 sums the loss functions 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) of each
individual data point 𝒙𝑖 , 𝒚𝑖 in the training data
• We need to go through the training data multiple times to find suitable
parameters 𝜽 for our problem
⮩ Each such iteration is called an epoch

⮩ Requires the computation of the full sum, which is expensive 💡

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39


The Concept of Batches
• Instead of computing the full sum of one epoch in one go, we instead
break it apart into multiple smaller pieces
• These pieces are usually of a specified size (e.g., 64), which tells us
how many training data points to use for each piece
• One epoch then consists of processing all pieces
⮩ Each such piece of the training data is called a batch
Batch1) (30 samples) Full training data (120 samples)

1) Note how we need 4 batches to go through all of the training data in this specific example.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 40
The Learning Process – A Summary
• The parameters 𝜽 are optimized by iterating through the training data
• For practical reasons, this is split into epochs and batches

Batch (30 samples) Full training data (120 samples)


Epoch 1 1. iteration 2. iteration 3. iteration 4. iteration

Epoch 2 5. iteration 6. iteration 7. iteration 8. iteration

Epoch 3 9. iteration 10. iteration 11. iteration 12. iteration



13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 41
Machine Learning for Engineers
Deep Learning – Convolution

Bilder: TF / Malter
The Visual Cortex
We’ve previously learned that the
brain has specialized regions.
• The visual cortex is in charge of
processing visual information
collected from the retinae
• However, cortical cells only
respond to stimuli of a small
receptive fields

⮩ Let’s us look at the implications this has on neural networks 💡


Image from https://link.springer.com/referenceworkentry/10.1007/978-1-4614-7320-6_360-2
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 43
Multilayer Perceptron versus Cortical Neuron
• Multilayer perceptron’s neurons are connected to all previous neurons
• Cortical neurons are only connected to a small receptive field

Standard Cortical
neuron neuron
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦11 = 𝑤11 𝑥11 + 𝑤12 𝑥12 + 𝑤13 𝑥13 + 𝑤21 𝑥21 + 𝑤22 𝑥22 + 𝑤23 𝑥23
+ 𝑤31 𝑥31 + 𝑤32 𝑥32 + 𝑤33 𝑥33
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =


𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 45
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦12 = 𝑤11 𝑥12 + 𝑤12 𝑥13 + 𝑤13 𝑥14 + 𝑤21 𝑥22 + 𝑤22 𝑥23 + 𝑤23 𝑥24
+ 𝑤31 𝑥32 + 𝑤32 𝑥33 + 𝑤33 𝑥34
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =


𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 46
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦13 = 𝑤11 𝑥13 + 𝑤12 𝑥14 + 𝑤13 𝑥15 + 𝑤21 𝑥23 + 𝑤22 𝑥24 + 𝑤23 𝑥25
+ 𝑤31 𝑥33 + 𝑤32 𝑥34 + 𝑤33 𝑥35
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =


𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 47
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦23 = 𝑤11 𝑥23 + 𝑤12 𝑥24 + 𝑤13 𝑥25 + 𝑤21 𝑥33 + 𝑤22 𝑥34 + 𝑤23 𝑥35
+ 𝑤31 𝑥43 + 𝑤32 𝑥44 + 𝑤33 𝑥45
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 48
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦22 = 𝑤11 𝑥22 + 𝑤12 𝑥23 + 𝑤13 𝑥24 + 𝑤21 𝑥32 + 𝑤22 𝑥33 + 𝑤23 𝑥34
+ 𝑤31 𝑥42 + 𝑤32 𝑥43 + 𝑤33 𝑥44
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 49
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦21 = 𝑤11 𝑥21 + 𝑤12 𝑥22 + 𝑤13 𝑥23 + 𝑤21 𝑥31 + 𝑤22 𝑥32 + 𝑤23 𝑥33
+ 𝑤31 𝑥41 + 𝑤32 𝑥42 + 𝑤33 𝑥43
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 50
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦31 = 𝑤11 𝑥31 + 𝑤12 𝑥32 + 𝑤13 𝑥33 + 𝑤21 𝑥41 + 𝑤22 𝑥42 + 𝑤23 𝑥43
+ 𝑤31 𝑥51 + 𝑤32 𝑥52 + 𝑤33 𝑥53
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 51
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦32 = 𝑤11 𝑥32 + 𝑤12 𝑥33 + 𝑤13 𝑥34 + 𝑤21 𝑥42 + 𝑤22 𝑥43 + 𝑤23 𝑥44
+ 𝑤31 𝑥52 + 𝑤32 𝑥53 + 𝑤33 𝑥54
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 52
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦33 = 𝑤11 𝑥33 + 𝑤12 𝑥34 + 𝑤13 𝑥35 + 𝑤21 𝑥43 + 𝑤22 𝑥44 + 𝑤23 𝑥45
+ 𝑤31 𝑥53 + 𝑤32 𝑥54 + 𝑤33 𝑥55
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32 𝑦33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 53
The Convolutional Neural Network – Multiple Filters
Instead of a single convolution operation we can also apply multiple
convolution operations in parallel on the same input.

Input 𝒙
The different outputs are
concatenated in the
channel dimension.
Kernel 𝒘

Output 𝒚 + + =

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 54


The Convolutional Neural Network – Multiple Layers
To incrementally process the input to relevant features, we again apply
multiple layers of convolutions.
⋆ ⋆
Convolution Convolution
Operation Operation

After many layers the


number of features
Input Image Convolution (64 kernels) Convolution (128 kernels) quickly grows 💡
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 55
Machine Learning for Engineers
Deep Learning – Pooling

Bilder: TF / Malter
Summarizing Features via Pooling
• Deeper in the network the number of features quickly grows
• It makes sense to summarize these features deeper in the network
• From a practical perspective, this reduces the number of computations
• From a theoretical perspective, these higher-level (i.e., global scale)
features don’t require a high spatial resolution

• This process is called pooling and works like a convolution, with the
kernel replaced by a pooling operation

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 57


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦11 = max(𝑥11 , 𝑥12 , 𝑥21 , 𝑥22 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 58


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦12 = max(𝑥13 , 𝑥14 , 𝑥23 , 𝑥24 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 59


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦21 = max(𝑥31 , 𝑥32 , 𝑥41 , 𝑥42 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34 𝑦21

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 60


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦22 = max(𝑥33 , 𝑥34 , 𝑥43 , 𝑥44 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34 𝑦21 𝑦22

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 61


The Pooling Operation
There are multiple methods to summarize the features which are
commonly used in Convolutional Neural Networks (CNN):
Example for 𝒏 = 𝟒 General case
• Maximum 𝑦 = max(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) 𝑦 = max(𝑥1 , … , 𝑥𝑛 )
1 1 4 1 𝑛
• Average 𝑦= (𝑥 + 𝑥2 + 𝑥3 + 𝑥4 ) = σ 𝑥 𝑦= σ 𝑥
4 1 4 𝑖=1 𝑖 𝑛 𝑖=1 𝑖

• L2-Norm 𝑦 = 𝑥12 + 𝑥22 + 𝑥32 + 𝑥42 = σ4𝑖=1 𝑥𝑖2 𝑦 = σ𝑛𝑖=1 𝑥𝑖2

⮩ There’s no general consensus what the “best” method is 💡

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 62


The Convolutional Neural Network – Pooling
Adding pooling to the network reduced the number of features deeper in
the network using a (relatively) inexpensive method.

Pooling 2
Pooling 1 Convolution 2
Input Image Convolution 1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 63
Machine Learning for Engineers
Deep Learning – Applications

Bilder: TF / Malter
Full Image Classification
• Task: Plant leaf disease classification
of 14 different species
• Input: Image of plant leaf
• Output: Plant and disease class
grape with black measles potato with late blight
• Model: EfficientNet
• Ü. Atilaa, M. Uçarb, K. Akyolc, and E. Uçarb, “Plant leaf disease
classification using EfficientNet deep learning model”, 2020.

healthy strawberry healthy tomato


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 65
Full Image Regression
• Task: Pose estimation based on
human key points (eyes, joints, …)
• Input: Image of human
• Output: Coordinates of 17 key points
• Model: PoseNet
• G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson,
and K. Murphy, “PersonLab: Person Pose Estimation and
Instance Segmentation with a Bottom-Up, Part-Based,
Geometric Embedding Model”, 2018.

• Animated image taken from TensorFlow Lite documentation at


https://www.tensorflow.org/lite/examples/pose_estimation/over
view

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 66


Pixel-wise Image Classification (Segmentation)
• Task: Land cover classification in
satellite imagery
• Input: Satellite image
• Output: Land cover class per
individual pixel
• Model: Segnet, Unet
• Z. Han,Y. Dian, H. Xia, J. Zhou, Y. Jian, C. Yao, X. Wang,
and Y. Li, “Comparing Fully Deep Convolutional Neural
Networks for Land Cover Classification with High-Spatial-
Resolution Gaofen-2 Images”, 2020.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 67


Pixel-wise Image Regression
• Task: Relative depth
estimation from a single
image
• Input: Any image
• Output: Depth value per
individual pixel
• Model: MiDaS
• R. Ranftl, K. Lasinger, D. Hafner, K.
Schindler, and V. Koltun, “Towards Robust
Monocular Depth Estimation: Mixing
Datasets for Zero-shot Cross-dataset
Transfer”, 2019.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 68

You might also like