Professional Documents
Culture Documents
Slides
Slides
Motivation 9
Machine Learning Types 22
Machine Learning Pipeline 55
Summary 69
Linear Regression - Motivation 72
Linear Regression - Overall Picture 86
Linear Regression - Model 94
Linear Regression - Optimization 103
Linear Regression - Basis Functions 128
Logistic Regression - Motivation 137
Logistic Regression - Framework 153
Overfitting and Underfitting 185
Problem Statement 208
Optimization 216
Kernel Trick 224
Hard and Soft Margin 234
Regression 238
Summary 244
Applications 247
Intuition 252
Mathematics 259
Summary 267
Perceptron 270
Multilayer Perceptron 280
Loss Function 287
Gradient Descent 295
Learning Process 305
Convolution 311
Pooling 325
Applications 333
Machine Learning for Engineers
Organizational Information
Bilder: TF / Malter
Course Organizers
Prof. Dr. Bjoern M. Eskofier
Machine Learning and Data Analytics Lab
University Erlangen-Nürnberg
Exercises
Two real-world industrial applications
• Exercise 1: Energy prediction using Support Vector Machines
• Exercise 2: Image classification using Convolutional Neural Networks
Deep Learning
by A. Courville, I. Goodfellow, Y. Bengio, 2015
Bilder: TF / Malter
Four Industrial Revolutions
Currently in the 4. Industrial Revolution 3. Industrial Revolution
Electronics and IT enable automation-driven rationalization
1. Industrial Revolution:
1. Industrial Revolution Steam 1960
Machinesas well as varied series production
Steam machines allow the
2. Industrial
1750 Revolution: Division of Labor (keyword: Conveyer belts)
Industrialization
3. Industrial Revolution: Electronics, Machines and Automatisation with IT
4. Industrial Revolution: Interlinked Production. Machines Communicate
with each other and optimize the process
Source: Industrie 4.0 in Produktion, Automatisierung und Logistik, T. Bauernhansl, M. Hompel, B. Vogel-Heuser, 2014
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Core of the Fourth Industrial Revolution is Digitalization
• 1960 : Dominant Factor Mechanical Engineering
Percantage of Engineering in Mechanical and Plant Engineering
• Mechanical Engineering decreasing with each decade
• Mechanical and plant
• During
Mechanical Engineering Electrical Engineering Computer Sciences Systems Engineering
that time Computer Science (AI, and Machine Learning)
engineering increase
was dominated by
100% 3
5 5
90% steadily! 9 15 mechcanical engineering in
10
1960s
80% 18
• Share of mechanical engineering
70% 30
has been steadily decreasing
60%
until today
50%
95 • Impact of computer sciences has
40% 85 25
70 been increasing since 1980s
30%
20%
30
→ Mechanical and plant
10% engineering is composed of
0% mutliple fields and became an
1960 1980 2000 2020
interdisciplinary area unlike 1960s
Source: Automatisierung 4.0, T. Schmertosch, M. Krabbes, 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
What drives that progress in Computer Science?
•
Annual Of course:
number Algorithmic
of AI papers advances (NOT ONLY AI)
•
25000 BUT: One significant force of progress IS AI
USA
Europe China
• You can see that clearly with the amount of publications and the
Total number of AI papers
20000
applications in that area
•
15000 So now: What is Artificial Intelligence actually? An What is the
difference to Machine Learning and Deep Learning?
10000
5000
0
2000 2005 2010 2015
Source: https://www.ibm.com/blogs
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Driving factors for the Advancement of AI
There are three important reasons for these advancements!
1. Increase of the Computational Power
Increase ofin
2. Increase Increase
thethe Amount of the amount
of Available Dataof Development of new
computational power available data algorithms
3. Development of New Algorithms
Sources:
Left Image: https://datafloq.com
Middle Image:https://europeansting.com
Right Image: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Increase of Computational Power
•• Moore’s
Transistors are Law: “the number of transistors inMoore’s
the building a dense
Law integrated
Transistors per Square Millimeter by Year
circuit
blocks (IC) doubles about every two years.”
of CPUs
•• Transistors: Small “blocks” in the computer doing “simple”
Numbers of transistors are
calculations
correlated with computational
• power
On the right image: x-axis = Years from 1970 – 2020 , y-axis =
logarithmic scale of transistors per mm^2
• Number of transistors has been
• increasing,
You see thethe
hence amount of transistors mm^2 increases linearly BUT
computational power
on a logarithmic scale, that is exponential!
•• ThisThat meansisexponential
phenomenon stated in growth of computation power!
Moore’s law
Source: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Computational Power of today’s supercomputers
TOP500 Supercomputers (2020)
•June
2020 Increase in computational power especially noticeablePflop/s
System Specs Site Country Cores in
Rmax
Power
1 Supercomputer
Fugaku Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D RIKEN R-CCS Japan 7,288,072 415.5 28.3
•2 Super
Summit Computers
IBM POWER9 (22C, have
3.07GHz),millions
NVIDIA Volta GV100 of cores (2020)
(80C), Dual Rail Mellanox EDR Infiniband
DOE/SC/ORNL USA 2,414,592 148.6 10.1
•3 Leaders
Sierra (2020) are(22C,Japan,
IBM POWER9 3.1GHz), NVIDIAUSA,
Tesla V100 China
(80C), Dual Rail Mellanox EDR Infiniband
DOE/NNSA/LLNL USA 1,572,480 94.6 7.4
•4 Germany is Shenwei
Sunway TaihuLight
on the SW2601013.
Interconnect
place
(260C, 1.45GHz) Custom
NSCC in Wuxi China 10,649,600 93 15.4
5 Tianhe-2A Intel Ivy Bridge (12C, 2.2GHz NSCC Guangzhou China 4,981,760 61.4 18.5
Source: https://www.top500.org
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Increase in the Amount of Available Data
• The amount
Other ofbig driving
created datafactor for AI:
increased fromAvailability
Accordingof to
DATA
General Electric (GE), each of
• twoCreated
zettabytesData
in 2010 to 47 zettabytes
jumped from 2 zettabyteits aircraft
(2010) engines
to 47produces
zettabytearound twenty
(2020)
in 2020 terabyte of data per hour
• It comes from all things around us (cars, IoT, aircrafts….)
• General electric: Each engine produces 20TB of data
https://www.statista.com https://www.forbes.com
Source: A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging, Curtis P. Langlotz et al., 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Industrial Examples – Real Environments
• (Show
KUKA Video)
Bottleflip Challenge WAYMO driverless driving
Engineers from KUKA tackled the “Bottleflip Challenge”
The robot flips a bottle in the air.
Fun Fact: Robot trained itself in a single night
• (Show Video)
Driving has been traditionally a human job
Autonomous Driving could progress thanks to:
Artificial neural networks and deep learning.
• Challenge called «Bottleflipping» • Autonomous driving
• Robot trained itself over night • A traditional human job is carried out
Source: https://www.youtube.com/watch?v=HkCTRcN8TB0&t by machine learning algorithms
Source: https://www.youtube.com/watch?v=aaOB-ErYq6Y&t
Source: https://deepmind.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Introduction to Machine Learning
Machine Learning Types
Bilder: TF / Malter
Machine Learning Categories
Machine Learning
Key Aspects:
• Learning is explicit
• Learning using direct feedback
• Data with labeled output
Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4
Supervised Learning Problems
Classification Regression
Label 𝒚 = Dog
Feature 𝑥1
Label 𝑦
Label 𝒚 = Cat
Feature 𝑥0 Feature 𝑥
Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Regression
Regression is used to predict
a continuous value Regression
Label 𝑦
𝒟 = (𝒙𝟎 , 𝒚𝟎 ), (𝐱 𝟏 , 𝒚𝟏 ), … , (𝐱 𝐧 , 𝒚𝒏 )
Sample : (𝒙𝒊 , 𝒚𝒊 )
𝒙𝒊 ∈ ℝ𝑚 is the feature vector of sample 𝑖
Feature 𝑥
𝒚𝒊 ∈ ℝ is the label value of sample 𝑖
Label 𝑦
model 𝑓 to all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
https://de.wikipedia.org/wiki/Zerspanen
Objective: Realization:
• Prediction of the surface quality • Speed and Feed data is gathered and the
based on the production parameters surface roughness measured for some trials
• By using regression algorithms surface
quality is predicted
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Example: Prediction of Remaining Useful Life (RUL)
“Health Monitoring and Prediction (HMP)” system (developed by BISTel)
Objective:
• Machine components degrade over time
• This causes malfunctions in the production line
• Replace component before the malfunction!
Realization:
• Use data of malfunctioning systems
• Fit a regression model to the data and predict
the RUL
• Use the model on a active systems and apply
necessary maintenance
Source: https://www.bistel.com/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
Classification
Classification is used to predict
the class of the input Classification
Label 𝒚 = Dog
Important:
Feature 𝑥1
The output belongs to only one class
Sample : 𝒔𝐢 = (𝑥𝑖 , 𝑦𝑖 )
𝑦𝑖 ∈ 𝐿 is the label of sample 𝑖 Label 𝒚 = Cat
Feature 𝑥0
In this example: 𝐿 = {𝐶𝑎𝑡, 𝐷𝑜𝑔}
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Classification
Goal:
Find a way to divide the input into the Classification
output classes!
Label 𝒚 = Dog
Feature 𝑥1
That means, we find a decision
boundary 𝑓 for all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
Label 𝒚 = Cat
In this case 𝑓 is a linear decision
Feature 𝑥0
boundary! (Black Line)
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Foreign object detection
Objective:
• Foreign object detection on a production part
• After a production step a chip can remain
on the piston rod
• Quality assurance: Only parts without chip
are allowed for the next production step
Realization:
• A camera system records 3,000 pictures
• All images are labeled by a human
Piston rod Piston rod with chip
• A machine learning classification algorithm without chip
was trained to differentiate between chip
or no chip situations
Source: Implementation and potentials of a machine vision system in a series production using deep learning and low-cost hardware, Hubert
Würschinger et al., 2020
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Classification algorithms
Below, you Nearest
Input Data can see example
Linear datasets
RBF andGaussian
the application of algorithms
Different different use
algorithms.Neighbors SVM SVM Processes different ways to classify
data
Therefore:
The algorithms perform
different on the datasets
Example:
Linear SVM’s has bad
results in the second
dataset
Reason: No linear way to
distinguish data
The lower right value shows the classification accuracy
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Classification vs. Regression
Both areRegression
Supervised Learning Problems i.e.Classification
require labeled
datasets in Machine learning. But •the
• The output are continuous or real
difference between both is
The output variable must be a discrete
how they are used for different machine
values learning problems.
value (class)
• We try to fit a regression model, which • We try to find a decision boundary,
can predict the output more accurately which can divide the dataset into different
classes.
• Regression algorithms can be used to • Classification Algorithms can be used to
solve the regression problems such as solve classification problems such as
Weather Prediction, House price Hand-written digits(MNIST), Speech
prediction, Stock market prediction etc. Recognition, Identification of cancer cells,
Defected or Undefected solar cells etc.
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Machine Learning Categories
Machine Learning
Key Aspects:
• Learning is implicit
• Learning using indirect feedback
• Methods are self-organizing
Petal width
1.5
Petal width
1.5
2.0 2.0
Petal width
Petal width
1.5 1.5
1.0 1.0
Iris virginica
0.5 Iris versicolor 0.5 Cluster 1
Iris setsoa Cluster 2
0.0 0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal length Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Clustering algorithms
K-Means Affinity Mean Shift Spectral Ward Different clustering
Propagation Clustering methods produce
different results
e.g. some algorithms “find”
more clusters than others
Example:
K-Means performs “well” on
the third dataset, but not on
dataset one and two
Reason:
K-Means can only identify
“circular” clusters
Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Curse of Dimensionality
“As the number of features or dimensions grows, the amount of data we
need to generalize accurately grows exponentially.” – Charles Isbell
Source: Chen L. (2009) Curse of Dimensionality. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA.
https://doi.org/10.1007/978-0-387-39940-9_133
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22
Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”
𝑁 =1 : 𝐷 = 21 = 2
𝑁 = 10 : 𝐷 = 210 = 1024
𝑁 = 100 : 𝐷 = 2100 = 1.2 × 1030
For N = 100, the number of points are even more than the number of atoms
in the universe!
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24
Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem!
T0 T1 T2 T3
Key Aspects:
• Learning is implicit
• Learning using indirect feedback
based on trials and reward signals
• Actions are affecting future measurements (i.e. inputs)
Source: https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
Introduction to Machine Learning
Machine Learning Pipeline
Bilder: TF / Malter
The Machine Learning Pipeline
A concept that provides guidance in a machine learning project
• Step-by-step process
• Each step has a goal and a method
Source: Artificial Intelligence with Python - Second Edition, by Alberto Artasanchez, Prateek Joshi
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
1. Problem Definition
Our aim with machine learning is to develop a solution for a problem. In order to develop a satisfying
solution, we need to define the problem. This definition of the problem lays the foundation to our solution. It
shows us what kind of data we need, what kind of algorithms we can use.
Examples:
• If we are trying to find a faulty equipment, we have a classification problem
• If we are trying to predict a continuous number, we have a regression problem
Tennis
Day Weather Temp. Humidity Wind
recommended?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Cloudy Hot High Weak Yes
4 Rainy Mild High Weak Yes
… … … … … …
We measure the actual We calculate the Within defined range: model is working
energy consumption of error of the sufficiently
10 work steps after each predictions of the Out defined range: model is working insufficient
maintenance procedure energy consumption → stop using and root cause analysis
Candidate
Problem Data Data Data Model Model Perfomance
Model
Definition Ingestion Preparation Segregation Training Deployment Monitoring
Evaluation
Bilder: TF / Malter
Summary
In this chapter we talked about:
• The history of machine learning and recent trends
• The different types of machine learning types such as supervised learning, unsupervised learning and
reinforcement learning
• The steps involved in a common machine learning pipeline
Bilder: TF / Malter
Motivation
Linear regression is used in multiple
scenarios each and every day!
Use Cases:
• Trend Analysis in financial
markets, sales and business
• Computer Vision problems
Financial development in FXCM
i.e. registration & localization
problems
• etc.
Source: https://www.tradingview.com/script/OZZpxf0m-Linear-Regression-Trend-Channel-with-Entries-Alerts/
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough i.e. Trend Analysis
• Linear models are the basis for complex models
i.e. Deep Networks
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm We want to answer Questions like:
5.7 years 116 cm What is the height of my child,
6.2 years 112 cm when it is 14 years old?
6.3 years 116 cm
6.6 years 120 cm
What is the height of my child,
6.7 years 122 cm
when it is 30 years old?
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140
Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140
Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Example: 2.Fit a model
145
140
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145
Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110
5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145
Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270
→ 165 cm 230
Height in (cm)
190
150
110
70
0 5 10 15 20 25 30 35
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270
→ 165 cm 230
Height in (cm)
190
→ 270 cm 70
0 5 10 15 20 25 30 35
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Example: Actual Answer
What is the height of my child,
when it is 14 years old?
(We estimated 165 cm)
→ ~165 cm
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Next in this Lecture:
• What is the Mathematical Framework?
• How do we fit a linear model?
Bilder: TF / Malter
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion
145
140
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
𝑥0 1 120
𝐱= 𝑥 =
1 Age in Years 115
y = Height 110
w = Weights
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
→ Finding a good 𝑤0 and 𝑤1 is 120
called fitting! 115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion
Height in (cm)
130
120 𝜖𝑖
115
110
5 6 7 8 9 10 11 12
Age in years
Note: In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:
Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
115
110
5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:
Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
1) 115
The 𝜖𝑖 and the summation 𝝐 of
all samples is called Residual Error! 110
5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
The Residual Error 𝝐
The residual error 𝜖 is not part of the input!
It is a random variable! Gaussian Distribution
0.5
0.4
Typically: We assume 𝜖 is Gaussian
Probability
Distributed (i.e. Gaussian noise) 0.3
1)
1 𝜖−𝜇 2 0.2
1 −2 𝜎
𝑝 𝜖 = 𝑒
𝜎 2𝜋 0.1
-3 -2 -1 0 1 2 3 4
But other distributions are also possible! Event
1) The short form is 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: In our use-case 𝜇 = 0 and 𝜎 is small. That means our samples are only “slightly” deviate
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35
The distribution of 𝑦
The noise is Gaussian:
𝑦1
𝑝 𝜖 =𝒩 𝜖 0, 𝜎 2 ) 145
140 𝑝(𝐲1 |𝐱1 )
Height in (cm)
“Inserting” the linear function: 130
𝑝 𝐲 𝐱, 𝐰, 𝜎) = 𝒩 𝐲 𝐰 T 𝐱, 𝜎 2 ) 120
𝑦0
115 𝑝(𝐲0 |𝐱 0 )
110
This is the conditional probability 𝑥0 𝑥1
density for the target variable 𝐲!
5 6 7 8 9 10 11 12
Age in years
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Vector Notation:
𝑓 𝐱 = 𝐰T𝐱 + 𝜖 = 𝐲
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
What parameters do we have to estimate?
The model is linear:
𝐷
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38
What parameters do we have to estimate?
The model is linear:
𝐷
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion
𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉
That means:
• We want to find the optimal parameters 𝛉∗
• We search over all possible 𝛉
• We select the 𝛉 which most likely generated our training dataset 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
0.1
0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 42
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1
0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 43
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1 𝛉 = {5, 1.5} : Best
0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
Maximum Likelihood Estimation
This process is called Maximum Likelihood Estimation (MLE)
𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉
𝓛 𝛉 =𝑝 𝒟𝛉
= 𝑝 𝒔0 , 𝒔1 , … , 𝒔𝑛 𝛉
∗ 𝑝 𝒔 𝛉 ⋅ 𝑝 𝒔 | 𝛉 ⋅ … ⋅ 𝑝(𝐬 𝛉
= 0 1 𝑛
𝑁
= ෑ 𝑝(𝐬𝑖 |𝛉)
𝑖=1
Note: In our case a sample 𝒔𝑖 is 𝐱 𝑖 and 𝐲𝑖 ; Just assume both are “tied” together i.e. like a vector
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 46
The Log-Likelihood
The product of small numbers in a computer is unstable!
* The logarithm „just“ scales the problem. Likelihood and Log-Likelihood both have to be maximized!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 47
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
log[𝑝 𝐬𝑖 𝛉 ]
𝑖=1
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
1)
… and of course many, many more methods!
1)
… and of course many, many more methods!
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐
1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 59
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐
1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 60
Analytical Solution
The solution i.e. the minimum is: 𝑁
−1 T
𝐰 = 𝐱T𝐱 𝐱 𝐲 𝐱 T 𝐲 = 𝐱 𝑖 𝐲𝑖
𝑖=1
𝑁
In practice too time
𝐱 T 𝐱 = 𝐱 𝑖 𝐱 𝑖𝑇
consuming to compute!
𝑖=1
N 2
𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,1 ⋅ 𝑥𝑖,𝐷
Reason: = ⫶ ⋱ ⫶
The more training data, 2
𝑖=1 𝑥𝑖,𝐷 ⋅ 𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,𝐷
the longer the calculation!
Note: 𝑁 : Amount of Samples in the Training Dataset ; 𝐷 : Amount of Dimensions (length of the input vector)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 61
What is the quality of the fit?
The goodness of fit should be evaluated to verify the learned model!
Model can completely (100%) Model can partially (91.3%) Model cannot explain any
explain variations in 𝑦 explain variations in y, but still (0%) variation in y, because
considered good it's its own average
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 66
Polynomial Regression
House Prices in Springfield (USA)
The answer is – of course – yes!
𝑓 𝐱 = 𝑤𝑗 𝚽𝒋 (𝒙) + 𝜖 = 𝐲
𝑗=1
Degree 𝑚 = 2 Degree 𝑚 = 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 70
Basis Functions
We can use any arbitrary function, with one condition:
Bilder: TF / Malter
Motivation
Logistic regression is the application of
linear regression to classification!
Use Cases:
• Credit Scoring
• Medicine
• Text Processing
• etc.
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 77
Example: Transform the Label
Label Petal Petal
Width Length
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
Setosa 7.7 mm 18.9 mm
Versi. 9.1 mm 32.1 mm
Setosa 7.9 mm 15.5 mm
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm
Versi. 15.0 mm 39 mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 78
Example: Transform the Label
Label Petal Petal
Width Length
0 5.0 mm 9.2 mm
1 9.2 mm 26.1 mm
0 7.7 mm 18.9 mm
1 9.1 mm 32.1 mm
0 7.9 mm 15.5 mm
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm
1 15.0 mm 39 mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 79
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0
Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm
2.5
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 80
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0
Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm Iris Setosa (0)
2.5 Iris Versicolor (1)
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 81
Example: 2. Find a decision boundary
15.0
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
15mm. 12.5
10.0
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
15mm. 12.5
10.0
Note: Setosa Versicolor | Reason: The point is on the “left side” of the decision boundary
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 87
Next in this Lecture:
• What is the Mathematical Framework?
• How do we classify using a linear model?
Bilder: TF / Malter
Overview
• The Logistic Model
• Optimization
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
12.5
“Uncertain”
10.0
Thumb-Rule: “Certain”
7.5
The larger the distance of the
input 𝐱 to the decision boundary, 5.0
Petal Width in mm
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 12.5
f(x)
𝑗=1 10.0
f(x)
1) 7.5
Calculates a signed distance,
between the input and the linear 5.0
model 2.5
Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 95
The sigmoid function
The sigmoid (logit, logistic) function
maps to the range [0,1]! 1.0
0.82
Output 𝜇 𝑥, 𝐰
0.66
That means the model is now:
0.5
1
𝜇 𝐱, 𝐰 = 𝐓𝐱 0.33
1 + 𝑒 −𝐰
0.17
Fun fact: The sigmoid function is sometimes lovingly called “squashing function”
Note: Here we already inserted the function f! Homework: What is the general form of the sigmoid function?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 96
Bernoulli distribution
The Bernoulli distribution can model
both events (yes-or-no event): 1.0
𝑝 y 𝐱, 𝐰) = Ber y 𝜇 𝐱, 𝐰 )
𝑦 1−𝑦
= 𝜇 𝐱, 𝐰 1 − 𝜇 𝐱, 𝐰 0.5
Heads
Mrs. B
Mr. A
Problem:
Tails
How do we get the label,
Voting Fair Coin
based on the calculated probability?
Note: We use the above distribution for the MLE estimation! Basically we replace “𝑝 𝐬𝑖 𝛉 ” in the log-likelihood with this!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 97
The Decision Rule
Based on 𝜇(𝐱, 𝐰) we can decide,
which class is more “likely”! 1.0
0.82
Output 𝜇 𝑥, 𝐰
0.66
The Decision Rule is:
1, if 𝜇(𝐱, 𝐰) > 0.5 0.5
𝑦=ቊ
0, if 𝜇(𝐱, 𝐰) ≤ 0.5 0.33
0.17
𝓵(𝛉) = log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ]
𝑖=1
= 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰
𝑖=1
• Unique minimum
• No analytical solution possible!
→ Optimization with Gradient descent.
Idea:
We find the minimum, by “walking
down” the slope of the mountain 𝐿(𝛉)
𝑤0
𝑤1
Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃
Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃
Loss 𝐿(𝜃)
The Gradient for Logistic Regression:
∇𝐿(𝜃𝑖 ) = 𝜇 𝐱 𝐢 , 𝐰 − 𝑦𝑖 𝐱 𝑖
𝑖
𝜃𝑖
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜂 is called the learning rate
• If 𝜂 is too large
→ Overshooting the minimum
𝜃𝑖 𝜃𝑖+1
• If 𝜂 is too small Parameter 𝜃
→ Minimum is not reached
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 111
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum
Repeat the process until the loss is minimal! 𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3
Parameter 𝜃
Minimum
Parameter
Minimum
Minimum
Parameter
Minimum
https://de.serlo.org/mathe/funktionen/wichtige-funktionstypen-ihre-eigenschaften/polynomfunktionen-beliebigen-grades/polynom
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 120
Thank you for listening!
Machine Learning for Engineers
Overfitting and Underfitting
Bilder: TF / Malter
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)
Prediction Error
(i.e. an ideal complexity)
• Variance error
Ideal Complexity
Complexity
(Minimum
Testing Loss)
Note: Complexity does not mean parameters! It means the mathematical complexity! i.e. the ability of the model to capture a relationship!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 127
Complexity vs. Generalization Error
Bias: High Bias Low Bias →
Error induced by simplifying
Prediction Error
assumptions of the model
Complexity
Prediction Error
assumptions of the model
Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”
Complexity
Prediction Error
assumptions of the model
Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”
Prediction Error
It has both a low bias and a
low variance!
1) We use this instead of the test set. This is important! You only touch the test set for the final evaluation!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 132
Hyperparameter search methods
There exist multiple ways to find good hyperparameters…
… manual search (Trial-and-Error)
… random search
… grid search
… Bayesian methods
… etc.
Note: Basically every known optimization technique can be used. Examples: Particle Swarm Optimization, Genetic Algorithms, Ant Colony Optimization…
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 133
Hyperparameter search on the data splits
Typically you split the data into a training and testing split
Solution:
Splitting the training data further into a train split and validation split
Solution:
Splitting the training data further into a train split and validation split
Solution:
Splitting the training data further into a train split and validation split
1) We assume the test data is already split and put away. A typical split of all the data is 80% train – 10% validation - 10% testing
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 138
K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
When k=1, the process is called Leave-one out cross validation (or LOOCV)
Bilder: TF / Malter
Recap: Logistic Regression
Goal: Find a linear decision Classification
boundary separating the two classes
Label 𝒚 = Dog
Problem: Multiple linear decision
Feature 𝑥1
boundaries solve the problem with
equal accuracy
Question: Which one is the best?
Label 𝒚 = Cat D
A B C D
Feature 𝑥0
A B C
Feature 𝑥1
throughout the boundary
C always is closer to Dog than Cat
D almost touches Dog at the bottom
and Cat at the top Label 𝒚 = Cat D
Feature 𝑥0
A B C
Feature 𝑥1
terms, we say it has the largest
margin 𝑚 𝑚
𝑚
Updated Goal: Find linear decision
boundary with the largest margin. Label 𝒚 = Cat
Feature 𝑥0
B
Feature 𝑥1
𝒘 is the normal vector
Feature 𝑥1
Furthermore, by definition, the margin
boundaries are given by
𝒘 ⋅ 𝒙 − 𝑏 = +1
𝑏
𝒘 ⋅ 𝒙 − 𝑏 = −1 𝒘
2 Feature 𝑥0
is thus the distance between the
𝒘
two margins → maximize
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Mathematical Model for Classes
Two-class problem 𝑦1 , … , 𝑦𝑛 = ±1
𝑦 = +1
All training data 𝒙1 , … , 𝒙𝑛 needs to be 2
correctly classified outside margin 𝒘
𝒘
Feature 𝑥1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 if 𝑦𝑖 = +1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≤ −1 if 𝑦𝑖 = −1 𝑦 = −1
𝑏
Due to the label choice, this can be 𝒘
simplified as Feature 𝑥0
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
Feature 𝑥1
𝒘
1 2
is equal to minimizing 𝒘
2
1
min 𝒘 2 𝑦 = −1
𝒘,𝑏 2 𝑏
𝒘
2. Subject to no misclassification on
Feature 𝑥0
the training data
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
Bilder: TF / Malter
Optimization of Support Vector Machines
In the previous section, we derived the constrained optimization problem
for Support Vector Machines (SVMs):
1 2
min 𝒘
𝒘,𝑏 2
s. t. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Primal versus Dual Formulation
This optimization problem is called the primal formulation, with solution 𝑝∗ :
𝑝∗ = min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶
Since Slater’s condition holds for this convex optimization problem, we can
guarantee that 𝑝∗ = 𝑑 ∗ and solve the dual problem instead.
s. t. 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
This quadratic programming problem can be solved using sequential
minimal optimization, but which is not a topic of this lecture.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
The Karush–Kuhn–Tucker Conditions
The given problem fulfills the Karush-Kuhn-Tucker conditions1):
𝛼𝑖 ≥ 0
𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 ≥ 0
𝛼𝑖 𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 = 0
💡 When 𝛼𝑖 is non-zero, the training point is on the margin and thus a so-
called “support vector”.
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Optimization Summary
The dual formulation leads to a quadratic programming problem on 𝜶:1)
𝑛 𝑛
1
max ℒ 𝒘, 𝑏, 𝜶 = 𝛼𝑖 − 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
𝜶 2
𝑖=1 𝑖,𝑗=1
1) This is simplified for the purposes of this summary. For the additional constraints that are also required refer to slide 14
2) Remember the equation for 𝒘 resulting from the derivative in slide 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Machine Learning for Engineers
Support Vector Machines – Non-Linearity and the Kernel Trick
Bilder: TF / Malter
Recap: Basis Functions
House Prices in Springfield (USA)
For linear regression we transform
T
1.4
𝚽 𝑥 = 1 𝑥 2 3 4 5
𝑥 𝑥 𝑥 𝑥 𝑥 6
1.3
💡 Note how the basis function is applied at each of the dot products.
Most summands are thus zero. This property is called sparsity and greatly
simplifies the computations.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Solving Memory Issues with Kernel Trick
Since the basis function 𝚽: ℝ𝑛 ↦ ℝ𝑚 is applied explicitly on each of the
points, the memory scales with the output dimensionality m.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = 𝛼𝑖 − 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1
The kernel function usually doesn’t require explicit computation of the basis
function. This method of replacement is called the kernel trick.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23
The Linear Kernel
The kernel function is given by:
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2 𝒙𝑖 ⋅ 𝒙𝑗
2𝜎
• 𝜎 is a length-scale parameter
• 𝜎 is a length-scale parameter
Bilder: TF / Malter
Hard Margin and Soft Margin
Up until now we have always implicitly
𝑦 = +1
assumed classes are linearly separable
2
⮩ So-called hard margin 𝒘
𝒘
Feature 𝑥1
What happens when the classes 𝜉2
overlap and there is no solution? 𝜉1
𝑦 = −1
⮩ So-called soft margin 𝑏
𝒘
Introduce slack variables 𝜉1 , 𝜉2 that Feature 𝑥0
move points onto the margin
Feature 𝑥1
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 👍
𝜉1
• No correction, thus 𝜉𝑖 = 0 𝜉2
𝑦 = −1
𝑏
If incorrectly classified or inside margin
𝒘
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 < 1 👎 Feature 𝑥0
• Move by 𝜉𝑖 = 1 − 𝑦𝑖 (𝒘 ⋅ 𝒙𝑖 − 𝑏)
𝑠. 𝑡. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 − 𝜉𝑖
The optimization procedure and kernel trick can similarly be derived for
this, but we’ll spare you the details 😉.
Bilder: TF / Malter
Intuition behind Support Vector Regression
The Support Vector Machine is a Regression
method for classification.
Let us now turn to regression. 𝜖
Label 𝑦
With training data 𝒙1 , 𝑦1 , … , (𝒙𝑛 , 𝑦𝑛 ),
we want to find a function of form:
𝜖
𝑦 =𝒘⋅𝒙+𝑏
For Support Vector Regression, we
Feature 𝑥
want to keep everything in a 𝜖-tube
Label 𝑦
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 𝜉+ 𝜉−
For outliers, we introduce slack 𝜖
variables 𝜉 + and 𝜉 − :
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
Feature 𝑥
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖−
Label 𝑦
+
variables 𝜉𝑖 and 𝜉𝑖− 𝜉+ 𝜉−
𝑛
1
min 𝒘 2 + 𝐶 ( 𝜉𝑖+ + 𝜉𝑖− ) 𝜖
𝒘,𝑏 2
𝑖=1
𝐶 is a free parameter specifying
the trade-off for outliers Feature 𝑥
Label 𝑦
tube specified by 𝜖 𝜉+ 𝜉−
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖− 𝜖
𝜉𝑖+ ≥ 0
𝜉𝑖− ≥ 0
Feature 𝑥
Label 𝑦
𝜉+ 𝜉−
Disadvantages
Bilder: TF / Malter
Summary of Support Vector Machines
Support Vector Machine Advantages
• Elegant to optimize, since they have one local and global optimum
Bilder: TF / Malter
Application: Image Compression
We can save data more efficiently
by applying PCA!
Can be used for other data as well! 16 Components (93.62%) 8 Components (85.50%)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Application: Facial Recognition
Eigenfaces (1991) are a fast and efficient
way to recognize faces in a gallery of facial
images.
General Idea:
1. Generate a set of eigenfaces (see right)
based on a gallery
2. Compute the weight for each eigenface
based on the query face image
3. Use the weights to classify (and identify)
the person
Turk, M. and A. Pentland. “Face recognition using eigenfaces.” Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1991): 586-591.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Application: Anomaly Detection
Detecting anomalous traffic in IP Networks is crucial for administration!
General Idea:
Transform the time series of IP packages (requests, etc.) using a PCA
→ The first k (typically 10) components represent
“normal” behavior
→ The components “after” k, represent
anomalies and/or noise
Packages mapped primarily to the latter component are classified as
anomalous!
Anukool Lakhina, Konstantina Papagiannaki, Mark Crovella, Christophe Diot, Eric D. Kolaczyk, and Nina Taft. 2004. Structural analysis of network traffic flows. SIGMETRICS Perform. Eval.
Rev. 32, 1 (June 2004), 61–72. DOI:https://doi.org/10.1145/1012888.1005697
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Limitation: The Adidas Problem
Let the data points be distributed like 𝑤2
the Adidas logo, with three implicit 𝑤1
classes (teal, red, blue stripe).
Feature 2
The principal directions 𝑤1 and 𝑤2
found by PCA are given.
The first principal direction 𝑤1 does
not preserve the information to
classify the points!
Feature 1
Bilder: TF / Malter
Intuition behind Principal Component Analysis
Consider the following data with Twin Heights
height of two twins ℎ1 and ℎ2 .
Now assume that
Twin height ℎ2
a) We can only plot one-dimensional
figures due to limitations1
b) We have complex models that
can only handle a single value1
Twin height ℎ1
1) These are obviously constructed limitations for didactical concerns. However, in real life you will also face limitations in visualization
(maximum of three dimensions) or model feasibility.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Intuition behind Principal Component Analysis
How do we find a transformation from Twin Heights
two values to one value?
In this specific case we could
Twin height ℎ2
1
a) Keep the average height (ℎ1 +
2
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
Twin height ℎ1
Height difference
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
💡 This corresponds to a change in
basis or coordinate frame in the plot.
Average height
⮩ Rotation by matrix 𝑨
Height difference
Twin height ℎ2
Rotation
1 1 1
𝑨=
2 1 −1
Twin height ℎ1 Average height
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Principal Component Analysis: Preserving Distances
💡 Idea: Preserve the distances Twin Heights
between each pair in the lower
dimension
Height difference
• Points that are far apart in the
original space remain far apart
• Points that are close in the original
space remain close
Average height
Height difference
• Find those axes and dimensions
Low variance
with high variance
• Discard the axes and dimensions
with increasingly low variance
High variance
Goal: Algorithmically find directions
with high/low variance in data1 Average height
1) In this trivial case we could hand-engineer such a transformation. However, this is not generally the case.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Machine Learning for Engineers
Principal Component Analysis – Mathematics
Bilder: TF / Malter
Description of Input Data
Let 𝑿 be a 𝑛 × 𝑝 matrix of data Twin Heights
points, where
• 𝑛 is number of points (here 15)
Twin height ℎ2
• 𝑝 is number of features (here 2)
1.75 1.77
1.67 1.68
𝑿=
⋮ ⋮
2.01 1.98 Twin height ℎ1
Twin height ℎ2
• 𝑝 is number of features (here 2)
The covariance matrix generalizes
the variance 𝜎 2 of a Gaussian
distribution 𝒩(𝜇, 𝜎 2 ) to higher
dimensions. Twin height ℎ1
1) Technically the covariance matrix of the transposed data points 𝑿𝑇 after centering to zero mean.
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Description of Desired Output Data
We’re interested in new axes that Twin Heights
rotate the system, here 𝑤1 and 𝑤2
These are columns of a matrix
Twin height ℎ2
𝜆2
𝑤2 𝜆1
0.7 0.7 𝑤1
𝑾=
0.7 −0.7
𝑤1 𝑤2
𝜆1 = 2, 𝜆2 = 1
Twin height ℎ2
𝜆2
𝑤2 𝜆1
• 𝑾 is 𝑝 × 𝑝 matrix, where each 𝑤1
column is an eigenvector
• 𝑳 is a 𝑝 × 𝑝 matrix, with all
eigenvalues on diagonal in
decreasing order Twin height ℎ1
Twin height ℎ2
𝜆2
𝑿= 𝑼𝑺𝑾𝑇 𝑤2 𝜆1
𝑤1
• 𝑼 is 𝑛 × 𝑛 unitary matrix
• 𝑺 is 𝑛 × 𝑝 matrix of singular values
𝜎1 , … , 𝜎𝑝 on diagonal
Twin height ℎ1
• 𝑾 is 𝑝 × 𝑝 matrix of singular
vectors 𝒘1 , … , 𝒘2
Twin height ℎ2
𝜆2
𝑇 𝑇 𝜆1
= 𝑼𝑺𝑾 𝑼𝑺𝑾𝑇 𝑤2
𝑤1
= 𝑾𝑺𝑇 𝑼𝑇 𝑼𝑺𝑾𝑇
= 𝑾𝑺2 𝑾𝑇
• Eigenvalues correspond to
squared singular values 𝜆𝑖 = 𝜎𝑖2 Twin height ℎ1
• Eigenvectors directly correspond
to singular vectors
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Summary of Algorithm
1. Find principal directions 𝑤1 , … 𝑤𝑝 Twin Heights
and eigenvalues 𝜆1 , … , 𝜆𝑝
Twin height ℎ2
2. Project data points into new
coordinate frame using 𝑤2
𝑤1
𝑻 = 𝑿𝑾
3. Keep the 𝑞 most important
dimensions as determined by
𝜆1 , … 𝜆𝑞 (which are sorted by Twin height ℎ1
variance)
Bilder: TF / Malter
Summary of Principal Component Analysis
• Rotate coordinate system such that all axes are sorted from most
variance to least variance
• Required axes 𝑾 determined using either
• Eigenvectors and –values of covariance matrix 𝑪 = 𝑾𝑳𝑾𝑇
• Singular Value Decomposition (SVD) of data points 𝑿 = 𝑼𝑺𝑾𝑇
Bilder: TF / Malter
The Human Brain
The human brain is our reference for
an intelligent agent, that
a) … contains different areas
specialized for some tasks (e.g.,
the visual cortex)
b) … consists of neurons as the
fundamental unit of “computation”
Input 𝑥1
Input 𝑥2 Output 𝑦1
Input 𝑥3
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2
Input 𝑥3
⋅ 𝑤3
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏 = 𝜎( 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏)
𝑖
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
The Perceptron – Signaling Mechanism
Given a perceptron with parameters
𝑤1 = 4, 𝑤2 = 7, 𝑤3 = 11, 𝑏 = −10 and equation
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏
1
𝜎 𝑥 = 𝜎 𝑥 = tanh(𝑥) 𝜎 𝑥 = max(𝑥, 0)
1 + 𝑒 −𝑥
Bilder: TF / Malter
The Perceptron – A Recap
In the last section we learned about the perceptron, a computational model
representing a neuron.
Inputs 𝑥1 , … , 𝑥𝑛 𝑥1
Output 𝑦1 𝑥2 𝑦1
Computation 𝑦1 = 𝜎(σ𝑛𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏) 𝑥3
𝑥2 𝑦3 𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)
Bilder: TF / Malter
How do our models learn?
• Up until now, the parameters 𝑾𝑖 and 𝒃𝑖 of the multilayer perceptron
were assumed as given
• We now aim to learn the parameters 𝑾𝑖 and 𝒃𝑖 based on some
example input and output data
• Let us assume we have a dataset of 𝒙 and corresponding 𝒚
1.2 0.2
3.2 0.2
𝒙1 , 𝒚1 = (1.4, ) and 𝒙2 , 𝒚2 = (0.4, ) and …
0.2 0.2
1.3 0.3
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
𝑾1 , 𝒃1 𝑾2 , 𝒃2
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
The Loss Function
• ෝ𝑖 and the
We need a comparison metric between the predicted outputs 𝒚
expected outputs 𝒚𝑖
• This is called the loss function, which usually depends on the type of
problem the multilayer perceptron solves
• For regression, a common metric is the mean squared error
• For classification, a common metric is the cross entropy
∇𝜽 ← Backward Pass
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
Forward Pass → 𝜽
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25
Machine Learning for Engineers
Deep Learning – Gradient Descent
Bilder: TF / Malter
Parameter Optimization
• At this stage, we can compute the gradient of the parameters ∇𝜽
• The gradient tells us how we need to change the current parameters 𝜽
in order to make fewer errors on the given data
• We can use this in an iterative algorithm called gradient descent, with
the central equation being
𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
• The learning rate 𝜇 tells us how quickly we should change the current
parameters 𝜽𝑖
⮩ Let us have a closer look at an illustrative example 💡
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Gradient Descent – An Example
• Let us assume that the error function is
𝑓 𝜃 = 𝜃2
• The gradient of the error function is the
derivative 𝑓 ′ 𝜃 = 2 ⋅ 𝜃
• We aim to find the minimum error at
the location 𝜃 = 0
• Our initial estimate for the location is
𝜃1 = 2
• The learning rate is 𝜇 = 0.25
Global Local
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
Gradient Descent – A Summary
• Gradient Descent incrementally adjusts the parameters 𝜽 based on the
gradient ∇𝜽 of the parameters
1. For each iteration
2. Compute the error of the parameters 𝜽
3. Compute the gradient of the parameters ∇𝜽
4. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
Bilder: TF / Malter
How do our models learn?
• We’ve just talked about the concept of gradient descent, but largely
avoided the details for the function 𝑓(𝜽)1)
• The goal is to minimize the error or loss function across the complete
dataset 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 )
𝑛 𝑛
𝑓 𝜽 = 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) = 𝐿(𝑔(𝒙𝑖 , 𝜽), 𝒚𝑖 )
𝑖=1 𝑖=1
1) In the previous section the function was 𝑓 𝜃 = 𝜃 2 . For multilayer perceptron, the function is quite obviously different.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
How do our models learn?
• We thus need to loop over the training data 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 ) to
compute sums over individual data points
• This results in a slight modification to the gradient descent algorithm
1. For each epoch – Loop over training data
2. For each batch – Loop over pieces of training data
3. Compute the error of the parameters 𝜽
4. Compute the gradient of the parameters ∇𝜽
5. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
1) Note how we need 4 batches to go through all of the training data in this specific example.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 40
The Learning Process – A Summary
• The parameters 𝜽 are optimized by iterating through the training data
• For practical reasons, this is split into epochs and batches
Bilder: TF / Malter
The Visual Cortex
We’ve previously learned that the
brain has specialized regions.
• The visual cortex is in charge of
processing visual information
collected from the retinae
• However, cortical cells only
respond to stimuli of a small
receptive fields
Standard Cortical
neuron neuron
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦11 = 𝑤11 𝑥11 + 𝑤12 𝑥12 + 𝑤13 𝑥13 + 𝑤21 𝑥21 + 𝑤22 𝑥22 + 𝑤23 𝑥23
+ 𝑤31 𝑥31 + 𝑤32 𝑥32 + 𝑤33 𝑥33
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦22 𝑦23
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32 𝑦33
Input 𝒙
The different outputs are
concatenated in the
channel dimension.
Kernel 𝒘
Output 𝒚 + + =
Bilder: TF / Malter
Summarizing Features via Pooling
• Deeper in the network the number of features quickly grows
• It makes sense to summarize these features deeper in the network
• From a practical perspective, this reduces the number of computations
• From a theoretical perspective, these higher-level (i.e., global scale)
features don’t require a high spatial resolution
• This process is called pooling and works like a convolution, with the
kernel replaced by a pooling operation
Pooling 2
Pooling 1 Convolution 2
Input Image Convolution 1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 63
Machine Learning for Engineers
Deep Learning – Applications
Bilder: TF / Malter
Full Image Classification
• Task: Plant leaf disease classification
of 14 different species
• Input: Image of plant leaf
• Output: Plant and disease class
grape with black measles potato with late blight
• Model: EfficientNet
• Ü. Atilaa, M. Uçarb, K. Akyolc, and E. Uçarb, “Plant leaf disease
classification using EfficientNet deep learning model”, 2020.