Slides

Organizational Information 3
Motivation 9
Machine Learning Types 22
Machine Learning Pipeline 55
Summary 69
Linear Regression - Motivation 72
Linear Regression - Overall Picture 86
Linear Regression - Model 94
Linear Regression - Optimization 103
Linear Regression - Basis Functions 128
Logistic Regression - Motivation 137
Logistic Regression - Framework 153
Overfitting and Underfitting 185
Problem Statement 208
Optimization 216
Kernel Trick 224
Hard and Soft Margin 234
Regression 238
Summary 244
Applications 247
Intuition 252
Mathematics 259
Summary 267
Perceptron 270
Multilayer Perceptron 280
Loss Function 287
Gradient Descent 295
Learning Process 305
Convolution 311
Pooling 325
Applications 333
Machine Learning for Engineers
Organizational Information
Bilder: TF / Malter
Course Organizers
Prof. Dr. Bjoern M. Eskofier
Machine Learning and Data Analytics Lab
University Erlangen-Nürnberg
Prof. Dr. Jörg Franke

Institute for Factory Automation and Production Systems
Prof. Dr. Nico Hanenkamp

Institute for Resource and Energy Efficient Production Systems
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2

Course Details
5 ECTS
4 SWS
Written Exam
• Exam is in English. The questions are based on understanding the methods and
ability to implement them.
• Two exercises should be successfully completed.
• Exam date will be announced for all universities.

Course Structure
Lectures
1. Introduction to Machine Learning (today)
2. Linear Regression
3. Support Vector Machines
4. Dimensionality Reduction
5. Deep Neural Networks
Exercises
Two real-world industrial applications
• Exercise 1: Energy prediction using Support Vector Machines
• Exercise 2: Image classification using Convolutional Neural Networks

Lecture Material
Pattern Recognition and Machine Learning
by C. Bishop, 2007
Deep Learning
by A. Courville, I. Goodfellow, Y. Bengio, 2015
Machine Learning: A Probabilistic Perspective

by K. Murphy, 2012

Introduction to Machine Learning
Motivation
Bilder: TF / Malter
Four Industrial Revolutions
Currently in the 4. Industrial Revolution 3. Industrial Revolution
Electronics and IT enable automation-driven rationalization
1. Industrial Revolution:
1. Industrial Revolution Steam 1960
Machinesas well as varied series production
Steam machines allow the
2. Industrial
1750 Revolution: Division of Labor (keyword: Conveyer belts)
Industrialization
3. Industrial Revolution: Electronics, Machines and Automatisation with IT
4. Industrial Revolution: Interlinked Production. Machines Communicate
with each other and optimize the process
2. Industrial Revolution 4. Industrial Revolution

Mass division of labor with the Industrial production is interlinked with information and
1870 2013 communication technology
help of electrical energy
In April 2013, the national platform Industry 4.0 was
founded at the Hannover Messe
Source: Industrie 4.0 in Produktion, Automatisierung und Logistik, T. Bauernhansl, M. Hompel, B. Vogel-Heuser, 2014
Core of the Fourth Industrial Revolution is Digitalization
• 1960 : Dominant Factor Mechanical Engineering
Percantage of Engineering in Mechanical and Plant Engineering
• Mechanical Engineering decreasing with each decade
• Mechanical and plant
• During
Mechanical Engineering Electrical Engineering Computer Sciences Systems Engineering
that time Computer Science (AI, and Machine Learning)
engineering increase
was dominated by
100% 3
5 5
90% steadily! 9 15 mechcanical engineering in
10
1960s
80% 18
• Share of mechanical engineering
70% 30
has been steadily decreasing
60%
until today
50%
95 • Impact of computer sciences has
40% 85 25
70 been increasing since 1980s
30%
20%
30
→ Mechanical and plant
10% engineering is composed of
0% mutliple fields and became an
1960 1980 2000 2020
interdisciplinary area unlike 1960s
Source: Automatisierung 4.0, T. Schmertosch, M. Krabbes, 2018
What drives that progress in Computer Science?
•
Annual Of course:
number Algorithmic
of AI papers advances (NOT ONLY AI)
•
25000 BUT: One significant force of progress IS AI
USA
Europe China
• You can see that clearly with the amount of publications and the
Total number of AI papers
20000
applications in that area
•
15000 So now: What is Artificial Intelligence actually? An What is the
difference to Machine Learning and Deep Learning?
10000
5000
0
2000 2005 2010 2015
Source: Left: Scopus, 2019

Artificial Intelligence vs. Machine Learning
• So what is Artificial Intelligence? What is Machine Learning? What is
Deep Learning?
• AI: Algorithms mimic behavior of humans
• Machine Learning: Algorithms do that by learning from data (Features
are hand crafted!)
• Deep Learning: Algorithms do the learning from data fully automatic
(The algorithms finds good features themselves)
Source: https://www.ibm.com/blogs
Driving factors for the Advancement of AI
There are three important reasons for these advancements!
1. Increase of the Computational Power
Increase ofin
2. Increase Increase
thethe Amount of the amount
of Available Dataof Development of new
computational power available data algorithms
3. Development of New Algorithms
Sources:
Left Image: https://datafloq.com
Middle Image:https://europeansting.com
Right Image: https://medium.com
Increase of Computational Power
•• Moore’s
Transistors are Law: “the number of transistors inMoore’s
the building a dense
Law integrated
Transistors per Square Millimeter by Year
circuit
blocks (IC) doubles about every two years.”
of CPUs
•• Transistors: Small “blocks” in the computer doing “simple”
Numbers of transistors are
calculations
correlated with computational
• power
On the right image: x-axis = Years from 1970 – 2020 , y-axis =
logarithmic scale of transistors per mm^2
• Number of transistors has been
• increasing,
You see thethe
hence amount of transistors mm^2 increases linearly BUT
computational power
on a logarithmic scale, that is exponential!
•• ThisThat meansisexponential
phenomenon stated in growth of computation power!
Moore’s law
Source: https://medium.com
Computational Power of today’s supercomputers
TOP500 Supercomputers (2020)
•June
2020 Increase in computational power especially noticeablePflop/s
System Specs Site Country Cores in
Rmax
Power
1 Supercomputer
Fugaku Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D RIKEN R-CCS Japan 7,288,072 415.5 28.3
•2 Super
Summit Computers
IBM POWER9 (22C, have
3.07GHz),millions
NVIDIA Volta GV100 of cores (2020)
(80C), Dual Rail Mellanox EDR Infiniband
DOE/SC/ORNL USA 2,414,592 148.6 10.1
•3 Leaders
Sierra (2020) are(22C,Japan,
IBM POWER9 3.1GHz), NVIDIAUSA,
Tesla V100 China
(80C), Dual Rail Mellanox EDR Infiniband
DOE/NNSA/LLNL USA 1,572,480 94.6 7.4
•4 Germany is Shenwei
Sunway TaihuLight
on the SW2601013.
Interconnect
place
(260C, 1.45GHz) Custom
NSCC in Wuxi China 10,649,600 93 15.4
5 Tianhe-2A Intel Ivy Bridge (12C, 2.2GHz NSCC Guangzhou China 4,981,760 61.4 18.5
The best german supercomputer:

ThinkSystem SD650, Xeon Platinum 8174 24C 3.1GHz, Intel
13 SuperMUC-NG Omni-Path
Leibniz RZ Germany 305,856 19 -
Source: https://www.top500.org
Increase in the Amount of Available Data
• The amount
Other ofbig driving
created datafactor for AI:
increased fromAvailability
Accordingof to
DATA
General Electric (GE), each of
• twoCreated
zettabytesData
in 2010 to 47 zettabytes
jumped from 2 zettabyteits aircraft
(2010) engines
to 47produces
zettabytearound twenty
(2020)
in 2020 terabyte of data per hour
• It comes from all things around us (cars, IoT, aircrafts….)
• General electric: Each engine produces 20TB of data
https://www.statista.com https://www.forbes.com

Development of New Algorithms
• 2006 : Hinton showed how to train deep neural networks (deep
learning)
• 2015 : Algorithms are officially better than humans in object
recognition!
• In 2006, Geoffrey Hinton et al. published a • Increasing performance of visual

paper showing how to train a deep neural object recognition
network
• In 2015 machine learning algorithms
• They branded this technique “Deep Learning.” exceed the human capability
Source: A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging, Curtis P. Langlotz et al., 2018
Industrial Examples – Real Environments
• (Show
KUKA Video)
Bottleflip Challenge WAYMO driverless driving
Engineers from KUKA tackled the “Bottleflip Challenge”
The robot flips a bottle in the air.
Fun Fact: Robot trained itself in a single night
• (Show Video)
Driving has been traditionally a human job
Autonomous Driving could progress thanks to:
Artificial neural networks and deep learning.
• Challenge called «Bottleflipping» • Autonomous driving
• Robot trained itself over night • A traditional human job is carried out
Source: https://www.youtube.com/watch?v=HkCTRcN8TB0&t by machine learning algorithms
Source: https://www.youtube.com/watch?v=aaOB-ErYq6Y&t

Game Examples – Defined Rules
• 2015 : AlphaGo
AlphaGo and (creators:
Go Deep Mind) won against
AlphaStarthe andEuropean
Starcraft 2
champion 5 – 0 [Mr. Fan Hui]
AlphaGo uses:
“Advanced Search Tree”
a Deep Network
Reinforcement Learning
• StarCraft II is one of the most enduring and popular real-time strategy
video games of all time.
•AlphaStar …
Go is considered as the most • StarCraft II is one of the most popular
…challenging
is the first AI to
classical reach the top league of real-time
game a widely popular
strategy E-sport
video games
• In October 2015, AlphaGo won against • AlphaStar is the first AI that reached
without any game
the European Champion restrictions. the top league
Source: https://deepmind.com
Machine Learning Types
Bilder: TF / Malter
Machine Learning Categories
Machine Learning
Supervised Unsupervised Reinforcement

Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
Machine Learning

Supervised Learning
The agent observes some example Input (Features) and Output (Label)
pairs and learns a function that maps input to output.
Key Aspects:
• Learning is explicit
• Learning using direct feedback
• Data with labeled output
→ Resolves classification and regression problems
Source: https://www.kdnuggets.com
Supervised Learning Problems
Classification Regression
Label 𝒚 = Dog
Feature 𝑥1
Label 𝑦
Label 𝒚 = Cat
Feature 𝑥0 Feature 𝑥
Source: https://www.kdnuggets.com
Regression
Regression is used to predict
a continuous value Regression
Training is based on a set of

input – output pairs (samples)
Label 𝑦
𝒟 = (𝒙𝟎 , 𝒚𝟎 ), (𝐱 𝟏 , 𝒚𝟏 ), … , (𝐱 𝐧 , 𝒚𝒏 )
Sample : (𝒙𝒊 , 𝒚𝒊 )
𝒙𝒊 ∈ ℝ𝑚 is the feature vector of sample 𝑖
Feature 𝑥
𝒚𝒊 ∈ ℝ is the label value of sample 𝑖

Regression
Goal:
Find a relationship (function), which Regression
expresses the input and output the best!
That means, we fit a regression
Label 𝑦
model 𝑓 to all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
In this case 𝑓 is a linear regression

Feature 𝑥
model! (Black Line)

Example: Predicting Surface Quality
In this example, we consider a milling process. We take speed and feed
as input features and the value of surface roughness is predicted
Milling Process Input: Speed, Feed Output: Surface Quality
https://de.wikipedia.org/wiki/Zerspanen
Objective: Realization:
• Prediction of the surface quality • Speed and Feed data is gathered and the
based on the production parameters surface roughness measured for some trials
• By using regression algorithms surface
quality is predicted
Example: Prediction of Remaining Useful Life (RUL)
“Health Monitoring and Prediction (HMP)” system (developed by BISTel)
Objective:
• Machine components degrade over time
• This causes malfunctions in the production line
• Replace component before the malfunction!
Realization:
• Use data of malfunctioning systems
• Fit a regression model to the data and predict
the RUL
• Use the model on a active systems and apply
necessary maintenance
Source: https://www.bistel.com/
Classification
Classification is used to predict
the class of the input Classification
Label 𝒚 = Dog
Important:
Feature 𝑥1
The output belongs to only one class
Sample : 𝒔𝐢 = (𝑥𝑖 , 𝑦𝑖 )
𝑦𝑖 ∈ 𝐿 is the label of sample 𝑖 Label 𝒚 = Cat
Feature 𝑥0
In this example: 𝐿 = {𝐶𝑎𝑡, 𝐷𝑜𝑔}
Source: https://scikit-learn.org/
Classification
Goal:
Find a way to divide the input into the Classification
output classes!
Label 𝒚 = Dog
Feature 𝑥1
That means, we find a decision
boundary 𝑓 for all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
Label 𝒚 = Cat
In this case 𝑓 is a linear decision
Feature 𝑥0
boundary! (Black Line)
Example: Foreign object detection
Objective:
• Foreign object detection on a production part
• After a production step a chip can remain
on the piston rod
• Quality assurance: Only parts without chip
are allowed for the next production step
Realization:
• A camera system records 3,000 pictures
• All images are labeled by a human
Piston rod Piston rod with chip
• A machine learning classification algorithm without chip
was trained to differentiate between chip
or no chip situations
Source: Implementation and potentials of a machine vision system in a series production using deep learning and low-cost hardware, Hubert
Würschinger et al., 2020
Classification algorithms
Below, you Nearest
Input Data can see example
Linear datasets
RBF andGaussian
the application of algorithms
Different different use
algorithms.Neighbors SVM SVM Processes different ways to classify
data
Therefore:
The algorithms perform
different on the datasets
Example:
Linear SVM’s has bad
results in the second
dataset
Reason: No linear way to
distinguish data
The lower right value shows the classification accuracy
Classification vs. Regression
Both areRegression
Supervised Learning Problems i.e.Classification
require labeled
datasets in Machine learning. But •the
• The output are continuous or real
difference between both is
The output variable must be a discrete
how they are used for different machine
values learning problems.
value (class)
• We try to fit a regression model, which • We try to find a decision boundary,
can predict the output more accurately which can divide the dataset into different
classes.
• Regression algorithms can be used to • Classification Algorithms can be used to
solve the regression problems such as solve classification problems such as
Weather Prediction, House price Hand-written digits(MNIST), Speech
prediction, Stock market prediction etc. Recognition, Identification of cancer cells,
Defected or Undefected solar cells etc.
Machine Learning

Unsupervised Learning
Unsupervised learning observes some example Input (Features) – No
Labels! - and finds patterns based on a metric
Key Aspects:
• Learning is implicit
• Learning using indirect feedback
• Methods are self-organizing
Resolves clustering and dimensionality reduction problems

Example for Clustering
Iris Flower data set
Contains 50 samples of the flowers Iris setosa, Iris virginica and Iris
versicolor
Measured parameters:
petal length, petal width, sepals length, sepals width
Iris setosa Iris versicolor Iris virginica

Source: Fisher's Iris data set
Clustering
Goal: Identify similar samples and assign them the same label
Mostly used for data analysis, 2.5

data exploration, and/or
data preprocessing 2.0
Petal width
1.5
Data should be: 1.0

• Homogeneous in the cluster 0.5
(Intra-cluster distance)
0.0
• Heterogenous between the cluster 1 2 3 4 5 6 7
(Inter-cluster distance) Petal length
Clustering
Goal: Identify similar samples and assign them the same label
Mostly used for data analysis, 2.5

data exploration, and/or
data preprocessing 2.0
Petal width
1.5
Data should be: 1.0

• Homogeneous in the cluster 0.5 Cluster 1
(Intra-cluster distance) Cluster 2
0.0
• Heterogenous between the cluster 1 2 3 4 5 6 7
(Inter-cluster distance) Petal length
Clustering vs. Classification
Clustering does not have labeled data and may not find all classes!
Classification Clustering
2.5 2.5
2.0 2.0
Petal width
Petal width
1.5 1.5
1.0 1.0
Iris virginica
0.5 Iris versicolor 0.5 Cluster 1
Iris setsoa Cluster 2
0.0 0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal length Petal length
Clustering algorithms
K-Means Affinity Mean Shift Spectral Ward Different clustering
Propagation Clustering methods produce
different results
e.g. some algorithms “find”
more clusters than others
Example:
K-Means performs “well” on
the third dataset, but not on
dataset one and two
Reason:
K-Means can only identify
“circular” clusters
Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
Curse of Dimensionality
“As the number of features or dimensions grows, the amount of data we
need to generalize accurately grows exponentially.” – Charles Isbell
The intuition in lower dimensions does not hold in higher dimensions:

• Almost all samples are close to at least one boundary
• Distances (e.g. euclidean) between all samples are similar
• Features might be wrongly correlated with outputs
• Finding decision boundaries becomes more complex
→ Problems become much more difficult to solve!
Source: Chen L. (2009) Curse of Dimensionality. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA.
https://doi.org/10.1007/978-0-387-39940-9_133
Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”
Question: How many samples do we need, to have all possible sensor

states in the dataset?

Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”
Question: How many samples do we need, to have all possible sensor

states in the dataset?
𝑁 =1 : 𝐷 = 21 = 2
𝑁 = 10 : 𝐷 = 210 = 1024
𝑁 = 100 : 𝐷 = 2100 = 1.2 × 1030
For N = 100, the number of points are even more than the number of atoms
in the universe!
Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem!
T0 T1 T2 T3
Typical Approaches: Sample0 11.3 0.1 -1 7.8
• Feature Selection Sample1

Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Identical
your problem!
T0 T1 T2 T3

Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1

The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
your problem! Applied a function f
T0 T1 T2 T3

Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1

Machine Learning

Reinforcement learning observes some example Input (Features) – No
Labels! - and finds the optimal action i.e maximizes its future reward
Key Aspects:
• Learning is implicit
• Learning using indirect feedback
based on trials and reward signals
• Actions are affecting future measurements (i.e. inputs)
Resolves control and decision problems

i.e. controlling agents in games or robots
Goal: Agents should take actions in an environment which maximize the
cumulative reward.
To achieve this RL uses reward and punishment signals based on the

previous actions to optimize the model.
Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018

Reinforcement Learning vs. (Un)supervised Learning
• No (Un)supervised
teacher/supervisor, only reward signalsReinforcement Learning
Learning
• • Delayed feedback,
The feedback is given bynot instantaneous• (credit
a supervisor assignment
The feedback problem)
is given by a reward signal
• Learning
or a metricby interaction between environment and agent over time
• • Agent’s
Feedbackactions affect the environment:
is instantaneous • Feedback can be delayed (credit
assignment problem)
• Actions have consequences!!!
• Learning by using static data (no re- • Training is based on trials i.e. interaction
→ non
1. recording of data i.i.d. aspect.
necessary) between environment and agent (re-
• Active Learning process: the actions that recording necessary)
the agent takes affect the
• subsequent
Prediction does not affect
data future
it receives. • The prediction (actions) affect future
measurements – The data is assumed measurements i.e. the measurements are
Independent Identically Distributed (i.i.d) no necessarily i.i.d!
Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018

Example: Flexible handling of components
Google designed an experiment to handle objects using a robot
After training the RL algorithm: The robots succeeded in 96% of the grasp
attempts across 700 trial grasps on previously unseen objects
Training objects:
Training and data
acquisition:
• 7 industrial robots
• 10 GPU's
• Numerous CPU's
• 580,000 gripping
attempts
Source: https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html
Machine Learning Pipeline
Bilder: TF / Malter
The Machine Learning Pipeline
A concept that provides guidance in a machine learning project
• Step-by-step process
• Each step has a goal and a method
There exist multiple pipelines and concepts,

however the idea behind all pipelines and steps are the same!
The 8-step machine learning pipeline by Artasanchez & Joshi :

Problem Data Data Data Model Model Model Perfomance
Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring
Source: Artificial Intelligence with Python - Second Edition, by Alberto Artasanchez, Prateek Joshi
1. Problem Definition
Our aim with machine learning is to develop a solution for a problem. In order to develop a satisfying
solution, we need to define the problem. This definition of the problem lays the foundation to our solution. It
shows us what kind of data we need, what kind of algorithms we can use.
Examples:
• If we are trying to find a faulty equipment, we have a classification problem
• If we are trying to predict a continuous number, we have a regression problem
Prediction of the energy consumption of a milling machine – problem definition
• Want to predict the necessary energy to conduct a process step

• The prediction should be made on the basis of the milling parameters
• Therefore we want to get transparency
• Further this should allow the optimization of the parameters


2. Data Ingestion
• After defining the problem, we have to define which data we need to describe the problem
• The data should represent the problem and the information needed to successfully predict the target
value
• If we have defined the data we need, we can gather them in databases or record them with trials
Prediction of the energy consumption of a milling machine – data ingestion

Data: • Data can be collected with
• Feed rate sensors
• Speed • Trials can be conducted on an
• Path existing machine
• Energy consumption


3. Data Preparation
We gathered the data and created our dataset. But it is highly unlikely that we can use our data as soon as
we gather it. We have to prepare it for our machine learning solution. For example:
• We can get rid of instances with missing values or outliers
• We can use dimensionality reduction for higher dimensional dataset,
• We can normalize the data to transform our data to a common scale.
The quality of a machine learning algorithm is highly dependent on the quality of data!
Prediction of the energy consumption of a milling machine – data preparation

• We have enough instances with no outliers
In our dataset we have outliers or missing values
and missing values • Therefore the easiest way to treat the
instances with missing values and outliers
is to delete them


An Example Dataset and Terminology
Value of the instance
Attribute for the attribute Label
Tennis
Day Weather Temp. Humidity Wind
recommended?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Cloudy Hot High Weak Yes
4 Rainy Mild High Weak Yes
… … … … … …
Instance Target Value

4. Data Segregation
Before we train a model we have to separate our data set:
• We have to separate the target variable
• Training Set: This set is used to fit our model
• Validation Set: Validation set is used to test our fit of the model and tune it’s hyperparameters
• Test Set: Test set is used to evaluate the performance of the final version of our model
Prediction of the energy consumption of a milling machine – data segregation

• We use an training/test split of
Attributes/ independent variable: Target variable:
80%/20%
• Feed rate • Energy consumption
• Because we do not perform
• Speed
hyperparameter optimization we
• Path
need no validation set


5. Model Training
Next step is to select an algorithm to train our data. So the question is which algorithm to use:
• Labeled data → supervised learning algorithms
• Prediction of a discreet value → regression algorithm
• Prediction of a class → classification algorithm
• There are numerous algorithms with their strengths and weaknesses
• To find out which algorithm is the best for our data set we have to test them
Prediction of the energy consumption of a milling machine – model training

We use the following regression
We have labeled data We have a discreet value algorithms
→ Supervised Energy consumption (kWh) → Linear regression
problem → Regression problem → Random forest regression
→ Support vector regression


6. Model Evaluation
Training and evaluation of the model are iterative Prediction of the energy consumption of a milling
processes: machine – model evaluation
1. First, we train our model with the training set
2. Then evaluate its performance with validation
set with evaluation metrics
3. Based on this information, we tune our
algorithm’s hyperparameters
This iterative process continues unless we decide

that we can’t improve our algorithm anymore
Then we use the test set to see the performance on

unknown data
Random Forest Regressor achieves the best results
→ We choose this model


Classification - Confusion Matrix and Accuracy
Confusion Matrix gives us a matrix as output and describes the performance of the model.
There are 4 important terms: Predicted: Predicted:

True Positives: The cases in which we predicted n=165 NO YES
YES, and the actual output was also YES
True Negatives: The cases in which we predicted
Actual: NO 50 10
NO, and the actual output was NO
False Positives: The cases in which we predicted
YES, and the actual output was NO Actual: YES 5 100
False Negatives: The cases in which we predicted
NO, and the actual output was YES Confusion Matrix
Classification Accuracy is what we usually mean,

when we use the term accuracy. It is the ratio of Accuracy = (150/165) = 0.909
number of correct predictions to the total number of
input samples.

Regression – Error Metrics
Mean Absolute Error (MAE): 𝑁
• Average difference between the original and predicted 1
values Mean Absolute Error= ෍ 𝑦𝑗 − 𝑦ො𝑗
• Measure how far predictions were from the actual output 𝑁
𝑗=1
• Does not give and idea about the direction of error.
Mean Squared Error (MSE):

• Similar to MAE 𝑁
• It takes the average of the square of the difference 1 2
between original and predicted value
Mean Squared Error= ෍ 𝑦𝑗 − 𝑦ො𝑗
𝑁
• Larger errors become more pronounced, so that the 𝑗=1
model can focus on larger errors.

7. Model Deployment
We have trained our model, evaluated it with evaluation metrics and tuned the hyperparameters with the
help of our validation set. After finalization of our model, we evaluated it with test set. Finally we are ready to
deploy our model to our problem. Here we have often to consider the following issues:
• Real time requirements
• Robust hardware (sensors and processor)
prediction of the energy consumption of a milling machine – model deployment

We are using the
model to predict the Axis=2
Predicted energy
energy consumption Feed=800
consumption = 0.046
bast on production Path=60
parameters


8. Performance Monitoring
We must monitor the performance of our model and make sure it still produces satisfactory results.
Unsatisfactory results might be caused by different reasons. For example,
• Sensors can malfunction and provide wrong data
• The data can be out of the trained range of the model
Monitoring the performance can mean the monitoring of the input data regarding any changes as well as the
comparison of the prediction and the actual value after defined time periods
prediction of the energy consumption of a milling machine – performance monitoring
We measure the actual We calculate the Within defined range: model is working
energy consumption of error of the sufficiently
10 work steps after each predictions of the Out defined range: model is working insufficient
maintenance procedure energy consumption → stop using and root cause analysis
Candidate
Problem Data Data Data Model Model Perfomance
Model
Definition Ingestion Preparation Segregation Training Deployment Monitoring
Evaluation

Summary
Bilder: TF / Malter
Summary
In this chapter we talked about:
• The history of machine learning and recent trends
• The different types of machine learning types such as supervised learning, unsupervised learning and
reinforcement learning
• The steps involved in a common machine learning pipeline
After studying this chapter, you should be able to:

• Identify the type of machine learning needed to solve a given problem
• Understand the differences of regression and classification tasks
• Setup a generic project pipeline following containing all relevant steps from problem definition to
performance monitoring

Linear Regression - Motivation
Bilder: TF / Malter
Motivation
Linear regression is used in multiple
scenarios each and every day!
Use Cases:
• Trend Analysis in financial
markets, sales and business
• Computer Vision problems
Financial development in FXCM
i.e. registration & localization
problems
• etc.
Source: https://www.tradingview.com/script/OZZpxf0m-Linear-Regression-Trend-Channel-with-Entries-Alerts/
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough i.e. Trend Analysis
• Linear models are the basis for complex models
i.e. Deep Networks
→ Let’s have a look at regression

i.e. prediction of a real-valued output

Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm
5.7 years 116 cm
6.2 years 112 cm
6.3 years 116 cm
6.6 years 120 cm
6.7 years 122 cm
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm We want to answer Questions like:
5.7 years 116 cm What is the height of my child,
6.2 years 112 cm when it is 14 years old?
6.3 years 116 cm
6.6 years 120 cm
What is the height of my child,
6.7 years 122 cm
when it is 30 years old?
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140
Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140
Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years
Example: 2.Fit a model
145
140
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
What “model” i.e. function
describes our data? 145
• Linear model 140
Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110
5 6 7 8 9 10 11 12
Age in years
What “model” i.e. function
describes our data? 145
• Linear model 140
Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110
Answer: Linear Model! 5 6 7 8 9 10 11 12

Age in years
Example: 3.Answer your Questions
when it is 14 years old? 270
→ 165 cm 230
Height in (cm)
190
150
110
70
0 5 10 15 20 25 30 35
Age in years
Example: 3.Answer your Questions
→ 165 cm 230
Height in (cm)
190
What is the height of my child, 150

→ 270 cm 70
0 5 10 15 20 25 30 35
Age in years
Example: Actual Answer
(We estimated 165 cm)
→ ~165 cm

(We estimated 270 cm)
→ ~175 cm
Next in this Lecture:
• What is the Mathematical Framework?
• How do we fit a linear model?

Linear Regression - Mathematics
Bilder: TF / Malter
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

Overview
• Overall Picture
• Optimization

Overall Picture
Linear Regression is a method to fit linear models to our data!
145
140
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Overall Picture
145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Overall Picture
145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
𝑥0 1 120
𝐱= 𝑥 =
1 Age in Years 115
y = Height 110
w = Weights
5 6 7 8 9 10 11 12
Age in years
Overall Picture
145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Overall Picture
145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
→ Finding a good 𝑤0 and 𝑤1 is 120
called fitting! 115
110
5 6 7 8 9 10 11 12
Age in years
Overview
• Overall Picture
• Optimization

The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Height in (cm)
130
120 𝜖𝑖
115
110
5 6 7 8 9 10 11 12
Age in years
Note: In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
The Linear Model
145
140
Linear model with noise:
Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
115
110
5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
The Linear Model
145
140
Linear model with noise:
Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
1) 115
The 𝜖𝑖 and the summation 𝝐 of
all samples is called Residual Error! 110
5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
The Residual Error 𝝐
The residual error 𝜖 is not part of the input!
It is a random variable! Gaussian Distribution
0.5
0.4
Typically: We assume 𝜖 is Gaussian
Probability
Distributed (i.e. Gaussian noise) 0.3
1)
1 𝜖−𝜇 2 0.2
1 −2 𝜎
𝑝 𝜖 = 𝑒
𝜎 2𝜋 0.1
-3 -2 -1 0 1 2 3 4
But other distributions are also possible! Event
1) The short form is 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: In our use-case 𝜇 = 0 and 𝜎 is small. That means our samples are only “slightly” deviate
The distribution of 𝑦
The noise is Gaussian:
𝑦1
𝑝 𝜖 =𝒩 𝜖 0, 𝜎 2 ) 145
140 𝑝(𝐲1 |𝐱1 )
Height in (cm)
“Inserting” the linear function: 130
𝑝 𝐲 𝐱, 𝐰, 𝜎) = 𝒩 𝐲 𝐰 T 𝐱, 𝜎 2 ) 120
𝑦0
115 𝑝(𝐲0 |𝐱 0 )
110
This is the conditional probability 𝑥0 𝑥1
density for the target variable 𝐲!
5 6 7 8 9 10 11 12
Age in years
Note for later use: 𝑝 𝐲 𝐱, 𝛉) = 𝑝 𝐲 𝐱, 𝐰, 𝜎)

General Linear Model
A more general formulation:
𝐷
𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Vector Notation:
𝑓 𝐱 = 𝐰T𝐱 + 𝜖 = 𝐲
What parameters do we have to estimate?
The model is linear:
𝐷
𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
What parameters do we have to estimate?
The model is linear:
𝐷
𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
The model parameters are:

𝛉 = 𝑤0 , … , 𝑤𝑛 , 𝜎 = {𝐰, 𝜎}
Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
Overview
• Overall Picture
• Optimization

What are optimal model parameters?
We define the optimal parameters as:
𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉
That means:
• We want to find the optimal parameters 𝛉∗
• We search over all possible 𝛉
• We select the 𝛉 which most likely generated our training dataset 𝒟
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!

Intuitive Example
0.5 𝜽
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
0.1
0 1 2 3 4 5 6 7 8 9 10
Event
Intuitive Example
0.5 𝛉
0.4
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1
0 1 2 3 4 5 6 7 8 9 10
Event
Intuitive Example
0.5 𝛉
0.4
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1 𝛉 = {5, 1.5} : Best
0 1 2 3 4 5 6 7 8 9 10
Event
Maximum Likelihood Estimation
This process is called Maximum Likelihood Estimation (MLE)
𝛉
Important: 𝑙 𝛉 = 𝑝 𝒟 𝛉 is called Likelihood!
Note: It can also written as 𝑝(𝒟|𝜽) = 𝑙(𝜽)

The Likelihood Function
∗
We assume each sample is independent, identically distributed (i.i.d)!
𝓛 𝛉 =𝑝 𝒟𝛉
= 𝑝 𝒔0 , 𝒔1 , … , 𝒔𝑛 𝛉
∗ 𝑝 𝒔 𝛉 ⋅ 𝑝 𝒔 | 𝛉 ⋅ … ⋅ 𝑝(𝐬 𝛉
= 0 1 𝑛
𝑁
= ෑ 𝑝(𝐬𝑖 |𝛉)
𝑖=1
→ This is the basis to judge, how good our parameters are!
Note: In our case a sample 𝒔𝑖 is 𝐱 𝑖 and 𝐲𝑖 ; Just assume both are “tied” together i.e. like a vector
The Log-Likelihood
The product of small numbers in a computer is unstable!
We transform the likelihood into the quasi-equivalent Log-Likelihood:

𝑁 𝑁
∗
𝓛 𝛉 = ෑ 𝑝 𝐬𝑖 𝛉 ⇔ ෍ log[𝑝 𝐬𝑖 𝛉 ] = 𝓵(𝛉)
𝑖=1 𝑖=1
* The logarithm „just“ scales the problem. Likelihood and Log-Likelihood both have to be maximized!
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
We insert this into the log-likelihood:

𝑁
෍ log[𝑝 𝐬𝑖 𝛉 ]
𝑖=1


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1


𝑁 𝑁 1
1 2 1 T𝐱 2
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1
𝑁
1 2 𝑁
= − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

𝑁 𝑁 1
1 2 1 T𝐱 2
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1
𝑁
1 2 𝑁
= − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Interpreting the Loss
We can “simplify” the current log-likelihood:
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Residual Sum of Squares

Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Residual Sum of Squares
→ We only have to minimize the Residual Sum of Squares!

The Residual Sum of Squares
We minimize the loss term:
𝑁
2
𝐿 𝛉 = ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖
𝑖=1
For linear regression: The loss is

shaped like a bowl with a unique
minimum!
𝐿(𝛉)
𝑤0
The cross indicates the optimal 𝑤1
parameters i.e. where the loss is
minimal
Optimization
Multiple ways to find the minimum:
• Analytical Solution
• Gradient Descend
• Newtons Method
1)
… and of course many, many more methods!
1) Examples: Particle Swarm Optimization, Genetic Algorithms, etc.

Optimization
Multiple ways to find the minimum:
• Analytical Solution
• Gradient Descend
• Newtons Method
1)
… and of course many, many more methods!
1) Examples: Particle Swarm Optimization, Genetic Algorithms, etc.

Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝑁
2
𝐿 𝛉 = ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖
𝑖=1
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐
1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐
We find the minimum conventionally by using the derivative:

NLL′ 𝛉 = 𝐱 T 𝐱𝐰 − 𝐱 T 𝐲
Equating the gradient to zero:

𝐱 𝑇 𝐱𝐰 = 𝐱 T 𝐲
1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
Analytical Solution
The solution i.e. the minimum is: 𝑁
−1 T
𝐰 = 𝐱T𝐱 𝐱 𝐲 𝐱 T 𝐲 = ෍ 𝐱 𝑖 𝐲𝑖
𝑖=1
𝑁
In practice too time
𝐱 T 𝐱 = ෍ 𝐱 𝑖 𝐱 𝑖𝑇
consuming to compute!
𝑖=1
N 2
𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,1 ⋅ 𝑥𝑖,𝐷
Reason: =෍ ⫶ ⋱ ⫶
The more training data, 2
𝑖=1 𝑥𝑖,𝐷 ⋅ 𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,𝐷
the longer the calculation!
Note: 𝑁 : Amount of Samples in the Training Dataset ; 𝐷 : Amount of Dimensions (length of the input vector)
What is the quality of the fit?
The goodness of fit should be evaluated to verify the learned model!
A typical measure for quality is : 𝑅2
It is a measure for the percentage of variability of the dependent

variable (𝐲) in the data explained by the model!

The 𝑅𝟐 – Measure
The 𝑅𝟐 – Measure is calculated as:
Let 𝑦ො be the mean of the labels, then

• Total sum of squares: 𝑆𝑆𝑡𝑜𝑡 = σ𝑖 𝐲𝑖 − 𝐲ො 2
2
• Residual sum of squares: 𝑆𝑆𝑟𝑒𝑠 = σ𝑖 𝐲𝑖 − 𝑓(𝐱 𝑖 )
𝑆𝑆𝑟𝑒𝑠
• Coefficient of determination: 𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡
→ This is different from the Mean Squared Error (MSE)!

Linear Regression: Evaluation
Model can completely (100%) Model can partially (91.3%) Model cannot explain any
explain variations in 𝑦 explain variations in y, but still (0%) variation in y, because
considered good it's its own average

Overview
• Overall Picture
• Optimization

Can we fit the following dataset?
House Prices in Springfield (USA)
House prices in (Mio. € )

1.6
1.5
1.4
1.3
1.2
1.1
1910 1940 1970 2000

Years from 1900 - 2020
Polynomial Regression
The answer is – of course – yes!

1.6
1.5
Polynomial regression fits a
1.4
nonlinear relationship between
input of 𝐱 and the corresponding 1.3
conditional mean of 𝐲 1.2
1.1
Example of a polynomial model:
1910 1940 1970 2000
𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝑚 𝑥 𝑚 + 𝜖 Years from 1900 - 2020
Note: 𝑚 is called the degree of the polynomial

How can we use polynomials in linear regression?
Trick: Basis function expansion!
We can just apply a function before we predict our “linear line”:

𝐷
𝑓 𝐱 = ෍ 𝑤𝑗 𝚽𝒋 (𝒙) + 𝜖 = 𝐲
𝑗=1
That means: We transform our Input-Space by using the function 𝚽!
→ The estimation from “normal” linear regression is the same!

Example: Polynomial Regression
The Basis Function for our example:

1.6
6 T
𝚽 𝑥 = 1 𝑥 𝑥2 𝑥3 𝑥4 𝑥5 𝑥 1.5
1.4
1)
The model parameters are: 1.3
𝛉 = {𝑤0 , 𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , 𝑤5 , 𝑤6 , 𝜎} 1.2
1.1
1910 1940 1970 2000

1) We assume that the noise is Gaussian distributed!

Note: The input here is NOT a vector! It is a scalar! However, the result is a 7-dim vector!
Complexity of Polynomial Regression
The greater degrees of the polynomial, the more complex relationships
we can express!
Degree 𝑚 = 2 Degree 𝑚 = 13
Basis Functions
We can use any arbitrary function, with one condition:
The resulting vector has to have a constant “length”

i.e. linear in respect to the parameters!
Possible basis functions:

• 𝚽 𝐱 = 1 𝑥0 𝑥1 … 𝑥𝐷 T
T
• 𝚽 𝐱 = 1 𝑥 𝑥2 … 𝑥𝑚
T
• 𝚽 𝐱 = 1 𝑥0 𝑥1 𝑥0 𝑥1 𝑥02 𝑥12 𝑥22
• etc.
Note: That is the reason, why polynomial regression can be considered linear
Examples for Basis Functions
T T
𝚽 𝐱 = 1 𝑥0 𝑥1 𝚽 𝐱 = 1 𝑥0 𝑥1 𝑥2 𝑥12 𝑥22

Thank you for listening!
Logistic Regression - Motivation
Bilder: TF / Malter
Motivation
Logistic regression is the application of
linear regression to classification!
Use Cases:
• Credit Scoring
• Medicine
• Text Processing
• etc.
→ Especially useful for “explainability”

Source: https://activewizards.com/blog/5-real-world-examples-of-logistic-regression-application
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough
• Basis for complex models
→ Let’s have a look at classification

i.e. prediction of a categorial-valued output

Example: Iris Flower Dataset
Label Petal Petal Contains 50 samples of the flowers Iris setosa,
Width Length Iris virginica and Iris versicolor
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
We want to answer Questions like:
My flower has a Petal Width of 7mm and a Petal
Length of 15mm.
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm Is my flower a Iris Setosa or Iris Versicolor?
Versi. 15.0 mm 39 mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
Example: Transform the Label
Label Petal Petal
Width Length
Setosa 5.7 mm 12 mm
Versi. 15.0 mm 39 mm
Example: Transform the Label
Label Petal Petal
Width Length
0 5.0 mm 9.2 mm
1 9.2 mm 26.1 mm
0 7.7 mm 18.9 mm
1 9.1 mm 32.1 mm
0 7.9 mm 15.5 mm
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm
1 15.0 mm 39 mm
Label Petal Petal
Width Length 15.0
Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm
2.5
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm
Label Petal Petal
Width Length 15.0
Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm Iris Setosa (0)
2.5 Iris Versicolor (1)
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm
Example: 2. Find a decision boundary
15.0
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Note: Setosa Versicolor

What decision boundary i.e. function
separates the data? 15.0
Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm

What decision boundary i.e. function
separates the data? 15.0
Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5
Answer: Linear Decision Boundary! 10 14 18 23 27 31 36 40

Petal Length in mm

Example: 3. Answer your Question
15.0
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm

My flower has a Petal Width of
7mm and a Petal Length of 15.0
Petal Width in mm
15mm. 12.5
10.0
Is my Flower a Iris Setosa or Iris 7.5

Versicolor? 5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm

My flower has a Petal Width of
7mm and a Petal Length of 15.0
Petal Width in mm
15mm. 12.5
10.0
Is my Flower a Iris Setosa or Iris 7.5

Versicolor? 5.0
2.5
→ Iris Setosa
10 14 18 23 27 31 36 40
Petal Length in mm
Note: Setosa Versicolor | Reason: The point is on the “left side” of the decision boundary
Next in this Lecture:
• What is the Mathematical Framework?
• How do we classify using a linear model?

Logistic Regression
Bilder: TF / Malter
Overview
• The Logistic Model
• Optimization

Overview
• Optimization

Logistic Regression
How do we describe the linear
model as decision boundary? 15.0
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm

Logistic Regression
How do we describe the linear
model as decision boundary? 15.0
Petal Width in mm
12.5
“Uncertain”
10.0
Thumb-Rule: “Certain”
7.5
The larger the distance of the
input 𝐱 to the decision boundary, 5.0
the more certain it belongs to 2.5

either Setosa (left of the line) or
10 14 18 23 27 31 36 40
Versicolor (right of the line)! Petal Length in mm

From distance to “probability”
The linear model:
𝐷 15.0
Petal Width in mm
𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 12.5
f(x)
𝑗=1 10.0
f(x)
1) 7.5
Calculates a signed distance,
between the input and the linear 5.0
model 2.5
→ How do we get a “probability”? 10 14 18 23 27 31 36 40

Petal Length in mm
Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
The sigmoid function
The sigmoid (logit, logistic) function
maps to the range [0,1]! 1.0
0.82
Output 𝜇 𝑥, 𝐰
0.66
That means the model is now:
0.5
1
𝜇 𝐱, 𝐰 = 𝐓𝐱 0.33
1 + 𝑒 −𝐰
0.17
→ „Only“ the probability for the −7.5 −5−2.5 0 2.5 5 7.5 10

event versicolor… Input 𝐰 𝐓 𝐱
Fun fact: The sigmoid function is sometimes lovingly called “squashing function”
Note: Here we already inserted the function f! Homework: What is the general form of the sigmoid function?
Bernoulli distribution
The Bernoulli distribution can model
both events (yes-or-no event): 1.0
𝑝 y 𝐱, 𝐰) = Ber y 𝜇 𝐱, 𝐰 )
𝑦 1−𝑦
= 𝜇 𝐱, 𝐰 1 − 𝜇 𝐱, 𝐰 0.5
Heads
Mrs. B
Mr. A
Problem:
Tails
How do we get the label,
Voting Fair Coin
based on the calculated probability?
Note: We use the above distribution for the MLE estimation! Basically we replace “𝑝 𝐬𝑖 𝛉 ” in the log-likelihood with this!
The Decision Rule
Based on 𝜇(𝐱, 𝐰) we can decide,
which class is more “likely”! 1.0
0.82
Output 𝜇 𝑥, 𝐰
0.66
The Decision Rule is:
1, if 𝜇(𝐱, 𝐰) > 0.5 0.5
𝑦=ቊ
0, if 𝜇(𝐱, 𝐰) ≤ 0.5 0.33
0.17
Question: −7.5 −5−2.5 0 2.5 5 7.5 10

How do we find the optimal Input 𝐰 𝐓 𝐱
parameters?
Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Overview
• Optimization

Constructing the Loss
Reminder: The Log-Likelihood
𝑁
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ]
𝑖=1

Reminder: The Log-Likelihood Bernoulli Distribution
𝑁
𝑝 y 𝐱, 𝐰) =
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ] 1−𝑦
𝑖=1
= 𝜇 𝐱, 𝐰 𝑦 1 − 𝜇 𝐱, 𝐰
𝑁
𝑦𝑖 1−𝑦𝑖
* ෍ log 𝜇 𝐱 𝑖 , 𝐰
= 1 − 𝜇 𝐱𝑖 , 𝐰
𝑖=1
* We just insert the Probability from Slide 99

Reminder: The Log-Likelihood Bernoulli Distribution
𝑁
𝑝 y 𝐱, 𝐰) =
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ] 1−𝑦
𝑖=1
= 𝜇 𝐱, 𝐰 𝑦 1 − 𝜇 𝐱, 𝐰
𝑁
𝑦𝑖 1−𝑦𝑖
* ෍ log 𝜇 𝐱 𝑖 , 𝐰
= 1 − 𝜇 𝐱𝑖 , 𝐰
𝑖=1
𝑁
= ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰
𝑖=1
* We just insert the Probability from Slide 99

The Cross Entropy Loss
Instead of Maximizing the Log-Likelihood, we minimize the Negative Log-
Likelihood!
This Loss is called the Cross Entropy:

𝑁
𝐿(𝛉) = − ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰

𝑖=1

The Cross Entropy Loss
Instead of Maximizing the Log-Likelihood, we minimize the Negative Log-
Likelihood!
This Loss is called the Cross Entropy:

𝑁
𝐿(𝛉) = − ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰

𝑖=1
• Unique minimum
• No analytical solution possible!
→ Optimization with Gradient descent.

Optimization Method: Gradient Descend
Observation:
The loss is like a
mountainous landscape!
Idea:
We find the minimum, by “walking
down” the slope of the mountain 𝐿(𝛉)
𝑤0
𝑤1

Optimization Method: Gradient Descend
Approach:
1. Start with random weights
2. Calculate: The direction of
steepest descend
3. Step in that direction
4. Repeat from step 2
𝐿(𝛉)
Result:
𝑤0
Some local minimum is reached! 𝑤1

Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃

Algorithm:
𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃
Note: The gradient in matrix-vector form is: 𝐗 𝑇 (𝛍 − 𝐲)

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
The Gradient for Logistic Regression:
∇𝐿(𝜃𝑖 ) = ෍ 𝜇 𝐱 𝐢 , 𝐰 − 𝑦𝑖 𝐱 𝑖
𝑖
𝜃𝑖
Parameter 𝜃
Note: The gradient in matrix-vector form is: 𝐗 𝑇 (𝛍 − 𝐲)

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜂 is called the learning rate
• If 𝜂 is too large
→ Overshooting the minimum
𝜃𝑖 𝜃𝑖+1
• If 𝜂 is too small Parameter 𝜃
→ Minimum is not reached
Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2
Parameter 𝜃

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum
𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3

Parameter 𝜃

Algorithm:
𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum
Repeat the process until the loss is minimal! 𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3
Parameter 𝜃

General weaknesses of Gradient Descend

Loss
• Multiple local minima are common
Minimum
Parameter
Minimum

Loss
• Into which the network converges to
depends heavily on random initialization
Minimum
Parameter
Minimum

• Into which the network converges to
depends heavily on random initialization
• Success depends on learning rate 𝜂
https://de.serlo.org/mathe/funktionen/wichtige-funktionstypen-ihre-eigenschaften/polynomfunktionen-beliebigen-grades/polynom
Overfitting and Underfitting
Bilder: TF / Malter
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

Estimated function
For 𝑀 = 9 the function exactly models the

given training data, as the chosen model
Is too complex (Overfitting)

Estimated function
For 𝑀 = 9 the function exactly models the

given training data, as the chosen model
Is too complex (Overfitting)
For 𝑀 = 3 the function closely matches

the expected function
UNDERFITTING OVERFITTING
Error on the Error on the

training data training data is
very high very low
Testing error is Testing error is

high high
Examples are Example is 𝑀 = 9
𝑀 = 0 and 𝑀 = 1

Complexity vs. Generalization Error
Plotting over all complexities, Underfitting Overfitting Training Set
typically reveals a “sweet spot” Testing Set
Prediction Error
(i.e. an ideal complexity)
The prediction error is affected

by two kind of errors:
• Bias error Generalization error
• Variance error
Ideal Complexity
Complexity
(Minimum
Testing Loss)
Note: Complexity does not mean parameters! It means the mathematical complexity! i.e. the ability of the model to capture a relationship!
Bias:  High Bias Low Bias →
Error induced by simplifying
Prediction Error
assumptions of the model
Complexity

Error induced by simplifying  Low Variance High Variance →
Prediction Error
Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”
Complexity

Error induced by simplifying  Low Variance High Variance →
Prediction Error
Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”
Both errors oppose each other! Complexity

When reducing bias, you increase variance!
(and vice versa)
The sweet spot is located at  High Bias Low Bias →
the minimum of the test curve  Low Variance High Variance →
Prediction Error
It has both a low bias and a
low variance!
This spot is – usually – found

using hyperparameter search
Ideal Complexity
Complexity
(Minimum
Testing Loss)
Hyperparameter search
1)
Approach: Calculate the prediction error on the validation set and select
the hyperparameters with the smallest loss
Example for hyperparameters:

• Amount of degrees in a polynomial
• Learning rate 𝜂
• Different Basis Functions (See BFE)
• In Neural Networks: The amount of neurons
→ There exists multiple methods to solve this!
1) We use this instead of the test set. This is important! You only touch the test set for the final evaluation!
Hyperparameter search methods
There exist multiple ways to find good hyperparameters…
… manual search (Trial-and-Error)
… random search
… grid search
… Bayesian methods
… etc.
→ In practice you typically use Trial-and-Error & Grid-Search
Note: Basically every known optimization technique can be used. Examples: Particle Swarm Optimization, Genetic Algorithms, Ant Colony Optimization…
Hyperparameter search on the data splits
Typically you split the data into a training and testing split
Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

Solution:
If the dataset is large “enough” →

Solution:
If the dataset is large “enough” →

If the dataset is small … →

Cross Validation
Problem of the Static Split:
The data1) is split into 80% train set and 20% validation set
If you are “unlucky”: The validation split is not representative!

i.e. your prediction error is a bad estimate for the
generalization performance!
→ Solution: Cross Validation!
1) We assume the test data is already split and put away. A typical split of all the data is 80% train – 10% validation - 10% testing
K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:

1. Train on all splits, Valid. Training data 𝑖=1
except the 𝑖-th Fold
2. Evaluate on the 𝑖-th Fold
Finally: Average over all results


1. Train on all splits, Valid. Training data 𝑖=1
except the 𝑖-th Fold Valid. Training data 𝑖=2
2. Evaluate on the 𝑖-th Fold
Finally: Average over all results


1. Train on all splits, Valid. Training data 𝑖 =1
except the 𝑖-th Fold Valid. Training data 𝑖 =2
2. Evaluate on the 𝑖-th Fold Training data Valid. Training data 𝑖 =3
Finally: Average over all results Training data Valid. 𝑖 =4
Training data Valid. 𝑖 =5

Cross Validation
K-Fold Cross validation is the typical form of cross validation
When k=1, the process is called Leave-one out cross validation (or LOOCV)
There exist other variants as well, like:

• Grouped Cross Validation
• Nested Cross Validation
• Stratified Cross Validation
→ Each has its unique use-case!

Support Vector Machines – Problem Statement
Bilder: TF / Malter
Recap: Logistic Regression
Goal: Find a linear decision Classification
boundary separating the two classes
Label 𝒚 = Dog
Problem: Multiple linear decision
Feature 𝑥1
boundaries solve the problem with
equal accuracy
Question: Which one is the best?
Label 𝒚 = Cat D
A B C D
Feature 𝑥0
A B C

Intuition behind Support Vector Machines
A is closer to Dog at the top and to Classification
Cat at the bottom
Label 𝒚 = Dog
B is far away from both Dog and Cat
Feature 𝑥1
throughout the boundary
C always is closer to Dog than Cat
D almost touches Dog at the bottom
and Cat at the top Label 𝒚 = Cat D
Feature 𝑥0
A B C

Intuition behind Support Vector Machines
B is far away from both Dog and Cat Classification
throughout the boundary
Label 𝒚 = Dog
To put this in more mathematical
Feature 𝑥1
terms, we say it has the largest
margin 𝑚 𝑚
𝑚
Updated Goal: Find linear decision
boundary with the largest margin. Label 𝒚 = Cat
Feature 𝑥0
B

Mathematical Model for Decision Boundary
Mathematically, any hyperplane is
given by the equation
𝒘⋅𝒙−𝑏 =0 𝒘
Feature 𝑥1
𝒘 is the normal vector
𝒙 is any arbitrary feature vector

𝑏
𝑏 𝒘
is the distance to the origin
𝒘
Feature 𝑥0

Mathematical Model for Margins
Decision boundary is given by
𝒘⋅𝒙−𝑏 =0 2
𝒘
𝒘
Feature 𝑥1
Furthermore, by definition, the margin
boundaries are given by
𝒘 ⋅ 𝒙 − 𝑏 = +1
𝑏
𝒘 ⋅ 𝒙 − 𝑏 = −1 𝒘
2 Feature 𝑥0
is thus the distance between the
𝒘
two margins → maximize
Mathematical Model for Classes
Two-class problem 𝑦1 , … , 𝑦𝑛 = ±1
𝑦 = +1
All training data 𝒙1 , … , 𝒙𝑛 needs to be 2
correctly classified outside margin 𝒘
𝒘
Feature 𝑥1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 if 𝑦𝑖 = +1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≤ −1 if 𝑦𝑖 = −1 𝑦 = −1
𝑏
Due to the label choice, this can be 𝒘
simplified as Feature 𝑥0
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

Optimization of Support Vector Machines
This leads us to a constrained
𝑦 = +1
optimization problem.
2
2 𝒘
1. Maximizing the margin , which 𝒘
Feature 𝑥1
𝒘
1 2
is equal to minimizing 𝒘
2
1
min 𝒘 2 𝑦 = −1
𝒘,𝑏 2 𝑏
𝒘
2. Subject to no misclassification on
Feature 𝑥0
the training data
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

Support Vector Machines – Optimization
Bilder: TF / Malter
Optimization of Support Vector Machines
In the previous section, we derived the constrained optimization problem
for Support Vector Machines (SVMs):
1 2
min 𝒘
𝒘,𝑏 2
s. t. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
This is a quadratic programming problem, minimizing a quadratic function

subject to some inequality constraints.
⮩ We can thus introduce Lagrange multipliers

Introduction of Lagrange Multipliers
For each of the 𝑛 inequality constraints, introduce 𝛼𝑖 Lagrange multiplier1):
𝑛
1 2
ℒ 𝒘, 𝑏, 𝜶 = 𝒘 − ෍ 𝛼𝑖 [𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 − 1]
2
𝑖=1
The Lagrange multipliers need to be maximized, resulting in the following

derived optimization problem:
min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
Primal versus Dual Formulation
This optimization problem is called the primal formulation, with solution 𝑝∗ :
𝑝∗ = min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶
Alternatively, there’s also the dual formulation, with solution 𝑑 ∗ :

𝑑 ∗ = max min ℒ 𝒘, 𝑏, 𝜶
𝜶 𝒘,𝑏
Since Slater’s condition holds for this convex optimization problem, we can
guarantee that 𝑝∗ = 𝑑 ∗ and solve the dual problem instead.
⮩ We can thus first solve the minimization problem for 𝒘 and 𝑏

Solution using Partial Derivatives
We now want to minimize the following function for 𝒘 and 𝑏:
𝑛
1 2
ℒ 𝒘, 𝑏, 𝜶 = 𝒘 − ෍ 𝛼𝑖 [𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 − 1]
2
𝑖=1
𝜕ℒ 𝜕ℒ
At the minimum, we know that the derivatives and need to be zero:
𝜕𝒘 𝜕𝑏
𝑛 𝑛
𝜕ℒ
= 𝒘 − ෍ 𝛼𝑖 𝑦𝑖 𝒙𝑖 = 0 ⇒ 𝒘 = ෍ 𝛼𝑖 𝑦𝑖 𝒙𝑖
𝜕𝒘
𝑖=1 𝑖=1
𝑛 𝑛
𝜕ℒ
= − ෍ 𝛼𝑖 𝑦𝑖 = 0 ⇒ ෍ 𝛼𝑖 𝑦𝑖 = 0
𝜕𝑏
𝑖=1 𝑖=1

Solution using Partial Derivatives
We then eliminate 𝒘 and 𝑏 by inserting both equations into the function:
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
2
𝑖=1 𝑖,𝑗=1
This directly leads us to the remaining maximization problem for 𝜶:

max ℒ 𝒘, 𝑏, 𝜶
𝜶
𝑛
s. t. 𝛼𝑖 ≥ 0, ෍ 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
This quadratic programming problem can be solved using sequential
minimal optimization, but which is not a topic of this lecture.
The Karush–Kuhn–Tucker Conditions
The given problem fulfills the Karush-Kuhn-Tucker conditions1):
𝛼𝑖 ≥ 0
𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 ≥ 0
𝛼𝑖 𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 = 0
Based on the last condition, we can derive that either 𝛼𝑖 = 0 or 𝑦𝑖 (𝑤 ⋅ 𝑥 −

𝑏) = 1 for each of the training point.
💡 When 𝛼𝑖 is non-zero, the training point is on the margin and thus a so-
called “support vector”.
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
Optimization Summary
The dual formulation leads to a quadratic programming problem on 𝜶:1)
𝑛 𝑛
1
max ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
𝜶 2
𝑖=1 𝑖,𝑗=1
The solution 𝜶∗ can then be used to compute 𝒘 and make predictions on

previously unseen data points 𝒙:2)
𝑛
𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝒙𝑖 ⋅ 𝒙

𝑖=1
1) This is simplified for the purposes of this summary. For the additional constraints that are also required refer to slide 14
2) Remember the equation for 𝒘 resulting from the derivative in slide 13
Support Vector Machines – Non-Linearity and the Kernel Trick
Bilder: TF / Malter
Recap: Basis Functions
For linear regression we transform

1.6
the input space using a polynomial
basis function 𝚽: 1.5
T
1.4
𝚽 𝑥 = 1 𝑥 2 3 4 5
𝑥 𝑥 𝑥 𝑥 𝑥 6
1.3
This can lead to 1.2

1.1
• Computational issues depending
on the number of points 1910 1940 1970 2000
• Memory issues depending on the
dimensionality of the output
Introduction of Basis Functions
We’ve previously derived the optimization and prediction functions for SVM.
Next, we can also introduce a basis function 𝚽(𝒙) to allow for non-linearity.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1
𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)

𝑖=1
💡 Note how the basis function is applied at each of the dot products.

Solving Computational Issues with Sparsity
There are 𝑛2 summands for the optimization and 𝑛 for the prediction, that
is, both functions naively scale with the number of points 𝑛.
𝑛 𝑛
1
2
𝑖=1 𝑖,𝑗=1

𝑖=1

Solving Computational Issues with Sparsity
However, from the Karush-Kuhn-Tucker conditions we know that 𝛼𝑖 is non-
zero only for the few support vectors.
𝑛 𝑛
1
2
𝑖=1 𝑖,𝑗=1

𝑖=1
Most summands are thus zero. This property is called sparsity and greatly
simplifies the computations.
Solving Memory Issues with Kernel Trick
Since the basis function 𝚽: ℝ𝑛 ↦ ℝ𝑚 is applied explicitly on each of the
points, the memory scales with the output dimensionality m.
𝑛 𝑛
1
2
𝑖=1 𝑖,𝑗=1

𝑖=1

Solving Memory Issues with Kernel Trick
However, instead of explicitly computing 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ), we can instead
replace it with a kernel function 𝐾(𝒙𝑖 , 𝒙𝑗 ).
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 𝐾(𝒙𝑖 , 𝒙𝑗 )
2
𝑖=1 𝑖,𝑗=1
𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝐾(𝒙𝑖 , 𝒙𝑗 )

𝑖=1
The kernel function usually doesn’t require explicit computation of the basis
function. This method of replacement is called the kernel trick.
The Linear Kernel
The kernel function is given by:
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2 𝒙𝑖 ⋅ 𝒙𝑗
2𝜎
• 𝜎 is a length-scale parameter
• Feature space mapping is basically

just the identity function
• Only useful for linearly separable

classification Image from https://scikit-
learn.org/stable/auto_examples/svm/plot_svm_kernels.html

The Radial Basis Function Kernel
2
𝒙𝑖 − 𝒙𝑗
𝐾 𝒙𝑖 , 𝒙𝑗 = exp −
2𝜎 2
• Useful for non-linear classification

with clusters
• Probably the most popular kernel

Image from https://scikit-
learn.org/stable/auto_examples/svm/plot_svm_kernels.html

The Polynomial Kernel
𝑑
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2
𝑥𝑖 ⋅ 𝑥𝑗 + 𝑟
2𝜎
• 𝑟 is a free parameter for the trade-off

between lower- and higher-order
• 𝑑 is the degree of the polynomial, a

Image from https://scikit-
typical choice is 𝑑 = 2 learn.org/stable/auto_examples/svm/plot_svm_kernels.html

Support Vector Machines – Hard Margin and Soft Margin
Bilder: TF / Malter
Hard Margin and Soft Margin
Up until now we have always implicitly
𝑦 = +1
assumed classes are linearly separable
2
⮩ So-called hard margin 𝒘
𝒘
Feature 𝑥1
What happens when the classes 𝜉2
overlap and there is no solution? 𝜉1
𝑦 = −1
⮩ So-called soft margin 𝑏
𝒘
Introduce slack variables 𝜉1 , 𝜉2 that Feature 𝑥0
move points onto the margin

Soft Margin – Slack Variables
Slack variable 𝜉𝑖 is the distance
𝑦 = +1
required to correctly classify point 𝒙𝑖
2
If correctly classified and outside margin 𝒘
𝒘
Feature 𝑥1
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 👍
𝜉1
• No correction, thus 𝜉𝑖 = 0 𝜉2
𝑦 = −1
𝑏
If incorrectly classified or inside margin
𝒘
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 < 1 👎 Feature 𝑥0
• Move by 𝜉𝑖 = 1 − 𝑦𝑖 (𝒘 ⋅ 𝒙𝑖 − 𝑏)

Updated Optimization Problem
We can update the original optimization problem with the slack variables 𝝃,
which leads us to the following constrained optimization problem:
𝑛
1 2
1
min 𝒘 + 𝜆 ෍ 𝜉𝑖
𝒘,𝑏,𝝃 2 𝑛
𝑖=1
𝑠. 𝑡. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 − 𝜉𝑖
The optimization procedure and kernel trick can similarly be derived for
this, but we’ll spare you the details 😉.

Support Vector Machines – Regression
Bilder: TF / Malter
Intuition behind Support Vector Regression
The Support Vector Machine is a Regression
method for classification.
Let us now turn to regression. 𝜖
Label 𝑦
With training data 𝒙1 , 𝑦1 , … , (𝒙𝑛 , 𝑦𝑛 ),
we want to find a function of form:
𝜖
𝑦 =𝒘⋅𝒙+𝑏
For Support Vector Regression, we
Feature 𝑥
want to keep everything in a 𝜖-tube

Keeping Everything inside The Tube
Thus, for each point (𝒙𝑖 , 𝑦𝑖 ), the Regression
following conditions need to hold:
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 𝜖
Label 𝑦
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 𝜉+ 𝜉−
For outliers, we introduce slack 𝜖
variables 𝜉 + and 𝜉 − :
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
Feature 𝑥
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖−

Optimization of Support Vector Regression
This leads us to another constrained Regression
1
𝜖
1. Minimizing 𝒘 2 and the slack
2
Label 𝑦
+
variables 𝜉𝑖 and 𝜉𝑖− 𝜉+ 𝜉−
𝑛
1
min 𝒘 2 + 𝐶 ෍( 𝜉𝑖+ + 𝜉𝑖− ) 𝜖
𝒘,𝑏 2
𝑖=1
𝐶 is a free parameter specifying
the trade-off for outliers Feature 𝑥

Optimization of Support Vector Regression
This leads us to another constrained Regression
𝜖
2. Subject to all 𝑦𝑖 being inside the
Label 𝑦
tube specified by 𝜖 𝜉+ 𝜉−
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖− 𝜖
𝜉𝑖+ ≥ 0
𝜉𝑖− ≥ 0
Feature 𝑥

Pros and Cons of Support Vector Regression
Advantages Regression
• Less susceptible to outliers when
𝜖
compared to linear regression
Label 𝑦
𝜉+ 𝜉−
Disadvantages
• Requires careful tuning of the free 𝜖

parameters 𝜖 and 𝐶
• Doesn’t scale to large data sets Feature 𝑥

Support Vector Machines – Summary
Bilder: TF / Malter
Summary of Support Vector Machines
Support Vector Machine Advantages
• Elegant to optimize, since they have one local and global optimum
• Efficiently scales to high-dimensional features due to the sparsity and

the kernel trick
Support Vector Machine Disadvantages
• Difficult to choose good combination of kernel function and other free

parameters, such as regularization coefficient 𝑪
• Does not scale to large amount of training data

References
1. D. Sontag. (2016). Introduction To Machine Learning: Support Vector Machines [Online].
Available: https://people.csail.mit.edu/dsontag/courses/ml16/
2. A. Zisserman. (2015). Lecture 2: The SVM classifier [Online]. Available:
https://www.robots.ox.ac.uk/~az/lectures/ml/
3. K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press,
2012.
4. B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2018.
5. A. Kowalczyk, Support Vector Machines Succintly. Morrisville, NC, USA: Syncfusion, 2017.

Principal Component Analysis – Applications
Bilder: TF / Malter
Application: Image Compression
We can save data more efficiently
by applying PCA!
The MaD logo (266 × 266 × 3 px)

can be compressed from 212 kB to
Original 32 Components (97.27%)
• 32 components : 8 kB
Can be used for other data as well! 16 Components (93.62%) 8 Components (85.50%)
Application: Facial Recognition
Eigenfaces (1991) are a fast and efficient
way to recognize faces in a gallery of facial
images.
General Idea:
1. Generate a set of eigenfaces (see right)
based on a gallery
2. Compute the weight for each eigenface
based on the query face image
3. Use the weights to classify (and identify)
the person
Turk, M. and A. Pentland. “Face recognition using eigenfaces.” Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1991): 586-591.
Application: Anomaly Detection
Detecting anomalous traffic in IP Networks is crucial for administration!
General Idea:
Transform the time series of IP packages (requests, etc.) using a PCA
→ The first k (typically 10) components represent
“normal” behavior
→ The components “after” k, represent
anomalies and/or noise
Packages mapped primarily to the latter component are classified as
anomalous!
Anukool Lakhina, Konstantina Papagiannaki, Mark Crovella, Christophe Diot, Eric D. Kolaczyk, and Nina Taft. 2004. Structural analysis of network traffic flows. SIGMETRICS Perform. Eval.
Rev. 32, 1 (June 2004), 61–72. DOI:https://doi.org/10.1145/1012888.1005697
Limitation: The Adidas Problem
Let the data points be distributed like 𝑤2
the Adidas logo, with three implicit 𝑤1
classes (teal, red, blue stripe).
Feature 2
The principal directions 𝑤1 and 𝑤2
found by PCA are given.
The first principal direction 𝑤1 does
not preserve the information to
classify the points!
Feature 1

Principal Component Analysis – Intuition
Bilder: TF / Malter
Intuition behind Principal Component Analysis
Consider the following data with Twin Heights
height of two twins ℎ1 and ℎ2 .
Now assume that
Twin height ℎ2
a) We can only plot one-dimensional
figures due to limitations1
b) We have complex models that
can only handle a single value1
Twin height ℎ1
1) These are obviously constructed limitations for didactical concerns. However, in real life you will also face limitations in visualization
(maximum of three dimensions) or model feasibility.
How do we find a transformation from Twin Heights
two values to one value?
In this specific case we could
Twin height ℎ2
1
a) Keep the average height (ℎ1 +
2
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
Twin height ℎ1

In this specific case we could Twin Heights
1
a) Keep the average height (ℎ1 +
2
Height difference
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
💡 This corresponds to a change in
basis or coordinate frame in the plot.
Average height
⮩ Rotation by matrix 𝑨

Principal Component Analysis as Rotation
The Principal Component Analysis (PCA) is essentially a change in basis
using a rotation matrix 𝑨.
Twin Heights Twin Heights
Height difference
Twin height ℎ2
Rotation
1 1 1
𝑨=
2 1 −1
Twin height ℎ1 Average height
Principal Component Analysis: Preserving Distances
💡 Idea: Preserve the distances Twin Heights
between each pair in the lower
dimension
Height difference
• Points that are far apart in the
original space remain far apart
• Points that are close in the original
space remain close
Average height

Principal Component Analysis: Preserving Variance
The previous idea is closely coupled Twin Heights
with the “variance”.
Height difference
• Find those axes and dimensions
Low variance
with high variance
• Discard the axes and dimensions
with increasingly low variance
High variance
Goal: Algorithmically find directions
with high/low variance in data1 Average height
1) In this trivial case we could hand-engineer such a transformation. However, this is not generally the case.
Principal Component Analysis – Mathematics
Bilder: TF / Malter
Description of Input Data
Let 𝑿 be a 𝑛 × 𝑝 matrix of data Twin Heights
points, where
• 𝑛 is number of points (here 15)
Twin height ℎ2
• 𝑝 is number of features (here 2)
1.75 1.77
1.67 1.68
𝑿=
⋮ ⋮
2.01 1.98 Twin height ℎ1

Description of Input Data
Let 𝑪 = 𝑿𝑇 𝑿 be the 𝑝 × 𝑝 covariance Twin Heights
matrix of data points1, where
• 𝑿 are data points
Twin height ℎ2
• 𝑝 is number of features (here 2)
The covariance matrix generalizes
the variance 𝜎 2 of a Gaussian
distribution 𝒩(𝜇, 𝜎 2 ) to higher
dimensions. Twin height ℎ1
1) Technically the covariance matrix of the transposed data points 𝑿𝑇 after centering to zero mean.
Description of Desired Output Data
We’re interested in new axes that Twin Heights
rotate the system, here 𝑤1 and 𝑤2
These are columns of a matrix
Twin height ℎ2
𝜆2
𝑤2 𝜆1
0.7 0.7 𝑤1
𝑾=
0.7 −0.7
𝑤1 𝑤2
We’re also interested in the

“variance” of each axis, 𝜆1 and 𝜆2 Twin height ℎ1
𝜆1 = 2, 𝜆2 = 1

Relationship to Eigenvectors and -values
We’re in luck as this corresponds to Twin Heights
the eigenvectors and –values of 𝑪:
𝑪 = 𝑾𝑳𝑾𝑇
Twin height ℎ2
𝜆2
𝑤2 𝜆1
• 𝑾 is 𝑝 × 𝑝 matrix, where each 𝑤1
column is an eigenvector
• 𝑳 is a 𝑝 × 𝑝 matrix, with all
eigenvalues on diagonal in
decreasing order Twin height ℎ1

Relationship to Singular Value Decomposition
Alternatively, one can also perform a Twin Heights
Singular Value Decomposition (SVD)
on the data points 𝑿:
Twin height ℎ2
𝜆2
𝑿= 𝑼𝑺𝑾𝑇 𝑤2 𝜆1
𝑤1
• 𝑼 is 𝑛 × 𝑛 unitary matrix
• 𝑺 is 𝑛 × 𝑝 matrix of singular values
𝜎1 , … , 𝜎𝑝 on diagonal
Twin height ℎ1
• 𝑾 is 𝑝 × 𝑝 matrix of singular
vectors 𝒘1 , … , 𝒘2

Relationship to Singular Value Decomposition
Let us insert 𝑿 = 𝑼𝑺𝑾𝑇 into the Twin Heights
definition of the covariance matrix 𝑪:
𝑪 = 𝑿𝑇 𝑿
Twin height ℎ2
𝜆2
𝑇 𝑇 𝜆1
= 𝑼𝑺𝑾 𝑼𝑺𝑾𝑇 𝑤2
𝑤1
= 𝑾𝑺𝑇 𝑼𝑇 𝑼𝑺𝑾𝑇
= 𝑾𝑺2 𝑾𝑇
• Eigenvalues correspond to
squared singular values 𝜆𝑖 = 𝜎𝑖2 Twin height ℎ1
• Eigenvectors directly correspond
to singular vectors
Summary of Algorithm
1. Find principal directions 𝑤1 , … 𝑤𝑝 Twin Heights
and eigenvalues 𝜆1 , … , 𝜆𝑝
Twin height ℎ2
2. Project data points into new
coordinate frame using 𝑤2
𝑤1
𝑻 = 𝑿𝑾
3. Keep the 𝑞 most important
dimensions as determined by
𝜆1 , … 𝜆𝑞 (which are sorted by Twin height ℎ1
variance)

Principal Component Analysis – Summary
Bilder: TF / Malter
Summary of Principal Component Analysis
• Rotate coordinate system such that all axes are sorted from most
variance to least variance
• Required axes 𝑾 determined using either
• Eigenvectors and –values of covariance matrix 𝑪 = 𝑾𝑳𝑾𝑇
• Singular Value Decomposition (SVD) of data points 𝑿 = 𝑼𝑺𝑾𝑇
• Subsequently drop axes with least variance

• Variance-based feature selection has limitations (see. Adidas problem)

References
1. P. Liang. (2009). Practical Machine Learning: Dimensionality Reduction [Online]. Available:
https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/dimensionality/
2. K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press,
2012.

Deep Learning – Perceptron
Bilder: TF / Malter
The Human Brain
The human brain is our reference for
an intelligent agent, that
a) … contains different areas
specialized for some tasks (e.g.,
the visual cortex)
b) … consists of neurons as the
fundamental unit of “computation”
⮩ Let us have a closer look at the inner workings of a neuron 💡
Image from https://ucresearch.tumblr.com/post/138868208574

The Human Neuron – Signaling Mechanism
1. Excitatory
stimuli reach
the neuron
⮩ Input
2. Threshold is
reached
3. Neuron fires
and triggers
action potential
⮩ Output
Image from https://www.khanacademy.org/science/biology/human-biology/neuron-nervous-system/a/the-synapse

The Perceptron – Computational Model of a Neuron
1. Let’s start by adding some basic components (input and output), we’re
subsequently going to build the computational model step by step
Input 𝑥1
Input 𝑥2 Output 𝑦1
Input 𝑥3

2. Next, let’s add some weights to select and deselect input channels as
not all are relevant to our model neuron
Input 𝑥1
⋅ 𝑤1
⋅ 𝑤2
Input 𝑥3
⋅ 𝑤3

3. Then, let’s add all the excitatory stimuli to the resting potential to
determine the current potential of our model neuron
Input 𝑥1
⋅ 𝑤1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏

4. Finally, let’s apply a threshold function 𝜎 (usually the sigmoid) to
determine whether to send an action potential in the output
Input 𝑥1
⋅ 𝑤1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥

5. We can now write a perceptron as a mathematical function mapping the
inputs 𝑥1 , 𝑥2 , 𝑥3 to the output 𝑦1 using channel weights 𝑤1 , 𝑤2 , 𝑤3 , 𝑏:
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏 = 𝜎(෍ 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏)
𝑖
Input 𝑥1
⋅ 𝑤1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
The Perceptron – Signaling Mechanism
Given a perceptron with parameters
𝑤1 = 4, 𝑤2 = 7, 𝑤3 = 11, 𝑏 = −10 and equation
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏
Input 𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟎, 𝒙𝟑 = 𝟎 (“100”) Input 𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟏, 𝒙𝟑 = 𝟏 (“111”)

𝑦1 = 𝜎 4 ⋅ 1 + 7 ⋅ 0 + 11 ⋅ 0 − 10 𝑦1 = 𝜎 4 ⋅ 1 + 7 ⋅ 1 + 11 ⋅ 1 − 10
= 𝜎 4 + 0 + 0 − 10 = 𝜎 −6 = 𝜎 4 + 7 + 11 − 10 = 𝜎 12
= 0.00 = 1.00
⮩ Output 𝑦1 not activated ⮩ Output 𝑦1 is activated
Generalizing the Threshold: Activation Function
Sigmoid Hyperbolic Tangent Rectified Linear Unit
1
𝜎 𝑥 = 𝜎 𝑥 = tanh(𝑥) 𝜎 𝑥 = max(𝑥, 0)
1 + 𝑒 −𝑥

Deep Learning – Multilayer Perceptron
Bilder: TF / Malter
The Perceptron – A Recap
In the last section we learned about the perceptron, a computational model
representing a neuron.
Inputs 𝑥1 , … , 𝑥𝑛 𝑥1
Output 𝑦1 𝑥2 𝑦1
Computation 𝑦1 = 𝜎(σ𝑛𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏) 𝑥3
⮩ We can combine more than one perceptron for complex models 💡

Combining Multiple Perceptron – Individual Notation
𝑥1 Let’s now combine three perceptron with
different outputs 𝑦1 , 𝑦2 and 𝑦3 .
𝑥2 𝑦1
The computations for each these are:
𝑥3
𝑥1 𝑦1 = 𝜎 𝒘𝟏𝟏 𝑥1 + 𝒘𝟏𝟐 𝑥2 + 𝒘𝟏𝟑 𝑥3 + 𝒃𝟏
𝑥2 𝑦2 𝑦2 = 𝜎 𝒘𝟐𝟏 𝑥1 + 𝒘𝟐𝟐 𝑥2 + 𝒘𝟐𝟑 𝑥3 + 𝒃𝟐
𝑥3 𝑦3 = 𝜎 𝒘𝟑𝟏 𝑥1 + 𝒘𝟑𝟐 𝑥2 + 𝒘𝟑𝟑 𝑥3 + 𝒃𝟑
𝑥1
Note how they all have their own parameters
𝑥2 𝑦3 (weights 𝑤 and bias 𝑏)
𝑥3
Combining Multiple Perceptron – Matrix Notation
𝑥1 The outputs 𝑦, weights 𝑤, inputs 𝑥 and bias 𝑏 of
each perceptron are matrices and vectors.
𝑥2 𝑦1
We can thus rewrite the three computations1) as:
𝑥3
𝑥1 𝑦1 𝑤11 𝑤12 𝑤13 𝑥1 𝑏1
𝑦2 = 𝑤21 𝑤22 𝑤23 𝑥2 + 𝑏2
𝑥2 𝑦2 𝑦3 𝑤31 𝑤32 𝑤33 𝑥3 𝑏3
𝑥3
𝑥1 Or in a more simplified form:
𝑥2 𝑦3 𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)
𝑥3 1) This is a perfect opportunity to brush up your linear algebra skills. As an optional

homework assignment, you can verify that the given matrix multiplication holds.
Combining Multiple Perceptron – Single Layer
Instead of depicting each computation as a node
𝑥1 𝑦1 or circle, we can also
• Depict each value (input 𝑥 and output 𝑦) as a
𝑥2 𝑦2 node or circle
• Depict each weighted connection as an arrow
𝑥3 𝑦3
This is the typical graphical representation1), with
𝑾, 𝒃 the underlying computation remaining
𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)
1) Note how the bias is implicitly assumed and not visualized.

| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Combining Multiple Perceptron – Multiple Layers
There’s no reason to limit ourselves to a
𝑥1 𝑦1 single layer.
𝑧1 We can chain multiple layers, with each
𝑥2 𝑦2 output being the input of the next:
𝑧2 𝒚 = 𝜎(𝑾1 ⋅ 𝒙 + 𝒃1 )
𝑥3 𝑦3
𝒛 = 𝜎(𝑾2 ⋅ 𝒚 + 𝒃2 )
𝑾1 , 𝒃1 𝑾2 , 𝒃2
Combining these leads us to:
𝒛 = 𝜎(𝑾2 ⋅ 𝜎 𝑾1 ⋅ 𝒙 + 𝒃1 + 𝒃2 )

Multilayer Perceptron – A Summary
• A multilayer perceptron is a neural
𝑥1 𝑦1 network with multiple perceptron
𝑧1 • It is oftentimes organized into layers
𝑥2 𝑦2
• Each layer has its own set of
𝑧2 parameters (weights 𝑾𝑖 and bias 𝒃𝑖 )
𝑥3 𝑦3
• The underlying computation is a
1 1 2 2
𝑾 ,𝒃 𝑾 ,𝒃 matrix multiplication described by
𝒚𝑖+1 = 𝜎(𝑾𝑖 ⋅ 𝒚𝑖 + 𝒃𝑖 )

Deep Learning – Loss Function
Bilder: TF / Malter
How do our models learn?
• Up until now, the parameters 𝑾𝑖 and 𝒃𝑖 of the multilayer perceptron
were assumed as given
• We now aim to learn the parameters 𝑾𝑖 and 𝒃𝑖 based on some
example input and output data
• Let us assume we have a dataset of 𝒙 and corresponding 𝒚
1.2 0.2
3.2 0.2
𝒙1 , 𝒚1 = (1.4, ) and 𝒙2 , 𝒚2 = (0.4, ) and …
0.2 0.2
1.3 0.3

• We now want to ensure that the parameters 𝑾𝑖 and 𝒃𝑖 can correctly
predict the 𝒚𝑖 based on the 𝒙𝑖
• To do this, we apply 𝒙𝑖 on our model and compare the result with 𝒚𝑖
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
𝑾1 , 𝒃1 𝑾2 , 𝒃2
The Loss Function
• ෝ𝑖 and the
We need a comparison metric between the predicted outputs 𝒚
expected outputs 𝒚𝑖
• This is called the loss function, which usually depends on the type of
problem the multilayer perceptron solves
• For regression, a common metric is the mean squared error
• For classification, a common metric is the cross entropy

Mean Squared Error
• For regression, each output 𝑦1 , … , 𝑦𝑛 is a real-valued target
• The difference 𝑦ො𝑖 − 𝑦𝑖 tells us something about how far off we are, with
the mean square error computed as
𝑛
1 2
ෝ, 𝒚 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝐿 𝒚
𝑛
𝑖=1

Cross Entropy – Class Probability and Softmax
• For classification, each output 𝑦1 , … , 𝑦𝑛 represents a class energy
• If the target class index is 𝑖, then only 𝑦𝑖 = 1 and all other 𝑦𝑗 = 0, 𝑖 ≠ 𝑗
(this is also called one-hot encoding)
• Example for class 1: 𝑦 = 1 0 0 𝑇
0 0
𝑇
• Example for class 3: 𝑦 = 0 0 1 0 0
• To ensure that we have probabilities, we apply a softmax transformation

exp(𝑦ො𝑖 )
𝑦ത𝑖 = 𝑛
σ𝑗=1 exp(𝑦ො𝑗 )

Cross Entropy – Negative Log Likelihood
• After the transformation, each 𝑦ത1 , … , 𝑦ത𝑛 represents a class probability
• The negative log − log 𝑦ത𝑖 for the target class index 𝑖 tells us how far off
we are, with the cross entropy calculated as
ෝ, 𝒚 = − log 𝑦ത𝑖
𝐿 𝒚
exp(𝑦ො𝑖 )
= − log 𝑛
σ𝑗=1 exp(𝑦ො𝑗 )

How do our models learn? – A Summary
• How wrong is the current set of parameters 𝜽? Forward Pass
• How should we change the set of parameters by ∇𝜽? Backward Pass
∇𝜽 ← Backward Pass
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
Forward Pass → 𝜽
Deep Learning – Gradient Descent
Bilder: TF / Malter
Parameter Optimization
• At this stage, we can compute the gradient of the parameters ∇𝜽
• The gradient tells us how we need to change the current parameters 𝜽
in order to make fewer errors on the given data
• We can use this in an iterative algorithm called gradient descent, with
the central equation being
𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
• The learning rate 𝜇 tells us how quickly we should change the current
parameters 𝜽𝑖
⮩ Let us have a closer look at an illustrative example 💡
Gradient Descent – An Example
• Let us assume that the error function is
𝑓 𝜃 = 𝜃2
• The gradient of the error function is the
derivative 𝑓 ′ 𝜃 = 2 ⋅ 𝜃
• We aim to find the minimum error at
the location 𝜃 = 0
• Our initial estimate for the location is
𝜃1 = 2
• The learning rate is 𝜇 = 0.25

• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1

• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5

• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5
• 3. Step: 𝜃 3 = 0.5, ∇𝜃 3 = 1
𝜃 4 = 𝜃 3 − 𝜇∇𝜃 3 = 0.5 − 0.25 ⋅ 1 = 0.25

• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5
• 3. Step: 𝜃 3 = 0.5, ∇𝜃 3 = 1
𝜃 4 = 𝜃 3 − 𝜇∇𝜃 3 = 0.5 − 0.25 ⋅ 1 = 0.25
• We incrementally approach the
expected target 𝜃 = 0

Gradient Descent – Limitations
• The learning rate 𝜇 can lead to slow
convergence if not properly configured
(see example for 𝜇 = 0.95)

Gradient Descent – Limitations
• The learning rate 𝜇 can lead to slow
convergence if not properly configured
(see example for 𝜇 = 0.95)
• The starting parameters 𝜃1 can lead to
a solution stuck in local minimum (see
example for 𝑓 𝜃 = 0.5𝜃 4 − 0.25𝜃 2 +
0.1𝜃)
• There’s no guarantee to finding an
optimal solution (a global minimum)
Global Local
Gradient Descent – A Summary
• Gradient Descent incrementally adjusts the parameters 𝜽 based on the
gradient ∇𝜽 of the parameters
1. For each iteration
2. Compute the error of the parameters 𝜽
3. Compute the gradient of the parameters ∇𝜽
4. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
• There is no guarantee that the algorithm converges to the global

optimum (e.g., due to learning rate 𝜇 or initial parameters 𝜽1 )

Deep Learning – Learning Process
Bilder: TF / Malter
• We’ve just talked about the concept of gradient descent, but largely
avoided the details for the function 𝑓(𝜽)1)
• The goal is to minimize the error or loss function across the complete
dataset 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 )
𝑛 𝑛
𝑓 𝜽 = ෍ 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) = ෍ 𝐿(𝑔(𝒙𝑖 , 𝜽), 𝒚𝑖 )
𝑖=1 𝑖=1
• The function 𝑔(𝒙𝑖 , 𝜽) encapsulates the computations of a multilayer

perceptron with parameters 𝜽
1) In the previous section the function was 𝑓 𝜃 = 𝜃 2 . For multilayer perceptron, the function is quite obviously different.
• We thus need to loop over the training data 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 ) to
compute sums over individual data points
• This results in a slight modification to the gradient descent algorithm
1. For each epoch – Loop over training data
2. For each batch – Loop over pieces of training data
3. Compute the error of the parameters 𝜽
4. Compute the gradient of the parameters ∇𝜽
5. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖

The Concept of Epochs
• Gradient descent requires multiple iterations to reach a minimum
• The optimization function 𝑓 𝜽 sums the loss functions 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) of each
individual data point 𝒙𝑖 , 𝒚𝑖 in the training data
• We need to go through the training data multiple times to find suitable
parameters 𝜽 for our problem
⮩ Each such iteration is called an epoch
⮩ Requires the computation of the full sum, which is expensive 💡

The Concept of Batches
• Instead of computing the full sum of one epoch in one go, we instead
break it apart into multiple smaller pieces
• These pieces are usually of a specified size (e.g., 64), which tells us
how many training data points to use for each piece
• One epoch then consists of processing all pieces
⮩ Each such piece of the training data is called a batch
Batch1) (30 samples) Full training data (120 samples)
1) Note how we need 4 batches to go through all of the training data in this specific example.
The Learning Process – A Summary
• The parameters 𝜽 are optimized by iterating through the training data
• For practical reasons, this is split into epochs and batches
Batch (30 samples) Full training data (120 samples)

Epoch 1 1. iteration 2. iteration 3. iteration 4. iteration

…
Deep Learning – Convolution
Bilder: TF / Malter
The Visual Cortex
We’ve previously learned that the
brain has specialized regions.
• The visual cortex is in charge of
processing visual information
collected from the retinae
• However, cortical cells only
respond to stimuli of a small
receptive fields
⮩ Let’s us look at the implications this has on neural networks 💡

Image from https://link.springer.com/referenceworkentry/10.1007/978-1-4614-7320-6_360-2
Multilayer Perceptron versus Cortical Neuron
• Multilayer perceptron’s neurons are connected to all previous neurons
• Cortical neurons are only connected to a small receptive field
Standard Cortical
neuron neuron
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦11 = 𝑤11 𝑥11 + 𝑤12 𝑥12 + 𝑤13 𝑥13 + 𝑤21 𝑥21 + 𝑤22 𝑥22 + 𝑤23 𝑥23
+ 𝑤31 𝑥31 + 𝑤32 𝑥32 + 𝑤33 𝑥33
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦12 = 𝑤11 𝑥12 + 𝑤12 𝑥13 + 𝑤13 𝑥14 + 𝑤21 𝑥22 + 𝑤22 𝑥23 + 𝑤23 𝑥24
+ 𝑤31 𝑥32 + 𝑤32 𝑥33 + 𝑤33 𝑥34
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦13 = 𝑤11 𝑥13 + 𝑤12 𝑥14 + 𝑤13 𝑥15 + 𝑤21 𝑥23 + 𝑤22 𝑥24 + 𝑤23 𝑥25
+ 𝑤31 𝑥33 + 𝑤32 𝑥34 + 𝑤33 𝑥35
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦23 = 𝑤11 𝑥23 + 𝑤12 𝑥24 + 𝑤13 𝑥25 + 𝑤21 𝑥33 + 𝑤22 𝑥34 + 𝑤23 𝑥35
+ 𝑤31 𝑥43 + 𝑤32 𝑥44 + 𝑤33 𝑥45
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦22 = 𝑤11 𝑥22 + 𝑤12 𝑥23 + 𝑤13 𝑥24 + 𝑤21 𝑥32 + 𝑤22 𝑥33 + 𝑤23 𝑥34
+ 𝑤31 𝑥42 + 𝑤32 𝑥43 + 𝑤33 𝑥44
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦21 = 𝑤11 𝑥21 + 𝑤12 𝑥22 + 𝑤13 𝑥23 + 𝑤21 𝑥31 + 𝑤22 𝑥32 + 𝑤23 𝑥33
+ 𝑤31 𝑥41 + 𝑤32 𝑥42 + 𝑤33 𝑥43
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦31 = 𝑤11 𝑥31 + 𝑤12 𝑥32 + 𝑤13 𝑥33 + 𝑤21 𝑥41 + 𝑤22 𝑥42 + 𝑤23 𝑥43
+ 𝑤31 𝑥51 + 𝑤32 𝑥52 + 𝑤33 𝑥53
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦32 = 𝑤11 𝑥32 + 𝑤12 𝑥33 + 𝑤13 𝑥34 + 𝑤21 𝑥42 + 𝑤22 𝑥43 + 𝑤23 𝑥44
+ 𝑤31 𝑥52 + 𝑤32 𝑥53 + 𝑤33 𝑥54
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

𝑦33 = 𝑤11 𝑥33 + 𝑤12 𝑥34 + 𝑤13 𝑥35 + 𝑤21 𝑥43 + 𝑤22 𝑥44 + 𝑤23 𝑥45
+ 𝑤31 𝑥53 + 𝑤32 𝑥54 + 𝑤33 𝑥55
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32 𝑦33
𝑥51 𝑥52 𝑥53 𝑥54 𝑥55

The Convolutional Neural Network – Multiple Filters
Instead of a single convolution operation we can also apply multiple
convolution operations in parallel on the same input.
Input 𝒙
The different outputs are
concatenated in the
channel dimension.
Kernel 𝒘
Output 𝒚 + + =

The Convolutional Neural Network – Multiple Layers
To incrementally process the input to relevant features, we again apply
multiple layers of convolutions.
⋆ ⋆
Convolution Convolution
Operation Operation
After many layers the

number of features
Input Image Convolution (64 kernels) Convolution (128 kernels) quickly grows 💡
Deep Learning – Pooling
Bilder: TF / Malter
Summarizing Features via Pooling
• Deeper in the network the number of features quickly grows
• It makes sense to summarize these features deeper in the network
• From a practical perspective, this reduces the number of computations
• From a theoretical perspective, these higher-level (i.e., global scale)
features don’t require a high spatial resolution
• This process is called pooling and works like a convolution, with the
kernel replaced by a pooling operation

The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦11 = max(𝑥11 , 𝑥12 , 𝑥21 , 𝑥22 )
𝑥11 𝑥12 𝑥13 𝑥14
𝑥21 𝑥22 𝑥23 𝑥24 𝑦11

max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34
𝑥41 𝑥42 𝑥43 𝑥44

𝑦12 = max(𝑥13 , 𝑥14 , 𝑥23 , 𝑥24 )
𝑥11 𝑥12 𝑥13 𝑥14
𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12

max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34
𝑥41 𝑥42 𝑥43 𝑥44

𝑦21 = max(𝑥31 , 𝑥32 , 𝑥41 , 𝑥42 )
𝑥11 𝑥12 𝑥13 𝑥14
𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12

max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34 𝑦21
𝑥41 𝑥42 𝑥43 𝑥44

𝑦22 = max(𝑥33 , 𝑥34 , 𝑥43 , 𝑥44 )
𝑥11 𝑥12 𝑥13 𝑥14
𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12

max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34 𝑦21 𝑦22
𝑥41 𝑥42 𝑥43 𝑥44

There are multiple methods to summarize the features which are
commonly used in Convolutional Neural Networks (CNN):
Example for 𝒏 = 𝟒 General case
• Maximum 𝑦 = max(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) 𝑦 = max(𝑥1 , … , 𝑥𝑛 )
1 1 4 1 𝑛
• Average 𝑦= (𝑥 + 𝑥2 + 𝑥3 + 𝑥4 ) = σ 𝑥 𝑦= σ 𝑥
4 1 4 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
• L2-Norm 𝑦 = 𝑥12 + 𝑥22 + 𝑥32 + 𝑥42 = σ4𝑖=1 𝑥𝑖2 𝑦 = σ𝑛𝑖=1 𝑥𝑖2
⮩ There’s no general consensus what the “best” method is 💡

The Convolutional Neural Network – Pooling
Adding pooling to the network reduced the number of features deeper in
the network using a (relatively) inexpensive method.
Pooling 2
Pooling 1 Convolution 2
Input Image Convolution 1
Deep Learning – Applications
Bilder: TF / Malter
Full Image Classification
• Task: Plant leaf disease classification
of 14 different species
• Input: Image of plant leaf
• Output: Plant and disease class
grape with black measles potato with late blight
• Model: EfficientNet
• Ü. Atilaa, M. Uçarb, K. Akyolc, and E. Uçarb, “Plant leaf disease
classification using EfficientNet deep learning model”, 2020.
healthy strawberry healthy tomato

Full Image Regression
• Task: Pose estimation based on
human key points (eyes, joints, …)
• Input: Image of human
• Output: Coordinates of 17 key points
• Model: PoseNet
• G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson,
and K. Murphy, “PersonLab: Person Pose Estimation and
Instance Segmentation with a Bottom-Up, Part-Based,
Geometric Embedding Model”, 2018.
• Animated image taken from TensorFlow Lite documentation at

https://www.tensorflow.org/lite/examples/pose_estimation/over
view

Pixel-wise Image Classification (Segmentation)
• Task: Land cover classification in
satellite imagery
• Input: Satellite image
• Output: Land cover class per
individual pixel
• Model: Segnet, Unet
• Z. Han,Y. Dian, H. Xia, J. Zhou, Y. Jian, C. Yao, X. Wang,
and Y. Li, “Comparing Fully Deep Convolutional Neural
Networks for Land Cover Classification with High-Spatial-
Resolution Gaofen-2 Images”, 2020.

Pixel-wise Image Regression
• Task: Relative depth
estimation from a single
image
• Input: Any image
• Output: Depth value per
individual pixel
• Model: MiDaS
• R. Ranftl, K. Lasinger, D. Hafner, K.
Schindler, and V. Koltun, “Towards Robust
Monocular Depth Estimation: Mixing
Datasets for Zero-shot Cross-dataset
Transfer”, 2019.

Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides

Uploaded by

Copyright:

Available Formats

Organizational Information 3

Prof. Dr. Jörg Franke

Prof. Dr. Nico Hanenkamp

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4

Machine Learning: A Probabilistic Perspective

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5

2. Industrial Revolution 4. Industrial Revolution

Source: Left: Scopus, 2019

The best german supercomputer:

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9

• In 2006, Geoffrey Hinton et al. published a • Increasing performance of visual

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11

Supervised Unsupervised Reinforcement

Supervised Unsupervised Reinforcement

→ Resolves classification and regression problems

Training is based on a set of

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6

That means, we fit a regression

In this case 𝑓 is a linear regression

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7

Supervised Unsupervised Reinforcement

Resolves clustering and dimensionality reduction problems

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16

Iris setosa Iris versicolor Iris virginica

Mostly used for data analysis, 2.5

Data should be: 1.0

Mostly used for data analysis, 2.5

Data should be: 1.0

The intuition in lower dimensions does not hold in higher dimensions:

Question: How many samples do we need, to have all possible sensor

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23

Question: How many samples do we need, to have all possible sensor

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27

Supervised Unsupervised Reinforcement

Resolves control and decision problems

To achieve this RL uses reward and punishment signals based on the

Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018

Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018

There exist multiple pipelines and concepts,

The 8-step machine learning pipeline by Artasanchez & Joshi :

Prediction of the energy consumption of a milling machine – problem definition

• Want to predict the necessary energy to conduct a process step

Problem Data Data Data Model Model Model Perfomance

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3

Prediction of the energy consumption of a milling machine – data ingestion

Problem Data Data Data Model Model Model Perfomance

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4

Prediction of the energy consumption of a milling machine – data preparation

Problem Data Data Data Model Model Model Perfomance

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5

Instance Target Value

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6