Lecture01 VDL

Very Deep Learning
Lecture 02
Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com
M. Zeshan Afzal, Very Deep Learning Ch. 2

Recap
M. Zeshan Afzal, Very Deep Learning Ch. 2 3

Artificial
Hierarchy Intellegence
Machine
Learning
Shallow
Deep Learning
Learning
Self-
Supervised Supervised Unsupervised Reinforcement
supervised
Structured
Unsupervised Regression Classification
Prediction
Self- Linear
supervised Regression
Reinforcement

Supervised Learning
Input Model output
x 𝑓𝑤 y
Learning: is the estimation of the parameter w from the training data {(𝑥𝑖, 𝑦𝑖 )}𝑛𝑖=1
Inference: Make prediction on unkown point x i.e., 𝑦 = 𝑓𝑤 𝑥

Artificial
Machine
Learning
Shallow
Deep Learning
Learning
Self-
supervised
Structured
Prediction
Self- Linear
Reinforcement

Examples Supervised Learning (Regression)
Input Model output
𝑓𝑤 31.45
Mapping:

Artificial
Machine
Learning
Shallow
Deep Learning
Learning
Self-
supervised
Structured
Prediction
Self- Linear
Reinforcement

Capacity, Overfitting, Underfitting
◼ Terminology
^ Capacity: Complexity of functions which can be represented by model f
^ Underfitting: Model too simple, does not achieve low error on training set
^ Overfitting: Training error small, but test error (= generalization error) large
Too low capacity almost there about right capcity too high

Capacity, Overfitting, Underfitting
◼ General Approach: Split dataset into training, validation and test set
Choose values of hyperparameters (e.g., degree of polynomial,
learning rate in neural net, ..) using validation set.
Important: Evaluate once on test set
◼ It is very (!!) important to use independent test data
^ Typically 50% for training (60)
^ 20% for validation (20) DATA SPLIT
^ 30% for testing (20)

testing; 30
◼ However, might change

^ Depending on number of data available training; 50
◼ For less data we use cross validation

validation; 20

Artificial
Machine
Learning
Shallow
Deep Learning
Learning
Self-
supervised
Structured
Prediction
Self- Linear
Ridge
Reinforcement
Regression

Shallow Learning
Ridge Regression

Ridge Regression
◼ Polynomial curve model
◼ Ridge Regression
◼ Add regularization \lambda to discourage large parameters

◼ It has a closed form solution

Ridge Regression
◼ M = 15
◼ left most linear regression,
◼ Others: left to write weak to strong regularization

Estimators, Bias and Variance

◼ Point Estimator
^ A point estimator is a function that maps a dataset to model
parameter Estimator
Estimate
^ Example: estimator of ridge regression

^ Hat on w ( ) signifies that its an estimate
^ A good estimator is the one that returns the estimate close to the true one
^ The data is drawn from a random process (xi, yi) ∼ pdata(·)
^ Thus any function of the data is random and is a random variable


Estimators, Bias, Variance
►is unbiased ⇔ Bias( ) = 0 ►A good estimator has low variance
►A good estimator has little bias
◼ Bias-Variance Dilemma:
►Statistical learning theory tells us that we can’t have both ⇒ there is a trade-off

Bias, Variance
◼ Datasets = 100
◼ True Model = Green line
Variance refers to an algos Bias occurs when the Limited

sensitivity to specific set of flexibility to learn the true signal
training data
Lambda = 0.0000001 Lambda = 5

Estimators, Bias, Variance
The bias and the variance tradeoff:

Terms
◼ Last lecture
^ What is AI
^ Types of AI
^ What is machine learning
^ What is deep learning
^ What are the types of learning
^ Linear Regression, Polynomial Fitting
^ Capacity, Overfitting, Underfitting
◼ This Lecture
^ Ridge regression
^ Estimators, Bias, Variance

Maximum Likelihood Estimation

◼ We now reinterpret out results by taking a probabilistic point of view

◼ Let be a set samples drawn i.i.d from
◼ Let the model be a parametric family of probability
distribution
◼ The conditional maximum likelihood estimator for w is given by

◼ Example
^ Assuming we obtain

◼ We see that choosing as Gaussian distribution causes

maximum-likelihood to yield exactly the same least square
estimator that has been derived before
◼ Variations:
^ If we choose as a Laplace distribution, we are going to obtain
we will get the norm and the expression becomes

◼ We see that choosing as Gaussian distribution causes

maximum-likelihood to yield exactly the same least square
estimator that has been derived before
◼ Consistency: as the number of training samples the

maximum likelihood estimate converges to the true parameters
◼ Efficiency: The ML estimate converges most quickly as N
increases
◼ These theoretical considerations make ML more appealing

Artificial
Machine
Learning
Shallow
Deep Learning
Learning
Self-
supervised
Structured
Prediction
Self- Linear
Ridge
Reinforcement
Regression

Examples Supervised Learning (Classification)
Input Model output
𝑓𝑤 cat
Mapping:
{ 0, 1 }

Artificial
Machine
Learning
Shallow
Deep Learning
Learning
Self-
supervised
Structured
Prediction
Self- Linear Logistic

supervised Regression Regression
Ridge
Reinforcement
Regression

Logistic Regression

Logistic Regression
◼ We have already see the Maximum Likelihood Estimator
◼ We now perform a binary classification

◼ How should we choose the model is this case
◼ Answer: Bernoulli distribution
where predicted by the model:

Logistic Regression
◼ In summary we have assumed the Bernoulli

distribution
Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e
◼ We can choose the
The sigmoid is given as follows

Logistic Regression
◼ Putting it together
◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

Logistic Regression
◼ Binary Cross Entropy Loss
◼ For yi = 1 the loss L is minimized if yî = 1

◼ For y i = 0 the loss L is minimized if yî = 0
◼ Thus, L is minimal if yî = y i
◼ Can be extended to > 2 classes

Logistic Regression
◼ A simple 1D example
◼ Dataset X with positive (yi = 1) and negative (yi = 0) samples
Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

Logistic Regression
►Logistic regressor f w (x) = σ(w0 + w1x) fit to dataset X

Logistic Regression
◼ Probabilities of classifier f w (xi) for positive samples (yi = 1)


Logistic Regression
►Probabilities of classifier f w (x i ) for negative samples (yi = 0)

Logistic Regression
◼ Putting both together


Logistic Regression
◼ Putting both together


Logistic Regression
◼ Maximum Likelihood for Logistic Regression:
With and
◼ How do we find the minimizer of

◼ In comparison to linear regression the loss is not quadratic in w
◼ We must apply the iterative gradient based optimization The gradient is

Logistic Regression
◼ Gradient Descent
^ Pick the step size and tolerance
^ Initialize
^ Repeat until
◼ Variants
^ Line Search
^ Conjugate gradients Source: https://en.wikipedia.org/wiki/Conjugate_gradient_method
^ L-BFGS

Logistic Regression
◼ Examples in 2D
◼ Logistic Regression model:

Logistic Regression
◼ Maximizing the Log-Likelihood is equivalent to

minimizing the Cross Entropy or KL divergence

Summary (Second Part)
• Maximum Likelihood Estimation is Least Square

Estimation when Gaussian distribution is assumed

Summary (Second Part)

• Logistic Regression
• Maximum Likelihood Estimation of Bernoulli
• Example of Binary Classification using Logistic

Regression
• Optimization
• Gradient Descent
• Relation to Information theory
• Maximizing the Log-Likelihood is equivalent to
minimizing the Cross Entropy or KL divergence

Computational Graphs

Thanks a lot for your Attention

Lecture01 VDL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture01 VDL

Uploaded by

Copyright:

Available Formats

Very Deep Learning

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

M. Zeshan Afzal, Very Deep Learning Ch. 2

M. Zeshan Afzal, Very Deep Learning Ch. 2 3

M. Zeshan Afzal, Very Deep Learning Ch. 2 4

Input Model output

M. Zeshan Afzal, Very Deep Learning Ch. 2 5

M. Zeshan Afzal, Very Deep Learning Ch. 2 6

Input Model output

M. Zeshan Afzal, Very Deep Learning Ch. 2 7

M. Zeshan Afzal, Very Deep Learning Ch. 2 8

M. Zeshan Afzal, Very Deep Learning Ch. 2 9

^ 30% for testing (20)

◼ However, might change

◼ For less data we use cross validation

M. Zeshan Afzal, Very Deep Learning Ch. 2 10

M. Zeshan Afzal, Very Deep Learning Ch. 2 11

M. Zeshan Afzal, Very Deep Learning Ch. 2 12

◼ Add regularization \lambda to discourage large parameters

M. Zeshan Afzal, Very Deep Learning Ch. 2 13

M. Zeshan Afzal, Very Deep Learning Ch. 2 14

M. Zeshan Afzal, Very Deep Learning Ch. 2 15

^ Example: estimator of ridge regression

M. Zeshan Afzal, Very Deep Learning Ch. 2 16

M. Zeshan Afzal, Very Deep Learning Ch. 2 17

►is unbiased ⇔ Bias( ) = 0 ►A good estimator has low variance

►A good estimator has little bias

M. Zeshan Afzal, Very Deep Learning Ch. 2 18

Variance refers to an algos Bias occurs when the Limited

Lambda = 0.0000001 Lambda = 5

M. Zeshan Afzal, Very Deep Learning Ch. 2 19

The bias and the variance tradeoff:

M. Zeshan Afzal, Very Deep Learning Ch. 2 20

M. Zeshan Afzal, Very Deep Learning Ch. 2 21

M. Zeshan Afzal, Very Deep Learning Ch. 2 22

◼ We now reinterpret out results by taking a probabilistic point of view

M. Zeshan Afzal, Very Deep Learning Ch. 2 23

M. Zeshan Afzal, Very Deep Learning Ch. 2 24

◼ We see that choosing as Gaussian distribution causes

M. Zeshan Afzal, Very Deep Learning Ch. 2 26

◼ We see that choosing as Gaussian distribution causes

◼ Consistency: as the number of training samples the

M. Zeshan Afzal, Very Deep Learning Ch. 2 27

M. Zeshan Afzal, Very Deep Learning Ch. 2 28

Input Model output

M. Zeshan Afzal, Very Deep Learning Ch. 2 29

Self- Linear Logistic

M. Zeshan Afzal, Very Deep Learning Ch. 2 30

M. Zeshan Afzal, Very Deep Learning Ch. 2 31

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 2 32

◼ In summary we have assumed the Bernoulli

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 2 33

M. Zeshan Afzal, Very Deep Learning Ch. 2 34

◼ Binary Cross Entropy Loss

◼ For yi = 1 the loss L is minimized if yˆi = 1

M. Zeshan Afzal, Very Deep Learning Ch. 2 35