You are on page 1of 47

Very Deep Learning

Lecture 02

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker


MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com

M. Zeshan Afzal, Very Deep Learning Ch. 2


Recap

M. Zeshan Afzal, Very Deep Learning Ch. 2 3


Artificial
Hierarchy Intellegence

Machine
Learning

Shallow
Deep Learning
Learning

Self-
Supervised Supervised Unsupervised Reinforcement
supervised

Structured
Unsupervised Regression Classification
Prediction

Self- Linear
supervised Regression

Reinforcement

M. Zeshan Afzal, Very Deep Learning Ch. 2 4


Supervised Learning

Input Model output

x 𝑓𝑤 y

Learning: is the estimation of the parameter w from the training data {(𝑥𝑖, 𝑦𝑖 )}𝑛𝑖=1
Inference: Make prediction on unkown point x i.e., 𝑦 = 𝑓𝑤 𝑥

M. Zeshan Afzal, Very Deep Learning Ch. 2 5


Artificial
Hierarchy Intellegence

Machine
Learning

Shallow
Deep Learning
Learning

Self-
Supervised Supervised Unsupervised Reinforcement
supervised

Structured
Unsupervised Regression Classification
Prediction

Self- Linear
supervised Regression

Reinforcement

M. Zeshan Afzal, Very Deep Learning Ch. 2 6


Examples Supervised Learning (Regression)

Input Model output

𝑓𝑤 31.45

Mapping:

M. Zeshan Afzal, Very Deep Learning Ch. 2 7


Artificial
Hierarchy Intellegence

Machine
Learning

Shallow
Deep Learning
Learning

Self-
Supervised Supervised Unsupervised Reinforcement
supervised

Structured
Unsupervised Regression Classification
Prediction

Self- Linear
supervised Regression

Reinforcement

M. Zeshan Afzal, Very Deep Learning Ch. 2 8


Capacity, Overfitting, Underfitting
◼ Terminology
^ Capacity: Complexity of functions which can be represented by model f
^ Underfitting: Model too simple, does not achieve low error on training set
^ Overfitting: Training error small, but test error (= generalization error) large

Too low capacity almost there about right capcity too high

M. Zeshan Afzal, Very Deep Learning Ch. 2 9


Capacity, Overfitting, Underfitting
◼ General Approach: Split dataset into training, validation and test set
Choose values of hyperparameters (e.g., degree of polynomial,
learning rate in neural net, ..) using validation set.
Important: Evaluate once on test set
◼ It is very (!!) important to use independent test data
^ Typically 50% for training (60)
^ 20% for validation (20) DATA SPLIT

^ 30% for testing (20)


testing; 30

◼ However, might change


^ Depending on number of data available training; 50

◼ For less data we use cross validation


validation; 20

M. Zeshan Afzal, Very Deep Learning Ch. 2 10


Artificial
Hierarchy Intellegence

Machine
Learning

Shallow
Deep Learning
Learning

Self-
Supervised Supervised Unsupervised Reinforcement
supervised

Structured
Unsupervised Regression Classification
Prediction

Self- Linear
supervised Regression

Ridge
Reinforcement
Regression

M. Zeshan Afzal, Very Deep Learning Ch. 2 11


Shallow Learning
Ridge Regression

M. Zeshan Afzal, Very Deep Learning Ch. 2 12


Ridge Regression
◼ Polynomial curve model

◼ Ridge Regression

◼ Add regularization \lambda to discourage large parameters


◼ It has a closed form solution

M. Zeshan Afzal, Very Deep Learning Ch. 2 13


Ridge Regression

◼ M = 15
◼ left most linear regression,
◼ Others: left to write weak to strong regularization

M. Zeshan Afzal, Very Deep Learning Ch. 2 14


Estimators, Bias and Variance

M. Zeshan Afzal, Very Deep Learning Ch. 2 15


Estimators, Bias and Variance

◼ Point Estimator
^ A point estimator is a function that maps a dataset to model
parameter Estimator
Estimate

^ Example: estimator of ridge regression


^ Hat on w ( ) signifies that its an estimate
^ A good estimator is the one that returns the estimate close to the true one
^ The data is drawn from a random process (xi, yi) ∼ pdata(·)
^ Thus any function of the data is random and is a random variable

M. Zeshan Afzal, Very Deep Learning Ch. 2 16


Estimators, Bias and Variance

M. Zeshan Afzal, Very Deep Learning Ch. 2 17


Estimators, Bias, Variance

►is unbiased ⇔ Bias( ) = 0 ►A good estimator has low variance

►A good estimator has little bias

◼ Bias-Variance Dilemma:
►Statistical learning theory tells us that we can’t have both ⇒ there is a trade-off

M. Zeshan Afzal, Very Deep Learning Ch. 2 18


Bias, Variance

◼ Datasets = 100
◼ True Model = Green line

Variance refers to an algos Bias occurs when the Limited


sensitivity to specific set of flexibility to learn the true signal
training data

Lambda = 0.0000001 Lambda = 5

M. Zeshan Afzal, Very Deep Learning Ch. 2 19


Estimators, Bias, Variance

The bias and the variance tradeoff:

M. Zeshan Afzal, Very Deep Learning Ch. 2 20


Terms
◼ Last lecture
^ What is AI
^ Types of AI
^ What is machine learning
^ What is deep learning
^ What are the types of learning
^ Linear Regression, Polynomial Fitting
^ Capacity, Overfitting, Underfitting

◼ This Lecture
^ Ridge regression
^ Estimators, Bias, Variance

M. Zeshan Afzal, Very Deep Learning Ch. 2 21


Maximum Likelihood Estimation

M. Zeshan Afzal, Very Deep Learning Ch. 2 22


Maximum Likelihood Estimation

◼ We now reinterpret out results by taking a probabilistic point of view


◼ Let be a set samples drawn i.i.d from
◼ Let the model be a parametric family of probability
distribution
◼ The conditional maximum likelihood estimator for w is given by

M. Zeshan Afzal, Very Deep Learning Ch. 2 23


Maximum Likelihood Estimation

◼ Example
^ Assuming we obtain

M. Zeshan Afzal, Very Deep Learning Ch. 2 24


Maximum Likelihood Estimation

◼ We see that choosing as Gaussian distribution causes


maximum-likelihood to yield exactly the same least square
estimator that has been derived before

◼ Variations:
^ If we choose as a Laplace distribution, we are going to obtain
we will get the norm and the expression becomes

M. Zeshan Afzal, Very Deep Learning Ch. 2 26


Maximum Likelihood Estimation

◼ We see that choosing as Gaussian distribution causes


maximum-likelihood to yield exactly the same least square
estimator that has been derived before

◼ Consistency: as the number of training samples the


maximum likelihood estimate converges to the true parameters
◼ Efficiency: The ML estimate converges most quickly as N
increases
◼ These theoretical considerations make ML more appealing

M. Zeshan Afzal, Very Deep Learning Ch. 2 27


Artificial
Hierarchy Intellegence

Machine
Learning

Shallow
Deep Learning
Learning

Self-
Supervised Supervised Unsupervised Reinforcement
supervised

Structured
Unsupervised Regression Classification
Prediction

Self- Linear
supervised Regression

Ridge
Reinforcement
Regression

M. Zeshan Afzal, Very Deep Learning Ch. 2 28


Examples Supervised Learning (Classification)

Input Model output

𝑓𝑤 cat

Mapping:

{ 0, 1 }

M. Zeshan Afzal, Very Deep Learning Ch. 2 29


Artificial
Hierarchy Intellegence

Machine
Learning

Shallow
Deep Learning
Learning

Self-
Supervised Supervised Unsupervised Reinforcement
supervised

Structured
Unsupervised Regression Classification
Prediction

Self- Linear Logistic


supervised Regression Regression

Ridge
Reinforcement
Regression

M. Zeshan Afzal, Very Deep Learning Ch. 2 30


Logistic Regression

M. Zeshan Afzal, Very Deep Learning Ch. 2 31


Logistic Regression

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification


◼ How should we choose the model is this case

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 2 32


Logistic Regression

◼ In summary we have assumed the Bernoulli


distribution

Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 2 33


Logistic Regression

◼ Putting it together

◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

M. Zeshan Afzal, Very Deep Learning Ch. 2 34


Logistic Regression

◼ Binary Cross Entropy Loss

◼ For yi = 1 the loss L is minimized if yˆi = 1


◼ For y i = 0 the loss L is minimized if yˆi = 0
◼ Thus, L is minimal if yˆi = y i
◼ Can be extended to > 2 classes

M. Zeshan Afzal, Very Deep Learning Ch. 2 35


Logistic Regression

◼ A simple 1D example

◼ Dataset X with positive (yi = 1) and negative (yi = 0) samples

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

M. Zeshan Afzal, Very Deep Learning Ch. 2 36


Logistic Regression

◼ A simple 1D example

►Logistic regressor f w (x) = σ(w0 + w1x) fit to dataset X

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

M. Zeshan Afzal, Very Deep Learning Ch. 2 37


Logistic Regression

◼ A simple 1D example

◼ Probabilities of classifier f w (xi) for positive samples (yi = 1)


Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

M. Zeshan Afzal, Very Deep Learning Ch. 2 38


Logistic Regression

◼ A simple 1D example

►Probabilities of classifier f w (x i ) for negative samples (yi = 0)

Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

M. Zeshan Afzal, Very Deep Learning Ch. 2 39


Logistic Regression

◼ A simple 1D example

◼ Putting both together


Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

M. Zeshan Afzal, Very Deep Learning Ch. 2 40


Logistic Regression

◼ A simple 1D example

◼ Putting both together


Source: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

M. Zeshan Afzal, Very Deep Learning Ch. 2 41


Logistic Regression

◼ Maximum Likelihood for Logistic Regression:

With and

◼ How do we find the minimizer of


◼ In comparison to linear regression the loss is not quadratic in w
◼ We must apply the iterative gradient based optimization The gradient is

M. Zeshan Afzal, Very Deep Learning Ch. 2 42


Logistic Regression

◼ Gradient Descent
^ Pick the step size and tolerance
^ Initialize
^ Repeat until

◼ Variants
^ Line Search
^ Conjugate gradients Source: https://en.wikipedia.org/wiki/Conjugate_gradient_method

^ L-BFGS

M. Zeshan Afzal, Very Deep Learning Ch. 2 43


Logistic Regression

◼ Examples in 2D

◼ Logistic Regression model:

M. Zeshan Afzal, Very Deep Learning Ch. 2 44


Logistic Regression

◼ Maximizing the Log-Likelihood is equivalent to


minimizing the Cross Entropy or KL divergence

M. Zeshan Afzal, Very Deep Learning Ch. 2 45


Summary (Second Part)

Maximum Likelihood Estimation

• Maximum Likelihood Estimation is Least Square


Estimation when Gaussian distribution is assumed

M. Zeshan Afzal, Very Deep Learning Ch. 2 46


Summary (Second Part)

Maximum Likelihood Estimation


• Logistic Regression
• Maximum Likelihood Estimation of Bernoulli

• Example of Binary Classification using Logistic


Regression
• Optimization
• Gradient Descent
• Relation to Information theory
• Maximizing the Log-Likelihood is equivalent to
minimizing the Cross Entropy or KL divergence

M. Zeshan Afzal, Very Deep Learning Ch. 2 47


Computational Graphs

M. Zeshan Afzal, Very Deep Learning Ch. 2 48


Thanks a lot for your Attention

M. Zeshan Afzal, Very Deep Learning Ch. 2 49

You might also like