You are on page 1of 64

Artificial Intelligence & Data Mining

WEN Bihan (Asst Prof)


Homepage: www.bihanwen.com

1
Regularization and Optimization

WEN Bihan (Asst Prof)


Homepage: www.bihanwen.com

2
Outline

• Learning and Supervision

• Bias and Variance

• Overfitting and Underfitting

• Diagnosing the model training

3
Carry-on Questions

• What are the types of supervision?

• How to measure the degree of overfitting?

• What are methods to prevent overfitting?

4
Supervised vs Unsupervised Learning

From what we learn so far:

• Classification is supervised:

• A training dataset with pre-defined class labels are provided.

• Clustering is unsupervised:

• No training dataset with pre-existing label is given.

5
Example – Image Classification
input desired output

apple

pear

tomato

cow

dog

horse

6
The Basic Supervised Learning Framework
Training time Training
Labels
Training
Samples
Learned
Features Training
model

Learned
model
Testing time

Features Prediction
7
The Basic Supervised Learning Framework

𝑦 = 𝑓𝜃 (𝒙)

Output Function Input

• Learning / Training: given a training set of labeled examples


{(𝒙1 , 𝑦1 ), … , (𝒙𝑁 , 𝑦𝑁 )}, estimate the parameters 𝜃 of the prediction
function 𝑓𝜃 .

• Inference / Testing : apply f to an unseen test example 𝒙 and output


the predicted value 𝑦 = 𝑓𝜃 (𝒙)

8
Examples of Supervised Methods

• Nearest Neighbor

• Linear classification / regression

• Structured Prediction

9
Supervised Learning - Nearest Neighbor

Test Training
Training examples
examples example
from class 2
from class 1

f(x) = label of the training example nearest to x

• All we need is a distance function for our inputs

10
K-Nearest Neighbor
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points

k=5

11
Supervised Learning – Linear Classifier

• Find a linear function to separate the classes:

f(x) = sgn(w1x1 + w2x2 + … + wDxD + b) = sgn(w  x + b)

12
Supervised Learning – Linear Classifier

13
Supervised Learning – Structured Prediction

Sentence Parse tree

14
Examples of Unsupervised Learning

• Clustering

• Principal Components Analysis (PCA)

15
Unsupervised Learning – Clustering

• Discover groups of “similar” data points.

• Only unlabeled data as input, without pre-defined classes.

16
Unsupervised Learning – Dimensionality
Reduction
• Discover a lower-dimensional subspace on which the data lives.

• Example: Principal Components Analysis (PCA)

17
Beyond: Reinforcement Learning

• Learn from rewards in a sequential environment.

https://deepmind.com/research/alphago/ 18
Types of Supervisions

Semi-supervised
(labels for a small portion of
training data)

Unsupervised Weakly supervised Supervised


(no labels) (noisy labels, labels not exactly for (clean, complete
the task of interest) training labels for the
task of interest)

19
Learning Effectiveness

• Potential Problems

1. Do you have sufficient data for supervision? - Overfitting

20
Learning Effectiveness

• Potential Problems

1. Do you have sufficient data for supervision? - Overfitting

2. Is your model complex / rich enough for the problem? - Underfitting

21
Learning Effectiveness

• Potential Problems

1. Do you have sufficient data for supervision? - Overfitting

2. Is your model complex / rich enough for the problem? - Underfitting

• We wish to understand what happened

• Solution: Statistical Learning Theory

22
Basics on statistical learning theory

• Why do we need to study statistical learning?

• We cannot know exactly how well an algorithm will work in practice (the
true "risk“ – measure of effectiveness).

• Because we do not know the true distribution of data that the


algorithm will work on.

• But, we can instead measure its performance on a known set of data


(the "empirical" risk).

• Empirical Risk Minimization is the core idea of statistical learning.

23
Basics on statistical learning theory

• Expected (true) Risk:

• ℎ(𝑥) is the function predicting 𝑦.

• 𝑙(ℎ 𝑥 , 𝑦) measures the distance between 𝑦 and the predicted ℎ 𝑥 .

• (𝑥, 𝑦) follows some underlying distribution 𝑝 𝑥, 𝑦 : some (𝑥, 𝑦) appear more


often in practice, thus need higher weight.

• The expected (true) risk measures how well the ℎ(𝑥) approximates the 𝑦.

• In practice, we do not have full access to such distribution 𝒑 𝒙, 𝒚 .

• We do not have access to expected risk explicitly.

24
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

• Though we do not have full access to the distribution 𝑝 𝑥, 𝑦 , we can collect


a labeled dataset: limited number of samples 𝑥 (𝑖) , 𝑦 (𝑖) from 𝑝 𝑥, 𝑦 .

• Instead of integration, we take the average distance between 𝑦 (𝑖) and the
predicted ℎ 𝑥 (𝑖) : all samples have equal weights.

• In practice, we can calculate the empirical risk given a labeled dataset.

• Empirical Risk approximates the Expected Risk.

25
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

• Limitations of learning the function ℎ(𝑥) in practice:

1. We need assumptions on ℎ 𝑥 to be learned, ℎ ∈ ℋ


(the specific model you use, e.g., linear regressor, neural networks).

2. We can only minimize the empirical risk instead of expected risk.

26
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

The best
• Limitations of learning the function ℎ(𝑥): possible ℎ(. )

With limitation 1

With limitations 1 + 2
27
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

Error by Error by
limitation 1 limitation 2
• Total Learning Error:

28
Basics on statistical learning theory

All possible algorithms


you can learn using a
specific model

• ℎ ∈ ℋ is the learnable function space based on our assumptions.

• 𝐼 is the size / complexity of the training dataset.


29
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Model Complexity (informally):


• How many parameters in 𝑓𝜃 (. ) do we have to learn?

• Neural Networks: #hidden neurons

30
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Empirical Error:
• In a given dataset, the percentage of items that are misclassified by 𝑓𝜃 (. ).

• Here we refer to the testing dataset.

31
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Expected Error:

• For an item that is randomly drawn from the underlying distribution, the likelihood that we
expect it to be misclassified by 𝑓𝜃 (. ).

32
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Variance:
• Type of error that occurs due to a model's sensitivity to small fluctuations in the
training set.

• Variance increases with model complexity.


33
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Bias:
• Type of error that occurs due to wrong / inaccurate assumptions made in the
learning algorithm.

• Bias is high when the model is (too) simple


34
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Expected error of a classifier ≈ bias2 + variance (+noise)

35
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

Simple Model: Complex Model:


High bias and low variance Low bias and high variance
36
Bias and Variance

• The trade-off between bias and variance (of a model)

• Bullseye (center) = target model; Darts (crosses) = learned models

37
Basics on statistical learning theory

All possible algorithms


you can learn using a
specific model

• ℎ ∈ ℋ is the learnable function space based on our assumptions.

• 𝐼 is the size / complexity of the training dataset.


38
Overfitting vs Underfitting

• What is a good model?

Simple Model Complex Model

Good Model!
39
Overfitting vs Underfitting

• Simple Model

• High Bias

• Cause an algorithm to miss relevant


relations between the input features and
the target outputs.

• Complex Model

• High Variance

• Cause an algorithm to model the noise in


the training set.

40
Overfitting vs Underfitting

• Simple Model

• High Bias - Underfitting

• Complex Model

• High Variance - Overfitting

41
Overfitting vs Underfitting

• Training a classifier 𝑓𝜃 (𝑥)

Simple Model: Complex Model:


High bias and low variance Low bias and high variance
42
Overfitting vs Underfitting

• Training a classifier 𝑓𝜃 (𝑥)

Underfitting Overfitting
High bias and low variance Low bias and high variance
43
Overfitting vs Underfitting

• Measure overfitting by training and testing / validation errors

44
Overfitting vs Underfitting

• Overfitting ≈ Testing / Validation Error – Training Error

The gap measures the


degree of overfitting

45
Overfitting vs Underfitting

• Overfitting ≈ Testing / Validation Error – Training Error

Overfitting
Large gap between
training and test errors

Underfitting
Small gap between
training and test errors 46
Overfitting vs Underfitting

• Bias-Variance Tradeoff:

• fundamental dilemma of minimizing between two sources of errors that


prevent ML algorithms from generalizing beyond their training set.

• The bias is error from erroneous assumptions in the learning algorithm. High
bias can cause an algorithm to miss the relevant relations between features
and target outputs (e.g., model is too simple -> underfitting).

• The variance is error from sensitivity to small fluctuations in the training set.
High variance can cause an algorithm to model the random noise in the
training data, rather than the intended outputs (e.g., model is too
complicated -> overfitting).

47
Overfitting vs Underfitting

• Monitoring the bias-variance trade-off:

• Separate a validation dataset.

• Learn parameters on the training data.


Validation With the
known
• Measure accuracy on the held-out or Dataset
validation data. labels

• Peek at the validation set to prevent


overfitting and underfitting.

48
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.

49
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.

50
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.

• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.

• Weight Sharing: Instead of training each neuron independently, we can force their
parameters to be the same. Examples: Recurrent Neural Networks (RNN).

51
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

2. Increase the training data complexity / size, to reduce the variance.

• Add more training data

• Data Augmentation: modify the data available in a realistic but randomized way, to
increase the variety of data seen during training

52
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

Flipping & Rotation


Cropping
53
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

• Photometric: color transformations

54
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

• Photometric: color transformations

• Other: scaling, add noise, compression artifacts, lens distortions, etc.

55
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

• Photometric: color transformations

• Other: scaling, add noise, compression artifacts, lens distortions, etc.

• Limited only by data assumptions + time/memory constraints!

• Avoid introducing obvious artifacts

56
Diagnosing the model training

• Important statistics:

• Training / Validation / Testing Error Curves

• Training parameters:

1. Learning Rate

2. Model Regularization

3. Number of Iterations / Epochs

57
Diagnosing learning rates

Image source: Stanford CS231n


A typical phenomenon

• Why does the learning curve look like this?

Image source: Stanford CS231n


A typical phenomenon

Image source
Debugging learning curves

Not training Error increasing Error decreasing


Bug in update calculation? Bug in update calculation? Not converged yet

Slow start Possible overfitting Definite overfitting


Suboptimal initialization?

Image source: Stanford CS231n


Early stopping

• Idea: do not train a network to achieve too low training error


• Monitor validation error to decide when to stop
What we have learned

• Learning and Supervision

• Types of learnings
• Examples of each learning type

• Bias and Variance

• Basics of statistical learning theory

• Overfitting and Underfitting


• How to measure degree of overfitting
• How to prevent overfitting

• Diagnosing the model training

63
Carry-on Questions

• What are the types of supervision?

• Unsupervised / Weakly Supervised / Semi-Supervised / Supervised Learning

• How to measure the degree of overfitting?

• The gap between the testing error and training error

• What are methods to prevent overfitting?

• Reduce model expressiveness: dropout, early stop, weight sharing, etc.


• Increase data richness: add more training data, data augmentation, etc.

64

You might also like