Optimization

Artificial Intelligence & Data Mining
WEN Bihan (Asst Prof)

Homepage: www.bihanwen.com
1
Regularization and Optimization
WEN Bihan (Asst Prof)

Homepage: www.bihanwen.com
2
Outline
• Learning and Supervision
• Bias and Variance
• Overfitting and Underfitting
• Diagnosing the model training
3
Carry-on Questions
• What are the types of supervision?
• How to measure the degree of overfitting?
• What are methods to prevent overfitting?
4
Supervised vs Unsupervised Learning
From what we learn so far:
• Classification is supervised:
• A training dataset with pre-defined class labels are provided.
• Clustering is unsupervised:
• No training dataset with pre-existing label is given.
5
Example – Image Classification
input desired output
apple
pear
tomato
cow
dog
horse
6
The Basic Supervised Learning Framework
Training time Training
Labels
Training
Samples
Learned
Features Training
model
Learned
model
Testing time
Features Prediction
7
The Basic Supervised Learning Framework
𝑦 = 𝑓𝜃 (𝒙)
Output Function Input
• Learning / Training: given a training set of labeled examples

{(𝒙1 , 𝑦1 ), … , (𝒙𝑁 , 𝑦𝑁 )}, estimate the parameters 𝜃 of the prediction
function 𝑓𝜃 .
• Inference / Testing : apply f to an unseen test example 𝒙 and output

the predicted value 𝑦 = 𝑓𝜃 (𝒙)
8
Examples of Supervised Methods
• Nearest Neighbor
• Linear classification / regression
• Structured Prediction
9
Supervised Learning - Nearest Neighbor
Test Training
Training examples
examples example
from class 2
from class 1
f(x) = label of the training example nearest to x
• All we need is a distance function for our inputs
10
K-Nearest Neighbor
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points
k=5
11
Supervised Learning – Linear Classifier
• Find a linear function to separate the classes:
f(x) = sgn(w1x1 + w2x2 + … + wDxD + b) = sgn(w  x + b)
12
Supervised Learning – Linear Classifier
13
Supervised Learning – Structured Prediction
Sentence Parse tree
14
Examples of Unsupervised Learning
• Clustering
• Principal Components Analysis (PCA)
15
Unsupervised Learning – Clustering
• Discover groups of “similar” data points.
• Only unlabeled data as input, without pre-defined classes.
16
Unsupervised Learning – Dimensionality
Reduction
• Discover a lower-dimensional subspace on which the data lives.
• Example: Principal Components Analysis (PCA)
17
Beyond: Reinforcement Learning
• Learn from rewards in a sequential environment.
https://deepmind.com/research/alphago/ 18
Types of Supervisions
Semi-supervised
(labels for a small portion of
training data)
Unsupervised Weakly supervised Supervised

(no labels) (noisy labels, labels not exactly for (clean, complete
the task of interest) training labels for the
task of interest)
19
Learning Effectiveness
• Potential Problems
1. Do you have sufficient data for supervision? - Overfitting
20
2. Is your model complex / rich enough for the problem? - Underfitting
21
2. Is your model complex / rich enough for the problem? - Underfitting
• We wish to understand what happened
• Solution: Statistical Learning Theory
22
Basics on statistical learning theory
• Why do we need to study statistical learning?
• We cannot know exactly how well an algorithm will work in practice (the
true "risk“ – measure of effectiveness).
• Because we do not know the true distribution of data that the

algorithm will work on.
• But, we can instead measure its performance on a known set of data

(the "empirical" risk).
• Empirical Risk Minimization is the core idea of statistical learning.
23
• Expected (true) Risk:
• ℎ(𝑥) is the function predicting 𝑦.
• 𝑙(ℎ 𝑥 , 𝑦) measures the distance between 𝑦 and the predicted ℎ 𝑥 .
• (𝑥, 𝑦) follows some underlying distribution 𝑝 𝑥, 𝑦 : some (𝑥, 𝑦) appear more

often in practice, thus need higher weight.
• The expected (true) risk measures how well the ℎ(𝑥) approximates the 𝑦.
• In practice, we do not have full access to such distribution 𝒑 𝒙, 𝒚 .
• We do not have access to expected risk explicitly.
24
• Expected Risk:
• Empirical Risk:
• Though we do not have full access to the distribution 𝑝 𝑥, 𝑦 , we can collect

a labeled dataset: limited number of samples 𝑥 (𝑖) , 𝑦 (𝑖) from 𝑝 𝑥, 𝑦 .
• Instead of integration, we take the average distance between 𝑦 (𝑖) and the
predicted ℎ 𝑥 (𝑖) : all samples have equal weights.
• In practice, we can calculate the empirical risk given a labeled dataset.
• Empirical Risk approximates the Expected Risk.
25
• Expected Risk:
• Empirical Risk:
• Limitations of learning the function ℎ(𝑥) in practice:
1. We need assumptions on ℎ 𝑥 to be learned, ℎ ∈ ℋ

(the specific model you use, e.g., linear regressor, neural networks).
2. We can only minimize the empirical risk instead of expected risk.
26
• Expected Risk:
• Empirical Risk:
The best
• Limitations of learning the function ℎ(𝑥): possible ℎ(. )
With limitation 1
With limitations 1 + 2
27
• Expected Risk:
• Empirical Risk:
Error by Error by
limitation 1 limitation 2
• Total Learning Error:
28
All possible algorithms

you can learn using a
specific model
• ℎ ∈ ℋ is the learnable function space based on our assumptions.
• 𝐼 is the size / complexity of the training dataset.

29
Bias and Variance
• Training a classifier 𝑓𝜃 (𝑥)
• Model Complexity (informally):

• How many parameters in 𝑓𝜃 (. ) do we have to learn?
• Neural Networks: #hidden neurons
30
Bias and Variance
• Empirical Error:
• In a given dataset, the percentage of items that are misclassified by 𝑓𝜃 (. ).
• Here we refer to the testing dataset.
31
Bias and Variance
• Expected Error:
• For an item that is randomly drawn from the underlying distribution, the likelihood that we
expect it to be misclassified by 𝑓𝜃 (. ).
32
Bias and Variance
• Variance:
• Type of error that occurs due to a model's sensitivity to small fluctuations in the
training set.
• Variance increases with model complexity.

33
Bias and Variance
• Bias:
• Type of error that occurs due to wrong / inaccurate assumptions made in the
learning algorithm.
• Bias is high when the model is (too) simple

34
Bias and Variance
• Expected error of a classifier ≈ bias2 + variance (+noise)
35
Bias and Variance
Simple Model: Complex Model:

High bias and low variance Low bias and high variance
36
Bias and Variance
• The trade-off between bias and variance (of a model)
• Bullseye (center) = target model; Darts (crosses) = learned models
37
All possible algorithms

you can learn using a
specific model
• ℎ ∈ ℋ is the learnable function space based on our assumptions.
• 𝐼 is the size / complexity of the training dataset.

38
Overfitting vs Underfitting
• What is a good model?
Simple Model Complex Model
Good Model!
39
• Simple Model
• High Bias
• Cause an algorithm to miss relevant

relations between the input features and
the target outputs.
• Complex Model
• High Variance
• Cause an algorithm to model the noise in

the training set.
40
• Simple Model
• High Bias - Underfitting
• Complex Model
• High Variance - Overfitting
41
Simple Model: Complex Model:

42
Underfitting Overfitting
43
• Measure overfitting by training and testing / validation errors
44
• Overfitting ≈ Testing / Validation Error – Training Error
The gap measures the

degree of overfitting
45
• Overfitting ≈ Testing / Validation Error – Training Error
Overfitting
Large gap between
training and test errors
Underfitting
Small gap between
training and test errors 46
• Bias-Variance Tradeoff:
• fundamental dilemma of minimizing between two sources of errors that

prevent ML algorithms from generalizing beyond their training set.
• The bias is error from erroneous assumptions in the learning algorithm. High
bias can cause an algorithm to miss the relevant relations between features
and target outputs (e.g., model is too simple -> underfitting).
• The variance is error from sensitivity to small fluctuations in the training set.
High variance can cause an algorithm to model the random noise in the
training data, rather than the intended outputs (e.g., model is too
complicated -> overfitting).
47
• Monitoring the bias-variance trade-off:
• Separate a validation dataset.
• Learn parameters on the training data.

Validation With the
known
• Measure accuracy on the held-out or Dataset
validation data. labels
• Peek at the validation set to prevent

overfitting and underfitting.
48
Regularization to prevent overfitting
• Solutions, in the context of learning neural networks:
1. Limit the model complexity by reducing the model expressiveness.
• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.
49
• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.
50
• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.
• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.
• Weight Sharing: Instead of training each neuron independently, we can force their
parameters to be the same. Examples: Recurrent Neural Networks (RNN).
51
2. Increase the training data complexity / size, to reduce the variance.
• Add more training data
• Data Augmentation: modify the data available in a realistic but randomized way, to
increase the variety of data seen during training
52
Data augmentation
• Introduce transformations not adequately sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops
Flipping & Rotation

Cropping
53
Data augmentation
• Photometric: color transformations
54
Data augmentation
• Other: scaling, add noise, compression artifacts, lens distortions, etc.
55
Data augmentation
• Other: scaling, add noise, compression artifacts, lens distortions, etc.
• Limited only by data assumptions + time/memory constraints!
• Avoid introducing obvious artifacts
56
Diagnosing the model training
• Important statistics:
• Training / Validation / Testing Error Curves
• Training parameters:
1. Learning Rate
2. Model Regularization
3. Number of Iterations / Epochs
57
Diagnosing learning rates
Image source: Stanford CS231n

A typical phenomenon
• Why does the learning curve look like this?

A typical phenomenon
Image source
Debugging learning curves
Not training Error increasing Error decreasing

Bug in update calculation? Bug in update calculation? Not converged yet
Slow start Possible overfitting Definite overfitting

Suboptimal initialization?

Early stopping
• Idea: do not train a network to achieve too low training error

• Monitor validation error to decide when to stop
What we have learned
• Learning and Supervision
• Types of learnings
• Examples of each learning type
• Bias and Variance
• Basics of statistical learning theory
• Overfitting and Underfitting

• How to measure degree of overfitting
• How to prevent overfitting
• Diagnosing the model training
63
Carry-on Questions
• What are the types of supervision?
• Unsupervised / Weakly Supervised / Semi-Supervised / Supervised Learning
• How to measure the degree of overfitting?
• The gap between the testing error and training error
• What are methods to prevent overfitting?
• Reduce model expressiveness: dropout, early stop, weight sharing, etc.

• Increase data richness: add more training data, data augmentation, etc.
64

Optimization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization

Uploaded by

Copyright:

Available Formats

Artificial Intelligence & Data Mining

WEN Bihan (Asst Prof)

WEN Bihan (Asst Prof)

• Learning and Supervision

• Bias and Variance

• Overfitting and Underfitting

• Diagnosing the model training

• What are the types of supervision?

• How to measure the degree of overfitting?

• What are methods to prevent overfitting?

From what we learn so far:

• A training dataset with pre-defined class labels are provided.

• No training dataset with pre-existing label is given.

Output Function Input

• Learning / Training: given a training set of labeled examples

• Inference / Testing : apply f to an unseen test example 𝒙 and output

• Linear classification / regression

f(x) = label of the training example nearest to x

• All we need is a distance function for our inputs

• Find a linear function to separate the classes:

f(x) = sgn(w1x1 + w2x2 + … + wDxD + b) = sgn(w  x + b)

Sentence Parse tree

• Principal Components Analysis (PCA)

• Discover groups of “similar” data points.

• Only unlabeled data as input, without pre-defined classes.

• Example: Principal Components Analysis (PCA)

• Learn from rewards in a sequential environment.

Unsupervised Weakly supervised Supervised

1. Do you have sufficient data for supervision? - Overfitting

1. Do you have sufficient data for supervision? - Overfitting

2. Is your model complex / rich enough for the problem? - Underfitting

1. Do you have sufficient data for supervision? - Overfitting

2. Is your model complex / rich enough for the problem? - Underfitting

• We wish to understand what happened

• Solution: Statistical Learning Theory

• Why do we need to study statistical learning?

• Because we do not know the true distribution of data that the

• But, we can instead measure its performance on a known set of data

• Empirical Risk Minimization is the core idea of statistical learning.

• Expected (true) Risk:

• ℎ(𝑥) is the function predicting 𝑦.

• 𝑙(ℎ 𝑥 , 𝑦) measures the distance between 𝑦 and the predicted ℎ 𝑥 .

• (𝑥, 𝑦) follows some underlying distribution 𝑝 𝑥, 𝑦 : some (𝑥, 𝑦) appear more

• In practice, we do not have full access to such distribution 𝒑 𝒙, 𝒚 .

• We do not have access to expected risk explicitly.

• Though we do not have full access to the distribution 𝑝 𝑥, 𝑦 , we can collect

• In practice, we can calculate the empirical risk given a labeled dataset.

• Empirical Risk approximates the Expected Risk.

• Limitations of learning the function ℎ(𝑥) in practice:

1. We need assumptions on ℎ 𝑥 to be learned, ℎ ∈ ℋ

2. We can only minimize the empirical risk instead of expected risk.

All possible algorithms

• ℎ ∈ ℋ is the learnable function space based on our assumptions.

• 𝐼 is the size / complexity of the training dataset.

• Training a classifier 𝑓𝜃 (𝑥)

• Model Complexity (informally):

• Neural Networks: #hidden neurons

• Training a classifier 𝑓𝜃 (𝑥)

• Here we refer to the testing dataset.

• Training a classifier 𝑓𝜃 (𝑥)

• Training a classifier 𝑓𝜃 (𝑥)

• Variance increases with model complexity.

• Training a classifier 𝑓𝜃 (𝑥)

• Bias is high when the model is (too) simple