Machine Learning Model: Machine Learning 2021 UML Book Chapter 2 Slides P. Zanuttigh (Derived From F. Vandin Slides)

Machine Learning Model
Machine Learning 2021

UML book chapter 2
Slides P. Zanuttigh (derived from F. Vandin slides)
Machine Learning
❑ Machine learning (ML) is a set of methods that give computer

systems the ability to "learn" from (training) data to make predictions
about novel data samples, without being explicitly programmed
❑ ML techniques: data driven methods
❑ Training data can be provided with or without corresponding correct
predictions (labels)
o Unsupervised learning: no labels are provided for training data
o Supervised learning: training data with labels
Unsupervised Learning
Training data
(unlabeled)
Training procedure
Predicted
Label
Data to be ML model
analyzed (training: estimate parameters)
Supervised Learning
Training data
with labels
Training procedure
Predicted
Label
ML model
Data to be (training: estimate parameters)
analyzed In most of the course we will focus on supervised learning
Training Data Representation
L Training set (3) with labels (2)
Predicted
Domain set (1) Label (2)
The machine learning algorithm has access to:

1. Domain set (or instance space) 𝒳: set of all possible objects to make
predictions about
• 𝑥 ∈ 𝒳 is a domain point or instance
• It is typically (but not always) represented by a vector of features
2. Label set 𝒴 : set of possible labels
• Simplest case: binary classification 𝒴 = {0,1}
3. Training set 𝑆 = ( 𝑥1 , 𝑦1 , … , 𝑥𝑚 , 𝑦𝑚 ) : finite sequence of labeled
(→supervised learning) domain points (in 𝒳𝑥𝒴)
• It is the input of the ML algorithm !
Make Predictions on Data
Predicted
D (5) h (4) Label (2)
Classifier
Error
(6)
f() Real Label

(5)
4. Prediction rule ℎ: 𝒳 → 𝒴 (sometimes called also 𝑓መ )

• The learner's output, called also predictor, hypothesis or classifier
• 𝐴 𝑆 : prediction rule produced by ML alg. A when training set S is given to it
5. Data-generation model: instances are
• Generated by a probability distribution 𝒟 over 𝒳 (NOT KNOWN BY THE ML
ALGORITHM)
• Labeled according to a function 𝑓 (NOT KNOWN BY THE ML ALGORITHM)
• Training set: ∀ 𝑥𝑖 ∈ 𝑆, sample 𝑥𝑖 according to 𝒟 then label it as 𝑦𝑖 = 𝑓 𝑥𝑖
6. Measure of success = error of the classifier = probability it does not predict the
correct label on a random data point generated by 𝒟
Notes:
Data Generating Distribution
𝑥 ∈ 𝒳 = 𝑅+
D 𝐴: 170 < 𝑥 < 180
𝐷 𝐴 = 𝐷({𝑥: 170 < 𝑥 < 180}) = 0.3
D(A)
1: 170 < 𝑥 < 180
𝜋 𝑥 =ቊ
0: 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
160 170 180 200
A
X
❑ Samples 𝑥 ∈ 𝑋 are produced by a probability distribution D: 𝑥~𝐷
❑ Consider a domain subset 𝐴 ⊂ 𝑋 :
o A: event, expressed by 𝜋: 𝑋 → {0,1} ,i.e., 𝐴 = {𝑥 ∈ 𝑋: 𝜋 𝑥 = 1}
o 𝐷 𝐴 : probability of observing a point x ∈ 𝐴 (it is a number in the 0-1 range)
o We get that 𝑃𝑥~𝐷 𝜋 𝑥 = 1 = 𝐷 𝐴
Measure of Success:
Loss Function
Recall:
❑ Assume a domain subset 𝐴 ⊂ 𝑋
❑ A: event, expressed by 𝜋: 𝑋 → {0,1} , i.e., 𝐴 = {𝑥 ∈ 𝑋: 𝜋 𝑥 = 1}
❑ 𝐷 𝐴 : probability of observing a point 𝑋 ∈ 𝐴
❑ We get that 𝑃𝑥~𝐷 𝜋 𝑥 =𝐷 𝐴
Error of prediction rule in classification problems ℎ: 𝑋 → 𝑌
𝐿𝐷,𝑓 ℎ ≝ 𝑃𝑥~𝐷 ℎ 𝑥 ≠ 𝑓 𝑥 = 𝐷(𝑥: ℎ 𝑥 ≠ 𝑓 𝑥 )
Notes:
❑ 𝐿𝐷,𝑓 ℎ : loss depends on distribution D and labelling function f
❑ 𝐿𝐷,𝑓 ℎ has many different names: generalization error, true error, true risk, loss
❑ Often f is omitted: 𝐿𝐷 ℎ
Empirical Risk Minimization
❑ Learner outputs ℎ𝑠 ∶ 𝒳 → 𝒴 (note the dependency on S!)
❑ Goal: find ℎ𝑠 which minimizes the generalization error 𝐿𝐷,𝑓 (ℎ)
o But 𝐿𝐷,𝑓 (ℎ) is unknown !
❑ What about considering the error on the training data ?
| 𝑖: ℎ 𝑥𝑖 ≠𝑦𝑖 ,1≤𝑖≤𝑚 | # wrong predictions

❑ Training error: 𝐿𝑠 (ℎ) ≜ =
𝑚 # training samples
o Assuming a classification problem and 0-1 loss, otherwise different definition
o also called empirical error or empirical risk
❑ Empirical Risk Minimization (ERM) : produce in output predictor h

minimizing 𝐿𝑠 (ℎ)
Is training error a good
measure of true error ?
Assume following D:
• Instance x is taken uniformly at

random in the square
• f : label is 0 if x in upper side, 1 if

lower side (red vs blue)
• Area of the two sides is the same

measure of true error ?
• Training set: samples in the figure
Consider this predictor:
0 𝑖𝑓 𝑥 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑠𝑖𝑑𝑒
ℎ𝑠 𝑥 = ቊ
1 𝑖𝑓 𝑥 𝑖𝑛 𝑟𝑖𝑔ℎ𝑡 𝑠𝑖𝑑𝑒
• 𝐿𝑠 ℎ𝑠 = 0
• Minimizes training loss (i.e.,
empirical risk) !
• Is it a good predictor ?
measure ?
1
• 𝐿𝐷,𝑓 ℎ𝑠 =
2
• Same loss as random guess
• Poor performances: overfitting on

training data!
• In this case very good

performances on training set an
poor performances in general
• When does ERM lead to good

performances w.r.t. generalization
error?
Hypothesis Class
❑ Apply ERM over a restricted set of possible hypotheses
▪ ℋ = hypothesis class
▪ Each ℎ ∈ ℋ is a function ℎ: 𝒳 → 𝒴
▪ Restricting to a set of hypothesis→making
assumptions (priors) on the problem at hand
❑ 𝐸𝑅𝑀ℋ learner: ∈ : there can be multiple optimal solutions
𝐸𝑅𝑀ℋ ∈ argmin 𝐿𝑠 (ℎ)

ℎ∈ℋ
❑ Which hypothesis classes ℋ do not lead to overfitting?

Assumptions
1. Assume ℋ is a finite hypothesis class, i.e., ℋ < ∞
2. Let ℎ𝑠 be the output of 𝐸𝑅𝑀ℋ (S), i.e., ℎ𝑠 ∈ argmin 𝐿𝑠 (ℎ)
ℎ∈ℋ
❑ Two further assumptions:

3. Realizability: there exist ℎ∗ ∈ ℋ such that 𝐿𝐷,𝑓 ℎ = 0
4. i.i.d.: examples in the training set are independently and identically
distributed (i.i.d) according to D, that is 𝑆~𝐷𝑚
• Note: these assumptions are very difficult to be satisfied in practice
❑ Realizability assumption implies that 𝐿𝑆 ℎ∗ = 0

❑ Can we learn ℎ∗ ?
PAC Learning
Probably Approximately Correct (PAC) learning
Since the training data comes from D:

❑ we can only be approximately correct
❑ we can only be probably correct
Parameters:
❑ accuracy parameter 𝜖 : we are satisfied with a good ℎ𝑠 for
which 𝐿𝐷,𝑓 ℎ𝑠 ≤ 𝜖
❑ confidence parameter 𝛿: want ℎ𝑠 to be a good hypothesis
with probability ≥ 1 − 𝛿
Theorem
Let ℋ be a finite hypothesis class. Let 𝛿 ∈ 0,1 , 𝜖 ∈ 0,1 and 𝑚 ∈ ℕ
such that:
log
ℋ Notice: m grows with ℋ and is
𝛿 inversely proportional to 𝛿 and 𝜖
𝑚≥
𝜖
Then, for any f and any D for which the realizability assumption holds,
with probability ≥ 1 − 𝛿 we have that for every ERM hypothesis ℎ𝑠 ,
computed on a training set S of size m sampled i.i.d. from D, it holds that
𝐿𝐷,𝑓 (ℎ𝑠 ) ≤ 𝜖
probably approximately correct
m: size of the training set (i.e., S contains m I.I.D. samples)

Idea of the Demonstration
❑ The critical issue are the training sets leading to a “misleading”
predictor h with 𝐿𝑠 ℎ = 0 but 𝐿𝐷,𝑓 ℎ > 𝜖
❑ Place an upper bound to the probability of sampling m
instances leading to a misleading training set, i.e., producing a
“misleading” predictor
❑ Using the union bound after various mathematical
computations the bound of the theorem can be obtained
❑ Message of the theorem: if ℋ is a finite class then ERM will not
overfit, provided it is computed on a sufficiently big training set
❑ Demonstration not part of the course, but you can find it on the
book if you are interested
Theorem:
Graphical Illustration
misleading
samples
Demonstration: some notes (1)
1 2
𝐷 𝑥𝑖 : ℎ 𝑥𝑖 = 𝑦𝑖 = 1 − 𝐿𝐷,𝑓 ℎ ≤ 1 − 𝜖
❑ In this step we are considering a single sample 𝑥𝑖

1. First step: 𝐷 𝑥𝑖 : ℎ 𝑥𝑖 = 𝑦𝑖 is the probability of a
correct prediction (i.e., 1 - probability of error)
2. Second step: ℎ ∈ ℋ𝐵 (set of bad hypotheses) →probability
of error for h is bigger than 𝜖, i.e., 𝐿𝐷,𝑓 ℎ > 𝜖
Demonstration not part of the course
Here are just some notes for critical steps, refer to the book
and lecture notes for the complete demonstration
𝐷𝑚 𝑆 ቚ : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖 ≤ ෍ 𝐷𝑚 𝑆 ቚ : 𝐿𝑆 ℎ = 0
𝑥 𝑥
ℎ∈ℋ𝐵
𝐷𝑚 𝑆 ቚ : 𝐿𝑆 ℎ = 0 ≤ 𝑒 −𝜖𝑚
𝑥
❑ First equation: from union bound

❑ Second equation: consequence of previous slide result
❑ By combining the 2 equations (substituting the red part)
𝐷𝑚 𝑆 ቚ : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖
𝑥
≤ ෍ 𝑒 −𝜖𝑚 = ℋ𝐵 𝑒 −𝜖𝑚 ≤ ℋ 𝑒 −𝜖𝑚
ℎ∈ℋ𝐵
❑ ℋ𝐵 is a subset of ℋ → 𝑙𝑎𝑠𝑡 𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦

❑ Notice the difference between true error 𝐿𝐷,𝑓 ℎ𝑠 and empirical errror 𝐿𝑆 ℎ
❑ Thesis of the theroem: the probability of having a small error

is ≥ 1-𝛿
o corresponds to probability of large error is ≤ 𝛿
o i.e. ,we need to demonstrate that: 𝐷𝑚 𝑆|𝑥 : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖 ≤𝛿
❑ We have obtained:
𝐷𝑚 𝑆 ቚ : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖 ≤ |ℋ|𝑒 −𝜖𝑚
𝑥
❑ Finally: purple part is smaller than red, to satisify the

theorem we need to find m for which red is smaller than 𝛿 :
ℋ
o Set 𝑚 ≥ log( )/𝜖
𝛿

Machine Learning Model: Machine Learning 2021 UML Book Chapter 2 Slides P. Zanuttigh (Derived From F. Vandin Slides)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Model: Machine Learning 2021 UML Book Chapter 2 Slides P. Zanuttigh (Derived From F. Vandin Slides)

Uploaded by

Copyright:

Available Formats

Machine Learning Model

Machine Learning 2021

❑ Machine learning (ML) is a set of methods that give computer

L Training set (3) with labels (2)

The machine learning algorithm has access to:

f() Real Label

4. Prediction rule ℎ: 𝒳 → 𝒴 (sometimes called also 𝑓መ )

❑ A: event, expressed by 𝜋: 𝑋 → {0,1} , i.e., 𝐴 = {𝑥 ∈ 𝑋: 𝜋 𝑥 = 1}

❑ 𝐷 𝐴 : probability of observing a point 𝑋 ∈ 𝐴

❑ We get that 𝑃𝑥~𝐷 𝜋 𝑥 =𝐷 𝐴

Error of prediction rule in classification problems ℎ: 𝑋 → 𝑌

𝐿𝐷,𝑓 ℎ ≝ 𝑃𝑥~𝐷 ℎ 𝑥 ≠ 𝑓 𝑥 = 𝐷(𝑥: ℎ 𝑥 ≠ 𝑓 𝑥 )

❑ What about considering the error on the training data ?

| 𝑖: ℎ 𝑥𝑖 ≠𝑦𝑖 ,1≤𝑖≤𝑚 | # wrong predictions

❑ Empirical Risk Minimization (ERM) : produce in output predictor h

• Instance x is taken uniformly at

• f : label is 0 if x in upper side, 1 if

• Area of the two sides is the same

Consider this predictor:

• Same loss as random guess

• Poor performances: overfitting on

• In this case very good

• When does ERM lead to good

❑ 𝐸𝑅𝑀ℋ learner: ∈ : there can be multiple optimal solutions

𝐸𝑅𝑀ℋ ∈ argmin 𝐿𝑠 (ℎ)

❑ Which hypothesis classes ℋ do not lead to overfitting?

❑ Two further assumptions:

• Note: these assumptions are very difficult to be satisfied in practice

❑ Realizability assumption implies that 𝐿𝑆 ℎ∗ = 0

Since the training data comes from D:

❑ we can only be probably correct

probably approximately correct

m: size of the training set (i.e., S contains m I.I.D. samples)

❑ In this step we are considering a single sample 𝑥𝑖

❑ First equation: from union bound

❑ ℋ𝐵 is a subset of ℋ → 𝑙𝑎𝑠𝑡 𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦

❑ Thesis of the theroem: the probability of having a small error

❑ Finally: purple part is smaller than red, to satisify the

You might also like