Professional Documents
Culture Documents
Training data
(unlabeled)
Training procedure
Predicted
Label
Data to be ML model
analyzed (training: estimate parameters)
Supervised Learning
Training data
with labels
Training procedure
Predicted
Label
ML model
Data to be (training: estimate parameters)
analyzed In most of the course we will focus on supervised learning
Training Data Representation
Predicted
Domain set (1) Label (2)
Notes:
❑ 𝐿𝐷,𝑓 ℎ : loss depends on distribution D and labelling function f
❑ 𝐿𝐷,𝑓 ℎ has many different names: generalization error, true error, true risk, loss
❑ Often f is omitted: 𝐿𝐷 ℎ
Empirical Risk Minimization
❑ Learner outputs ℎ𝑠 ∶ 𝒳 → 𝒴 (note the dependency on S!)
❑ Goal: find ℎ𝑠 which minimizes the generalization error 𝐿𝐷,𝑓 (ℎ)
o But 𝐿𝐷,𝑓 (ℎ) is unknown !
Assume following D:
0 𝑖𝑓 𝑥 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑠𝑖𝑑𝑒
ℎ𝑠 𝑥 = ቊ
1 𝑖𝑓 𝑥 𝑖𝑛 𝑟𝑖𝑔ℎ𝑡 𝑠𝑖𝑑𝑒
• 𝐿𝑠 ℎ𝑠 = 0
• Minimizes training loss (i.e.,
empirical risk) !
• Is it a good predictor ?
Is training error a good
measure ?
1
• 𝐿𝐷,𝑓 ℎ𝑠 =
2
Parameters:
❑ accuracy parameter 𝜖 : we are satisfied with a good ℎ𝑠 for
which 𝐿𝐷,𝑓 ℎ𝑠 ≤ 𝜖
❑ confidence parameter 𝛿: want ℎ𝑠 to be a good hypothesis
with probability ≥ 1 − 𝛿
Theorem
Let ℋ be a finite hypothesis class. Let 𝛿 ∈ 0,1 , 𝜖 ∈ 0,1 and 𝑚 ∈ ℕ
such that:
log
ℋ Notice: m grows with ℋ and is
𝛿 inversely proportional to 𝛿 and 𝜖
𝑚≥
𝜖
Then, for any f and any D for which the realizability assumption holds,
with probability ≥ 1 − 𝛿 we have that for every ERM hypothesis ℎ𝑠 ,
computed on a training set S of size m sampled i.i.d. from D, it holds that
𝐿𝐷,𝑓 (ℎ𝑠 ) ≤ 𝜖
misleading
samples
Demonstration: some notes (1)
1 2
𝐷 𝑥𝑖 : ℎ 𝑥𝑖 = 𝑦𝑖 = 1 − 𝐿𝐷,𝑓 ℎ ≤ 1 − 𝜖
𝐷𝑚 𝑆 ቚ : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖 ≤ 𝐷𝑚 𝑆 ቚ : 𝐿𝑆 ℎ = 0
𝑥 𝑥
ℎ∈ℋ𝐵
𝐷𝑚 𝑆 ቚ : 𝐿𝑆 ℎ = 0 ≤ 𝑒 −𝜖𝑚
𝑥
𝐷𝑚 𝑆 ቚ : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖
𝑥
≤ 𝑒 −𝜖𝑚 = ℋ𝐵 𝑒 −𝜖𝑚 ≤ ℋ 𝑒 −𝜖𝑚
ℎ∈ℋ𝐵
❑ We have obtained:
𝐷𝑚 𝑆 ቚ : 𝐿𝐷,𝑓 ℎ𝑠 > 𝜖 ≤ |ℋ|𝑒 −𝜖𝑚
𝑥