0% found this document useful (0 votes)

47 views35 pages

PAC Learning and Generalization Error

Uploaded by

saikiran27.s1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views35 pages

PAC Learning and Generalization Error

Uploaded by

saikiran27.s1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS-E4710 Machine Learning: Supervised

Methods
Lecture 2: Statistical learning theory

Juho Rousu
September 12, 2023
Department of Computer Science
Aalto University
Generalization

• Our aim is to predict as well as possible the outputs of future

examples, not only for training sample
• We would like to minimize the generalization error, or the (true)
risk
R(h) = E(x,y )∼D [ L(h(x), y ) ] ,
where L(y , y 0 ) is a suitable loss function (e.g. zero-one loss)
• Assuming future examples are independently drawn from the same
distribution D that generated the training examples (i.i.d
assumption)
• But we do not know D!
• What can we say about R(h) based on training examples and the
hypothesis class H alone? Two possibilities:
• Empirical evaluation through testing
• Statistical learning theory (Lectures 2 and 3)

1
Additional reading

• Lectures 2-4 are mostly based

on Mohri et al book:
chapters 2-4
• Available online in Aalto
eBookAalto Central:
https://ebookcentral.
proquest.com/lib/
aalto-ebooks/detail.
action?pq-origsite=
primo&docID=6246520
• The book goes much deeper
in the theory (e.g. proofs of
theorems) than what we do
in the course

2
Probably approximately correct
learning
Probably Approximate Correct Learning framework

• Probably Approximate Correct (PAC) Learning framework formalizes

the notion of generalization in machine learning
• Ingredients:
• input space X containing all possible inputs x
• set of possible labels Y (in binary classification Y = {0, 1} or
Y = {−1,+1})
• Concept class C containing concepts C : X 7→ Y (to be learned),
concept C gives a label C (x) for each input x
• unknown probability distribution D
• training sample S = (x1 , C (x1 )), . . . , (xm , C (xm )) drawn
independently from D
• hypothesis class H, in the basic case H = C but this assumption can
be relaxed
• The goal in PAC learning is to learn a hypothesis with a low
generalization error

R(h) = Ex∼D L0/1 (h(x), C (x)) = Pr (h(x) 6= C (x))
x∼D
3
PAC learnability

• A class C is PAC-learnable, if there exist an algorithm A that given

a training sample S outputs a hypothesis hS ∈ H that has
generalization error satisfying

Pr (R(hS ) ≤ ) ≥ 1 − δ

• for any distribution D, for arbitrary , δ > 0 and sample size m = |S|
that grows polynomially in 1/,1/δ
• for any concept C ∈ C
• In addition, if A runs in time polynomial in m,1/,and 1/δ the class
is called efficiently PAC learnable

4
Interpretation

Let us interpret the bound

Pr (R(hS ) ≤ ) ≥ 1 − δ

• sets the level of generalization error that is of interest to us, say we

are content with predicting incorrectly 10% of the new data points:
= 0.1
• 1 − δ sets a level of confidence, if we are content of the training
algorithm to fail 5% of the time to provide a good hypothesis:
δ = 0.05
• Samples size and running time should not explode when we make
and δ stricter: requirement of polynomial growth
• The event ”low generalization error”, {R(hS ) ≤ } is considered as a
random variable because we cannot know beforehand which
hypothesis hS ∈ H will be selected by the algorithm
5
Generalization error bound vs. test error

• Generalization error bounds concern

the tail of the error distribution
• We wish a high generalization error
to be a rare event
• Expected generalization error which
might be considerably lower
• Analyzing average behaviour where
most data distributions and
concepts are ”not bad”
• We expect there be a gap between the
expected error and the tail
• The smaller the failure probability δ,
the larger the gap

6
Example: learning axis-aligned
rectangles
Learning setup

• The goal is to learn a

rectangle R (representing the
true concept) that includes
all blue points and excludes
all red points
• The hypotheses also will be
rectangles (here R0 ), which
will in general have both false
positive predictions
(predicting blue when true
label is red) and false
negative predictions
(predicting red when true
label is blue).

7
Example: learning axis-aligned rectangles

• We will use a simple

algorithm: choose the
tightest rectangle that
contains all blue points
• Note that this will be a
consistent hypothesis: no
errors on the training data
• The hypothesis R0 will only
make false negative errors
compared to the true concept
R, no false positive errors (Q:
Why is that?)

8
Example: learning axis-aligned rectangles

Questions:
• Is the class of axis-aligned
rectangles PAC-learnable?
• How much training data will
be needed to learn?
• Need to bound the risk of
outputting a bad hypothesis
R(R0 ) > with high
probability 1 − δ
• We can assume PrD (R) >
(otherwise R(R0 ) ≤
trivially)

9
Example: learning axis-aligned rectangles

Let r1 , r2 , r3 , r4 be rectangles along the sides of R such that

PrD (ri ) = /4

• Their union satisfies

PrD (r1 ∪ r2 ∪ r3 ∪ r4 ) ≤
• Errors can only occur within
R − R 0 (false negative
predictions)
• If R0 intersects all of the four
regions r1 , . . . , r4 then we
know that

R(R0 ) ≤
Thus, if R(R0 ) > then R0 must miss at least one of the four regions

10
Example: learning axis-aligned rectangles

• Events A =
{R0 intersects all four rectangles r1 , . . . , r4 },
B = {R(R0 ) < }, satisfy A ⊆ B
• Complement events AC =
{R0 misses at least one rectangle },
BC = {R(R0 ) ≥ } satisfy BC ⊆ AC
• BC is the bad event (high
generalization error), we want it to
have low probability
• In probability space, we have
Pr (BC ) ≤ Pr (AC )
• Our task is to upper bound Pr (AC )

11
Example: learning axis-aligned rectangles

• Each ri has probability mass

/4 by our design
• Probability of one example
missing one rectangle:
1 − /4
• Probability of m examples
missing one rectangle:
(1 − /4)m
(m times repeated trial with
replacement)
• Probability of all examples
missing at least one of the
rectangles:

Pr (AC ) ≤ 4(1 − /4)m

12
Example: learning axis-aligned rectangles

• We can use a general inequality

∀x : (1 − x) < exp(−x) to obtain:

Pr (R(h) ≥ ) ≤ 4(1−/4)m ≤ 4 exp(−m/4)

• We want this probability to be small

(< δ):

4 exp(−m/4) < δ
⇔ m ≥ 4/ log 4/δ

• The last inequality is our first

generalization error bound, a sample
complexity bound to be exact
Note: corresponding to Mohri et al (2018) log denotes the natural
logarithm: log(exp(x)) = x

13
Plotting the behaviour of bound

• Left, the sample complexity, the number of examples needed to

reach a given generalization error level is shown m(, δ) = 4/ log 4/δ
• Right, the generalization bound is plotted as a function of training
sample size (m, δ) = 4/m log 4/δ
• Three different confidence levels (δ) are plotted

14
Plotting the behaviour of the bound

Typical behaviour of ML learning algorithms is revealed:

• increase of sample size decreases generalization error
• extra data gives less and less additional benefit as the sample size
grows (law of diminishing returns)
• requiring high level of confidence (small δ) for obtaining low error
requires more data for the same level of error

15
Generalization error bound vs. expected test error

• The error bounds hold for any

concepts from the class
• including concepts that are harder
to learn than ”average concept”
• They hold for any distribution D
generating the data
• Including adversially generated
distributions (aiming to make
learning harder)
• We bound the probability of being in
the high error tail of the distribution
(not the convergence to the mean or
median generalization error)
For these reasons empirically estimated test errors might be considerably
lower than the bounds suggest

16
Half-time poll: Personalized email spam filtering system
Company is developing a personalized email spam filtering system. The
system is tuned personally for each customer using the customers data on
top of the existing training data. The company has a choice of three
machine learning algorithms, with different performance characteristics.
So far, the company has tested three different algorithms on a small set
of test users.
Which algorithm should the company choose?
1. Algorithm 1, which guarantees error rate of less than 10% for 99%
of the future customer base
2. Algorithm 2, which guarantees error rate of less than 5% for 90% of
the future customer base
3. Algorithm 3, which empirically has error rate of 1% on the current
user base
Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll
down to ”Lecture 2 poll”: https:
//mycourses.aalto.fi/course/view.php?id=37029&section=2
Guarantees for finite hypothesis
sets
Finite hypothesis classes

• Finite concept classes arise when:

• Input variables have finite domains or they are converted to such in
preprocessing (e.g. discretizing real values), and
• The representations of the hypotheses have finite size (e.g. the
number of times a single variable can appear)
• Dealing with subclasses of Boolean formulae, expressions binary
input variables (literals) combined with logical operators (AND, OR,
NOT,...)
• Finite concept classes have been thoroughly analyzed hypothesis
classes in statistical learning theory

17
Example: Boolean conjunctions

• Aldo likes to do sport only when the weather is suitable

• Also has given examples of suitable and not suitable weather
• Let us build a classifier for Aldo to decide whether to do sports today
• As the classifier we use rules in the form of boolean conjunctions
(boolean formulae containing AND, and NOT, but not OR
operators): e.g. if (Sky=Sunny) AND NOT(Wind=Strong) then
(EnjoySport=1)

18
Finite hypothesis class - consistent case

• Sample complexity bound relying on the size of the hypothesis class

(Mohri et al, 2018): Pr (R(hs ) ≤ ) ≥ 1 − δ if

1 1
m≥ (log(|H|) + log( ))
δ
• An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
• Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
• Extra term compared to the rectangle learning example is the term
1
(log(|H|))
• The more hypotheses there are in H, the more training examples are
needed

19
Example: Boolean conjunctions

• How many different conjunctions can be built (=|H|)

• Each variable can appear with or without ”NOT” or can be excluded
from the rule = 3 possibilities
• The total number of hypotheses is thus 3d , where d is the number
of variables
• We have six variables in total, giving us |H| = 36 = 729 different
hypotheses

20
Plotting the bound for Aldo’s problem using boolean conjunc-
tions

• On the left, the generalization bound is shown for different values of

δ, using d = 6 variables
• On the right, the bound is shown for increasing number of input
variables d, using δ = 0.05

21
Arbitrary boolean formulae

• What about using arbitrary boolean formulae?

• How many boolean formulae of d variables there are?
• There are 2d possible input vectors, size of the input space is
|X | = 2d
• We can define a boolean formula that outputs 1 for an arbitrary
subset of S ⊂ X and zero outside that subset:
fS (x) = (x = x1 )OR(x = x2 )OR · · · OR(x = x|S| )
• We can pick the subset in 2|X | ways (Why?)
d
• Thus we have |H| = 22 different boolean formula
• Our generalization bound gives
1 d 1
m≥ (2 log 2 + log( ))
δ
• Thus we need exponential number of examples with respect to the
number of variables; the hypothesis class is considered not
PAC-learnable!
22
Plotting the bound for Arbitrary boolean formulae

• With d = 6 variables we need ca. 500 examples to get bound below

0.07 (left picture)
• Increase of number of variables quickly raises the sample complexity
to 106 and beyond (right picture)

23
Proof outline of the PAC bound
for finite hypothesis classes
Proof outline (Mohri et al., 2018)

• Consider any hypothesis h ∈ H with R(h) >

• For h to be consistent R̂(h) = 0, all training examples need to miss
the region where h is making an error.
• The probability of this event is

Pr (R̂(h) = 0|R(h) > ) ≤ (1 − )m

• m times repeated trial with success probability

• This is the probability that one consistent hypothesis has high error

24
Proof outline

• But we do not need which consistent hypothesis h is selected by our

learning algorithm
• Hence our result will need to hold for all consistent hypotheses
• This is an example of uniform convergence bound
• We wish to upper bound the probability that some h ∈ H is
consistent R̂(h) = 0 and has a high generalization error R(h) > for
a fixed > 0:

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > )

• Above ∧ is the logical ”and”

25
Proof outline

• We can replace ∃ by enumerating all hypotheses in H using

logical-or (∨)

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) =

Pr {R̂(h1 ) = 0 ∧ R(h1 ) > } ∨ {R̂(h2 ) = 0 ∧ R(h2 ) > } ∨ · · ·

• Using the the fact that Pr (A ∪ B) ≤ Pr (A) + Pr (B) and

Pr (A ∩ C ) ≤ Pr (A|C ) for any events A,B and C the above is upper
bounded by
X X
≤ Pr (R̂(h) = 0 ∧ R(h) > ) ≤ Pr (R̂(h) = 0|R(h) > )
h∈H h∈H

≤ |H|(1 − )m

• Last inequality follows from using

Pr (R̂(h) = 0|R(h1 ) > ) ≤ (1 − )m for the |H| summands
26
Proof outline

• We have established

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) ≤ |H|(1 − )m ≤ |H| exp(−m)

• Set the right-hand side equal to δ and solve for m to obtain the
bound:

δ = |H| exp (−m)

log δ = log |H| − m
1
m= (log(|H|) + log(1/δ))

27
Finite hypothesis class - inconsistent case

• So far we have assumed that there is a consistent hypothesis h ∈ H,

one that achieves zero empirical risk on training sample
• In practise this is often not the case
• However as long as the empirical risk R̂(h) is small, a low
generalization error can still be achieved
• Generalization error bound (Mohri, et al. 2018): Let H be a finite
hypothesis set. Then for any δ > 0 with probability at least 1 − δ we
have for all h ∈ H:
r
log(|H|) + log(2/δ)
R(h) ≤ R̂(h) +
2m
• We see the dependency from log |H|/m as in the consistent case but
now under square root
• Slower convergence w.r.t number of examples

28
Summary

• Probably approximately correct learning is a theoretical framework

for analysing the generalization performance of machine learning
algorithms
• PAC theory is concerned about upper bounding the probability δ of
”bad” events, those of high generalization error ()
• In finite hypothesis classes, (the logarithm of) number of hypothesis
log |H| in the hypothesis class affects the number of examples
needed to obtain a given level of risk with high probability

Statistical Learning Theory in ML
No ratings yet
Statistical Learning Theory in ML
35 pages
PAC-Bayesian Analysis Overview
No ratings yet
PAC-Bayesian Analysis Overview
124 pages
Machine Learning Theory Lecture
No ratings yet
Machine Learning Theory Lecture
6 pages
SML Lecture4
No ratings yet
SML Lecture4
38 pages
Machine Learning Module 2 Overview
No ratings yet
Machine Learning Module 2 Overview
20 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
ML Unit-4 Prob Learning
No ratings yet
ML Unit-4 Prob Learning
36 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
75 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
16 pages
Module 1 Part3
No ratings yet
Module 1 Part3
56 pages
PAC Learning and Sample Complexity in ML
No ratings yet
PAC Learning and Sample Complexity in ML
64 pages
CSE176 Machine Learning Lecture Notes
No ratings yet
CSE176 Machine Learning Lecture Notes
80 pages
Machine Learning Course Overview and Concepts
No ratings yet
Machine Learning Course Overview and Concepts
81 pages
Machine Learning Course Notes CSE176
No ratings yet
Machine Learning Course Notes CSE176
80 pages
01-Linear Regression-Part 2
No ratings yet
01-Linear Regression-Part 2
37 pages
CH 2
No ratings yet
CH 2
29 pages
계산학습이론
No ratings yet
계산학습이론
1 page
Week 3
No ratings yet
Week 3
56 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
125 pages
PAC Learning Frameworks Explained
No ratings yet
PAC Learning Frameworks Explained
59 pages
Matters of Discussion
No ratings yet
Matters of Discussion
28 pages
Foundations of Machine Learning Overview
No ratings yet
Foundations of Machine Learning Overview
47 pages
Machine Learning Basics Overview
No ratings yet
Machine Learning Basics Overview
24 pages
Solutions For Exercises in Foundations of Machine Learning, 2nd Edition - Mohri & Rostamizadeh
100% (1)
Solutions For Exercises in Foundations of Machine Learning, 2nd Edition - Mohri & Rostamizadeh
5 pages
Generalization in CS168 Algorithms
No ratings yet
Generalization in CS168 Algorithms
16 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
100% (1)
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
61 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
145 pages
PAC Learning Overview and Concepts
No ratings yet
PAC Learning Overview and Concepts
3 pages
Machine Learning: PAC and Learning Limits
No ratings yet
Machine Learning: PAC and Learning Limits
7 pages
Evaluating Hypotheses in Machine Learning
No ratings yet
Evaluating Hypotheses in Machine Learning
8 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
SupervisedLearning 2 33
100% (1)
SupervisedLearning 2 33
32 pages
Csup AL
No ratings yet
Csup AL
5 pages
PAC and Mistake Bound Learning Theory
No ratings yet
PAC and Mistake Bound Learning Theory
12 pages
Model Selection in Machine Learning
No ratings yet
Model Selection in Machine Learning
8 pages
Computer Network: 02 December 2024 22:38
No ratings yet
Computer Network: 02 December 2024 22:38
5 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
99 pages
SLT 2024
No ratings yet
SLT 2024
94 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
ML B HA4 Answers Final
No ratings yet
ML B HA4 Answers Final
8 pages
Unit 1
No ratings yet
Unit 1
92 pages
Understanding Computational Learning Theory
No ratings yet
Understanding Computational Learning Theory
19 pages
Understanding Machine Learning Theory
No ratings yet
Understanding Machine Learning Theory
90 pages
Lecture 02
No ratings yet
Lecture 02
4 pages
PAC Learning and Hypothesis Error Explained
No ratings yet
PAC Learning and Hypothesis Error Explained
15 pages
Statistical Learning Framework Overview
No ratings yet
Statistical Learning Framework Overview
7 pages
Lecture22 s12
No ratings yet
Lecture22 s12
21 pages
Unit 3
No ratings yet
Unit 3
16 pages
Understanding Computational Learning Theory
No ratings yet
Understanding Computational Learning Theory
91 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
Statistical Learning by Sasha Rakhlin
No ratings yet
Statistical Learning by Sasha Rakhlin
26 pages
Or Type in Your Web Browser
100% (1)
Or Type in Your Web Browser
50 pages
Unit 3
No ratings yet
Unit 3
17 pages
Active Learning in Machine Learning
No ratings yet
Active Learning in Machine Learning
6 pages
Machine Learning in Geospatial Data
No ratings yet
Machine Learning in Geospatial Data
9 pages
Statistical Learning Techniques Overview
No ratings yet
Statistical Learning Techniques Overview
29 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Lec-3-Vc Dimension and Pac Learning
No ratings yet
Lec-3-Vc Dimension and Pac Learning
19 pages
Emiprical Risk Minimization
No ratings yet
Emiprical Risk Minimization
12 pages
L 5 CLT
No ratings yet
L 5 CLT
24 pages
Overview of Statistical Learning Theory
No ratings yet
Overview of Statistical Learning Theory
7 pages
Statistical Learning Theory Essentials
No ratings yet
Statistical Learning Theory Essentials
106 pages

PAC Learning and Generalization Error

Uploaded by

PAC Learning and Generalization Error

Uploaded by

CS-E4710 Machine Learning: Supervised

• Our aim is to predict as well as possible the outputs of future

• Lectures 2-4 are mostly based

• Probably Approximate Correct (PAC) Learning framework formalizes

• A class C is PAC-learnable, if there exist an algorithm A that given

Let us interpret the bound

•  sets the level of generalization error that is of interest to us, say we

• Generalization error bounds concern

• The goal is to learn a

• We will use a simple

Let r1 , r2 , r3 , r4 be rectangles along the sides of R such that

• Their union satisfies

• Each ri has probability mass

Pr (AC ) ≤ 4(1 − /4)m

• We can use a general inequality

Pr (R(h) ≥ ) ≤ 4(1−/4)m ≤ 4 exp(−m/4)

• We want this probability to be small

• The last inequality is our first

• Left, the sample complexity, the number of examples needed to

Typical behaviour of ML learning algorithms is revealed:

• The error bounds hold for any

• Finite concept classes arise when:

• Aldo likes to do sport only when the weather is suitable

• Sample complexity bound relying on the size of the hypothesis class

• How many different conjunctions can be built (=|H|)

• On the left, the generalization bound is shown for different values of

• What about using arbitrary boolean formulae?

• With d = 6 variables we need ca. 500 examples to get bound below

• Consider any hypothesis h ∈ H with R(h) > 

Pr (R̂(h) = 0|R(h) > ) ≤ (1 − )m

• m times repeated trial with success probability 

• But we do not need which consistent hypothesis h is selected by our

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > )

• Above ∧ is the logical ”and”

• We can replace ∃ by enumerating all hypotheses in H using

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) =

• Using the the fact that Pr (A ∪ B) ≤ Pr (A) + Pr (B) and

• Last inequality follows from using

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) ≤ |H|(1 − )m ≤ |H| exp(−m)

δ = |H| exp (−m)

• So far we have assumed that there is a consistent hypothesis h ∈ H,

• Probably approximately correct learning is a theoretical framework

You might also like

• sets the level of generalization error that is of interest to us, say we

Pr (AC ) ≤ 4(1 − /4)m

Pr (R(h) ≥ ) ≤ 4(1−/4)m ≤ 4 exp(−m/4)

• Consider any hypothesis h ∈ H with R(h) >

Pr (R̂(h) = 0|R(h) > ) ≤ (1 − )m

• m times repeated trial with success probability

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > )

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) =

Pr (∃h ∈ H|R̂(h) = 0 ∧ R(h) > ) ≤ |H|(1 − )m ≤ |H| exp(−m)

δ = |H| exp (−m)