Professional Documents
Culture Documents
STRONGLY RECOMMENDED:
• Linear algebra
– Matrices, vectors, systems of linear equations
– Eigenvectors, matrix rank
– Singular value decomposition
• Multivariable calculus
– Derivatives, integration, tangent planes
– Optimization, Lagrange multipliers
• Good programming skills: Python highly recommended
Source Materials
Spam
vs.
Not Spam
Face recognition
Temperature
72° F
Ranking
comparing items
Web search
Given image, find similar images
http://www.tiltomo.com/
Collaborative Filtering
Recommendation systems
Recommendation systems
Machine learning competition with a $1 million prize
Clustering
Set of Images
[Goldberger et al.]
Clustering web search results
Embedding
visualizing data
Embedding images
• Images have
thousands or
millions of pixels.
[Joseph Turian]
Embedding words (zoom in)
[Joseph Turian]
Structured prediction
•How do we
choose the best
one?
Occam’s Razor Principle
• William of Occam: Monk living in the 14th century
• Principle of parsimony:
[Samy Bengio]
Key Issues in Machine Learning
• How do we choose a hypothesis space?
– Often we use prior knowledge to guide this choice
• How can we gauge the accuracy of a hypothesis on unseen
data?
– Occam’s razor: use the simplest hypothesis consistent with data!
This will help us avoid overfitting.
– Learning theory will help us quantify our ability to generalize as
a function of the amount of training data and the hypothesis space
• How do we find the best hypothesis?
– This is an algorithmic question, the main topic of computer
science
• How to model applications as machine learning problems?
(engineering challenge)
Probability Theory refresher
Chapter 2
Bayesian Decision Theory
Key Principle
Bayes Theorem
Machine Learning Tapas Kumar Mishra 2
Bayes Theorem
Bayes theorem
X: the observed sample (also called evidence; e.g.: the length of a fish)
H: the hypothesis (e.g. the fish belongs to the “salmon” category)
P(H): the prior probability that H holds (e.g. the probability
of catching a salmon)
P(X|H): the likelihood of observing X given that H holds
(e.g. the probability of observing a 3‐inch length fish which is
salmon)
P(X): the evidence probability that X is observed
(e.g. the probability of observing a fish with 3‐inch length)
P(H|X): the posterior probability that H holds given X (e.g.
Thomas Bayes
the probability of X being salmon given its length is 3‐inch)
(1702-1761)
Observation for
test example
Convert the prior probability
(e.g.: fish lightness)
to the posterior probability
Question
If positive test result is returned for some person, does
he/she have this kind of cancer or not?
No cancer!
Counting relative
Quantities to know:
frequencies via
collected samples
Compute :
# cars in 221
# cars in 988
For , # cars in Ix
is 46
For , # cars in Ix
is 59
decide decide
error rate
the probability thataction
Is wrong
Discriminant functions
Various Identical
discriminant functions classification results
where and
Discrete case
Discrete case:
Continuous case:
Variance
Discrete case:
Continuous case:
(marginal pdf )
Expected vector
marginal pdf on
the i‐th component
Notation:
symmetric
Positive
semidefinite
weight vector
threshold/bias
squared Mahalanobis
distance
weight vector
threshold/bias
quadratic matrix
weight vector
threshold/bias
◼ Gaussian/Normal density
3
Complementary Event
P(A)=1-P(not A)
4
Joint Probability
Joint Probability (A ∩ B)
The probability of two events in conjunction. It is the probability of both events together.
p (A B ) = p (A ) + p (B ) − p (A B )
Independent Events
Two events A and B are independent if
p (A B ) = p (A ) p (B )
5
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3
E2: Drawing Ball 2 P(E2):1/3 p (A B ) = p (A ) p (B )
E3: Drawing Ball 3 P(E3): 1/3
p (B ) P (A )
7
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3 p (A B )
E2: Drawing Ball 2 P(E2):1/3 p (A | B ) =
E3: Drawing Ball 3 P(E3): 1/3 p (B )
Case 1: Drawing with replacement of the ball
1 2
The second draw is independent of the first draw
3
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1 E 2) 3 3 1
= =
3 p ( E 2) 1 3
3
Case 2: Drawing without replacement of the ball
The second draw is dependent on the first draw
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1 E 2) 3 2 1
= =
2 p ( E 2) 1 2
3
8
Baye’s Rule
We know that
p (A B ) = p (B A )
Using Conditional Probability definition, we have
p (A | B ) P (B ) = p (B | A ) P (A )
10
Law of Total Probability
Event A = {3}
X : →S
Ei xi
Event Value
14
A random variable: Examples.
15
Random Variable Types
► Discrete Random Variable:
► possible values are discrete (countable sample space,
Integer values)
X : → {1, 2,3, 4,...}
Ei xi
► Continuous Random Variable:
► possible values are continuous (uncountable space, Real
X : → 1.4,32.3
values)
Ei xi
16
Discrete Random Variable
The probability distribution for discrete random
variable is called Probability Mass Function (PMF).
p (x i ) = P (X = x i )
Properties of PMF
0 p (x i ) 1 and p (x
i
i ) =1
Cumulative Distribution Function (CDF)
p (X x ) = x x
p (x i )
i
Mean value n
X = E ( X ) = x
i =1
i p (x i )
17
Discrete Random Variable
Mean (Expected) value
n
X = E ( X ) = x
i =1
i p (x i )
Variance
V (X ) = X2 = E (X − E (X ))
2
=E X ( )
2
− E (X )2 General Equation
n
V (X ) = ( x i − x ) p (x i )
2
i =1 For Discrete RV
2
n n
V (X ) = (x i ) p (x i ) −
x i p (x i )
2
i =1 i =1
18
Discrete Random Variable: Example
Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
0 p ( 2) 1
4
p ( 2) = P ( X = 2) = = 0.4
10
0 p ( 3) 1
4
p ( 3) = P ( X = 3) = = 0.4
10 Grade
19
Discrete Random Variable: Example
Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
p ( x i ) = p (1) + p ( 2) + p (3) = 1
i
p ( X 3) = x 2
p (x i ) = p (1) + p ( 2 ) + p ( 3) = 1
i
20
Discrete Random Variable: Example
Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
x p (x
i
i i ) = 1 0.2 + 2 0.4 + 3 0.4 = 2.2
Grade
i
= 2.2
10
21
Continuous Random Variable
The probability distribution for continuous random
variable is called Probability Density Function (PDF).
f (x )
The probability of a given value is always 0 p ( x = x i ) = 0
The sample space is infinite
For continuous random variable, we compute p ( a x b )
Properties of PDF
1. f ( x) 0 , for all x in R X
2. f ( x)dx = 1
RX
3. f ( x) = 0, if x is not in R X
22
Continuous Random Variable
Cumulative Distribution Function (CDF)
p (X x ) = x x
p (x i )
i
p ( X x ) = f (t ) dt = 0
x
−
Mean/Expected value
p ( a X b ) = f ( x ) dx
b
+
x f ( x )dx
a
X = E ( X ) =
−
Variance
+ +
V (X ) = (x − ) f ( x ) dx and
V ( X ) = x 2 f ( x ) dx − x2
2
x
− −
23
Discrete versus Continuous Random Variables
Discrete Random Variable Continuous Random Variable
3. f ( x) = 0, if x is not in R X
x
p (x i )
i −
p ( a X b ) = f ( x ) dx
b
24
Continuous Random Variables: Example
1 x
Exponential Distribution exp − , x 0
f (x ) =
Exp (µ) 0,
otherwise
25
Continuous Random Variables: Example
1 −x / 2
e , x0
f ( x) = 2
0, otherwise
time
26
Continuous Random Variables: Example
Probability that the customer waits exactly 3 minutes is:
1 3 − x /2
P (x = 3) = P (3 x 3) = 3 e dx = 0
2
Probability that the customer waits between 2 and 3 minutes is:
1 3 − x /2
P (2 x 3) = e dx = 0.145
2 2
P(2 X 3) = F (3) − F (2) = (1 − e − (3 / 2) ) − (1 − e −1 ) = 0.145 CDF
27
Continuous Random Variables: Example
2 0
E (X ) = −x /2
dx = + e − x / 2dx = 2
xe
0
0
= V (X ) = 2
28
Variance
X = X
2
= V (X ) =s
29
Coefficient of Variation
V (X ) X
CV (X ) = =
E ( X ) X
30
Discrete Probability Distribution
31
Probability Mass Function (PMF)
Formally
the probability distribution or probability mass function
(PMF) of a discrete random variable X is a function that
gives the probability p(xi) that the random variable equals xi,
for each value xi:
p (x i ) = P (X = x i )
It satisfies the following conditions:
0 p (x i ) 1
p (x
i
i ) =1
32
Continuous Random Variable
33
Probability Density Function (PDF)
For the case of continuous variables, we do not want to
ask what the probability of "1/6" is, because the answer is
always 0...
Rather, we ask what is the probability that the value is in
the interval (a,b).
So for continuous variables, we care about the derivative of
the distribution function at a point (that's the derivative of an
integral). This is called a probability density function
(PDF).
The probability that a random variable has a value in a set A is
the integral of the p.d.f. over that set A.
34
Probability Density Function (PDF)
The Probability Density Function (PDF) of a continuous
random variable is a function that can be integrated to
obtain the probability that the random variable takes a value in
a given interval.
More formally, the probability density function, f(x), of a
continuous random variable X is the derivative of the
cumulative distribution function F(x):
d
f (x ) = F (x )
dx
Since F(x)=P(X≤x), it follows that:
b
F (b ) − F (a ) = P (a X b ) = f ( x ) dx
a
35
Cumulative Distribution Function (CDF)
− x +,
F (x ) = P (X x )
36
Cumulative Distribution Function (CDF)
F (x ) = P (X x ) = P (X
x i x
= xi ) = p (x
x i x
i )
F (a ) − F (b ) = P (a X b ) = f ( x ) dx
a
37
Cumulative Distribution Function (CDF)
► Example
► Discrete case: Suppose a random variable X has the
following probability mass function p(xi):
xi 0 1 2 3 4 5
p(xi) 1/32 5/32 10/32 10/32 5/32 1/32
38
Mean or Expected Value
40
Variance
41
Variance
V (X ) = X2 = E (X − E (X ))
2
( ) − E (X )
=E X 2 2
X = X2 = V ( X ) =s
42
Sampling Distributions,
Confidence interval, Hypothesis
testing
Sampling Distribution of the means
• Central Limit Theorem: if 𝑋ത is the mean of a random sample of size n
taken from a population mean 𝜇 and finite variance 𝜎 2 , then the
limiting form of the distribution of
ത
𝑋−𝜇
𝑧= 𝜎
𝑛
Theorem 2:
Example:
• Let X1, X2, …, Xn be the sample means of samples S1, S2, …, Sn that are
drawn from an independent and identically distributed population
with mean and standard deviation . From central limit theorem
we know that the sample means Xi follow a normal distribution with
mean and standard deviation / n . The variable Z = / n follows
X −
i
− −
P ( X − Z / 2 / n X + Z / 2 / n ) = 1 −
9/11/2023 @TKMISHRA ML NITRKL 14
CI for Different Significance Values
• That is, the probability that the population mean takes a value
between X − Z / 2 / n and X + Z / 2 / n is 1 – .
− −
• The absolute values of Z/2 for various values of are shown below:
Confidence interval for
|Z/2|
population mean when
population standard deviation is
known
−
0.1 1.64 X 1.64 / n
−
0.05 1.96 X 1.96 / n
−
0.02 2.33 X 2.33 / n
−
0.01 2.58 X 2.58 / n
(a) Calculate the 95% confidence interval for the population mean.
(b) What is the probability that the population mean is greater than
4.73 days?
Note that 4.73 is the upper limit of the 95% confidence interval from part (a), thus the probability
that the population mean is greater than 4.73 is approximately 0.025.
• William Gossett (Student, 1908) proved that if the population follows a normal
distribution and the standard deviation is calculated from the sample, then the statistic
given in Eq will follow a t-distribution with (n − 1) degrees of freedom
−
X −
t =
S / n
• Here S is the standard deviation estimated from the sample (standard error). The t-
distribution is very similar to standard normal distribution; it has a bell shape and its
mean, median, and mode are equal to zero as in the case of standard normal distribution.
The major difference between the t-distribution and the standard normal distribution is
that t-distribution has broad tail compared to standard normal distribution. However, as
the degrees of freedom increases the t-distribution converges to standard normal
distribution.
• In above Eq, the value t/2,n − 1 is the value of t under t-distribution for
which the cumulative probability F(t) = /2 when the degrees of
freedom is (n − 1).
• An online grocery store is interested in estimating the basket size (number of items
ordered by the customer) of its customers so that it can optimize its size of crates used
for delivering the grocery items. From a sample of 70 customers, the average basket size
was estimated as 24 and the standard deviation estimated from the sample was 3.8.
Calculate the 95% confidence interval for the basket size of the customer order.
Solution
−
We know that X = 24 , n = 70, S = 3.8 and t0.025, 69 = 1.995
Thus the 95% confidence interval for the size of the basket is
(23.09,24.91).
9/11/2023 @TKMISHRA ML NITRKL 25
HYPOTHESIS TESTING
INTRODUCTION TO HYPOTHEIS TESTING
3) Identify the test statistic to be used for testing the validity of the null
hypothesis. Test statistic will enable us to calculate the evidence in
support of null hypothesis. The test statistic will depend on the
probability distribution of the sampling distribution; for example, if the
test is for mean value and the mean is calculated from a large sample
and if the population standard deviation is known, then the sampling
distribution will be a normal distribution and the test statistic will be a Z-
statistic (standard normal statistic).
9/11/2023 @TKMISHRA ML NITRKL 31
HYPOTHESIS TESTING STEPS
4. Decide the criteria for rejection and retention of null hypothesis.
This is called significance value traditionally denoted by symbol .
The value of will depend on the context and usually 0.1, 0.05, and
0.01 are used.
6. Take the decision to reject or retain the null hypothesis based on the
p-value and significance value . The null hypothesis is rejected
when p-value is less than and the null hypothesis is retained when
p-value is greater than or equal to .
analytics.
engineering.
hypothesis.
Criteria Decision
H0: m 100,000
HA: m > 100,000
Z-statistic = X −
/ n
• The critical value in this case will depend on the significance
value and whether it is a one-tailed or two-tailed test
0.1
−1.28 1.28 −1.64 and 1.64
0.05
−1.64 1.64 −1.96 and 1.96
0.01
−2.33 2.33 −2.58 and 2.58
X − 4250 − 4200
Z = = = 3.125
/ n 3200 / 40000
16 16 30 37 25 22 19 35 27 32
34 28 24 35 24 21 32 29 24 35
28 29 18 31 28 33 32 24 25 22
21 27 41 23 23 16 24 38 26 28
/ n = 12 .5 / 40 = 1.9764
9/11/2023 @TKMISHRA ML NITRKL 56
Solution Continued…
• The critical value of left-tailed test for = 0.05 is –1.644.
• Since the critical value is less than the Z-statistic value,
we fail to reject the null hypothesis. The p-value for Z =
−1.4926 is 0.06777 which is greater than the value of .
X − 84 − 82
Z= = = 1.8132
/ n 11.03 / 100
601 627 330 364 562 353 583 254 528 470
408 601 593 729 402 530 708 599 439 762
292 636 444 286 636 667 252 335 457 632
X − 429.55 − 500
t - statistic = = = −2.2845
S/ n 195.0337 / 40
X − 19.5 − 16.8
t = = = 2.8927
S / n 6.6 / 50
d − D 11.5 − 0
t = = = 0.5375
S/ n 95.6757 / 20
ˆ − X
(X ˆ ) − ( − )
Z = 1 2 1 2
12 22
+
n1 n2
Specialization Sample Size Estimated Mean Salary (in Rupees) Population Standard
Z-statistic value is higher than the Z-critical value, we reject the null
Where pis
S2 the pooled variance of two samples and is given by
2 2
( n − 1) S + ( n − 1) S
S 2p = 1 1 2 2
(n1 + n2 − 2)
( X 1 − X 2 ) − (1 − 2 )
t =
1 1
S p2 +
1
n n2
Group Sample Size Increase in Height (in cm) during the Standard Deviation
Drink health
80 7.6 cm 1.1 cm
drink
Do not drink
80 6.3 cm 1.3 cm
health drink
2 = 1.3.
The null and alternative hypotheses are
H0: 1 − 2 1.2
HA: 1 − 2 > 1.2
Pooled variance is
(n1 − 1) S12 + (n2 − 1) S 22 79 1.12 + 79 1.32
S 2p = = = 1.45
(n1 + n2 − 2) 80 + 80 − 2
deviation
S2 S 2
Su = 1 + 2
n1 n
2
( X 1 − X 2 ) − (1 − 2 )
t =
S12 S 22
+
1
n n2
Couples with no
120 10.1 years 2.4 years
Degree
Couples with
100 9.5 years 3.1 years
Degree
1
+
2 +
1
n n2
120 100
August 3, 2022
Bayes Optimal classifier
Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?
Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?
nH + nT
P(D | θ) = θnH (1 − θ)nT , (1)
nH
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
θα−1 (1 − θ)β−1
P(θ) = (8)
B(α, β)
August 7, 2022
1/45
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The data points (xi , yi ) are drawn from some (unknown)
distribution P(X , Y ). Ultimately we would like to learn a function
h such that for a new pair (x, y ) ∼ P, we have h(x) = y with high
probability (or h(x) ≈ y ).
3/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Assume for example an email x can either be classified as spam
(+1) or ham (−1). For the same email x the conditional class
probabilities are:
In this case the Bayes optimal classifier would predict the label
y ∗ = +1 as it is most likely, and its error rate would be
BayesOpt = 0.2.
6/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The Venn diagram illustrates that the MLE method estimates
P̂(y |x) as
|C |
P̂(y |x) =
|B|
8/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!
9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!
9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
10/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(y ) is easy.
For example, if Y takes on discrete binary values estimating P(Y )
reduces to coin tossing. We simply need to count how many times
we observe each outcome (in this case each class):
Pn
I (yi = c)
P(y = c) = i=1 = π̂c
n
11/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(x|y ), however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.
12/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
14/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as
Now that we know how we can use our assumption to make the
estimation of P(y |x) tractable. There are 3 notable cases in which
we can use our naive Bayes classifier.
Case #1: Categorical features.
Case #2: Multinomial features.
Case #3: Continuous features (Gaussian Naive Bayes)
16/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
17/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Features:
[x]α ∈ {f1 , f2 , · · · , fKα }
Each feature α falls into one of Kα categories. (Note that the case
with binary features is just a specific case of this, where Kα = 2.)
An example of such a setting may be medical data where one
feature could be
gender (male / female) or
marital status (single / married / widowed).
18/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Model P(xα | y ):
and
Kα
X
[θjc ]α = 1
j=1
19/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα
20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα
20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c [θ̂jc ]α
y y
α=1
Pn d P
= c) Y ni=1 I (yi = c)I (xiα = j) + l
i=1 I (yi
= argmax Pn
y n i=1 I (yi = c) + lKα
α=1
21/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d
29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d
29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c (θ̂αc )xα
c c
α=1
30/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
31/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
32/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
Features:
33/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
34/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
35/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
Prediction:
36/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
That is,
w> x + b > 0 ⇐⇒ h(x) = +1.
40/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
44/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Validation of the Linear
Regression Model
Validation of the Simple Linear Regression Model
The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)
Yi
= 0 + 1 X i + i
Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model
In absence of the predictive model for Yi, the users will use the
mean value of Yi. Thus, the total variation is measured
−
as the
difference between Yi and mean value of Yi (i.e.,Yi -Y ).
Description of total variation, explained variation and
unexplained variation
−
Total Variation (SST) ( Yi −Y ) Total variation is the difference between the actual
value and the mean value.
−
Variation explained by the model ( Yi − Y ) Variation explained by the model is the difference
between the estimated value of Yi and the mean value
of Y
Variation not explained by model ( ) Variation not explained by the model is the difference
Yi − Yi
between the actual value and the predicted value of Yi
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
− ∧ − ∧
𝑌𝑖 − 𝑌 = 𝑌𝑖 − 𝑌 + 𝑌𝑖 − 𝑌𝑖
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model
where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by
∧ − 2
Explained variation 𝑆𝑆𝑅 𝑌𝑖 − 𝑌
Coefficient of determination = R2 = = = − 2
Total variation 𝑆𝑆𝑇
𝑌𝑖 − 𝑌
∧
𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:
2004 1 2
2005 6 2
2006 12 2
2007 58 2
2008 145 11
2009 360 21
2010 608 31
2011 845 40
2012 1056 51
Facebook users versus helium poisoning in UK
The R-square value for regression model between the number of deaths due to
helium poisoning in UK and the number of Facebook users is 0.9928. That is,
99.28% variation in the number of deaths due to helium poisoning in UK is
explained by the number of Facebook users.
Hypothesis Test for Regression Co-efficient (t-Test)
In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by
n n
i
2 2
(Yi − Y i )
Se = i =1 = i =1
n−2 n−2
n
2
(Yi − Y i ) n−2
Se i =1
S e ( 1 ) = =
− −
2
(Xi − X ) ( X i − X )2
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:
H0: 1 = 0
HA: 1 0
• The corresponding t-statistic is given as
1 − 1 1− 0 1
t = = =
Se ( 1) Se ( 1) Se ( 1)
Confidence Interval for Regression coefficients 0 and
1
The standard error of estimates of and 1are given by
0
𝑆𝑒 × σ𝑛𝑖=1 𝑋𝑖2
∧ 𝑆𝑒
∧ 𝑆𝑒 (𝛽1 ) =
𝑆𝑒 (𝛽0 ) = 𝑆𝑆𝑋
𝑛 × 𝑆𝑆𝑋
∧ 2
where 𝑌𝑖 − 𝑌𝑖
𝑆𝑒 =
𝑛−2
n −
2
Where Se is the standard error of residuals and SSX = (Xi − X )
i =1
The interval estimate or (1-)100% confidence interval for 0and
1 are given by
∧ ∧ ∧ ∧
𝛽1 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽1 ) 𝛽0 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽0 )
Multiple Linear Regression
• Multiple linear regression means linear in
regression parameters (beta values). The following
are examples of multiple linear regression:
Y = 0 + 1x1 + 2 x2 + ... + k xk +
Y = 0 + 1x1 + 2 x2 + 3 x1x2 + 4 x2 ... + k xk
2
+
∧
𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
• SSE is the sum of squares of errors and SST is the sum of
squares of total deviation. In case of MLR, SSE will decrease as
the number of explanatory variables increases, and SST
remains constant.
SSE/(n - k - 1)
Adjusted R - Square = 1 -
SST/(n - 1)
Statistical Significance of Individual Variables in MLR – t-test
β = (XT X) −1 XTY
Alternatively,
• H0: i = 0
• HA: i 0
The corresponding test statistic is given by
i − 0 i
t = =
Se (i ) Se (i )
Validation of Overall Regression Model – F-test
H0: 1 = 2 = 3 = … = k = 0
H1: Not all s are zero.
F = (SST-SSE)/k/SSE/(n-k-1) ~ Fk,n-k-1
F-test for the overall fit of the model
• Where the critical value F(, k, n-k-1) can be found from an F-table.
• The existence of a regression relation by itself does not assure that
useful prediction can be made by using it.
• Note that when k=1, this test reduces to the F-test for testing in simple
linear regression whether or not 1= 0
Linear Regression
1/39
Supervised learning
Lets start by talking about a few examples of supervised
learning problems. Suppose we have a dataset giving the
living areas and prices of 47 houses from Delhi, India:
2/39
Tapas Kumar Mishra Linear Regression
Given data like this, how can we learn to predict the prices of other
houses in Delhi, as a function of the size of their living areas?
3/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To describe the supervised learning problem slightly more formally,
our goal is,
given a training set, to learn a function h : X → Y so that h(x)is a
good predictor for the corresponding value of y . For historical
reasons, this function h is called a hypothesis.
5/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.
6/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.
6/39
Tapas Kumar Mishra Linear Regression
Linear Regression
To make our housing example more interesting, lets consider a
slightly richer dataset in which we also know the number of
bedrooms in each house:
7/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)
8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)
8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)
8/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square
11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square
11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square
11/39
Tapas Kumar Mishra Linear Regression
LMS for a single instance
Lets first work it out for the case of if we have only one training
example (x, y ), so that we can neglect the sum in the definition of
J. We have:
12/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)
13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)
13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)
13/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset
The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.
14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset
The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.
14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset
The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.
14/39
Tapas Kumar Mishra Linear Regression
m
1X
J(θ) = (h(x (i) − y (i) )2 . (6)
2
i=1
15/39
Tapas Kumar Mishra Linear Regression
16/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
When we run batch gradient descent to fit θ on our previous
dataset, to learn to predict housing price as a function of living
area, we obtain θ0 = 71.27, θ1 = 0.1345. If we plot hθ (x) as a
function of x(area), along with the training data, we obtain the
following figure:
18/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent
19/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent
19/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.
20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.
20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.
20/39
Tapas Kumar Mishra Linear Regression
Normal Equation
21/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation
24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.
24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.
24/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE
27/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
30/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
30/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,
31/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,
31/39
Tapas Kumar Mishra Linear Regression
Locally weighted linear regression
Consider the problem of predicting y from x ∈ R. The leftmost
figure below shows the result of fitting a y = θ0 + θ1 x to a
dataset. We see that the data doesnt really lie on straight line, and
so the fit is not very good. This is Underfitting: the structure of
the data is not captured by the model.
32/39
Tapas Kumar Mishra Linear Regression
Instead, if we had added an extra feature x 2 , and
fity = θ0 + θ1 x + θ2 x 2 , then we obtain a slightly better fit to the
data. Naively, it might seem that the more features we add, the
better.
33/39
Tapas Kumar Mishra Linear Regression
However, there is a danger of adding too many features. P The
figure below is the result of 5th order polynomial y = 5j=0 θj x j .
Even though the fitted curve passes through the data perfectly, it
is not a good predictor of y (housing prices)for different x (living
area). This is Overfitiing.
34/39
Tapas Kumar Mishra Linear Regression
In the original linear regression algorithm, to make a prediction at
a query point x (to evaluate h(x)), we would:
P (i)
1 Fit θ to minimize
i (y − θT x (i) )2 .
2 Output θT x.
35/39
Tapas Kumar Mishra Linear Regression
In Contrast, the locally weighted linear regression algorithm does
the following:
P (i) (i)
1 Fit θ to minimize
i z (y − θT x (i) )2 .
2 Output θT x.
36/39
Tapas Kumar Mishra Linear Regression
Here z (i) are non-negative valued weights.
If z (i) is large for a particular value i, then in picking θ, we will try
hard to make (y (i) − θT x (i) )2 small.
If z (i) is small for a particular value i, then (y (i) − θT x (i) )2 is
ignored in the fit.
37/39
Tapas Kumar Mishra Linear Regression
A fairly standard choice for weights is
!
(i) (x (i) − x)2
z = exp −
2τ 2
weights depend on x.
If |x (i) − x| is small, z (i) is close to 1 and if |x (i) − x| is large,
z (i) is small.
Hence, θ is chosen giving a much higher weight to the
training examples close to the query point x.
τ is the bandwidth parameter.
38/39
Tapas Kumar Mishra Linear Regression
Parametric vs non-Parametric
39/39
Tapas Kumar Mishra Linear Regression
Logistic Regression
1/10
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Logistic regression
1
hθ (x) = g (θT x) =
1 + e −θT x
1
where g (a) = 1+e −a
is the logistic/sigmoid function.
3/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as
5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as
5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as
5/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
To maximize the likelihood, we will use gradient descent.
θ := θ + α∇θ l(θ)
.
We start by taking just one training example (x, y ) and take
derivatives to derive the stochastic gradient descent rule.
7/10
Tapas Kumar Mishra Logistic Regression
This gives the stochastic ascent rule
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj
.
This is a similar looking rule as compared to LMS update rule!!
8/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit
10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit
10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit
10/10
Tapas Kumar Mishra Logistic Regression
Optimizing the training process:
Underfitting, overfitting,
testing,
and regularization
• Let’s say that we have to study for a test.
• Several things could go wrong during our study process.
• Maybe we didn’t study enough. There’s no way to fix that, and we’ll likely
perform poorly in our test. ---------- Underfitting
• What if we studied a lot but in the wrong way. For example, instead of focusing
on learning, we decided to memorize the entire textbook word for word. Will
we do well in our test? It’s likely that we won’t, because we simply memorized
everything without learning. ----------Overfitting
• The best option, of course, would be to study for the exam properly and in a
way that enables us to answer new questions that we haven’t seen before on
the topic. -----------Generalization
• Notice that model 1 is too simple, because it is a line trying to fit a quadratic dataset. There is no way
we’ll find a good line to fit this dataset, because the dataset simply does not look like a line.
Therefore, model 1 is a clear example of underfitting.
• Model 2, in contrast, fits the data pretty well. This model neither overfits nor underfits.
• Model 3 fits the data extremely well, but it completely misses the point. The data is meant to look
like a parabola with a bit of noise, and the model draws a very complicated polynomial of degree 10
that manages to go through each one of the points but doesn’t capture the essence of the data.
Model 3 is a clear example of overfitting.
How do we get the computer to pick the right
model?
By testing
• Testing a model consists of picking a small set of the points in the dataset and choosing to use
them not for training the model but for testing the model’s performance. This set of points is
called the testing set.
• The remaining set of points (the majority), which we use for training the
model, is called the training set.
• Once we’ve trained the model on the training set, we use the
testing set to evaluate the model.
• In this way, we make sure that the model is good at generalizing
to unseen data, as opposed to memorizing the training set.
• Going back to the exam analogy, let’s imagine training and testing this way.
• Let’s say that the book we are studying for in the exam has
100 questions at the end.
• We pick 80 of them to train, which means we study them carefully, look
up the answers, and learn them.
• Then we use the remaining 20 questions to test ourselves—we
try to answer them without looking at the book, as in an exam setting.
• Looking at the top row we can see that model 1 has a large training
error, model 2 has a small training error, and model 3 has a tiny
training error (zero, in fact). Thus, model 3 does the best job on the
training set.
• Model 1 still has a large testing error,meaning that this is simply a bad
model, underperforming with the training and the testing set: it
underfits.
Can we use our testing data for training the model? No.
• We broke the golden rule in the previous example.
• Recall that we had three polynomial regression models: one of degree
1, one of degree 2, and one of degree 10, and we didn’t know which one
to pick.
• We used our training data to train the three models, and then we used
the testing data to decide which model to pick.
• We are not supposed to use the testing data to train our model or to
make any decisions on the model or its hyperparameters.
Solution: Validation Set
We break our dataset into the following three sets:
• Training set: for training all our models
• Validation set: for making decisions on which model to use
• Testing set: for checking how well our model did
• Imagine that we have a different and much more complex dataset, and we are trying to build a
polynomial regression model to fit it. We want to decide the degree of our model among the
numbers between 0 and 10 (inclusive).
• the way to decide which model to use is to pick the one that has the smallest validation error.
• However, plotting the training and validation errors can give us some valuable information and
help us examine trends.
The model
complexity graph
Another alternative to avoiding overfitting: Regularization
Now it is clear that roofer 2 is the best one, which means that optimizing performance and
complexity at the same time yields good results that are also as simple as possible. This is what
regularization is about: measuring performance and complexity with two different error functions,
and adding them to get a more robust error function.
Regularization- Measuring how complex a model is: L1 and L2 norm
• in the roofer analogy, our goal was to find a roofer that provided both good quality
and low complexity. We did this by minimizing the sum of two numbers: the measure of quality
and the measure of complexity. Regularization consists of applying the same principle to our
machine learning model.
• regression error A measure of the quality of the model. In this case, it can be the absolute
or square errors
• regularization term A measure of the complexity of the model. It can be the L1 or the L2
norm of the model.
• Error = Regression error + λ Regularization term
• λ is the regularization hyperparameter.
• Lasso regression error = Regression error + λ L1 norm
• Ridge regression error = Regression error + λ L2 norm
Regularization- Effects of L1 and L2 regularization
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Expected Label (given x ∈ Rd ):
Z
ȳ (x) = Ey |x [Y ] = y Pr(y |x)∂y .
y
The expected label denotes the label you would expect to obtain,
given a feature vector x.
3/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):
h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y
5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):
h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y
5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0
8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0
8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0
8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (2)
| {z }
0
Returning to the earlier expression, we’re left with the variance and
another term
h i h 2 i h 2 i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) +Ex,y h̄(x) − y
| {z }
Variance
(3)
10/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0
11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0
11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0
11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2
13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2
13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?
14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?
14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?
14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.
15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.
15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.
15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.
16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.
16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Graphical illustration of bias and variance.
17/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : The variation of Bias and Variance with the model complexity.
This is similar to the concept of overfitting and underfitting. More
complex models overfit while the simplest models underfit.
18/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Detecting High Bias and High Variance
The graph above plots the training error and the test error and can
be divided into two overarching regimes. In the first regime (on the
left side of the graph), training error is below the desired error
threshold (denoted by ), but test error is significantly higher.
19/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Test and training error as the number of training instances
increases.
In the second regime (on the right side of the graph), test error is
remarkably close to training error, but both are above the desired
tolerance of .
20/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 1 (High Variance)
Symptoms:
Training error is much lower than test error
Training error is lower than
Test error is above
Remedies:
Add more training data
Reduce model complexity – complex models are prone to high
variance
Bagging (will be covered later in the course)
21/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 2 (High Bias): the model being used is not robust enough
to produce an accurate prediction
Symptoms:
Training error is higher than , but close to test error.
Remedies:
Use more complex model (e.g. kernelize, use non-linear
models)
Add features
Boosting (will be covered later in the course)
22/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Model Selection
Performance estimation techniques
Always evaluate models as if they are predicting future data
We do not have access to future data, so we pretend that some data is hidden
Simplest way: the holdout (simple train-test split)
Randomly split data (and labels) into training , Validation and test set (e.g. 60%-20%-20 %)
Train (fit) a model on the training da ta,minimize error on validation set and score on the test data
K-fold Cross-validation
Each random split can yield very different models (and scores)
e.g. all easy (of hard) examples could end up in the test set
Split data into k equal-sized parts, called folds
Create k splits, each time using a different fold as the test set
Compute k evaluation scores, agg regate afte rwards (e.g. take the mea n)
Large k gives be tter estimates (more training data), but is expensive
Stratified K-Fold cross-validation
Sample n (dataset size) data points, with replacement, as training set (the bootstrap)
On average, bootstraps include 66% of all data points (some are duplicates)
Use the unsampled (out-of-bootstrap) samples as the test set
Repeat times to obtain scores
k k
Repeated cross-validation
Cross-validation is still biased in that the initial split can be made in many ways
Repeated, or n-times-k-fold cross-validation:
Shuffle data randomly, do k-fold cross-validation
Repeat n times, yields n times k scores
Unbiased, very robust, but n times more expensive
Cross-validation with groups
Every new sample is evaluated only once, then added to the training set
Can also be done in batches (of n samples at a time)
TimeSeriesSplit
In the kth split, the first k folds are the train set and the (k+1)th fold as the Validation set
Often, a maximum training set size (or window) is used
more robust against concept drift (change in data over time)
Choosing a performance estimation procedure
No strict rules, only guidelines:
Always use stratification for classification (sklearn does this by default)
Use holdout for very large datasets (e.g. >1.000.000 examples)
Or when learners don't always converge (e.g. deep learning)
Choose k depending on dataset size and resources
Use leave-one-out for very small datasets (e.g. <100 examples)
Use cross-validation otherwise
Most popular (and theoretically sound): 10-fold CV
Literature suggests 5x2-fold CV is better
Use grouping or leave-one-subject-out for grouped data
Use train-then-test for time series
Binary classification
https://en.wikipedia.org/wiki/Precision_and_recall
Multi-class classification
Train models per class : one class viewed as positive, other(s) als negative, then average
micro-averaging: count total TP, FP, TN, FN (every sample equally important)
micro-precision, micro-recall, micro-F1, accuracy are all the same
C
∑ TPc c=2 TP + TN
c=1
Precision: −
−→
C C
TP + TN + FP + FN
∑ TPc + ∑ FPc
c=1 c=1
Other useful classification metrics
Cohen's Kappa
Measures 'agreement' between different models (aka inter-rater agreement)
To evaluate a single model, compare it against a model that does random guessing
Similar to accuracy, but taking into account the possibility of predicting the
right class by chance
Can be weighted: different misclassifications given different weights
1: perfect prediction, 0: random prediction, negative: worse than random
With = accuracy, and = accuracy of random classifier:
p0 pe
po − pe
κ =
1 − pe
The best trade-off between precision and recall depends on your application
You can have arbitrary high recall, but you often want reasonable precision, too.
Plotting precision against recall for all possible thresholds yields a precision-recall curve
Change the treshold until you find a sweet spot in the precision-recall trade-off
Often jagged at high thresholds, when there are few positive examples left
Model selection
Plotting TPR against FPR for all possible thresholds yields a Receiver Operating
T P +F N F P +T N
Characteristics curve
Change the treshold until you find a sweet spot in the TPR-FPR trade-off
Lower thresholds yield higher TPR (recall), higher FPR, and vice versa
Visualization
Histograms show the amount of points with a certain decision value (for each class)
TPR =
TP
can be seen from the positive predictions (top histogram)
can be seen from the negative predictions (bottom histogram)
T P +F N
FP
FPR =
F P +T N
Model selection
Between 0 and 1, but negative if the model is worse than just predicting the mean
Easier to interpret (higher is better).
Decision tree learning
Inductive inference with decision trees
▪ Inductive reasoning is a method of reasoning in which
a body of observations is considered to derive a
general principle.
▪ Decision Trees is one of the most widely used and
practical methods of inductive inference
▪ Features
▪ Method for approximating discrete-valued functions
(including boolean)
▪ Learned functions are represented as decision trees (or if-
then-else rules)
▪ Expressive hypotheses space, including disjunction
Decision tree representation (PlayTennis)
+
−
When to use Decision Trees
▪ Problem characteristics:
▪ Instances can be described by attribute value pairs
▪ Target function is discrete valued
▪ Disjunctive hypothesis may be required
▪ Possibly noisy training data samples
▪ Robust to errors in training data
▪ Missing attribute values
▪ Different classification problems:
▪ Equipment or medical diagnosis
▪ Credit risk analysis
▪ Several tasks in natural language processing
Top-down induction of Decision Trees
▪ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
▪ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
▪ The construction of the tree is top-down. The algorithm is greedy.
▪ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
▪ Select the best attribute
▪ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
▪ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A best attribute; decision attribute for Root A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs − {A})
return Root
Inductive bias in decision tree learning
(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
▪ Each distinct path produces a different rule: a condition
removal may be based on a local (contextual) criterion. Node
pruning is global and affects all the rules
▪ In rule form, tests are not ordered and there is no book-
keeping involved when conditions (nodes) are removed
▪ Converting to rules improves readability for humans
Dealing with continuous-valued attributes
▪ So far discrete values for attributes and for outcome.
▪ Given a continuous-valued attribute A, dynamically create a
new attribute Ac
Ac = True if A < c, False otherwise
▪ How to determine threshold value c ?
▪ Example. Temperature in the PlayTennis example
▪ Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
▪ Determine candidate thresholds by averaging consecutive values where
there is a change in classification: (48+60)/2=54 and (80+90)/2=85
▪ Evaluate candidate thresholds (attributes) according to information gain.
The best is Temperature>54.The new attribute competes with the other
ones
Problems with information gain
▪ Natural bias of information gain: it favours attributes with
many possible values.
▪ Consider the attribute Date in the PlayTennis example.
▪ Date would have the highest information gain since it perfectly
separates the training data.
▪ It would be selected at the root resulting in a very broad tree
▪ Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
▪ The problem is that the partition is too specific, too many small
classes are generated.
▪ We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A) − log2
|S |
i=1 |S |
▪ Si are the sets obtained by partitioning on value i of A
▪ SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A)
SplitInformation(S, A)
▪ GainRatio penalizes attributes that split examples in many small classes such as
Date. Let |S |=n, Date splits examples in n classes
▪ SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
▪ Compare with A, which splits data in two even classes:
▪ SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
▪ Problem: SplitInformation(S, A) can be zero or very small
when |Si | ≈ |S | for some value i
▪ To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
Handling incomplete training data
▪ How to cope with the problem that the value of some attribute
may be missing?
▪ Example: Blood-Test-Result in a medical diagnosis problem
▪ The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training examples at
the node
2. Assign a probability to each value, based on frequencies, and assign
values to missing attribute, according to this probability distribution
▪ Missing values in new instances to be classified are treated
accordingly, and the most probable classification is chosen
(C4.5)
Handling attributes with different
costs
▪ Instance attributes may have an associated cost: we would
prefer decision trees that use low-cost attributes
▪ ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) − 1
(Cost(A) + 1)w w ∈ [0,1]
Gini (impurity) Index
▪ The Gini index is a measure of diversity in a dataset. In other
words, if we have a set in which all the elements are similar,
this set has a low Gini index, and if all the elements are
different, it has a large Gini index.
▪ For clarity, consider the following two sets of 10 colored balls
(where any two balls of the same color are indistinguishable):
▪ • Set 1: eight red balls, two blue balls
▪ • Set 2: four red balls, three blue balls, two yellow balls, one green
ball
▪ Set 1 looks more pure than set 2, because set 1 contains
mostly red balls and a couple of blue ones, whereas set 2 has
many different colors. Next, we devise a measure of impurity
that assigns a low value to set 1 and a high value to set 2.
Gini (impurity) Index
▪ If we pick two random elements of the set, what is the
probability that they have a different color ? The two elements
don’t need to be distinct; we are allowed to pick the same
element twice.
▪ P(picking two balls of different color) = 1 – P(picking two balls
of the same color)
▪ P(picking two balls of the same color) = P(both balls are color 1)
+ P(both balls are color 2) + … + P(both balls are color n)
▪ P(both balls are color i) = pi2
▪ P(picking two balls of different colors) = 1 – p12 – p22 – … – pn2
Gini (impurity) Index
▪ Gini impurity index:
In a set with m elements and n classes, with ai elements belonging
to the i-th class, the Gini impurity index is
Gini = 1 – p12 – p22 – … – pn2 , where pi = ai / m
Gini (impurity) Index
sample x.
● The labels of the 3 neighbors are 2×(+1) and
outputs
● Classification rule: For a test input x, assign