You are on page 1of 729

Introduction to Machine Learning

Tapas Kumar Mishra


NIT Rourkela
Logistics
• Class webpage:
– https://mishra-tapas.github.io/ml.html
Prerequisites
REQUIRED:
• Basic algorithms
– Dynamic programming, algorithmic analysis

STRONGLY RECOMMENDED:
• Linear algebra
– Matrices, vectors, systems of linear equations
– Eigenvectors, matrix rank
– Singular value decomposition
• Multivariable calculus
– Derivatives, integration, tangent planes
– Optimization, Lagrange multipliers
• Good programming skills: Python highly recommended
Source Materials

• C. Bishop, Pattern Recognition and Machine Learning,


Springer, 2007
• K. Murphy, Machine Learning: a Probabilistic
Perspective, MIT Press, 2012
• Tom Mitchel. Machine Learning.
• Richard O. Duda, Peter E. Hart, David G. Stork Pattern
Classification.
What is Machine Learning ?
(by examples)
Classification

from data to discrete classes


Spam filtering
data prediction

Spam
vs.
Not Spam
Face recognition

Example training images


for each orientation

10 ©2009 Carlos Guestrin


Weather prediction
Regression

predicting a numeric value


Stock market
Weather prediction revisited

Temperature

72° F
Ranking

comparing items
Web search
Given image, find similar images

http://www.tiltomo.com/
Collaborative Filtering
Recommendation systems
Recommendation systems
Machine learning competition with a $1 million prize
Clustering

discovering structure in data


Clustering Data: Group similar things
Clustering images

Set of Images

[Goldberger et al.]
Clustering web search results
Embedding

visualizing data
Embedding images

• Images have
thousands or
millions of pixels.

• Can we give each


image a coordinate,
such that similar
images are near
each other?

26 ©2009 Carlos Guestrin


[Saul & Roweis ‘03]
Embedding words

[Joseph Turian]
Embedding words (zoom in)

[Joseph Turian]
Structured prediction

from data to discrete classes


Speech recognition
Natural language processing

I need to hide a body


noun, verb, preposition, …
Growth of Machine Learning
• Machine learning is preferred approach to
– Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control
– Computational biology
– Sensor networks
– …
• This trend is accelerating
– Big data
– Improved machine learning algorithms
– Faster computers
– Good open-source software
Course roadmap
– Bayesian Decision Theory
– Linear and Logistic regression
– Decision trees
– Dimensionality Reduction
– SVMs, kernel methods
– Neural Networks
– Unsupervised learning
– Ensemble Methods
Supervised Learning: find f
• Given: Training set {(xi, yi) | i = 1 … N}
• Find: A good approximation to f : X ! Y
Examples: what are X and Y ?
• Spam Detection
– Map email to {Spam, Not Spam}
• Digit recognition
– Map pixels to {0,1,2,3,4,5,6,7,8,9}
• Stock Prediction
– Map new, historic prices, etc. to ℜ (the real numbers)
A Supervised Learning Problem
• Our goal is to find a
Dataset:
function f : X ! Y
– X = {0,1}4
– Y = {0,1}

• Question 1: How should


we pick the hypothesis
space, the set of possible
functions f ?
• Question 2: How do we
find the best f in the
hypothesis space?
Most General Hypothesis Space
Consider all possible boolean functions over four
input features!
Dataset:
•216 possible
hypotheses

•29 are consistent


with our dataset

•How do we
choose the best
one?
Occam’s Razor Principle
• William of Occam: Monk living in the 14th century
• Principle of parsimony:

“One should not increase, beyond what is necessary, the number of


entities required to explain anything”

• When many solutions are available for a given problem, we should


select the simplest one
• But what do we mean by simple?
• We will use prior knowledge of the problem to solve to define what is
a simple solution

Example of a prior: smoothness

[Samy Bengio]
Key Issues in Machine Learning
• How do we choose a hypothesis space?
– Often we use prior knowledge to guide this choice
• How can we gauge the accuracy of a hypothesis on unseen
data?
– Occam’s razor: use the simplest hypothesis consistent with data!
This will help us avoid overfitting.
– Learning theory will help us quantify our ability to generalize as
a function of the amount of training data and the hypothesis space
• How do we find the best hypothesis?
– This is an algorithmic question, the main topic of computer
science
• How to model applications as machine learning problems?
(engineering challenge)
Probability Theory refresher
Chapter 2
Bayesian Decision Theory

Machine Learning Tapas Kumar Mishra 1


Bayesian Decision Theory
Bayesian decision theory is a statistical approach to
pattern recognition
The fundamentals of most ML algorithms are
rooted from Bayesian decision theory
Basic Assumptions
 The decision problem is posed (formalized) in
probabilistic terms
 All the relevant probability values are known

Key Principle
Bayes Theorem
Machine Learning Tapas Kumar Mishra 2
Bayes Theorem
Bayes theorem

X: the observed sample (also called evidence; e.g.: the length of a fish)
H: the hypothesis (e.g. the fish belongs to the “salmon” category)
P(H): the prior probability that H holds (e.g. the probability
of catching a salmon)
P(X|H): the likelihood of observing X given that H holds
(e.g. the probability of observing a 3‐inch length fish which is
salmon)
P(X): the evidence probability that X is observed
(e.g. the probability of observing a fish with 3‐inch length)
P(H|X): the posterior probability that H holds given X (e.g.
Thomas Bayes
the probability of X being salmon given its length is 3‐inch)
(1702-1761)

Machine Learning Tapas Kumar Mishra 3


A Specific Example
State of Nature
 Future events that might occur
e.g. the next fish arriving along the conveyor belt
 State of nature is unpredictable
e.g. it is hard to predict what type will emerge next

From statistical/probabilistic point of view, the state of nature


should be favorably regarded as a random variable
e.g. let denote the (discrete) random variable
representing the state of nature (class) of fish types

Machine Learning Tapas Kumar Mishra 4


Prior Probability
Prior Probability
Prior probability is the probability distribution which
reflects one’s prior knowledge on the random variable

Probability distribution (for discrete random variable)


Let be the probability distribution on the random variable
with c possible states of nature , such that:

the catch produced as much sea bass as salmon


the catch produced more sea bass than salmon

Machine Learning Tapas Kumar Mishra 5


Decision Before Observation
The Problem
To make a decision on the type of fish arriving next, where
1) prior probability is known; 2) no observation is allowed

Naive Decision Rule

 This is the best we can do without observation


 Fixed prior probabilities ➔ Same decisions all the time
Incorporate Good when is much greater (smaller) than
observations Poor when is close to
into decision! [only 50% chance of being right if ]

Machine Learning Tapas Kumar Mishra 6


Probability Density Function (pdf)
Probability density function (pdf )
(for continuous random variable)
Let be the probability density function on the continuous
random variable taking values in R, such that:

 For continuous random variable, it no longer makes sense to talk about


the probability that x has a particular value (almost always be zero)
 We instead talk about the probability of x falling into a region R, say
R=(a,b), which could becomputed with the pdf:

Machine Learning Tapas Kumar Mishra 7


Incorporate Observations
The Problem
Suppose the fish lightness measurement x is observed,
how could we incorporate this knowledge into usage?
Class‐conditional probability density function
It is a probability density function (pdf ) for x given that the state of
nature (class) is , i.e.:

 The class‐conditional pdf describes the difference in the


distribution of observations under different classes

Machine Learning Tapas Kumar Mishra 8


Class-Conditional PDF
An illustrative example
h‐axis: lightness of fish scales
v‐axis: class‐conditional pdf
values

black curve: sea bass


red curve: salmon

 The area under each curve


is 1.0 (normalization)

 Sea bass is somewhat


brighter than salmon
class‐conditional pdf for lightness

Machine Learning Tapas Kumar Mishra 10


Decision After Observation
Known Unknown
The quantity which we want to use
Prior probability in decision naturally (by exploiting
observation information)

Class‐conditional Bayes Posterior probability


pdf
Formula

Observation for
test example
Convert the prior probability
(e.g.: fish lightness)
to the posterior probability

Machine Learning Tapas Kumar Mishra


Bayes Formula Revisited
Joint probability density function
Marginal distribution

Law of total probability

Machine Learning Tapas Kumar Mishra


Bayes Formula Revisited (Cont.)

Bayes Decision Rule

 and are assumed to be known


 is irrelevant for Bayesian decision (serving
as a normalization factor, not related to any state
of nature)

Machine Learning Tapas Kumar Mishra


Bayes Formula Revisited (Cont.)

Special Case I: Equal prior probability


Depends on the
likelihood

Special Case II: Equal likelihood


Degenerate to naive
decision rule

Normally, prior probability and likelihood function


together in Bayesian decision process
Machine Learning Tapas Kumar Mishra
Bayes Formula Revisited (Cont.)
An illustrative example

What will the


posterior
probability for
either type of fish
class‐conditional pdf for lightness look like?

Machine Learning Tapas Kumar Mishra


Bayes Formula Revisited (Cont.)
An illustrative example
h‐axis: lightness of fish scales
v‐axis: posterior probability
for either type of fish

black curve: sea bass


red curve: salmon

 For each value of x, the


higher curve yields the
output of Bayesian decision
 For each value of x, the
posteriors of either curve
posterior probability for either type of fish
sum to 1.0

Machine Learning Tapas Kumar Mishra


Another Example
Problem statement
 A new medical test is used to detect whether a patient has a certain
cancer or not, whose test result is either + (positive) or ‐ (negative )
 For patient with this cancer, the probability of returning positive test
result is 0.98
 For patient without this cancer, the probability of returning negative
test result is 0.97
 The probability for any person to have this cancer is 0.008

Question
If positive test result is returned for some person, does
he/she have this kind of cancer or not?

Machine Learning Tapas Kumar Mishra


Another Example (Cont.)

No cancer!

Machine Learning Tapas Kumar Mishra


Feasibility of Bayes Formula

To compute posterior probability , we need to know:

Prior probability: Likelihood:

 A simple solution: Counting


How do we relative frequencies
know these
probabilities?  An advanced solution: Conduct
density estimation

Machine Learning Tapas Kumar Mishra


A Further Example
Problem statement
Based on the height of a car in some campus, decide whether
it costs more than $50,000 or not

Counting relative
Quantities to know:
frequencies via
collected samples

Machine Learning Tapas Kumar Mishra 20


A Further Example (Cont.)
Collecting samples
Suppose we have randomly picked 1209 cars in the
campus, got prices from theirowners, and measured
their heights

Compute :

# cars in 221

# cars in 988

Machine Learning Tapas Kumar Mishra 20


A Further Example (Cont.)
Compute :
Discretize the height spectrum (say [0.5m, 2.5m]) into 20 intervals
each with length 0.1m, and then count the number of cars falling
into each interval for either class
Suppose x falls into interval
Ix=[1.0m, 1.1m]

For , # cars in Ix
is 46
For , # cars in Ix
is 59

Machine Learning Tapas Kumar Mishra 21


A Further Example (Cont.)
Question
For a car with height 1.05m, is its price greater than $50,000?
Estimated quantities

Machine Learning Tapas Kumar Mishra 22


Is Bayes Decision Rule Optimal?
Bayes Decision Rule (In case of two classes)

Whenever we observe a particular x, the probability of error is:

Under Bayes decision rule, we have

For every x, we ensure The average probability of error


that P(error | x) is as over all possible x must be as
small as possible small as possible

Machine Learning Tapas Kumar Mishra 23


Bayes Decision Rule – The General
Case
➢ By allowing to use more than one feature
(d‐dimensional Euclidean space)

➢ By allowing more than two states of nature


(finite set of c states of nature)

➢ By allowing actions other than merely deciding the


state of nature
(finite set of a possible actions)

Machine Learning Tapas Kumar Mishra 24


Bayes Decision Rule – The General
Case (Cont.)
➢ By introducing a loss function more general than
the probability of error

the loss incurred for taking action when the


state of nature is
A simple loss function
For ease of reference, Action
Class
usually written as:
5 50 10,000
60 3 0

Machine Learning Tapas Kumar Mishra 25


Bayes Decision Rule – The General
Case (Cont.)
The problem
Given a particular x, we have to decide which action to take

We need to know the loss of taking each action

true state of the action being However, the true state


nature is taken is of nature is uncertain

incur the loss Expected (average) loss

Machine Learning Tapas Kumar Mishra 26


Bayes Decision Rule – The General
Case (Cont.) Average by enumerating over
all possible states of nature!
Expected loss

The incurred loss of taking The probability of


action in case of true being the true state
state of nature being of nature

The expected loss is also named as (conditional)


risk

Machine Learning Tapas Kumar Mishra 27


Bayes Decision Rule – The General
Case (Cont.)
Suppose we have:
Action For a particular x:
Class
5 50 10,000
60 3 0

Similarly, we can get:

Machine Learning Tapas Kumar Mishra 28


Bayes Decision Rule – The General
Case (Cont.)
The task: find a mapping from patterns to actions

In other words, for every x, the decision function


assumes one of the a actions
Overall risk R
expected loss
with decision Conditional risk for pattern pdf for
function x with action patterns

Machine Learning Tapas Kumar Mishra 30


Bayes Decision Rule – The General
Case (Cont.)

For every x, we ensure that the


The overall risk over all possible
conditional risk
x must be as small as possible
is as small as possible
 The resulting overall
Bayes decision rule (General case)
risk is called the Bayes
risk (denoted as R*)
 The best performance
achievable given p(x)
and loss function

Machine Learning Tapas Kumar Mishra 30


Two-Category Classification
Special case
 (two states of nature)

the loss incurred for deciding


when the true state of nature is

The conditional risk:


yes no

decide decide

Machine Learning Tapas Kumar Mishra 31


Two-Category Classification (Cont.)
constant θ
likelihood
independent of x
ratio
by
definition

the loss for being


error is ordinarily
by greater than the loss
re‐arrangement for being correct
by Bayes
theorem

Machine Learning Tapas Kumar Mishra 32


Minimum-Error-Rate Classification
Classification setting
 (c possible states of nature)

Zero‐one (symmetrical) loss function

 Assign no loss (i.e. 0) to a correct decision


 Assign a unit loss (i.e. 1) to any incorrect decision (equal cost)

Machine Learning Tapas Kumar Mishra 33


Minimum-Error-Rate Classification
(Cont.)

error rate
the probability thataction
Is wrong

Minimum error rate

Machine Learning Tapas Kumar Mishra 34


Discriminant Function
Classification decide
actions
Pattern ➔ Category categories

Discriminant functions

 Useful way to represent classifiers


 One function per category

Machine Learning Tapas Kumar Mishra 35


Discriminant Function (Cont.)
Minimum risk:
Minimum‐error‐rate:

Various Identical
discriminant functions classification results

is a monotonically increasing function

(i.e. equivalent in decision)


e.g.:

Machine Learning Tapas Kumar Mishra 36


Discriminant Function (Cont.)
Decision region
c discriminant functions c decision regions

where and

Decision boundary decision


boundary
surface in feature space where ties
occur among several largest
discriminant functions
Machine Learning Tapas Kumar Mishra 37
Expected Value
Expected value , a.k.a. expectation, mean or average
of a random variable x

Discrete case

Continuous case Notation:

Machine Learning Tapas Kumar Mishra 38


Expected Value (Cont.)
Given random variable x and function , what is the
expected value of

Discrete case:

Continuous case:

Variance
Discrete case:

Continuous case:

Notation: ( : standard deviation )


Machine Learning Tapas Kumar Mishra 40
Gaussian Density – Univariate Case
Gaussian density , a.k.a. normal density, for continuous
random variable

Machine Learning Tapas Kumar Mishra


Vector Random Variables
(joint pdf )

(marginal pdf )

Expected vector

marginal pdf on
the i‐th component
Notation:

Machine Learning Tapas Kumar Mishra


Vector Random Variables (Cont.)
Covariance matrix Properties of

 symmetric

 Positive
semidefinite

marginal pdf on a pairof


random variables (xi, xj)

Machine Learning Tapas Kumar Mishra


Gaussian Density – Multivariate Case

d‐dimensional column vector


d‐dimensional mean vector
covariance
matrix

Machine Learning Tapas Kumar Mishra


Gaussian Density – Multivariate Case
(Cont.)

Machine Learning Tapas Kumar Mishra


Discriminant Functions for Gaussian
Density
Minimum‐error‐rate classification

Constant, could be ignored

Machine Learning Tapas Kumar Mishra


Case I:

Covariance matrix: times the identity matrix I

Machine Learning Tapas Kumar Mishra


Case I: (Cont.)

the same for all states of nature,


could be ignored

Linear discriminant functions

weight vector

threshold/bias

Machine Learning Tapas Kumar Mishra


Case II:

Covariance matrix: identical for all classes

squared Mahalanobis
distance

reduces to Euclidean distance

Machine Learning Tapas Kumar Mishra


Case II: (Cont.)

the same for all states of nature,


could be ignored

Linear discriminant functions

weight vector

threshold/bias

Machine Learning Tapas Kumar Mishra 50


Case III:

quadratic discriminant functions

quadratic matrix

weight vector

threshold/bias

Machine Learning Tapas Kumar Mishra 50


Summary
◼ Bayesian Decision Theory
❑ Basic concepts
◼ States of nature
◼ Probability distribution, probability density function (pdf )
◼ Class‐conditional pdf
◼ Joint pdf, marginal distribution, law of total probability
❑ Bayes theorem
◼ Prior + likelihood + observation ➔ Posterior probability
❑ Bayes decision rule
◼ Decide the state of nature with maximum posterior

Machine Learning Tapas Kumar Mishra 51


Summary (Cont.)
◼ Feasibility of Bayes decision rule
❑ Prior probability + likelihood
❑ Solution I: counting relative frequencies
❑ Solution II: conduct density estimation
◼ Bayes decision rule: The general scenario
❑ Allowing more than one feature
❑ Allowing more than two states of nature
❑ Allowing actions than merely deciding state of nature
❑ Loss function:

Machine Learning Tapas Kumar Mishra 52


Summary (Cont.)
◼ Expected loss (conditional risk)

Average by enumerating over all possible states of nature

◼ General Bayes decision rule


❑ Decide the action with minimum expected loss
◼ Minimum‐error‐rate classification
❑ Actions ➔ Decide states of nature
❑ Zero‐one loss function
◼ Assign no loss/unit loss for correct/incorrect decisions
Machine Learning Tapas Kumar Mishra 53
Summary (Cont.)
◼ Discriminant functions
❑ General way to represent classifiers
❑ One function per category
❑ Induce decision regions and decision boundaries

◼ Gaussian/Normal density

◼ Discriminant functions for Gaussian pdf


linear discriminant function
quadratic discriminant function
Machine Learning Tapas Kumar Mishra 54
Probability Review
Fundamentals

 We measure the probability for Random Events


 How likely an event would occur
 The set of all possible events is called Sample Space
 In each experiment, an event may occur with a certain
probability (Probability Measure)
 Example:
 Tossing a dice with 6 faces
 The sample space is {1, 2, 3, 4, 5, 6}
 Getting the Event « 2 » in on experiment has a probability 1/6
2
Probability

 The probability of every set of possible events is between 0 and


1, inclusive.
 The probability of the whole set of outcomes is 1.
 Sum of all probability is equal to one
 Example for a dice: P(1)+P(2)+P(3)+ P(4)+P(5)+P(6)=1
 If A and B are two events with no common outcomes, then the
probability of their union is the sum of their probabilities.
 Event E1={1},
 Event E2 ={6}
 P(E1 U E2)=P(E1)+P(E2)

3
Complementary Event

 Complementary Event of A is not(A)

 P(A)=1-P(not A)

 The probability that event A will not happen is 1-P(A).


 Example
 Event E1={1}

 Probability to get a value different from {1} is 1-P(E1).

4
Joint Probability

 Event Union (U = OR) Event Intersection (∩= AND)

 Joint Probability (A ∩ B)
 The probability of two events in conjunction. It is the probability of both events together.

p (A  B ) = p (A ) + p (B ) − p (A  B )

 Independent Events
 Two events A and B are independent if

p (A  B ) = p (A )  p (B )

5
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3
E2: Drawing Ball 2 P(E2):1/3 p (A  B ) = p (A )  p (B )
E3: Drawing Ball 3 P(E3): 1/3

Case 1: Drawing with replacement of the ball


1 2
The second draw is independent of the first draw
3
1 1 1
p ( E 1  E 2 ) =  = = p ( E 1)  p ( E 2 )
3 3 9

Case 2: Drawing without replacement of the ball


The second draw is dependent on the first draw
1 1 1
p ( E 1  E 2 ) =  =  p ( E 1)  p ( E 2 )
3 2 6
1
Quiz: Show that in Case 2, we have p ( E 1  E 2) =
2
6
Conditional Probability
 Conditional Probability p(A|B) is the probability of some
event A, given the occurrence of some other event B.
p (A  B )
p (A | B ) = p (B | A ) =
p (B  A )

p (B ) P (A )

 If A and B are independent,


then p ( A | B ) = p ( A ) and p ( B | A ) = p ( B )
 If A and B are independent, the conditional probability of A, given B is
simply the individual probability of A alone; same for B given A.
 p(A) is the prior probability;
 p(A|B) is called a posterior probability.
 Once you know B is true, the universe you care about shrinks to B.

7
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3 p (A  B )
E2: Drawing Ball 2 P(E2):1/3 p (A | B ) =
E3: Drawing Ball 3 P(E3): 1/3 p (B )
Case 1: Drawing with replacement of the ball
1 2
The second draw is independent of the first draw
3
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1 E 2) 3  3 1
= =
3 p ( E 2) 1 3
3
Case 2: Drawing without replacement of the ball
The second draw is dependent on the first draw
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1  E 2) 3  2 1
= =
2 p ( E 2) 1 2
3

8
Baye’s Rule

 We know that
p (A  B ) = p (B  A )
 Using Conditional Probability definition, we have
p (A | B )  P (B ) = p (B | A )  P (A )

 The Bayes rule is:


p (B | A )  p (A )
p (A | B ) =
p (B )
9
Law of Total Probability

 In probability theory, the law of total probability means


the prior probability of A, P(A), is equal to
the expected value of the posterior probability of A.
 That is, for any random variable N,
p ( A ) = E  p ( A | N ) 
 where p(A|N) is the conditional probability of A given N.

10
Law of Total Probability

 Law of alternatives: The term law of total probability is


sometimes taken to mean the law of alternatives, which is
a special case of the law of total probability applying to
discrete random variables.
 if { Bn : n = 1, 2, 3, ... } is a finite partition of a probability
space and each set Bn is measurable, then for any event A we
have

p (A ) =
n
p (A  B )
n

or, alternatively, (using Rule of Conditional Probability)


p (A ) =  p (A | B
n
n )  p (B )
11
Example: Law of Total Probability
Sample Space S = {1, 2,3, 4,5, 6, 7}

Partitions B 1 = {1,5} B 2 = {2,3, 6} B 3 = {4, 7}

Event A = {3}

Law of Total Probability


p ( A ) = p ({3}) = p ( A  B1 ) + p ( A  B 2 ) + p ( A  B 3 )
= 0 + p ({3}  {2,3, 6}) + 0 = p ({3})
12
A random variable

► Random variable is a measurable function from a


sample space of events  to the measurable
space S of values of possible values of the variable

X : →S
Ei xi
Event Value

Each value/event has a probability of occurrence


13
A random variable

► Random Variable is also know as stochastic


variable.
► A random variable is not a variable. It is a function. It maps
from the sample space to the real numbers.
► random variable is defined as a quantity whose values
are random and to which a probability distribution
is assigned.

14
A random variable: Examples.

► The number of packets that arrives to the destination

► The waiting time of a customer in a queue

► The number of cars that enters the parking each hour

► The number of students that succeed in the exam

15
Random Variable Types
► Discrete Random Variable:
► possible values are discrete (countable sample space,
Integer values)
X :  → {1, 2,3, 4,...}
Ei xi
► Continuous Random Variable:
► possible values are continuous (uncountable space, Real
X :  → 1.4,32.3
values)

Ei xi
16
Discrete Random Variable
 The probability distribution for discrete random
variable is called Probability Mass Function (PMF).
p (x i ) = P (X = x i )
 Properties of PMF
0  p (x i )  1 and  p (x
i
i ) =1
 Cumulative Distribution Function (CDF)
p (X  x ) = x x
p (x i )
i

 Mean value n
X = E ( X ) = x
i =1
i  p (x i )
17
Discrete Random Variable
 Mean (Expected) value
n
X = E ( X ) = x
i =1
i  p (x i )
 Variance
V (X ) = X2 = E (X − E (X ))
2
=E X ( )
2
− E (X )2 General Equation

n
V (X ) =  ( x i − x )  p (x i )
2

i =1 For Discrete RV
2
 n   n 

V (X ) =  (x i )  p (x i ) −  
x i  p (x i ) 
2
  
 i =1   i =1 
18
Discrete Random Variable: Example
 Random Variable: Grades of the students

Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

Probability Mass Function


PDF
2
p (1) = P ( X = 1) = = 0.2 0  p (1)  1
10

0  p ( 2)  1
4
p ( 2) = P ( X = 2) = = 0.4
10
0  p ( 3)  1
4
p ( 3) = P ( X = 3) = = 0.4
10 Grade

19
Discrete Random Variable: Example
 Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

Probability Mass Function Property CDF

 p ( x i ) = p (1) + p ( 2) + p (3) = 1
i

Cumulative Distribution Function


p (X  x ) = x x
p (x i )
i

p ( X  2) =  x p (x i ) = p (1) + p ( 2 ) = 0.2 + 0.4 = 0.6


Grade
i 2

p ( X  3) =  x 2
p (x i ) = p (1) + p ( 2 ) + p ( 3) = 1
i

20
Discrete Random Variable: Example
 Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

The Mean Value

 x  p (x
i
i i ) = 1 0.2 + 2  0.4 + 3  0.4 = 2.2

Grade
i
= 2.2
10
21
Continuous Random Variable
 The probability distribution for continuous random
variable is called Probability Density Function (PDF).
f (x )
 The probability of a given value is always 0 p ( x = x i ) = 0
 The sample space is infinite
 For continuous random variable, we compute p ( a  x  b )
 Properties of PDF
1. f ( x)  0 , for all x in R X
2.  f ( x)dx = 1
RX

3. f ( x) = 0, if x is not in R X
22
Continuous Random Variable
 Cumulative Distribution Function (CDF)

p (X  x ) = x x
p (x i )
i

p ( X  x ) =  f (t ) dt = 0
x

−

Mean/Expected value
p ( a  X  b ) =  f ( x ) dx
 b
+

 x  f ( x )dx
a
X = E ( X ) =
−

 Variance
+  + 
V (X ) =  (x −  )  f ( x ) dx and 
V ( X ) =  x 2  f ( x ) dx  − x2
2
x
 
−  − 
23
Discrete versus Continuous Random Variables
Discrete Random Variable Continuous Random Variable

Finite Sample Space Infinite Sample Space


e.g. {0, 1, 2, 3} e.g. [0,1], [2.1, 5.3]

Probability Mass Function (PMF) Probability Density Function (PDF)


p (x i ) = P (X = x i ) f (x )
1. f ( x)  0 , for all x in R X
1. p ( x i )  0, for all i
2.  i =1 p ( x i ) = 1
 2.  f ( x)dx = 1
RX

3. f ( x) = 0, if x is not in R X

Cumulative Distribution Function (CDF) p ( X  x )


p (X  x ) = x p ( X  x ) =  f (t ) dt = 0
x

x
p (x i )
i −

p ( a  X  b ) =  f ( x ) dx
b

24
Continuous Random Variables: Example

Example: modeling the waiting time in a queue


 People waiting for service in bank queue

 Time is a continuous random variable


 Random Time is typically modeled as exponential distribution
  is the mean value

1  x 
Exponential Distribution   exp  − , x 0
f (x ) =    
Exp (µ) 0,
 otherwise

25
Continuous Random Variables: Example

 We assume that with average waiting time of one customer


is 2 minutes
PDF: f (time)

1 −x / 2
 e , x0
f ( x) =  2
0, otherwise

time
26
Continuous Random Variables: Example
 Probability that the customer waits exactly 3 minutes is:
1 3 − x /2
P (x = 3) = P (3  x  3) = 3 e dx = 0
2
 Probability that the customer waits between 2 and 3 minutes is:
1 3 − x /2
P (2  x  3) =  e dx = 0.145
2 2
P(2  X  3) = F (3) − F (2) = (1 − e − (3 / 2) ) − (1 − e −1 ) = 0.145 CDF

 The Probability that the customer waits less than 2 minutes


P (0  X  2) = F (2) − F (0) = F (2) = 1 − e −1 = 0.632 CDF

27
Continuous Random Variables: Example

Expected Value and Variance

 The mean of life of the previous device is:



1 
 −xe −x /2
 

2 0
E (X ) = −x /2
dx = +  e − x / 2dx = 2
 
xe
0
0

 To compute variance of X, we first compute E(X2):



−x / 2
E ( X ) =  x e dx = − x e
1  
2 2 −x / 2 2 +  e − x / 2 dx = 8
2 0 0
0

 Hence, the variance and standard deviation of the device’s life


are:
V (X ) = 8 − 2 = 4
2

 = V (X ) = 2
28
Variance

 The standard deviation is defined as the square


root of the variance, i.e.:

X = X
2
= V (X ) =s

29
Coefficient of Variation

 The Coefficient of Variance of the random variable X is


defined as:

V (X ) X
CV (X ) = =
E ( X ) X

30
Discrete Probability Distribution

 The probability distribution of a discrete random


variable is a list of probabilities associated with each of
its possible values.

 It is also sometimes called the probability function or


the probability mass function (PMF) for discrete
random variable.

31
Probability Mass Function (PMF)
 Formally
 the probability distribution or probability mass function
(PMF) of a discrete random variable X is a function that
gives the probability p(xi) that the random variable equals xi,
for each value xi:
p (x i ) = P (X = x i )
 It satisfies the following conditions:
0  p (x i )  1
 p (x
i
i ) =1

32
Continuous Random Variable

 A continuous random variable is one which takes an


infinite number of possible values.
 Continuous random variables are usually measurements.
 Examples include height, weight, the amount of sugar in an orange,
the time required to run a mile.

33
Probability Density Function (PDF)
 For the case of continuous variables, we do not want to
ask what the probability of "1/6" is, because the answer is
always 0...
 Rather, we ask what is the probability that the value is in
the interval (a,b).
 So for continuous variables, we care about the derivative of
the distribution function at a point (that's the derivative of an
integral). This is called a probability density function
(PDF).
 The probability that a random variable has a value in a set A is
the integral of the p.d.f. over that set A.

34
Probability Density Function (PDF)
 The Probability Density Function (PDF) of a continuous
random variable is a function that can be integrated to
obtain the probability that the random variable takes a value in
a given interval.
 More formally, the probability density function, f(x), of a
continuous random variable X is the derivative of the
cumulative distribution function F(x):
d
f (x ) = F (x )
dx
 Since F(x)=P(X≤x), it follows that:
b


F (b ) − F (a ) = P (a  X  b ) = f ( x )  dx
a
35
Cumulative Distribution Function (CDF)

 The Cumulative Distribution Function (CDF) is a


function giving the probability that the random variable X
is less than or equal to x, for every value x.
 Formally
 the cumulative distribution function F(x) is defined to be:

 −   x  +,
F (x ) = P (X  x )

36
Cumulative Distribution Function (CDF)

 For a discrete random variable, the cumulative distribution


function is found by summing up the probabilities as in the
example below.
 −   x  +,

F (x ) = P (X  x ) =  P (X
x i x
= xi ) =  p (x
x i x
i )

 For a continuous random variable, the cumulative


distribution function is the integral of its probability density
function f(x).
b


F (a ) − F (b ) = P (a  X  b ) = f ( x )  dx
a
37
Cumulative Distribution Function (CDF)

► Example
► Discrete case: Suppose a random variable X has the
following probability mass function p(xi):
xi 0 1 2 3 4 5
p(xi) 1/32 5/32 10/32 10/32 5/32 1/32

► The cumulative distribution function F(x) is then:


xi 0 1 2 3 4 5
F(xi) 1/32 6/32 16/32 26/32 31/32 32/32

38
Mean or Expected Value

Expectation of discrete random variable X


n
X = E ( X ) = x
i =1
i  p (x i )

Expectation of continuous random variable X


+
X = E ( X ) = 
−
x  f ( x ) dx
39
Example: Mean and variance

 When a die is thrown, each of the possible faces 1, 2, 3, 4, 5, 6


(the xi's) has a probability of 1/6 (the p(xi)'s) of showing. The
expected value of the face showing is therefore:

µ = E(X) = (1 x 1/6) + (2 x 1/6) + (3 x 1/6) + (4 x


1/6) + (5 x 1/6) + (6 x 1/6) = 3.5

 Notice that, in this case, E(X) is 3.5, which is not a possible


value of X.

40
Variance

 The variance is a measure of the 'spread' of a


distribution about its average value.
 Variance is symbolized by V(X) or Var(X) or σ2.
 The mean is a way to describe the location of a
distribution,
 the variance is a way to capture its scale or degree of
being spread out. The unit of variance is the square of the
unit of the original variable.

41
Variance

 The Variance of the random variable X is defined as:

V (X ) = X2 = E (X − E (X ))
2
( ) − E (X )
=E X 2 2

 where E(X) is the expected value of the random variable X.


 The standard deviation is defined as the square
root of the variance, i.e.:

 X = X2 = V ( X ) =s
42
Sampling Distributions,
Confidence interval, Hypothesis
testing
Sampling Distribution of the means
• Central Limit Theorem: if 𝑋ത is the mean of a random sample of size n
taken from a population mean 𝜇 and finite variance 𝜎 2 , then the
limiting form of the distribution of

𝑋−𝜇
𝑧= 𝜎
𝑛

Is a standard normal distribution N(0,1) as n -> ∞.

9/11/2023 @TKMISHRA ML NITRKL 2


9/11/2023 @TKMISHRA ML NITRKL 3
Sampling Distribution of S2: 2

Theorem 2:

9/11/2023 @TKMISHRA ML NITRKL 4


Sampling Distribution of S2

Example:

9/11/2023 @TKMISHRA ML NITRKL 5


Sampling Distribution of 2
S: t
Theorem 3:

9/11/2023 @TKMISHRA ML NITRKL 6


9/11/2023 @TKMISHRA ML NITRKL 7
Sampling Distribution of 2
S: F
Theorem 4:

9/11/2023 @TKMISHRA ML NITRKL 8


Confidence Intervals

9/11/2023 @TKMISHRA ML NITRKL 9


Interval Estimate and Confidence Level

• An interval estimate of a population parameter such as mean and


standard deviation is an interval or range of values within which the
true parameter value is likely to lie with certain probability.

• Confidence level, usually written as (1 − )100%, on the interval


estimate of a population parameter is the probability that the interval
estimate will contain the population parameter. When  = 0.05, 95%
is the confidence level and 0.95 is the probability that the interval
estimate will have the population parameter

9/11/2023 @TKMISHRA ML NITRKL 10


Significance and Confidence Level

• The value of  is called significance


• 95% confidence implies that in 19 out of 20 cases, the true population
mean will be within the interval estimate.

• Confidence interval is the interval estimate of the population


parameter estimated from a sample using a specified confidence level

9/11/2023 @TKMISHRA ML NITRKL 11


Confidence Interval for Population Mean

• Let X1, X2, …, Xn be the sample means of samples S1, S2, …, Sn that are
drawn from an independent and identically distributed population
with mean  and standard deviation . From central limit theorem
we know that the sample means Xi follow a normal distribution with
mean  and standard deviation  / n . The variable Z =  / n follows
X −
i

a standard normal variable.

9/11/2023 @TKMISHRA ML NITRKL 12


Assume that we are interested in finding (1 − ) 100% confidence
interval for the population mean. We can distribute  (probability of
not observing true population mean in the interval) equally (/2) on
either side of the distribution as shown in Figure

9/11/2023 @TKMISHRA ML NITRKL 13


CI for the population mean when population standard deviation is
known

• In general, (1 – ) 100% the confidence interval for the population


mean when population standard deviation is known can be written as

X  Z / 2   / n

• Above equation is valid for large sample sizes, irrespective of the


distribution of the population. The above equation is equivalent to

− −
P ( X − Z / 2   / n    X + Z / 2   / n ) = 1 − 
9/11/2023 @TKMISHRA ML NITRKL 14
CI for Different Significance Values

• That is, the probability that the population mean takes a value
between X − Z / 2   / n and X + Z / 2   / n is 1 – .
− −

• The absolute values of Z/2 for various values of  are shown below:
Confidence interval for
 |Z/2|
population mean when
population standard deviation is
known


0.1 1.64 X  1.64   / n

0.05 1.96 X  1.96   / n

0.02 2.33 X  2.33   / n

0.01 2.58 X  2.58   / n

9/11/2023 @TKMISHRA ML NITRKL 15


Example

A sample of 100 patients was chosen to estimate the length of stay


(LoS) at a hospital. The sample mean was 4.5 days and the population
standard deviation was known to be 1.2 days.

(a) Calculate the 95% confidence interval for the population mean.
(b) What is the probability that the population mean is greater than
4.73 days?

9/11/2023 @TKMISHRA ML NITRKL 16


Solution

(a) 95% confidence interval for population mean: We know that
X =4.5 and  = 1.2 and thus
/ n = 1.2 / 100 = 0.12

The 95% confidence interval is given by


( X − Z  /2   / n , X + Z  /2   / n ) = (4.5 − 1.96  0.12, 4.5 + 1.96  0.12) = (4.2648, 4.7352)

Note that 4.73 is the upper limit of the 95% confidence interval from part (a), thus the probability
that the population mean is greater than 4.73 is approximately 0.025.

9/11/2023 @TKMISHRA ML NITRKL 17


Confidence Interval for Population Mean when Standard Deviation is Unknown

• William Gossett (Student, 1908) proved that if the population follows a normal
distribution and the standard deviation is calculated from the sample, then the statistic
given in Eq will follow a t-distribution with (n − 1) degrees of freedom

X − 
t =
S / n

• Here S is the standard deviation estimated from the sample (standard error). The t-
distribution is very similar to standard normal distribution; it has a bell shape and its
mean, median, and mode are equal to zero as in the case of standard normal distribution.
The major difference between the t-distribution and the standard normal distribution is
that t-distribution has broad tail compared to standard normal distribution. However, as
the degrees of freedom increases the t-distribution converges to standard normal
distribution.

9/11/2023 @TKMISHRA ML NITRKL 22


Confidence Interval for Population Mean when Standard Deviation is Unknown

• The (1 − )100% confidence interval for mean from a population that


follows normal distribution when the population mean is unknown is
given by − S
X  t / 2, n −1 
n

• In above Eq, the value t/2,n − 1 is the value of t under t-distribution for
which the cumulative probability F(t) = /2 when the degrees of
freedom is (n − 1).

9/11/2023 @TKMISHRA ML NITRKL 23


• The absolute values of t/2,n−1 for different values of  are shown in
Table along with corresponding Z/2 values.

 |t/2,10| |t/2,50| |t/2,500| |Z/2|

0.1 1.812 1.675 1.647 1.64

0.05 2.228 2.008 1.964 1.96

0.02 2.763 2.403 2.333 2.33

0.01 3.169 2.677 2.585 2.58

It is evident from table that the values of t/2,n−1 and Z/2


converge for higher degrees of freedom. In fact, as the sample
size nears 100, the t-distribution gets very close to a normal
distribution.

9/11/2023 @TKMISHRA ML NITRKL 24


Example

• An online grocery store is interested in estimating the basket size (number of items
ordered by the customer) of its customers so that it can optimize its size of crates used
for delivering the grocery items. From a sample of 70 customers, the average basket size
was estimated as 24 and the standard deviation estimated from the sample was 3.8.
Calculate the 95% confidence interval for the basket size of the customer order.

Solution

We know that X = 24 , n = 70, S = 3.8 and t0.025, 69 = 1.995

The confidence interval for size of basket using Eq. is given by


− S 3.8
X  t / 2, n −1 = 24  1.995 = 24  0.9061
n 70

Thus the 95% confidence interval for the size of the basket is
(23.09,24.91).
9/11/2023 @TKMISHRA ML NITRKL 25
HYPOTHESIS TESTING
INTRODUCTION TO HYPOTHEIS TESTING

Hypothesis testing is a statistical process of either rejecting or


retaining a claim or belief or association related to a business
context, product, service, processes, etc

9/11/2023 @TKMISHRA ML NITRKL 27


INTRODUCTION TO HYPOTHEIS TESTING

• Hypothesis test consists of two complementary


statements called null hypothesis and alternative
hypothesis, and only one of them is true

• Hypothesis is an integral part of many predictive


analytics techniques such as multiple linear
regression and logistic regression

9/11/2023 @TKMISHRA ML NITRKL 28


HYPOTHESIS TESTING STEPS
1. Describe the hypothesis in words. Hypothesis is
described using a population parameter (such as
mean, standard deviation, proportion, etc.) about
which a claim (hypothesis) is made. Few sample
claims (hypothesis) are:

• Average time spent by women using social media is more


than men.
• Customers with more than one mobile handsets are more
likely to die early.

9/11/2023 @TKMISHRA ML NITRKL 30


HYPOTHESIS TESTING STEPS
2) Based on the claim made in step 1, define null and alternative
hypotheses. Initially we believe that the null hypothesis is true. In
general, null hypothesis means that there is no relationship between the
two variables under consideration (for example, null hypothesis for the
claim ‘women use social media more than men’ will be ‘there is no
relationship between gender and the average time spent in social
media’). Null and alternative hypotheses are defined using a population
parameter.

3) Identify the test statistic to be used for testing the validity of the null
hypothesis. Test statistic will enable us to calculate the evidence in
support of null hypothesis. The test statistic will depend on the
probability distribution of the sampling distribution; for example, if the
test is for mean value and the mean is calculated from a large sample
and if the population standard deviation is known, then the sampling
distribution will be a normal distribution and the test statistic will be a Z-
statistic (standard normal statistic).
9/11/2023 @TKMISHRA ML NITRKL 31
HYPOTHESIS TESTING STEPS
4. Decide the criteria for rejection and retention of null hypothesis.
This is called significance value traditionally denoted by symbol .
The value of  will depend on the context and usually 0.1, 0.05, and
0.01 are used.

5. Calculate the p-value (probability value), which is nothing but the


conditional probability of observing the test statistic value when the
null hypothesis is true. In simple terms p-value is the evidence in
support of the null hypothesis.

6. Take the decision to reject or retain the null hypothesis based on the
p-value and significance value . The null hypothesis is rejected
when p-value is less than  and the null hypothesis is retained when
p-value is greater than or equal to .

9/11/2023 @TKMISHRA ML NITRKL 32


Null and Alternative Hypothesis

• Null hypothesis, usually denoted as H0 (H zero and H


naught), refers to the statement that there is no
relationship or no difference between different groups
with respect to the value of a population parameter.

• Alternative hypothesis, usually denoted as HA (or H1),


is the complement of null hypothesis.

9/11/2023 @TKMISHRA ML NITRKL 33


Hypothesis statement to definition of null and alternative
hypothesis
S. No. Hypothesis Description Null and Alternative Hypothesis
1 Average annual salary of machine learning H0: m = f
experts is different for males and females. HA: m  f
m and f are average annual salary of male and
(In this case, the null hypothesis is that there female machine learning experts, respectively.
is no difference in male and female salary of
machine learning experts)
H0: a  e
2 On average people with Ph.D. in analytics earn
HA: a > e
more than people with Ph.D. in engineering.
a = Average annual salary of people with Ph.D. in

analytics.

e = Average annual salary of people with Ph.D. in

engineering.

It is essential to have the equal sign in null hypothesis


statement.
9/11/2023 @TKMISHRA ML NITRKL 34
Test Statistic
• Test statistic is the standardized difference between the
estimated value of the parameter being tested
calculated from the sample(s) and the hypothesis value
(that is standardized difference between and ) in

order to establish the evidence in support of the null


X

hypothesis.

• It measures the standardized distance (measured in


terms of number of standard deviations) between the
value of the parameter estimated from the sample(s)
and the value of the null hypothesis.

9/11/2023 @TKMISHRA ML NITRKL 35


P - Value
• The p-value is the conditional probability of
observing the statistic value when the null hypothesis
is true.

• For example, consider the following hypothesis:


Average annual salary of machine learning experts is at
least 100,000. The corresponding null hypothesis is H0:
m  100,000. Assume that estimated value of the
salary from a sample is 1,10,000 (that is X = 1,10,000 and
assume that the standard deviation of population is
known and standard error of the sampling distribution
is 5000 (that is,  / n = 5000 where n is the sample size
using which X = 1,10,000 was calculated).
9/11/2023 @TKMISHRA ML NITRKL 36
Hypothesis Testing
• The standardized distance between estimated salary from
hypothesis salary is (1,10,000 – 1,00,000)/5000 = 2.

• That is, the standardized distance between estimated value


and the hypothesis value is 2 and we can now find the
probability of observing this statistic value from the sample
if the null hypothesis is true (that is if m  100,000).

• A large standardized distance between the estimated value


and the hypothesis value will result in a low p-value.

• Note that the value 2 is actually the value under a standard


normal distribution since it is calculated from

X − 
 / n
9/11/2023 @TKMISHRA ML NITRKL 37
Standard normal distribution and the p-value
corresponding to Z = 2 are shown below:

9/11/2023 @TKMISHRA ML NITRKL 38


Hypothesis Testing
• Probability of observing a value of 2 and higher from a
standard normal distribution is 0.02275.

• That is, if the population mean is 1,00,000 and standard error


of the sampling distribution is 5000 then probability of
observing a sample mean greater than or equal to 1,10,000
is 0.02275.

• The value 0.02275 is the p-value, which is the evidence in


support of the statement in the null hypothesis.

p-value = P(Observing test statistics value | null hypothesis is


true)
9/11/2023 @TKMISHRA ML NITRKL 39
Decision Criteria – Significance Value

• Significance level, usually denoted by , is the criteria used


for taking the decision regarding the null hypothesis
(reject or retain) based on the calculated p-value.

• The significance value  is the maximum threshold for p-


value.

• The decision to reject or retain will depend on whether


the calculated p-value crosses the threshold value  or not

9/11/2023 @TKMISHRA ML NITRKL 40


Decision making under hypothesis testing

Criteria Decision

p-value <  Reject the null hypothesis

p-value   Retain (or fail to reject) the null hypothesis

9/11/2023 @TKMISHRA ML NITRKL 41


Example 1
Statement 1 − Salary of machine learning experts on
average is at least US $100,000:
The null and alternative hypotheses in this case are given
by

H0: m  100,000
HA: m > 100,000

where m is the average annual salary of machine learning


experts. Note that the equality symbol is always part of
the null hypothesis since we have to measure the
difference between estimated value from the sample and
the hypothesis value. In this case, reject or retain decision
will depend on the direction of deviation of the estimated
parameter from the sample from hypothesis value.
9/11/2023 @TKMISHRA ML NITRKL 42
Solution
Below figure shows the rejection region on the right side
of the distribution. Since the rejection region is only on
one side this is a one-tailed test (right tailed test).
Specifically, since the alternative hypothesis in this case is
m > 100,000, this is called right-tailed test.

9/11/2023 @TKMISHRA ML NITRKL 43


Example 2
• Statement 2 − Average waiting time at the London
Heathrow airport security check is less than 30 minutes:
The null and alternative hypotheses in this case are given
by
H0: w  30
HA: w < 30

where w is the average waiting time at London Heathrow


security check. In this case, reject region will on the left
side (known as left tailed test) of the distribution as shown
in Figure

9/11/2023 @TKMISHRA ML NITRKL 44


Solution

Rejection region in case of left-sided test

9/11/2023 @TKMISHRA ML NITRKL 45


Example 3
Statement 3 − Average salary of male and female MBA students
at graduation is different:

The null and alternative hypotheses in this case are given by


H0: m = f
HA: m  f

Where m and f are the average salaries of male and female


MBA students, respectively, at the time of graduation.
In this case, the rejection region will be on either side of the
distribution and if the significance level is  then the rejection
region will be /2 on either side of the distribution. Since the
rejection region is on either side of the distribution, it will be a
two-tailed test.
9/11/2023 @TKMISHRA ML NITRKL 46
Solution

Rejection region in case of two-tailed test

9/11/2023 @TKMISHRA ML NITRKL 47


Hypothesis Testing for Population Mean with known
Variance: Z-Test
• Z-test (also known as one-sample Z-test) is used when a
claim (hypothesis) is made about the population parameter
such as population mean or proportion when population
variance is known.
• Since the hypothesis test is carried out with just one sample,
this test is also known as one-sample Z-test.
• Z-test to conduct a hypothesis test for population mean
when the population variance is known; the test statistics for
Z-test is given by

Z-statistic = X −
 / n
• The critical value in this case will depend on the significance
value  and whether it is a one-tailed or two-tailed test

9/11/2023 @TKMISHRA ML NITRKL 48


Critical value for different values of 

Approximate Critical Values

 Left-tailed test Right-tailed test Two-tailed test

0.1
−1.28 1.28 −1.64 and 1.64

0.05
−1.64 1.64 −1.96 and 1.96

0.01
−2.33 2.33 −2.58 and 2.58

9/11/2023 @TKMISHRA ML NITRKL 49


Condition for rejection of null hypothesis H0

Type of Test Condition Decision

Left-tailed test Z-statistic < Critical value Reject H0

Z-statistic  Critical value Retain H0

Right-tailed test Z-statistic > Critical value Reject H0

Z-statistic  Critical value Retain H0

Two-tailed test |Z-statistic| > |Critical Value| Reject H0

|Z-statistic|  |Critical Value| Retain H0

9/11/2023 @TKMISHRA ML NITRKL 50


Example
• An agency based out of Bangalore claimed that the
average monthly disposable income of families
living in Bangalore per month is greater than INR
4200 with a standard deviation of INR 3200. From
a random sample of 40,000 families, the average
disposable income was estimated as INR 4250.
Assume that the population standard deviation is
INR 3200. Conduct an appropriate hypothesis test
at 95% confidence level ( = 0.05) to check the
validity of the claim by the agency.

9/11/2023 @TKMISHRA ML NITRKL 51


Solution
Claim: Average disposable income is more than INR
4200.
Let  and  denote the mean and standard deviation in
the population. The corresponding null and alternative
hypotheses are
H0:   4200
HA:  > 4200
Since we know the population standard deviation, we
can use the Z-test. The corresponding Z-statistic is given
by

X − 4250 − 4200
Z = = = 3.125
/ n 3200 / 40000

9/11/2023 @TKMISHRA ML NITRKL 52


Solution Continued…
This is a right-tailed test.
The corresponding Z-critical value at  = 0.05 for right-
tailed test is approximately 1.64
Since the calculated Z-statistic value is greater than the
Z-critical value, we reject the null hypothesis.

The corresponding p-value = 0.00088.

9/11/2023 @TKMISHRA ML NITRKL 53


Critical value, Z-statistic value, and corresponding p-value.

9/11/2023 @TKMISHRA ML NITRKL 54


Example
A passport office claims that the passport applications
are processed within 30 days of submitting the
application form and all necessary documents. Table 6.6
shows processing time of 40 passport applicants. The
population standard deviation of the processing time is
12.5 days.
Conduct a hypothesis test at significance level  = 0.05 to
verify the claim made by the passport office.

16 16 30 37 25 22 19 35 27 32

34 28 24 35 24 21 32 29 24 35

28 29 18 31 28 33 32 24 25 22

21 27 41 23 23 16 24 38 26 28

9/11/2023 @TKMISHRA ML NITRKL 55


Solution
Null and alternative hypotheses in this case are given by
H0:   30
HA:  < 30
From the data in Table 6.6, the estimated sample mean is
27.05 days.
The standard deviation of the sampling distribution
The value of Z-statistic is given by
X − 27.05 − 30
Z= = = −1.4926
/ n 12.5 / 40

 / n = 12 .5 / 40 = 1.9764
9/11/2023 @TKMISHRA ML NITRKL 56
Solution Continued…
• The critical value of left-tailed test for  = 0.05 is –1.644.
• Since the critical value is less than the Z-statistic value,
we fail to reject the null hypothesis. The p-value for Z =
−1.4926 is 0.06777 which is greater than the value of .

• That is, there is no strong evidence against null


hypothesis so we retain the null hypothesis, which is  
30. Figure 6.6 shows the calculated Z-statistic value and
the rejection region.

9/11/2023 @TKMISHRA ML NITRKL 57


Left-tailed test

9/11/2023 @TKMISHRA ML NITRKL 58


Example
According to the company IQ Research, the average
Intelligence Quotient (IQ) of Indians is 82 derived
based on a research carried out by Professor Richard
Lynn, a British Professor of Psychology, using the data
collected from 2002 to 2006 (Source: IQ Research).
The population standard deviation of IQ is estimated
as 11.03. Based on a sample of 100 people from
India, the sample IQ was estimated as 84.
(a) Conduct an appropriate hypothesis test at  =
0.05 to validate the claim of IQ Research (that
average IQ of Indians is 82).

9/11/2023 @TKMISHRA ML NITRKL 59


Solution
a)Hypothesis test: It is given that  = 82,  = 11.03, n = 100,
and
X =84.
The null and alternative hypotheses in this case are:
H0:  = 82
HA:   82
Since the direction of alternative hypothesis is both ways,
we have a two-tailed t-test. The test statistics is given by

X − 84 − 82
Z= = = 1.8132
 / n 11.03 / 100

9/11/2023 @TKMISHRA ML NITRKL 60


Solution Continued…
• For a two-tailed test, the critical values at /2 = 0.025
are -1.96 and 1.96.

• Since the calculated Z-statistic value is less than the


critical value, we fail to reject the null hypothesis
(retain the null hypothesis).

• Since the Z-statistic value is 1.8132 and falls on the


right tail, we first calculate normal distribution beyond
1.8132 which is equal to 0.0348.

• Since this is a two-tailed test, the p-value is twice the


area to the right side of the Z-statistic value, which is =
0.0698, that is the p-value in this case is 0.0698

9/11/2023 @TKMISHRA ML NITRKL 61


Statistic, critical values, and the rejection region

9/11/2023 @TKMISHRA ML NITRKL 62


Hypothesis Test for Population Mean Under Unknown Population
Variance: t-Test

• We use the fact that a sampling distribution of a


sample from a population that follows normal
distribution with unknown variance follows a t-
distribution with (n − 1) degrees of freedom.

• In many cases the population variance (and thus the


standard deviation) will not be known. In such cases we
will have to estimate the variance using the sample
itself.

• Let S be the standard deviation estimated from the


sample of size n.

9/11/2023 @TKMISHRA ML NITRKL 63


t-test continued…
X −
• Then the statistic S / n will follow a t-distribution
with (n − 1) degrees of freedom if the sample is drawn
from a population that follows a normal distribution.

• Here 1 degree of freedom is lost since the standard


deviation is estimated from the sample.

• Thus, we use the t-statistic (hence the test is called t-


test) to test the hypothesis when the population
standard deviation is unknown
X −
t-statistic = S / n
9/11/2023 @TKMISHRA ML NITRKL 64
Example
Aravind Productions (AP) is a newly formed movie
production house based out of Mumbai, India. AP was
interested in understanding the production cost required
for producing a Bollywood movie. The industry believes
that the production house will require at least INR 500
million (50 crore) on average. It is assumed that the
Bollywood movie production cost follows a normal
distribution. Production cost of 40 Bollywood movies in
millions of rupees are shown in Table 6.7. Conduct an
appropriate hypothesis test at  = 0.05 to check whether
the belief about average production cost is correct.

9/11/2023 @TKMISHRA ML NITRKL 65


Production cost of Bollywood movies

601 627 330 364 562 353 583 254 528 470

125 60 101 110 60 252 281 227 484 402

408 601 593 729 402 530 708 599 439 762

292 636 444 286 636 667 252 335 457 632

9/11/2023 @TKMISHRA ML NITRKL 66


Solution
It is given that the production cost of Bollywood movies
follows a normal distribution; however, the standard
deviation of the population is not known and we need to
estimate the standard deviation value from the sample.
Thus, we have to use the t-test for testing the hypothesis.
From the sample data in Table we get the following values:

n = 40, X =429.55, and S = 195.0337

The null and alternative hypotheses are


• H0:   500
• HA:  > 500

9/11/2023 @TKMISHRA ML NITRKL 67


Solution Continued…
The corresponding test statistic is

X − 429.55 − 500
t - statistic = = = −2.2845
S/ n 195.0337 / 40

Note that this is a one-tailed test (right-tailed) and


the critical t-value at  = 0.05 under right-tailed test,
tcritical = 1.6848

9/11/2023 @TKMISHRA ML NITRKL 68


Solution Continued…
Since t-statistic value is less than the critical t-value, we
retain the null hypothesis. The t-statistic value and
critical value for the t-test are shown in Figure

9/11/2023 @TKMISHRA ML NITRKL 69


Example
According to statistics released by the Department of
Civil Aviation, the average delay of flights is equal to
16.8 minutes, flight delays are assumed to follow a
normal distribution. However, from a sample of 50
flights, the average delay was estimated to be 19.5
minutes and the sample standard deviation was 6.6
minutes.
Conduct a hypothesis test to disprove the claim that
the average delay is equal to 16.8 minutes at  = 0.01.

9/11/2023 @TKMISHRA ML NITRKL 70


Solution
_
Given n = 50, X = 19.5 , S = 6.6.
Null and alternative hypotheses are
H0:  = 16.8
HA:   16.8
The corresponding t-statistic value is

X − 19.5 − 16.8
t = = = 2.8927
S / n 6.6 / 50

9/11/2023 @TKMISHRA ML NITRKL 71


Solution Continued…
• The critical t-value for two-tailed t-test when  = 0.01 and
degrees of freedom = 49 is 2.67
• Since the calculated t-statistic value is greater than the t-critical
value, we reject the null hypothesis. The corresponding p-value is
0.0057. The values of t-statistic, t-critical value, rejection and
retention regions are shown in Figure

9/11/2023 @TKMISHRA ML NITRKL 72


Paired Sample t-Test
• In a paired t-test, the data related to the parameter is
captured twice from the same subject, once before the
intervention and once after intervention

• Alternatively, the paired t-test can be used for


comparing two different interventions such as two
different promotion strategies applied on the same
subject

9/11/2023 @TKMISHRA ML NITRKL 73


Examples of paired t-test
❖ Body weight of subjects before and after
attending a yoga training program.
❖ Cholesterol levels of subjects before and after
attending meditation training.
❖ Amount of time spent by subjects on the internet
before and after marriage.
❖ Quantity of alcohol consumed by people before
and after breakup.
❖ Level of cortisol among students during and after
exam.

9/11/2023 @TKMISHRA ML NITRKL 74


Paired t-Test
Assume that the mean difference in the parameter value
before and after the treatment is d and the corresponding
standard deviation of difference is Sd . Let D be the
hypothesized mean difference. Then the statistic defined in
Eq follows a t-distribution with (n − 1) degrees of freedom.
d − D
Sd / n

Here we assume that the differences follow a normal


distribution.

9/11/2023 @TKMISHRA ML NITRKL 75


Example
Table shows data on alcohol consumption before and after breakup.
Conduct a paired t-test to check whether the alcohol consumption is more
after the breakup (that is d > 0) at 95% confidence ( = 0.05).
Average weekly consumption of alcohol (in ml) before and after breakup
S. No. Before Breakup (X1) After Breakup (X2) Difference (X2 − X1)
1 470 408 −62
2 354 439 85
3 496 321 −175
4 351 437 86
5 349 335 −14
6 449 344 −105
7 378 318 −60
8 359 492 133
9 469 531 62
10 329 417 88
11 389 358 −31
12 497 391 −106
13 493 398 −95
14 268 394 126
15 445 508 63
16 287 399 112
17 338 345 7
18 271 341 70
19 412 326 −86
20 335 467 132
9/11/2023 @TKMISHRA ML NITRKL 76
Solution
The mean difference, that is mean of (X2 – X1), is 11.5 and
the corresponding sample standard deviation is 95.67.
The null and alternative hypotheses are (when the claim is
that the difference is greater than zero):
H0: d  0
HA d > 0
The value of test statistic is

d − D 11.5 − 0
t = = = 0.5375
S/ n 95.6757 / 20

9/11/2023 @TKMISHRA ML NITRKL 77


Solution Continued…
• The critical t-value for one-tailed test when  = 0.05
and df = 19 is 1.7291
• Since the t-statistic value is 0.5375, which is less than
the critical value, we retain the null hypothesis and
conclude the difference in alcohol consumption is not
greater than 0 before and after breakup.
• The corresponding p-value is 0.70.

9/11/2023 @TKMISHRA ML NITRKL 78


Two Sample Z test
Assume that 1 and 2 are the population means. Our interest is
to check a hypothesis on difference between 1 and 2, that is
(1 − 2). If X 1and X are
2 the estimated mean values from two
samples drawn from the two populations, the statistic ( Xˆ 1 − Xˆ 2 )
follows a standard normal distribution with mean (1 − 2) and
 12  22
standard deviation n1
+
n2 where n1 and n2 are the sample sizes
of two samples. The corresponding Z-statistic is given by

ˆ − X
(X ˆ ) − ( −  )
Z = 1 2 1 2

12  22
+
n1 n2

9/11/2023 @TKMISHRA ML NITRKL 79


Example
The Dean of St Peter School of Management Education
(SPSME) believes that the graduating students with
specialization in Marketing earn at least INR 5000 more per
month than the students with specialization in Operations
Management. To verify his belief, the Dean collected a
sample data from his graduating students, given in Table .
Conduct an appropriate hypothesis test at  = 0.05 to check
whether the difference in monthly salary is at least 5000 more for
students with marketing specialization compared to operations
specialization. Assume that the salary of students with marketing
specialization and operations specialization follow normal
distribution.

9/11/2023 @TKMISHRA ML NITRKL 80


Sample values on Marketing and Operations Students

Specialization Sample Size Estimated Mean Salary (in Rupees) Population Standard

per Month Deviation

Marketing 120 67,500.00 7,200

Operations 45 58,950.00 4,600

9/11/2023 @TKMISHRA ML NITRKL 81


Solution
We have n1 = 120, n2 = 45, X1 = 67,500 ,X 2 = 58,950  1 = 7,200 and  2 = 4,600
The null and alternative hypotheses are
H0: 1 − 2  5000
HA: 1 − 2 > 5000
The corresponding test statistic value is

(67500 − 58950) − 5000 3550


Z= = = 3.7374
7200 2 4600 2 949.85
+
120 45

The critical value of Z at  = 0.05 is 1.64 [= NORMSINV(1 − 0.05)]. Since the

Z-statistic value is higher than the Z-critical value, we reject the null

hypothesis. The corresponding p-value is 9.29  10-05.

9/11/2023 @TKMISHRA ML NITRKL 82


Two-Sample t-Test
Difference in Two Population Means when
Population Standard Deviations are Unknown and
Believed to be Equal: Two-Sample t-Test
• In this section we discuss the hypothesis test for
difference in two population means when the
standard deviation of the populations are
unknown.
• Hence we need to estimate them from the samples
drawn from these two populations.
• An additional assumption we make here is that the
standard deviation of the two populations are
equal (however, unknown).

9/11/2023 @TKMISHRA ML NITRKL 83


Then the sampling distribution of the difference in estimated
means ( X − X )follows a t-distribution with (n1 + n2 – 2) degrees of
1 2

freedom with mean (1 – 2) and standard deviation


 1 1 
S2
p + 

 n1 n2 

Where pis
S2 the pooled variance of two samples and is given by
2 2
( n − 1) S + ( n − 1) S
S 2p = 1 1 2 2
(n1 + n2 − 2)

The corresponding t-statistic is

( X 1 − X 2 ) − (1 −  2 )
t =
 1 1 
S p2  + 
 1
n n2 

9/11/2023 @TKMISHRA ML NITRKL 84


Example
A company makes a claim that children (between the age group
between 7 and 12) who drink their health drink will grow (height)
taller than the children who do not drink that health drink. Data in
Table shows average increase in height over one-year period from
two groups: one drinking the health drink and the other not drinking
the health drink. At  = 0.05, test whether the increase in height for
the children who drink the health drink is at least 1.2 cm.

Group Sample Size Increase in Height (in cm) during the Standard Deviation

Test Period Estimated from Sample

Drink health
80 7.6 cm 1.1 cm
drink

Do not drink
80 6.3 cm 1.3 cm
health drink

9/11/2023 @TKMISHRA ML NITRKL 85


Solution
1 ,  1 = 1.1, and
We have n1 = 80, n2 = 80, X = 7.6 , X = 6.3 2

 2 = 1.3.
The null and alternative hypotheses are
H0: 1 − 2  1.2
HA: 1 − 2 > 1.2

Pooled variance is
(n1 − 1) S12 + (n2 − 1) S 22 79  1.12 + 79  1.32
S 2p = = = 1.45
(n1 + n2 − 2) 80 + 80 − 2

9/11/2023 @TKMISHRA ML NITRKL 86


Solution Continued…
The t-statistic is
(7.6 − 6.3) − 1.2
t = = 0.5252
 1 1 
1.45  + 
 80 80 

The t-critical value for one-tailed t-test when  = 0.05


and degrees of freedom = 158 (80 + 80 – 2) is 1.6546.
Since the calculated t-statistic value is less than t-critical
value we retain the null hypothesis. That is, the
difference between two groups is less than 1.2 and the
corresponding right-tailed test has a p-value of 0.3.

9/11/2023 @TKMISHRA ML NITRKL 87


Two Sample t test with unequal Variance

Difference in Two Population Means when Population


Standard Deviations are Unknown and Not Equal – Two
Sample t test with unequal Variance

We need to estimate standard deviations from the samples


drawn from these two populations. Then the sampling
distribution of the difference in estimated means ( X − X )
follows a t-distribution with mean (1 − 2) and standard
1 2

deviation

 S2 S 2 
Su =  1 + 2 
 n1 n 
 2 

9/11/2023 @TKMISHRA ML NITRKL 88


The corresponding degrees of freedom is given by
 
 4 
df =  
Su
 (S 2 n )2 2
(S2 n2 ) 2 
 1 1 + 

 n1 − 1 n2 − 1 

where the symbol   implies rounding down to the


nearest integer. The t-statistic for testing two
populations with unequal variance is given by

( X 1 − X 2 ) − (1 −  2 )
t =
 S12 S 22 
 + 
 1
n n2 

9/11/2023 @TKMISHRA ML NITRKL 89


Example
A researcher is interested in finding average duration of
marriage based on the educational qualifications. Two
groups were considered for the study: Group 1 consisted of
couples with no Bachelor’s degree (both partners) and Group
2 consisted of couple who both have Bachelor’s degree or
higher. Data in Table shows average duration of marriage in
years. At  = 0.05, test whether the average duration of
marriage is more for couples with no Bachelor’s degree
compared to couples with Bachelor’s degree.

Group Sample Size Duration of Marriage in Years Standard Deviation

Estimated from Sample

Couples with no
120 10.1 years 2.4 years
Degree

Couples with
100 9.5 years 3.1 years
Degree

9/11/2023 @TKMISHRA ML NITRKL 90


Solution
X = 10.1 X ,= 9.5
We have n1 = 120, n2 = 100, 1 2 ,  1 = 2.4,
and  2 = 3.1.
The null and alternative hypotheses are
H0: 1 − 2  0
HA: 1 − 2 > 0
The t-statistic value is
( X 1 − X 2 ) − (1 −  2 ) (10.1 − 9.5) − 0
t= = = 1.5805
S 2
S  2
2.4 2
3.1 2


1
+ 
2 +
 1
n n2 
120 100

9/11/2023 @TKMISHRA ML NITRKL 91


Solution Continued…
The corresponding degrees of freedom is
 
 
 Su4   0.0207 
df =   =  0.000113  = 184.33 = 184
2
 ( S1 n1 )
2 2 n )2
(S2 2   
 + 
 n1 − 1 n2 − 1 

The critical value of t for  = 0.05 and df = 184 is 1.6531.


Since the t-statistic is less than critical value of t, we
retain the null hypothesis.
That is, the difference in duration of marriage between
two groups is less than or equal to zero.
The corresponding p-value is 0.05785

9/11/2023 @TKMISHRA ML NITRKL 92


Estimating Probabilities from data

Tapas Kumar Mishra

August 3, 2022
Bayes Optimal classifier

If we are provided with P(X , Y ) we can predict the most


likely label for x, formally argmaxy P(y |x).
It is therefore worth considering if we can estimate P(X,Y)
directly from the training data.
If this is possible (to a good approximation) we could then use
the Bayes Optimal classifier in practice on our estimate of
P(X,Y).

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

If we are provided with P(X , Y ) we can predict the most


likely label for x, formally argmaxy P(y |x).
It is therefore worth considering if we can estimate P(X,Y)
directly from the training data.
If this is possible (to a good approximation) we could then use
the Bayes Optimal classifier in practice on our estimate of
P(X,Y).

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

If we are provided with P(X , Y ) we can predict the most


likely label for x, formally argmaxy P(y |x).
It is therefore worth considering if we can estimate P(X,Y)
directly from the training data.
If this is possible (to a good approximation) we could then use
the Bayes Optimal classifier in practice on our estimate of
P(X,Y).

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating


P(X,Y).
Generally, they fall into two categories:
When we estimate P(X , Y ) = P(X |Y )P(Y ), then we call it
generative learning.
When we only estimate P(Y |X ) directly, then we call it
discriminative learning.
So how can we estimate probability distributions from samples?

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating


P(X,Y).
Generally, they fall into two categories:
When we estimate P(X , Y ) = P(X |Y )P(Y ), then we call it
generative learning.
When we only estimate P(Y |X ) directly, then we call it
discriminative learning.
So how can we estimate probability distributions from samples?

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating


P(X,Y).
Generally, they fall into two categories:
When we estimate P(X , Y ) = P(X |Y )P(Y ), then we call it
generative learning.
When we only estimate P(Y |X ) directly, then we call it
discriminative learning.
So how can we estimate probability distributions from samples?

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating


P(X,Y).
Generally, they fall into two categories:
When we estimate P(X , Y ) = P(X |Y )P(Y ), then we call it
generative learning.
When we only estimate P(Y |X ) directly, then we call it
discriminative learning.
So how can we estimate probability distributions from samples?

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating


P(X,Y).
Generally, they fall into two categories:
When we estimate P(X , Y ) = P(X |Y )P(Y ), then we call it
generative learning.
When we only estimate P(Y |X ) directly, then we call it
discriminative learning.
So how can we estimate probability distributions from samples?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss

Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss

Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss

We observed nH = 4 heads and nT = 6 tails. So, intuitively,


nH 4
P(H) ≈ = = 0.4
nH + nT 10
Can we derive this more formally?

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood


Estimate (MLE). For MLE you typically proceed in two steps:
First, you make an explicit modelling assumption about what
type of distribution your data was sampled from.
Second, you set the parameters of this distribution so that the
data you observed is as likely as possible.

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood


Estimate (MLE). For MLE you typically proceed in two steps:
First, you make an explicit modelling assumption about what
type of distribution your data was sampled from.
Second, you set the parameters of this distribution so that the
data you observed is as likely as possible.

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood


Estimate (MLE). For MLE you typically proceed in two steps:
First, you make an explicit modelling assumption about what
type of distribution your data was sampled from.
Second, you set the parameters of this distribution so that the
data you observed is as likely as possible.

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

Let us return to the coin example.


A natural assumption about a coin toss is that the distribution of
the observed outcomes is a binomial distribution.
The binomial distribution has two parameters n and θ and it
captures the distribution of n independent Bernoulli(i.e. binary)
random events that have a positive outcome with probability θ.
In our case n is the number of coin tosses, and θ could be the
probability of the coin coming up heads (e.g. P(H) = θ).

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

Let us return to the coin example.


A natural assumption about a coin toss is that the distribution of
the observed outcomes is a binomial distribution.
The binomial distribution has two parameters n and θ and it
captures the distribution of n independent Bernoulli(i.e. binary)
random events that have a positive outcome with probability θ.
In our case n is the number of coin tosses, and θ could be the
probability of the coin coming up heads (e.g. P(H) = θ).

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

Let us return to the coin example.


A natural assumption about a coin toss is that the distribution of
the observed outcomes is a binomial distribution.
The binomial distribution has two parameters n and θ and it
captures the distribution of n independent Bernoulli(i.e. binary)
random events that have a positive outcome with probability θ.
In our case n is the number of coin tosses, and θ could be the
probability of the coin coming up heads (e.g. P(H) = θ).

Tapas Kumar Mishra Estimating Probabilities from data


Maximum Likelihood Estimation (MLE)

Formally, the binomial distribution is defined as

 
nH + nT
P(D | θ) = θnH (1 − θ)nT , (1)
nH

and it computes the probability that we would observe exactly nH


heads, nT tails, if a coin was tossed n = nH + nT times and its
probability of coming up heads is θ.

Tapas Kumar Mishra Estimating Probabilities from data


MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)


θ
 
nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H
 
nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it


with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data


Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)


θ
 
nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H
 
nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it


with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data


Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)


θ
 
nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H
 
nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it


with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data


Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)


θ
 
nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H
 
nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it


with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data


Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)


θ
 
nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H
 
nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it


with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data


MLE

MLE gives the explanation of the data you observed.


If n is large and your model/distribution is correct (that is H
includes the true model), then MLE finds the true parameters.
But the MLE can overfit the data if n is small. It works well
when n is large.
If you do not have the correct model (and n is small) then
MLE can be terribly wrong!
For example, suppose you observe H,H,H,H,H. What is θ̂MLE ?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss with prior knowledge

Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m

For large n, this is an insignificant change. For small n, it


incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss with prior knowledge

Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m

For large n, this is an insignificant change. For small n, it


incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss with prior knowledge

Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m

For large n, this is an insignificant change. For small n, it


incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data


Simple scenario: coin toss with prior knowledge

Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m

For large n, this is an insignificant change. For small n, it


incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data


The Bayesian Way

Model θ as a random variable, drawn from a distribution P(θ).


Note that θ is not a random variable associated with an event in a
sample space.
In frequentist statistics, this is forbidden.
In Bayesian statistics, this is allowed and you can specify a prior
belief P(θ) defining what values you believe θ is likely to take on.

Tapas Kumar Mishra Estimating Probabilities from data


P(D|θ)P(θ)
Now, we can look at P(θ | D) = P(D) (recall Bayes Rule!),
where
P(θ) is the prior distribution over the parameter(s) θ, before
we see any data.
P(D | θ) is the likelihood of the data given the parameter(s) θ.
P(θ | D) is the posterior distribution over the parameter(s) θ
after we have observed the data.

Tapas Kumar Mishra Estimating Probabilities from data


A natural choice for the prior P(θ) is the Beta distribution:

θα−1 (1 − θ)β−1
P(θ) = (8)
B(α, β)

where B(α, β) = Γ(α)Γ(β)


Γ(α+β) is the normalization constant (if this
looks scary don’t worry about it, it is just there to make sure
everything sums to 1).

Tapas Kumar Mishra Estimating Probabilities from data


Why is the Beta distribution a good fit?
it models probabilities (θ lives on [0, 1])
it is of the same distributional family as the binomial
distribution (conjugate prior) → the math will turn out nicely:

P(θ | D) ∝ P(D | θ)P(θ) ∝ θnH +α−1 (1 − θ)nT +β−1 (9)

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

Find θ̂ that maximizes the posterior distribution P(θ | D):

θ̂MAP = argmax P(θ | D) (10)


θ
= argmax log P(D | θ) + log P(θ) (11)
θ

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)


θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)


θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)


θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)


θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)


θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data


Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)


θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data


MAP A few comments:

The MAP estimate is identical to MLE with α − 1


hallucinated heads and β − 1 hallucinated tails.
As n → ∞, θ̂MAP → θ̂MLE as α − 1 and β − 1 become
irrelevant compared to very large nH , nT .
MAP is a great estimator if an accurate prior belief is
available (and mathematically tractable).
If n is small, MAP can be very wrong if prior belief is wrong!

Tapas Kumar Mishra Estimating Probabilities from data


Machine Learning and estimation

As always the differences are subtle.


In MLE we maximize log [P(D; θ)];
in MAP we maximize log [P(D|θ)] + log [P(θ)].
So essentially in MAP we only add the term log [P(θ)] to our
optimization. This term is independent of the data and penalizes if
the parameters, θ deviate too much from what we believe is
reasonable.
We will later revisit this as a form of regularization, where
log [P(θ)] will be interpreted as a measure of classifier complexity.

Tapas Kumar Mishra Estimating Probabilities from data


Bayes Classifier and Naive Bayes

Tapas Kumar Mishra

August 7, 2022

1/45
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The data points (xi , yi ) are drawn from some (unknown)
distribution P(X , Y ). Ultimately we would like to learn a function
h such that for a new pair (x, y ) ∼ P, we have h(x) = y with high
probability (or h(x) ≈ y ).

3/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Assume for example an email x can either be classified as spam
(+1) or ham (−1). For the same email x the conditional class
probabilities are:

P(+1|x) = 0.8, P(−1|x) = 0.2

In this case the Bayes optimal classifier would predict the label
y ∗ = +1 as it is most likely, and its error rate would be
BayesOpt = 0.2.

6/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The Venn diagram illustrates that the MLE method estimates
P̂(y |x) as
|C |
P̂(y |x) =
|B|

8/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!

9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!

9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

We can approach this dilemma with a simple trick, and an


additional assumption. The trick part is to estimate P(y ) and
P(x|y ) instead, since, by Bayes rule,

P(x|y )P(y )
P(y |x) = .
P(x)

Recall from Estimating Probabilities from Data that estimating


P(y ) and P(x|y ) is called generative learning.

10/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(y ) is easy.
For example, if Y takes on discrete binary values estimating P(Y )
reduces to coin tossing. We simply need to count how many times
we observe each outcome (in this case each class):
Pn
I (yi = c)
P(y = c) = i=1 = π̂c
n

11/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(x|y ), however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.

12/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Illustration behind the Naive Bayes algorithm. We estimate
P(xα |y ) independently in each dimension (middle two images) and
then obtain an estimate of the full data
Q distribution by assuming
conditional independence P(x|y ) = α P(xα |y ) (very right image).

14/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Estimating P([x]α |y )

Now that we know how we can use our assumption to make the
estimation of P(y |x) tractable. There are 3 notable cases in which
we can use our naive Bayes classifier.
Case #1: Categorical features.
Case #2: Multinomial features.
Case #3: Continuous features (Gaussian Naive Bayes)

16/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
17/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Features:
[x]α ∈ {f1 , f2 , · · · , fKα }
Each feature α falls into one of Kα categories. (Note that the case
with binary features is just a specific case of this, where Kα = 2.)
An example of such a setting may be medical data where one
feature could be
gender (male / female) or
marital status (single / married / widowed).

18/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Model P(xα | y ):

P(xα = j|y = c) = [θjc ]α

and

X
[θjc ]α = 1
j=1

where [θjc ]α is the probability of feature α having the value j,


given that the label is c. And the constraint indicates that xα must
have one of the categories {1, . . . , Kα }.

19/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα

where xiα = [xi ]α and l is a smoothing parameter.


By setting l = 0 we get an MLE estimator, l > 0 leads to MAP. If
we set l = +1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means

# of samples with label c that have feature α with value j


.
# of samples with label c

20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα

where xiα = [xi ]α and l is a smoothing parameter.


By setting l = 0 we get an MLE estimator, l > 0 leads to MAP. If
we set l = +1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means

# of samples with label c that have feature α with value j


.
# of samples with label c

20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c [θ̂jc ]α
y y
α=1
Pn d P
= c) Y ni=1 I (yi = c)I (xiα = j) + l
i=1 I (yi
= argmax Pn
y n i=1 I (yi = c) + lKα
α=1

21/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female):


P(X |illness = No) = P(Dallas|No) ∗ P(Female|No) =
2/5 × 2/5 = 4/25.
P(X |illness = Yes) = P(Dallas|Yes) ∗ P(Female|Yes) =
2/3 × 2/3 = 4/9.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female):


P(X |illness = No) = P(Dallas|No) ∗ P(Female|No) =
2/5 × 2/5 = 4/25.
P(X |illness = Yes) = P(Dallas|Yes) ∗ P(Female|Yes) =
2/3 × 2/3 = 4/9.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Here, feature values don’t represent categories (e.g. male/female)


but counts we need to use a different model. E.g. in the text
document categorization, feature value xα = j means that in this
particular document x the αth word in my dictionary appears j
times.
Think of starting with a all zero vector of dimension d = V . We
throw a dice of dimension V to sample the first word, increase the
count of the vector at the position by 1. Keep repeating this
experiment until we have m word that is the email.

25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Here, feature values don’t represent categories (e.g. male/female)


but counts we need to use a different model. E.g. in the text
document categorization, feature value xα = j means that in this
particular document x the αth word in my dictionary appears j
times.
Think of starting with a all zero vector of dimension d = V . We
throw a dice of dimension V to sample the first word, increase the
count of the vector at the position by 1. Keep repeating this
experiment until we have m word that is the email.

25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Let us consider the example of spam filtering. Imagine the αth


word is indicative towards “spam”. Then if xα = 10 means that
this email is likely spam (as word α appears 10 times in it). And
another email with xα0 = 20 should be even more likely to be spam
(as the spammy word appears twice as often).
With categorical features this is not guaranteed. It could be that
the training set does not contain any email that contain word α
exactly 20 times. In this case you would simply get the
hallucinated smoothing values for both spam and not-spam - and
the signal is lost.

26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Let us consider the example of spam filtering. Imagine the αth


word is indicative towards “spam”. Then if xα = 10 means that
this email is likely spam (as word α appears 10 times in it). And
another email with xα0 = 20 should be even more likely to be spam
(as the spammy word appears twice as often).
With categorical features this is not guaranteed. It could be that
the training set does not contain any email that contain word α
exactly 20 times. In this case you would simply get the
hallucinated smoothing values for both spam and not-spam - and
the signal is lost.

26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

We need a model that incorporates our knowledge that features


are counts
Features:
d
X
xα ∈ {0, 1, 2, . . . , m} and m = xα (2)
α=1

Each feature α represents a count and m is the length of the


sequence. An example of this could be the count of a specific word
α in a document of length m and d is the size of the vocabulary.

27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

We need a model that incorporates our knowledge that features


are counts
Features:
d
X
xα ∈ {0, 1, 2, . . . , m} and m = xα (2)
α=1

Each feature α represents a count and m is the length of the


sequence. An example of this could be the count of a specific word
α in a document of length m and d is the size of the vocabulary.

27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Model P(x | y ): Use the multinomial distribution


d
m! Y
P(x | m, y = c) = (θαc )xα
x1 ! · x2 ! · · · · · xd !
α=1

where θαc is the probability


P of selecting α, xα is the number of
times α is chosen and dα=1 θαc = 1.
So, we can use this to generate a spam email, i.e., a document x
of class y = spam by picking m words independently at random
from the vocabulary of d words using P(x | y = spam).

28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Model P(x | y ): Use the multinomial distribution


d
m! Y
P(x | m, y = c) = (θαc )xα
x1 ! · x2 ! · · · · · xd !
α=1

where θαc is the probability


P of selecting α, xα is the number of
times α is chosen and dα=1 θαc = 1.
So, we can use this to generate a spam email, i.e., a document x
of class y = spam by picking m words independently at random
from the vocabulary of d words using P(x | y = spam).

28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d

where mi = dβ=1 xiβ denotes the number of words in document i.


P
The numerator sums up all counts for feature xα and the
denominator sums up all counts of all features across all data
points. E.g.,
# of times word α appears in all spam emails
.
# of words in all spam emails combined
Again, l is the smoothing parameter.

29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d

where mi = dβ=1 xiβ denotes the number of words in document i.


P
The numerator sums up all counts for feature xα and the
denominator sums up all counts of all features across all data
points. E.g.,
# of times word α appears in all spam emails
.
# of words in all spam emails combined
Again, l is the smoothing parameter.

29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c (θ̂αc )xα
c c
α=1

30/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
31/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Illustration of Gaussian NB. Each class conditional feature


distribution P(xα |y ) is assumed to originate from an independent
Gaussian distribution with its own mean µα,y and variance σα,y2 .

32/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Features:

xα ∈ R (each feature takes on a real value) (4)

33/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Model P(xα | y ): Use Gaussian distribution


 2
2
 1 −1 xα −µαc
P(xα | y = c) = N µαc , σαc =√ e 2 σαc
(5)
2πσαc
Note that the model specified above is based on our assumption
about the data - that each feature α comes from a
class-conditional Gaussian distribution. The full distribution
P(x|y ) ∼ N (µy , Σy ), where Σy is a diagonal covariance matrix
2 .
with [Σy ]α,α = σα,y

34/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Parameter estimation: As always, we estimate the parameters of


the distributions for each dimension and class independently.
Gaussian distributions only have two parameters, the mean and
variance. The mean µα,y is estimated by the average feature value
of dimension α from all samples with label y . The (squared)
standard deviation is simply the variance of this estimate.
n n
1 X X
µαc ← I (yi = c)xiα where nc = I (yi = c)
nc
i=1 i=1
(6)
n
2 1 X
σαc ← I (yi = c)(xiα − µαc )2 (7)
nc
i=1

35/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Prediction:

argmax P(y = c | x) ∝ argmax π̂c P(x|y = c)


c c

36/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Likelihood of Yes , the person has disease is


P(Income = 100000|Illness = Yes) = (1/(42397.5 ∗ (2.506))) ∗
(2.718)− ((10000068386.66)/(2 ∗ 42397.5)) = 0.00000712789
Likelihood of No , the person has not disease is
P(Income = 100000|Illness = No) = (1/(28972.49∗(2.506)))∗
(2.718)− ((10000079571.8)/(2 ∗ 28972.49)) = 0.0000107421 37/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female,


income=100k):
P(X |illness = No) =
P(Dallas|No) ∗ P(Female|No) ∗ P(Income = 100K |Class =
No) = 2/5 × 2/5 × 0.00000712789 = 0.00000114046.
P(X |illness = Yes) =
P(Dallas|Yes) ∗ P(Female|Yes) ∗ P(Income = 100K |Class =
Yes) = 2/3 × 2/3 × 0.0000107421 = 0.00000477426.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female,


income=100k):
P(X |illness = No) =
P(Dallas|No) ∗ P(Female|No) ∗ P(Income = 100K |Class =
No) = 2/5 × 2/5 × 0.00000712789 = 0.00000114046.
P(X |illness = Yes) =
P(Dallas|Yes) ∗ P(Female|Yes) ∗ P(Income = 100K |Class =
Yes) = 2/3 × 2/3 × 0.0000107421 = 0.00000477426.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

Naive Bayes leads to a linear decision boundary in many common


cases. Illustrated here is the case where P(xα |y ) is Gaussian and
where σα,c is identical for all c (but can differ across dimensions
α). The boundary of the ellipsoids indicate regions of equal
probabilities P(x|y ). The black decision line indicates the decision
boundary where P(y = 1|x) = P(y = 2|x).
39/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

1. Suppose that yi ∈ {−1, +1} and features are multinomial


We can show that
d
Y
h(x) = argmax P(y ) P(xα | y ) = sign(w> x + b)
y
α=1

That is,
w> x + b > 0 ⇐⇒ h(x) = +1.

40/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier


As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)


b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for


w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier


As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)


b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for


w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier


As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)


b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for


w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
 
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒  
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
 
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒  
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
 
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒  
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

2. In the case of continuous features (Gaussian Naive Bayes), we


can show that
1
P(y | x) =
1 + e (w> x+b)
−y

This model is also known as logistic regression. NB and LR


produce asymptotically the same model if the Naive Bayes
assumption holds.

44/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Validation of the Linear
Regression Model
Validation of the Simple Linear Regression Model

It is important to validate the regression model to ensure its validity


and goodness of fit before it can be used for practical applications.
The following measures are used to validate the simple linear
regression models:

• Co-efficient of determination (R-square).


• Hypothesis test for the regression coefficient
• Analysis of Variance for overall model validity (relevant more for
multiple linear regression).

The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)

• The co-efficient of determination (or R-square or R2)


measures the percentage of variation in Y explained by the
model (0 + 1 X).
• The simple linear regression model can be broken into
explained variation and unexplained variation as shown in

Yi

=  0 + 1 X i + i

Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model

In absence of the predictive model for Yi, the users will use the
mean value of Yi. Thus, the total variation is measured

as the
difference between Yi and mean value of Yi (i.e.,Yi -Y ).
Description of total variation, explained variation and
unexplained variation

Variation Type Measure Description


Total Variation (SST) ( Yi −Y ) Total variation is the difference between the actual
value and the mean value.

 −
Variation explained by the model ( Yi − Y ) Variation explained by the model is the difference
between the estimated value of Yi and the mean value
of Y

Variation not explained by model (  ) Variation not explained by the model is the difference
Yi − Yi
between the actual value and the predicted value of Yi
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
− ∧ − ∧
𝑌𝑖 − 𝑌 = 𝑌𝑖 − 𝑌 + 𝑌𝑖 − 𝑌𝑖
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model

It can be proved mathematically that sum of squares of total


variation is equal to sum of squares of explained variation plus sum of
squares of unexplained variation
𝑛 2 𝑛 𝑛
− ∧ − 2 ∧ 2
෍ 𝑌𝑖 − 𝑌 = ෍ 𝑌𝑖 − 𝑌 + ෍ 𝑌𝑖 − 𝑌𝑖
𝑖=1 𝑖=1 𝑖=1
𝑆𝑆𝑇 𝑆𝑆𝑅 𝑆𝑆𝐸

where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by

∧ − 2
Explained variation 𝑆𝑆𝑅 𝑌𝑖 − 𝑌
Coefficient of determination = R2 = = = − 2
Total variation 𝑆𝑆𝑇
𝑌𝑖 − 𝑌


𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:

• The value of R2 lies between 0 and 1.


• Higher value of R2 implies better fit, but one should be aware of
spurious regression.
• Mathematically, the square of correlation coefficient is equal to
coefficient of determination (i.e., r2 = R2).
• We do not put any minimum threshold for R2; higher value of R2
implies better fit.
Spurious Regression
Number of Facebook users and the number of
people who died of helium poisoning in UK

Year Number of Facebook users in Number of people who died of helium


millions (X) poisoning in UK (Y)

2004 1 2

2005 6 2

2006 12 2

2007 58 2

2008 145 11

2009 360 21

2010 608 31

2011 845 40

2012 1056 51
Facebook users versus helium poisoning in UK

The regression model is given as Y = 1.9967 + 0.0465 X

The R-square value for regression model between the number of deaths due to
helium poisoning in UK and the number of Facebook users is 0.9928. That is,
99.28% variation in the number of deaths due to helium poisoning in UK is
explained by the number of Facebook users.
Hypothesis Test for Regression Co-efficient (t-Test)

• The regression co-efficient ( 1) captures the existence of a


linear relationship between the response variable and the
explanatory variable.
• If 1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
The standard error of 1 is given by
 Se
S e ( 1 ) =

( X i − X )2

In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by

n  n
 i
2 2
 (Yi − Y i )
Se = i =1 = i =1
n−2 n−2

The denominator in above Eq. is (n − 2) since 0 and 1 are estimated


from the sample in estimating Yi and thus two degrees of freedom are

lost. The standard error of 𝛽 can be written as
1

n 
2

 (Yi − Y i ) n−2
Se i =1
S e ( 1 ) = =
− −
2
(Xi − X ) ( X i − X )2
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:

H0: 1 = 0
HA: 1  0
• The corresponding t-statistic is given as

  
 1 − 1  1− 0 1
t = = =
  
Se (  1) Se ( 1) Se ( 1)
Confidence Interval for Regression coefficients 0 and
1
 
The standard error of estimates of and  1are given by
0

𝑆𝑒 × σ𝑛𝑖=1 𝑋𝑖2
∧ 𝑆𝑒
∧ 𝑆𝑒 (𝛽1 ) =
𝑆𝑒 (𝛽0 ) = 𝑆𝑆𝑋
𝑛 × 𝑆𝑆𝑋

∧ 2
where 𝑌𝑖 − 𝑌𝑖
𝑆𝑒 =
𝑛−2

n −
2
Where Se is the standard error of residuals and SSX =  (Xi − X )
i =1

The interval estimate or (1-)100% confidence interval for  0and

 1 are given by
∧ ∧ ∧ ∧
𝛽1 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽1 ) 𝛽0 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽0 )
Multiple Linear Regression
• Multiple linear regression means linear in
regression parameters (beta values). The following
are examples of multiple linear regression:

Y =  0 + 1x1 +  2 x2 + ... +  k xk + 
Y =  0 + 1x1 +  2 x2 + 3 x1x2 +  4 x2 ... +  k xk
2
+

An important task in multiple regression is to estimate the


beta values (1, 2, 3 etc…)
Co-efficient of Multiple Determination
(R-Square) and Adjusted R-Square
As in the case of simple linear regression, R-square measures
the proportion of variation in the dependent variable
explained by the model. The co-efficient of multiple
determination (R-Square or R2) is given by


𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
• SSE is the sum of squares of errors and SST is the sum of
squares of total deviation. In case of MLR, SSE will decrease as
the number of explanatory variables increases, and SST
remains constant.

• To counter this, R2 value is adjusted by normalizing both SSE


and SST with the corresponding degrees of freedom. The
adjusted R-square is given by

SSE/(n - k - 1)
Adjusted R - Square = 1 -
SST/(n - 1)
Statistical Significance of Individual Variables in MLR – t-test

Checking the statistical significance of individual variables is achieved


through t-test. Note that the estimate of regression coefficient is given
by Eq:


β = (XT X) −1 XTY

This means the estimated value of regression coefficient is a linear


function of the response variable. Since we assume that the residuals
follow normal distribution, Y follows a normal distribution and the
estimate of regression coefficient also follows a normal distribution.
Since the standard deviation of the regression coefficient is estimated
from the sample, we use a t-test.
The null and alternative hypotheses in the case of individual
independent variable and the dependent variable Y is given,
respectively, by

• H0: There is no relationship between independent variable Xi and


dependent variable Y
• HA: There is a relationship between independent variable Xi and
dependent variable Y

Alternatively,
• H0: i = 0
• HA: i  0
The corresponding test statistic is given by
 
i − 0 i
t = =
 
Se (i ) Se (i )
Validation of Overall Regression Model – F-test

Analysis of Variance (ANOVA) is used to validate the overall


regression model. If there are k independent variables in the
model, then the null and the alternative hypotheses are,
respectively, given by

H0: 1 =  2 =  3 = … =  k = 0
H1: Not all  s are zero.

F-statistic is given by:

F = (SST-SSE)/k/SSE/(n-k-1) ~ Fk,n-k-1
F-test for the overall fit of the model

• The decision rule at significance level  is:


• Reject H0 if F  F ( ; k , n − k − 1)

• Where the critical value F(, k, n-k-1) can be found from an F-table.
• The existence of a regression relation by itself does not assure that
useful prediction can be made by using it.
• Note that when k=1, this test reduces to the F-test for testing in simple
linear regression whether or not 1= 0
Linear Regression

Tapas Kumar Mishra

August 22, 2022

1/39
Supervised learning
Lets start by talking about a few examples of supervised
learning problems. Suppose we have a dataset giving the
living areas and prices of 47 houses from Delhi, India:

2/39
Tapas Kumar Mishra Linear Regression
Given data like this, how can we learn to predict the prices of other
houses in Delhi, as a function of the size of their living areas?

3/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.

4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.

4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.

4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.

4/39
Tapas Kumar Mishra Linear Regression
To describe the supervised learning problem slightly more formally,
our goal is,
given a training set, to learn a function h : X → Y so that h(x)is a
good predictor for the corresponding value of y . For historical
reasons, this function h is called a hypothesis.

5/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.

6/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.

6/39
Tapas Kumar Mishra Linear Regression
Linear Regression
To make our housing example more interesting, lets consider a
slightly richer dataset in which we also know the number of
bedrooms in each house:

7/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)

8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)

8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)

8/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0

Linear regression stands for a function that is linear in regression


coefficients.

9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0

Linear regression stands for a function that is linear in regression


coefficients.

9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0

Linear regression stands for a function that is linear in regression


coefficients.

9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0

Linear regression stands for a function that is linear in regression


coefficients.

9/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1

OLS model - ordinary least squares regression model.

10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1

OLS model - ordinary least squares regression model.

10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1

OLS model - ordinary least squares regression model.

10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1

OLS model - ordinary least squares regression model.

10/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square

We want to choose θ so as to minimize J(θ). To do so,


lets use a search algorithm that starts with some initial guess
for θ , and that repeatedly changes θ to make J(θ) smaller,
until hopefully we converge to a value of θ that minimizes
J(θ).
Specifically, lets consider the gradient descent algorithm,
which starts with some initial θ , and repeatedly performs the
update:

θj := θj − α J(θ) (4)
∂θj

Here, α is the learning rate. We update simultaneously for


every value of j in range 0 to n.

11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square

We want to choose θ so as to minimize J(θ). To do so,


lets use a search algorithm that starts with some initial guess
for θ , and that repeatedly changes θ to make J(θ) smaller,
until hopefully we converge to a value of θ that minimizes
J(θ).
Specifically, lets consider the gradient descent algorithm,
which starts with some initial θ , and repeatedly performs the
update:

θj := θj − α J(θ) (4)
∂θj

Here, α is the learning rate. We update simultaneously for


every value of j in range 0 to n.

11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square

We want to choose θ so as to minimize J(θ). To do so,


lets use a search algorithm that starts with some initial guess
for θ , and that repeatedly changes θ to make J(θ) smaller,
until hopefully we converge to a value of θ that minimizes
J(θ).
Specifically, lets consider the gradient descent algorithm,
which starts with some initial θ , and repeatedly performs the
update:

θj := θj − α J(θ) (4)
∂θj

Here, α is the learning rate. We update simultaneously for


every value of j in range 0 to n.

11/39
Tapas Kumar Mishra Linear Regression
LMS for a single instance

Lets first work it out for the case of if we have only one training
example (x, y ), so that we can neglect the sum in the definition of
J. We have:

12/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)

the magnitude of the update is proportional to the error term


(y (i) − h(x (i) )):
for a training example if the prediction nearly matches y (i) ,
little change required;
for a training example if the prediction has large error with
respect to y (i) , large parameter change required;

13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)

the magnitude of the update is proportional to the error term


(y (i) − h(x (i) )):
for a training example if the prediction nearly matches y (i) ,
little change required;
for a training example if the prediction has large error with
respect to y (i) , large parameter change required;

13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)

the magnitude of the update is proportional to the error term


(y (i) − h(x (i) )):
for a training example if the prediction nearly matches y (i) ,
little change required;
for a training example if the prediction has large error with
respect to y (i) , large parameter change required;

13/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset

The reader can easily verify that the quantity in the summation in

the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.

14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset

The reader can easily verify that the quantity in the summation in

the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.

14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset

The reader can easily verify that the quantity in the summation in

the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.

14/39
Tapas Kumar Mishra Linear Regression
m
1X
J(θ) = (h(x (i) − y (i) )2 . (6)
2
i=1

Consider the equation: y = 2 + 2x1 .

15/39
Tapas Kumar Mishra Linear Regression
16/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
When we run batch gradient descent to fit θ on our previous
dataset, to learn to predict housing price as a function of living
area, we obtain θ0 = 71.27, θ1 = 0.1345. If we plot hθ (x) as a
function of x(area), along with the training data, we obtain the
following figure:

18/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent

In this algorithm, we repeatedly run through the training set, and


each time we encounter a training example, we update the
parameters according to the gradient of the error with respect to
that single training example only. This algorithm is called
stochastic gradient descent (also incremental gradient descent)

19/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent

In this algorithm, we repeatedly run through the training set, and


each time we encounter a training example, we update the
parameters according to the gradient of the error with respect to
that single training example only. This algorithm is called
stochastic gradient descent (also incremental gradient descent)

19/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.

20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.

20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.

20/39
Tapas Kumar Mishra Linear Regression
Normal Equation

Giving a training set, define the design matrix X to be the m-by-n


matrix (actually m-by-n + 1, if we include the intercept term) that
contains
 the(1)training
 examples input values in its rows:
−(x )T −
 −(x (2) )T − 
 
X =  . 

 . 
−(x (m) )T −
Also, let ~y be the m-dimensional vector
 (1)containing
 all the target
(y )
 (y (2) ) 
 
values from the training set: ~y = 
 . 

 . 
(y (m) )

21/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation

Data Assumption: y (i) ∈ R.


Model Assumption: y (i) = w T x(i) + i , where i ∼ N(0, σ 2 ).
⇒ y (i) |x(i) ∼ N(w> x(i) , σ 2 )
(w> x(i) −y (i) )2

⇒ P(y (i) |x(i) , w) = √ 1 e 2σ 2
2πσ 2
23/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation

Data Assumption: y (i) ∈ R.


Model Assumption: y (i) = w T x(i) + i , where i ∼ N(0, σ 2 ).
⇒ y (i) |x(i) ∼ N(w> x(i) , σ 2 )
(w> x(i) −y (i) )2

⇒ P(y (i) |x(i) , w) = √ 1 e 2σ 2
2πσ 2
23/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation

Data Assumption: y (i) ∈ R.


Model Assumption: y (i) = w T x(i) + i , where i ∼ N(0, σ 2 ).
⇒ y (i) |x(i) ∼ N(w> x(i) , σ 2 )
(w> x(i) −y (i) )2

⇒ P(y (i) |x(i) , w) = √ 1 e 2σ 2
2πσ 2
23/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.

24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.

24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.

24/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)


w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)


w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)


w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)


w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)


w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)


w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m 
(w> x(i) −y (i) )2
   
X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m 
(w> x(i) −y (i) )2
   
X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m 
(w> x(i) −y (i) )2
   
X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m 
(w> x(i) −y (i) )2
   
X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m 
(w> x(i) −y (i) )2
   
X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

Additional Model Assumption:


wi ∼ N(0, τ 2 ).
w> w

P(w) = √ 1 e 2τ 2
2πτ 2

27/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )


w
P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
= argmax
w P(y (1) , x(1) , ..., y (m) , x(m) )
= argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
w
"m #
Y
= argmax P(y (i) , x(i) |w) P(w)
w
i=1
"m #
Y
(i) (i) (i)
= argmax P(y |x , w)P(x |w) P(w)
w
i=1

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )


w
P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
= argmax
w P(y (1) , x(1) , ..., y (m) , x(m) )
= argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
w
"m #
Y
= argmax P(y (i) , x(i) |w) P(w)
w
i=1
"m #
Y
(i) (i) (i)
= argmax P(y |x , w)P(x |w) P(w)
w
i=1

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )


w
P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
= argmax
w P(y (1) , x(1) , ..., y (m) , x(m) )
= argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
w
"m #
Y
= argmax P(y (i) , x(i) |w) P(w)
w
i=1
"m #
Y
(i) (i) (i)
= argmax P(y |x , w)P(x |w) P(w)
w
i=1

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )


w
P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
= argmax
w P(y (1) , x(1) , ..., y (m) , x(m) )
= argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
w
"m #
Y
= argmax P(y (i) , x(i) |w) P(w)
w
i=1
"m #
Y
(i) (i) (i)
= argmax P(y |x , w)P(x |w) P(w)
w
i=1

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )


w
P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
= argmax
w P(y (1) , x(1) , ..., y (m) , x(m) )
= argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)P(w)
w
"m #
Y
= argmax P(y (i) , x(i) |w) P(w)
w
i=1
"m #
Y
(i) (i) (i)
= argmax P(y |x , w)P(x |w) P(w)
w
i=1

28/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

This objective is known as Ridge Regression.


It has a closed form solution of: w = (X> X + λI)−1 X> ~y , where
−(x (1) )T −
   (1) 
(y )
 −(x (2) )T −   (y (2) ) 
   
X = .  ~y =  . 
   
 .   . 
−(x (m) )T − (y (m) )

30/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

This objective is known as Ridge Regression.


It has a closed form solution of: w = (X> X + λI)−1 X> ~y , where
−(x (1) )T −
   (1) 
(y )
 −(x (2) )T −   (y (2) ) 
   
X = .  ~y =  . 
   
 .   . 
−(x (m) )T − (y (m) )

30/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,

31/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,

31/39
Tapas Kumar Mishra Linear Regression
Locally weighted linear regression
Consider the problem of predicting y from x ∈ R. The leftmost
figure below shows the result of fitting a y = θ0 + θ1 x to a
dataset. We see that the data doesnt really lie on straight line, and
so the fit is not very good. This is Underfitting: the structure of
the data is not captured by the model.

32/39
Tapas Kumar Mishra Linear Regression
Instead, if we had added an extra feature x 2 , and
fity = θ0 + θ1 x + θ2 x 2 , then we obtain a slightly better fit to the
data. Naively, it might seem that the more features we add, the
better.

33/39
Tapas Kumar Mishra Linear Regression
However, there is a danger of adding too many features. P The
figure below is the result of 5th order polynomial y = 5j=0 θj x j .
Even though the fitted curve passes through the data perfectly, it
is not a good predictor of y (housing prices)for different x (living
area). This is Overfitiing.

34/39
Tapas Kumar Mishra Linear Regression
In the original linear regression algorithm, to make a prediction at
a query point x (to evaluate h(x)), we would:
P (i)
1 Fit θ to minimize
i (y − θT x (i) )2 .
2 Output θT x.

35/39
Tapas Kumar Mishra Linear Regression
In Contrast, the locally weighted linear regression algorithm does
the following:
P (i) (i)
1 Fit θ to minimize
i z (y − θT x (i) )2 .
2 Output θT x.

36/39
Tapas Kumar Mishra Linear Regression
Here z (i) are non-negative valued weights.
If z (i) is large for a particular value i, then in picking θ, we will try
hard to make (y (i) − θT x (i) )2 small.
If z (i) is small for a particular value i, then (y (i) − θT x (i) )2 is
ignored in the fit.

37/39
Tapas Kumar Mishra Linear Regression
A fairly standard choice for weights is
!
(i) (x (i) − x)2
z = exp −
2τ 2

weights depend on x.
If |x (i) − x| is small, z (i) is close to 1 and if |x (i) − x| is large,
z (i) is small.
Hence, θ is chosen giving a much higher weight to the
training examples close to the query point x.
τ is the bandwidth parameter.

38/39
Tapas Kumar Mishra Linear Regression
Parametric vs non-Parametric

Unweighted Linear regression: Parametric. Once we have fitted the


θi ’s and stored them, we do not need the training data to make
future predictions.
Weighted Linear regression: Non-Parametric. to make future
predictions, we need to keep the entire training set around.

39/39
Tapas Kumar Mishra Linear Regression
Logistic Regression

Tapas Kumar Mishra

August 22, 2022

1/10
Classification problem

A classification problem is just like a regression problem:


difference is y can take small number of discrete values.
We will focus on the binary classification problem, where y
can take only two values: 0 and 1.
In Spam Classification: x (i) may be a feature of the ith email
and y = 1 denotes the email is spam and y = 0 denotes the
email is not spam.
y = 0 is denoted as the negative class, 1 is the positive class.
y (i) is the label of ith instance.

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:


difference is y can take small number of discrete values.
We will focus on the binary classification problem, where y
can take only two values: 0 and 1.
In Spam Classification: x (i) may be a feature of the ith email
and y = 1 denotes the email is spam and y = 0 denotes the
email is not spam.
y = 0 is denoted as the negative class, 1 is the positive class.
y (i) is the label of ith instance.

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:


difference is y can take small number of discrete values.
We will focus on the binary classification problem, where y
can take only two values: 0 and 1.
In Spam Classification: x (i) may be a feature of the ith email
and y = 1 denotes the email is spam and y = 0 denotes the
email is not spam.
y = 0 is denoted as the negative class, 1 is the positive class.
y (i) is the label of ith instance.

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:


difference is y can take small number of discrete values.
We will focus on the binary classification problem, where y
can take only two values: 0 and 1.
In Spam Classification: x (i) may be a feature of the ith email
and y = 1 denotes the email is spam and y = 0 denotes the
email is not spam.
y = 0 is denoted as the negative class, 1 is the positive class.
y (i) is the label of ith instance.

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:


difference is y can take small number of discrete values.
We will focus on the binary classification problem, where y
can take only two values: 0 and 1.
In Spam Classification: x (i) may be a feature of the ith email
and y = 1 denotes the email is spam and y = 0 denotes the
email is not spam.
y = 0 is denoted as the negative class, 1 is the positive class.
y (i) is the label of ith instance.

2/10
Tapas Kumar Mishra Logistic Regression
Logistic regression

1
hθ (x) = g (θT x) =
1 + e −θT x
1
where g (a) = 1+e −a
is the logistic/sigmoid function.

3/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).

4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).

4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).

4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).

4/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))(1−y )


Assuming that all training examples were generated independently,
the likelyhood L(θ) is given by
L(θ) = p(~y |X ; θ)

5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))(1−y )


Assuming that all training examples were generated independently,
the likelyhood L(θ) is given by
L(θ) = p(~y |X ; θ)

5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))(1−y )


Assuming that all training examples were generated independently,
the likelyhood L(θ) is given by
L(θ) = p(~y |X ; θ)

5/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)


θ θ
m
Y
= argmax p(y (i) |x (i) ; θ)
θ i=1
m
(i) (i)
Y
= argmax (hθ (x (i) ))y (1 − hθ (x (i) ))(1−y )
θ i=1
taking log as its easier to handle
Xm
= argmax y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
θ i=1

6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)


θ θ
m
Y
= argmax p(y (i) |x (i) ; θ)
θ i=1
m
(i) (i)
Y
= argmax (hθ (x (i) ))y (1 − hθ (x (i) ))(1−y )
θ i=1
taking log as its easier to handle
Xm
= argmax y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
θ i=1

6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)


θ θ
m
Y
= argmax p(y (i) |x (i) ; θ)
θ i=1
m
(i) (i)
Y
= argmax (hθ (x (i) ))y (1 − hθ (x (i) ))(1−y )
θ i=1
taking log as its easier to handle
Xm
= argmax y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
θ i=1

6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)


θ θ
m
Y
= argmax p(y (i) |x (i) ; θ)
θ i=1
m
(i) (i)
Y
= argmax (hθ (x (i) ))y (1 − hθ (x (i) ))(1−y )
θ i=1
taking log as its easier to handle
Xm
= argmax y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
θ i=1

6/10
Tapas Kumar Mishra Logistic Regression
To maximize the likelihood, we will use gradient descent.

θ := θ + α∇θ l(θ)
.
We start by taking just one training example (x, y ) and take
derivatives to derive the stochastic gradient descent rule.

7/10
Tapas Kumar Mishra Logistic Regression
This gives the stochastic ascent rule
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj
.
This is a similar looking rule as compared to LMS update rule!!

8/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.

9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.

9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.

9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.

9/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit

10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit

10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit

10/10
Tapas Kumar Mishra Logistic Regression
Optimizing the training process:
Underfitting, overfitting,
testing,
and regularization
• Let’s say that we have to study for a test.
• Several things could go wrong during our study process.
• Maybe we didn’t study enough. There’s no way to fix that, and we’ll likely
perform poorly in our test. ---------- Underfitting
• What if we studied a lot but in the wrong way. For example, instead of focusing
on learning, we decided to memorize the entire textbook word for word. Will
we do well in our test? It’s likely that we won’t, because we simply memorized
everything without learning. ----------Overfitting
• The best option, of course, would be to study for the exam properly and in a
way that enables us to answer new questions that we haven’t seen before on
the topic. -----------Generalization
• Notice that model 1 is too simple, because it is a line trying to fit a quadratic dataset. There is no way
we’ll find a good line to fit this dataset, because the dataset simply does not look like a line.
Therefore, model 1 is a clear example of underfitting.
• Model 2, in contrast, fits the data pretty well. This model neither overfits nor underfits.
• Model 3 fits the data extremely well, but it completely misses the point. The data is meant to look
like a parabola with a bit of noise, and the model draws a very complicated polynomial of degree 10
that manages to go through each one of the points but doesn’t capture the essence of the data.
Model 3 is a clear example of overfitting.
How do we get the computer to pick the right
model?
By testing
• Testing a model consists of picking a small set of the points in the dataset and choosing to use
them not for training the model but for testing the model’s performance. This set of points is
called the testing set.
• The remaining set of points (the majority), which we use for training the
model, is called the training set.
• Once we’ve trained the model on the training set, we use the
testing set to evaluate the model.
• In this way, we make sure that the model is good at generalizing
to unseen data, as opposed to memorizing the training set.
• Going back to the exam analogy, let’s imagine training and testing this way.
• Let’s say that the book we are studying for in the exam has
100 questions at the end.
• We pick 80 of them to train, which means we study them carefully, look
up the answers, and learn them.
• Then we use the remaining 20 questions to test ourselves—we
try to answer them without looking at the book, as in an exam setting.
• Looking at the top row we can see that model 1 has a large training
error, model 2 has a small training error, and model 3 has a tiny
training error (zero, in fact). Thus, model 3 does the best job on the
training set.

• Model 1 still has a large testing error,meaning that this is simply a bad
model, underperforming with the training and the testing set: it
underfits.

• Model 2 has a small testing error, which means it is a good model,


because it fits both the training and the testing set well.

• Model 3, however, produces a large testing error. Because it


did such a terrible job fitting the testing set, yet such a good job
fitting the training set, we conclude that model 3 overfits.
How do we pick the testing set, and how big should it be?

• A portion of the dataset is picked randomly (or based on some


features in case of temporal data) as the test set.
• In practice, 10-20% of the data is kept as the test set.

Can we use our testing data for training the model? No.
• We broke the golden rule in the previous example.
• Recall that we had three polynomial regression models: one of degree
1, one of degree 2, and one of degree 10, and we didn’t know which one
to pick.
• We used our training data to train the three models, and then we used
the testing data to decide which model to pick.
• We are not supposed to use the testing data to train our model or to
make any decisions on the model or its hyperparameters.
Solution: Validation Set
We break our dataset into the following three sets:
• Training set: for training all our models
• Validation set: for making decisions on which model to use
• Testing set: for checking how well our model did

it is common to use a 60-20-20 split or an 80-10-10 split—in other words,


60% training, 20% validation, 20% testing, or

80%training, 10% validation, 10% testing.


A numerical way to decide how complex our model should be:

The model complexity graph

• Imagine that we have a different and much more complex dataset, and we are trying to build a
polynomial regression model to fit it. We want to decide the degree of our model among the
numbers between 0 and 10 (inclusive).

• the way to decide which model to use is to pick the one that has the smallest validation error.

• However, plotting the training and validation errors can give us some valuable information and
help us examine trends.
The model
complexity graph
Another alternative to avoiding overfitting: Regularization

• simple models tend to underfit and complex models tend to


overfit
• in the previous methods, we tested several models and selected
the one that best balanced performance and complexity.
• In contrast, when we use regularization, we don’t need to
train several models. We simply train the model once, but during
the training, we try to not only
improve the model’s performance but also reduce its complexity.
Another alternative to avoiding overfitting: Regularization
• Performance (in mL of water leaked)
Roofer 1: 1000 mL water
Roofer 2: 1 mL water
Roofer 3: 0 mL water
• Complexity (in price)
Roofer 1: $1
Roofer 2: $100
Roofer 3: $100,000
• Performance + complexity
Roofer 1: 1001
Roofer 2: 101
Roofer 3: 100,000

Now it is clear that roofer 2 is the best one, which means that optimizing performance and
complexity at the same time yields good results that are also as simple as possible. This is what
regularization is about: measuring performance and complexity with two different error functions,
and adding them to get a more robust error function.
Regularization- Measuring how complex a model is: L1 and L2 norm

• Notice that a model with


more coefficients, or
coefficients of higher value,
tends to be more complex.
Therefore, any formula that
matches this will work, such
as the following:
• The sum of the absolute
values of the coefficients
• The sum of the squares of
the coefficients
The first one is called the L1
norm, and the second one is
called the L2 norm.
Regularization- Modifying the error function

• in the roofer analogy, our goal was to find a roofer that provided both good quality
and low complexity. We did this by minimizing the sum of two numbers: the measure of quality
and the measure of complexity. Regularization consists of applying the same principle to our
machine learning model.
• regression error A measure of the quality of the model. In this case, it can be the absolute
or square errors
• regularization term A measure of the complexity of the model. It can be the L1 or the L2
norm of the model.
• Error = Regression error + λ Regularization term
• λ is the regularization hyperparameter.
• Lasso regression error = Regression error + λ L1 norm
• Ridge regression error = Regression error + λ L2 norm
Regularization- Effects of L1 and L2 regularization

• A quick rule of thumb to use when deciding if we want to use L1 or L2


regularization follows:
• if we have too many features and we’d like to get rid of most of them, L1 regularization is
perfect for that.
• If we have only few features and believe they are all relevant, then L2 regularization is
what we need, because it won’t get rid of our useful features.
Bias-Variance Tradeoff

Tapas Kumar Mishra

August 28, 2022


As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Expected Label (given x ∈ Rd ):
Z
ȳ (x) = Ey |x [Y ] = y Pr(y |x)∂y .
y

The expected label denotes the label you would expect to obtain,
given a feature vector x.

3/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).

4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).

4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).

4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):

h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y

5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):

h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y

5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a


weighted average over functions

6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a


weighted average over functions

6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a


weighted average over functions

6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a


weighted average over functions

6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y

To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.

7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y

To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.

7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y

To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.

7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y

To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.

7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error

h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
 
   h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0

8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error

h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
 
   h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0

8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error

h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
 
   h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0

8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
  
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
   
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
  
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
  
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
   
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
  
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
  
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
   
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
  
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
  
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
   
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
  
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
  
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
   
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
  
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
  
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
   
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
  
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
  
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
 
   h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (2)
| {z }
0

Returning to the earlier expression, we’re left with the variance and
another term
h i h 2 i h 2 i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) +Ex,y h̄(x) − y
| {z }
Variance
(3)

10/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
  
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0

11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
  
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0

11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
  
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0

11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
  
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
  
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
 
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
  
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
  
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
 
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
  
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
  
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
 
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
  
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
  
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
 
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
  
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
  
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
 
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
  
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
 
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
  
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
 
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2

13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2

13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?

14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?

14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?

14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.

15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.

15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.

15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.

16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.

16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Graphical illustration of bias and variance.

17/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : The variation of Bias and Variance with the model complexity.
This is similar to the concept of overfitting and underfitting. More
complex models overfit while the simplest models underfit.

18/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Detecting High Bias and High Variance

Figure : Test and training error as the number of training instances


increases.

The graph above plots the training error and the test error and can
be divided into two overarching regimes. In the first regime (on the
left side of the graph), training error is below the desired error
threshold (denoted by ), but test error is significantly higher.
19/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Test and training error as the number of training instances
increases.

In the second regime (on the right side of the graph), test error is
remarkably close to training error, but both are above the desired
tolerance of .

20/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 1 (High Variance)
Symptoms:
Training error is much lower than test error
Training error is lower than 
Test error is above 
Remedies:
Add more training data
Reduce model complexity – complex models are prone to high
variance
Bagging (will be covered later in the course)

21/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 2 (High Bias): the model being used is not robust enough
to produce an accurate prediction
Symptoms:
Training error is higher than , but close to test error.
Remedies:
Use more complex model (e.g. kernelize, use non-linear
models)
Add features
Boosting (will be covered later in the course)

22/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Model Selection
Performance estimation techniques
Always evaluate models as if they are predicting future data
We do not have access to future data, so we pretend that some data is hidden
Simplest way: the holdout (simple train-test split)
Randomly split data (and labels) into training , Validation and test set (e.g. 60%-20%-20 %)
Train (fit) a model on the training da ta,minimize error on validation set and score on the test data
K-fold Cross-validation

Each random split can yield very different models (and scores)
e.g. all easy (of hard) examples could end up in the test set
Split data into k equal-sized parts, called folds
Create k splits, each time using a different fold as the test set
Compute k evaluation scores, agg regate afte rwards (e.g. take the mea n)
Large k gives be tter estimates (more training data), but is expensive
Stratified K-Fold cross-validation

If the data is unbalanced, some classes have only few samples


Likely that some classes are not present in the test set
Stratification: proportions between classes are conserved in each fold
Order examples per class
Separate the samples of each class in k sets (strata)
Combine corresponding strata into folds
Leave-One-Out cross-validation

k fold cross-validation with k equal to the number of samples


Completely unbiased (in terms of data splits), but computationally expensive
Actually generalizes less well towards unseen data
The training sets are correlated (overlap heavily)
Overfits on the data used for (the entire) evaluation
A different sample of the data can yield different results
Recommended only for small datasets
The Bootstrap

Sample n (dataset size) data points, with replacement, as training set (the bootstrap)
On average, bootstraps include 66% of all data points (some are duplicates)
Use the unsampled (out-of-bootstrap) samples as the test set
Repeat times to obtain scores
k k
Repeated cross-validation

Cross-validation is still biased in that the initial split can be made in many ways
Repeated, or n-times-k-fold cross-validation:
Shuffle data randomly, do k-fold cross-validation
Repeat n times, yields n times k scores
Unbiased, very robust, but n times more expensive
Cross-validation with groups

Sometimes the data contains inherent groups:


Multiple samples from same patient, images from same person,...
Data from the same person may end up in the training and test set
We want to measure how well the model generalizes to other people
Make sure that data from one person are in either the train or test set
This is called grouping or blocking
Leave-one-subject-out cross-validation: test set for each subject/group
Time series
When the data is ordered, random test sets are not a good idea
Test-then-train (prequential evaluation)

Every new sample is evaluated only once, then added to the training set
Can also be done in batches (of n samples at a time)
TimeSeriesSplit
In the kth split, the first k folds are the train set and the (k+1)th fold as the Validation set
Often, a maximum training set size (or window) is used
more robust against concept drift (change in data over time)
Choosing a performance estimation procedure
No strict rules, only guidelines:
Always use stratification for classification (sklearn does this by default)
Use holdout for very large datasets (e.g. >1.000.000 examples)
Or when learners don't always converge (e.g. deep learning)
Choose k depending on dataset size and resources
Use leave-one-out for very small datasets (e.g. <100 examples)
Use cross-validation otherwise
Most popular (and theoretically sound): 10-fold CV
Literature suggests 5x2-fold CV is better
Use grouping or leave-one-subject-out for grouped data
Use train-then-test for time series
Binary classification

We have a positive and a negative class


2 different kind of errors:
False Positive (type I error): model predicts positive while true label is negative
False Negative (type II error): model predicts negative while true label is positive
They are not always equally important
Which side do you want to err on for a medical test?
Confusion matrices

We can represent all predictions (correct and incorrect) in a confusion matrix


n by n array (n is the number of classes)
Rows correspond to true classes, columns to predicted classes
Count how often samples belonging to a class C are classified as C or any other
class.
For binary classification, we label these true negative (TN), true positive (TP), false
negative (FN), false
Predicted Neg Predicted Pos
positive (FP)
Actual Neg TN FP
Actual Pos FN TP
confusion_matrix(y_test, y_pred):
[[48 5]
[ 5 85]]
Predictive accuracy

Accuracy can be computed based on the confusion matrix


Not useful if the dataset is very imbalanced
E.g. credit card fraud: is 99.99% accuracy good enough?
TP + TN
Accuracy = (1)
TP + TN + FP + FN

3 models: very different predictions, same accuracy:


Precision

Use when the goal is to limit FPs


Clinical trails: you only want to test drugs that really work
Search engines: you want to avoid bad search results
TP
Precision = (2)
TP + FP
Recall

Use when the goal is to limit FNs


Cancer diagnosis: you don't want to miss a serious disease
Search engines: You don't want to omit important hits
Also know as sensitivity, hit rate, true positive rate (TPR)
TP
Recall = (3)
TP + FN
F1-score

Trades off precision and recall:


precision ⋅ recall
F1 = 2 ⋅ (4)
precision + recall
Classification measure Zoo

https://en.wikipedia.org/wiki/Precision_and_recall
Multi-class classification

Train models per class : one class viewed as positive, other(s) als negative, then average
micro-averaging: count total TP, FP, TN, FN (every sample equally important)
micro-precision, micro-recall, micro-F1, accuracy are all the same
C
∑ TPc c=2 TP + TN
c=1
Precision: −
−→
C C
TP + TN + FP + FN
∑ TPc + ∑ FPc
c=1 c=1
Other useful classification metrics

Cohen's Kappa
Measures 'agreement' between different models (aka inter-rater agreement)
To evaluate a single model, compare it against a model that does random guessing
Similar to accuracy, but taking into account the possibility of predicting the
right class by chance
Can be weighted: different misclassifications given different weights
1: perfect prediction, 0: random prediction, negative: worse than random
With = accuracy, and = accuracy of random classifier:
p0 pe

po − pe
κ =
1 − pe

Matthews correlation coefficient


Corrects for imbalanced data, alternative for balanced accuracy or AUROC
1: perfect prediction, 0: random prediction, -1: inverse prediction
tp × tn − f p × f n
M CC =
√(tp + f p)(tp + f n)(tn + f p)(tn + f n)
Precision-Recall curve

The best trade-off between precision and recall depends on your application
You can have arbitrary high recall, but you often want reasonable precision, too.
Plotting precision against recall for all possible thresholds yields a precision-recall curve
Change the treshold until you find a sweet spot in the precision-recall trade-off
Often jagged at high thresholds, when there are few positive examples left
Model selection

Some models can achieve trade-offs that others can't


Your application may require very high recall (or very high precision)
Choose the model that offers the best trade-off, given your application
The area under the PR curve (AUPRC) gives the best overall model
Receiver Operating Characteristics (ROC)

Trade off true positive rate


TPR =
TP
with false positive rate
FPR =
FP

Plotting TPR against FPR for all possible thresholds yields a Receiver Operating
T P +F N F P +T N

Characteristics curve
Change the treshold until you find a sweet spot in the TPR-FPR trade-off
Lower thresholds yield higher TPR (recall), higher FPR, and vice versa
Visualization

Histograms show the amount of points with a certain decision value (for each class)
TPR =
TP
can be seen from the positive predictions (top histogram)
can be seen from the negative predictions (bottom histogram)
T P +F N
FP
FPR =
F P +T N
Model selection

Again, some models can achieve trade-offs that others can't


Your application may require minizing FPR (low FP), or maximizing TPR (low FN)
The area under the ROC curve (AUROC or AUC) gives the best overall model
Frequently used for evaluating models on imbalanced data
Random guessing (TPR=FPR) or predicting majority class (TPR=FPR=1): 0.5 AUC
Regression metrics
Most commonly used are
mean squared error:
2
∑ (ypred −yactual )
i i i

root mean squared error (RMSE) often used as well


n

mean absolute error: ∑ |ypred −yactual |


i i i

Less sensitive to outliers and large errors


n
R squared
2
∑ (ypred −yactual )
2 i i i
R = 1 − 2
∑ (ymean −yactual )

Ratio of variation explained by the model / total variation


i i

Between 0 and 1, but negative if the model is worse than just predicting the mean
Easier to interpret (higher is better).
Decision tree learning
Inductive inference with decision trees
▪ Inductive reasoning is a method of reasoning in which
a body of observations is considered to derive a
general principle.
▪ Decision Trees is one of the most widely used and
practical methods of inductive inference
▪ Features
▪ Method for approximating discrete-valued functions
(including boolean)
▪ Learned functions are represented as decision trees (or if-
then-else rules)
▪ Expressive hypotheses space, including disjunction
Decision tree representation (PlayTennis)

Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong No


Decision trees expressivity
▪ Decision trees represent a disjunction of conjunctions on
constraints on the value of attributes:
(Outlook = Sunny  Humidity = Normal) 
(Outlook = Overcast) 
(Outlook = Rain  Wind = Weak)
Decision trees representation

+

When to use Decision Trees
▪ Problem characteristics:
▪ Instances can be described by attribute value pairs
▪ Target function is discrete valued
▪ Disjunctive hypothesis may be required
▪ Possibly noisy training data samples
▪ Robust to errors in training data
▪ Missing attribute values
▪ Different classification problems:
▪ Equipment or medical diagnosis
▪ Credit risk analysis
▪ Several tasks in natural language processing
Top-down induction of Decision Trees
▪ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
▪ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
▪ The construction of the tree is top-down. The algorithm is greedy.
▪ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
▪ Select the best attribute
▪ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
▪ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?

▪ A statistical property called information gain, measures how


well a given attribute separates the training examples
▪ Information gain uses the notion of entropy, commonly used in
information theory
▪ Information gain = expected reduction of entropy
Entropy in binary classification
▪ Entropy measures the impurity of a collection of examples. It
depends from the distribution of the random variable p.
▪ S is a collection of training examples
▪ p+ the proportion of positive examples in S
▪ p– the proportion of negative examples in S
Entropy (S)  – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0  p  1, 0  entropy  1
Entropy
Entropy in general
▪ Entropy measures the amount of information in a random
variable
H(X) = – p+ log2 p+ – p– log2 p– X = {+, –}
for binary classification [two-valued random variable]
c c
H(X) = –  pi log2 pi =  pi log2 1/ pi X = {i, …, c}
i=1 i=1
for classification in c classes
Example: rolling a die with 8, equally probable, sides
8
H(X) = –  1/8 log2 1/8 = – log2 1/8 = log2 8 = 3
i=1
Entropy and information theory
▪ Entropy specifies the number the average length (in bits) of the
message needed to transmit the outcome of a random variable.
This depends on the probability distribution.
▪ Optimal length code assigns − log2 p bits to messages with
probability p. Most probable messages get shorter codes.
▪ Example: 8-sided [unbalanced] die
1 2 3 4 5 6 7 8
4/16 4/16 2/16 2/16 1/16 1/16 1/16 1/16
2 bits 2 bits 3 bits 3 bits 4bits 4bits 4bits 4bits
E = (1/4 log2 4)  2 + (1/8 log2 8)  2 + (1/16 log2 16)  4 = 1+3/4+1 = 2.75
Information gain as entropy reduction
▪ Information gain is the expected reduction in entropy caused by
partitioning the examples on an attribute.
▪ The higher the information gain the more effective the attribute
in classifying training data.
▪ Expected reduction in entropy knowing A

Gain(S, A) = Entropy(S) −  |Sv|


Entropy(Sv)
v  Values(A) |S|
Values(A): possible values for A
Sv: subset of S for which A has value v
Example: expected information gain
▪ Let
▪ Values(Wind) = {Weak, Strong}
▪ S = [9+, 5−]
▪ SWeak = [6+, 2−]
▪ SStrong = [3+, 3−]
▪ Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14  0.811 − 6/14  1.00
= 0.048
Which attribute is the best classifier?
Example
First step: which attribute to test at the root?

▪ Which attribute should be tested at the root?


▪ Gain(S, Outlook) = 0.246
▪ Gain(S, Humidity) = 0.151
▪ Gain(S, Wind) = 0.084
▪ Gain(S, Temperature) = 0.029
▪ Outlook provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Outlook
▪ partition the training samples according to the value of Outlook
After first step
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5  0.0 − 2/5  0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5  1.0 − 3.5  0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5  0.0 − 2/5  1.0 − 1/5  0.0 = 0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Humidity
▪ partition the training samples according to the value of Humidity
Second and third steps

{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A  best attribute; decision attribute for Root  A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi  subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs − {A})
return Root
Inductive bias in decision tree learning

▪ The inductive bias (also known as learning bias) of a learning


algorithm is the set of assumptions that the learner uses to
predict outputs of given inputs that it has not encountered.
▪ What is the inductive bias of DT learning?
1. Shorter trees are preferred over longer trees
2. Prefer trees that place high information gain attributes close to
the root
Prefer shorter hypotheses: Occam's rasor
▪ Why prefer shorter hypotheses?
▪ Arguments in favor:
▪ There are fewer short hypotheses than long ones
▪ If a short hypothesis fits data unlikely to be a coincidence
▪ Elegance and aesthetics
▪ Arguments against:
▪ Not every short hypothesis is a reasonable one.
▪ Occam's razor:"The simplest explanation is usually the best one."
Issues in decision trees learning
▪ Overfitting
▪ Extensions
▪ Continuous valued attributes
▪ Alternative measures for selecting attributes
▪ Handling training examples with missing attribute values
▪ Handling attributes with different costs
▪ Improving computational efficiency
▪ Most of these improvements in C4.5 (Quinlan, 1993)
Overfitting: definition
▪ Building trees that “adapt too much” to the training examples
may lead to “overfitting”.
▪ Consider error of hypothesis h over
▪ training data: errorD(h) empirical error
▪ entire distribution X of data: errorX(h) expected error
▪ Hypothesis h overfits training data if there is an alternative
hypothesis h'  H such that
errorD(h) < errorD(h’) and
errorX(h’) < errorX(h)
i.e. h’ behaves better over unseen data
Example

D15 Sunny Hot Normal Strong No


Overfitting in decision trees

Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No 

New noisy example causes splitting of second leaf node.


Overfitting in decision tree learning
Avoid overfitting in Decision Trees
▪ Two strategies:
1. Stop growing the tree earlier, before perfect classification
2. Allow the tree to overfit the data, and then post-prune the tree
▪ Training and validation set
▪ split the training in two parts (training and validation) and use
validation to assess the utility of post-pruning
▪ Reduced error pruning
▪ Rule pruning
Reduced-error pruning (Quinlan 1987)
▪ Each node is a candidate for pruning
▪ Pruning consists in removing a subtree rooted in a node: the
node becomes a leaf and is assigned the most common
classification
▪ Nodes are removed only if the resulting tree performs no
worse on the validation set.
▪ Nodes are pruned iteratively: at each iteration the node
whose removal most increases accuracy on the validation set is
pruned.
▪ Pruning stops when no pruning increases accuracy
Effect of reduced error pruning
Rule post-pruning
1. Create the decision tree from the training set
2. Convert the tree into an equivalent set of rules
▪ Each path corresponds to a rule
▪ Each node along a path corresponds to a pre-condition
▪ Each leaf classification to the post-condition
3. Prune (generalize) each rule by removing those preconditions
whose removal improves accuracy …
▪ … over validation set
▪ … over training with a pessimistic, statistically inspired, measure
4. Sort the rules in estimated order of accuracy, and consider
them in sequence when classifying new instances
Converting to rules

(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
▪ Each distinct path produces a different rule: a condition
removal may be based on a local (contextual) criterion. Node
pruning is global and affects all the rules
▪ In rule form, tests are not ordered and there is no book-
keeping involved when conditions (nodes) are removed
▪ Converting to rules improves readability for humans
Dealing with continuous-valued attributes
▪ So far discrete values for attributes and for outcome.
▪ Given a continuous-valued attribute A, dynamically create a
new attribute Ac
Ac = True if A < c, False otherwise
▪ How to determine threshold value c ?
▪ Example. Temperature in the PlayTennis example
▪ Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
▪ Determine candidate thresholds by averaging consecutive values where
there is a change in classification: (48+60)/2=54 and (80+90)/2=85
▪ Evaluate candidate thresholds (attributes) according to information gain.
The best is Temperature>54.The new attribute competes with the other
ones
Problems with information gain
▪ Natural bias of information gain: it favours attributes with
many possible values.
▪ Consider the attribute Date in the PlayTennis example.
▪ Date would have the highest information gain since it perfectly
separates the training data.
▪ It would be selected at the root resulting in a very broad tree
▪ Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
▪ The problem is that the partition is too specific, too many small
classes are generated.
▪ We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A)  −  log2
|S |
i=1 |S |
▪ Si are the sets obtained by partitioning on value i of A
▪ SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A) 
SplitInformation(S, A)
▪ GainRatio penalizes attributes that split examples in many small classes such as
Date. Let |S |=n, Date splits examples in n classes
▪ SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
▪ Compare with A, which splits data in two even classes:
▪ SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
▪ Problem: SplitInformation(S, A) can be zero or very small
when |Si | ≈ |S | for some value i
▪ To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
Handling incomplete training data
▪ How to cope with the problem that the value of some attribute
may be missing?
▪ Example: Blood-Test-Result in a medical diagnosis problem
▪ The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training examples at
the node
2. Assign a probability to each value, based on frequencies, and assign
values to missing attribute, according to this probability distribution
▪ Missing values in new instances to be classified are treated
accordingly, and the most probable classification is chosen
(C4.5)
Handling attributes with different
costs
▪ Instance attributes may have an associated cost: we would
prefer decision trees that use low-cost attributes
▪ ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) − 1
(Cost(A) + 1)w w ∈ [0,1]
Gini (impurity) Index
▪ The Gini index is a measure of diversity in a dataset. In other
words, if we have a set in which all the elements are similar,
this set has a low Gini index, and if all the elements are
different, it has a large Gini index.
▪ For clarity, consider the following two sets of 10 colored balls
(where any two balls of the same color are indistinguishable):
▪ • Set 1: eight red balls, two blue balls
▪ • Set 2: four red balls, three blue balls, two yellow balls, one green
ball
▪ Set 1 looks more pure than set 2, because set 1 contains
mostly red balls and a couple of blue ones, whereas set 2 has
many different colors. Next, we devise a measure of impurity
that assigns a low value to set 1 and a high value to set 2.
Gini (impurity) Index
▪ If we pick two random elements of the set, what is the
probability that they have a different color ? The two elements
don’t need to be distinct; we are allowed to pick the same
element twice.
▪ P(picking two balls of different color) = 1 – P(picking two balls
of the same color)
▪ P(picking two balls of the same color) = P(both balls are color 1)
+ P(both balls are color 2) + … + P(both balls are color n)
▪ P(both balls are color i) = pi2
▪ P(picking two balls of different colors) = 1 – p12 – p22 – … – pn2
Gini (impurity) Index
▪ Gini impurity index:
In a set with m elements and n classes, with ai elements belonging
to the i-th class, the Gini impurity index is
Gini = 1 – p12 – p22 – … – pn2 , where pi = ai / m
Gini (impurity) Index

Split on Gender: Split on Class:


1. Female Gini = 1- (0.2)2-(0.8)2=0.32 1. IX Gini = 1- (0.43)2-(0.57)2=0.49
2. Male Gini= 1- (0.65)2-(0.35)2=0.45 2. X Gini= 1- (0.56)2-(0.44)2=0.49
3. Ginigender = (10/30)*0.32+(20/30)*0.45 3. Giniclass = (14/30)*0.49+(16/30)*0.49
= 0.406 = 0.49
The Attribute producing least Gini impurity index is selected for
split.
References
▪ Machine Learning, Tom Mitchell, Mc Graw-Hill International
Editions, 1997 (Cap 3).
Dimensionality
Constrained Optimization - Lagrange Multipliers

• For a rectangle whose perimeter is 20 m, find the dimensions that


will maximize the area.
• Solution
• The area A of a rectangle with width x and height y is A=xy . The
perimeter P of the rectangle is then given by the formula P=2x+2y . Since
we are given that the perimeter P=20 , this problem can be stated as:
• Maximize : f(x,y)=xy
• given : 2x+2y=20
• y=10−x f(x,y)=xy=x(10−x)=10x−x2
• f'(x)=10−2x=0⇒x=5 and f''(5)=−2<0 , then the Second Derivative Test tells
us that x=5 is a local maximum for f,
• y=10−x=5
Constrained Optimization - Lagrange Multipliers

• Notice in the above example that the ease of the solution


depended on being able to solve for one variable in terms of
the other in the equation 2x+2y=20 .
• But what if that were not possible (which is often the case)? In
this section we will use a general method, called the Lagrange
multiplier method, for solving constrained optimization
problems:
Constrained Optimization - Lagrange Multipliers
Constrained Optimization - Lagrange Multipliers

• For a rectangle whose perimeter is 20 m, use the Lagrange


multiplier method to find the dimensions that will maximize the
area.
• Maximize : f(x,y)=xy given : 2x+2y=20
Constrained Optimization - Lagrange Multipliers
Curse of Dimensionality: Complexity
Curse of Dimensionality: Complexity
KNN
● A binary classification example with k=3.
● The green point in the center is the test

sample x.
● The labels of the 3 neighbors are 2×(+1) and

1×(-1) resulting in majority prediction (+1)


● Assumption: Similar Inputs have similar

outputs
● Classification rule: For a test input x, assign

the most common label amongst its k most


similar training inputs
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
The Curse of Dimensionality
• We should try to avoid creating lot of features
• Often no choice, problem starts with many features
• Example: Face Detection
The Curse of Dimensionality

You might also like