Lec 04

Naïve Bayes
Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
Administrative
• HW 1 out today. Please start early!
• Office hours
• Chen: Wed 4pm-5pm
• Shih-Yang: Fri 3pm-4pm
• Location: Whittemore 266
Linear Regression
• Model representation
• Cost function
• Gradient descent for linear regression
Repeat until convergence {}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g.,
polynomial)
• Normal equation
() Size in feet^2 Number of Number of Age of home Price ($) in
() bedrooms () floors () (years) () 1000’s (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
… …
[ ]
460
𝑦 = 232
315
178
⊤ −1 ⊤
𝜃=( 𝑋 𝑋) 𝑋 𝑦 Slide credit: Andrew Ng
Least square solution
•
𝒚
Justification/interpretation 1
𝑿 𝜽−𝒚
• Geometric interpretation 𝑿𝜽
column space of
• : column space of or span()
• Residual is orthogonal to the column space of

• Probabilistic model
• Assume linear model with Gaussian errors
• Solving maximum likelihood
Image credit: CS 446@UIUC

• Loss minimization
• Least squares loss
• Empirical Risk Minimization (ERM)

training examples, features
Gradient Descent Normal Equation
• Need to choose • No need to choose
• Need many iterations • Don’t need to iterate
• Works well even when • Need to compute
is large
• Slow if is very large
Slide credit: Andrew Ng

Things to remember
• Model representation
• Cost function
• Gradient descent for linear regression
Repeat until convergence {}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g.,
polynomial)
• Normal equation
Today’s plan
• Probability basics
• Estimating parameters from data

• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)
• Naïve Bayes
Today’s plan

• Naive Bayes
Random variables
• Outcome space S
• Space of possible outcomes
• Random variables
• Functions that map outcomes to real numbers
• Event E
• Subset of S
Visualizing probability
Sample space
Area = 1
A is true
A is false
Area of the blue circle

A is true
A is false
𝑃 ( 𝐴 ) + P ( A ) =1
A^B
B
A^~B
𝑃 ( 𝐴 ) =P( A , B)+P ( A , 𝐵 )
Visualizing conditional probability
A^B 𝑃 ( 𝐴∨𝐵 )=𝑃 ( 𝐴 , 𝐵 ) /𝑃 (𝐵)
B Corollary: The chain rule

A
Bayes rule
A^B
Thomas Bayes
B
A
Corollary: The chain rule

Other forms of Bayes rule
Applying Bayes rule
• A = you have the flu

B = you just coughed
• Assume:
0.8 × 0.05
𝑃 ( 𝐴∨𝐵 )= 0.17
0.8 × 0.05+0.2 × 0.95
• What is P(flu | cough) = P(A|B)?

Slide credit: Tom Mitchell
Why we are learning this?
𝑥 h 𝑦
Hypothesis
Learn
A B C Prob
Joint distribution 0
0
0
0
0
1
0.30
0.05
0 1 0 0.10
• Making a joint distribution of M variables 0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1. Make a truth table listing all combinations 1 1 0 0.25
1 1 1 0.10
2. For each combination of values, say how probable it is
3. Probability must sum to 1

Using joint distribution
• Can ask for any logical expression
involving these variables

The solution to learn ?
• Main problem: learning may require more data than we have
• Say, learning a joint distribution with 100 attributes
• # of rows in this table? 2100 ≥ 10 30
• # of people on earth? 109

What should we do?
1. Be smart about
how we estimate probabilities from sparse data
• Maximum likelihood estimates (ML)
• Maximum a posteriori estimates (MAP)
2. Be smart about
how to represent joint distributions
• Bayes network, graphical models (more on this later)
Today’s plan

• Maximum a posteriori (MAP)
• Naive Bayes
Estimating the probability
• Flip the coin repeatedly, observing 𝑋=1 0
• It turns heads times
• It turns tails times
• Your estimate for is?
• Case A: 100 flips: 51 Heads (), 49 Tails ()
• Case B: 3 flips: 2 Heads (), 1 Tails ()

Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes probability of observed data

Choose that is most probable given prior probability and
data

Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes

Choose that maximize

Maximum likelihood estimate
• Each flip yields Boolean value for 𝑋=1 0
• Data set of independent, identically distributed (iid) flips,

produces ones, zeros

Beta prior distribution
1 𝛽 −1 𝛽 −1
𝑃 ( 𝜃)=𝐵𝑒𝑡𝑎 ( 𝛽1,𝛽0)= 𝜃 (1 −𝜃)
1 0
𝐵(𝛽1 ,𝛽0)
Maximum likelihood estimate
• Data set of iid flips, 𝑋=1 0
produces ones, zeros
• Assume prior (Conjugate prior: Closed form representation of posterior)

Some terminology
• Likelihood function
• Prior
• Posterior
• Conjugate prior:
Prior is the conjugate prior for a likelihood function if
the prior and the posterior have the same form.
• Example (coin flip problem)
• Prior : Likelihood : Binomial
• Posterior :
How many parameters?
• Suppose , where
and are Boolean random variables
To estimate
When (Gender, Hours-worked)?
When ?

Can we reduce paras using Bayes rule?
• How many parameters for ?
• How many parameters for

Today’s plan

• Naive Bayes
Naïve Bayes
• Assumption:
• i.e., and are conditionally independent

given for

Conditional independence
• Definition: is conditionally independent of given , if the probability
distribution governing is independent of the value of , given the value
of
Example:

Applying conditional independence
• Naïve Bayes assumes are conditionally independent given
e.g.,
General form:
How many parameters to describe ? ?
• Without conditional indep assumption?
• With conditional indep assumption?
Naïve Bayes classifier
• Bayes rule:
• Assume conditional independence among ’s:
• Pick the most probable Y

Naïve Bayes algorithm – discrete
• For each value
Estimate
For each value of each attribute
Estimate
• Classify
Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month
𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S| F ) P ( D|F ) P (G∨F )

Naïve Bayes: Subtlety #1
• Often the are not really conditionally independent
• Naïve Bayes often works pretty well anyway

• Often the right classification, even when not the right probability [Domingos
& Pazzani, 1996])
• What is the effect on estimated ?

• What if we have two copies:

Naïve Bayes: Subtlety #2
MLE estimate for might be zero.
(for example, = birthdate. = Feb_4_1995)
• Why worry about just one parameter out of many?
• What can we do to address this?

• MAP estimates (adding “imaginary” examples)

Estimating parameters: discrete
• Maximum likelihood estimates (MLE)
• MAP estimates (Dirichlet priors):

What if we have continuous
• Gaussian Naïve Bayes (GNB): assume
• Additional assumption on :
• Is independent of ()
• Is independent of ()
• Is independent of and ()

Naïve Bayes algorithm – continuous
• For each value
Estimate
For each attribute estimate
Class conditional mean , variance
• Classify

Things to remember

• Maximum likelihood (ML) maximize
• Maximum a posteriori estimation (MAP) maximize
• Naive Bayes
Next class
• Logistic regression

Lec 04

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 04

Uploaded by

Copyright:

Available Formats

Naïve Bayes

• : column space of or span()

• Residual is orthogonal to the column space of

• Solving maximum likelihood

Image credit: CS 446@UIUC

• Least squares loss

• Empirical Risk Minimization (ERM)

Slide credit: Andrew Ng

• Estimating parameters from data

• Estimating parameters from data

Area of the blue circle

B Corollary: The chain rule

Corollary: The chain rule

• A = you have the flu

• What is P(flu | cough) = P(A|B)?

2. For each combination of values, say how probable it is

3. Probability must sum to 1

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

• Say, learning a joint distribution with 100 attributes

• # of rows in this table? 2100 ≥ 10 30

• # of people on earth? 109

Slide credit: Tom Mitchell

• Estimating parameters from data

• Case A: 100 flips: 51 Heads (), 49 Tails ()

• Case B: 3 flips: 2 Heads (), 1 Tails ()

Slide credit: Tom Mitchell

• Maximum a posteriori estimation (MAP)

Slide credit: Tom Mitchell

• Maximum a posteriori estimation (MAP)

Slide credit: Tom Mitchell

• Data set of independent, identically distributed (iid) flips,

Slide credit: Tom Mitchell

produces ones, zeros

• Assume prior (Conjugate prior: Closed form representation of posterior)

Slide credit: Tom Mitchell

When (Gender, Hours-worked)?

Slide credit: Tom Mitchell

• How many parameters for ?

• How many parameters for

Slide credit: Tom Mitchell

• Estimating parameters from data

• i.e., and are conditionally independent

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

• Assume conditional independence among ’s:

• Pick the most probable Y

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S| F ) P ( D|F ) P (G∨F )

• Naïve Bayes often works pretty well anyway

• What is the effect on estimated ?

Slide credit: Tom Mitchell

• Why worry about just one parameter out of many?

• What can we do to address this?

Slide credit: Tom Mitchell

• MAP estimates (Dirichlet priors):

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

Slide credit: Tom Mitchell

• Estimating parameters from data

You might also like