You are on page 1of 50

Naïve Bayes

Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
Administrative
• HW 1 out today. Please start early!

• Office hours
• Chen: Wed 4pm-5pm
• Shih-Yang: Fri 3pm-4pm
• Location: Whittemore 266
Linear Regression
• Model representation

• Cost function
• Gradient descent for linear regression
Repeat until convergence {}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g.,
polynomial)
• Normal equation
() Size in feet^2 Number of Number of Age of home Price ($) in
() bedrooms () floors () (years) () 1000’s (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
… …

[ ]
460
𝑦 = 232
315
178
⊤ −1 ⊤
𝜃=( 𝑋 𝑋) 𝑋 𝑦 Slide credit: Andrew Ng
Least square solution

𝒚
Justification/interpretation 1
𝑿 𝜽−𝒚
• Geometric interpretation 𝑿𝜽
column space of 

• : column space of or span()

• Residual is orthogonal to the column space of 


Justification/interpretation 2
• Probabilistic model
• Assume linear model with Gaussian errors

• Solving maximum likelihood

Image credit: CS 446@UIUC


Justification/interpretation 3
• Loss minimization

• Least squares loss

• Empirical Risk Minimization (ERM)


training examples, features
Gradient Descent Normal Equation
• Need to choose • No need to choose
• Need many iterations • Don’t need to iterate
• Works well even when • Need to compute
is large
• Slow if is very large

Slide credit: Andrew Ng


Things to remember
• Model representation

• Cost function
• Gradient descent for linear regression
Repeat until convergence {}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g.,
polynomial)
• Normal equation
Today’s plan
• Probability basics

• Estimating parameters from data


• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)

• Naïve Bayes
Today’s plan
• Probability basics

• Estimating parameters from data


• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)

• Naive Bayes
Random variables
• Outcome space S
• Space of possible outcomes

• Random variables
• Functions that map outcomes to real numbers

• Event E
• Subset of S
Visualizing probability
Sample space

Area = 1
A is true

A is false

Area of the blue circle


Visualizing probability

A is true

A is false

𝑃 ( 𝐴 ) + P ( A ) =1
Visualizing probability
A^B

B
A^~B

𝑃 ( 𝐴 ) =P( A , B)+P ( A , 𝐵 )
Visualizing conditional probability
A^B 𝑃 ( 𝐴∨𝐵 )=𝑃 ( 𝐴 , 𝐵 ) /𝑃 (𝐵)

B Corollary: The chain rule


A
Bayes rule
A^B
Thomas Bayes
B
A

Corollary: The chain rule


Other forms of Bayes rule
Applying Bayes rule

• A = you have the flu


B = you just coughed
• Assume:

0.8 × 0.05
𝑃 ( 𝐴∨𝐵 )= 0.17
0.8 × 0.05+0.2 × 0.95

• What is P(flu | cough) = P(A|B)?


Slide credit: Tom Mitchell
Why we are learning this?
𝑥 h 𝑦
Hypothesis

Learn
A B C Prob

Joint distribution 0
0
0
0
0
1
0.30
0.05
0 1 0 0.10
• Making a joint distribution of M variables 0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1. Make a truth table listing all combinations 1 1 0 0.25
1 1 1 0.10

2. For each combination of values, say how probable it is

3. Probability must sum to 1

Slide credit: Tom Mitchell


Using joint distribution
• Can ask for any logical expression
involving these variables

Slide credit: Tom Mitchell


The solution to learn ?
• Main problem: learning may require more data than we have

• Say, learning a joint distribution with 100 attributes

• # of rows in this table? 2100 ≥ 10 30

• # of people on earth? 109

Slide credit: Tom Mitchell


What should we do?
1. Be smart about
how we estimate probabilities from sparse data
• Maximum likelihood estimates (ML)
• Maximum a posteriori estimates (MAP)

2. Be smart about
how to represent joint distributions
• Bayes network, graphical models (more on this later)
Slide credit: Tom Mitchell
Today’s plan
• Probability basics

• Estimating parameters from data


• Maximum likelihood (ML)
• Maximum a posteriori (MAP)

• Naive Bayes
Estimating the probability
• Flip the coin repeatedly, observing 𝑋=1 0
• It turns heads times
• It turns tails times
• Your estimate for is?

• Case A: 100 flips: 51 Heads (), 49 Tails ()

• Case B: 3 flips: 2 Heads (), 1 Tails ()

Slide credit: Tom Mitchell


Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes probability of observed data

• Maximum a posteriori estimation (MAP)


Choose that is most probable given prior probability and
data

Slide credit: Tom Mitchell


Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes

• Maximum a posteriori estimation (MAP)


Choose that maximize

Slide credit: Tom Mitchell


Maximum likelihood estimate
• Each flip yields Boolean value for 𝑋=1 0

• Data set of independent, identically distributed (iid) flips,


produces ones, zeros

Slide credit: Tom Mitchell


Beta prior distribution

1 𝛽 −1 𝛽 −1
𝑃 ( 𝜃)=𝐵𝑒𝑡𝑎 ( 𝛽1,𝛽0)= 𝜃 (1 −𝜃)
1 0

𝐵(𝛽1 ,𝛽0)
Slide credit: Tom Mitchell
Maximum likelihood estimate
• Data set of iid flips, 𝑋=1 0

produces ones, zeros

• Assume prior (Conjugate prior: Closed form representation of posterior)

Slide credit: Tom Mitchell


Some terminology
• Likelihood function
• Prior
• Posterior

• Conjugate prior:
Prior is the conjugate prior for a likelihood function if
the prior and the posterior have the same form.
• Example (coin flip problem)
• Prior : Likelihood : Binomial
• Posterior :
Slide credit: Tom Mitchell
How many parameters?
• Suppose , where
and are Boolean random variables

To estimate

When (Gender, Hours-worked)?

When ?

Slide credit: Tom Mitchell


Can we reduce paras using Bayes rule?

• How many parameters for ?

• How many parameters for

Slide credit: Tom Mitchell


Today’s plan
• Probability basics

• Estimating parameters from data


• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)

• Naive Bayes
Naïve Bayes
• Assumption:

• i.e., and are conditionally independent


given for

Slide credit: Tom Mitchell


Conditional independence
• Definition: is conditionally independent of given , if the probability
distribution governing is independent of the value of , given the value
of

Example:

Slide credit: Tom Mitchell


Applying conditional independence
• Naïve Bayes assumes are conditionally independent given
e.g.,

General form:
How many parameters to describe ? ?
• Without conditional indep assumption?
• With conditional indep assumption?
Slide credit: Tom Mitchell
Naïve Bayes classifier
• Bayes rule:

• Assume conditional independence among ’s:

• Pick the most probable Y

Slide credit: Tom Mitchell


Naïve Bayes algorithm – discrete
• For each value
Estimate
For each value of each attribute
Estimate

• Classify
Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

Slide credit: Tom Mitchell


• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month

𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S| F ) P ( D|F ) P (G∨F )


Naïve Bayes: Subtlety #1
• Often the are not really conditionally independent

• Naïve Bayes often works pretty well anyway


• Often the right classification, even when not the right probability [Domingos
& Pazzani, 1996])

• What is the effect on estimated ?


• What if we have two copies:

Slide credit: Tom Mitchell


Naïve Bayes: Subtlety #2
MLE estimate for might be zero.
(for example, = birthdate. = Feb_4_1995)

• Why worry about just one parameter out of many?

• What can we do to address this?


• MAP estimates (adding “imaginary” examples)

Slide credit: Tom Mitchell


Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

• MAP estimates (Dirichlet priors):

Slide credit: Tom Mitchell


What if we have continuous
• Gaussian Naïve Bayes (GNB): assume

• Additional assumption on :
• Is independent of ()
• Is independent of ()
• Is independent of and ()

Slide credit: Tom Mitchell


Naïve Bayes algorithm – continuous
• For each value
Estimate
For each attribute estimate
Class conditional mean , variance

• Classify

Slide credit: Tom Mitchell


Things to remember
• Probability basics

• Estimating parameters from data


• Maximum likelihood (ML) maximize
• Maximum a posteriori estimation (MAP) maximize

• Naive Bayes
Next class
• Logistic regression

You might also like