You are on page 1of 55

Artificial Intelligence

and Machine Learning

A.A. 2020/2021
Tatiana Tommasi
1
Bayes’ Rule

http://www-inst.eecs.berkeley.edu/~cs188/fa19/
2020/21
2
Naïve Bayes Classifier

2020/21
Slide Credit: Barnabás Póczos & Alex Smola 3
Maximum a Posteriori (MAP) & Bayesian Learning

● hypothesis h: probabilistic theory about how the domain works


● bayesian learning: calculates the probability of each hypothesis and, given the data,
makes predictions on that basis.
→predictions are made by using all the hypotheses weighted by their probabilities

d = observed data
X = unknown quantity

Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the
extent that

2020/21
4
Maximum a Posteriori (MAP) & Bayesian Learning

when conditional independence is satisfied, Naive Bayes corresponds to MAP classification.

2020/21
5
Independent and Identically distributed (i.i.d)

● stationary assumption: there is a probability distribution over examples that remains


stationary over time

● each example data point (before we see it) is a random variable Ej whose observed value ej
= (xj, yj) is sampled from that distribution and is independent of the previous examples

i.i.d: independent and identically distributed

The iid assumption connects the past to the future, without some such connections, all the bets
are off and the future could be anything

2020/21
6
Maximum A Posteriori → Maximum Likelihood (ML)

● the prior can be used to penalize complexity


● more complex hypotheses have lower prior probability

● if we consider a uniform prior over the space of hypotheses, the posterior is maximized
by the same hypothesis that maximizes the likelihood→Maximum Likelihood (ML)

● it provides a good approximation to Bayesian and MAP learning when the data set is large

2020/21
7
Estimating Probabilities
We are often faced with the situation of having random data which we know (or believe) is drawn from a
parametric model, whose parameters we do not know.

Example: the outcome of tossing a coin described


by a distribution with unknown parameter θ.

In this case we would like to use the data to estimate the value of the parameter θ.

Flips are i.i.d.:


● Independent events
● Identically distributed

Maximum Likelihood Estimation: choose θ that maximizes the probability of observed data

2020/21
slide credit: Barnabás Póczos & Alex Smola 8
Maximum Likelihood Estimation

2020/21
slide credit: Barnabás Póczos & Alex Smola 9
Maximum Likelihood Estimation

I have a coin, I flip it, what is the


probability of getting head?
The estimated probability is ⅗ =
frequency of heads

Why? Frequency of heads is


exactly the MLE for this problem

2020/21
slide credit: Barnabás Póczos & Alex Smola 10
Maximum Likelihood Estimation

Laid out one standard method for parameter


learning:
1. write down an expression for likelihood of
I have a coin, I flip it, what is the the data as a function of the parameter(s)
probability of getting head?
The estimated probability is ⅗ = 2. write down the derivative with respect to
frequency of heads each parameter
3. find the parameter(s) values such that the
Why? Frequency of heads is derivatives are zero
exactly the MLE for this problem

2020/21
slide credit: Barnabás Póczos & Alex Smola 11
Discrete Probability Distributions
Bernoulli Distribution -- Ber(θ)
Single toss of a (possibly biased) coin

Ω = {head, tail}
P(head) = θ, P(tails) = 1-θ

2020/21
12
Discrete Probability Distributions
Bernoulli Distribution -- Ber(θ)
Single toss of a (possibly biased) coin

Ω = {head, tail}
P(head) = θ, P(tails) = 1-θ

Binomial Distribution -- Bin(n,θ)


Toss a single (possibly biased) coin n times and report the number k of times it
comes up head and n-k it comes up tail.

Ω = {n-long head/tail series} , |Ω|=2n


number of heads k={0,1,2,...,n}

2020/21
13
Discrete Probability Distributions

Binomial Distribution -- Bin(n,θ)


Toss a single (possibly biased) coin n times and report the number k of times it
comes up head and n-k it comes up tail.

Ω = {n-long head/tail series} , |Ω|=2n


number of heads k={0,1,2,...,n}

2020/21
14
Maximum a Posteriori for coin flip

Likelihood function is binomial

Beta Prior distribution

2020/21
slide credit: Carlos Guestrin 15
Maximum a Posteriori for coin flip

Likelihood function is binomial

Beta Prior distribution

θ θ θ

As we get more samples, effect of prior is “washed out”

2020/21
slide credit: Carlos Guestrin 16
Maximum a Posteriori for coin flip

Beta prior is equivalent to extra flips: when


the number of flips is high it is ‘forgotten’ but
when the number of flips is small prior is
important! θ θ θ

As we get more samples, effect of prior is “washed out”

2020/21
slide credit: Carlos Guestrin 17
Continuous Random Variables
A random variable X is continuous if its set of possible values is an entire interval of
numbers

This explains why P(X=x)=0 for continuous distributions

2020/21
slide credit: Barnabás Póczos & Alex Smola 18
Probability Density Function

Intuitively, one can think of f(x)dx as being the probability of X falling within
the infinitesimal interval [x, x+dx ]. P(x < X < x+dx) = f(x) dx
2020/21
slide credit: Barnabás Póczos & Alex Smola 19
Probability Density Function

Cumulative
Distribution
Function

Intuitively, one can think of f(x)dx as being the probability of X falling within
the infinitesimal interval [x, x+dx ]. P(x < X < x+dx) = f(x) dx
2020/21
slide credit: Barnabás Póczos & Alex Smola 20
Uniform Distribution

2020/21
slide credit: Barnabás Póczos & Alex Smola 21
Normal (Gaussian) Distribution

2020/21
slide credit: Barnabás Póczos & Alex Smola 22
Moments

f(x) = probability mass

f(x) = probability density

2020/21
Slide Credit: Francesco Orabona 23
Moments

2020/21
Slide Credit: Francesco Orabona 24
MLE for Gaussian mean and variance

2020/21
slide credit: Barnabás Póczos & Alex Smola 25
MLE for Gaussian mean and variance

2020/21
26
MLE for Gaussian mean and variance

2020/21
27
MLE for Gaussian mean and variance

2020/21
28
Exercise on MLE and MAP

Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. Find the maximum likelihood estimate of λ

Likelihood
To find λ that maximizes the likelihood, we will
● take the logarithm (a monotonic function) to
simplify the calculation,
● find its derivative with respect to λ.
● equate it with zero to find the maximum.

2020/21
29
Exercise on MLE and MAP

Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. Find the maximum likelihood estimate of λ

Likelihood Log-likelihood

2020/21
30
Exercise on MLE and MAP

Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. Find the maximum likelihood estimate of λ

2020/21
31
Exercise on MLE and MAP
Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. But now we are also given additional information: suppose the prior knowledge about λ can
be expressed using a gamma distribution 𝛤(x|k,θ) with parameters k=3, θ=1.
Find the maximum a posteriori estimate of λ

2020/21
32
Exercise on MLE and MAP
Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. But now we are also given additional information: suppose the prior knowledge about λ can
be expressed using a gamma distribution 𝛤(x|k,θ) with parameters k=3, θ=1.
Find the maximum a posteriori estimate of λ

2020/21
33
Exercise on MLE and MAP
Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. But now we are also given additional information: suppose the prior knowledge about λ can
be expressed using a gamma distribution 𝛤(x|k,θ) with parameters k=3, θ=1.
Find the maximum a posteriori estimate of λ

Likelihood

Prior

2020/21
34
Exercise on MLE and MAP
Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ. But now we are also given additional information: suppose the prior knowledge about λ can
be expressed using a gamma distribution 𝛤(x|k,θ) with parameters k=3, θ=1.
Find the maximum a posteriori estimate of λ

2020/21
35
Exercise on MLE and MAP

Suppose data set D={2,5,9,5,4,8} is an i.i.d. sample from a Poisson distribution with an unknown
parameter λ.

It can be shown that when n➝∞ the expected value of the difference (λMAP-λML) goes to 0.
In other words, large data diminish the importance of prior knowledge.

2020/21
36
Discriminative vs Generative
Many supervised learning can be viewed as estimating P(X,Y). Generally thy fall into two categories
● When we estimate P(X,Y)=P(X|Y)P(Y) then we call it generative learning.
● When we only estimate P(Y|X) then we call it discriminative learning.

2020/21
Slide Credit: Barnabás Póczos & Alex Smola 37
Frequentists vs Bayesian -- MLE vs MAP

You give a different


answer for different
You are not good
priors
when the number of
samples is small

2020/21
Slide Credit Idea: Barnabás Póczos & Alex Smola 38
How good is the estimator?
Consider two properties of the estimator:

● Bias
● Variance

2020/21
39
The Bias
● The bias of an estimator for parameter is defined as

● The estimator is unbiased if

Estimator of Bernoulli Mean


● Bernoulli distribution for binary variable
x ∊ {0,1} with mean θ has the for
P(x; θ) = θx (1- θ)1-x

● Estimator for given samples {x(1), …, x(m)} is

unbiased

2020/21
40
The Bias
● The bias of an estimator for parameter is defined as

● The estimator is unbiased if

Estimator of Gaussian Mean


● {x(1), …, x(m)} distributed i.i.d. according
to p(xi)= N(xi;μ,σ2)

● sample mean is an estimator of the mean


parameter

unbiased
2020/21
41
The Bias
● The bias of an estimator for parameter is defined as

● The estimator is unbiased if

Estimator of Gaussian Variance

● The sample variance is

● We are interested in computing

● We begin by evaluating

2020/21
42
The Bias
● The bias of an estimator for parameter is defined as

● The estimator is unbiased if

Estimator of Gaussian Variance

● The sample variance is biased


The unbiased sample
● We are interested in computing variance estimator is

● We begin by evaluating

2020/21
43
The Variance
● The variance indicate how much we expect the estimator to vary as a function of data samples

● Just as we computed the expectation of the estimator to determine its bias, we can compute its
variance

● The variance of an estimator is simply Var( ) where the random variable is the training set.

If two estimators of a parameters are both unbiased, the best is the one with the least amount of
variability.

2020/21
44
The Variance
● The variance indicate how much we expect the estimator to vary as a function of data samples

● Just as we computed the expectation of the estimator to determine its bias, we can compute its
variance

● The variance of an estimator is simply Var( ) where the random variable is the training set.

● The square root of the variance is called the standard error, denoted as SE( )

● SE( ) measures how we would expect the estimate to vary as we obtained different samples from
the same distribution

● The standard error of the mean is given by

Here σ is the true variance of the samples x(i).


The SE is often estimated using an estimate of σ: it is not unbiased, but the approximation is reasonable

2020/21
45
Mean Squared Error

Clearly, if this is zero then the estimator is perfect. There are of course
other ways of measuring the error of an estimator, e.g.

Slide Credit: Rui Castro


2020/21
46
Bias / Variance Decomposition

2020/21
Slide Credit: Rui Castro 47
Example - Rate of a Poisson Process

It is reasonable to assume that users access a webserver according to a


Poisson process with unknown rate λ. Suppose you measure the number of
users in 4 one hour periods:

Two estimators are proposed, which one is better?

2020/21
Slide Credit: Rui Castro 48
Example - Rate of a Poisson Process

Let’s first see what the bias of the estimators is

2020/21
Slide Credit: Rui Castro 49
Example - Rate of a Poisson Process

Let’s first see what the bias of the estimators is

It seems the first estimator


might be more desirable...

2020/21
Slide Credit: Rui Castro 50
Example - Rate of a Poisson Process

2020/21
Slide Credit: Rui Castro 51
Example - Rate of a Poisson Process

2020/21
Slide Credit: Rui Castro 52
Example - Rate of a Poisson Process

Which estimator should we prefer?

The answer depends on the unknown value of lambda. If someone tells us


that every hour hundreds of people access the webserver then we know
which estimator to choose.
2020/21
Slide Credit: Rui Castro 53
Summary

● Maximum a Posteriori

● Maximum Likelihood

● iid data

● Discrete and Continuous Distributions

● Bias / Variance

● Mean Squared Error

2020/21
54
Expected Values and Variance

2020/21
Slide Credit: cs229-prob stanford 55

You might also like