Machine Learning Probability: Why Probability For ML?

Machine Learning
Module 5:
Probability
Faculty: Santosh Chapaneri
Why Probability for ML?

• Probability theory can be applied to any situation involving uncertainty.
• In ML, there may be uncertainties in different forms, e.g. arriving at the best
prediction of future given the past data, arriving at the best model based on
certain data, arriving at the confidence level while predicting the future
outcome based on past data, etc.
• In ML, we train the system using training data and we expect the ML algorithm
to depict the behaviour of the larger set of actual data.
• If we have observation on a subset of events, called ‘sample’, then there will
be some uncertainty in attributing the sample results to the whole set or

population.
santosh.chapaneri@ieee.org
3
Probability – Properties
Probability – Properties
(marginal)
5
Probability – Conditional
• P (A | B) = the probability of event A given event B happened
• P(A|B) is the probability measure of the event A after observing the
occurrence of event B.
• Two events are called independent if and only if P(A|B) = P(A) P(B) (or
equivalently, P(A|B) = P(A).
• Therefore, independence is equivalent to saying that observing B does not
have any effect on the probability of A.
Probability – Conditional
• Q1: In a toy-making shop, the automated machine produces few
defective pieces. It is observed that in a lot of 1,000 toy parts, 25 are

defective. If two random samples are selected for testing without
replacement from the lot, calculate the probability that both the samples
are defective.
7
Probability – Baye’s Rule

• Posterior = Likelihood x Prior / Marginal

• Q2: Suppose a new home HIV test has 95% sensitivity and 98%
specificity, and is to be used in a population of size 100,000 with an HIV

prevalence of 1/1000. What is the probability of a person being truly
positive given that the person tests positive?
9

• Q3: Prob. of emails with sender names ‘mass’ and ‘bulk’ are spam = 0.8
• Prob. of false alarm = 0.1 (marked as spam even if email is not a spam)
• Prior knowledge that only 0.4% of the total emails received are spam
• Let x be the event of marked as spam if the sender name has the words
‘mass’ or ‘bulk’ and y be the event of some mail really being spam.
• Compute p(y = 1 | x = 1)
10

• Q4: Consider a woman in her 40s who decides to have a medical
test for breast cancer called mammogram. Suppose you are told
the test has a sensitivity of 80%, the prior probability of having
breast cancer is 0.4%, the false positive is 10%. If the test is
positive, what is the probability that she has cancer?
11
Probability – Random Variables

• Consider an experiment in which we flip 10 coins, and we want to know the
number of coins that come up heads. Here, the elements are 10-length
sequences of H & T.
• in practice, we usually do not care about the probability of obtaining any particular
sequence of heads and tails. Instead we usually care about real-valued functions
of outcomes, such as the number of heads that appear among our 10 tosses, or
the length of the longest run of tails.
• These functions are known as random variables.
12
Probability – Random Variables

• Suppose that X(w) is the number of heads which occur in sequence of tosses w.
Given that 10 coins are tossed, X(w) can take only a finite number of values, so it
is known as a discrete random variable.
• Here, the probability of the set associated with a random variable X taking on
some specific value k is
• Suppose that X(w) is a random variable indicating the amount of time it takes for
a radioactive particle to decay. In this case, X(w) takes on a infinite number of

possible values, so it is called a continuous random variable.
• We denote the probability that X takes on a value between two real constants a
and b (where a < b) as
13
Random Variables – CDF

• Cumulative Distribution Function
14
Random Variables – PMF

• Probability Mass Function
15
Random Variables – PDF

• Probability Density Function
16
Random Variables – Expectation

• Mean of random variable
• Intuitively, the expectation of g(X) can be thought of as a “weighted average” of
the values that g(x) can taken on for different values of x, where the weights are
given by p(x) or f(x).
17
Random Variables – Variance

• The variance of a random variable X is a measure of how concentrated the
distribution of a random variable X is around its mean.
18
Random Variables – Variance

• Q5: Calculate the mean and the variance of the uniform random variable
X with PDF
19
Discrete Random Variables
20
Continuous Random Variables
21
PDF/CDF of Random Variables
22
Statistics of Random Variables
23
Multiple Random Variables

(Joint CDF)
(Marginal CDF)
24

(Joint PMF)
(Marginal PMF)
(Joint PDF)
(Marginal PDF)
25

Two random variables X and Y are independent if
26
Covariance of Random Variables
When Cov[X, Y ] = 0, we say that X and Y are uncorrelated.
27
Central Limit Theorem

• The central limit theorem tells us that the sum of a set of
random variables, which itself is a random variable, has a
distribution that becomes Gaussian as the number of terms in
the sum increases.
• Consider N IID variables, each U[0, 1], find the distribution of
average. For large N, this distribution tends to a Gaussian.
28
Sampling Distributions
• An important application of statistics in machine learning is how to draw a
conclusion about a set or population based on the probability model of

random samples of the set.
• E.g. based on the malignancy sample test results of some random tumor
cases, we want to estimate the proportion of all tumors which are

malignant and thus advise the doctors on the requirement or non-
requirement of biopsy on each tumor case.
• Different random samples may give different estimates.
• If we can get some knowledge about the variability of all possible
estimates derived from the random samples, then we should be able to

arrive at reasonable conclusions
29
• Population is a finite set of objects being investigated.
• Random sample refers to a sample of objects drawn from a population in a
way that every member of the population has the same chance of being
chosen.
• Sampling distribution refers to the probability distribution of a random
variable defined in a space of random samples.
• Sampling with Replacement: While choosing the samples from the
population if each object chosen is returned to the population before the next
object is chosen, then it is called the sampling with replacement. In this case,
repetitions are allowed. Number of possibilities = NN, since each sample can
be repeated. Prob. of each sample = 1/ NN
30
• Sampling with Replacement: choose a random sample of 2 patients from a
population of 3 patients {A, B, C} and replacement is allowed. There can be 9

such ordered pairs (A, A), (A, B), (A, C), (B, A), (B, B), (B, C), (C, A), (C, B),
(C, C). The number of random samples of 2 from the population of 3 is N = 9
• Sampling without Replacement: In case, we don’t return the object being
chosen to the population before choosing the next object, then the unordered
subset is called sampling without replacement. The number of such samples
that can be drawn from the population size of N is
31
Sampling Distributions – Mean and Var

• X random variable with mean μ and standard deviation σ from a population N
• Random sample of size n, drawn without replacement will generate n values
x1, x2, …, xn for X
• When samples are drawn with replacement, these values are independent of
each other and can be considered as values of n independent random

variables X1, X2, …, Xn, each having mean µ and variance σ2.
• The sample mean is
• When samples are drawn without replacement, these values are not
independent of each other
32
Hypothesis Testing
• Hypothesis is a statement about one or more populations.
• It is usually concerned with the parameters of the population. e.g. the hospital
administrator may want to test the hypothesis that the average length of stay
of patients admitted to the hospital is 5 days
• Null hypothesis H0: It is the hypothesis to be tested.
• Alternative hypothesis H1: It is a statement of what we believe is true if our
sample data cause us to reject the null hypothesis.
• The level of significance α is the probability of rejecting H0.
33
Hypothesis Testing
34
Monte Carlo Approximation

• Finding the distribution function of the random variables, in practical
situations, is difficult to compute using the change of variables formula
• Monte Carlo approximation provides a simple but powerful alternative to
this.
• Generate S samples from the distribution
• Given the samples, we can approximate the distribution of f(X) by using
the empirical distribution of
• We can use Monte Carlo to approximate the expected value of any
function of a random variable. We simply draw samples, and then

compute the arithmetic mean of the function applied to the samples:

Machine Learning Probability: Why Probability For ML?

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Probability: Why Probability For ML?

Uploaded by

Copyright:

Available Formats

Machine Learning

Why Probability for ML?

to depict the behaviour of the larger set of actual data.

• If we have observation on a subset of events, called ‘sample’, then there will

be some uncertainty in attributing the sample results to the whole set or

• P(A|B) is the probability measure of the event A after observing the

equivalently, P(A|B) = P(A).

• Therefore, independence is equivalent to saying that observing B does not

have any effect on the probability of A.

defective pieces. It is observed that in a lot of 1,000 toy parts, 25 are

Probability – Baye’s Rule

Probability – Baye’s Rule

specificity, and is to be used in a population of size 100,000 with an HIV

Probability – Baye’s Rule

Probability – Baye’s Rule

Probability – Random Variables

• These functions are known as random variables.

Probability – Random Variables

some specific value k is

a radioactive particle to decay. In this case, X(w) takes on a infinite number of

and b (where a < b) as

Random Variables – CDF

Random Variables – PMF

Random Variables – PDF

Random Variables – Expectation

• Intuitively, the expectation of g(X) can be thought of as a “weighted average” of

Random Variables – Variance

distribution of a random variable X is around its mean.

Random Variables – Variance

Discrete Random Variables

Continuous Random Variables

PDF/CDF of Random Variables

Statistics of Random Variables

Multiple Random Variables

Multiple Random Variables

Multiple Random Variables

Covariance of Random Variables

When Cov[X, Y ] = 0, we say that X and Y are uncorrelated.

Central Limit Theorem

conclusion about a set or population based on the probability model of

cases, we want to estimate the proportion of all tumors which are

• Different random samples may give different estimates.

• If we can get some knowledge about the variability of all possible

estimates derived from the random samples, then we should be able to

• Random sample refers to a sample of objects drawn from a population in a

• Sampling distribution refers to the probability distribution of a random

variable defined in a space of random samples.

• Sampling with Replacement: While choosing the samples from the

population of 3 patients {A, B, C} and replacement is allowed. There can be 9

• Sampling without Replacement: In case, we don’t return the object being

Sampling Distributions – Mean and Var

• Random sample of size n, drawn without replacement will generate n values

x1, x2, …, xn for X

each other and can be considered as values of n independent random

• The sample mean is

independent of each other

• Null hypothesis H0: It is the hypothesis to be tested.

• Alternative hypothesis H1: It is a statement of what we believe is true if our

sample data cause us to reject the null hypothesis.

• The level of significance α is the probability of rejecting H0.

Monte Carlo Approximation

situations, is difficult to compute using the change of variables formula