You are on page 1of 39

1

Probability Refresher
CS771: Introduction to Machine Learning
Purushottam Kar

CS771: Intro to ML
What is Probability 2
Depends on whom we ask this question
A statistician may claim probability is a way of measuring how frequently does
something happen
“If I recommend an iPhone to 1000 female customers aged 25-30 years, roughly
600 of them will make a purchase”
A logician may claim that probability is a way of measuring the amount of
uncertainty in a certain statement
“If John makes a credit card transaction worth more than ₹10,000, then there is
a 70% chance it is fraudulent since he never spends so much”
A measure theoretician may claim probability is a way of assigning positive
scores in a way so that two scores can be easily compared
“This customer is more likely to be in the 20-30 age group than the 50+ group”
Machine Learning subscribes to all these views in one way or another
CS771: Intro to ML
Sample Space 3
Denotes an exhaustive enumeration of all possible outcomes that
either have happened or could happen (even if extremely unlikely)
A Toy RecSys Problem: our website has 10 products on sale. Users
visit our website, browse and are shown one ad. Depending on their
experience, they either purchase one of the 10 products or don’t
purchase anything. We record gender, age of customer and how
many seconds they spend on the website.
Sample Space:
Gender Age Time Spent Ad Shown Purchase
Sample spaces are often infinite in size in real settings since they
enumerate all possibilities, even very unlikely ones
CS771: Intro to ML
Events
Notice that events may choose to precisely describe certain aspects and
neglect others e.g. in the events given here, some events are very specific
about the gender of the user, others are not. Some are concerned about
4
An event is simply
whether a purchasea was
description
made, othersof
areuseful
merely facts
trackingabout an outcome
time spent etc.
“A male user in age group 20-30 years visiting our website” is an event
“A female user being shown an ad for a P2 (a laptop)” is an event
“A user buying something that was shown as an ad” is an event
“A user buying something that was not shown as an ad” is an event
“A user spending more than 200 seconds on the website” is an event
ML can be used to do several useful things
Tell us how frequently does an event occur/if one event more likely than other
What fraction of male customers aged 35-40 purchase P6 (a phone) if shown an ad?
What fraction of female customers purchase P2 (a laptop) whether ad shown or not?
Is it more likely that a purchase will be made if I show a mobile ad or a laptop ad?
Is it more likely that a 20-25 year old will purchase if I show a mobile vs laptop ad?
Tell us how confident is the ML algorithm while giving the above replies
CS771: Intro to ML
Random Variables
I could have also defined a random variable such that S = 1 if purchase
made (whether or not on ad shown) and S = 0 otherwise. What I define as
a random variable (or even an event) is totally up to my creativity
5
Random variables are simply a way to express useful facts about
events as numbers so that we can do math with them
Random variables canseen
We had earlier bethat
categorical
features for or
datanumerical
points in ML may be
Categorical: X =numerical
categorical, 1 if female, X =is2noif accident.
etc. This male, X Random
= 3 if transgender
variables can
indeed be(Discrete):
Numerical seen as giving
Y=usage
“features” for the
of person inoutcome
years of the experiment
Numerical (Continuous): Z = number of seconds spent on the website
Indicator: W = 1 if purchase made on ad shown, W = 0 otherwise
Example Outcome: A male customer aged 25 years spent 18 minutes
on our website but did not purchase the product whose ad was shown
X = 2, Y = 25, Z = 1080, W = 0
Can arrange many random variables as vectors too e.g. [2, 25, 1080, 0]
CS771: Intro to ML
Probability Distribution
For starters, a 120 year old human being
is almost certainly a woman not a man 6
For the purpose of ML, a probability distribution serves two purposes
GivenInan
thisevent it are
case, we caninterested
tell usinhow likely
getting is of
samples that event
female customers who
are shown
Given an ad for
two events, P6.also
this For allows
example,usif to
such
askcustomers are more
which one likely
is more to
likely
buy P6 then we would like W = 1 more frequently for these samples too!
Note that random variables can be used to define events too e.g. W = 1 is an
event (that a purchase was made on the product whose ad was shown)
Generate a sample outcome
We want outcomes that are more likely to be generated more often than those
that are rare e.g. “120 year old man who is shown an ad for P8, spent 1000
seconds but did not purchasing anything” is not a very likely outcome
Sometimes, we may be interested in a sample outcome with some restrictions
e.g. we can request for a random “female customer who is shown an ad for
P6”. In this case, we only want outcomes that satisfy the above but would like
those that are more likely, to be generated more often
CS771: Intro to ML
Getting Started 7
Sample space: 1 2 3 4 5 6

1 2 3 4 5 6

Note: always 1 2 3 4 5 6

1 2 3 4 5 6

Initially, to get used to things, it is good to think of


probability in terms of proportions or frequency
CS771: Intro to ML
Probability as Proportions 8
Sample/Outcome: pick one ball 1 2 3 4 5 6
Sample space:
Assume that picking any ball is 1 2 3 4 5 6
equally likely. In other words, the
probability of picking any ball is
since there are only 24 balls 1 2 3 4 5 6

Use this toy setting to get


comfortable with concepts since 1 2 3 4 5 6
in this case, probability of some
event is simply the proportion of
the outcomes when that happensFor now, we will only look at discrete
random variables (categorical/numeric)
CS771: Intro to ML
Probability as Proportions 9
Sample/Outcome: pick one ball 1 2 3 4 5 6
Sample space:
Define two random variables (r.v.) 1 2 3 4 5 6
number on the ball
colour of the ball 1 2 3 4 5 6

proportion of samples for which 1 2 3 4 5 6


we have

CS771: Intro to ML
Probability as Proportions 10
Sample/Outcome: pick one ball 1 2 3 4 5 6
Sample space:
Define two random variables (r.v.) 1 2 3 4 5 6
number on the ball
colour of the ball 1 2 3 4 5 6

proportion of samples for which 1 2 3 4 5 6


we have

CS771: Intro to ML
Probability beyond Proportions 11
Suppose now that not all samples 1 2 3 4 5 6
are equally likely, i.e. not all balls 1
24
1
24
1
96
1
24
1
6
1
96
are equally likely to be picked
sum of probabilities 1 1
2
1
3
1
4
1
5
1
6
1

of samples for which 48 24 12 96 48 24

1 2 3 4 5 6
1 1 1 1 1 1

sum of probabilities 16 24 48 96 12 96

of samples for which 1 2 3 4 5 6


1 1 1 1 1 1
96 24 24 48 24 12

CS771: Intro to ML
Rules of Probability 12
Let denote the sample space (set of all possible outcomes)
Let be any (discrete) random variable and let denote the set of
(numerical) values could possibly take (even unlikely values)
The set is called the support of the random variable
In previous example,
For any outcome , let denote value of on thatisoutcome
the probability with which an
For example, and3 ( encodes color)
3 outcome happens. E.g 3

For any value (i.e. any valid value), we define

Sometimes we use lazy notation to denote

CS771: Intro to ML
Rules of Probability 13
No matter how we define our random variable, if it is discrete valued,
then the following must hold
For all , we must have
If then we say is an impossible value for random variable
If then we say that almost surely takes the value
We must have
Another way of saying that when we get a sample, the random variable must
take some valid value on that sample, it cannot remain undefined!
It is a different thing that we (e.g. the ML algo) may not know what value has
taken on that sample, but there must be some hidden value it did take
An immediate consequence of the above two rules is that we must
have, for , we must have
CS771: Intro to ML
Probability Mass Function (PMF)
Note that this will always give us values and that too
in a way so that if is large, we will get that value more
likely than a value for which
14
A fancy name for a function that tells us the probability of the random
variable taking any particular value
That is correct. E.g. in our toy setting (where not all samples are equally
For a discrete
likely), random
so if we sample variable
(recall , its color)
that encodes PMFthentells us,almost
we are for any
twice, what is the
likely to get than
probability . There isthe
of taking a comparatively
value i.e.much smaller chance that we
would get
Warning: papers/books often use lazy or confusing notation – take
care
Often the blackboard letter P i.e. used to denote PMF i.e.
Sometimes may write to emphasize that this PMF for and not some
Sometimes, is also used to refer to the PMF of the random variable
Sampling from a PMF: or or even means 3that we generated an
outcome , e.g. according to the probability distribution and are
looking at CS771: Intro to ML
Joint Probability 15
proportion 1 2 3 4 5 6
of samples for which we have 1
24
1
24
1
96
1
24
1
6
1
96
both and
Let us look at uniform case first 1 1
2
1
3
1
4
1
5
1
6
1
48 24 12 96 48 24
1 1 5
+ =
24 16 48 1 2 3 4 5 6
Notation: means 1 1 1 1 1 1
the same as 16 24 48 96 12 96

1 2 3 4 5 6
1 1 1 1 1 1
If not all samples are equallyStill
likely, then
96 24 24 48 24 12

we similarly look at the sumzero


of 
probabilities of all samples where etc etc
… CS771: Intro to ML
Keep in mind that this result holds for any two (or even more than two) r.v.s
A PMF for the Joint Distribution?
no matter how they are defined. This result holds even if the two r.v.s are
clones of each other! This is so because the proof of this result never uses
16
The joint probabilities
facts such as uses coloralso
in its form a valid
definition distribution
and does not etc. Even if both
were, we
For any defined
haveusing color of the ball, this result would still be true
The sum of probabilities over all values of add up to too
Proof:TheRecall
PMF for thiswe
that joint distribution
defined whereis simply
is the aprobability
function thatwith
takeswhich
two an outcome
inputs, namely
happens. Thus, weandareand gives us . Often
interested in allthe notationwhere
samples or is .used to referway of saying
Another
this is that we are interestedto this
injoint distribution
all samples where but we do not care what
value takes i.e.
Just as before, we can sample from this PMF except in this case, the PMF
would return
Thus, weback
havetwo numbers instead of one i.e. since what the PMF would
do is obtain an outcome and simply return
Thus, we have
However, since , we conclude that we must also have

CS771: Intro to ML
Joint Distributions on more R.V.s 17
Suppose we had another𝑍 r.v.=1
say 1 2 3 4 5 6
indicating the row in 1
24
1
24
1
96
1
24
1
6
1
96
which the ball is listed i.e.
𝑍 =2
if the ball is in the first row, etc. 1 1
2
1
3
1
4
1
5
1
6
1
48 24 12 96 48 24

𝑍 =3 1
1
2
1
3
1
4
1
5
1
6
1

We could define a joint probability 16 24 48 96 12 96

distributions on the three𝑍 =4


RVs now 1 2 3 4 5 6
1 1 1 1 1 1
96 24 24 48 24 12

CS771: Intro to ML
Marginal Probability 18
When we had only two RVs (namely ) we looked at how they behave
at the same time (by looking at ) or how they behaved on their own (by
looking at )
Now that we have three RVs () we can look at how they behave
At the same time, by looking at
On their own, by looking at
Two at a time, by looking at
The distributions are called marginal probability distributions and
they look at how a subset of RVs behave
Marginal distributions are also proper distributions (same proof works)
and hence they also have PMFs associated with them
CS771: Intro to ML
Obtaining Marginal PMF from Joint PMF19
If we have the joint PMF for a set of RVs, say , then obtaining the
marginal PMF for any subset of these RVs is very simple
Involves a process called marginalization: uses the proof we saw earlier
Suppose we are interested in i.e. we don’t care about
In this case we say that has been marginalized out
Earlier argument can be reused to show that for any

This shows that


Similarly,
In this case we say that both and have been marginalized out

CS771: Intro to ML
Conditional Probability 20
Perhaps one of the most important 1 2 3 4 5 6
concepts w.r.t ML applications
Let us look at uniform case first
1 2 3 4 5 6
Notice: if we focus only on balls
with number 2 written on them,
most (3/4) of those balls are red 1 2 3 4 5 6

Contrast: if the number on the


ball is 3, nothing as strong can be 1 2 3 4 5 6
said
proportion of samples with In this case and
among those samples where
CS771: Intro to ML
Conditional Probability What if happens to be ?
Won’t we get a divide-by-zero error?
21
The previous way of defining the conditional probability is makes it
cumbersome to extend to more general settings
Let us use a different (but equivalent) way of defining conditional
probability
Yes, although there are ways to get around this, for this course, we
will avoid such cases or else, if convenient, define
Dividing numerator and denominator by possible number of samples i.e. 24

The above is just another way of saying

CS771: Intro to ML
Conditional Probability 22
1 2 3 4 5 6
1 1 1 1 1 1
24 24 96 24 6 96

1 1
2
1
3
1
4
1
5
1
6
1
48 24 12 96 48 24

1 2 3 4 5 6
1 1 1 1 1 1
16 24 48 96 12 96

1 2 3 4 5 6
1 1 1 1 1 1
96 24 24 48 24 12

CS771: Intro to ML
A PMF for
To sample from the
, we Conditional
consider Distribution?
the set of only those outcomes in the sample
space where , then sample an outcome from this set with probability and 23
then also
Conditional probability values returnform a distribution

For any value of we have, by our marginalization argument

This implies
Thus, we can readily define a PMF for conditional distributions as well
that takes in two values and gives
We can similarly define as well
Notation used
May ask for a sample or too!

CS771: Intro to ML
Marginal Conditional Probability???? 24
The operations of marginalization and conditioning can be used to
define lots of different kinds of PMFs
For example consider
If we marginalize out of the PMF for , we will get the PMF for

Can prove the above result using the same marginalization argument
Note that this means that this PMF can be derived from the PMF for
the joint distribution i.e.

CS771: Intro to ML
Marginal Conditional Probability???? 25
The operations of marginalization and conditioning can be used to
define lots of different kinds of PMFs
For example consider
We can show that this is nothing but

Try proving this result using the marginalization and conditioning rules
Yet again, this means that this PMF can be derived from the PMF for
the joint distribution i.e.

CS771: Intro to ML
Rules of Probability 26
Sum Rule (Marginalization Rule) – aka Law of Total Probability
We may use the fact that
and also that etc to show that we also have
or more explicitly,
Product Rule (Conditioning Rule)

Combine to get
or more explicitly,
Chain rule (Iterated Conditioning Rule)

CS771: Intro to ML
Bayes Theorem 27
The foundation of Bayesian Machine Learning
and
However and are the same thing
Thus,
This gives us

Similarly

CS771: Intro to ML
28
Marginal, Joint and Conditional Probability
In most settings, we would have defined tons of random variables on
our outcomes to capture interesting things about the outcomes
: what is the gender of the person visiting our website
: what is the age of the person
: how many seconds did they spend on our website
: what ad were they shown
: what purchase did they make
ML algos like to ask and answer interesting questions about these
random variables
Marginal, Joint and Conditional Probability give us the language to
speak when asking and answering these questions
CS771: Intro to ML
Using Probability to do ML 29
Arguably the most interesting random variable of is
Recommendation Systems (RecSys) would like to know what value
would take if we know the values of
Of these, the website cannot control but it does control
The whole enterprise of recommendation and ad placement can be
summarized in the following statement “
“ Given values , find a value of
such that is as close to as possible
ML algos for RecSys can learn distributions (the models for these
ML algos are distributions) such that they mimic reality i.e. if the
model says , the user really does buy something
CS771: Intro to ML
30
Creating Events from Random Variables??
In the RecSys example,
We can also havewe saw
events likethe expression
(that the random variable takes a
value less
Here is an event thanof
(that ). We
thedefine
user. purchasing
Similarly, , aresomething)
also valid events
and we are
interested in the probability of this event given
Recall: events are merely a description of interesting facts about an outcome
Similarly is also an event (that the ball we picked is green and has the
number 1 stamped on it)
is also an event (that the ball has the number 1 stamped on it)
is also an event (that the ball is green colored)
Given an event, it may happen on certain outcomes, not happen on others e.g.
if then this event will not have taken place if we pick a blue ball
Thus, given any collection of random variables, we can create events
out of them and ask interesting questions about the r.v.s
CS771: Intro to ML
The term calculus in general, means a
Event Calculus system of rules – it does not
necessarily mean differentiation or
integration 
31
Let be events (possibly defined using same/different r.v.s etc)
tell us that saying
is“Italso
is notan
theevent
case that either A happened or B happened or both happened”
Called the negationis or
justcomplement ofsaying
a funny way of ( did not happen)
is also“Aan
didevent
not happen and B did not happen i.e. neither happened”
Called the union of the two events (either happened
tell us that saying or happened or both
happened) “It is not the case that both A and B happened”
is also an event is just another way of saying
“Either A did not happen or B did not happen or both did not happen”
Called the intersection of the two events (both and happened)
De Morgan’s Laws: for any two events , we must have
This means we can be more creative in defining events,
as well as e.g. (also written as ) or else

CS771: Intro to ML
Event Calculus 32
Example: let be events
is also an event (number on ball is something other than 2)
is also an event (the colour of the ball is something other than red)
is also an event (either I have a red ball or a ball with number 2 written
on it or else a red ball with number 2 written on it)
is also an event (I have a red ball and the number on it is 2)
De Morgan’s Laws: for any two events , we must have
as well as
Caution: De Morgan’s Laws always hold no matter how we defined the
events. They do not require events to be defined on separate r.v.s etc.
CS771: Intro to ML
Rules of Probability for Complex Events 33
We can derive an “intersection rule” using de-Morgan’s laws and these rules

(apply union and complement rules)


Suppose we know probability of two events
Complement Rule:
Union Rule:
These rules can be proved using a similar proof technique that we used for
joint/marginal probability derivations in the last lecture
The above allow us to calculate more interesting probabilities

We used only the marginal and the joint probability distributions

The above rules hold even for conditional probability


CS771: Intro to ML
Rules of Conditional Probability
In fact, Bayes Theorem applies to events just as well i.e. if are events, then
Proof: create a random variable for each event e.g. if event happens and if
34
Let denote an event
does not happen. Similarly define a random variable to tell us whether
For example
happened or not. Now use the Bayes theorem on – done!
A user with gender , age spent sec on website and was shown ad
Let denote two events
The random variables we defined above to tell us whether some event
may beorrelated/unrelated
happened to –random
not are called indicator does not mattersince
variables – thethey
rules always hold
indicate
Conditional Complement
whether an event Rule:
took place (in which case the r.v. takes value 1) or not (in
which case the r.v. takes value 0). Notation: .
Conditional Union Rule: if blah is true else
In general,
Conditional Implication Rule: If , then . On the other hand, if , then
Example: defined as above and and Caution: not a standard rule
In the above cases, we will indeedyou
have andfind in textbooks
would

CS771: Intro to ML
Independence of Random Variables means that and will both take values
according to their own (marginal) PMFs,
happily unmindful of the values the other r.v. is
35
Two r.v.s are said to be independent if,taking!
for all and , we have
We can show that if two r.v.s are independent, then this means that
the
Wevalue
Can you one
show
have thatr.v. takes
if (i.e. for does notweinfluence
all ) then must alwaysthe
haveother r.v.allto) take some
(i.e. for
value more preferentially than
as others
well? 1in any2 way 3 4 5 6
We have if we condition on
However, Hint: use the definition of independence
Proof:
We
, thencanweverify
have that we have
for
Similarly, we also get 1 2 3 4 5 6
i.e.
all
For i.e.
independence,
Independence of r.v.swemakes
shouldlife extremely simple in ML algorithms
Similarly,
but
haveis had we
very for can verify
precious
all – notthat
always available
1 2 3 4 5 6
for
Notation: If are independent, then we often write
all i.e. 1 2 3 4 5 6

CS771: Intro to ML
Conditional Independence 36
If are three r.v.s such that for all we have

then we say that and are conditionally 1 independent


2 3 given
4 5 6
Notation:
This is the earlier example𝑍 where
=1
Ifweare
verified that andthen it is not necessary
independent, 1 2that they
3 continue
4 5to be 6
independent
However it iseven
easyif to
conditioned
see that weon a third random variable
do not have since
1 2 3 4 5 6
𝑍 =2
1 2 3 4 5 6

CS771: Intro to ML
Conditional Independence 37
If are three r.v.s such that for all we have

then we number
The r.v.s say thatand and are conditionally
colour are 1 1independent
2 1 3 2 given4 2 5 9 6 9
Notation: 𝑍 =1
not independent here. To see this, note 96 96 96 96

thatare independent, then it is not necessary that they continue to be


96 96

If 1 2 1 3 2 4 2 5 9 6 9
. However, if we
independent even if conditioned on a third
1
96 random
96 variable
96 96 96 96
condition on a new random variable
Even if are not independent,
that distinguishes it is still possible
the first two rows that there may exist a
1 1 2 1 3 1 4 1 5 1 6 1
third random
from the variable
last two rows, thensuch
andthat makes 24 conditionally independent i.e.24
𝑍 =2 24 24 24 24

become independent r.v.s i.e.


but we do have 1 1
2 1
3 1
4 1
5 1
6 1
24 24 24 24 24 24
CS771: Intro to ML
Conditional Independence 38
Slightly more “practical” examples
If I throw a fair dice twice, the outcome on the first throw in no way
influences the second outcome i.e. two outcomes are independent of
each other and each takes values in . However, if I additionally am
told that the sum of the two numbers is , then the throws are no
longer conditionally independent since for example we have

In the above, is number on first throw, denotes second, is sum


Sample space is i.e. all possible outcomes of two throws
Examples where non-independent r.v.s become conditionally
independent are most commonly found in a branch of ML called
graphical models. CS771: Intro to ML
Random Vectors 39
Random vectors can be thought of as simply a collection of random
variables arranged in an array
No restriction on the random variables being independent or uncorrelated
PMF/PDF of is simply the joint PMF/PDF of
Can talk about marginal/conditional prob among
Think of as just a bunch of r.v.s

Since PMF/PDF of is simply a joint PMF/PDF, all probability laws


we learnt earlier continue to hold if we apply them correctly
Chain Rule, Sum Rule, Product Rule, Bayes Rule
Conditional/marginal variants of all these rules
CS771: Intro to ML

You might also like