Professional Documents
Culture Documents
Prob Ref
Prob Ref
Probability Refresher
CS771: Introduction to Machine Learning
Purushottam Kar
CS771: Intro to ML
What is Probability 2
Depends on whom we ask this question
A statistician may claim probability is a way of measuring how frequently does
something happen
“If I recommend an iPhone to 1000 female customers aged 25-30 years, roughly
600 of them will make a purchase”
A logician may claim that probability is a way of measuring the amount of
uncertainty in a certain statement
“If John makes a credit card transaction worth more than ₹10,000, then there is
a 70% chance it is fraudulent since he never spends so much”
A measure theoretician may claim probability is a way of assigning positive
scores in a way so that two scores can be easily compared
“This customer is more likely to be in the 20-30 age group than the 50+ group”
Machine Learning subscribes to all these views in one way or another
CS771: Intro to ML
Sample Space 3
Denotes an exhaustive enumeration of all possible outcomes that
either have happened or could happen (even if extremely unlikely)
A Toy RecSys Problem: our website has 10 products on sale. Users
visit our website, browse and are shown one ad. Depending on their
experience, they either purchase one of the 10 products or don’t
purchase anything. We record gender, age of customer and how
many seconds they spend on the website.
Sample Space:
Gender Age Time Spent Ad Shown Purchase
Sample spaces are often infinite in size in real settings since they
enumerate all possibilities, even very unlikely ones
CS771: Intro to ML
Events
Notice that events may choose to precisely describe certain aspects and
neglect others e.g. in the events given here, some events are very specific
about the gender of the user, others are not. Some are concerned about
4
An event is simply
whether a purchasea was
description
made, othersof
areuseful
merely facts
trackingabout an outcome
time spent etc.
“A male user in age group 20-30 years visiting our website” is an event
“A female user being shown an ad for a P2 (a laptop)” is an event
“A user buying something that was shown as an ad” is an event
“A user buying something that was not shown as an ad” is an event
“A user spending more than 200 seconds on the website” is an event
ML can be used to do several useful things
Tell us how frequently does an event occur/if one event more likely than other
What fraction of male customers aged 35-40 purchase P6 (a phone) if shown an ad?
What fraction of female customers purchase P2 (a laptop) whether ad shown or not?
Is it more likely that a purchase will be made if I show a mobile ad or a laptop ad?
Is it more likely that a 20-25 year old will purchase if I show a mobile vs laptop ad?
Tell us how confident is the ML algorithm while giving the above replies
CS771: Intro to ML
Random Variables
I could have also defined a random variable such that S = 1 if purchase
made (whether or not on ad shown) and S = 0 otherwise. What I define as
a random variable (or even an event) is totally up to my creativity
5
Random variables are simply a way to express useful facts about
events as numbers so that we can do math with them
Random variables canseen
We had earlier bethat
categorical
features for or
datanumerical
points in ML may be
Categorical: X =numerical
categorical, 1 if female, X =is2noif accident.
etc. This male, X Random
= 3 if transgender
variables can
indeed be(Discrete):
Numerical seen as giving
Y=usage
“features” for the
of person inoutcome
years of the experiment
Numerical (Continuous): Z = number of seconds spent on the website
Indicator: W = 1 if purchase made on ad shown, W = 0 otherwise
Example Outcome: A male customer aged 25 years spent 18 minutes
on our website but did not purchase the product whose ad was shown
X = 2, Y = 25, Z = 1080, W = 0
Can arrange many random variables as vectors too e.g. [2, 25, 1080, 0]
CS771: Intro to ML
Probability Distribution
For starters, a 120 year old human being
is almost certainly a woman not a man 6
For the purpose of ML, a probability distribution serves two purposes
GivenInan
thisevent it are
case, we caninterested
tell usinhow likely
getting is of
samples that event
female customers who
are shown
Given an ad for
two events, P6.also
this For allows
example,usif to
such
askcustomers are more
which one likely
is more to
likely
buy P6 then we would like W = 1 more frequently for these samples too!
Note that random variables can be used to define events too e.g. W = 1 is an
event (that a purchase was made on the product whose ad was shown)
Generate a sample outcome
We want outcomes that are more likely to be generated more often than those
that are rare e.g. “120 year old man who is shown an ad for P8, spent 1000
seconds but did not purchasing anything” is not a very likely outcome
Sometimes, we may be interested in a sample outcome with some restrictions
e.g. we can request for a random “female customer who is shown an ad for
P6”. In this case, we only want outcomes that satisfy the above but would like
those that are more likely, to be generated more often
CS771: Intro to ML
Getting Started 7
Sample space: 1 2 3 4 5 6
1 2 3 4 5 6
Note: always 1 2 3 4 5 6
1 2 3 4 5 6
CS771: Intro to ML
Probability as Proportions 10
Sample/Outcome: pick one ball 1 2 3 4 5 6
Sample space:
Define two random variables (r.v.) 1 2 3 4 5 6
number on the ball
colour of the ball 1 2 3 4 5 6
CS771: Intro to ML
Probability beyond Proportions 11
Suppose now that not all samples 1 2 3 4 5 6
are equally likely, i.e. not all balls 1
24
1
24
1
96
1
24
1
6
1
96
are equally likely to be picked
sum of probabilities 1 1
2
1
3
1
4
1
5
1
6
1
1 2 3 4 5 6
1 1 1 1 1 1
sum of probabilities 16 24 48 96 12 96
CS771: Intro to ML
Rules of Probability 12
Let denote the sample space (set of all possible outcomes)
Let be any (discrete) random variable and let denote the set of
(numerical) values could possibly take (even unlikely values)
The set is called the support of the random variable
In previous example,
For any outcome , let denote value of on thatisoutcome
the probability with which an
For example, and3 ( encodes color)
3 outcome happens. E.g 3
CS771: Intro to ML
Rules of Probability 13
No matter how we define our random variable, if it is discrete valued,
then the following must hold
For all , we must have
If then we say is an impossible value for random variable
If then we say that almost surely takes the value
We must have
Another way of saying that when we get a sample, the random variable must
take some valid value on that sample, it cannot remain undefined!
It is a different thing that we (e.g. the ML algo) may not know what value has
taken on that sample, but there must be some hidden value it did take
An immediate consequence of the above two rules is that we must
have, for , we must have
CS771: Intro to ML
Probability Mass Function (PMF)
Note that this will always give us values and that too
in a way so that if is large, we will get that value more
likely than a value for which
14
A fancy name for a function that tells us the probability of the random
variable taking any particular value
That is correct. E.g. in our toy setting (where not all samples are equally
For a discrete
likely), random
so if we sample variable
(recall , its color)
that encodes PMFthentells us,almost
we are for any
twice, what is the
likely to get than
probability . There isthe
of taking a comparatively
value i.e.much smaller chance that we
would get
Warning: papers/books often use lazy or confusing notation – take
care
Often the blackboard letter P i.e. used to denote PMF i.e.
Sometimes may write to emphasize that this PMF for and not some
Sometimes, is also used to refer to the PMF of the random variable
Sampling from a PMF: or or even means 3that we generated an
outcome , e.g. according to the probability distribution and are
looking at CS771: Intro to ML
Joint Probability 15
proportion 1 2 3 4 5 6
of samples for which we have 1
24
1
24
1
96
1
24
1
6
1
96
both and
Let us look at uniform case first 1 1
2
1
3
1
4
1
5
1
6
1
48 24 12 96 48 24
1 1 5
+ =
24 16 48 1 2 3 4 5 6
Notation: means 1 1 1 1 1 1
the same as 16 24 48 96 12 96
1 2 3 4 5 6
1 1 1 1 1 1
If not all samples are equallyStill
likely, then
96 24 24 48 24 12
CS771: Intro to ML
Joint Distributions on more R.V.s 17
Suppose we had another𝑍 r.v.=1
say 1 2 3 4 5 6
indicating the row in 1
24
1
24
1
96
1
24
1
6
1
96
which the ball is listed i.e.
𝑍 =2
if the ball is in the first row, etc. 1 1
2
1
3
1
4
1
5
1
6
1
48 24 12 96 48 24
𝑍 =3 1
1
2
1
3
1
4
1
5
1
6
1
CS771: Intro to ML
Marginal Probability 18
When we had only two RVs (namely ) we looked at how they behave
at the same time (by looking at ) or how they behaved on their own (by
looking at )
Now that we have three RVs () we can look at how they behave
At the same time, by looking at
On their own, by looking at
Two at a time, by looking at
The distributions are called marginal probability distributions and
they look at how a subset of RVs behave
Marginal distributions are also proper distributions (same proof works)
and hence they also have PMFs associated with them
CS771: Intro to ML
Obtaining Marginal PMF from Joint PMF19
If we have the joint PMF for a set of RVs, say , then obtaining the
marginal PMF for any subset of these RVs is very simple
Involves a process called marginalization: uses the proof we saw earlier
Suppose we are interested in i.e. we don’t care about
In this case we say that has been marginalized out
Earlier argument can be reused to show that for any
CS771: Intro to ML
Conditional Probability 20
Perhaps one of the most important 1 2 3 4 5 6
concepts w.r.t ML applications
Let us look at uniform case first
1 2 3 4 5 6
Notice: if we focus only on balls
with number 2 written on them,
most (3/4) of those balls are red 1 2 3 4 5 6
CS771: Intro to ML
Conditional Probability 22
1 2 3 4 5 6
1 1 1 1 1 1
24 24 96 24 6 96
1 1
2
1
3
1
4
1
5
1
6
1
48 24 12 96 48 24
1 2 3 4 5 6
1 1 1 1 1 1
16 24 48 96 12 96
1 2 3 4 5 6
1 1 1 1 1 1
96 24 24 48 24 12
CS771: Intro to ML
A PMF for
To sample from the
, we Conditional
consider Distribution?
the set of only those outcomes in the sample
space where , then sample an outcome from this set with probability and 23
then also
Conditional probability values returnform a distribution
This implies
Thus, we can readily define a PMF for conditional distributions as well
that takes in two values and gives
We can similarly define as well
Notation used
May ask for a sample or too!
CS771: Intro to ML
Marginal Conditional Probability???? 24
The operations of marginalization and conditioning can be used to
define lots of different kinds of PMFs
For example consider
If we marginalize out of the PMF for , we will get the PMF for
Can prove the above result using the same marginalization argument
Note that this means that this PMF can be derived from the PMF for
the joint distribution i.e.
CS771: Intro to ML
Marginal Conditional Probability???? 25
The operations of marginalization and conditioning can be used to
define lots of different kinds of PMFs
For example consider
We can show that this is nothing but
Try proving this result using the marginalization and conditioning rules
Yet again, this means that this PMF can be derived from the PMF for
the joint distribution i.e.
CS771: Intro to ML
Rules of Probability 26
Sum Rule (Marginalization Rule) – aka Law of Total Probability
We may use the fact that
and also that etc to show that we also have
or more explicitly,
Product Rule (Conditioning Rule)
Combine to get
or more explicitly,
Chain rule (Iterated Conditioning Rule)
CS771: Intro to ML
Bayes Theorem 27
The foundation of Bayesian Machine Learning
and
However and are the same thing
Thus,
This gives us
Similarly
CS771: Intro to ML
28
Marginal, Joint and Conditional Probability
In most settings, we would have defined tons of random variables on
our outcomes to capture interesting things about the outcomes
: what is the gender of the person visiting our website
: what is the age of the person
: how many seconds did they spend on our website
: what ad were they shown
: what purchase did they make
ML algos like to ask and answer interesting questions about these
random variables
Marginal, Joint and Conditional Probability give us the language to
speak when asking and answering these questions
CS771: Intro to ML
Using Probability to do ML 29
Arguably the most interesting random variable of is
Recommendation Systems (RecSys) would like to know what value
would take if we know the values of
Of these, the website cannot control but it does control
The whole enterprise of recommendation and ad placement can be
summarized in the following statement “
“ Given values , find a value of
such that is as close to as possible
ML algos for RecSys can learn distributions (the models for these
ML algos are distributions) such that they mimic reality i.e. if the
model says , the user really does buy something
CS771: Intro to ML
30
Creating Events from Random Variables??
In the RecSys example,
We can also havewe saw
events likethe expression
(that the random variable takes a
value less
Here is an event thanof
(that ). We
thedefine
user. purchasing
Similarly, , aresomething)
also valid events
and we are
interested in the probability of this event given
Recall: events are merely a description of interesting facts about an outcome
Similarly is also an event (that the ball we picked is green and has the
number 1 stamped on it)
is also an event (that the ball has the number 1 stamped on it)
is also an event (that the ball is green colored)
Given an event, it may happen on certain outcomes, not happen on others e.g.
if then this event will not have taken place if we pick a blue ball
Thus, given any collection of random variables, we can create events
out of them and ask interesting questions about the r.v.s
CS771: Intro to ML
The term calculus in general, means a
Event Calculus system of rules – it does not
necessarily mean differentiation or
integration
31
Let be events (possibly defined using same/different r.v.s etc)
tell us that saying
is“Italso
is notan
theevent
case that either A happened or B happened or both happened”
Called the negationis or
justcomplement ofsaying
a funny way of ( did not happen)
is also“Aan
didevent
not happen and B did not happen i.e. neither happened”
Called the union of the two events (either happened
tell us that saying or happened or both
happened) “It is not the case that both A and B happened”
is also an event is just another way of saying
“Either A did not happen or B did not happen or both did not happen”
Called the intersection of the two events (both and happened)
De Morgan’s Laws: for any two events , we must have
This means we can be more creative in defining events,
as well as e.g. (also written as ) or else
CS771: Intro to ML
Event Calculus 32
Example: let be events
is also an event (number on ball is something other than 2)
is also an event (the colour of the ball is something other than red)
is also an event (either I have a red ball or a ball with number 2 written
on it or else a red ball with number 2 written on it)
is also an event (I have a red ball and the number on it is 2)
De Morgan’s Laws: for any two events , we must have
as well as
Caution: De Morgan’s Laws always hold no matter how we defined the
events. They do not require events to be defined on separate r.v.s etc.
CS771: Intro to ML
Rules of Probability for Complex Events 33
We can derive an “intersection rule” using de-Morgan’s laws and these rules
CS771: Intro to ML
Independence of Random Variables means that and will both take values
according to their own (marginal) PMFs,
happily unmindful of the values the other r.v. is
35
Two r.v.s are said to be independent if,taking!
for all and , we have
We can show that if two r.v.s are independent, then this means that
the
Wevalue
Can you one
show
have thatr.v. takes
if (i.e. for does notweinfluence
all ) then must alwaysthe
haveother r.v.allto) take some
(i.e. for
value more preferentially than
as others
well? 1in any2 way 3 4 5 6
We have if we condition on
However, Hint: use the definition of independence
Proof:
We
, thencanweverify
have that we have
for
Similarly, we also get 1 2 3 4 5 6
i.e.
all
For i.e.
independence,
Independence of r.v.swemakes
shouldlife extremely simple in ML algorithms
Similarly,
but
haveis had we
very for can verify
precious
all – notthat
always available
1 2 3 4 5 6
for
Notation: If are independent, then we often write
all i.e. 1 2 3 4 5 6
CS771: Intro to ML
Conditional Independence 36
If are three r.v.s such that for all we have
CS771: Intro to ML
Conditional Independence 37
If are three r.v.s such that for all we have
then we number
The r.v.s say thatand and are conditionally
colour are 1 1independent
2 1 3 2 given4 2 5 9 6 9
Notation: 𝑍 =1
not independent here. To see this, note 96 96 96 96
If 1 2 1 3 2 4 2 5 9 6 9
. However, if we
independent even if conditioned on a third
1
96 random
96 variable
96 96 96 96
condition on a new random variable
Even if are not independent,
that distinguishes it is still possible
the first two rows that there may exist a
1 1 2 1 3 1 4 1 5 1 6 1
third random
from the variable
last two rows, thensuch
andthat makes 24 conditionally independent i.e.24
𝑍 =2 24 24 24 24