Introduction To Probability in Computing

INDEX
S. No Topic Page No.

Week 1
1 Introduction to Probability - A box of chocolates 1
2 Introduction to Probability - Axiomatic Approach to Probability Theory 9
Introduction to Probability - Verifying Matrix Multipilication (
3 Statement,Algorithm & Independence ) 23
Introduction to Probability - Verifying Matrix Multipilication ( Correctness
4 & Law of Total Probability ) 37
5 Introduction to Probability - How Strong is your Network? 55
6 Introduction to Probability - How to Understand the World? Play with it! 75
7 Tutorial 1 87
8 Tutorial 2 89
Week 2
9 Discrete Random Variables - Basic Definitions 91
10 Discrete Random Variables - Linearity of Expectation & Jensens Inequality 105
11 Discrete Random Variables - Conditional Expectation I 118
12 Discrete Random Variables - Conditional Expectation II 129
Discrete Random Variables - Geometric Random Variables & Collecting
13 Coupons 145
14 Discrete Random Variables - Randomized Selection 156
Week 3
15 Tail Bounds I - Markov's Inequality 166
16 Tail Bounds I - The Second Moment,Variance & Chebyshev's Inequality 185
17 Tail Bounds I - Median via Sampling 200
18 Tail Bounds I - Median via Sampling - Analysis 213
19 Tail Bounds I - Moment Generating Functions and Chernoff Bounds 235
Week 4
20 Tail Bounds I - Parameter Estimation 257
21 Tail Bounds I - Control Group Selection 267
22 Applications of Tail Bounds - Routing in Sparse Networks 277
23 Applications of Tail Bounds - Analysis of Valiant's Rounting 292
24 Applications of Tail Bounds - Random Graphs 313
Probability & Computing
Prof. Jhon Augustine
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Module - 01
Introduction to Probability
Lecture - 01
Segment 1: A box of chocolates
So, we are now going to start with Module 1. I am going to start with the first segment and
this segment I am going to title it a box of chocolates I will see what the box of chocolates is
about shortly.
(Refer Slide Time: 00:33)
So, let us see it is a plan for the segment. We are going to talk about a crooked confectioner
his goal is to sell as many chocolates as he can.
And your goal is to basically just get the chocolate that you want not buy all the chocolates
that the confectioners trying to sell you. So, the problem is what we see is that it is hard to
fight against this crooked confectioner, except if you use a random strategy ok.
And that is what we are going to see it is going to be very simple setting, but I want this to be
a way to illustrate the power of randomization, and also point out the need for a formal
understanding of randomization probability theory in the context of computer science ok.
1
That is the very goal of this course and today is this segment is just to illustrate the
importance of this area.
So, without further a do let us let us see what is in the box? Where we have 100 chocolates?
50 with nuts and 50 without nuts that is the situation ok.
The thing is you only want 1 chocolate, 1 chocolate is all you want your health conscious, but
you want the chocolate with nuts in it ok. You do not want chocolates without nuts in it ok,
but the confectioner what he is done is knowing the situation he is wrapped up all the
chocolates. So, he you cannot see what is inside each chocolate?
So, what you are supposed to do is come up with a strategy to get the first nutty chocolate and
that is it ok. And so, this is so, the confectioners goal on the other hand is to try and sell you
as many chocolates as possible, because the moment you unwrap a chocolate you have
bought it. So, all right. So, this is a situation is just in case you are getting a little worried
whether this whole courses is going to be about chocolates.
Let us actually see a little bit more of a computer science application for the exact same
scenario ok. So, you have let us say a data centre, you have 100 servers in this data center and
now a new job arrives it could be a web request, it could be whatever request for running a
program and there are some 50 servers that have capacities within them to run your program,
the problem is you do not know which ones ok.
2
And you want to in each attempt to put your program in a particular server is going to cost
you some money. So, if you are going to repeatedly keep trying those servers that do not have
enough capacity you are spending a lot of effort ok. So, it is in your interest to find a server
that has the capacity to run your program ok. So, this if you can think about it is exactly the
box of chocolates problem, this rephrased in a language that is more amenable to computer
scientists.
So, what do you do and this is what you typically do if you come from an undergraduate
algorithms course typical algorithms course I mean some algorithms course are you know
covering some new topics. So, that is great, but a typical undergraduate algorithms course
would have taught you deterministic algorithms and here is what you would do you would
pick chocolates in some arbitrary sequence ok. Unwrap them one by one check if it is nutty or
not and then keep repeating it until you find the first nutty chocolate.
What is the problem with this, well the worst case is very bad. You can check 50 different
chocolates by an only succeed, you are only guaranteed to succeed on the 51st pick right.
This is your entire algorithms course right; you learn how to come up with algorithms? That
in the worst case do you know gives good worst case guarantees and this is the best worst
case guarantee you can have ok.
Can we do any better at all if we insist on deterministic algorithms of course; there is no other
way because this is a very simple argument right.
3
So, now all you have to do is remember now this is a 2 player game you can think of it
between you and the confectioner. All the confectioner has to do is understand the sequence
at which you pick the chocolates? And therefore, just place all the non nutty chocolates first.
So, in that in whatever sequence you pick and then you are then forced to pick 15 non-nutty
chocolates.
Before you get the first nutty chocolate this is where the problem lies.
So, obvious questions can be foiled the scheming confectioner. How do we do that? Well this
is where we need to approach this problem in a slightly different way ok. So, whatever you
learned in the undergraduate algorithms course, we need to not necessarily unlearn then, but
add some tools to it and that is in particular we want to try something beyond deterministic
algorithms, we want to do randomized algorithm ok.
So, here is the algorithm randomized algorithm while nutty chocolate not found by a random
chocolate and unwrap.
So, this is the simple algorithm is it any better that is the question. By now any reasonable
person would have a vague sense that it is better, but let us let us see.
4
So, let us try to understand this issue. First we will ask a wrong question, because sometimes
to appreciate something you first have to ask the wrong question and then slowly work your
way to asking the right question. Can this randomized algorithm possibly take 51 iterations?
If you are talking about possibility yes it can actually take 51 iterations, but this is missing the
point ok.
So, let us try to rephrase the question good, how likely is it that this algorithm will take 50,
51 iterations? There was some mathematical jargon thrown at me just now, but given this the
first lecture first segment of the first module. Let us just go with something we know ok, we
know it is very unlikely that is all we know for now, we do not know how unlikely right that
is something we need to figure out right.
5
So, this is that is really. So, we have kind of reached the situation where we think we have a
right solution given the level of information, we have heard. So, far we have we come up
with a an algorithm, that we know, will fail we will do the bad thing basically run for too long
with very low likelihood that much alone we know ok, but we need to this intellectual
dissatisfaction right, because we do not know what is likelihood. We do not know you know
when you say “very unlikely” we do not know what; that means, right.
So, some important thing is can it be quantified can we put a number to this and can we if we
if we have 2 things, 2 possibilities that are you know one is somewhat unlikely and somewhat
very, and the other one is very unlikely we want to be able to know how different they are
and we want to be able to compare quantify them.
And then the other thing is we also want to understand, whether these you know what is so,
special about randomization would it you know the sequential algorithms also have the same
benefits, why do they not have those benefits, that is another philosophical question that
might wrangle you. For now what we what we have understood. So, far is that you know
there are ways to overcome these sort of pesky situations.
But we need the right tools in particular what we are of course, going towards is the
probabilistic analysis tools to formally and quantifiably address this vague notion of
likelihood. Everybody has this notion of likelihood the gamblers in 19; I mean in 16th
century had the notion of likelihood right.
6
But it took about 400 years to formalize this and a lot of probability courses will just jump to
the the formalisms, but I want you to take a small you know take some time to realize that it
took about 400 years you know the first attempt that probability theory was in the 16th
century at least that we know of and you know. If enough material from Indian history was
present we might be even able to find out that India invented probability theory, but we do
not have that evidence.
So, a lot of things were invented here, which we do not know about right so, but then we
know if we have clear evidence that it was invented in a there were attempts in 16th century,
but what we are going to talk about in the next segment is how Kolmogorov formalized this
notion and helped us appreciate this vagueness. He gave a very concrete way to approach this
vague notion of likelihood ok.
So, that is going to be the topic for our next lecture I have mentioned all the pretty much all
the concluding remarks already. So, I will quickly mention them we need to move away from
worst case analysis, we need to add some more in realistic ways to approach algorithm design
and we need to be able to come up with formal ways to understand and appreciate it and so,
that someone can trust our claims.
So, it has to be formally proved, trust that the desired outcomes is sufficiently likely and we
have to do this in a very rigorous manner and we need the right mathematical ideas that is
7
going to be the topic of this course. And of course, as I mentioned the fun fact is that what we
are going to study is something invented by gamblers.
So, so the next segment will be an axiomatic approach to probability theory this is the formal
way to approach probability theory that will be covered in the next segment. So,
Thank you.
8
Prof. John Augustine
Module - 01
Lecture - 02
Segment 2: Axiomatic Approach to Probability Theory
So, we are now going to talk about a more formal or what is called axiomatic approach to
Probability Theory. So, what is the plan? So, what we you may recall that we have already
established a need for formalism and need for a quantifiable way to understand nature.
If you will in particular one of the examples that we talked about was getting a nutty
chocolate in last segment, ok.
And what we want is a precise mathematical basis and so, we are going to provide just that.
And I want to impress again I repeat myself a little bit, but impress again that it took about
300 and 400 years to get to this level of formalism. And it is easy to assume that this level of
formalism and clarity comes for free, because someone has given it to you, but if you are
interested in creating theory and new ideas that are going to last that long it is important for
you to learn how these things come about ok?
So, take some time to appreciate the clarity with which people have formalized this notion.
So, basically probability theory is just used to understand some natural phenomenon and we
9
want to capture that by the notion called an experiment. And then on top of that experiment
the notion called experiment, we are going to lay down some probabilistic a notion called the
probability space. And on this basis we will conduct the rest of our understanding of
probability theory in this course and pretty much all of modern probability theory right, you
will see some examples in today is segments.
So, let us start with this basic notion called an experiment, if this is the very basis of
probability theory it is often not explicitly mentioned, but you have to realize that this is the
very basis it also goes by the term trial. What is an experiment it is an activity typically a
natural phenomenon and natural activity, but sometimes it can also be artificially done. So,
for example, a computer can also perform something in a in a random fashion and that also
can be treated as an experiment ok.
It is something that can be repeated as many times as needed at least mentally speaking has a
well-defined set of outcomes. So, in this experiment should have a clear set of outcomes?
And each time the activity is repeated you can think of it as a nature as choosing one of the
outcomes, through some means; we do not know we are not going to get into how nature
behaves in specific terms?
But we are just going to assume that nature is going to pick one outcome and the set of all
possible outcomes is called the sample space ok. So, given an experiment you will have to
have a sample space non-empty sample space. And if the sample space is exactly one then the
10
activity is called a deterministic active, but if the sample space is more than one then it is a
randomized probabilistic at experiment ok.
So, let us look at some examples and this is going to be a repetition of ideas that you have
already seen before a coin toss is probably the most fundamental experiment that you can
think of and of course, it is sample space is heads and tails these are the 2 elements in the
sample space throwing a die is another experiment with sample space ranging from 1 through
6 ok.
There are all things that you have seen and connecting with the example that we discussed
yesterday you know our sample space is one of 2 things either nutty chocolate or a non-nutty
chocolate here is a more complex experiment ok.
11
So, we are just going to put together a few experiments and create a new experiment you can
do that.
So, now what we are going to do is we are going to toss a coin first ok. If the outcome is a
heads we will do something otherwise we will do something else. If the outcome is a heads
then we will roll a die. So, after we roll a die we repeatedly pick a car from a well shuffled
deck of cards until you get a red card. So, that is if the die roll comes up even, but if the die
roll comes up odd, we repeatedly pick a card from the well shuffled deck of cards until you
get a black card or if the original coin toss was a heads well you pick a card from a well
shuffled deck of cards that is your outcome.
If you did not follow all the details do not worry, because the point was not for you to follow
all the details, but the point was to illustrate that you can put together smaller experiments, in
some interesting way and create larger experiments. Where what is the outcome of this
experiment this complex experiment that is there on this slide you have 52 cards at your
sample space.
And at the end you are drawing a card from this sample space right. All the inner workings
are what nature does if you will ok? What matters is that the sample space of this experiment
is the 52 cards and you are going to draw one from that ok. And importantly hope you see
that the makings of an algorithm. So, look the way we have expressed this experiment should
descend remind you of an algorithms course pseudo code and that is exactly the point.
12
So, any algorithm that we design with the capability of tossing coins is essentially an
experiment. So, now, the algorithm that we are going to talk about has at it is disposal coins
that it can toss not for buying anything, but tossing and based on the outcome of those coin
tosses it can make decisions right. So, basically any randomized algorithm therefore, is
basically an experiment ok.
And you can define the if it is an experiment you should be able to define some outcomes,
without looking at specific algorithms yet let us look at what are some interesting ways you
can come up with outcomes. The most basic way of looking at outcomes is whether your
algorithm is has arrived with the correct output or the wrong output. Now this algorithm is
tossing coins along the way making lots of decisions going in some random way, what if it
leads to a wrong output that is a serious concern that you should have ok.
So, you want to have it if that is your concern your outcomes that you care about will be
correct output versus incorrect output ok. Let us assume you have established that your
algorithm is correct or at least mostly correct. If you might be interested in the running time
ok.
So, easier algorithm going to be and let us say at a high level you just want it to be able to
execute in reasonable time, then you can still define the outcomes to be either efficient or
inefficient ok.
13
And if you want to be even more nuanced about it you can look for some fine grained
outcomes ok. So, an example context that we will be seeing in this course is that after you run
an algorithm, you want to ask the running time been small as in within O(n log n) or not. And
you want to then argue and here I am going to use terms that I have not yet defined, but you
might be interested in things like this is running time O(n log n) that happens most of the
time. So, the technical term is with high probability that is what I have as whp.?
So, basically any algorithm that you run with the ability to toss coins you have to think about
the outcomes and you have to reason about the outcomes. And so, you see how these
outcomes naturally fit with our understanding of algorithms, if you had a basic course in
algorithms all the questions that you talked about there can be phrased in this randomized or
probabilistic way ok. That is the important insight that you should have now.
We defined an experiment.
It is basically something that has outcomes we want to now provide some quantification to
these outcomes ok. That is where the probability space comes up now you start with a sample
space.
14
And of course, if there is a sample space; that means, there is an experiment from which the
sample space is coming ok. The outcomes listed in the sample space are typically called
simple or elementary events, but you can also define some more interesting events allowable
events or just an event is basically any subset of this sample space give you a quick example
before we move on.
So, now let us say you roll a die the sample space is 1, 2, 3, 4, 5, 6, but you can also define an
event which is an even number and that is like 2, 4, 6 that is a subset of the sample space and
that is an event odd even ou know all any subset of the 1 through 6 is going to be a an
allowable event. And the set of all allowable events we call it F, we are going to stop at this
level for now basically you have a sample space and a set of subsets of the sample space and
that is what we care about and called F?
If you are interested in exploring a little bit more what I would ask you to do is Google up
these 2 terms; sigma algebra and measurable spaces ok. And you learn a little bit more about
the formalism that goes behind this by the way I am going to be posting these slides as we go
along. So, just to let you know. So, you may use that as a reminder to Google these 2 terms to
get a little bit more formal understanding of what we mean by these things ok.
15
Now, we can quantify them. So, now, we define what is called a probability function,
basically it is a function that takes as input one of the events basically that is the subsets in F
and maps them to real values this is a quantification part.
So, any event that you can think of has a real value associated with it and that real value tells
you how likely nature is to choose that particular event. And now we come to the axioms.
There are 3 axioms that this probability function has to obey and that defines a probability
space the first one is basically a normalization axiom basically a quantification.
So, this probability that we talk about is at least 0. Also the probability of the entire sample
space because the entire sample space itself is an event if you think about it and that the
probability of that event must be a 1. And the third axiom is now if you look at any set of pair
wise disjoint event.
So, now these are remember events are sets. So, you can talk about pair wise disjoint sets. If
you look at pair wise disjoint sets and you consider the event obtained by taking the union of
those events. So, this is the new event obtained by taking the union of all these other events
the these events 𝐸1 through 𝐸1𝐸2and so on.
Now, that has to be equal to the sum of the individual probabilities. So, these 3 are the basic
axioms they are very simple and very intuitive help you helps capture the notion of
16
probability. And the entire field of probability theory is resting on these axioms. So, it is as
simple as it is it is also very important so, very clear about that ok.
Now, let us look at some examples. So, what do we mean by so, let us go you know let us go
back to this event this experiment of tossing a coin. Now what we were going to do to
understand the notion of probability space we are going to impose quantification and we are
going to do that by implicitly calling it a fair coin when we call it a fair coin.
What we are doing is for each of these outcomes, we are going to assign a value half. And
because it is fair they both have the same value, which means they are both equally likely.
This is of course, I am sure an experiment that you have performed you know very well, but I
just want to take these basic concepts that you probably already know, make sure they are
you know well set well within the theory of probability, you can go to the next level what if
you toss the coin twice.
17
So, what is the sample space? So, at least what is the cardinality of the sample space when
you toss it twice 4? ok. And then one more question for you. So, can you think of some
meaningful events that are not elementary?
For example, if you think of this tossing of a coin twice you can think of a non-elementary
outcome for example, both coin output toss outcomes have to be the same. So, that would be
an event that corresponds to either HH or TT both heads or both tails ok.
18
So, in rolling 2 dice. So, now, what is the cardinality of this sample space? It is 36. So, now,
we can think of some interesting events. So, 𝐸1 so, this is the sample space, but events, you
can define it based on your requirement right. So, let us think of a few events 𝐸1 is the event
that both are odd 𝐸2 is event that both are even, 𝐸3 is let us say the event that the sum is even.
And that is by the way the same as the event 𝐸1𝑈 𝐸2, because if the sum has to be even either
both are have to be odd or both are to be even 𝐸4 is the event that the product is even ok. So,
the only way you cannot have the product being even if is both are odd. So, basically it is a
sample space minus 𝐸1.
So, what we are going to a do now is we are going to argue that the probability of 𝐸1 and
probability of 𝐸2 are both 1/4intuitively, it should already make sense to you if you think
about it, but we will just work through it carefully and you can similarly work out 𝐸3 and 𝐸4.
19
So, here is how we are going to work this out we have 2 dice, die 1 and die 2 are going to be
rolled right. We are going to and so, there are a total of 36 outcomes.
The let us look at the first die being odd. So, now, this so, you basically the events are marked
in green. So, the outcomes which fall into this event that the first die is odd are marked as
green.
20
So, now there is another event this is that the second die comes out odd ok. So, here if you
think about it I am showing that as these stars. So, you have these stars basically
corresponding to these rows.
So, now, we are ready to understand what the probability of 𝐸1 is 𝐸1 is the event that both
dice are odd ok. And this means that you only consider those events, those outcomes in which
it is both green and has the star in it and so, on ok. And that is what you I have marked here
in this purple colour?
And now if you and implicit in this is the assumption that all the outcomes are equally likely.
So, now, you can use the third axiom to just add up the probabilities of the individual
outcomes that are marked purple and you will get 9 over 36, which works out to 1/4 should
make perfect intuitive sense, but it is good to see this in the context of the well-defined notion
of probability space.
21
So, with that we are ready to conclude this segment remember the last segment we left off
saying we need a clear formal and quantifiable understanding of the likelihood of some
natural phenomenon and that is exactly what we have done in this segment. We formalized
that natural something as an outcome of an experiment ok.
And on that basis we have defined what is called a probability space and it is basically an
axiomatic approach in which we have defined the notion of a probability function giving 3
clear axioms that it has to follow. And then we saw some examples and quantified them and
you know we have some understanding how these things work now.
So, in the next segment what we are going to see is an algorithmic example and along the
way we will also learn a few more concepts in probability theory so.
Thank you.
22
Department Of Computer Science and Engineering
Module – 01
Lecture - 03
Segment3: Verifying Matrix Multiplication (Statement, Algorithm, &.
Independence)
So, now we are going to start the third segment of the first module and in this segment
well if you recall.
So, we now have a formal understanding of terms like sample space events probability
function and so, on. So, the goal for this segment is to introduce the notion of independence,
which is very fundamental in probability theory. And also the related notion of conditional
probability and work through some examples.
And then we are going to look at and we are going to do this in the context of an algorithmic
problem ok. So, it is going to be the problem of verifying matrix multiplication it is a and we
are going to provide a very simple randomized algorithm.
And while we are doing that we are going to leave some undefined terms and that those terms
the ones are going to lead to an understanding of conditional probability and independence.
We will try to clarify those terms and then finally, we end with a well-defined description of
the randomized algorithm ok. We are going to still leave a few questions hanging at the end
23
of this segment. For example, we will whether the algorithm is correct is going to be left out
in this segment and we will address that in a subsequent segment.
So, here is the problem statement, you are given 3 𝑛 × 𝑛 matrices A B and C. And you are
asked to check if A times B equals C and you would not have you at the output yes if AB=C
and no otherwise ok. And of course, you all will know how to solve this very easily you just
multiply A and B ok. There is only one downside to it this when you multiply you are going
3
to take the normal naive multiplication is going to take θ(𝑛 ) time ok.
And this is fine for small matrices, but nowadays the applications for matrix manipulation
come from big data applications right, talking about very large matrices like
10000 × 10000 or something like that.
3
And so, then an 𝑛 cube time algorithm becomes prohibitive ok. So, you need to somehow
speed this up there are some fancy algorithms that you can use you might have heard of
2.89
Strassen's algorithm anybody. So, that brings it down to like 𝑛 or something like that.
And then even more fancy algorithms that only theoreticians think about will bring it down to
2.37 2.37
𝑛 I think there is been some improvement in the in this 𝑛 . I think it was 2.38 and for
like 20 years and then somebody came up with a very actually 2 people independently came
2.38 2.37
up with brilliant ideas to improve it from 𝑛 to 𝑛 or something like that.
24
So, yeah that is something you could do if you wanted to, but that is going to be very
complicated algorithm, but our interest is to try and do this even more efficiently and take
advantage of the fact that we can design algorithms that can toss coins ok.
So, here is how this randomized algorithm is going to work ok. We are going to generate a
column vector and that is this vector 𝑟 and we are going to do it in a very specific way we are
going to choose there are n bits in this column vector each of those bits has to be uniformly at
random.
So, another way of thinking about it is just flip a coin if it is heads make it a 1, if it is a tails
make it a 0 it has to be a fair coin ok. And the outcome of and we also want each of these n
bits to be independent of each other ok. This is again a term that we have not defined well,
but it just means that these coins have to be done without influencing each other we will
formally define the this term shortly ok.
So, you will get basically 𝑟1 𝑟2 𝑟3 so on upto 𝑟𝑛 this is the column vector. And this we already
have and the same column vector shows up here as well ok. And here is what we are going to
do we are going to see whether 𝐴𝐵𝑟 . So, matrix multiplication of a time A with B and then
matrix 2 vector multiplication, which gives us 𝐴𝐵𝑟 , whether we will check if that is going to
be equal to 𝐶𝑟
25
If these 2 are equal we are going to output saying 𝐴𝐵 = 𝐶 otherwise if they are if they turn
out to be unequal we are going to output saying 𝐴𝐵 ≠ 𝐶 ok. So, now, the nice thing that you
2
should notice immediately is that 𝐴𝐵𝑟 can be computed in 𝑂(𝑛 ) time why? Because you
2
first perform this multiplication and that is going to be an 𝑛 multiplication you will get a
vector. And then again now you do you perform this multiplication ok.
This might be reminiscent of matrix chain multiplication that you might have studied in your
algorithms right we are exploiting the same phenomenon here as well. And so, what now
2
what we are going to get is an 𝑛 time algorithm to compute the left hand side and of course,
2
we have an 𝑛 time to compute the right hand side as well. So, this algorithm it takes only off
n squared time is, it very clear hopefully.
So, you just generate a random vector each element chosen uniformly and independently and
random and you call that 𝑟 you compute 𝐴𝐵𝑟 left hand side you compute 𝐶𝑟 right hand side
and check if they are equal. So, these are the terms that we need to worry about, but here is
one other pesky question that should rankle you, why is this algorithm correct? Because we
are asked to check whether 𝐴𝐵 = 𝐶, but what we are doing is 𝐴𝐵𝑟 = 𝐶𝑟 ok. So, that is
something we need to worry about it will defer that question to a later side.
Now, I want to explain this notion of independence ok.
26
And in order to do that there is a very nice saying if you want to carve an elephant how do
you do that? You take a large rock and chisel away everything that does not look like an
elephant. And then what you are left with is going to be an elephant that is what we are going
to do?
So, if you want to try and understand independence we are going to chisel away all notions of
dependence and what we are left with is independence? We are going to take 2 events A B
and we are going to see how they can depend on each other and define independence as the
opposite ok. For that we will need a notation 𝑃𝑟(𝐴|𝐵) ok.
So, this what does this mean it is the probability of an outcome A or an event A given a
guarantee that outcome B occurred ok, given event B occurred ok. Another way to think of
this is the following, you are running an experiment ok, you repeat the and A and B are 2
events, you and remember what is the when event you are talking about 2 subsets of the
sample space.
You basically define A when you when you say (𝐴|𝐵) what you are doing is you are defining
a new experiment ok. In this new experiment you are going to take the old experiment you
are going to repeat it over and over and over again until B occurs the event B occurs. And
when they even B occurs, then you check then you check I mean whatever outcome, you
have that is the outcome of this new experiment ok. Does that make sense let me repeat what
27
I am saying basically you start out with an experiment you are concerned about 2 events A
and B ok.
You want to understand the probability of A given B ok. So, the way to think about it at least
one way to think about it is create A new experiment in, which you repeat this old experiment
until B occurs. And will be B occurs whatever outcome you got in the last, when B occurred
is the outcome of this new experiment ok. And in that experiment you want to understand the
probability of A and that is this probability of A given B.
So, let us look at a an example consider 2 dice that are rolled let us ask I mean. So, one way
to exercise this understanding of this notation is we can ask, what is the probability that the
sum is even given that the first outcome is odd ok.
So, let us say what you do is you roll 2 dice ok. And if the first outcome is even your
condition is not satisfied B is called the condition right B is not satisfied ok.
So, that is not good enough you keep repeating the roll the 2 dice being rolled until the first
dice is odd ok. The moment the first dice is odd, then you can ask is my some even and that is
this notion of conditional probability.
So, let us work that out. So, now, you can roll 2 dice and you want to ask what is a the notion
of given at, the first is odd another way of thinking about it is you look at the sample space
when 2 dice are rolled ok. And then what you do is simply remove these outcomes because
28
they are not relevant to this condition. If you condition that the first dice has to be an odd
only these are the relevant outcomes ok. So, only the ones marked green.
So, now, ignore all the other outcomes and just worry about these outcomes and. So, in other
words your sample space gets redefined ok. This is the out this is the sample space of this
new experiment that I talked about ok.
Now you can ask what is what is the probability that the sum is even given that the first is
odd.
So, now you ignore all the those outcomes that are not relevant and then among those there
are relevant mark green, you ask what are the ones that are what are the outcomes in which
the sum is even and those are marked with a smiley face.
And now you can see that there are a total of 18 outcomes equally likely outcomes mark
green out of which 9 have a smiley face. So, your probabilities are half.
29
So, now let us try to arrive at a formula for this 𝑃𝑟(𝐴|𝐵) let us first do try something ok. So,
let us first try 𝑃𝑟(𝐴 ∩)𝐵 ok. This is this is not quite hitting the mark, because there is this
other aspect to this conditional probability right. We need to reduce the sample space
somehow ok.
So, what we are going to have to do is probably so, basically now you have to throw away all
the sample space outside of B. So, when you talk about 𝑃𝑟(𝐴 ∩ 𝐵) what you have to do is
30
consider this as your sample space and out of that you consider what is the probability that A
occurred? Given that B occurred ok.
So, your sample space is B and out of that sample space what is the probability that your
event of interest A occurred ok. And how do you achieve that you this probability of 𝐴 ∩ 𝐵
gives you this portion, but then this is with respect to the overall sample space.
So, now you have to restrict it to just B you achieve that by in some sense normalizing over
B. So, that is basically dividing by B ok. So, this is your formula for computing 𝑃𝑟(𝐴 ∩ 𝐵)
and I could have just written the formula and walked away, but I would like you to think
about how these formulas come about ok.
So, now let us talk about dependence remember we are going to have to we have to
understand dependence to chisel it away. So, that we left with independence.
So, it is easy to also think about dependence right if when can we say that A depends on B, if
somehow when B occurs some the probability of A changes. Then you know that A depends
on B the fact that B occurred somehow change the out the probability of A, then you know
that is dependent on B. So, that is what we are going to write in this fashion.
So, we so, simple now we are ready to chisel it away it. A does not depend on B oops if
𝑃𝑟(𝐴|𝐵) = 𝑃𝑟(𝐴) now the thing is. So, basically A does not depend on B now we have A
31
formula for probability of A given B right, just applying that formula we get 𝐴 does not
depend on B, if and only if 𝑃𝑟(𝐴 ∩ 𝐵)/𝑃𝑟(𝐴) = 𝑃𝑟(𝐴).
So, now we will come the classic formula or definition of independence this you might recall
is the way diff independence is often defined and event A is independent of B if 𝑃𝑟(𝐴 ∩ 𝐵)
equals to the product of the 2 individual probabilities 𝑃𝑟(𝐴) × 𝑃𝑟(𝐵) ok. So, this is you
might have seen this as the very definition, but hopefully the intuition is clear as to how your
arriving at this definition.
And here you should notice something we are talking about A depending on B, but look at
the condition for independence can you interchange A and B and still have the notion. So,
that is there is this beauty here there is A symmetry here if A is independent of B, then what
can you also say B is also independent of A.
So, you can actually work out the exact same notion for B and you will get that thus
independence is A is symmetric and B also does not depend on A, if and only if A does not
depend on B. So, to summarize we A and B are independent if probability of A given B equal
to A property of B given A is equal to B and 𝑃𝑟(𝐴 ∩ 𝐵)=𝑃𝑟(𝐴)𝑃𝑟(𝐵) all these are
equivalent statements any questions on the notion of independence?
32
So, now we can revisit this algorithm. So, would we have to generate A column vector r of n
bits chosen uniformly and independently at random. And how do we do that we know the
notion of independence how do we actually do that.
So, now, what you do is for each random bit we have to do this for 𝑖 = 1 𝑡𝑜 𝑛 toss a fair coin
set the bit I to 1 if heads else bit to 0 and completely forget the outcome of the toss, because
the next one has to be completely independent of the previous coin toss.
So, comes and you repeat this n times you will get vector r that you need.
33
𝑛
Now, there is an important claim here each of the 2 binary column vectors now if you have a
binary vector of n bits. How many outcomes I mean what are the total number of possible
𝑛 𝑛
such binary vectors is 2 right. Each of the 2 binary column vector is equally likely, we want
to be able to argue that each one of them is equally likely right.
So, what we know is each individual bit is equally likely to be 0 or 1. How does that imply
𝑛
that the entire n bit binary vector is drawn from A sample space of 2 bit binary strings each
equally likely that is the question ok. So, now, how we are going to do it remember now our
𝑛
sample space is 2 binary vectors ok, pick 1 any arbitrary 1 we call it 𝑟1𝑟2 and so on up to 𝑟𝑛
ok.
So, a column vector using transpose over them, remember each bit was generated the way
you generated you generated completely independent of each other right. So, and if they were
independent you recall that they are probably individual the probability of 𝑟1 being a certain
value 𝑟2 being a certain value and so on those individual probabilities can be multiplied,
because that is the formula for independence we showed it for 2 events, but it extends to n
events as well ok.
34
So, then you take one by for each 𝑟1 the probability that you get the appropriate𝑟1 is half
−𝑛
appropriate 𝑟2 is half and so on. So, you multiply half with itself n times you get 2 . So,
𝑛
what have we established if you pick any one of the 2 binary vectors you the probability that
𝑛
you get that particular binary vector is 1/2 that is equally likely.
So, now hopefully if we revisit the randomized algorithm all the terms are clear to us.
35
We know exactly what we mean by uniformly and independently and then all the algorithm is
clear to us we have understood, but either way along the way we have understood the notion
of independence of events conditional probability, we still do not know whether it is correct.
2
We know the running time. So, we know this k is guaranteed to run in 𝑂(𝑛 )we do not know
the correctness and we will see that in the next segment ok.
Thank you.
36
Department of Science and Engineering
Indian Institute of Science, Madras
Module - 01
Lecture – 04
Segment 4: Verifying Matrix Multiplication
(Corrections & Law of Total Probability)
Let us get started with Module 1, Segment 4. So, we are going to continue our discussion of
Verifying Matrix Multiplication. We saw an algorithm in the previous segment and today
what we are going to do is prove that that algorithm is correct and the some definition of
correct. So, we will have to redefine the notion of correctness for randomized algorithms
system little bit at least some types of randomized algorithms.
And so, along the way we would be introducing a few things principle of deferred decisions
the law of total probability. So, these 2 principles we will discuss them as we go along.
And then will give a form of proof of correctness and initial proof of correctness will be a
probabilistic statement.
37
So, it will come with certain probability guarantees, what will do is then will show how to
boost the probability of correctness to pretty much any extent that we want ok. So, that is the
plan for today’s segment.
Let us start with some bad news first hopefully to you know set ourselves the stage to give
some good news later. So, if you recall the algorithm you chose a random vector r you
multiplied A B r on the left hand side and then right hand side we might be multiplied C r,
and we checked if they were equal they were equally reported 𝐴𝐵 = 𝐶 otherwise not. So,
that is the algorithm.
And if you deterministically choose r you can always find an r for which the algorithm is
going to be incorrect. So, an undergraduate algorithms course would not suffice to handle this
problem, because then you will immediately think about the worst case and in the worst case
this algorithm is incorrect at least in the deterministic worst case sense, but then our hope lies
in the fact that r is chosen uniformly at random and each bit is independent of each other.
38
And so, we are going to redefine the notion of correctness and what we are going to do is
prove the following statement that remember, when you run the algorithm there are 2
outcomes that in this case our sample space is correct versus incorrect.
And we want to argue that the probably that the algorithm is incorrect is at most some δand
in this case we will start with δbeing just a half. So, to begin with half the only guarantee we
give is that half the time our algorithm will be correct, but the other half possibly could be
wrong and then we will see how what we can do later with this ok.
39
So, since let us look at it a little bit carefully when 𝐴𝐵 = 𝐶 our algorithm will never be
incorrect, why is that because whatever r you choose 𝐴𝐵𝑟 = 𝐶𝑟, because AB and C are
equal. So, we are always going to be correct when 𝐴𝐵 = 𝐶 the only interesting part
therefore, is when 𝐴𝐵 ≠ 𝐶 and our algorithm if it was correct should be able to say that
𝐴𝐵 ≠ 𝐶.
What we left hand show is that given that 𝐴𝐵 ≠ 𝐶 our algorithm will say incorrect with
probability at most δ, that is what we are trying to for the rest of this we will be therefore,
focusing on 𝐴𝐵 ≠ 𝐶 we will just assume that 𝐴𝐵 ≠ 𝐶. Now we define D = AB - C and
therefore, now we can think of 𝐷𝑟 = 𝐴𝐵𝑟 − 𝐶𝑟.
So, remember our algorithm is going to check if 𝐴𝐵𝑟 = 𝐶𝑟. So, the other an equivalent
check is to see whether 𝐷𝑟 = 0. And if it was if it shows up to be 0 it is incorrect. Why
because we have made the assumption that 𝐴𝐵 ≠ 𝐶 and 𝐷𝑟 = 0 would be fooling us into
thinking 𝐴𝐵 = 𝐶.
So, now we want to apply a principle called the principle of deferred decision it is just a
technique and in case you did not know deferred means delayed ok. Some decisions are going
to be delayed and others are going to assumed to have happened ok.
So, in this case why are we applying this principle of deferred decision we have to contend
with r which is an n bit vector. And what we are the way we are going to do that is we are
40
going to fix all elements except one. This one element were going to choose and defer the
decision on that element alone among this n elements. And then we will argue those that will
be the deferred element and then we will argue based on that deferred element that the
probability is that of in being incorrect is at most half ok.
So, let us illustrate the principle of deferred decision first by A simpler setting. So, let us
reorient ourselves to the simpler example, toss a coin 10 times what is the probability that the
number of heads is odd? This is the question at hand. So, here is one way we can do it we
10
know that there are 2 equally likely outcomes, because there are n 10 coin tosses out of
them, how can we how many outcomes have an odd number of heads? It is going to be
10 10 10 10
1
+ 3
+ 5
+ 7
and so on right.
So, the answer if you want to directly argue the answer for the number of heads being odd is
you know you have all these 10 choose odd numbers and the summation of them in the
10
numerator, and 2 in the denominator and it is going to get a little messy to argue what this
is were not going to bother about that.
41
But let us see how the principle of deferred decisions works in this case ok. Now what we’re
going to do is assume that the first 9 tosses have been completed and there are some 𝑙 heads
we do not know what that 𝑙 value is there are some 𝑙 heads ok?
So, now we are going to defer the decision for the last coin toss alone ok. And if you think
about it the last one is equally likely to be head or tails ok. And now you can see what
happens if 𝑙 is even ok, then how do we get an odd number of heads then the last will be
heads with probability half, even if on the other hand is 𝑙is odd again last coin toss will be at
tails with probability half.
And in either case we are going to be able to conclude at least intuitively at this point that the
total number of heads is going to be odd with probability half ok. And better I mean this is
just an intuition right now we will need the lot of probability to formalize this ok.
42
So, let us get back to the problem of matrix verifying matrix multiplication ok. So, now, let us
look at D and we claim because remember 𝐷 = 𝐴𝐵 − 𝑐 and we are going to assume that
𝐴𝐵 ≠ 𝐶. So, there must be a nonzero entry and were just going to assume that that nonzero
entries is a very first one 𝐷11 ok. And we are interested our remember the bad event, that we
are trying to show the probability of that bad event is small is this 𝐷𝑟 = 0.
43
So, when we say 𝐷𝑟 = 0 we can just apply the formula and so, what we are going to do is
work with the first row of D alone.
So, this is basically the first row of D multiplied with the elements of r and we want to ask
how what is the probability that entry will be a 0 and that. So, it is if you think about it this is
going to be one way in which the bad event one important requirement and a necessary
condition for 𝐷𝑟 the matrix 𝐷𝑟 = 0, basically this 0 is the first entry of 𝐷𝑟 ok. And we
basically want to limit ourselves to proving that this itself is will happen with some bounded
probability. So, now, let us expand this summation out.
So, it is basically what we are doing is isolating the first term and remember we want to use
the principle of deferred decision. So, we were isolating the first term which is the term that
we are going to differ and the rest of the terms are here in the summation ok. And then we are
going to were isolating just this 𝑟1. So, we get some formula. So, for now focus on the fact
that we have isolated 𝑟1, this is the first bit in our random vector.
So, now, our statement can be simplified as to show that the probability that the first bit 𝑟1
equals this particular right hand side quantity is at most a δ.
So, it is slightly it seems messy, but what we have there the nice thing we have done is we
have isolated our focus on just 𝑟1. So, now, we have the ability to apply the principle of
deferred decision. So, we do not we kind of try to avoid the other random bits we focus our
efforts on see what happens to 𝑟1. And so, that is the principles of deferred decision were
going to try to apply.
44
But if you want to be careful and formal about it what we are actually doing is going to apply
now at this point the law of total probability which I will talk about now, how does this law of
total probability work? Now assume that there is and this ω is a sample space and within that
the sample space is broken into 𝐸1, 𝐸2, 𝐸3, 𝐸4 and so on. These are mutually disjoint subsets
of ωand when you take the union of all these 𝐸1, 𝐸2, 𝐸3, 𝐸4and so on, it should actually equal
ω.
So, basically it is a partition of the sample space these 𝐸𝑖 and our interest is a particular event
B ok, that is the oval shown over here. And what the law of total probability says that now
you can compute the probability of B by looking at the intersections of B with the 𝐸𝑖 ok.
45
And that is a if you look at this picture, it is a very intuitive thing to see right it is just when
you intersect B with𝐸1 you get that portion of the sample space cover.
And you just remember the axioms of probability theory you just have to add up these
individual intersections and you get the probability of the event B ok. And now if you recall
the formula for conditional probability what was that if you recall that it is going to be
𝑃𝑟(𝐵|𝐸𝑖)= 𝑃𝑟(𝐵 ∩ 𝐸𝑖)/𝑃𝑟(𝐸𝑖).
And now we just take we a jiggle the terms around to get it to be in this form this is
essentially the law of total probability, how are we going to apply it over here?
46
And remember we are interested in the our bad event is 𝐷𝑟 = 0. And so, that what is that
probability, now you can take this bad event 𝐷𝑟 = 0 and you can intersect it with a bunch of
𝐸𝑖. And if you can compute these individual intersection probabilities you can just add them
up.
So, that is the law of total probability and what are the 𝐸𝑖 that we are going to take this is
where it connects to the principle of deferred decisions. The 𝐸𝑖 are going to be, basically how
(𝑛−1) (𝑛−1)
many 𝐸𝑖 are there are 2 , 𝐸𝑖. So, 𝑖 ranges from 0 to 2 and it is even that the rest of
the random bits correspond to the binary number 𝑖.
So, the first (𝑛 − 1) bits or not the first (𝑛 − 1) in this case the bits 𝑟2 to 𝑟𝑛 can take on
some binary value right. Remember and we are our intention is to not worry about any of
them ok. And if you think about it the union of all of those events is going to be the sample
space and they are mutually disjoint because you are talking about different random bits
when, they when you evaluate them they value to different binary numbers ok.
So, now we can apply the principle of defer decision. So, now, what we are going to do is
apply the principle deferred decision in this manner.
47
So, we are basically now at this point isolating our focus to 𝑟1, which we already showed we
can do that and the thing remember the random bits were chosen independently. So, the
whether 𝑟1 equals to this quantity is going to be independent of the 𝐸𝑖, because the 𝐸𝑖 depend
on the rest of the random bits. So, we can simply because of that independence we can
convert this intersection into a multiplication and we will get it in this form.
The first inequality let me be a little bit careful here this 𝐷𝑟 is the original bad event ok. That
is the bad event that if you look at 𝐷𝑟 it is going to be a vector and it is going every element
has to be a 0. What we are going to do is focus only on the first element and argue that just
the probability of the first and that is what is happening over here, just the probability of the
first element being 0 itself we are going to bound it. So, that is going to. So, this is going to
be a stricter requirement this 𝐷𝑟 = 0, for simplicity we are going to bound so, if you think
about the sample space.
So, let us draw the sample space 𝐷𝑟 the bad even 𝐷𝑟 = 0 is going to be something like this.
So, this is a bad event 𝐷𝑟 = 0 ok. This will require all of them to be 0 what were going to do
is instead focus on a larger event which only requires the first element to be 0 ok.
And we are going to argue the this is the this is this event and we are going to argue that this
outer event itself is going to have small probability that is what we are doing that makes
48
sense. So, if you look at each let us look at 𝐸1 is going to be the event that if you take the
random bits 𝑟2 𝑟3 ok.
So, if you look at 𝑟2 , 𝑟3 and so on up to 𝑟𝑛 they can take value 0 or 1. And now they can take
𝑛−1
how many values they can go to go from 0 to 2 and so on. These are all the possible ways
that they can change and 𝐸1 is the case where they take the value.
If you look at the by the binary string they should, if you take this binary string where
everything is here except the least significant bit that evaluates to a 1 right, that is 𝐸1. It 𝐸0
will be when all the bits are 0 𝐸2 will be when all the bits I mean the it will be something like
1 0 and so on. And that basically covers will cover the entire sample space and that is exactly
what we wanted for the law of total probability.
So, that will in cover the entire sample space and were taking for each of the 𝐸𝑖 were
intersecting, it with this event that we care about this outer you know that is in the law of total
probability figure that that is to be the oval B that we do. So, now, what we are going to do is
just apply the formula the question is this how is this 𝑟1 equal to this quantity, well 𝑟1 is a bit
value either it is either 0 or 1, and clearly let me make this why are these 2 events
independent that is the question.
So, if you look at the first event 𝑟1 what we have assumed is that we are working with the
principle of deferred decision this right hand side has some quantity this right hand side has
some quantity. And whether 𝑟1 is going to equal that quantity or not is going to be completely
independent of what other random bit values are and that is why 𝑟1 this event is going to
operate independent of 𝐸𝑖. The outcome of this event is going to depend purely on 𝑟1 in the
principle deferred decision we have at this point in time we have fixed this quantity that is the
reason.
So, now what we are going to do is just apply some values. So, this 𝑟1 how will it equal this
quantity on the right hand side, well there are 2 possibilities 𝑟1 is either 0 or 1. Certainly if it
49
applies if 𝑟1 is equal to this quantity on the right hand side when 𝑟1 is 0, then when it is equal
to 1 it would not be equal.
So, with probability at most half it is going to be equal that is where this half comes from and
the probabilities of the 𝐸𝑖 stay as is. So, we are take the half outside and well what is this
probability this is basically events that span the entire sample space. So, that is equal to 1
which leaves us with the probability of half.
So, this is where how we get the fact that this the probability of this bad event is at most half.
𝑟1 is the index number 1 right, which is what we have taken it was an arbitrary choice, but we
just fix assume that there was some it goes back to the fact that some entry in D was nonzero
and we used that fact to work with one entry in r.
So, what we have shown is that δ the upper bound on the probability of the bad event δ is
half is that good enough certainly not, I mean you would not bet your life on something
where the probability of the bad event is close to half.
So, how do we work with that one thing that we can take advantage of is the fact that this is a
problem where we have in an algorithm where it has one sided error ok, when A B is = C
were always going to be correct when 𝐴𝐵 ≠ 𝐶 we are going to be wrong with probability at
most half.
50
So, what we want to do now basically ensure that the probability of error is it comes down to
* * *
some arbitrary δ . So, you decide what the δ is and the δ could be like let us say 0.0001 ok,
*
you decide what that δ is going to be.
Now, what we are going to do is try to repeat this algorithm some k times and ensure that we
* *
bring down the probability of error to at most this δ that your favorite δ . So, of course, it is
being in the one side of the thing when 𝐴𝐵 = 𝐶 we are always going to be correct. So, we
are done with that. So, we are only going to worry about the case where 𝐴𝐵 ≠ 𝐶.
So, what is the probability that this bad even 𝐷𝑟 = 0 is going to happen for all the k
repetitions given that 𝐴𝐵 ≠ 𝐶 ok, that is the question and remember these repetitions are
going to be independent repetitions.
So, you can simply if they are independent you can just multiply them. So, it is going to be
𝑘 𝑘 *
(1/2) and we want this (1/2) to be bounded by δ . Which means that you have to run this
* *
for 𝑘 ≥ 𝑙𝑜𝑔(1/δ ) number of times, if we run it this many times and remember 𝑙𝑜𝑔(1/δ )is
actually a fairly small quantity it is basically log is just a representation of the number of bits
needed to represent a quantity right.
* * *
1/δ could be something like if it is your δ is say 0.001 your 1/δ = 1000 , 𝑙𝑜𝑔(1000) is
what something like 10, 𝑙𝑜𝑔(1024) = 10.
51
So, that is all if you just repeat it 10 times you bring down the probability of error down to
*
your favorite δ .
So, putting things together so, this is the algorithm that we have already seen and were just
were just gonna have to wrap it around the for loop and that is it.
So, just remind ourselves what the claim is or algorithm is always correct when 𝐴𝐵 = 𝐶
*
when 𝐴𝐵 ≠ 𝐶 our algorithm will be correct with probability at least 1 − δ .
52
*
So, let us remember it is going to be incorrect with probability at most δ . So, it is going to be
*
correct with at least 1 − δ probability. And what is the running time well we know that the
2
original algorithm was 𝑛 running time, we are wrapping it around the for loop that takes
* 2 *
𝑙𝑜𝑔(δ ) number of iterations and such 𝑛 𝑙𝑜𝑔(δ ).
*
Now, one way to think about this δ is you try to you we in Computer Sc we are often
interested in scalability the larger the problem we want we want to ensure that our guarantees
*
are strong right. So, what we let us say we want our δ = 1/𝑛 ok. And when we get a
correctness of this form where the in correctness probability is at most 1/𝑛 and therefore, the
correctness is at least 1 − 1/𝑛 we say that the algorithm is correct with high probability this
is a standard term used these things.
So, when we get the probability of correctness to be at least (1-1/n), let us correct with high
probability ok. And how do we ensure that we can get with high probability? And that is that
*
is easy right. So, now, 1/δ ; so, this running time if we want high probability this running
2
time will just become basically 𝑛 𝑙𝑜𝑔(𝑛).
So, if you just add a factor of 𝑙𝑜𝑔(𝑛) you are going to get with high probability correctness.
So, just we are down to the concluding slide.
53
So, what we saw? So, let us just conclude our segment in the previous segment, we saw the
algorithm to verify matrix multiplication what we have done is carefully go through the
analysis of this algorithm, we have shown that we have studied principle of deferred decision
the law of total probability and we shown that it is correct with high probability.
So, with that let us look forward to the next segment, it is going to be another exciting
algorithm called Karger’s Mincut algorithm.
Thank you.
54
Module – 01
Lecture - 05
Segment 5: How Strong is your Network?
We are going to be talking about how strong our network is and what do we; well let us look
at the plan for this segment recall that in the last segment, we studied a randomized algorithm
for matrix multiplication and we introduced the probabilistic notion of correctness.
What do we mean by an algorithm being correct with high probability in and in this current
segment, we are going to look at the problem of how strong a network is formalize it in the
following way we are going to define something called a min cut in a graph informally
speaking it is the fewest number of edges that you need to remove in order to disconnect that
graph that if that is very small, then the graphs in some sense is not strong enough and that is
the intuition that leads to this formalism and we present a very simple elegant algorithm a
randomized algorithm to solve this problem.
55
So, let us look at the motivation as I said, we are trying to figure out; how strong the network
is. So, here are 2 examples of networks. The question is which of these graphs networks is
robust ok. If you look at the one on the right, it is completely connected. So, let us let us say,
we are play the devil’s advocate, we want to disconnect this network, you would have to
destroy a lot of edges before this network becomes disconnected, even if you want to isolate
one node, you will have to destroy all the incident edges and that is a lot on the contrary, if
you look at this graph on the left, if this is a network, you only have to delete that one link
and you have disconnected this network.
And so, this makes this network very brittle and so, you want to find out how brittle or how
easy to break your network is and so, we formalize it in the following ways, we call this the
min cut problem ok.
56
So, we assume that our input is a graph with vertex set 𝑉 and vertex 𝐸 this represents our
network. So, typically you can think of the vertices as being the computer nodes and edges is
being the ability for the 2 nodes can talk to each other. And, so if you can create a cut in this
network; that means, there is a portion of the network that cannot talk to the rest of the
network and that is that is not a good situation. We want to avoid that. So, cut is a partition of
vertic of the vertex at 𝑉 into 2 sets 𝑆 and the remaining 𝑉\𝑆 and the cut set corresponding to a
particular cut is the set of edges with one end in 𝑆 and the other end in 𝑉\𝑆.
So, all these edges that go across the cut are called the cut set our goal is to find the partition
we want to find the set 𝑆 such that the corresponding cut set has the least cardinality you want
to find the smallest cut and that is called the min cut. So, we want to find the min cut.
57
So, the algorithm we are going to look at was invented by Karger’s. So, call it the Karger’s
Mincut algorithm.
There is you know we have we have to keep in mind that there is a flow based algorithm that
could do quite well as well, but it is still not very efficient the most straightforward
5
application of the flow based algorithm would lead to something of the order of 𝑛 and we
want to do better than that ok. So, that is and moreover the Karger’s algorithm is very very
58
elegant and nice. So, you will see this and the algorithm here is the algorithm right its very
simple you just repeat this process of contraction 𝑛 − 2 times because each time you
contract you will be coalescing two vertices. So, you can do that 𝑛 − 2 times for you are left
which is two vertices.
So, repeat until the graph has only two vertices. Now, let us pick an edge uniformly at
random from all the edges, Thing to note here is the graph the way it will progress you might
end up having multiple edges going between vertices. So, when you pick an edge you are not
you should be careful not to pick two vertices and then consider the edge connecting them
because some edges could have a lot of some vertices in some pairs of vertices could have a
lot of edges going between them and others may be much fewer.
So, you really have to be careful to pick the edge uniformly at random over the set of all
edges ok. So, these two vertices between which there is a lot of edges and another. So, for
example, here say this one has lots of edges going between them and there is another pair of
vertices and you only have one edge going between them, the probability this one edge is
chosen should be exactly equal to the probability that this other edge of here is chosen ok.
So, pick such an edge uniformly at random and then you look at the two vertices that it is
connecting. So, let us say this is u and this is v. Now knowing that this edge is connecting u
and v, we have to contract or coalesce u and v into a single vertex. So, you have to contract
that edge and when you do that. So, for example, now if you contract u and v and that is add a
couple of more edges to illustrate the point, let us say this is our graph when you contract u
and v, you have to put u and v together as one vertex.
So, this is the coalesced vertex u v, and let us just call these x and y ok. So, all these edges
that go between u and v will form self loops when you contract them. So, they are all
removed. So, you remove all the self loops just do not consider them, but you need to be
careful about the other edges though. So, let us put down x and y and you have let us look at
the there is one edge going between u and x now there is no u there is only a u v, but you see
that edge still has to be put in over here.
And between v and x there is an edge. So, that has to be a separate edge between v and y
there is an edge and between x and y they there previously was an edge. So, this is the new
graph that we get ok. So, this is the algorithm you just keep contracting you pick an edge
uniformly at random you contract it and create a new graph and this new graph will have one
59
fewer vertex keep in mind some of the vertices are going to become these sort of compound
vertices that contain multiple original vertices in them. So, we do this until we are left with
just two vertices and when we are left with two vertices what will picture look like well it is
going to look something like this there is going to be 2 sort of giant vertices that include a lot
of coalesced original vertices and then. So and then there is going to be edges across them.
And all the vertices in one of these what is the original vertices in one of these what is these
sort of coalesced vertices you call them 𝑆. So, all the original vertices in the other coalesced
vertex is going to be 𝑉\𝑆 at least this is going to be our candidate 𝑆 and 𝑉\𝑆 and our hope is
that this edge this edge set will be the min cut and this of course, is not guaranteed, but we
hope to be able to show that with some reasonable probability indeed that will be the min cut.
So, let us look at the execution of this algorithm in this example. So, here you have a
sequence of graphs ok, what this is the original graph this is the original graph and we have
chosen to contract this vertex as.
So, this edge and because of that contraction it is the two vertices have become one vertex
and now there are one two three four five six edges going away from these pair of vertices.
So, those 6 edges must find there well actually there is a seventh one over here. So, then
those seven edges must be here and you notice that is the case there are these seven edges that
are incident on this new coalesced vertex all this must have been oneself loop that was
created that was removed. And now you pick one more edge to contract this is the edge that
60
is chosen for contraction and after contraction you get this coalesced vertex and you continue
on ok.
So, now you have chosen to contract this edge and you get this vertex as the coalesced vertex.
Now notice that you are trying to contract this edge and you get this coalesced vertex and you
continue on.
61
So, you are contracting you have chosen to contract this edge you get this coalesced vertex
and you continue proceeding.
This way you are left with two final coalesced vertices and you are left with these 3 edges
and incidentally these are all this coalesced vertex in this coalesced vertex respectively
contain all these vertices and all these vertices respectively and if you think about it the min
cut indeed was this these 3 edges that separated the top from the top 5 vertices from the
bottom 5 vertices and. So, this shows an example of a good run of this algorithm because it
has actually led to us finding the min cut.
62
But of course, that is not at all guaranteed it is a probabilistic thing and you need to be able to
show that this happens with some reasonable probability. So, let us try to analyze that ok. So,
what is it that we are going to prove let 𝑆 be the set of original vertices that led to one of the
final two vertices.
So, remember the algorithm ends with 2 final vertices with and let 𝑆 be the set of original
𝑛
vertices that led to one of the final two vertices now with probability at least 1/ 2
. So, that is
one over some polynomial over n, 𝑆 and 𝑉\𝑆 induced the smallest cut. So, the cut set
𝑛
produced in this way is going to be the min cut with probability at least 1/ 2
. So, this is this
is what we need to show ok. So, I noticed that this probability is fairly small. So, this one
over some polynomial n, but bear with bear with me will figure out how to deal with that.
So, how do we prove this lemma ok, actually we do not care a about all possible min cuts you
know a graph can have lots of min cuts and good exercise for you would be to come up with
*
graphs that have lots and lots and lots of min cuts. So, now, let us pick one min cut let 𝐶 be a
cut set a min cut of minimum cardinality ok. So, instead of proving something strong like the
probability of finding any min cut is bla, we are just going to limit show that even finding this
one particular min cut this finding the probability of finding this one particular min cut is at
𝑛
least 1/ 2
and this is actually good enough if you think about it ok. So, we are going to limit
ourselves to this proving this.
63
So, let us try to figure out, how to do that and think about how this event may fail to happen
what could go wrong as long as the edges that you keep picking are avoiding the min cut, this
*
particular 𝐶 , let us let us say that you know there is a graph looks something like this and
*
this is the min cut that are this is the 𝐶 that we are talking about as long as the contractions
happen over here, in other edges that are not part of the min cut, we are going to be fine the
min cut is going to survive till the end the moment we pick a min cut edge we have lost it ok.
So, that is the intuition that we want to formalize ok.
So, towards that let 𝑒1, 𝑒2 and so on up to 𝑒𝑛−2 be the sequence of edges contracted by the
algorithm now as I said if one of this 𝑒1 through 𝑒𝑛−2 is a member of this min cut that is bad
ok. So, the other way of saying is that the algorithm succeeds is if none of them are in 𝐶 star
ok. So, this probability that we care about probability that the cut produced by the algorithm
* *
is 𝐶 is equal to the probability that the last edge that we contracted was not in 𝐶 the last, but
*
one of the edge that we contracted was not in 𝐶 and so, on up to the first edge that was
contracted ok. So, I have just gone in the reverse order for a reason.
So, now this is the probability of the intersection of several events also. So, all of these things
will have to happen in order for the algorithm to succeed. So, a lot of stars have to line up.
So, let us what we you know let us see the one thing need, we need to be careful about is
these are not independent events, they could there could be dependencies. So, we should be a
64
little bit careful we cannot just multiply their individual probabilities, but given this situation,
we can write this intersection of all these probabilities in the following manner you isolate the
probability the conditional probability of just the very first event listed over here conditioned
on all these other things happening ok. But if you were to do that then you also have to
multiply it by the probability of whatever condition you imposed ok.
So, that same thing is written over here, but now the same structure holds over here. So, here
*
if you notice it is just the probability of the intersection of several events except 𝑒𝑛−2 ∉ 𝐶 has
been removed, but the rest of them over all intersection of intersections of events which
*
means we can repeat the same procedure you isolate 𝑒𝑛−3 ∉ 𝐶 conditioned on the rest of
them. But then, if you were to do that you would have to then multiply it by the probability of
this condition which means then you can isolate 𝑒𝑛−4 conditioned on the remaining
intersections and so on.
So, when you work that out you are going to get this this sequence of conditional
*
probabilities and the last one of course, is you are going to end with probability of 𝑒1 ∉ 𝐶
after which you cannot apply this any further. So, this boils down to computing one such
conditional probability because really the probability that we want is the product of a
sequence of such conditional probabilities. So, by now you may be wondering; what is the
65
intuition ok. So, let us actually look at a picture that gives us the intuition of exactly what
happens let us think of the execution of this algorithm.
at the very beginning you are in some state you have still not picked any edge. So, then you
*
pick an edge that edge could be a member of the min cut 𝐶 or not if it is a member of the cut
* *
𝐶 basically what; that means, the very first edge that you picked if that belongs to 𝐶 that is
very bad your this execution will not succeed otherwise you something good happens 𝑒1 is
*
not in 𝐶 good great. So, then the execution has survived the first iteration ok, but then again
*
the second edge that was chosen 𝑒2 if it belong to 𝐶 again we are out of luck. So, the
execution we had as we had off into a bad situation otherwise its good and this continues on
ok.
So, at any point in time, let us look at what is the you know what happens well if you have
managed to reach this point that is basically saying let us let us say that there was a sequence
of executions this is the i-th iteration how can you say that the i-th iteration will succeed or
fail well first of all you should have reached the i-th iteration what does it mean to have
reached the i-th iteration all of these where good all of these had to have been good only then
you would have even reached here and that is exactly what is captured by this condition all of
the previous edges 𝑒𝑖−1, 𝑒 and so on up to 𝑒1, all of those edges must have missed the min
𝑖−2
cut only then you would have even come to this point and then conditioning having come to
this point.
Now, you can ask what is the probability that you would either go down this bad path or you
would continue to I am sorry you would continue to be in the good path and that is exactly
the intuition here. So, now, that we can come back to the formalism here let us look at this let
us look at each of these conditional probabilities right. So, in particular we are going to focus
on this one conditional probability right. So, let us, to get the right expression for that when
particular probability.
66
Let us denote the size the min cut as k, now if you think about it as the graph, as the edge
contractions take place the degree of the graph that we obtained is never going to be less than
k why is that well if you look at any cut in the graph as the algorithm progresses in the let us
see in the i-th iteration any cut in the graph actually corresponds to an original cut in the
original graph. So, if that if the number of edges going across the cut that you chose in the
i-th iteration is less than k its also going to be less than k in the original graph.
So, k is going to be a lower bound on the min cut as the algorithm progresses. So, all the
graphs that we encounter till the very end is going to have a min cut of size at least k which
means the degree will also have to be at least k because if the degree is smaller, then you can
create a cut you can either do the small the low degree vertex can become one side of the cut
and all the other vertices can become a other side of the cut. So, clearly that the minimum
degree of the graph G that we have is at least k and as I mentioned this is going to be the case
throughout the execution right. So, any look at any coalesced vertex. So, this has a lot of the
original nodes and if you look at all the edges going out of that node it cannot be it cannot
become less than k ok.
67
So, just before the i-th contraction how many vertices are there well 𝑖 − 1 contractions have
taken place. So, you are left with𝑛 − 𝑖, you will be left with 𝑛 − 𝑖 − 1 vertices which is
𝑛 − 𝑖 + 1 and each one of them has degree at least k right. So, if let us if you count and how
do you count the number of edges well actually let us count the total degree there are
𝑛 − 𝑖 + 1 vertices each of degree at least k and, but if you count the total degree that double
counts the number of edges. So, if you divide by 2 you will get a lower bound on the number
of edges. So, now, let us go back to this probability remember this probability term is what
we want to compute ok.
We know that the total number of we know that the number of edges is at least this quantity
*
and we have reached the i-th the execution without destroying the min cut 𝐶 how do we
ensure.
68
So, what is the probability?
(Refer Slide Time: 26:46).
That i-th edge that is chosen also avoids the min cut well that is going to be at least the
probability that i-th edge is part of the min cut. So, this is the probability of a good event this
is a good event that i-th edge did not belong to the min cut that is going to be 1 minus the
*
probability of the bad event which is that i-th edge is a member of the min cut 𝐶 . So, this is
69
going to be the probability that i-th edge is a member of the min cut that is what we need to
put here.
So, how do we get this expression here well there are k edges in the min cut if you choose
one of them which you will do with probability k over the set of the size of the total number
of edges ok, if you this is going to be the probability with which you are going to choose one
of these min cut edges. So, this is going to be the probability of the bad event. So, if you do 1
minus the probability of the bad event you will get the probability of the good event right and
keep in mind that this is just the lower bound on the on the size of the on the number of
edges.
So, when you replace the total number of edges by a lower bound this whole quantity can
become a little bit larger the probability of the bad event could become large potentially a
little bit larger. So, 1 minus of this therefore, you could only be a little bit smaller and as a
result you will have this inequality here its greater than are equal to this whole thing on the
right hand side ok.
So, now, if you work that out its going to come out as (𝑛 − 𝑖 − 1)/(𝑛 − 𝑖 + 1)this is the
expression we get for this one probability term. So, now, going back to this original
*
probability that the cut produced by the algorithm is 𝐶 remember we wrote that as the
product of multiple conditional probabilities and for each one we have
(𝑛 − 𝑖 − 1)/(𝑛 − 𝑖 + 1).
So, if we plug that in we get this telescopic product and nice thing is there is going to be a lot
of cancellations this 𝑛 − 2 will cancel out with this one there will be an 𝑛 − 3 cancel out
cancelling out with the next 𝑛 − 3 over here and so on. So, here if you see the 3 will cancel
out over here four will cancel out over here and so on. So, what you will be left with is these
2 terms the denominator and this 2 here in the numerator which is exactly what we want. So,
with that we have proven the lemma ok.
70
So, now, we need to figure out a way to boost the probability of success remember now we
only have a probability of success that is something like 2/𝑛(𝑛 − 1) and that is pretty low
ok. So, let us figure out what let us see what happens when we repeat this some number of
times.
So, let us in particular we are going to repeat it this many times let us see what happens what
𝑛
is the probability that all of these repetitions these 𝐶 2
𝑙𝑜𝑔(𝑛)number of repetition what is the
probability that all of them will fail well the probability that one of them will fail is at most
1
1− 𝑛 the probability that all of them will fail and these are all independent repetitions is
2
𝑛
this a product of this many times. So, you can just raise it to the power 𝐶 2
𝑙𝑜𝑔(𝑛) and we
−𝑥
have this inequality 1 − 𝑥 is at most 𝑒 is a commonly used inequality. So, this 1 minus this
𝑛
1/ 2
quantity can be written as 𝑒 , but then you also have this in the exponent.
So, the new 𝑒 to the you get overall you get 𝑒 to the minus this exponent the nice thing is is
𝑛 𝑛 𝐶
this 2
over here and then 2
over here, they cancel out and you can take the 𝐶 as 𝑙𝑜𝑔 𝑛 and
𝐶
so, you finally, this works out to 1/𝑛 ok. So, the nice thing is now what we have shown is
that the probability that all of these repetitions will fail is at most one over some arbitrarily
large polynomial n ok. So, this leads us to the algorithm that will succeed with high
probability you keep repeating the algorithm this contraction algorithm that we have
71
𝑛
discussed you repeat that this many times 𝐶 2
𝑙𝑜𝑔(𝑛) times and each time you repeat you see
what is the min cut that you get if it is better than all the min cuts that you have seen before
updated ok.
So, at the end you would have the best min cut that you have seen is in all these many
repetitions and that min cut is going after that that candidate min cut actually will actually be
𝐶
the true min cut with probability at least 1 − 1/𝑛 and that is that is exactly what we mean by
with high probability.
So that is good. So, we have got the algorithm working correctly what we need to do is make
sure that we have not designed a very bad algorithm in terms of running time um, but keep in
mind that this is a Monte Carlo algorithm, we are not guaranteed to have found the min cut
we have we guarantee that the min cut that we produce is going to the cut as a cut that we
claim as a min cut is indeed going to be the min cut with high probability, but not guarantee
and this Karger’s algorithm this contraction algorithm can be carefully designed to run in
2
𝑂(𝑛 ) time.
𝑛
So, the overall running time because remember we at also repeat this is another 2
𝑙𝑜𝑔(𝑛)
4
number of times. So, the overall running time is going to be 𝑂(𝑛 𝑙𝑜𝑔(𝑛)) and that is going to
be better than the flow based algorithm.
72
So let us conclude this segment, we have studied the problem of finding a min cut in a graph
we discussed an elegant randomized algorithm invented by David Karger and show that if
you repeat it appropriately it is going to be correct with high probability which is faster than
the usual deterministic algorithm that we think about in this context.
In the next segment, we are going to study Bayes’ law and we will see it via an example
where we need to understand something about the world around us, we will repeat some
experiments. And, as we run these experiments we will get a better and better understanding
73
of the world around us and that is what we mean by. I mean by this you know this prompt for
the next segment how to understand the world around this play with it run some experiments
figure out update your understanding of the world.
So, with that we end this segment.
74
Dept. Of Computer Science and Engineering
Module – 01
Lecture - 06
Segment 2: How to Understand the World? Play with it!
We are now on segment 6 of module 1 and the title of this segment is how to understand the
world play with it and this actually makes a lot of sense if you look at a small child how does
it understand the world around it, it plays little games it and then it builds an understanding of
the world how the world works, and that is how the child develops.
And what we are going to see is somewhat similar to that in this ah, but in a much more
simplified sense, let us say we want to try and understand and build a model a probabilistic
model of some system. And, but we do not know what the right model is. So, what we do is
we play with it, we repeat some experiment to try to understand what the model should be.
And over the course of time, we get a good sense of what the model should be and that is the
goal.
75
And we want we want to look at this in this segment is Bayes’ law which provides the basis
for this sort of playing with the world in order to get the best get to build the right model ok.
And this is this finds a lot of application especially in Bayesian statistics.
And so, that is what further ado let us look at a simple example where this sort of context
plays out. So, here is what we mean by the world around us, we have three coins, one of them
is biased it is bias. So, that it will land heads with probability two thirds, the other two are
unbiased, but the problem is we do not know which one is biased in which one is unbiased.
So, we are just going to assume that they are all randomly permuted and So, this is the world
that we are given and, but we would like to build an improved understanding of this world,
build the right model we want to we will find the biased coin we would not know what the
biased coin is. And so, what would be a natural thing to do well let us try to play with the
world try to play with these coins toss them around, to see which one is likely to be the biased
coin. So, this is what we mean by playing with our context.
So, let us toy that you know. So, we can do that, but and we can get a sense of what which
one is biased, but we want to have a rigorous quantification of our understanding, and that is
where our attempt at modelling the situation comes in ok.
76
So, what we do is before we start experimenting with these coins, we start with some basic
understanding a reasonable understanding of which coin is biased and which one is not biased
and. So, this is before we start any experimentation. So, its often called the prior ok. And let
us say we toss each of these coins once and ask ourselves you know what is the outcome well
say the outcome is, heads tails and tails the three coins.
Now, having seen this outcome we would like to make some inference about which coin is
biased in which coin is unbiased. It will not be a deterministic inference, but it will be a
probabilistic inference and this because this is an inference that we make after the experiment
has been conducted, its called the posterior understanding or the posterior model.
So, before the experiment we had a prior model and after the experiment we get a posterior
model, and then we this, but this is only based on one experiment. So, maybe we could try to
improve upon this posterior understanding and how do we do that? We just simply repeat the
experiment a few times.
77
So, here is a pictorial representation of how we can do that we start with a reasonable prior
model, we conduct an experiment and based on the outcome of the experiment, we develop a
posterior model. But now this posterior model is an improvement over the initial prior model.
So, this posterior we therefore, convert that into our prior. So, and conduct the experiment
again and when we conduct the experiment again, we get an even more improved posterior
model and with that as our prior we repeat the experiment and so on and so forth. And at
some point we are going to reach a stable situation where there is no not much more
improvement and so, that kind of indicates that we have reached a good posterior model and
we break out of this loop.
So, this is this is very typical Bayesian inference in this context. So,. So, the question is can
we rigorously quantify our confidence in our understanding. So, we have run done these
experiment a few times, we what is how good is our understanding these are some of the
questions that we would like to answer.
78
And the key underlying principle that will help us approach this these this question is Bayes’
law.
And So, to understand Bayes’ law let us consider some events 𝐸1 through 𝐸𝑛 the probabilities
of these events is what we are interested in. This is this is the model that we want to build and
these are we are going to assume that these are disjoint events, and 𝐵 is some other event this
is this is usually the outcome of some experiment and based on this outcome, we want to
update the probabilities of these events. So, we basically want to get these probabilities of
these events 𝐸𝑗s and this given that after the experiment we got this 𝐵 as our outcome.
So, this is this is what we want and this is a conditional probability. So, we can apply the
formula that is very straightforward. And Bayes’ law is very simple you just look at the
denominator that is just the probability of the event B. And now you expand that out using
the law of total probability and similarly the numerator you apply the formula for conditional
probability.
But important thing is in the right hand side you are going to condition on the 𝐸𝑗 values on
the probabilities on the on the 𝐸𝑗 events So, what you notice here is on the left hand side we
want the probability of the events 𝐸𝑗 conditioned on the outcome of the experiment, but on
the right hand side what we are going to do is switch the conditionalities.
79
So, conditioning on the 𝐸𝑗 s we want to plug in the probabilities of get the probabilities of the
of the event 𝐵. And why is this important and useful well because the 𝐸𝑗 s on the right hand
side corresponds to our prior model.
So, based on our prior assumption, we can actually compute things of this nature. Given that
we have a prior understanding of 𝐸𝑗, we will be able to compute the probability of 𝐵. So, this
is something we will typically be able to handle and that is the type of probabilities we have
on the right hand side.
And because on the now we are in the right hand side, we have probabilities condition the
conditional probabilities of this type we can actually compute them and when we can
compute them what Bayes’ laws law gives us is a way to compute the left hand side where
the conditionality is reversed and it in the focus now is on getting the probabilities of the 𝐸𝑗s
which is the posterior model ok.
Let us actually work out an example and things would become a lot more clear ok. So, let us
go back to the coin tossing example we have three coins we do not know which one is biased
one of them is and so, let us let 𝐸𝑖 denote the event that the ith coin is biased ok. So, 𝐸1
means the first coin is biased 𝐸2 means second coin is biased and 𝐸3 means the third coin is
biased and of course, these are all disjoint events.
80
And we want to understand and build the model of which coin is the biased coin ok. Initially
we do not know anything about these coins. So, the reasonable prior to begin with is that with
probability one third, each one of them is the biased coin that is so, that is pretty much all we
can do because we do not know any more information about the coins ok.
So, this is our prior model now let us try to apply Bayes’ law for that we will need to run
some experiment and using the and we need to be able to use the outcome of that experiment,
that event 𝐵 in order to make some improvements to our model.
So, basically compute the posterior model. So, now let us say we toss each coin and 𝐵 is the
outcome of each of those three coins. Now for now based on 𝐵we would like to understand
what is the probability of 𝐸𝑖 given that 𝐵 came out as the outcome um.
So, this is what we want to compute these are the posterior probabilities.
81
And let us look at an example to see how this can be worked out. So, let us assume that this is
our outcome 𝐵. The first coin came out heads the second and the third coins both came out
tails ok. Given that this is our outcome how does this influence our understanding of whether
the first coin is the biased one. So, that is the event 𝐸1, 𝐸1 remember refers to the event that
the first coin is biased we want to understand this probability ok.
Now, all we have to do is apply the Bayes’ law formula and notice that on the right hand side
we have the conditionality is switched we are now conditioning on 𝐸1 ok.
82
That is nice because for each of these we already So, we know this one this is a third that is a
prior model.
We also know given that each coin is equally likely to be biased what is the probability that
you will you will see the outcome 𝐵 what is that well this is this one third refers to this
probability of 𝐸1. Remember 𝐵 is heads tails tails ok. So, what is the outcome of what is the
property of getting a heads given that 𝐸1 the given 𝐸1 which is which means that the first coin
is biased.
So, given first coin is biased what is the probability that that point will come out heads well
we know that its two thirds ok. And given that the first coin is biased what is the probability
that the second coin will come out tails well if the first coin is biased then the second and the
third coins are unbiased. So, they will come out tails with probability half each ok.
So, and which means that the outcome 𝐵 is the product of two third times one half times one
half because these three coins are tossed independently which. So, now, you see that the
numerator values we are able to get all the numerator values and in similar fashion you can
fill out all the denominator values as well.
And if you work it out it comes out to a half. So, if we see the outcome heads tails tails, then
our posterior understanding of the first coin changes you from one third which was a prior
probability that the first coin is biased, noting that that was the only coin that showed up
83
heads we have been able to update our understanding to say that look looks, its more likely
that the first coin is the biased one we cannot save deterministically, we are still not far from
being sure about it, but certainly our belief that the first coin is the biased coin has been
bumped up and its gone to about a half.
And if you work out the other two it will you will work will be able to work it out that these
are both a quarter these are the probabilities that 𝐸2and 𝐸3 are the biased once given this
particular outcome. And this makes sense if you both of them showed up as tails they are not
likely to be the biased coin.
So, this one experiment changed our understanding a little bit and the this still is not you
know we still do not have complete understanding, will have to perform this experiment
again. So, this one half one forth one forth will become our new prior model, and we will
have to repeat to further refine our understanding of the words and that is what is shown over
here we started off with one third, one third, one third, we ran an experiment.
We updated our probabilities and use this updated probabilities to run the experiment again,
and then further we will keep refining getting a will get a better posterior model, which is the
individual probabilities of 𝐸1, 𝐸2 and 𝐸3 and that will become our prior model run the
experiment again, we will go through this loop a few times at some point it will stabilize and
we will know that this is the right answer at that point in time where done.
84
So, this is the general framework of applying Bayes’ law to understand the world around us
and in this segment what we have seen is we have shown how to apply Bayes’ law you know
in a fairly simple setting wherein we try to figure out which of the coin might be the unfair
coin ok.
So, with that we can conclude our understanding of Bayes’ law of course, this is just a mere
introduction, there is a lot more to this.
85
And I urge you to Google it up and explore some more about it ok. And this with this we
come to the end of the first module, in the in so, far we have only worked with probabilities
and events and an important notion forum especially in the algorithm design context is
random variables, which we will be studying in more detail in detail in the upcoming second
module.
Thank you.
86
Lecture - 07
Tutorial 1
Let us now look at this very simple problem in this corner in this problem, we were playing a
tournament. Where there is a winner and there is a loser who is the winner, we stop as soon as
one of us the winner wins 𝑛 games.
So, the two players are evenly matched and so, if you look at any one particular game in
isolation, the probability that one of the players wins is exactly a half as independent of all
other games. Now the question is that you know what is the probability that the loser has won
k out of those, all the games that were played ok.
So, we are now interested in the case where the winner has won n games, and the loser has
won k games ok. So, always the winner is going to be the one who wins the very last game.
So, if you list down the sequence of games, the very last game would have been won by the
winner.
Among the rest of the games, some games were won by the winner some games were won by
the loser and out of all of these games exactly k of them were won by the loser that is the
87
event that we are interested in we are interested in the event where the loser wins in exactly
some k games in the among the first 𝑛 + 𝑘 − 1 games ok.
So, if you if you fix one such configuration, clearly the probability that you get this
𝑛+𝑘
configuration is going to be nothing, but 1/2 . Why is that well because there are a total of
𝑛 + 𝑘 games played out of which k where won by the losing player.
But we want to ensure that those k losses occurred within the first 𝑛 + 𝑘 − 1 games and we
should account for the fact that any such subset of k games could have been won by the
losing player.
So, which means this is this is for one particular configuration how many such configurations
are there, well there is a total of 𝑛 + 𝑘 − 1 such games out of which k have to be won by
𝑛+𝑘−1
the loser and so, there are 𝑘
such configurations possible and each one of them occur
with this probability. So, the final answer is given over here.
88
Lecture - 08
Tutorial 2
Let us look at exercise 1.5 in the Textbook Mitzenmacher and Upfal first edition it is on page
17 ok. What is the question? You have 10 dice; you roll these 10 dice and sum of all the
numbers that come up on these 10 rolls. Now the question is what is the probability that the
sum is divisible by 6? Ok. It might be tricky to think about it as is.
But let us try to apply the principle of deferred decisions ah. So, now what we are going to do
is we are going to focus our efforts on the tenth time ok. So, we are going to focus our efforts
on the tenth time and assume that these 9 dice have already been rolled ok. And let us say the
these nine dice the there some happens to be S and the tenth dice which is what we are
focusing on comes up with a value X. So, this question becomes what is the probability that S
plus X is divisible by 6? Ok.
Well this is this makes it a lot easier to think about because whatever be the value of S there
is exactly one value of X that when added to S will make S plus X divisible by 6 and that is
not being difficult to see. Because let us take an example.
89
Let us say S is equal to 37. So, X can take the values from 1 to 6. So, the only value of X for
which S plus X is divisible by 6 is when X is 5 in which case S plus X will be 42 and that is
the only value for which this divisibility happens.
And of course, what is it what is the probability that x would take the value 5 that is 1 over 6.
So, this probability is simply 1 over 6. So, one thing we need to be careful about is when we
formally write this up we need to apply the law of total probability and a lot of the probability
has to be applied over all possible values of S.
Because S can take a range of values for every possible value of S probability of S plus X
being divisible by 6 is 1/6. So, we need to carefully apply the law of total probability to
actually arrive at this result and that is what is expected of you in this exercise.
90
Module – 02
Discrete Random Variables
Lecture – 09
Segment 1: Basic Definitions
So, the first module we did basics the axioms of probability in the second module we are just
going to get one step ahead we are going to try and understand what discrete random
variables are and how they help us in the context of computing. So, that is going to be the
plan for today to just define what a discrete random variable is and some related notions
independence and expectation.
And these are again things that you might have seen or you might be able to relate to, but
nevertheless let us try to go through them you know formally and clearly.
So, let us start with a with the game I like this game because it is loaded in my favour. So,
what you do is you toss a coin and if you land heads, you have to pay me two rupees
otherwise I pay you 1 rupee and clearly a very unfair game, but nevertheless let us go with
that ok.
91
And. So, here if you think about it in this game, the outcomes are not just heads and tails.
They are quantities let us come to another example where this is closer to computing which is
what we ideally ultimately want to understand. It was a simple code snippet if you were you
initialize a ran a variable to some 10,000 and put that into a loop and each iteration of the
loop we check whether L is still greater than 1 and each time you pick a random number
between 1 and L and you assign that value to L.
So, now this is a typical code snippet, you will be interested in understanding how many
times this while loop executes and that would tell us something about the running time of the
algorithm and clearly that is of interest to us in computing ok. So, this is clearly points us to
one thing, it is essential to understand a way to quantify things. Events in the last module we
just thought of events as heads or tails picking a card from a suite and of cards and so on and
so forth those are events .
But we want to associate the events with some quantities and we need to do that precisely.
92
And it is not that difficult the way we do this is the most straightforward way possible, we
define something called a random variable for this.
What is a random variable? it is simply a function it just takes as input an element in the
sample space and outputs a value a real value for that outcome. So, in other words if you
think of this example shown over here, you have the sample space Ω and all those outcomes
that lead to the that will if you evaluate this using the random variable X .
The parts that will evaluate to 0 for example, are this these are all the outcomes that will
evaluate to 0 and these are all the outcomes that will evaluate to 1. So, you can see that this is
a very natural thing to do just each outcome has some quantity associated with it some real
valued quantity associated with it, and we just assign that quantity to that outcome and you
get the real you get the random variable ok.
This is all a random variable is and we typically use an uppercase letter to denote random
variables and often we are interested in discrete random variables which means that you the
number of different values that the variable can take is only finite or countably infinite.
93
So, for example, if you go back to this one is this finite yes it just takes two values what about
this. So, here we are interested in the number of iterations. So, this is some experiment that is
taking place, this code snippet execution of this code snippet is this finite the number of
iterations no why not.
Student: I can keep choosing some number right (Refer Time: 04:33).
Yeah. So, you can this can actually be infinite, why because if each time you choose a
random number between say 1 and L, if you keep on choosing L you are not going to reduce
the value of L. So, if you think about just possibilities, yes it is quite possible that you will be
at L for as long as you can think of and that is a essentially the this number of iterations is not
finite, but it will be countably infinite ok. So, now, when you think of yes.
Student: (Refer Time: 05:11). If you keep choosing L, it will still be uniformly random, right?
it is a it is in terms of sample space, yes that is still a possibility right it will not be it is not a
very likely outcome, but even if you think of uniformly at random it is still with probability
well 1/𝐿 you are going to be able to you will be choosing L and thats still a possibility. it is
got a positive probability, it is not likely it is not likely especially to keep on choosing L, but
it is a possibility yeah.
Student: We keep reducing our sample space (Refer Time: 05:46).
94
It depends on what you mean by your you precise understanding of your experiments, in this
case the experiment is the entire execution of that code snippet ok. So, the sample space is the
outcome after the entire code snippet has executed and you are coming out of the while loop
and so, then you can ask. So, the sample space really in this case is, did this code snippet
execute once or twice or thrice or four times five times and so on those are the outcomes. So,
you see there are an infinite number of outcomes and with each of those outcomes we
associate the number of iterations as the random variable ok. So, there is an infinite number
of outcomes possible keep that.
Back to our notion of random variable it is important to realize one thing. When you take a
random variable X and you look at I mean you think of it as having a value say in this case
𝑋 =− 1. Immediately what you are thinking of is an event, why? because when you say
𝑋 =− 1, what you have done is you have isolated your attention to all these outcomes.
Those outcomes whose when you take the function X it leads to a value of − 1 ok. So, going
back to our Venn diagram here it is clear that it is any vector ok. So, that is something that is
important to keep in mind.
So, let us look come back to our my favourite things example you probably not your
favourite, but. So, what do we do here we have two outcomes either two in let us look at it
just for the fun of it from my perspective, two is what I gain with if the coin comes up heads
95
and I have to give you − 1. So, it is a loss from me if the heads the coin outcome is said tails
right.
So, now one can think of you know, what do I what am I likely to get you know. So, it is a it
is a notion that we need to develop and it is a, but in this case I am equally likely to both gain
2 rupees and lose 2 rupees it is both of these events if you will have equal likelihood ok.
So, coming back. So, this loop let us look at this loop the number of iterations of the while
loop in the code snippet, is again a random variable can take as we saw it can take countably
infinite number of values.
96
What is the probability remember that these are random events N=1, N being let us say here
denoting the number of iterations what is the probability that N=1?
Student: (Refer Time: 08:46).
1/10000 yeah in with probability 1/10000 you would choose the a value of L to be 1, in
which case it exits the loop and what is this slight change that I can do in order to make this
of have finite outcome? Let us go back to this how do I make this sample space finite here?
Student: (Refer Time: 09:09). Let’s say, stop after some time.
That is one way. So, suppose I just want to play with this line.
Student: (Refer Time: 09:23). 1 to L-1
Yes. So, if you choose a uniform random number from 1 to 𝐿 − 1 you are certainly going to
be decreasing every iteration. So, the number of iterations is not going to take exceed 10,000.
So, one I am going to leave this as a homework what is the probability that N=2.
So, in exactly two iterations this loop will exit ok. So, that is a little homework for you.
97
Now that we know what a random variable is, let us think of some is standard things. So, for
example, we already know the notion of independence, and if events can be independent
random variables are closely tied to events. So, the notion of independence should be
extendable here as well.
So, but we have to be a little bit careful about what we mean by two random variables X and
Y being independent, they will be independent if and only if now the important thing is this
has to hold for all values of X and Y, 𝑃𝑟((𝑋 = 𝑥) ∩ (𝑌 = 𝑦)) = 𝑃𝑟(𝑋 = 𝑥). 𝑃𝑟(𝑌 = 𝑦).
This is a very natural definition it is just that you have to realize that this has to hold for all
values of X and Y ok.
So, think of an example this way. So, you look at the sample space let us say you divided up
the sample space. So, now, X=1 is this entire column X=2 is this entire column, X=3 is entire
column , but Y=1, Y=3 and Y=2 and Y=3 are all rows let us say.
So, now if I if you take any pair this condition for independence should hold if you think
about it ok. So, this is one example where it shows how you can get independence between
two random variables.
98
And the notion of mutual independence also extends and we saw the notion of mutual
independence as part of our tutorial last class right. So, the notion of independence also
extends.
So, now if you have k random variables, 𝑥1 through 𝑥𝑘 and for any subset of those k values,
there and for any assigned assignable values 𝑥𝑖 to random variable 𝑋𝑖 where the i is in this the
chosen index i is in the chosen subset of indices this condition has to hold the independence
condition has to hold what is that? If you take the intersection of all those events 𝑋𝑖 = 𝑥 ,
𝑖
random variable 𝑋𝑖 = 𝑥𝑖.
If you take the intersection of all of those and take the probability of that that should be equal
to the product of the individual probabilities and this is again a direct extension of what we
have already.
Now, comes an important topic. So, what is this we have already alluded to this a random
variable is basically measuring some quantity and so, for example.
99
So, in the game that we played, a natural question I can ask is how much do I expect to win.
So, it is if I get lucky, I will get 2 rupees if I am unlucky I will lose 2 rupees ok.
And. So, what I can do is I can weight these values these are my either my gain or my loss I
can weight them by their individual probabilities, and I get the value half. So, that kind of
indicates how likely I mean how much I am likely to get. It makes sense right I am if I am
lucky I get 2 otherwise I lose − 1 somewhere in the middle is the half value.
100
So, in general in you can generalize this notion for any random variable. So, consider all the
possible values that the random variable can take and that is denoted by little x over here, and
if I just sum them all up, but weighted by their probability so; obviously, if somewhere it can
take a value . So, going back to our the loop example, it can yes I remember I said that it can
repeatedly keep choosing 10,000 right, but the probability that it will repeatedly keep
choosing 10,000 becomes very very small.
So, you have when you are computing the expected value you need to weight the event
according order or the value of the possible value of the random variable, with the probability
with which such a value will occur right. So, when you do that weighting and then you sum
up over all possible values that the random variable can take, what you get is the expectation
of that random variable 𝑋 ok.
So, this is a very natural quantity to study when you think of a random variable. In fact, you
usually be the first quantity you will want to understand about the random variable.
One quick question will E[X] always be a legal value that X can take, no. In fact, most of the
times it would not.
So, for example, in the game that we played, the expectation we calculated was half and that
is not a legal output at all it is just a measure of how much I can I am likely to get ok.
101
So, let us let us look at a very simple example. So, here this will be an example for what we
are asking both, but also a very important experiment, it is just an experiment where the
outcomes are just 2 typically called success and failure is called a Bernoulli trial ok.
Any two outcome experiment is a Bernoulli trial you can assign one the notion of success the
other the notion of failure ok. And a typical thing to do is assign a random variable as well
define a random variable as well, that takes the value 1 corresponding to success and 0
otherwise ok.
So, now you can again ask what is the expectation of X ok. So, a typical thing would be to
generalize and say look it is not you know it is often people think of Bernoulli trials as fair
coin tosses, but they are more generally the success can happen with some probability p, and
failure should be a will occur with probability 1 − 𝑝 ok.
So, what is the expectation? it is very simple right you take the individual values. So, zero
comes with the probability weighted a weight of 1 − 𝑝 and 1 occurs with probability weight
of p. So, when you sum them up you get p and this is your expectation of the random variable
X ok. Another question you can easily ask is you know what would be. So, now, for a for a
die the outcome is very clearly randomly will the number either 1 2 3 up to 6 right.
So, if you work it out you are going to get the expectation to be 3.5 not very difficult to work
that out.
102
Can X with countably infinite values have a finite expectation. So, we already looked at one
countably infinite sample space and random variable cannot have a finite expectation any
thoughts on this, what do you guys think?
Yeah it can. So, one. So, let me introduce one more important random variable. So, here X is
the number of tosses just think about a fair point for now number of tosses to get the first
heads ok. And if you think about the sample space you can actually get tail tail tail for an
arbitrary long period of time followed by a heads.
So, if you look at the sample space it is countably infinite this is called the geometric random
variable the number of times you have to toss before you get the first heads, but intuitively
when you think about it in a fair coin, if both heads and tails are equal equally likely, within
about two rounds you are likely to see you yes it is a very intuitive statement I am making the
formal proof will be we will see you look at it in a short while ok. So, this is an example
where the when you keep you have an infinite sized sample space countably infinite , but
nevertheless the expectation is a small.
103
So, we come to the end of this segment, we have introduced the notion of random variables
and the notion of expectation, independence of random variables and so on. We will study a
little bit more about expectation in the next segment.
104
Module – 02
Lecture - 10
Segment 2: Linearity of Expectation & Jensen's Inequality
Ok. So, now, we can we start the second segment in module 2, where we will be continuing
our discussion on random variables, and we will be looking at two important properties of
expectation, one is the Linearity of expectation and the other is called the Jensen's inequality
ok.
Linearity of expectation we will study it with the binary. So, in studying that we will
introduce this distribution called binomial distribution, it can a very fundamental distribution
shows up heavily in computing. And we will introduce Jensen's inequality by trying to
understand what the area of a square random square is.
So, let us start with the linearity of expectation, it is a very useful theorem let me state it.
105
So, let us consider two random variables X and Y and they have finite expectations, they do
not need to be finite in the sample space, but their expectations must be finite and a and b are
some arbitrary constants. So, we are interested in what is the expectation.
So, now when you think about it, this itself this 𝑎𝑋 + 𝑏𝑌 itself is a random variable. You can
take two random variables and you can add them you can apply some functions on them and
outcome will again be a random variable. Why because it is just a composition of functions,
remember X itself is a function, Y itself is a function you can you compose them and you get
this new function on the sample space, it is a function on the sample space and therefore, it is
a it is a random variable ok.
So, this 𝑎𝑋 + 𝑏𝑌 is a random variable. So, you can ask what is the what is the expectation of
this random variable 𝑎𝑋 + 𝑏𝑌. And as it turns out you have you it equals 𝑎𝐸[𝑋] + 𝑏𝐸[𝑌].
So, this is a most natural thing that you can you would suspect its value to be and that is
exactly what we get ok. What is surprising is that, later on we will see other quantities other
notion and other ways to understand the random variable for which this would not hold
unless you have other conditions like independence of random variables and things like that.
So, here in this theorem the important thing is there is no restriction, it is any two random
variables X and Y you take 𝐸[𝑎𝑋 + 𝑏𝑌], you get 𝑎𝐸[𝑋] + 𝑏𝐸[𝑌] ok.
So, this is the linearity of expectation and the proof is also quite simple what is the
𝐸[𝑎𝑋 + 𝑏𝑌]? Now what you have to do is consider all possible values of X and Y. So,
106
possible values of X is x, possible values of Y is y and it is just the sum the weighted sum of
the those values right. So, 𝑎𝑥 + 𝑏𝑦 weighted by the probabilities remember a and b are
constants.
So, they do not have probabilities associated with them, but we can take what is the
probability that random variable X takes this little x and the random variable Y takes this
little y and you just sum up over all possible values that X and Y can take and we will play
with the summation.
Now, what we do? We arrange it so that, what happens over here is we put the ∑ first and the
𝑥
∑first and . So, basically what we do here is we when we take this probability term and
𝑦
multiply it by 𝑎𝑥 then probability term multiply it by 𝑏𝑦 that is the two terms that we get, but
then when we write it down, we put the in the first term we put the x coming first. So, that
what the advantage this has is that this x should be inside over here.
But it can come out of the y we summing over y the x is going to stay common. So, you can
bring it out, and a is going to be common throughout it it is not going to be affected by the
values that x takes all the values at y take. So, a comes out all the way outside the summation
x comes out, but cannot come out of a summation of x, it can come out of the summation of y
107
and then you have the probability inside because this probability is dependent on y as well.
So, you cannot bring it out of this summation.
So, that is the first term that you have over here, and then the second term similarly you teach
the constant b comes out all the way, in the summation you put the y first and then the x
comes. So, the y is able to come out of the in inner summation, but it cannot come out of the
outer summation. So, this is a very straightforward thing to do if you think about it.
Now, what about this quantity? And if you think about it, the 𝑋 = 𝑥 is one event, and you are
summing that intersected with 𝑌 = 𝑦 for all possible values of y what does that remind you
of? it is the law of total probability, you the 𝑌 = 𝑦 is going to arrange the entire sample
space.
So, what do you get out of this? This is nothing, but the probability that 𝑋 = 𝑥 and similarly
you get probability that 𝑌 = 𝑦 over here ok, but of course, now what is this quantity? This is
nothing, but the definition of the 𝐸[𝑋] ok. So, that is what you get over here, similarly this is
the definition of the 𝐸[𝑌] you get 𝑎𝐸[𝑋] + 𝑏𝐸[𝑌]and that ends the proof of this theorem.
So, we will try to now apply this theorem. So, very simple application, but nevertheless you
know please bear with me, because simplicity means it is important and this is generally
going to be true, when things have a very simple clear explanation; that means, they will
show up again and again and again. So, there; that means, is important ok.
108
So, the binomial distribution simply this. So, you consider n Bernoulli trials each with bias p
we say bias p is the success happens with probability p or you can think of it as heads
occurring with probability p, the number of heads obtained under this experiment represents
the binomial distribution. So, you toss a coin n times biased coin with probability p of heads
count the number of times the heads appears that is the binomial distribution. And you should
denoted the 𝐵(𝑛, 𝑝)the two parameters that define this distribution are the number of times it
tosses the coin n, and the probability p ok. And when we think of a random variable x that
has this binomial distribution we denoted as X drawn from 𝐵(𝑛, 𝑝)that is the notation we use
𝑋 ∼ 𝐵(𝑛, 𝑝)ok.
Let us get our feet wet (Refer Time: 07:27) what is the probability that 𝑋 = 𝑖? It’s very easy
to see how X what are we asking. So, X can take the values either zero when no heads
appears or n when all the tosses outcome are heads, we pick a particular value I and ask what
is the probability that X can take that value i.
109
So, let us focus on a particular situation, where there is a set of locations in particular i
locations where heads we want heads to occur. So, the rest of them tails ok.
So, if you think about it, each location where we want to tails to occur what we are expecting
what we want is an event with probability 1 − 𝑝 to occur because we want tails to occur
over there, and wherever we want each heads to occur we were that will that event will occur
with probability p ok.
So, we just pick some i locations for H, and you put probabilities p probability p for all of
them the remaining ones would be 1 − 𝑝. So, if you if you specify the locations for the
𝑖
heads the outcome has the probability 𝑝 and these are all independent tosses. So, you can
𝑖 𝑛−𝑖
multiply the probabilities 𝑝 (1 − 𝑝) for all the re remaining locations ok. This probability
works when you have specified the locations for the heads ok.
But that; obviously, you cannot do you do not know where the heads will occur. So, you have
𝑛
to choose you have to find out all the we were to consider all the 𝑖
possible ways in which
𝑛
the i heads can occur right. As 𝑖
mutually disjoint events if it is mutually disjoint you do not
𝑛 𝑛 𝑖 𝑛−𝑖
multiply you add them up. So, and there are 𝑖
of them. So, you basically 𝑖
𝑝 (1 − 𝑝) .
𝑖 𝑛−𝑖 𝑛
Basically you taken 𝑝 or (1 − 𝑝) you have added it 𝑖
times. So, that is where the
multiplication comes from ok. So, this is the 𝑃𝑟(𝑋 = 𝑖).
110
Now, let us look at computing the expectation of this random variable.
. So, by the formula we get this. So, you basically sum over all possible values of i that is
going to be i is the values that X can take. So, it can be 0, when no heads occurs or n. So,
when all of them are heads and you sum over all i’s, but then weighted by their individual
probabilities, and I do not know how to deal with this summation. I am sure there is some
way you can get this to work, but it is a little bit messy ok. So, what saves the day for us is
that we can apply linearity of expectation.
111
So, there is a very simple way to compute the expectation of the binomial distribution, and
what we do is we break down the binomial distribution into individual Bernoulli trials, and
then we somehow rebuild using linearity of expectation. So, what we do let us break it down
first. There are n coin tosses we use 𝑋𝑖 = 1it is a new random variable that we are defining.
In fact, we will define n such random variables 𝑋𝑖 = 1, if the i th toss is a 1 otherwise 𝑋𝑖 is 0
ok. And we know 𝐸[𝑋𝑖] = 𝑝. So, now, we know that the number the total number of heads
that we going to get is simply the summation of 𝑋1+ 𝑋2 +... + 𝑋𝑛basically take each one of
them if you get a heads, it accounts towards this capital X otherwise not.
So, it is just a summation. So, clearly you can take expectation on both sides there. So,
𝐸[𝑋] = 𝐸[𝑋1 + 𝑋2 +... + 𝑋𝑛]and so, now, well at this point we do not know what to do with
the right hand side, but if you just apply linearity of expectation immediately you know what
to do with it you each one of them 𝐸[𝑋1] is a p, 𝐸[𝑋2] is a p and 𝐸[𝑋𝑛] is a p there are n such
terms. So, the total expectation is 𝑛𝑝 let us a linearity of expectation for you is very powerful
very simple and very useful for us.
In the same context of random variables just look at Jensen's inequality.
So, let us motivate that by a simple example suppose you consider a random square and what
do I mean by that .
112
So, the edge length X is chosen uniformly at random from the range 1 to 99. So, let us say it
is an integer value. So, questions that we can ask, what is the value expected value of it is
2 2
area? So, you take 𝑋 and you know what is you ask what is the 𝐸[𝑋 ]? And one question
2 2
that should come to mind is 𝐸[𝑋 ] the same as 𝐸[𝑋] and in this context, we are going to just
see how these two relate to each other.
2
So, let us for that let us think noticed this 𝑋 is a convex function. So, let us just be precise
about what we mean by a convex function. So, here is the definition. So, this way of defining
it as three equivalent statements . So, f is convex, and then three other two statements
equivalent statements are defined the notion of convexity is the following.
Suppose you have. So, we have shown this function f over here right. So, it is just this curve
of shown over here it is convex, if for any two points say 𝑥1 and 𝑥2 that you choose you look
at some intermediate point, and that is defined by this parameter λ. So, we take
λ𝑥1 + (1 − λ)𝑥2 you get a point somewhere in the middle.
So, that is shown in this blue is that that is the blue expression here λ𝑥1 + (1 − λ)𝑥2. And
now if you take you apply f for that function that is the left hand side and the claim is in the
for a convex function what happens when you take the f you get to this point on the function,
that is this purple point over here you compare that with the point .
113
So, now if you think of 𝑓(𝑥1) that is 𝑓(𝑥1)over here, 𝑓(𝑥2) is over here and then you take the
corresponding point parameterize by λin the seg between the segment in the segment
connecting 𝑓(𝑥1) to 𝑓(𝑥2). Now the comparison of these two points gives you a definition of
whether the where the curve is convex or not. For convexity what you need is that this point
on the curve should be less than or equal to this intermediate point by just joining there then
those two points 𝑓(𝑥1) and 𝑓(𝑥2) by a line segment.
So, and that should be true for any choice of 𝑥1 and any choice of λok. And many of these
2 4
functions that they all of these 𝑥 , 𝑥 and all are going to be convex functions. Another way to
think about it is if you take the second derivative and if the second derivative is non-negative
then again you can say that the function is convex ok.
So, what is Jensen's inequality say? If f is a convex function then the 𝐸[𝑓(𝑋)] ≥ 𝑓(𝐸[𝑋]) ok.
So, this is a quite fundamental inequality that you shows up every once in a while and so, let
us quickly look at the proof and the proof we are going to do the simple version where there
is a Taylor expansion for this function the proof holds more generally just to be clear. And
you can actually try it out try proving it more generally there is an exercise in the textbook
which gives you a hint to prove it more generally as well, but for our purposes we will just
quickly prove it under the assumption that there is a Taylor’s expansion.
114
So, another standard notation 𝐸[𝑋] is often denoted as just µ when the context is clear. So, I
am just going to use µ to represent 𝐸[𝑋] and remember µis just essentially a scalar value. So,
2
𝑓(𝑥) can be written as 𝑓(µ) + 𝑓'(µ)(𝑥 − µ) + 𝑓''(µ)(𝑥 − µ) /2 this is based on standard
Taylor.
Note you take the Taylor series and just capture the first two terms as is, but then the rest of
the terms are captured by this third term over here. The nice thing you have is recall that the
convexity gives you that f the second derivative is always going to be greater than or equal to
0. So, this third term over here is going to be a positive term or at least non negative term.
So, what do we do? We simply get rid of the third term and replace the equality by a
inequality.
So, now, we apply expectation on both sides. So, basically we have 𝑓(𝑋) ≥ 𝑓(µ) +... we
just apply expectation on both sides, on the right hand side you have expectation over a larger
term I mean at two summation of two terms, you apply linearity of expectation. So, far we are
doing something things that are quite straightforward.
Now, here what are we doing this 𝑓'(µ) is just a scalar value ok. Remember 𝐸[𝑎𝑋] = 𝑎𝐸[𝑋]
we have already seen that. So, we simply get the scalar value out over here. So, it is just 𝑓'(µ)
become 𝑓'(µ)(𝐸[𝑋] − 𝐸[µ]). Expectation of a constant is just a constant itself 𝐸[𝑋] is
anyway just µ and then there is this 𝑓'(µ)and here what are we doing we are taking 𝐸[𝑓(µ)]
115
is again just 𝑓(µ) because 𝑓(µ)is just a constant or scalar value , but µis a is essentially 𝐸[𝑋],
here you of course, have this µ − µ. So, this term cancels out you are left with 𝑓(𝐸[𝑋]).
So, this establishes the Jensen's inequality ok. Just going back to the example if you apply
this what you will get is well what is 𝐸[𝑋]? 𝐸[𝑋] is going to be 50, X ranges from 1 to 99
2 2
right. So, like E[X] is going to be 50, 50 is something some 2500 , but 𝐸[𝑋] if you work it
out is going to be a larger quantity. So, that is just an example where it works out that way.
The important thing to keep in mind is that we are actually what we are doing is, we are
working our way towards understanding of a some other measures. So, for example, this if
2 2
you notice that 𝐸[𝑋 ] is larger than 𝐸[𝑋] . The difference as it turns out is actually an
important measure it tells you how much a random variable tends to deviate from it is from it
is mean value from it is expected value ok. So, that itself is an important measure.
116
So, with that we end this segment, where we studied the expected value of a random variable
and just proved something called it and proved the linearity of expectation and now the
Jensen's inequality. And so, with that we will we will have to get ready for conditional
expectation, which is some more understanding of how expectation works.
117
Module – 02
Lecture - 11
Segment 3: Conditional Expectation I
Let us start the third segment in our second module. So, far we have looked at the expectation
of a random variable.
What we will do this time is look at extending the idea to include what is called conditional
expectation. And there are two variants of this notion of conditional expectation. So, we look
at the first variant in this class and this segment and then we look at the second variant in the
there are related notions. It is just that the same term conditional expectation is used to denote
two different slightly different concepts.
Let me motivate that by looking at what is called a branching process. It is a very simple
process, but it is a very very useful process that shows up a lot.
118
And so I motivate this in the context of just trying to understand the population of a species.
So, for example, let us take rabbits, and normally when you try to model populations a lot of
times what they do is they simplify it and just focus on the females in the species ok. So, now
there is you start off with one mother if you will and so you make the assumption that each
mother gives birth once and that is this again an assumption, once to a random number of
daughters.
From and this random number is drawn from the binomial distribution with parameters n and
p. This is just of course, a modelling assumption the real world, may not work exactly this
way, but we are just going to use this to understand how things work ok.
So, let us say the first mother gave gives birth to some 4 daughters and in their generation
those 4 give birth to their own daughter some 3, some 2 and so on and so you have
generations where they give birth to daughters. And as you can see this is a very; obviously,
something a population that branches out. So, it is of often called a branching process and
shows up a lot in evolutionary systems even in computer sciences shows up when one
function can spawn other functions.
So, what is clear is if you fix a mother you know the expected number of daughters and that
is 𝑛𝑝, maybe already so. But what is not clear is what is the number of expected number of
daughters in some generation i. So, in the in the first generation we saw we know that, but
how do we extend that to an arbitrary i th generation that is the question.
119
2
And of course, intuitively it should be 𝑛𝑝in the first generation, (𝑛𝑝) in the second
3
generation and (𝑛𝑝) in the third generation and so on. There should be some intuition here,
but we do not have, what we have studied so far does not guarantee that I mean the formalism
does not work out. The intuition might work out, but the formalism is what we need and that
is what we are going to develop and this and the next segment.
So, let us see for example, if you are if you know that at some particular generation there are
k mothers ok, then the number expected number of daughters is 𝑘𝑛𝑝, this is applying linearity
of expectation over the k mothers. So, I can write it this way expected number of daughters
given that the number of mothers is equal to k in equals 𝑘𝑛𝑝 ok. And this is something we
just seem to have made up, but actually this is there is a formal definition in general, so you
take any two random variables X and Y you can have what is called the conditional
expectation of X conditioned on Y equals a particular value.
120
So, in this case in the previous slide Y represented the number of mothers and X represents
the number of daughters ok. So, when the number of mothers is a specific value then you can
ask what is the expected number of daughters. And as you would expect the formula is very
very straightforward. If it were if it did not have a condition you be, ∑ 𝑥 𝑃𝑟(𝑋 = 𝑥), but
𝑥
because as the conditionality over here you add the conditionality to the probability as well.
Essentially what are you doing we are taking the sample space and saying let us limit our
sample space to just this one portion, where 𝑌 = 𝑦 that is all we are doing. And you can
generalize this a little bit you can remember 𝑌 = 𝑦 this is just next an event, unfortunately
we reusing E here, ok. So, maybe think of 𝐸[𝑋]given some other event F can also be written
in a similar fashion. So, apologies for reusing the letter E here, ok. The first type of
conditional expectation where you are conditioning on a particular event.
121
So, let us look at an example some examples. So, 𝑋1 and 𝑋2 let us say there are two random
numbers obtained from two independent cause for a random number generator which range
with range from 1 to 10. So, or you can think of it as a 10 sided die if you will. And you have
𝐸[𝑋1]=𝐸[𝑋2] = 5. 5. Let us look at the sum of those two outcomes. So, 𝑋 = 𝑋1+ 𝑋2 ,
𝐸[𝑋 ] we know by linearity of expectation is 11.
Question is what is the 𝐸[𝑋] given that the first random number generator, generated 2. And
what is that? Well, now again you can in this case what we are doing is summing over all the
values. So, what are we doing over here 𝑋1is fixed at 2 ok. So, the only variable is really 𝑋2.
So, what, how are we applying this? So, we are just varying the values that 𝑋2 can take ok.
So, that is ranging from 1 to 6, but then here internally we are considering what is the
outcomes its and this the first outcome is fixed at 2, the second outcome is the value i and that
will happen with probability 1/6.
Remember the whole sample space is limited to just the second random number generator.
The first one is fixed at 2 ok, so that is why you have a 1/6 over here which is which comes
out to be 7.5 which is making a lot of sense because the first one when its fixed at 2, the
second one contributes 5.5 to the expectation you get 7.5
122
1/6 over here, 1 over, oh 1/6 because now because the first random number generator is fixed
at 2, the randomness only depends on the second random number sorry that is an error. So,
that should be 1 to 10, I think I just flipped in my mind from a random number from 1 to 10
to die in my head. So, while I was doing the slides. So, thanks for pointing that out. So, that is
an error. So, that has to be updated. So, that has to be worked out with this going up a 10 and
this also will have to be here 10, but if you work it out I think the final answer based on the
intuition is correct. So, sorry about that, but the final answer is still correct.
Another interesting various, so let us hope I have not made any mistakes here. Let us see
𝐸[𝑋2|𝑋 = 3].
Here what are we doing? So, now, this capital X is the summation. So, here we are not, we
are conditioning on the fact that the sum is 3, if the sum is 3 we are asking what is the 𝐸[𝑋2],
but now 𝑋2 can only take a few values, it can take either the value 1 or 2 it cannot take the
value 3 or more. Why because if it takes the value 3 or more it will have to be then added to
the first item, first value and then if this conditioning will not work, ok. So, you are left with
either 1 or 2. So, one times the 𝑃𝑟(𝑋2 = 1), again with the conditioning time plus 2 times the
𝑃𝑟(𝑋2 = 2) with the conditioning.
123
And if you work that out just apply the formula for conditional probability it works out to
about 1.5 and that should make intuitive sense to you. Given that, the sum is 3 the 𝑋2will
either be 1 or 2 with equal likelihood and so 1.5 seems to be the right answer and that is what
we get through formal verification of this.
So, some quick properties of this, so basically this is just repetition of ideas that we already
know, but we want to view it through this conditional expectation I mean notion. So, we
know this notion of law of total probability. Does it apply when you view it from the
conditional expectation point of view? Yes, as it turns out. So, the claim is this the 𝐸[𝑋] is
equal to the∑ 𝑃𝑟(𝑌 = 𝑦)so, this basically covers the entire space times the 𝐸[𝑋] even when
𝑦
you have conditioned according to Y.
So, how do we show this? Let us just look at the start from the right hand side and here the
what we are doing is from the right hand side we have this expectation term we simply
expand that out, apply the formula.
124
And from that expression we collect the summations. So, we get ∑∑ and everything is inside
𝑥 𝑦
of it. But now what do we have over here is 𝑃𝑟(𝑋 = 𝑥|𝑌 = 𝑦)𝑃𝑟(𝑌 = 𝑦)that is of course,
𝑃𝑟(𝑋 = 𝑥 ∩ 𝑌 = 𝑦), conditional probability formula.
And so, now, what we are doing this X does not depend on Y, so it can be brought out and
what we have here is ∑ and we have 𝑋 = 𝑥 ∩ 𝑌 = 𝑦. And what is this?

𝑦
Student: Law of (Refer Time: 11:15).
Law of total probability, that is going to be just 𝑃𝑟(𝑋 = 𝑥). So, you get 𝐸[𝑋], ok.
125
So, law of total probability. What about linearity of expectation? Again that will also hold.
So, I will spare you the proof, but essentially what are we asking, so under again I have made
this. So, let us assume this is some other event F ok. So, the expectation of F, the sum of
several random variables each conditioned upon some event say F is equal to the sum of their
individual expectations each conditioned on the same F. So, this as you would expect ok. You
can work that the details out.
Basically it this is a very intuitive thing to state right. So, all that conditioning does is it, it
redefines the sample space, it carves out a new sample space, but still is a sample space never
the less and from which a probability space is defined. Therefore the expectation should, the
linearity of expectation should hold that is the intuition here, ok.
126
So, back to the branching process. What does this conditional expectation give you; when it
when you can condition on the number of mothers you get the number of daughters. So, I
what this tells you is if at the (𝑖 − 1)th generation if you will, if there are some k mothers
then in the next generation there will be 𝑘𝑛𝑝 daughters. So, it gives you a step in the
induction if you will, but it still does not tell you what happens over the course of the entire
populations evolution.
So, we are still short of trying to understand what the expected number of daughters in
generation i will be, ok.
127
This question this red question is still not addressed yet, we know it conditioned on the
previous generation, but we do not know it in general, ok.
So, that is still something that needs to be addressed and that is what we are going to address
in the next segment where we will talk about the second notion of conditional expectation.
And we will revisit this branching process and try to understand that the answer to that
question, ok.
128
Department of Science and Engineering
Indian Institute of Science, Madras
Module – 02
Lecture - 12
Conditional Expectation II
So, let us start with the segment 4 of module 2. Here we will be talking about conditional
expectation as a variation of what we have already looked at. So, I call it conditional
expectation II.
What is the plan for the segment? We will just revisit this branching process that we talked
about in the last segment.
And then in order to understand this branching process we particularly want to know the
expected number of children at an arbitrary generation i and the to understand that question
we will introduce this notion of conditional expectation version 2 if you will, will describe
some properties of that notion. And then are get back to the branching process and try to
understand that question.
129
So, let us a quick reminder of the branching process. So, we start with one mother if you will
and that mother gives birth to some number of daughters and those daughters become
mothers and give birth to their own daughters and so on and so forth. And the way we are
modelling it each mother gives birth once to a random number of daughters, and the random
number is chosen from the binomial distribution with parameters n and p. Each mother we
know one expectation gives birth to 𝑛𝑝 children. So, even you also know for example, if
there are k mothers in 1 generation in the next generation the expected number is 𝑘𝑛𝑝.
But what we do not know is how to generalize it to an arbitrary generation i ok, intuitively
the first generation there will be 𝑛𝑝 daughters on expectation and those 𝑛𝑝 in turn will
2
produce 𝑛𝑝, each will produce 𝑛𝑝 daughters. So, it will be (𝑛𝑝) in the second generation and
𝑖
so on and so forth. So, an arbitrary i th the generation intuitively it (𝑛𝑝) .
We want to make sure that we can we have a clear formal basis for this intuition ok.
130
So, would, that brings us to this notion of conditional expectation and the second version in
which we defined this 𝐸[𝑌|𝑍]. So, this 𝐸[𝑌|𝑍]itself is a random variable and takes the value
𝐸[𝑌], given a specific value of Z this z whenever the random 𝑍 = 𝑧.
So, if you think of the sample space as being divided into regions. So, with various regions
corresponding to various values of a Z each region if you condition on that region you would
get an expectation of the random variable Y and that expectation itself is now a random
variable over the entire sample space right and that is this 𝐸[𝑌|𝑍].
131
In a similar fashion we can also view the probability of some event E over this random
variable Z.
So, going back to this picture again Z let us say it divides up the sample space into regions
and in each region you have a certain conditional probability of some event E and that now
becomes a random variable and that is what we will denote by 𝑃𝑟(𝐸|𝑍).
So, as we talked about this 𝐸[𝑌|𝑍]and 𝑃𝑟(𝐸|𝑍) they take on different values based on the a
value is taken by Z, in this that is their random variables.
Let us work out a formula for the conditional expectation in terms of this conditional
probability, we are asking what is the 𝐸[𝑌|𝑍].
Now, it is an expectation, so we will try to work through sum through all possible values of
Y. And in each case its very simple we just use the conditional 𝑃𝑟(𝑌 = 𝑦|𝑍). So, it is a very
natural formula that we can have for expectation condition expectation 𝐸[𝑌|𝑍].
132
So, let us work out a simple example hopefully this will be helpful in making the idea
concrete. So, recall we have already talked about this example where we have to run two
random numbers generated uniformly at random between 1 and 10 ok. So, we call them 𝑋1
and 𝑋2 and 𝑋 = 𝑋1 + 𝑋2.
So, now we ask what is the 𝐸[𝑋|𝑋1]and we can apply the formula. So, this is simply the
formula that I had mentioned in the previous slide you are looking for the 𝐸[𝑋]. So, you have
to run through all the values that X can take and then we weighted by their individual
probabilities of course, each conditioned on 𝑋1, whatever; that means, let us see what that
works out.
So, now what are the values that this X can take? Well, it is hinging on 𝑋1. So, if X whatever
𝑋1 takes it can take a value 𝑋1 + 1 all the way up to 𝑋1 + 10. So, this x therefore, runs from
𝑋1 + 1to 𝑋1 + 10 and of course, here you have to sum the individual elements weighted by
their probabilities. Now, what is this 𝑃𝑟(𝑋 = 𝑥|𝑋1)? What is this basically talking about?
So, this X has to take on a specific value of x conditioned on the value of 𝑋1 ok. In other
words when will that happen when 𝑋2 takes the value of, so basically what should happen is
x another way to write this is x should be equal to this 𝑥 = 𝑋1 + 𝑋2 right and for that
133
specific value of 𝑋2 the for the event that x that happens when this 𝑋2 takes this value 𝑥 − 𝑋1
. What is the probability of that specific event happening it is going to be 1/10.
So, this whole probability is basically 1/10. So, this is. So, now, you can expand this out. So,
if you work out this sum what is going to happen over here? You are going to run from values
of x is going to range from 𝑋1 + 1to 𝑋1 + 10. So, you are going to have 𝑋1 added 10 times,
but each of those times it is going to be weighted by 1/10. So, you are going to have and it is
going to average out to just 𝑋1 plus this second you are adding a quantity here 𝑋1 + 1 all the
way to the 𝑋1 + 10 right, you can think of it as another i, if you can think of it as an value i
running from 1 to 10 or rather 𝑋2, 𝑋2 the running from 1 to 10 that also will take the values is
weighted by the probability is 1/10.
10
So, this the second term if you think about it, it is this basically ∑ 𝑥21/10that is going to
𝑥2=1
work out to 5.5. So, what is it ultimately going to be? This expectation is going to be
𝑋1 + 5. 5. You can work out the details, but look let us look at the intuition here we are
asking what is the 𝐸[𝑋|𝑋1]and it is this is basically remember this is a random variable.
So, you on the right hand side you cannot have a constant, you will have to again have a
random variable on the right hand side the random variable is 𝑋1 and over and above that
random variable you going to add a 5.5 and that makes a lot of sense because now whatever
your 𝑋1 value is your X is going to be on expectation 5.5 plus that original value of 𝑋1 ok.
So, this hopefully the intuition at least is very clear. I would recommend that you go through
the details to convince yourself ok.
134
Here we notice that this 𝐸[𝑋|𝑋1] or any of these sort of conditional expectations is itself a
random variable. So, it is a meaningful question to ask is what is the expectation of the
𝐸[𝑋|𝑋1]. So, what will that be? Well, let us work it out in this example. What is the
expectation? Well, we already worked out that 𝐸[𝑋|𝑋1] is 𝑋1 + 5. 5. So, we can apply plug
that into the interior of this expectation.
And now we get two terms. So, now we can apply linearity of expectation. So, it is going to
be 𝐸[𝑋1] + 𝐸[5. 5] which is going to be just 5.5. What is 𝐸[𝑋1]? It is going to be again 5.5,
remember it is a uniform random number between 1 and 10, so it is going to be 5.5, its going
to work out to 11. And if you think about it, it is also the 𝐸[𝑋] why because 𝑋 = 𝑋1 + 𝑋2.
So, 𝐸[𝑋] is 𝐸[𝑋1] which is 5.5 plus 𝐸[𝑋2] which is 5.5. So, it is that is also going to be 11 and
so there is this when you take the expectation of the expectation what you end up getting is
just the expectation of the original random variable X, ok.
135
So, this is true from the example. The question is can this be generalized and as you would
expect, yes you can generalize it.
So, in general what we are claiming here is if you remember 𝐸[𝑌|𝑍]is a random variable, if
you take the expectation of that random variable you get back 𝐸[𝑌]. So, I am going to give
you an illustrative way to look at it at least this helped me to look at it and make some sense
out of it. So, maybe this will help you use well and then we will formalize it as well, ok.
So, what I am going to think of is this experiment. So, we have a geographic area and we are
looking at people in this geography. So, this is think of this sample space as people in some
region and the experiment is to choose a person uniformly at random and each person the that
you choose Y corresponds to the height of that person. And Z corresponds to this district
number. So, you have various districts at say in this region. So, now, what we what you can
think of is 𝐸[𝑌|𝑍 = 2]is basically restrict to yourself to the second district and you ask what
is the expected height of the people in the second district.
So, now we ask in this context we ask what is 𝐸[𝐸[𝑌|𝑍]], ok. If this is this is an expectation
of a random variable right. Let us look at what this is about. Each of these regions has a
certain Z value associated with it and each of them has a value of the 𝐸[𝑌|𝑍]. So, when we
want to take the expectation what do we do? For each of these regions we take the
𝐸[𝑌|𝑍 = 𝑧]and then we weight that by the 𝑃𝑟(𝑍 = 𝑧)and we sum it over all possible Z
values that is that will give us the left hand side. The way to think about it is you look at each
136
one of these is a district, you look at the average height per for each district, but then you
weight it by the size of those districts that is what you have on the left hand side.
Now, let us play this intuitive thing. What happens if someone comes and redraws the
districts? Some politician comes redraw districts, so now, (Refer Time: 13:08) Z it is 𝑍'. So,
now, again the same you can still ask what is the 𝐸[𝑌|𝑍'], now let us say given 𝑍' ok. So, that
should be a 𝑍' over there. And intuitively again now what you are doing? You are doing this
weighted height of people across this new districting.
And in the must there is some intuition that should tell you that look these are is essentially
the same ok. These are just weighted slightly differently based on two different politicians
and you know or whatnot, but it is essentially there referring to the same quantity and that is
what is on the right hand side 𝐸[𝑌].
Student: Only have (Refer Time: 13:48) because expectation of (Refer Time: 13:50) the
intuition.
Yes, intuition, yes that is correct. So, let us, but let us work it out more formally now.
137
Let us try to work out this intuition ok. So, let us restate the claim. So, what we are claiming
is that this left hand side remember we worked it out as this formula 𝐸[𝑌]given specific
values of Z weighted by the probabilities of Z taking various values ok. And on the right hand
side we want to claim that that is equal to this 𝐸[𝑌], all right.
So, we will take the left hand side we have this 𝐸[𝑌|𝑍 = 𝑧]. So, we can expand that out we
have a formula for that. So, that is basically 𝐸[𝑌]. So, you have to run through all possible
values that Y can take and weighted by these conditional probabilities, sum them up and
weight them by the conditional probabilities I am just restating that here. So, now what we
are going to do is, so I am going to interchange the summations I am going to bring the ∑
𝑦
outside and I am taking this summation of I mean this 𝑃𝑟(𝑍 = 𝑧), inside the inner
summation. So, I get this expression.
138
Here what are we doing? So, this is basically let us go back here what we have here is a
conditional probability, 𝑃𝑟(𝑌 = 𝑦|𝑍 = 𝑧). This is simply nothing, but the
𝑃𝑟(𝑌 = 𝑦 ∩ 𝑍 = 𝑧), this is just a conditional probability formula. Now, how do we get this
term?
Student: Summation over this.
Yes. So, this is summation over all Z. So, this is what property?
Student: Law of total property.
Law of total probability, right. So, you are considering all values have Z. So, this is going to
cover the entire sample space. So, this is just the law of total probability you will get
∑ 𝑦. 𝑃𝑟(𝑌 = 𝑦).
𝑦
Student: can you just what is the initial expression for 𝐸[𝑌|𝑍]bigger.
E of, 𝐸[𝑌] given.
Student: (Refer Time: 16:12) you cannot (Refer Time: 16:15) Z=z in that way like when you
do like you cannot say that ∑ 𝐸[𝑌|𝑍] is given ∑ 𝑌 (Refer Time: 16:18)
139
(Refer Time: 16:18), let us let us go back are you talking about this entire.
Student: I am talking about the guy inside the first expectation.
So, for that we have this other expression.
Student: Yeah. So, how did you how did we convert that into Z=z expression in the (Refer
Time: 16:34).
Let us look at that so, but now we are looking at this expression expectation of this random
variable, right. So, now think of it this way. Look there are these what is this is an expectation
of a random variable. So, what is inside is a random variable ok. What is that random variable
going to take on, for each district if you will each event if you will it is going to take on a
value ok. What is that value? It is going to take on 𝐸[𝑌 + 𝑍 = 𝑧]
So, in this district the its going to take on the value of the height of people in that district that
is 𝐸[𝑌|𝑍 = 2]. For each district it is going to take on a different value ok. So, now, that is
this random variable inside 𝐸[𝑌|𝑍]. So, when we take the expectation of that we sum up all
those values. So, do not for a minute forget that this is expectation it is you are summing up
over all these values, but then you have to weight them by the district probabilities.
140
Now, let us go back to the branching process this is interesting because now what we have
done is we built the machinery to address this question with this branching process. So, now,
we want to know at an arbitrary generation i, what is the expected number without
conditioning on the previous generation. If we know the previous generation we know, we
already had the tools and techniques to answer that question.
141
If it if we cannot condition on the previous generation alone what do we do let us see how we
can work this out. So, let us use the random variable 𝑌𝑖 to denote the number of females at
the end of generation i. So, now we can ask what is this is something we already know how
to do, it should not be is to surprising here. What is the 𝐸[𝑌𝑖]? As that is the number of
females at the end of some generation given in the previous generation you had some certain
specific number.
So, remember this is a very specific number, this is a random variable, but this is a specific
number. So, then if you know that in the previous generation you had a specific number of
females then you know that each one of those females is going to have on expectation 𝑛𝑝
daughters. So, what is it the total number of daughters at the end of the ith generation? It is
the number of females in the previous generation times 𝑛𝑝 ok. This should be
straightforward. So, now, let us let us hang on to that.
What we want is expectation of 𝑌𝑖 and notice there is no conditioning here, we just want to
look at the i th generation and ask how many what is the expected number of daughters at the
end of the i th generation. And this is where we go back to this claim that we have here ok.
What we are looking at is the right hand side and I am going to apply the left hand side.
So, now we know that 𝐸[𝑌𝑖] is nothing, but the 𝐸[𝐸[𝑌𝑖|𝑌𝑖−1]] ok. This is just applying that
claim that we worked up just now just in the opposite direction. And this inner part
𝐸[𝑌𝑖|𝑌𝑖−1] intuitively is nothing, but 𝑌𝑖 whatever the previous number was this let us leave
that as a random variable ok. In the previous generation it was 𝑌𝑖−1𝑛𝑝 ok. I am not making it
specific times 𝑛𝑝. But now the 𝑛𝑝 can come out because that is just those are just parameters
numbers. So, they can come out by linearity of expectation. So, 𝑛𝑝 𝐸[𝑌𝑖−1] ok.
So, what we have done is made an interesting jump in the sense that we made one step of an
inductive argument, basically what we are saying is if we can represent the number of
females in the in the previous generation by a random variable 𝑌𝑖−1, the number of females in
the current generation is going to be 𝑛𝑝 𝐸[𝑌𝑖−1].
Now, what does this mean? And this we can remember this we could do only because we had
we understood we took the time to understand this knows the second notion of conditional
142
expectation and we claim that that notion of expectation of that notion has this left hand side
idea. So, now, if you let us look at what you have done in the ith gen the expected number of
females in the i th generation is 𝑛𝑝 times the expected number of females in the previous
generation ok.
2
So, then it will be (𝑛𝑝) times the number of females in the, in the grandmother generation
and so on. So, at and what we know is at the very start of this branching process the we
started with one female and so 𝑌0 if you will actually just has to be a 𝑌1, 𝑌1 if you will is 1.
At the end of the first; I guess it depends on how you start counting the generation, but at the
very at the very beginning there is just one female. So, you can think of this as basically an
inductive argument in which the base case is that initially you start off with one female. So, if
𝑖
you substitute you are going to get the 𝐸[𝑌𝑖] = (𝑛𝑝) . Basically understood at least on
expectation this branching process.
We as a in this segment we studied the second notion of conditional expectation we applied

that to this branching process to understand this discussion of the expected number of female
rabbits after i generations. What we did importantly is we confirmed an intuition. We have an
intuitive sense of how many accepted number of females we will have in the i th generation,
now we have a way to dot the i's and dash the t's and claim that our intuition is in fact,
rigorously correct.
143
So, with that we conclude this segment.
In the next segment we will be talking about geometric random variables which we have
already briefly talked about. And we will apply that to a very very important context called
the coupon collectors problem, and with that we will end this segment.
144
Module – 02
Lecture - 13
Discrete Random Variables – Geometric Random Variables & Collecting Coupons
Let us get started. We are in 5th segment of module 2 where we will talking about random
variables, and in particular view we talked about the expectation of random variable and the
conditional expectation.
Today I mean this segment on the next segment we are going to talk about some algorithmic
ideas, so let us hopefully going to change. So, we are going to talking about geometric
random variables and then will apply them to understanding this problem call coupon
collectors problems that is that is going to be the topic for today.
And of course, we have already seen with geometric random variables. So, that sense we will
just revisited and we will understand one important property of this random variable called
the memoryless property and then we will work out the expectation understand the coupon
collectors problem and analyze its expected time ok.
145
So, let us revisit the definition of the geometric random variable. So, this the best way to
illustrate this is it is the number of flips of a coin with some bias p until you get the first
heads ok. So, this geometric random variable comes with the parameter p and that shows up
in this definition this intuitive definition.
For formally if you have a geometric random variable X it has support of 1 2 3 and so on all
the integer starting from one when I say support what does that means, is these are the values
for which the probability is non-zero. So, for i equal to value ranging from one onwards what
is the probability that the random variable X takes the value i; that means, the previous 𝑖 − 1
flips must been heads and that is why you have 1 − 𝑝 this is the probability that you get tails
raise to the power 𝑖 − 1 followed by the probability that you get a heads.
So, this is the formal definition of the geometric random variable. So, if it is proper
distribution you need to have this property that the probability of the sample space should
equal 1 ok, and thus if we can verify that and so that you can verify just by summing over all
possible elements in the sample space probability that X takes that value corresponding to
that i.
And so now, we apply the definition which we already seen before then notice that you can
this is essentially let us see what this is over here its basically the sum of a geometric series if
you apply that formula you end up getting know. So, that fits our requirement that the
probability of the sample space must be 1.
146
Then comes the very important property memoryless property. This is very crucial property it
is a sometimes can be little unintuitive and what this means is that let us say the let us let us
go back to the tossing of coins way of looking at this distribution. If you toss the coin of few
times and you been repeatedly getting tails, then let us say you have done this for 5 times in
repeatedly got the 5 tails.
And let say then that is history have any bearing on how many more coin tosses you will need
before you get the heads and as you turns out you history will not have any bearing care
because these are independent coin flips and that is the intuition of this memoryless property
is capturing. The previous coin tosses if you are unlucky enough you gotten tails will not
somehow influence you to get a heads quickly ok.
And this is some something that goes counter to a lot of our thinking because you know,
people talk about things like you know oh I had the you know I had the good things happen
to me and now worried about something bad happening or if bad things happening to me I
thought to be getting a good thing you know sometime soon, yeah. You know if this out of
memorylessness property shows up in life as well then that will not be the case. I do not
know whether it shows up in life or not, but in this distribution it is completely memoryless
case ok.
So, let us formally see why that is the case. So, how do we express that formally. So, we are
asking what is the 𝑃𝑟(𝑋 = 𝑖 + 𝑘|𝑋 > 𝑘)what does that mean. So, for the first k coin tosses
147
your head tails you will be observed that given that you know that X therefore, has to be
some value greater than k ok. So, what is this conditional probability on the left hand side?
Now, the right hand side look there has no k is completely eliminated. So, the fact that you
seen k is coin tosses is completely eliminated on the right hand side this is the 𝑃𝑟(𝑋 = 𝑖).
So, let us see why this is true in a formal way. So, the left hand side we take it and we just
apply the formula for conditional probability and then so on the in the numerator is what do
we have is the 𝑃𝑟((𝑋 = 𝑖 + 𝑘) ∩ (𝑋 > 𝐾))ok. So, if 𝑋 = 𝑖 + 𝑘 and i and k are both
positive quantities it is clearly what you have on the numerator is basically just what happens
you have to get 𝑖 + 𝑘 − 1tails followed by the heads and X has to be greater than k ok. And
in the denominator we have approx; so that is the numerator in the denominator let us see
little bit careful. So, what you are saying is the 𝑃𝑟(𝑋 > 𝑘).
So, if 𝑋 > 𝑘 it can take the value 𝑘 + 1, 𝑘 + 2, and so on and you have to sum it over all
∞
those possibilities and so we run it through a ∑ . So, over the first j coin tosses have to be
𝑗=𝑘
tails followed by a heads. Of course, you have to p’s in the one in the numerator and one in
the denominator. So, they will cancel out.
And one thing I would like you to work it out on your own is basically the this summation if
𝑘
you work through it is going to end up being (1 − 𝑝) /𝑝. So, it will come out this way in
this derivation. So, when you work it out it is going to be. So, what happens over here
𝑘 𝑖−1
(1 − 𝑝) will cancel out with this k over here. So, you will get (1 − 𝑝) 𝑝 which is
nothing, but the 𝑃𝑟(𝑋 = 𝑖). So, this is formally verifying our intuition of the memoryless
property.
148
So, now let us look at the geometric random variable and let us ask what is the expectation of
the geometric random variable. We want to be we want to claim that its 1/𝑝 and this should
make intuitive sense. So, now, let us say that this geometric random variable the coin has bias
very small bias; that means, it is going to take more coin tosses to get the first head.
So, say bias is 1/10 ok. So, roughly only 1/10 of the coin tosses you are going to see is heads.
So, you will have to toss roughly 10 times before you see the first head and that is intuitive
statement and so when we try to formalize that we will be able to we will state that has
𝐸[𝑋 = 1] = 1/𝑝. So, let us see why that is exactly correct. So, when you think of let us
focus on the first coin flip. The first coin flip can either be a tails or a heads. So, let us say
that define another random variable Y it is as Bernoulli random variable. So, just 𝑋 = 0if the
first flip is tails and 1 is the first flip is a heads and so now, the 𝐸[𝑋] you can write it in this
form in this fashion.
So, now let us say there are two possibilities, either the first flip is the tails or the first flip is a
heads, and this might be will a little bit easier. So, when the first flip is a is a basically let us
actually look at this line, depending on whether you get tails or heads the expectation
becomes conditional on that. So, the 𝐸[𝑋] conditioned on 𝑌 = 0 here, 𝑌 = 1 here. Let us
actually see how that plays out.
Now, we will be easier to see this part. When 𝑌 = 1 what is that mean? It just means of the
very first coin flip was a heads and this probability itself is p and this then this conditional
149
expectation becomes a 1 because in very first flip you got a heads ok. In this part the
probability is 1 − 𝑝, but what about the conditional expectation. Well, when you say 𝑌 = 0
it means the first flip failed you cannot have a value of 𝑋 = 1 for this in this case. So, we can
condition instead of conditioning on Y we can condition on 𝑋 > 1, because Y=1 is out of the
question now.
So, now what is this 𝐸[𝑋|𝑋 > 1]? When you think about it now we are applying the
memoryless property, X>1 its what when it is when its guarantee there is greater than 1 has to
be at least one the first flip has to be a tails, after which entire memory is lost and you are
basically it is like starting the experiment all over again ok. So, this one is counting for the
fact of first coin flip was a tails and it is completely lost you have to restart the experiment.
So, then you have to add this X, so this 𝑋|𝑋 > 1 can be written as just X+1.
Now, we can apply linearity of expectation. So, and I am skipping a step here linearity of
expectation and then multiply with 1 − 𝑝 you will get these terms and you will get a few
cancellation. So, p and − 𝑝 will cancel out and so here what let us actually what this of
(Refer Time: 11:32) bit carefully. So, this 𝐸[𝑋] will cancel out the this 𝐸[𝑋] this p will cancel
out the this p and so what you will have is 𝑝𝐸[𝑋] = 1 which is which will then gives give
this form.
So, this again confirms our intuition that the expectation of geometric random variables is
1/p.
150
Now, comes this very interesting problem the coupon collectors problem the way to it is sort
of explain this is at least as a fun way to think about it. So, let us say you are buying some
something in the store and each time you buy you get a nice sticker ok.
And let us say there are some n different types of stickers, and you want to in each time you
want to buy this box of chocolates or whatever you will get this sticker and you get a random
sticker out of the n different stickers of the company has made available. And so you are
asking how many time should I buy this box of chocolates before I get at least one copy of all
the n stickers.
So, let us state that formally, you are given a collection of n coupons if you were stickers
coupons whatever you want to call it. And then what you do? You when you buy this box of
chocolates you get one of them ok. So, think of it is sampling a random coupon and you want
to now repeat this process until you gotten all the possible coupons at least once ok.
So, maybe the next time you buy you get this coupon, and the next time you buy
unfortunately you get something that you have already seen ok. And next time you get
something new ok, and then you again get unlucky you get something you are already seen
before and finally, you get to see something the last coupon. So, this point you seen all the 4
different coupons right. So, that is the that is the coupon collectors problem. And the question
is how many times should we buy the box of chocolates before we get to see all the stickers
or another way to stating it is how many iterations of this procedure here should be executed
151
before we have gotten all the coupons. And this is a simple problem that shows up in a lot of
sampling situation. So, it is important to understand this, this has been this can be analyzed
quite thoroughly, but for now we are going to just focus on understanding the expected
number of iterations ok.
So, this is the theorem we want to prove. That X be the number of iterations of the coupon
collectors problem. So, the number of times you will sample the 𝐸[𝑋] = 𝑛 ln(𝑛)plus a
smaller n term. So, how does this proof go we do this sort of breaking up of X. So, we break
up this X into small 𝑋𝑖ok. So, X in particular 𝑋𝑖is the number of iterations after you seen
𝑖 − 1 different coupons, but until you see the i th coupon ok.
So, let us make sure we understand what this means ok.
152
So, this is let us say the timeline you are sampling overtime the very first time you buy
something you are going to see something new. So, your 𝑋1 basically is have to seen 0
different coupons until you see the first new coupon. So, that is this 𝑋1 = 1 ok. Very first
time we buy something we will get something new. The second time use, so at this point in
time you using one coupon there are 𝑛 − 1 coupon that you have not see ok.
So, now you ask what is how many time should I buy before I see one more new coupon that
is going to be your 𝑋2 ok. So, what is the 𝐸[𝑋2]? You think about it its now going to be a
geometric random variable, your success, your p what is basically (𝑛 − 1)/𝑛because there
are 𝑛 − 1 coupons you have not seen before out of a total of n and if you get any one of
them you have seen a new coupon that is your p value. And what is the 𝐸[𝑋2]? That is 1/𝑝.
So, that is𝑛/(𝑛 − 1)
And similarly 𝐸[𝑋3] if you work it out its going to be 𝑛/(𝑛 − 2) and so on and so the pattern
will continue on ok. And this should fit your intuition because an early on it is the these
quantities are going to be very close to one 𝐸[𝑋2] is going to be closed to one 𝐸[𝑋3] is going
to be close to 1 and so on ok. And this should fit your intuition because early on its going to
easy to find new coupons ok, but as you start collecting coupons is going to get harder and
harder to see new coupon because every time you buy you are going to find the coupon it is
likely that you are going to find the coupon that you already collected.
153
And particular if you look at the very last 𝑋𝑛 is going to an expectation take n time before
you find that coupon because you seen 𝑛 − 1 you only have one coupon that you have not
seen out of a total of n. So, your p reduces to 1/𝑛. So, the expectation becomes 1/𝑝 which is
equal to n ok, and that should fit your intuition.
So, now, let us know now that we know the expectations let us plug them into our
understanding of X. So, clearly the capital X is this is this is just breaking few clearly by just
breaking time into 𝑋1, 𝑋2 and so on up to 𝑋𝑛 right. So, capital X simply the summation of
these 𝑋𝑖s and we can apply the linearity of expectation and apply the formula for 𝐸[𝑋𝑖] and.
𝑛
So, that is going to be∑ 𝑛−𝑖+1
, and n is commons you get it out and ∑ 1/𝑖.
𝑖 𝑖
What is ∑ 1/𝑖? That is nothing, but the n th harmonic number and we have a formula for that
𝑖
it is the textbook goes through the details of how these formulas arrived at. But we will skip
those details, but essentially n th harmonic number is nothing, but it is between ln(𝑛) and
ln(𝑛) + 1and so you can write that as ln(𝑛) + θ(1) and with that we get the result; that
means, 1, yeah.
154
Ok what is what is the type of.
Student: T of X is Θ of (Refer Time: 18:59) yeah. So, the statement (Refer Time: 19:01).
Oh ok, yes, yeah thank you. So, with that we conclude the proof of this theorem.
And so we can conclude this segment just to remind ourselves we revisited the geometric
random variables shown that the expectation values 1/𝑝 when the parameter is p. And we
looked at the coupon collectors problem, and we show the expected number of times we need
to buy the box of chocolates if you will is 𝑛 ln(𝑛) + θ(𝑛).
Next segment we are going to again look at something interesting again algorithmic problem,
finding the median over more generally the case selection problem.
Thank you.
155
Module – 02
Lecture - 14
Randomized Selection
So, now we are in segment 6 of module 2, and we are going to talk about the problem of
randomized selection, a special case of it is the problem finding median in an array of
numbers.
So, what we will do is we present a recursive algorithm to find the kth smallest element in an
array of n numbers and this is going to be of course, randomized algorithm and quite chances
are that you have seen a deterministic algorithm for this problem. And if you recall that the
deterministic algorithm is probably not a very trivial algorithm it is somewhat complicated,
but the algorithm that we are going to present today is very simple and elegant ok.
And we will exercise our understanding of conditional expectation to analyze this algorithm
the reference for this algorithm is Dasgupta, Papadimitriou Vazirani, it is the basic book on
algorithms, but they do not have a complete analysis. So, you have to pay attention to the
analysis here.
156
So, let us just formally define the problem. Your input is an arbitrarily ordered array of n
numbers repetitions are ok, it is arbitrarily or ordered of course, if it is sorted then it will be
easy to solve this problem. So, it is an arbitrary ordered array of numbers. And you are given
an input parameter k, and what are you asked to find out you are required to find the kth
smallest element in the array S of course, when k is n/2 you get the media and importantly we
want this algorithm to be simple because we already know complicated algorithm for it ok.
So, got further do as just make sure is a problem create everybody ok.
157
So, let us proceed to understanding the algorithm it is basically recursive. So, we are going to
call this high level function calls 𝑠𝑒𝑙𝑒𝑐𝑡(.) and the inputs will be the array S and the
parameter k ok. So, here is where the randomness comes up. So, we start by picking a
random a number uniformly at random from the set S.
So, it is basically some element in this array ok. And we look at that value v and we use that
to partition S into 3 sub arrays this is a very intuitive algorithm you think about it. So,
whenever, you can just you basically run a for loop through the entire list of elements in S
and for each element that is < 𝑣 you include it in this the first sub array 𝑆← if you will, and
whenever the number you encounter is > 𝑣 you include it in 𝑆→ of (Refer Time: 03:23) ok,
and whenever and you because of duplications you could have numbers that = 𝑣 which you
just put them into 𝑆↓ok. So, you get 3 sub lists.
So, here is an example quick example. So, let us say this is the array that you have you pick a
random element within the array. And what do you do? All those numbers that are less than 9
you put them in the 𝑆← you have 3 nines. So, you put them in the 𝑆↓and then there is one
element greater than 9. So, you put that in the 𝑆→.
Let us actually look at this. Suppose you want to find the third element in this array which
sub list will you look into clearly you look into 𝑆←ok. If you on the other hand where to look
158
at, if you are looking for the 8th element in the array there are ten elements in this if I guess.
So, 8th you intuitively know that you only have to look at the 𝑆↓and so on. So, the moment
you break it up into these sort of sub lists you know which way to recurs into ok. So, that is
exactly what we doing here.
If the input parameter 𝑘 < |𝑆←|then we know that in it is its going to be somewhere in this
sub array.
So, we basically just invoke 𝑠𝑒𝑙𝑒𝑐𝑡(𝑆←, 𝑘), otherwise we check if the k that we are looking
for is greater than the cardinality of this element this array plus the cardinality of this array
ok. If it is greater than these 2 cardinalities then it we are we must be looking at. So, they are
not going to the case the case smallest element is not going to fall in the first two sub arrays,
it is going to be in the third sub array ok. So, that is satisfy we will recurse in to the third sub
array. But we need to be a bit careful here. So, what will the parameter be over here?
Student: k minus (Refer Time: 05:39).
𝑘 − |𝑆←|. So, you will have to do 𝑘 − |𝑆←| sub exact it.
159
Yeah, yeah ok. So, that is going to be our new parameter and otherwise what is the option that
we have if it is not going to be recursing into the 𝑆← or the 𝑆→you are left with the 𝑆↓, but all
the elements in the middle array are just the element v. So, we simply return v it is a very
intuitive simple algorithm. So, any questions?
Oh, where.
Student: (Refer Time: 06:42) k minus (Refer Time: 06:48) the cardinalities should be (Refer
Time: 06:54), it should be k minus of (Refer Time: 06:57).
Ha, ok, yeah sorry.
Yeah, thank you yeah. So, other than that is the algorithm clear to everybody.
Let us now analyze this algorithm. We are going to see that the running time of this algorithm
actually is 𝑂(𝑛) on expectation. Let us try to understand why that is the case.
Let us use 𝐿𝑖 to denote the number of elements in the ith recursive call we start with n
elements. So, let us start with 𝐿0and 𝐿0 = 𝑛.
160
Notice that the time spent in the ith recursive call is proportional to 𝐿𝑖. So, we might do other
things. But essentially what we do in the ith iteration is or the ith recursive call is we sweep
through the list of elements that we are focusing on, and try to understand which ones are less
than the pivot element and which ones are greater than the pivot element. So, it requires that
one sweep and therefore, the time spent in the ith recursive call is proportional to the size of
the sub array that is focused on ok.
So, now let us make a claim. What we claim is that the expected size of the array at iteration
ith that is 𝐿𝑖 given that in the size of the array in the previous iteration was some fixed 𝑙𝑖−1.
Now, given the that is that was the size of the array in the previous recursive call what is the
expected size of the array in the current recursive call we claimed that is going to be a
fraction of the previous size of fraction 7/8 ok.
Let us try to understand why that is the case. Look at let us look at the pivot that is that we
chose we chose that is pivot v if we are lucky that pivot is going to fall between the 25th and
the 75th percentiles and clearly the probability of being lucky is one-half. So, how do these
ideas play out? So, let us try to be a little bit more careful about this what we care to bound is
this 𝐸[𝐿𝑖] given this condition. And let us look at how that can be worked out.
Well, as we noted if we are lucky the size of the array is going to shrink otherwise it could be
unlucky and it is hard to figure out exactly what does what it is going to be is, but since we
161
are just interested in an upper bound we can just use we can just assume that there is no
shrink in the size of the array at all ok.
But now here the being lucky and un unlucky happened with probabilities half and half. Now,
you may ask why do we claim that the size of the array shrinks to three-fourths are the
previous size when we are lucky. Well, to understand this you should realize that the two
extremes are within the lucky region is the 25th percentile and the 75th percentile. So, let us
look at these two extremes they happen to be symmetric. So, let us just focus on what
happens if the element we choose the pivot element is the 25th element.
Now, that is going to divide up the array into a quarter of the array and the remaining
three-fourth of the array and the median we will be able to figure out is in the larger of those
two which is three-fourth of the array. And therefore, the shrink in the size of the array is
factor of three-fourths at least provided we get lucky, and that is exactly why we have this
7
three-fourths over here. And when you evaluate this expression you get 8 𝑙𝑖−1.
Now, we want to bound this 𝐸[𝐿𝑖]. And we know that the 𝐸[𝐿𝑖] is nothing, but the expectation
of the 𝐸[𝐸[𝐿𝑖|𝐿𝑖−1]] ok, and keep in mind that we already have a handle on this conditional
7
expectation inside and that is going to be 8
𝐿𝑖−1. We worked it out for a specific value, but
162
7
for any random variable also it is going to be 8
𝐿𝑖−1and by linearity of expectation you get
(7/8) outside and so 𝐸[𝐿𝑖] = (7/8)𝐸[𝐿𝑖−1].
𝑖
And you can continue this recursively and as a result we would get 𝐸[𝐿𝑖] to be (7/8) 𝑛
because when the recursion ends at 𝐿0 you will just have n over there ok.
So, how does this impact the running? Time we claimed that the expected running time is
𝑂(𝑛) and let us see why that is the case. Let us denote the running time as T.
Now, if you recall I said that the running time in each recursive calls proportional to the
number of elements that we consider. So, let r capture that proportionality, so r some
constant. So, the running time T is at most r times the number of elements that were there in
the 0th recursive call the first recursive call and so on and so forth ok.
And if T is that value then the 𝐸[𝑇] by linearity of expectation and just taking the expectation
into the expression you will get 𝑟(𝐸[𝐿0] + 𝐸[𝐿1] + 𝐸[𝐿2] +...)ok. And we have already
derived the value for the 𝐸[𝐿𝑖]. So, we can plug that value into this expression and what we
will get is the sum of a geometric series which is not hard to evaluate and we will get a value
of an upper bound of 8𝑟𝑛 for the expected running time which of course, because r and 8 are
both constants this upper bound is 𝑂(𝑛).
163
So, let us conclude this segment. We introduced the k selection problem which is a
generalization of the median problem, provided a very simple randomized algorithm ok. We
analyzed its running time and we show that it runs on expectation in 𝑂(𝑛)time. But there is
one question that I want you to ask yourself. So, this is an expected time running time
analysis, how can you trust this answer. So, can you be sure that will not take too much time
in the worst case ok? What does that even mean ok?
So, these are some questions that we need to ask ourselves because when you when you
claim something on expectation you are not guaranteed to end in 𝑂(𝑛)time, there could be
situations where the actual running time could be much higher than n that is a possibility and
if you are going to run a mission critical operation in which this shows up as a critical piece
and you need some more better guarantees than this. So, that is something that should rankle
you and hopefully we will be able to address that in a subsequent module ok.
So, with that we come to the end of this module 2, where we talked about random variables
and the expected expectation of random variables and we looked at a couple of examples.
164
In the next module we will be talking about tail bounds and we will address exactly this
question that is hopefully ranking you. We will try to go from expected time analysis not just
go away from it expected time analysis is also important. But we will also try to make claims
that hold with high probability. So, that we have this extra guarantee that with high
probability and we can actually make it arbitrarily high, with high probability the running
times would be good enough for us ok.
So, that is that will be the focus of the next modules. Actually not just one module might spill
over into multiple modules ok. So, with that we will close the module 2.
Thank you.
165
Indian Institute of Technology Madras
Module – 03
Tail Bounds I
Lecture - 15
Tail Bounds I - Markov’s Inequality
So, we are now going to start a new module. In this module we are going to be concerned
about tail bounds and what does that mean? Well we are interested in random variables like
running time space complexity and so on and so forth in algorithm design. And we want to
understand such random variables the first level of understanding we have from the
expectation of that random variable, but that is not all we want to know we want to also
understand how the random variable behaves with high probability. For example, the
expectation could be small, but if an algorithm takes a lot of time every once in a while, then
that might be a cause for concern. So, we want to be able to argue that the probability with
which an algorithm takes a long time is very very small and tail bounds help us make such
arguments ok.
So, in today’s segment we are going to look at Markov’s inequality which is pretty much. The
first starting point for these sort of tail inequalities we will be studying Markov’s inequality,
166
using randomized selection as the as the example problem and we will then prove Markov
inequality, and then we will apply it to this randomized selection problem.
And we will also see how this Markov inequality plays out in binomial distribution, it turns
out that its actually quite weak in general, but under certain conditions it will it will work and
in fact, even though its weak in general, it is this is actually the starting point for all other tail
inequalities that we will be studying ok.
So, without further ado let us look at the selection problem. So, we are given an arbitrarily
ordered array of n numbers. This is an input array arbitrary order and we are also given a
parameter 𝑘 ok. And this parameter tells us specifies the index of the item that we want to
output in the sorted order. So, in other words we want to find the 𝑘th smallest element in S,
we would like this algorithm to be as simple as possible ok.
167
So, here is the randomized selection algorithm means it is kind of like quicksort it is a
recursive algorithm, you take the entire array S and you have this parameter 𝑘. So, if case is
case values 10, we would want to output the tenth smallest element. So, of course, 𝑘 has to be
in the range one to the size of the array.
So, now just like quicksort you pick a random number v and this v is picked uniformly at
random from the set S and now we partition S into 3 parts 𝑆← are all the items in S, that are
< 𝑣 as 𝑆↓is the numbers that are = 𝑣 and 𝑆→ are all the numbers in S that are > 𝑣and this can
be done by a single sweep through the array S ok.
168
And now it is quite intuitive to see where the kth smallest element will lie ok, if 𝑆↓I mean 𝑆←
has a cardinality that is more than k, then in fact a greater than or equal if the cardinality of
𝑆←is greater than or equal to 𝑘then what we can say is that the 𝑘 th smallest element is in 𝑆←.
So, what we can do is simply recurse into that region ok. On the other hand if 𝑘is strictly
greater than the first two parts the 𝑆←and the 𝑆↓, then 𝑘is clearly in as 𝑆→ok. So, you recurse
into 𝑆→, but you have to provide a parameter that is slightly updated. So, no longer will 𝑘be
the right parameter; you will have to remove 𝑆←and 𝑆↓. So, maybe a picture will help us. So,
if you look at the array S what we have done is we have split this array into 𝑆← 𝑆↓and 𝑆→.
So, the step 3 takes care of the case where 𝑘th smallest element is over here, step 4 basically
takes care of the case where the kth smallest element lies in this region and because we are
going to recurse into this region alone, we have to remove this many on this many numbers
from 𝑘. So, that we only focus on item number whatever we get over here in this region and
of course, if 3 steps 3 and 4 dot work; that means, 𝑘is in this region, in which case remember
all of these items have value equal to 𝑘. So, you simply return the, and this is nothing, but the
pivot element v that we chose. So, simply we can return v.
So, this is how the randomized k selection works at least this algorithm, and we would like to
understand how good this algorithm is, and recall we have already looked at this algorithm
169
and and we know the expected running time of this algorithm, which is quite good ok. And
we understood this the we studied the expected running time of the algorithm by defining
these random variables 𝐿𝑖and 𝐿0and you recall that 𝐿𝑖denotes the number of elements in the
ith recursive call and 𝐿0is simply the number of items, that we started with its basically the|𝑆|
ok.
𝑖
And we showed that the 𝐸[𝐿𝑖] = (7/8) 𝑛and using that we were able to show that the
expected running time T, is is this expectation of the sum of the 𝐿𝑖s and scaled by r which
captures the fact that a 𝐿𝑖is just represent the number of elements whereas, the running time
might be a little bit more, but still only a linear amount more. So, now we can you know
solve I mean you can work out that this expectation applying any additive expectation and
things like that will work out to 𝑂(𝑛). So, we have seen this already, now what we really
want to show in this, now is that the run time is not too large with high probability. I mean
when we when we claim that some event holds with high probability, what we mean that
𝑑
what we mean is that the probabilities of the form 1 − 1/𝑛 ok.
Which means in another way of saying the same thing is that the probability of the runtime
𝑑
being≥ δwhatever the δ,δ𝑛 log(𝑛)should be at most some (1/𝑛) ok. So, these are two
equivalent ways of stating this high probability claim you can if you are looking at the good
event the runtime being 𝑂(𝑛 log(𝑛)) and being small you need to show that it works out with
high probability, but if you focus on the bad event bad event is when the runtime becomes
larger than some quantity. So, greater than someδ𝑛 log(𝑛)that is the bad event if you are
focusing on the bad event then the probability of the bad event should be very small should
𝑑
be at most (1/𝑛) , there are two equivalent statements assuming the δ in this own; in the
constant of the notation.
170
So, how are we going to achieve this? Prove this following theorem that the running time is
𝑂(𝑛 log(𝑛))with high probability, and we will prove that 𝐿𝑖becomes small with high
probability when i becomes 𝑂(log(𝑛))ok. So, as the as you recurse further and further after
the ith roughly the ith I mean 𝑖 = 𝑂(𝑛 log(𝑛))at recursion, what happens is the size of the
array becomes a constant ok. And once it becomes a constant then this algorithm is only
going to the remaining running time is only going to be a constant. So, we do not care. So, if
we were able to show this, then the running time is essentially limited by this summation i
equal to 0 to someΘ(log(𝑛)), and sum up all the individual 𝐿𝑖s and of course, there might be
a scaling factor r coming, but that is only a constant ok. and so, now, if you if this is what we
care about the easier thing to and notice that if after the log(𝑛)th iteration if 𝐿𝑖becomes small,
we can simply stop with summing over the first log(𝑛), 𝐿𝑖values ok.
And so, then that this can be upper bounded by this quantity. So, here essentially what we are
doing is we are replacing 𝐿𝑖by a very gross upper bound n ok. So, so clearly the running time
Θ(log(𝑛))
is upper bounded by ∑ 𝑛 and that is not hard to see that its 𝑂(𝑛 log(𝑛)). So, now, what
𝑖=0
remains to be shown is this that the 𝑃𝑟(𝐿𝑖 ≥ 1) ≤ (1/𝑛) when i is some Θ(log(𝑛)).
171
So recall that we have already shown something about 𝐿𝑖, 𝐿𝑖s we have already seen that 𝐿𝑖 is
𝑖 𝑖
at most (7/8) 𝑛, and we can write that as essentially take the (7/8) to the denominator. Now
𝑖
it will have to take the reciprocal. So, it becomes (8/7) ok.
And now, let us remember we are interested in the case where i is some Θ(log(𝑛)). So, what
we do is set i to be some 𝑐 𝑙𝑜𝑔8/7(𝑛)of and c is some constant. So, let us see how this plays
out. So, now, when you plug this value of i into this term you get this 𝐸[𝐿𝑖] now becomes at
𝑐 𝑙𝑜𝑔8/7(𝑛)
most 1/(8/7) and ok. But what is this quantity actually actually what is this quantity
𝑙𝑜𝑔8/7(𝑛) 𝑐
(8/7) and that is nothing, but n and there is the c. So, this c will turn out to be 𝑛 . So,
1 1−𝑐
this whole inequality becomes at most ( 𝑐 )𝑛 which is 𝑛 . So, what have we shown? We
𝑛
1−𝑐
have shown that the 𝐸[𝐿𝑖] ≤ 𝑛 when i takes on this value. So, that is pretty small.
−1
So, for example, when c is 2, this will become 𝑛 and that is a very small fraction, but that is
not exactly what we want, we want to show that the 𝑃𝑟(𝐿𝑖 ≥ 1) ≤ some 1/n we want to
show something like that ok.
172
So, we need a new tool for this and this is where Markov’s equality shows up. We have been
able to bring the 𝐸[𝐿𝑖] down to something very small we have to exploit them ok. So,
Markov’s inequality states the following let X be a non-negative random variable ok, then
consider any value a and ask what is the 𝑃𝑟(𝑋 ≥ 𝑎), and that is upper bounded by the
𝐸[𝑋]/𝑎. So, a lot of times a picture would help understand what we are talking about. So,
here is the distribution of X it can be a complicated distribution, but we want X to be
non-negative that is an important requirement.
And so, let us assume that the expectation lies around here, were going to what we are asking
is what is the 𝑃𝑟(𝑋 ≥ 𝑎)and that is the shaded portion over here the area under this in this
shaded of the shaded portion and what does this inequality say that? That is at most the
𝐸[𝑋]/𝑎. And you immediately notice that this inequality only makes sense as long as a at
most 𝐸[𝑋]. If a is smaller than the 𝐸[𝑋], then this inequality the right hand side will become
greater than 1 and that is useless because we know that probabilities are always bounded by
1.
So, that is not very meaningful and in general this Markov’s equality tends to be very weak,
but it is still fundamental because its power shows up when the expectation becomes small,
then you will be able to get a very good upper bound its small upper bound on this
probability. Remember this is typically going to be the probability of some bad event and so,
you want this right hand side this is this is the bad region. So, you want do not you for
173
example, you do not want your running time to be too large that is the bad situation and you
would not argue that the probability of that bad situation is very very small ok.
So, you are most often you are interested in ensuring that this right hand side is as small as
possible, and that particularly comes out I mean in the context of Markov’s inequality, it is
only useful as long as the expectation is sufficiently small ok.
So, let us quickly prove this inequality, let us define a variable I and indicate a variable I that
is 1 whenever X is greater than or equal to a and 0 otherwise ok. And this immediately tells
us that 𝐼 ≤ 𝑋/𝑎 because if 𝑋 ≥ 𝑎then 𝐼 = 1 and the right hand side will be larger. If when
X is when X<a, then I will jump to 0 and the right hand side will still be a fraction remember
X always is ≥ 0that is a requirement for this Markov’s inequality ok.
So, now let us look at the 𝐸[𝐼] and that is nothing, but the probability because this is a 0 1
random variable this is going to work out to just being the 𝑃𝑟(𝐼 = 1). Why because this is if
you work it out its going to be the 𝑃𝑟(𝐼 = 1) × 1 + 𝑃𝑟(𝐼 = 0) × 0. So, this term just will
vanish away and so, that is why we are simply writing it as 𝑃𝑟(𝐼 = 1)and that probability is
nothing, but the 𝑃𝑟(𝑋 ≥ 𝑎)ok.
So, which remember is exactly what we are interested in ok. So, this 𝑃𝑟(𝑋 ≥ 𝑎)is nothing,
but 𝐸[𝐼] these are all equalities I am just using that which is at most now what is I? I is
nothing, but 𝑋/𝑎. So, at most 𝐸[𝑋/𝑎]and by linearity of expectation its nothing, but 𝐸[𝑋]/𝑎
174
and that is exactly what we want ok. So, now, we have a handle on Markov’s inequalities and
she says that the 𝑃𝑟(𝑋 ≥ 𝑎) ≤ 𝐸[𝑋]/𝑎 ok. So, remember this now were going to go back to
the selection problem ok.
So, what do we know in the selection problem, when i is some 𝑐 log(𝑛)we get this the
1−𝑐
𝐸[𝐿𝑖] ≤ 𝑛 ok. And remember now this expectation is sufficiently small when i is small.
So, now, we need to what we are trying to show is something like this 𝑃𝑟(𝐿𝑖 ≥ 1) ≤ 1/𝑛,
this is the form that we want and. So, to get this form we have to apply Markov’s inequality,
the nice thing is 𝐿𝑖is non-negative it is a running time or the size of the array. So, its
non-negative. So, we can apply Markov’s equality ok.
2−𝑐 2−𝑐
So, we ask what is a 𝑃𝑟(𝐿𝑖 ≥ 𝑛 )that is nothing, but the 𝐸[𝐿𝑖]/𝑛 just by applying
Markov’s inequality ok. And we know that the 𝐸[𝑋]in the well is actually this axis should be
1−𝑐 2−𝑐
a 𝐿𝑖 and what is the 𝐸[𝐿𝑖] that is nothing, but 𝑛 /𝑛 which turns out to be 1/𝑛 ok. So,
now, if you set c equal to 2 or anything larger than 2, then essentially like it specifically if you
set c=2 you will get 𝑃𝑟(𝐿𝑖 ≥ 1) ≤ 1/𝑛 and of course, if you increase the value of c then you
you only generalize and so, this is exactly what we want.
175
And that with that we complete the proof of this theorem that the running time of this
randomized selection algorithm is at most 𝑂(𝑛 log(𝑛))with high probability what we have
shown here in this slide is this that the 𝑃𝑟(𝐿𝑖 ≥ 1) ≤ 1/𝑛and you have to then consider. So,
what we have just to recall how this proof completes, you have to remember that given that
the 𝑃𝑟(𝐿𝑖 ≥ 1) ≤ 1/𝑛we can then plug in the fact that these 𝐿𝑖s are at most I mean if it is not
even going to be greater than 1 with you know then clearly you can upper bound it by n, and
summation of these ends over the log(𝑛)iterations is going to be 𝑂(𝑛 log(𝑛))and this is going
to be true with high probability ok.
So, now were going to ask somewhat intriguing questions. So, what we have shown so far is
that the probability of the running time being ≥ δ𝑛 log(𝑛) ≤ (1/𝑛). Essentially saying that
it does not I mean the probability that it goes too large and in that in that sense too large here
is defined as𝑛 log(𝑛) ≤ 1/𝑛. But we also know that the expectation of the running time is at
most some 𝑂(𝑛) ok. So, this is a gap expectation is 𝑂(𝑛)and, but we have only been able to
show that the running time with high probability stays with an 𝑛 log(𝑛)ok.
So, the natural question is can we say something with or not even high probability, but some
probability tending towards 1 for this event that T belongs to 𝑂(𝑛)what can we say about the
probability that, T is some 𝑂(𝑛)say what is the probability that T is, at most 50 × 𝑛, where
50 is a constant right and can we argue somehow that that probability will be 1 − 𝑜(𝑛). 𝑜(𝑛)
176
can be something like 1/ log(𝑛) or 1/ 𝑛 or something like that, what it cannot be is a
constant 𝑂(𝑛)is something sub constant and were asking whether that is. So, basically this is
a probability that tends to 1 as n increases is that possible as it turns out the answer is no. So,
this you have to live with this divide between the expected running time and the bound that
we have on the running time with high probability there will have to be a separation ok.
So, what we are going to show is that, the bad event that the running time belongs to some
Ω(𝑛)is going to be at least a constant for this to hold this is the bad event should be should
have been 𝑜(1), but that is not going to happen because it is we are going to show that that is
going to be Ω(1)ok. So, that is what we are going to show now, and we are going to show
that in a in a carefully constructed manner ok. So, the bad event this we are going to specify
what we mean by 𝑇 ∈ Ω(𝑛), this the bad event is going to we are going to consider the bad
event specific bad event as 𝑇 ≥ 𝑐𝑛/2and c can be any constant ok.
𝑐𝑛
So, now we ask what is the 𝑃𝑟(𝑇 ≥ 2
)and we are going to argue that that probability is
going to at least be some constant remember c is a constant. So, any function of a constant,
this should be a constant this will be a constant right we would not argue that this probability
is going to be more than a constant, which is which is exactly what we mean by Ω(1)how do
we make this argument?
177
Let us assume let us look at the algorithm the execution of this randomized selection
algorithm, but we are going to view it from the point of view that the items are sorted. So, in
reality the items will not be sorted, but we are just going to for the sake of analysis observe
what happens in this sorted order ok. So, this is the sorted order, but we are going to break the
sorted order at least the first half of the sorted order into little pieces and these little pieces
there are going to be c such intervals and each such interval is going to consist of 𝑛/2𝑐 items
ok. Basically this is 𝑛/2 and that 𝑛/2 divided into c pieces is 𝑛/2𝑐 items each and in this
case we are going to assume that 𝑘 = 𝑛/2. So, were basically trying to find the median
element ok.
So, what could go wrong in this sort of view? Well remember and each recursive call we are
going to pick a random element. So, what is the probability that the very first random element
is going to be in this very first interval ok? Well the width of that interval is going to be 𝑛/2𝑐
and the overall and then this. So, you your random favourite element is going to be in that
interval with probability 𝑛/2𝑐 divided by the total size of the array ok, that is the current size
and what about the next iteration?
So, let us say that the first recursive call you pick a pivot in that region ok. In the second
recursive call what is the probability that you would pick a pivot? In the second interval that
is again going to be 𝑛/2𝑐, but now the size might have been smaller because you your array
size is reduced a little bit, but its only reduced by a little bit because you will probably
eliminated the first interval. So, that is. So, I am just not going to worry about exact size, but I
am just going to call it the current size and so on and so forth.
In each recursive call the bad thing that, could happen is that the pivot gets chosen from this
tiny sliver of items at the far left and that those bad events can happen with these probabilities
𝑛/2𝑐 divided by the whatever the current size is and I am only interested in a lower bound for
this probability. So, I can make I can I can replace this by a smaller quantity. So, what do I
do? I look at this denominator it is the current size. And if I replace it by a larger quantity I
certainly will know that this will be larger than whatever I get over here.
So, that being the case I am going to replace the denominator by n, because n is the full size
of the original array and the current size could only have been smaller ok. So, I get and now I
−𝑐
the n cancels out and I get (2𝑐) which is a constant. So, what is the final outcome of this
argument? We have been able to show that the probability of the bad event that the running
178
time is ≥ 𝑐𝑛/2is at least a constant. So, this means that there you will not be able to prove
any high probability result for a running time of 𝑂(𝑛)that is that clearly a separation between
the expected time analysis, and the running time analysis with high probability.
So, you may wonder at this point and especially those of you who have studied a selection
problem in an undergraduate algorithms course, you may you may recall that there exists a
deterministic 𝑂(𝑛)time algorithm and why not something better as it turns out little while
later, we are actually going to see an improved algorithm which is actually going to run in
𝑂(𝑛) time with high probability, that is going to be a different algorithm. But this randomized
selection algorithm that kind of resembles quick sort is unfortunately going to have this sort
of a chasm between expected time analysis and high probability analysis ok.
So, now let us see this whole randomized selection allowed us to explore this notion of
Markov’s inequality, we were able to exploit that to get a bound on the running time. Let us
use the Markov inequality to bound a binomial random variable ok. So, it will help us get a
sense of what it is useful for and how to use it ok.
179
So, now, let us let us assume that X is drawn from the binomial random variable binomial
distribution with parameters and meaning n coins are tossed and the p values are half, which
means that the coins are all unbiased ok. So, the question we ask in this sort of tail bound
analysis is how much does X deviate from its expectation ok. So, what range of values of X
have significant probabilities?
So, here is an experiment that I quickly coded up, I am going to set n equal to 10,000 and I
am going to draw X repeatedly 1000 times. And I am going to ask how much does it deviate
from the expectation remember expectation here should be 𝑛/2 right. So, in this context
when n is 10,000 the 𝐸[𝑋] should be 5000 right. So, the question is; how much does it deviate
from this 5000 and you may want to pause here to ask what what will that deviation be.
You know can we say most of the time it is going to be within say 2500 to 7500 or is it going
your are you going to be able to see a wider range of numbers. So, for example, remember
were repeating it then I mean 1000 times. So, how many times is going to be less you know
less than 2000, this X how many times is going to be greater than 9000 these are all you
know the type of questions that these tail bounds try to ask and so, you may want to pause a
little bit to try and answer that question intuitively, just think about it and after you have
passed and you have had some thought let us come back to this ok. Now having thought
about it will show you a picture might surprise you ok.
180
So, here I am plotting on the x axis its 𝐸[𝑋] − 𝑋. So, this is the 0 here actually refers to the
5000 mark and so, this for example, 120 refers to the 5000 - 120. So, that is like 4880th mark
this refers to 5000+140. So, that is going to be 5140 as a thumbs are out of these 10,000
repetitions most of the time you are within a plus or minus 120 130 ok. So, we never went
below 4850 we never went above 5150. So, clearly we never would have touched 2500 or
7500 ok.
So, this binomial distribution when you think about, it is actually going to be very tightly
bound around its expectation and this is this is an important intuition that you need to develop
ok. So, even though its random it is highly predictable. Now let us see if Markov’s inequality
is really any good like how good is its prediction is it able to tell that this random variable X
drawn from the binomial distribution is going to be close to the expectation or not let us let us
see where it says.
181
Let us ask; what is the 𝑃𝑟(𝑋 > 5000 + 160)that is going to be the expectation which is
5000/5160 that is 0.96 and what do you think about it out of the 1000 reps that we did for this
experiment, none of them exceeded 5160. So, the probability should be much closer to 0, but
were getting a probability of 0.96 ok. So, this is clearly telling you that Marcos is very very
weak in this context at least it is not always weak its sometimes it is really the best starting
point that you can have for these tail bound analysis, but in this context its weak ok. So, we
need slightly more nuanced tail bounds to analyse the binomial distribution ok, and that is
what we will be developing in the rest of this module.
182
So, to conclude in addition to the expected running time, we also typically want high
probability guarantees on the running time. So, we want to be able to say that the running
time does not exceed some quantity with high probability or that it exceeds the this value
with very small probability, both equivalent ways of saying things and we have looked at the
first fundamental step towards these sort of tail bounds and Markov’s inequality, and we have
noticed that when the expectation can be brought low then Markov is quite good, but not so,
good otherwise. So, it did not; was not very good for the binomial distribution, where the
expectation was already high like 5000 or something.
So, the trick is to really exploit this feature of Markov’s inequality. So, whatever random
variable we want instead of bounding that random variable directly, if its expectation is high
we want to be able to recast that. So, that were looking at a random variable with low
probability and then apply Markov inequality and thereby good get tight bounce ok.
183
So, with that we conclude the first segment in the next segment we will look at a slightly
tighter inequality call Chebyshev inequality.
184
Module – 03
Tail Bounds I
Lecture - 16
Tail Bounds I - The Second Moment,Variance & Chebyshev’s Inequality
So, now we are starting the second segment in module three we will see a few definitions and
introduce the next tail bound.
So, the definitions will particularly be that be the kth moment use that in particular the second
moment will be used to define something called the variance and a related notion called the
covariance. Well run through some examples to understand the notion of variance and use the
variance to get a better bound called the Chebyshev’s inequality and we will see that the
binomial distribution is captured better by the Chebyshev’s inequality that is the goal for this
segment.
185
𝑘
The kth moment is simply this its 𝐸[𝑋 ]and the first moment is something that you have
1
already seen extensively that is nothing, but the 𝐸[𝑋 ]is this 𝐸[𝑋]that is expectation . The
second moment along with the first moment yields the notion of variance and if you recall if
2
you look at the second moment 𝑋 ; it is a convex function right.
2 2
So, 𝐸[𝑋 ]is going to be greater than 𝐸[𝑋] that was Jensen’s inequality right and that
difference if you recall when we looked at Jensen’s inequality we claimed that the difference
actually measures how much the random variable deviates from the mean and that is
essentially the idea that we are going to capture here.
2
So, 𝑣𝑎𝑟(𝑋) is nothing, but the 𝐸[(𝑋 − 𝐸[𝑋]) ]this is one definition. The other definition is
2 2 2
𝐸[𝑋 ] − (𝐸[𝑋]) ok. So, 𝐸[𝑋 ] is nothing, but this second moment, 𝐸[𝑋]is nothing, but the
first moment putting the two together you get the variance ok; these two definitions are
equivalent and a quick homework for you would be the check that they are in fact, equivalent.
So, if you just apply the formulas and run through you should be able to get them.
186
So, let us now try to understand what is 𝑣𝑎𝑟(𝑋 + 𝑌) where X and Y are two random
variables. Remember this is in line with how we approached expectation we tried to
understand what is expectation of the sum of two random variables and we want to know if
something similar comes up with variance as well.
So, we apply the formula 𝑣𝑎𝑟(𝑋 + 𝑌) is simply the expectation of we want 𝑋 + 𝑌. So,
𝑋 + 𝑌 − 𝐸[𝑋 + 𝑌]which when you apply the linearity of expectation is going to be
− 𝐸[𝑋] − 𝐸[𝑌]; the whole square that is 𝑣𝑎𝑟(𝑋 + 𝑌). And I have coloured it up in yellows
and blues because I am going to regroup them in accordance to their colours it is its
expectation of .
So, here I am going to take the square I am going to consider 𝑋 − 𝐸[𝑋] is one term and
𝑌 − 𝐸[𝑌] as the other term. So, and then when you square it you are going to get all these
terms and then apply the linearity of expectation over all those terms; so, I will just skip that.
2 2
So, essentially what you will get 𝐸[(𝑋 − 𝐸[𝑋]) ] + 𝐸[(𝑌 − 𝐸[𝑌]) ]times some quantity.
2
So, this you have 𝐸[(𝑋 − 𝐸[𝑋]) ]is nothing, but 𝑣𝑎𝑟[𝑋] then you get 𝑣𝑎𝑟(𝑌) plus 2 times
this strange looking object here ok. And again another small homework is if you work this
out its actually going to be nothing, but 𝐸[𝑋𝑌] − 𝐸[𝑋]𝐸[𝑌]and this strange object is called
the covariance.
187
This is if you look at this 𝑣𝑎𝑟(𝑋 + 𝑌); we have shown this is 𝑣𝑎𝑟(𝑋) + 𝑣𝑎𝑟(𝑌)plus
something ok. We want to know whether that something becomes a 0 and what is the
condition under which it becomes 0 because that then relates to the expectation because there
𝐸[𝑋 + 𝑌] would have been equal to 𝐸[𝑋] + 𝐸[𝑌].
So, we and this object will become a 0 when these two terms are equal right. So, can they be
made equal what condition will they be made equal. So, let us actually work that out 𝐸[𝑋𝑌];
you apply the formula ∑ ∑ 𝑥𝑦 𝑃𝑟((𝑋 = 𝑥) ∩ (𝑌 = 𝑦)) you apply all the formulas, but then
𝑥 𝑦
you are stuck here you do not know what to do with 𝑃𝑟((𝑋 = 𝑥) ∩ (𝑌 = 𝑦)); except when
the two random variables are independent. If the two random variables are independent then
you can replace it with 𝑃𝑟(𝑋 = 𝑥) 𝑃𝑟(𝑌 = 𝑦) in which case it works out to be 𝐸[𝑋]𝐸[𝑌]
alright.
So, essentially what; that means, is that when the two random variables X and Y are
independent, you will get something similar to the linearity of expectation otherwise you will
lose the linearity of expectation. But nevertheless we can this quantity this covariance
actually has some meaning because if its independent its equal what if it is not independent
there is some meaning to it.
188
So, the best way to at least get some sense of it is to actually work out some examples. So, let
us take two random variables let us say they are outcomes of coin tosses and 1 if its heads, 0
if its tails or something like that . In this case so, what is what; what is this quantity of
𝐸[𝑋𝑌] − 𝐸[𝑋]𝐸[𝑌] should be immediate, it is 0 why because these are two independent
random variables ok.
So, let us do something silly here let us limit the sample space to one of these. So, let us let us
just focus on this one ok. So, in this case we are we are not considering this outcome. So, the
sample space is limited to the three outcomes shown in red ok. So, then now and let us say
this is always X,Y; so, what is the 𝐸[𝑋] here? this one is a 0, 0 and 1; so, the 𝐸[𝑋] = 1/3
right.
So, I am trying to work out this thing 𝐸[𝑌]; what is 𝐸[𝑌]? that is also 1/3 and what is the
𝐸[𝑋𝑌]? 0 why because at least one of them is always remaining as 0. So, it is; so, it is going
to be − 1/9 and you can work out something like that for this other example shown here as
well ok. Let us do something similar what happened what about this case? So, let us again
pick the one that will ok.
So, here let us lets again work on this one ok. So, what is the 𝐸[𝑋] here? It is going to be 2/3;
what is the 𝐸[𝑌]? It os going to be 1/3; what is the 𝐸[𝑋𝑌]? 1/3 and so, that will turn out to
be 1/3 − 2/9right. So, it is going to be 1/9 that is what do you think would be the case over
189
here? Well 𝐸[𝑋𝑌]is in this case is 𝐸[𝑋𝑌] is 0. Here if you work it out its going to be what is
that going to be what is 𝐸[𝑋𝑌]? It is it is going to be 1/2 − 𝐸[𝑋] again − 1/4 I think well
1 1
2
+ 4
Why are we going through all of this?
So, it is what we would we sort of guess is its ranging from − 1/4 to 1/4 what is this
measure telling you? So, why is this for example, this 1/4what is the intuition here as to why
the covariance is maximized here?
Yes it is like you have tied the two coins together. So, if one happens to be a tails the other
one also the tails. So, they are what is called positively correlated whereas, here you have tied
them together, but you have tied them. So, that when one appears tails the other one appears
heads always. So, they are negatively correlated when one is high the other one automatically
becomes low and so, on ok.
So, this covariance actually is a very meaningful object; it actually measures how connected
these two variables are ok. And when it is when they are completely dis or unconnected or
the technical term being independent it is equal to a 0 right. So, this is a very useful object,
but for our purposes we will stop at this level of intuition because what we want to do is use
this the notion of variance to go on to tail bounds, but I want to make sure this notion is also
clear in our minds at least at the intuitive level ok.
190
So, let us quickly look at some variances. So, let us look at the Bernoulli random variable.
2
So, (𝑋 = 1)with probability 𝑝; (𝑋 = 0) probability (1 − 𝑝); the 𝐸[𝑋] = 𝑝, 𝐸[𝑋 ] = 𝑝.
Because when you square 0 1 random variables remains 0 1 random variable right. So, the
𝑣𝑎𝑟(𝑋) you run it through the formula, you are going to get 𝑝(1 − 𝑝). The binomial random
variable it is just the summation of n Bernoulli random variables. So, you just do to
independence you can just sum them up.
191
So, you get 𝑛𝑝(1 − 𝑝). Geometric random variable on the other hand is a little bit more
tricky because you do not have this neat summation you do not know how long you are going
to do this these iterations right.
Student: The binominal (Refer Time: 12:03) they are independent. So, we.
Sorry did I say dependent?
Student: No.
They are independent that is why the 𝑣𝑎𝑟(𝑌) can be written as the sum of the variance of the
individual X is. So, here I am using 𝑋𝑖’s to denote.
Student: can you get a similar expression for covariance here like ?
The covariances will be 0 right; oh I see you are you are talking about the more general thing
where you know it is not just 𝑋 + 𝑌.
So, there is that will be a little bit more complicated thing, but its it is something that can be
written ok.
But since all of those covariance terms are going to be 0 because of independence. So, let us
so, we are kind of sweeping some details under the rug for now, but essentially since all the
covariance terms are going to be 0 just you.
Let us now consider the geometric random variable X is a geometric random variable and it
has parameter p. So, this means you toss a coin with bias 𝑝 until you see the first heads. And
by now we should know that the 𝐸[𝑋] = 1/𝑝; you can work that out; our question now is
what is the variance of this random variable X? And towards understanding the variance, we
2
first want to compute 𝐸[𝑋 ] and that is what we are going to do now.
192
2
Let us consider what 𝐸[𝑋 ] is and we will write it out using the law of total expectation. Now
2 2
𝑋 to compute 𝑋 you can you break it break the universe into two parts. The first part
corresponds to where the first coin flip lands tails and the second part corresponds to the first
flip landing heads ok.
2
So, now our 𝐸[𝑋 ]is split into these two parts and of course, they have to be weighted by
their corresponding probabilities ok. And let us look at this expression let us look at the
second one the one where the first flip is heads. If the first flip is heads then X is going to be
1 because you have seen the heads and therefore, you are not going to toss anymore. So, then
2
𝑋 is also going to be 1; so, this whole term becomes a 1 ok. And we know the probability of
flipping a heads is 𝑝 because its 𝑝 biased coin ok. So, that leads us to just the term 𝑝 because
this is just going to be a 1; so, the second term here is just going to be p.
Let us now consider this first expression; well the first the probability that the first flip is a
tails is 1 − 𝑝 and that we have and let us consider what happens when the first clip is tails.
After the first flip you really have to this is the geometric random variable; so, there is this
memorylessness property. So, you have to set aside the first coin toss and then now again all
start all over and wait for the first heads to occur. So, in essence after the first flip has been
set aside you really are in a situation where you have to repeat the geometric random variable
from the start.
193
So, that is why we have 𝑋' which is another geometric random variable with the same
parameter 𝑝 plus 1 this is this 1 is where you set aside the first coin toss. And then you start
over with a geometric random variable 𝑋'that is and this whole thing is within a square. So, it
2
is going to be (𝑋' + 1) and we want the expectation of that is going to be this part ok.
And now we can since it is a square, you can apply the formula and then apply linearity of
2
expectation over and we will get be getting 𝐸[(𝑋') ] + 2𝐸[𝑋'] + 1. And so, this when we
2 2
expand it out we are going to get this expression for the 𝐸[𝑋 ]ok, but that is just the 𝐸[𝑋 ];
2 2
the variance is given by the 𝐸[𝑋 ] − 𝐸[𝑋] that is the formula for variance if you recall. So, if
you apply that you are going to end up with the variance of a geometric random variable with
2
parameter 𝑝 being (1 − 𝑝)/𝑝 ok. So, with that we have seen a couple of examples of the of
computing the variance for random variables.
We are now ready to talk about this state inequality called the Chebyshev’s inequality. And as
you may expect the Chebyshev’s inequality depends on the variance of a random variable.
So, you in order to be able to apply this inequality you need to know the variance of that
other random variable ok. So, let us take X to be any random variable now notice that X need
not be non negative X can be anything ok. And now 𝑎 is some parameter in this context we
are not just going to bound the upper tail like we did in Markov’s inequality, we are going to
bound both the upper and the lower tails ok.
194
So, in precisely speaking we are interested in |𝑋 − 𝐸[𝑋]| ≥ 𝑎. And if you think about that it
corresponds to these regions within the distribution. So, this is your expectation and you want
to know you want to talk about 𝑋 − 𝐸[𝑋]ok. So, that can fall anywhere on this line and you
are in particularly interested in the event that this 𝑋 − 𝐸[𝑋] ≥ 𝑎.
So, its it has to be greater than 𝑎 or because we consider the absolute value 𝑋 − 𝐸[𝑋] ≤ 𝑎
these are the two things that we care about and that is why these areas under these shaded
𝑣𝑎𝑟(𝑋)
portions is what we care about. And that tail is given to be the 2 ; so, this is Chebyshev’s
𝑎
inequality.
And it is not hard to prove this. So, let us see how this can be proved; so, this is what we want
it is just straight from here. And inside this probability we have this |𝑋 − 𝐸[𝑋]| ≥ 𝑎. Now
just square the terms on both sides of this inequality of this event ok; it is not going to change
the event at all. In fact, it has a nice property that it gets rid of the absolute and this square on
the left side is going to be non negative ok.
And so, this now what we have is the probability this term becomes the
2 2
𝑃𝑟((𝑋 − 𝐸[𝑋]) ≥ 𝑎 ). And because this is a positive term the non negative term, we can
simply apply Markov’s inequality and you can now see that all I mean this state inequality;
this more fancy looking tail inequality essentially depends on Markov’s inequality.
195
2 2
So, this is a positive term and there I an 𝑎 I mean a parameter here 𝑎 in this context. So,
then we can apply Markov’s inequality and we get the expectation of this is by the way
2
random variable right. So, the expectation of that random variable divided by 𝑎 that is this
2
Markov’s inequality and what is the 𝐸[(𝑋 − 𝐸(𝑋)) ]? That is nothing, but the 𝑣𝑎𝑟(𝑋)that is
2
just another formula for the 𝑣𝑎𝑟(𝑋) and so, it is going to be 𝑣𝑎𝑟(𝑋)/𝑎 which is exactly
Chebyshev’s inequality ok
And the same inequality can be slightly rewritten. So, you can rewrite it in a couple of forms;
so, here is one form. So, if you write this tail as 𝑡𝐸[𝑋]you simply get the 𝑡 in over here it is
going to be the square of whatever you put in over here.
2 2
So, it is going to be𝑣𝑎𝑟[𝑋]/(𝑡 𝐸[𝑋] ). And now another form let us define a term called the
standard deviation you know often denoted σ(𝑋)this is nothing, but the standard deviation
ok. And that is nothing, but the square root of the variance of the random variable. The nice
thing about the standard deviation is it is in the same units as that of the random variable X
2 2
suppose X measures say distance in meters variance is 𝐸[𝑋 ] − 𝐸[𝑋] .
So, when you think about it the unit is going to be meter squared whereas, the standard
deviation is going to take a square root of the variance and therefore, it is going to bring it
back to meters ok. So, that is one of the nice features about the standard deviation and. so, it
196
is quite commonly used in practice and the common question people ask for random variables
is you know well whether they relate to how far I mean how large standard deviation is
because that tells you how much the random variable is likely to deviate from its expectation.
So, what is the probability that a random variable will deviate from its expectation by more
2
than 𝑡σ(𝑋); that is going to be at most 1/𝑡 again simply by applying the Chebyshev’s
inequality. Now here it will be what will we have over here this will become
2 2 2
𝑣𝑎𝑟(𝑋)/(𝑡 σ(𝑋) ), but that is essentially just variance is nothing but σ(𝑋) . And so, those
2
two cancel out you will be left with 1/𝑡 that is how you get this formula this inequality.
So, let us go back to the binomial distribution and test out this new tail bound that we have
this Chebyshev’s inequality. Remember that in our experiment in the last segment we drew
several samples from the binomial distribution with parameters 10000 and bias 0.5 in
particular I mean. So, this means that the expectation is going to be 5000 and you can work
out that the variance is going to be 2500 which means that the standard deviation is going to
be 50 just 2500 ok.
So, now you can ask what is the probability that X is going to deviate from the expectation by
more than a 150 ok. If you recall there was almost never I mean the number that that we drew
from the binomial distribution almost never went beyond 5650 I mean.
197
So, sorry 5150 or it almost never went below 4850. So, it was within the plus or minus 150
range and that is what we are asking and if you work it out it this is going to turn out to be
2
1/𝑡 . So, 𝑡 = 3 and that is going to be 1/9 and this is a much better more accurate
probability and Chebyshev’s is clearly more powerful than Markov’s in this context, but in a
strange way it is essentially Markov’s applied in a more appropriate way that is exactly what
Chebyshev’s says ok.
So, with that we come to the end of our segment on Chebyshev’s inequality . What we did
was we introduced some definitions and some terms that were crucial to understanding
Chebyshev’s inequality. And we stated and prove it and we showed that at least in the context
of binomial random variables it is it is a better way to bound the random variable than
Markov’s inequality ok.
198
And in the next segment, we are going to go back to the problem of finding I mean of
selecting the k element. In fact, it stays a little bit simple well just be interested in finding the
median of a given array of numbers, but this time the good thing is we are going to ensure
that the running time is O(n) with high probability not just on expectation, but with high
probability. So, that is that is going to be the focus of next segment.
Thank you.
199
Department of Computer science and Engineering
Module – 3
Tail Bounds I
Lecture - 17
Segment 3: Median via sampling
So, we are now starting segment 3 in module 3 where were going to be talking about a
median algorithm, algorithm to find the median we are going to the technique where
algorithmic technique we are going to use this is sampling ok.
(Refer slide Time: 00:29)
So, this algorithm is going to be a linear time algorithm meaning it is going to be 𝑂(𝑛)where
n is the size of the array it is going to be a Monte Carlo algorithm it may as stated, we can
talk about how to convert it into a Las Vegas algorithm , so let us just state the problem.
200
Now, I am going to focus only on the median , but extending it to the selection problem is not
going to be difficult. So, you are given a set S of n numbers arbitrarily ordered and our goal is
to find the median element in S.
So, here how the algorithm works you take the unsorted array S and we sample a multi set R
3/4
of this 𝑛 number of elements ok. So, each one of them is chosen independently and
201
uniformly at random from S ok, just pick a random number from 1 to n whatever the element
index it lands on drag it into the set R.
Student: It is with replacement.
It is it is with replacement. So, basically you can that is why this R will be a multi set , it you
could be sampling the same element more than once and this is often the case because it
makes the analysis simpler, you can take advantage of independents and things like that and
you could probably do a little bit better if you use if you do it without replacement, but it is
probably not worth.
It is not worth it for a few reasons, it is not a) it is not worth it from them from an algorithmic
point of view because, now if you have to do this without replacement; that means, this S has
to be suitably updated after you pick an element to avoid picking it again ok.
So, there is an algorithmic complication that comes with it, there is also an analysis
complication that comes with it. So, now dependencies can creep in, so you need to be a little
bit careful. So, with all of that the a lot of times if we can do away with it we will just do the
sampling with replacement ok. So, then life becomes very simple both algorithmically and
from the analysis point of view.
202
So, now we have the sample set R what are we going to do with it, now what we do is we
take that sample set R and we sort it and so immediately you should be worried sorting is
3/4
expensive, but look the size of the this the array that we are sorting is only 𝑛 .
So, now if you use an 𝑛 log(𝑛) time algorithm to sort it you are still going to be only taking
𝑜(𝑛) amount of time ok. So, will sort it and our hope is the following that will be able to spot
elements in this sample that are close to the median, but guaranteed to be less than one
guaranteed to be less than the median and one guaranteed to be little to the right of the
median ok.
So, that is let us let and well see why that is the case. So, once we sort it this is the sorted
array R and I have showed it over here. So, this is a sorted array R, this is the middle element
in this in the array R this is just a sampled elements ok. So, now from middle element you
walk to the left 𝑛 steps, you get the element whatever is there we call it the element d. From
the middle element you walk to the right for some 𝑛 steps and whatever element you get
over there, you call it you u and our rest of the algorithm is going to hinge on the kind on the
condition, if you will that d is less than the median in the in the original array, but it is close
to the median and similarly u is greater than the median, but is close to the median.
If that is the case then what can you do? Well you can scan through the entire array S and
pick all the elements. So, create this list of elements C ok, it is all the elements in S that are
203
lying between the values of d and u. Remember d and u are these elements d and u which and
this will require a full scan that is 𝑂(𝑛)time, but that is still 𝑂(𝑛)ok. What is the advantage
with this? Now you collected a set of elements C that is and if this the good event occurred, d
is less than the median and u is greater than the median and this C itself is not too large, then
what can we do in order to find the median?
How do we find the median? Well let us lets it is not very difficult to see that for now
important difference.
So, I am taking S, I am just going to ask you to view S from the sorted viewpoint. Remember
we do not have the sorted array S because, sorting this will require Θ(𝑛 log(𝑛)) time ok, but
just for thinking about how this problem this algorithm works, view it in the sorted case ok.
Now, this C over here that you have. If you just sort it and this is let us say this is sufficiently
small this C, this you take the sorted C and you superimpose it on S, where will the median
lie? Remember what is the smallest element in C that is the element d, the largest element is u
and remember we are at least hoping at the moment, that d is less than the median but close to
the median, u is greater than the median, but close to the median. So, where will the median
lie in this it is somewhere in this sorted in this C right?
204
So, now what we now, that we know that that is the case, what do we do? We sort C and just
a sanity check to we should be able to again scan and make sure that you know the number, if
the number of elements in S that was there are less than d are is more than 𝑛/2. What has
happened? That that would have happened if this whole super imposed array is to the to the
right of the median element. So, that is a bad event that was the case we have failed ok,
similarly the u the number of elements greater than u should not be great also greater than
𝑛/2, again that case also we have failed ok. Just need to make sure that the median is in fact,
in C and we also have to make sure and this will happen with high probability that the size of
the this set C is also sufficiently small if that is the case then we can sort C . If we can sort C
then what we do; then we should be able to find the appropriate median element ok, so let us
see how that works out .
So, again go back to this picture here , so we know that the median is somewhere in the
middle of this sorted C that is being superimposed on S, what we can do is in our scan we can
count the number of elements that were less than d ok. For simplicity if you are concerned let
us make sure that we just think of the case where the elements are not repeated ok. So, the all
the elements are distinct. So, now what you do is count the number of elements to the there
are less than d and now we want to spot the element in C, that is exactly the median element
ok. So, that is the if you look the median element is going to be the 𝑛/2th element.
205
So, the median element is going to be somewhere over here that is going to be the 𝑛/2th
element. So, if we want to spot the 𝑛/2th element in S, we have to find the appropriate
element in C, what is that going to be that is going to be? That going to be the
(𝑛/2 − 𝑙 + 1) th element in C that will be the median element in S ok. So, that is it that
gives us the ability to spot the median.
So let us just make sure that this is you know everything is correct at least from the running
3/4
time point of view. The sampling of R from S takes 𝑂(𝑛 )time because we are just
assuming that these are random numbers that we are generating in the range 1 to n and
collecting them into this array R ok
3/4
And sorting of R is going to take again 𝑜(𝑛)time because it is going to take 𝑛 log(𝑛)ok.
Student: What is the (Refer Time: 11:34) 𝑛 3/4.
that is a very good question, a lot of times what you do is you work with an idea. So, this is a
this will give me an opportunity to explain how randomized algorithms works you have an
intuitions to how the algorithm works and then you have to work through it and see then play
with it and basically reverse engineer the parameters. You just need to find the right 1 set of
right parameters for which the argument goes through.
206
You do not have to it does not have to be empirically, you can also in this case what you have
to do is actually kind of try 1 set of values and work through the analysis, at the end if the fits
required a proof then you are done, but if it does not, you can actually work your way
backwards in the analysis and say oh at this point you know the probability is not good
enough. So, I need to you know bring down the probability a little bit which means then
working backwards in the analysis which means I should have sampled a little bit more over
here. So, increase the sample a little bit over here. So, it is a reverse engineering process. So,
you kind of have to work your way backwards and get and finally, play with the values until
you get a 1 set of parameters for which the entire an algorithm and analysis go through.
So.
Student: (Refer Time: 12:59) say it is 1 minus some C between 0.
Yes that is a good point. So, you can keep things general, but keeping things general
sometimes works sometimes .
Student: Ya it will become messy.
It will become very messy. So, what I end up doing and this is a personal thing is that I just
put in specific values. So, it is a little bit also putting in specific values means that the
intuition is not lost. If his if it is lot of parameters floating around the intuition can get lost.
So, if you put specific values, then you still have this intuition and you kind of are playing
with your intuition until you get some values. So, one thing you can do is after you play with
your intuition get one set of values then you can generalize it and optimize the constants also
if you want.
Student: Because you take lot of values between 0 and 1, let say you have to you have to
prove that it is order of n right. So, intuitively you know that this power has to be between 0
and 1.
Correct ya.
Student: If I run a vigorous experiment will not be that exactly 3/4, I might get a special
output.
207
Ya you could get. So, for example, you can even make this as long as the size of R is
𝑂(𝑛/ log(𝑛)) you are fine because you are doing a sorting and when you do the sorting you
do not want to exceed 𝑂(𝑛) that is the that is important thing, but these are just one set of
values for which it works, and when we have that we just kind of satisfied ok.
3/4
So sorting the array R can be sorted in 𝑜(𝑛) time because it is of size 𝑂(𝑛 ) . So, from there
we need to compute this set C where the elements lie between d and u that require one full
scan of S right, but that is again still 𝑂(𝑛)all other operations can check in conditions here
and there.
Student: sorting again right.
yes. So, there will be one more sort, but again on oh ya the C will also be sorted, but C again
3/4
is a size you are right. So, C again is of size at most 𝑛 because it C is the subset of R oh
sorry no I take it back. So, C is not a subset of R, but this is this is.
Student: (Refer Time: 15:22) bounded by.
Correct. So, that you are right. So, C is what we need to prove is that d and u are actually
very close to the median, and what is that if they are very close to the median the number of
elements.
Ya the number of elements that get dropped into C will also be little over and so, I have to
add that 1 point over here you are right. So, thanks for pointing that out.
Ya.
Student: (Refer Time: 15:44) 1 before.
Ya 1 before
Student: I will ask him here we seeing that C is greater than.
Ya
208
Ya this is this is where this shows appropriate you are absolutely right. So, here what we need
to show is with high probability C is going to be small and were how will C be small? If the d
and u are close to the median, but still to the left and to the right of the median. If d and u are
far away from the median then in the set of elements that you are collecting in C can become
large right and you what we will do is, we prove in the next segment will prove that
3/4
𝐶 ≤ 4𝑛 with high probability ok.
In the unfortunate event that it exceeds it is a bad event, actually you can this condition is a
little too strong. As long as C is 𝑜(𝑛) of or 𝑂(𝑛/ log(𝑛)) itself we can actually proceed
because we can still afford a sorting on that set of elements.
Student: Is there any (Refer Time: 16:57).
Oh god I want you to work for your points.
So, all right. So, let us let us continue that and so, then now when you add up all the running
times it is at most 𝑂(𝑛).
And ya as I said there are pending things here, what we know what we have seen is the
algorithm hopefully the algorithm the idea is clear it is a very simple algorithm, there are
analysis issues that are pending though. For example, we need to ensure that I mean what is
209
clear is that it is only going to fail or we are going to be correct when you, but and when it
fails we know that it is failed ok. What we need to be able to show is that the probability of
failure that is how can it fail, there it can fail for a few reasons as was pointed out it can fail
because the set C the array C somehow is too large ok.
Student: I could mean that I would not fail do you just say that running time is larger than
𝑂(𝑛)
That is correct ya .
Student: So.
But if it is really large, then you are not no longer going to be in the 𝑂(𝑛) region.
Student: Ya so, but it is are you saying that if it exceeds that size.
Hm.
Student: Then the median that is finds will not close to the real median I think.
No the other failure is more egregious, the other failure is where the median. So, there are 2
conditions.
Student: Ya
Which can fail one is just a running time issue.
Student: Ya.
The other the median not being in C is the real.
Student: (Refer Time: 18:43) Monte Carlo.
Ya. So, currently the wait state of the algorithm is a Monte Carlo algorithm, meaning that it
can fail to even produce a correct answer, but we can work with that to convert we can
convert it into a Las Vegas algorithm I will discuss that shortly now.
Student: But currently also it is its having a problem with realistic running time.
Yes ya.
210
Student: Other one is the median is exactly median or close to them.
No it is exactly the median, if the algorithm is in this conditioning is and we actually proceed
here and basically median is in the C then we are going to find the exact median ok.
So, . So, we know that the output is either fail or correct. So, what will prove and this is I am
1/4
going to leave it as a claim for now is that the probability of failure is at most 𝑛 and. So,
−1/4 −1/4
𝑛 . So, this is just as fixed constant in the exponent 𝑛 and if you recall in the previous
cases, when we wanted to prove something with high probability the bad event has 1/𝑛
−1/4
probability . So, how do you I mean. So, this is just 𝑛 would that be an issue what can we
do to avoid this?
−1/4
we want to avoid this 𝑛 , which seems a little awkward we want to say let us say we want
−1
to be able to get this to 𝑛 what do we do? We repeat it to simply 4 times, and if it fails all 4
−1
times then let us say we give up and that the 4 repetitions will ensure that it is 𝑛 the
−α
probability of failure . But in general for any 𝑛 you just have to repeat this sum what 4α
times or something like that right and you will be able to ensure that you get, you will be able
to get that particular probability of failure
211
So, with that we conclude this segment, hopefully the algorithm is very clear the analysis will
require us to work with Chebyshev's inequality. So, exercises on that which we will see in the
next segment.
212
Module – 03
Tails Bounds I
Lecture - 18
Segment 4: Median via Sampling – Analysis
We are now in segment four of module three we have already seen a median algorithm based
on sampling. We just saw how the algorithm works, but we and we it is also a reasonably
clear, that it is a linear time algorithm; what is not clear is why it is correct and how to state
how to do how to analyse it.
And that is going to be the plan for today; we are going to focus on the analysis of
correctness of this median algorithm. And we will be using Chebyshev’s inequality in
particular we are going to prove the following claim, that the algorithm fails with probability
−1/4
at most 𝑛 ok.
So, we already alluded to this claim. We are going to see how this is proved. And if you recall
Chebyshev’s inequality requires you to know the expectation and the variance of the random
variable. So, with these two pieces of information you will be able to bound random variables
that show up in the analysis ok.
213
So, I will first recall the median problem the algorithm and some of the details there ok. So,
of course, you have a set S arbitrarily ordered and you need to find the median that is the
median problem.
And let us recall the algorithm and keep in mind that there are three major objects in this
algorithm that interplay ok. There is the set S which is the; or the array S if you will, it has
the elements from which you need to find the median that is the input set ok.
214
And; then the first thing that happens is you sample multiset R that is the second object ok;
3/4
you have a multiset R and it is of size 𝑛 these are elements chosen uniformly and
independently at random from the set S ok.
This is the sample and the third object; well before we go to the third object what we do is we
take this second object R we sort it and we pick out two elements d and u from this array.
And our intuition is that d should be to the should be less than the median, but close to the
median it should not be too far from the median. Similarly u should be greater than the
median, but again close to the median. So, these are this is the intuition based on which the
algorithm works of course, in our analysis today we will be proving this intuition formally.
215
And now comes a third object. So, based on these two elements d and u; you pick out you go
back to set S and you pick out all those elements that are in S that lie within the range d and u
that is the third object C. So, these three objects interplay quite a bit and how these three
objects interplay is very important? it is it is crucial to understanding how to analyse this
algorithm ok.
So, what is the intuition here; this third object remember; d is to the left of the median is our
hope u is to the right of the median. So, when you take the; now when we are trying to
216
understand the analysis sometimes we look at S, but we look at it in the sorted view ok. The
sorted view is not available to the algorithm. Remember the analysis is one thing; the
algorithm is another thing ok. The algorithm works on an unsorted list, but the analysis for
understanding how the algorithm works you can view the sorted array and play with it and
that is what we are doing.
So, in this sorted array; what you do is you can superimpose the set the array C and
remember array C is something that we actually sort in the algorithm as well. And in this
array C, what we do is well the d the smallest element in this array we hope is less than the
median in S and u we hope is to the right of the median. So, this sorted set spans the median.
So, it must encompass the median, but it should not be too to large it should be something
3/4
like 𝑛 . So, which means that because it spans the median you can spot the median of S
within the array C, this is the intuition that I hope you already have. Any questions on how
this algorithm works; any issues at all? Because that is understanding this algorithm is
absolutely imperative to understanding the analysis.
So here is the three objects ok. Now what are the recall that we had some we hope some good
things will happen. What was the first good thing that we hoped would happen? The d the
element d is to the left of the median ok. And let us ask ourselves ok; how does that happen?
If you think about it where is the right event should occur right way even when you sample
the elements R ok. What is this element d? You go back to the fact that you first sample the
217
set of elements this multiset of elements R. You sorted it found the median, you walked left
𝑛 steps and then you found the median and from here that ended up being the leftmost
element in C and that has to fall in this left region of S ok.
Similarly, the other good thing that must happen is your u there this is a symmetric good
event; u should be from the right part of the array S ok. And the third good thing you want is
3/4
that; the set C should not be too large it should be no more than some 4𝑛 and here is an
important principle. So, there are these good events that you want and you want to make sure
that these good events happen with high probability ok, but often in that that is hard to prove.
So, what do we do? We focus on the corresponding bad events, because those are is often
easier to bound ok.
So, let us take these good events and convert them into the appropriate bad events ok. So,
what is the bad event? So, the good event was that d falls from the left side; the bad event is
that the d falls on the right side that is one bad event. I am going to call that bad event number
one ok; I am going to denote that by a one enclosed in a circle and of course, the mirror
image is also easy to see the other bad event is that the u falls on the left side; u should
rightfully fall on the right side to the right of the median, but falling on the left side is the bad
event ok.
218
3/4
And you want to 𝐶 < 4𝑛 the bad event is just the opposite. We will understand the third
bad event more carefully and more slowly later on ok, but hopefully the first two events the
back the corresponding bad events are easy to visualize and what is going on ok.
So, let us formally write down the bad events; because, now what we have here is a pictorial
representation. This d over here this d falling on the right side is the first bad event. Let us try
to write it out formally and let us make sure we understand how this works and so for this we
first define a random variable 𝑌1ok. 𝑌1is simply all those elements in R remember R is the set
of elements that we sampled ok. So, original elements that we sampled; out of which we are
picking out all the elements that are less than or equal to the median; m here represents the
median and we are asking how large is that set ok. And this is the random variable
𝑌1≜|{𝑟 ∈ 𝑅: 𝑟 ≤ 𝑚}|; and the bad situation is when this 𝑌1is less than this quantity ok. So,
this will need this will require us to understand exactly why that is the case ok.
So, 𝑌1let us go back to this picture. 𝑌1is all those elements that are less than the median. So,
𝑌1is you are counting the elements in this R ok; please pay attention to this part. You are
counting the elements in R ok, but those elements that are falling to the left of the median,
alright. So, let us let us see what; so, now, why is that a bad situation? So, and in particular
3/4
we are asking whether the particular bad event is when that quantity is less than 𝑛 /2 − 𝑛
ok.
219
So, let us see why that is the case can somebody what I would suggest that you do is that you
stare at this picture a little bit and convince yourself that that this bad event is actually
captured by this by this formal statement ok.
So, what happens if the number of elements sampled from this region to the left is less than
3/4
𝑛 ; so this by the way has ok. So, I think there is a small type one more typo here this has to
3/4
be this median element has to be 𝑛 /2ok. So, this median element here let me correct that
3/4
over here small. So, this whole thing is of length 𝑛 . So, this is; this element is actually not
3/4
𝑛 it is by 2 ok.
So, apologies for the typo; so here also we have that and it is − 𝑛; this is divided by 2 plus
𝑛 ok. Now; what is 𝑌1. 𝑌1is all those elements from R that are falling in this range on the
left side ok. If this 𝑌1is too few; where will d fall. In other words; if 𝑌1those elements that are
from the left of the median are just a small set of elements over here where will d fall?
Ha so, that is this is the intuition here. When this 𝑌1is too small when it is particularly lesser
than this quantity it is just going to occupy a small part on the left of R, which means that this
d will fall in this region ok.
Student: So why that specific quantity.
The specific quantity 𝑌1oh;
3/4
Student: No, no. How does it come less than 𝑛 /2 − 𝑛? Why 𝑌1 > 𝑚?
Ok, so, we are trying to capture a bad event right. The bad event is that d falls in this region
ok.
Student: Yeah.
And now we have to formally capture that bad event and we want to capture it in such a way
that we can apply Chebeshev’s inequality ok. We want to define this random variable 𝑌1
which can for which we can find the expectation and variance right and be able to ensure that
220
we can say that 𝑌1will have will be small only with small probability that is the bad event
when 𝑌1is too small the d falls on the bad region.
Student: but the d will still be left of the median.
No when the 𝑌1is small what happens think about it. What is 𝑌1? 𝑌1is all is the count of the
elements in R that are sampled from the left side ok. If that is a very small part of this set R
ok, d is positioned over here if the elements corresponding to 𝑌1are coming from this small
set even inside R; that will naturally push d to the right of the median. So, what is the precise
3/4
definition here you define 𝑌1and you want that to be less than 𝑛 /2 − 𝑛. What does that
mean? When it is; when 𝑌1is less than this quantity all the elements that were sampled left of
median are only occupying the portion to the left of d ok.
They are only occupying portion to the left of d which means the d th element d th element is
clearly based on a position within R. So, you are finding the middle element and you are
walking back 𝑛steps ok. So, 𝑑th position d’s position is fixed, but all the elements that
correspond to 𝑌1are to the left which means it if you think about it d will be pushed into the
bad region this is important make sure you understand this because this is how the analysis a
lot of setting up the variables to work out the analysis is captured in this step ok. Shall we
proceed?.
If you want I can give a minute to stare for you to stare at this picture. That is a very good
point thanks for pointing that out. So, Bharath’s recommendation is that just think of the
extreme situation where 𝑌1is simply all is equal to 0; basically means that no element was
sampled at all from this region. So, even the smallest element is over here which means that d
is also going to be in this part. So, that is an extreme situation.
It is not like the procedure magically knows what the median is and keeps the d at that point
right. The algorithm is just running without knowledge of what the median is right; there is
no magic about this. So, you it is a proper is just you are just sampling elements. So, you don
t know where d lies really I mean where the median is all right ok.
so let us move on. So, this is the formal way to represent the bad event ok. If the first event is
captured this way the second event is going to be a mirror image ok. Here it is going to be
221
another variable 𝑌2; where you are looking for all the elements greater than or equal to
median in the sample set R and you are asking how will that will you know the bad event
3/4
would be that that cardinality of that set that is 𝑌2 is fewer than 𝑛 /2 − 𝑛it is too small
ok. This is just a mirror image so not too tricky. And the third one I am going to defer the
details till later, but will maintain we call it 3. So, we have listed the three bad events and we
have given them names 1, 2 and 3 ok. 1 and 2 are essentially the same. So, we are going only
going to bound 1 and the same bound will hold for 2 as well ok.
What we are going to do is apply a union bound? This is a very very important thing; simple,
but extremely useful; if you have a few bad things that can go wrong. In this case there are
three bad things that can happen; and if you can ensure that each one of those bad things
happens with very small probability then the probability that any one of those bad things
happens is just at most the sum of the individual probabilities this is just coming from
inclusion exclusion principle and throwing away weight terms that we do not.
So let us look at this first bad event ok; and recall that this is how we formally defined it. So,
𝑌1is the cardinality of this set and the bad case is when it is too small ok. So, how do we
bound this? Remember we even we carefully define this random variable. So, we can apply
Chebyshev’s which means that we need to be able to find the expectation and the variance ok.
222
So, we want to break 𝑌1into smaller pieces. What is 𝑌1? Is basically the count of the set of
elements that are less than or equal to the median ok.
So, go back to the sampling algorithm your sampling elements into this set R. 𝑋𝑖 = 1; if the i
th sample is less than or equal to the median. So, remember this is you have this less than or
equal to here right that is what is defining your set which gives you whose cardinality is 𝑌1
ok. So, now, what you do is? You basically just define you break 𝑌1into individual exercise
(𝑋𝑖 = 1), if the ith sample is at most the median value 0 otherwise, which simply means.
So, now the nice thing is 𝑋𝑖 is very easy to work with because 𝐸[𝑋𝑖]is nothing, but a half I
mean close to a half if you look at the textbook they are very careful and precise about
computing the median I am being a little slack here, because then it depends on how you
define the median is it the middle element if it is odd you have a middle element if you say
even you do not have exactly a middle element. So, you look at the left of the middle in
position what not we are ignoring that and saying look the 𝐸[𝑋𝑖] = 1/2 approximately close
enough ok
And, when you have the 𝑋𝑖; so, you can very easily compute the 𝑌1; 𝑌1is simply the
summation of the 𝑋𝑖which means now you can compute the 𝐸[𝑌1] because again linearity of
3/4 3/4
expectation. And so the 𝐸[𝑌1] = 𝑛 ; why because? 𝑌1the set R has cardinality 𝑛 and each
3/4 1
time you sample the expected contribution towards 𝑌1is half right. So, that is 𝑛 (2 ) that is
expectation of course, that is only half the story you need the other half the variances as well.
223
Again it is easy to compute the variance for the individual 𝑋𝑖; it is 𝑝(1 − 𝑝) if you recall this
is just a Bernoulli random variable. And again the 𝑝s and the (1 − 𝑝)s are roughly a half ok.
So, the variance is roughly 1/4 and this is where independence helps and I recall somebody
saying why did we sample with why not sample without replacement if you do it with
replacement you get independence and so now, you can compute the 𝑣𝑎𝑟(𝑌1) simply by
summing the individual variances ok. So, remember the individual variances are a quarter.
3/4 3/4
So, sum over all i and remember there are i is running from 1 to 𝑛 . So, you get 𝑛 (1/4)
ok.
224
So, we have the two pieces the expectation and the variance. So, let us proceed from here. Let
us go back to stating the bad event. Probability of the first bad event; if you recall is that;
3/4 3/4
𝑌1 < 𝑛 /2 − 𝑛 ok, but what is 𝑛 /2 that is nothing, but the 𝐸[𝑌1] ok. So, that is what
and what we have what I am going to do now is rearrange the terms just a little bit.
So, let us make sure we get a picture of what is going on often it is in you know while
working with these notations we can lose track of the intuition of what is going on what is
this asking. So, let us say we have this 𝑌1this is the let us say the 𝐸[𝑌1] and this is
𝐸[𝑌1] − 𝑛and we are asking what is the probability and 𝑌1has some distribution we are
asking; what is this probability? This is the tail bound right. We are asking; what is the
probability? That 𝑌1is going to be less than this quantity ok.
And if you recall those Chebyshev’s had bounds on both sides. So, it is not just this part it
also has this part ok. So, we are going to you know make life simple for us we are going to
convert this equality into a less than or equal to and just bound both of them together this is
only adding probability to a bad event and that is and if the upper bound still is within what
we care about then we are fine right. So, this probability how are we going to capture this
basically you are looking at this 𝑌1 − 𝐸[𝑌1]and how do you fall in these two extreme when
this when the absolute value of this quantity exceeds the 𝑛then you are falling into either
this region or this region that is what is captured in this event ok.
And this form is now exactly the way we want for applying Chernoff bounds ok. The
probability of a random variable minus absolute value of a random variable minus it is
expectation exceeding some quantity 𝑛in this case is at most the 𝑣𝑎𝑟(𝑌1)divided by the
2
square of this quantity right. So, if it was 𝑎here it will be 𝑎 here right. So, it is 𝑛here.
3/4
𝑛 /4 −1/4
So, it is n over here and if you work it out 𝑣𝑎𝑟(𝑌1) = 𝑛
; it works out to 𝑛 /4 ok.
−1/4
This is one of some three different bad events right and it is contributing 𝑛 /4. So, if you
plug it into the union bound there will be other. So, for example, this is the first bad event.
You can do the exact same thing and argue the same thing for probability of the second bad
−1/4
𝑛
event as well; that is also going to be≤ 4
ok.
225
And then what we are going to do now is? We are going to show that the third event is going
−1/4
3/4 𝑛
to be≤ 𝑛 sorry 2
. When you add up these 4 probabilities one here, one here and one
−1/4
here it adds up to 𝑛 and that is that is where we completed our analysis ok.
So, that is what I am stating here probability of the first bad event is equal to the second bad
event. So, we are not we are going to skip the second bad event ok.
226
Now, we are looking at the third bad event we need to be again a bit careful about, how to
approach this bad event? Ok. This is the bad event that this C this third object in our
3/4
algorithm C is too large ok. In particularly you do not want to be larger than 4𝑛 ok.
3/4
So, let us say the bad event occurs you can you if which means now your |𝐶| > 4𝑛 . So,
3/4 3/4
then you can you can look at the bottom 2𝑛 and at top 2𝑛 and they are going to be
3/4
disjoint right because it is the overall width is more than 4𝑛 ok. So, you are going to notice
that if one of few bad things will happen in this case.
3/4
If you notice this top 2𝑛 elements in C are all greater than the medium. The other
possibility is that, but if you go back here, this bottom set is actually spanning going across
the median ok, but at least one of them is fully to the right fully on one side of the median.
The other thing that could happen is both of them could be one could be fully to the left and
3/4
one could be fully to the right. This is also a possibility; the top 2𝑛 elements in C a greater
than the median the bottom elements this bottom red portion is lesser than the median.
3/4
The third possibility is that the right part overlaps, but the left part; the bottom 2𝑛
elements in C are less than the median.
227
So, there are these three possibilities, but the best way to capture it is at least one of these two
things will happen means the bad event happens. What is the either the bottom part is
completely to the left the top part is completely to the right ok. So, basically the third event
can be broken into two smaller bad events and these are again symmetric bad events. So,
what we will do is we will just focus on one of them.
if we show that and just to be clear this is a bad event, but this if this bad event occurs it does
not necessarily mean that the third bad event occurs ok. What this means is that? It is the
other way around; if the third bad event occurs, then at least one of these two bad events
occurs. These bad the smaller bad events occur. So, what we are going to do is really if you
were to look at it from a Venn diagrams point of view let us say that this is the third bad event
ok. So, if you what we can say is that if the; we cannot say that if this were to happen that
implies 3 that we cannot say, but what we can say is if 3 were to occur then one of these two
occurs.
So, this what we are really doing is we are taking a larger event this is either this or this larger
event is the OR of these two events and what we are going to do is bound this larger event
and for that what we are going to do is focus on the two separately and just focus on the first
of these two bad events. They might be. So, let us see not clear right. So, both can happen and
neither can happen it is a little bit more complicated. So, let us not worry about that because it
is I think both can also happen or neither can also happen ok.
228
These are all possibilities here, but let us just focus on this one bad event; basically what we
3/4
are focusing is that the top 2𝑛 elements in C are greater than the median ok. So, how will
this happens basically, what is the rightmost element in this C that is this element u and what
is happening is that element u is what is intuition here the element u occurring too far away
3/4
from the median element this portion is greater than 2𝑛 that is why this red portion which
3/4
is of size 2𝑛 is fully to the right of the median ok.
So, now for this to have occurred; actually the problem started way back when we sampled in
R, so that is why we are going back to R.
So, this is our u over here in R this is the original sample and somehow this u is where is it is
3/4
falling in the sorted set R; it is falling at some position greater than 2𝑛 away from the
median element ok. How would that happen? Again it is a slightly counterintuitive way we
are going to see how it is going to happen. How does u get pushed so much to the right? You
look at all those elements to the rightmost portion of this set R; that is basically this portion
3/4
that is where is u it is in the middle half 𝑛 sorry this yeah − 𝑛 this is correct because you
are looking at these elements over here to the to the right ok.
All of these; where should where should they have fallen where should these samples have
𝑛 3/4
come from they should have all been contained within the top 2
− 2𝑛 that is this portion
229
all of these elements should have been contained over here that is; that is how this u can get
pushed to the right ok. I will let you stare at this picture for just a second ok. Why is this
3/4
happening? u gets pushed to the right it is its more than 2𝑛 from the median how is this
3/4
𝑛
happening if you look at the sample set R and you look at the top portion top 2
− 𝑛
samples all of them are drawn from this region. So, that is how u gets pushed so much to the
right ok.
So, now we can write out this bad event ok. So, now, we for that we are going to define this
3/4
random variable X; it is the number of samples in this region in this top 𝑛/2 − 2𝑛 . It is
the number of samples among in R among the top into the samples in R, but drawn from the
3/4
top 𝑛/2 − 2𝑛 elements in S ok. So, now, again you define a random variable you break it
into it is components. So, 𝑋𝑖 = 1 if the ith sample is meeting the requirement of X; otherwise
it is 0.
So, this part should be obvious to you. X is simply summation of the individual 𝑋𝑖 s. So, your
𝐸[𝑋] can you know is simply the summation of the individual expectations, but here we need
to be careful what is the 𝐸[𝑋𝑖] it is the probability that a sample will fall in the top
3/4 3/4
𝑛/2 − 2𝑛 region right and that; what is the probability? Is (𝑛/2 − 2𝑛 )/𝑛and if you
3/4
work it out in it you get this quantity 𝑛 /2 − 2 𝑛 ok.
230
Now, the variance similarly can be computed. So, you have the individual probabilities and
what here what we are doing is any times you have 𝑝(1 − 𝑝) and 𝑝 value is less than 1; it is
it is quantity it is going to be at most of 1/4 ok. So, variance is going to be the summation of
3/4
3/4 𝑛
𝑛 terms of the form 𝑝(1 − 𝑝). So, it works out to be 4
these you can verify ok. So,
simply now we apply Chebyshev’s inequality probability that ok.
So, what is the bad event? That this the set of elements that we sampled from the top n to the
3/4 3/4
this top 𝑛/2 − 2𝑛 is greater than this quantity 𝑛 and that is greater than this quantity
3/4
this 𝑛 /2 − 𝑛. And, now set this up carefully because now you have this event X greater
3/4
than this quantity you want the expectation is 𝑛 .
3/4
So, if you look at this 𝑛 /2 − 2 𝑛. So, I m just going to subtract a − 𝑛 and add 𝑛; so, I
get it of this form. So, so that I can write it as 𝑃𝑟(𝑋 − 𝐸[𝑋] ≥ 𝑛)and that is less than or
equal to; now I i again do the same trick of not just bounding the left tail, but also the right
tail and that is going to be 𝑃𝑟(|𝑋 − 𝐸[𝑋]| ≥ 𝑛).
And this is now exactly the form that we want. So, we can apply Chebyhev’s inequality it is
−1/4
𝑛
nothing, but the𝑣𝑎𝑟(𝑋)/𝑛and it works out to be 4
ok. All of the bad events had the same
231
type of probability. So, the first bad event had the same probability, second bad event by
symmetry has the same probability upper bound on the probabilities.
Rather, the third bad event was broken into two pieces and one of them has this probability
−1/4
𝑛
4
which means that the other mirror image will also have the same upper bound. So, all
the bad events when you add them up there are 4 of them if you if you consider the third one
−1/4
𝑛
broken into 2 there are four bad events each of the same 4
.
So, this is that is what I am showing over here. So, you have these two mirror I mean
symmetric bad events for corresponding to 3.
232
So, you put all them together you get the bad the upper bound and the probability of bad
−1/4
events is at most 𝑛 and this is exactly what we wanted.
So, with that we can conclude. We basically I hope you got an appreciation for how
Chebyshev’s was applied in this context; we were able to as you rightfully pointed out it is is
defining the right variables setting it up the right way so that you can connect your intuition
with the formula analysis ok. And the reason we one reason we do this is tuition is often
233
great, but it in tuition can be also deceptive ok. So, when I started teaching probability in
computing I thought that the previous median algorithm would also work with high
probability that was my intuition and I was proven wrong and I actually was able to prove
myself wrong by saying look you cannot actually not prove that with high probability you
need this algorithm to prove this high probability bound on the median on finding the median
ok.
So, the intuition is great to get you started, but often intuition can be misleading. So, you do
need to learn how to analyse your algorithms more carefully ok.
So, that is the hopefully you are getting an appreciation for that. So, with that we will
conclude we will next segment we will actually look at Chernoff bounds which is a little bit
you know more powerful than chebyshevs, but in some sense restricted ok. So, that is it.
Thank you.
234
Module – 03
Tails Bounds I
Lecture - 19
Segment 5: Moment Generating Functions and Chernoff Bounds
Three and now we are going to see Chernoff bounds it is a family of bounds it is not just one
bound, but family of bounds they are tied together by the technique the way we would derive
these bounds and an important concept along the way we will try to understand the notion
called moment generating functions. These moment generating functions they are quite nice;
𝑛
because they capture all the moments. If you recall the moments the nth moment is 𝐸[𝑋 ]this
one moment generating function is like a bag that just collects all of them together.
So, if you recall the Markov’s inequality required the first moment, Chebyshev’s required the
second moment. Why is Chernoff? So, powerful because it includes information of all the
moments and as it turns out if you have all the moments you actually uniquely define the
random variable. So, it is this moment generating function actually capture. So, characterizes
random variables uniquely. So, that is why this is the significantly more powerful than
Chebyshev’s inequality and we will work out Chernoff bounds for the sum of Poisson trials it
235
is basically a generalization of some Bernoulli trials which is what we call binomial
distribution it is a slightly more general version for which we will work out the Chernoff
bound ok.
So, here is the Chernoff bound technique at a very very high level. Let us set that in the
backdrop of Chebyshev’s. How did we prove Chebyshev’s inequality? We took a random
variable X ok.
Yes, what is 𝑋 − 𝐸[𝑋] the 𝑃𝑟(|𝑋 − 𝐸[𝑋]| ≥ 𝑎)well we did not know how to handle that,
but we squared it and when we squared if you recall the event stage this essentially the same,
but now you are able to apply the second moment right and that is what we did for
Chebyshev’s ok.
236
We are just going to push this further to the extent possible. What we are going to do is not
square it, but take the exponential ok. So, now, let us say you want the 𝑃𝑟(𝑋 ≥ 𝑎); what we
do is we pick a suitable t which is some quantity greater than 0 ok; and we ask when we say
𝑡𝑋 𝑡𝑎
you know we morph this event over here 𝑋 ≥ 𝑎to 𝑒 ≥ 𝑒 ; so, basically where it takes
exponentiating on both sides.
So, the inequality stays the same. So, the event is really capturing the same thing. So, to have
equal probabilities; and So you can easily see the analogy between this and Chebyshev’s
inequality it is essentially the same idea, but here we are pushing into the limits not just
squaring it with exponential and the nice thing here is now for the right tail this is what what
we have here is a right tail X we were asking what is the probability that X exceeds some
position a some value a and that is giving you the right end of the tail.
For the left end of the tail all you need to do is pick a t that is less than 0 and when you say
pick a 𝑡 < 0 then you can ask what will happen what you do instead of greater than or equal
to is you say what is X was the 𝑃𝑟(𝑋 ≤ 𝑎). So, that will be the left tail and for that
remember your 𝑡 < 0. So, what will happen over here when you apply this inequality is well
you how do you convert the less than or equal to sorry to a greater than or equal to you will
𝑡𝑋
raise it to the power 𝑒 in this context, but 𝑡 < 0 what happens over here is you will have a
negative in the exponent ok.
So, the inequality will appropriately fall into the greater than and which means that again you
can apply Markov’s inequality and so on and so forth right and that is exactly what we are
237
doing here. So, for the moment let us not worry about the left tail let us continue to focus on
the right tail. So, if you what you can now do is apply the Markov’s inequality. So, what we
have is a random variable here and we are asking; what is the probability that it is more than
some positive constant and you will you can then apply Markov’s inequality? So, you see
how Markov’s inequality is still the basis upon which all the other tail bounds have already
arrived, but nevertheless we were able to do fancy things now ok.
So, now, we have to complete the story. So, we need a good bound for this numerator over
𝑡𝑋
here the 𝐸[𝑒 ]and that is something we will focus our efforts on ok, but if we have the right
bounds for the numerator what we can do is we can play with the value of t. So, this is true
for any t greater than 0 which means that for the best bounds you may need to play with the
choose the right value of t ok.
And then hopefully you know the terms will cancel out nicely; and you will get a bound that
is a Chernoff bound. Basically, this is a technique a general technique to deriving these sort of
bounds and any bounded derived using this general technique of taking the random variable
exponentiate; and asking what is the probability that it exceeds a particular value; then taking
the exponentiation on both sides of that event and applying Markov’s inequality and
bounding the numerator. So, these this is a very standard technique this technique is called
the Chernoff bound technique and any bound derived from it is the Chernoff bound ok.
𝑡𝑋
So, just to recall, now we have to worry about this 𝐸[𝑒 ]ok.
238
And as it turns out that fits very nicely with a well known notion called the moment
generating function what is the moment generating function of a random variable X it is
𝑡𝑋
exactly what we want it is denoted 𝑀𝑥 where this has to be a capital X; 𝑀𝑥(𝑡)≜𝐸[𝑒 ], this is
exactly what we want and we have a quantity, we have a well defined notion the moment
generating function. Let us try to understand this object this moment generating function.
First theorem about the moment generating function; it captures all the moments this is
something that we talked about a little while ago right the Chernoff bounds why is it. So,
powerful it is capturing all the moments not just the first in a second moment. So, why is that
239
the case and it is under some conditions. So, here the condition is that when we have
expectation and differentiation we should be able to interchange there the way we apply
them. If we can interchange them then we have this theorem ok. So, basically what are we
𝑛
claiming here; this 𝐸[𝑋 ] this is the nth moment you simply get that by taking the nth
derivative of the moment generating function; this n this is the nth derivative of the moment
generating function. Then evaluate it at the position 0 ok. Then you basically by doing that
you get the nth moment of the random variable X; that is the nice thing. So, basically what is
happening here is that the moment generating function in it is its somehow capturing all the
moments when you take as you take the derivatives each derivative evaluated at 0 gives you
the appropriate nth moment.
The proof is quite simple; you look at the nth derivative of the moment generating function
and remember this one important thing to notice is that you are you are taking the nth
𝑛
𝑑
derivative not with respect to X, but with respect to t ok. So, 𝑛 . So, that is the end of the
𝑑𝑡
𝑡𝑋
derivative of well; what is the moment generating function? It is 𝐸[𝑒 ]right. So, you apply
that and remember we have said that differentiation and expectation can be interchanged. So,
I am just interchanging it and taking the differentiation inside and taking the derivative. So,
𝑡𝑋 𝑛 𝑡𝑋
here derivative the nth derivative of 𝑒 = 𝑋 𝑒 this is remember definite the derivative with
respect to t ok.
240
And now you; so now that you have this form you can just up start applying or evaluating this
expression at 𝑡 = 0. So, when you evaluate at 𝑡 = 0essentially this term will vanish away
0𝑋 𝑛
you will get 𝑒 and that is just a 1 ok. So, you will just be left with 𝐸[𝑋 ]and that is what we
𝑛
have over here. So, the nth derivative evaluated at 0 is 𝐸[𝑋 ]which is the end.
Let us look at one more one specific example where this is applied ok. So, now, let us go
back to our favourite Bernoulli random variable. We just want to confirm this; basically what
are we doing here.
241
We are taking the nth derivative up evaluating at it 0 we need to be able to get the moments
ok; and let us just test it out for a very simple case this is the simplest possible case that you
𝑡𝑋
can think of. So, what is the moment generating function that is 𝐸[𝑒 ] and that; what is this
𝑡𝑋
𝐸[𝑒 ]?
𝑡𝑋
Well, essentially the variable 𝑒 what is the probability I mean with probability p it is going
𝑡 𝑡𝑋
to take the value 𝑒 , because X remember this is 𝑒 is defined based on X; and X will be 1
𝑡.1
with probability p and 0 with probability (1 − 𝑝) when 𝑋 = 1 you are going to get 𝑒
𝑡 𝑡 0
which is just 𝑒 that is. So, you get 𝑝𝑒 plus when 𝑋 = 0 you are going to get 𝑒 which is a 1;
so you when that happens with probability (1 − 𝑝). So, that is how you get this expression
for the moment generating function for; and then you can play with it a little bit and apply the
𝑡
𝑥 𝑝(𝑒 −1)
fact that 1 + 𝑥 ≤ 𝑒 and you can get it to be of the form 𝑒 , but the most for now the
form that we want is this what was what is shown in yellow ok.
So, now we have a moment generating function. Let us take the first derivative and when we
take the first derivative of this expression this is with respect to t. So, you will just be left
𝑡 𝑡
with 𝑝𝑒 and when you evaluate it at 𝑡 = 0,𝑒 = 1. So, you are left with p and that is 𝐸[𝑋]if
you recall this is fitting exactly the theorem that we have just proved if you take the first
derivative up and apply it at value 𝑡 = 0, you get back 𝐸[𝑋]you can do the same thing with
the second derivative ok. I will leave you to work that out on your own time, but essentially
242
the second derivative I mean the second moment is for the Bernoulli random variable is going
to continue to be p and you are going to get the same.
So, the other thing is that not only does it capture all the moments when it captures all the
moments it fully defines the random variable and this we are just going to state the theorem
and going to skip the proof. So, what is the theorem stating if you have two random variables
have X and Y; and if they are moment generating functions in the vicinity of 0 what does
what does that mean when t is around 0, 𝑡 ∈ (− δ, δ), if both those moment generating
functions are exactly equal, then we are guaranteed that both X and Y are essentially the
same distribution same essential the same ok. So, this is this tells you once I mean if anything
this tells you that the moment generating function really gets to the heart of what a random
variable is really can characterize random variable and one more nice theorem before we get
back in the Chernoff bounds ok.
243
What is this if you this again is a way to exploit independence of random variables ok. So,
you have two random variables X and Y. 𝑋 + 𝑌 itself is a random variable. So, there is this
well defined notion of the moment generating function of 𝑋 + 𝑌and that is this left hand side
right ok. The nice thing is that is just the product of the individual moment generating
functions; when X and Y are independent this is not true when 𝑋 + 𝑌are not independent,
but when they are independent you have this ok. This is very useful because later on when we
are trying to bound say the binomial random variable or sum of Poisson random variables or
whatever you will exploit independence and this is a key crucial requirement for that ok.
So, you why is it ok. So, you take the moment generating function for 𝑋 + 𝑌you apply the
𝑡(𝑋+𝑌) 𝑡𝑋 𝑡𝑌
formula that is nothing, but the 𝐸[𝑒 ]and you expand it out you get 𝑒 𝑒 , but the nice
𝑡𝑋 𝑡𝑌 𝑡𝑋
thing is now these are two random variables 𝑒 and 𝑒 if X and Y are independent then 𝑒
𝑡𝑌
and 𝑒 are also independent and if they are independent there the expectation of their product
is the product of their expectations and that is what we are over here and that is by
independent which ok. What is each of these multiplicative terms? It is nothing, but the
moment generating function of X and is this clear what we have talked about. So, far; any
questions.
244
Now, we are ready to get back into the Chernoff bounds we will work out a particular
example it is basically the sum of Poisson trials this is if you recall what is the binomial
distribution it is the sum of n Bernoulli trials the sum of Poisson trials it is just a
generalization it is just the techniques. So, everything that we say from now on; with respect
to Poisson trials is going to be applicable to Binomial distribution as well, but; in fact it is
going to be a one step more general ok.
What is the general part of it? In when you take the binomial distribution each Bernoulli trial
must have the same probability of success p when we go in to sum of Poisson trials we are
generalizing we are allowing each Poisson random variable 𝑋𝑖 to be 1 with some probability
𝑝𝑖 and 0 with some with probability 1 − 𝑝𝑖and these 𝑝𝑖s can be different for each 𝑋𝑖. So, what
is 𝑋 = 𝑋1 + 𝑋2 +... + 𝑋𝑛; each one can have it is own probability of success ok. So, that is
the general nature of sum of Poisson trials. So, of course, if all of these 𝑝𝑖’s are equal, you get
back binomial distribution. Now, that we have established that to the we know what the sum
of Poisson trials is essentially a binomial distribution generalized a little bit. Let us workout
Chernoff bounds for this X which is the sum of Poisson trials ok.
245
𝑡𝑋
So, we will need remember to apply the Chernoff technique; we will need 𝑒 which means
that we need the moment generating function ok. So, the nice thing now we immediately start
applying the theorem based on independence. If you want the moment generating function for
X, it is basically the product; because these individual 𝑋𝑖s are independent. The moment
generating function for 𝑋 is the product of the moment generating functions for the
individual 𝑋𝑖s. So, what is the moment generating function for the individual 𝑋𝑖? If you recall
we worked this out a little while ago; when we worked out the Bernoulli trial with respect to
𝑡
𝑝𝑖(𝑒 −1)
probability of success p; we worked it out to be of the form 𝑒 a few slides ago.
So, we are just applying that and so now, you have the product of several exponentials which
means you can take the e to the you can take the exponential of their sum of their this product
∑𝑋
𝑡
of the exponentials is just 𝑒 . And well what is that this 𝑒 − 1and I think this needs to be
𝑡
slightly clarified this it looks like this 𝑒 − 1is in the subscript over here it is actually
𝑡 𝑡
𝑝𝑖(𝑒 − 1)this; this 𝑒 − 1 can come out of the summation because the summation is with
respect to i.
𝑡
(𝑒 −1)
So, you get 𝑒 if that if this comes out you will be left with ∑ 𝑝𝑖what a ∑ 𝑝𝑖. What is 𝑝𝑖? 𝑝𝑖
𝑖 𝑖
corresponds to random variable 𝑋𝑖; 𝐸[𝑋𝑖] = 𝑝𝑖. So, what is ∑ 𝑝𝑖; 𝐸[𝑋]. And; and we are
𝑖
246
going to denote that by µ. So, we have got the moment generating function; which means
now we can go back to the Chernoff technique and apply this.
So, let me let us before we start applying it; let me state the bounds that we can get only one
bound I am going to prove, the rest of them you will have to proof your own I mean the
textbook has it. So, you kind of have to take some time to work through them ok.
The first and if you recall; I said there are there is this one framework or technique the
Chernoff bounds technique and you can derive multiple bounds. So, that is why we are going
to give multiple bounds now ok. Now the first bound for any δ > 0 the 𝑃𝑟(𝑋 ≥ (1 + δ)µ).
So, just to give you give yourself a picture for what is going on here. You have a random
variable X this is 𝐸[𝑋] = µ; you have this some distribution and you are asking well and let
us say this point is a (1 + δ)µ. We are asking what is the probability that X will fall in this
δ 1+δ µ
right tail; and that is at most this right hand side quantity; (𝑒 /(1 + δ) ) looks
complicated, but there are nicer forms this is why it helps to have multiple Chernoff bounds.
So, let us look at the second form. We have here we need to restrict δto be at most 1. What is
𝑃𝑟(𝑋 ≥ (1 + δ)µ)again? Is the same story you are you are basically trying to bound the
2
−µδ /3
right tail it is 𝑒 Why? and you hope this might look like just some mathematical jargon,
but let us pause for a moment to see why this is powerful in the binomial distribution; what is
your µ? µis the average right. So, if you have like 10000 coin tosses what is your µit is 5000
247
and where does in the tail bound probability where does µappear µappears in the exponential
−µ
and it is 𝑒 ok.
So, as the as the mean increases the probability decreases exponentially ok. So, that is the
power of this Chernoff bounds; the probability of deviating from the expectation this tail
bound it drops exponentially. So, in that example this coin flipping example that we have
2
−5000.δ /3 2
worked out in the past it this it will show up as 𝑒 . So, these δ and 3 are also
relatively small quantities, but you can immediately see as the number of coin flips goes
larger and larger your probability of deviating significantly from the mean drops very very
quickly ok.
So, hopefully this intuition is making sense to you because that is very very important this is
this is what is telling you about the power of the Chernoff bound and one more convenient
form which it is.
Well, it is positive for the numerator it is positive also for the denominator and if the
denominator is dominating then it is going to be again having the same. what is a third form
for some 𝑅 > 6µ. So, basically this is useful when you want to ask what is the probability of
deviating beyond 6µin some cases this makes sense again you see this exponential drop.
−𝑅
𝑃𝑟(𝑋 ≥ 𝑅) ≤ 2 .
Student: From the (Refer Time: 24:11).
Yes. So, that makes sense particularly for things like normal distribution.
Student: Yeah.
Things like that here it; so when you work with the standard deviation you are actually
working with Chebyshev’s inequality or some related equality and you can do that and it is
usually done, because you do not have a handle on the moment generating function this is
what you are saying is particularly true for a lot of fields, but in computer science; we often
we define the algorithm which means that we control what random variable? What the
random variables going to look like? Which means that we have more power in our hands?
248
We can go in to Chernoff bounds and that is what will happen very commonly; we end up
being able to exploit Chernoff bounds.
Because we have access to all the moments and therefore, implicitly we do not explicitly
think about it all the time implicitly we have access to all the moments and so we will be
applying the more powerful techniques. And this is significantly more powerful than just
asking how many times away from; how many standard deviations away from the mean you
are? So, this is I think your question is also motivated by this right.
Student: Yeah.
This is just one convenient form; it might be useful in some cases where the mean is small. If
the mean is small then you can ask; what is the probability that you are exceeding 6 times the
small mean? Then it might be a this might be a convenient form to use ok. And again goes to
say that it is basically the same technique all these three bounds are derived using the same
technique; it is just playing with the value of t, if you recall there is a t a parameter t showing
up in the derivation of Chernoff bounds displaying with a value of t and other mathematical
minor mathematical juggleries you get these convenient forms.
And so when you think about applying them; you just choose the most appropriate one and
typically the second form is the most commonly used form of this.
249
And as I said this the same technique can also be applied in the in the left tail as well we will
see that, but for now let us actually look at the proof. So, just to be clear all these three have
the proofs in the textbook. I am only going to go through the first inequality. So, that we have
one inequality nailed down; we know how to do it, but the other things I still expect you to go
through it on your own so.
We are now focusing on the first inequality. 𝑋 ≥ (1 + δ)µand here δis any value any
positive value; and we will not need to prove that it is at most this right hand side ok. And
again we are going to apply the exact Chernoff bound technique that we have already seen
before. We are just going to say what is 𝑃𝑟(𝑋 ≥ (1 + δ)µ)well; you take the exponentiation
𝑡𝑋 𝑡(1+δ)µ
on both sides 𝑃𝑟(𝑒 ≥𝑒 )is standard technique. Apply the Markov’s inequality. So,
𝑡𝑋
that is you get 𝐸[𝑒 ]divided by the right hand side of this event ok.
So, but we already know this; we know remember we just a few slides ago we derived the
𝑡
(𝑒 −1)µ
moment generating function for the random variable X that is nothing, but 𝑒 if you
recall we did that ok.
So, I am just applying that over here. So, you get this you get this form over here. Now
comes the trick because now you have to choose a particular value of t that will be
𝑡
convenient for you and you see a lot of 𝑒 s and if you if you want to cancel things out what
do you do when you have e to the some ln of something you the e and the ln cancel each
other out, right.
250
So, that is the; that is what we are going to exploit. So, we are going to set 𝑡 = ln(1 + δ)
and because δ > 0 this ln(1 + δ) is also going to be > 0.
So, which means we can apply it. So, going back to this form; if you apply if you apply
𝑡 = ln(1 + δ); so essentially here you will be you will be getting ln(1 + δ). So, what will
𝑡 ln(1+δ)
you get over here you will get 𝑒 will become ln(1 + δ)you know it will it will 𝑒 will
become 1 + δ − 1. So, it will be 1 + δ − 1 divided; now I am going to take the whole to
𝑡
the µseparately and here 𝑒 will become ln you know 1 + δ. So, and this 1 + δ will be
posted over here; which is exactly the form that we; so this is the first technique. The other
techniques in similar fashion will be derived and at; in fact, in this case you look at the other
you basically play with this form. This is one of the best forms that you can get the other
bounds are obtained by just playing with;
So, those are left as homework for you. So, now, let us. Any questions before I move on and
how this derivation works out.
251
This is just two forms for the left tail, that we have; again they look quite similar because
they are derivations are essentially of the same pattern. Now because it is the left tail you
need to consider δvalues to be in the range 0 to 1 and you have these two forms again here it
is 𝑋 ≤ (1 − δ)µ. So, so you will essentially what you are doing is this is your µ; this is
going to be your (1 − δ)µ.
So, you are asking what is the probability of this left tail. That is again at most some quantity
which is if you notice it looks very similar to the type of bounds that we have and the second
2
−µδ /2
form is also their 𝑃𝑟(𝑋 ≤ (1 − δ)µ) ≤ 𝑒 and usually this form is more useful and. In
fact, you can take the two left in the right tail the second form of them this.
252
In this case; it is this one we have something similar in the right tail as well we can put them
both together and apply the union bound and you can get a combination of both right and left
tails. So, this is here I because it is combination of right hand left tails you have the
2
−µδ /3
𝑃𝑟(|𝑋 − µ| ≥ δµ) ≤ 2𝑒 .
What is the time?
Student: divided by 2 or 3.
It is divided by 2; it is this happens to not be a typo, but when you apply the union bound you
are just a little bit relaxed about how you apply it. The first one is I believe the tightest, but it
really is not. So, much about tightness in these cases because the tightness usually just gives
you some I mean typically the benefits just in constants the you usually use the more
appropriate question would be which one is the most convenient one of the use of course, in
some cases one might be convenient, but might not give you the good enough bound, right.
So, the first one is usually the tightest because it is pretty close to what you want, but what is
also and the reason why I even went through the proof of the Chernoff technique is that there
are some cases where you actually have to redo the proof of the Chernoff technique for a
specific t value that fits the application. Remember this here we applied a particular t value
that; that nicely cancels things out and it is great for a general form, but there are applications
253
for which you really have to go through the derivation and choose the appropriate t value and
all that.
So, it is good to know the general derivation technique. So, we are now almost done; only
thing I want to point out now is, how this Chernoff bounds that we have seen so far fares with
respect to coin flips? This is the example that we have seen. And here; now we are going to
ask what is the; so X is the number of heads out of n coin flips ok. So, this is essentially the
binomial distribution and since we know even how to handle Poisson; some of Poisson trials
you we can actually handle the binomial distributions just a special case.
And so we want to ask; what is the probability that; so in the particular example that we had n
was 10000. So, the mean would have been something like 5000 right. And we are asking
what is the probability there it deviates away from the mean by this quantity ok? And so let
us work that out. So, this is this is just this is recall this is if you ignore the constants
essentially what you have is a 𝑛 ln(𝑛) ok. So, it is basically if you for a rough understanding
of what we are talking about here? What is the 10000? well it is roughly a 100 right. So, we
are asking what is the probability that X deviates from the mean by more than roughly a 100
ok. If you just you can ignore the constants in the ln(.) term; ln(.) is small constants are also
small and let us see.
254
How do you work this out? So, here on the right hand side you need it to be of the form δµ
right. So, you should work out the appropriate. So, we on the left or what we have over here
is; what is exactly the tail bound that we want? We want to understand; what are these
probabilities these tail probabilities? Ok, but now we have to fit it to the Chernoff bounds
form ok. So, now, what is your µ? µ = 𝑛/2. So, your δ if µ = 𝑛/2; what you have to do? To
get the µ; to get the δis just multiply by 𝑛/2 and also multiply by 2/𝑛. If you do that you get
this will be your µ this 1. So, basically let me write that a little bit more carefully over here
what you do is you multiply by 𝑛/2 and you also multiply by 2/𝑛ok.
This 𝑛/2 is nothing, but your µ. Whatever is left is your δok. So, that so now we have all
2
−µδ /3
other δ and µ. So, we can apply the formula and if you recall the formula it is 2𝑒 . So,
that is what you are applying over here ok. And when you apply you get oops 2/𝑛. And for a
similar range what did we get out of Chebyshev’s? I think we got something like 1/9 or
something like that the probability of exceeding 5000 plus or minus 160 we got was
something like 1/9, but now what are we getting we are getting 2/𝑛. So, this is like
2/10000ok. So, you see; how the tail bound tail probability is much tighter when you use the
Chernoff bounds. So, that is generally the idea here wherever possible you want to use the
Chernoff bounds, but when it is not possible you live with the Chebyshev’s or the Markov’s
inequality.
255
So, with that we can conclude our segment for today. So, we introduce the Chernoff bounds
technique and you know along the way, we picked up an understanding of moment
generating functions we looked at some properties and we worked out the derivation of a
Chernoff bound for sum of Poisson trials at least one form of it there were other forms that
we have stated we have not actually looked at the derivations and what we did was we
applied it to the coin flips and we saw it is immediate how powerful Chernoff is compared to
Chebyshev’s I hope you can appreciate this difference. So, that is what we have seen so far
and in this module we have seen two slightly larger segments. So, we have seen the analysis
of the median algorithm in the last segment which again like slightly longer module and this
Chernoff bounds technique again is slightly longer module five segments in this module 3 ok.
So, in the next module what we are going to do is we are going to focus on applications we
are going to now that we know all these bounds these markers in equality Chebyshev’s and
Chernoff’s and things like that what we are going to do is start looking at some algorithmic
contexts and we are going to start applying them. So, hopefully these bounds you know come
alive to you. So, you will actually see how they help in a computer science ok. So, that will
be the topic for next module ok.
Thanks.
256
Probability and Computing
Module – 04
Tail Bounds I
Lecture - 20
Segment 1: Parameter Estimation
So, we are going to start a new module today, in this module assisted basically an extension
of module 3 we are just going to look at some applications of a tail bounds; particularly from
the from an algorithmic and computer science point of view, the first one is estimating a
parameter. So, let me give us some.
Give ourselves some context, we have looked at all of these inequalities we want to look at
the following problem.
257
So, what is given you are given a population of N people. So, think of the country like India
or something like that see and you are given 2 small constants ϵand δ and you want to find
the fraction of the people who like something say nutty chocolates ok. So, more practically be
used to estimate for example, people who want to vote for or against a certain policy or
something and what is this ϵand δthey are specifying the accuracy with which you want to
estimate this parameter p. So, in particularly you want an (ϵ, δ) approximation of the
parameter p.
~ ~
What is an (ϵ, δ)approximation? It is basically an interval given by some [𝑝 − ϵ, 𝑝 + ϵ], so
ϵ is the error term and what is the what role does δ play it (1 − δ) is often called the
confidence interval your estimation; should be correct with probability at least (1 − δ)
another way of stating that is that your estimation can be incorrect with probability at most δ
ok. That is the way I have stated it in this particular case.
258
So, everybody clear about this problem statement? because, very simple problem you just
need to estimate the fraction of the people who like something and the nice thing is the
algorithm we are going to consider for this case is extremely simple, we are just going to run
a for loop N times and each time we are going to sample someone uniformly at random and
independently and ask them do you like nutty chocolate or not; if they like nutty chocolates
we are going to increment the counter otherwise we do not.
~ ~
So, then what we do we basically estimate 𝑝, we estimate p and denote that estimate by 𝑝as
just simply 𝑋/𝑛, after we have iterated through some n number of samples. So, the algorithm
is very straightforward the only question that we need to worry about is what should the value
of n be because, obviously if you just sample 1 person you are not going to get an accurate
estimation and if you are going to sample all the N number of people, that is not very useful
either that is too difficult. So, the population of India is what some thousand 1.6 billion, no
1.3 some billion I think ok.
259
So, the question is we have let us let us play with our intuition a little bit, let us consider 2
countries India and Srilanka. I mean we have been having cricket matches recently and we
have been doing pretty well. So, feels good. we have 1.324 billion people and Srilanka has 21
million people, how does this n change between the 2 countries that will be an interesting
question to ask right. So, out of curiosity how much what proportion how much more
samples do you need for India, but a common intuition is that ok. So, country like India needs
a lot more effort to get the estimates right, so let us let us see what happens.
260
So, here going back to the algorithm your X is the total number of people who said they like
nutty chocolates right. So, X is a random variable here because that depends on p which is a
random variable, why because you the population is fixed there are N number of people, but
what we do the randomness comes from the fact that we are sampling from this population
~
and this 𝑝represents the fraction of the people in the sample who liked nutty chocolates ok, so
~
and X so basically we remember we calculated 𝑝to be nothing but 𝑋/𝑛and that we are just
rewriting it this way here.
However, what is the 𝐸[𝑋]this is a well defined quantity at the even though we do not know
p, p is a fixed parameter it is an exact fraction of the number of people in say India who like
nutty chocolates. So, this p is a fixed quantity so the 𝐸[𝑋]is also a fixed quantity we do not
know it, but it is fixed in this algorithm what are the ways in which things can go wrong
again, once again we need to figure out what the bad events are and make sure that the bad
events happen with low probability.
~ ~
Remembers our estimation is in the we are going to give the range [𝑝 − ϵ, 𝑝 + ϵ]. So,
anything below or above that range is bad and this is actual p value being less than the left
end of the bound and this is the p value being to the right of the bound ok, these are the 2
ways in which things can go over ok. So, let us try to formally state these 2 bad events again
if the pattern, I hope you are seeing is repeating itself here you are clearly specifying what the
bad event and you are trying to capture that bad event in a way that can be fit into a known
tail bound.
So, here what is how do we define this bad event well let us start with X ok, X we know is in
~
𝑝 that is that we know and another way of getting this now you just take this equation
~ ~
𝑝 > 𝑝 − ϵand you multiply throughout by 𝑛 and here you have then 𝑝 term. So, you isolate
~ ~
the 𝑛𝑝 term it will be less than here ok, you isolate then 𝑛𝑝term and so on the right hand side
you will get np and this term will go here plus 𝑛ϵ ok, so that is what you get over there
alright.
So, now you can in this you can take out np, so you will get (1 + ϵ/𝑝), but 𝑛𝑝 of course is
nothing but 𝐸[𝑋](1 + ϵ/𝑝)ok. So, this ultimately is your bad event written in the form that
261
can be captured by chernoff bounds 𝑋 > 𝐸[𝑋](1 + ϵ/𝑝), remember that is the way in which
you want chernoff bounds 𝑋 > (1 + δ)µ.
So, you notice that we have gotten exactly the form that we, the same thing can be done with
the other bad event as well ok; you will again get 𝑋 < (1 − δ)𝐸[𝑋]ok. All of this is
basically taking the bad event and rephrasing it in a manner that it can where such that it can
be plugged into the chernoff bounds ok. So, then we can do that.
So, our analysis basically is the following what is the probability that either 1 the bad events
occur ok, there are 2 bad events that we listed it at least 1 of them should occur; what is the
probability well that is equal to the union of the 2 probabilities there are 2 probabilities are
shown in red over here and in the previous slide we worked out a formal way in which we
can express those bad events in a manner amenable to chernoff bounds.
So, that is what is written over here these 2 things and now this is the union of 2 bad events
and if you recall the union bound if you have the union of 2 bad events, that is at most the
sum of the individual probabilities of those bad events and so now you can simply apply a
chernoff bounds in the first 1 it is.
Let us see here you have µ and this plays the role of δjust to be clear, this δis different from
the δwe are using in this, in this segment this is the δcoming from the way we stated chernoff
bounds now. So, apologies for reusing the same δ, so then if you recall the chernoff bounds
262
2
−µδ /3
so what does it say𝑃𝑟(𝑋 > (1 + δ)µ) ≤ 𝑒 , you recall this was 1 Chernoff bounds and
that is that is what is showing up over here for instance. So, here it is (1 + δ)you have the
(1 + ϵ/𝑝), so instead of it.
So, here Oh yeah trying to see why that and yeah you are right. So, let us make sure that we
2
are not missing something, yeah I think there is some typo here yeah. So, ϵ by yeah that is
2 2 2
right, so this ϵ here also there will be an ϵ the p term there will be a 𝑝 at the bottom, but
there will be a p at the top as well, so one of the p is will get cancelled so as I miss the square
over here.
So, the square will continue to play role, so this is square over here ok. So, this is just I am
approximating instead of dividing by 2𝑝 in the exponent, I am just if you divide by 3𝑝you
only get a larger bound. So, since we just want an upper bound I am just calling these 2 in
2
−𝑛ϵ /3𝑝
individual terms as 2𝑒 and recall that we want to make sure that this probability is at
most δremember this is one of the input parameters the probability that we will get into a bad
event should be at most a δok, so that is where this δshows up.
So, now, let us let us try to work with this inequality, so here if you take the log on both sides.
So, first let us take the 2 to the other side it is δ/2and then let us also do 1 more thing to get
rid of the negation let us make this let me write it over here. So, what can we do about this I
2
𝑛ϵ /3𝑝
can say 𝑒 should be ≥ 2/δok. I am just taking the reciprocal on both sides, now I can
2
take the lnon both sides rather. So, then what I will get is an I will get 𝑛ϵ /3𝑝, so what I am
2
going to do is this I am going to take the ϵ /3𝑝 to the other side. So, I am going to make that
2
3𝑝/ϵ and there will beln over here ok.
This is what we get over here and that gives us how do we interpret this. What does this even
mean? it just means that as long as we our n value is at least this much we are fine, but there
is 1 pesky issue there is this p showing up, p is really what we want to estimate, but there is a
p showing up there what do we do that yeah exactly. So, you just want an up a large enough
value of n, so simply get rid of the p and you are still going to you your.
Rest your value of n is continue it is going to continue to be sufficiently large right.
263
So, basically at the end of the day if you want to look at it what you have is you have gotten
rid of the p here, what you have here is a sufficient bound on n to ensure that you are going to
get an (ϵ, δ)approximation this of course this typo. Which is which brings us to the issue that
we talked about, if you look at this. Bharath’s intuition was correct, what is the surprising fact
over here.
There is no occurrence of N which means that whether you are you are working in Srilanka
or in India, the N does not occur in this bound. So, it is only going to your estimation is only
going to depend on ϵand δnot on N. So, that is a fairly often a surprising fact because, you are
where you know what you may fail to realize is that regardless of the size of the population
your algorithm. Now focus is just based on just sampling and each time you are going to
sample with certain probability p right sorry, you are going to get someone who likes nutty
chocolates through some probability p; that does not change based on whether it is Srilanka
or India that is the that is the intuition that is going over here.
That is correct, so you are absolutely right. In fact, that does bring me to the.
264
Conclusions like where I am you know wont emphasize that it really heavily depends on the
fact that we are doing uniformly at random and independent samples and as you can tell from
recent past, predictions go wrong all the time right and we can also take a leaf from Niels
Bohrs book and say look prediction is very difficult especially if it is about the future.
So yeah sometimes you know when your memory as you start getting older, your memory
becomes bad and even the prediction about the past is coming difficult [laughter ]. So, with
that we conclude this segment.
265
So, next again we are going to see 1 more simple application of chernoffs bound and we
conclude with that.
Thank you.
266
Module – 04
Tail Bounds I
Lecture – 21
Segment 2: Control Group Selection
So let us start with the second module I mean a second segment of module 4, and it is about
how experiments are done in the real world, especially in the context of drug design or trying
to understand the effectiveness of a campaign or something like that.
So, basically what you do is, you collect a group of volunteers they are also called subjects in
the experiment and this is typically people who consent to this experiment.
And you can what you want to do is, somehow separate this group of volunteers into the
control group and the experimental group ok. And you want to administer the drug to the
experimental group, you want to administer a placebo basically what looks like the drug, but
is just you know an inert substance to the control group, and you want to be able to say well
the there has been significant improvement in the experimental group compared to the control
group.
267
Even before we go into interpreting the outcome of the experiment and all that, the important
prerequisite for getting things right is that we need to ensure that the separation between the
control group and experimental group is done appropriately. Because suppose for example,
you are testing the drug and somehow your control group people have certain feature. Some
drug let us say acts better with people with blue eyes versus those who do not have blue eyes
and suppose all the people who blue eyes get put in the control group, and the people with
non blue eyes get put in the experimental group, then you are not your experiment is going to
give the wrong result it is going to say that the drug has no effect whereas, in fact, it you
might have been able to figure out later on that you know it does have some effect.
So, basically what you need to do is to try to characterize each individual, and what we do is
we characterize them by a feature vector. So, let us say you have some n features height you
know let us think about binary features. So, it is just blue eyes versus not blue eyes curly hair
versus non curly hair tall versus short what not ok.
So, you characterize each subject by this sort of a feature vector, and you want to make sure
that for any feature there are sufficient number of people in the control group and a sufficient
number of people in the experimental group. So, you do not have you want the separation to
happen. So, the to avoid situations where a particular feature is only found in the control
group or only found in the experimenting ok. So, this is the context in which this problem is
studied.
268
So, let us try to formalize this. So, you are given an input matrix A it is an 𝑛 × 𝑚 matrix, m
is the number of subjects. So, each column corresponds to some person let us say, and n is the
number of features ok. So, each feature could be you know blue eyes or tall or whatnot right
and if the feature is present you put a1. So, if the ith feature of the 𝑗th person is present, then
you put a 1 there otherwise it is 0. So, that is how you interpret this matrix 𝑛 × 𝑚matrix and
what is your required output? it is basically this vector b whose values are either -1 or 1. -1
means that that person is going into the control group and +1 means that the person is going
to go into the experimental group. So, and how do you ensure that the separation is good let
us let us take one feature for example, let us take the first feature.
Basically when you multiply this matrix a with the vector 𝑏 ⃗you are going to do the dot
product of the first let us say the first row with the with vector b ok. And by doing so, you
will get 𝑐1and so, whenever there is all those people who do not have the feature, they are not
going to participate I mean they are not going to contribute to this dot product.
The people with the feature are the ones who are going to contribute to this dot product. And
in that case the if look let us look at the perfect case, where those who can do those who
contribute in the control group and those who contribute in the experimental group are both
equal then what will the 𝑐1value be? it will be.
Student: 0
It will be 0 because it will be -1 times the number of people who contribute to the control
group, one times the number of people who contribute to the experimental group. So, the best
value you can hope for 𝑐1is 0, but that is too much to hope for we want; however, to
minimize this 𝑐1, but we do not just want to minimize 𝑐1we want to minimize 𝑐1this is just
one feature you want to minimize across all the features. So, in other words you want to
minimize.
The max over all the 𝑐𝑖s their absolute values because this remember it is 0 is the perfect
value. So, it can air on the left side or on the right side. So, you have to take the absolute
value and minimize that absolute value across all the 𝑐𝑖s and how do you how do you get this
max over all and the absolute value of 𝑐𝑖s? It is basically taking ||𝐴. 𝑏 ⃗||∞ ok. So, you are
going to get them by doing this multiplication you are going to get our matrix and you are
269
going to pick the that is essentially this matrix this vector c, and picking the maximum
element is the L-infinity norm and this quantity has to be minimized. So, that is the goal.
So, again we are going to apply the most obvious algorithm ok, we are going to do the silliest
thing possible, each 𝑏𝑖remember 𝑏𝑖 is either plus or minus 1 is going to be 1 with sorry.
Before someone points it out -1 with probability half and 1 with probability half ok. So, this
is the most obvious algorithm that I do not think of. So, we need to figure out a way to
analyze this. So, how good is this such a silly looking algorithm. So, we have to figure out
whether it is any good.
270
So, again we are going to apply Chernoff Bounds, and we are going to apply another variant
that we have not seen, but it is actually not hard to prove this the book has the proof of this,
this variation as well. So, here 𝑋1,..., 𝑋𝑛are either 1 or -1 with probability half each, again
uniformly and independently at random and 𝑋the summation of all the individual 𝑋𝑖s and so
in this case what you have is the mean is 0 ok.
So, it is either plus or minus 1. So, the mean is going to be 0 and you are asking what is the
𝑃𝑟(𝑋 ≥ 𝑎) that is what is being asked over here, and what is been proved is that it is at most
2
−𝑎 /2𝑛
𝑒 ok. And if you want to do if you want both want to bound both sides of the tail both
2
−𝑎 /2𝑛
tail ends, then it is at most 2𝑒 . So, it is fairly straight forward.
Student: (Refer Time: 09:54) them is divided by 3 right side is given 2.
Correct. So, here the proof is a little bit different. So, here this is remember this is symmetric.
If you just change if you notice that if here you are you are there, you are looking at 0,1
variables here you are looking at -1,1 variable. So, all the variables is symmetric about the 0
point. So, when you prove a tail bond on the right side the same tail bound will hold on the
left side as well, exact same tail bound will hold on the other side. So, you can simply use
2
−𝑎 /2
that symmetry and the union bound to get this 2𝑒 so.
271
So, now just keep this bound, in mind and we are just going to apply that. So, recall what do
we need to prove? We need to prove that this the ||𝐴. 𝑏 ⃗||∞ ≥ 4𝑚 ln(𝑛) ok. So, basically
ln(𝑛) is relatively small essentially what we are saying is that most of all the people all the
for every feature the number of people who might be more in the control group with a
particular feature that might be more in the control group versus the experimental group of
vice versus at most roughly 𝑚 ok. So, that is the way to interpret it.
So, we want say that exceeding that 4𝑚 ln(𝑛) is at most 2/𝑛 ok. So, this is across remember
this is across all the all the n features right. So, what we are going to do is focus on one
feature first and. So, without loss of generality let us focus on the first feature. So, this is just
row 1. If the number of ones in that feature is small, it is itself within 4𝑚 ln(𝑛) then we are
immediately done because; however, if you split it if the number of people who are tall are at
most 4𝑚 ln(𝑛). And; however, you split it is going to be equally split and I mean it is not
equally split I am sorry it is the number in the control group versus the experimental group
can will not exceed that. And in a sense this is fine because it is a feature that affects very
small number of people. So, then you immediately get what you want that appropriate 𝑐1in
this case, is at most 4𝑚 ln(𝑛). So, it is you are done. So,. So, we are going to focus
therefore, on the case where the number of 1’s is greater than this quantity ok. Only then do
272
we have to care about making sure that both the control I mean the control group and the
experimental group are roughly equal ok.
𝑚
In this case let us look at 𝑐1, 𝑐1 is nothing, but this dot product right ∑ 𝑎1𝑗𝑏𝑗ok. This is and
𝑖=1
we know that this at least the number of positive terms. So, the number of terms here that
there are once is at least 4𝑚 ln(𝑛) ok. So, now, what is the probability that this 𝑐1is greater
this greater than this quantity.
Remember 𝑐1now is these 𝑏𝑗s. So, let us let us be a little bit careful about what is going on
over here, when the term 𝑎1𝑗 is 0 this is the random variable 𝑏𝑗s are the random variables this
is the ones that your algorithm chose randomly 𝑎1𝑗 was given to you, it was either 0or 1. If it
was 0it is not going to play any role you can ignore that particular 𝑗th term ok. The only
terms that matter are the terms where the 𝑎1𝑗 = 1and that is at least these many terms and
you want to still make sure that the 𝑐1remember now the 𝑐1is the summation of all of them
some 𝑏𝑗s are +1 some 𝑏𝑗s are − 1.
So, the expectation is still going is going to be 0 and you still have the same thing happening
over here, you will have a some distribution with mean 0 and each individual term that
273
matters is either +1 or − 1 and you have set yourself some bounds to say, what is the and you
are asking what is the probability that you are going to lie on either side of the bounds that is
that is what is being asked over here.
So, we can simply apply the formula, it is 2e to the and this there is a square root on the right
hand side of this event you just any the formula has a square in the exponent you have
− 4𝑚 ln(𝑛)/2𝑘, here what is 𝑘? K equal to the number of ones in first row which is greater
than 4𝑚 ln(𝑛) ok. So, that is the k over here because those are the only ones that matter, but
again now if you just want an upper bound, you can just replace the K by n and still fine and
what is that going to work out to. So, if you if you make that an n.
Oh ok. So, oh yeah you are right. So, you can make it an m sorry not an n. So, TA was the
one who pointed that out. So, you get a day off. So, the m cancels out we the m cancels out,
−2
you are left with and there is a 2 here that cancels. So, you get 2. So, you basically get 2𝑛 ,
2
which is what we have over here ok. So, it is 2/𝑛 , but now we want to we what we have
done is we have only bound this probability for one feature there are a total of n features.
So, again apply the union bound over all ends. So, then you multiply this bound by n. So, this
is this is why you carefully doing this reverse engineering knowing that you want it to be of
2
the form 1/𝑛, you want to bound each individual probability to be 2/𝑛 . So, that the n and
2
the 𝑛 cancel out to be. So, you are left with 2/𝑛all right. So, that is that essentially proves
this theorem for us ok.
274
So, that we can conclude again we have seen a very fundamental problem shows up all the
time, and there is an obvious algorithm in fact, practitioners are using this all the time, but
what we have done is given a formal analysis and given a proof that that obvious algorithms
are. In fact, actually a good algorithm in and if time permits we will actually address some
techniques where we can actually show that does not exist a better solution in the worst case.
In other words you can always come up with input matrices, where there is there does not
exist any significantly better 𝑏 ⃗. So, you cannot come up with a better output. So, this is what
this is the best that you can essentially hope for ok.
275
So, with that we will conclude this segment, the next segment it will be little bit more non
trivial, but at the same time a lot more fun as well in some sense, it is going to be a topic on
routing and it is one it is a contribution by Leslie Valiant who recently won the Turing award
for this and a few other similar such contributions. So, that will be next segment ok.
Thank you.
276
Module - 04
Applications of Tail Bounds
Lecture – 22
Routing in Sparse Networks
So, we have seen about a couple of applications of tail bounds in this course in this module.
And now we are going to see something that is actually quite famous, we are going to talk
about routing in sparse networks. So, here is the plan for this segment first of all we are going
to define a low degree network. So, if you think about networks in data centers and things
like that, you need low degrees sparse networks because in order to maintain the computers in
the data centers.
So, that they can I mean you still want it to be low diameter you want the latencies to be
small. But at the same time you do not want very high degree or very high number of edges
dense graphs because then you will have to invest on a lot of wire. So, a hypercube is one
such example that is low degree and low diameter ok.
So, we are going to define a hypercube, and then we are going to take an illustrative problem
the permutation routing problem will define that. And we will point out that deterministic
277
algorithms are bad for solving this permutation routing. We would not prove this, but we will
just point that out. And then in this module, in this segment, we will discuss a very, very
simple and very neat algorithm; it is a little bit counterintuitive possibly, but very nice and
elegant algorithm. And in the next segment, we will analyze its properties ok.
Just to motivate this is this problem is set in the context of congested networks, they are all
used as you know traffic congestion, slow internet and stuff like that. And sometimes when
you see how the flyovers are built in china you wonder you know what thought went into
that.
278
So, yeah so this is let us actually look at some science behind this. And this is something
called a Braess paradox. So, let me just briefly talk about this. So, this paradox kind of tells
you about you know smart ideas that you might have, but may not and smart in quotes that
may not really lead to improvements. Let us let us see what happens over here.
So, you have this two cities S and T. And you have two ways to go from S to T. And let us
say there are some 20,000 cars that want to go from S to T ok. And these routes have two
segments each for now ignore the dotted route. This and the if you look at the top route, there
is a segment that has this 𝐿/200this means the time it takes for a car to go from S to A is the
is given by the load the number of cars that are going to use that route divided by 200 ok. So,
this is you can think of it as a small road, where if more number of cars go that, then there is
more congestion and therefore, this more delay ok.
But then there the other segment from A to T if you notice has a fixed time. So, this is like a
highway plenty of lanes. So, the your latency does not depend on the number of cars. you
have a similar situation in the bottom route except the constant time segment is first followed
by the segment that in which the delay depends on the load. And, so now in our example we
have 20,000 cars and it is not hard to see that with without the middle road that connects A
and B ok, ignore that for the moment.
You expect people to you know say little game theory here, people will realize that they you
know if one side gets too crowded, then they will start using the other side and so the cars
279
will split roughly into 10,000 each ok. And it is not hard to see that the latency here is going
to be 160 minutes or whatever units of time you want to take ok. Why, because there is
10,000 cars each, so if you divide that by 200, you are going to get something like 50+110
units for this segment. So, you have a total of 160 units.
But let us say what happen let us see what happens if you build this very high speed corridor
that connects A and B ok. This is very fast and you know really high investment road that is
built between A and B takes zero time to go from A to B well ok. But then let us see what
happens you will start to realize that people will start going from S, so they what is what will
be the natural way people will try to gain the system, they will go from S to A, then try to use
this very fast connection. And then from A to B, and then go from B to T and what will
happen is this will have a snowballing effect and ultimately everybody will want to do that.
And if everybody does that then we are in a bad situation because this segment will take
20,000/200. So, that will take about 100 units. This is 0 and this will take another 100 units.
So, it becomes the Nash equilibrium if you will becomes this bad situation ok. So, this is just
to illustrate that congestion is important and funny things can happen. So, we have to be very
careful and just being so called smart it is not necessarily the right thing to do ok.
So, this brings us to a contribution by a recent Turing award winner Lesley Valiant known for
many things, but we are going to be particularly focusing on the theory of parallel and
280
distributed computing. In particular one of his famous contributions here is what we are
going to talk about today ok.
So, first to set it up we have an n dimensional hypercube. What is that? Well, basically each
node in this hypercube is denoted by an n-bit id ok. And here the number of bits is n. And so
the total number of nodes in this n dimensional hypercube is N. And if you, when are two
nodes connected or not connected, they are connected if and only if their id’s differ by
exactly 1 bit ok. So, if you have 1011011, it will be connected to one, two, three, four, five,
six, seven other nodes. So, for example, you be connected to 0011011, why because there if
they differ the first bit. And it will also be connected to 1111011, because they differ in the
second bit and so on.
281
So, now let us actually look at how that hypercube looks and it is this is an example of a
four-dimensional hypercube. So, notice that each node in this in this hypercube is denoted by
a four bit string that is its id. And let us see this, this, this is this node right. So, who are its
neighbors, it is connected to this node. And if you notice they differ in the first bit and so that
is why they are connected. And let us take for example, this node, they differ in two bits, so
they are not connected right.
So, this is how of an n-dimensional hypercube looks like ok. So, it is it gets a little hard to
visualize as the dimensionality increases, but for now the only thing that you intuition that
you need is two nodes are connected as long as their id’s different 1 bit.
282
So, this is the network topology or structure. And we want to also understand some of the
rules by which this network place. So, it is a synchronous network which means everybody
has a global clock, it is like a conductor saying this is time step one and everybody does
something, then the conductor says time step two and somebody said as in the next step
occurs and so on.
So, everybody follows the same clock ok. And if you look at any edge in the network at most
one packet of information can go through that network per time step ok. This is this is where
congestion issues start to show up this is the restriction, but it is a realistic restriction and so,
if this restriction was not there this problem would not be any interesting what we are
whatever we are going to study ok.
And what so now you may ask if only one can go and an a bunch of packets are waiting over
here how do you handle that well just put them in a first in first out queue ok. This first in
first out queue is there is one queue for every edge incident at this vertex. So, for every edge,
there are a bunch of packets waiting to go along that edge. So, those packets will be put in a
queue, and they will be sent out one by one along that edge. So, for each vertex, they will
actually be n queues in that vertex.
283
So, this is the permutation routing problem this is the problem that we want to solve in this
hypercube network setting. Each node has a packet each node i has a packet 𝑝𝑖 ok. And what
do you want out of this at the end each 𝑝𝑖should which is starting at node i must reach another
node π(𝑖) and this π is a permutation. So, basically everyone will start with the packet, and
everyone will end with a corresponding π(𝑖)packet all right. π is a permutation. You want to
the goal is to minimize the number of rounds required to get this permutation routing to be
completed ok.
And you can sort of see the issue here that there are if each node has a packet, there are N
number of packets, they are all kind of jostling with each other in order to reach their final
destination. Remember along each edge only one packet getting go per time steps. So, if there
is too much congestion that takes place you have to wait along the way its several steps and
so the number of rounds totally needed in order for all the packets to reach their
corresponding destinations can become uncomfortably large, and that is what you want to try
and minimize.
284
So, let us start with the bad news first. The bad news is that we are interested in oblivious
algorithms. So, oblivious algorithms means these are this is a very natural class of algorithm
so in the distributed setting. If you have a packet that is has to go from S to some T, so T is
basically πof as the permutation point at which it has to go. This route that it takes should
only depend on S and T ok.
This is this makes sense because in a distributed setting you do not know what other packets
are doing you do not know what others packets source and destination pairs are and things
like that. Those should not be part of your concern. If you are if you are a packet going from
S to T your path should only depend on S to T, and this is a very natural restriction. So, we
are only interested in oblivious algorithms.
And let us focus first on deterministic algorithms and let me state the bad news. If, so let us
fix any deterministic. Algorithm you come up with any algorithm that you can think of. And
any network of out degree n, so the degree is bounded by this n. And of course, there are total
of N number of nodes they always exists a permutation basically this π that defines for each
starting vertex where the packet should end, there always exists a permutation which will
𝑁
force the algorithm to take at least Ω( 𝑛
).
So, you can ignore this is for understanding purposes; the n is a small quantity its log(𝑁). But
what matters mostly that you need to focus your attention on this its dependent on the 𝑁ok,
285
so that can be quite large ok. So, we want to try and minimize that ok. This is uncomfortably
large.
So, naturally we want to try to see if there is some randomization we can do.
So, here is what my good news will be. So, there exists a randomized algorithm for
permutation routing on the n-dimensional hypercube, it routes all the packets to their
destination in 𝑂(𝑛) number of rounds ok. What used to be 𝑁/𝑛is now become just log(.) of
286
that. So, if you if you want to roughly understand what the difference is, so let us say you
have N=1 million ok, 𝑁is 1000 ok.
But n is just the number of bits needed to represent 1 million, so that should be like I do not
know 20 something; like that so small. So, you see that this is a significant improvement of
course, we are ignoring constants here, but constant should not play too much of a role here
ok. And this is guaranteed with probability 1 − 1/𝑁so that is with high probability. So, this
is the good news.
So, let me give the an algorithm ok. This is called the bit-fixing algorithm a very natural
algorithm to route the packets. So, remember each packet is starting from a node and so let us
look at a particular packet it is starting at a node that has this address (𝑎1,..., 𝑎𝑛). And it has its
destination is this location denoted by this address (𝑏1,..., 𝑏𝑛)ok. So, let us work through an
example a simple example let us say is 1011, and it has to let us say its destination is 0100.
These are two nodes right in the in the network, it should probably denote the nodes ok.
So, from here, there will be four neighbors, but what the bit-fixing algorithm says is you start
from let us let us count the bits from left to right. So, let us start from the first bit are the two
bits different from where the packet is now and where it finally has to go yes. So, then you
fix that bit that bit alone change you means that you have a neighbor where that bit alone is
287
different right. So, you go to that neighbor ok. So, you go to that node. So, this neighbor of
that first node where the first bit alone is different.
Second bit again is different between where it currently is and where it finally, has to go. So,
then you again fix that bit. So, it becomes 0111. The third bit again needs to be also pretty
much all of them need to be fixed in this case. So, 0, it goes to 01001 and then finally, you fix
the last bit you get to the destination ok. So, that in this manner you go from one node to its
neighbor by just fixing bits from left to right ok. Why is this not good enough, what sort of an
algorithm is this?
Student: Deterministic, deterministic.
It is a deterministic algorithm, or what do we know about deterministic algorithm. So, it is
always going to take 𝑁/𝑛number of rounds. So, it is a it is a very nice looking algorithm
very simple clean algorithm, but it is not good enough why because its deterministic and as
soon as it becomes a deterministic algorithm you have this bad news scenario ok. This of
course, keeping just to keep in mind this is done by every packet because every packet has a
starting location and a destination location.
288
So, how do we salvage the situation; again now we have to insert some randomization into
this right. So, how did Valiant achieve this it is a very very nice algorithm it just basically
takes advantage of bit fixing, but sets it up in a randomized fashion, but sets it up in two
phases and this is the beauty of this algorithm. So, recall that each packet p must be routed
from its source to its destination 𝑑𝑝, 𝑠𝑝to 𝑑𝑝.
Here is what you do so that in the first phase what does p do this packet p, its routed from its
source to a random destination completely randomly chosen destination using bit-fixing.
Because now source is well defined it is it is already where the packet is starting from you
just pick a random n bit string that tells you the random destination that this packet has to go
to. And do you know how to go there yes you do not want to bit fix your way to that location.
And then phase two from that random location you make your way to the final destination.
So, it is a very very simple algorithm; it is just two phases of the bit fixing. So, each packet
just starts from its current location and goes to a random location from that random location it
goes to the final destination, and both of them work via bit-fixing. So, bit-fixing itself then
work, but it we have salvaged ok. It is it is a permutation routing right.
So, 𝑠𝑝so basically 𝑑𝑝is π(𝑠𝑝)each face is not a permutation routing you are absolutely right.
So, there can be collisions on the receiving end. So, multiple nodes could choose the same 𝑟𝑝
yeah that is the possible possibility that is completely ok. So, that depends on the local
289
memory of the of each node and that there is usually some because we have any way have to
maintain the queues and everything in that local machine right in that machine. So, that is not
the problem. The problem is in the as when you go through the links you can only send one
by one.
So, this really is the algorithm any questions on the algorithms ok? So, hopefully everything
is clear its basically let us just focus on the analysis in the next segment in this segment we
have just basically seen a very simple algorithm and at least the first time I saw it, it was
counterintuitive to me. In some sense its intuitive in some sense, it is counter intuitive. So, so
it depends there is a subtlety in here
Yeah, there is a subtlety here if you define the algorithm so that face one has to complete
before phase two starts then they would have. Even if one reached before the other it both
will wait in the destination until its time for phase two to start and then the phase two will
start, but then you can you can actually the other version where the.
Student: Perfect.
You do not wait is also perfectly legal.
Student: Perfectly legal.
290
It is cleaner, but even if there are collisions, it is not a problem. There will be collisions there
was surely be collisions they will spread out, but we will actually be studying how to analyze
the exact issue that you are talking about. There will be lots of nodes good fraction of them of
the nodes will not receive any token at all any packet at all which means that some a good
fraction of them will have collisions.
And in fact you can also analyze how many packets will end up in I mean the maximum
number of packets that will end up in some node and that will turn out is it is actually
something like 𝑛in this case𝑛/ log(𝑛)roughly speaking. And we will analyze these things in
forthcoming lectures ok. So, this is basically what you are alluding to is balls and bins
analysis.
So, of course, we have to ask ourselves how do we establish that this is a good algorithm, it is
a very clean algorithm neat algorithm, but that is what we are going to see in the next
segment ok.
291
Module - 04
Application of Tail Bounds
Lecture - 23
Analysis of Valiant's Routing
So, now we are in segment four of module four in which we are going to analyze Valiant’s
routing which we saw in the previous segment. So, let us we need to ask ourselves if that
algorithm is any good.
292
And let me quickly remind you of what the context is. So, we have a synchronous network on
an n-dimensional hypercube each node i has a packet 𝑝𝑖. And the packet 𝑝𝑖must reach a
destination node π(𝑖), and πis basically some arbitrary permutation and you must minimize
the number of rounds, so that is the problem.
We have and recall that there was a bad news for deterministic algorithm basically there is a
𝑛 term that you cannot avoid.
293
So, what we did was we randomized here we came up with a randomized algorithm.
Basically the our claim which we have not proved, but we hope to prove in today’s segment
is that this randomized algorithm will complete permutation routing in 𝑂(𝑛)times and that
will succeed with high probability ok.
294
And what is this? So, well this randomized algorithm uses bit-fixing as a sort of a subroutine
what is a bit-fixing basically just fix it is it is a way to go for a packet to go from one node to
another node by just fixing bits one by one from left to right. But what is important is that is
not a randomized algorithm, so deterministic algorithm, so it is not good enough.
Which means; that you need a slight modification this randomized algorithm where we go
through two phases in which the packet goes from the starting node to a random destination
using bit-fixing and then from that random destination to the final destination again using
295
bit-fixing. So, there is these two phases that we need and this algorithm we claim has the nice
property that it terminates in 𝑂(𝑛) rounds with high probability.
So, now we are ready for the analysis, some preliminary thoughts. So, one thing to start off
with this clearly needs Ω(𝑛)rounds why is that yeah, but fixing is one specific thing, but you
are only so the diameter of this network is n which means and bit-fixing gives you the
intuition as to why the diameter is n. And because the diameter is n you there can be always
permutations where a packet has to traverse n edges and therefore, it will at least be Ω(𝑛)
rounds.
Yeah, that is yes that is the other thing. So, the random number is the destination, but even the
source then the destination the adversary can actually decide say for example, the adversary
decides that the 0 the packet at 00000 has to go to destination 11111 then all the n bits are to
be flipped. So, let us let us look at it I mean one thing is there hope when we want to prove
𝑂(𝑛) we want to make sure that there is enough intuition to say that there is enough hope to
even proceed forward ok.
So, let us try to establish enough hope that there is that we have. So, there are a N number of
packets, each of them will have to walk for at most 𝑂(𝑛)steps ok. So, that itself means that
there are N/n, n number of hops that have to take place over all and how many edges are there
296
well there are this many n edges. So, n is the degree of each vertex, and there are N number
of vertices. So, this many edges are there.
So, if you look at the number of hops over the number of edges that itself is. So, this is sort of
the good news if you will. In the sense that there are sufficient number of edges for you to get
the packets across we do not know how, but at least if it were this where we do not even have
sufficient number of edges will be in bad shape, we have sufficient number of edges. So,
there is hope.
But the issue that we need to be aware of is that if is let us say one or two packets get stuck
somewhere and they would caused a delay and therefore, a lot of other packets come and get
stuck there. And then there is this ripple where a lot more and because of that delay a lot
more packets get stuck and so on. This ripple effect is what we should be able to claim would
not happen.
So, let us get into the analysis. So, some that start off some preliminaries something. So, one
thing we are going to do is we are only going to focus on the forward direction phase one
basically. In the back so not I would not say I guess I should probably not use the word
forward direction phase one. And phase two, we are going to essentially hand wave in this
segment, but it is an exercise for you to think about how whatever we discuss in class will
apply to phase two as well and complete the thought process needed to convince yourself that
phase two also will work fine ok.
297
So, we are going to focus on phase one ok. And now in phase one let us focus on a particular
packet p ok. This particular packet p is starting from 𝑠𝑝and it is going to 𝑟𝑝 ok. And we have a
picture for that. So, starting at 𝑠𝑝and going to 𝑟𝑝 ok. And along the way in each edge that it
traverses there is going to be a queue. So, it will have to if it is if it is if the queue is not
empty it will have to get into the queue wait for it is turn to get out of the queue and then
traverse the edge ok. So, this is the context.
This path this packet p uses we are going to denote that by P. And that can have at most n
number of edges. Why, well because we are doing bit-fixing. And another important property
of this thing if each of these edges are representing a bit that was fixed in sequence from left
to right, so some of the bits might be skipped why because the current address and the
destination already matched. But you never have it where there is a bit in the right side the
that got fixed and then a bit for in the left side, you will always be fixing from the left to right
all right.
298
Let us make a first attempt and let us see where it takes us ok. We want to bound 𝑇(𝑝) which
is the time taken by p to traverse this packet p to traverse path P ok. So, now, how and
whenever we want we have this large variable, we want to try and break it up into smaller
pieces. So, how do we do that, we denote some we define something called 𝑋(𝑒)remember P
is made up of a lot of edges.
So, this e is one of those edges and 𝑋(𝑒)is 1 plus the time spent by p packet p at the queue in
edge e ok. Why the one, one is the actual time it takes for that edge for the packet to go
through that edge, but before it even went through the edge it at to spend some time in the
queue that is taken care of by this part ok.
And now you can see that 𝑇(𝑝) = ∑ 𝑋(𝑒)ok. So, we have a way to take this 𝑇(𝑝) and break
𝑒∈𝑃
it up into smaller pieces and now hopefully we can use Chernoff bounds ok. What might be
an issue with this, this is the problem here. So, these 𝑋(𝑒)’s there are well let us talk about a
minor issue that possibly could be addressed first one is this is not a zero one variable ok, but
that can be addressed.
Because we can go back to the Chernoff bounds technique and massage it to work for this
these types of small variables, but even if we did this the 𝑋(𝑒)’s are not independent and for
299
Chernoff bound to work we need independence ok, so that is the problem ok, they are not
independent. So, we can this will not work.
So, we have to refine the analysis somehow and in particular way we have to refine it. So,
that we go after a certain quantity you want to try and define a quantity that can be broken
into small independent pieces then we can use Chernoff ok. So, let us try to see how that can
happen. So, let us get back to our path P ok.
300
And let us try to see how that path P might interact with some other path 𝑃'ok. So, now, you
have a packet 𝑝' it is making it is way and that it is through a path 𝑃'. Now, here is an
interesting property here let us say well one possibility is that they never converge and they
never say set of edges they used use are always disjoint that is great then; that means, there is
no interaction. But if they were actually going to interact there would be a vertex at which
they come together ok, they will converge at that point and they may follow each other for a
little while, and then at some point they will diverge. Let us say they diverge will they have
ever reconverge. So, we will never reconverge because the moment they diverged that at that
point they fixed different bits. And when they fix different bits their left side of, is always
different their prefix is different which means that they are never going to be able to find a
way to converge again ok. So, so this is a good property to keep in mind ok. It would not
come in right now it will come in shortly, but easy to understand property ok.
Let us now try to understand these interactions. Ok. So, now, what we are going after let me
give you a sneak preview we are going after H which is the number of packets that interact
with path P. So, this quantity we can try and break things down ok. How are we going to
break this down, we are going to define the this was H parameterize by 𝑝' and that is going be
1, if this other packet 𝑝' uses any edge used by p basically at some point they converge, zero
otherwise. And if now if you think about it 𝐻(𝑝') and 𝐻(𝑝'')when 𝑝' ≠ 𝑝''are independent.
Why are these two quantities independent 𝐻(𝑝') is?
301
It is just another packet 𝑝'' is another packet. So, if you have two packets H of one packet and
H of the other packet are independent, basically if either of them interact with path P or not
the permutation what permutation did we choose in phase one or was it even a permutation
whatever it is the destinations.
It is random. So, for each packet the destination was chosen independently at random right.
So, if you have a packet that is starting at something that packet in that node tossed n coins
and that defined where the destination was. It had nothing to do with what some other
random destination was chosen by another node ok. So, the paths chosen by 𝑝' and 𝑝'' are
completely independent of each other. And therefore, with whether they interact with this
path P that we are interested in or not is completely independent of each other. This one is
any edge used by P. And the reason is if it just uses a node then there is no delay caused by it
ok.
So, now, once we have established this independence you know we can ensure that we
basically establish that this H is the summation of these individual 𝐻(𝑝)’s, 𝐻(𝑝')s rather yeah
Ok, so let us what we care about is a path P ok. So, this is 𝑠𝑝to 𝑟𝑝this is our path P. And we
are asking there is another this is the path P, 𝑃'chosen by this guy this 𝑝'. And then there is
this other path 𝑃'' chosen by 𝑝''. Well, in this case, I have drawn it like they are interacting
with this path P that we care about, but whether they interact or not is independent of each
other because what are these two paths.
So, let us take this path 𝑃'', it was this packet 𝑝''. So, it is basically source of packet 𝑝'' all the
way to the random destination of 𝑝'' ok. And the source is fixed, but the random destination
was chosen completely uniformly at random. So, now, the question is there is a certain
probability with which this path will interact with this path 𝑃'' will interact with 𝑝 ok. But our
claim is what is our claim; it is independent of whether 𝑃' will interact with 𝑃 or not ok.
302
We are not claiming that their probabilities of interactions are equal, what are un unequal or
anything like that, what we are claiming is this whether it will interact or not purely depends
on the randomness over here ok. And as a result these two quantities 𝐻(𝑝')and 𝐻(𝑝'')are
going to be independent of each other.
So, now this is our road map ok. We are going to prove that this H which is the number of
packets that interact with this path P is at most 6n and with high probability and then this is
primarily what we are going to focus on. And then given that that is only a 6n other packets
interact with this path, we are going to prove that the overall time is going to be at most 𝑂(𝑛).
And this second statement is actually you can show a deterministic thing ok, so the first that
is why we will spend more the first all right.
303
So, now so let us in order to do this let us get a few definitions down. So, now, remember we
are focusing on a path P and e is some edge in that path and so that is denoted by this position
here because that edge each edge is fixing some bit ok, it is fixing some bit and in this case it
is fixing the jth bit ok.
So, I am going to have a ask a question which packets can reach v and have the possibility I
am sorry can reach sorry it just gives which packets can reach u, and then have the possibility
of traversing e. So, basically let us have a picture he is going from u to v; u has this address
over here and v has this address over here ok. They differ in the jth bit. So, which packets can
reach u go through v and move on. So, you need to be a little bit careful about the prefixes
and suffixes.
304
Well, let us be a bit careful here. So, which packets will reach u, it is not about their prefix, it
is about their suffix if you think about it. Because the prefixes can somehow get bit fixed and
changed and reach u, but if it had reached u and then here the jth bit was fixed ok, so then the
prefix can be whatever they can be fixed, but all the suffixes 𝑢𝑗on what should match that of
vertex u do you see that. And it is destination here if you notice if it had reached u that
means, it had fixed to make sure that these are all matching this one at this part.
And then after they traverses, so basically here ok, so here you can actually mention this as 𝑢𝑗
or something like that. So, basically here the destination should that particular bit should get
fixed and then after that particular bit gets fixed then what it does after that we do not care.
So, the suffix has to be there can be anything let me let yeah so basically yeah this is 𝑣𝑗 yeah
you are right yeah. We are assuming in this case if it is actually going through an edge, we are
assuming that 𝑢𝑗 = 𝑣𝑗, 𝑣𝑗 = 𝑢𝑗this is actually fixed.
Student: Two packets that are reaching the same vertex. They need not be reaching at the
same bit right, they can be reaching at.
They can be reaching, but if they have to go through that edge they had to fix the exact same
bit. There are to in this case they have to fix the jth bit if they were to go through that edge
ok.
305
So, it let us let us proceed for now. So, here let us ask a few questions and try to answer them.
How many packets, now we know the structure of the packets that go through that edge, how
many packets have starting address of the form do not cares followed by 𝑢𝑗, 𝑢𝑗+1and so on
𝑗−1
that is of course, 2 .
What is the probability of such a packet reaching u, well each of these bits should have been
randomly chosen, so that they fix their way to this particular node right. So, that is; that
means, each of them out of the two choices that the correct choice should have been chosen
𝑗−1
so that is 1/2 . Think of the number of packets along e. So, even if you think of the number
of packets that reached that particular vertex u it is at most 1 on x ok. So, let me make maybe
pause to make sure you get some time to think about it ok.
So, let us let us go through the questions one by one. How many packets have the starting
address of the form shown here, and why do we care about it? We are talking about those
𝑗−1
packets that can reach this node u and that is basically 2 , because there are 𝑗 − 1 do not
cares that we have. And all the rest these are all fixed you if you look at the number of
packets that can come to that particular node u, you cannot play with these values, but you
𝑗−1
can you can play with these values and there are 𝑗 − 1bits to play with. So, there are 2
options for that ok, so that is the answer to the first question.
306
So, now, let us look at the second question what is the probability of such a packet reaching
ok. So, basically what does such a packet look like, it will it will have an address that looks
like this basically the first 𝑗 − 1bits we do not care, it has some it looks like something. And
then the rest of them are fixed right. What is the probability that such a packet will make it is
way to u? Well, in the bit-fixing sequence, each one of them has to be fixed just right, so that
the address matches u exactly which means in the very first bit the first star whatever the star
was the random value associated with the destination should match the first bit in u that
happens it probability half.
The second bit also should match with second bit of u again probability half and so on. Each
bit has to be the random bit has to be chosen appropriately and that will happen with
𝑗−1
probability half and there are 𝑗 − 1such choices. So, this probability is 1/2 they have to
this is stuff we are talking about reaching u going through particular edge and going through.
Student: (Refer Time: 22:50) jth position only how can (Refer Time: 22:53).
So, we are concerned about the number of packets that go through that particular edge e, this
is our edge e that we are going to focus on this.
That is perfectly fine just this is perfectly fine because this one just happened to this one
happened to be mostly matching u in terms of the bits that needed to be fixed right that is
perfectly fine. But when it comes to u, what is needed is that if you look at this n bit string
and this is let us say the jth bit, all of them should not change anymore only then the focus
will be on the jth bit and that is important for us it.
Will let us take this offline if you need a little bit more ok. So, now, if you look at edge e that
is one out of some n at most n number of edges in the path P. So, what is the expectation of
the number of other packets that interact with that packet that that path P that is that will be at
most n because you are adding that over all the edges ok.
307
So, now we need we know we have this quantity H, we have we can we have established the
expectation of H at least an upper bound on that and we need to ensure that we can apply
Chernoff bounds. We know that the individual 𝐻(𝑝')s that add up to H. What are they? They
are independent.
So, we also know that. And we know that they are not making any claims about their
probabilities at this point. So, their probabilities each of the probabilities of 𝐻(𝑝')equal to 1
or 0 or some quantities we do not they are different for each, but that only means that they
these are this H is the sum of random variables indicating poisson trials; which is perfectly
fine because our Chernoff bounds are applicable here ok.
308
−6𝑛
Which means that is it the slide has only one inequality; 𝑃𝑟(𝐻 ≥ 6𝑛) ≤ 2 so this is the
third inequality that I gave in the Chernoff bounds and that is see how low this is this is one
−6 −𝑛
over. So, and µ is at most n and that is so then it is at most 2 . So, this is just 2 .
Student: Capital.
−𝑛 6
So, 2 is nothing but 1/𝑛 ok.
309
So, now let us see the second step in the analysis.
What we have done is, we have established that basically H is at most is at most 6n with high
probability it exceeds 6n with very low probability. So, now, we are just going to assume that
H is at most 6n and how long will it take p to traverse this path P. And the answer is some 7n
and why because what is happening is a pipelining effect that is going on here ok. So, now,
let us look at this path P ok. So, it is from 𝑠𝑝to 𝑟𝑝and the number of other packets that is use
any of these edges is at most 6n ok. Let us assume the worst case that whenever this p reaches
a particular vertex, and it is in some queue ok.
So, there are some packets that have gone past it, but if there is any contention it always gets
the worst; I mean it is remember the you asked this question right how do our contentions
result, p gets to be the last guy in such contentious. When a bunch of packets reach a
particular vertex at the same time ok, there are already some packets in the queue, but when
they get added to the queue at the end of the queue, p gets always unlucky it is the last guy.
But when you think about it there is a nice pipelining effect that goes on here. So, let me see
how you can explain this. So, let us say p is in this location at this point. There are a bunch of
packets waiting in the queue over here. There are a bunch of packets waiting in the queue
over here, a bunch of packets waiting in the queue over here and so on ok. Consider the train
and let us say there is nothing waiting over here this I call it a train where there is p and ahead
310
of p there are all these packets in a series of vertices and then at some point there is an either
empty thing or it is touching 𝑟𝑝 the destination ok.
What can you say about the head of the train in each time step the head of the train is this guy
at the very beginning of the of this train. In each time step that packet is going to move one
step forward or there are other I mean, I am hand waving here, but or the train itself could get
a little longer because some packets joined and things like that. But if you look at it the head
the position of the head will always keep moving forward ok.
So, if that is the case if the head started at 𝑠𝑝 and it always kept moving forward in n rounds it
is the head would have reached 𝑟𝑝 ok. After our 𝑟𝑝 is reached by the head of the train in every
time step what will happen one at least one packet from the head of the train is going to be
knocked into the 𝑟𝑝 it is going to reach 𝑟𝑝. How many such packets can get into 𝑟𝑝 at most 6n
and if that happens at that point in time p would be able to enter into 𝑟𝑝.
So, it takes n rounds at most for the head of the train to hit 𝑟𝑝 and then 6n and rounds for the
entire train to fall into 𝑟𝑝 is that argument somewhat clear ok. And that establishes that in at
most 7n and rounds your packet p is going to reach it is destination ok.
311
So, with that we will conclude what we have done is we have analyzed the first phase of
Valiant’s routing. Some questions remain, so, for example, we have not talked about phase
two. So, the other question that we have we discussed a little while ago is let us say a packet
finishes it is phase one, should it wait for all other packets to finish their phase one. No, it can
start it is phase two and why is that again that is something you need to convince yourself ok.
So, I leave you with the those two questions and we will see some more examples in the next
segment or so ok.
312
Module - 04
Application of Tail Bounds
Lecture – 24
Segment 5: Random Graphs
So, let us get start. So, it is a in segment 5 of module 4 in this module again because segment
4 was a little bit on the longer side I am going to just limit it to segment 5. So, hopefully it is
a short.
So, what we are going to do today is we are going to introduce a very beautiful random graph
model Erdos-Renyi model it is very very simple and that is probably why it is so beautiful
and leads to a lot of interesting things in particular this notion called phase transition will
define that and we will show at least one phase transition ok.
313
And so let us without further ado let us get started with the definition of the Erdos-Renyi
graph model. So, it is a little bit of history so this paper was published probably in the early
sixties and since then this model has really gotten the attention of people lots of very pretty
results have improved.
Basically what is this? The it is a graph denoted by it is a random graph. So, essentially what
it is a probability space it is we are denoting that by 𝐺𝑛,𝑝it is the probability space over all
graphs with n labeled vertices and if it is a probability space then there must be a probability
function associated with it.
How do you get the probability function in some sense implicitly for each pair of vertices in
you create an edge with probability p so that accounts for the two parameters that show up in
the notation 𝐺𝑛,𝑝 𝑛is the number of vertices p is the probability with which any pair of
vertices you see will have an edge ok.
So, now, one way to think about it is you create the n vertices and for each pair you toss a
biased point with probability p and if it shows up heads put an edge between them otherwise
move on to the next pair and these are all independently chosen of course. So it is a very
elegant simple way to define a random graph ok.
314
And actually closer to what Erdos-Renyi originally defined is this notion called 𝐺𝑛,𝑚 ok. In
this case this is actually very close these two are very closely related 𝐺𝑛,𝑚is again a
probability space again over all n vertex graphs, but now you choose m edges at random from
𝑛
uniformly at random from all 2
possible vertices edges ok.
And these are essentially similar if you set up the probability p to be well if you set up the
𝑛
probability 𝑝 = 𝑚/ 2
then essentially both will be about the same they are not exactly the
same, but about the same our focus of course, will be only on 𝐺𝑛,𝑝.
And we use we introduce a notation 𝐺𝑛,𝑝models 𝑃 or has the property P it it denotes the event
at a random graph drawn from this distribution this property space has the property P and
what is the property P it could be any graph theoretic property graph is connected or bipartite
or has a clique of size 4 or whatever.
You need property you know we can ask thus a graph drawn from this probability space have
that property or not that is in the whether it has the property or not is the event and then you
can talk about the probability of that event ok. And what do we mean by threshold I mean
phase transitions if a property has this nice threshold function some function such that when p
the probability with which edges are chosen.
315
If p is just less than a threshold function then the probability that you see that property tends
towards 0 which means most likely you do not have that property when the p is just less than
the threshold, but then we you when you raise the probability just above the threshold then
immediately the probability that you see that property skyrockets towards the almost always
there probability that is probability one all right.
So, this is phase transition so when you can come up with such a threshold function then you
we say that you experience a phase transition at that threshold function ok.
So today what we are going to do is study the one such phase transition. So, we are going to
look at the phase transition for the property that property P that says that the graph has a
clique of size 4 ok. In other words way to think about it is when you that is when you when
your probabilities are very very small and you create this graph on n vertices the when the p
value is very very small you do not see cliques of size 4 you might see cliques of smaller size,
but it is very difficult to see cliques of size 4.
But then when you just increase it beyond the threshold you suddenly start to see cliques of
−2/3
size 4, and soon as it n particular here the threshold function is 𝑛 when you set the
−2/3
probability to be 𝑜(𝑛 )then the probability of seeing cliques of size 4 tends towards 0 and
−2/3
otherwise the moment you say that the probability of having edges is ω(𝑛 )then
316
immediately the probability that you see that property tends towards 1 ok. So, let us see how
we can prove this theorem the first claim is reasonably straight forward.
−2/3
So, this first claim is when the probability is less than or not technically 𝑜(𝑛 )we won’t be
able to say that you do not see clique so size 4.
𝑛
So, how do we do that well let us start by listing the cliques all possible cliques there are 4
possible cliques ok. So, you list them one by one and some arbitrary. And you we are finally,
going to shoot for the number of such cliques that get formed in particular we want what we
are finally, shooting for is the 𝑃𝑟(𝑋 = 0).
We want to understand what is the probability that you do not see cliques right but towards
that what we are going to do is we are going to set up the variable so that this X can be
broken into these 𝑋𝑖s, 𝑋𝑖 = 1 if the i-th 𝐶𝑖 is a 4 clique and 0 otherwise it is an indicator
random variable. So, clearly now then the number of 4 cliques is given by X which is a
summation of all these individual 𝑋𝑖s.
317
And you will notice that X is integral it is an integer and X is always going to be greater than
0. So, now we can ask the question what is the 𝑃𝑟(𝑋 ≥ 1)we want to argue that this
probability becomes very small little 𝑜(1).
So, what is the 𝑃𝑟(𝑋 ≥ 1)first of all we claim that this ≤ 𝐸[𝑋]. Why is that so that I was
thinking of a slightly different way to do this but yes.
You can you can do this, this way I had a slightly different idea, but that is picked up another
way to see this great so, by Markov’s let us even put it that way alright ok. So, then what do
we do so this is at most 𝐸[𝑋]. So, we need to bound the 𝐸[𝑋]actually we can get the exact
𝑛
value of 𝐸[𝑋] because first of all there are 4
such indicator random variables so that is and
6 6
each one of them is one with probability 𝑝 why is 𝑝 showing up here.
Yeah because you are talking about 4 vertices and you are asking whether all of them have all
6
pairs are connected by edges and so you will have 6 edges and so that is why you have 𝑝 ok.
318
−2/3 6
And we know in this case we are in the case where p value is 𝑜(𝑛 )and 𝑝 so you get
𝑛 −12/3 𝑛 4 −4
4
𝑜(𝑛 )and 4
is 𝑂(𝑛 )and then but this term is 𝑜(𝑛 )so overall it is little 𝑜(1)and
convinced about that so, so the first case was nice and easy. So, let us look at the second case.
−2/3
So, this is the case where the threshold function I mean this probability function is ω(𝑛 ).
−2/3
So, it is more than 𝑛 and in this case we want to show that the probability with which the
property will hold is tending towards 1 remember ϵ is a very small quantity.
In fact, for any small ϵ we want to be to prove that it is greater than 1 − ϵ another way to
prove this is just the opposite the 𝑃𝑟(𝑋 = 1) ≤ ϵif this is the case where you do not see any
4 clique some sense the bad event in this context. The bad event is happening with less than ϵ
probability ok. So, let us see.
−2/3
So now, here let us look at the 𝐸[𝑋]in this case remember p value is ω(𝑛 ). So, when you
𝑛
plug it into that formula remember we said 4
times the expectation of each individual 𝑋𝑖s
4 −4
you get this quantity see this is 𝑛 and this is going to be ω(𝑛 )ok.
So, overall it is going to be tending towards infinity and so this seems promising. Why
because now what we what we are able to see is that the 𝐸[𝑋]is tending towards infinity
which means that this 𝑋 = 0intuitively is an unlikely event ok, but that is not sufficient
319
technically. Because we won’t be able to show that 𝑋 = 0with we won’t be precise about this
we will show that the 𝑃𝑟(𝑋 = 0) ≤ ϵ.
So, one thing the you know why is this not sufficient let me ask you this why is this not
sufficient why is this not sufficient. You could have situations where the expectation is very
large because let us say for example, a situation where X takes the value either 0 or a very
very very very very large value 0 with some probability say half and this very very very large
value with or rather.
Let us say 0 with probability ϵand some very very large value with probability 1 − ϵ and if
this large value tends towards infinity then itself you will get 𝐸[𝑋]to going to infinity, but that
won’t satisfy this requirement that you have so that is why we need to be a bit careful here.
And this is where it connects to tail bounds remember we are in the context of tail bounds we
are trying to understand applications of tail bounds. We need a new trick we need a tail bound
based trick.
And what we are going to see is what’s called the second moment method and where actually
going to apply Chebyshev's, but in fact, we are just going to rewrite Chebyshev's in the in the
manner that we want and then we will try to apply it.
320
So, now, we are going to assume that X is a non negative integer random variable. Now we
want 𝑃𝑟(𝑋 = 0) and this is exactly what we wanted over here ok. And this 𝑃𝑟(𝑋 = 0)is at
2
most the second moment method uses this inequality that it is at most 𝑣𝑎𝑟(𝑋)/𝐸[𝑋] .
And it is actually coming from a very very straightforward application of Chebyshev's. So,
what is X equal to the 𝑃𝑟(𝑋 = 0)you can it write this so what is this so now, you want to
know what is the 𝑃𝑟(𝑋 = 0), and let us say this is the 𝐸[𝑋]what Chebyshev's tells you is it
bounds both of them this is the 0 mark. So, it bounds 𝑋 − 𝐸[𝑋]greater than or equal to some
parameter a right in this case that parameter is nothing, but 𝐸[𝑋]itself right. So, in here this is
also 𝐸[𝑋]here ok.
So, you can basically write this 𝑋 = 0, event as well it is basically 𝑋 = 0will be a subset of
this event that says the |𝑋 − 𝐸[𝑋]| ≥ 𝐸[𝑋]and then you apply Chebyshev’s equality and you
get.
In our context so, going back to you know we want to be able to say that this is in 𝑜(1) we
want to show that this is 𝑜(1) basically now what we need to do what we need to show is that
the 𝑣𝑎𝑟(𝑋)is 𝑜(𝐸[𝑋])if we show that this whole thing will fall into place so that is going to
this is going to be our focus from now on.
So now we need to understand the variance for which let us try to bound the variance when
we have a certain property we have this property that this X is the summation of 0 1 random
321
variables if that is the case well it; obviously, looks very similar to the variance formula that
that we already are familiar with so we just need to do some minor massaging ok.
So, we know that 𝑣𝑎𝑟(𝑋)is the summation of the individual plus these covariance terms ok.
The covariance terms all is already matching what we have over here so, we do not worry
about that so we only worry about the sum over all the individual variance terms and well
2 2
then you can just apply the formula 𝑣𝑎𝑟(𝑋𝑖) = 𝐸[𝑋 ] − (𝐸[𝑋𝑖]) that is just a formula ok.
𝑖
But now we are going to take advantage of the fact that 𝑋𝑖is a 0 1 random variable it is a 0 1
2 2
random variable 𝐸[𝑋 𝑖] what is that it is going to be 1 with probability p and 0 with
2 2
probability 0 with probability 1 − 𝑝and that is nothing this 1 does not it again shows up as
one and so it is again just p right. So, this is nothing but, 𝐸[𝑋𝑖]and here we want an inequality
so and this is a positive quantity. So, we simply can ignore it you want the less than or equal
to over here ok. So, it is nothing, but expect sum over all 𝑋𝑖 which by linearity of expectation
is 𝐸[𝑋]plus the covariance plus the covariance terms ok. So, this is this is what we have. So,
now, we have a way to bound the covariance so we will just go ahead and bound.
322
So, now we need to just to recall we want to show that the property of 𝑃𝑟(𝑋 = 0) < ϵ and
we are applying the second moment method and we have a way to bound the variance which
is what is written over here 𝐸[𝑋]plus all the summation of all the covariance terms.
Now if you look at the so a few things that we need to worry about here. So, we need to
2
worry about the covariance terms now and in the denominator we have an 𝐸[𝑋] ok. So, so if
you think about it if you recall we want to show that variance is what do we want to show we
2
want to show that variance is 𝑜(𝐸[𝑋] ). So, this is not going to be too much of a problem so
covariance terms that we need to be careful about now ok.
But anyway let us go through all these terms and just to recall the covariance has this formula
2
𝑐𝑜𝑣(𝑋𝑖, 𝑋𝑗) = 𝐸[𝑋𝑖𝑋𝑗] − 𝐸[𝑋𝑖]𝐸[𝑋𝑗] and we already know the 𝐸[𝑋] so this 𝐸[𝑋] is it is
8 12
going to work out to 𝑛 𝑝 that is just by applying formula.
323
So let us now focus on the covariance term so now we are going to focus on the covariance
terms. And for the covariance we need to be a bit careful because remember this covariance
is you know is over all pairs (𝑖, 𝑗)pairs right and funny things can happen. So let us break it
down into cases the first are sub cases rather sub case what let us see the first sub case is this
|𝐶𝑖 ∩ 𝐶𝑗|is either 0 or 1 and I claim that this is the easy case with all the covariance terms are
zeroes why is that?
Independent or saying no we need to be there just independent why are they independent. So,
let us look at them one by one what is |𝐶𝑖 ∩ 𝐶𝑗| = 0what is that situation?
Yeah the cliques are completely disjoint there is one and then the other one what happens in
one set of 4 vertices will have no bearing on the other set of 4 vertices what about
|𝐶𝑖 ∩ 𝐶𝑗| = 1. What does the picture look like what we will have to do we have to do to this
picture yeah one of the vertices is common right.
So basically let us remove each. So, now we are asking these 4 forming a clique and are these
4 forming up clique and why are these still independent.
324
Yeah the.
So, the vertex one vertex might be common, but the edges are all chosen independently. So,
because this edge whether this edge is chosen or not has no bearing on whether this edge is
chosen or not ok. So, even though there is a common vertex the whether the cliques are
formed or not are independent events.
So, the covariance terms for those cases are 0. what about the case where |𝐶𝑖 ∩ 𝐶𝑗| = 2 so,
what is let us draw the picture for it so we have two common vertices and then the rest of the
vertices will look like this.
So, we are asking are these going to form a clique and are these going to form of clique ok.
So, now there is essentially just one edge that is common. So, now, you start to see some
amount of independence showing up sorry dependence. So, now, we are going to claim
11
covariance is upper bounded by 𝑝 why is that well for one thing we want an upper bound.
So, we are going to ignore this this term we only have to worry about 𝐸[𝑋𝑖𝑋𝑗]. So now what
we want what is that that is the expectation I mean that what is 𝑋𝑖times 𝑋𝑗it will be 1 when
both cliques are formed and 0 otherwise. How do you get to form both the cliques? How
many edges should actually get formed if you look at the two cliques are separate how many
it is a total of 12, but one of those edges is common. So, you will need to form an eleven
11
edges that is 𝑝 so that is so those that that will be the covariance term for the clique this case
where this one common edge a very similar argument can be used for the case where 3
common vertices when there are 3 common vertices. Let us let us draw the picture.
So, you have 3 common vertices and we are asking these forming a clique and are these
forming a clique ok. So, now, there will be 1, 2, 3; 3 common edges and then 1 2 3 4 5 6. So,
there will be 6 other edges to a total of 9 edges. So, we have either so now, we have the
covariance terms individually. So, we just need to put them into the summation ok.
325
So, recall that this is the covariance that we want to compute I mean sorry variance we want
𝑛 6
to compute. 𝐸[𝑋] is this term 4
𝑝 we have already seen that whenever 𝐶𝑖and 𝐶𝑗have an
intersection of either 0 or 1 elements we have just discarded them. So, we only have to worry
about the case where they have 2 common vertices or 3 common vertices.
So, this is the case where they have 2 common vertices you may recall that when they when
11
they had 2 common vertices the covariance itself first 𝑝 , but how many such covariance
terms will be there that is that is going to be the number of ways in which you can choose
these 4 vertices for the first clique 4 vertices for the second clique such that they have 2
common vertices ok.
How do you count that first of all you need to remember the picture you need to first get 6
vertices after you get 6 vertices then you have to choose 2 of them to be the common vertices.
So, get this is choosing the 6 vertices then out of those 6 vertices you want 2 to be in the
middle.
Then you want to choose 2 to be here and then you want of course, the remaining 2 will go
here that is the number of such covariance terms that you will have so you get that and a
similar argument it can be made for this 1. So, here again if you recall the picture 3 common
vertices so this, out of the n vertices you choose 5 vertices and out of those 5, 3 of them go in
the middle and then one on the left one on the right alright.
326
So, now what we do is let us plug in the terms. So, in the interest of time what I am going to
do is just point out. So, so remember that p is we need to be a bit careful here ok. So, let us
for the moment just ignore this line ok.
So, let me see now let us just write these things in terms of n and p. So, here this is going to
4 6 8 12 6 11 8 12
be an 𝑛 𝑝 and that is clearly the 𝑜(𝑛 𝑝 )then this is going to be an 𝑛 𝑝 again 𝑜(𝑛 𝑝 ) this
5 9 8 12
is going to be an 𝑛 𝑝 again 𝑜(𝑛 𝑝 ).
2
So, you get 𝑜(𝐸[𝑋] )which is exactly what we wanted to show. That really brings us to the
end of this segment. Because what we have shown with this we have shown that the
probability is going back 𝑃𝑟(𝑋 = 0) < ϵ for any arbitrary small ϵso that is exactly what we
wanted to show and completes all the cases. Well let me conclude.
So, what we have introduced is random graphs define transitions and we have shown one
phase transition we have shown along the way we learned the nice method called the second
moment method in the next module which we will do.
After and in this classroom setting we will be doing this after our first quiz will go into things
like bowls into bins and some other computer science applications of them. So, thank you
some easy questions first. So, look at this random graph 𝐺𝑛,1/2are ok.
327
So, here I want you to ask argue two things one is argue that every vertex has more than
− ϵ𝑛incident edges with high probability of course, should be very easy and the second thing
I want you to prove in this is prove that with high probability diameter of this is at most two.
couple of things that you can you should be able to prove quite easily.
The other thing is that for this is for arbitrary p 𝑓(𝑛) = log(𝑛)/𝑛prove that this is a threshold
function for connectivity for the property that the graph is connected there is some interesting
problems to work on in the 10 minutes you should be able to do this.
328
THIS BOOK IS
NOT FOR SALE
NOR COMMERCIAL USE
(044) 2257 5905/08 nptel.ac.in swayam.gov.in

Introduction To Probability in Computing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Probability in Computing

Uploaded by

Copyright:

Available Formats

INDEX

S. No Topic Page No.

(Refer Slide Time: 00:33)

(Refer Slide Time: 01:38)

(Refer Slide Time: 04:45)

(Refer Slide Time: 05:19)

(Refer Slide Time: 09:58)

(Refer Slide Time: 10:45)

(Refer Slide Time: 02:01)

(Refer Slide Time: 03:44)

(Refer Slide Time: 08:53)

(Refer Slide Time: 12:50)

(Refer Slide Time: 14:26)

(Refer Slide Time: 15:24)

(Refer Slide Time: 15:52)

(Refer Slide Time: 16:32)

(Refer Slide Time: 17:25)

(Refer Slide Time: 01:40)

(Refer Slide Time: 03:40)

Now, I want to explain this notion of independence ok.

(Refer Slide Time: 07:07)

(Refer Slide Time: 07:33)

(Refer Slide Time: 10:20)

(Refer Slide Time: 11:13)

(Refer Slide Time: 12:26)

(Refer Slide Time: 13:47)

(Refer Slide Time: 15:51)

(Refer Slide Time: 19:24)

(Refer Slide Time: 19:35)

(Refer Slide Time: 19:42)

(Refer Slide Time: 00:55)

(Refer Slide Time: 01:16)

(Refer Slide Time: 02:54)

(Refer Slide Time: 04:26)

(Refer Slide Time: 05:20)

(Refer Slide Time: 08:17)

equals this particular right hand side quantity is at most a δ.

going to try to apply.

the random bits correspond to the binary number 𝑖.

random bits 𝑟2 𝑟3 ok.

to 1 it would not be equal.

(Refer Slide Time: 20:28)

(Refer Slide Time: 23:13)

(Refer Slide Time: 23:27)

(Refer Slide Time: 25:15)

(Refer Slide Time: 00:19)

(Refer Slide Time: 04:21)

(Refer Slide Time: 09:57)

(Refer Slide Time: 11:15)

(Refer Slide Time: 11:37)

(Refer Slide Time: 11:46)

ourselves to this proving this.

(Refer Slide Time: 18:27)

intersections and so on.

So, what is the probability?

(Refer Slide Time: 26:46).

(Refer Slide Time: 33:19)

(Refer Slide Time: 34:50)

So, with that we end this segment.

(Refer Slide Time: 01:18)

(Refer Slide Time: 01:41)

side corresponds to our prior model.