# Probability

A random experiment has an unpredictable outcome. Examples include tossing a coin, playing the lottery, driving a car, taking an exam. Interpretations of Probability 1. Long term frequency, as in the limit of the number of 6's that appear in many tosses of a fair dice. 2. Measure of belief: P(I survive surgical operation) This experiment cannot be repeated, the only interpretation would be that of a measure of belief. Sample Space The set of all possible outcomes is called the sample space and noted . An event is a subset of the sample space . Distribution Function We defined the distribution function for a finite sample space to be a positive real valued function m that is defined on and such that:

When the sample space is either finite or countably infinite we say the distribution is discrete. To an experiment we associate a number, we either number the outcomes, or we assign winnings to them. This number is called a random variable in fact it is a function from the outcomes into the reals. Random Variable: We call the numerical outcome of a chance experiment a random variable, and we denote it by X. Example 1: When rolling a dice the sample space is

Definition of Probability Definition: Given anoutcome space , a discrete random variable and its distribution function m, we define the probability of an event E as

Consequence:Direct consequences of the definition are that:

where φ is the empty set. Example: Pick at ``random'' among 100 people, means pick giving everyone an equal chance of being picked. Each person is equally likely to be chosen. Implicit in many every day use of random is the fact that the draw should be fair or unbiased. If is the outcome space and A some event: Property 1: If all outcomes in a set are equally likely the probability of the event is the number of outcomes in A over the number of outcomes in .

Example Suppose the experiment is to draw a random number between 1 and 100, each with the same probability, draw uniformly at random.

, and Set Facts I defined in class the union ( B or

.

),intersection (

or AB) and difference (Aor ).

) of two sets, and of complement of a set (Ac or

Rules for set manipulation: (a) Distributive Rule: (b) De Morgan's Laws: More generally . . .

Events are said to be mutually exclusive if

Note that this is a property of the sets, not of the probability

Properties of Probabilities We divide the properties into two groups: Basic Axioms of Probability: (a) (b) (c) If j different: for any set A. , where is the sample space.

is a sequence of mutually exclusive events, that is for all i and we have the finite additivity property:

Consequences: Property 1:

P(Ec)=1-P(E) Property 2:

Property 3:

Special Case: If E and F are disjoint, (mutally exclusive), then:

Property 4: If

Definition: We define a partition of the sample space whose union is If

to be a sequence of pairwise disjoint sets

. Property 5: then:

forms a partition of

Property 6:

Property 7:

Moreover:

Example: Toss a coin until heads comes up.

Looking at the tree diagram, we see that P(1)=1/2, P( 2 )=1/4, P(3)=1/8........ The distribution function in this case is: m(i)=1/2i But we have to check that What is the probability that heads turns up after an even number of tosses? Densities A continuous random variable takes on a non-countable infinity of possible values, here we will define it with the help of a density function. A density is a continuous non-negative function defined on all the reals and such its integral is equal to 1.

For B=[a,b] an interval:

If we take b=a, we see that for all a, the probability that a continuous random variable takes on that value is 0, this is the big difference with discrete random variables.

This implies:

Intuitively for a very small width density at x:

the probability will be proportional to the

Cumulative Distribution Function Definition: Let X be a continuous real-valued random variable, its cumulative distribution function is:

Theorem: If X has a density f(x) then:

is the cdf and

Proof: Property 1. comes from the definition of probability as a function of the density:

Property 2 is due to the fundamental theorem of calculus. Examples with functions of Uniform Random Numbers Example 1: The cdf of a uniform random variable is:

Thus by computing the derivative we have the density of the uniform random variable to be:

The box shape we already knew! Example 2: The square of a random variable:

We start by computing the cdf by

We obtain the density by just deriving this cdf:

Example 3: The sum of two uniform random variables: Z=U1+U2

Enumeration Rules How many different 5-girl teams I can make from a set of 10 girls, each girl's position matters in the choice. First we choose the center: there are 10 choices, then we choose Power forward: 9 choices, then we choose small forward: 8 choices, then we choose shooting guard: 7 choices, then the last(point guard) : 6 choices. The rule is to multiply the number of choices, I showed a way to see this in class with trees. the total number of different teams is :

1. Multiplication Rule: If k successive choices are to be made with exactly nj choices at each stage, then the total number of successive choices is

2. Number of orderings: An ordering (or arrangement or permutation of k out of n) is a sequence of length k out of n choices with no duplications. The number of such arrangements is:

3. Permutations: If the total number of possible choices is the same as the number of elements we want to pick, this is called a permutation, there are :

permutations of n elements. We have a special way of calling this number: ``factorial n'' = n! This number gets big very fast: 3!=6, 5!= 120, 10!=3628800, 20!= 2432902008176640000, 50! =30414093201713378043612608166064768844377641568960512000000000000. 4. Combinations: The number of ways of choosing k out of n objects is:

Property 1:

Property 2:

We obtain the density by just deriving this cdf:

Birthday Problem I did the experiment of asking 31 people their birthdays, and I bet that would be a match, was I taking a big risk? Try out the experiment yourselves with this applet: Birthday applet

Supposing that the probability of each day is

(which is not quite true), we will use

the complement trick, is is easier to compute the probability of no match rather than the probability of at least two matches, in which all kinds of possibilities have to be enumerated.

I had a good chance of winning! The general formula for k people in the room:

We use the approximation for x small that approximately

, so that this log is

Giving

This approximation would have given.

Or

In fact there is a term in n left because the ratio of this approximation to n! still increases (in fact like .

The true approximation is given by Stirling's formula which uses the notion of asymptotically equivalent. Definition: Two sequences ak and bk are said to be asymptotically equivalent if

Theorem 1:(Stirling's Formula) Factorial n is asymptotically equivalent to the sequence defined as: .

In class I showed that the probability that a random permutation had at least one fixed point could be computed using the Inclusion-Exclusion Principle for n events:

taking Ei the vent that the ith point is a fixed point. We can see that and that

As there are

pairs i<j and

triples i<j<k , the formula gives:

Note that this does not depend on n. Multinomial Coefficients and applications How many ways are there of assigning 10 police officers to 3 different tasks: Patrol, of which there must be 5. Office, of which there must be 2. Reserve, of which there will be 3. We saw how the answer comes out to be:

In general there will

ways divide up n objects into r different categories so

that there will be k1 objects in category 1, k2 in category 2, and so on ( ). Definition and Properties Definition of the conditional probability

of A given B:

Averaging rule: P(E)=P(E|F)P(F)+P(E|Fc)P(Fc)

General Averaging Rule: Let be a partition of , by which I mean that the events Fi are all , then:

mutually exclusive and that their union is

Bayes Rule:(how to find the opposite conditional probability than the ones given):

General Bayes Rule: Let and: be a partition of , then is a partition of E

Conditional Probability IS a probability Conditional probability obeys the three axioms necessary to having a probability, provided of course it is defined, ie the event that we are conditioning on has probability non zero. Independence Definition:Two events E and F are said to be independent if

Example: We draw two cards one at a time from a shuffled deck of 52 cards. Are the two following events independent? E : The first card is a heart. F : The second card is a queen.

From the definition of conditional probability, we need to find P(F|E) by computing

and

.

or we showed that equivalently, E and F are independent if and only if:

Beware this multiplication rule is ONLY available if and only if the events ARE independent. Example: De Méré's problem is whether or not it is more likely to get at least one double six in 24 throws of a pair of dice or to get at least one six in 4 throws of a die? Essential to this argument is the fact that each throw of a die is independent of the preceding one. The easier probability to compute is the complementary one:

and the complementary event

Independence of more than two events I gave an example in class of three events E,F and G, such that E and F are independent and E and G are independent , but and E are NOT independent. In general for 3 events, we say the three events are mutually independent iff all couples of events are independent 2 by 2 and

The Craps principle Suppose a game that has to be continued as long as neither player has won a game, where , and P(Draw)=pd, with pa+pb+pd=1,

I showed that the probability of A winning is:

Intuitively, we know this, the only probabilities that matter are the relative chances of A and B of winning, pd does not matter. I went on to define craps: The game is played with 2 dice, a first throw can finish in either a loss (if I throw 12,3 or 2), a win (if I throw 7 or 11) or a draw-replay, (if I throw 4, 5, 6, 8, 9, or 10, from hereon I have to remember what I threw in this case, it will be called my ``point''), if I have a replay number, I have to go on throwing the two dice, until either I get the same number I had at first, then I win, or I get a 7, (I lose). Example of computation for craps Suppose I said I had just won at craps, and I ask you what the probability of my having rolled a 4 in the first place was? In terms of events I want P(4 first throw | win), in order to compute this we have to know P(win) and in class I showed that: , these are both directly computable,

(by the craps principle) and

Finally we have enough parts to compute:

Joint Distributions Examples: Suppose we consider three tosses of a coin, associating a 1 to heads and 0 to tails each time, and call Xi the random variable that results from trial i, then we could consider the random vector space for is that describes the three tosses. The state , and we can compute the

probability distribution on this space as products of the individual coordinates' distributions because the random variables Xi are independent. Here is an example based on the same experiment but the random variables are different and are not independent: Let Yi be the number of heads up to and including the ith toss: that of , the state space or sample space for is the same as

, however there are some triplets that are impossible. For instance P(Y2=0|

Y1=1)=0, the coordinate random variables are not independent and we have to give the distribution of all the vectors one by one because we cannot build them up from the marginals. Example: In the example on colorblindedness, suppose I consider the binary random variables associated to color blindness and gender (associate 0 if male, 1 if female), these are called indicator variables, we can tabulate the probabilities of all 4 possible pairs of outcomes as:

So that from this table of joint distribution we read:

In general, when we build the joint distribution of two random variables we can make such a two-way table, of course, for more variables this is impossible. Definition: In the case of two random variables X and Y we define the joint probability mass function of X and Y as :

The row-sums and column-sums produce the complete distribution functions for the coordinate random variables, they are called the marginal probabilities, here for instance we have:

In general, given the joint distribution on the pairs (x,y) for two random variables X and Y: P(x,y) we have the marginal distributions

Gambler's Ruin Suppose we do a sequence of Bernouill(p) trials, we call them coin tosses, with a pcoin. A wins the toss if the coin comes up heads, and B wins if tails. If A wins he takes 1 from B, if not the inverse happens. The overall game is played until one of the players goes bankrupt. Call Then we do a ``one-step'' analysis, take Pi=P(Ei), we can show that:

Now we use the two boundary conditions:

, and the

recurrence above leads to the following conclusions (using the sum of the geometric series of term .

Joint Probability Distribution for n variables Example of the multinomial Sequence of identical experiments, each outcome one of r possible ones, with

probabilities

. Denote by Xi the number of the is the

outcomes that result in i. The joint distribution of multinomial, the Xi's are not indpendent. Mutually Independent events: The events subset of events intersection:

are said to be mutually independent iff for every , we have the multiplicative property of

Mutually Independent random variables: The random variables any vector are said to be mutually independent iff for in the cartesian product state space , we have:

Continuous Random Variables

The joint density is the probability per unit area near (x,y).

The joint density is a non-negative function that integrates to 1. Distribution Function:

Marginal Distributions

Examples (X,Y) uniform on densities? f(x,y)=c, 0<x<y<1 , as the area is take c=2 , what are the joint and marginal

Marginal Densities

We remarked that :

Independent Random Variables

Reminder for Discrete random variables: P(x,y)=pX(x)pY(y)

Independence for Continuous Random Variables We no longer have individual mass distribution functions which are non-zero, so we use the densities:

which is also equivalent to the products of the distribution functions being equal to the joint distribution function:

Conditional Distributions Remember the formula for conditional probability:

Now suppose that E and F are events defined with regards to random variables X and Y. We cannot always use the actual probability of events such as that has probability 0. Conditional Density because

I made detailed geometrical remarks in lectures, if you missed lecture please come and see me I have special handouts for you. Bayes Billiard Balls First argument: Suppose we throw 1 red billiard ball on the table and measure how far it goes on the scale from 0 to 1, call this value x, then throw n balls, what is the distribution of the number of balls to the left of the red ball?

Now what is x distribution? It is Uniform(0,1), and the overall distribution of the number of successes is the sum for all possible x's:

Second way:

Suppose I throw all the balls down first, and choose which is to be the red one, then the probability that the red one has k to the left of it is: . So we have:

Which tells us that:

This constant actually has a special name it is B(k+1,n-k+1) and it comes in for the following important density function, whose support is [0,1]. Beta Random Variable

Order Statistics are n iid continuous random variables with a common density f, a distribution function F. Define:

The ordered values of the iid sample are known as the order statistics. The question we will try to reply to is: What is the formula for the density of the kth order statistic? We start with the extremes that are easy to handle: First method

Second Method

The advantage of this method is that it can be generalized: X(k) is the k smallest of .

Example with a Uniform Random Variable The density of a uniform random variable is

and the cumulative distribution function is the identity between 0 and 1.So that applying the above formula for the kth order statistic of n independent uniform random variables (0,1) gives the density

This is in fact a density we will encounter several times, it is the Beta(k,n-k+1) density. Example: Suppose five independent uniforms U1, U2...U5. Find the joint density of U(2) andU(4) :

This is the density of which I showed you a picture in class when I defined joint densities, it is only non-zero for

0<x<y<1 Binomial Distribution Suppose we repeat a Bernouilli p experiment n times and count the number X of successes, the distribution of X is called the Binomial B(n,p) random variable. Probability mass function:

Odds Ratios and Mode The odds of k successes relative to (k-1) successes are:

This is very useful for computing by recursion the probability mass of the binomial. Property: For X a B( n , p ) random variable with probability of success p neither 0 or 1, then as k varies from 0 to n, P(X=k) first increases monotonically and then decreases monotonically, (it is unimodal) reaching its highest value when k is the largest integer less or equal to (n+1)p (=floor(n+1)p). Proof:

is equivalent to

The value where the the probability mass function takes on its maximum is called the mode. Negative Binomial This is the number of trials until r successes is obtained in a sequence of Bernouill(p) trials.

In fact the Geometric is the special case where r=1. Geometric This is the number of trials until a success is obtained in a sequence of Bernouill(p) trials.

P(X=k)=pqk-1 Hypergeometric Random Variable A sample of size n is chosen at random from an urn containing N balls of which m are white. This is called a draw without replacement. The random variable we are going to study is X the number of white balls selected:

This only takes on positive values for:

Example: Tagging animals, N unknown, we conduct a tagging experiment by catching and marking m animals and recapture i of them that are tagged, out of the recapture sample of size n. We find it by maximising the probability of P(X=i)=Pi(N). To see which N maximises this we notice that :

This ratio is greater than 1 if and only if

This increases and then decreases and is max at floor(mn/i). This is what is called the maximum likelihood estimate of N. Example: There are 50 tagged deer in the forest, mark them and release them, a subsequent catch n=40, of which 4 are found to have marks then .

Remark: If we supposed that probability of finding tagged animal is binomial, we get the same conclusion. Poisson random variable Motivation Situations occur where an event happens at random over a period of time: A tap drips a drop about every 5 minutes. Police office receives emergency calls. Typos on a page We have to take a period of time where the rate is about unchanged. (not like the police calls in early morning/late afternoon).

Definition A discrete random variable taking on values 0,1,2,... with the probability mass function:

is called the Poisson distribution. We can check it is a probability mass function because

Special Continuous Distributions Change of Variables Theorem: For a one-to-one function, (you can recognize this by whether it is always strictly

increasing or always strictly decreasing), let X's density be fX and Y's density fY then:

I did several examples in lecture 11/4, the most important is the affine transformation:

Y=g(X)=aX+b, then the new density as:

, whose derivative:

gives

This is the most important transformation. Exponential Random Variable

Density

This is used to modelize waiting times, at telephone booths, at postoffices and in science for time until decay of an atom in radioactive decay. Call T the time to decay for an atom, ,

The physicists do not use this parametrization, they use the notion of half life h, this is the median of the exponential:

λ 2 Example: Strontium 90 is a dangerous component of nuclear fallout. It has a half life of 28 years.
So that h =

How long does it take for 99% of the strontium to disappear?

Standard Normal Random Variable A continuous random variable is defined by a density function f:

Standard Normal Integrals:

Important fact:

Normal Random Variable If we take a an affine transformation of a standard Normal random variable: Y=aZ+b the new density of Y is

This is called the Normal variable with parameters, b and a2, denoted by
２ Ｎ( ｂ，ａ )

For any such transformation we have: if Y = σ Z + µ , then the density of Y is

The two parameters that are needed to define a normal are:μ=E[z], σ＝ Var[ z ] In general if you have a Normal random variable with parameters µ and , we need to standardize it, because the probabilities cannot be computed from a closed ２ ｂ，ａ ) form formula, this is done by standardizing, say Ｙ～Ｎ(

Now Z =

Y −b is a standard N (0,1) variable so we can use the distribution function a

How was the constant

1 found? 2π

We consider points distributed on the plane, with two independent standard Normal marginals, and we ask, what is the distribution of the distance to the origin? We suppose we don't know the constant of integration in the Normal density, call it c. As the coordinates are independent the joint density distribution is the product of the marginal densities:

First we note that this is the same value, whatever the point (x,y), as long as it is at the same distance from the origin, call this distance .

What is the density distribution of the distance to the center, for a random point taken from this joint density? Here is its picture in 3 d:

From which we get :

this has to integrate to one:

The expected value of the negative binomial NB(r,p) is

r , by writing it as the sum of p

r geometric (p) random variables. One would compute the expected number of cereal boxes to buy if there were N objects to be collected:

And the number of matches in the matching problem can be put as the sum of the indicator variables Xi where Xi=1 if the ith person gets his/her hat. As , we have

Conditional Expectation It is often useful to use conditional probabilities for the computation of expected values, the definition of the conditional expectation follows from the definition of conditional probability:

This used through the following equivalent of the law of total probability: If Fj,j=1...M is a partition of the state space ,

Variance E[X] does not say anything about the the spread of the values. This is measured by

which can also be written in the computational formula:

The unit in which this is measured is not coherent with that of X, we very often use the standard deviation

For the Bernouill(p): X2=X var(X)=E(X2)-(E(X))2=p-p2=p(1-p)=pq Property 1: For two independent random variables, X and Y: Var(X+Y)=Var(X)+Var(Y) This essential property allowed us to compute the variance of a binomial Sn, because we can write a binomial as the sum of n independent Bernouilli(p) random variables Xi so that:

Example: What is the variance of the geometric? Using the computational formula, we compute first E(X2):

From the sums of independent variables theorem, and the fact that a Negative Binomial Yr can be written as the sum of r independent Geometrics, we have:

I also showed that the variance of the Poisson( helps recognize a Poisson random variable.

) random variable is

, a fact that