You are on page 1of 38

Common Probability distributions

Discrete probability distributions

1. The Bernoulli distribution

Suppose there are two mutually exclusive outcomes in a given experiment. Examples are (head, tail),
(success, failure), (male, female) etc.

Generally the outcome space S is

S = {success, failure}

Let the random variable X such that

X(=SUCCESS)= 1, and X (FAILURE)= 0

The probability of success is p. Hence, the probability of failure is 1-p

2. Binomial distribution

Doing N Bernoulli trials and summing the outcomes gives the Binomial random variable

B=∑ X i

Now, assume the following

 N random trials
 The trials are independent
 The probability of success is the same from trial to trial

Then, the Bernoulli probability distribution is given by


a
a
( )
p ( B=a )= N p (1− p)
N −a

Where N= Number of random trials

P=probability of success

a= number of successes in N trails

the mean of the binomial distribution is

μ=np

The standard deviation is

σ =√ np(1−p)

3. The Poisson distribution

This is a discrete distribution which is used to determine the probability of the occurance of a specific
number of successes per unit of time when the events/successes are independent and the average number
of successes per unit time remains constant. The probability distribution function is given by

x −λ
λ e
p ( x) =
x!

Where X= Designated number of successes

P(x)= probability of X number of successes

λ= average number of successes per unit of time

the variance and the mean of the poisson are equal


2
μ=σ = λ

4. Discrete uniform distribution

Suppose there are N possible outcomes each equally likely, the probability function is

1
f ( x )= for X=1,2…,N
N

Continuous distributions

1. The normal distribution

The density function for a normal random variable is

1
f ( x )= e−1/ 2 ¿¿¿
√2 π σ 2
x

2. The exponential distribution

The density function of the exponential distribution is

f ( x )= λ e−λx for λ> 0 , x >0

The area below the curve falls as x increases. This makes it desirable for analyzing problems involving
the time before a given event occurs (duration). Example: the probability that a machine will remain
operational for x years before failing.

The cumulative density function of the exponential distribution is given by

P ( X ≤ x )=1−e− λ
Where λ=the mean number of occurances for the interval of interest

The mean of the exponential distribution is

E ( X ) =1/ λ

The variance is

2 2
σ =1/ λ

3. The uniform distribution

The uniform distribution is given by

1
f ( x )=
b−a

Where x can assume any value in the interval [a,b]

Mathematics of expectations

Probability distributions provide too much detailed information about the random variables. In the interest
of parsimony, but at the cost of detail, we summarize the distributions by some summary statistics. These
statistics tell us the central tendency (typical value) and dispersion of the distribution.

Central tendency

1. Mean

The mean of a random variable X is

n
E ( X ) =∑ Pi x i if X is discrete
1

∫ xf ( x )dx if X is continuous
−∞

The probabilities are the weights. The mean is the position of the fulcrum that will keep the distribution in
balance.

Properties

For any random variable X and constants a and b

E ( aX +b )=b+ aE( X)

E ( a ) =a

E ( bx )=bE (x)

For any function g(X)

n
E ( g ( X ) ) =∑ Pi g ¿ ¿ ¿) if X is discrete
1

∫ g ( x ) f (x) dx when x is continuous


−∞

Note

In general, E(g(x))≠ g(E(X))

Moment

E(X )
r
is reffered to as the rth moment of X about the origion. The mean is said to be the first moment
of x about the origin
E ( X −μ ¿¿¿ r ) is referred to as the rth moment of X about its mean

2. The median

If the number m satisfies the two conditions

1
p ( x< m) ≤ ∧ p( x >m)≤ 1/2
2

Then m is the median of the distribution of X

3. The mode

The mode is the value of X associated with the largest probability (probability density). The value of x
associated with the peak of the density curve. It is the most likely outcome.

Comparisons

 The mean is unique


 The mode and median may not be unique (many median and mode of different value for a given
distribution is possible)
 Mean is more affected by outliers

Comparison based on desired objective

 If the objective is to minimize the mean squared deviation, the best estimator is the mean of X
 If the objective is to minimize the mean absolute deviation, the best estimator is median of X
 If the objective is to maximize the probability that the mean error in prediction is zero, the mode
of X is best.

Measures of Dispersion

1. The range

The range is simply the difference between the maximum and the minimum values of X.
Range(x )=Max ( X ) – Min( x) The range is highly sensitive to extreme values

2. The variance (the standard deviation)

The variance of X is measured as

n
σ =∑ pi (x i−E ( x))2
2
x if X is discrete random variable
1


= ∫ ((x i−E (x)) ) f (x )dx
2
if X is continuous random variable
−∞

The variance makes it difficult to use the same unit of measures as the mean value. If X is in dollars, the
variance is in squared dollars. To put the dispersion with the same unit of measure as the mean, the
remedy is to square the variance and obtain the standard deviation.

Variance is the second moment of X around the mean

Properties

var ( X )=E ( X 2) −E( X)2

This means, the variance is equal to the mean of the square minus the square of the mean

var ( a )=0

This says that a constant doesn’t vary.

If y=bX +a

2
var ( y ) =b var (x )
Stand deviation(y)=/b/var(x)

Variances of nonlinear functions of X are given as follows

n
σ x =∑ pi ¿ ¿
2
if X is discrete random variable
1


=∫ ¿ ¿ if X is continuous random variable
−∞

Chebyshev’s inequality

1
p ( μ x −k σ x ≥ X ≥ μ x +k μ x ) ≤
k2

1
p (| X−μ x|≥ k σ x ) ≤
k2

This result doesn’t depend on any assumption about the distribution of the random variable.

The proof is based on the Markov Inequality

E( X)
p( X ≥ t)≤
t

Multivariate distributions

With multivariate distributions, we look at two or more random variables at once and their joint
distribution functions. It is possible to investigate relationships between two or more random variables in
this way.

Bivariate distributions
Discrete Bivariate distributions

Consider two random variables, X and Y. The possible values of X are x 1 , x 2 , … .. , x n and the
possible values of Y are y 1 , y 2 , … .., y n , then there will be a finite set of pairs (x,y), that X and Y
may jointly assume.

If we attach a probability to each of the different joint outcomes, we have a discrete bivariate probability
distribution or a joint probability function. Th joint discrete probability distribution function for two
random variables , X and Y, is the function f(x,y) such that for any point (x,y) in the X-Y plane , f(x,y)=
p(X=x and Y=y). Thus, f(x,y) gives us the joint probability that the two random variables , X and Y ,
assume the joint outcome (x, y).

Y Marginal
1 2 3 probability
of X
X 0 p(X=0 & Y=1) p(X=0 & Y=2) p(X=0 & Y=3) P(X=0)
1 p(X=1 & Y=1) p(X=1& Y=2) p(X=1 & Y=3) P(X=1)
P(Y=1) P(Y=2) P(Y=3)

Properties of Joint Probability distributions

Property I

f ( x , y )= p( X =x∧Y = y) ≥ 0 for all (x,y) pairs. The probabilities assigned to the joint outcomes are
nonnegative

Property 2

∑ ❑∑ f ( x , y )=1 The sum of probabilities of all joint outcomes is equal to 1


all x all y

Cumulative joint probability function


A function F (x , y ) that gives the probability that X is less than a given value x and the random
varialble Y is jointly less than a given value y

F ( x , y ) =p ( X ≤ x ,∧Y ≤ y )=∑ ❑ ∑ f ( s ,t) This is known as the joint distribution function or the
s ≤x t≤y
cumulative joint probability function. This is simply summing the outcomes satisfying X≤ x and Y ≤ y

Univariate and Bivariate distributions (Marginal distributions)

Given the bivariate distribution of X and Y we may want to move back to obtain the univariate
distributions of X and Y. These are known as the marginal probability distributions for X and Y. The
marginal distributions for X and Y, respectively are given by

g ( x )= p(X =x)=∑ f (x , y)
all y

h ( y )= p(Y = y )=∑ f (x , y )
all x

this means, if we want the probability that X equals to a specific value x, we simply sum the joint
probabilities across all values of y after fixing X=x. doing this for all values of X gives us the marginal
probability distribution of X. Similarly we do the same for Y to obtain marginal distribution of Y.

Conditional distributions

When we are interested to determine the distribution of one random variable conditional upon a second
random variable assuming a specific value, we compute the conditional probability distribution for the
variable. The conditional probability distributions of X and Y are given by

f ( x , y)
f c ( x / y )= p ( X= x /Y = y )=
h( y )
f ( x , y)
f c ( y / x )= p ( Y =x / X = y )=
g (x)

Continuous bivariate distributions

We denote by f(x,y) with a property that

b d
p ( a ≤ X ≤ b∧c ≤Y ≤ d )=∫ ❑∫ f ( x , y ) dxdy
a c

This function (f(x,y)) is called the bivariate joint probability density function for the random
variables X and Y

Properties

b d
p ( a ≤ X ≤ b∧c ≤Y ≤ d )=∫ ❑∫ f ( x , y ) dxdy ≥0
a c

∞ ∞
p (−∞ ≤ X ≤ ∞∧−∞≤ Y ≤ ∞ ) =∫ ❑ ∫ f ( x , y ) dxdy=1
−∞ −∞

Cumulative joint density function

A function F (x , y ) that gives the probability that X is less than a given value x and the random
varialble Y is jointly less than a given value y
x �丶
F ( x , y ) =p ( X ≤ x ,∧Y ≤ y )=p (−∞ ≤ X ≤ x∧−∞ ≤Y ≤ y )=∫ ❑ ∫ f ( s , t ) dtdsThis is known as the
−∞ −∞
cumulative joint density function or the cumulative joint distribution function.

Properties of cumulative density functions

Property I: F (−∞,−∞ )=0 the probability that x is less than negative infinity and Y is less than
negative infinity is zero

Property 2: F ( ∞ , ∞ )=1 the probability that X is less than positive infinity and Y is less than positive
infinity is one

Property 3

2
∂ F(x , y)
=f (x , y )
∂ x dy

Property 4

p ( a ≤ X ≤ b∧c ≤Y ≤ d )=F ( b ,d ) −m F ( a , d ) −F ( b , c ) + F ( a , c )

Marginal density function


g ( x )= p( X =x)= ∫ f ( x , y ) dy
−∞


h ( y )= p(Y = y )=∫ f ( x , y ) dx
−∞

Independence
Two random variables, X and Y are said to be independent if and only if

f ( x , y )=g ( x ) h(�� � )

In general, for N random variables X 1 , X 2 , … , X N are independent if and only if

f ( X 1 , X 2 , … . X N )=f X 1 ( X 1) f X2 ( X 2 ) … f X N ( X N ) .

Expectations

The expectation of a random variable given the bivariate distribution of this variable and one or more
other random variables can be computed in two ways. First, one can simply obtain the marginal
distribution function for the variable of interest and compute its expectation as usual based on the
marginal distibrution. Second, one can also directly compute the expectation of a given random variable
based on the joint density function.

Discrete random variables

E ( X ) =∑ ❑ ∑ xf (x , y )
all x all y

var ( X )=∑ ❑ ∑ ( x−E ( �æ ) ) f ( x , y )


2

all x all y

E ( Y )=∑ ❑ ∑ y f ( x , y )
all x all y

var ( Y ) =∑ ❑ ∑ ( y−E ( Y ) ) f ( x , y )
2

allx all y

Continuous random variables


∞ ∞
E( X)=∫ ❑ ∫ x f ( x , y ) d y d x
−∞ −∞

∞ ∞
var ( X )= ∫ ❑ ∫ ( x−E ( X ) ) f ( x , y ) d y d x
2

−∞ −∞

∞ ∞
E(Y )= ∫ ❑ ∫ y f ( x , y ) d y d x
−∞ −∞

∞ ∞
var (Y )=∫ ❑ ∫ ( y−E ( Y )) f ( x , y ) d y d x
2

−∞ −∞

Expectations of functions of several random variables

Property 1

If X 1 , X 2 , … , X N are N random variables,

E( X 1 + X 2 ,+ …+ X N ¿=E ( X 1 ) + E ( X 2 ) +…+ E( X N )

(∑ ) ∑
N N
E XI = E(X i)
i=1 1

The expected value of a sum is the sum of the expected values. This is true whether or not the
variables are independent.

( )
N N
E ∑ ai X I =∑ ai E(X �㄰ )
i=1 1

Property 2
Let X and Y be two random variables with a joint density function f(x,y). If g(x,y) is some function of X
and Y,

E ( g( X ,Y ) )=∑ ❑ ∑ g(x , y) f (x , y ) if X and Y are discrete


all x all �<

∞ ∞
E( g ( X , Y ) )=∫ ❑ ∫ g( x , y) f ( x , y ) dydx if X and Y are continuous
−∞ −∞

Property 4

If X and Y are two independent random variables, then E ( XY )=E ( X ) E (Y ) .

Property 5 If X 1 , X 2 , … , X N are N independent random variables, then

var( X 1 + X 2 ,+ …+ X N ¿=var ( X 1 ) + var ( X 2 ) +…+ var ( X N )

( )
N N
var ∑ X I =∑ var ( X i )
i=1 1

The variance of a sum is the sum of the individual variances.

Conditional expectations

Often, our interest lies in the expected value of a random variable conditioned on specific value of
another variable. The conditional expectation is given by

�楨( x , y )
E ( Y / x ¿ )=∑ yf ( y / x)=∑ y if Y is discrete
all y all y g ( x)


�㄰( x , y)
E(Y /x )= ∫ y dy if Y is continuous
−∞ g(x)

Measures of association
Besides the various summary measures used for individual random variables, when dealing with two or
more random variables, we can also measure association between the variables. Two measures of
association are commonly used: covariance and correlation.

Covariance

The covariance of two random variables X and Y is given by

cov ( X , Y )=E [ ( x−E ( X ) ) ( y−E ( Y ) ) ]

Covariance indicates direction of association based on the sign of the covariance value (+,-, 0). It is
sensitive to units of measurement such that its interpretation as indicator of strength of association is
doubtful.

Properties

Property I

Since covariance is an expected value it is obtained as follows

cov ( X , Y )=∑ ❑ ∑ [ ( x −E ( X ) ) ( y−E ( Y ) ) ] f (x , y ) if X and Y are discrete


all x all y

∞ ∞
cov ( X , Y )= ∫ ❑ ∫ ( x−E ( X ) ) ( y−E ( Y ) ) f ( x , �㄰)dy
−∞ −∞

Property 3

cov ( X , Y )=E ( XY ) −E ( X ) E (Y ).

Property 4

If X and Y are independent, then


cov ( X , Y )=0

While independence implies zero covariance, the converse is not true. That is, zero covariance doesn’t
necessarily imply independence.

Rationale: covariance is a measure of linear association between two random variables. Hence, it is
possible that two variables that have strong association in a non linear way may still have close to zero
covariance (covariance measures only linear association).

Property 5

If X=Y, then

cov ( X , Y )=E [ ( x−E ( X ) )( x−E ( X ) ) ] =var ( X )

The covariance of a random variable with itself is its variance.

Property 6

If X and Y are two random variables, then

var ( X+ Y )=var ( X ) +var ( Y ) +2 cov ( X , Y )

var ( X−Y ) =vary ( X ) + va �㸰 (Y )−2 cov (X , Y )

Correlation

The correlation between two random variables X and Y is given by

Corr ( X ,Y )=E
[ ( x−E ( X ) ) ( y− E (Y ) )
σX σY ]
cov ( X , Y )
¿
σ X σ �㄰

The advantage of the correlation measure is it is unit free and it gives better measure of degree of
association between two random variables that is not influenced by the units of measurement used.

Friedman’s Contention

Let the level of economic activity without government intervention be X. Let the amount of economic
activity increment due to government intervention (stabilizing policy) be Y. Both are random variables.

Traditional argument is that the mean level of economic activity with intervention is more stable than
without intervention. This mean level is given by

E( X +Y )

This is more constant than when there is no government intervention.

Friedman’s argument is that economic swings are due mainly to variance of GDP than mean level of
GDP. As a result the variance with intervention is

Var ( X +Y )=�牜 ar ( X )+ var ( Y ) +2 cov ( X , Y )

Therefore, intervention is contributing to stabilize the economy only if

var ( Y ) +2 cov ( X ,Y )<0

This means, the value of the covariance must be negative as well as large enough to offset the variance of
the intervention variable. This is not sufficient arugment to counter the traditional point. Rather, Friedman
argued that the covariance term is negligible because intervention affects the economy with a variable and
long lag making the case for the covariance to be very small. Hence, rather than stabilizing these
measures may be surprisingly destabilizing the economy.

Properties of correlation
Property 1

−1 ≤ ρ XY ≤ 1

The proof is based on Scwarz inequality which states that for any two random variables U and V

E ¿Property 2

|ρ XY |=1 if ∧only if Y =a+bX

Only if there is a perfect linear relationship between two variables that the correlation between the two
variables will equal one in absolute value.

Property 3

Correlation doesn’t imply causality. Two things may move together but there may be no causal relation
whatsoever between them.

Sampling and sampling distributions

The population is the universe of individuals of interest and contains all the characteristics of interest.

The sample is part of the population taken for study. To draw reasonable conclusions about the population
based on sample observations, the sample must be selected so that it mimics the population. It is a replica-
in miniature-of the population.
Of many ways to select sample, simple random sampling is one way to ensure the above features of a
sample that is representative of the population of interest. Simple random sampling is characterized as
follows

1) Each member of the population has equal chance of being selected


2) The observations are independent: which member is selected doesn’t affect the chance that
another is selected
3) Since the items are from the same population, simple random sample observations are said to be
identically and independently distributed

Why sample

 If population is infinite, obviously one can’t do a census of it


 Cost saving: samples provide reasonably adequate information at much lower cost than census
 Time: census is time consuming while sampling helps reduce the time required of gathering
information and allows for quick decisions to be made based on information

What sample size

 This depends critically on the precesion of inference with which we want to make about the
population

Populations, probability distributions and data

A sample of size n from a population can be thought as a set of random variables: X1, X2,…, Xn. The
outcome of the value of the first observation (X1) for instance is unkown until a particular sample is
observed. In one sample, it ma assume one specific value and another value in a second sample and so on.
Hence, it is a random variables. Similarly, all the observations can be thought as realizations of random
variables. The distribution of each of these variables is determined by the distribution in the population. If
a population has mean V and variance D, then each of these rrandom variables have mean V and variance
D.

But, once we have taken a specific sample of size n, we have obtained real values that are simply
realizations of the random variables. If we take another sample, we may get yet different realizations for
each of our random variables.

The realizations are the data for our study.


Suppose X1, X2, ,,,,Xn form a simple random sample of size n taken from a population with an unknown
parameter value θ. For instance, this parameter may be the mean value of the population, μ. Now, we
want to infer the value of this parameter based on our sample observations. To do so, we form a sample
statistic. A sample statistic is a real valued function

T =r ( X 1 , X 2 , X 3 … . , X n)

Of the random variables X1, X2, …Xn.

It is a formula that shows how to combine the sample data to form a point estimate of the population
parameter. As such it is called an estimator. Once data is obtained, and we have realized values x1,
x2…xn we plug these into the formula and get an estimate of the population parameter. This estimate
is called a point estimate (it is a single number instead of a range of numbers).

Sample statistics

Since T is a function of random variables, it is also a random variable. That is, its value depends on
the specific sample drawn and since which sample is to be drawn is unknown, so is the value of T
uncertain. Thus, the sample statistic is itself a random variable with probability distribution. The
distribution of a sample statistic is known as a sampling distribution.

The normal probability distribution

 Symmetric
 Bell shaped
 Continuous

Properties

2
If a random variable X is normally distributed with mean μ X and variance σ X , we write
2
X N ( μ X , σ X )The centeral limit thermo is written as

X N ( μ X , σ 2X )

Where

2
2 σX
, σ X=
n

Property 2

2
The normal distribution is completely characteriised by two parameters: μ X , σ X . For every different
value of the mean and variance we have different normal distribution. The different means changes
the center and the different varianes change the spread of the distribution.

Property 3

The normal distribution is symmetric with 0.5 probability of on each side of the mean. And the
mean=mode=median for the normal distribution

Property 4

Any linear function of a normal random variable is also normally distributed. If


2
X N ( μ X , σ X )And

2 2
Y =a+bX E ( Y )=a+b μ X Var ( Y )=b σ X

And

2 2
Y N (a+ b μ X , b σ X )

Special case

If in the above linear relation we let

−μ x 1
a= and b=
σx σ x

We have

X−μ x
N (0,1)
σx

Property
Any linear combination of normal random variables is normally distributed. It is from this result that the
statement X is normally distributed even if the sample size is not large as long as the population and
hence, each Xi, is normally distributed.

Sampling distribution of the sample mean

The mean of a sample is

∑ Xi
1
X=
n

This is the same as the population mean except that the sample mean is a random variable while the
population mean is a constant.

Since a sample statistic is a random variable, and sicne the sample mean is a sample statistic, it is a
random variable and it has mean and variance.

The mean of the sample mean is

∑ Xi
1
E( X)=E( )
n

n
1
E( X)= E( ∑ X i)
n 1

n
1
E( X)= ∑ E( X ¿¿ i¿)¿ ¿
n 1

n
1
E( X)= ∑ �� �} X ¿
n �흜

1
E( X)= n μ X
n
E( X)=μ X

The mean of the sample mean is the population mean. Here use is made the fact that each random
variable (observation) has the same mean and variance as the population and the population mean is a
constant.

Variance of the sample mean is

∑ Xi
1
Var (X )=var ( )
n

n
1
var ( X )= 2 ∑ var (X ¿ ¿i¿)¿ ¿
n 1

n
1
2∑ X
var (X )= σ 2¿
n 1

1
var (X )= 2
nσ X 2
n

1 2
var ( X )= σ X
n

The variance of the sample mean is simply the population variance of the variable X (whose value is not
known to us) divided by the sample size.

The distribution of the sample mean

Having derived the mean of the sample mean and its variance (standard deviation), the next question is
what is its exact shape or how is it distributed. The central limit theorem provides an answer to this.

Central limit theorem


If X 1 , X 2 , X 3 … ., X n constitute a simple random sample from a population with μ �� �} ,σ ¿, then the
2
X

random variable

X−μ x


2
σ x
n

has an approximately normal distribution. The approximation improves with the size of n. the sampling
distribution of the sample mean follows a normal distribution when n is large (n >30). This holds
regardless of the distribution of the population. On the other, even if the sample size is small, but the
population has a normal distribution, then the approximation holds true.

The Chi-Squared and T-distributions

As is done for the sample mean, it also possible to derive the sampling distribution of the sample
variance.

The sample variance is obtained by

n
1
2
SX = ∑ ( x i−X )2
n−1 1

The sample variance differs from population variance only in one aspect: the denominator is n-1 instead
of n. This has its own advantages and that is the reason for it.

If we draw a random sample of size n repeatedly we will have different sample variances. As such, the
sample variance is a random variable and has its own probability distribution. What distribution does it
follow?

The chi-squared distribution

Let Z1, Z2, Z3,…Zn be n independent random variables each with a standard normal distribution. Then
n

∑ Z i2 χn
2

Where the

χ n2

Denotes the chi-squared distribution. If we square and sum n standard normal variables, we have a new
random variable which has a chi-squared distribution. The shape of the chi-squared distribution depends
on n, which is known as the degrees of freedom.

The mean of the chi-squared distribution is

E( χ ¿¿ n¿¿ 2)=n ¿ ¿

In this relation, observe that

2
(n−1) s x 2
2
χ n−1
σ X

With n-1 degrees of freedom.

The t-Distortion
2 2
Let z and χ n be two independent random variables, where z �� � ( 1,0 ) and χ v is chi-squared
distributed with v degrees of freedom. Then, if we form the new random variable

z
t


2
χv
v

This variable has the t distribution with v degrees of freedom.

The mean of the t distribution with v degrees of freedom is

E(t ( v ) )=0

Properties of the t-distribution

 The t-distribution is symmetrical and bell shaped.


 It is nearly identical to the normal distribution. When N≥120, the t-distributin and the z
distribution are almost identical. For N≤120, the t-distribution has more probabilities in the tails
than the z distribution.

The importance of the t-distribution

It was observed that the sample mean is normally distributed. That is,

X−μ x
N (1,0)
σx
√n

But, usually we don’t know the population standard deviation. Let’s assume we can replace it with the
sample standard deviation, Sx. Now, we have a new random variable given by

X−μ x
sx
√n
But, this random variable is not normally distributed. Rather, it has the t-distribution with n-1 degrees of
freedom. This is clear from the following

X−μ x
sx
X−μ x
=
√n


sx (n−1) s
2

√n (n−1)σ 2x

The denominator in the right side is distributed


√ χ v2
n−1

X−μ x
In other words, the random variable s x is put in the form
√n

z
t n−1

√ χv2
n−1

That is it is a random variable that follows the t-distribution, with n-1 degrees of freedom.

The F distribution

2 2
Consider two independent random variables, χ v 1
and χ v2
which are both chi-squared distributed. A new
random variable can be formed as follows

2 2
χv χv
÷1 1

v1 v2

This new variable has the F distribution. There are two parameters that characterize this distribution: the
numerator degrees of freedom and denominator degrees of freedom.
Estimation

Now, any population distribution can be characterized by a few parameters. The normal distribution, for
instance, is characterized by its mean and variance. The exponential distribution is characterized by λ and
so on. The interest is in making an estimate of these population parameters.

Potentially, there is infinite number of parameter estimators. An estimator is a rule that indicates how to
compute a population parameter estimate based on the data given. an actual number obtained after
plugging data into the estimator function is an estimate. An estimate may be a point or an interval
estimate. From an infinite number of potential estimators we have to choose the one that provides us the
best estimate based on some criteria.

Finite sample properties

Denote the population parameter of interest by θ and denote the estimator by θ^ . There are many
desribale characterstic of an estimator. Typically, the properties used to judge between estimators are
classified as

 Finite sample properties : properties that hold regardless of the size of the sample used to
generate the estimate
 Asymptotic properties: often properties of finite sample estimators are difficult to derive and we
must resort to asymptotic properties that hold only when the sample size is large.

Criteria 1: Unbiasedness

An estimator is said to be unbiased when the expected value of the estimator equals the true value of the
population parameter. That is,

^
E( θ)=θ

The bias of an estimator is


Bias=E ( θ^ ) −θ

Potentially there are several unbiased estimators of the true population parameter. Hence, we need more
criteria.

The desirability of dividing by n-1 for sample variance

It is noted that the sample variance is given by

n
1
S2X = ∑ ( x i−X )2
n−1 1

This is unbiased. But, the alternative is biased:

n
1
S = ∑ ( x i−X )
2 2
X
n 1

When this is used, the expected value of the estimator becomes

2 n−1
E( S ¿ ¿ X )= var (X ) ¿
n

The bias, n-1/n gets smaller as n increases.

Criteria 2: Efficiency (minimum variance estimator)

If two estimators are both unbiased, centered on the true parameter, they will give us the correct value on
average. But, the two estimators may have different variances. Then we choose the one whose estimates
are more concentrated around the true value. We say that one estimator ( θ^ 1) is more efficient than another
(θ^ 2) if it has a smaller sampling variance. That is,

var ( θ^ ¿ ¿ 1)≤ var ( θ^ 2) ¿


Since in actual practice we don’t observe the true parameter value, and though conceptually it is possible
to do repeated sampling, in practice we only observe a single estimate. Thus, our chance that this estimate
is closer to the true value is increased if the estimator has smaller variance.

Mean squared error

Often there is a tradeoff involved: one estimator may be efficient but biased and another might be
unbiased but less efficient. Which one is best under this situation depends on several considerations. In
general, one way of formalizing these points is by defining a loss function that explicitly shows the loss or
^ . Let the loss function be L(θ^ , θ) which measures the loss
costs associated with deviations between θ∧θ
when the population parameter is θ and we estimate it to be θ^ . The most commonly used loss function is
the mean squared error loss function, where

^ 2
MSE=E(θ−θ)

We can decompose MSE as follows

MSE=var ( θ^ ) +bias 2

In other words, MSE shows both efficiency and bias at once.

The sample variance estimator

It can be shown that the sample estimator

n
1
2
SX= ∑ (x i−�⁲ )2
n−1 1

Is unbiased it has higher MSE than the estimator

n
1
2
SX= ∑
n 1
( x i−X )
2
Infinite sample property:

Finite sample properties relate to an estimator’s sampling distribution. If an estimator is unbiased, it is


unbiased regardless of the sample size drawn to estimate it. Often, it is impossible to find such an
estimator possessing the required properties in a finite sample. In this situation, we consider the sampling
distributions of the estimators in the limiting/asymptote case. That is, the properties of the estimators
when the sample size is large are compared to chose between estimators.

Asymptotic unbiasedness

An estimator is said to be asymptotically unbiased when

lim E ( θ^ ) =θ
n→∞

Consistency

An estimator is said to be consistent when the probability that an estimator differs from actual population
parameter by a small value ε gets zero as n gets to infinity.

lim p (|θ−θ
^ |≥ ε ) =0 for any ε > 0
n→∞

Asymptotic efficiency

An estimator is said to be asymptotically efficient if it converges to the true population parameter value
more quickly than any other estimator

Sufficient estimator

The condition joint distribution of the sample observations, given the estimator value, doesn’t depend on
the population parameter value.

Methods of finding estimators


To actually find good estimators with more of the desirable features listed above, we use alternative
methods of finding estimators: method of moments, method of maximum likelihood, and the method of
least squares.

Method of moments

The method seeks to equate the moments implied by a statistical model of the population distribution with
the actual moments observed in the sample. It assumes that relations that hold for the population also hold
for the sample.

The method of moments doesn’t necessarily generate unique estimators. One can start with the mean or
the variance etc and still may get different estimators. This however has benefits. The different estimators
should give the same result if the assumed distribution of the population is correct.

Method of maximum likelihood

The main idea of MML is that the data we observe are more likely to be associated with some
distributions than with others. Typically, we start by assuming that our data is drawn from a some
population distribution with unknown parameterθ . We then select our estimate of this parameter so as to
maximize the likelihood of seeing the data we actually saw.

Formally, consider a random drawing of the sample observations X1, X2, ….Xn. from a distribution
f ( X /θ) with unkown parameter θ .the likelihood of observing the sample we saw is the joint density
function

L(θ)=f ( X 1 /θ ) f ( X 2 /θ ) … … . f ( X n /θ)

This is known as the likelihood function. For convenience we convert this into a logarithm form and get
the log likelihood function. Then we maximize

L (θ )=ln f ( X 1 / θ ) + ln f ( X 2 /θ ) … … .+ ln f ( X n /θ)

With respect to θ

Properties
 If an ML Estimator is found to be unbiased, it is the most efficient possible estimator of all
possible estimators. If we can show the estimator is unbiased, we know it is the best.
 Almost all ML estimators are consistent
 Most MLEs estimators follow a normal distribution when the sample size is large. That is, they
are asymptotically normal.

Method of least squares

Though it is not as general as the two other methods, it is also widely used in some situations especially in
relation to linear models.

It selects the single estimator to minimize the sum of squared errors (SSE).

For instance, if we have n random variables, Y1,Y2,…., Yn, we can represent each random observation as

Y i=μ Y + ε i

Since each observation provides an unbiased estimator of the mean value of the population,

E(Y ¿ ¿i)=�慜 �}Y ¿ ¿

Because,
E ( ε i )=0

And

Var ( ε i ) =var (Yi)

, Now, for each observation i we have

ε ii =Y −μY

The least squares estimator is found by minimizing the function


n n

∑ ε =∑ (Y i−μY )2
2
i
1 i

Thus, minimizing this with respect to μY yields the MLS estimator

∑ yi
i
^μY =
n

Interval estimation and hypothesis testing

Often we want to develop a range of numbers within which we expect the population parameter to lie
with certain level of confidence. An interval estimation is intended for this purpose. Put simply, interval
estimation involves determining an upper and lower bound within which the population parameter resides
with certain level of confidence. The degree of our certainty on the estimate is referred to as confidence
coefficient and is denoted by 1-α. The confidence interval is then 100% (1-α).

An interval estimation of the population mean

An interval estimation of the population mean, μ X , consistsof two bounds within which we expect μ X to
reside. That is,

LB ≤ μ X ≤UB

Where LB and UB are the lower bound and upper bound, respectively.

The probabity that μ X lies within the provided interval is known as the confidence coefficient and is
denoted by 1−α . We call α the significance level. for a specific value of α we refer to LB ≤ μ X ≤UB as
the 100(1-α)% confidence interval.

Confidence interval : Interpretation


One possible interpretation (the frequentist interpretation) is as follows. Suppose we repeatedly draw of
large numbers of samples of size n from a population that has a mean of μ X .

Deterministic linear model

y=f ( x )ou

That is, the value of y is entirely determined by the value of x

Non deterministic model

y=f ( x )+ ϵ

In this model, the value of y is also affected by other random factors: factors other than x.

The error term (epsilon) arises due to many factors such as

 Unpredictability of human behavior


 The presence of left out variables (variables not included in the model that may affect y)
 Measurement error in the dependent variable y

Because the error term is a random variable, and y is a linear function of this random error, then y is also
a random variable.

For every value of x, there are a number of possible values of y, which vary around the mean value of y,
the distance from this mean value is what the error term represents

Linear regression model


y i=B0 + B1 x i + ε i

Assumptions

 Assumption 1: the expected value of the error term is zero


E ( y i ) =B 0+ B1 xi

The additional assumptions are needed for hypotheses testing related with the estimated population
parameters (the coefficient of x and the constant)

Assumption 2: the error term variance is the same for all observations

va �㈹ ( ε i ) =σ 2ε for all i

when this is not satisfied, it is said that there is Hetroscedasticity (agains homoscedasticity)

this assumption means that the errors are drawn from a distribution with same variance.

Assumption 3: covariance of the error terms for any two observations is zero

Cov ( ε i , ε j ) =0 for all ian �၎ j w h ere i ≠ j

The two assumptions together imply that, the errors are identically distributed independent random
variables.

You might also like