You are on page 1of 8

Statistics

Chapter Two: Estimation


Estimation: Suppose θ be a parameter of a population to be estimated. Now according to the
information on θ two types of situation may occur:
Case-I: The population parameter θ is totally unknown and on the basis of a sample drawn
from the population we may construct a function whose values will estimate θ. The aim will be
to construct a function in such a way that it gives us a good estimation. This case refers to point
estimation and interval estimation.
Case-II: We are not completely in the dark about the value of the unknown parameter θ; but
we do not have pin-point information regarding the true value of it. In this situation, based on
the information we have regarding the value of θ, we may claim that θ takes a value θ = θ0 or
θ = θ1 . Now what remains is to prove or disprove our claim. Thus regarding to value of θ we have
some hypothesis and we need to test the hypothesis. This case we will refer as testing of hypothesis.

Point estimation:
A statistic Θ̂ = Θ̂(X1 , X2 , · · · , Xn ) is called a estimator of θ if the observed values of Θ̂ or the
distribution of Θ̂ are/is able to approximate θ as good as possible (mathematically we say that Θ̂
has central tendency at θ). For a certain observation (x1 , x2 , · · · , xn ) of the sample the value of Θ̂
denoted by θ̂(x1 , x2 , · · · , xn ) is called an estimate or point estimate of Θ̂.
Example. For example, the value X of the statistic X, computed from a sample of size n, is a
point estimate of the population parameter µ. Similarly, p̂ = x/n is a point estimate of the true
proportion p for a binomial experiment.
Caution: An estimator is not expected to estimate the population parameter without error. We
do not expect X to estimate µ exactly, but we certainly hope that it is not far off.

Properties defining a good point estimator:


(I) Unbiasedness: A point estimator Θ̂ is called unbiased estimator of the parameter θ, if E(Θ̂) = θ.
Example. X and S 2 are unbiased estimator of the population mean and population variance,
n
X
respectively; but (Xi − X)2 is not a unbiased estimator of σ 2 .
i=1

The bias of an estimator is measured by the quantity Bias(Θ̂), defined by


Bias(Θ̂) = E(Θ̂) − θ = E[(Θ̂ − θ)].
The mean squared error (MSE) of an estimator is measured by the quantity M SE(Θ̂), defined
by M SE(Θ̂) = E[(Θ̂ − θ)2 ]. Note that if Θ̂ is unbiased, then M SE(Θ̂) = σΘ̂
2.

(II) Consistency: A point estimator Θ̂ = Θ̂(n) (which is obviously a function of n, where n is the
sample size) for the parameter θ is consistent if for every ε > 0 the following holds: P (|Θ̂(n) − θ| >
ε) → 0 as n → ∞.
Example. The sample mean X is a consistent estimator of the population mean µ. (Use Cheby-
shev’s inequality to prove it). The sample variance S 2 is a consistent estimator of the population
(n − 1)S 2
variance σ 2 (Use the fact that ∼ χ2 (n − 1) and use Chebyshev’s inequality to prove it).
σ2
Theorem. If Θ̂ = Θ̂(n) is a estimator of θ such that E(Θ̂(n)) → θ and V ar(Θ̂(n)) → 0 as n → ∞
then Θ̂ is a consistent estimator of θ.

1
Application of the theorem: X is a consistent estimator of the sample mean µ, since E(X) = µ
and var(X) = σ 2 /n → 0 and n → ∞.
2 ≤ σ 2 . Note
(III) Efficiency: A point estimator Θ̂1 is called more efficient than a statistic Θ̂2 if σΘ̂ Θ̂ 1 2
that if Θ̂1 and Θ̂2 are unbiased then Θ̂1 is more efficient than Θ̂2 if M SE(Θ̂1 ) ≤ M SE(Θ̂2 ).

Remark: According to Fisher consistency is not enough to say that an unbiased estimator Θ̂
is good. It is also needed that the estimator should converge to the corresponding parameter as
speedily as possible. Efficiency of estimator determines how speedily it converges to the parameter.
More efficient estimator converges more speedily.

(IV) Sufficiency: A point estimator Θ̂ = Θ̂(X1 , X2 , · · · , Xn ) (which is a function of a random


variables X1 , X2 , · · · , Xn constituting a sample) is sufficient for underlying parameter θ if the con-
ditional probability distribution of (X1 , X2 , · · · , Xn ), given the statistic Θ̂, does not depend on the
parameter θ, i.e., the expression P [(X1 , X2 , · · · , Xn )| Θ̂(X)] is free of the parameter θ.
Example. 1. Sample mean is a sufficient estimator.
2. Consider a b(m, p) (or p(λ)) variate population. Let (X1 , X2 , · · · , Xn ) be a sample. Then it
n
P
is easy to see that Xi is a sufficient estimator of p (or λ).
i=1

Theorem (Fisher-Neyman factorization theorem). Let S be a population with parameter θ. For


a random sample (X1 , X2 , · · · , Xn ), an estimator Θ̂ = Θ̂(X1 , X2 , · · · , Xn ) of θ is sufficient if and
only if there exists non negetive functions h and g such that

f(X1 ,X2 ,··· ,Xn ) (x1 , x2 , · · · , xn ; θ) = h(x1 , x2 , · · · , xn ) g(Θ̂(x1 , x2 , · · · , xn ); θ).

Corollary. If Θ̂ is a sufficient estimator for θ, then aΘ̂ is also a sufficient estimator for θ where
a is a constant.
Example (Contd). 1. X1 + X2 + · · · + Xn and X are sufficient statistics for µ in a N (µ, σ 2 )-
population.

2. Max {X1 , X2 , · · · , Xn } is a sufficient statistics for θ in a U [0, θ]-population.

3. X1 + X2 + · · · + Xn and X are sufficient statistics for λ for an exp(θ)-population.


Methods of finding point estimators:
(I) Method of moments.
(II) Method of Maximum Likelihood (ML).
(III) Method of least squares.
(IV) Method of minimum Chi-square.

However, the most important and widely used method of estimation is the maximum likelihood
method and resulting estimators are known as maximum likelihood estimator (m.l.e.).

Maximum Likelihood method of estimation:


Suppose we have given a f (x, θ) variate population where θ is a parameter. Consider a
random sample (X1 , X2 , · · · , Xn ) of size n from the population. Obviously the join pmf/pdf of
n
Q
(X1 , X2 , · · · , Xn ) is given by f (x1 , x2 , · · · , xn , θ) = f (xi , θ) where ((x1 , x2 , · · · , xn ) denotes an
i=1
observation of the sample (X1 , X2 , · · · , Xn ). Define a function L by
n
Y
L(θ) = f (x1 , x2 , · · · , xn , θ) = f (xi , θ).
i=1

2
We call this function L(θ) as likelihood function. Our problem is to find the value of θ, say θ̂,
which makes likelihood function L(θ) a maximum. Obviously θ̂ will be a function of (x1 , x2 , · · · , xn )
(i.e., θ̂ = θ̂(x1 , x2 , · · · , xn )). Thus the corresponding maximum likelihood estimator will be Θ̂ =
θ̂(X1 , X2 , · · · , Xn ).
Pn
Taking logarithm in the both sides of the above equality we get loge L(θ) = loge f (xi , θ).
i=1
So, to maximize L(θ) it is enough to maximize loge L(θ). Thus if θ̂ is the m.l.e. of θ, then

loge L(θ̂) = M ax loge L(θ). Hence, θ̂ can be obtained by taking loge L(θ) = 0 which will yield
∂θ
∂2
θ = θ̂(x1 , x2 , · · · , xn ) such that loge L(θ)|θ=θ̂ < 0.
∂θ2
Remark:
1. If θ is a vector (θ1 , θ2 , · · · , θs ), then we need to consider s equations
∂ ∂ ∂
loge L(θ) = loge L(θ) = · · · = loge L(θ) = 0 (called likelihood equations).
∂θ1 ∂θ2 ∂θs
2. In some cases, the derivative may not exists at θ = θ̂ and then the usual method of differ-
entiation for finding m.l.e fails. In such cases we have to adopt some other method like algebraic
method using some inequality.
3. m.l.e.s are generally sufficient estimators or function of sufficient estimators, if exist.
4. m.l.e.s are consistent estimators.
5. m.l.e.s are not, in general, unbiased. But they can be made unbiased by some simple
modifications.
1
6. The variance of m.l.e. θ̂ is given by var(θ̂) = ∂2
(Cramer-Rao’s inequality).
−nE[ ∂θ2 loge f (x, θ)]
7. If m.l.e.s are unbiased, then they can be seen to be most efficient.
8. m.l.e.s processes invariance property in the sense that if θ̂ is the m.l.e. of θ, then g(θ̂) will
be the m.l.e. of g(θ).

Standard Error of a Point Estimate: Suppose θ is a population parameter and Θ̂ is a point


estimator of θ. Then the standard error of Θ̂ is defined by σΘ̂ . Thus standard error of the estimator
X (from a sample of size n) of population mean µ is σX = √σn which, in this, tends to 0 as n tends
to ∞ .

Interval estimation:
Till now we have discussed about the criterion which good point estimator should satisfy and
the method(s) of obtaining such point estimators. We may, however, note that any single point
estimator, however satisfactory it may be, is seldom likely to give us an idea about the error of
the estimate. It is true that estimation accuracy increases with large samples, but there is still
no reason we should expect a point estimate (giving a single value only!) from a given sample
to be exactly equal to the population parameter it is supposed to estimate. It is therefore pro-
posed to find an interval in which the population parameter may be expected to lie (on the basis of
random samples) with a specified degree of confidence. This is the problem of estimation by interval.

Suppose we have given a f (x, θ) variate population where θ is a parameter. The problem of
interval estimation of θ consists of the following:
Let 0 ≤ α ≤ 1. Choosing a sample (X1 , X2 , · · · , Xn ) of size n from the population we need to
find two statistics Θ̂L and Θ̂U where Θ̂L < Θ̂U on every observations such that

P (θ ∈ [Θ̂L , Θ̂U ]) = 1 − α, i.e., P (Θ̂L ≤ θ ≤ Θ̂U ) = 1 − α.


Our aim will be to find Θ̂L and Θ̂U in such a way that the length of the interval [Θ̂L , Θ̂U ] remains
as minimum as possible and 1 − α remains as maximum as possible.

3
On basis of a particular observation (x1 , x2 , · · · , xn ) of the sample, the computed interval [θ̂L , θ̂U ]
is called a 100(1 − α)% confidence interval, the fraction 1 − α is called the confidence coefficient or
the degree of confidence, and the endpoints θ̂L and θ̂U are called the lower and upper confidence
limits respectively. For a particular observation we say that one may be (100)α% confident that
the value of the parameter θ will lie within the observed interval [θ̂L , θ̂U ].

Remarks:
1. The interval [Θ̂L , Θ̂U ] is a random interval.
2. The interpretation of the above equality is: Whatever the true value of θ be, the proba-
bility that θ will lie in the random interval is 1 − α. In other words for repeated observations
(x1 , x2 , · · · , xn ) of the random sample (X1 , X2 , · · · , Xn ) if we compute the values of Θ̂L and Θ̂U ,
say, θ̂L and θ̂U , respectively, then in about 100(1−α)% of the cases, the interval [θ̂L , θ̂U ] will contain
θ and in 100α% cases it will fail to do so, i.e., if α is very small, then the probability that [θ̂L , θ̂U ]
will contain θ is very high.

Routine method for finding confidence interval:


Suppose we have given a f (x, θ) variate population where θ = (θ1 , θ2 , · · · , θs ) is a parameter. We
will follow, in general, the following steps to obtain a confidence interval of one of the parameters,
say, θ1 :
1. Choose α ∈ [0, 1].
2. On the basis of a random sample (X1 , X2 , · · · , Xn ), we choose a point estimator Θ̂ =
Θ̂(X1 , X2 , · · · , Xn ; θ1 ) of θ1 such that the sampling distribution of Θ̂ is independent of all the
parameters; but Θ̂ itself depends upon θ1 and independent of other parameters θ2 , θ3 , · · · , θs .
3. Construct another function Ψ(Θ̂, θ1 ) = Ψ(Θ̂) of which we know the sampling distribution, if
needed for simplicity; other wise we may take Ψ(Θ̂, θ1 ) = Θ̂ itself.
4. Choose two constants λ1 and λ2 (λ1 < λ2 ) such that P (Ψ(Θ̂, θ1 ) ∈ [λ1 , λ2 ]) = 1−α. Note that
this is same as choosing two constants λ1 and λ2 depending upon α1 and α2 , respectively, where
α1 + α2 = α such that P [Ψ(Θ̂, θ1 ) ≤ λ1 ] = α1 and P [Ψ(Θ̂, θ1 ) ≥ λ2 ] = α2 , so that P (Ψ(Θ̂, θ1 ) ∈
[λ1 , λ2 ]) = 1 − α.
5. We rewrite the inequality λ1 ≤ Ψ(Θ̂, θ1 ) ≤ λ2 as δ1 (Θ̂) ≤ θ1 ≤ δ2 (Θ̂) where δ1 (Θ̂) and δ2 (Θ̂)
are independent of θ1 .
6. Set Θ̂L = δ1 (Θ̂) and Θ̂U = δ2 (Θ̂), so that P (θ1 ∈ [Θ̂L , Θ̂U ]) = 1 − α.

Note: Observe that δ1 (Θ̂) and δ2 (Θ̂) are random variables and hence [δ1 (Θ̂), δ2 (Θ̂)] is a random
interval and that P (θ1 ∈ [δ1 (Θ̂), δ2 (Θ̂)]) = P (Ψ(Θ̂, θ1 ) ∈ [λ1 , λ2 ]) = 1 − α.

Example (1). Interval estimation of the mean µ of a population:

Case: Variance σ 2 is known and sample size is large.


Suppose (X1 , X2 , · · · , Xn ) is a large sample from the population. By known methods we find
X −µ
that X is a very good estimator of the mean µ. Consider the function Ψ(X, µ) = √ which,
σ/ n
obviously, is independent of any other parameters except µ, since σ is given to be known, and
of which we know the distribution: since n is large, using Central Limit Theorem we get that
X −µ
Ψ(X, µ) = √ ∼ N (0, 1) (approximately) which is obviously independent of all parameters.
σ/ n
Since normal pdf is symmetric, we can easily choose a symmetric 100(1 − α)% confidence interval
of Ψ(X, µ) from which we will find desired confidence interval of µ. From Normal table we choose
zα/2 such that
X −µ
P (Ψ(X, µ) ∈ [−zα/2 , zα/2 ]) = 1 − α, i.e., P (−zα/2 ≤ √ ≤ zα/2 ) = 1 − α. From this we
σ/ n

4
Figure 1: P (Z ∈ [−zα/2 , zα/2 ]) = 1 − α

σ σ
get that P (X − zα/2 √ ≤ µ ≤ X + zα/2 √ ) = 1 − α. Thus the 100(1 − α)% confidence interval
n n
σ σ
for mean µ for known variance and for large sample is [X − zα/2 √ , X + zα/2 √ ] where X is a
n n
realization of X.

Case: Variance is unknown.

Figure 2: P (T ∈ [−tα/2 , tα/2 ]) = 1 − α

Here we need to assume that the population approximately follows Normal distribution with
X −µ
mean µ. Since variance is unknown we choose Ψ(X, µ) = √ which, obviously, is independent of
S/ n
X −µ
any other parameters except µ, and of which we know the distribution: Ψ(X, µ) = √ ∼ t(n−1)
S/ n
which is obviously independent of all parameters, where S 2 is the sample variance. Using similar
arguments we can show that the 100(1 − α)% confidence interval of µ when variance is unknown is
s s
[X − tα/2 √ , X + tα/2 √ ] where s is a realization of S and tα/2 is the upper 100(α/2)% point of
n n
the t-distribution with n − 1 degrees of freedom.

Case: Variance is unknown and sample size is very large.


The arguments to find the confidence interval in this case will be the same as of the first case
except the fact that instead of σ we will use S where S 2 is the sample variance, and the 100(1 − α)%
s s
confidence interval for mean µ will be [X − zα/2 √ , X + zα/2 √ ].
n n
Maximum error in estimating µ by X: Suppose x is an observation of X. While estimating
µ by X, for the specified observation the error is |µ − x|. Now since for different observations of
the random sample the value of x are also different and randomly picked by X, the error is also
random, measured by |µ − X|. So, to measure the absolute error we can only talk of maximum
possible error with probability of occurrence. Since any observation x of X and the value of µ both
lie in confidence interval, roughly speaking, the maximum error will be the length of the confidence

5
interval. We have the following results to measure this error:

Theorem ( Variance is known and sample is large ). If X is used as an estimator of µ, then for
an observation from a sample of size n, the probability that error will not exceed zα/2 √σn is 1 − α.

i.e., If x is used as an estimate of µ, then one can be 100(1 − α)% confident that the error will
not exceed zα/2 √σn .

Theorem (Sample is small ). If X is used as an estimator of µ, then for an observation from a


sample of size n, the probability that error will not exceed tα/2 √sn is 1 − α.

Theorem ( Variance is unknown and sample is large ). If X is used as an estimator of µ, then for
an observation from a sample of size n, the probability that error will not exceed zα/2 √sn is 1 − α.

Sample size needed for a specified error: The following results are straightforward from the
above theorems.

Corollary ( Variance is known and sample is large ). If X is used as an estimator of µ, then for
an observation the error will not exceed a specified amount e with probability 1 − α if the sample
 z σ 2
α/2
has size n = .
e
i.e., If x is used as an estimate of µ, then one can be 100(1 − α)% confident that the error will
 z σ 2
α/2
not exceed a specified amount e if the sample size is n = .
e
Corollary ( Variance is unknown and sample is large ). If X is used as an estimator of µ, then
for an observation the error will not exceed a specified amount e with probability 1 − α if the sample
 z s 2
α/2
has size n = .
e
One sided confidence intervals for µ: Similar construction as before shows that one can have
following types of 100(1 − α)% confidence intervals for µ
(I) When variance is known and sample is large:
σ σ
1. (−∞, x + zα √ ]; 2. [x − zα √ , ∞).
n n
( Sketch of the proof: Start with
X−µ X−µ
1. P (−zα < σ/ √
n
< ∞) = 1 − α; 2. P (−∞ < σ/ √
n
< zα ) = 1 − α )
(II) When sample is small:
s s
1. (−∞, x + tα √ ]; 2. [x − tα √ , ∞).
n n
(III) When variance is unknown and sample is large:
s s
1. (−∞, x + zα √ ]; 2. [x − zα √ , ∞).
n n
(1 − 2α)
Remark: Similar construction as before shows that one can have following types of 100 %
2
confidence intervals for µ
(I) When variance is known and sample is large:
σ σ
1. [x, x + zα √ ]; 2. [x − zα √ , x].
n n
(II) When sample is small:
s s
1. [x, x + tα √ ]; 2. [x − tα √ , x].
n n
(III) When variance is unknown and sample is large:
s s
1. [x, x + zα √ ]; 2. [x − zα √ , x].
n n

6
Example (2). Interval estimation of the proportion p of a population:

Suppose we are given a binomial/Bernoulli population with proportion p and we have chosen
a random sample (X1 , X2 , · · · , Xn ) from b(1, p). By some standard method of point estimation we
Pn
Xi
i=1
know that a good point estimator of the proportion p is given by the statistic P̂ = = X
n
X
(which many authors write as where X counts number of success in n independent trials).
n
Denote p̂ for an observation of P̂. If the unknown proportion p is not expected to be too close to 0
or 1, we can establish a confidence interval for p by considering the sampling distribution of P̂. So
we suppose that p is neither close to 0 nor close to 1.
n
Xi ∼ b(n, p) and hence µP̂ = p and σP̂2 = p(1 − p)/n. Consider
P
Since each Xi ∼ b(1, p),
i=1
P̂ − p
the function Ψ(P̂, µ) = p which, obviously, depends only upon p. Now, suppose n is
p(1 − p)/n
sufficiently large to apply Central Limit Theorem on Ψ(P̂, µ) to get Ψ(P̂, µ) ∼ N (0, 1) (approxi-
mately), which is free of any parameter. Thus as in the previous case we can choose zα/2 , the upper
P̂ − p
100(α/2)% point of standard normal distribution, such that P (−zα/2 ≤ p ≤ zα/2 ) =
p(1 − p)/n
1 − α where α is given. When n is very large, for an observation (x1 , x2 , · · · , xn ), p̂ = X will be very
close to the original value of p and hence for sufficiently large n we may introduce very small error
(replacing p by p̂)" to get, from
r the above equality, r approximately
# " a 100(1 r− α)% confidence r #interval
p̂(1 − p̂) p̂(1 − p̂) p̂q̂ p̂q̂
of p in the form p̂ − zα/2 , p̂ + zα/2 or p̂ − zα/2 , p̂ + zα/2 where
n n n n
r r
p̂q̂ p̂q̂
q̂ = 1 − p̂, i.e., P (P̂ − zα/2 ≤ p ≤ P̂ + zα/2 ) ≈ 1 − α.
n n
If we don’t use the approximation then solving quadratic equation we can get a 100(1 − α)%
confidence interval for p insthe following form: s
2 2
 
zα/2 2 zα/2 2
p̂ + 2n
zα/2 p̂q̂ zα/2 p̂ + 2n
zα/2 p̂q̂ zα/2 
 2 − 2 + , 2 + 2 + .
zα/2 zα/2 n 4n2 zα/2 zα/2 n 4n2
1+ n 1+ n 1+ n 1+ n
Maximum error in estimating p by P̂:

Theorem. If P̂ is used as an estimator of r p, then for an observation from a sample of size n, the
σ
probability that error will not exceed zα/2 is 1 − α.
n
Sample size needed for a specified error: The following are straightforward form the previous
theorem.

Corollary. If P̂ is used as an estimator of p, then for an observation the error will be less than a
2 p̂q̂
zα/2
specified amount e with probability 1 − α if the sample has size approximately n = .
e2
Corollary. If P̂ is used as an estimator of p, then for an observation the error will not exceed a
2
zα/2
specified amount e with probability, at least, 1 − α if the sample has size n = .
4e2
(Proof follows from previous corollary putting p̂ = q̂ = 1/2.)

Example (3). Interval estimation of the variance σ 2 of a normal N (µ, σ 2 ) population:

7
We choose a sample (X1 , X2 , · · · , Xn ) of size n from the population. It is well known that S 2 is
(n − 1)S 2
a good point estimator of σ 2 . We set Φ(S 2 , σ 2 ) = which, obviously, depends only upon
σ2
(n − 1)S 2
the parameter σ 2 . By a theorem in sampling distribution we know that ∼ χ2 (n − 1).
σ2

Figure 3: P (χ2 ∈ [χ21−α/2 , χ2α/2 ]) = 1 − α

From χ2 distribution table we can find χ2α/2 , the upper 100(α/2)% point of χ2 distribution, such
(n − 1)S 2
that P (Φ(S 2 , σ 2 ) ∈ [χ21−α/2 , χ2α/2 ]) = 1 − α, i.e., P (χ21−α/2 ≤ ≤ χ2α/2 ) = 1 − α. From this
σ2
(n − 1)S 2 2 (n − 1)S 2
we get ≤ σ ≤ .
χ2α/2 χ21−α/2

You might also like