You are on page 1of 7

Week 2.

Quantiles of Data, Testing Normality Heuristically,


Unbiased Estimators, Measures of Center and Variation

1 Quantiles of a Random Variable

Denition 1.1. (p - quantile of a random variable) If X is a random variable and


0 < p < 1, then a p - quantile (also called a p - percentile) of X (or, of the distribution
of X ) is any number x such that P(X ≤ x) ≥ p and (P(X ≥ x) ≥ 1 − p.
Remark 1.2. It is not hard to see that when p is in the range of the CDF F of X then the
set of p - quantiles is the closure of the (possibly degenerate) interval on which F (x) = p;
and it consists of the single point

sup{x | F (x) < p} (1)


when p is not in the range of F. In this case (due to the left continuity of F ) we can also
write max in (1) instead of sup .

Remark 1.3. The p - quantiles are sometimes called lower p - quantiles and the 1 − p -
quantiles are called upper p - quantiles. We will usually consider upper quantiles.

Figure 1: An α - upper quantile of the standard normal distribution

Exercise 1.4. Prove the statements in Remark 1.2.

1
2 Quantiles of Data

Denition 2.1 (p - quantile of data). If p ∈ (0, 1) then a p - quantile (or a p - per-


centile) of the data is a p - quantile of the empirical distribution function corresponding
to the data.
Remark 2.2. By the denition of the p- quantile and the empirical distribution function,
the p- quantile of the data (x1 , . . . , xn ) is any number x such that

#{i : xi ≤ x} ≥ pn and #{i : xi ≥ x} ≥ (1 − p)n.

Proposition 2.3. The number x is a p - quantile of the data (x1 , . . . , xn ) i


x = x([pn]+1 if pn is not an integer
x(pn) ≤ x ≤ x(pn+1) if pn is an integer.

Remark 2.4. There is no general agreement on what the value of the p - quantile of data
should be when pn is an integer. In the Bhattacharyya-Johnson book the p - quantile
x +x
of data (x1 , . . . , xn ) is dened by (pn) 2 (pn+1) if np is an integer. In R we can obtain it
by the command "quantile(data, p, type = 2)". There are 9 types of quantiles in R. For
instance, the 0.25, 0.5 resp. the 0.75 quantile of the data (2, 4, 5, 7, 11, 13, 17, 19) are 4.5,
9, 15, resp, using type=2.

Denition 2.5. (Quartiles and the median) A 0.25, 0.5 and a 0.75 - quantile is called
1st, 2nd and 3rd quartile, respectively. By Proposition 2.3 they are not always unique and
they depend on the software we use for data.. But for the value of the 2nd quartile of the
data (x1 , . . . , xn ) there is a widespread agreement: it is 21 x(n/2) + x(n/2+1) if n is even.
It is called the median of the data and it is denoted by xen = xe.
Example 2.6. Suppose that our data are:
x = (2, 4, 5, 7, 11, 13, 17, 19). (2)

Then the software R gives the following output for the command "summary(c(2,4,5,7,11,13,17,19))":

In the command "summary" R uses the 7th type (out of its 9 types) for the pth quantile,
that is, if (n − 1)p + 1 = i + f, the sum of the integer and fractional part then the output
is (1 − f )x(i) + f x(i+1) .

The data can also be summarized in R visually by the so-called boxplot (Figure 2):

2
Figure 2: The boxplot for data (2).

3 Testing Normality Heuristically


14

4
12

qnorm(ppoints(70), mean = 2, sd = 1)
4
qchisq(ppoints(170), df = 3)

10

qt(ppoints(170), df = 3)

3
2
8

2
6

−2

1
4

−4
2

0
−6
0

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

qnorm(ppoints(70)) qnorm(ppoints(70)) qnorm(ppoints(70))

(a) chi2 (3) against N(0,1) (b) t(3) against N(0,1) (c) N(2,1) against N(0,1)

Figure 3: Normal Q-Q plots for various distributions

We would like to nd evidence if our data come from a normal distribution. We can plot
the quantiles of the data against the quantiles of the distribution N(0, 1). The resulting
plot is called the normal quantile-quantile plot (or, the Normal Q-Q plot in R). If our
data are independent realizations of the normal distribution N(µ, σ 2 ) then it is easily
seen that the plotted points mentioned above will approximately lie on a straight line
y = σx + µ. It's just a visual check, so it is somewhat subjective. But it allows us to see
if our assumption on normality is plausible, and if not, how the assumption is violated.
Certainly we can plot the quantiles of our data against the quantiles of any distribution
that we suspect the data come from.

3
Normal Q−Q Plot

60
50
40
Sample Quantiles

30
20
10
0

−2 −1 0 1 2

Theoretical Quantiles

Figure 4: Normal QQ-plot of students' test scores


12
10
rchisq(ppoints(70), df = 3)

8
6
4
2
0

−2 −1 0 1 2

qnorm(ppoints(70))

Figure 5: Normal QQ-plot for data from chi-square distribution with 3 degrees of freedom
in R using the qqplot(qnorm(ppoints(70)), rchisq(ppoints(70),df=3)) command

4
4 Measures of Center

Denition 4.1. Let (X1 , . . . , Xn ) be a sample. Then the random variable


n
1X
Xn = X = Xi
n i=1
is called the sample mean.
Remark 4.2. X n is the expected value of the empirical distribution function corresponding
to the sample (X1 , . . . , Xn ), that is, if (x1 , . . . , xn ) is a realization of (X1 , . . . , Xn ) then xn
is the expected value of the EDF corresponding to the data (x1 , . . . , xn ). 1
Theorem 4.3 (Steiner). If VarX < ∞ then E(X − a)2 = VarX + (EX − a)2 .
Proof. E(X −a)2 = EX 2 −2aEX +a2 = VarX +(EX)2 −2aEX +a2 = VarX +(EX −a)2 .
Corollary 4.4. If VarX < ∞ then the minimum of E(X − a)2 is uniquely taken at
a = EX.

If X takes the points (x1 , . . . , xn ) with probability 1/n then we have


Corollary 4.5.
n
X n
X
2
(xi − a) = (xi − x)2 + n(x − a)2 .
i=1 i=1

Corollary 4.6. The minimum of − a)2 is uniquely taken at a = x.


Pn
i=1 (xi

Denition 4.7. If (X1 , . . . , Xn ) is a random sample then the random variable


X([ n2 ]+1) if n is odd

X
en = X
e= 1
2
(X( n2 ) + X( n2 +1) ) if n is even
is called the sample median.
Remark 4.8. The sample median is a 0.5 - quantile of the corresponding empirical distri-
bution function.
Theorem 4.9. If E|X| < ∞ then the minimum of E|X − a| is achieved at the 0.5 -
quantile(s) of X and nowhere else.
Proof. Suppose E|X| < ∞. Then E|X − a| < ∞ for every a (Why?). It is enough to
prove the theorem when 0 is a 0.5 - quantile of X (Why?) Suppose therefore that 0 is a
0.5 - quantile of X . Then if a > 0 then E(|X − a| − |X|) ≥ aP(X ≤ 0) − aP(X > 0) ≥ 0.
If a < 0 apply the above for −X.
If 0 is not a 0.5 - quantile of X then suppose rst that P(X ≤ 0) < 0.5. Then
P(X ≤ ε) < 0.5 for some ε > 0 and E(|X − ε| − |X|) ≤ εP(X ≤ ε) − εP(X > ε) < 0. If
P(X ≥ 0) < 0.5 then the proof is similar.
1 Since in this case we estimate the expected value of the parent distribution by the expected value of

the empirical distribution, we say that Xn is the plug-in estimator of EX. What would be the plug-in

estimator of VarX ?

5
Corollary 4.10. The minimum of |xi − a| is achieved at the 0.5 - quantile(s) of
Pn
i=1
(x1 , . . . , xn ).

Proposition 4.11. (Alternative proof for Thm 4.9.) If E|X| < ∞ then E|X −a| < ∞ for
every a and if m is a 0.5 quantile of X then m ≤ a1 < a2 implies E|X − a1 | ≤ E|X − a2 |
with strict inequality if a2 is not a 0.5-quantile of X.
Proof. That E|X − a| < ∞ if E|X| < ∞ comes from the triangle inequality. It is easy to
see that

= a1 − a2 if X ≤ a1
|X − a2 | − |X − a2 | (3)
≤ a2 − a1 if X > a1 ,
therefore

E(|X − a2 | − |X − a2 |)
≤(a1 − a2 )P(X ≤ a1 ) + (a2 − a1 )P(X > a1 ) ( due to (3))
=(a2 − a1 )(P(X > a1 ) − P(X ≤ a1 ))
≤0 ( due to m ≤ a1 ).

5 Unbiased Estimators

Denition 5.1. If θb = θ(X


b 1 , . . . , Xn ) is an estimator of θ (hence it is a statistic by which
we would like to estimate the unknown parameter θ from the sample) then we can dene
the quantity Bias(θ)
b = Eθ θb− θ (with Eθ X denoting the expectation of the random variable
X when the true parameter is θ). The estimator θb is called unbiased when Bias(θ) b = 0.

Remark 5.2. The sample mean is an unbiased estimator of the expected value, since
EX n = EX.

Sometimes we want to estimate a function g(θ) of the parameter θ, for instance, some
quantile of the parent distribution. Then we say that the estimator θb is unbiased for g(θ)
if Eθ θb = g(θ).
Unfortunately, unbiased estimators do not always exist. Consider, for example the
model {B(n, θ), 0 < θ < 1} with xed n and take g(θ) = 1θ . Then if θb were an unbiased
statistic for 1θ then Eθ θb = ni=1 θ(i) θ (1 − θ)n−i = 1θ would not hold for each θ since 1θ
P b n i
i
is not a polynomial.

Sometimes we can nd only silly unbiased estimators:


Example 5.3. Let n = 1 and X1 ∼ P(θ) (so we have a sample of size one) and let
b 1 ) = (−2)X1 is the only unbiased estimator of g(θ).
g(θ) = e−3θ . Then θb = θ(X

Proof. Eθ (θ)
P∞ θk −θ
b =
k=0 (−2)
k
· k!
e = e−3θ .

Remark 5.4. The situation is not better if we have a sample of size n in Example 5.3
(Why?)

6
6 Measures of Variation

Let (X1 , . . . , Xn ) be a sample. Then the variance of the empirical distribution function
(which Pis called the sample variance or uncorrected sample variance) is easily seen to be
Sn2 = n1 ni=1 (Xi −X)2 therefore we expect this (plug-in) estimator to be a good estimator
of the variance σ 2 of the parent distribution. Well, it is good, but only almost unbiased:
Theorem 6.1.
n−1 2
Eσ Sn2 = σ ,
n
hence Sn∗2 = n−1
1
i=1 (Xi −X) (which is called the unbiased sample variance or corrected
n 2
P
sample variance) is an unbiased estimator of σ2 .
Proof. If Eσ Xi = µ, i = 1, . . . , n, then by Steiner (Proposition 4.3),
n
!
2 1 X 1
(Xi − µ)2 − n(X − µ)2 nσ 2 − σ 2 .

Eσ Sn = E =
n i=1
n

Denition 6.2. If (X1 , . . . , Xn ) is a sample then the estimator θbn of θ is called asymp-
totically unbiased if limn→∞ Eθ θbn = θ.
Remark 6.3. The sample variance is an asymptotically unbiased estimator of the variance
by Theorem 6.1.

You might also like