Week 2 Notes

Week 2.
Quantiles of Data, Testing Normality Heuristically,

Unbiased Estimators, Measures of Center and Variation
1 Quantiles of a Random Variable
Denition 1.1. (p - quantile of a random variable) If X is a random variable and

0 < p < 1, then a p - quantile (also called a p - percentile) of X (or, of the distribution
of X ) is any number x such that P(X ≤ x) ≥ p and (P(X ≥ x) ≥ 1 − p.
Remark 1.2. It is not hard to see that when p is in the range of the CDF F of X then the
set of p - quantiles is the closure of the (possibly degenerate) interval on which F (x) = p;
and it consists of the single point
sup{x | F (x) < p} (1)

when p is not in the range of F. In this case (due to the left continuity of F ) we can also
write max in (1) instead of sup .
Remark 1.3. The p - quantiles are sometimes called lower p - quantiles and the 1 − p -
quantiles are called upper p - quantiles. We will usually consider upper quantiles.
Figure 1: An α - upper quantile of the standard normal distribution
Exercise 1.4. Prove the statements in Remark 1.2.
1
2 Quantiles of Data
Denition 2.1 (p - quantile of data). If p ∈ (0, 1) then a p - quantile (or a p - per-

centile) of the data is a p - quantile of the empirical distribution function corresponding
to the data.
Remark 2.2. By the denition of the p- quantile and the empirical distribution function,
the p- quantile of the data (x1 , . . . , xn ) is any number x such that
#{i : xi ≤ x} ≥ pn and #{i : xi ≥ x} ≥ (1 − p)n.
Proposition 2.3. The number x is a p - quantile of the data (x1 , . . . , xn ) i

x = x([pn]+1 if pn is not an integer
x(pn) ≤ x ≤ x(pn+1) if pn is an integer.
Remark 2.4. There is no general agreement on what the value of the p - quantile of data
should be when pn is an integer. In the Bhattacharyya-Johnson book the p - quantile
x +x
of data (x1 , . . . , xn ) is dened by (pn) 2 (pn+1) if np is an integer. In R we can obtain it
by the command "quantile(data, p, type = 2)". There are 9 types of quantiles in R. For
instance, the 0.25, 0.5 resp. the 0.75 quantile of the data (2, 4, 5, 7, 11, 13, 17, 19) are 4.5,
9, 15, resp, using type=2.
Denition 2.5. (Quartiles and the median) A 0.25, 0.5 and a 0.75 - quantile is called
1st, 2nd and 3rd quartile, respectively. By Proposition 2.3 they are not always unique and
they depend on the software we use for data.. But for the value of the 2nd quartile of the
data (x1 , . . . , xn ) there is a widespread agreement: it is 21 x(n/2) + x(n/2+1) if n is even.
It is called the median of the data and it is denoted by xen = xe.
Example 2.6. Suppose that our data are:
x = (2, 4, 5, 7, 11, 13, 17, 19). (2)
Then the software R gives the following output for the command "summary(c(2,4,5,7,11,13,17,19))":
In the command "summary" R uses the 7th type (out of its 9 types) for the pth quantile,
that is, if (n − 1)p + 1 = i + f, the sum of the integer and fractional part then the output
is (1 − f )x(i) + f x(i+1) .
The data can also be summarized in R visually by the so-called boxplot (Figure 2):
2
Figure 2: The boxplot for data (2).
3 Testing Normality Heuristically

14
4
12
qnorm(ppoints(70), mean = 2, sd = 1)
4
qchisq(ppoints(170), df = 3)
10
qt(ppoints(170), df = 3)
3
2
8
2
6
−2
1
4
−4
2
0
−6
0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
qnorm(ppoints(70)) qnorm(ppoints(70)) qnorm(ppoints(70))
(a) chi2 (3) against N(0,1) (b) t(3) against N(0,1) (c) N(2,1) against N(0,1)
Figure 3: Normal Q-Q plots for various distributions
We would like to nd evidence if our data come from a normal distribution. We can plot
the quantiles of the data against the quantiles of the distribution N(0, 1). The resulting
plot is called the normal quantile-quantile plot (or, the Normal Q-Q plot in R). If our
data are independent realizations of the normal distribution N(µ, σ 2 ) then it is easily
seen that the plotted points mentioned above will approximately lie on a straight line
y = σx + µ. It's just a visual check, so it is somewhat subjective. But it allows us to see
if our assumption on normality is plausible, and if not, how the assumption is violated.
Certainly we can plot the quantiles of our data against the quantiles of any distribution
that we suspect the data come from.
3
Normal Q−Q Plot
60
50
40
Sample Quantiles
30
20
10
0
−2 −1 0 1 2
Theoretical Quantiles
Figure 4: Normal QQ-plot of students' test scores

12
10
rchisq(ppoints(70), df = 3)
8
6
4
2
0
−2 −1 0 1 2
qnorm(ppoints(70))
Figure 5: Normal QQ-plot for data from chi-square distribution with 3 degrees of freedom
in R using the qqplot(qnorm(ppoints(70)), rchisq(ppoints(70),df=3)) command
4
4 Measures of Center
Denition 4.1. Let (X1 , . . . , Xn ) be a sample. Then the random variable

n
1X
Xn = X = Xi
n i=1
is called the sample mean.
Remark 4.2. X n is the expected value of the empirical distribution function corresponding
to the sample (X1 , . . . , Xn ), that is, if (x1 , . . . , xn ) is a realization of (X1 , . . . , Xn ) then xn
is the expected value of the EDF corresponding to the data (x1 , . . . , xn ). 1
Theorem 4.3 (Steiner). If VarX < ∞ then E(X − a)2 = VarX + (EX − a)2 .
Proof. E(X −a)2 = EX 2 −2aEX +a2 = VarX +(EX)2 −2aEX +a2 = VarX +(EX −a)2 .
Corollary 4.4. If VarX < ∞ then the minimum of E(X − a)2 is uniquely taken at
a = EX.
If X takes the points (x1 , . . . , xn ) with probability 1/n then we have

Corollary 4.5.
n
X n
X
2
(xi − a) = (xi − x)2 + n(x − a)2 .
i=1 i=1
Corollary 4.6. The minimum of − a)2 is uniquely taken at a = x.

Pn
i=1 (xi
Denition 4.7. If (X1 , . . . , Xn ) is a random sample then the random variable

X([ n2 ]+1) if n is odd

X
en = X
e= 1
2
(X( n2 ) + X( n2 +1) ) if n is even
is called the sample median.
Remark 4.8. The sample median is a 0.5 - quantile of the corresponding empirical distri-
bution function.
Theorem 4.9. If E|X| < ∞ then the minimum of E|X − a| is achieved at the 0.5 -
quantile(s) of X and nowhere else.
Proof. Suppose E|X| < ∞. Then E|X − a| < ∞ for every a (Why?). It is enough to
prove the theorem when 0 is a 0.5 - quantile of X (Why?) Suppose therefore that 0 is a
0.5 - quantile of X . Then if a > 0 then E(|X − a| − |X|) ≥ aP(X ≤ 0) − aP(X > 0) ≥ 0.
If a < 0 apply the above for −X.
If 0 is not a 0.5 - quantile of X then suppose rst that P(X ≤ 0) < 0.5. Then
P(X ≤ ε) < 0.5 for some ε > 0 and E(|X − ε| − |X|) ≤ εP(X ≤ ε) − εP(X > ε) < 0. If
P(X ≥ 0) < 0.5 then the proof is similar.
1 Since in this case we estimate the expected value of the parent distribution by the expected value of
the empirical distribution, we say that Xn is the plug-in estimator of EX. What would be the plug-in
estimator of VarX ?
5
Corollary 4.10. The minimum of |xi − a| is achieved at the 0.5 - quantile(s) of
Pn
i=1
(x1 , . . . , xn ).
Proposition 4.11. (Alternative proof for Thm 4.9.) If E|X| < ∞ then E|X −a| < ∞ for
every a and if m is a 0.5 quantile of X then m ≤ a1 < a2 implies E|X − a1 | ≤ E|X − a2 |
with strict inequality if a2 is not a 0.5-quantile of X.
Proof. That E|X − a| < ∞ if E|X| < ∞ comes from the triangle inequality. It is easy to
see that

= a1 − a2 if X ≤ a1
|X − a2 | − |X − a2 | (3)
≤ a2 − a1 if X > a1 ,
therefore
E(|X − a2 | − |X − a2 |)
≤(a1 − a2 )P(X ≤ a1 ) + (a2 − a1 )P(X > a1 ) ( due to (3))
=(a2 − a1 )(P(X > a1 ) − P(X ≤ a1 ))
≤0 ( due to m ≤ a1 ).
5 Unbiased Estimators
Denition 5.1. If θb = θ(X

b 1 , . . . , Xn ) is an estimator of θ (hence it is a statistic by which
we would like to estimate the unknown parameter θ from the sample) then we can dene
the quantity Bias(θ)
b = Eθ θb− θ (with Eθ X denoting the expectation of the random variable
X when the true parameter is θ). The estimator θb is called unbiased when Bias(θ) b = 0.
Remark 5.2. The sample mean is an unbiased estimator of the expected value, since
EX n = EX.
Sometimes we want to estimate a function g(θ) of the parameter θ, for instance, some
quantile of the parent distribution. Then we say that the estimator θb is unbiased for g(θ)
if Eθ θb = g(θ).
Unfortunately, unbiased estimators do not always exist. Consider, for example the
model {B(n, θ), 0 < θ < 1} with xed n and take g(θ) = 1θ . Then if θb were an unbiased
statistic for 1θ then Eθ θb = ni=1 θ(i) θ (1 − θ)n−i = 1θ would not hold for each θ since 1θ
P b n i
i
is not a polynomial.
Sometimes we can nd only silly unbiased estimators:

Example 5.3. Let n = 1 and X1 ∼ P(θ) (so we have a sample of size one) and let
b 1 ) = (−2)X1 is the only unbiased estimator of g(θ).
g(θ) = e−3θ . Then θb = θ(X
Proof. Eθ (θ)
P∞ θk −θ
b =
k=0 (−2)
k
· k!
e = e−3θ .
Remark 5.4. The situation is not better if we have a sample of size n in Example 5.3
(Why?)
6
6 Measures of Variation
Let (X1 , . . . , Xn ) be a sample. Then the variance of the empirical distribution function
(which Pis called the sample variance or uncorrected sample variance) is easily seen to be
Sn2 = n1 ni=1 (Xi −X)2 therefore we expect this (plug-in) estimator to be a good estimator
of the variance σ 2 of the parent distribution. Well, it is good, but only almost unbiased:
Theorem 6.1.
n−1 2
Eσ Sn2 = σ ,
n
hence Sn∗2 = n−1
1
i=1 (Xi −X) (which is called the unbiased sample variance or corrected
n 2
P
sample variance) is an unbiased estimator of σ2 .
Proof. If Eσ Xi = µ, i = 1, . . . , n, then by Steiner (Proposition 4.3),
n
!
2 1 X 1
(Xi − µ)2 − n(X − µ)2 nσ 2 − σ 2 .

Eσ Sn = E =
n i=1
n
Denition 6.2. If (X1 , . . . , Xn ) is a sample then the estimator θbn of θ is called asymp-
totically unbiased if limn→∞ Eθ θbn = θ.
Remark 6.3. The sample variance is an asymptotically unbiased estimator of the variance
by Theorem 6.1.

Week 2 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2 Notes

Uploaded by

Copyright:

Available Formats

Week 2.

Quantiles of Data, Testing Normality Heuristically,

1 Quantiles of a Random Variable

Denition 1.1. (p - quantile of a random variable) If X is a random variable and

sup{x | F (x) < p} (1)

Figure 1: An α - upper quantile of the standard normal distribution

Exercise 1.4. Prove the statements in Remark 1.2.

Denition 2.1 (p - quantile of data). If p ∈ (0, 1) then a p - quantile (or a p - per-

#{i : xi ≤ x} ≥ pn and #{i : xi ≥ x} ≥ (1 − p)n.

Proposition 2.3. The number x is a p - quantile of the data (x1 , . . . , xn ) i

3 Testing Normality Heuristically

qnorm(ppoints(70)) qnorm(ppoints(70)) qnorm(ppoints(70))

Figure 3: Normal Q-Q plots for various distributions

Figure 4: Normal QQ-plot of students' test scores

Denition 4.1. Let (X1 , . . . , Xn ) be a sample. Then the random variable

If X takes the points (x1 , . . . , xn ) with probability 1/n then we have

Corollary 4.6. The minimum of − a)2 is uniquely taken at a = x.

Denition 4.7. If (X1 , . . . , Xn ) is a random sample then the random variable

Denition 5.1. If θb = θ(X

Sometimes we can nd only silly unbiased estimators:

You might also like

Week 2 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2 Notes

Uploaded by

Copyright:

Available Formats

Week 2.

Quantiles of Data, Testing Normality Heuristically,

1 Quantiles of a Random Variable

Denition 1.1. (p - quantile of a random variable) If X is a random variable and

sup{x | F (x) < p} (1)

Figure 1: An α - upper quantile of the standard normal distribution

Exercise 1.4. Prove the statements in Remark 1.2.

Denition 2.1 (p - quantile of data). If p ∈ (0, 1) then a p - quantile (or a p - per-

#{i : xi ≤ x} ≥ pn and #{i : xi ≥ x} ≥ (1 − p)n.

Proposition 2.3. The number x is a p - quantile of the data (x1 , . . . , xn ) i

3 Testing Normality Heuristically

qnorm(ppoints(70)) qnorm(ppoints(70)) qnorm(ppoints(70))

Figure 3: Normal Q-Q plots for various distributions

Figure 4: Normal QQ-plot of students' test scores

Denition 4.1. Let (X1 , . . . , Xn ) be a sample. Then the random variable

If X takes the points (x1 , . . . , xn ) with probability 1/n then we have

Corollary 4.6. The minimum of − a)2 is uniquely taken at a = x.

Denition 4.7. If (X1 , . . . , Xn ) is a random sample then the random variable

Denition 5.1. If θb = θ(X

Sometimes we can nd only silly unbiased estimators:

You might also like

Denition 1.1. (p - quantile of a random variable) If X is a random variable and

Denition 2.1 (p - quantile of data). If p ∈ (0, 1) then a p - quantile (or a p - per-

Proposition 2.3. The number x is a p - quantile of the data (x1 , . . . , xn ) i

Denition 4.1. Let (X1 , . . . , Xn ) be a sample. Then the random variable

Denition 4.7. If (X1 , . . . , Xn ) is a random sample then the random variable

Denition 5.1. If θb = θ(X

Sometimes we can nd only silly unbiased estimators: