Professional Documents
Culture Documents
Remark 1.3. The p - quantiles are sometimes called lower p - quantiles and the 1 − p -
quantiles are called upper p - quantiles. We will usually consider upper quantiles.
1
2 Quantiles of Data
Remark 2.4. There is no general agreement on what the value of the p - quantile of data
should be when pn is an integer. In the Bhattacharyya-Johnson book the p - quantile
x +x
of data (x1 , . . . , xn ) is dened by (pn) 2 (pn+1) if np is an integer. In R we can obtain it
by the command "quantile(data, p, type = 2)". There are 9 types of quantiles in R. For
instance, the 0.25, 0.5 resp. the 0.75 quantile of the data (2, 4, 5, 7, 11, 13, 17, 19) are 4.5,
9, 15, resp, using type=2.
Denition 2.5. (Quartiles and the median) A 0.25, 0.5 and a 0.75 - quantile is called
1st, 2nd and 3rd quartile, respectively. By Proposition 2.3 they are not always unique and
they depend on the software we use for data.. But for the value of the 2nd quartile of the
data (x1 , . . . , xn ) there is a widespread agreement: it is 21 x(n/2) + x(n/2+1) if n is even.
It is called the median of the data and it is denoted by xen = xe.
Example 2.6. Suppose that our data are:
x = (2, 4, 5, 7, 11, 13, 17, 19). (2)
Then the software R gives the following output for the command "summary(c(2,4,5,7,11,13,17,19))":
In the command "summary" R uses the 7th type (out of its 9 types) for the pth quantile,
that is, if (n − 1)p + 1 = i + f, the sum of the integer and fractional part then the output
is (1 − f )x(i) + f x(i+1) .
The data can also be summarized in R visually by the so-called boxplot (Figure 2):
2
Figure 2: The boxplot for data (2).
4
12
qnorm(ppoints(70), mean = 2, sd = 1)
4
qchisq(ppoints(170), df = 3)
10
qt(ppoints(170), df = 3)
3
2
8
2
6
−2
1
4
−4
2
0
−6
0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
(a) chi2 (3) against N(0,1) (b) t(3) against N(0,1) (c) N(2,1) against N(0,1)
We would like to nd evidence if our data come from a normal distribution. We can plot
the quantiles of the data against the quantiles of the distribution N(0, 1). The resulting
plot is called the normal quantile-quantile plot (or, the Normal Q-Q plot in R). If our
data are independent realizations of the normal distribution N(µ, σ 2 ) then it is easily
seen that the plotted points mentioned above will approximately lie on a straight line
y = σx + µ. It's just a visual check, so it is somewhat subjective. But it allows us to see
if our assumption on normality is plausible, and if not, how the assumption is violated.
Certainly we can plot the quantiles of our data against the quantiles of any distribution
that we suspect the data come from.
3
Normal Q−Q Plot
60
50
40
Sample Quantiles
30
20
10
0
−2 −1 0 1 2
Theoretical Quantiles
8
6
4
2
0
−2 −1 0 1 2
qnorm(ppoints(70))
Figure 5: Normal QQ-plot for data from chi-square distribution with 3 degrees of freedom
in R using the qqplot(qnorm(ppoints(70)), rchisq(ppoints(70),df=3)) command
4
4 Measures of Center
the empirical distribution, we say that Xn is the plug-in estimator of EX. What would be the plug-in
estimator of VarX ?
5
Corollary 4.10. The minimum of |xi − a| is achieved at the 0.5 - quantile(s) of
Pn
i=1
(x1 , . . . , xn ).
Proposition 4.11. (Alternative proof for Thm 4.9.) If E|X| < ∞ then E|X −a| < ∞ for
every a and if m is a 0.5 quantile of X then m ≤ a1 < a2 implies E|X − a1 | ≤ E|X − a2 |
with strict inequality if a2 is not a 0.5-quantile of X.
Proof. That E|X − a| < ∞ if E|X| < ∞ comes from the triangle inequality. It is easy to
see that
= a1 − a2 if X ≤ a1
|X − a2 | − |X − a2 | (3)
≤ a2 − a1 if X > a1 ,
therefore
E(|X − a2 | − |X − a2 |)
≤(a1 − a2 )P(X ≤ a1 ) + (a2 − a1 )P(X > a1 ) ( due to (3))
=(a2 − a1 )(P(X > a1 ) − P(X ≤ a1 ))
≤0 ( due to m ≤ a1 ).
5 Unbiased Estimators
Remark 5.2. The sample mean is an unbiased estimator of the expected value, since
EX n = EX.
Sometimes we want to estimate a function g(θ) of the parameter θ, for instance, some
quantile of the parent distribution. Then we say that the estimator θb is unbiased for g(θ)
if Eθ θb = g(θ).
Unfortunately, unbiased estimators do not always exist. Consider, for example the
model {B(n, θ), 0 < θ < 1} with xed n and take g(θ) = 1θ . Then if θb were an unbiased
statistic for 1θ then Eθ θb = ni=1 θ(i) θ (1 − θ)n−i = 1θ would not hold for each θ since 1θ
P b n i
i
is not a polynomial.
Proof. Eθ (θ)
P∞ θk −θ
b =
k=0 (−2)
k
· k!
e = e−3θ .
Remark 5.4. The situation is not better if we have a sample of size n in Example 5.3
(Why?)
6
6 Measures of Variation
Let (X1 , . . . , Xn ) be a sample. Then the variance of the empirical distribution function
(which Pis called the sample variance or uncorrected sample variance) is easily seen to be
Sn2 = n1 ni=1 (Xi −X)2 therefore we expect this (plug-in) estimator to be a good estimator
of the variance σ 2 of the parent distribution. Well, it is good, but only almost unbiased:
Theorem 6.1.
n−1 2
Eσ Sn2 = σ ,
n
hence Sn∗2 = n−1
1
i=1 (Xi −X) (which is called the unbiased sample variance or corrected
n 2
P
sample variance) is an unbiased estimator of σ2 .
Proof. If Eσ Xi = µ, i = 1, . . . , n, then by Steiner (Proposition 4.3),
n
!
2 1 X 1
(Xi − µ)2 − n(X − µ)2 nσ 2 − σ 2 .
Eσ Sn = E =
n i=1
n
Denition 6.2. If (X1 , . . . , Xn ) is a sample then the estimator θbn of θ is called asymp-
totically unbiased if limn→∞ Eθ θbn = θ.
Remark 6.3. The sample variance is an asymptotically unbiased estimator of the variance
by Theorem 6.1.