You are on page 1of 6

Lab 4 Density estimation and Box-cox

transformation
Lian Heng

Histogram and Kernel density estima-


tion
Histogram can be regarded as a crude density estimator (hist(x,20)).
Mathematically, histogram can be defined as the function
n
X
fb(y) = I{Yi ∈ Bin(y)}.
i=1

To use it as a density estimator, we should use the scaled version


(hist(x,20,freq=F))
n
1 X
f (y) =
b I{Yi ∈ Bin(y)}
nb i=1
R
(it can be shown for the scaled version, fb(y)dy = 1).

1
Figure 1: Illustration of histrogram.

2
A much better estimator is the kernel density estimator (KDE).
The estimator takes its name from the so-called kernel function, de-
noted here by K, which is a probability density function that is
symmetric about 0. The standard normal density function is a com-
mon choice for K and will be used here. The kernel density estimator
based on Y1, . . . , Yn is
n  
1 X y − Yi
fb(y) = K ,
nb i=1 b

where b, which is called the bandwidth, determines the resolution of


the estimator.
My code illustrates the similarity of histogram and KDE, and
the effect of b. My code also shows how to compare the KDE (non-
parametric in nature) with some parametric density estimators (for
example fitting the data using a normal or t density).

Box-Cox transformation
The logarithm transformation is probably the most widely used trans-
formation in data analysis, followed by squared-root transformation.
They are special cases of Box-Cox transformation
 yα −1
y (α) = α , α ̸= 0
log(y), α = 0

3
Figure 2: Illustration of kernel density estimator.

4
Box-Cox transformation is often used to transform data being
right-skewed (using α < 1) or being left-skewed (using α > 1) to a
roughly symmetric distribution. An explanation of why α < 1 (in
particular log transformation) can deal with right-skewed data is the
following picture:

Figure 3: Illustration of logarithm transformation.

My code illustrates the effect of transformation.

Task
Using the CPSch3 data (average hourly earnings data from the Cur-
rent Population Survey) in the Ecdat package, we look at earning
for males. Using QQ-plot, Box-plot, KDE, all methods suggests a
square-root transformation is reasonably good in transforming the
data to a normal distribution. Use boxcox() function (in the MASS

5
package) to find the best value of α, which is used to transform
the data (simply use y α instead of (y α − 1)/α). After transforma-
tion, plot the KDE, fitted normal density and fitted t distribution
with df = 5. Visually, does normal or t provides a better fit to the
transformed data? Fill in the missing part of the code and
upload the plot showing the three densities as .jpg file
on canvas (do not submit the picture for the result of
boxcox()). In comment box of canvas, paste the two-line
code (one line for using boxcox() and another line for do-
ing transformation y=...) and state whether you think
the normal distribution of t distribution is a better fit
to the transformed data.

You might also like