Professional Documents
Culture Documents
transformation
Lian Heng
1
Figure 1: Illustration of histrogram.
2
A much better estimator is the kernel density estimator (KDE).
The estimator takes its name from the so-called kernel function, de-
noted here by K, which is a probability density function that is
symmetric about 0. The standard normal density function is a com-
mon choice for K and will be used here. The kernel density estimator
based on Y1, . . . , Yn is
n
1 X y − Yi
fb(y) = K ,
nb i=1 b
Box-Cox transformation
The logarithm transformation is probably the most widely used trans-
formation in data analysis, followed by squared-root transformation.
They are special cases of Box-Cox transformation
yα −1
y (α) = α , α ̸= 0
log(y), α = 0
3
Figure 2: Illustration of kernel density estimator.
4
Box-Cox transformation is often used to transform data being
right-skewed (using α < 1) or being left-skewed (using α > 1) to a
roughly symmetric distribution. An explanation of why α < 1 (in
particular log transformation) can deal with right-skewed data is the
following picture:
Task
Using the CPSch3 data (average hourly earnings data from the Cur-
rent Population Survey) in the Ecdat package, we look at earning
for males. Using QQ-plot, Box-plot, KDE, all methods suggests a
square-root transformation is reasonably good in transforming the
data to a normal distribution. Use boxcox() function (in the MASS
5
package) to find the best value of α, which is used to transform
the data (simply use y α instead of (y α − 1)/α). After transforma-
tion, plot the KDE, fitted normal density and fitted t distribution
with df = 5. Visually, does normal or t provides a better fit to the
transformed data? Fill in the missing part of the code and
upload the plot showing the three densities as .jpg file
on canvas (do not submit the picture for the result of
boxcox()). In comment box of canvas, paste the two-line
code (one line for using boxcox() and another line for do-
ing transformation y=...) and state whether you think
the normal distribution of t distribution is a better fit
to the transformed data.