Professional Documents
Culture Documents
The graphical methods help in detecting the violation of basic assumptions in regression analysis. Now we
consider the methods and procedures for building the models through data transformation when some of the
assumptions are violated.
For example, if study variable ( y ) in the model is Poisson random variable in a simple linear regression
model, then its variance is same as mean. Since mean of y is related to explanatory variable x , so the
variance of y will be proportional to x . In such cases, variance stabilizing transformations are useful.
In another example, if y is proportion, i.e., 0 ≤ yi ≤ 1 then in such cases the variance of y is proportional
Some commonly used variance-stabilizing transformations in the order of their strength are as follows:
Relation of σ 2 to E ( y ) Transformation
σ 2∝ E ( y ) y* = y (Poisson data)
σ 2 ∝ E ( y )[1 − E ( y )] y* = sin −1 ( y ) (Binomial proportion 0 ≤ yi ≤ 1)
σ 2 ∝ [ E ( y )]2 y* = ln( y )
σ 2 ∝ [ E ( y )]3 y* = 1/ y
1
σ 2 ∝ [ E ( y )]4 y* =
y
After making the suitable transformation, use y * as a study variable in respective case.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
1
The strength of a transformation depends on the amount of curvature present in the curve between study and
explanatory variable. The transformation mentioned here range from relatively mild to relatively strong.
The square root transformation is relatively mild and reciprocal transformation is relatively strong. The
square root transformation is relatively mild and reciprocal transformation is relatively strong.
In general, a mild transformation applied when the minimum and maximum values do not range much (e.g.
ymax / ymin < 2,3) and such transformation has little effect on the curvature. On the other hand when the
minimum and maximum vary much then, a strong transformation is needed that will have a strong effect on
the analysis.
In the presence of non-constant variance, the OLSE will remain unbiased but will looses the minimum
variance property.
When the study variable has been transformed as y* , then the predicted values are in the transformed scale.
It is often necessary to convert the predicted values back to the original units ( y ).
When inverse transformation is directly applied to the original values, then it gives an estimate of the median
of the distribution of study variable instead of mean. So one needs to be careful while doing so.
Confidence interval and prediction interval may be directly converted from one metric to another. Reason
being that the interval estimates are percentile of a distribution and percentiles are unaffected by
transformation. One may note that the resulting intervals may or may not be or remain the shortest possible
intervals.
In some cases, a nonlinear model can be linearized by using a suitable transformation. Such nonlinear
models are called intrinsically or transformably linear. The advantage of transforming the nonlinear
function into linear function is that the statistical tools are developed for the case of linear regression model.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
2
For example, exact tests for test of hypothesis, confidence interval estimation etc. are developed for the case
of linear regression model. Once the nonlinear function is transformed to a linear function, all such tools can
be readily applied and there is no need to develop them separately.
=
Using the transformation y* ln=
y, x* ln x, i.e., by taking log on both sides, the model
becomes
=
log y log β 0 + β1 log x
* β 0* + β1 x *
or y=
where β 0* = log β 0 and the model becomes a linear model. Note that the parameter β 0 changes
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
3
2. If the curve between y and x is like as follows
ln y ln β 0 + β1 x
=
* β 0* + β1 x
or y=
= y and β 0* ln β 0 .
where y* ln=
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
4
then the possible linearizable function is of the form
y β 0 + β1 log x
=
which can be written as
y β 0 + β1 x *
=
1 1
which becomes a linear model by using the transformation y* = , x* = − .
y x
• With the observed behaviour of the plots, one can choose any such curve and use the linearized form
of the function.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
5
• When such transformations are used, many times the form of ε also gets changed. For example, in
case of
y = β 0 exp( β1 x) ε
ln y = ln β 0 + β1 x + ln ε
or y* =β 0* + β1 x + ε * .
This implies that the multiplicative error in original model is log normally distributed in the
transformed model. Many times, we ignore this aspect and continue to assume that the random errors
are still normally distributed. In such cases, the residuals from the transformed model should be
checked for the validity of the assumptions.
• When such transformations are used, the OLSE has the desired properties with respect to the
transformed data and not the original data.
For example, if λ = 0.5, then the transformation is square root and y is used as study variable in place
of y .
Now the linear regression model has parameters β , σ 2 and λ. Box and Cox method tells how to estimate
simultaneously the λ and parameters of the model using the method of maximum likelihood.
Note that as λ approaches zero, y λ approaches to 1. So there is a problem at λ = 0 because this makes all
the observation y to be unity. It is meaningless that all the observation on study variable are constant. So
yλ −1
there is a discontinuity at λ = 0 . One approach to solve this difficulty is to use as a study variable.
λ
yλ −1
Note that as λ → 0, → ln y . So a possible solution is to use the transformed study variable as
λ
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
6
yλ −1
for λ ≠ 0
W = λ
ln y for λ = 0.
So family W is continuous. Still it has a drawback. As λ changes, the value of W change dramatically. So
it is difficult to obtain the best value of λ . If different analyst obtain different values of λ , then it will fit
different models. It may then not be appropriate to compare the models with different values of λ . So it is
preferable to use an alternative form
yλ −1
for λ ≠ 0
y (λ=
)
V= λ y*λ −1
y ln y for λ = 0
*
where y* is the geometric mean of yi ' s as y* = ( y1 y2 ... yn )1/ n which is constant.
When V is applied to each yi , we get V = (V1 , V2 ,..., Vn ) ' as a vector of observation on transformed study
The quantity λ y*λ −1 in the denominator is related to the nth power of Jacobian of transformation. See how:
yiλ − 1
yi( λ=
)
W= ; λ ≠ 0.
i
λ
Let y (=
y1 , y2 ,..., yn ) ', W (W1 , W2 ,..., Wn ) '.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
7
y1λ − 1
Note that if W1 = , then
λ
∂W1 λ y1λ −1
= = y1λ −1
∂y1 λ
∂W1
= 0.
∂y2
In general,
∂Wi yiλ −1 if i = j
=
∂y j 0 if i ≠ j.
The Jacobian of transformation is given by
∂yi 1 1
J ( yi → Wi ) = = = .
∂Wi ∂Wi yiλ −1
∂yi
∂W1 ∂W1 ∂W1
∂y1 ∂y2 ∂yn
y1λ −1 0 0 0
∂W2 ∂W2 ∂W2
0 λ −1
y2 0 0
=
J (W → y ) ∂y1 =
∂y2 ∂yn
λ −1
0 0 0 yn
∂Wn ∂Wn ∂Wn
∂y1 ∂y2 ∂yn
n
= ∏ yiλ −1
i =1
λ −1
n
= ∏ yi
i =1
λ −1
1 1
J(y →
= W) = n .
J (W → Y )
∏ yi
i =1
Since this is a Jacobian when we want to transform the whole vector y to whole vector W . If an individual
yi is to be transform into Wi , then take its geometric mean as
λ −1
1
J ( yi → Wi ) =
1
.
n n
∏ yi
i =1
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
8
1
The quantity J (Y → W ) =n
ensures that unit volume is preserved moving from the set of yi to the
∏ yi λ −1
i =1
set of Vi . This is a factor which scales and ensures that the residual sum of squares obtained from different
values of λ can be compared.
(λ ) yλ −1
where y = λ −1
, ε ~ N (0, σ 2 I ).
λ y*
n 2
∑ εi
n
(λ ) 1 2
=
Ly 2
exp − i =1 2
2πσ 2σ
n
1 2 ε 'ε
= 2
exp − 2
2πσ 2σ
n
1 2 ( y ( λ ) − X β ) '( y ( λ ) − X β )
= 2
exp −
2πσ 2σ 2
n ( y ( λ ) − X β ) '( y ( λ ) − X β )
− ln σ 2 −
ln L y ( λ ) = (ignoring constant).
2 2σ 2
Solving
∂ ln L y ( λ )
=0
∂β
∂ ln L y ( λ )
=0
∂σ 2
gives the maximum likelihood estimators
βˆ (λ ) = ( X ' X ) −1 X ' y ( λ )
1 (λ ) (λ ) y ( λ ) ' Hy ( λ )
σ (λ ) =−
ˆ 2
−1
y ' I X ( X ' X ) X ' y =
n n
for a given value of λ .
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
9
Substituting these estimates in the log likelihood function ln L y ( λ ) gives
n n
L (λ ) = − ln [ SS r e s (λ ) ]
− ln σˆ 2 =
2 2
where SS r e s (λ ) is the sum of squares due to residuals which is a function of λ . Now maximize L(λ )
with respect to λ. It is difficult to obtain any closed form of the estimator of λ . So we maximize it
numerically.
n
The function − ln [ SS r e s (λ ) ] is called as the Box-Cox objective function.
2
Let λmax be the value of λ which minimizes the Box-Cox objective function. Then under fairly general
has approximately χ 2 (1) distribution. This result is based on the large sample behaviour of the likelihood
ratio statistic. This is explained as follows:
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
10
n SS r e s (λmax )
ln η = ln
2 SS r e s (λ )
n SS r e s (λ )
− ln η =ln
2 SS r e s (λmax )
n n
= ln [ SS r e s (λ ) ] − ln [ SSr e s (λmax ) ]
2 2
− L(λ ) + L(λmax )
=
where
n
L(λ ) = − ln [ SS r e s (λ ) ]
2
n
L(λmax ) = − ln [ SS r e s (λmax ) ] .
2
Since under certain regularity conditions, −2 ln ηn converges in distribution to χ 2 (1) when the null
hypothesis is true, so
− 2 ln η ~ χ 2 (1)
χ 2 (1)
or − ln η ~
2
χ 2 (1)
or L(λmax ) − L(λ ) ~ .
2
Computational procedure
The maximum- likelihood estimate of λ corresponds to the value of λ for which residual sum of squares
from the fitted model SS r e s (λ ) is a minimum. To determine such λ , we proceed computationally as follows:
- Fit y ( λ ) for various values of λ . For example, start with values in (-1, 1) then take the values in
(-2, 2) and so on. Take about 15 to 20 values of λ which are expected to be sufficient for the
estimation of optimum value.
- Plot SS r e s (λ ) versus λ .
Note that the value of λ can not be selected by directly comparing the residual sum of squares from the
regression of y λ on x because for each λ , the residual sum of squares is measured on a different scale.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
11
It is better to use simple values of λ . For example, the practical difference between λ = 0.5 and λ = 0.58 is
likely to be small but λ = 0.5 is much easier to interpret.
It is entirely acceptable to use y ( λ ) as response for final model. This model will have a scale difference and
selecting the final value of λ . For example , if λˆ = 0.58 is the value of λ which is minimizing the sum of
squares due to residual. But if λ = 0.5 is in the confidence interval, then one may use the square root
transformation because it is easier to explain. Furthermore if λ = 1 is in the confidence interval, then it may
be concluded that no transformation is necessary.
In applying the method of maximum likelihood to the regression model, we are essentially maximizing
n
L(λ ) = − ln [ SS r e s (λ ) ]
2
or equivalently, we are minimizing SS r e s (λ ) .
An approximate 100(1- α ) % confidence interval for λ consists of those values of λ that satisfy
χα2 (1)
L(λˆ ) − L(λ ) ≤
2
where χα2 (1) is the upper α % point of the Chi-square distribution with one degree of freedom.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
12
- This line would cut the L(λ ) at two points.
- The location of these two points on the λ -axis defines the two end points of the approximate
confidence interval.
- If sum of squares due to residuals is minimized and SS r e s (λ ) versus λ is plotted, then the line must
be plotted at the height
χ 2 (1)
SS * = SS r e s (λˆ ) exp α
n
where λ̂ is the value of λ which minimizes the sum of squares due to residuals. See how:
χ (1) 2
n χ (1) 2
L(λˆ ) − α =− ln SS r e s (λˆ ) − α
2 2 2
n χ (1)
{ }
2
= − ln SS Re s (λˆ ) + α
2 n
n χ 2 (1)
=
2
{ }
− ln SS r e s (λˆ ) + ln exp α
n
n χ 2 (1)
= − ln SS r e s (λˆ ).exp α
2 n
n
= − ln SS * .
2
Using the expansion of exponential function as
t2
exp(t ) = 1 + t + + ...
2!
≈ 1 + t,
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
13
These expressions are based on the fact that
χ 2 (1)
= Z 2 ≈ tγ2 if γ is small.
It is debatable to use either γ or n but practically the difference is very little between the confidence
interval results.
Box-Cox transformation was originally introduced to reduce the nonnormality in the data. It also helps is
reducing the nonlinearity. The approach is to find out the transformations which attempts to reduce the
residuals associated with outliers and also reduce the problem of non constant error variance if there was
no acute nonlinearity to begin with.
We want to select an appropriate transformation on the explanatory variable so that the relationship between
y and transformed explanatory variable is as simple as possible.
Box and Tidwell procedure describes a general analytical procedure for determining the form of
transformation on x
Suppose that the study variable y is related to the power of explanatory variables. Box and Tidwell
procedures for explanatory variables chooses the variables as
xijα j − 1
when α j ≠ =
0, i 1, 2,.., =
n; j 1, 2,..., k
zij = α j
ln xij when α j = 0.
We need to estimate α j ' s . Since the dependent variable is not being transformed, we need not worry about
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
14
We consider this for simple linear regression model instead of nonlinear regression model.
Assume y is related to ξ = xα as
) f (ξ , β 0 , β1=
E ( y= ) β 0 + β1ξ
xα if α ≠ 0
where ξ =
ln x if α = 0
Expand about the initial guess in a Taylor series and ignoring terms of order higher them one gives
df (ξ , β 0 , β1 )
E ( y ) f (ξ 0 , β 0 , β1 ) + (α − α 0 )
=
dα αξ ==ξα0
0
df (ξ , β 0 , β1 )
= β 0 + β1 x + (α − 1) .
dα αξ ==ξα0
0
df (ξ , β 0 , β1 )
Suppose the term is known, then it can be treated just like as an additional explanatory
dα αξ ==ξα0
0
variable. Then the parameters β 0 , β1 and α can be estimated by least squares method.
df (ξ , β 0 , β1 ) df (ξ , β 0 , β1 ) dξ
= .
dα αξ ==ξα0 dξ ξ =ξ0 dα α =α0
0
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
15
df (ξ , β 0 , β1 ) d ( β 0 + β1 x)
= = β1.
dξ ξ =ξ0 dx
ŷ βˆ0 + βˆ1 x
=
by least squares method.
yˆ =βˆ0* + βˆ1* x + γω
ˆ
γˆ (α − 1) βˆ
= 1
or
γˆ
α= +1
βˆ1
1
This procedure may be repeated using a new regression x* = xα1 in the calculation.
This procedure generally converges rapidly. Usually, the first stage result α1 is a satisfactory estimate of
α . The round-off error is a potential problem. If enough decimal places are not taken care, then the
successive values of α may oscillate badly. If the standard deviation of error (σ ) is large or the range of
explanatory variable is very small relative to its mean then the estimator may face convergence problems.
This situation implies that the data do not support the need for any transformation.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
16