You are on page 1of 16

Chapter 5

Transformation and Weighting to Correct Model Inadequacies

The graphical methods help in detecting the violation of basic assumptions in regression analysis. Now we
consider the methods and procedures for building the models through data transformation when some of the
assumptions are violated.

Variance stabilizing transformations


In regression analysis, it is assumed that the variance of disturbances is constant, i.e.,
(ε i ) σ=
Var= 2
, i 1, 2,..., n. Suppose this assumption is violated. A common reason for such isolation is that
the study variable follows a probability distribution in which the variance is functionally related to mean.

For example, if study variable ( y ) in the model is Poisson random variable in a simple linear regression
model, then its variance is same as mean. Since mean of y is related to explanatory variable x , so the
variance of y will be proportional to x . In such cases, variance stabilizing transformations are useful.

In another example, if y is proportion, i.e., 0 ≤ yi ≤ 1 then in such cases the variance of y is proportional

to E ( y )[1 − E ( y )]. In such case, the variance – stabilizing transformation is useful.

Some commonly used variance-stabilizing transformations in the order of their strength are as follows:
Relation of σ 2 to E ( y ) Transformation

σ 2 ∝ constant y* = y (no transformation)

σ 2∝ E ( y ) y* = y (Poisson data)
σ 2 ∝ E ( y )[1 − E ( y )] y* = sin −1 ( y ) (Binomial proportion 0 ≤ yi ≤ 1)
σ 2 ∝ [ E ( y )]2 y* = ln( y )
σ 2 ∝ [ E ( y )]3 y* = 1/ y

1
σ 2 ∝ [ E ( y )]4 y* =
y

After making the suitable transformation, use y * as a study variable in respective case.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
1
The strength of a transformation depends on the amount of curvature present in the curve between study and
explanatory variable. The transformation mentioned here range from relatively mild to relatively strong.
The square root transformation is relatively mild and reciprocal transformation is relatively strong. The
square root transformation is relatively mild and reciprocal transformation is relatively strong.

In general, a mild transformation applied when the minimum and maximum values do not range much (e.g.
ymax / ymin < 2,3) and such transformation has little effect on the curvature. On the other hand when the
minimum and maximum vary much then, a strong transformation is needed that will have a strong effect on
the analysis.

In the presence of non-constant variance, the OLSE will remain unbiased but will looses the minimum
variance property.

When the study variable has been transformed as y* , then the predicted values are in the transformed scale.
It is often necessary to convert the predicted values back to the original units ( y ).

When inverse transformation is directly applied to the original values, then it gives an estimate of the median
of the distribution of study variable instead of mean. So one needs to be careful while doing so.

Confidence interval and prediction interval may be directly converted from one metric to another. Reason
being that the interval estimates are percentile of a distribution and percentiles are unaffected by
transformation. One may note that the resulting intervals may or may not be or remain the shortest possible
intervals.

Transformations to linearize the model


The basic assumption in linear regression analysis is that the relationship between study variable and
explanatory variables is linear. Suppose this assumption is violated. Such violation can be checked by
scatter plot matrix, scatter diagrams, partial regression plots, lack of fit test etc.

In some cases, a nonlinear model can be linearized by using a suitable transformation. Such nonlinear
models are called intrinsically or transformably linear. The advantage of transforming the nonlinear
function into linear function is that the statistical tools are developed for the case of linear regression model.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
2
For example, exact tests for test of hypothesis, confidence interval estimation etc. are developed for the case
of linear regression model. Once the nonlinear function is transformed to a linear function, all such tools can
be readily applied and there is no need to develop them separately.

Some linearizable functions are as follows:


1. If the curve between y and x is like as follows:

then the possible linearizable function is of the form


y = β 0 x β1 .

=
Using the transformation y* ln=
y, x* ln x, i.e., by taking log on both sides, the model
becomes
=
log y log β 0 + β1 log x

* β 0* + β1 x *
or y=

where β 0* = log β 0 and the model becomes a linear model. Note that the parameter β 0 changes

to log β 0 in the transformed model.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
3
2. If the curve between y and x is like as follows

then the possible linearizable function is of the form


y = β 0 exp ( β1 x)

Taking log e (ln) on both sides,

ln y ln β 0 + β1 x
=
* β 0* + β1 x
or y=

= y and β 0* ln β 0 .
where y* ln=

So y* = ln y is the transformation needed in this case. The intercept term β 0 becomes ln β 0 in


the transformed model.

3. If the curve between y and x is like as follows

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
4
then the possible linearizable function is of the form
y β 0 + β1 log x
=
which can be written as
y β 0 + β1 x *
=

using the transformation x* = log x.

4. If the curve between y and x is like as follows

then the possible linearizable function is of the form


x
y=
β 0 x − β1
which can be written as
1 β
= β0 − 1
y x
* β 0 + β1 x * .
or y=

1 1
which becomes a linear model by using the transformation y* = , x* = − .
y x

• With the observed behaviour of the plots, one can choose any such curve and use the linearized form
of the function.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
5
• When such transformations are used, many times the form of ε also gets changed. For example, in
case of
y = β 0 exp( β1 x) ε

ln y = ln β 0 + β1 x + ln ε

or y* =β 0* + β1 x + ε * .
This implies that the multiplicative error in original model is log normally distributed in the
transformed model. Many times, we ignore this aspect and continue to assume that the random errors
are still normally distributed. In such cases, the residuals from the transformed model should be
checked for the validity of the assumptions.

• When such transformations are used, the OLSE has the desired properties with respect to the
transformed data and not the original data.

Analytical methods for selecting a transformation on study variable


The Box-Cox method
Suppose the normality and/or constant variance of study variable y can be corrected through a power

transformation on y . This means y is to be transformed as y λ where λ is the parameter to be determined.

For example, if λ = 0.5, then the transformation is square root and y is used as study variable in place
of y .

Now the linear regression model has parameters β , σ 2 and λ. Box and Cox method tells how to estimate
simultaneously the λ and parameters of the model using the method of maximum likelihood.

Note that as λ approaches zero, y λ approaches to 1. So there is a problem at λ = 0 because this makes all
the observation y to be unity. It is meaningless that all the observation on study variable are constant. So

yλ −1
there is a discontinuity at λ = 0 . One approach to solve this difficulty is to use as a study variable.
λ
yλ −1
Note that as λ → 0, → ln y . So a possible solution is to use the transformed study variable as
λ

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
6
 yλ −1
 for λ ≠ 0
W = λ
ln y for λ = 0.

So family W is continuous. Still it has a drawback. As λ changes, the value of W change dramatically. So
it is difficult to obtain the best value of λ . If different analyst obtain different values of λ , then it will fit
different models. It may then not be appropriate to compare the models with different values of λ . So it is
preferable to use an alternative form
 yλ −1
 for λ ≠ 0
y (λ=
)
V=  λ y*λ −1
 y ln y for λ = 0
 *
where y* is the geometric mean of yi ' s as y* = ( y1 y2 ... yn )1/ n which is constant.

For calculation purpose, we can use


1 n
ln y* = ∑ ln yi .
n i =1

When V is applied to each yi , we get V = (V1 , V2 ,..., Vn ) ' as a vector of observation on transformed study

variable and we use it to fit a linear model


V Xβ +ε .
=
using least squares or maximum likelihood method.

The quantity λ y*λ −1 in the denominator is related to the nth power of Jacobian of transformation. See how:

We want to convert yi into yi( λ ) as

yiλ − 1
yi( λ=
)
W= ; λ ≠ 0.
i
λ
Let y (=
y1 , y2 ,..., yn ) ', W (W1 , W2 ,..., Wn ) '.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
7
y1λ − 1
Note that if W1 = , then
λ
∂W1 λ y1λ −1
= = y1λ −1
∂y1 λ
∂W1
= 0.
∂y2
In general,

∂Wi  yiλ −1 if i = j
=
∂y j 0 if i ≠ j.
The Jacobian of transformation is given by
∂yi 1 1
J ( yi → Wi ) = = = .
∂Wi  ∂Wi  yiλ −1
 
 ∂yi 
∂W1 ∂W1 ∂W1

∂y1 ∂y2 ∂yn
y1λ −1 0 0  0
∂W2 ∂W2 ∂W2
 0 λ −1
y2 0  0
=
J (W → y ) ∂y1 =
∂y2 ∂yn
    
    λ −1
0 0 0  yn
∂Wn ∂Wn ∂Wn

∂y1 ∂y2 ∂yn
n
= ∏ yiλ −1
i =1
λ −1
 n 
=  ∏ yi 
 i =1 
λ −1
 
1  1 
J(y →
= W) =  n  .
J (W → Y )  
 ∏ yi 
 i =1 
Since this is a Jacobian when we want to transform the whole vector y to whole vector W . If an individual
yi is to be transform into Wi , then take its geometric mean as
λ −1
 
 
 1 
J ( yi → Wi ) =
 1 
.
 n n 
  ∏ yi  
  i =1  

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
8
1
The quantity J (Y → W ) =n
ensures that unit volume is preserved moving from the set of yi to the
∏ yi λ −1

i =1

set of Vi . This is a factor which scales and ensures that the residual sum of squares obtained from different
values of λ can be compared.

To find the appropriate family, consider


y (λ=
)
V= X β + ε

(λ ) yλ −1
where y = λ −1
, ε ~ N (0, σ 2 I ).
λ y*

Applying method of maximum likelihood for likelihood function for y ( λ ) ,

 n 2
 ∑ εi 
n
(λ )  1  2
=
Ly    2 
exp  − i =1 2 
 2πσ   2σ 
 
n
 1 2  ε 'ε 
= 2 
exp  − 2 
 2πσ   2σ 
n
 1 2  ( y ( λ ) − X β ) '( y ( λ ) − X β ) 
=  2 
exp − 
 2πσ   2σ 2 
n  ( y ( λ ) − X β ) '( y ( λ ) − X β ) 
− ln σ 2 − 
ln L  y ( λ )  =  (ignoring constant).
2  2σ 2 
Solving

∂ ln L  y ( λ ) 
=0
∂β
∂ ln L  y ( λ ) 
=0
∂σ 2
gives the maximum likelihood estimators

βˆ (λ ) = ( X ' X ) −1 X ' y ( λ )
1 (λ ) (λ ) y ( λ ) ' Hy ( λ )
σ (λ ) =−
ˆ 2
 −1

y '  I X ( X ' X ) X ' y =
n n
for a given value of λ .

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
9
Substituting these estimates in the log likelihood function ln L  y ( λ )  gives

n n
L (λ ) = − ln [ SS r e s (λ ) ]
− ln σˆ 2 =
2 2
where SS r e s (λ ) is the sum of squares due to residuals which is a function of λ . Now maximize L(λ )
with respect to λ. It is difficult to obtain any closed form of the estimator of λ . So we maximize it
numerically.
n
The function − ln [ SS r e s (λ ) ] is called as the Box-Cox objective function.
2
Let λmax be the value of λ which minimizes the Box-Cox objective function. Then under fairly general

conditions, for any other λ


n ln [ SS r e s (λ ) ] − n ln [ SS r e s (λmax ) ]

has approximately χ 2 (1) distribution. This result is based on the large sample behaviour of the likelihood
ratio statistic. This is explained as follows:

The likelihood ratio test statistic in our case is


Max L
Ωo
ηn ≡ η =
Max L

n
 1 2
Max  2 
Ωo  σ 
= n
 1 2
Max  2 
Ω σ 
n
 1 2
 σˆ 2 (λ ) 
=  
n
 1 2
 ˆ2 
 σ (λmax ) 
n
 1/ SS r e s (λ )  2
= 
 1/ SS r e s (λmax ) 

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
10
n  SS r e s (λmax ) 
ln η = ln  
2  SS r e s (λ ) 
n  SS r e s (λ ) 
− ln η =ln  
2  SS r e s (λmax ) 
n n
= ln [ SS r e s (λ ) ] − ln [ SSr e s (λmax ) ]
2 2
− L(λ ) + L(λmax )
=
where
n
L(λ ) = − ln [ SS r e s (λ ) ]
2
n
L(λmax ) = − ln [ SS r e s (λmax ) ] .
2
Since under certain regularity conditions, −2 ln ηn converges in distribution to χ 2 (1) when the null
hypothesis is true, so
− 2 ln η ~ χ 2 (1)
χ 2 (1)
or − ln η ~
2
χ 2 (1)
or L(λmax ) − L(λ ) ~ .
2

Computational procedure
The maximum- likelihood estimate of λ corresponds to the value of λ for which residual sum of squares
from the fitted model SS r e s (λ ) is a minimum. To determine such λ , we proceed computationally as follows:

- Fit y ( λ ) for various values of λ . For example, start with values in (-1, 1) then take the values in
(-2, 2) and so on. Take about 15 to 20 values of λ which are expected to be sufficient for the
estimation of optimum value.
- Plot SS r e s (λ ) versus λ .

- Find the value of λ which minimizes SS r e s (λ ) from the graph.


- A second iteration can be performed using a finer mesh of values of desired.

Note that the value of λ can not be selected by directly comparing the residual sum of squares from the
regression of y λ on x because for each λ , the residual sum of squares is measured on a different scale.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
11
It is better to use simple values of λ . For example, the practical difference between λ = 0.5 and λ = 0.58 is
likely to be small but λ = 0.5 is much easier to interpret.

Once λ is selected, then use


• y λ as a study variable if λ ≠ 0
• ln y as a study variable if λ = 0.

It is entirely acceptable to use y ( λ ) as response for final model. This model will have a scale difference and

an origin shift in comparison to model using y λ (or ln y ) as the response.

An approximate confidence interval for λ


We can find an approximate confidence interval for the transformation parameter λ . This interval helps in

selecting the final value of λ . For example , if λˆ = 0.58 is the value of λ which is minimizing the sum of
squares due to residual. But if λ = 0.5 is in the confidence interval, then one may use the square root
transformation because it is easier to explain. Furthermore if λ = 1 is in the confidence interval, then it may
be concluded that no transformation is necessary.

In applying the method of maximum likelihood to the regression model, we are essentially maximizing
n
L(λ ) = − ln [ SS r e s (λ ) ]
2
or equivalently, we are minimizing SS r e s (λ ) .

An approximate 100(1- α ) % confidence interval for λ consists of those values of λ that satisfy

χα2 (1)
L(λˆ ) − L(λ ) ≤
2
where χα2 (1) is the upper α % point of the Chi-square distribution with one degree of freedom.

The approximate confidence interval is constructed using the following steps:


- Draw a plot of L(λ ) versus λ .
- Draw a horizontal line at height
χ (1)
2
L(λˆ ) − α
2
on the vertical scale.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
12
- This line would cut the L(λ ) at two points.
- The location of these two points on the λ -axis defines the two end points of the approximate
confidence interval.
- If sum of squares due to residuals is minimized and SS r e s (λ ) versus λ is plotted, then the line must
be plotted at the height
 χ 2 (1) 
SS * = SS r e s (λˆ ) exp  α 
 n 

where λ̂ is the value of λ which minimizes the sum of squares due to residuals. See how:

χ (1) 2
n χ (1) 2
L(λˆ ) − α =− ln  SS r e s (λˆ )  − α
2 2 2
n χ (1) 
{ }
2
= − ln SS Re s (λˆ ) + α 
2 n 
n   χ 2 (1)  
=
2 
{ }
− ln SS r e s (λˆ ) + ln exp  α  
  n  
n    χ 2 (1)  
= − ln  SS r e s (λˆ ).exp  α  
2    n  
n
= − ln SS * .
2
Using the expansion of exponential function as
t2
exp(t ) = 1 + t + + ...
2!
≈ 1 + t,

 χ 2 (1)  χ 2 (1)  χ 2 (1) 


we can approximate and replace exp  α  by 1 + α . So in place of exp  α  in applying the
 n  n  n 
confidence interval procedure, we can use the following:
Zα2 /2  Zα2 /2 
1+  or 1 + 
γ  n 
tα2 /2  tα2 /2 
or 1 +  or 1 + 
γ  n 
χα2 /2  χα2 /2 
or 1 +  or 1 + 
γ  n 
where γ is the degrees of freedom associated with sum of squares due to residuals.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
13
These expressions are based on the fact that
χ 2 (1)
= Z 2 ≈ tγ2 if γ is small.

It is debatable to use either γ or n but practically the difference is very little between the confidence
interval results.

Box-Cox transformation was originally introduced to reduce the nonnormality in the data. It also helps is
reducing the nonlinearity. The approach is to find out the transformations which attempts to reduce the
residuals associated with outliers and also reduce the problem of non constant error variance if there was
no acute nonlinearity to begin with.

Transformation on explanatory variables: Box and Tidwell procedure


Suppose the relationship between y and one or more of the explanatory variables is nonlinear. Other usual
assumptions normally and independently distributed study variable with constant variance are at least
approximately satisfied.

We want to select an appropriate transformation on the explanatory variable so that the relationship between
y and transformed explanatory variable is as simple as possible.

Box and Tidwell procedure describes a general analytical procedure for determining the form of
transformation on x

Suppose that the study variable y is related to the power of explanatory variables. Box and Tidwell
procedures for explanatory variables chooses the variables as
 xijα j − 1
 when α j ≠ =
0, i 1, 2,.., =
n; j 1, 2,..., k
zij =  α j

ln xij when α j = 0.

We need to estimate α j ' s . Since the dependent variable is not being transformed, we need not worry about

the changes of scale and minimize


n

∑[ y − β − β1 zi1 − ... − β k zik ]


2
i 0
i =1

by using nonlinear least squares techniques.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
14
We consider this for simple linear regression model instead of nonlinear regression model.

Assume y is related to ξ = xα as
) f (ξ , β 0 , β1=
E ( y= ) β 0 + β1ξ
 xα if α ≠ 0
where ξ = 
ln x if α = 0

where β 0 , β1 and α are the unknown parameters.

Suppose α 0 is the initial guess of constant α .

Usually, first guess is α 0 = 1 so that ξ = x or no transformation is applied in the first iteration.

Expand about the initial guess in a Taylor series and ignoring terms of order higher them one gives
 df (ξ , β 0 , β1 ) 
E ( y ) f (ξ 0 , β 0 , β1 ) + (α − α 0 ) 
= 
 dα αξ ==ξα0
0

 df (ξ , β 0 , β1 ) 
= β 0 + β1 x + (α − 1)   .
 dα αξ ==ξα0
0

 df (ξ , β 0 , β1 ) 
Suppose the term   is known, then it can be treated just like as an additional explanatory
 dα αξ ==ξα0
0

variable. Then the parameters β 0 , β1 and α can be estimated by least squares method.

The estimate of α can be considered as an improved estimate of the transformation parameter.


This term can be written as

 df (ξ , β 0 , β1 )   df (ξ , β 0 , β1 )   dξ 
  =    .
 dα αξ ==ξα0  dξ ξ =ξ0  dα α =α0
0

Since the form of transformation is known, i.e., ξ = xα ,



so = x ln x.

Furthermore

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
15
 df (ξ , β 0 , β1 )  d ( β 0 + β1 x)
 =  = β1.
 dξ ξ =ξ0 dx

So β1 can be estimated by fitting the model

ŷ βˆ0 + βˆ1 x
=
by least squares method.

Then an “adjustment” to initial guess α 0 = 1 is computed by defining a second regression variable as


ω = x ln x
estimating the parameter in
E ( y ) = β 0* + β1* x + (α − 1) β1ω
=β 0* + β1* x + γω
by least squares. This gives the following:

yˆ =βˆ0* + βˆ1* x + γω
ˆ
γˆ (α − 1) βˆ
= 1

or
γˆ
α= +1
βˆ1
1

as the revised estimate of α .

ŷ βˆ0 + βˆ1 x and


Note that β̂1 is obtained from = γˆ is obtained from ŷ =βˆ0* + βˆ1* x + γω
ˆ .

Generally, βˆ1 and βˆ1* will differ.

This procedure may be repeated using a new regression x* = xα1 in the calculation.

This procedure generally converges rapidly. Usually, the first stage result α1 is a satisfactory estimate of

α . The round-off error is a potential problem. If enough decimal places are not taken care, then the
successive values of α may oscillate badly. If the standard deviation of error (σ ) is large or the range of
explanatory variable is very small relative to its mean then the estimator may face convergence problems.
This situation implies that the data do not support the need for any transformation.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
16

You might also like