You are on page 1of 6

Applied Statistics and Computing Lab

WHY ADDED VARIABLE PLOTS ?


There are several ways to look at an added variable plot. We outline below what an added variable
plot is and some of the different ways to approach it.
• What is an added variable plot?
The added variable plot basically plots the residuals from the regression of the response on a
subset of the regressors versus the residuals from the regression of the new regressor on the
same subset of regressors. For example our response is Y and we have 4 possible predictors
X1, X2, X3 and X4. We want to know whether X4 adds anything over and above X1, X2
and X3, which are already in our model. For this one can use an added variable plot. One
proceeds as follows for the plot.
1. Fit a regression of Y on X1, X2 and X3 and save the residuals, say r1
2. Fit a regression of X4 on X1, X2 and X3 and save the residuals, say r2
3. Plot r1 against r2.
This is the added variable plot. It is customary also to fit a regression of r1 on r2, and add this
line on the same plot.
Of course all software packages give these plots automatically on request.
• What is an added variable plot used for?
An added variable plot is used when you want to know what a particular regressor's
contribution is over and above those already present in the model. It also frequently serves
as a guide to the right functional form that the said regressor should enter into the regression
equation.
• Why an added variable plot? Why not just fit the model?
First an added variable plot gives you more detail, rather than the numerical summary from
a regression. The pattern of points on the graph is often a clue to the correct functional form
of the regressor. One can also judge whether the relationship is sustained across the range of
the regressor, or only due to a few (influential) points.
• Why not just look at the scatter of Y against X?
We are talking about the conditional effects here while the scatter of Y on X gives only the
raw effect. There are at least two different ways to approach this.
The first is when one already has a model say Y on 3 regressors X1, X2 and X3 and we want
to investigate the effect of X1 on Y in presence of X2 and X3, the effect of X2 on Y in
presence of X1 and X3, and the effect of X3 on Y in presence of X1 and X2. This is
effectively looking at the individual effects of regressors in a multiple regression.
The second way to look at this is in the context of stepwise regression. Provided you already
have a regression model of Y on X1, what is the best variable to enter next into this
equation? Added variable plots can form part of the answer here, supplementing the usual R2
or Cp or AIC / BIC criteria. In view of what is noted in the previous point, such an approach
can be expected to be more informative and robust.

Indian School of Business, Hyderabad


Applied Statistics and Computing Lab

• Why not plot the residuals against the regressor itself?


This is a controversial issue. People have done it, and still do it. But theoretically and
practically the added variable plot is a better thing to do. We'll explain why and additionally
give two examples of situations where the added variable plot is more informative. To start
with, we have a very practical reason to look at the added variable plot because of the
following theorem.

• Theorem :
The relationship observed from the added variable plot is always stronger than
that observed from the plot of residuals against the new regressor.
Proof:
Let the response be denoted by Y, which we regress on 'k' regressors contained in the data
matrix X. We wish to bring in a new regressor u. Assume without loss of generality that each
variable has zero mean and unit variance and are jointly multivariate normal as

[  ]
  XU 

 
XX XY
X 0 k xk k x1 k x1
Z= u ~N 0 ,  UX 1  UY
1x k 1x1 1 x1
y  k2  x 1 0  k 2 x 1
 YX  YU 1
1x k 1x 1 1 x1

In this setting, one can write,


E Y | X = YX −1
XX X

with the residual e as,


e= y− YX −1
XX X

with variance,
Var e =1− YX −1
XX  XY

The correlation coefficient between e and the new regressor u is,


Cov e , u  − YX −1 XX  XU
 e , u= = YU
Var e  Var u  1− YX −1XX  XY 
On the other hand, if we regress u on the remaining the regressors in X, we obtain the
regression line,
E u | X = UX −1
XX X

with residuals as
f =u− UX  −1
XX X

with variance,
Var  f =1− UX −1
XX  XU

The correlation coefficient between the two sets of residuals is,

Indian School of Business, Hyderabad


Applied Statistics and Computing Lab

Cov e , f   YU −YX −1 XX  XU


Corr e , f = =
Var e  Var  f  1− YX  XX  XY   1−UX −1
−1
XX  XU 

From the expressions for the correlation coefficients, it can be seen that Corr(e , f) has an
extra Var(f) term in the denominator which is always lesser than or equal to 1, being the
multiple correlation coefficient of u with the set of existing regressors X. Hence Corr(e , f) is
always greater than or equal to Corr(u , f).

Another way to look at this is the following. When you regress Y on X1, X2 and X3, what
you have left as the residual is the part of Y unexplained by X1, X2 and X3 for which you
seek an explanation from X4. But in a regression setup, X4 as a whole will not contribute –
instead what will contribute is the part of X4 unexplained by X1, X2 and X3. That we can
get as the residual of X4 on X1, X2 and X3. Hence the added variable plot is the “true”
picture on the contribution of X4 if it is entered.
There is also a layman's explanation. Suppose we have the regression we talked about
previously, namely Y on X1, X2, X3 and X4. We want to fit a multiple regression to this
data, but let us suppose that we unfortunately have forgotten how to do this. Luckily,
however we do remember our simple regression. Is there a way to proceed?
Yes! Proceed as follows : Fit Y on X1, and save the residuals, say r1. Regress these residuals
on X2, and save them say r2. Regress these on X3, and save the residuals as r3. Finally do
the same for X4 as well. Whatever coefficients you got for X1, X2, X3 and X4 constitute
your multiple regression!
Seen this way, an added variable plot is nothing but taking apart the black box of multiple
regression and looking inside.
• Illustration 1 : Where are a regressor has a lot of information irrelevant to the
response
Consider the situation where one wants to infer on the economic status of individuals, and
one of the regressors is consumption on electricity. It is well known that both extreme heat
and cold can lead to an increase in this variable, regardless of the person's status. Our
economist has the idea that including the average temperature at the respondent's location as
a regressor could remove this effect. Is he right? And can the added variable plot help us in
this regard?
Model Summary of Y on Z
Estimate Std Error p-value
Intercept 0.046 0.133 0.73
Z 0.143 0.148 0.39
We simulate the same situation below. We generate a bivariate random sample (X,Y) of
size 40 where the correlation between the variables is zero. One can think of these as the
average temperature at the respondent's location and the economic status index of 40
respondents. We then generate another variable Z as a linear combination of X and Y as
0.9X + 0.001Y. Thus our Z is almost totally swamped by the X variable which is really
uncorrelated with Y, but it still has a hint of Y.

Indian School of Business, Hyderabad


Applied Statistics and Computing Lab

Running a regression of Y on Z does not pick up on this little Y component on Z. But look at
the added variable plot of X below. It shows that X is highly significant. On the other hand,
looking at the bare residuals from this regression on X shows no pattern at all, leading one to
the erroneous conclusion that X is not important!

Just for verification, we reproduce the model summary of Y on X and Z below. It can be
seen clearly that both are highly significant.
Model Summary of Y on Z and X
Estimate Std. Error p-value
-14 -14
Intercept 2.4 x 10 1.57 x 10 0.135
Z 1000 1.91 x 10-11 0 ***
X -900 1.72 x 10-11 0 ***

• Illustration 2 : The case of an omitted variable with a hint of new information


In economics, one frequently encounters the case of a hard-to-observe variable, for which
only a poor proxy is available, in the sense that this proxy contains very little new
information on the unobserved variable and its high correlation with the variables already in
the regression casts doubt on its usefulness.
For example consider the case of an economist who wants to relate Expenditure to Income
for employees. He suspects that the variables of interest here are base salary, real assets and
performance incentives. However only salary and real assets can be observed by our
economist. The performance incentives cannot be observed but our friend wants to estimate
these from the income tax paid(as opposed to tax on assets). Assume further that income
from assets is negligible. Now income tax is highly correlated with salary with only a little
information on these incentives. Is it useful to include this tax variable?

Indian School of Business, Hyderabad


Applied Statistics and Computing Lab

Model Summary of Y on X1 and X2


Estimate Std. Error p-value
Intercept 26.31 31.762 0.413
X1 3.36 0.397 3.55 x 10-10 ***
X2 7.49 0.451 0 ***

This is the situation we simulate below. We generate a trivariate normal sample (X1,X2,X4) of size
40. Here X1 and X2 can be thought of as salary and real assets and are variables already in our
model. X4 is our unobserved variable, namely performance incentives. We construct X3 as as a
linear combination of X1 and X4 as 0.95*X1 + 0.05*X4. We further generate Y as 3*X1 + 8*X2 +
0.5*X4 as the “true” underlying relationship.
As a start we fit the model Y on X1 and X2. Both variables are highly significant in this regression.
The correlation of X3 with X1 is over 0.99. Is there any benefit to include X3 in the regression at
all? And can the added variable plot help us judge this?
Reproduced below is the plot of the raw residuals on the regressor X3. One could infer from this
plot that X3 is not important. What happens is that the residual being orthogonal to X1, are near
orthogonal to X3 as well giving the appearance of no relationship.
But now looking at the added variable plot, we can immediately see that X3 definitely has
something to add. Effectively the added variable plot “strips away” the X1 from X3 and lets us see
the real contribution of X3.

• Illustration 3 : The case of an outlying point in the X-space


It can sometimes happen that an observation is far out in the space of regressors, but still the
exhibits the same relationship between the response on one hand, and the set of regressors
on the other.
For example consider again the task of estimating expenditure from income. Most of the
group under study has similar incomes, but one of the respondents, say the star performer,

Indian School of Business, Hyderabad


Applied Statistics and Computing Lab

earns much higher than everyone else, but still has expenditures in roughly the same ratio as
the others. In such cases, one can show that the added variable plot still presents the true
picture, but the plot of residuals against the raw regressor can present a misleading one.
For something concrete, we simulated 39 rows of data from a bivariate normal distribution,
say X1 and X2 with means of 12 and 3 respectively. One can think of these as the Base
Salary and the standard perks being paid out. We then generate a third random variable X3
as a linear combination of these two with some random noise. This variable can be thought
of as the performance incentive. We then add a 40th row (X1, X2, X3) where the means of
X1 and X2 are now 20 and 10 respectively, but X3 still is the same linear combination as
before. One can think of this row as the “star performer” row. Finally we generate the
response Y as a linear combination of X1, X2 and X3 again with random noise. One can
think of Y as the expenditure.
A regression of Y on X1 and X2 has and R2 of 0.83 with both X1 and X2 significant. X3 is
only moderately correlated with X1 and X2, and is the natural candidate to enter into the
regression next. We want to evaluate this using both the plots at our disposal.

Looking at the two plots, one can immediately notice the isolated point in the middle on the
right enge in the plot on the left. Knowing however, that all points have the same
relationship between the regressors and the response, it is the added variable plot which
gives us the correct picture here.

Indian School of Business, Hyderabad

You might also like