Professional Documents
Culture Documents
• Theorem :
The relationship observed from the added variable plot is always stronger than
that observed from the plot of residuals against the new regressor.
Proof:
Let the response be denoted by Y, which we regress on 'k' regressors contained in the data
matrix X. We wish to bring in a new regressor u. Assume without loss of generality that each
variable has zero mean and unit variance and are jointly multivariate normal as
[ ]
XU
XX XY
X 0 k xk k x1 k x1
Z= u ~N 0 , UX 1 UY
1x k 1x1 1 x1
y k2 x 1 0 k 2 x 1
YX YU 1
1x k 1x 1 1 x1
with variance,
Var e =1− YX −1
XX XY
with residuals as
f =u− UX −1
XX X
with variance,
Var f =1− UX −1
XX XU
From the expressions for the correlation coefficients, it can be seen that Corr(e , f) has an
extra Var(f) term in the denominator which is always lesser than or equal to 1, being the
multiple correlation coefficient of u with the set of existing regressors X. Hence Corr(e , f) is
always greater than or equal to Corr(u , f).
Another way to look at this is the following. When you regress Y on X1, X2 and X3, what
you have left as the residual is the part of Y unexplained by X1, X2 and X3 for which you
seek an explanation from X4. But in a regression setup, X4 as a whole will not contribute –
instead what will contribute is the part of X4 unexplained by X1, X2 and X3. That we can
get as the residual of X4 on X1, X2 and X3. Hence the added variable plot is the “true”
picture on the contribution of X4 if it is entered.
There is also a layman's explanation. Suppose we have the regression we talked about
previously, namely Y on X1, X2, X3 and X4. We want to fit a multiple regression to this
data, but let us suppose that we unfortunately have forgotten how to do this. Luckily,
however we do remember our simple regression. Is there a way to proceed?
Yes! Proceed as follows : Fit Y on X1, and save the residuals, say r1. Regress these residuals
on X2, and save them say r2. Regress these on X3, and save the residuals as r3. Finally do
the same for X4 as well. Whatever coefficients you got for X1, X2, X3 and X4 constitute
your multiple regression!
Seen this way, an added variable plot is nothing but taking apart the black box of multiple
regression and looking inside.
• Illustration 1 : Where are a regressor has a lot of information irrelevant to the
response
Consider the situation where one wants to infer on the economic status of individuals, and
one of the regressors is consumption on electricity. It is well known that both extreme heat
and cold can lead to an increase in this variable, regardless of the person's status. Our
economist has the idea that including the average temperature at the respondent's location as
a regressor could remove this effect. Is he right? And can the added variable plot help us in
this regard?
Model Summary of Y on Z
Estimate Std Error p-value
Intercept 0.046 0.133 0.73
Z 0.143 0.148 0.39
We simulate the same situation below. We generate a bivariate random sample (X,Y) of
size 40 where the correlation between the variables is zero. One can think of these as the
average temperature at the respondent's location and the economic status index of 40
respondents. We then generate another variable Z as a linear combination of X and Y as
0.9X + 0.001Y. Thus our Z is almost totally swamped by the X variable which is really
uncorrelated with Y, but it still has a hint of Y.
Running a regression of Y on Z does not pick up on this little Y component on Z. But look at
the added variable plot of X below. It shows that X is highly significant. On the other hand,
looking at the bare residuals from this regression on X shows no pattern at all, leading one to
the erroneous conclusion that X is not important!
Just for verification, we reproduce the model summary of Y on X and Z below. It can be
seen clearly that both are highly significant.
Model Summary of Y on Z and X
Estimate Std. Error p-value
-14 -14
Intercept 2.4 x 10 1.57 x 10 0.135
Z 1000 1.91 x 10-11 0 ***
X -900 1.72 x 10-11 0 ***
This is the situation we simulate below. We generate a trivariate normal sample (X1,X2,X4) of size
40. Here X1 and X2 can be thought of as salary and real assets and are variables already in our
model. X4 is our unobserved variable, namely performance incentives. We construct X3 as as a
linear combination of X1 and X4 as 0.95*X1 + 0.05*X4. We further generate Y as 3*X1 + 8*X2 +
0.5*X4 as the “true” underlying relationship.
As a start we fit the model Y on X1 and X2. Both variables are highly significant in this regression.
The correlation of X3 with X1 is over 0.99. Is there any benefit to include X3 in the regression at
all? And can the added variable plot help us judge this?
Reproduced below is the plot of the raw residuals on the regressor X3. One could infer from this
plot that X3 is not important. What happens is that the residual being orthogonal to X1, are near
orthogonal to X3 as well giving the appearance of no relationship.
But now looking at the added variable plot, we can immediately see that X3 definitely has
something to add. Effectively the added variable plot “strips away” the X1 from X3 and lets us see
the real contribution of X3.
earns much higher than everyone else, but still has expenditures in roughly the same ratio as
the others. In such cases, one can show that the added variable plot still presents the true
picture, but the plot of residuals against the raw regressor can present a misleading one.
For something concrete, we simulated 39 rows of data from a bivariate normal distribution,
say X1 and X2 with means of 12 and 3 respectively. One can think of these as the Base
Salary and the standard perks being paid out. We then generate a third random variable X3
as a linear combination of these two with some random noise. This variable can be thought
of as the performance incentive. We then add a 40th row (X1, X2, X3) where the means of
X1 and X2 are now 20 and 10 respectively, but X3 still is the same linear combination as
before. One can think of this row as the “star performer” row. Finally we generate the
response Y as a linear combination of X1, X2 and X3 again with random noise. One can
think of Y as the expenditure.
A regression of Y on X1 and X2 has and R2 of 0.83 with both X1 and X2 significant. X3 is
only moderately correlated with X1 and X2, and is the natural candidate to enter into the
regression next. We want to evaluate this using both the plots at our disposal.
Looking at the two plots, one can immediately notice the isolated point in the middle on the
right enge in the plot on the left. Knowing however, that all points have the same
relationship between the regressors and the response, it is the added variable plot which
gives us the correct picture here.