You are on page 1of 23

Multivariate Linear Regression (2)

69
Prediction
• We do regression because we want to carry out predictions
▶ what is the predicted response variable for a given new value of the
covariates?
▶ with xnew = (x1,new , . . . , xp,new ) and response

ynew = β0 + β1 x1,new + . . . + βp xp,new +εnew


| {z }
E [Ynew |X =xnew ]

▶ ybnew = estimate of E [Ynew |X = xnew ]

• But how confident can we be in our predictions?


▶ regression coefficients are estimates ⇝ uncertainty in βb ⇝ confidence
limits
▶ even if we had the “true” model, the regression model does not allow
for exact predictions ⇝ uncertainty in εnew ⇝ prediction limits

70
Confidence and prediction intervals

• Under the normality assumption, confidence and prediction limits can


be obtained

▶ confidence limits for E [Ynew |X = xnew ]:


q
ybnew ± tn−p−1,α/2 sε (1, xnew )T (X T X )−1 (1, xnew )

▶ wider prediction limits for Ynew :


q
ybnew ± tn−p−1,α/2 sε 1 + (1, xnew )T (X T X )−1 (1, xnew )

• Alternatively simulation methods (⇝ bootstrapping) can be used for


both ⇝ extends easily to more complicate predictions

71
Categorical Variables

• So far covariates were assumed to be quantitative variables ⇝ the


product β · X makes sense

• frequently covariates are categorical: qualitative variables

▶ binary variables

▶ discrete variables taking a few values (eg counts)

▶ ordered or not

▶ continuous variables converted to categorical by dividing it into bins

• the response must always be quantitative!

72
. . . Categorical Variables

• Values taken by a categorical variables: levels;


how can a categorical variable be inserted in a regression equation?

▶ X categorical with levels l1 , . . . , lg : one β for each level is required

▶ dummy coding approach: choose a comparison/baseline/control level,


say l1

▶ the equation is (say with only one covariate)

E [Y |X = l1 ] =
E [Y |X = lk ] = , k = 2, . . . , g

⇝ β1 = average response when the baseline level l1 applies


⇝ βk = change in the average response wrt the baseline level when
level lk , k > 1 applies
. . . Categorical Variables

• Formally, write the model using indicators:

Yi = β1 + β2 1{Xi =l2 } + . . . + βg 1{Xi =lg } +εi , i = 1, . . . , n


| {z } | {z }
Zi2 Zig

or in short
Yi = β1 + βk + εi if level k applies
⇝ linearity always hold for such model

▶ one categorical variable ⇝ multiple covariates

▶ intercept = baseline

▶ baseline: most important level/one for which most observations are


available

74
. . . Categorical Variables

• When a categorical variable X1 (g levels l1 , . . . , lg ) is combined with


other covariates

▶ with a quantitative covariate X2 :

E [Y |Xi1 = lk , Xi2 = x2 ] = ,

k = 2, . . . , g ⇝ change of intercept

▶ with a categorical covariate X2 (f levels s1 , . . . , sf )

E [Y |Xi1 = lk , Xi2 = sh ] =

k = 2, . . . , g , h = 2, . . . , f
⇝ here l1 and f1 are the baseline levels for X1 and X2 respectively

▶ similarly with combinations of categorical/quantitative


. . . Categorical Variables
• Example. Data on selling price of 100 homes; response: price in $,
covariates (p = 5):
▶ size: size in ft 2
▶ tax: annual tax bill in $
▶ beds: number of bedrooms
▶ baths: number of bathrooms
▶ new: 1 if house is new and 0 if old

price size tax beds baths new


279900 2048 3104 4 2 0
146500 912 1173 2 1 0
237700 1654 3076 4 2 0
499900 3153 2997 3 2 1
200000 2068 1608 3 2 0
.. .. .. .. .. ..
. . . . . .

76
. . . Categorical Variables
• Model is

pricei =β0 + β1 · sizei + β2 · taxi + β3 · bedsi


+ β4 · bathsi + β5 · newi + εi
=β0 + β1 · sizei + β2 · taxi + β3 · bedsi + β4 · bathsi +




if house is old



=





if house is new

▶ adjustment to the intercept for new homes


▶ what if the number of bathrooms is treated as categorical with values
0, 1, 2?
. . . Categorical Variables

• Result:

βbi SE t p-value
(Intercept) 4526.00 24470.00 0.1849 8.531 · 10−1
size 68.35 13.94 4.9040 3.916 · 10−6
tax 38.14 6.81 5.5960 2.158 · 10−7
beds -11260.00 9115.00 -1.2350 2.198 · 10−1
baths -2114.00 11470.00 -0.1844 8.541 · 10−1
new 41710.00 16890.000 2.4700 1.531 · 10−2

▶ R 2 = 0.793

▶ intercept, beds and baths not significant at 5% level (p-values > 0.05)

78
Inference for Multiple Coefficients

• Test a null hypothesis about a subset of the regression slopes (say the
first 1 ≤ q < p)

▶ H0 : β1 = . . . = βq = 0; vs

▶ H1 : at least one of β1 , . . . , βq is ̸= 0

• Note:

▶ “overall significance test” is a special case

▶ useful to compare nested models

▶ drop more than one covariate at a time: significance test on categorical


variables (multiple covariates)
. . . Inference for Multiple Coefficients

• Parametric statistics: model M0 is nested in model M1 (full model) if


it can be obtained from M1 by constraining some parameters

• In the context of the null and alternative hypotheses

▶ regressing against covariates q + 1, . . . , p (model M0 ) nested in


regressing against all covariates 1, . . . , p (M1 )

▶ let RSSk , RegSSk , Rk2 , k = 0, 1 the RSS, RegSS and R squared under
the two models

▶ clearly RSS0 RSS1 and RegSS0 RegSS1 since

RSS0 + RegSS0 =

further R02 ≤ R12

80
. . . Inference for Multiple Coefficients

• Test a null hypothesis about a subset of the regression slopes

▶ if H0 is false ⇝ RSS0 − RSS1 = should be large

▶ the test statistic is (under H0 )

(RegSS1 − RegSS0 )/q


F0 = = ∼ Fq,n−p−1
RSS1 /(n − p − 1)

reject H0 if F0 > Fq,n−p−1,1−α

• Alternatively, a general purpose likelihood ratio test for nested models


can be used ⇝ see GLM

81
. . . Model Building

• Example. Home prices: fit model with baths and beds removed:
R 2 = 0.79

βbi SE t p-value
(Intercept) -21350.00 13310.00 -1.604 1.12 · 10−1
size 61.70 12.50 4.937 3.34 · 10−6
tax 37.23 6.73 5.528 2.78 · 10−7
new 46370.00 16460.00 2.818 5.87 · 10−3

• H0 : β 3 = β 4 = 0

82
Variables Interaction

• When combining the variables linearly

E [Y |X ] = β0 + β1 X1 + . . . + βp Xp

▶ increase in Xj ⇝ E [Y |X ] changes by βj , regardless of the other


covariates (main effect)
▶ each regressor slope does not depend on the other regressors

• Interaction: the effect of each regressor depends on the other


regressors
▶ add product of covariates
▶ notation: main effects +; interaction ×
▶ when building models, always have the main effect for comparison
purposes

83
. . . Variables Interaction

• Example. House prices


▶ the categorical variable new changes the intercept
▶ what if we suspect that old houses gain less value by being more
sizeable, compared to new houses?
▶ we can model this situation with an interaction term

pricei =β0 + β1 · sizei + β2 · taxi


+ β3 · newi + β4 · sizei × newi

β0 + β1 · sizei


+β · tax if house is old
2 i
=
(β0 + β3 ) + (β1 + β4 ) · sizei


+β2 · taxi if house is new

⇝ adjustment to both intercept and slope

84
. . . Variables Interaction

• Interaction ≡ product: effect of a variable on another variable’s


change; for covariates X1 , X2
▶ both X1 , X2 quantitative:

E [Y |X1 = x1 , X2 = x2 ] =

▶ X1 categorical with levels l1 , . . . , lg , X2 quantitative:

E [Y |X1 = lh , X2 = x2 ] = ,

k = 2, . . . , g
▶ both categorical, X1 with levels l1 , . . . , lg and X2 with levels s1 , . . . , sf :

E [Y |X1 = lk , X2 = sh ] = ,

k = 2, . . . , g , h = 2, . . . , f

85
. . . Variables Interaction

• Example. House prices: fit model with interaction between size and
new: R 2 = 0.8168

βbi SE t p-value
(Intercept) -365.70 13680.00 -0.0267 9.78 · 10−1
size 46.29 12.42 3.7260 3.30 · 10−4
tax 38.81 6.33 6.1290 2.00 · 10−8
new -106900.00 43650.00 -2.4490 1.61 · 10−2
new×size 69.43 18.50 3.7540 2.99 · 10−4

86
. . . Variables Interaction

87
. . . Variables Interaction

• R 2 and adjusted R 2 (Re2 )

▶ R 2 always increases with more, even not significant, explanatory


variables! ⇝ not a good criterion for model selection

▶ keeping non-significant explanatory variables may not improve


predictions and makes the model unnecessarily complex

covariates R2 Re2
size, tax, beds, baths, new 0.793 0.782
size, tax, beds, new 0.793 0.785
size, tax, new 0.790 0.783
size, tax, new, size×new 0.817 0.809

88
Model Building

• What variables to use?

▶ quantitative

▶ categorical

▶ interactions ⇝ joint effects

▶ transformation of variables

• Is the model good?

▶ statistical significance

▶ practical significance (effect size)

▶ simplicity v complexity

▶ model assumptions

89
. . . Model Building
• With k explanatory variables,
▶ 2k possible models arise from including/omitting each covariate
▶ different ways to select a model exist

• Backward search:
▶ fit model with all k covariates
▶ remove one variable at a time according to some criterion; stop when
the criterion no longer applies

• Forward search
▶ fit model with no covariates
▶ add one variable at a time according to some criterion; stop when the
criterion no longer applies

• Criterion: eg based on |t|-value

90
. . . Model Building

• Stepwise regression algorithm: combine backward and forward


1 consider all possible k regressions using one explanatory variable
2 for each regression in (1), compute the t-value for the slope; choose
the variable with the largest |t| provided it exceeds a pre-specified
value; otherwise halt the process
3 add to the model in (2) the variable based on the largest significant
contribution in terms of |t|, provided it exceeds a pre-specified value;
otherwise halt the process
4 delete to the model in (3) the variable based on the the smallest
contribution in terms of |t| provided it is lower than a pre-specified
value; otherwise halt the process
5 repeat steps (3) and (4) until all possible additions and deletions are
performed.

91

You might also like