w4 - Statistical Modelling - Gaps

Multivariate Linear Regression (2)
69
Prediction
• We do regression because we want to carry out predictions
▶ what is the predicted response variable for a given new value of the
covariates?
▶ with xnew = (x1,new , . . . , xp,new ) and response
ynew = β0 + β1 x1,new + . . . + βp xp,new +εnew

| {z }
E [Ynew |X =xnew ]
▶ ybnew = estimate of E [Ynew |X = xnew ]
• But how confident can we be in our predictions?

▶ regression coefficients are estimates ⇝ uncertainty in βb ⇝ confidence
limits
▶ even if we had the “true” model, the regression model does not allow
for exact predictions ⇝ uncertainty in εnew ⇝ prediction limits
70
Confidence and prediction intervals
• Under the normality assumption, confidence and prediction limits can

be obtained
▶ confidence limits for E [Ynew |X = xnew ]:

q
ybnew ± tn−p−1,α/2 sε (1, xnew )T (X T X )−1 (1, xnew )
▶ wider prediction limits for Ynew :

q
ybnew ± tn−p−1,α/2 sε 1 + (1, xnew )T (X T X )−1 (1, xnew )
• Alternatively simulation methods (⇝ bootstrapping) can be used for

both ⇝ extends easily to more complicate predictions
71
Categorical Variables
• So far covariates were assumed to be quantitative variables ⇝ the

product β · X makes sense
• frequently covariates are categorical: qualitative variables
▶ binary variables
▶ discrete variables taking a few values (eg counts)
▶ ordered or not
▶ continuous variables converted to categorical by dividing it into bins
• the response must always be quantitative!
72
. . . Categorical Variables
• Values taken by a categorical variables: levels;

how can a categorical variable be inserted in a regression equation?
▶ X categorical with levels l1 , . . . , lg : one β for each level is required
▶ dummy coding approach: choose a comparison/baseline/control level,

say l1
▶ the equation is (say with only one covariate)
E [Y |X = l1 ] =
E [Y |X = lk ] = , k = 2, . . . , g
⇝ β1 = average response when the baseline level l1 applies

⇝ βk = change in the average response wrt the baseline level when
level lk , k > 1 applies
• Formally, write the model using indicators:
Yi = β1 + β2 1{Xi =l2 } + . . . + βg 1{Xi =lg } +εi , i = 1, . . . , n

| {z } | {z }
Zi2 Zig
or in short
Yi = β1 + βk + εi if level k applies
⇝ linearity always hold for such model
▶ one categorical variable ⇝ multiple covariates
▶ intercept = baseline
▶ baseline: most important level/one for which most observations are

available
74
• When a categorical variable X1 (g levels l1 , . . . , lg ) is combined with

other covariates
▶ with a quantitative covariate X2 :
E [Y |Xi1 = lk , Xi2 = x2 ] = ,
k = 2, . . . , g ⇝ change of intercept
▶ with a categorical covariate X2 (f levels s1 , . . . , sf )
E [Y |Xi1 = lk , Xi2 = sh ] =
k = 2, . . . , g , h = 2, . . . , f
⇝ here l1 and f1 are the baseline levels for X1 and X2 respectively
▶ similarly with combinations of categorical/quantitative

• Example. Data on selling price of 100 homes; response: price in $,
covariates (p = 5):
▶ size: size in ft 2
▶ tax: annual tax bill in $
▶ beds: number of bedrooms
▶ baths: number of bathrooms
▶ new: 1 if house is new and 0 if old
price size tax beds baths new

279900 2048 3104 4 2 0
146500 912 1173 2 1 0
237700 1654 3076 4 2 0
499900 3153 2997 3 2 1
200000 2068 1608 3 2 0
.. .. .. .. .. ..
. . . . . .
76
• Model is
pricei =β0 + β1 · sizei + β2 · taxi + β3 · bedsi

+ β4 · bathsi + β5 · newi + εi
=β0 + β1 · sizei + β2 · taxi + β3 · bedsi + β4 · bathsi +




if house is old



=





if house is new

▶ adjustment to the intercept for new homes

▶ what if the number of bathrooms is treated as categorical with values
0, 1, 2?
• Result:
βbi SE t p-value
(Intercept) 4526.00 24470.00 0.1849 8.531 · 10−1
size 68.35 13.94 4.9040 3.916 · 10−6
tax 38.14 6.81 5.5960 2.158 · 10−7
beds -11260.00 9115.00 -1.2350 2.198 · 10−1
baths -2114.00 11470.00 -0.1844 8.541 · 10−1
new 41710.00 16890.000 2.4700 1.531 · 10−2
▶ R 2 = 0.793
▶ intercept, beds and baths not significant at 5% level (p-values > 0.05)
78
Inference for Multiple Coefficients
• Test a null hypothesis about a subset of the regression slopes (say the
first 1 ≤ q < p)
▶ H0 : β1 = . . . = βq = 0; vs
▶ H1 : at least one of β1 , . . . , βq is ̸= 0
• Note:
▶ “overall significance test” is a special case
▶ useful to compare nested models
▶ drop more than one covariate at a time: significance test on categorical

variables (multiple covariates)
. . . Inference for Multiple Coefficients
• Parametric statistics: model M0 is nested in model M1 (full model) if

it can be obtained from M1 by constraining some parameters
• In the context of the null and alternative hypotheses
▶ regressing against covariates q + 1, . . . , p (model M0 ) nested in

regressing against all covariates 1, . . . , p (M1 )
▶ let RSSk , RegSSk , Rk2 , k = 0, 1 the RSS, RegSS and R squared under
the two models
▶ clearly RSS0 RSS1 and RegSS0 RegSS1 since
RSS0 + RegSS0 =
further R02 ≤ R12
80
. . . Inference for Multiple Coefficients
• Test a null hypothesis about a subset of the regression slopes
▶ if H0 is false ⇝ RSS0 − RSS1 = should be large
▶ the test statistic is (under H0 )
(RegSS1 − RegSS0 )/q

F0 = = ∼ Fq,n−p−1
RSS1 /(n − p − 1)
reject H0 if F0 > Fq,n−p−1,1−α
• Alternatively, a general purpose likelihood ratio test for nested models

can be used ⇝ see GLM
81
. . . Model Building
• Example. Home prices: fit model with baths and beds removed:
R 2 = 0.79
βbi SE t p-value
(Intercept) -21350.00 13310.00 -1.604 1.12 · 10−1
size 61.70 12.50 4.937 3.34 · 10−6
tax 37.23 6.73 5.528 2.78 · 10−7
new 46370.00 16460.00 2.818 5.87 · 10−3
• H0 : β 3 = β 4 = 0
82
Variables Interaction
• When combining the variables linearly
E [Y |X ] = β0 + β1 X1 + . . . + βp Xp
▶ increase in Xj ⇝ E [Y |X ] changes by βj , regardless of the other

covariates (main effect)
▶ each regressor slope does not depend on the other regressors
• Interaction: the effect of each regressor depends on the other

regressors
▶ add product of covariates
▶ notation: main effects +; interaction ×
▶ when building models, always have the main effect for comparison
purposes
83
. . . Variables Interaction
• Example. House prices

▶ the categorical variable new changes the intercept
▶ what if we suspect that old houses gain less value by being more
sizeable, compared to new houses?
▶ we can model this situation with an interaction term
pricei =β0 + β1 · sizei + β2 · taxi

+ β3 · newi + β4 · sizei × newi

β0 + β1 · sizei


+β · tax if house is old
2 i
=
(β0 + β3 ) + (β1 + β4 ) · sizei


+β2 · taxi if house is new

⇝ adjustment to both intercept and slope
84
• Interaction ≡ product: effect of a variable on another variable’s

change; for covariates X1 , X2
▶ both X1 , X2 quantitative:
E [Y |X1 = x1 , X2 = x2 ] =
▶ X1 categorical with levels l1 , . . . , lg , X2 quantitative:
E [Y |X1 = lh , X2 = x2 ] = ,
k = 2, . . . , g
▶ both categorical, X1 with levels l1 , . . . , lg and X2 with levels s1 , . . . , sf :
E [Y |X1 = lk , X2 = sh ] = ,
k = 2, . . . , g , h = 2, . . . , f
85
• Example. House prices: fit model with interaction between size and
new: R 2 = 0.8168
βbi SE t p-value
(Intercept) -365.70 13680.00 -0.0267 9.78 · 10−1
size 46.29 12.42 3.7260 3.30 · 10−4
tax 38.81 6.33 6.1290 2.00 · 10−8
new -106900.00 43650.00 -2.4490 1.61 · 10−2
new×size 69.43 18.50 3.7540 2.99 · 10−4
86
87
• R 2 and adjusted R 2 (Re2 )
▶ R 2 always increases with more, even not significant, explanatory

variables! ⇝ not a good criterion for model selection
▶ keeping non-significant explanatory variables may not improve

predictions and makes the model unnecessarily complex
covariates R2 Re2
size, tax, beds, baths, new 0.793 0.782
size, tax, beds, new 0.793 0.785
size, tax, new 0.790 0.783
size, tax, new, size×new 0.817 0.809
88
Model Building
• What variables to use?
▶ quantitative
▶ categorical
▶ interactions ⇝ joint effects
▶ transformation of variables
• Is the model good?
▶ statistical significance
▶ practical significance (effect size)
▶ simplicity v complexity
▶ model assumptions
89
• With k explanatory variables,
▶ 2k possible models arise from including/omitting each covariate
▶ different ways to select a model exist
• Backward search:
▶ fit model with all k covariates
▶ remove one variable at a time according to some criterion; stop when
the criterion no longer applies
• Forward search
▶ fit model with no covariates
▶ add one variable at a time according to some criterion; stop when the
criterion no longer applies
• Criterion: eg based on |t|-value
90
• Stepwise regression algorithm: combine backward and forward

1 consider all possible k regressions using one explanatory variable
2 for each regression in (1), compute the t-value for the slope; choose
the variable with the largest |t| provided it exceeds a pre-specified
value; otherwise halt the process
3 add to the model in (2) the variable based on the largest significant
contribution in terms of |t|, provided it exceeds a pre-specified value;
otherwise halt the process
4 delete to the model in (3) the variable based on the the smallest
contribution in terms of |t| provided it is lower than a pre-specified
value; otherwise halt the process
5 repeat steps (3) and (4) until all possible additions and deletions are
performed.
91

w4 - Statistical Modelling - Gaps

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

w4 - Statistical Modelling - Gaps

Uploaded by

Copyright:

Available Formats

Multivariate Linear Regression (2)

ynew = β0 + β1 x1,new + . . . + βp xp,new +εnew

▶ ybnew = estimate of E [Ynew |X = xnew ]

• But how confident can we be in our predictions?

• Under the normality assumption, confidence and prediction limits can

▶ confidence limits for E [Ynew |X = xnew ]:

▶ wider prediction limits for Ynew :

• Alternatively simulation methods (⇝ bootstrapping) can be used for

• So far covariates were assumed to be quantitative variables ⇝ the

• frequently covariates are categorical: qualitative variables

▶ discrete variables taking a few values (eg counts)

▶ continuous variables converted to categorical by dividing it into bins

• the response must always be quantitative!

• Values taken by a categorical variables: levels;

▶ X categorical with levels l1 , . . . , lg : one β for each level is required

▶ dummy coding approach: choose a comparison/baseline/control level,

▶ the equation is (say with only one covariate)

⇝ β1 = average response when the baseline level l1 applies

• Formally, write the model using indicators:

Yi = β1 + β2 1{Xi =l2 } + . . . + βg 1{Xi =lg } +εi , i = 1, . . . , n

▶ one categorical variable ⇝ multiple covariates

▶ baseline: most important level/one for which most observations are

• When a categorical variable X1 (g levels l1 , . . . , lg ) is combined with

▶ with a quantitative covariate X2 :

▶ with a categorical covariate X2 (f levels s1 , . . . , sf )

▶ similarly with combinations of categorical/quantitative

price size tax beds baths new

pricei =β0 + β1 · sizei + β2 · taxi + β3 · bedsi

▶ adjustment to the intercept for new homes

▶ “overall significance test” is a special case

▶ useful to compare nested models

▶ drop more than one covariate at a time: significance test on categorical

• Parametric statistics: model M0 is nested in model M1 (full model) if

• In the context of the null and alternative hypotheses

▶ regressing against covariates q + 1, . . . , p (model M0 ) nested in

▶ clearly RSS0 RSS1 and RegSS0 RegSS1 since

further R02 ≤ R12

• Test a null hypothesis about a subset of the regression slopes

▶ if H0 is false ⇝ RSS0 − RSS1 = should be large

▶ the test statistic is (under H0 )

(RegSS1 − RegSS0 )/q

reject H0 if F0 > Fq,n−p−1,1−α

• Alternatively, a general purpose likelihood ratio test for nested models

• When combining the variables linearly

▶ increase in Xj ⇝ E [Y |X ] changes by βj , regardless of the other

• Interaction: the effect of each regressor depends on the other

• Example. House prices

pricei =β0 + β1 · sizei + β2 · taxi

⇝ adjustment to both intercept and slope

• Interaction ≡ product: effect of a variable on another variable’s

▶ X1 categorical with levels l1 , . . . , lg , X2 quantitative:

• R 2 and adjusted R 2 (Re2 )

▶ R 2 always increases with more, even not significant, explanatory

▶ keeping non-significant explanatory variables may not improve

• What variables to use?

▶ interactions ⇝ joint effects

• Is the model good?

▶ practical significance (effect size)

• Criterion: eg based on |t|-value

• Stepwise regression algorithm: combine backward and forward

You might also like