Introduction To Regression: Y,, X X X

9/7/21, 3:51 PM UW Möbius - 1.
1 Introduction/Overview
Introduction to Regression
Understanding and quantifying variability in data is the essence of statistics and statistical inference.
In regression modelling, we attempt to explain or account for variation in a response variable, y , by
using a statistical model to describe the relationship between the response and one or
more explanatory variables, x1 , x2 , x3 …
We can then use the model we've developed to learn and answer questions about relationships
between the explanatory variables and the response, and/or predict the value of the response for a
given set of values of the explanatory variables.
Regression is an extremely powerful and widely used tool for investigating and understanding
relationships in any field of interest, including biology, health and social sciences, economics, business,
and finance, and for making decisions based on those relationships.
To begin to gain an appreciation of how we may use regression in empirical-based decision making, we
will introduce three of the examples from the field of business and finance that we will be exploring in
more detail throughout the course.
Example 1:
An auditor wishes to determine whether the cost of overhead claimed by offices in a certain group is
consistent with the office's attributes, including size, age, number of clients, number of employees,
and the cost of living index of the city in which the office is located.
To this end, the auditor creates a regression model to describe the relationship between these
attributes and the overhead claimed by the office in order to estimate the expected overhead for each
office.
The auditor can then investigate any claim for which a large discrepancy exists between the observed
overhead and the expected overhead estimated from the model.
Response variable: (Claimed) overhead
Potential explanatory variables: office size, office age, number of clients, ...
Objective: Examine the difference* between the claimed overhead and the expected overhead
estimated from the model for each office based on the office's size, age, etc., and allocate auditing
resources to those offices associated with a large (positive) difference.
(*we call these differences the residuals, which are extremely important values in regression
analysis, as we will see)
https://uwmo.mobius.cloud/566/6868/assignments/43958/0 1/3
9/7/21, 3:51 PM UW Möbius - 1.1 Introduction/Overview
Example 2:
Is there systemic gender inequity in the salaries of Waterloo faculty members?
To answer this question, a Waterloo working committee obtained information on each faculty member,
including rank, academic unit, years of service, gender, and annual salary, and fit a regression model
to the data.
Based on the model, they found that, after accounting for rank, academic unit, years of service,
and several other variables, males were getting paid significantly more than females, on average.
The results from this regression analysis resulted in an immediate increase of $2905 to the annual
salaries of all female faculty members.
Response variable: Annual Salary
Potential explanatory variables: Rank (Lecturer, Professor, ...), performance review,
Faculty/Department, years of service, ..., gender.
Objective: To determine whether, after accounting for the variation in salary due to rank, years
service, Faculty/Dept., etc, there is any discrepancy in mean annual salary between male and female
Faculty members.
Example 3:
Before listing a house, a realtor wishes to estimate its market value based on recent selling prices of
homes in the area.
Information is obtained on attributes of these homes that may help to account for selling price, such
as size, lot size, number of rooms, number of bathrooms, number of stories, whether the house has a
garage, etc., and a regression model is created to describe the relationship between selling price and
these variables.
The realtor can now use the model to estimate market value and predict selling price of the house,
based on its attributes.
Response variable: Selling price
Potential explanatory variables: house size, style, age, # of bathrooms, district, garage, basement,
swimming pool, ...
Objective: To estimate the market value (i.e. predict selling price) of a house based on its size, age,
style, etc.
9/7/21, 3:51 PM UW Möbius - 1.1 Introduction/Overview
Note that in each of these examples, the objective for fitting a regression model is different,
illustrating the power and usefulness of regression modelling.
In the first example, the investigator wishes to detect discrepancies between an office's (claimed)
overhead and the expected overhead for that office estimated from the regression model.
In the second example, investigators wish to determine whether there is a relationship between

gender and salary (i.e. whether there is a difference in mean salaries between male and female faculty
members) after accounting for potentially confounding explanatory variables such as academic unit,
rank, etc.
Whereas in the last example, the objective was to predict the value of the response (selling price) for
a given set of explanatory variates.
We will be looking at each of these examples in more detail throughout the course. First, however, we
will begin with a review of simple linear regression, in which we model the relationship between a
single explanatory variable and the response.
9/7/21, 3:54 PM UW Möbius - 1.2 Graphical and Numerical Summaries for Bivariate Data
Graphical and Numerical Summaries for Bivariate Data

Consider the audit dataset associated with Example 1, which contains information on the size (sq. ft),
number of employees, number of clients, cost of living index, and annual overhead ($) for 24 offices
from a certain population of interest. For now, we will consider only the relationship between overhead
and the office size. The partial dataset is given below:
> audit=read.table('audit.txt', header = TRUE) #reads the audit text file
contained in the R working directory and creates a dataframe
> attach(audit) #the audit dataframe is attached to the R search path, so column
vectors can be accessed by name. Must be attached every session.
> audit
office overhead size
1 218955 1589
2 224513 1912
3 66542 741
4 212349 1839
⁝ ⁝ ⁝
23 188435 1812
24 35099 607
Although the raw data contains all the available information about the relationship between size,
x,
and overhead,
y , we cannot readily synthesize this information without the aid of appropriate
graphical methods and numerical summeries.
With bivariate data, {x, y} , such as we have here, a scatterplot is an essential tool in visualizing
and understanding the nature and strength of the relationship between an explanatory variable and a
response. A scatterplot of the audit data is given below:
> plot(size,overhead,pch=19,xlab='size (sq.ft.)',ylab='overhead ($)',
main='Claimed overhead vs office size (n = 24)')
Note the underlying linear relationship between the response (overhead), and the explanatory variate
(office size).
9/7/21, 3:54 PM UW Möbius - 1.2 Graphical and Numerical Summaries for Bivariate Data
In addition to a visual representation of the relationship between two quantitative variables, we also
require a quantitative measure of the strength of a linear relationship between two variables, as given
by the correlation coefficient, r , defined as
∑(xi − x̄)(yi − ȳ ) Sxy

r = =
−−−−−−−−−−−−−−− −−− −−−−−−
√∑(xi − x̄) 2 ∑(yi − ȳ ) 2 √Sxx Syy
Properties of r :
−1 ≤ r ≤ 1, where the closer r is to 1 (−1), the stronger the positive (negative) relationship.
r is unitless. (Note that the units of the numerator and denominator will cancel). We can thus
compare the relative strength of linear relationships across different scales and datasets.
For the audit data, r , as obtained in R:

= .927
> cor(overhead,size)
[1] 0.9271985
This suggests a relatively strong positive relationship between office size and overhead.
Note that in calculating and interpreting r , we are not attempting to establish the presence of a
causal relationship between the explanatory variate and the response, only the strength of the linear
association or correlation between the two variables. For example, we cannot conclude that larger
offices cause larger overheads, as there may be one or more confounding variables, such as the
number of employees, that may be associated with both office size and overhead. An oft-quoted
saying is:
'Correlation does not imply causation'.
9/7/21, 3:55 PM UW Möbius - 1.3 The Simple Linear Regression Model
The Simple Linear Regression Model

Consider again the scatterplot of overhead, y , vs. office size, x, for the audit data:

We can describe the observed behaviour of the response with a statistical model that includes both
a deterministic component, that describes the variation in y accounted for by the functional
form of the underlying relationship between y and x.
Based on the scatterplot, the deterministic component can be adequately described by the linear
function μ = β0 + β1 x , where μ is the mean value of y for a given value of x
.
an error term, denoted by the random variable ϵ , that describes the random variation in y not
accounted for by the underlying relationship with x.
Incorporating both the deterministic and error components into our model yields the simple
linear regression (SLR) model, expressed as
yi = β0 + β1 xi + ϵi i = 1, … , n
where
β0 denotes the intercept parameter
β1 denotes the slope parameter
the index i denotes the observation number (e.g. {x3 , y3 } denotes the size and overhead
associated with the third office in the dataset).

9/7/21, 3:55 PM UW Möbius - 1.3 The Simple Linear Regression Model
The Normal SLR Model

We will see in future lessons that, in order to derive the distributions of estimators for statistical
inference procedures (i.e. confidence intervals and hypothesis tests for model parameters), we
require certain distributional assumptions about the error random variable, ϵ .
In linear regression, we typically assume that the errors, ϵi , follow a normal distribution, with mean
= 0, and variance denoted by σ 2 . We also must assume that the errors are independent (recall for a
normal random variable, independent errors ↔ C ov(ϵj , ϵk ) = 0, j ≠ k ).
Incorporating these assumptions into our SLR model yields the normal model
2
Yi = β0 + β1 xi + ϵi ϵi ∼ N (0, σ ) ind. , i = 1, … , n
Assumptions of the Normal Model

the functional form (e.g. linear) of the relationship between y and x is correctly specified by
the deterministic component of the model
the errors follow a normal distribution
errors have a constant variance, denoted by σ 2 (this property is sometimes referred to
as homoskedasticity)
the errors are independent
For the normal model to be an appropriate model to use in investigating the relationship between y
and x, these assumptions must hold. Otherwise, our model will be inappropriate and any
conclusions we obtain from our regression analysis will be invalid.
We will be examining these model assumptions in more detail in later sections.
9/7/21, 3:56 PM UW Möbius - 1.4 Least Squares Estimation of Model Parameters
Least Squares Estimation of Model Parameters

Analysing and drawing conclusions about the linear relationship between the explanatory and response
variables requires the fitting of the SLR model to the data and estimating the values of the
parameters β0 and β1
Consider again the audit dataset. Based on the observations and the relationship described by the SLR
model,
218955 = β0 + β1 (1589) + ϵ1
224513 = β0 + β1 (1912) + ϵ2
66542 = β0 + β1 (741) + ϵ3
35099 = β0 + β1 (607) + ϵ24
for unknown constants, β0 and β1 .
In least squares estimation, we find the values for β0 and β1 that yield the smallest sum of squares
of the errors, ∑i=1 ϵ2i .
n
The values obtained by this procedure, denoted by ^

β0 and ^
β1 , are referred to as the least squares
estimates of the intercept and slope parameters.
Using calculus, we find these least squares estimates by minimizing the function
n 2 n 2
S(β0 , β1 ) = ∑ ϵ = ∑ [yi − (β0 + β1 xi )]
i=1 i i=1
with respect to β0 and β1 .
Taking the partial derivatives and setting to 0:
n
∂S
^ ^
= −2 ∑ [yi − (β 0 + β 1 xi )] = 0
∂ β0
i=1
n
∂S
^ ^
= −2 ∑ xi [yi − (β 0 + β 1 xi )] = 0
∂ β1
i=1
yield the normal equations:

n n
^ ^
nβ 0 + β 1 ∑ xi = ∑ yi
n=1 n=1
n n n
^ ^ 2
β 0 ∑ xi + β 1 ∑ x = ∑ xi yi
i
i=1 n=1 n=1

Solving these normal equations yields the least squares estimates:
^ ^
β 0 = ȳ − β 1 x̄
∑(xi − x̄)(yi − ȳ ) Sxy

^
β1 = =
2
∑(xi − x̄) Sxx
9/7/21, 3:56 PM UW Möbius - 1.4 Least Squares Estimation of Model Parameters
For the audit data, the least squares parameter estimates are ^ ^
β 0 = −27877.1, β 1 = 126.3 , as given
by the R output:
> audit.slr.lm=lm(overhead ~ size) #fits the linear model y ~ x (intercept is fit as default)
> audit.slr.lm
Call:
lm(formula = overhead ~ size)
Coefficients:
(Intercept) size
-27877.1 126.3
^ ^
β0 β 1
9/7/21, 3:57 PM UW Möbius - 1.5 The Fitted Model
The Fitted Model

The fitted model, or fitted line, for the SLR model is expressed as μ
^
^ ^
= β0 + β1 x , where μ
^ is the
estimated mean value of the response y for a given value of x.

For the audit data, the fitted model is μ
^ = −27877.06 + 126.33x
(Note that the fitted model is sometimes expressed in terms of the predicted value of the response,
y
^
^ = β ^
0 + β1 x
. While ^
μ and ^
y are identical in terms of the value they represent, there are subtle
differences in their interpretation that we will discuss in a later section)
The (fitted) residual of the ith observation, ei , is the difference between the observed response,
yi , and the fitted value, μ
^ , defined as
i
^ ^
^
e i = yi − μ = yi − (β 0 + β 1 xi )
i
For example, the second office in the dataset, a 1912 sq. ft. office with a claimed overhead of
$224513, has an estimated mean overhead of
^
μ = −27877.06 + 126.33x2 = −27877.06 + 126.33(1912) = $213666
2
with an associated residual value of
^
e 2 = y2 − μ = 224513 − 213666 = $10847
2
The fitted line, fitted value and residual for the audit data is illustrated in the plot below:
9/7/21, 3:57 PM UW Möbius - 1.5 The Fitted Model
Some notes on the residuals:

We will see in later sections that much of our statistical analysis from a regression model relies
on the calculated value of the sum of squares of the residuals, ∑ e2i .
Understand the distinction between the residual, ei , and the error,
ϵi = yi − μ i = yi − (β0 + β1 xi ) .
The error is the random variable, on which we impose certain distributional assumptions, that we
use to model the random variation in the response for a given value of x.
The residual, ei = y − μ
^ , is the difference between the value of the observed response and the
i
estimated mean response, the value of which we calculate from the fitted line. We can think of
the residuals as estimates of the errors.
By taking the partial derivative with respect to each parameter and setting = 0 in our least
squares estimation procedure, we have imposed two constraints on our residuals, namely:
∑ ei = 0
∑ xi e i = 0
(see the normal equations to see this)

These constraints allow us to compute the remaining two residuals from n − 2 observations.
Thus, we say that the fitted model is associated with n − 2 degrees of freedom.
9/7/21, 4:37 PM UW Möbius - 1.6 Least Squares Estimation of σ2
Least Squares Estimation of σ 2

Recall the normal model given by
2
yi = β0 + β1 xi + ϵi ϵi ∼ N (0, σ ) ind. , i = 1, … , n
Inference for model parameters requires not only the estimation of β0 and β1 , but also on the
estimation of the error variance, σ 2 .
In any least squares regression model, this is obtained by dividing the sum of squares of the
residuals by the degrees of freedom, giving the least squares estimate as
n 2 n 2
∑ e ∑ ^ )
(yi − μ
2 i=1 i i=1 i
^
σ = =
n−2 n−2

Note that is an unbiased estimate of (i.e. .
2 2 2 2
^
σ σ ^ ) = σ )
E(σ
The residual standard error is the square root of the estimated variance, given by
−−−−−
2
∑e
i
^ = √
σ
n−2
The residual standard error can be interpreted as the estimated standard deviation of the errors, and
is a measure of the random variation in the response for a given value of x. The smaller the value of
the residual standard error, the more variation in the response is explained by the relationship with x,
and the better the fit of the model.
The residual standard error is part of the summary R ouput for the fitted model:
> summary(audit.lm)
Call:
Residuals:
Min 1Q Median 3Q Max
-36639 -12874 -1997 8642 56686

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27877.06 14172.00 -1.967 0.0619 .
size 126.33 10.88 11.610 7.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23480 on 22 degrees of freedom
Multiple R-squared: 0.8597, Adjusted R-squared: 0.8533
F-statistic: 134.8 on 1 and 22 DF, p-value: 7.472e-11
Note the intercept and slope parameter estimates are also provided. We will be exploring
R summary output in great detail throughout the course.
9/12/21, 1:42 PM UW Möbius - 1.7 Interpretation of Slope Parameter Estimates
Interpretation of Parameter Estimates

Recall the fitted regression model for the audit data, with parameter estimates
^
β 0 = −27877.06
^
β 1 = 126.33
What do these values represent in the context of the study?
Interpreting and communicating results of statistical analyses is an extremely important

skill, requiring and demonstrating a fundamental understanding of the core statistical concepts behind
the numbers.
It is not sufficient that we merely learn how to calculate or obtain the parameter estimates (or any
other statistical results) through rote application of an algorithm.
It is far more important that we understand how we can interpret these results in the context of the
study, and how we can effectively and concisely communicate these interpretations to various interest
groups, be they academics, technicians, coworkers, or the general public.

^
Interpretation of the Slope Parameter Estimate, β 1
Let ^
μ
x0
be the estimated mean response at x = x0 , and let ^
μ
x 0 +1
be the estimated mean response at
x = x0 + 1 .
Then
^ ^
^
μ = β 0 + β 1 (x0 + 1)
x 0 +1
^ ^ ^
= β 0 + β 1 x0 + β 1
^
^
= μ + β1
x0

Thus we can see that, in general, β
^
1
can be interpreted as
The estimated mean change in the response, y , associated with a change of one unit in x .

For the audit model:
^
β 1 = 126.33 can be interpreted in the context of the study as: The estimated mean overhead
increases by $126.33 for every increase of one square foot in office size.
9/12/21, 1:42 PM UW Möbius - 1.7 Interpretation of Slope Parameter Estimates
^
Interpretation of the Intercept Parameter Estimate, β 0
Note that for x = 0 , the estimated mean response reduces to μ

^ ^ ^ ^
= β 0 + β 1 (0) = β 0 .
Thus, β
^
0
may be interpreted in certain situations as the estimated mean value of y at x = 0 .
However, there is an important caveat:
This interpretation may be nonsensical or meaningless in cases where x = 0 is not a relevant value, or
where x = 0 is not in the range of values used in the fit of the model.
In the audit model, for example, ^

β 0 = −$27877.06 . This leads to the interpretation:
An office 0 sq. ft. in size has an estimated mean overhead of negative $27877.06 ???
Clearly, as overhead cost is non-negative, this is a nonsensical interpretation.
The linear relationship between overhead and size we observed in the scatterplot is only evident for
offices between approx. 500 and 2000 sq. ft in size.
This serves as an important reminder:
Never extrapolate results to values of x outside the range used to fit the model
Interpretation of the Residual Standard Error, σ

^
Recall the least squares estimate of the standard deviation of the errors, σ , called the residual
standard error and given by:
−−−−−−−
n 2
∑ e
i=1 i
^ = √
σ
n−2
where ^
e i = yi − μ
i
are the residuals of the fitted model.
Note that the residual standard error is similar to the (sample) standard deviation of the residuals
(only with n − 2 degrees of freedom instead of n − 1 ), and is thus a measure of the variability of the
response about the fitted line. The smaller the residual standard error, the closer the data are to the
fitted line, and the better the fit of the model.
Similar to a standard deviation, the residual standard error can be roughly interpreted as a typical or
'standard' distance (or absolute difference) between the response, yi , and the fitted value, μ
^ .
i
For the audit model, ^ = 23480

σ .
This tells us that a typical difference between the observed overhead and estimated overhead for an
office of a given size is approximately $24000. This provides us with a reasonable reference on the
variability of the response relative to its expected value.
9/12/21, 1:44 PM UW Möbius - 1.8 Inference for the Slope Parameter
Inference for β1
'Is there a relationship between overhead and office size in the population?''
Based on the SLR model,
Yi = β0 + β1 xi + ϵi
2
ϵi ∼ N (0, σ ) ind. , i = 1, … , n ,
We can assess whether β1 = 0 and thus whether a (linear) relationship exists by employing one of two
methods of inference for β1 , namely:
Confidence intervals, or
Hypothesis tests
To carry out these procedures, we must first obtain the distribution of the least squares estimator*,
¯
∑(xi − x̄)(Yi − Y )
^
β1 =
2
∑(xi − x̄)
*A note on terminology and notation:
When we refer to the estimator β ^

1
, we are referring to β
^
1
as a random variable, the
distribution of which we need to determine for inference purposes. When we refer to the
estimate β
^
1
, we are referring to a single realization or observed value of the random variable.
Note that the estimator β^

1
is a function of the random variables, Yi , whereas the estimate β
^
1
is
a function of the observed values, yi .
9/12/21, 1:45 PM UW Möbius - 1.8.1 Distribution of Parameter Estimators
^
Distribution of β 1
First, we note that the estimator ^

β1 can be expressed as a linear combination of the response random
variable:
¯ ¯
∑(xi − x̄)(Yi − Y ) ∑(xi − x̄)Yi − Y ∑(xi − x̄)
β
^
1
= =
2 2
∑(xi − x̄) ∑(xi − x̄)
∑(xi − x̄)Yi
= (∑(xi − x̄) = 0)
∑(xi − x̄) 2
= ∑ ci Yi
(xi − x̄)
where ci = .
∑(xi − x̄) 2
For the SLR model with normal, independent errrors,

2 2
ϵi ∼ N (0, σ ) ind. → Yi ∼ N (β0 + β1 xi , σ ) ind.
→ ^
β 1 = ∑ ci Yi ∼ N ormal (linear combination of ind. normal r.v.'s)
Expressing β
^
1
as a linear combination of independent random variables also allows us to easily derive
its mean and variance:
^

E(β 1 ) = E(∑ ci Yi ) = ∑ ci E(Yi )
(xi − x̄)
= ∑( )(β0 + β1 xi )
2
∑(xi − x̄)
β0 ∑(xi − x̄) + β1 ∑ xi (xi − x̄) β1 ∑ xi (xi − x̄) − β1 x̄ ∑(xi − x̄)

= = (∑(xi − x̄) = 0)
2 2
∑(xi − x̄) ∑(xi − x̄)
β1 ∑(xi − x̄)(xi − x̄)

=
∑(xi − x̄) 2
= β1 (unbiased estimator)

^
V ar(β 1 ) = V ar(∑ ci Yi ) = 2
∑ c V ar(Yi )
i
(Yi ′ s ind.)
2
(xi − x̄)
= σ
2
∑( )
2
∑(xi − x̄)
2
∑ (xi − x̄)
= σ
2
2 2
(∑(xi − x̄) )
2 2
σ σ
= =
2 s xx
∑(xi − x̄)
9/12/21, 1:45 PM UW Möbius - 1.8.1 Distribution of Parameter Estimators
From these three results we have derived the distribution of ^

β1 as:
2
σ
^
β 1 ∼ N (β1 , )
s xx
Recall: In general, for any (unbiased, normally distributed) least squares estimator of parameter, θ ,
^
θ −θ
∼ tdf
^
SE(θ )
where the degrees of freedom, df, is the number of estimable model parameters.
Thus for the SLR normal model,
^ ^
β 1 − β1 β 1 − β1
= ∼ tn−2
^ ^
σ
SE(β 1 )
−−
−
√s xx
−−−−−−−−−−− −−−−−
2
^ )2
∑ (yi − μ ∑e
where .
i i
^ = √
σ = √
n−2 n−2
We use this result to obtain t-based confidence intervals and hypothesis tests for β1 .
^
Distribution of β 0
In the same way we derived the distribution of ^

β1 , so could we derive the distribution of β
^
0
, yielding
2
1 x̄
^ 2
β 0 ∼ N (β0 , σ ( + ))
n s xx
^
β 0 − β0
→ ∼ tn−2
^
SE(β 0 )
−−−−−−−
2
1 x̄
where ^
^√
SE(β 0 ) = σ + .
n s xx
Confidence intervals and hypothesis tests for β0 would follow in the same way as for β1 .
However, inference for β0 is typically of little relevence, and we will focus primarily on confidence
intervals and hypothesis tests for β1 .
9/12/21, 1:48 PM UW Möbius - 1.8.2 Confidence Interval for Slope Parameter
Confidence Interval for Slope Paramater, β

1
Recall: (1 − α)100% confidence interval for the population mean parameter, μ :
^ ±t
μ ^)
SE(μ
n−1,1−α/2
^
σ
= x̄ ± tn−1,1−α/2
−
−
√n
In the same way, we obtain a (1 − α)100% confidence interval for β

1 of the form:
^ ^
β1 ± tn−2,1−α/2 SE(β1 )
^
σ
where SE(β
^
1
) =
−−
−
√s xx
Notes:
tn−2,1−α/2 denotes the critical value from a tn−2 distribution corresponding to confidence level
(1 − α)100% . (Be sure you know how to obtain this value for a given confidence level from both
R and the posted t - tables)
tn−2,1−α/2 SE(β^1 ) is called the margin of error of the interval. It can be thought of
as the bound on the difference between the value of the estimate and the actual (unknown)
value of the parameter for the given confidence level.
It should be obvious, both intuitively and from the form of the confidence interval that:
- the higher the confidence level, the wider the interval
- the larger the standard error, SE(β

^
1
) , the wider the interval
Example: Provide a 95% confidence interval for β

‌ 1 from the audit SLR model.
We need to provide the interval:
^ ^
β 1 ± t22,0.975 SE(β 1 ) ‌
We can obtain the values of ‌β

^
1
and ‌SE(β
^
1)
from the summary R output:
Coefficients:
(Intercept) -27877.06 14172.00 -1.967 0.0619
size 126.33 10.88 11.610 7.47e-11

We can obtain the critical
t value, ‌t22,0.975 , from either a
t-table or R:
> qt(.975,22)
[1] 2.073873
These values give us a ‌95% confidence interval for ‌β1 of :
126.33 ± 2.074(10.88)
= 126.33 ± 22.57
= (103.76, 148.90)
9/12/21, 1:57 PM UW Möbius - 1.8.2.1 Interpretation of Confidence Intervals
Interpretation of Confidence Intervals

Recall: A 95% confidence interval for β1 from the audit model:
(103.76, 148.90)
How can we interpret this interval in the context of the study?
Recall our interpretation of β

^
1
as: the estimated mean change in the response for a change in one
unit of the explanatory variate.
The same interpretation applies when considering the endpoints of a confidence interval.
For the audit model, the 95% confidence interval for β1 of (103.76, 148.90) can be interpreted as:
We are 95% confident* that for every additional increase of one square foot in office size, the mean
increase in overhead is between $103.76 and $148.90.
*The phrase '95% confident' is ambigious, but is a sufficient and common phrase for conveying
the results of a 95% confidence interval. A more formal interpretation would be:
In repeated sampling of a given size from the population, it is expected that 95% of these samples
would yield (95%) confidence intervals that would contain the actual (unknown) value of β1 .
Note that this interpretation implies that there is a probability of .95 that the true (unknown) value
of β1 is contained in any one of these intervals, including the interval of (103.76, 148.90) that we
obtained from our sample.
Conclusions about the significance of parameters

What conclusions can be drawn from our confidence interval about whether a relationship exists
between overhead and office size?
Consider again the question that motivated the creation of a confidence interval:
'Is there a relationship between overhead and office size in the population?''
Since β1 = 0 is not in the interval, (103.76, 148.90) , we conclude that there is a significant* positive
relationship between overhead and office size.
(Had 0 been in the interval, then 0 would be considered a plausible value for β1 and we would thus
conclude that there was no significant relationship between overhead and office size)
*When we use the word 'significant' in conclusions from confidence intervals and hypothesis tests,
we are referring to statistical significance.
Whether or not, or the extent to which, our conclusions are significant in terms of the
practical implications are not considered in our statistical conclusions.
9/12/21, 2:10 PM UW Möbius - 1.8.3 Hypothesis Tests for Slope Parameter
Steps in a Hypothesis Test - review

Recall (e.g. STAT 231):
In a hypothesis test, we assess the strength of the evidence in the data against an assumed value
of a model parameter.
Steps in a Hypothesis Test:
1. Present a null hypothesis, H0 , and alternative hypothesis, Ha , concerning the value of
the parameter:
2. Calculate the value of the test statistic, or discrepancy measure. The test statistic is a
measure based on the difference between the value of the estimate of the parameter that we
observe in our data and the hypothesized value of the parameter.
3. Calculate the p-value from the test statistic. The p-value is the probability that we would
observe at least as big a difference between the observed estimate and the hypothesized
value of the parameter, if H0 is true.
The smaller the p-value, the more evidence against H0 .
4. Draw a conclusion about the value of the parameter hypothesized in H0 , based on the p-
value:
If p-value ≤ .05 , reject H0 in favour of Ha .
If p-value > .05 , do not reject H0 .
We call .05 the significance level of the test.

Hypothesis Test for Slope Parameter, β1

Is there a relationship between overhead and office size in the population?''
According to our linear model, no relationship exists iff β

1 = 0
1. Present the null and alternative hypotheses.
H0 : β1 = 0 (no relationship exists) Ha : β1 ≠ 0 (relationship exists)
2. Calculate the value of the test statistic.
^
β 1 − β1
Recall, from the distribution of β
^
1 : ∼ tn−2
^
SE(β 1 )
Under H0 , β1 = 0 , yielding the test statistic:
^
β1 126.33
t = = = 11.61
^ 10.88
SE(β 1 )

9/12/21, 2:10 PM UW Möbius - 1.8.3 Hypothesis Tests for Slope Parameter
3. Obtain the p-value = P (|T | > |t|) = 2P (T > |t|) , where T ∼ t22 , and t is the value of test
statistic.
p-value = 2P (T > 11.61)
Obtaining p-value from t-table:
2P (T > 2.819) = 2(0.005) = 0.001→ p-value = 2P (T > 11.61) < .001
Obtaing p-value using R:

> 2*(1-pt(11.61, 22)) # 1 - pt(t, df) gives the upper tail.
[1] 7.47824e-11
4. Conclusion in context of study. Reject H0 (p-value < .05), and conclude that
There is a significant positive relationship between overhead and office size.
Note that the summary R output displays the results of hypothesis tests for both the intercept and
slope parameters:
> summary(audit.lm)
Call:
Residuals:
Min 1Q Median 3Q Max
-36639 -12874 -1997 8642 56686

Coefficients:
(Intercept) -27877.06 14172.00 -1.967 0.0619 ← (H0 : β0 = 0)
size 126.33 10.88 11.610 7.47e-11 ← (H0 : β1 = 0)
---

9/12/21, 2:10 PM UW Möbius - 1.8.4 Further Notes on Hypothesis Tests
Further Notes on Hypothesis Tests

Types of errors in hypothesis testing
Note that whenever we draw a conclusion from a hypothesis test regarding the significance of the
parameter, we could be in error, since our conclusion is based on a probabilistic criteria, and is
therefore, subject to uncertainty.
For example, we may reject H0 : β1 = 0 and conclude there is a significant relationship, when no
relationship exists (β1 = 0 ). Conversely, we may accept H0 : β1 = 0 when, in fact, β1 ≠ 0 and a
relationship exists.
The possibility that we could have made one of these errors should always be kept in mind when
drawing conclusions from a hypothesis test (as well as from a confidence interval). These two errors
are referred to as:
Type I error: Rejecting the null hypothesis when it is true
Type II error: Accepting (i.e. not rejecting) the null hypothesis when it is false.
Note that for any hypothesis test, P(Type I error) = .05 (Convince yourself of this. It will help in your
understanding of p-values)

Two-sided vs one-sided tests

By default, we will use a two-sided alternative hypothesis when performing hypothesis tests, since, in
most cases, we are concerned with discovering significant relationships in either direction (positive or
negative), and have little or no prior reliable knowledge of the possible direction of the relationship. If
a one-sided alternative seems appropriate, it will be specified.
Note that hypothesis tests for which a one-sided alternative (e.g. Ha : β1 > 0 , or Ha : β1 < 0 ) is
appropriate yield a p-value = P (tn−2 > |t|), half the p-value that one would obtain with the two-sided
alternative, Ha : β1 ≠ 0 .

The relationship between confidence interval and hypothesis tests

Note that the conclusions (i.e. whether not a signifcant relationship exists) drawn from a
95% confidence interval for β1 will always be consistent with conclusions drawn from a test of
H0 : β1 = 0 .
That is:
If the 95% confidence interval contains 0, then a (two-sided) test of H0 : β1 = 0 would yield a p-
value ≥ .05
If the 95% confidence interval does not contain 0, then a (two-sided) test of H0 : β1 = 0 would
yield a p-value < .05
9/16/21, 1:32 PM UW Möbius - 2.1 Multiple Regression Model
Multiple Regression Model

Consider the expanded audit dataset, which, in addition to office size, includes data on age of office
(yrs), # of employees, col (cost of living index, relative to a standard index of 1.0), and # of clients.
overhead size age employees col clients

1 218955 1589 3 11 1.00 2450
2 224513 1912 19 15 1.00 2310
⋮ ⋮ ⋮
23 188435 1812 15 14 1.00 2147
24 35099 607 15 2 0.95 492
By extending the SLR model to include p explanatory variables, we obtain the multiple linear
regression model:
yi = β0 + β1 xi1 + β2 xi2 + … + βp xip + ϵi i = 1, 2, … , n
The multiple regression model can be expressed in matrix form as:
y1 1 x11 x12 … x1p β0 ϵ1

⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
⎢ y2 ⎥ ⎢1 x21 x22 … x2p ⎥ ⎢ β1 ⎥ ⎢ ϵ2 ⎥

⎢ ⎥
⎢ ⎥ = ⎢ ⎥⎢ ⎥+⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ … ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣ ⎦ ⎣ ⎦⎣ ⎦ ⎣ ⎦
yn 1 xn1 xn2 … xnp βp ϵn
y X β
β ϵ
ϵ
which we write as
y = Xβ
β+ϵ
ϵ
Note the column of 1's prior to the vectors of explanatory variables in the X matrix that must be
included for models fit with an intercept, β0 .
The normal model

For the normal model, for which we assume 2
ϵi ∼ N (0, σ ) ind. , we write
2
Y = Xβ
β+ϵ
ϵ ϵ
ϵ ∼ N (0, σ I)
where V ar(ϵ
ϵ) = σ I
2
is the covariance matrix of the error random vector, ϵϵ , with
2
C ov(ϵi , ϵi ) = V ar(ϵi ) = σ , i = 1, . . . , n as the diagonal elements and
C ov(ϵj , ϵk ), j, k = 1. . . , n, j ≠ k on the off-diagonals.
Note that expressing the covariance matrix in this way captures both the constant variance
assumption (V ar((ϵi ) = σ 2 for all i), and the independence assumption, since
V ar(ϵ
2
ϵ) = σ I → C ov(ϵj , ϵk ) = 0, j ≠ k → independent errors for ϵi ∼ normal
9/16/21, 1:33 PM UW Möbius - 2.2 Least Squares Estimation of Beta
Least squares estimation of Beta

As with the SLR model, we wish to minimize the sum of squares of the errors function with respect to
the model parameters. For multiple regression this function takes the form:
2 2
S(β0 , β1 , … , βp ) = ∑ ϵ = ∑[yi − (β0 + β1 xi1 + … + βp xip )]
i
which we wish to minimize with respect to β0 , β1 , … , βp .

Taking the partial derivatives and setting = 0 gives us:
∂S
^ ^ ^
= −2 ∑(yi − (β 0 + β 1 xi1 + … + β p xip )) = 0
∂ β0
∂S
^ ^ ^
= −2 ∑ xi1 (yi − (β 0 + β 1 xi1 + … + β p xip )) = 0
∂ β1
∂S
^ ^ ^
= −2 ∑ xip (yi − (β 0 + β 1 xi1 + … + β p xip )) = 0
∂ βp
which yield the normal equations:

^ ^ ^
n(β 0 ) + (∑ xi1 )β 1 ) + … + (∑ xip )β p ) = ∑ yi
^ 2 ^ ^
(∑ xi1 )β 0 + (∑ x )β 1 + … + (∑ xi1 xip )β p = ∑ xi1 yi
i1
^ ^ 2 ^
(∑ xip )β 0 + (∑ xi1 xip )β 1 + … + (∑ x )β p = ∑ xip yi
ip
These normal equations can be expressed in matrix form as:

T ^
^ T
(X X)β
β = X y
Solving for by multiplying both sides of the equation by yields the least squares estimate:
∗
^ T −1
β
β (X X)
^
^ T −1 T
β
β = (X X) X y

(*For X X to be invertible, X must be of full rank. That is, all
p + 1 columns of X must be
T
linearly independent. Otherwise a unique solution will not exist. We will explore this issue in more
detail in a later topic.)
Notes:
The fitted line, μ
^
^ ^ ^
= β 0 + β 1 x1 + … + β p xp , can be represented in matrix form by μ
^ = x
T
,
^
β
β
where x T
= {1, x1 , . . . , xp }
the vector of fitted values is given by μ

^
μ
^
= Xβ
β
The residual vector is given by e ^

= y−μ
μ
The sum of squares of residuals, ∑ ϵ2i ^ )

= ∑ (yi − μ
i
2
, can be expressed in matrix form as
e
T
^
e = (y − μ
μ)
T
^
(y − μ
μ) .
(Note the use of bold type when representing a matrix or vector, and regular type when representing
a scalar)
9/16/21, 1:34 PM UW Möbius - 2.2.1 Least Squares Estimation - audit model
Least squares estimation of β

β - audit model
> audit.lm=lm(overhead~size+age+employees+col+clients)
> audit.lm
Coefficients:
(Intercept) size age employees col clients

-198262.24 31.26 330.38 4695.73 178136.66 38.52
T
^
β
β ^
= [β 0 β
^
1
β
^
2
β
^
3
β
^
4
β
^
5
]
Verifying these estimates in R:

> X = model.matrix(audit.lm) #gives us the X matrix
> y = overhead
> XtXinv = solve(t(X)%*%X) #solve(A) gives us A-1; t(A) gives us AT

^
^
> beta_hat = XtXinv%*%t(X)%*%y β
β = (X
T
X)
−1
X
T
y
> beta_hat
(Intercept) -198262.24220
size 31.26158
age 330.37544
employees 4695.73212
col 178136.65602
clients 38.52180
Fitted values, μ
^
i
^ ^ ^ ^ T ^
= β 0 + β 1 xi1 + β 1 xi2 + … + β p xip = x β
i
β :
> fitted(audit.lm)
1 2 3 4 22 23 24

200698.68 205345.02 59566.12 207988.84 … 171866.94 189922.58 21921.68
^
> fitted.audit.lm=X%*%beta_hat μ
^
μ = Xβ
β
> t(fitted.audit.lm)
1 2 3 4 22 23 24

200698.7 205345 59566.12 207988.8 … 171866.9 189922.6 21921.68
Residuals, ei ^
= yi − μ
i
:
> residuals(audit.lm)
1 2 3 4 22 23 24
18256.323 19167.976 6975.875 4360.160 … -4493.936 -1487.579 13177.321
^
> res.audit.lm=y-X%*%beta_hat e = y − μ
^ β
μ = y − Xβ
> t(res.audit.lm)
1 2 3 4 22 23 24

18256.32 19167.98 6975.875 4360.16 … -4493.936 -1487.579 13177.32
9/16/21, 1:35 PM UW Möbius - 2.3 The Hat Matrix
The Hat Matrix

Recall the vector of fitted values, given by μ
^
^
μ
^
^
= Xβ
β , where β
^
β = (X
T
X)
−1
X
T
y
We can thus express μ

^ as
^
μ
μ
T −1 T
^
μ = X(X X) X y = Hy
where the hat matrix, H = X(X

T
X)
−1
X
T
, maps the vector of response variables to the vector of
fitted values.
Properties of H :
H is symmetric (HT = H )
Proof:
T T −1 T T
H = [X(X X) X ]
= X[(X
T
X)
−1 T
] X
T
((AB) T = B
T
A
T
))
= X[(X
T
X)
T
]
−1
X
T
((A −1 ) T = (A
T
)
−1
)
=
T −1 T
X(X X) X = H
H is idempotent (HH = H )

Proof:
T −1 T T −1 T
HH = [X(X X) X ][X(X X) X ]
=
T −1 T T −1 T
X(X X) (X X)(X X) X
= X(X
T
X)
−1
X
T
= H
Note that the residual vector, e, can be expressed as
^
^
e = y−μ
μ
= y − Hy
= (I − H)y
We can thus express our response vector as
^
^ +e
y = μ
μ
= Hy + (I − H)y
where Hy⊥(I − H)y . (This can be easily shown from the symmetric and idempotent properties of H )
This tells us that the response vector, y, can be decomposed into its two orthogonal elements: the
vector of fitted values, μ
^ , and the vector of residuals, e. This decomposition forms the basis of ANOVA
^
μ
methods, in which the variation in the response is partitioned into its two components - variation
accounted for by the fitted model, and variation not accounted for by the model.

9/16/21, 1:35 PM UW Möbius - 2.4 Least Squares Estimation of Sigma^2
Least squares estimation of σ 2

Recall the least squares estimate of σ 2 for the SLR model, given by:
n 2 n
∑ e ∑ ^ )
(yi − μ
2 i=1 i i=1 i
^
σ = =
n−2 n−2
where
n − 2 is the degrees of freedom (df), resulting from the two constraints imposed on the
residuals through the least squares estimation of the two parameters, β0 and β1 .
The degrees of freedom for a
p explanatory variable multiple regression model with
p + 1
parameters (including the intercept, β0 ) is thus
n − (p + 1) , yielding least squares estimate:
n 2
∑ e
2 i=1 i
^
σ =
n − (p + 1)
and residual standard error

−−−−−
n
−−− − −
2
∑ e
i=1 i
^ = √
σ
n − (p + 1)
For the 5-variable audit model with

24 − 6 = 18 degrees of freedom, the residual standard error is
−−−−−−−
n 2
∑ e
i=1 i
^ = √
σ = 14430
18
as seen in the output below:
> summary(audit.lm)
Coefficients:
(Intercept) -198262.24 74354.09 -2.666 0.0157 *
size 31.26 21.47 1.456 0.1625
age 330.38 502.03 0.658 0.5188
employees 4695.73 5492.21 0.855 0.4038
col 178136.66 69013.51 2.581 0.0188 *
clients 38.52 33.03 1.166 0.2587
---

9/16/21, 1:36 PM UW Möbius - 2.5 Least Squares vs. Maximum Likelihood Estimation
Least squares vs maximum likelihood estimation

Note that for a regression model with normal errors, the maximum likelihood and least squares
estimates of β
β are equivalent:
For ϵi 2
∼ N (0, σ ) ind., the likelihood function is given by:
n
L(β0 , β1 , … , βp ∣ y1 , … , yn ) = ∏ f (yi )
i=1
2
−(y −μ )
i i
1
=
n
2
∏ e 2σ
i=1 −−−−
√2πσ 2
2
− ∑(y −μ )
−n i i
= (2πσ )
2
2 e 2σ 2
where μi = β0 + β1 xi1 + ⋯ + βp xip .
We see that the log-likelihood, of the form
2
∑(yi − (β0 + β1 xi1 + … + βp xip )
l = log(L) = c −
2
2σ
is maximized for the same value of that minimizes the sum of squares of the
T
β
β = (β0 , β1 , … , βp )
errors,
2 2
∑ϵ = ∑ (yi − (β0 + β1 xi1 + … + βp xip ))
i
Gauss-Markov theorem and BLUE

We have shown that, for linear regression models with normal errors, the maximum likelihood and
least squares methods yield identical estimators for β
β.
How do these two (and other possible) estimation methods compare when the errors are not assumed
to be normal?
The Gauss-Markov theorem states that the least squares estimator ^

^
β
T
β = (X X)
−1 T
X Y is the
'best linear unbiased estimator' (BLUE) of β
β.
Stated more formally:
Consider the model given by Y = Xβ

β+ϵ
ϵ , where E(ϵϵ) = 0, V ar(ϵ
ϵ) = σ I
2
.
∗
Among all unbiased, linear estimators, = M Y, the least squares estimator, given by
^
^
β
β
∗ ^
^
β = MY
β ,
where M = (X X) −1 X , has the smallest variance. That is:
T T
∗
^ ^
^ 2 ∗ ∗ T
V ar(β
β ) = V ar(β
β ) + σ (M − M)(M − M)
where (M
∗
− M)(M
∗
− M)
T
is a positive semidefinite matrix (a matrix A is positive semidefinite if
a
T
Aa ≥ 0 for any vector a).
9/16/21, 1:37 PM UW Möbius - 2.6 Inference: Distribution of Parameter Estimators
^
Distribution of β
β
^
^

2 2 T −1 T
Y = Xβ
β+ϵ
ϵ ϵ
ϵ ∼ N (0, σ I) → Y ∼ N (Xβ
β, σ I) → β
β = (X X) X Y ∼ (multivariate) Normal
^ T −1 T
E[β
β] = E[(X X) X Y]
T −1 T
= (X X) X E[Y]

T −1 T
= (X X) X Xβ
β
= β
β
^ T −1 T
V ar[β
β] = V ar[(X X) X Y]
T −1 T T −1 T T T
= (X X) X V ar[Y][(X X) X ] (V ar(AY) = AV ar(Y)A )
= σ (X
2 T
X)
−1
X
T
[(X
T
X)
−1
X
T
]
T
2 T −1 T T −1
= σ (X X) X X(X X)
2 T −1
= σ (X X)
Thus, the distribution of β

β is given by
^
^
^ 2 T −1
β
β ∼ N (β
β, σ (X X) )
and the distribution of β

^
, the
j element of β
β, by
^ th
j
β
^
j
∼ N (βj , σ (X
2 T
X)
−1
jj
) j = 0, 1, 2, … , p
where (X represents the

j diagonal element of (X
T −1 th T −1
X) X)
jj

^
β j − βj
^ 2 T −1
β j ∼ N (βj , σ (X X) jj ) → ∼ tn−(p+1)
^
SE(β j )
Note the following results from the distribution of β

β and β j :
^
^ ^
^ 2 T
V ar(β j ) = σ (X X)
−1
jj
(parameter estimators do not have constant variance)
−−−− −−−−
^ T
^√(X X)
→ SE(β j ) = σ
−1
jj
(required for confidence intervals and hypothesis tests)
^ ^ 2 T
C ov(β j , β k ) = σ (X X)
−1
≠ 0
jk
(parameter estimators not, in general, independent)

^
Interpretation of β j
Since the paramater estimators are not independent, the value of β
^
j
, the estimate associated with the
variable
xj , will depend on the other variables in the model.
^
βj can be thus be interpreted as: the estimated mean change in the response associated with a
change of one unit in
xj after accounting for the other variables (i.e. while holding all other
variables constant).
9/16/21, 1:37 PM UW Möbius - 2.6 Inference: Distribution of Parameter Estimators
Example: Fitted audit model:
> summary(audit.lm)
(Intercept) -198262.24 74354.09 -2.666 0.0157
size 31.26 21.47 1.456 0.1625
age 330.38 502.03 0.658 0.5188
employees 4695.73 5492.21 0.855 0.4038
col 178136.66 69013.51 2.581 0.0188
clients 38.52 33.03 1.166 0.2587
---


−−−− −−−−
Note the difference in the standard errors, SE(β
^
j)
T
^√(X X)
= σ
−1
jj
, for the different estimators.
Note also the change in the size parameter estimate (31.26) from that of the SLR fitted model
(126.33).
Interpretation of β
^
j
: β
^
2
= 330.38 → after accounting for size, # of employees, col, and # of clients,
each additional year in the age of the office is associated with an estimated increase in overhead of
$330.38
> coef(summary(audit.lm))[,'Std. Error']
74354.08519 21.46704 502.02680 5492.21433 69013.51183 33.02616
Verifying these standard errors in R:

> sigma_hat=sigma(audit.lm) #yields residual standard error
−−−− −−−−
> SE_betahats=sigma_hat*sqrt(diag(XtXinv)) SE(β^ j
T
^√(X X)
) = σ
−1
jj

> SE_betahats
74354.08519 21.46704 502.02680 5492.21433 69013.51183 33.02616
9/21/21, 9:35 AM UW Möbius - 2.7 Inference - Confidence Intervals and Hypothesis Tests for Model Parameters
Inference for model parameters

After accounting for office size, age, # of employees, and cost of living, is there a relationship between
# of clients and overhead?
Confidence intervals for βj

Recall a (1 − α)100% confidence interval for β1 from the SLR model:
^ ^
β1 ± tn−2,1−α/2 SE(β1 )
This result can easily be extended to a (1 − α)100% confidence interval for βj :
^ ^
βj ± tn−(p+1),1−α/2 SE(βj )
Example: A 95 % CI for β5 (clients coefficient) from the audit model is

^ ^
β5 ± t18,0.975 SE(β5 )
= 38.52 ± 2.101(33.03) (from t-table or R: (P (t18 < 2.101) = .975)
= 38.52 ± 69.40
= (−30.88, 107.92)
Since the interval encompasses 0, we conclude that, after accounting for the other model variates,
there is no significant relationship between number of clients and overhead.
Hypothesis Tests for βj
H0 : β5 = 0
Ha : β5 ≠ 0
^ ^ ^
β 5 − β5 β5 − 0 β5 38.52
t = = = = = 1.166
^ ^ ^ 33.03
SE(β 5 ) SE(β 5 ) SE(β 5 )
p-value = 2P (t18 > 1.166)
From t-table: P (t18
> 1.330) = 0.10 → 2P (t18 > 1.330) = .20 → p-value = 2P (t18 > 1.166) > 0.20
In R: > 2*(1-pt(1.166,18))→ p-value = .2588
We do not reject H0 (p-value > .05). There is no significant relationship between # of clients and
overhead, after accounting for the other model variables.
> summary(audit.lm)
(Intercept) -198262.24 74354.09 -2.666 0.0157
size 31.26 21.47 1.456 0.1625
age 330.38 502.03 0.658 0.5188
employees 4695.73 5492.21 0.855 0.4038
col 178136.66 69013.51 2.581 0.0188
clients 38.52 33.03 1.166 0.2587 H 0 : β5 = 0
9/25/21, 12:55 PM UW Möbius - 2.8 Multicollinearity
Multicollinearity
Consider the matrix of pairwise scatterplots for the audit dataset (using >plot(audit) in R):
Take a minute to examine the relationships between the response and expanatory variates (top row)
and the relationships among the explanatory variates in the remaining rows. Note especially the
relationships among size, employees, and clients.
The scatterplot reveals strong linear assocations (correlations) among some of the explanatory
variables - particularly between employees and clients (r > .99)
When strong (linear) relationships are present among two or more explanatory variables, we say
these variables exhibits multicollinearity.
Multicollinearity leads to inflated (i.e. increased) variances of the associated parameter estimators,
and correspondingly, inflated standard errors. This in turn leads to wide (imprecise) confidence
intervals and inaccurate conclusions from hypothesis tests, due to inflated p-values.
To assess the degree of multicollinearity associated with an explanatory variable, xj :

1. Regress xj onto all other explanatory variables. That is, we consider xj to be the response
variable, and fit the model with all other explanatory variables.
2. Calculate the variance inflation factor (VIF) associated with xj :

1
V I Fj =
2
1−R
j
where R
2
j
is the coefficient of determination (Multiple R-squared in R) of the model fit with
xj as the response variable.
V I Fj can be interpreted as the factor by which the variance of ^

βj is increased, through the
multicollinearity among xj and the other explanatory variables, relative to the case in which all
explanatory variables are uncorrelated.
One simple solution is to remove xj from the model if V I Fj > 10 . Note that this corresponds to
R > .90 . This is a general rule of thumb - some references consider V I Fj > 5 to be cause for
2
j
concern, depending on the context.
9/25/21, 12:55 PM UW Möbius - 2.8 Multicollinearity
Multicollinearity - audit data example

From the scatterplots, # of employees and the # of clients appear to have the strongest relationships.
We'll investigate # of employees here (we can do the same with clients, or any other variables).
> audit.emp.lm=lm(employees~size+age+col+clients)
> summary(audit.emp.lm)
(Intercept) -3.5700313 2.9959159 -1.192 0.2481
size 0.0016606 0.0008117 2.046 0.0549
age 0.0148300 0.0206924 0.717 0.4823
col 2.1395850 2.8406723 0.753 0.4606
clients 0.0055651 0.0005226 10.649 1.89e-09
---
Residual standard error: 0.6028 on 19 degrees of freedom

F-statistic: 322.7 on 4 and 19 DF, p-value: < 2.2e-16
> VIF.emp = 1/(1-0.9855)

> VIF.emp
[1] 68.96552
Since the VIF is extremely large, we would typically remove employees from the model.
Note the effect the removal of employees from the model has on the standard errors, and
subsequently the p-values, of the other variables (esp. clients).
Before removal:
(Intercept) -198262.24 74354.09 -2.666 0.0157
size 31.26 21.47 1.456 0.1625
age 330.38 502.03 0.658 0.5188
employees 4695.73 5492.21 0.855 0.4038
col 178136.66 69013.51 2.581 0.0188
clients 38.52 33.03 1.166 0.2587
After removal:
(Intercept) -215026.15 71212.70 -3.019 0.00705
size 39.06 19.30 2.024 0.05723
age 400.01 491.86 0.813 0.42614
col 188183.57 67522.57 2.787 0.01175
clients 64.65 12.42 5.205 5.04e-05
We will consider the model with employees removed going forward.

Note that we could also have considered removing size (VIF = 8.4) or clients (VIF = 8.2) from the
second model , but we will keep them both in the model for now.
9/25/21, 12:58 PM UW Möbius - 2.9 Confidence Intervals for a Response Mean and Prediction Intervals for a Response
Confidence Intervals for a Response Mean and Prediction Intervals for a Response
Once we have fit the model to our data, we may wish to use the fitted model to estimate the mean
response, or predict the value of the response, of a new unit in the population (one that was not used
in the fit of the model).
For example, after fitting the audit model to the 24 offices, an auditor may wish to use the model on
other offices in the population to assess the consistency of their claimed overhead with the office
attributes (size, age, ...). This, in fact, would likely be the main objective for fitting a regression model
to the audit data.
Consider the following questions an auditor might wish to address about a new office from the
population that is 1000 f t2 , 12 years old, with 1300 clients and a cost of living index of 1.02:
1. What is the estimated mean overhead for all offices with these attributes in the population?
2. What is the predicted overhead of this office?
Both these questions will yield the same value regardless of whether we're taking about the
estimated mean response, μ ^
new
, or the predicted response, y
^new , since
^ ^ ^
^
μ ^
= y = β 0 + β1 xnew,1 + … + βp xnew,p
new new
T ^
^
= x new β
β
where xTnew = (1, xnew,1 , … , xnew,p ) .

The difference is in the variability associated with these values, resulting in a different margin of error
for the respective confidence/prediction interval.
Confidence Interval for μnew

As with the confidence interval we derived for βj , obtaining the form of a confidence interval for
μ new requires that we first derive the distribution of μ
^
new
.
Distribution of μ
^
new
First, recall the distribution of β

β , given by
^
^
^
^ 2 T −1
β
β ∼ N (β
β, σ (X X) )
We can use this result to derive the distribution of μ

^
new
T ^
^
= x new β
β as follows:
T ^
^
^
μ = x new β
β ∼ N ormal (linear combination of normal r.v.'s)
new
T ^
^
^
E(μ ) = E(x new β
β)
new
T ^
^
= x new E(β
β)
T
= x new β
β (= μ new )
T ^
^
^
V ar(μ ) = V ar(x new β
β)
new
T ^
^ T T
= x new V ar(β
β )[x new ]
2 T T −1
= σ x new (X X) x new
giving us the distribution of μ

^
new
as
T 2 T T −1
^
μ ∼ N (x new β
β, σ x new (X X) x new )
new
(1 − α)100% confidence interval for μnew

The distrbution of μ
^
new
gives us a (1 − α)100% confidence interval for μnew of the form
−−−−−−−−−−−−−−−
T T
^
μ ±t α ^√x new (X
σ X) −1 x new
new n−(p+1),1−
2
Example: Provide a 95% confidence interval for the mean overhead for offices in the population that
are 1000 f t2 , 12 years old, with 1300 clients and a cost of living index of 1.02.
> new_x=data.frame(size=1000,age=12,col=1.02,clients=1300)
> predict(audit2.lm,new_x,interval='confidence',level=.95)
fit lwr upr
104831.2 97460.07 112202.3 (97460, 112202) → $104831 ± $7371
Interpretation: We can be 95% confident that the mean overhead for offices in the population with
these characteristics is between $97,460 and $112,202.
Prediction interval for ynew
Whereas the variance of μ ^
new
T
β is based solely on the variance of the parameter estimators,
^
^
= x new β
the variance of the predicted response, y ^

new
, is comprised of two sources of variation: the variation
associated with the parameter estimators, and the variance, σ 2 , associated with a random response.
Adding these two (independent) sources of variation:
2 2 T T −1
^
V ar(ynew ) + V ar(μ ) = σ + σ x new (X X) x new
new
2 T T −1
= σ (1 + xnew (X X) x new )
Gives us a (1 − α)100% prediction interval for ynew of the form

−−−−−−−−−−−−−−−−−−
T T
^
y new
±t α ^√1 + x new (X
σ X) −1 x new
n−(p+1),1−
2
Example: Provide a 95% prediction interval for an office in the population that is 1000 2
ft , 12 years
old, with 1300 clients and a cost of living index of 1.02.
> predict(audit2.lm,new_x,interval='prediction',level=.95)
fit lwr upr
$104831 ± $30885
104831.2 73946.72 135715.7
(73946, 135715)
→
(Note the four-fold increase in the margin of error, from $7371 to $30885)
Interpretation: We predict with 95% confidence that the overhead for this office is between $73,946
and $135,715.
Confidence and Prediction Bands

−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−
Note that for the SLR model, T
^√x new (X
σ
T
X)
−1
x new and σ
^√1 + x new (X X) −1 x new reduce to
T T
−−−−−−−−−− −−−−−−−−−−−−−−
2 2
(x new −x̄) (x new −x̄)
σ
^√
1
n
+ and σ
^√1 +
1
n
+ ,
Sxx Sxx
respectively (try showing this as an exercise).

We thus see that the confidence and prediction intervals will be narrower the closer the value of the
explanatory variate, xnew is to the sample mean, x̄, resulting in the confidence bands and prediction
bands seen below.
This can be generalized to the multiple regression model. The closer {x1 , x2 , , . . . xp } is to the centroid
{x̄1 , x̄2 , . . . , x̄p } , the narrower the interval.

9/25/21, 12:59 PM UW Möbius - 2.10 Modelling Categorical Explanatory Variables
Modelling Categorical Explanatory Variables

To investigate the effect of two promotion types on store sales, researchers carried out a study
according to the following sampling protocol, and yielding the data below:
30 stores where randomly selected from all stores in the population (e.g. all Tim Hortons
franchises in Canada)
10 stores where randomly assigned to implement each of three promotion types: promo A,
promo B, and the control, (no promo).
Response variable: The percent change in sales over the two-week period of the study.
Explanatory variable: Promotion type (a categorical variable)
Promo A Promo B No promo

4 0.7 -5.7
2.2 8.2 5.5
9.5 1.3 -0.6
10.7 9 -3.4
5.7 0.4 -0.4
9.2 -4 9
4.5 -3.5 -5.8
5 4.2 1.9
9.3 7.5 4.7
14.7 -2.8 -13.9
A regression analysis was conducted to investigate the relationship between promo type and sales.
Coding of Categorical Explanatory Variables

We can code categorical explanatory variables using indicator or 'dummy' variables, that typically
take on values of 0 or 1 to define the category level.
For example, coding promotion type with the indicater variables
1 if store uses promo A 1 if store uses promo B

x1 = { x2 = {
0 otherwise (P romo B/no promo) 0 otherwise (P romo A/no promo)
describes the promotion type for each store, where
{x1 , x2 } = {1, 0} → Promo A; {x1 , x2 } = {0, 1} → Promo B; {x1 , x2 } = {0, 0} → No promo
Note that only two indicator variables are required to model the three categorical levels. In general,
in models fit with an intercept, l − 1 indicator variables are needed to describe l category levels.
To understand this from a mathematical perspective, consider the consequences if we were to
attempt to add a third indicator variate
1 if no promo store
x3 = {
0 otherwise
in the model.
Since x3 = 1 − (x1 + x2 ) , X is not of full rank, having only three linearly independent columns (the
fourth column, containing the x3 values, is a linear combination of the preceding three columns)
As a result, (X T X) is not invertible, and the least squares estimate β
^
^
β = (X
T
X)
−1
X
T
y does not
exist.
9/25/21, 12:59 PM UW Möbius - 2.10 Modelling Categorical Explanatory Variables
Interpretation of Parameter Estimates

Model output for the promotion data is given below.

(Intercept) -0.870 1.665 -0.523 0.60552
PromoA 8.350 2.354 3.547 0.00145
PromoB 2.970 2.354 1.261 0.21792

F-statistic: 6.464 on 2 and 27 DF, p-value: 0.005084

To understand how we can correctly interpret parameter estimates associated with indicator
variables, we refer to the fitted model,
^ ^ ^
^ = β + β x1 + β x2
μ 0 1 2
where μ
^ is the estimated mean (percent change in) sales associated with a given promotion type,
and the indicator variables, x1 and x2 , are as previously defined.

For stores in which no promotion was used:

^ ^ ^ ^
^
μ = β 0 + β 1 (0) + β 2 (0) = β 0
no promo
Thus, ^
β0 can be interpreted as the estimated mean sales for stores that used no promotion.
^
β 0 = −0.870 tells us that, for those stores that used no promotion, mean sales decreased by an
estimated 0.87 percent over the period of the study.
For stores in which Promotion A was used:
^ ^ ^ ^ ^ ^
^
μ ^
= β 0 + β 1 (1) + β 2 (0) = β 0 + β 1 = μ + β1
A no promo
Thus, β
^
1
can be interpreted as the difference in estimated mean sales for stores that used
Promotion A relative to the stores that used no promotion.
^
tells us that stores that used Promotion A had a estimated 8.35 higher percent change in
β 1 = 8.35
mean sales than stores that used no promotion.
Similarly, β
^
2
= 2.97 tells us that stores that used Promotion B had a estimated 2.97 higher percent
change in mean sales than stores that used no promotion.
9/25/21, 1:00 PM UW Möbius - 2.10.1 Modelling Categorical Variables - Inference for Parameters
Inference for parameters associated with indicator variables

Is there a difference in mean sales between stores that used promotion A and stores that used no
promotion?
Recall the output from the model fit to the promotion data:

(Intercept) -0.870 1.665 -0.523 0.60552
PromoA 8.350 2.354 3.547 0.00145
PromoB 2.970 2.354 1.261 0.21792
We can answer this question using a confidence interval or hypothesis test for β1 in the same manner
as learned previously:
H0 : β1 = 0 Ha : β1 ≠ 0
t = 3.547
p-value = 2P (t27 > 3.547) = 0.00145
Since p-value < 0.05, we reject H0 , and conclude that stores using promotion A had significantly
higher sales than stores using no promotion.
We could also calculate a 95% confidence interval for β1 to reach the same conclusion.
Similarly, we conclude that there was no significant difference in mean sales between stores using
promotion B and stores using no promotion (p-value = 0.21792).
Is there a difference in mean sales between promotion A and promotion B stores?

Since no difference in mean sales between these two promotion types ↔ β1 − β2 = 0 , we can test
for a difference with the null hypothesis, H0 : β1 − β2 = 0 .
One way to test this hypothesis is with a
t test statistic, the form of which we can obtain from the
distribution of the estimator, β
^
j
^
− βk , j, k = 1, 2, . . . , p .
We can easily derive the distribution of β

^
j
^
− βk from the distribution of ^
βj we derived previoulsy:
^ 2 T −1 ^ ^
β j ∼ N (βj , σ (X X) ) → β j − β k ∼ normal (lin. comb. of normal random variables)
jj
^ ^ ^ ^
E(β j − β k ) = E(β j ) − E(β k ) = βj − βk
^ ^ ^ ^ ^ ^
V ar(β j − β k ) = V ar(β j ) + V ar(β k ) − 2C ov(β j , β k )
2 T −1 2 T −1 2 T −1
= σ (X X) + σ (X X) − 2σ (X X)
jj kk jk
2 T −1 T −1 T −1
= σ ((X X) + (X X) − 2(X X) )
jj kk jk
^ ^ 2 T −1 T −1 T −1
β 1 − β 2 ∼ N (β1 − β2 , σ ((X X) 11 + (X X) 22 − 2(X X) 12 )) yields the test statistic
^ ^
(β j − β k )
t = ∼ tn−(p+1) under H0 : β1 − β2 = 0 , where
^ ^
SE(β j − β k )
−−−−− −−−−− −−−−−−−−−− −−−−−−−−

^ ^ −1 −1 −1
^√(X T X)
SE(β j − β k ) = σ T
+ (X X)
T
− 2(X X) .
jj kk jk
A more widely applicable method to test hypotheses for any linear combination of model parameters,
including H0 : βj − βk = 0 , is the additional sum of squares method, based on the F test stastic.
We will test H0 : β1 − β2 = 0 for the promo model using this method in an upcoming lesson.
9/25/21, 1:09 PM UW Möbius - 2.12 Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)

Recall (STAT 230): The (sample) variance of a set of observations, {y1 , y2 , . . . , yn } , given by
2
∑(yi − ȳ )
2
s =
n−1
where we can think of the total sum of squares, SS(T ot) = ∑(yi − ȳ )

2
, as representing the total
variation in y .
In regression modelling, we partition this total variation in the response into two components - the
variation in the response explained by the model variables, and the variation left unexplained.
We can express this partition algebraically as
n 2 2 2
∑ (yi − ȳ ) ^ − ȳ )
= ∑(μ ^ )
+ ∑(yi − μ
i=1 i i
SS(T ot) = SS(Reg) + SS(Res)
where the regression sum of squares, SS(Reg) , is the variation explained by the model, and the
residual sum of squares, SS(Res), is the variation in the response left unexplained (i.e., not
accounted for by the model variables).
In ANOVA (ANalysis Of VAriance) methods of inferenece, we draw conclusions about the relative fit
of a model or models by comparing these two sources of variation. The greater the variation explained
by the model relative to the variation unexplained, the better the fit of the model.
Proof of sum of squares decomposition:
2 2
∑ (yi − ȳ ) ^ +μ
= ∑(yi − μ ^ − ȳ )
i i
i=1
2 2
^ − ȳ )
= ∑(μ ^ )
+ ∑(yi − μ ^ )(μ
+ 2 ∑(yi − μ ^ − ȳ )
i i i i
^ )(μ
∑(yi − μ ^ − ȳ ) = ∑ μ
^ (yi − μ
^ ) − ȳ ∑(yi − μ
^ )
i i i i i
^ e i − ȳ ∑ e i
= ∑μ
i
^ ei
= ∑μ (∑ e i = 0)
i
T
^
^
= μ
μ e
T
= (Hy) (I-H)y
T T T T
= y H y−y H Hy
= 0 (H symmetric, idempotent)
2 2 2
∑ (yi − ȳ ) ^ − ȳ )
= ∑(μ ^ )
+ ∑(yi − μ
i i
i=1

9/25/21, 1:09 PM UW Möbius - 2.12 Analysis of Variance (ANOVA)
Coefficient of Determination
By partitioning the total variation in the response into its two component sources of variation, as
SS(Reg)
described by the relationship SS(T ot) = SS(Reg) + SS(Res) , we see that the ratio , or,
SS(T ot)
SS(Res)
equivalently, 1 − , measures the proportion of the variation in the response explained by the
SS(T ot)
model. We call this measure the coefficient of determination, or more simply, the (multiple) R-
squared, and denote it by
SS(Res)
R
2
= 1−
SS(T ot)
For the fitted audit model below (fit now without # of employees due to multicollinearity issues)
(Intercept) -215026.15 71212.70 -3.019 0.00705 **
size 39.06 19.30 2.024 0.05723 .
age 400.01 491.86 0.813 0.42614
col 188183.57 67522.57 2.787 0.01175 *
clients 64.65 12.42 5.205 5.04e-05 ***

an R-squared value of 0.9549 tells us that over 95% of the variation in overhead is explained by the
office's size, age, col and #of clients.
9/25/21, 1:11 PM UW Möbius - 2.12.1 F test and the ANOVA table
F-test for model parameters

Is there a relationship between overhead and at least one of size, age, col, or number of clients?
We can test the hypothesis:
H0 : β1 = β2 = … = βp = 0
Ha : at least one of βj ≠ 0 j = 1, 2, 3, … , p
with a test statistic that compares the relative magnitudes of the variation explained by the model,
SS (Reg), and the variation left unexplained, SS (Res).
This test statistic takes the form

SS(Reg)/p M S(Reg)
F = =
SS(Res)/(n − (p + 1)) M S(Res)
where, under H0 , F has an F distibution, on p, n − (p + 1) degrees of freedom.

MS (Reg) and
MS (Res) are the mean squared values, obtained by dividing the sum of squares by
their respective degrees of freedom.
Examples of
F distributions for various degrees of freedom is shown in the following graph:
The F test statistic and p-value are provided in the last line of the summary output.
(Intercept) -215026.15 71212.70 -3.019 0.00705 **
size 39.06 19.30 2.024 0.05723 .
age 400.01 491.86 0.813 0.42614
col 188183.57 67522.57 2.787 0.01175 *
clients 64.65 12.42 5.205 5.04e-05 ***


Is there a (linear) relationship between overhead and at least one of size, age, col, or # of clients?
H0 : β1 = β2 = … = βp = 0
Ha : at least one of βj ≠ 0 j = 1, 2, 3, … , p
F = 100.5
p-value = P (F4,19 > 100.5) = 1.661 × 10

−12
Reject H0 . At least one of size, age, col, # of clients is signficantly related to overhead.
9/25/21, 1:11 PM UW Möbius - 2.12.1 F test and the ANOVA table
The test of H0 : β1 = β2 = … = βp = 0 is often summarized in an ANOVA table, that shows not only
the F test statistic and p-value, but also shows the breakdown of the sum and squares (and mean
squares) of the two sources of variation.
Source df SS MS F p-value
Regression p SS (Reg) SS (Reg) /
p MS (Reg) /
MS (Res) P (Fp,n−(p+1) > F)
Residual n − (p + 1) SS (Res) SS (Res) /

(n − (p + 1))
Total n−1 SS (Tot)
Exercise: For the audit model, confirm the value of the test statistic, F = 100.5 , and complete the
ANOVA table from values in the summary output.
−−−−−−−− −− −−−−−−−−− −
2
∑e SS(Res)
We can obtain
SS (Res) from the residual standard error,
i
^ = √
σ = √ .
n − (p + 1) n − (p + 1)
2 2
^ (n − (p + 1)) = 14330 (19) = 3901629100
SS(Res) = σ
SS(Res) SS(Res) 3901629100

R
2
= 1−
= .9549 → SS(T ot) = = = 86510623060
2 1 − .9549
SS(T ot) 1−R
Source df SS MS F p-value
Regression 4 82608993960

20652248490 100.6 1.66 × 10
−12
Residual 19 3901629100 205348900

Total 23 86510623060
(slight descripancy in F value due to round-off error)

9/25/21, 1:12 PM UW Möbius - 2.13 Additional Sum of Squares
Additional Sum of Squares

After accounting for col index and # of clients, does either size or age account for significant variation
in overhead?
Consider a 'full' model of p+1 parameters given by

2
Y = β0 + β1 x1 + β2 x2 + ⋯ + βp xp + ϵ ϵ ∼ N (0, σ )
and consider a 'reduced' or restricted model that reflects the restrictions imposed by

H0 : β1 = β2 = ⋯ = βk = 0, k ≤ p , given by
2
Y = β0 + βk+1 xk+1 + βk+2 xk+2 + … + βp xp + ϵ ϵ ∼ N (0, σ )
To determine the better model, we assess the difference in the variation explained by the full and
reduced models, expressed as SS(Reg) f ull − SS(Reg) red , or equivalently, by
SS(Res) red − SS(Res) f ull
We call this difference in variation between the two models the additional sum of squares.
We test H0 : β1 = β2 = ⋯ = βk = 0 with the

F test statistic:
(SS(Res) red − SS(Res) f ull )/(dfred − dff ull )

F =
SS(Res) f ull /dff ull
where under H0 , F ∼ Fdf

red
−dffull , dffull .
Additional Sum of Squares - audit example

After accounting for col and clients, does either size or age account for significant variation in
overhead?
H0 : β1 = β2 = 0
Ha : at least one of β1 , β2 ≠ 0
full model: Y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + ϵ ϵ ∼ N (0, σ )

2
reduced model (under H0 ): Y = β0 + β3 x3 + β4 x4 + ϵ ϵ ∼ N (0, σ )

2

F =
As an exercise, we can calculate the test statistic by obtaining the residual sum of squares of the full
and reduced models from the output, in the same way we did in creating the ANOVA table.
(from output of full model shown previously)

2 2
^
SS(Res) f ull = σ (dff ull ) = 14330 (19)
f ull
(from output for the reduced model, below)

2 2
^
SS(Res) red = σ (dfref ) = 15360 (21)
red
(Intercept) -1.594e+05 7.154e+04 -2.227 0.0370 *
col 1.484e+05 6.989e+04 2.124 0.0457 *
clients 8.774e+01 4.734e+00 18.532 1.71e-14 ***
9/25/21, 1:12 PM UW Möbius - 2.13 Additional Sum of Squares
2 2
(SS(Res) red − SS(Res) f ull )/(dfred − dff ull ) (15360 (21) − 14330 (19))/2
F = = = 2.564
2
SS(Res) f ull /dff ull 14330
(Note that M S(Res) f ull )

2
^
= SS(Res) f ull /dff ull = σ
f ull
From
F table:
P (F2,19 > 3.52) = .05 → p-value = P (F2,19 > 2.56) > .05
Using R:
> 1-pf(2.564,2,19)
[1] 0.1033253
Since the p-value > .05, we do not reject H0 . The reduced model is preferred.
More specifically, age and size together do not account for significant additional variation in overhead
after accounting for col and clients, so we do not need them in the model.
We can verify these results using the anova function in R:
> anova(audit.red.lm,audit.full.lm)
Analysis of Variance Table

Model 1: overhead ~ col + clients
Model 2: overhead ~ size + age + col + clients
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21 4954374034
2 19 3901347198 2 1.053e+09 2.5642 0.1033
9/25/21, 1:13 PM UW Möbius - 2.13.1 Additional Sum of Squares Test and Categorical Variables
Additional Sum of Squares Test and Categorical Variables

Recall the promo model for the sales/promotion dataset from a previous lesson, given by:
2
Y = β0 + β1 x1 + β2 x2 + ϵ, ϵ ∼ N (0, σ I )
1 if store uses promo A 1 if store uses promo B

x1 = { x2 = {
0 otherwise 0 otherwise
with fitted model
(Intercept) -0.870 1.665 -0.523 0.60552
typeA 8.350 2.354 3.547 0.00145
typeB 2.970 2.354 1.261 0.21792
for which we wished to answer the question

Is there a difference in mean sales between promotion A and promotion B?
by considering a test of
H0 : β1 − β2 = 0
Ha : β1 − β2 ≠ 0
with test statistic

^ ^
(β 1 − β 2 ) − 0
t =
^ ^
SE(β 1 − β 2 )
Alternatively, we can test H0 : β1 − β2 = 0 using the additional sum of squares test statistic

F =
M S(Res) f ull
To do so, we need to fit the reduced model under H0 : β1 − β2 = 0 (i.e. under restriction that
β1 = β2 = β ), given by
∗
∗ ∗
Y = β0 + β x1 + β x2 + ϵ
∗
= β0 + β (x1 + x2 ) + ϵ
∗ ∗
= β0 + β x +ϵ
where
∗ 1 if a promotion (either A or B) is used

x = {
0 otherwise
Defining x and fitting the reduced model in R:

∗
> x_promo=x1+x2
> x_promo
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
⋮
(Intercept) -0.870 1.786 -0.487 0.6299
x_promo 5.660 2.187 2.588 0.0151
9/25/21, 1:13 PM UW Möbius - 2.13.1 Additional Sum of Squares Test and Categorical Variables
Is there a difference in mean sales between promotion A and promotion B?
Testing H0 : β1 − β2 = 0 using additional sum of squares:
> anova(promo_red.lm,promo.lm)
Model 1: Promo_sales ~ x_red
Model 2: Promo_sales ~ type
1 28 893.02
2 27 748.30 1 144.72 5.2218 0.03038
(As an exercise, confirm the value of the test statistic and the p-value using the procedure we used in
the previous lesson)
We reject H0 and conclude that that mean sales associated with promotion A is signifcantly higher
than the mean sales for promotion B.
9/25/21, 1:14 PM UW Möbius - 2.13.2 Additional Sum of Squares Test and ANOVA
Test of H0 : β1 = β2 = ⋯ = βp = 0 Revisited

Is there a (linear) relationship between overhead and at least one of size, age, col, or # of clients?
Recall in the lesson on ANOVA we addressed this question with the hypotheses
H0 : β1 = β2 = … = βp = 0
Ha : at least one of βj ≠ 0 j = 1, 2, … , p
and test statistic

SS(Reg)/p M S(Reg)
F = =
SS(Res)/(n − (p + 1)) M S(Res)
This is just another example of an additional sum of squares test for which the reduced model is
2
Y = β0 + ϵ ϵ ∼ N (0, σ )
To see this, consider the least squares estimate of β0 under the reduced model:
2
S(β0 ) = ∑ (yi − β0 )
∂S
^
= −2 ∑(yi − β 0 ) = 0
∂ β0
^
− nβ 0 + ∑ yi = 0
∑ yi
^
β0 = = ȳ
n
This result is consistent with our intuitive understanding of the fitted model. With no explanatory
variables in the model, μ
^ ^
= β 0 = ȳ . (The sample mean, ȳ , is the least squares estimate of μ )
Note that the residual sum of squares for the reduced model is equivalent to the total sum of squares,
since
2
SS(Res) red = ∑ e
i
2
^ )
= ∑ (yi − μ
i
2
^
= ∑ (yi − β 0 )
2
= ∑ (yi − ȳ )
=SS(T ot)
This should also be intuitively obvious, since, with no expanatory variables in the model, no variation
is explained by the model, and SS(Reg) red = 0.
Then the additional sum of squares test statistic reduces to:
(SS(Res) red − SS(Res) f ull )/p
F =
M S(Res) f ull
(SS(T ot) − SS(Res) f ull )/p

F =
M S(Res) f ull
SS(Reg) f ull /p
=
M S(Res) f ull
M S(Reg)
=
M S(Res)
as defined previously.
9/25/21, 1:15 PM UW Möbius - 2.13.3 Additional Sum of Squares Test for Individual Parameters
Test of H0 : βj = 0 Revisited
After accounting for size, col, and clients, is age related to overhead?
Recall in an earlier lesson we addressed this question with the hypothesis H

0 : βj = 0 and test statistic
^
βj
t = , for
j = 2 (the age parameter), the value of which and associated p-value can be
^
SE(β j )
obtained directly from the summary output.
(Intercept) -215026.15 71212.70 -3.019 0.00705
size 39.06 19.30 2.024 0.05723
age 400.01 491.86 0.813 0.42614
col 188183.57 67522.57 2.787 0.01175
clients 64.65 12.42 5.205 5.04e-05
An equivalent test of H0 : βj = 0 that yields an identical p-value can be obtained with the additional
sum of squares test statistic
(SS(Res) red − SS(Res) f ull )/(dfres − dff ull ) SS(Res) red − SS(Res) f ull
F = =
SS(Res) f ull /dff ull M S(Res) f ull
Note that the degrees of freedom of the numerator, dfres − dff ull , is one, since there is only one
restriction imposed on the model by the null hypothesis. That is, the full model has only one more
parameter than the reduced model.
To illustrate, we can perform an additional sum of squares test on the full and reduced audit models
associated with H0 : β2 = 0 :
> anova(audit_minus_age.lm,audit.lm)

Model 1: overhead ~ size + col + clients
Model 2: overhead ~ size + age + col + clients
1 20 4037157770
2 19 3901347198 1 135810572 0.6614 0.4261

Note the equivalent p-value (0.4261) associated with the
F and
t test statistics.
Note also the relationship between the values of
F and
t. We see that
F .
2 2
= 0.661 = 0.813 = t
This relationship between the

F and
t test statistics holds for any hypothesis test with a single
restriction (e.g. H0 : βj = 0 , H0 : βj − βk = 0 ):
In general, for υ degrees of freedom,
2
P (|t| > |t|) = P (F1,υ > t )
υ
9/25/21, 1:15 PM UW Möbius - 2.13.4 The General Linear Hypothesis
General Linear Hypothesis

Consider the hypotheses we have tested so far in lessons using additional sum of squares:
1) H0 : β1 = β2 = ⋯ = βp = 0 (audit model,

p = 4 )
2) H0 : β1 = β2 = 0 (audit model)
3) H0 : β1 − β2 = 0 (promo model)
4) H0 : β2 = 0 (audit model)
These hypotheses all test linear combinations of the model parameters. As such, they can all be
expressed in the form of the general linear hypothesis:
H0 : Aβ
β = 0
where A is an ℓ × (p + 1) matrix that imposes the ℓ linear constraints on the full model as described
by H0 .
Result: Consider the ('full') normal model given by Y = Xβ β+ϵ ϵ ϵ ∼ N (0, σ I) , and corresponding
ϵ
2
('reduced') model associated with the set of linear hypotheses of the form H0 : Aβ β = 0. Under H
0,

F = ∼ Fdf
red −dffull , dffull
That is, the additional sum of squares test statistic can be used to test any set of linear hypotheses
of the form H0 : Aβ β = 0.
For the hypotheses we have looked at so far:
H
0 : β1 = β2 = ⋯ = β4 = 0 can be expressed in the form
β0
⎡ ⎤
0 1 0 0 0 0
⎡ ⎤ ⎡ ⎤
⎢ β1 ⎥
⎢0 0 1 0 0 ⎥⎢ ⎥ ⎢0⎥
H0 : ⎢ ⎢ ⎥ =
⎥ β ⎢ ⎥
⎢ 2 ⎥
⎢0 0 0 1 0 ⎥⎢ ⎥ ⎢0⎥
⎢β ⎥
⎣ ⎦ 3 ⎣ ⎦
0 0 0 0 1 0
⎣ ⎦
β4
A β
β = 0

Similarly, for H0 : β1 = β2 = 0 :
0 1 0 0 0
A = [ ]
0 0 1 0 0

H0 : β1 − β2 = 0 :
A = [0 1 −1 ]

and H0 : β2 = 0 :
A = [0 0 1 0 0]
10/12/21, 4:22 PM UW Möbius - 2.14 Assessing Model Adequacy (Residual Analysis)
Assessing Model Adequacy (Residual Analysis)

Consider the assumptions of the normal model, Y = Xβ
β+ϵ
ϵ
2
ϵ ∼ N (0, σ I)
ϵ :
the functional form of the relationship between the response and the explanatory variables, is
correctly specified by the deterministic component of the model. For the linear model, this
relationship is described by μ
μ = Xβ β.
errors follow a normal distribution
errors have a constant variance, denoted by σ 2 (this property is sometimes referred to
as homoskedasticity)
the errors are independent
(The last two assumptions are described by V ar(ϵϵ) 2

= σ I )
Distribution of the parameter estimators and subsequent methods of inference were derived based on
these assumptions. If the assumptions do not hold, the model is inadequate, and any conclusions
drawn from the fit of the model are inaccurate and meaningless.
We can assess model adequacy (i.e. the validity of the model assumptions) through examination and
analysis of the fitted residuals, e = y − μ
^ , to ensure that the behaviour of these residuals is consistent
with the assumptions of the model.

The best way to examine the residuals for evidence of departures from the model assumptions is
through residual plots.
Residual Plots
There are many different types of residual plots that may be used to assess model adequacy. We will
introduce only two of the more common ones here.
Plot of the residuals, ei, vs the fitted values, μ
^ .
i
The most common and useful diagnostic plot for assessing model assumptions, is a plot of ei vs μ
^ .
i
It can be shown that, if the model assumptions hold, ei and μ

^ , are uncorrelated.
i
Thus, if the model is adequate, we would expect to see no observable pattern in a plot of ei vs μ ^ . The
i
plot should only exhibit random scatter, as illustrated in the plots below from the fit of a SLR model.

If the assumptions do not hold, we would expect to see a pattern or relationship consistent with the
assumption that has been violated. Two examples are provided on the following slide.
In the figure below, a SLR model is fit to a relationship that includes a quadratic component. Thus, the
functional form of the model is not correctly specified (μ ≠ β0 + β1 x) , resulting in a observed
quadratic relationship in the plot of the residuals vs the fitted values.
Another common violation of the model assumptions is a non-constant variance of the errors.
Often, for example, the variance of the errors may increase as the mean of the response increases,
thus violating the assumption of V ar(ϵ) = σ 2 , as seen in the example below.
QQ plots
QQ (Quantile-Quantile) plots are used to assess the assumption of normal errors. (This assumption is
more critical for very small datasets than for relatively large ones. Why?)
In normal QQ plots, the ordered residuals ('Sample Quantiles' in R plot), e (i) , are plotted vs the
expected ordered values ('Theoretical Quantiles'), E(Z(i) ) , where Zi ∼ N (0, 1) .
If the residuals are from a normal distribution, then e(i) should be proportional to E(Z(i) ). Thus a
straight line relationship is an indication that the assumption of normal errors has been well met.
The QQ plots below provide an example of a fitted model that meets the assumption of normal errors
(left) and a model that does not (right).

(Note: we have discussed the use of residual plots to assess the assumptions of correct functional
form, constant variance of the errors, and normality of the errors. We will consider plots that address
the assumption of independence of the errors when we discuss time series data)
Variance Stabilizing Transformations

There are several approaches available that may be used to attempt to address model inadequacies
revealed in the residual plots.
Often, a transformation of the response (and possibly one or more of the explanatory variables) is
sufficient to improve the adequacy of the model in terms of the model assumptions.
We call such transformations variance stabilizing transformations, since they address violation of
the constant variance assumption, in addition to the assumptions of model misspecification and non-
normality of the errors.
Examples of common transformations include:
log(y) ((natural) log transformation)
1
y 2 (square root transformation)
y
−1
(reciprocal transformation)
These transformations are particularly useful when the error variance, σ

2
, is a function of the mean
response, μ.
For example, it can be shown that, when the standard deviation is proportional to the mean, the log
transformation is most appropriate, whereas a square root transformation is more suitable in cases
where the variance is proportional to the mean.
Other possible approaches to addressing model inadequacies include:
the addition of higher order terms (e.g. 2
x ) in one or more of the explanatory variables.
the inclusion of an interaction term
These approaches will be discussed in more detail in later lessons.
10/12/21, 4:23 PM UW Möbius - 2.14.1 Residual Analysis - audit model
Residual Analysis - audit model

Consider again the audit model output
(Intercept) -215026.15 71212.70 -3.019 0.00705
size 39.06 19.30 2.024 0.05723
age 400.01 491.86 0.813 0.42614
col 188183.57 67522.57 2.787 0.01175
clients 64.65 12.42 5.205 5.04e-05
The large R-squared value and relatively small p-values associated with most of the variables suggest
that we have a very good fit, providing the model assumptions are valid.
There is nothing in the output that provides any information on the validity of the assumptions.
Whenever we fit a model, we must always perform a residual analysis to ensure that the assumptions
have been met and the model is adequate. If the model is not adequate, any conclusions we draw
from the fit of the model are meaningless.
We will begin by examining a plot of the residuals vs the fitted values.
> plot(fitted(audit.lm),residuals(audit.lm),pch=19,xlab='fitted',ylab='residuals')
The presence of an obvious observable pattern indicates the model is not adequate. We see that
taking a square root transformation of the response helps to address problems with the model:
10/12/21, 4:23 PM UW Möbius - 2.14.1 Residual Analysis - audit model
Note that by taking an appropriate transformation to address issues with the model assumptions, we
have also arrived at a better fitting model, as evidenced by the resulting increase in the R-squared
value and decrease in p-values seen in the output below:
> audit.sqrt.lm=lm(sqrt(overhead)~size+age+col+clients)
> summary(audit.sqrt.lm)
Coefficients:
(Intercept) -161.86531 75.93121 -2.132 0.046302
size 0.04694 0.02057 2.281 0.034226
age 0.70536 0.52445 1.345 0.194472
col 280.24570 71.99658 3.892 0.000979
clients 0.10285 0.01324 7.765 2.6e-07
---

Notes:
−−−−−−
Remember that the response is now in units of √dollars . When taking a transformation of the
response, we need to back transform estimated mean values, μ ^ , and associated confidence and
prediction intervals to the original units for interpretation purposes.
A log transformation was also applied, but did not adequately address problems with model
assumptions.
10/12/21, 4:24 PM UW Möbius - 2.14.2 Properties of the Residuals
Properties of the Residuals

Before we continue in our analysis of the residuals to assess model adequacy, it is important that we
understand certain properties of the residuals. We will attempt to do so here.
We begin by examining the relationship between the fitted residuals and the model errors. Note that
we can express the residual vector, e , as a function of the error vector, ϵϵ :
^ T −1 T
^
e = Y−μ
μ = Y − Xβ
β = Y − X(X X) X Y
= Y − HY = (I − H)Y
= (I − H)(Xβ
β + ϵ)
ϵ)
T −1 T
= Xβ
β − X(X X) X Xβ
β+ϵ
ϵ − Hϵ
ϵ
= Xβ
β − Xβ
β+ϵ
ϵ − Hϵ
ϵ
= (I − H)ϵ
ϵ
(Aside): Note that e

e = (I
I −H
H )Y
Y = (I
I −H
H )ϵ
ϵ . This would seem to imply that Y
Y = ϵ
ϵ , since, by
multiplying both sides by the inverse, (II − H
H)
−1
,
−1 −1
(I
I −H
H) (I
I −H
H )Y
Y = (I
I −H
H) (I
I −H
H )ϵ
ϵ
⇒ Y
Y = ϵ
ϵ
However for (II − H
H ) to be invertible, it must be of full rank, with rank(I I −HH ) = n . This is not
the case. It can be shown that rank(II − H H ) = n − (p + 1) . (I

I −HH ) is therefore not invertible.
Distribution of e
Now that we have established e = (I − H)ϵ
ϵ where ϵ ∼ N (0, σ I)
ϵ
2
, we can easily derive the
distribution of e :
e ∼ Normal (since ϵ
ϵ ∼ Normal)
E[e] = E[(I − H)ϵ

ϵ] = (I − H)E(ϵ
ϵ) = 0
T
V ar(e) = (I − H)V ar(ϵ
ϵ)(I − H)
2
= σ (I − H)(I − H)
T
2
= σ (I − H) (H symmetric, idempotent)
Thus
e ∼ N (0
2
0, σ (I − H))
→ 2
e i ∼ N (0, σ (1 − hii ))
where hii is the i

th
diagonal element of H, i = 1, 2, . . . , n .
Note in particular:
2
V ar(e i ) = σ (1 − hii ) (Residuals have non-constant variance)
C ov(e j , e k ) = −σ hjk
2
j ≠ k (Residuals are not independent. This is a consequence of the
constraint, ∑ ei = 0 , placed on the residuals in least squares estimation)
We will be referring to these properties of the residuals in upcoming lessons.
10/12/21, 4:26 PM UW Möbius - 2.14.3 Studentized Residuals
Studentized Residuals
Recall that we standardize a random variable by subtracting the mean and dividing by the standard
deviation.
For example, in the case of a normal random variable,
X :
X−μ
2
X ∼ N (μ, σ ) → Z = ∼ N (0, 1)
σ
where Z is the standardized normal random variable.
Similarly, we studentize a random variable by subtracting the mean and dividing by the estimate of
the standard deviation
In a previous statistics course we studentized the sample mean of a normal random variable:
¯
2 X−μ
¯ σ
X ∼ N (μ, ) → t =
n
^
σ
√n
In this course, we studentized the parameter estimates:
^ ^
β j − βj β j − βj
^ 2 T −1
β j ∼ N (βj , σ (X X) → t = =
ii −−−−−−−−
T −1 ^
^√(X
σ X) SE(β j )
ii
Note that in both cases, the resulting studentized random variable follow a t distribution.
In the same way, we can studentize the residuals from the fit of a normal regression model, for which
we have previously shown that ei ∼ N (0, σ 2 (1 − hii ))
A studentized residual associated with the i

th
observation and denoted by di , is defined as
ei
di =
−−−−−
^√1 − hii
σ
where the distribution of di can be reasonably approximated by a N (0, 1) distribution* for large
n .
*In previous examples (e.g. ¯

X , β
^
j
), the studentized random variables can be shown to follow a
t distribution. However, since ei and σ ^ are not independent, the studentized residuals, di , do not
have an exact
t distribution. The
t distribution will, however, be a reasonable approximation to
the distribution of di for large
n . This implies that di ∼ N (0, 1) for large
n .

Note that by studentizing the residuals, we have insured that the resulting residuals, di , will have a
constant (estimated) variance = 1. For this reason, the studentized residuals, di , are often used
instead of the fitted residuals, ei, in residual plots.
10/12/21, 4:27 PM UW Möbius - 2.14.4 Extreme Values in the Response ('Outliers')
Extreme Values of the Response ('Outliers')

An outlier can be loosely defined in general as any observation which is extreme (either extremely
large or extremely small) relative to the other observations.
Here we think of an outlier in the response as an observation for which yi is much larger or smaller
than its estimated mean response, μ ^ , relative to the other observations.
i
Equivalently, we can define an outlier in the response as any observation for which the residual,
^ , is extreme relative to the other residuals.
e i = yi − μ
i
Recall the plot of the residuals vs the fitted values for the audit model, fit with the square root
transformation of the response:
Note that the transformation of the response reveals an outlier in the residuals, associated with an
office with an unusually high overhead relative to the mean overhead estimated from the model.
We can obtain additional perspective on how extreme the outlier is by plotting the studentized
residuals, di , in place of the fitted residuals, ei.
Note that there is little qualitative difference in the plot of the studentized residuals to that of the
fitted residuals. However, we can use our understanding of the distribution of di to determine how
often as extreme an outlier would occur due to random variation.
Recall that, for large n , the distribution of di can be reasonably approximated by a N (0, 1) distribution.
based on our understanding of normal probability theory, we know that approx. 99% of all
observations will be within ±2.5 . Anything within this range is acceptable variation.
Thus, as a general rule of thumb, an observation may be considered an outlier in the response if
|di | > 2.5
or so. Here, the studentized residual in question is greater than

3, suggesting that the associated
response is a moderate outlier.
10/12/21, 4:27 PM UW Möbius - 2.14.4 Extreme Values in the Response ('Outliers')
Addressing Outliers in the Response

Once an outlier is detected, the associated observation should be investigated for a possible cause.
Causes of outliers may include:
typos, misrecording of data
values of associated potential explanatory variables not included in the model
random variability
Deciding how to deal with outliers will depend on the cause, and should be dealt with on a case-by-
case basis. It is never a good idea to remove an observation deemed to be an outlier from the fit of
the model without further investigation.

10/22/21, 1:35 PM UW Möbius - 2.14.5 Leverage and Influential Observations
Leverage
Leverage is a measure used to identify those observations whose set of explanatory
variables is extreme relative to the sets of explanatory variables of the other observations.
The leverage of the i observation in a dataset is defined as the i diagonal element of the hat
th th
matrix, denoted by hii . It is a function of the distance between the point (xi1 , xi2 , . . . , xip ) , and the
centroid, (x̄1 , x̄2 , . . . , x̄p ) , of the sets of explanatory variables of the dataset.
To see how we can use leverage to identify extreme values in the sets of explanatory variables, we
consider the SLR case. It can be shown that leverage can be expressed as:
¯ 2
(x i −x )
hii =
1
n
+
2
(You can try to show this as an exercise)
¯
∑(x i −x )
Note that the more extreme the value of the explanatory variable, xi , relative to the mean, x̄ , the
larger the leverage.
Recall also that μ
^
^
μ ^
= Hy ⇒ μ
i
= hii yi + ∑
j≠i
hij yj . The leverage, hii , can therefore be thought of as
the weight of the contribution of yi to the fitted value, μ^ . The larger the leverage relative to the other
i
observations, the more yi contributes to the fit of the line.

Leverage properties:
1
≤ hii ≤ 1
n
∑ hii = tr(H) = rank(H) = rank(X) = p + 1
Finally, note that, as H , hii is a function only of the explanatory variables, not of
T −1 T
= X(X X) X
the response.
Identifying High Leverage Cases

Unlike in assessing model assumptions, plots of the residuals, ei, are not useful in revealing high
leverage points. To see this, recall the distribution of the residuals we derived in a previous lesson,
where ei ∼ N (0, σ 2 (1 − hii )). Note that as hii → 1, V ar(ei ) → 0 . Consequently, the residuals of high
leverage observations will tend to be close to zero.
Instead, we can plot the leverage values ('hatvalues' in R). As of a rough general rule, an observation
i is considered to have high leverage if
2(p + 1)
¯
hii > 2h =
n
A plot of the hatvalues for the audit model (with square root transformation of
y ) is shown below.
> plot(hatvalues(audit.sqrt.lm),cex.lab=1.3,cex.axis=1.3,cex=1.3,pch=19)
We see that there are no high leverage observations in the dataset. Note that all values are less than
¯
2h = 2(5)/24 = .417
10/22/21, 1:35 PM UW Möbius - 2.14.5 Leverage and Influential Observations
Influential Observations
An observation is considered influential if its removal from the fit of the line changes the fitted line
(i.e. changes the parameter estimates) considerably.
Only high leverage cases have the potential to be influential. Whereas leverage depends only on the
explanatory variables, the influence of an observation also depends on the value of the response, as
illustrated below.
In the above plots, the fitted regression line with the leverage point included is given by the solid
black line and line fit with the leverage point omitted is given by the dotted red line.
Note that in the plot on the left, the removal of the high leverage point does not dramatically alter the
fitted line, whereas in the plot on the right, omission of the leverage point alters the fitted line
considerably.
Thus the high leverage observation seen in both plots is not influential in the scenario on the left, but
is influential in the scenario on the right.
Identifying Influential Observations

As defined previously, an observation will be influential if its removal from the fit of the line changes
the fitted line considerably. The larger the influence of an observation
i, the larger the distance,
) , between the vector of fitted values, μ
^ , and the vector of fitted values with the
T
^ −μ
^
(μ
μ ^
μ^ ^ −μ
^
) (μ
μ ^
^
μ ^
μ
(i) (i)
i
th
observation omitted, ^
^
μ
μ
(i)
.
Cook's distance is one common measure of influence that is a function of this distance, defined as
T
^
^ −μ
(μ
μ ^
^
μ ) ^
^ −μ
(μ
μ ^
^
μ )
(i) (i)
Di =
2
^ (p + 1)
σ
where is the estimate of the variance from the model fit with the the observation included.
2 th
^
σ i
This would seem to imply that to measure the influence for each observation, we need to fit the
model both without and without the ith observation for all i. However, it can be shown that Cook's
distance can be expressed in the form
2
hii d
i
Di = ⋅
1 − hii p+1
which can be calculated from the fit of model with all observations included.
Note that to be influential, an observation must have both a relatively high leverage, hii and large
(absolute) studentized residual, di .
As a general rule of thumb, Di ≥ 1 suggests a strongly influential observation.

As expected, there are no influential observations associated with the audit model:
> max(cooks.distance(audit.sqrt.lm))
[1] 0.2202189
10/22/21, 1:36 PM UW Möbius - 2.15 Model Selection - Introduction
Model Selection - Introduction

Consider the house dataset that contains the selling price (value, in $1000s) of 100 houses , as well
as information on size (m 2 ), # of stories, # of bathrooms, # of rooms, age (yrs), lot size (m 2 ), and
whether the house has a basement and garage.
A plot of the data suggests that a linear model might be a suitable model to fit to the data to
investigate the relationship between selling price and one or more of the house characteristics.
A fit of the linear model to all the variables in the dataset (with housing price as the response) is given
below:
Call: lm(formula = value~size+stories+baths+rooms+age+lotsize+basement+garage)

(Intercept) -65.92366 39.47105 -1.670 0.098321
size 1.57860 0.21994 7.178 1.86e-10
stories 17.19540 9.78106 1.758 0.082105
baths 7.65047 16.81162 0.455 0.650142
rooms -1.31162 5.03983 -0.260 0.795258
age -3.73867 0.78717 -4.750 7.57e-06
lotsize 0.16106 0.04331 3.719 0.000345
basement 1.50462 9.11014 0.165 0.869185
garage -42.11920 15.04650 -2.799 0.006253


The output suggests a reasonably well-fitting model, with 67% of the variation in selling price
accounted for by the eight variables, and several strong relationships between the selling price and
variables size, age, lot size, and presence/absence of a garage.
10/22/21, 1:36 PM UW Möbius - 2.15 Model Selection - Introduction
Subsequent residual analysis suggests that the model is adequate, and reveals no problems with
outliers or the model assumptions (we say the residuals are 'well-behaved').

Further analysis revealed no high leverage or influential observations.
Note, however, the large p-values associated with some of the variables (e.g. # of bathrooms, # of
rooms, presence/absence of a basement).
This suggests that the model might be improved by including only a subset of the eight variables,
since increasing the degrees of freedom without significantly increasing the variation unexplained by
the model will result in a lower residual standard error and, consequently, more precise parameter
estimates and estimated mean values.
To understand why removing certain variables might yield a 'better' model, consider the expression for
the residual standard error, given by
−−−−−−−−− −
SS(Res)
^ = √
σ
n − (p + 1)
Note that removing one or more variables from the model will increase both the SS(Res) and the
degrees of freedom.
If the increase in SS(Res) resulting from the removal of variables is small relative to the degrees of
freedom gained, as is often the case for variables associated with large p-values, then ^
σ will
decrease, resulting in smaller standard errors and a more precise model.
If however, the increase in degrees of freedom obtained from removing one or more variables is not
sufficient to counter-balance the associated increase in SS(Res), then σ
^ will increase, and we will
have a less precise model.

Note that we cannot simply remove all the variables that are associated with large p-values, since the
removal of any one variable will change the estimates and associated p-values of all the remaining
variables. A variable that is associated with a large p-value may be associated with a small p-value
once a variable is removed.
We must therefore rely on established model selection procedures to arrive at a chosen model.
10/22/21, 1:37 PM UW Möbius - 2.15.1 Model Selection Methods
Iterative Model Selection Procedures

Iterative methods of model selection involve building a model by adding or removing variables one at
a time, and refitting the model at each iteration until no more variables can be added or removed.
1. Backward elimination
Fit all p variables
Remove the variable with the largest p-value that is greater than some predetermined threshold
value, α (e.g. α = .10 )
Refit the model with the remaining p−1 variables
Continue removing one variable at each iteration of the above steps until no more variables can
be removed (all p-values < α)
2. Forward selection
Fit all
p single variable (i.e. SLR) models
Select the variable associated with the smallest p-value < α
Fit the p−1 two-variable models that include the variable selected in the previous step
Continue adding one variable at each iteration, including all variables selected in the previous
step, until no more variables can be added (all p-values > α )
3. Stepwise selection
Begin with forward selection, and employ both forward selection and backward elimination
at each step until no more variables can be added or removed
Selection From All Model Subsets

With p potential variables, there are 2
p
−1 possible models to choose from.
(The total number of model subsets =
follows from the binomial
p p p p p
∑ ( ) = ∑ ( )−1 = 2 −1
k=1 k k=0 k
theorem, ∑x=0 (x)a n bn−x )

n n n
= (a + b)
For the house model with p = 8 variables, for example, there are 255 potential models.
Selection of reasonable models from all potential models is based on some measure of fit that takes
into account both the SS(Res) and the number of variables. Two such measurs are:
Adjusted R-squared
Mallows' Cp
Adjusted R-squared
Recall the coefficent of determination, given by
SS(Res)
2
R = 1−
SS(T ot)
Note that with the addition of more variables, SS(Res) will always decrease, and subsequently, R2
will always increase when variables are added regardless of whether the variables account for a
significant amount of the variation in the response. For this reason, we cannot use R2 as a relative
measure of fit when comparing model subsets with different numbers of parameters.
Instead, we can use the adjusted R-squared, given by
SS(Res)/(n − (p + 1))
2
R = 1−
adj
SS(T ot)/(n − 1)
10/22/21, 1:37 PM UW Möbius - 2.15.1 Model Selection Methods
Since R2adj takes into account the number of variables in the model, it will only increase if the variation
accounted for by the added variable(s) increases proportionally more than the degrees of freedom
decreases through the estimation of the additional parameters.
Note that since we can express
2
SS(Res)/(n − (p + 1)) ^
σ
2
R = 1− = 1−
adj
SS(T ot)/(n − 1) SS(T ot)/(n − 1)
as a function of the residual standard error, model selection based on a large R2adj is equivalent to
selection based on a low residual standard error, ^
σ .
Mallows' Cp
For a
k -variable model (k = 1, 2, … , p) , Mallows' Cp is defined as
SS(Res) k
Cp = + 2(k + 1) − n
M S(Res) p
Intuitively, the smaller the SS(Res) for a given k , the better the model. Thus, smaller Cp values
relative to the number of variables are associated with more suitable models.
Mallows' Cp is used to compare a
k -variable model (k < p ) with the full model (for which k = p )
A
k -variable model is preferred over the full model if
Cp ≤ k + 1
Note that for the full (k = p ) model

SS(Res) p
Cp = + 2(p + 1) − n = p + 1
SS(Res) p /(n − (p + 1))
regardless of the fit of the full model.
10/22/21, 1:39 PM UW Möbius - 2.15.2 Model Selection - House data
Model Selection - House data

Consider again the fit of the full (
p = 8 ) model to the housing data:
Call: lm(formula = value~size+stories+baths+rooms+age+lotsize+basement+garage)

(Intercept) -65.92366 39.47105 -1.670 0.098321
size 1.57860 0.21994 7.178 1.86e-10
stories 17.19540 9.78106 1.758 0.082105
baths 7.65047 16.81162 0.455 0.650142
rooms -1.31162 5.03983 -0.260 0.795258
age -3.73867 0.78717 -4.750 7.57e-06
lotsize 0.16106 0.04331 3.719 0.000345
basement 1.50462 9.11014 0.165 0.869185
garage -42.11920 15.04650 -2.799 0.006253


Note that basement would be the first variable to be removed in backward selection.
We will select an appropriate model from all possible subsets based on the R
2
adj
and Cp criteria.
> leaps(house[,-9],value,method=c('adjr'),nbest=2,names=names(house[-9]))
size stories baths rooms age lotsize basement garage
1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
3 TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
3 TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
4 TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
4 TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
5 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
5 TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE
6 TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
6 TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
7 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
7 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

$adjr2
[1]0.549407 0.317135 0.572833 0.560069 0.618481 0.593374 0.644686 0.6251132
[9]0.651625 0.641082 0.648770 0.648308 0.645221 0.645063 0.641430
We see that the preferred model based on R2adj is the model fit with size, stories, age, lotsize, and
garage.
Selecting a model based on Mallows' Cp :
> leaps(house[,-9],value,method=c('Cp'),nbest=1,names=names(house[-9]))
size stories baths rooms age lotsize basement garage
1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
3 TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
4 TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
5 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
6 TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
7 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

$Cp
[1] 27.150656 21.556807 10.144146 4.137270 3.327332 5.096306 7.027277 9.000000
We see that there are several models that meet the criterion of Cp < k + 1 .
We will select the model with the variables size, stories, age, lotsize, and garage. This is consistent
with the model selected using R2adj .
The fits of the full model and selected model are given on the following slide. Note the following:
The higher R
2
value for the full model, due to having more parameters
The higher R
2
adj
, (and lower ^
σ ) for the selected model
The p-value > .05 associated with stories in the selected model. Retaining variables with
associated p-values > .05 is common in model selection procedures.
(Intercept) -65.92366 39.47105 -1.670 0.098321
size 1.57860 0.21994 7.178 1.86e-10
stories 17.19540 9.78106 1.758 0.082105
baths 7.65047 16.81162 0.455 0.650142
rooms -1.31162 5.03983 -0.260 0.795258
age -3.73867 0.78717 -4.750 7.57e-06
lotsize 0.16106 0.04331 3.719 0.000345
basement 1.50462 9.11014 0.165 0.869185
garage -42.11920 15.04650 -2.799 0.006253
(Intercept) -67.72531 33.81377 -2.003 0.048069
size 1.64000 0.15173 10.809 < 2e-16
stories 15.77880 9.27816 1.701 0.092317
age -3.86942 0.73110 -5.293 7.87e-07
lotsize 0.16073 0.04164 3.860 0.000208
garage -42.37070 14.76986 -2.869 0.005089

Finally, residual analysis on our selected model indicates conformity with the model assumptions.

We have arrived at an appropriate and well-fitted model that adequately describes the relationship
between the value of a house and its attributes.

10/22/21, 1:42 PM UW Möbius - 2.16 Interaction
Interaction
Consider the house data.
Does the effect of having a garage on the value of a house depend on the age of the house?
It may be, for example, that having a garage contributes to the value of a house more markedly
for older houses than it does for newer houses, or vice versa.
When the effect of a variable, xj , on the response depends on the value of another variable, xk , we
say there is interaction between variables xj , xk .
We can account for a possible interaction effect by including the term xj xk in our model.
For example, to address the question above, we include the interaction term in our model:
Y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 (x3 ∗ x5 ) + ϵ
where x3 represents the age and x5 represents whether the house has a garage.
To see how this effectively addresses interaction, note that we can rewrite our model as
Y = β0 + β1 x1 + β2 x + (β3 + β6 x5 )x3 + β4 x4 + β5 x5 + ϵ
where β3 is the effect of age on house price if the house has no garage (i.e. when x5 = 0 ), and
β3 + β6 is the effect of age on house price if the house has a garage.
Including an age*garage interaction term into the model selected in the previous lesson yields the
following output:
> house.int.lm=lm(value~size+stories+age+lotsize+garage+age*garage)
> summary(house.int.lm)
Coefficients:
(Intercept) -33.83130 37.57164 -0.900 0.370208
size 1.65248 0.14965 11.043 < 2e-16
stories 16.30724 9.14648 1.783 0.077866
age -5.23119 1.00279 -5.217 1.1e-06
lotsize 0.16370 0.04106 3.986 0.000133
garage -91.03581 28.86644 -3.154 0.002171
age:garage 2.11762 1.08476 1.952 0.053929

The estimates β
^
3
and β^
6
tell us that, whereas the mean value of a house decreases by an estimated
$523 each year of age for houses with no garage, it only decreases by an estimated $311 each year of
age for houses with a garage (after accounting for the other variables).
Note that, although we would not reject H0 : β6 = 0 (p-value > .05), based on model selection
methods we would keep the interaction term in the model, since it leads to a decrease in σ
^ (i.e. it
leads to an increase in R2adj ).
10/22/21, 1:43 PM UW Möbius - 2.17 Fitting Linear Models to Time Series Data - Introduction
Fitting Linear Models to Time Series Data

Consider time series data, , where
{yt } yt is the value of the response at time t, t = 1, 2, … , n .
Examples: Monthly sales data, daily stock prices, quarterly economic indicators, weekly traffic flow,...
Variation in a time series is comprised of four main components:

Seasonal
Variations in the data that repeat regularly over fixed intervals, typically repeating within a 12
month span (e.g., week, month, quarterly,... )
Trend
Persistent increase/decrease in mean response, μt , as t increases.
Cyclical
Oscillations repeating over irregular intervals, typically with a span of more than a year (e.g.,
business, economic cycles)
Irregular or random
Variation remaining in the time series when variaton caused by the other three components is
accounted for
We will investigate the use of linear models to account for variation in the time series due to seasonal
and/or trend components (we may also use linear models to attempt to account for variation due to
cyclical components, but we will not do so here).
Autocorrelation Function
Before we attempt to model the components present in time series data, we must first investigate and
quantify the nature of the autocorrelation between yt and yt−k for any lag, k .
We do so with the (sample) autocorrelation function (acf), defined as
n
∑ (yt − ȳ )(yt−k − ȳ )
t=k+1
rk =
n
2
∑ (yt − ȳ )
t=1
Note that rk is the correlation coefficient, r , between values x and
y , where x = yt−k and y = yt .

That is, it is the (auto)correlation between values of the response k time units apart, with properties:
−1 ≤ rk ≤ 1 for all k
unitless
A plot of rk for k = 1, 2, … , referred to as a correlogram, is an indispensible tool in modelling time

series data.
10/22/21, 1:43 PM UW Möbius - 2.17 Fitting Linear Models to Time Series Data - Introduction
Wine sales series

Consider the monthly wine sales time series data, {y1 , y2 , . . . y187 } = {1952, 2302, . . . , 4290} , where
yt is the wine sales (1000's litres) in month
t, t = 1, 2, 3, . . . , 187 from Jan. 1980 to July 1995.
A time series plot of yt vs

t is shown below.
Not surprisingly, the time series plot suggests a strong seasonal pattern recurring every 12 time units
(months).
A correlogram of the series will provide us with a much better understanding of the autocorrelation
structure so that we can attempt to model the components of variation present in the time series.
A correlogram for the wine sales data is shown below:

> acf(wine.sales)
As seen in the time series plot, there is a strong positive autocorrelation at lag k = 12. As well, we see
significant* negative autocorrelations at lags 6 and 18, which supports the intuitive notion of a
difference in wine sales between the winter and summer months. We also see a significant positive
autocorrelation at lag 1, mainly due to the presence of an increasing trend in wine sales over time.
(Note: the value r0 = 1 is irrelevant and should be ignored, as it provides no useful information about
the autocorrelation of the time series)
*The significance lines on the acf plot provide a (approx.) visual test of the hypotheses ρk = 0 ,
where the parameter ρk is the process autocorrelation at lag k , estimated by the sample acf, rk .
−−−
The lines correspond to ± 2 standard errors, where SE(rk ) ≈ √1/n . Values of rk outside these lines
suggest significant autocorrelation (ρk ≠ 0) .
−−−
For the wine sales acf plot, these lines correspond to ±2/√187 = ±0.146
10/22/21, 1:46 PM UW Möbius - 2.17.1 Modelling Seasonal and Trend Components with Linear Models
Modelling Seasonal Component of Time Series Data

Continuing with the wine sales time series: Based on the correlogram, we suspect that the month of
the year accounts for much of the variation in wine sales.
We can attempt to account for this seasonal component by modelling month of the year as a
categorical variable in a linear regression model:
2
yt = β0 + β1 xt1 + β2 xt2 + … + β11 xt11 + ϵt ϵt ∼ N (0, σ ) independent
where
th
1 if t month Jan 1 if Feb 1 if Nov
xt1 = { xt2 = { ... xt11 = {
0 otherwise 0 otherwise 0 otherwise
Recall that by defining the indicator variables in this way:

^
βj , j = 1, … , 11 , is the estimate of the mean difference in wine sales between the j
th
month
and December, and
^
β0, is the estimate of the mean sales for December. Be sure you can verify this with your
understanding of the fitted model.
> month=c(rep(c('Jan','Feb','Mar','Apr',...,'Oct','Nov','Dec'),length=187)
#creates a categorical variable giving the month for each observation
> month=factor(month,levels=c('Jan','Feb','Mar','Apr','May', ...'Oct','Nov','Dec'))
#establishes the order of the factor levels (overriding alpha-numeric ordering)
> month=relevel(month,'Dec') #assigns December as the reference month
> wine.seas.lm=lm(Sales~month)
(Intercept) 4536.1 120.2 37.734 < 2e-16
monthJan -2247.6 167.3 -13.432 < 2e-16
monthFeb -1723.4 167.3 -10.299 < 2e-16
monthMar -1386.0 167.3 -8.283 3.01e-14
monthApr -1616.1 167.3 -9.658 < 2e-16
monthMay -1557.7 167.3 -9.309 < 2e-16
monthJun -1576.9 167.3 -9.424 < 2e-16
monthJul -1155.3 167.3 -6.904 8.94e-11
monthAug -946.4 170.0 -5.567 9.61e-08
monthSep -1355.7 170.0 -7.974 1.91e-13
monthOct -1141.7 170.0 -6.716 2.52e-10
monthNov -418.9 170.0 -2.464 0.0147

We see that, conditional on model assumptions being met, we have a reasonably well fitting model
with over 60% of the variation in wine sales accounted for by month of the year.
Based on the negative values for all ^
β j , j = 1, 2, … , 11 and the associated p-values, we can conclude
that December has significantly higher mean wine sales than any other month
^
^
(μ = β 0 = 4, 536, 200 litres)
Dec
We suspect Nov. has significantly higher wine sales than all months other than Dec., but we would
have to confirm that with an appropriate test.
Consider now the time series of the residuals of this model, et, t = 1, 2, … , 187 .
By accounting for the seasonal component, the residual time series is effectively the time series of
seasonally adjusted wine sales, given in the time series plot below:
Note that by removing the seasonal variation, the trend component becomes much more evident.
Before we attempt to model the trend, let us consider the acf of the residuals, given by
n
∑ e t e t−k
t=k+1
rk =
n
2
∑e
t
t=1
(confirm that this is consistent with the definition of the acf provided in an earlier lesson)
Under the model assumption of independent errors, we would expect the acf plot of the residuals to
reveal no signficant autocorrelation at any lag.
However, due to the trend, the elements etet−1 will be predominantly positive, as will
e t e t−2 , e t e t−3 , … , resulting in persistently large values of r1 , r2 , r3 , … .
This characteristic of the autocorrelation in time series that exhibit a strong trend is seen in the acf
plot of the residuals for the wine sales model:
Modelling Trend Component

Recall that after accounting for seasonal variation, we observe a strong non-linear (e.g. quadratic)
trend in wine sales, as observed previously in the time series plot of the residuals:
We can account for the trend in our model through the addition of appropriate time variables - in this
case, with the linear and quadratic terms t and t2 , respectively, resulting in the model
2 2
yt = β0 + β1 xt1 + … + β11 xt11 + β12 t + β13 t + ϵt ϵt ∼ N (0, σ ) independent
 
seasonal component trend component
Fitting the model in R:

> t=c(1:187)
> tsq = t^2
> wine.seas.trend.lm=lm(Sales~month+t+tsq)
(Intercept) 4.293e+03 1.148e+02 37.387 < 2e-16
monthJan -2.237e+03 1.226e+02 -18.253 < 2e-16
monthFeb -1.718e+03 1.226e+02 -14.019 < 2e-16
monthMar -1.386e+03 1.226e+02 -11.310 < 2e-16
monthApr -1.621e+03 1.226e+02 -13.231 < 2e-16
monthMay -1.568e+03 1.226e+02 -12.798 < 2e-16
monthJun -1.593e+03 1.226e+02 -13.000 < 2e-16
monthJul -1.177e+03 1.226e+02 -9.605 < 2e-16
monthAug -9.251e+02 1.245e+02 -7.432 4.72e-12
monthSep -1.340e+03 1.245e+02 -10.763 < 2e-16
monthOct -1.131e+03 1.245e+02 -9.086 2.32e-16
monthNov -4.134e+02 1.245e+02 -3.322 0.00109
t -2.864e+00 1.861e+00 -1.539 0.12555
tsq 4.356e-02 9.587e-03 4.544 1.03e-05

Note that the the trend terms dramatically increases the variation in wine sales explained by the
model, as is reflected by both the much higher R2adj (0.781 compared 0.5914) and the small p-value
associated with the quadratic term.
(Note: we always retain all lower order terms in a model when the higher order terms are included.
This holds for both polynomial terms (e.g. quadratic, cubic, ...) as well as for interaction terms)
The time series plot of the residuals (below) suggests that we have successfully captured the trend.
The correlogram of the residuals (below right) indicates our assumption of independent errors is
reasonably met. We see that we have accounted for virtually all the autocorrelation in the response
(left), by including seasonal and trend components in a linear model.
Further analysis (e.g. a plot of residuals vs fitted values, not shown here) indicates that the rest of the
model assumptions are reasonally well met. We now have an adequate and well-fit model that we can
use to forecast future wine sales, where the forecasted value at time t is described by:
^ ^ ^ ^ ^ ^ 2
^ = β + β xt1 + β xt2 + … + β
y xt11 + β 12 t + β 13 t
t 0 1 2 11
For example, the forecasted wine sales for the following month (Aug/95) is
^ ^ ^ ^ 2
^
y = β 0 + β 8 + β 12 (188) + β 13 (188 )
188
= 4, 369, 053 litres
Recall that the seasonal and trend components are not the only components of variation in a time
series. Often, there is autocorrelation structure remaining in the time series after accounting for
seasonal and trend components with a linear model, rendering the assumption of independence of the
errors invalid.
We will see an example of this in an upcoming lesson.
10/22/21, 1:47 PM UW Möbius - 2.17.2 Modelling Seasonal and Trend components - Kenora temperature time series
Modelling seasonal and trend components - Kenora temperature time series

The Kentemp time series consists of monthly (mean daily maximum) temperatures in Kenora, Ontario
over a 50 year period (Jan. 1963 to Dec. 2012). Time series plot and correlogram are given below.
Whereas the time series plot provides limited information on the main characteristics and
autocorrelation of the time series, the correlogram clearly illustrates the autocorrelation structure that
we would anticipate* in a monthly temperature series.
(*consider the numerator of rk , given by ∑ (yt − ȳ )(yt−k − ȳ ) . We know that yt − ȳ will be negative
for t associated with the winter months (e.g., Nov, Dec, Jan, Feb, ...) and positive for the summer
months, resulting in mostly negative values of(yt − ȳ )(yt−k − ȳ ) for k = 4, 5, 6, 7, 8 , and mostly
positive values for k = 1, 2, 10, 11, 12)
Fitting a seasonal component as we did for the wine sales data yields the following fitted model
(Intercept) -9.4580 0.3770 -25.087 < 2e-16
Jan -2.8880 0.5332 -5.417 8.86e-08
Feb 1.1760 0.5332 2.206 0.0278
Mar 8.8580 0.5332 16.614 < 2e-16
Apr 18.3820 0.5332 34.477 < 2e-16
May 26.0120 0.5332 48.788 < 2e-16
Jun 31.0420 0.5332 58.222 < 2e-16
Jul 33.9700 0.5332 63.714 < 2e-16
Aug 32.4780 0.5332 60.915 < 2e-16
Sep 26.3160 0.5332 49.358 < 2e-16
Oct 18.4120 0.5332 34.533 < 2e-16
Nov 8.4860 0.5332 15.916 < 2e-16

F-statistic: 1260 on 11 and 588 DF, p-value: < 2.2e-16
Note the extremely high R2 value, with almost 96% of the variation in mean daily maximum
temperature accounted for by month (why is this understandable?)
We see also that, based on the estimates and associated p-values, January was the coldest month,
with a mean monthy (daily max.) temperature of −12.346∘ C .
A time series plot of the residuals suggests the possibility of a linear trend:
The correlogram of the residuals also indicates that the autocorrelation has not been wholly accounted
for by the seasonal component:
The presence of a positive linear trend is confirmed by the fit of the model:
(Intercept) -1.035e+01 4.158e-01 -24.895 < 2e-16
Jan -2.856e+00 5.238e-01 -5.452 7.33e-08
Feb 1.205e+00 5.238e-01 2.301 0.0218
Mar 8.884e+00 5.238e-01 16.961 < 2e-16
⋮ ⋮ ⋮ ⋮ ⋮
Sep 2.632e+01 5.238e-01 50.261 < 2e-16
Oct 1.842e+01 5.238e-01 35.164 < 2e-16
Nov 8.489e+00 5.238e-01 16.208 < 2e-16
t 2.916e-03 6.174e-04 4.723 2.91e-06

Note that the positive trend parameter estimate and associated p-value is consistent with the evidence o
We can say that, after adjusting for season, mean daily maximum temperature in Kenora has increased
On a more relevent scale: Over the past 50 years, the mean daily maximum temperature in Kenora has
'The global annual temperature has increased at an average rate of 0.07°C per decade since 1880 and
warming-
update#:~:text=The%20global%20annual%20temperature%20has,0.32%C2%B0F)%20since%201981

Now that we appear to have an extremely well-fitted model, we can perform a residual analysis to
assess model adequacy.
The plot of the (studentized) residuals vs the fitted values is shown below:
The absence of a relationship suggests that the model is well specified.

Correspondingly, and not unexpectedly, a normal qqplot (not shown here) does not reveal any issues
with the assumption of normal errors (why 'not unexpectedly'?)
Also not unexpectedly, there appears to be more variability in monthly temperature for the colder
months than for the warmer months (a minor violation of the assumption of constant variance).
We also see an unusually warm mean monthly temperature one year (Oct. 1963, with a mean monthly
temp of 17.3 ∘ C, which was 9.2 degrees higher than the estimated mean for October).
We will assess the assumption of independence of the errors in the next lesson.
10/22/21, 1:48 PM UW Möbius - 2.17.3 Durbin Watson test for lag 1 autocorrelation in linear model residuals
Durbin Watson test for lag 1 autocorrelation in model residuals

In the previous lesson, we plotted the residuals vs the fitted values for the Kenora temperature model
(fit with seasonal and trend components) to assess model adequacy. We saw that, apart from a minor
non-constant variance and the presence of a minor outlier, our model appeared adequate.
This analysis did not address the assumption of indepence of the errors. We can do so now with a
correlogram of the residuals:
We see that after accounting for the seasonal and trend components, there still appears to be
significant autocorrelation, at lag 1 in particular, not accounted for by our model.
This is often the case after modelling the seasonal and/or trend components in time series data. (Why
might this be the case with temperature data?)
Durbin Watson test statistic

The Durbin Watson test is a hypothesis test for the presence of (lag 1) autocorrelation, denoted by
the parameter, ρ1 , in the residuals of a regression model fit to time series data. The test statistic is
n
2
∑ (e t − e t−1 )
t=2
DW =
n
2
∑e
t
t=1
Intuitively, we would expect positive lag 1 autocorrelation to be associated with relatively small
differences in successive residuals, given by et − et−1 , and would therefore expect small values of DW
to provide evidence for ρ1 > 0 .
We can confirm this by rewritting the test statistic as
2 2
∑e + ∑e − 2 ∑ e t e t−1
t t−1
DW =
2
∑e
t
≈ 2 − 2r1 = 2(1 − r1 )
where −1 < r1 < 1
Expressing the test statistic in this way suggests the following properties:
0 ≤ DW ≤ 4
The closer DW is to 0, the more evidence that ρ1 > 0 (positive lag 1 auto-correlation)
The closer DW is to 4, the more evidence that ρ1 < 0 (negative lag 1 auto-correlation)
The distribution of DW under H0 : ρ1 = 0 , required to obtain p-values, depends on both the number
of parameters and the number of observations.
In the absence of software to compute p-values, Durbin-Watson tables provide lower and upper critical
(α = .05 ) values, DW L and DW U , against which we compare the value of the test statistic DW to
determine whether we are to accept or reject H0 : ρ1 = 0 according to the following rules:
To test for positive lag 1 autocorrelation (H0 : ρ1 = 0 vs Ha : ρ1 > 0 ):

If DW < DW L , reject H0 (conclude ρ1 > 0 )
If DW > DW U , do not reject H0 (conclude ρ1 = 0 )
If DW L < DW < DW U , results inconclusive
To test for negative lag 1 autocorrelation (H0 : ρ1 = 0 vs Ha : ρ1 < 0 ), we use the test statistic
4 − DW and proceed as follows:
If 4 − DW < DW L , reject H0 (conclude ρ1 < 0 )
If 4 − DW > DW U , do not reject H0 (conclude ρ1 = 0 )
If DW L < 4 − DW < DW U , results inconclusive
We can use this test to confirm the presence of a positive lag 1 autocorrelation in the residuals from
the Kenora temperature model.

A plot et vs et−1 helps to illustrate the nature of the lag 1 autocorrelation:

> res=residuals(Kentemp.seas.trend)
> plot(res[-600],res[-1],pch=19,xlab='residuals (lag 1)',ylab='residuals')
We see a positive linear association between et and e t−1 . To test for positive lag 1 autocorrelation:
H0 : ρ1 = 0 vs Ha : ρ1 > 0
Test statistic:
> DW = sum(diff(res)^2)/sum(res^2)):
> DW
[1] 1.554483
Using
Durbin Watson tables (https://www.real-statistics.com/statistics-tables/durbin-watson-table/)
(α = .05, n = 600, k = 12) :
W L = 1.825, DW U = 1.907
D
We therefore reject H0 : ρ1 = 0 , and conclude ρ1 > 0
To verify our calculations and conclusion, we can use the dwtest function in the lmtest library:
> library(lmtest)
> dwtest(Kentemp.seas.trend)
Durbin-Watson test
data: Kentemp.seas.trend
DW = 1.5545, p-value = 3.12e-08
alternative hypothesis: true autocorrelation is greater than 0

If autocorrelation still exists in a time series once the seasonal and trend components have been
accounted for, we can employ another class of forecasting models on the resulting series.
These models will be introduced in the next lesson.

3. Introduction to Process Performance – 1
Measuring Process Performance
1
Introduction
A process is a series of operations or actions repeated over time, each
iteration of which produces a unit. The output, yt, is the variable(s) of
interest associated with the unit produced at time t.
Consider the pull dataset from 453 vehicles produced over a 24hr
period.
Process: production of vehicles at this plant
Unit: vehicle
Output of interest: the pull (a measure of alignment, in degrees)
The specification limits for this process are 0.23 ± 0.25.
(Cars with pull outside these limits will need to be pulled from
production and realigned)
Measuring process performance- 2
Process Performance Summaries
Graphical summaries:
• Histograms
• Run plots

Process Performance Summaries
Numerical summaries:
• Sample mean: 𝑥ҧ = 𝜇ො (where 𝜇 is the process mean)
• Sample standard deviation: s = 𝜎ො (where 𝜎 is the process std. dev.)
For the pull data,
𝜇ො = 0.2084
𝜎ො = 0.0682
Measures of performance with respect to the specification limits are
based on our estimates of 𝜇 and 𝜎:
• For a given 𝜎,ො the closer 𝜇ො is to the centre of the spec. limit, (e.g.
0.23 for the pull data), the better the performance
• For a given 𝜇ො , the smaller the 𝜎,ො the better.
Capability Ratio
For continuous data, one common measure of process performance is
the capability ratio, Ppk, defined as
min(U − ˆ , ˆ − L)
Ppk =
3ˆ
where U and L are the upper and lower specification limits, respectively.
It should be evident from the expression that the larger the value of Ppk,
the better the performance of the process relative to the specification
limits.
For the pull data:
min(U − ˆ , ˆ − L) 0.2084 − (−.02)
Ppk = = = 1.116
3ˆ 3(.0682)
Interpretation of Ppk
It can be shown that for a centred Gaussian output (i.e Y ~ N (  ,  2 ),
U +L
where  = ), the proportion, p, of units produced that are
2
outside specification limits is estimated by
p = 1 − P(−3Ppk  Z  3Ppk )
where Z ~ N (0,1) .
Example: for the pull process (which appears reasonably Gaussian
and centred from the histogram and run plot),
p = 1 − P(−3(1.116)  Z  3(1.116) = 1 − P(−3.348  Z  3.348) = .00081
Thus, approx. 8 out of 10,000 vehicles can be expected to be
outside specification limits.
2. Control Charts - Introduction
1
Introduction
Control charts are a means of assessing if (and in some cases, when)
the process mean, µ, and/or the process std. dev., , have changed
substantially over time.
In control chart construction, subgroups of units are sampled and the
subgroup means and standard deviations are plotted over time.
Control limits are added to the plot to represent the limits of acceptable
variation in the subgroup mean and standard deviation.
A process is considered stable or in-control if all subgroup means (or
standard deviations) are within the control limits and unstable or out-of-
control if any are outside control limits.
We will examine two types of control charts: the Xbar chart and S chart
Control charts - 2
Introduction
Example: Spring height process (Example 1 of Control Chart notes):
• Sampling protocol: Four consecutively produced springs selected
every hour over 25 hr. period.
• y = height (mm) of a coil spring, was recorded for each unit in
sample
• The subgroup means and standard deviations were calculated and
plotted.
• ‘3 sigma’ control limits added to plot to indicate the limits of
acceptable random variation if process mean/std. dev. did not
change.
Control charts - 3
Xbar Xbar
Chart forforSpring
Chart Height
Spring Height
5.3
5.2 +
+
5.1
+
height
+ +
+ + + +
+ +
5.0
+ +
+ +
+ + +
+ +
+ + +
4.9
+
+
4.8
4.7
5 10 15 20 25
subgroup
Process Monitoring - 4
S Chart forS Chart
Spring Height
0.35
0.30
0.25
0.20
+
+
s
+
0.15
+
+ +
+ + +
+
0.10
+ + +
+ +
+ +
+ +
0.05
+
+ +
+ +
+
0.00
5 10 15 20 25
subgroup
Process Monitoring - 5
3. Creating Control Charts
1
Xbar Xbar
Chart forforSpring
Chart Height
Spring Height
5.3
5.2 +
+
5.1
+
height
+ +
+ + + +
+ +
5.0
+ +
+ +
+ + +
+ +
+ + +
4.9
+
+
4.8
4.7
5 10 15 20 25
subgroup
Creating control charts - 2

S Chart forS Chart
Spring Height
0.35
0.30
0.25
0.20
+
+
s
+
0.15
+
+ +
+ + +
+
0.10
+ + +
+ +
+ +
+ +
0.05
+
+ +
+ +
+
0.00
5 10 15 20 25
subgroup

Creating Xbar and S Charts (‘Phase I’)
1) Define and sample subgroups according to an appropriate sampling
protocol (e.g. sample 4 consecutive units every hour for 25 hours)
2) Calculate mean and std. dev. of each subgroup
3) Create S Chart:
a) Plot subgroup standard deviations vs subgroup #
b) Add horizontal line at 𝑠ҧ , the mean of the subgroup std. dev.’s
c) Add upper and lower 3-sigma control limits:
LCL = B3 𝑠ҧ , UCL = B4 𝑠ҧ
where B3, B4 can be obtained from Table 1 of Control Chart notes
d) For process monitoring going forward, remove any subgroups
corresponding to out-of-control points, and recalculate control limits.
5) Create Xbar chart:
a) Plot sample means of remaining subgroups
b) Add horizontal line at 𝑋ത , the mean of the subgroup means
c) Add upper and lower 3-sigma* control limits:
𝑋ത ± 𝐴3 𝑠ҧ
where A3 is obtained from Table 1 of Control Chart notes
d) For process monitoring going forward, remove any subgroups with
points outside control limits, and recalculate control limits.
ˆ
* Note that 𝐴3 𝑠ҧ corresponds to 3 , where n is subgroup size.
n


Ongoing Use of Xbar and S Charts (‘Phase II’)
Once created, Xbar and S charts can be used going forward to monitor
the process for any significant changes to the mean or std. dev.
The process can be periodically sampled, and deemed unstable if any
sampled subgroups are outside control limits.
Out-of-control points may suggest that an adjustment to the process
may be necessary.
Out-of-control points may also help to identify the changes to an input
responsible for causing instability in the process (special causes).

3. Introduction to Process Performance – 5
Control charts for count data
1
Bead Demonstration
1: https://youtu.be/AfxabXHL9zY 2: https://youtu.be/dCv12-kLTXI 3: https://youtu.be/PoM5gGkshGw 4: https://youtu.be/VXmIUToadfs
• Volunteers from class were recruited as workers in a bead production process

• After a ‘training’ period, workers produced 100 beads per day using a sampling
paddle method (pic below), and the # of defective (blue) beads was recorded
• Workers were given rewards\motivational pep talks\warnings based on their
daily performance
Control charts for count data- 2

Bead Demonstration
Results from two workers over a 20-day period:
● ● ● Worker A
● Worker B
● ●
● ● ●
# defective
● ●
● ● ● ● ●
● ●
●
●
●
●
Day
3-Sigma Control Limits for Binomial Count Data
We can create 3-sigma control charts to determine whether the mean, µ
(expected number of defective beads), for this process has changed
week to week.
Let Yt be # of defective beads produced on day t.
Assuming beads produced independently with constant probability, , of
producing a defective bead, then
Yt ~ Binomial(100,  )
With mean  = E (Yt ) = 100 , and std. dev.  = SD(Yt ) = 100 (1 −  )
From the data, ˆ = 100ˆ = 10.4; ˆ = 100(.104)(.896) = 3.053
yielding 3-sigma control limits:
10.4  9.16 = (1.24,19.56)
3-Sigma Control Chart for Bead Data
● ●
● ●
● ● ●
# defective
● ●
● ● ● ● ●
● ●
●
●
●
●
Day
3-Sigma Control Chart for Bead Data
# defective
Day

Measuring Process Performance
Every organization measures the performance of some of its processes with the idea that managers can
make decisions to maintain or improve the outputs. For example, each month the University produces
a detailed report of spending and income for a large number of accounts measured in absolute terms
and relative to the budget. Large organizations have expensive Management Information Systems that
track financial and other processes and provide regular reports used by management to make all sorts
of decisions.
In this chapter, we concentrate on measuring process outputs and describing the performance of the
process by summarizing the data collected. Statistical method and thinking play an important role
because the outputs of the process vary from unit to unit. Failure to understand the variation can lead a
manager to odd behaviour and poor decisions.
Example 1
A large manufacturing organization conducts a daily audit of its production that involves careful
checking of 30 units for a large number of possible defects and other failure to conform to
specifications. The results of the previous day’s audit are discussed in a morning meeting by quality
and production managers. The managers spend most of their time in the meeting discussing what went
wrong the previous day and what steps have been taken to ensure that problems have been resolved.
One measure of performance that receives a lot of attention is the number of defects per unit detected
in the audit. Figure 1 is a typical report.
Defects per Unit
1.5
defects per unit
0.5
0
average year to date day before yesterday yesterday
Figure 1: Part of a Daily Report of Audit Results
The dashed line is the target set at the start of the year. By year end, the management team expects that
the daily defects per unit will be below the target. The morning meeting is part of this undertaking to
improve the process. The report in Figure 1 report would generate, among others, the following
comments:
 “We did a lot better yesterday than the day before. Good job!”
 “We were better than average yesterday”
 “We’re trending in the right direction!”
 “I am a bit worried about meeting the target. We need to be more vigilant!”
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-1
There are many questions that someone with statistical training would ask. For example,
 How is the sample of 30 units selected?

 What is known about the measurement system that determines the defects?
 What do the data look like for the last month, day by day? Or the whole year?
Suppose the answers were
 3 or 4 units are haphazardly selected each hour until the quota of 30 is reached.
 The same crew conducts the audit every day so they know what they are doing. Besides, they
are warned to look extra carefully at the major problems from the previous day.
 The plot of the data is given in Figure 2.
Defects per unit vs Day
2.5
Defects per unit
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Day
Figure 2: Defects per Unit for the Previous 20 Days
Then the statistically trained visitor might then comment that
 With a sample of size 30 on each of yesterday and the previous day, it is not clear if the
observed difference is real or just due to sampling error.
 Two points do not make a trend! Look at Figure 2! (you should also see the exercises)
 From Figure 2, the process performance is not changing in the long run. As a consequence,
unless there is a fundamental change to the process, the target will not be met. Calling for more
vigilance is wishful thinking.
You should ask yourself the following questions with respect to improving the process and meeting the
target:
 What is the value of the audit?

 What is the value of the daily meeting?
The point of Example 1 is that the people involved had difficulty dealing with the variation in the
process. They forgot that the sample of 30 units did not give an exact measure of the process
performance. They wasted a lot of time explaining the ups and downs in the process output from day to
day and were not achieving any improvement despite their best efforts. At best, you could argue that
they were maintaining the process performance through their efforts.
In terms of summarizing the performance of the process, the run chart in Figure 2 is much better than
Figure 1. The run chart puts yesterday’s result in a meaningful context which helps avoid
misinterpretation and over-interpretation. Putting the data into a reasonable context is a fundamental
rule for presenting process performance data – see Wheeler (1993) for an excellent discussion of this
important point. The following example is adapted from material in that book.
Example 2
A large organization ships products on order to many customers. A key feature of the shipping process
was the promise of on-time delivery. Orders were normally delivered by truck but if there was a
problem, air freight, an expensive alternative was used to ensure that the delivery was on-time. Use of
air freight was called a premium or expedited shipment. The transportation manager receives a
monthly report which contains the information in Table 1.
Table 1: Extract from shipping manager’s report, July 2003

this month last month same month year-to-date
last year
number of shipments 5546 4050 5275 36249
number of premium shipments 654 535 324 3611
percentage of premium shipments 11.8 13.2 6.1 10.0
The manager is concerned about the high cost of premium shipments in July. On a percentage basis,
the manager notes that July 2003 is much worse than July 2002 but a bit better than June 2003. Are
things getting better or worse? What would happen in August?
This questions are impossible to answer without more context. Figure 3 is a plot of the percentage of
premium shipments by month since January 2001. From this plot, it is evident that the percentage of
premium shipments has been increasing over the last two years. August is likely to be a bad month.
One of the manager’s staff offers the explanation that since the number of orders has also been
increasing, it is not surprising that the percentage of premium shipments has been increasing because
the shipping department has not been given any extra resources. However, we can see from the run
chart in Figure 4 that this contention is not true and that some other explanation is required.
Percentage of Premium Shipments by Month
14
12
10
8
6
4
2
0
01
M 1
03
Ja 2
M 2
Se 1
M 2
Se 2
3
No 1
No 2
M 1
M 03
1
3
2
-0
l-0
l-0
l-0
0
0
-0
0
0
0
-0
-0
-0
v-
v-
n-
n-
n-
p-
p-
-
ar
ay
ar
ar
ay
ay
Ju
Ju
Ju
Ja
Ja
Figure 3: Percentage of Premium Shipments by Month
Total Shipments by Month
8000
7000
6000
5000
4000
3000
2000
1000
0
Ja 1
M 1
M 3
Ja 2
M 2
Se 1
M 2
Se 2
3
No 1
No 2
M 1
M 3
1
3
2
-0
l-0
l-0
l-0
0
0
-0
-0
0
0
-0
-0
-0
v-
v-
n-
n-
n-
p-
p-
ar
ay
ar
ar
ay
ay
Ju
Ju
Ju
Ja
Figure 4: Total Shipments by Month
We see from Example 2 that to interpret process performance for a given month , we need to look in
the context of the process over a longer period of time. The run chart is an excellent way to present this
context. Quoting Joiner (1994), another excellent book you should read, the key to providing context is
to
“PLOT THE DATA”
In other words, we need to give a visual display of the process over time in order to interpret a single
or small set of points correctly.
Some Issues in Assessing Process Performance

The reasons for collecting and presenting process performance data are:
 to assess the effects of fundamental changes to the process (i.e. in our language, the effect of
changing one or more fixed inputs) in the past
 to predict the future performance to see what, if any, action should be taken (i.e. should any
fixed inputs be changed?)
You need to decide among other issues:
 what outputs should be measured

 how can we ensure that the measurement system is adequate
 which of the outputs should be summarized
 over what time frame should the summaries be done (e.g. once a week, month, quarter etc.)
 what is an appropriate summary statistic
We often track outputs that are especially important to the customer (sometimes called key product
characteristics or special characteristics) and major cost drivers, important to the process owners. Be
careful not to choose too many outputs or the useful information gets diluted by the mass of reports. A
good rule of thumb is to stop producing reports that are not used on a regular basis. If you think of
producing the reports as a process and the users of the reports as the customers, then the key is to select
characteristics that meets the customers’ needs.
As discussed in the previous section, a key point in presenting process performance data is to set an
appropriate context. In statistical language, we need to define a study population. Since we are usually
interested in current and future performance, we need to decide how far back in time to go in order to
establish the context. This may not be an easy decision. If we are assessing a fundamental change in
the process, we need to include sufficient time before the change in order to capture the long run
behaviour of the process output. When we are looking at future performance, we need a sufficiently
long record to make the prediction. See Chapter 4 on forecasting.
We often construct a performance measure based on a sample of units from the study population. In
the audit example, 30 units were selected haphazardly from each day’s production. We need to ensure
that the sample is representative of the process over the selected time period and to remember the
possibility of sampling error.
Some Standard Performance Measures

Many outputs are binary as in Example 2 (every order is a premium shipment or not). Other examples
include over-budget or not, defective or not, within specifications or not, on-time or not, customer
satisfied or not, etc. If we code the two possible values of the output y as 0 or 1, we summarize
performance over a fixed time period using the total or percentage of realizations with y  1. If the
number of realizations in different time periods varies, then these two measures of performance can
give different pictures of how the process is doing.
If the output is a continuous measurement, we use histograms, averages and standard deviations to
summarize performance.
Example 3
In an assembly plant, a number of output characteristics related to the wheel alignment is measured on
every vehicle produced. One output important to the driver is called pull which measures the tendency
of the vehicle to turn on a straight surface if the driver’s hands are removed from the steering wheel.
The specifications for pull are 0.23  0.25. If an alignment characteristic is outside the specification
limits, the vehicle is taken out of sequence for repair. The file ch2example3.txt. contains the data,
including the pull measurements, for one day when 453 vehicles were produced.
To summarize the daily performance, we plot a histogram and run chart of the data that include the
specification limits.
The R code used to produce the plots is
br <- -0.15+0.05*(1:15) #creates boundaries for the histogram

hist(pull, breaks=br, freq=F) # creates histogram with vertical scale relative frequency
abline(v=-.02,lty=2, col=’red’) #creates dashed red vertical line at lower spec limit
abline(v=0.48,lty=2, col=’red’) #creates dashed red vertical line at upper spec limit
plot(pull,ylim=c(-.1,.6), type=’l’, main=’Run Chart of Pull’) # ylim determines the scale on the y-axis
abline(h=-.02,lty=2, col=’red’) #creates dashed red horizontal line at lower spec limit
abline(h=0.48,lty=2, col=’red’) #creates dashed red horizontal line at upper spec limit
Figure 4: Histogram and Run Chart of Pull Values
We see from the plots that the pull values are centred near the target value 0.23 and almost all values
are within the specification limits. There are no obvious trends over the day.
The average and standard deviation for the pull values for the day are ˆ  0.208 and ˆ  0.068 . We
can plot the average and standard deviation on a run chart to assess the process in the context of day to
day performance.
Another common measure of performance when the output is measured on a continuous scale is the
capability ratio Ppk defined as
min(U  ˆ , ˆ  L)
Ppk 
3ˆ
where U and L are the upper and lower specification limits. For the pull output, we have
min(0.48  0.208, 0.208  ( 0.02))

Ppk   1.12
3(0.068)
The larger the capability ratio the better the process performance relative to the specifications. If the
process is centred so that the average ̂ is (U  L) / 2 , the numerator of the capability ratio is as large
as possible, for the given variation. If there is little variation in the process, then ˆ and the
denominator of Ppk is small.
If the histogram is bell-shaped (matching a gaussian density), we can interpret the capability ratio in
terms of the proportion of units outside of specification. See the exercises.
As in the earlier examples we must interpret these measures of performance in a broader context.
References
1. Wheeler, Donald J. (1993), Understanding Variation, The Key to Managing Chaos, SPC Press,
Knoxville TE.

Introduction To Regression: Y,, X X X

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Regression: Y,, X X X

Uploaded by

Copyright:

Available Formats

9/7/21, 3:51 PM UW Möbius - 1.

In the second example, investigators wish to determine whether there is a relationship between

Graphical and Numerical Summaries for Bivariate Data

∑(xi − x̄)(yi − ȳ ) Sxy

For the audit data, r , as obtained in R:

The Simple Linear Regression Model​​

β1 denotes the slope parameter

The Normal SLR Model

Assumptions of the Normal Model

the errors follow a normal distribution

the errors are independent

Least Squares Estimation of Model Parameters

35099 = β0 + β1 (607) + ϵ24

for unknown constants, β0 and β1 .

The values obtained by this procedure, denoted by ^

with respect to β0 and β1 .

Taking the partial derivatives and setting to 0:

yield the normal equations:

i=1 n=1 n=1

∑(xi − x̄)(yi − ȳ ) Sxy

The Fitted Model

estimated mean value of the response y for a given value of x.

with an associated residual value of

Some notes on the residuals:

Understand the distinction between the residual, ei , and the error,

(see the normal equations to see this)

Least Squares Estimation of σ 2

Min 1Q Median 3Q Max

-36639 -12874 -1997 8642 56686

Estimate Std. Error t value Pr(>|t|)

(Intercept) -27877.06 14172.00 -1.967 0.0619 .

size 126.33 10.88 11.610 7.47e-11 ***

Multiple R-squared: 0.8597, Adjusted R-squared: 0.8533

F-statistic: 134.8 on 1 and 22 DF, p-value: 7.472e-11

Interpretation of Parameter Estimates​​

What do these values represent in the context of the study?

Interpreting and communicating results of statistical analyses is an extremely important

Note that for x = 0 , the estimated mean response reduces to μ

However, there is an important caveat:

In the audit model, for example, ^

Clearly, as overhead cost is non-negative, this is a nonsensical interpretation.

This serves as an important reminder:

Interpretation of the Residual Standard Error, σ

For the audit model, ^ = 23480

Based on the SLR model,

*A note on terminology and notation:

When we refer to the estimator β ^

Note that the estimator β^

First, we note that the estimator ^

For the SLR model with normal, independent errrors,

β0 ∑(xi − x̄) + β1 ∑ xi (xi − x̄) β1 ∑ xi (xi − x̄) − β1 x̄ ∑(xi − x̄)

β1 ∑(xi − x̄)(xi − x̄)

= β1 (unbiased estimator)

From these three results we have derived the distribution of ^

Thus for the SLR normal model,

In the same way we derived the distribution of ^

Confidence Interval for Slope Paramater, β

In the same way, we obtain a (1 − α)100% confidence interval for ​β

- the higher the confidence level, the wider the interval

- the larger the standard error, SE(β

Example: Provide a 95% confidence interval for β

We can obtain the values of ‌β

(Intercept) -27877.06 14172.00 -1.967 0.0619

The Simple Linear Regression Model

Interpretation of Parameter Estimates

In the same way, we obtain a (1 − α)100% confidence interval for β

If p-value > .05 , do not reject H0 .

According to our linear model, no relationship exists iff β

> XtXinv = solve(t(X)%*%X) #solve(A) gives us A-1; t(A) gives us AT