Professional Documents
Culture Documents
1 Introduction/Overview
Introduction to Regression
Understanding and quantifying variability in data is the essence of statistics and statistical inference.
In regression modelling, we attempt to explain or account for variation in a response variable, y , by
using a statistical model to describe the relationship between the response and one or
more explanatory variables, x1 , x2 , x3 …
We can then use the model we've developed to learn and answer questions about relationships
between the explanatory variables and the response, and/or predict the value of the response for a
given set of values of the explanatory variables.
Regression is an extremely powerful and widely used tool for investigating and understanding
relationships in any field of interest, including biology, health and social sciences, economics, business,
and finance, and for making decisions based on those relationships.
To begin to gain an appreciation of how we may use regression in empirical-based decision making, we
will introduce three of the examples from the field of business and finance that we will be exploring in
more detail throughout the course.
Example 1:
An auditor wishes to determine whether the cost of overhead claimed by offices in a certain group is
consistent with the office's attributes, including size, age, number of clients, number of employees,
and the cost of living index of the city in which the office is located.
To this end, the auditor creates a regression model to describe the relationship between these
attributes and the overhead claimed by the office in order to estimate the expected overhead for each
office.
The auditor can then investigate any claim for which a large discrepancy exists between the observed
overhead and the expected overhead estimated from the model.
Response variable: (Claimed) overhead
Potential explanatory variables: office size, office age, number of clients, ...
Objective: Examine the difference* between the claimed overhead and the expected overhead
estimated from the model for each office based on the office's size, age, etc., and allocate auditing
resources to those offices associated with a large (positive) difference.
(*we call these differences the residuals, which are extremely important values in regression
analysis, as we will see)
https://uwmo.mobius.cloud/566/6868/assignments/43958/0 1/3
9/7/21, 3:51 PM UW Möbius - 1.1 Introduction/Overview
Example 2:
Is there systemic gender inequity in the salaries of Waterloo faculty members?
To answer this question, a Waterloo working committee obtained information on each faculty member,
including rank, academic unit, years of service, gender, and annual salary, and fit a regression model
to the data.
Based on the model, they found that, after accounting for rank, academic unit, years of service,
and several other variables, males were getting paid significantly more than females, on average.
The results from this regression analysis resulted in an immediate increase of $2905 to the annual
salaries of all female faculty members.
Response variable: Annual Salary
Potential explanatory variables: Rank (Lecturer, Professor, ...), performance review,
Faculty/Department, years of service, ..., gender.
Objective: To determine whether, after accounting for the variation in salary due to rank, years
service, Faculty/Dept., etc, there is any discrepancy in mean annual salary between male and female
Faculty members.
Example 3:
Before listing a house, a realtor wishes to estimate its market value based on recent selling prices of
homes in the area.
Information is obtained on attributes of these homes that may help to account for selling price, such
as size, lot size, number of rooms, number of bathrooms, number of stories, whether the house has a
garage, etc., and a regression model is created to describe the relationship between selling price and
these variables.
The realtor can now use the model to estimate market value and predict selling price of the house,
based on its attributes.
Response variable: Selling price
Potential explanatory variables: house size, style, age, # of bathrooms, district, garage, basement,
swimming pool, ...
Objective: To estimate the market value (i.e. predict selling price) of a house based on its size, age,
style, etc.
https://uwmo.mobius.cloud/566/6868/assignments/43958/0 2/3
9/7/21, 3:51 PM UW Möbius - 1.1 Introduction/Overview
Note that in each of these examples, the objective for fitting a regression model is different,
illustrating the power and usefulness of regression modelling.
In the first example, the investigator wishes to detect discrepancies between an office's (claimed)
overhead and the expected overhead for that office estimated from the regression model.
Whereas in the last example, the objective was to predict the value of the response (selling price) for
a given set of explanatory variates.
We will be looking at each of these examples in more detail throughout the course. First, however, we
will begin with a review of simple linear regression, in which we model the relationship between a
single explanatory variable and the response.
https://uwmo.mobius.cloud/566/6868/assignments/43958/0 3/3
9/7/21, 3:54 PM UW Möbius - 1.2 Graphical and Numerical Summaries for Bivariate Data
Although the raw data contains all the available information about the relationship between size,
x,
and overhead,
y , we cannot readily synthesize this information without the aid of appropriate
graphical methods and numerical summeries.
With bivariate data, {x, y} , such as we have here, a scatterplot is an essential tool in visualizing
and understanding the nature and strength of the relationship between an explanatory variable and a
response. A scatterplot of the audit data is given below:
> plot(size,overhead,pch=19,xlab='size (sq.ft.)',ylab='overhead ($)',
main='Claimed overhead vs office size (n = 24)')
Note the underlying linear relationship between the response (overhead), and the explanatory variate
(office size).
https://uwmo.mobius.cloud/566/3369/assignments/44098/0 1/2
9/7/21, 3:54 PM UW Möbius - 1.2 Graphical and Numerical Summaries for Bivariate Data
In addition to a visual representation of the relationship between two quantitative variables, we also
require a quantitative measure of the strength of a linear relationship between two variables, as given
by the correlation coefficient, r , defined as
Properties of r :
−1 ≤ r ≤ 1, where the closer r is to 1 (−1), the stronger the positive (negative) relationship.
r is unitless. (Note that the units of the numerator and denominator will cancel). We can thus
compare the relative strength of linear relationships across different scales and datasets.
> cor(overhead,size)
[1] 0.9271985
This suggests a relatively strong positive relationship between office size and overhead.
Note that in calculating and interpreting r , we are not attempting to establish the presence of a
causal relationship between the explanatory variate and the response, only the strength of the linear
association or correlation between the two variables. For example, we cannot conclude that larger
offices cause larger overheads, as there may be one or more confounding variables, such as the
number of employees, that may be associated with both office size and overhead. An oft-quoted
saying is:
'Correlation does not imply causation'.
https://uwmo.mobius.cloud/566/3369/assignments/44098/0 2/2
9/7/21, 3:55 PM UW Möbius - 1.3 The Simple Linear Regression Model
We can describe the observed behaviour of the response with a statistical model that includes both
a deterministic component, that describes the variation in y accounted for by the functional
form of the underlying relationship between y and x.
Based on the scatterplot, the deterministic component can be adequately described by the linear
function μ = β0 + β1 x , where μ is the mean value of y for a given value of x
.
an error term, denoted by the random variable ϵ , that describes the random variation in y not
accounted for by the underlying relationship with x.
Incorporating both the deterministic and error components into our model yields the simple
linear regression (SLR) model, expressed as
yi = β0 + β1 xi + ϵi i = 1, … , n
where
β0 denotes the intercept parameter
the index i denotes the observation number (e.g. {x3 , y3 } denotes the size and overhead
associated with the third office in the dataset).
https://uwmo.mobius.cloud/566/3369/assignments/23647/0 1/2
9/7/21, 3:55 PM UW Möbius - 1.3 The Simple Linear Regression Model
2
Yi = β0 + β1 xi + ϵi ϵi ∼ N (0, σ ) ind. , i = 1, … , n
errors have a constant variance, denoted by σ 2 (this property is sometimes referred to
as homoskedasticity)
For the normal model to be an appropriate model to use in investigating the relationship between y
and x, these assumptions must hold. Otherwise, our model will be inappropriate and any
conclusions we obtain from our regression analysis will be invalid.
We will be examining these model assumptions in more detail in later sections.
https://uwmo.mobius.cloud/566/3369/assignments/23647/0 2/2
9/7/21, 3:56 PM UW Möbius - 1.4 Least Squares Estimation of Model Parameters
218955 = β0 + β1 (1589) + ϵ1
224513 = β0 + β1 (1912) + ϵ2
66542 = β0 + β1 (741) + ϵ3
In least squares estimation, we find the values for β0 and β1 that yield the smallest sum of squares
of the errors, ∑i=1 ϵ2i .
n
Using calculus, we find these least squares estimates by minimizing the function
n 2 n 2
S(β0 , β1 ) = ∑ ϵ = ∑ [yi − (β0 + β1 xi )]
i=1 i i=1
n
∂S
^ ^
= −2 ∑ [yi − (β 0 + β 1 xi )] = 0
∂ β0
i=1
n
∂S
^ ^
= −2 ∑ xi [yi − (β 0 + β 1 xi )] = 0
∂ β1
i=1
^ ^
nβ 0 + β 1 ∑ xi = ∑ yi
n=1 n=1
n n n
^ ^ 2
β 0 ∑ xi + β 1 ∑ x = ∑ xi yi
i
Solving these normal equations yields the least squares estimates:
^ ^
β 0 = ȳ − β 1 x̄
https://uwmo.mobius.cloud/566/3369/assignments/44231/0 1/2
9/7/21, 3:56 PM UW Möbius - 1.4 Least Squares Estimation of Model Parameters
For the audit data, the least squares parameter estimates are ^ ^
β 0 = −27877.1, β 1 = 126.3 , as given
by the R output:
> audit.slr.lm=lm(overhead ~ size) #fits the linear model y ~ x (intercept is fit as default)
> audit.slr.lm
Call:
lm(formula = overhead ~ size)
Coefficients:
(Intercept) size
-27877.1 126.3
^ ^
β0 β 1
https://uwmo.mobius.cloud/566/3369/assignments/44231/0 2/2
9/7/21, 3:57 PM UW Möbius - 1.5 The Fitted Model
(Note that the fitted model is sometimes expressed in terms of the predicted value of the response,
y
^
^ = β ^
0 + β1 x
. While ^
μ and ^
y are identical in terms of the value they represent, there are subtle
differences in their interpretation that we will discuss in a later section)
The (fitted) residual of the ith observation, ei , is the difference between the observed response,
yi , and the fitted value, μ
^ , defined as
i
^ ^
^
e i = yi − μ = yi − (β 0 + β 1 xi )
i
For example, the second office in the dataset, a 1912 sq. ft. office with a claimed overhead of
$224513, has an estimated mean overhead of
^
μ = −27877.06 + 126.33x2 = −27877.06 + 126.33(1912) = $213666
2
^
e 2 = y2 − μ = 224513 − 213666 = $10847
2
The fitted line, fitted value and residual for the audit data is illustrated in the plot below:
https://uwmo.mobius.cloud/566/3369/assignments/44234/0 1/2
9/7/21, 3:57 PM UW Möbius - 1.5 The Fitted Model
ϵi = yi − μ i = yi − (β0 + β1 xi ) .
The error is the random variable, on which we impose certain distributional assumptions, that we
use to model the random variation in the response for a given value of x.
The residual, ei = y − μ
^ , is the difference between the value of the observed response and the
i
estimated mean response, the value of which we calculate from the fitted line. We can think of
the residuals as estimates of the errors.
By taking the partial derivative with respect to each parameter and setting = 0 in our least
squares estimation procedure, we have imposed two constraints on our residuals, namely:
∑ ei = 0
∑ xi e i = 0
https://uwmo.mobius.cloud/566/3369/assignments/44234/0 2/2
9/7/21, 4:37 PM UW Möbius - 1.6 Least Squares Estimation of σ2
Inference for model parameters requires not only the estimation of β0 and β1 , but also on the
estimation of the error variance, σ 2 .
In any least squares regression model, this is obtained by dividing the sum of squares of the
residuals by the degrees of freedom, giving the least squares estimate as
n 2 n 2
∑ e ∑ ^ )
(yi − μ
2 i=1 i i=1 i
^
σ = =
n−2 n−2
Note that is an unbiased estimate of (i.e. .
2 2 2 2
^
σ σ ^ ) = σ )
E(σ
The residual standard error is the square root of the estimated variance, given by
−−−−−
2
∑e
i
^ = √
σ
n−2
The residual standard error can be interpreted as the estimated standard deviation of the errors, and
is a measure of the random variation in the response for a given value of x. The smaller the value of
the residual standard error, the more variation in the response is explained by the relationship with x,
and the better the fit of the model.
The residual standard error is part of the summary R ouput for the fitted model:
> summary(audit.lm)
Call:
lm(formula = overhead ~ size)
Residuals:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23480 on 22 degrees of freedom
Note the intercept and slope parameter estimates are also provided. We will be exploring
R summary output in great detail throughout the course.
https://uwmo.mobius.cloud/566/3369/assignments/44244/0 1/1
9/12/21, 1:42 PM UW Möbius - 1.7 Interpretation of Slope Parameter Estimates
^
β 1 = 126.33
^
Interpretation of the Slope Parameter Estimate, β 1
Let ^
μ
x0
be the estimated mean response at x = x0 , and let ^
μ
x 0 +1
be the estimated mean response at
x = x0 + 1 .
Then
^ ^
^
μ = β 0 + β 1 (x0 + 1)
x 0 +1
^ ^ ^
= β 0 + β 1 x0 + β 1
^
^
= μ + β1
x0
Thus we can see that, in general, β
^
1
can be interpreted as
The estimated mean change in the response, y , associated with a change of one unit in x .
For the audit model:
^
β 1 = 126.33 can be interpreted in the context of the study as: The estimated mean overhead
increases by $126.33 for every increase of one square foot in office size.
https://uwmo.mobius.cloud/566/3369/assignments/25764/0 1/2
9/12/21, 1:42 PM UW Möbius - 1.7 Interpretation of Slope Parameter Estimates
^
Interpretation of the Intercept Parameter Estimate, β 0
Thus, β
^
0
may be interpreted in certain situations as the estimated mean value of y at x = 0 .
This interpretation may be nonsensical or meaningless in cases where x = 0 is not a relevant value, or
where x = 0 is not in the range of values used in the fit of the model.
An office 0 sq. ft. in size has an estimated mean overhead of negative $27877.06 ???
The linear relationship between overhead and size we observed in the scatterplot is only evident for
offices between approx. 500 and 2000 sq. ft in size.
Never extrapolate results to values of x outside the range used to fit the model
Recall the least squares estimate of the standard deviation of the errors, σ , called the residual
standard error and given by:
−−−−−−−
n 2
∑ e
i=1 i
^ = √
σ
n−2
where ^
e i = yi − μ
i
are the residuals of the fitted model.
Note that the residual standard error is similar to the (sample) standard deviation of the residuals
(only with n − 2 degrees of freedom instead of n − 1 ), and is thus a measure of the variability of the
response about the fitted line. The smaller the residual standard error, the closer the data are to the
fitted line, and the better the fit of the model.
Similar to a standard deviation, the residual standard error can be roughly interpreted as a typical or
'standard' distance (or absolute difference) between the response, yi , and the fitted value, μ
^ .
i
This tells us that a typical difference between the observed overhead and estimated overhead for an
office of a given size is approximately $24000. This provides us with a reasonable reference on the
variability of the response relative to its expected value.
https://uwmo.mobius.cloud/566/3369/assignments/25764/0 2/2
9/12/21, 1:44 PM UW Möbius - 1.8 Inference for the Slope Parameter
Inference for β1
'Is there a relationship between overhead and office size in the population?''
Yi = β0 + β1 xi + ϵi
2
ϵi ∼ N (0, σ ) ind. , i = 1, … , n ,
We can assess whether β1 = 0 and thus whether a (linear) relationship exists by employing one of two
methods of inference for β1 , namely:
Confidence intervals, or
Hypothesis tests
To carry out these procedures, we must first obtain the distribution of the least squares estimator*,
¯
∑(xi − x̄)(Yi − Y )
^
β1 =
2
∑(xi − x̄)
https://uwmo.mobius.cloud/566/3369/assignments/25767/0 1/1
9/12/21, 1:45 PM UW Möbius - 1.8.1 Distribution of Parameter Estimators
^
Distribution of β 1
¯ ¯
∑(xi − x̄)(Yi − Y ) ∑(xi − x̄)Yi − Y ∑(xi − x̄)
β
^
1
= =
2 2
∑(xi − x̄) ∑(xi − x̄)
∑(xi − x̄)Yi
= (∑(xi − x̄) = 0)
∑(xi − x̄) 2
= ∑ ci Yi
(xi − x̄)
where ci = .
∑(xi − x̄) 2
→ ^
β 1 = ∑ ci Yi ∼ N ormal (linear combination of ind. normal r.v.'s)
Expressing β
^
1
as a linear combination of independent random variables also allows us to easily derive
its mean and variance:
^
E(β 1 ) = E(∑ ci Yi ) = ∑ ci E(Yi )
(xi − x̄)
= ∑( )(β0 + β1 xi )
2
∑(xi − x̄)
2
∑ (xi − x̄)
= σ
2
2 2
(∑(xi − x̄) )
2 2
σ σ
= =
2 s xx
∑(xi − x̄)
https://uwmo.mobius.cloud/566/3369/assignments/44420/0 1/2
9/12/21, 1:45 PM UW Möbius - 1.8.1 Distribution of Parameter Estimators
2
σ
^
β 1 ∼ N (β1 , )
s xx
Recall: In general, for any (unbiased, normally distributed) least squares estimator of parameter, θ ,
^
θ −θ
∼ tdf
^
SE(θ )
where the degrees of freedom, df, is the number of estimable model parameters.
^ ^
β 1 − β1 β 1 − β1
= ∼ tn−2
^ ^
σ
SE(β 1 )
−−
−
√s xx
−−−−−−−−−−− −−−−−
2
^ )2
∑ (yi − μ ∑e
where .
i i
^ = √
σ = √
n−2 n−2
We use this result to obtain t-based confidence intervals and hypothesis tests for β1 .
^
Distribution of β 0
2
1 x̄
^ 2
β 0 ∼ N (β0 , σ ( + ))
n s xx
^
β 0 − β0
→ ∼ tn−2
^
SE(β 0 )
−−−−−−−
2
1 x̄
where ^
^√
SE(β 0 ) = σ + .
n s xx
Confidence intervals and hypothesis tests for β0 would follow in the same way as for β1 .
However, inference for β0 is typically of little relevence, and we will focus primarily on confidence
intervals and hypothesis tests for β1 .
https://uwmo.mobius.cloud/566/3369/assignments/44420/0 2/2
9/12/21, 1:48 PM UW Möbius - 1.8.2 Confidence Interval for Slope Parameter
^ ±t
μ ^)
SE(μ
n−1,1−α/2
^
σ
= x̄ ± tn−1,1−α/2
−
−
√n
^ ^
β1 ± tn−2,1−α/2 SE(β1 )
^
σ
where SE(β
^
1
) =
−−
−
√s xx
Notes:
tn−2,1−α/2 denotes the critical value from a tn−2 distribution corresponding to confidence level
(1 − α)100% . (Be sure you know how to obtain this value for a given confidence level from both
R and the posted t - tables)
tn−2,1−α/2 SE(β^1 ) is called the margin of error of the interval. It can be thought of
as the bound on the difference between the value of the estimate and the actual (unknown)
value of the parameter for the given confidence level.
It should be obvious, both intuitively and from the form of the confidence interval that:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
We can obtain the critical
t value, t22,0.975 , from either a
t-table or R:
> qt(.975,22)
[1] 2.073873
126.33 ± 2.074(10.88)
= 126.33 ± 22.57
= (103.76, 148.90)
https://uwmo.mobius.cloud/566/3369/assignments/44423/0 1/1
9/12/21, 1:57 PM UW Möbius - 1.8.2.1 Interpretation of Confidence Intervals
(103.76, 148.90)
For the audit model, the 95% confidence interval for β1 of (103.76, 148.90) can be interpreted as:
We are 95% confident* that for every additional increase of one square foot in office size, the mean
increase in overhead is between $103.76 and $148.90.
*The phrase '95% confident' is ambigious, but is a sufficient and common phrase for conveying
the results of a 95% confidence interval. A more formal interpretation would be:
In repeated sampling of a given size from the population, it is expected that 95% of these samples
would yield (95%) confidence intervals that would contain the actual (unknown) value of β1 .
Note that this interpretation implies that there is a probability of .95 that the true (unknown) value
of β1 is contained in any one of these intervals, including the interval of (103.76, 148.90) that we
obtained from our sample.
Consider again the question that motivated the creation of a confidence interval:
'Is there a relationship between overhead and office size in the population?''
Since β1 = 0 is not in the interval, (103.76, 148.90) , we conclude that there is a significant* positive
relationship between overhead and office size.
(Had 0 been in the interval, then 0 would be considered a plausible value for β1 and we would thus
conclude that there was no significant relationship between overhead and office size)
*When we use the word 'significant' in conclusions from confidence intervals and hypothesis tests,
we are referring to statistical significance.
Whether or not, or the extent to which, our conclusions are significant in terms of the
practical implications are not considered in our statistical conclusions.
https://uwmo.mobius.cloud/566/3369/assignments/44429/0 1/1
9/12/21, 2:10 PM UW Möbius - 1.8.3 Hypothesis Tests for Slope Parameter
2. Calculate the value of the test statistic, or discrepancy measure. The test statistic is a
measure based on the difference between the value of the estimate of the parameter that we
observe in our data and the hypothesized value of the parameter.
3. Calculate the p-value from the test statistic. The p-value is the probability that we would
observe at least as big a difference between the observed estimate and the hypothesized
value of the parameter, if H0 is true.
The smaller the p-value, the more evidence against H0 .
4. Draw a conclusion about the value of the parameter hypothesized in H0 , based on the p-
value:
^
β 1 − β1
Recall, from the distribution of β
^
1 : ∼ tn−2
^
SE(β 1 )
^
β1 126.33
t = = = 11.61
^ 10.88
SE(β 1 )
https://uwmo.mobius.cloud/566/3369/assignments/25761/0 1/2
9/12/21, 2:10 PM UW Möbius - 1.8.3 Hypothesis Tests for Slope Parameter
3. Obtain the p-value = P (|T | > |t|) = 2P (T > |t|) , where T ∼ t22 , and t is the value of test
statistic.
[1] 7.47824e-11
4. Conclusion in context of study. Reject H0 (p-value < .05), and conclude that
There is a significant positive relationship between overhead and office size.
Note that the summary R output displays the results of hypothesis tests for both the intercept and
slope parameters:
> summary(audit.lm)
Call:
lm(formula = overhead ~ size)
Residuals:
---
https://uwmo.mobius.cloud/566/3369/assignments/25761/0 2/2
9/12/21, 2:10 PM UW Möbius - 1.8.4 Further Notes on Hypothesis Tests
For example, we may reject H0 : β1 = 0 and conclude there is a significant relationship, when no
relationship exists (β1 = 0 ). Conversely, we may accept H0 : β1 = 0 when, in fact, β1 ≠ 0 and a
relationship exists.
The possibility that we could have made one of these errors should always be kept in mind when
drawing conclusions from a hypothesis test (as well as from a confidence interval). These two errors
are referred to as:
Type II error: Accepting (i.e. not rejecting) the null hypothesis when it is false.
Note that for any hypothesis test, P(Type I error) = .05 (Convince yourself of this. It will help in your
understanding of p-values)
That is:
If the 95% confidence interval contains 0, then a (two-sided) test of H0 : β1 = 0 would yield a p-
value ≥ .05
If the 95% confidence interval does not contain 0, then a (two-sided) test of H0 : β1 = 0 would
yield a p-value < .05
https://uwmo.mobius.cloud/566/3369/assignments/44426/0 1/1
9/16/21, 1:32 PM UW Möbius - 2.1 Multiple Regression Model
⋮ ⋮ ⋮
23 188435 1812 15 14 1.00 2147
24 35099 607 15 2 0.95 492
By extending the SLR model to include p explanatory variables, we obtain the multiple linear
regression model:
⎣ ⎦ ⎣ ⎦⎣ ⎦ ⎣ ⎦
yn 1 xn1 xn2 … xnp βp ϵn
y X β
β ϵ
ϵ
which we write as
y = Xβ
β+ϵ
ϵ
Note the column of 1's prior to the vectors of explanatory variables in the X matrix that must be
included for models fit with an intercept, β0 .
2
Y = Xβ
β+ϵ
ϵ ϵ
ϵ ∼ N (0, σ I)
where V ar(ϵ
ϵ) = σ I
2
is the covariance matrix of the error random vector, ϵϵ , with
2
C ov(ϵi , ϵi ) = V ar(ϵi ) = σ , i = 1, . . . , n as the diagonal elements and
C ov(ϵj , ϵk ), j, k = 1. . . , n, j ≠ k on the off-diagonals.
Note that expressing the covariance matrix in this way captures both the constant variance
assumption (V ar((ϵi ) = σ 2 for all i), and the independence assumption, since
V ar(ϵ
2
ϵ) = σ I → C ov(ϵj , ϵk ) = 0, j ≠ k → independent errors for ϵi ∼ normal
https://uwmo.mobius.cloud/566/3778/assignments/26858/0 1/1
9/16/21, 1:33 PM UW Möbius - 2.2 Least Squares Estimation of Beta
∂S
^ ^ ^
= −2 ∑(yi − (β 0 + β 1 xi1 + … + β p xip )) = 0
∂ β0
∂S
^ ^ ^
= −2 ∑ xi1 (yi − (β 0 + β 1 xi1 + … + β p xip )) = 0
∂ β1
∂S
^ ^ ^
= −2 ∑ xip (yi − (β 0 + β 1 xi1 + … + β p xip )) = 0
∂ βp
^ 2 ^ ^
(∑ xi1 )β 0 + (∑ x )β 1 + … + (∑ xi1 xip )β p = ∑ xi1 yi
i1
^ ^ 2 ^
(∑ xip )β 0 + (∑ xi1 xip )β 1 + … + (∑ x )β p = ∑ xip yi
ip
Solving for by multiplying both sides of the equation by yields the least squares estimate:
∗
^ T −1
β
β (X X)
^
^ T −1 T
β
β = (X X) X y
(*For X X to be invertible, X must be of full rank. That is, all
p + 1 columns of X must be
T
linearly independent. Otherwise a unique solution will not exist. We will explore this issue in more
detail in a later topic.)
Notes:
The fitted line, μ
^
^ ^ ^
= β 0 + β 1 x1 + … + β p xp , can be represented in matrix form by μ
^ = x
T
,
^
β
β
where x T
= {1, x1 , . . . , xp }
(Note the use of bold type when representing a matrix or vector, and regular type when representing
a scalar)
https://uwmo.mobius.cloud/566/3778/assignments/44501/0 1/1
9/16/21, 1:34 PM UW Möbius - 2.2.1 Least Squares Estimation - audit model
> audit.lm=lm(overhead~size+age+employees+col+clients)
> audit.lm
Coefficients:
T
^
β
β ^
= [β 0 β
^
1
β
^
2
β
^
3
β
^
4
β
^
5
]
> y = overhead
^
^
> beta_hat = XtXinv%*%t(X)%*%y β
β = (X
T
X)
−1
X
T
y
> beta_hat
(Intercept) -198262.24220
size 31.26158
age 330.37544
employees 4695.73212
col 178136.65602
clients 38.52180
Fitted values, μ
^
i
^ ^ ^ ^ T ^
= β 0 + β 1 xi1 + β 1 xi2 + … + β p xip = x β
i
β :
> fitted(audit.lm)
> t(fitted.audit.lm)
Residuals, ei ^
= yi − μ
i
:
> residuals(audit.lm)
1 2 3 4 22 23 24
18256.323 19167.976 6975.875 4360.160 … -4493.936 -1487.579 13177.321
^
> res.audit.lm=y-X%*%beta_hat e = y − μ
^ β
μ = y − Xβ
> t(res.audit.lm)
https://uwmo.mobius.cloud/566/3778/assignments/26861/0 1/1
9/16/21, 1:35 PM UW Möbius - 2.3 The Hat Matrix
μ
T −1 T
^
μ = X(X X) X y = Hy
= X[(X
T
X)
−1 T
] X
T
((AB) T = B
T
A
T
))
= X[(X
T
X)
T
]
−1
X
T
((A −1 ) T = (A
T
)
−1
)
=
T −1 T
X(X X) X = H
=
T −1 T T −1 T
X(X X) (X X)(X X) X
= X(X
T
X)
−1
X
T
= H
^
^
e = y−μ
μ
= y − Hy
= (I − H)y
^
^ +e
y = μ
μ
= Hy + (I − H)y
where Hy⊥(I − H)y . (This can be easily shown from the symmetric and idempotent properties of H )
This tells us that the response vector, y, can be decomposed into its two orthogonal elements: the
vector of fitted values, μ
^ , and the vector of residuals, e. This decomposition forms the basis of ANOVA
^
μ
methods, in which the variation in the response is partitioned into its two components - variation
accounted for by the fitted model, and variation not accounted for by the model.
https://uwmo.mobius.cloud/566/3778/assignments/44504/0 1/1
9/16/21, 1:35 PM UW Möbius - 2.4 Least Squares Estimation of Sigma^2
n 2 n
∑ e ∑ ^ )
(yi − μ
2 i=1 i i=1 i
^
σ = =
n−2 n−2
where
n − 2 is the degrees of freedom (df), resulting from the two constraints imposed on the
residuals through the least squares estimation of the two parameters, β0 and β1 .
The degrees of freedom for a
p explanatory variable multiple regression model with
p + 1
parameters (including the intercept, β0 ) is thus
n − (p + 1) , yielding least squares estimate:
n 2
∑ e
2 i=1 i
^
σ =
n − (p + 1)
−−−−−−−
n 2
∑ e
i=1 i
^ = √
σ = 14430
18
> summary(audit.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
---
https://uwmo.mobius.cloud/566/3778/assignments/44507/0 1/1
9/16/21, 1:36 PM UW Möbius - 2.5 Least Squares vs. Maximum Likelihood Estimation
For ϵi 2
∼ N (0, σ ) ind., the likelihood function is given by:
n
L(β0 , β1 , … , βp ∣ y1 , … , yn ) = ∏ f (yi )
i=1
2
−(y −μ )
i i
1
=
n
2
∏ e 2σ
i=1 −−−−
√2πσ 2
2
− ∑(y −μ )
−n i i
= (2πσ )
2
2 e 2σ 2
2
∑(yi − (β0 + β1 xi1 + … + βp xip )
l = log(L) = c −
2
2σ
is maximized for the same value of that minimizes the sum of squares of the
T
β
β = (β0 , β1 , … , βp )
errors,
2 2
∑ϵ = ∑ (yi − (β0 + β1 xi1 + … + βp xip ))
i
How do these two (and other possible) estimation methods compare when the errors are not assumed
to be normal?
Stated more formally:
Among all unbiased, linear estimators, = M Y, the least squares estimator, given by
^
^
β
β
∗ ^
^
β = MY
β ,
where M = (X X) −1 X , has the smallest variance. That is:
T T
∗
^ ^
^ 2 ∗ ∗ T
V ar(β
β ) = V ar(β
β ) + σ (M − M)(M − M)
where (M
∗
− M)(M
∗
− M)
T
is a positive semidefinite matrix (a matrix A is positive semidefinite if
a
T
Aa ≥ 0 for any vector a).
https://uwmo.mobius.cloud/566/3778/assignments/44510/0 1/1
9/16/21, 1:37 PM UW Möbius - 2.6 Inference: Distribution of Parameter Estimators
^
Distribution of β
β
^
^
2 2 T −1 T
Y = Xβ
β+ϵ
ϵ ϵ
ϵ ∼ N (0, σ I) → Y ∼ N (Xβ
β, σ I) → β
β = (X X) X Y ∼ (multivariate) Normal
^ T −1 T
E[β
β] = E[(X X) X Y]
T −1 T
= (X X) X E[Y]
T −1 T
= (X X) X Xβ
β
= β
β
^ T −1 T
V ar[β
β] = V ar[(X X) X Y]
T −1 T T −1 T T T
= (X X) X V ar[Y][(X X) X ] (V ar(AY) = AV ar(Y)A )
= σ (X
2 T
X)
−1
X
T
[(X
T
X)
−1
X
T
]
T
2 T −1 T T −1
= σ (X X) X X(X X)
2 T −1
= σ (X X)
^
^ 2 T −1
β
β ∼ N (β
β, σ (X X) )
β
^
j
∼ N (βj , σ (X
2 T
X)
−1
jj
) j = 0, 1, 2, … , p
^
β j − βj
^ 2 T −1
β j ∼ N (βj , σ (X X) jj ) → ∼ tn−(p+1)
^
SE(β j )
^ 2 T
V ar(β j ) = σ (X X)
−1
jj
(parameter estimators do not have constant variance)
−−−− −−−−
^ T
^√(X X)
→ SE(β j ) = σ
−1
jj
(required for confidence intervals and hypothesis tests)
^ ^ 2 T
C ov(β j , β k ) = σ (X X)
−1
≠ 0
jk
(parameter estimators not, in general, independent)
^
Interpretation of β j
Since the paramater estimators are not independent, the value of β
^
j
, the estimate associated with the
variable
xj , will depend on the other variables in the model.
^
βj can be thus be interpreted as: the estimated mean change in the response associated with a
change of one unit in
xj after accounting for the other variables (i.e. while holding all other
variables constant).
https://uwmo.mobius.cloud/566/3778/assignments/26876/0 1/2
9/16/21, 1:37 PM UW Möbius - 2.6 Inference: Distribution of Parameter Estimators
Example: Fitted audit model:
> summary(audit.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -198262.24 74354.09 -2.666 0.0157
---
jj
, for the different estimators.
Note also the change in the size parameter estimate (31.26) from that of the SLR fitted model
(126.33).
Interpretation of β
^
j
: β
^
2
= 330.38 → after accounting for size, # of employees, col, and # of clients,
each additional year in the age of the office is associated with an estimated increase in overhead of
$330.38
jj
> SE_betahats
https://uwmo.mobius.cloud/566/3778/assignments/26876/0 2/2
9/21/21, 9:35 AM UW Möbius - 2.7 Inference - Confidence Intervals and Hypothesis Tests for Model Parameters
^ ^
β1 ± tn−2,1−α/2 SE(β1 )
This result can easily be extended to a (1 − α)100% confidence interval for βj :
^ ^
βj ± tn−(p+1),1−α/2 SE(βj )
= 38.52 ± 69.40
= (−30.88, 107.92)
Since the interval encompasses 0, we conclude that, after accounting for the other model variates,
there is no significant relationship between number of clients and overhead.
Hypothesis Tests for βj
H0 : β5 = 0
Ha : β5 ≠ 0
^ ^ ^
β 5 − β5 β5 − 0 β5 38.52
t = = = = = 1.166
^ ^ ^ 33.03
SE(β 5 ) SE(β 5 ) SE(β 5 )
From t-table: P (t18
> 1.330) = 0.10 → 2P (t18 > 1.330) = .20 → p-value = 2P (t18 > 1.166) > 0.20
We do not reject H0 (p-value > .05). There is no significant relationship between # of clients and
overhead, after accounting for the other model variables.
> summary(audit.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -198262.24 74354.09 -2.666 0.0157
https://uwmo.mobius.cloud/566/3778/assignments/44536/0 1/1
9/25/21, 12:55 PM UW Möbius - 2.8 Multicollinearity
Multicollinearity
Consider the matrix of pairwise scatterplots for the audit dataset (using >plot(audit) in R):
Take a minute to examine the relationships between the response and expanatory variates (top row)
and the relationships among the explanatory variates in the remaining rows. Note especially the
relationships among size, employees, and clients.
The scatterplot reveals strong linear assocations (correlations) among some of the explanatory
variables - particularly between employees and clients (r > .99)
When strong (linear) relationships are present among two or more explanatory variables, we say
these variables exhibits multicollinearity.
Multicollinearity leads to inflated (i.e. increased) variances of the associated parameter estimators,
and correspondingly, inflated standard errors. This in turn leads to wide (imprecise) confidence
intervals and inaccurate conclusions from hypothesis tests, due to inflated p-values.
where R
2
j
is the coefficient of determination (Multiple R-squared in R) of the model fit with
xj as the response variable.
https://uwmo.mobius.cloud/566/3778/assignments/27933/0 1/2
9/25/21, 12:55 PM UW Möbius - 2.8 Multicollinearity
> audit.emp.lm=lm(employees~size+age+col+clients)
> summary(audit.emp.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.5700313 2.9959159 -1.192 0.2481
---
[1] 68.96552
Since the VIF is extremely large, we would typically remove employees from the model.
Note the effect the removal of employees from the model has on the standard errors, and
subsequently the p-values, of the other variables (esp. clients).
Before removal:
After removal:
https://uwmo.mobius.cloud/566/3778/assignments/27933/0 2/2
9/25/21, 12:58 PM UW Möbius - 2.9 Confidence Intervals for a Response Mean and Prediction Intervals for a Response
Confidence Intervals for a Response Mean and Prediction Intervals for a Response
Once we have fit the model to our data, we may wish to use the fitted model to estimate the mean
response, or predict the value of the response, of a new unit in the population (one that was not used
in the fit of the model).
For example, after fitting the audit model to the 24 offices, an auditor may wish to use the model on
other offices in the population to assess the consistency of their claimed overhead with the office
attributes (size, age, ...). This, in fact, would likely be the main objective for fitting a regression model
to the audit data.
Consider the following questions an auditor might wish to address about a new office from the
population that is 1000 f t2 , 12 years old, with 1300 clients and a cost of living index of 1.02:
1. What is the estimated mean overhead for all offices with these attributes in the population?
2. What is the predicted overhead of this office?
Both these questions will yield the same value regardless of whether we're taking about the
estimated mean response, μ ^
new
, or the predicted response, y
^new , since
^ ^ ^
^
μ ^
= y = β 0 + β1 xnew,1 + … + βp xnew,p
new new
T ^
^
= x new β
β
Distribution of μ
^
new
^
^ 2 T −1
β
β ∼ N (β
β, σ (X X) )
T ^
^
^
μ = x new β
β ∼ N ormal (linear combination of normal r.v.'s)
new
T ^
^
^
E(μ ) = E(x new β
β)
new
T ^
^
= x new E(β
β)
T
= x new β
β (= μ new )
T ^
^
^
V ar(μ ) = V ar(x new β
β)
new
T ^
^ T T
= x new V ar(β
β )[x new ]
2 T T −1
= σ x new (X X) x new
https://uwmo.mobius.cloud/566/3778/assignments/27936/0 1/3
9/25/21, 12:58 PM UW Möbius - 2.9 Confidence Intervals for a Response Mean and Prediction Intervals for a Response
−−−−−−−−−−−−−−−
T T
^
μ ±t α ^√x new (X
σ X) −1 x new
new n−(p+1),1−
2
Example: Provide a 95% confidence interval for the mean overhead for offices in the population that
are 1000 f t2 , 12 years old, with 1300 clients and a cost of living index of 1.02.
> new_x=data.frame(size=1000,age=12,col=1.02,clients=1300)
> predict(audit2.lm,new_x,interval='confidence',level=.95)
Interpretation: We can be 95% confident that the mean overhead for offices in the population with
these characteristics is between $97,460 and $112,202.
Prediction interval for ynew
Whereas the variance of μ ^
new
T
β is based solely on the variance of the parameter estimators,
^
^
= x new β
2 T T −1
= σ (1 + xnew (X X) x new )
Example: Provide a 95% prediction interval for an office in the population that is 1000 2
ft , 12 years
old, with 1300 clients and a cost of living index of 1.02.
> predict(audit2.lm,new_x,interval='prediction',level=.95)
$104831 ± $30885
104831.2 73946.72 135715.7
(73946, 135715)
→
Interpretation: We predict with 95% confidence that the overhead for this office is between $73,946
and $135,715.
https://uwmo.mobius.cloud/566/3778/assignments/27936/0 2/3
9/25/21, 12:58 PM UW Möbius - 2.9 Confidence Intervals for a Response Mean and Prediction Intervals for a Response
−−−−−−−−−− −−−−−−−−−−−−−−
2 2
(x new −x̄) (x new −x̄)
σ
^√
1
n
+ and σ
^√1 +
1
n
+ ,
Sxx Sxx
This can be generalized to the multiple regression model. The closer {x1 , x2 , , . . . xp } is to the centroid
{x̄1 , x̄2 , . . . , x̄p } , the narrower the interval.
https://uwmo.mobius.cloud/566/3778/assignments/27936/0 3/3
9/25/21, 12:59 PM UW Möbius - 2.10 Modelling Categorical Explanatory Variables
10 stores where randomly assigned to implement each of three promotion types: promo A,
promo B, and the control, (no promo).
Response variable: The percent change in sales over the two-week period of the study.
A regression analysis was conducted to investigate the relationship between promo type and sales.
{x1 , x2 } = {1, 0} → Promo A; {x1 , x2 } = {0, 1} → Promo B; {x1 , x2 } = {0, 0} → No promo
Note that only two indicator variables are required to model the three categorical levels. In general,
in models fit with an intercept, l − 1 indicator variables are needed to describe l category levels.
To understand this from a mathematical perspective, consider the consequences if we were to
attempt to add a third indicator variate
1 if no promo store
x3 = {
0 otherwise
in the model.
Since x3 = 1 − (x1 + x2 ) , X is not of full rank, having only three linearly independent columns (the
fourth column, containing the x3 values, is a linear combination of the preceding three columns)
As a result, (X T X) is not invertible, and the least squares estimate β
^
^
β = (X
T
X)
−1
X
T
y does not
exist.
https://uwmo.mobius.cloud/566/3778/assignments/27939/0 1/2
9/25/21, 12:59 PM UW Möbius - 2.10 Modelling Categorical Explanatory Variables
To understand how we can correctly interpret parameter estimates associated with indicator
variables, we refer to the fitted model,
^ ^ ^
^ = β + β x1 + β x2
μ 0 1 2
where μ
^ is the estimated mean (percent change in) sales associated with a given promotion type,
Thus, ^
β0 can be interpreted as the estimated mean sales for stores that used no promotion.
^
β 0 = −0.870 tells us that, for those stores that used no promotion, mean sales decreased by an
estimated 0.87 percent over the period of the study.
^ ^ ^ ^ ^ ^
^
μ ^
= β 0 + β 1 (1) + β 2 (0) = β 0 + β 1 = μ + β1
A no promo
Thus, β
^
1
can be interpreted as the difference in estimated mean sales for stores that used
Promotion A relative to the stores that used no promotion.
^
tells us that stores that used Promotion A had a estimated 8.35 higher percent change in
β 1 = 8.35
Similarly, β
^
2
= 2.97 tells us that stores that used Promotion B had a estimated 2.97 higher percent
https://uwmo.mobius.cloud/566/3778/assignments/27939/0 2/2
9/25/21, 1:00 PM UW Möbius - 2.10.1 Modelling Categorical Variables - Inference for Parameters
Recall the output from the model fit to the promotion data:
We can answer this question using a confidence interval or hypothesis test for β1 in the same manner
as learned previously:
H0 : β1 = 0 Ha : β1 ≠ 0
t = 3.547
Since p-value < 0.05, we reject H0 , and conclude that stores using promotion A had significantly
higher sales than stores using no promotion.
We could also calculate a 95% confidence interval for β1 to reach the same conclusion.
Similarly, we conclude that there was no significant difference in mean sales between stores using
promotion B and stores using no promotion (p-value = 0.21792).
^ ^ ^ ^
E(β j − β k ) = E(β j ) − E(β k ) = βj − βk
^ ^ ^ ^ ^ ^
V ar(β j − β k ) = V ar(β j ) + V ar(β k ) − 2C ov(β j , β k )
2 T −1 2 T −1 2 T −1
= σ (X X) + σ (X X) − 2σ (X X)
jj kk jk
2 T −1 T −1 T −1
= σ ((X X) + (X X) − 2(X X) )
jj kk jk
^ ^ 2 T −1 T −1 T −1
β 1 − β 2 ∼ N (β1 − β2 , σ ((X X) 11 + (X X) 22 − 2(X X) 12 )) yields the test statistic
^ ^
(β j − β k )
t = ∼ tn−(p+1) under H0 : β1 − β2 = 0 , where
^ ^
SE(β j − β k )
A more widely applicable method to test hypotheses for any linear combination of model parameters,
including H0 : βj − βk = 0 , is the additional sum of squares method, based on the F test stastic.
We will test H0 : β1 − β2 = 0 for the promo model using this method in an upcoming lesson.
https://uwmo.mobius.cloud/566/3778/assignments/28093/0 1/1
9/25/21, 1:09 PM UW Möbius - 2.12 Analysis of Variance (ANOVA)
where the regression sum of squares, SS(Reg) , is the variation explained by the model, and the
residual sum of squares, SS(Res), is the variation in the response left unexplained (i.e., not
accounted for by the model variables).
In ANOVA (ANalysis Of VAriance) methods of inferenece, we draw conclusions about the relative fit
of a model or models by comparing these two sources of variation. The greater the variation explained
by the model relative to the variation unexplained, the better the fit of the model.
2 2
∑ (yi − ȳ ) ^ +μ
= ∑(yi − μ ^ − ȳ )
i i
i=1
2 2
^ − ȳ )
= ∑(μ ^ )
+ ∑(yi − μ ^ )(μ
+ 2 ∑(yi − μ ^ − ȳ )
i i i i
^ )(μ
∑(yi − μ ^ − ȳ ) = ∑ μ
^ (yi − μ
^ ) − ȳ ∑(yi − μ
^ )
i i i i i
^ e i − ȳ ∑ e i
= ∑μ
i
^ ei
= ∑μ (∑ e i = 0)
i
T
^
^
= μ
μ e
T
= (Hy) (I-H)y
T T T T
= y H y−y H Hy
= 0 (H symmetric, idempotent)
2 2 2
∑ (yi − ȳ ) ^ − ȳ )
= ∑(μ ^ )
+ ∑(yi − μ
i i
i=1
https://uwmo.mobius.cloud/566/3778/assignments/28099/0 1/2
9/25/21, 1:09 PM UW Möbius - 2.12 Analysis of Variance (ANOVA)
Coefficient of Determination
By partitioning the total variation in the response into its two component sources of variation, as
SS(Reg)
described by the relationship SS(T ot) = SS(Reg) + SS(Res) , we see that the ratio , or,
SS(T ot)
SS(Res)
equivalently, 1 − , measures the proportion of the variation in the response explained by the
SS(T ot)
model. We call this measure the coefficient of determination, or more simply, the (multiple) R-
squared, and denote it by
SS(Res)
R
2
= 1−
SS(T ot)
For the fitted audit model below (fit now without # of employees due to multicollinearity issues)
an R-squared value of 0.9549 tells us that over 95% of the variation in overhead is explained by the
office's size, age, col and #of clients.
https://uwmo.mobius.cloud/566/3778/assignments/28099/0 2/2
9/25/21, 1:11 PM UW Möbius - 2.12.1 F test and the ANOVA table
with a test statistic that compares the relative magnitudes of the variation explained by the model,
SS (Reg), and the variation left unexplained, SS (Res).
The F test statistic and p-value are provided in the last line of the summary output.
Is there a (linear) relationship between overhead and at least one of size, age, col, or # of clients?
H0 : β1 = β2 = … = βp = 0
F = 100.5
Reject H0 . At least one of size, age, col, # of clients is signficantly related to overhead.
https://uwmo.mobius.cloud/566/3778/assignments/28102/0 1/2
9/25/21, 1:11 PM UW Möbius - 2.12.1 F test and the ANOVA table
The test of H0 : β1 = β2 = … = βp = 0 is often summarized in an ANOVA table, that shows not only
the F test statistic and p-value, but also shows the breakdown of the sum and squares (and mean
squares) of the two sources of variation.
Source df SS MS F p-value
Regression p SS (Reg) SS (Reg) /
p MS (Reg) /
MS (Res) P (Fp,n−(p+1) > F)
Exercise: For the audit model, confirm the value of the test statistic, F = 100.5 , and complete the
ANOVA table from values in the summary output.
−−−−−−−− −− −−−−−−−−− −
2
∑e SS(Res)
We can obtain
SS (Res) from the residual standard error,
i
^ = √
σ = √ .
n − (p + 1) n − (p + 1)
2 2
^ (n − (p + 1)) = 14330 (19) = 3901629100
SS(Res) = σ
Source df SS MS F p-value
Regression 4 82608993960
20652248490 100.6 1.66 × 10
−12
https://uwmo.mobius.cloud/566/3778/assignments/28102/0 2/2
9/25/21, 1:12 PM UW Möbius - 2.13 Additional Sum of Squares
2
Y = β0 + βk+1 xk+1 + βk+2 xk+2 + … + βp xp + ϵ ϵ ∼ N (0, σ )
To determine the better model, we assess the difference in the variation explained by the full and
reduced models, expressed as SS(Reg) f ull − SS(Reg) red , or equivalently, by
We call this difference in variation between the two models the additional sum of squares.
H0 : β1 = β2 = 0
Ha : at least one of β1 , β2 ≠ 0
As an exercise, we can calculate the test statistic by obtaining the residual sum of squares of the full
and reduced models from the output, in the same way we did in creating the ANOVA table.
https://uwmo.mobius.cloud/566/3778/assignments/28294/0 1/2
9/25/21, 1:12 PM UW Möbius - 2.13 Additional Sum of Squares
2 2
(SS(Res) red − SS(Res) f ull )/(dfred − dff ull ) (15360 (21) − 14330 (19))/2
F = = = 2.564
2
SS(Res) f ull /dff ull 14330
From
F table:
P (F2,19 > 3.52) = .05 → p-value = P (F2,19 > 2.56) > .05
Using R:
> 1-pf(2.564,2,19)
[1] 0.1033253
Since the p-value > .05, we do not reject H0 . The reduced model is preferred.
More specifically, age and size together do not account for significant additional variation in overhead
after accounting for col and clients, so we do not need them in the model.
We can verify these results using the anova function in R:
> anova(audit.red.lm,audit.full.lm)
1 21 4954374034
https://uwmo.mobius.cloud/566/3778/assignments/28294/0 2/2
9/25/21, 1:13 PM UW Möbius - 2.13.1 Additional Sum of Squares Test and Categorical Variables
Ha : β1 − β2 ≠ 0
To do so, we need to fit the reduced model under H0 : β1 − β2 = 0 (i.e. under restriction that
β1 = β2 = β ), given by
∗
∗ ∗
Y = β0 + β x1 + β x2 + ϵ
∗
= β0 + β (x1 + x2 ) + ϵ
∗ ∗
= β0 + β x +ϵ
where
> x_promo=x1+x2
> x_promo
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
⋮
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.870 1.786 -0.487 0.6299
https://uwmo.mobius.cloud/566/3778/assignments/28297/0 1/2
9/25/21, 1:13 PM UW Möbius - 2.13.1 Additional Sum of Squares Test and Categorical Variables
> anova(promo_red.lm,promo.lm)
1 28 893.02
(As an exercise, confirm the value of the test statistic and the p-value using the procedure we used in
the previous lesson)
We reject H0 and conclude that that mean sales associated with promotion A is signifcantly higher
than the mean sales for promotion B.
https://uwmo.mobius.cloud/566/3778/assignments/28297/0 2/2
9/25/21, 1:14 PM UW Möbius - 2.13.2 Additional Sum of Squares Test and ANOVA
H0 : β1 = β2 = … = βp = 0
This is just another example of an additional sum of squares test for which the reduced model is
2
Y = β0 + ϵ ϵ ∼ N (0, σ )
To see this, consider the least squares estimate of β0 under the reduced model:
2
S(β0 ) = ∑ (yi − β0 )
∂S
^
= −2 ∑(yi − β 0 ) = 0
∂ β0
^
− nβ 0 + ∑ yi = 0
∑ yi
^
β0 = = ȳ
n
This result is consistent with our intuitive understanding of the fitted model. With no explanatory
variables in the model, μ
^ ^
= β 0 = ȳ . (The sample mean, ȳ , is the least squares estimate of μ )
Note that the residual sum of squares for the reduced model is equivalent to the total sum of squares,
since
2
SS(Res) red = ∑ e
i
2
^ )
= ∑ (yi − μ
i
2
^
= ∑ (yi − β 0 )
2
= ∑ (yi − ȳ )
=SS(T ot)
This should also be intuitively obvious, since, with no expanatory variables in the model, no variation
is explained by the model, and SS(Reg) red = 0.
Then the additional sum of squares test statistic reduces to:
(SS(Res) red − SS(Res) f ull )/p
F =
M S(Res) f ull
SS(Reg) f ull /p
=
M S(Res) f ull
M S(Reg)
=
M S(Res)
as defined previously.
https://uwmo.mobius.cloud/566/3778/assignments/28300/0 1/1
9/25/21, 1:15 PM UW Möbius - 2.13.3 Additional Sum of Squares Test for Individual Parameters
Test of H0 : βj = 0 Revisited
After accounting for size, col, and clients, is age related to overhead?
An equivalent test of H0 : βj = 0 that yields an identical p-value can be obtained with the additional
sum of squares test statistic
(SS(Res) red − SS(Res) f ull )/(dfres − dff ull ) SS(Res) red − SS(Res) f ull
F = =
SS(Res) f ull /dff ull M S(Res) f ull
Note that the degrees of freedom of the numerator, dfres − dff ull , is one, since there is only one
restriction imposed on the model by the null hypothesis. That is, the full model has only one more
parameter than the reduced model.
To illustrate, we can perform an additional sum of squares test on the full and reduced audit models
associated with H0 : β2 = 0 :
> anova(audit_minus_age.lm,audit.lm)
1 20 4037157770
Note the equivalent p-value (0.4261) associated with the
F and
t test statistics.
Note also the relationship between the values of
F and
t. We see that
F .
2 2
= 0.661 = 0.813 = t
https://uwmo.mobius.cloud/566/3778/assignments/28303/0 1/1
9/25/21, 1:15 PM UW Möbius - 2.13.4 The General Linear Hypothesis
These hypotheses all test linear combinations of the model parameters. As such, they can all be
expressed in the form of the general linear hypothesis:
H0 : Aβ
β = 0
where A is an ℓ × (p + 1) matrix that imposes the ℓ linear constraints on the full model as described
by H0 .
Result: Consider the ('full') normal model given by Y = Xβ β+ϵ ϵ ϵ ∼ N (0, σ I) , and corresponding
ϵ
2
('reduced') model associated with the set of linear hypotheses of the form H0 : Aβ β = 0. Under H
0,
That is, the additional sum of squares test statistic can be used to test any set of linear hypotheses
of the form H0 : Aβ β = 0.
H
0 : β1 = β2 = ⋯ = β4 = 0 can be expressed in the form
β0
⎡ ⎤
0 1 0 0 0 0
⎡ ⎤ ⎡ ⎤
⎢ β1 ⎥
⎢0 0 1 0 0 ⎥⎢ ⎥ ⎢0⎥
H0 : ⎢ ⎢ ⎥ =
⎥ β ⎢ ⎥
⎢ 2 ⎥
⎢0 0 0 1 0 ⎥⎢ ⎥ ⎢0⎥
⎢β ⎥
⎣ ⎦ 3 ⎣ ⎦
0 0 0 0 1 0
⎣ ⎦
β4
A β
β = 0
Similarly, for H0 : β1 = β2 = 0 :
0 1 0 0 0
A = [ ]
0 0 1 0 0
H0 : β1 − β2 = 0 :
A = [0 1 −1 ]
and H0 : β2 = 0 :
A = [0 0 1 0 0]
https://uwmo.mobius.cloud/566/3778/assignments/28306/0 1/1
10/12/21, 4:22 PM UW Möbius - 2.14 Assessing Model Adequacy (Residual Analysis)
the functional form of the relationship between the response and the explanatory variables, is
correctly specified by the deterministic component of the model. For the linear model, this
relationship is described by μ
μ = Xβ β.
errors have a constant variance, denoted by σ 2 (this property is sometimes referred to
as homoskedasticity)
Distribution of the parameter estimators and subsequent methods of inference were derived based on
these assumptions. If the assumptions do not hold, the model is inadequate, and any conclusions
drawn from the fit of the model are inaccurate and meaningless.
We can assess model adequacy (i.e. the validity of the model assumptions) through examination and
analysis of the fitted residuals, e = y − μ
^ , to ensure that the behaviour of these residuals is consistent
Residual Plots
There are many different types of residual plots that may be used to assess model adequacy. We will
introduce only two of the more common ones here.
Plot of the residuals, ei, vs the fitted values, μ
^ .
i
The most common and useful diagnostic plot for assessing model assumptions, is a plot of ei vs μ
^ .
i
Thus, if the model is adequate, we would expect to see no observable pattern in a plot of ei vs μ ^ . The
i
plot should only exhibit random scatter, as illustrated in the plots below from the fit of a SLR model.
If the assumptions do not hold, we would expect to see a pattern or relationship consistent with the
assumption that has been violated. Two examples are provided on the following slide.
https://uwmo.mobius.cloud/566/3778/assignments/28527/0 1/3
10/12/21, 4:22 PM UW Möbius - 2.14 Assessing Model Adequacy (Residual Analysis)
In the figure below, a SLR model is fit to a relationship that includes a quadratic component. Thus, the
functional form of the model is not correctly specified (μ ≠ β0 + β1 x) , resulting in a observed
quadratic relationship in the plot of the residuals vs the fitted values.
Another common violation of the model assumptions is a non-constant variance of the errors.
Often, for example, the variance of the errors may increase as the mean of the response increases,
thus violating the assumption of V ar(ϵ) = σ 2 , as seen in the example below.
QQ plots
QQ (Quantile-Quantile) plots are used to assess the assumption of normal errors. (This assumption is
more critical for very small datasets than for relatively large ones. Why?)
In normal QQ plots, the ordered residuals ('Sample Quantiles' in R plot), e (i) , are plotted vs the
expected ordered values ('Theoretical Quantiles'), E(Z(i) ) , where Zi ∼ N (0, 1) .
If the residuals are from a normal distribution, then e(i) should be proportional to E(Z(i) ). Thus a
straight line relationship is an indication that the assumption of normal errors has been well met.
The QQ plots below provide an example of a fitted model that meets the assumption of normal errors
(left) and a model that does not (right).
(Note: we have discussed the use of residual plots to assess the assumptions of correct functional
form, constant variance of the errors, and normality of the errors. We will consider plots that address
the assumption of independence of the errors when we discuss time series data)
https://uwmo.mobius.cloud/566/3778/assignments/28527/0 2/3
10/12/21, 4:22 PM UW Möbius - 2.14 Assessing Model Adequacy (Residual Analysis)
y
−1
(reciprocal transformation)
https://uwmo.mobius.cloud/566/3778/assignments/28527/0 3/3
10/12/21, 4:23 PM UW Möbius - 2.14.1 Residual Analysis - audit model
The large R-squared value and relatively small p-values associated with most of the variables suggest
that we have a very good fit, providing the model assumptions are valid.
There is nothing in the output that provides any information on the validity of the assumptions.
Whenever we fit a model, we must always perform a residual analysis to ensure that the assumptions
have been met and the model is adequate. If the model is not adequate, any conclusions we draw
from the fit of the model are meaningless.
We will begin by examining a plot of the residuals vs the fitted values.
> plot(fitted(audit.lm),residuals(audit.lm),pch=19,xlab='fitted',ylab='residuals')
The presence of an obvious observable pattern indicates the model is not adequate. We see that
taking a square root transformation of the response helps to address problems with the model:
https://uwmo.mobius.cloud/566/3778/assignments/28530/0 1/2
10/12/21, 4:23 PM UW Möbius - 2.14.1 Residual Analysis - audit model
Note that by taking an appropriate transformation to address issues with the model assumptions, we
have also arrived at a better fitting model, as evidenced by the resulting increase in the R-squared
value and decrease in p-values seen in the output below:
> audit.sqrt.lm=lm(sqrt(overhead)~size+age+col+clients)
> summary(audit.sqrt.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
---
Notes:
−−−−−−
Remember that the response is now in units of √dollars . When taking a transformation of the
response, we need to back transform estimated mean values, μ ^ , and associated confidence and
A log transformation was also applied, but did not adequately address problems with model
assumptions.
https://uwmo.mobius.cloud/566/3778/assignments/28530/0 2/2
10/12/21, 4:24 PM UW Möbius - 2.14.2 Properties of the Residuals
= Y − HY = (I − H)Y
= (I − H)(Xβ
β + ϵ)
ϵ)
T −1 T
= Xβ
β − X(X X) X Xβ
β+ϵ
ϵ − Hϵ
ϵ
= Xβ
β − Xβ
β+ϵ
ϵ − Hϵ
ϵ
= (I − H)ϵ
ϵ
⇒ Y
Y = ϵ
ϵ
However for (II − H
H ) to be invertible, it must be of full rank, with rank(I I −HH ) = n . This is not
Distribution of e
Now that we have established e = (I − H)ϵ
ϵ where ϵ ∼ N (0, σ I)
ϵ
2
, we can easily derive the
distribution of e :
e ∼ Normal (since ϵ
ϵ ∼ Normal)
T
V ar(e) = (I − H)V ar(ϵ
ϵ)(I − H)
2
= σ (I − H)(I − H)
T
2
= σ (I − H) (H symmetric, idempotent)
Thus
e ∼ N (0
2
0, σ (I − H))
→ 2
e i ∼ N (0, σ (1 − hii ))
Note in particular:
2
V ar(e i ) = σ (1 − hii ) (Residuals have non-constant variance)
C ov(e j , e k ) = −σ hjk
2
j ≠ k (Residuals are not independent. This is a consequence of the
constraint, ∑ ei = 0 , placed on the residuals in least squares estimation)
https://uwmo.mobius.cloud/566/3778/assignments/44763/0 1/1
10/12/21, 4:26 PM UW Möbius - 2.14.3 Studentized Residuals
Studentized Residuals
Recall that we standardize a random variable by subtracting the mean and dividing by the standard
deviation.
For example, in the case of a normal random variable,
X :
X−μ
2
X ∼ N (μ, σ ) → Z = ∼ N (0, 1)
σ
Similarly, we studentize a random variable by subtracting the mean and dividing by the estimate of
the standard deviation
In a previous statistics course we studentized the sample mean of a normal random variable:
¯
2 X−μ
¯ σ
X ∼ N (μ, ) → t =
n
^
σ
√n
^ ^
β j − βj β j − βj
^ 2 T −1
β j ∼ N (βj , σ (X X) → t = =
ii −−−−−−−−
T −1 ^
^√(X
σ X) SE(β j )
ii
Note that in both cases, the resulting studentized random variable follow a t distribution.
In the same way, we can studentize the residuals from the fit of a normal regression model, for which
we have previously shown that ei ∼ N (0, σ 2 (1 − hii ))
where the distribution of di can be reasonably approximated by a N (0, 1) distribution* for large
n .
have an exact
t distribution. The
t distribution will, however, be a reasonable approximation to
the distribution of di for large
n . This implies that di ∼ N (0, 1) for large
n .
Note that by studentizing the residuals, we have insured that the resulting residuals, di , will have a
constant (estimated) variance = 1. For this reason, the studentized residuals, di , are often used
instead of the fitted residuals, ei, in residual plots.
https://uwmo.mobius.cloud/566/3778/assignments/28536/0 1/1
10/12/21, 4:27 PM UW Möbius - 2.14.4 Extreme Values in the Response ('Outliers')
Equivalently, we can define an outlier in the response as any observation for which the residual,
^ , is extreme relative to the other residuals.
e i = yi − μ
i
Recall the plot of the residuals vs the fitted values for the audit model, fit with the square root
transformation of the response:
Note that the transformation of the response reveals an outlier in the residuals, associated with an
office with an unusually high overhead relative to the mean overhead estimated from the model.
We can obtain additional perspective on how extreme the outlier is by plotting the studentized
residuals, di , in place of the fitted residuals, ei.
Note that there is little qualitative difference in the plot of the studentized residuals to that of the
fitted residuals. However, we can use our understanding of the distribution of di to determine how
often as extreme an outlier would occur due to random variation.
Recall that, for large n , the distribution of di can be reasonably approximated by a N (0, 1) distribution.
based on our understanding of normal probability theory, we know that approx. 99% of all
observations will be within ±2.5 . Anything within this range is acceptable variation.
Thus, as a general rule of thumb, an observation may be considered an outlier in the response if
|di | > 2.5
https://uwmo.mobius.cloud/566/3778/assignments/28539/0 1/2
10/12/21, 4:27 PM UW Möbius - 2.14.4 Extreme Values in the Response ('Outliers')
random variability
Deciding how to deal with outliers will depend on the cause, and should be dealt with on a case-by-
case basis. It is never a good idea to remove an observation deemed to be an outlier from the fit of
the model without further investigation.
https://uwmo.mobius.cloud/566/3778/assignments/28539/0 2/2
10/22/21, 1:35 PM UW Möbius - 2.14.5 Leverage and Influential Observations
Leverage
Leverage is a measure used to identify those observations whose set of explanatory
variables is extreme relative to the sets of explanatory variables of the other observations.
The leverage of the i observation in a dataset is defined as the i diagonal element of the hat
th th
matrix, denoted by hii . It is a function of the distance between the point (xi1 , xi2 , . . . , xip ) , and the
centroid, (x̄1 , x̄2 , . . . , x̄p ) , of the sets of explanatory variables of the dataset.
To see how we can use leverage to identify extreme values in the sets of explanatory variables, we
consider the SLR case. It can be shown that leverage can be expressed as:
¯ 2
(x i −x )
hii =
1
n
+
2
(You can try to show this as an exercise)
¯
∑(x i −x )
Note that the more extreme the value of the explanatory variable, xi , relative to the mean, x̄ , the
larger the leverage.
Recall also that μ
^
^
μ ^
= Hy ⇒ μ
i
= hii yi + ∑
j≠i
hij yj . The leverage, hii , can therefore be thought of as
the weight of the contribution of yi to the fitted value, μ^ . The larger the leverage relative to the other
i
Finally, note that, as H , hii is a function only of the explanatory variables, not of
T −1 T
= X(X X) X
the response.
2(p + 1)
¯
hii > 2h =
n
A plot of the hatvalues for the audit model (with square root transformation of
y ) is shown below.
> plot(hatvalues(audit.sqrt.lm),cex.lab=1.3,cex.axis=1.3,cex=1.3,pch=19)
We see that there are no high leverage observations in the dataset. Note that all values are less than
¯
2h = 2(5)/24 = .417
https://uwmo.mobius.cloud/566/3778/assignments/28620/0 1/2
10/22/21, 1:35 PM UW Möbius - 2.14.5 Leverage and Influential Observations
Influential Observations
An observation is considered influential if its removal from the fit of the line changes the fitted line
(i.e. changes the parameter estimates) considerably.
Only high leverage cases have the potential to be influential. Whereas leverage depends only on the
explanatory variables, the influence of an observation also depends on the value of the response, as
illustrated below.
In the above plots, the fitted regression line with the leverage point included is given by the solid
black line and line fit with the leverage point omitted is given by the dotted red line.
Note that in the plot on the left, the removal of the high leverage point does not dramatically alter the
fitted line, whereas in the plot on the right, omission of the leverage point alters the fitted line
considerably.
Thus the high leverage observation seen in both plots is not influential in the scenario on the left, but
is influential in the scenario on the right.
i
th
observation omitted, ^
^
μ
μ
(i)
.
Cook's distance is one common measure of influence that is a function of this distance, defined as
T
^
^ −μ
(μ
μ ^
^
μ ) ^
^ −μ
(μ
μ ^
^
μ )
(i) (i)
Di =
2
^ (p + 1)
σ
where is the estimate of the variance from the model fit with the the observation included.
2 th
^
σ i
This would seem to imply that to measure the influence for each observation, we need to fit the
model both without and without the ith observation for all i. However, it can be shown that Cook's
distance can be expressed in the form
2
hii d
i
Di = ⋅
1 − hii p+1
which can be calculated from the fit of model with all observations included.
Note that to be influential, an observation must have both a relatively high leverage, hii and large
(absolute) studentized residual, di .
[1] 0.2202189
https://uwmo.mobius.cloud/566/3778/assignments/28620/0 2/2
10/22/21, 1:36 PM UW Möbius - 2.15 Model Selection - Introduction
A fit of the linear model to all the variables in the dataset (with housing price as the response) is given
below:
The output suggests a reasonably well-fitting model, with 67% of the variation in selling price
accounted for by the eight variables, and several strong relationships between the selling price and
variables size, age, lot size, and presence/absence of a garage.
https://uwmo.mobius.cloud/566/3778/assignments/28796/0 1/2
10/22/21, 1:36 PM UW Möbius - 2.15 Model Selection - Introduction
Subsequent residual analysis suggests that the model is adequate, and reveals no problems with
outliers or the model assumptions (we say the residuals are 'well-behaved').
Further analysis revealed no high leverage or influential observations.
Note, however, the large p-values associated with some of the variables (e.g. # of bathrooms, # of
rooms, presence/absence of a basement).
This suggests that the model might be improved by including only a subset of the eight variables,
since increasing the degrees of freedom without significantly increasing the variation unexplained by
the model will result in a lower residual standard error and, consequently, more precise parameter
estimates and estimated mean values.
To understand why removing certain variables might yield a 'better' model, consider the expression for
the residual standard error, given by
−−−−−−−−− −
SS(Res)
^ = √
σ
n − (p + 1)
Note that removing one or more variables from the model will increase both the SS(Res) and the
degrees of freedom.
If the increase in SS(Res) resulting from the removal of variables is small relative to the degrees of
freedom gained, as is often the case for variables associated with large p-values, then ^
σ will
decrease, resulting in smaller standard errors and a more precise model.
If however, the increase in degrees of freedom obtained from removing one or more variables is not
sufficient to counter-balance the associated increase in SS(Res), then σ
^ will increase, and we will
https://uwmo.mobius.cloud/566/3778/assignments/28796/0 2/2
10/22/21, 1:37 PM UW Möbius - 2.15.1 Model Selection Methods
Remove the variable with the largest p-value that is greater than some predetermined threshold
value, α (e.g. α = .10 )
Continue removing one variable at each iteration of the above steps until no more variables can
be removed (all p-values < α)
2. Forward selection
Fit all
p single variable (i.e. SLR) models
Fit the p−1 two-variable models that include the variable selected in the previous step
Continue adding one variable at each iteration, including all variables selected in the previous
step, until no more variables can be added (all p-values > α )
3. Stepwise selection
Begin with forward selection, and employ both forward selection and backward elimination
at each step until no more variables can be added or removed
For the house model with p = 8 variables, for example, there are 255 potential models.
Selection of reasonable models from all potential models is based on some measure of fit that takes
into account both the SS(Res) and the number of variables. Two such measurs are:
Adjusted R-squared
Mallows' Cp
Adjusted R-squared
Recall the coefficent of determination, given by
SS(Res)
2
R = 1−
SS(T ot)
Note that with the addition of more variables, SS(Res) will always decrease, and subsequently, R2
will always increase when variables are added regardless of whether the variables account for a
significant amount of the variation in the response. For this reason, we cannot use R2 as a relative
measure of fit when comparing model subsets with different numbers of parameters.
Instead, we can use the adjusted R-squared, given by
SS(Res)/(n − (p + 1))
2
R = 1−
adj
SS(T ot)/(n − 1)
https://uwmo.mobius.cloud/566/3778/assignments/28799/0 1/2
10/22/21, 1:37 PM UW Möbius - 2.15.1 Model Selection Methods
Since R2adj takes into account the number of variables in the model, it will only increase if the variation
accounted for by the added variable(s) increases proportionally more than the degrees of freedom
decreases through the estimation of the additional parameters.
Note that since we can express
2
SS(Res)/(n − (p + 1)) ^
σ
2
R = 1− = 1−
adj
SS(T ot)/(n − 1) SS(T ot)/(n − 1)
as a function of the residual standard error, model selection based on a large R2adj is equivalent to
selection based on a low residual standard error, ^
σ .
Mallows' Cp
For a
k -variable model (k = 1, 2, … , p) , Mallows' Cp is defined as
SS(Res) k
Cp = + 2(k + 1) − n
M S(Res) p
Intuitively, the smaller the SS(Res) for a given k , the better the model. Thus, smaller Cp values
relative to the number of variables are associated with more suitable models.
Mallows' Cp is used to compare a
k -variable model (k < p ) with the full model (for which k = p )
A
k -variable model is preferred over the full model if
Cp ≤ k + 1
https://uwmo.mobius.cloud/566/3778/assignments/28799/0 2/2
10/22/21, 1:39 PM UW Möbius - 2.15.2 Model Selection - House data
Note that basement would be the first variable to be removed in backward selection.
We will select an appropriate model from all possible subsets based on the R
2
adj
and Cp criteria.
> leaps(house[,-9],value,method=c('adjr'),nbest=2,names=names(house[-9]))
We see that the preferred model based on R2adj is the model fit with size, stories, age, lotsize, and
garage.
https://uwmo.mobius.cloud/566/3778/assignments/28802/0 1/3
10/22/21, 1:39 PM UW Möbius - 2.15.2 Model Selection - House data
> leaps(house[,-9],value,method=c('Cp'),nbest=1,names=names(house[-9]))
We see that there are several models that meet the criterion of Cp < k + 1 .
We will select the model with the variables size, stories, age, lotsize, and garage. This is consistent
with the model selected using R2adj .
The fits of the full model and selected model are given on the following slide. Note the following:
The higher R
2
value for the full model, due to having more parameters
The higher R
2
adj
, (and lower ^
σ ) for the selected model
The p-value > .05 associated with stories in the selected model. Retaining variables with
associated p-values > .05 is common in model selection procedures.
https://uwmo.mobius.cloud/566/3778/assignments/28802/0 2/3
10/22/21, 1:39 PM UW Möbius - 2.15.2 Model Selection - House data
Finally, residual analysis on our selected model indicates conformity with the model assumptions.
We have arrived at an appropriate and well-fitted model that adequately describes the relationship
between the value of a house and its attributes.
https://uwmo.mobius.cloud/566/3778/assignments/28802/0 3/3
10/22/21, 1:42 PM UW Möbius - 2.16 Interaction
Interaction
Consider the house data.
Does the effect of having a garage on the value of a house depend on the age of the house?
It may be, for example, that having a garage contributes to the value of a house more markedly
for older houses than it does for newer houses, or vice versa.
When the effect of a variable, xj , on the response depends on the value of another variable, xk , we
say there is interaction between variables xj , xk .
We can account for a possible interaction effect by including the term xj xk in our model.
For example, to address the question above, we include the interaction term in our model:
Y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 (x3 ∗ x5 ) + ϵ
where x3 represents the age and x5 represents whether the house has a garage.
To see how this effectively addresses interaction, note that we can rewrite our model as
Y = β0 + β1 x1 + β2 x + (β3 + β6 x5 )x3 + β4 x4 + β5 x5 + ϵ
where β3 is the effect of age on house price if the house has no garage (i.e. when x5 = 0 ), and
β3 + β6 is the effect of age on house price if the house has a garage.
Including an age*garage interaction term into the model selected in the previous lesson yields the
following output:
> house.int.lm=lm(value~size+stories+age+lotsize+garage+age*garage)
> summary(house.int.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
The estimates β
^
3
and β^
6
tell us that, whereas the mean value of a house decreases by an estimated
$523 each year of age for houses with no garage, it only decreases by an estimated $311 each year of
age for houses with a garage (after accounting for the other variables).
Note that, although we would not reject H0 : β6 = 0 (p-value > .05), based on model selection
methods we would keep the interaction term in the model, since it leads to a decrease in σ
^ (i.e. it
https://uwmo.mobius.cloud/566/3778/assignments/28931/0 1/1
10/22/21, 1:43 PM UW Möbius - 2.17 Fitting Linear Models to Time Series Data - Introduction
We will investigate the use of linear models to account for variation in the time series due to seasonal
and/or trend components (we may also use linear models to attempt to account for variation due to
cyclical components, but we will not do so here).
Autocorrelation Function
Before we attempt to model the components present in time series data, we must first investigate and
quantify the nature of the autocorrelation between yt and yt−k for any lag, k .
We do so with the (sample) autocorrelation function (acf), defined as
n
∑ (yt − ȳ )(yt−k − ȳ )
t=k+1
rk =
n
2
∑ (yt − ȳ )
t=1
unitless
https://uwmo.mobius.cloud/566/3778/assignments/28934/0 1/2
10/22/21, 1:43 PM UW Möbius - 2.17 Fitting Linear Models to Time Series Data - Introduction
Not surprisingly, the time series plot suggests a strong seasonal pattern recurring every 12 time units
(months).
A correlogram of the series will provide us with a much better understanding of the autocorrelation
structure so that we can attempt to model the components of variation present in the time series.
As seen in the time series plot, there is a strong positive autocorrelation at lag k = 12. As well, we see
significant* negative autocorrelations at lags 6 and 18, which supports the intuitive notion of a
difference in wine sales between the winter and summer months. We also see a significant positive
autocorrelation at lag 1, mainly due to the presence of an increasing trend in wine sales over time.
(Note: the value r0 = 1 is irrelevant and should be ignored, as it provides no useful information about
the autocorrelation of the time series)
*The significance lines on the acf plot provide a (approx.) visual test of the hypotheses ρk = 0 ,
where the parameter ρk is the process autocorrelation at lag k , estimated by the sample acf, rk .
−−−
The lines correspond to ± 2 standard errors, where SE(rk ) ≈ √1/n . Values of rk outside these lines
suggest significant autocorrelation (ρk ≠ 0) .
−−−
For the wine sales acf plot, these lines correspond to ±2/√187 = ±0.146
https://uwmo.mobius.cloud/566/3778/assignments/28934/0 2/2
10/22/21, 1:46 PM UW Möbius - 2.17.1 Modelling Seasonal and Trend Components with Linear Models
where
th
1 if t month Jan 1 if Feb 1 if Nov
xt1 = { xt2 = { ... xt11 = {
0 otherwise 0 otherwise 0 otherwise
^
β0, is the estimate of the mean sales for December. Be sure you can verify this with your
understanding of the fitted model.
> month=c(rep(c('Jan','Feb','Mar','Apr',...,'Oct','Nov','Dec'),length=187)
#creates a categorical variable giving the month for each observation
> month=factor(month,levels=c('Jan','Feb','Mar','Apr','May', ...'Oct','Nov','Dec'))
#establishes the order of the factor levels (overriding alpha-numeric ordering)
> month=relevel(month,'Dec') #assigns December as the reference month
> wine.seas.lm=lm(Sales~month)
We see that, conditional on model assumptions being met, we have a reasonably well fitting model
with over 60% of the variation in wine sales accounted for by month of the year.
Based on the negative values for all ^
β j , j = 1, 2, … , 11 and the associated p-values, we can conclude
that December has significantly higher mean wine sales than any other month
^
^
(μ = β 0 = 4, 536, 200 litres)
Dec
We suspect Nov. has significantly higher wine sales than all months other than Dec., but we would
have to confirm that with an appropriate test.
https://uwmo.mobius.cloud/566/3778/assignments/28969/0 1/4
10/22/21, 1:46 PM UW Möbius - 2.17.1 Modelling Seasonal and Trend Components with Linear Models
Consider now the time series of the residuals of this model, et, t = 1, 2, … , 187 .
By accounting for the seasonal component, the residual time series is effectively the time series of
seasonally adjusted wine sales, given in the time series plot below:
Note that by removing the seasonal variation, the trend component becomes much more evident.
Before we attempt to model the trend, let us consider the acf of the residuals, given by
n
∑ e t e t−k
t=k+1
rk =
n
2
∑e
t
t=1
(confirm that this is consistent with the definition of the acf provided in an earlier lesson)
Under the model assumption of independent errors, we would expect the acf plot of the residuals to
reveal no signficant autocorrelation at any lag.
However, due to the trend, the elements etet−1 will be predominantly positive, as will
e t e t−2 , e t e t−3 , … , resulting in persistently large values of r1 , r2 , r3 , … .
This characteristic of the autocorrelation in time series that exhibit a strong trend is seen in the acf
plot of the residuals for the wine sales model:
https://uwmo.mobius.cloud/566/3778/assignments/28969/0 2/4
10/22/21, 1:46 PM UW Möbius - 2.17.1 Modelling Seasonal and Trend Components with Linear Models
We can account for the trend in our model through the addition of appropriate time variables - in this
case, with the linear and quadratic terms t and t2 , respectively, resulting in the model
2 2
yt = β0 + β1 xt1 + … + β11 xt11 + β12 t + β13 t + ϵt ϵt ∼ N (0, σ ) independent
seasonal component trend component
Note that the the trend terms dramatically increases the variation in wine sales explained by the
model, as is reflected by both the much higher R2adj (0.781 compared 0.5914) and the small p-value
associated with the quadratic term.
(Note: we always retain all lower order terms in a model when the higher order terms are included.
This holds for both polynomial terms (e.g. quadratic, cubic, ...) as well as for interaction terms)
https://uwmo.mobius.cloud/566/3778/assignments/28969/0 3/4
10/22/21, 1:46 PM UW Möbius - 2.17.1 Modelling Seasonal and Trend Components with Linear Models
The time series plot of the residuals (below) suggests that we have successfully captured the trend.
The correlogram of the residuals (below right) indicates our assumption of independent errors is
reasonably met. We see that we have accounted for virtually all the autocorrelation in the response
(left), by including seasonal and trend components in a linear model.
Further analysis (e.g. a plot of residuals vs fitted values, not shown here) indicates that the rest of the
model assumptions are reasonally well met. We now have an adequate and well-fit model that we can
use to forecast future wine sales, where the forecasted value at time t is described by:
^ ^ ^ ^ ^ ^ 2
^ = β + β xt1 + β xt2 + … + β
y xt11 + β 12 t + β 13 t
t 0 1 2 11
For example, the forecasted wine sales for the following month (Aug/95) is
^ ^ ^ ^ 2
^
y = β 0 + β 8 + β 12 (188) + β 13 (188 )
188
= 4, 369, 053 litres
Recall that the seasonal and trend components are not the only components of variation in a time
series. Often, there is autocorrelation structure remaining in the time series after accounting for
seasonal and trend components with a linear model, rendering the assumption of independence of the
errors invalid.
We will see an example of this in an upcoming lesson.
https://uwmo.mobius.cloud/566/3778/assignments/28969/0 4/4
10/22/21, 1:47 PM UW Möbius - 2.17.2 Modelling Seasonal and Trend components - Kenora temperature time series
Whereas the time series plot provides limited information on the main characteristics and
autocorrelation of the time series, the correlogram clearly illustrates the autocorrelation structure that
we would anticipate* in a monthly temperature series.
(*consider the numerator of rk , given by ∑ (yt − ȳ )(yt−k − ȳ ) . We know that yt − ȳ will be negative
for t associated with the winter months (e.g., Nov, Dec, Jan, Feb, ...) and positive for the summer
months, resulting in mostly negative values of(yt − ȳ )(yt−k − ȳ ) for k = 4, 5, 6, 7, 8 , and mostly
positive values for k = 1, 2, 10, 11, 12)
Fitting a seasonal component as we did for the wine sales data yields the following fitted model
Note the extremely high R2 value, with almost 96% of the variation in mean daily maximum
temperature accounted for by month (why is this understandable?)
We see also that, based on the estimates and associated p-values, January was the coldest month,
with a mean monthy (daily max.) temperature of −12.346∘ C .
https://uwmo.mobius.cloud/566/3778/assignments/29078/0 1/3
10/22/21, 1:47 PM UW Möbius - 2.17.2 Modelling Seasonal and Trend components - Kenora temperature time series
A time series plot of the residuals suggests the possibility of a linear trend:
The correlogram of the residuals also indicates that the autocorrelation has not been wholly accounted
for by the seasonal component:
The presence of a positive linear trend is confirmed by the fit of the model:
Note that the positive trend parameter estimate and associated p-value is consistent with the evidence o
We can say that, after adjusting for season, mean daily maximum temperature in Kenora has increased
On a more relevent scale: Over the past 50 years, the mean daily maximum temperature in Kenora has
'The global annual temperature has increased at an average rate of 0.07°C per decade since 1880 and
warming-
update#:~:text=The%20global%20annual%20temperature%20has,0.32%C2%B0F)%20since%201981
https://uwmo.mobius.cloud/566/3778/assignments/29078/0 2/3
10/22/21, 1:47 PM UW Möbius - 2.17.2 Modelling Seasonal and Trend components - Kenora temperature time series
Now that we appear to have an extremely well-fitted model, we can perform a residual analysis to
assess model adequacy.
The plot of the (studentized) residuals vs the fitted values is shown below:
https://uwmo.mobius.cloud/566/3778/assignments/29078/0 3/3
10/22/21, 1:48 PM UW Möbius - 2.17.3 Durbin Watson test for lag 1 autocorrelation in linear model residuals
We see that after accounting for the seasonal and trend components, there still appears to be
significant autocorrelation, at lag 1 in particular, not accounted for by our model.
This is often the case after modelling the seasonal and/or trend components in time series data. (Why
might this be the case with temperature data?)
2
∑ (e t − e t−1 )
t=2
DW =
n
2
∑e
t
t=1
Intuitively, we would expect positive lag 1 autocorrelation to be associated with relatively small
differences in successive residuals, given by et − et−1 , and would therefore expect small values of DW
to provide evidence for ρ1 > 0 .
We can confirm this by rewritting the test statistic as
2 2
∑e + ∑e − 2 ∑ e t e t−1
t t−1
DW =
2
∑e
t
≈ 2 − 2r1 = 2(1 − r1 )
Expressing the test statistic in this way suggests the following properties:
0 ≤ DW ≤ 4
The closer DW is to 0, the more evidence that ρ1 > 0 (positive lag 1 auto-correlation)
The closer DW is to 4, the more evidence that ρ1 < 0 (negative lag 1 auto-correlation)
https://uwmo.mobius.cloud/567/3778/assignments/34643/0 1/3
10/22/21, 1:48 PM UW Möbius - 2.17.3 Durbin Watson test for lag 1 autocorrelation in linear model residuals
The distribution of DW under H0 : ρ1 = 0 , required to obtain p-values, depends on both the number
of parameters and the number of observations.
In the absence of software to compute p-values, Durbin-Watson tables provide lower and upper critical
(α = .05 ) values, DW L and DW U , against which we compare the value of the test statistic DW to
determine whether we are to accept or reject H0 : ρ1 = 0 according to the following rules:
To test for negative lag 1 autocorrelation (H0 : ρ1 = 0 vs Ha : ρ1 < 0 ), we use the test statistic
4 − DW and proceed as follows:
We can use this test to confirm the presence of a positive lag 1 autocorrelation in the residuals from
the Kenora temperature model.
We see a positive linear association between et and e t−1 . To test for positive lag 1 autocorrelation:
H0 : ρ1 = 0 vs Ha : ρ1 > 0
Test statistic:
> DW = sum(diff(res)^2)/sum(res^2)):
> DW
[1] 1.554483
Using
Durbin Watson tables (https://www.real-statistics.com/statistics-tables/durbin-watson-table/)
(α = .05, n = 600, k = 12) :
W L = 1.825, DW U = 1.907
D
https://uwmo.mobius.cloud/567/3778/assignments/34643/0 2/3
10/22/21, 1:48 PM UW Möbius - 2.17.3 Durbin Watson test for lag 1 autocorrelation in linear model residuals
To verify our calculations and conclusion, we can use the dwtest function in the lmtest library:
> library(lmtest)
> dwtest(Kentemp.seas.trend)
Durbin-Watson test
data: Kentemp.seas.trend
If autocorrelation still exists in a time series once the seasonal and trend components have been
accounted for, we can employ another class of forecasting models on the resulting series.
These models will be introduced in the next lesson.
https://uwmo.mobius.cloud/567/3778/assignments/34643/0 3/3
3. Introduction to Process Performance – 1
Measuring Process Performance
1
Introduction
A process is a series of operations or actions repeated over time, each
iteration of which produces a unit. The output, yt, is the variable(s) of
interest associated with the unit produced at time t.
Consider the pull dataset from 453 vehicles produced over a 24hr
period.
Process: production of vehicles at this plant
Unit: vehicle
Output of interest: the pull (a measure of alignment, in degrees)
The specification limits for this process are 0.23 ± 0.25.
(Cars with pull outside these limits will need to be pulled from
production and realigned)
Measuring process performance- 2
Process Performance Summaries
Graphical summaries:
• Histograms
• Run plots
1
Introduction
Control charts are a means of assessing if (and in some cases, when)
the process mean, µ, and/or the process std. dev., , have changed
substantially over time.
In control chart construction, subgroups of units are sampled and the
subgroup means and standard deviations are plotted over time.
Control limits are added to the plot to represent the limits of acceptable
variation in the subgroup mean and standard deviation.
A process is considered stable or in-control if all subgroup means (or
standard deviations) are within the control limits and unstable or out-of-
control if any are outside control limits.
We will examine two types of control charts: the Xbar chart and S chart
Control charts - 2
Introduction
Example: Spring height process (Example 1 of Control Chart notes):
• Sampling protocol: Four consecutively produced springs selected
every hour over 25 hr. period.
• y = height (mm) of a coil spring, was recorded for each unit in
sample
• The subgroup means and standard deviations were calculated and
plotted.
• ‘3 sigma’ control limits added to plot to indicate the limits of
acceptable random variation if process mean/std. dev. did not
change.
Control charts - 3
Xbar Xbar
Chart forforSpring
Chart Height
Spring Height
5.3
5.2 +
+
5.1
+
height
+ +
+ + + +
+ +
5.0
+ +
+ +
+ + +
+ +
+ + +
4.9
+
+
4.8
4.7
5 10 15 20 25
subgroup
Process Monitoring - 4
S Chart forS Chart
Spring Height
0.35
0.30
0.25
0.20
+
+
s
+
0.15
+
+ +
+ + +
+
0.10
+ + +
+ +
+ +
+ +
0.05
+
+ +
+ +
+
0.00
5 10 15 20 25
subgroup
Process Monitoring - 5
3. Creating Control Charts
1
Xbar Xbar
Chart forforSpring
Chart Height
Spring Height
5.3
5.2 +
+
5.1
+
height
+ +
+ + + +
+ +
5.0
+ +
+ +
+ + +
+ +
+ + +
4.9
+
+
4.8
4.7
5 10 15 20 25
subgroup
0.35
0.30
0.25
0.20
+
+
s
+
0.15
+
+ +
+ + +
+
0.10
+ + +
+ +
+ +
+ +
0.05
+
+ +
+ +
+
0.00
5 10 15 20 25
subgroup
1
Bead Demonstration
1: https://youtu.be/AfxabXHL9zY 2: https://youtu.be/dCv12-kLTXI 3: https://youtu.be/PoM5gGkshGw 4: https://youtu.be/VXmIUToadfs
● ● ● Worker A
● Worker B
● ●
● ● ●
# defective
● ●
● ● ● ● ●
● ●
●
●
●
●
Day
Control charts for count data- 3
3-Sigma Control Limits for Binomial Count Data
We can create 3-sigma control charts to determine whether the mean, µ
(expected number of defective beads), for this process has changed
week to week.
Let Yt be # of defective beads produced on day t.
Assuming beads produced independently with constant probability, , of
producing a defective bead, then
Yt ~ Binomial(100, )
With mean = E (Yt ) = 100 , and std. dev. = SD(Yt ) = 100 (1 − )
From the data, ˆ = 100ˆ = 10.4; ˆ = 100(.104)(.896) = 3.053
yielding 3-sigma control limits:
10.4 9.16 = (1.24,19.56)
Control charts for count data- 4
3-Sigma Control Chart for Bead Data
● ●
● ●
● ● ●
# defective
● ●
● ● ● ● ●
● ●
●
●
●
●
Day
Control charts for count data- 5
3-Sigma Control Chart for Bead Data
# defective
Day
In this chapter, we concentrate on measuring process outputs and describing the performance of the
process by summarizing the data collected. Statistical method and thinking play an important role
because the outputs of the process vary from unit to unit. Failure to understand the variation can lead a
manager to odd behaviour and poor decisions.
Example 1
A large manufacturing organization conducts a daily audit of its production that involves careful
checking of 30 units for a large number of possible defects and other failure to conform to
specifications. The results of the previous day’s audit are discussed in a morning meeting by quality
and production managers. The managers spend most of their time in the meeting discussing what went
wrong the previous day and what steps have been taken to ensure that problems have been resolved.
One measure of performance that receives a lot of attention is the number of defects per unit detected
in the audit. Figure 1 is a typical report.
1.5
defects per unit
0.5
0
average year to date day before yesterday yesterday
The dashed line is the target set at the start of the year. By year end, the management team expects that
the daily defects per unit will be below the target. The morning meeting is part of this undertaking to
improve the process. The report in Figure 1 report would generate, among others, the following
comments:
“We did a lot better yesterday than the day before. Good job!”
“We were better than average yesterday”
“We’re trending in the right direction!”
“I am a bit worried about meeting the target. We need to be more vigilant!”
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-1
There are many questions that someone with statistical training would ask. For example,
3 or 4 units are haphazardly selected each hour until the quota of 30 is reached.
The same crew conducts the audit every day so they know what they are doing. Besides, they
are warned to look extra carefully at the major problems from the previous day.
The plot of the data is given in Figure 2.
2.5
Defects per unit
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Day
With a sample of size 30 on each of yesterday and the previous day, it is not clear if the
observed difference is real or just due to sampling error.
Two points do not make a trend! Look at Figure 2! (you should also see the exercises)
From Figure 2, the process performance is not changing in the long run. As a consequence,
unless there is a fundamental change to the process, the target will not be met. Calling for more
vigilance is wishful thinking.
You should ask yourself the following questions with respect to improving the process and meeting the
target:
The point of Example 1 is that the people involved had difficulty dealing with the variation in the
process. They forgot that the sample of 30 units did not give an exact measure of the process
performance. They wasted a lot of time explaining the ups and downs in the process output from day to
day and were not achieving any improvement despite their best efforts. At best, you could argue that
they were maintaining the process performance through their efforts.
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-2
In terms of summarizing the performance of the process, the run chart in Figure 2 is much better than
Figure 1. The run chart puts yesterday’s result in a meaningful context which helps avoid
misinterpretation and over-interpretation. Putting the data into a reasonable context is a fundamental
rule for presenting process performance data – see Wheeler (1993) for an excellent discussion of this
important point. The following example is adapted from material in that book.
Example 2
A large organization ships products on order to many customers. A key feature of the shipping process
was the promise of on-time delivery. Orders were normally delivered by truck but if there was a
problem, air freight, an expensive alternative was used to ensure that the delivery was on-time. Use of
air freight was called a premium or expedited shipment. The transportation manager receives a
monthly report which contains the information in Table 1.
The manager is concerned about the high cost of premium shipments in July. On a percentage basis,
the manager notes that July 2003 is much worse than July 2002 but a bit better than June 2003. Are
things getting better or worse? What would happen in August?
This questions are impossible to answer without more context. Figure 3 is a plot of the percentage of
premium shipments by month since January 2001. From this plot, it is evident that the percentage of
premium shipments has been increasing over the last two years. August is likely to be a bad month.
One of the manager’s staff offers the explanation that since the number of orders has also been
increasing, it is not surprising that the percentage of premium shipments has been increasing because
the shipping department has not been given any extra resources. However, we can see from the run
chart in Figure 4 that this contention is not true and that some other explanation is required.
14
12
10
8
6
4
2
0
01
M 1
03
Ja 2
M 2
Se 1
M 2
Se 2
3
No 1
No 2
M 1
M 03
1
3
2
-0
l-0
l-0
l-0
0
0
-0
0
0
0
-0
-0
-0
v-
v-
n-
n-
n-
p-
p-
-
ar
ay
ar
ar
ay
ay
Ju
Ju
Ju
Ja
Ja
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-3
Total Shipments by Month
8000
7000
6000
5000
4000
3000
2000
1000
0
Ja 1
M 1
M 3
Ja 2
M 2
Se 1
M 2
Se 2
3
No 1
No 2
M 1
M 3
1
3
2
-0
l-0
l-0
l-0
0
0
-0
-0
0
0
-0
-0
-0
v-
v-
n-
n-
n-
p-
p-
ar
ay
ar
ar
ay
ay
Ju
Ju
Ju
Ja
We see from Example 2 that to interpret process performance for a given month , we need to look in
the context of the process over a longer period of time. The run chart is an excellent way to present this
context. Quoting Joiner (1994), another excellent book you should read, the key to providing context is
to
In other words, we need to give a visual display of the process over time in order to interpret a single
or small set of points correctly.
to assess the effects of fundamental changes to the process (i.e. in our language, the effect of
changing one or more fixed inputs) in the past
to predict the future performance to see what, if any, action should be taken (i.e. should any
fixed inputs be changed?)
We often track outputs that are especially important to the customer (sometimes called key product
characteristics or special characteristics) and major cost drivers, important to the process owners. Be
careful not to choose too many outputs or the useful information gets diluted by the mass of reports. A
good rule of thumb is to stop producing reports that are not used on a regular basis. If you think of
producing the reports as a process and the users of the reports as the customers, then the key is to select
characteristics that meets the customers’ needs.
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-4
As discussed in the previous section, a key point in presenting process performance data is to set an
appropriate context. In statistical language, we need to define a study population. Since we are usually
interested in current and future performance, we need to decide how far back in time to go in order to
establish the context. This may not be an easy decision. If we are assessing a fundamental change in
the process, we need to include sufficient time before the change in order to capture the long run
behaviour of the process output. When we are looking at future performance, we need a sufficiently
long record to make the prediction. See Chapter 4 on forecasting.
We often construct a performance measure based on a sample of units from the study population. In
the audit example, 30 units were selected haphazardly from each day’s production. We need to ensure
that the sample is representative of the process over the selected time period and to remember the
possibility of sampling error.
If the output is a continuous measurement, we use histograms, averages and standard deviations to
summarize performance.
Example 3
In an assembly plant, a number of output characteristics related to the wheel alignment is measured on
every vehicle produced. One output important to the driver is called pull which measures the tendency
of the vehicle to turn on a straight surface if the driver’s hands are removed from the steering wheel.
The specifications for pull are 0.23 0.25. If an alignment characteristic is outside the specification
limits, the vehicle is taken out of sequence for repair. The file ch2example3.txt. contains the data,
including the pull measurements, for one day when 453 vehicles were produced.
To summarize the daily performance, we plot a histogram and run chart of the data that include the
specification limits.
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-5
Figure 4: Histogram and Run Chart of Pull Values
We see from the plots that the pull values are centred near the target value 0.23 and almost all values
are within the specification limits. There are no obvious trends over the day.
The average and standard deviation for the pull values for the day are ˆ 0.208 and ˆ 0.068 . We
can plot the average and standard deviation on a run chart to assess the process in the context of day to
day performance.
Another common measure of performance when the output is measured on a continuous scale is the
capability ratio Ppk defined as
min(U ˆ , ˆ L)
Ppk
3ˆ
where U and L are the upper and lower specification limits. For the pull output, we have
The larger the capability ratio the better the process performance relative to the specifications. If the
process is centred so that the average ̂ is (U L) / 2 , the numerator of the capability ratio is as large
as possible, for the given variation. If there is little variation in the process, then ˆ and the
denominator of Ppk is small.
If the histogram is bell-shaped (matching a gaussian density), we can interpret the capability ratio in
terms of the proportion of units outside of specification. See the exercises.
As in the earlier examples we must interpret these measures of performance in a broader context.
References
1. Wheeler, Donald J. (1993), Understanding Variation, The Key to Managing Chaos, SPC Press,
Knoxville TE.
Stat 372 © R.J. MacKay and S.H. Steiner, University of Waterloo, 2006 II-6