# CHEE824

Nonlinear Regression Analysis
J. McLellan
Winter 2004
Module 1:
Linear Regression
3
Outline -
• assessing systematic relationships
• matrix representation for multiple regression
• least squares parameter estimates
• diagnostics
» graphical
» quantitative
• further diagnostics
» testing the need for terms
» lack of fit test
• precision of parameter estimates, predicted responses
• correlation between parameter estimates
4
The Scenario
We want to describe the systematic relationship
between a response variable and a number of
explanatory variables
multiple regression
we will consider the case which
is linear in the parameters
5
Assessing Systematic Relationships
Is there a systematic relationship?
Two approaches:
• graphical
» scatterplots, casement plots
• quantitative
» form correlations between response, explanatory
variables
» consider forming correlation matrix - table of pairwise
correlations between regressor and explanatories, and
pairs of explanatory variables
• correlation between explanatory variables leads to
correlated parameter estimates
chee824 - Winter 2004 J. McLellan 6
Graphical Methods for Analyzing Data
Visualizing relationships between variables
Techniques
• scatterplots
• scatterplot matrices
» also referred to as “casement plots”
• Time sequence plots
chee824 - Winter 2004 J. McLellan 7
Scatterplots
,,, are also referred to as “x-y diagrams”
• plot values of one variable against another
• look for systematic trend in data
» nature of trend
• linear?
• exponential?
» degree of scatter - does spread increase/decrease over
range?
• indication that variance isn’t constant over range of data
chee824 - Winter 2004 J. McLellan 8
Scatterplots - Example
Scatterplot (teeth 4v*20c)
FLUORIDE
D
I
S
C
O
L
O
R
5
10
15
20
25
30
35
40
45
50
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
trend - possibly
nonlinear?
• tooth discoloration data - discoloration vs. fluoride
chee824 - Winter 2004 J. McLellan 9
Scatterplot - Example
Scatterplot (teeth 4v*20c)
BRUSHING
D
I
S
C
O
L
O
R
5
10
15
20
25
30
35
40
45
50
4 5 6 7 8 9 10 11 12 13
• tooth discoloration data -discoloration vs. brushing
signficant trend?
- doesn’t appear to
be present
chee824 - Winter 2004 J. McLellan 10
Scatterplot - Example
Scatterplot (teeth 4v*20c)
BRUSHING
D
I
S
C
O
L
O
R
5
10
15
20
25
30
35
40
45
50
4 5 6 7 8 9 10 11 12 13
Variance appears
to decrease as
# of brushings increases
• tooth discoloration data -discoloration vs. brushing
chee824 - Winter 2004 J. McLellan 11
Scatterplot matrices
…are a table of scatterplots for a set of variables
Look for -
» systematic trend between “independent” variable and
dependent variables - to be described by estimated
model
» systematic trend between supposedly independent
variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically
with statistical software, manually using Excel
chee824 - Winter 2004 J. McLellan 12
Scatterplot Matrices - tooth data
Matrix Plot (teeth 4v*20c)
FLUORIDE
AGE
BRUSHING
DISCOLOR
chee824 - Winter 2004 J. McLellan 13
Time Sequence Plot - Naphtha 90% Point
9
0
%

p
o
i
n
t

(
d
e
g
r
e
e
s

F
)
390
400
410
420
430
440
450
460
470
480
0 30 60 90 120 150 180 210 240 270
Time Sequence Plot
- for naphtha 90% point - indicates amount of heavy
hydrocarbons present in gasoline range material
excursion - sudden
shift in operation
average operating point
- time correlation in data
chee824 - Winter 2004 J. McLellan 14
What do dynamic data look like?
Time Series Plot of Industrial Data
var1
var2 # 1
# 151
# 301
# 451
# 601
# 751
# 901
# 1051
# 1201
# 1351
# 1501
# 1651
# 1801
# 1951
# 2101
0
1
2
3
4
5
6
7
15
Assessing Systematic Relationships
Quantitative Methods
• correlation
» formal def’n plus sample statistic (“Pearson’s r”)
• covariance
» formal def’n plus sample statistic
provide a quantiative measure of systematic LINEAR
relationships
16
Covariance
Formal Definition
• given two random variables X and Y, the covariance
is
• E{ } - expected value
• sign of the covariance indicates the sign of the slope
of the systematic linear relationship
» positive value --> positive slope
» negative value --> negative slope
• issue - covariance is SCALE DEPENDENT
Cov X Y E X Y
X Y
( , ) {( )( )} = − − u u
17
Covariance
• motivation for covariance as a measure of systematic
linear relationship
» look at pairs of departures about the mean of X, Y
X
Y
mean of X, Y
X
Y
mean of X, Y
18
Correlation
• is the “dimensionless” covariance
» divide covariance by standard dev’ns of X, Y
• formal definition
• properties
» dimensionless
» range
Corr X Y X Y
Cov X Y
X Y
( , ) ( , )
( , )
= = ρ
σ σ
− ≤ ≤ 1 1 ρ( , ) X Y
strong linear relationship
with negative slope
strong linear relationship
with positive slope
Note - the correlation gives NO information about the
actual numerical value of the slope.
19
Estimating Covariance, Correlation
…from process data (with N pairs of observations)
Sample Covariance
Sample Correlation
R
N
X X Y Y
i i
i
N
=

− −

=
1
1
1
( )( )
r
N
X X Y Y
s s
i i
i
N
X Y
=

− −

=
1
1
1
( )( )
20
Making Inferences
The sample covariance and corrleration are
STATISTICS, and have their own probability
distributions.
Confidence interval for sample correlation -
» the following is approximately distributed as the standard
normal random variable
» derive confidence limits for and convert to
confidence limits for the true correlation using tanh
tanh ( )
−1
ρ
N r − −
− −
3
1 1
(tanh ( ) tanh ( )) ρ
21
Confidence Interval for Correlation
Procedure
1. find for desired confidence level
2. confidence interval for is
3. convert to limits to confidence limits for correlation by
taking tanh of the limits in step 2
A hypothesis test can also be performed using this function of the
correlation and comparing to the standard normal distribution
z
α/2
tanh ( )
−1
ρ
tanh ( )
/

±

1
2
1
3
r
N
z
α
22
Example - Solder Thickness
Objective - study the effect of temperature on solder
thickness
Data - in pairs
Solder Temperature (C) Solder Thickness (microns)
245 171.6
215 201.1
218 213.2
265 153.3
251 178.9
213 226.6
234 190.3
257 171
244 197.5
225 209.8
23
Example - Solder Thickness
Solder Thickness (microns)
140
150
160
170
180
190
200
210
220
230
200 210 220 230 240 250 260 270
temperature
t
h
i
c
k
n
e
s
s
Solder Temperature (C) r Thickness (mic
Solder Temperature (C) 1
Solder Thickness (micro -0.920001236 1
24
Example - Solder Thickness
Confidence Interval
zalpha/2 of 1.96 (95% confidence level)
limits in tanh^-1(rho) -2.329837282 -0.848216548
limits in rho -0.981238575 -0.690136605
25
Empirical Modeling - Terminology
• response
» “dependent” variable - responds to changes in other
variables
» the response is the characteristic of interest which we are
trying to predict
• explanatory variable
» “independent” variable, regressor variable, input, factor
» these are the quantities that we believe have an
influence on the response
• parameter
» coefficients in the model that describe how the
regressors influence the response
26
Models
When we are estimating a model from data, we
consider the following form:
Y f X = + ( , ) θ ε
response
explanatory
variables
parameters
“random error”
27
The Random Error Term
• is included to reflect fact that measured data contain
variability
» successive measurements under the same conditions
(values of the explanatory variables) are likely to be
slightly different
» this is the stochastic component
» the functional form describes the deterministic
component
» random error is not necessarily the result of mistakes in
experimental procedures - reflects inherent variability
» “noise”
28
Types of Models
• linear/nonlinear in the parameters
• linear/nonlinear in the explanatory variables
• number of response variables
– single response (standard regression)
– multi-response (or “multivariate” models)
From the perspective of statistical model-building,
the key point is whether the model is linear or
nonlinear in the PARAMETERS.
29
Linear Regression Models
• linear in the parameters
• can be nonlinear in the regressors
T T T
95 1 2
= + + b b
LGO mid
ε
T T T
95 1 2
= + + b b
LGO mid
ε
30
Nonlinear Regression Models
• nonlinear in the parameters
– e.g., Arrhenius rate expression
r exp(
RT
) =

k
E
0
linear
(if E is fixed)
nonlinear
31
Nonlinear Regression Models
• sometimes transformably linear
and take ln of both sides to produce
which is of the form
r exp(
RT
) =

+ k
E
0
ε
ln(r) ln( )
RT
= − + k
E
0
δ
Y = + + β β δ
0 1
1
RT
linear in the
parameters
32
Transformations
• note that linearizing the nonlinear equation by
transformation can lead to misleading estimates if the
proper estimation method is not used
• transforming the data can alter the statistical
distribution of the random error term
33
Ordinary LS vs. Multi-Response
• single response (ordinary least squares)
• multi-response (e.g., Partial Least Squares)
– issue - joint behaviour of responses, noise
T T T
95 1 2
= + + b b
LGO mid
ε
T T T
T T T
,
,
95 11 12 1
95 21 22 2
LGO LGO mid
kero kero mid
b b
b b
= + +
= + +
ε
ε
We will be focussing on single response models.
34
Linear Multiple Regression
Model Equation
Y X X
i i p ip i
= + + + β β ε
1 1
K
i-th observation
of response
(i-th data point)
i-th value of
explanatory variable X
1
i-th value of
explanatory variable X
p
The intercept can be considered as corresponding
to an X which always has the value “1”
random noise
in i-th observation
of response
35
Assumptions for Least Squares Estimation
Values of explanatory variables are known EXACTLY
» random error is strictly in the response variable
» practically - a random component will almost always be
present in the explanatory variables as well
» we assume that this component has a substantially
smaller effect on the response than the random
component in the response
» if random fluctuations in the explanatory variables are
important, consider alternative method (“Errors in
Variables” approach)
36
Assumptions for Least Squares Estimation
The form of the equation provides an adequate
representation for the data
» can test adequacy of model as a diagnostic
Variance of random error is CONSTANT over range of
data collected
» e.g., variance of random fluctuations in thickness
measurements at high temperatures is the same as
variance at low temperatures
» data is “heteroscedastic” if the variance is not constant -
different estimation procedure is required
» thought - percentage error in instruments?
37
Assumptions for Least Squares Estimation
The random fluctuations in each measurement are
statistically independent from those of other
measurements
» at same experimental conditions
» at other experimental conditions
» implies that random component has no “memory”
» no correlation between measurements
Random error term is normally distributed
» typical assumption
» not essential for least squares estimation
» important when determining confidence intervals,
conducting hypothesis tests
Least Squares Estimation - graphically
least squares - minimize sum of squared prediction errors
response
(solder thickness)
T
o
o
o
o
o
o
deterministic
“true”
relationship
prediction error
“residual”
39
More Notation and Terminology
Random error is “independent, identically distributed”
(I.I.D) -- can say that it is IID Normal
Capitals - Y - denotes random variable
- except in case of explanatory variable - capital used
to denote formal def’n
Lower case - y, x - denotes measured values of
variables
Model
Measurement
Y X = + + β β ε
0 1
y x = + + β β ε
0 1
40
More Notation and Terminology
Estimate - denoted by “hat”
» examples - estimates of response, parameter
Residual - difference between measured and predicted
response
\$,
\$
y β
0
e y y = − \$
41
Matrix Representation for Multiple Regression
We can arrange the observations in “tabular” form - vector of
observations, and matrix of explanatory values:
Y
Y
Y
Y
X X X
X X X
X X X
X X X
N
N
p
p
N N N p
N N N p
p
1
2
1
11 12 1
21 22 2
11 1 2 1
1 2
1
2
1
M
L
L
M M M M
L
L
M

− − −

=

+
, , ,
, , ,
β
β
β
ε
ε
2
1
M
ε
ε
N
N

42
Matrix Representation for Multiple Regression
The model is written as:
Y X = + β ε
Nx1
vector
Nxp
matrix
px1
vector
Nx1
vector
N --> number of data observations
p --> number of parameters
43
Least Squares Parameter Estimates
We make the same assumptions as in the straight line
regression case:
» independent random noise components in each
observation
» explanatory variables known exactly - no randomness
» variance constant over experimental region (identically
distributed noise components)
44
Residual Vector
Given a set of parameter values , the residual vector is formed
from the matrix expression:
e
e
e
e
Y
Y
Y
Y
X X X
X X X
X X X
X X X
N
N
N
N
p
p
N N N p
N N N p
1
2
1
1
2
1
11 12 1
21 22 2
11 1 2 1
1 2
M M
L
L
M M M M
L
L
− −
− − −

=

, , ,
, , ,

~
~
~
β
β
β
1
2
M
p
~
β
45
Sum of Squares of Residuals
…is the same as before, but can be expressed as the squared
length of the residual vector:
SSE e
i
i
N
T
T
=

=
=
= − −
=
2
1
2
e e
e
Y X Y X (
~
) (
~
) β β
46
Least Squares Parameter Estimates
Find the set of parameter values that minimize the sum
of squares of residuals (SSE)
» apply necessary conditions for an optimum from calculus
(stationary point)
» system of N equations in p unknowns, with number of
parameters < number of observations : over-determined
system of equations
» solution - set of parameter values that comes “closest to
satisfying all equations” (in a least squares sense)

∂ β
β
( )
\$
SSE = 0
47
Least Squares Parameter Estimates
The solution is:
\$
( ) β =

X X X Y
T T 1
generalized matrix inverse
of X
- generalization of standard
concept of matrix inverse to case of
non-square matrix case
48
Example - Solder Thickness
Let’s analyze the data considered for the straight line case:
Solder Temperature (C) Solder Thickness (microns)
245 171.6
215 201.1
218 213.2
265 153.3
251 178.9
213 226.6
234 190.3
257 171
244 197.5
225 209.8
Model:
Y X = + + β β ε
0 1
49
Example - Solder Thickness
In matrix form:
1716
2011
2132
1533
178 9
226 6
190 3
171
197 5
209 8
1 245
1 215
1 218
1 265
1 251
1 213
1 234
1 257
1 244
1 225
.
.
.
.
.
.
.
.
.

=

+

β
β
ε
ε
ε
ε
ε
ε
ε
ε
ε
ε
0
1
1
2
3
4
5
6
7
8
9
10
Y X = + ⇔ β ε
50
Example - Solder Thickness
In order to calculate the Least Squares Estimates:
( ) ; X X
T
=

10 2367
2367 563335
X Y
T
=

1910
449420
51
Example - Solder Thickness
The least squares parameter estimates are obtained as:
\$
( )
. .
. .
.
.
β = =

=

X X X Y
T T 1
18 373 0 0772
0 0772 0 0003
1910
449420
45810
113
52
Example - Wave Solder Defects
(page 8-31, Course Notes)
Wave Solder Defects Data
Run Conveyor Speed Pot Temp Flux Density No. of Defects
1 -1 -1 -1 100
2 1 -1 -1 119
3 -1 1 -1 118
4 1 1 -1 217
5 -1 -1 1 20
6 1 -1 1 42
7 -1 1 1 41
8 1 1 1 113
9 0 0 0 101
10 0 0 0 96
11 0 0 0 115
53
Example - Wave Solder Defects
In matrix form:
100
119
118
217
20
42
41
113
101
96
115
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 0 0 0
1 0 0 0
1 0 0 0

=
− − −
− −
− −

− −

+

β
β
β
β
ε
ε
ε
ε
ε
ε
ε
ε
ε
ε
ε
0
1
2
3
1
2
3
4
5
6
7
8
9
10
11

Y X = + ⇔ β ε
54
Example - Wave Solder Defects
To calculate least squares parameter estimates:
( ) ; X X
T
=

11 0 0 0
0 8 0 0
0 0 8 0
0 0 0 8
X Y
T
=

1082
212
208
338
55
Example - Wave Solder Defects
Least squares parameter estimates:
\$
( )
.
.
.
.
β = =

=

X X X Y
1
11
1
8
1
8
1
8
T T 1
0 0 0
0 0 0
0 0 0
0 0 0
1082
212
208
338
9336
2650
26 0
42 25
56
• if there N runs, and the model has p parameters, X
T
X is a pxp
matrix (smaller dimension than number of runs)
• elements of X
T
Y are for parameters j=1, …, p
• in the Wave Solder Defects example, the values of the
explanatory variable for the runs followed very specific patterns
of -1 and +1, and X
T
X was a diagonal matrix
• in the Solder Thickness example, the values of the explanatory
variable did not follow a specific pattern, and X
T
X was not
diagonal
x y
ij i
i

57
Graphical Diagnostics
Basic Principle - extract as much trend as possible from
the data
Residuals should have no remaining trend -
» with respect to the explanatory variables
» with respect to the data sequence number
» with respect to other possible explanatory variables
(“secondary variables”)
» with respect to predicted values
58
Graphical Diagnostics
Residuals vs. Predicted Response Values
residual
e
i
\$ y
i
*
*
*
*
*
*
*
*
*
*
* *
*
*
- even scatter
over range of prediction
- no discernable pattern
- roughly half the residuals
are positive, half negative
DESIRED RESIDUAL PROFILE
59
Graphical Diagnostics
Residuals vs. Predicted Response Values
residual
e
i
\$ y
i
*
*
*
*
*
*
*
*
*
*
* *
*
*
outlier lies outside
main body of residuals
RESIDUAL PROFILE WITH OUTLIERS
60
Graphical Diagnostics
Residuals vs. Predicted Response Values
residual
e
i
\$ y
i
*
*
*
*
*
*
*
*
*
*
*
*
*
*
variance of the residuals
appears to increase
with higher predictions
NON-CONSTANT VARIANCE
*
*
*
*
61
Graphical Diagnostics
Residuals vs. Explanatory Variables
» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present
residual
e
i
*
*
*
*
*
*
*
*
*
*
*
*
*
*
x
left over quadratic trend
- need quadratic term in model
62
Graphical Diagnostics
Residuals vs. Explanatory Variables Not in Model
» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present
residual
e
i
*
*
**
*
*
*
*
*
*
*
*
*
*
w
systematic trend
not accounted for in model
- include a linear term in “w”
63
Graphical Diagnostics
Residuals vs. Order of Data Collection
residual
e
i
*
*
**
*
*
*
*
*
*
*
*
*
*
t
*
*
*
*
*
*
*
*
*
*
*
*
*
t
residual
e
i
failure to account for time trend
in data
successive random noise
components are correlated
- consider more complex model
- time series model for random
component?
64
Quantitative Diagnostics - Ratio Tests
Residual Variance Test
» is the variance of the residuals significant to the inherent
noise variance?
» same test as that for the straight line data
» only distinction - number of degrees of freedom for the
Mean Squared Error => N-p , where p is the number of
parameters in the model
» compare ratio to F
N-p,M-1,0.05
where M is the number of
data points used to estimate inherent variance
» significant? -> model is INADEQUATE
65
Quantitative Diagnostics - Ratio Tests
Residual Variance Ratio
Mean Squared Error of Residuals (Var. of Residuals):
s
s
Mean Squared Error of Residuals MSE
s
residuals
inherent inherent
2
2 2
=
( )
s MSE
e
N p
residuals
i
i
N
2
2
1
= =

=
66
Quantitative Diagnostics - Ratio Tests
Mean Square Regression Ratio
» same as in the straight line case except for degrees of
freedom
Variance described by model:
MSR
y y
p
i
i
N
=

=
( \$ )
2
1
1
67
Quantitative Diagnostics - Ratio Test
Test Ratio:
is compared against F
p-1,N-p,0.95
Conclusions?
– ratio is statistically significant --> significant trend
– NOT statistically significant --> significant trend has NOT
been modeled, and model is inadequate in its present form
MSR
MSE
For the multiple regression case, this test is a coarse
measure of whether some trend has been modeled -
it provides no indication of which X’s are important
68
Analysis of Variance Tables
The ratio tests involve dissection of the sum of squares:
SSR
y y
i
i
N
= −

=
( \$ )
2
1
SSE
y y
i i
i
N
= −

=
( \$ )
2
1
TSS y y
i
i
N
= −

=
( )
2
1
69
Analysis of Variance (ANOVA) for Regression
Source
of
Variation
Degrees
of
Freedom
Sum of
Squares
Mean
Square
F-Value p-value
Regression p-1 SSR MSR=SSR/(p-1) F=MSR/MSE p
Residuals N-p SSE MSE=SSE/(N-p)
Total N-1 TSS
70
Quantitative Diagnostics - R
2
Coefficient of Determination (“R
2
Coefficient”)
» square of correlation between observed and predicted
values:
» relationship to sums of squares:
» values typically reported in “%”, i.e., 100 R
2
» ideal - R
2
near 100%
R corr y y
2 2
= [ ( , \$)]
R
SSE
TSS
SSR
TSS
2
1 = − =
71
Issues with R
2
• R
2
is sensitive to extreme data points, resulting in misleading
indication of quality of fit
• R
2
can be made artifically large by adding more parameters to
the model
» put a curve through every point - “connect the dots”
model --> simply modeling noise in the data, rather than
trend
» solution - define the “adjusted R
2
”, which penalizes the
addition of parameters to the model
72
2
Adjust for number of parameters relative to number of observations
» account for degrees of freedom of the sums of squares
» define in terms of Mean Squared quantities
» want value close to 1 (or 100%), as before
» if N>>p, adjusted R
2
is close to R
2
» provides measure of agreement, but does not account for
magnitude of residual error
R
MSE
TSS N
SSE N p
TSS N
2
1
1
1
1
= −

= −

− / ( )
/ ( )
/ ( )
73
Testing the Need for Groups of Terms
In words: “Does a specific group of terms account for significant
trend in the model”?
Test
» compare difference in residual variance between full and
reduced model
» benchmark against an estimate of the inherent variation
» if significant, conclude that the group of terms ARE
required
» if not significant, conclude that the group of terms can be
dropped from the model - not explaining significant trend
» note that remaining parameters should be re-estimated in
this case
74
Testing the Need for Groups of Terms
Test:
A - denotes the full model (with all terms)
B - denotes the reduced model (group of terms deleted)
Form:
p
A
, p
B
are the numbers of parameters in models A, B
s
2
is an estimate of the inherent noise variance:
» estimate as SSE
A
/(N-p
A
)
SSE SSE
s p p
A B
A B
model model

2
( )
75
Testing the Need for Groups of Terms
Compare this ratio to
» if MSE
A
is used as estimate of inherent variance, then
degrees of freedom of inherent variance estimate is p
A
F
p p
A B inherent
− , , . ν 0 95
76
Lack of Fit Test
If we have replicate runs in our regression data set, we can break
out the noise variance from the residuals, and assess the
component of the residuals due to unmodelled trend
Replicates -
» repeated runs at the SAME experimental conditions
» note that all explanatory variables must be at fixed
conditions
» indication of inherent variance because no other factors
are changing
» measure of repeatibility of experiments
77
Using Replicates
We can estimate the sample variance for each set of replicates,
and pool the estimate of the variance
» constancy of variance can be checked using Bartlett’s
test
» constant variance is assumed for ordinary least squares
estimation
For each replicate set, we have:
s
y y
n
i
ij i
j
n
i
i
2
2
1
1
=

=
( )
average of
values in
replicate set
“i”
number of
values in
replicate set
“i”
values in
replicate set
“i”
78
Using Replicates
The pooled estimate of variance is:
i.e., convert back to sums of squares, and divide by the total
number of degrees of freedom (the sum of the degrees of
freedom for each variance estimate)
( ) n s
n m
i i
i
m
i
i
m

|
\

|
.
|

=
=
1
2
1
1
79
The Lack of Fit Test
Back to the sum of squares “block”:
SSR
TSS
SSELOF SSEP
“pure error” sum
of squares
“lack of fit”
sum of squares
SSE
80
The Lack of Fit Test
We partition the SSE into two components:
» component due to inherent noise
» component due to unmodeled trend
Pure error sum of squares (SSEP):
i.e., add together sums of squares associated with each replicate
group (there are “m” replicate groups in total)
SSEP y y
ij i
j
n
i
m
i
= −

|
\

|
.
|
|

= =
( )
2
1 1
81
The Lack of Fit Test
The “lack of fit sum of squares” (SSELOF) is formed by backing out
SSEP from SSE:
Degrees of Freedom:
- for SSEP:
- for SSELOF:
SSELOF SSE SSEP = −
n m
i
i
m
=

|
\

|
.
|

1
N p n m
i
i
m
− −

|
\

|
.
|

|
\

|
.
|
=1
82
The Lack of Fit Test
The test ratio:
Compare to
» significant? - there is significant unmodeled trend, and
model should be modified
» not significant? - there is nosignificant unmodeled trend,
and supports model adequacy
MSELOF
MSEP
SSELOF
SSEP
LOF
Pure
=
/
/
ν
ν
F
LOF Pure
ν ν , , . 0 95
83
Example - Wave Solder Defects
From earlier regression, SSE = 2694.0 and SSR = 25306.5
LACK OF FIT TEST
ANOVA
df SS MS F value from F-table (95% pt)
Residual 7 2694.045
LOF 5 2500.045 500.0091 5.154733 19.3 (this is F5,2,0.95)
Pure Error 2 194 97
Replicate Set
9 0 0 0 101
10 0 0 0 96
11 0 0 0 115
std. devn 9.848858
sample var 97
sum of sq 194
(as (n_i-1)s^2)
This was done by hand - Excel has no Lack of Fit test
84
A Comment on the Ratio Tests
Order of Preference (or “value”) - from most definitive to
least definitive:
• Lack of Fit Test -- MSELOF/MSEP
• MSE/s
2
inherent
• MSR/MSE
If at all possible, try to include replicate runs in your experimental
program so that the Lack of Fit test can be conducted
Many statistical software packages will perform the Lack of Fit test
in their Regression modules - Excel does NOT
85
The Parameter Estimate Covariance Matrix
…summarizes the variance-covariance structure of the parameter
estimates
Σ =

Var Cov Cov
Cov Var Cov
Cov Cov Var
p
p
p p p
(
\$
) (
\$
,
\$
) (
\$
,
\$
)
(
\$
,
\$
) (
\$
) (
\$
,
\$
)
(
\$
,
\$
) (
\$
,
\$
) (
\$
)
β β β β β
β β β β β
β β β β β
1 1 2 1
1 2 2 2
1 2
L
L
M M O M
L
86
Properties of the Covariance Matrix
• symmetric -- Cov(b1,b2) = Cov(b2,b1)
• diagonal entries are always non-negative
• off-diagonal entries can be +ve or -ve
• matrix is positive definite
for any vector v
v v
T
Σ > 0
87
Parameter Estimate Covariance Matrix
The covariance matrix of the parameter estimates is defined as:
Compare expression with variance for single parameter:
For linear regression, the covariance matrix is obtained as:
{ }
Σ = − − E
T
(
\$
)(
\$
) β β β β
Var E (
\$
) {(
\$
) } β β β = −
2
Σ =

( ) X X
T 1 2
σ
ε
88
Parameter Estimate Covariance Matrix
Key point - the covariance structure of the parameter estimates is
governed by the experimental run conditions used for the
explanatory variables -
the Experimental Design
Example - the Wave Solder Defects data
( ) ; X X
T
=

11 0 0 0
0 8 0 0
0 0 8 0
0 0 0 8
( ) X X
1
11
1
8
1
8
1
8
T −
=

1
0 0 0
0 0 0
0 0 0
0 0 0
Parameter estimates
are uncorrelated, and
variances of the
non-intercept
parameteres are the
same
- towards “uniform
precision” of
parameter estimates
89
Estimating the Parameter Covariance Matrix
The X matrix is known - set of run conditions - so the only
estimated quantity is the inherent noise variance
» from replicates, external estimate, or MSE
For wave solder defect data, the sample variance of the replicates
is 384.86 with 7 degrees of freedom, and the parameter
covariances are:
\$
( ) ( . )
.
.
.
.
Σ = =

=

X X
1
11
1
8
1
8
1
8
T
e
s
1 2
0 0 0
0 0 0
0 0 0
0 0 0
384 86
34 99 0 0 0
0 4811 0 0
0 0 4811 0
0 0 0 4811
residual
variance from
MSE
90
Using the Covariance Matrix
Variances of parameter estimates
» are obtained from the diagonal of the matrix
» square root is the standard dev’n, or “standard error”, of
the parameter estimates
• use to formulate confidence intervals for the paramters
• use in hypothesis tests for the parameters
Correlations between the parameter estimates
» can be obtained by taking covariance from appropriate
off-diagonal element, and dividing by the standard errors
of the individual parameter estimates
91
Correlation of the Parameter Estimates
Note that
I.e., the parameter estimate for the intercept depends
linearly on the slope!
» the slope and intercept estimates are correlated
\$ \$
β β
0 1
= − Y x
changing slope changes
point of intersection with
axis because the line must
go through the centroid of the
data
92
Getting Rid of the Covariance
Let’s define the explanatory variable as the deviation
from its average:
Z X X = −
\$
\$
β
β
0
1
1
2
1
=
=

=
=
Y
z Y
z
i i
i
N
i
i
N
Least Squares parameter
estimates:
- note that now there is no explicit
dependence on the slope value
in the intercept expression
- average of z is zero
93
Getting Rid of the Covariance
In this form of the model, the slope and intercept
parameter estimates are uncorrelated
Why is lack of correlation useful?
» allows indepedent decisions about parameter estimates
» decide whether slope is significant, intercept is significant
individually
» “unique” assignment of trend
• intercept clearly associated with mean of y’s
• slope clearly associated with steepness of trend
» correlation can be eliminated by altering form of model,
and choice of experimental points
94
Confidence Intervals for Parameters
…similar procedure to straight line case:
» given standard error for parameter estimate, use
appropriate t-value, and form interval as:
The degrees of freedom for the t-statistic come from the
estimate of the inherent noise variance
» the degrees of freedom will be the same for all of the
parameter estimates
If the confidence interval contains zero, the parameter is plausibly
zero and consideration should be given to deleting the term.
\$
, /
\$
β
ν α
β
i
t s
i
±
2
95
Hypothesis Tests for Parameters
…represent an alternative approach to testing whether the term
should be retained in the model
Null hypothesis - parameter = 0
Alternate hypothesis - parameter is not equal to 0
Test statistic:
» compare absolute value to
» if test statistic is greater (“outside the fence”), parameter
is significant -- retain
» inside the fence? - consider deleting the term
\$
\$
β
β
i
s
i
t
ν α , /2
96
Example - Wave Solder Defects Data
Test statistic will be compared to
because MSE is used to calculate standard errors of parameters,
and has 7 degrees of freedom.
Test statistic for intercept:
Since 16.63 > 2.365, conclude that intercept parameter IS
significant and should be retained.
t
7 0 025
2 365
, .
. =
\$
.
.
.
\$
β
β
0
0
98 36
34 99
16 63
s
= =
97
Example - Wave Solder Defects Data
For the next term in the model:
Therefore this term should be retained in the model.
Because the parameter estimates are uncorrelated in this model,
terms can be dropped without the need to re-estimate the other
parameters in the model -- in general, you will have to re-
estimate the final model once more to obtain the parameter
estimates corresponding to the final model form.
\$
.
.
. .
\$
β
β
1
1
265
4811
382 2 365
s
= = >
98
Example - Wave Solder Defects Data
From Excel:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.36363636 5.915031978 16.62943 6.948E-07 84.376818 112.3505
Conveyor Speed 26.5 6.935989803 3.820652 0.0065367 10.099002 42.901
Pot Temp 26 6.935989803 3.748564 0.0071817 9.599002 42.401
Flux Density -42.25 6.935989803 -6.09142 0.0004953 -58.651 -25.849
standard dev’ns.
of each parameter
estimate
test statistic
for each
parameter
prob. that
a value is
greater than
computed test
ratio - 2-tailed
test!
confidence
limits
99
Precision of the Predicted Responses
The predicted response from an estimated model has uncertainty,
because it is a function of the parameter estimates which have
uncertainty:
e.g., Solder Wave Defect Model - first responseat the point -1,-1,-1
If the parameter estimates were uncorrelated, the variance of the
predicted response would be:
(recall results for variance of sum of random variables)
\$
\$ \$
( )
\$
( )
\$
( ) y
1 0 1 2 3
1 1 1 = + − + − + − β β β β
Var y Var Var Var Var ( \$ ) (
\$
) (
\$
) (
\$
) (
\$
)
1 0 1 2 3
= + + + β β β β
100
Precision of the Predicted Responses
In general, both the variances and covariances of the parameter
estimates must be taken into account.
For prediction at the k-th data point:
| |
Var y
x x x
x
x
x
k
k
T T
k
k k kp
T
k
k
kp
( \$ ) ( )
( )
=
=

x X X x
X X
1 2
1 2
1
1
2
2
σ
σ
ε
ε
L
M
101
Example - Wave Solder Defects Model
In this example, the parameter estimates are uncorrelated
» X
T
X is diagonal
» variance of the predicted reponse is in fact the sum of the
variances of the parameter estimates
Variance of prediction at run #11 (0,0,0):
Var y Var Var Var Var
Var
( \$ ) (
\$
) (
\$
)( ) (
\$
)( ) (
\$
)( )
(
\$
)
11 0 1 2 3
0
0 0 0 = + + +
=
β β β β
β
102
Precision of “Future” Predictions
Suppose we want to predict the response at conditions other than
those of the experimental runs --> future run.
The value we observe will consist of the component from the
deterministic component, plus the noise component.
In predicting this value, we must consider:
» uncertainty from our prediction of the deterministic
component
» noise component
The variance of this future prediction is
where is computed using the same expression
for variance of predicted responses at experimental run conditions
Var y
future
( \$ ) + σ
ε
2
Var y
future
( \$ )
103
Estimating Precision of Predicted Responses
Use an estimate of the inherent noise variance
The degrees of freedom for the estimated variance of the predicted
response are those of the estimate of the noise variance
» replicates, external estimate, MSE
s s
y
k
T T
k e
k
\$
( )
2 1 2
=

x X X x
104
Confidence Limits for Predicted Responses
Follow an approach similar to that for parameters - 100(1-alpha)%
confidence limits for predicted response at the k-th run are:
» degrees of freedom are those of the inherent noise
variance estimate
If the prediction is for a response at conditions OTHER than one of
the experimental runs, the limits are:
\$
, / \$
y t s
k y
k
±
ν α 2
\$
, /
\$
y t s s
k
y
e
future
± +
ν α 2
2 2
105
Practical Guidelines for Model Development
1) Consider CODING your explanatory variables
Coding - one standard form:
» places designed experiment into +1,-1 form
» if run conditions are from an experimental design, this
coding must be used in order to obtain all of the benefits
from the design - uncorrelated parameter estimates
» if conditions are not from an experimental design, such a
coding improves numerical conditioning of the problem --
similar numerical scales for all variables
~
( )
x
x x
range x
i
i i
i
=

1
2
106
Practical Guidelines for Model Development
2) Types of models -
» linear in the explanatory variables
» linear with two-factor interactions (x
i
x
j
)
» general polynomials
3) Watch for collinearity in the X matrix - run condition patterns for
two or more explanatory variables are almost the same
» prevents clear assignment of trend to each factor
» shows up as singularity in X
T
X matrix
» associated with very strong correlation between
parameter estimates
107
Practical Guidelines for Model Development
4) Be careful not to extrapolate excessively beyond the range of
the data
5) Maximum number of parameters that can be fit to a data set =
number of unique run conditions
» N - number of data points
» m - number of replicate sets
» n
i
- number of points in replicate set “i”
» as number of parameters increases, precision of
predictions decreases - start modeling noise
N n m
i
i
m

|
\

|
.
|

|
\

|
.
|
=1
108
Practical Guidelines for Model Development
6) Model building sequence
» “building” approach - start with few terms and add as
necessary
» “pruning” approach - start with more terms and remove
those which aren’t statistically significant
» stepwise regression - terms are added, and retained
according to some criterion - frequently R
2
• uncorrelated? criterion?
» “all subsets” regression - consider all subsets of model
terms of certain type, and select model with best criterion
• significant computational load
109
Polynomial Models
Order - maximum over the p terms in the model of the sum of the
exponents in a given term
e.g.,
is a fifth-order model
Two factor interaction -
» product term -
» implies that impact of x
1
on response depends on value
of x
2
Y x x x x = + + + + β β β β ε
0 1 1 2 2
2
3 1
2
2
3
x x
1 2
110
Polynomial Models
» polynomial models can sometimes suffer from collinearity
problems - coding helps this
» polynomials can provide approximations to nonlinear
functions - think of Taylor series approximations
» high-order polynomial models can sometimes be
replaced by fewer nonlinear function terms
• e.g., ln(x) vs. 3rd order polynomial
111
Joint Confidence Region (JCR)
Where do the true values of the parameters lie?
Recall that for individual parameters, we gain an understanding of
where the true value lies by:
» examining the variability pattern (distribution) for the
parameter estimate
» identify a range in which most of the values of the
parameter estimate are likely to lie
» manipulate this range to determine an interval which is
likely to contain the true value of the parameter
112
Joint Confidence Region
Confidence interval for individual parameter:
Step 1) The ratio of the estimate to its standard deviation is
distributed as a Student’s t-distribution with degrees of freedom
equal to that of the standard devn of the variance estimate
Step 2) Find interval which contains
of values -i.e., probability of a t-value falling in this interval is
Step 3) Rearrange this interval to obtain interval
which contains true value of parameter of the time
\$
~
\$
β β
β
ν
i i
s
t
i

[ , ]
, / , /
−t t
ν α ν α 2 2
100 1 ( )% −α
( ) 1− α
\$
, /
\$
β
ν α
β
i
t s
i
±
2
100 1 ( )% −α
113
Joint Confidence Region
Comments on Individual Confidence Intervals:
» sometimes referred to as marginal confidence intervals -
cf. marginal distributions vs. joint distributions from earlier
» marginal confidence intervals do NOT account for
correlations between the parameter estimates
» examining only marginal confidence intervals can
sometimes be misleading if there is strong correlation
between several parameter estimates
• value of one parameter estimate depends in part on anther
• deletion of the other changes the value of the parameter
estimate
• decision to retain might be altered
114
Joint Confidence Region
Sequence:
Step 1) Identify a statistic which is a function of the parameter
estimate statistics
Step 2) Identify a region in which values of this statistic lie a certain
fraction of the time (a region)
Step 3) Use this information to determine a region which contains
the true value of the parameters of the time
100 1 ( )% −α
100 1 ( )% −α
115
Joint Confidence Region
The quantity
is the ratio of two sums of squares, and is distributed as an F-
distribution with p degrees of freedom in the numerator, and n-p
degrees of freedom in the denominator
(
\$
) (
\$
)
~
,
β β β β
ε
− −

T T
p n p
p
s
F
X X
2
estimate of
inherent
noise variance
(if MSE is used, degrees of freedom is n-p)
116
Joint Confidence Region
We can define a region by thinking of those values of the ratio
which have a value less than
i.e.,
Rearranging yields:
F
p n p , , − − 1 α
(
\$
) (
\$
)
, ,
β β β β
ε
α
− −

− −
T T
p n p
p
s
F
X X
2
1
(
\$
) (
\$
)
, ,
β β β β
ε α
− − ≤

T T
p n p
ps F X X
2
117
Joint Confidence Region - Definition
The joint confidence region for the parameters is
defined as those parameter values satisfying:
Interpretation:
» the region defined by this inequality contains the true
values of the parameters of the time
» if values of zero for one or more parameters lie in this
region, those parameters are plausibly zero, and
consideration should be given to dropping the
corresponding terms from the model
100 1 ( )% −α
(
\$
) (
\$
)
, ,
β β β β
ε α
− − ≤
− −
T T
p n p
ps F X X
2
1
β
100 1 ( )% −α
118
Joint Confidence Region - Example with 2 Parameters
Let’s reconsider the solder thickness example:
95% Joint Confidence Region (JCR) for slope&intercept:
( ) ; X X
T
=

10 2367
2367 563335
\$
.
.
; β =

45810
113
| |
(
\$
) (
\$
)
\$ \$
\$
\$
, , , .
β β β β
β β β β
β β
β β
ε ε
− −
= − −

≤ =
− −
T T
T
p n p
ps F s F
X X
X X
0 0 1 1
0 0
1 1
2 2
2 10 2 0 95
2
s
ε
2
135 38 = .
119
Joint Confidence Region - Example with 2 Parameters
95% Joint Confidence Region (JCR) for slope&intercept:
The boundary is an ellipse...
| |
45810 113
45810
113
2 13538
2 13538 4 46 1207 59
0 1
0
1
2 8 0 95
. .
.
.
( . )
( . )( . ) .
, , .
− − −

− −

= =
β β
β
β
X X
T
F
120
Joint Confidence Region - Example with 2 Parameters
Region
320 600
-0.6
-1.6
Intercept
Slope
rotated - implies correlation
between estimates of slope
and intercept
centred at least squares
parameter estimates
greater “shadow” along horizontal axis --> variance of
intercept estimate is greater than that of slope
121
Interpreting Joint Confidence Regions
1) Are axes aligned with coordinate axes?
» is ellipse horizontal or vertical?
» indicates no correlation between parameter estimates
2) Which axis has the greatest shadow?
» projection of ellipse along axis
» indicates which parameter estimate has the greatest
variance
3) The elliptical region is, by definition, centred at the least squares
parameter estimates
4) Long, narrow, rotated ellipses indicate significant correlation
between parameter estimates
5) If a value of zero for one or more parameters lies in the region,
these parameters are plausibly zero - consider deleting from
model
122
Joint Confidence Regions
What is the motivation for the ratio
used to define the joint confidence region?
Consider the joint distribution for the parameter estimates:
(
\$
) (
\$
) β β β β
ε
− −
T T
p
s
X X
2
1
2
1
2
2
1
( ) det( )
exp{ (
\$
) (
\$
)}
/
\$
\$
π
β β β β
β
β
p
T
Σ
Σ − − −

Substitute in estimate for
parameter covariance matrix:
(
\$
) (( ) ) (
\$
)
(
\$
) (
\$
)
β β β β
β β β β
ε
ε
− −
=
− −
− − T T
T T
s
s
X X
X X
1 2 1
2
123
Confidence Intervals from Densities
Individual Interval Joint Region
f b
\$
( )
β
f b b
\$ \$
( , )
β β
0 1
0 1
b
b
0
b
1
lower upper
area = 1-alpha
volume = 1-alpha
Joint Confidence
Region
124
Relationship to Marginal Confidence Limits
Region
320 600
-0.6
-1.6
Intercept
Slope
centred at least squares
parameter estimates
marginal confidence interval for intercept
m
a
r
g
i
n
a
l

c
o
n
f
i
d
e
n
c
e

i
n
t
e
r
v
a
l
f
o
r

s
l
o
p
e
125
Relationship to Marginal Confidence Limits
Region
320 600
-0.6
-1.6
Intercept
Slope
95% confidence
region for parameters
considered jointly
marginal confidence interval for intercept
m
a
r
g
i
n
a
l

c
o
n
f
i
d
e
n
c
e

i
n
t
e
r
v
a
l
f
o
r

s
l
o
p
e
95% confidence
region implied by
considering parameters
individually
126
Relationship to Marginal Confidence Intervals
Marginal confidence intervals are contained in joint confidence
region
» potential to miss portions of plausible parameter values
at tails of ellipsoid
» using individual confidence intervals implies a
rectangular region, which includes sets of parameter
values that lie outside the joint confidence region
» both situations can lead to
• erroneous acceptance of terms in model
• erroneous rejection of terms in model
127
Going Further - Nonlinear Regression Models
Model:
Estimation Approach:
» linearize model with respect to parameters
» treat linearization as a linear regression problem
» iterate by repeating linearization/estimation/linearization
about new estimates,… until convergence to parameter
values - Gauss-Newton iteration - or solve numerical
optimization problem
Y
i i i
= + η θ ε ( , ) x
explanatory
variables
parameters
random noise
component
chee824 - Winter 2004 J. McLellan 128
Interpretation - Columns of X
– values of a given variable at different operating points -
– entries in X
T
X
» dot products of vectors of regressor variable values
» related to correlation between regressor variables
– form of X
T
X is dictated by experimental design
• e.g., 2
k
design - diagonal form
chee824 - Winter 2004 J. McLellan 129
Parameter Estimation - Graphical View
approximating observation vector
\$ y
y
observations
residual
vector
chee824 - Winter 2004 J. McLellan 130
Parameter Estimation - Nonlinear Regression Case
approximating observation vector
\$ y
y
residual
vector
model surface
observations
chee824 - Winter 2004 J. McLellan 131
Properties of LS Parameter Estimates
Key Point - parameter estimates are random variables
» because of how stochastic variation in data propagates
through estimation calculations
» parameter estimates have a variability pattern -
probability distribution and density functions
Unbiased
» “average” of repeated data collection / estimation
sequences will be true value of parameter vector
E{
\$
} β β =
chee824 - Winter 2004 J. McLellan 132
Properties of Parameter Estimates
Consistent
» behaviour as number of data points tends to infinity
» with probability 1,
» distribution narrows as N becomes large
Efficient
» variance of least squares estimates is less than that of
other types of parameter estimates
N→∞
=
lim
\$
β β
chee824 - Winter 2004 J. McLellan 133
Properties of Parameter Estimates
Covariance Structure
» summarized by variance-covariance matrix
Cov(
\$
) ( ) β σ =

X X
T 1 2
structure dictated by
experimental design
variance of
noise
Cov
Var Cov
Cov Var
(
\$
)
(
\$
) (
\$
,
\$
)
(
\$
,
\$
) (
\$
)
β
β β β
β β β
=

0 0 1
0 1 1
chee824 - Winter 2004 J. McLellan 134
Prediction Variance
…in matrix form -
where is vector of conditions at k-th data point
var( \$ ) ( ) y
k k
T T
k
=

x X X x
1 2
σ
x
k
chee824 - Winter 2004 J. McLellan 135
Joint Confidence Regions
Variability in data can affect parameter estimates jointly
depending on structure of data and model
β
2
β
1
marginal confidence limits
section of sum of
squares
(or likelihood)
function

Sign up to vote on this title