Professional Documents
Culture Documents
Linear regression
Y = m X + b
A linear relationship is assumed to exist
between to factors.
This was already discussed in an earlier unit.
Regression methods
Multiple linear regression
Y = m
1
X
1
+ m
2
X
2
+ ... m
n
X
n
+ e
This is a linear regression t that is extended to
several variables.
It is useful when several factors contribute to the
overall observed response.
Multivariate calibration
Typically, a multivariate method implies that
you have multiple X (independent) and
multiple Y (dependent) variable.
We will outline three multivariate
approaches to creating a calibration curve.
Ordinary Least Squares (OLS)
Principal component regression (PCR)
Partial least squares regression (PLS)
While each optimizes the t of your data
differently, method evaluation, optimization
and the results are often the same.
OLS
With traditional linear and multiple linear regression,
were limited to a single Y (dependent) variable.
OLS (also called a general linear model - GLM) can
be seen as an extension of this approach. You have
a Y matrix instead of a Y vector.
Mathematically, the matrix formulations for MLS and
OLS (GLM) are the identical - except for allowing for
a Y matrix. Basically a combination of MLS and
simultaneous equations.
XLStat will handle either approach - based on the
number of Y variables you give it.
OLS
QR
2
Ycum index.
Sum of the coefcients of determination (R
2
)
between the dependent variables and the h rst
components for the dependent variables.
QR
2
Xcum index.
Sum of the coefcients of determination (R
2
)
between the independent variables and the h
rst components for the independent variables.
These are similar to the Q
2
cum(h) index - but only
for one of the blocks of data.
Note: other programs will either a) call these
different things or b) use different measures.
Octane number
Rating Octane of Gasoline using Near IR.
ASTM method is complex and expensive.
A simple method would be more desirable.
Experimental
Unleaded gasoline samples were assayed
by the ASTM method.
NIR spectra (900-1600 nm) were obtained.
OLR, PCR and PLS models were studied.
X matrix - spectra at 20 nm intervals.
ASTM octane number by Research method
was used as the Y matrix (vector).
Octane Number
NIR spectra
A 915 nm, CH
2
stretch
B 1021 nm, CH
2
/CH
3
combination band
C 1151 nm, aromatic
and CH
3
stretch
D 1194 nm, CH
3
stretch
E 1394 nm, CH
2
combination bands
F 1412 nm aromatic &
CH
2
combination bands
G 1435 nm aromatic &
CH
2
combination bands
A
B
C
D
E
F
G
Octane Number - OLS
High R
2
value and
RMSE and Press
RMSE are similar.
Octane Number
Only 9 variables ended up being used in
building the OLS model.
Octane Number, OLS
!"
!#
!$
!%
%&
%"
!" !# !$ !% %& %"
!"#$%&'#$()&'*+#(,
)
&
'
*
+
#
(
,
'()*+,
-./*0.)*12
OLS residual.
!"#$%&'(')'*#$%+$,+-.&+',&/-+0$1/
!"#$
!"
!%#$
!%
!&#$
&
&#$
%
%#$
"
'( ') '$ '* '+ '' ', ,& ,% ," ,(
!"#$%&'(
*
#
$
%
+
$
,
+
-
.
&
+
'
,
&
/
-
+
0
$
1
/
Octane Number, PCR
PCR
OLS
Note that you get a small
improvement in the t
with PCR. Might have a
problem with outliers
Also, you have a larger
number of degrees of
freedom since all of the
original variables were
used. With OLS, most
were discarded.
Octane Number, PCR
!
"
#!
#"
$!
$"
%!
%"
&# &$ &% &' &" &(
!"#$"%&%'
(
)
*
&
%
+
,
-
.
&
!
$!
'!
(!
)!
#!!
!
.
#
.
-
,
'
)
+
&
/+
,
0
)
,
1
)
-
)
'
2
/3
4
5
Almost 90% of the
variance is captured in
the rst component.
Over 99% in the rst
three.
Scores plot - potential outliers
!"#$
!"
!"#% !&#"
!'#&
!'#"
!&#$
!"#(
!'#)
&%#"
!"#%
!"#&
!$#"
!$#"
!"#)
&%#"
&%#*
!&#'
&%#"
!'
!"#%
!"#$
!$#(
!"#%
&%#"
!'#*
&*#)
!"#)
&%#'
!"#%
&%#&
&%#!
!"#(
!(#!
!(#$
!(#&
!(#&
!"#%
!$#"
!"
!(#!
!"#'
&%#!
&%#&
!'#)
!$#&
!$#&
!"#$
&*#%
&%#"
!"#$
!(#&
&*#*
!"#!
!$#"
!$#&
!"#*
!"#'
&%#&
+)
,
)
+%, +) , ) %, %)
!"#$%&'()#*+
!
(
#$
"
"
')
,
#*
+
-./012 345064/078
Octane Number,
PCR
!"
!#
!$
!%
%&
%"
!" !# !$ !% %& %"
!"#$%&'#$ )&'*+# ,
)
&
'
*
+
#
(,
'()*+, -./*0.)*12
!"#$%& ( )'*#$%+$,+-.&+',&/-+0$1/
!"
!#
!$
!%
&
%
$
#
"
'# '( ') '* *% *#
!"#$%& (
*
#
$
%
+
$
,
+
-.
&
+
',
&
/
-+
0
$
1/
Octane Number, PLS
!"#$%&'()%*+,&-,&.(/-$0 "1 2"/3".$.+4
!
!"#
!"$
!"%
!"&
'
' # ( $ )
5"/3".$.+4
6
.
#
$
7
Octane Number, PCR
PCR
OLS
PLS
Correlation plot
!""
!#"
!$"
!%"
!&"
'""" '"#"
'"$"
'"%"
'"&"
''""
''#"
''$"
''%"
''&"
'#""
'##"
'#$"
'#%"
'#&"
'(""
'(#"
'($"
'(%"
'(&"
'$""
'$#"
'$$"
'$%"
'$&"
')""
')#"
')$"
')%"
')&"
'%""
*+,-./01
2'
2"34)
2"3)
2"3#)
"
"3#)
"3)
"34)
'
2' 2"34) 2"3) 2"3#) " "3#) "3) "34) '
!"
!
#
5
6
Similar to
a scores
plot
Octane Number, PLS
!"
!#
!$
!%
%&
%"
!" !# !$ !% %& %"
!"#$%&'#$ )&'*+# ,
)
&
'
*
+
#
(,
'()*+,
-./*0.)*12
!"#$%& ( )'*#$%+$,+-.&+',&/-+0$1/
!"#$
!"
!%#$
!%
!&#$
&
&#$
%
%#$
"
'( '$ ') '* *% *(
!"#$%& (
*
#$
%
+
$
,
+
-.
&
+
',
&
/
-+
0
$
1/
+,-./0 123.42-.56
Octane number
For this data set, there is no signicant
difference for OLR, PCR and PLS calibration
models.
Each produces a comparable t of the data
and has about the same level of residual
error.
Lets look at another example - this time with
multiple dependent (Y) variables.
X-Ray uorescence example
A series of nickel alloys were assayed by X-
ray uorescence. Four of the elements are
known to have specic spectral features that
allow prediction. There are a total of 15
samples.
! Elements present
! ! Si, Mn, Ni, Cr, Mo, Ti and Fe.
(bold = with spectral features)
OLS, PCR and PLS models were built and
compared.
Sample spectra
4
0 50 100 150 200 250 300
0
1
2
3
4
5
6 x 10
Variable
A
b
s
o
r
b
a
n
c
e
OLS
Due to the limited number of samples,
there are too many X points/sample.
By creating replicate copies of the samples
(in triplicate), an initial OLS was conducted
to determine which X points would have
been automatically eliminated.
This left 8 X points/sample.
OLS
Model Quality
Summary.
Si
Pred(% Si) / % Si
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Pred(% Si)
%
S
i
% Si / Standardized residuals
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1
% Si
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Mn
Pred(% Mn) / % Mn
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
Pred(% Mn)
%
M
n
% Mn / Standardized residuals
-1.5
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2 2.5
% Mn
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Ni
Pred(% Ni) / % Ni
0
5
10
15
20
0 5 10 15 20
Pred(% Ni)
%
N
i
% Ni / Standardized residuals
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 5 10 15 20
% Ni
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Cr
Pred(% Cr) / % Cr
5
7
9
11
13
15
17
19
21
5 7 9 11 13 15 17 19 21
Pred(% Cr)
%
C
r
% Cr / Standardized residuals
-1.5
-1
-0.5
0
0.5
1
5 10 15 20
% Cr
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Pred(% Mo) / % Mo
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Pred(% Mo)
%
M
o
% Mo / Standardized residuals
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5
% Mo
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Ti
Pred(% Ti) / % Ti
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Pred(% Ti)
%
T
i
% Ti / Standardized residuals
-1
-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1
% Ti
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
Fe
60
65
70
75
80
85
90
60 65 70 75 80 85 90
Pred(% Fe)
%
F
e
-1.5
-1
-0.5
0
0.5
1
1.5
60 65 70 75 80 85 90
% Fe
S
t
a
n
d
a
r
d
i
z
e
d
r
e
s
i
d
u
a
l
s
PCR gives higher R
2
values but look at the Press RMSE
OLS results
!
"
#
$
%
&!
&"
'& '" '( '# ') '$ '* '% '+ '&! '&& '&" '&( '&#
!"#$"%&%'
(
)
*
&
%
+
,
-
.
&
!
"!
#!
$!
%!
&!!
!
.
#
.
-
,
'
)
+
&
/
+
,
0
)
,
1
)
-
)
'
2
/
3
4
5
Si
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.3
0.6
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Sl
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
!
!"#
!"$
!"%
!"&
'
'"#
'"$
! !"# !"$ !"% !"& ' '"# '"$
()*+,-.*+/0/1,
0
/
1
,
Mn
!
!"#
$
$"#
%
%"#
! !"# $ $"# % %"#
&'()*+,()-.-/0
.
-
/
0
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.3
0.6
0 0.3 1 1.3 2 2.3
Mn
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Ni
!
"
#
$
%
&!
&"
&#
&$
! " # $ % &! &" &# &$
'()*+,-)*./.0+
/
.
0
+
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.3
0.6
0 3 10 13
nl
S
L
a
n
d
a
r
d
l
z
e
d
r
e
s
l
d
u
a
l
s
Cr
!
"
##
#$
#%
#!
#"
&#
! " ## #$ #% #! #" &#
'()*+,-)*./.0(
/
.
0
(
-0.6
-0.3
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
7 9 11 13 13 17 19 21
Cr
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Mo
!"#$
"#$
%#$
&#$
'#$
(#$
$#$
!"#$ "#$ %#$ &#$ '#$ (#$ $#$
)*+,-./+,01023
1
0
2
3
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.3
0.6
0 1 2 3 4
Mo
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Ti
!"#$
"
"#$
"#%
"#&
"#'
(
(#$
!"#$ " "#$ "#% "#& "#' ( (#$
)*+,-./+,0102-
1
0
2
-
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.3
0.6
0 0.2 0.4 0.6 0.8 1
1l
S
L
a
n
d
a
r
d
l
z
e
d
r
e
s
l
d
u
a
l
s
Fe
!"
!#
$"
$#
%"
%#
&"
!" !# $" $# %" %# &"
'()*+,-)*./.0)
/
.
0
)
-0.6
-0.3
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
60 63 70 73 80 83 90
le
S
L
a
n
d
a
r
d
l
z
e
d
r
e
s
l
d
u
a
l
s
PLS
!"#$%&'()%*+,&-,&.(/-$0&"1&2"/3".$.+4
5
567
568
569
56:
56;
56<
56=
56>
56?
7
7 8 9 : ; < = > ? 75 77 78 79
@"/3".$.+4
A
.
#
$
B
The Q
2
pattern indicates that most of the information if brought out in the rst
few components but then noise is brought out to nally nd a way to t the
non-spectral species.
PLS
PLS gives an even better t - but not a
huge improvement compared to PCR.
Si
!
!"#
!"$
!"%
!"&
'
'"#
'"$
! !"# !"$ !"% !"& ' '"# '"$
()*+,-.*+/0/1,
0
/
1
,
-0.3
-0.3
-0.1
0.1
0.3
0.3
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Sl
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Mn
!
!"#
$
$"#
%
%"#
! !"# $ $"# % %"#
&'()*+,()-.-/0
.
-
/
0
-0.3
-0.3
-0.1
0.1
0.3
0.3
0 0.3 1 1.3 2 2.3
Mn
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Ni
!
"
#
$
%
&!
&"
&#
&$
! " # $ % &! &" &# &$
'()*+,-)*./.0+
/
.
0
+
-0.3
-0.3
-0.1
0.1
0.3
0.3
0 3 10 13
nl
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Cr
!
"
##
#$
#%
#!
#"
&#
! " ## #$ #% #! #" &#
'()*+,-)*./.0(
/
.
0
(
-0.6
-0.4
-0.2
0
0.2
0.4
7 9 11 13 13 17 19 21
Cr
S
L
a
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
!"#$
"
"#$
"#%
"#&
"#'
(
(#$
!"#$ " "#$ "#% "#& "#' ( (#$
)*+,-./+, 102-
1
02
-
!"#$
"#$
%#$
&#$
'#$
(#$
$#$
!"#$ "#$ %#$ &#$ '#$ (#$ $#$
)*+,-./+, 1023
1
02
3
-0.3
-0.3
-0.1
0.1
0.3
0.3
0 1 2 3 4
Mo
S
La
n
d
a
rd
lze
d
re
sld
u
a
ls
-0.3
-0.3
-0.1
0.1
0.3
0.3
0 0.2 0.4 0.6 0.8 1
1l
S
La
n
d
a
rd
lze
d
re
sld
u
a
ls
Mo
and Ti
Fantastic ts
considering
that there is
NO data to
support
them.
Fe Results
!"
!#
$"
$#
%"
%#
&"
!" !# $" $# %" %# &"
'()*+,-)*./.0)
/
.
0
)
-0.6
-0.4
-0.2
0
0.2
0.4
60 63 70 73 80 83 90
le
S
La
n
d
a
r
d
lz
e
d
r
e
s
ld
u
a
ls
Summary
Although we were able to develop models
that appeared to be able to predict the
amounts of all seven species, there is actually
only real information about four of them.
The PCR and PLS modes will produce a t
regardless of noise, lack of a positive
response, .
Care must be taken to ensure that your data
set contains real information about all of the
components.
One last example