Professional Documents
Culture Documents
Elsevier - Handbook of Qualimetrics Part B
Elsevier - Handbook of Qualimetrics Part B
In 1991 two of us, Luc Massart and Bernard Vandeginste, discussed, during one of
our many meetings, the possibility and necessity of updating the book
Chemometrics: a textbook. Some of the newer techniques, such as partial least
squares and expert systems, were not included in that book which was written
some 15 years ago. Initially, we thought that we could bring it up to date with
relatively minor revision. We could not have been more wrong. Even during the
planning of the book we witnessed a rapid development in the application of
natural computing methods, multivariate calibration, method validation, etc.
When approaching colleagues to join the team of authors, it was clear from the
outset that the book would not be an overhaul of the previous one, but an almost
completely new book. When forming the team, we were particularly happy to be
joined by two industrial chemometricians. Dr. Paul Lewi from Janssen
Pharmaceutica and Dr. Sijmen de Jong from Unilever Research Laboratorium
Vlaardingen, each having a wealth of practical experience. We are grateful to
Janssen Pharmaceutica and Unilever Research Vlaardingen that they allowed
Paul, Sijmen and Bernard to spend some of their time on this project. The three
other authors belong to the Vrije Universiteit Brussel (Prof. An Smeyers-Verbeke
and Prof. D. Luc Massart) and the Katholieke Universiteit Nijmegen (Professor
Lutgarde Buydens), thus creating a team in which university and industry are
equally well represented. We hope that this has led to an equally good mix of
theory and application in the new book.
Much of the material presented in this book is based on the direct experience of
the authors. This would not have been possible without the hard work and input of
our colleagues, students and post-doctoral fellows. We sincerely want to acknowl-
edge each of them for their good research and contributions without which we
would not have been able to treat such a broad range of subjects. Some of them
read chapters or helped in other ways. We also owe thanks to the chemometrics
community and at the same time we have to offer apologies. We have had the
opportunity of collaborating with many colleagues and we have profited from the
research and publications of many others. Their ideas and work have made this
book possible and necessary. The size of the book shows that they have been very
productive. Even so, we have cited only a fraction of the literature and we have not
included the more sophisticated work. Our wish was to consolidate and therefore
to explain those methods that have become more or less accepted, also to
newcomers to chemometrics. Our apologies, therefore, to those we did not cite or
not extensively: it is not a reflection on the quality of their work.
Each chapter saw many versions which needed to be entered and re-entered in
the computer. Without the help of our secretaries, we would not have been able to
complete this work successfully. All versions were read and commented on by all
authors in a long series of team meetings. We will certainly retain special
memories of many of our two-day meetings, for instance the one organized by Paul
in the famous abbey of the regular canons of Premontre at Tongerlo, where we
could work in peace and quiet as so many before us have done.
Much of this work also had to be done at home, which took away precious time
from our families. Their love, understanding, patience and support was indispens-
able for us to carry on with the seemingly endless series of chapters to be drafted,
read or revised.
Chapter 28
Introduction to Part B
In the introduction to Part A we discussed the "arch of knowledge" [1] (see Fig.
28.1), which represents the cycle of acquiring new knowledge by experimentation
and the processing of the data obtained from the experiments. Part A focused
mainly on the first step of the arch: a proper design of the experiment based on the
hypothesis to be tested, evaluation and optimization of the experiments, with the
accent on univariate techniques. In Part B we concentrate on the second and third
steps of the arch, the transformation of data and results into information and the
combination of information into knowledge, with the emphasis on multivariate
techniques.
In order to obtain information from a series of experiments, we need to interpret
the data. Very often the first step in understanding the data is to visualise them in a
plot or a graph. This is particularly important when the data are complex in nature.
These plots help in discovering any structure that might be present in the data and
which can be related to a property of the objects studied. Because plots have to be
represented on paper or on a flat computer screen, data need to be projected and
compressed.
Analytical results are often represented in a data table, e.g., a table of the fatty
acid compositions of a set of olive oils. Such a table is called a two-way multi-
variate data table. Because some olive oils may originate from the same region and
others from a different one, the complete table has to be studied as a whole instead
as a collection of individual samples, i.e., the results of each sample are interpreted
in the context of the results obtained for the other samples. For example, one may
ask for natural groupings of the samples in clusters with a common property,
namely a similar fatty acid composition. This is the objective of cluster analysis
(Chapter 30), which is one of the techniques of unsupervised pattern recognition.
The results of the clustering do not depend on the way the results have been
arranged in the table, i.e., the order of the objects (rows) or the order of the fatty
acids (columns). In fact, the order of the variables or objects has no particular
meaning.
In another experiment we might be interested in the monthly evolution of some
constituents present in the olive oil. Therefore, we decide to measure the total
amount of free fatty acids and the triacylglycerol composition in a set of olive oil
Knowledge
creativity intelligence
Deduction Induction
(synthesis) (analysis)
Hypothesis Information
design data
Experiment
samples under fixed storage conditions. Each month a two-way table is obtained
— six in total after six months. We could decide to analyse all six tables
individually. However, this would not provide information on the effect of the
storage time and its relation to the origin of the oil. It is more informative to
consider all tables together. They form a so-called three-way table. The analysis of
such a table is discussed in Chapter 31. If, in addition, all olive samples are split
into portions which are stored under different conditions, e.g., open and closed
bottles, darkness and daylight, we obtain several three-way tables or in general a
multi-way data table.
Some analytical instruments produce a table of raw data which need to be
processed into the analytical result. Hyphenated measurement devices, such as
HPLC linked to a diode array detector (DAD), form an important class of such
instruments. In the particular case of HPLC-DAD, data tables are obtained con-
sisting of spectra measured at several elution times. The rows represent the spectra
and the columns are chromatograms detected at a particular wavelength. Conse-
quently, rows and columns of the data table have a physical meaning. Because the
data table X can be considered to be a product of a matrix C containing the
concentration profiles and a matrix S containing the pure (but often unknown)
spectra, we call such a table bilinear. The order of the rows in this data table
corresponds to the order of the elution of the compounds from the analytical
column. Each row corresponds to a particular elution time. Such bilinear data
tables are therefore called ordered data tables. Trilinear data tables are obtained
from LC-detectors which produce a matrix of data at any instance during the
elution, e.g., an excitation-emission spectrum as a function of time. Bilinear and
trilinear data tables are also measured when a chemical reaction is monitored over
time, e.g., the biomass in a fermenter by Near infrared Spectroscopy. An example
of a non-bilinear two-way table is a table of MS-MS spectra or a 2D-NMR
spectrum. These tables cannot be represented as a product of row and column
spectra.
So far, we have discussed the structure of individual data tables: two-way to
multiway, which may be bilinear to multilinear. In some cases two or more tables
are obtained for a set of samples. The simplest situation is a two-way data table
associated with a vector of properties, e.g., a table with the fatty acid composition
of olive oils and a vector with the coded region of the oils. In this case we do not
only want to display the data table, but we want to derive a classification rule,
which classifies the collected oils in the right category. With this classification
rule, oils of unknown origin can be classified. This is the area of supervised pattern
recognition, which classically is based on multivariate statistics (Chapter 33) but
more recently neural nets have been introduced in this field (Chapter 44). The
technique of neural networks belongs to the area of so-called natural computation
methods. Genetic algorithms — another technique belonging to the family of
natural computation methods — were discussed in Part A. They are called natural
because the principle of the algorithms to some extent mimics the way biological
systems function. Originally, neural networks were considered to be a model for
the functioning of the brain. In the example above, the property is a discrete class,
region of origin, healthy or ill person, which is not necessarily a quantitative value.
However, in many cases the property may be a numerical value, e.g., a concent-
ration or a pharmacological activity (Chapter 37). Modelling a vector of properties
to a table of measurements is the area of multivariate calibration, e.g., by principal
components regression or by partial least squares, which are described in Chapter
36. Here the degree of complexity of the data sets is almost unlimited: several data
tables may be predictors for a table of properties, the relationship between the
tables may be non-linear and some or all tables may be multiway.
As indicated before, the columns and the rows of a bilinear or trilinear dataset
have a particular meaning, e.g., a spectrum and a chromatogram or the concent-
ration profiles of reactants and the reaction products in an equilibrium or kinetic
study. The resulting data table is made up by the product of the tables of these pure
factors, e.g., the table of the elution profiles of the pure compounds and the table of
the spectra of these compounds. One of the aims of a study of such a table is the
decomposition of the table into its pure spectra and pure elution profiles. This is
done by factor analysis (Chapter 34).
A special type of table is the contingency table, which has been introduced in
Chapter 16 of Part A. In Part B the 2x2 contingency table is extended to the general
case (Chapter 32) which can be analyzed in a multivariate way. The above
examples illustrate that the complexity of data and operations discussed in Part B
require advanced chemometric techniques. Although Part B can be studied inde-
pendently from Part A, we will implicitly assume a chemometrics background
equivalent to Part A, where a more informal treatment of some of the same topics
can be found. For instance, vectors and matrices are introduced for the first time in
Chapter 9 of Part A, but are treated in more depth in Chapter 29 of Part B. Principal
components analysis, which was introduced in Chapter 17, is discussed in more
detail in Chapter 31. These two chapters provide the basis for more advanced
techniques such as Procrustes analysis, canonical correlation analysis and partial
least squares discussed in Chapter 35. In Part B we also concentrate on a number of
important application areas of multivariate statistics: multivariate calibration
(Chapter 36) quantitative structure-activity relationships (QSAR, Chapter 37),
sensory analysis (Chapter 38) and pharmacokinetics (Chapter 39).
The success of data analysis depends on the quality of the data. Noise and other
instrumental factors may hide the information in the data. In some instances it is
possible to improve the quality of the data by a suitable preprocessing technique
such as signal filtering and signal restoration by deconvolution (Chapter 40). Prior
to signal enhancement and restoration it may be necessary to transform the data by,
e.g., a Fourier transform or wavelet transform. Both are discussed in Chapter 40. A
special type of filter is the Kalman filter which is particularly applicable to the
real-time modelling of systems of which the model parameters are time dependent.
For instance, the slope and intercept of a calibration line may be subject to a drift.
At each new measurement of a calibration standard, the Kalman filter updates the
calibration factors, taking into account the uncertainty in the calibration factors
and in the data. Because the Kalman filter is driven by the difference between a
new measurement and the predicted value of that measurement, the filter ignores
outlying measurements, caused by stochastic or systematic errors.
The last step of the arch of knowledge is the transformation of information into
knowledge. Based on this knowledge one is able to make a decision. For instance,
values of temperatures and pressures in different parts of a process have to be
interpreted by the operator in the control room of the plant, who may take a series
of actions. Which action to take is not always obvious and some guidance is
usually found in a manual. In analytical method development the same situation is
encountered. Guidance is required for instance to select a suitable stationary phase
in HPLC or to select the solvents that will make up the mobile phase. This type of
knowledge may be available in the form of "If Then" rules. Such rules can be
combined in a rule-based knowledge base, which is consulted by an expert system
(Chapter 43). Questions such as "What If can also be answered by models
developed in Operations Research (Chapter 42). For instance, the average time a
sample has to wait in the sample queue can be predicted by queuing theory for
various priority strategies.
After completion of one cycle of the arch of knowledge, we are back at the
starting point of the arch, where we should accept or reject the hypothesis. At this
stage a new cycle can be started based on the knowledge gained from all previous
ones. The chemometric techniques described in Part A and B aim to support the
scientist in running through these cycles in an efficient way.
References
holds only when all p coefficients c^ are zero. Otherwise, the p vectors are linearly
dependent (see also Section 9.2.8).
The following three vectors z,, Zj and Zj with dimension four are linearly
independent:
'-\ " 2" "-5l
4 2 1
z, = Z2 = Z3 =
0 3 -6
_ 2_ -1 _ 4_
as it is not possible to find a set of coefficients Cj, C2, C3 which are not all equal to
zero, and which satisfy the system of four equations:
-C] + 2C2 -5C3 = 0
4c, + 2c2 + IC3 = 0
Oc, + 3c2 - 6C3 = 0
2c, - IC2 + 4c3 = 0
In the case of linearly dependent vectors, each of them can be expressed as a
linear combination of the others. For example, the last of the three vectors below
can be expressed in the form Z3 = Zj - IT^^,
'-\ " 2" '-5l
4 2 0
^2 = Z3 =
0 3 -6
_ 2_ -1 _ 4_
A vector space spanned by a set of/? vectors (Zj... z^ with the same dimension n
is the set of all vectors that are linear combinations of the p vectors that span the
space [3]. A vector space satisfies the three requirements of an algebraic set which
follow hereafter. (1) Any vector obtained by vector addition and scalar multiplication
of the vectors that span the space also belongs to this space. This includes the null
vector whose elements are all equal to zero. (2) Addition of the null vector to any
vector of the space reproduces the vector. (3) For every vector that belongs to the
space, another one can be found, such that vector addition of these two vectors
produces the null vector.
A set of n vectors of dimension n which are linearly independent is called a basis
of an n-dimensional vector space. There can be several bases of the same vector
space.
The set of unit vectors of dimension n defines an n-dimensional rectangular (or
Cartesian) coordinate space S^. Such a coordinate space 5" can be thought of as
being constructed from n base vectors of unit length which originate from a
common point and which are mutually perpendicular. Hence, a coordinate space is
a vector space which is used as a reference frame for representing other vector
spaces. It is not uncommon that the dimension of a coordinate space (i.e. the
number of mutually perpendicular base vectors of unit length) exceeds the dimension
of the vector space that is embedded in it. In that case the latter is said to be a
subspace of the former. For example, the basis of 5"^ is:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
: = X^/"/ (29.2)
where the n vectors (Uj ... u j form a basis of 5". The n coefficients (xj ... x^ are
called the coordinates of the vector x in the basis (Uj ... u j .
For example, a vector x in 5"* may be expressed in the previously defined usual
basis as the vector sum:
1 0 0 0
0 1 0 0
x=3 + (-4) +0 +2
0 0 1 0
0 0 0 1
10
where the coefficients 3, ^ , 0 and 2 are the coordinates of x in the usual basis.
Another basis of 5"* can be defined by means of the set of vectors:
1 1 -1 -1
1 -1 1 -1
1 -1 -1 1
1 1 1 1
and the same vector x which we have defined above can be expressed in this
particular basis by means of the vector sum:
'\ r -1 -1
1 -1 1 -1
x = 0.25 + 2.25 + (-1.25) + 0.75
1 -1 -1 1
1 1 1 1
where the coefficients 0.25, 2.25, -1.25 and 0.75 nbw represent the coordinates of
X in this particular basis. From the above it follows that a vector can be expressed
by different coordinates according to the particular basis that has been chosen. In
multivariate data analysis one often changes the basis in order to highlight particular
properties of the vectors that are represented in it. This automatically causes a
change of their coordinates. A change of basis and its effect on the coordinates can
be defined algebraically, as is shown in Chapters 31 and 32.
By way of example, if x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^ then we obtain that the
scalar product of x with y equals:
x'ry = 2 x 4 + l x 2 + 3 x ( - l ) + ( - l ) x 5 = 2
Note that in Section 9.2.2.3 the dot product x • y is used as an equivalent notation
for the scalar product x^y.
In Euclidean space we define squared distance from the origin of a point x by
means of the scalar product of x with itself:
n
xTx = ^ x f =llxl|2 (29.4)
where 11 x - yl I is the norm or length of the vector x - y. Note that the expression of
distance from the origin in eq. (29.4) can be derived from that of distance between
two points in eq. (29.5) by replacing the vector y by the null vector 0:
xTx = ( x - 0 ) ^ ( x - 0 ) = llx-OlP =llxlP
Given the vectors x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^, we derive the norms:
llxlP - 2 ^ +1^ + 3 ^ +(-1)2 =15
llyiP = 4 ^ + 2 ^ + ( - 1 ) ' + 5^ =46
l l x - y i P = ( 2 - 4 ) 2 + ( 1 - 2 ) 2 +(3_(_i))2 ^ ( _ i _ 5 ) 2 ^ 5 7
It may be noted that some authors define the norm of a vector x as the square of
11 xl I rather than 11 xl I itself, e.g. Gantmacher [2].
Angular distance or angle between two points x and y, as seen from the origin of
space, is derived from the definition of the scalar product in terms of the norms of
the vectors:
n
xTy = ^ x , . y,. =llxllllyllcosi^ (29.6)
llyll
11x11 cos d
Fig. 29.1. Geometrical interpretation of the scalar product of x'^y as the projection of the vector x upon
the vector y. The lengths of x and y are denoted by 11 xl I and 11 yl I, respectively, and their angular separa-
tion is denoted by i3.
cos-d^ = 0.0761
15V46
or:
^ = 85.64 degrees
One can define three special configurations of two vectors, namely parallel in
the same direction, parallel in opposite directions, and orthogonal (or perpend-
icular). The three special configurations depend on the angular distances between
the two vectors, being 0, 180 and 90 degrees respectively (Fig. 29.3).
More generally, two vectors x and y are orthogonal when their scalar product is
zero:
13
Fig. 29.2. Distance 11 x - yl I between two vectors x and y of length 11 xl I and 11 yl I, separated by an angle i^.
f^ ^ x y = 11x11 llyll
180°
V 0 V > x y = -11x11 llyll
0 X
/^y
tki T
X y=0
Fig. 29.3. Three special configurations of two vectors x and y and their corresponding scalar product
x^y. Angular separations of x and y are 0, 180 and 90 degrees, respectively.
x^y = 0 (29.9)
It can be shown that the vectors x = [2, - 1 , 8, 0]''' and y = [10, 4, -2, 3]^ are
orthogonal, since:
x'ry = [2,-l,8,0][10,4,-2,3]T
= 2 x l 0 + (-l)x4 + 8x(-2) + 0 x 3 = 0
14
x^x = y V = l (29.10)
The n basis vectors which define the basis of a coordinate space 5" are n
mutually orthogonal and normalized vectors. Together they form a frame of
reference axes for that space.
If we represent by x and y the arithmetic means of the elements of the vectors x
and y:
1 "
^ ' (29.11)
1 "
y = -Yuyi
^ i
then we can relate the norms of the vectors (x - x) and (y - y) to the standard
deviations s^ and s^ of the elements in x and y (Section 2.1.4):
1 " 1
^x =-Y,{x,-xy =-\\x-x\\^
n , n (29.12)
29.3 Matrices
On the other hand, one can also regard the same matrix X as an ordered array of n
vectors of dimension p, each of the form:
X • — [X-1 , . . . , X;„ with / = 1, ..., n
In our notation, x^j represents the element of matrix X at the crossing of row / and
column j . The vector Xy defines a vector which contains the n elements of the jth
column of X. The vector x, refers to a vector which comprises the/? elements of the
/th row of X.
In the matrix X of the following example:
3 2 -2
2 1 0
X=
0 -4 -3
-1 2 4
2
1
X,=2=i
-4
2
16
1 P
1 X
T
r
''ij
nI
''j
p
Fig. 29.4. Schematic representation of a matrix X as a stack of horizontal rows x,- , and as an assembly
of vertical columns x,.
In the illustration of Fig. 29.4 we regard the matrix X as either built up from n
horizontal rows x^ of dimension p, or as built up from p vertical columns x^ of
dimension n. This exemplifies the duality of the interpretation of a matrix [9].
From a geometrical point of view, and according to the concept of duality, we can
interpret a matrix with n rows and p columns either as a pattern of n points in a
/7-dimensional space, or as a pattern of p points in an n-dimensional space. The
former defines a row-pattern P"" in column-space SP, while the latter defines a
column-pattern P^ in row-space 5^". The two patterns and spaces are called dual (or
conjugate). The term dual space also possesses a specific meaning in another
17
1 1 p
1 X
T
-^1
Xij
Fig. 29.5. Geometrical interpretation of an nxp matrix X as either a row-pattern of n points P" in
/7-dimensional column-space S^ (left panel) or as a column-pattern of p points P^ in n-dimensional
row-space S" (right panel). The/? vectors u, form a basis of 5^ and the n vectors v, form a basis of 5".
mathematical context which is distinct from the one which is implied here. The
occasion for confusion, however, is slight.
In Fig. 29.5, the column-space S^ is represented as a/^-dimensional coordinate
space in which each row x^ of X defines a point with coordinates {xn,.., x^p.., x^^) in
an orthonormal basis (u^,.., Uy,.., u^ such that:
with:
rr '0'
0 0
Each element x^j can thus be reconstructed from the scalar product:
Xij=xJ u . = u j X. (29.14)
In the same Fig. 29.5, the row-space S"^ is shown as an n-dimensional coordinate
space in which each column Xj of X defines a point with coordinates (x^j,.., x-j,.., x^p
in an orthonormal basis (Vj,.., v,,.., v„) such that:
Xj=X,j V, -\-,., + X,j V,- + . . . + X„^. V„
with:
T "o"
0 0
Each element x^j can also be reconstructed from the scalar product:
X^j — \ • Xj — Xj \ • (29.15)
The dimension of a matrix X with n rows and p columns is nxp (pronounced n
by p). Here X is referred to as an nxp matrix or as a matrix of dimension nxp.
A matrix is called square if the number of rows is equal to the number of
columns. Such a matrix can be referred to as a pxp matrix or as a square matrix of
dimension p.
The transpose of a matrix X is obtained by interchanging its rows and columns
and is denoted by X^. If X is an nxp matrix then X^ is apxn matrix. In particular,
we have that:
T\T
(X^)' _ (29.16)
19
3 2 0 -1
2 1 -4 2
•2 0 -3 4
Special matrices are the null matrix 0 in which all elements are zero, and the sum
matrix 1 in which all elements are unity. In the case of 3x3 matrices we obtain:
"0 0 0" "1 1 r
0= 0 0 0 1= 1 1 1
0 0 0 1 1 1
Chapter 9 dealt with the basic operations of addition of two matrices with the
same dimensions, of scalar multiplication of a matrix with a constant, and of
arithmetic multiplication element-by-element of two matrices with the same
20
dimensions. Here, we formalize the properties of the matrix product that have
already been introduced in Section 9.3.2.3. If X is of dimension nxp and Y is of
dimension pxq, then the product Z = XY is an nxq matrix, the elements of which
are defined by:
Note that the inner dimensions of X and Y must be equal. For this reason the
operation is also called inner product, as the inner dimensions of the two terms
vanish in the product. Any element of the product, say z,^, can also be thought of as
being the sum of the products of the corresponding elements of row / of X with
those of column k of Y. Hence the descriptive name of rows-by-columns product.
In terms of the scalar product (Section 29.2) we can write:
= X (29.19)
Throughout the book, matrices are often subscripted with their corresponding
dimensions in order to provide a check on the conformity of the inner dimensions
of matrix products. For example, when a 4x3 matrix X is multiplied with a 3x2
matrix Y rows-by-columns, we obtain a 4x2 matrix Z:
3 2 -f 1 38
3 8'
0 -5 4 26 -44
X - Y = -2 4 Z = X Y =
4x3 2 1 -2 3x2 4x2 4x3 3x2 -4 32
4 - 6J
4 3 0 6 44
"3 4
"l 0' "1 0 0] [3 4' '3 4'
2 -1 0 1 — 0 1 0 2 -1 — 2 -1
0 3 0 0 ij [o 3 0 3_
A matrix is orthogonal if the product with its own transpose produces a diagonal
matrix. An orthogonal matrix of dimension nxp satisfies one or both of the
following relationships:
XX^ = D or X^X = D„ (29.21)
where D„ is a diagonal matrix of dimension n and D^ is a diagonal matrix of
dimension p. In an orthogonal matrix we find that all rows or all columns of the
matrix are mutually orthogonal as defined above in Section 9.2. In the former case
we state that X is row-orthogonal, while in the latter case X is said to be column-
orthogonal.
The following 3x3 matrix X can be shown to be column-orthogonal:
2.351 0.060 0.686
X = 4.726 1.603 -0.301
-1.840 4.196 0.105
where I is the identity matrix with the same dimension as U (or U^).
The 3x3 matrix U shown below is both row- and column-orthonormal:
22
nT T
'3 2 -1] 1 38"
r 3 8]
0 -5 4 __ 26 -44 ^ 1 26-4 6
-2 4
2 1 -2 -4 32 38 -44 32 44
L 4 -6]
4 3 oj 6 44_
and
8" T 3 2 -1 2 4'
3 r 3 0
4 0 -5 4 '3 -2 4] 1 3
-2 = 2 --5
2 1 -2 8 4 -6j
4 -6 [-1 4 -2 0_
4 3 0
For example:
23
4 2 - 1
tr 2 8 - 4 = 4 + 8 + 3=15
1 -4 3
n p
tr(X^X) = tr(XX^) = ^ ^ J ^j =1^J ^i = l E ^ ' (29.27)
I J
where x, represents the ith row and x^ denotes theyth column of the nxp matrix X.
The proof follows from working out the products in the manner described above.
This relationship is important in the case when Y equals X^.
Matrix multiplication can be applied to vectors, if the latter are regarded as
one-column matrices. This way, we can distinguish between four types of special
matrix products, which are explained below and which are represented schem-
atically in Fig. 29.6.
(1) In the matrix-by-vector product one postmultiplies an nxp matrix X with ap
vector y which results in an n vector z:
Xy = z (29.28)
For example:
3 2 -1] r '^1
I
0 -5 4 26
-2
2 1 -2 A
-4
4 3 oj 6
For example:
24
1 ^ P
z = Xy
1 Y P
z X
1 P 1 n
T T„
z =x Y
1z p ^ y
Z = xy'
T
z=x y
Fig. 29.6. Schematic illustration of four types of special matrix products: the matrix-by-vector prod-
uct, the vector-by-matrix product, the outer product and the scalar product between vectors, respec-
tively from top to bottom.
25
3 2
[3 2 -1] -2 1 = [1 5]
4 3
(3) The outer product results from premultiplying an n vector x with ap vector
y^ yielding an nxp matrix Z:
xy^ = Z (29.30)
The outer product of two vectors can be thought of as the matrix product between a
single-column matrix with a single-row matrix:
For the purpose of completeness, we also mention the vector product which is
extensively used in physics and which is defined as:
xxy = z (29.31)
T
(29.33)
x'y = z
For example:
•-1
4
[3 0 2 -1] :3x(-l) + 0 x 4 + 2 x ( - 2 ) + (-l)x0 = - 7
-2
0
26
The product of a matrix with a diagonal matrix is used to multiply the rows or
the columns of a matrix with given constants. If X is an nxp matrix and if D^ is a
diagonal matrix of dimension n we obtain a product Y in which the iih row y-
equals the iih row of X, i.e. x,, multiplied by the iih element on the main diagonal of
Y = D„X (29.34)
It has been shown that the/? columns of an nxp matrix X generate a pattern of p
points in S"^ which we call PP. The dimension of this pattern is called rank and is
indicated by r(P^). It is equal to the number of linearly independent vectors from
which all p columns of X can be constructed as Hnear combinations. Hence, the
rank of P^ can be at most equal top. Geometrically, the rank of P^can be seen as the
minimum number of dimensions that is required to represent the p points in the
pattern together with the origin of space. Linear dependences among the p columns
of X will cause coplanarity of some of the p vectors and hence reduce the minimum
number of dimensions.
The same matrix X also generates a pattern oin points in S^ which we call P*^ and
which is generated by the n rows of X. The rank of P"is denoted as riP"^) and equals
the number of linearly independent vectors from which all n rows of X can be
produced as linear combinations. Hence, the rank of P"^ can be at most equal to n.
Using the same geometrical arguments as above, one can regard the rank of P^ as
the minimum number of dimensions that is required to represent the n points in the
pattern together with the origin of space. Linear dependences between the n rows
of X will also reduce this minimum number of dimensions because of coplanarity
of some of the corresponding n vectors.
It can be shown that the rank of P^ must be equal to that of P'^ and, hence, that the
rank of X is at most equal to the smaller of n and p [3]:
r{P^ ) = r{PP) = r(X) < min(n, p) (29.36)
where r(X) is the rank of the matrix X (see also Section 9.3.5). For example, the
rank of a 4x3 matrix can be at most equal to 3. In the case when there are linear
dependences among the rows or columns of the matrix, the rank can be even
smaller.
An nxp matrix X with n>p is called singular if linear dependences exist between
the columns of X, otherwise the matrix is called non-singular. In this case the rank
of X equals p minus the number of linear dependences among the columns of X. If
n < p, then X is singular if linear dependences exist between the rows of X,
otherwise X is non-singular. In that case, the rank of X equals n minus the number
of linear dependences among the rows of X. A matrix is said to be of full rank when
X is non-singular or alternatively when r(X) equals the smaller of n or/?.
Dimensions and rank of a matrix are distinct concepts. A matrix can have
relatively large dimensions say 100x50, but its rank can be small in comparison
with its dimensions. This point can be made more clearly in geometrical terms. In a
100-dimensional row-space 5^^, it is possible to represent the 50 columns of the
matrix as 50 points, the coordinates of which are defined by the 100 elements in
each of them. These 50 points form a pattern which we represent by P^^. It is clear
28
that the true dimension of this pattern of 50 points must be less than the number of
coordinate axes of the space 5^^ in which they are represented. In fact, it cannot be
larger than 50. The true dimension of the pattern P^^ defines its rank. In an extreme
case when all 50 points are located at the origin of 5^^, the rank is zero. In another
extreme situation we may obtain that all 50 points are collinear on a line through
the origin, in which case the rank is one. All 50 columns may be coplanar in a plane
that comprises the origin, which results in a rank of 2. In practical situations, we
often find that the points form patterns of low rank, when the data are sufficiently
filtered to eliminate random variation and artifacts. Multivariate data analysis
capitalizes on this point, and in a subsequent section on eigenvectors we will deal
with the algebra which allows us to find the true number of dimensions or rank of a
pattern in space.
We can now define the rank of the column-pattern P^^ as the number of linearly
independent columns or rank of X. If all 50 points are coplanar, then we can
reconstruct each of the 50 columns, by means of linear combinations of two
independent ones. For example, if x^^j and Xj^2 ^^^ linearly independent then we
must have 48 linear dependences among the 50 columns of X:
where 0,3, C23,... are the coefficients of the linear combinations. The rank of X and
hence the rank of the column-pattern P^^ is thus equal to 50 - 48 = 2. In this case it
appears that 48 of the 50 columns of X are redundant, and that a judicious choice of
two of them could lead to a substantial reduction of the dimensionality of the data.
The algebraic approach to this problem is explained in Section 29.6 on eigen-
vectors.
A similar argument can be developed for the dual representation of X, i.e. as a
row-pattern of 100 points P^^ in a 50-dimensional column-space 5^^. Here again, it
is evident that the rank of P^^ can at most be equal to 50 (as it is embedded in a
50-dimensional space). This implies that in our illustration of a 100x50 matrix, we
must of necessity have at least 100 - 50 = 50 linear dependences among the rows of
X. In other words, we can eliminate 50 of the 100 points without affecting the rank
of the row-pattern of points in 5^^. Let us assume that this is obtained by reducing
the 100x50 matrix X into a 50x50 matrix X'. The resulting pattern of the rows in X'
now comprises only 50 points instead of 100 and is denoted here by the symbol
pioo-50 ^hich has the same rank as P^^. Since we previously assumed 48 linear
29
Hence we can remove the fourth row of X without affecting its rank, which
results in the 3x3 matrix X":
3 5 8
X= 7 2 9
1 2 3
Note that the linear dependence among the columns still persists. We now show
that there remains a second linear dependence among the rows of X':
X i=3 = •
11 i=2
29 29
Thus we have illustrated that the number of independent rows, the number of
independent columns and the rank of the matrix are all identical. Hence, from
geometrical considerations, we conclude that the ranks of the patterns in row- and
column-space must also be equal. The above illustration is also rendered geomet-
rically in Fig. 29.7.
The rank of a product of two matrices X and Y is equal to the smallest of the rank
30
Fig. 29.7. Illustration of a pattern of points with rank of 2. The pattern is represented by a matrix X with
dimensions 5x4 and a linear dependence between the three columns of X is assumed. The rank is
shown to be the smallest number of dimensions required to represent the pattern in column-space S^
and in row-space S".
ofXandY:
This follows from the fact that the columns of XY are linear combinations of X and
that the rows of XY are linear combinations of Y [3].
From the above property, it follows readily that:
where the products XX^ and X^X are of special interest in data analysis as will be
explained in Section 29.7.
I^i=tr(A) (29.42)
and that the product of the eigenvalues is equal to the determinant of the symmetric
matrix A:
n^=iAi (29.43)
"2 -\
-5 4 39 -24
A = XTX =
1 -2 -24 21
_3 0
IA-;i2i|^(39-;i2)(21-X^)-(-24)(-24) = 0
(X^y -(21+39)>.2 + 3 9 x 2 1 - 2 4 x 2 4 = 0
or
(?i2)2 _60;i2 +243 = 0
From the form of this equation we deduce that the characteristic equation has two
positive roots:
?i^ 2 = 30 ± (30^ - 243)^/2 ^ 30 + 25.632
or
X] =55.632 and X\ =4.368
It can be easily verified that:
2
XM=tr(A) = 60
n^ 2^=IAI = 243
(Vy - ( 1 5 + 60);i2 + 1 5 x 6 0 - 3 0 x 3 0 = 0
The last term of the characteristic equation is always equal to the determinant of
A, which in this case equals zero. Hence we obtain:
XM=tr(A) = 75
k
and
n^i=iAi=o
k
where I^ is thepxp identity matrix and where A^ is apx/? diagonal matrix in which
the elements of the main diagonal are the p eigenvalues associated to the eigen-
vectors (columns) in the pxp matrix V. Because A^ is a diagonal matrix, the
decomposition is also called diagonalization of A. Algorithms for eigenvalue
decomposition are discussed in Section 31.4. These are routinely used in the
multivariate analysis of measurement tables and contingency tables (Chapters 31
and 32).
Because of the orthonormality condition we can rearrange the terms of the
decomposition in eq. (29.46) into the expression:
A = VA^V^ (29.47)
which is known as the spectral decomposition of A. The latter can also be expanded
in the form:
A = ^^ V, v^ + . . . + ^^, V, v l + . . . + ^ V, v^ (29.48)
where v^ represents the Jdh column of V and where \\ is the eigenvalue associated
tov^.
By way of example, we propose to extract the eigenvectors from the symmetric
matrix A of which the eigenvalues have been derived at the beginning of this
section:
39 -24
A=
-24 21
where v^ and Vj2 are the unknown elements of the eigenvector Vj associated to A,^.
The determinant of the corresponding system of homogeneous linear equations
equals zero:
IA-?i2i| = (_i6.632)(-34.632)-(-24)(-24) = 0
within the limits of precision of our calculation. Hence, we can solve for the
unknowns v'^ and v'21, which are the non-normalized elements v^ and V2\.
35
v'n = 1
21 -16.632/24 = -24/34.632 = -0.693
Since the norm of \\ is defined as:
IIv^ Il = (v7i + v'l, )i/2 =(1 + 0.6932)1/2 ^1.217
we derive the elements of the normalized eigenvector Vj from:
vji =v'ii/llvVI = l/l.217=0.822
V2i= v'21/llv'ill = -0.693/1.217 = - 0 . 5 6 9
or
Vi = [0.822-0.569f
In order to compute the second eigenvector, we make use of the spectral decomp-
osition of the matrix A:
A-?l2 y ..T _ •
1 M^l ' 2 V.Vn^
^2^2
where the left-hand member is called the residual matrix (or deflated matrix) of A
after extraction of the first eigenvector.
This reduces the problem to that of finding the eigenvector V2 associated to 'k\
from the residual matrix:
39 -24 0.822
-Mv,v;^ = - 55.6322 [0.822 -0.569]
-24 21_ -0.569
"1.417 2.045
2.045 2.S>51
where v^2 and V22 are the unknown elements of the eigenvector V2 associated to X, 2.
The determinant of the residual matrix is also zero:
36
12 • 1
v'„ = 2.951/2.045 = 2.045/1.417 = 1.443
After normalization we obtain the elements v,2 and V22 of the eigenvector V2 associ-
ated to A, 2:
v,2 =v',2 /llv'2 11=1/1.756 = 0.569
V22 =v'22 /llv'211= 1.433/1.756 = 0.822
where
llv'2 II = (1+1.4332)"2 ^ i 7 5 g
It can be shown that the second eigenvector Vj can also be computed directly from
the original matrix A, rather than from the residual matrix A - X\\f\J, by solving
the relation:
( A - ^ 1)^2 = 0
This follows from the orthogonality of the eigenvectors v, and V2. We have preferred
the residual matrix because this approach is used in iterative algorithms for the
calculation of eigenvectors, as is explained in Section 31.4.
Finally, we arrange the eigenvectors column-wise into the matrix V, and the
eigenvalues into the diagonal matrix A^:
From V and A-^ one can reconstruct the original matrix A by working out the
consecutive matrix products:
0.822 0.569 55.632 0 0.822 -0.569
VA^V T _
-0.569 0.822 0 4.368 0.569 0.822
37
39 -24
=A
-24 21
where I^ represents the identity matrix of dimension rxr, and where Js} is the
diagonal matrix whose elements on the main diagonal are the r positive eigen-
values, with r < /?. As a general rule we cannot write that VV^ equals I^ but the
product possesses a property which is analogous to that of an identity matrix:
VV^A = A VV^ = A (29.50)
which holds for the matrix A from which V has been computed. This remarkable
property is illustrated by means of the singular 2x2 matrix A which we have
already introduced at the beginning of this section:
15 -30 0.447
A= [75] [0.447 - 0.894] = VA^ V^
-30 60 -0.894
and which possesses only one eigenvector and associated eigenvalue. By working
out the matrix product of V with its transpose we obtain that:
0.2 -0.4
VV T _
-0.4 0.8
which is different from the 2x2 identity matrix. Nevertheless we find that:
T _
15 -301 0.2 -0.4 15 -30
AW =A
30 60 -0.4 0.8 -30 60
38
and
0.2 - 0 . 4 ] r 15 -30 15 -30
=A
•0.4 0.8 -30 60 -30 60
S = VA
and where A is a diagonal matrix whose main diagonal elements are equal to the
square roots of those of A^. This expression is also known as the Young-House-
holder factorization of a symmetric matrix [6].
The spectral decomposition can also be used to compute various powers of a
symmetric matrix. For example, the square of a symmetric matrix A of dimension
p can be computed by means of its spectral decomposition:
A^ = A A = ( V A 2 V T ) ( V A 2 V ^ ) = VA^V'^ (29.52)
where use is made of the orthonormality of the columns of V. This procedure
generally applies to all real exponents and holds also for singular symmetric
matrices (for which r<p):
A' = VA^^V^ for all real exponents s. (29.53)
It should be noted that in the case of a singular matrix A, the dimensions of V and
A^ are pxr and rxr, respectively, where r is smaller than p. The expression in eq.
(29.53) allows us to compute the generalized inverse, specifically the Moore-
Penrose inverse, of a symmetric matrix A from the expression:
which holds even when A is singular. It provides a more general method for
computing inverses, but requires that an algorithm for eigenvector extraction is
available.
For the previously defined 2x2 matrix A we obtain the inverse from its
eigenvalue decomposition, which has already been derived:
2\TT
= \A^\
AA-^=A-iA = l2
X^X = VA^2\7T
V with \^Y = V and r<p (29.55)
The transposed product XX^ has the same r eigenvalues in A^, but generally
possesses r different eigenvectors in U:
2ITT
XXT = UA2U with U^U = L and r<n (29.56)
We have already mentioned that the determinant of a square symmetric pxp
matrix A of the form X^X can be expressed in terms of its p eigenvalues:
This is an alternative way of computing determinants which, unlike the usual one
described in Section 9.3.4, allows for a geometrical interpretation.
As we have shown above (Section 29.3), a pxn matrix X can be interpreted
geometrically as a pattern of points P*" in the space S^. In the case of an (hyper)
ellipsoidal pattern we find that the axes of symmetry coincide with the eigen-
40
vectors of X^X and that the lengths of the semi-axes are equal to the square roots of
the corresponding eigenvalues. This is the case when the data represented by X
follow a multivariate normal distribution [3]. The product of the square roots of the
eigenvalues can thus be regarded as the product of the lengths of the semi-axes of
the ellipsoid. Hence, this product is proportional to the volume enclosed by the
(hyper)ellipsoid. From eq. (29.57) follows that the determinant lAI or IX'^XI can
also be regarded as being proportional to the volume enclosed by an ellipsoidal
pattern. In this case, the semi-axes of the (hyper)ellipsoid are also oriented in the
direction of the eigenvectors of X^X but their lengths are proportional to the
corresponding eigenvalues. When /? = 2 we obtain that the determinant is propor-
tional to the area of the ellipse nX] X\; when /? = 3 the determinant is proportional
to the volume of the ellipsoid (4n/3)X]X\'k\, etc. In some applications it is
required that this volume be minimal. For example, in an experimental design with
design matrix X, D-optimality is achieved when the determinant, i.e. the volume of
the ellipsoid, derived from (X^X)~^ is minimized (Section 24.2.1). This geo-
metrical interpretation may clarify the concept of degeneracy, when two or more
eigenvalues are identical. In that case, the (hyper)ellipsoid which encloses the
pattern P^ can be seen to be spherical in two or more dimensions of S^.
Singularity of the matrix A occurs when one or more of the eigenvalues are zero,
such as occurs if linear dependences exist between the p rows or columns of A.
From the geometrical interpretation it can be readily seen that the determinant of a
singular matrix must be zero and that under this condition, the volume of the
pattern P^ has collapsed along one or more dimensions of 5^. Applications of
eigenvalue decomposition of dispersion matrices are discussed in more detail in
Chapter 31 from the perspective of data analysis.
Singular value decomposition (SVD) of a rectangular matrix X is a method
which yields at the same time a diagonal matrix of singular values A and the two
matrices of singular vectors U and V such that:
From the 4x2 matrix X of our previous illustration we already derived V and A^
from the eigenvalue decomposition of the 2x2 cross-product matrix X^X:
r 2 -f
-5 4
X=
1 -2
_ 3 0_
X'^X = VA^V^
39 -24'
•24 21_
XX^ =UA2U^
0.297 0.152
-0.856 0.210 55.632 0 0.297 -0.856 0.263 0.331
0.263 -0.514 0 4.368 0.152 0.210 -0.514 0.818
0.331 0.818
5 -14 4 6
14 41 -13 -15
4 -13 5 3
6 -15 3 9
From these results we can now define the singular vector decomposition of the 4x2
data matrix X:
42
0.297 0.152]
-0.856 0.210 r7.459 0 1 [0.822 -0.569
UAV =
0.263 -0.514 [o 2.09oJ [o.569 0.822
0.331 0.818 J
2 -1"
-5 4
=X
1 -2
3 0
h=
points, thus giving more weight to certain points than to others. This aspect will be
further developed in Chapter 32.
In order to be consistent with the concept of the two dual data spaces, we must
also define the vector of row-means m^:
1 1 ^
m ^ = —XI or ^i =-~y\^ij with/= 1,..., n (29.60)
P P j
where 1^ is the sum vector of dimension p.
The elements of m^ represent the coordinates of the row-centroid, which
corresponds with the center of mass of the column-pattern P^ of points formed in
row-space S"^.
The global mean m is a scalar defined by:
Fig. 29.8. (a) Pattern of points in column-space S^ (left panel) and in row-space S" (right panel) before
column-centering, (b) After column-centering, the pattern in S^ is translated such that the centroid co-
incides with the origin of space. Distances between points in S^ are conserved while those in 5" are not.
(c) After column-standardization, distances between points in 5^ and S" are changed. Points in S" are
located on a (hyper)sphere centered around the origin of space.
45
or
J IJ ^ IJ "^l with /= 1,..., n andy = 1, ...,p
where the outer product has dimension nxp. Row-centering achieves a parallel
translation of the pattern of points P^ representing the columns in row-space S"^
such that the center of mass coincides with the origin. It leaves distances in 5^^
unchanged, but affects distances in 5^.
Finally, we can center the matrix X simultaneously by rows and by columns,
which yields the matrix of deviations from row- and column-means or the double-
centered matrix Y:
Y= X-\ml-m„\l+m\M (29.64)
or
yij = X,
rri: m, + m with / = 1,..., n andj = 1, ...,p
we derive
(a) the vector of column-means nip and the column-centered matrix Y^:
-0.5 1 0.25
5.5 - 5 1.25
p
-4.5 5 3.25
-0.5 -1 -475
In the above matrix Y we obtain that the sums computed row- and column-wise are
zero.
A vector of column-standard deviations dp provides for each column of X a
measure of the spread of the elements around the corresponding column-mean:
,1/2 \ Ml
dT = withy = 1, ...,/? (29.65)
p
or dj =
z^yl
where Y^ is the column-centered matrix obtained from X and where the power
symbols indicate that all elements of Y^ are to be squared and that the square root
of all elements in the vector-to-matrix product is to be taken. We divide by n rather
than by n-\ in the above definition of standard deviation. The former seems to be
more appropriate in exploratory data analysis, where the emphasis lies in revealing
geometrical structure in the data. The latter is preferred in confirmatory data
analysis, where the focus is on estimating statistical properties from sampled data.
Each element dj of the vector d^ contains the norm or length of the corresp-
onding column y^ in Y^, multiplied by a constant:
.1/2
1
dj = |y./| smce iy,n = \yl
The geometrical interpretation of column-standard deviations is in terms of
distances of the points representing the columns of Y^ from the origin of S'^Cmulti-
47
z.=Y^; (29.66)
or
'yij'^^j
with /= 1, ..., n andy = 1, ...,/?
All 111
,1/2
Since iy,n =
VJ
Z . =D-^ Y (29.68)
or
s =xx^ (29.69)
s, = x^x (29.70)
where X is an nxp data matrix. The nxn matrix S„ is symmetric and contains all
scalar products that can be formed from the rows of X. The px/? matrix S^ is also
symmetric and contains all scalar products that can be formed from the columns of
X. Using the previously defined properties of vectors we can derive from these
cross-products the norms of the vectors, their angular distances and the distances
between their endpoints. This shows that the structural information of the patterns
in space are contained in the cross-product matrices.
49
From a previously derived property of the trace of a product of two matrices (eq.
(29.26)) follows that:
tr(S,) = tr(XXT) = tr(X^X) = tr(Sp (29.71)
For the 4x3 matrix X in the previous illustration in this section, we derive the
cross-product matrices:
35 14 60 10
14 66 -6 -4
s„ = 60 -6 126 12
10 -4 12 14
C p = - Y ^p Yp (29.72)
n
The matrix C^ contains the variances of the columns of X on the main diagonal and
the covariances between the columns in the off-diagonal positions (see also Section
9.3.2.4.4). The correlation matrix R^ is derived from the column-standardized
matrix Z^:
(29.73)
n
which contains the coefficients of correlation between the columns of X.
From Y^ and Z^, which have been computed in the previous examples, we
obtain the corresponding column-covariance matrix C^ and column-correlation
matrix R^:
12.750 -12.500 -1.375
Cp = -12.500 13.000 3.750
-1.375 3.750 8.688
50
1 -0.971 -0.131
R. -0.971 1 0.353
-0.131 0.353 1
Similar expressions to those in eqs. (29.72) and (29.73) can be derived for the
variance-covariances and correlations between the rows of X:
1
Y Y^ (29.74)
n n
(29.75)
P
®
Fig. 29.9. (a) Geometrical representation of the image s in 5" of a vector v in S^. To each point in S" cor-
responds a vector in S^, and vice versa, (b) Geometrical representation of the image 1 in S^ of a vector u
in S". To each point in S^ corresponds a vector in 5", and vice versa.
s = Xv (29.76)
in particular, we can express the /th element of s by means of the scalar product of
the /th row of X with v:
S: =X with /= 1, ..., n
52
The matrix X defines a pattern P"" of n points, e.g. x^, in 5^ which are projected
perpendicularly upon the axis v. The result, however, is a point s in the dual space
5". This can be understood as follows. The matrix X is of dimension nxp and the
vector V has dimensions p. The dimension of the product s is thus equal to n. This
means that s can be represented as a point in 5^^. The net result of the operation is
that the axis v in 5^ is imaged by the matrix X as a point s in the dual space 5". For
every axis v in 5^ we will obtain an image s formed by X in the dual space. In this
context, we use the word image when we refer to an operation by which a point or
axis is transported into another space. The word projection is reserved for operations
which map points or axes in the same space [11]. The imaging of v in S^ into s in S""
is represented geometrically in Fig. 29.9a. Note that the patterns of points P"^ and P^
are represented schematically by elliptic envelopes.
In a similar way, we can think of X^as representing a pattern P^ ofp points, e.g.
Xy, in 5" which can be projected upon an axis u and which results into a point I (not
to be confounded with the sum vector 1) in the complementary (or dual) space 5^
(Fig. 29.9b):
l = X^u (29.77)
in particular, we can express the jih element of 1 as the scalar product of the jth
column of X with u:
/. =xTu withy = 1, ...,/7
Using the same argument as above, we can see that the product is of dimension p,
since the matrix X^ has dimensions pxn and the vector u possesses dimension n.
Hence, the vector can be imaged as a single point in the dual space S^. We state that
the vector u in 5" is imaged by the matrix X^ into the point 1 in the dual space S^.
:P ^
Fig. 29.10. Geometrical interpretation of multiple linear regression (MLR). The pattern of points in S^
representing a matrix X is projected upon a vector b, which is imaged in S" by the point y. The orienta-
tion of the vector b is determined such that the distance between y and the given y is minimal.
53
In a general way, we can state that the projection of a pattern of points on an axis
produces a point which is imaged iJi the dual space. The matrix-to-vector product
can thus be seen as a device for passing from one space to another. This property of
swapping between spaces provides a geometrical interpretation of many procedures
in data analysis such as multiple! linear regression and principal components
analysis, among many others [12] (see Chapters 10 and 17).
In multiple linear regression (A^LR) we are given an nxp matrix X and an
n vector y. The problem is to find aii unknown p vector b such that the product y of
X with b is as close as possible to the original y using a least squares criterion:
y = Xb (29.78)
h = (X'X)-'X'y (29.79)
where X is the nxp matrix of independent variables, y is the n vector of dependent
observations and b is the p vector of regression coefficients. This leads to an
expression for the n vector of estimated dependent observations y:
y = X(X'^X)-^X^y (29.80)
where the term X(X^X)"^X^ is called an orthogonal projection operator. In geo-
metrical terms, this operator projects from 5'' to 5^ and back to S"^. When applied to
the dependent observations in y, it returns the predictions in y that can be obtained
by using the information contained in X. In other words, y reproduces the part of
the variation in y that can be predicted from X. The above expression is useful, for
example, in the case when one obtained p independent measurements from a
number n^ of additional samples in the form of an n^xp additional data matrix X^,.
Predictions for the corresponding dependent observations y^ are then derived
from:
i =:X,(X^X)-^X^y (29.81)
In the general case we use the symbols U and V to represent projection matrices
in S'^ and 5^, each containing r projection vectors, and the symbols S and L to
represent their images in the dual space:
for a projection in 5^, producing an image in S^ (29.82)
s = xv
L = X^U for a projection in 5^^, producing an image in S^ (29.83)
Fig. 29.11. Geometrical interpretation of a rotation of 5^ as a change of the frame of coordinate axes to
new directions which are defined by the columns in the rotation matrix V (left panel). Likewise, a rota-
tion ofS" can be interpreted as a change of the frame of coordinate axes to new directions which are de-
fined by the columns in the rotation matrix U (right panel).
55
where X is an nxp matrix and where V and U are/7Xr and nxr projection matrices,
respectively. The images S and L are nxr and/?xr matrices, respectively. We have
chosen the symbols S and L in accordance with conventional nomenclature used in
multivariate data analysis, where projections of rows of a data matrix are called
scores and where projections of columns are called loadings.
An orthogonal projection is obtained when the projection matrices V or U are
orthonormal:
V^V = U^U = I, (29.84)
where I^ is the identity matrix of dimension r, i.e. the number of projection vectors
in V and U.
Eigenvector projections are those in which the projection vectors u and v are
eigenvectors (or singular vectors) of the data matrix. They play an important role in
multivariate data analysis, especially in the search for meaningful structures in
patterns in low-dimensional space, as will be explained further in Chapters 31 and
32 on the analysis of measurement tables and general contingency tables.
A projection is called a rotation when the projection matrix U or V is non-
singular and square. In a geometrical sense, one obtains in this case that the number
of projection vectors equals the number of dimensions of space. The result of a
rotation is a change of the frame of coordinates (Fig. 29.11). A non-orthogonal
rotation will result in an oblique frame of coordinate axes, and interdistances
between points that represent rows and columns will generally not be conserved.
Of special interest are orthogonal rotations which are obtained by multiplying a
matrix with an orthonormal rotation matrix. In the case of an nxp non-singular
matrix X with n>p,we have:
or
LL^ = X^UU^X = X'^X (29.88)
where use is made of the orthogonality of rotation matrices.
After an orthogonal rotation one can also perform a backrotation toward the
original frame of reference axes:
SV^ = XVV^ = X (29.89)
or
LU^ = X'^UU^ = X^ (29.90)
where V and U are orthonormal rotation matrices and where use is made of the
same property of orthogonality as stated above.
References
1. W.R. Dillon and M. Goldstein, Multivariate Analysis, Methods and Applications. Wiley, New
York, 1984.
2. F.R. Gantmacher, The Theory of Matrices. Vols. 1 and 2. Chelsea Publ., New York, 1977.
3. N.C. Giri, Multivariate Statistical Inference. Academic Press, New York, 1972.
4. N. Cliff, Analyzing Multivariate Data. Academic Press, San Diego, CA, 1987.
5. R.J. Harris, A Primer on Multivariate Statistics. Academic Press, New York, 1975.
6. C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman and Hall, Lon-
don, 1980.
7. M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North
Holland, New York, 1983.
8. T.W. Anderson, An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984.
9. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and
use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom.
Intell. Lab. Syst., 4 (1988) 11-25.
10. P.E. Green and J.D. Carroll, Mathematical Tools for Applied Multivariate Analysis. Academic
Press, New York, 1976.
11. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990.
12. K. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal.
Chem., 59 (1987) 1007A-1009A.
57
Chapter 30
Cluster analysis
30.1 Clusters
Concentration of Ni
i
Concentration of Ge
ABCDEFGHIJ
. 1
ABFDG CEHIJ
I ' I I ' 1
ABF DG CEJI H
rti rh I r I I
A B F D G C E J I
Fig. 30.2. Hierarchical clustering for the meteorites described in Fig. 30.1.
grouped in genera, genera in families, etc. This classification was obtained histor-
ically by determining characteristics such as the number of cotyledons, the flower
formula, etc. More recently, botanists and scientists from other areas such as
bacteriology where classification is needed, have reviewed the classifications in
their respective fields by numerical taxonomy [1,2]. This consists of considering
the species as objects, characterized by certain variables (number of cotyledons,
etc.). The data table thus obtained is then subjected to clustering. Numerical
taxonomy has inspired other experimental scientists, such as chemists to apply
clustering techniques in their own field.
The other main possibility of representing clustered data is to make a table
containing different clusterings. A clustering is a partition into clusters. For the
example of Fig. 30.1, this could yield Table 30.1. Such a table does not necessarily
yield a complete hierarchy (e.g. in going from 6 to 7, objects J and I, separated for
clustering 6, are joined again for 7). Therefore, the presentation is called non-
hierarchical.
Classical books in the field are the already cited book by Sneath and Sokal [1]
and that by Everitt [3]. A more recent book has been written by Kaufman and
59
(angiosperms)
(monocotyledones) (dicotyledones)
(rosaceae) (papilionaceae)
pi
o
r
&
I I I
Fig. 30.3. Taxonomy of some plants.
TABLE 30.1
1 A B C D E F G H I J
2 A B F D G - C E H I J
3 A B F - D G - C E H I J
4 A B F - D G - C E J I - H
6 A - B F - D G - C I - J E - H
7 A - B F - D G - C - E - J I - H
0 A - B - F - D - G - C - E - J -
Rousseeuw [4]. Massart and Kaufman [5] and Bratchell [6] wrote specifically for
chemometricians. Massart and Kaufman's book contains many examples, relevant
to chemometrics, including the meteorite example [7]. More recent examples
concern classification, for instance according to structural descriptions for toxicity
testing [8] or in connection with combinatorial chemistry [9], according to chemical
60
composition for aerosol particles [10], Chinese teas [11] or mint species [12] and
according to physicochemical parameters for solvents [13]. The selection of
representative samples from a larger group for multivariate calibration (Chapter
36) by clustering was described by Naes [14].
To be able to cluster objects, one must measure their similarity. From our
introduction, it is clear that "distance" may be such a measure. However, many
types of similarity coefficients may be applied. While the terms similarity or
dissimilarity have no unique definitions, the definition of distance is much clearer
[6]. A dissimilarity between two objects / and i' is a distance if
(where x- and x- are the row-vectors of the data table X with the measurements
describing objects / and /')
Ar = A . (30.2)
30.2.2.1 Distances
The equation for the Euclidean distance between objects / and i is
Ar=JX(^,v-^.';)' (30.4)
In some cases, one wants to give larger weights to some variables. This leads to
the weighted Euclidean distance:
^/r=JS[(^//-^.y)/^,]' (30.6)
'j=\lt^-,-^j)' (30.7)
It can be shown that the standardized Euclidean distance is the Euclidean distance
of the autoscaled values of X (see further Section 30.2.2.3). One should also note
that in this context the standard deviation is obtained by dividing by n, instead of
by(n-l).
The Mahalanobis distance [15] is given by:
Z)^,=(x,-x,)^C-Ux,-x,) (30.8)
x^i Xi»
a) b)
Fig. 30.4. Mahalanobis distance: (a) object Bis closer to centroidC of cluster Gl then object A; (b)the
distance between clusters Gl and G2 is smaller than between G3 and G4.
D,^,=(x,-x,yW(x,-x,) (30.9)
where W represents an mxm weighting matrix. Four particular cases of the gener-
alized distance are mentioned below:
W = I defines ordinary Euclidean distances
W = diag (w) produces weighted Euclidean distances
W = diag (1/d^) where d represents the vector of column-standard
deviations of X, yielding standardized Euclidean distances
W = C~\ where C represents the variance-covariance matrix as defined in
eq. (30.8), defines Mahalanobis distances.
Euclidean distances (ordinary or standardized) are used very often for clustering
purposes. This is not the case for Mahalanobis distance. An application of Mahal-
anobis distances can be found in Ref. [16].
Fig. 30.5. The point / is equidistant to /' and f according to the Euclidean distances {Dw and Du") but
much closer to /' (cos 0,/') than to i" (cos 6r), when a correlation-based similarity measure is applied.
those SFs that have more or less the same over-all retention, i.e., the same polarity
towards a variety of substances, are considered to be similar. In that case, SF3 is
very dissimilar from both SFj and SF2, while SFj and SF2 are quite similar. The
best way to express this is the Euclidean distance. D^^ and D22, are then much higher
than D^2'
On the other hand, the analyst might not be interested in global retention
indices. Indeed, by increasing the temperature for SF3, he would obtain similar
retention indices as for the other two. He will then observe that the relative
retention time, i.e. the retention times of the substances compared with each other,
are the same for SFj and SF3 and different from SF2. Chemically, this means that
SF3 has different polarity from SFj, but the same specific interactions. This is best
expressed by using the correlation coefficient as the similarity measure. Indeed,
ri3 = 1, indicating complete similarity, while r^2 ^^^ ^23 ^ ^ niuch lower. Since both
r = 1 and r = -1 are considered to indicate absolute similarity and if, as with
Euclidean distance, one would like the numerical value of the similarity measures
to increase with increasing dissimilarity, one should use, for instance, 1 - |r|.
TABLE 30.2
Retention indices of five substances on three stationary phases in GLC
Stationary 1 2 3 4 5
phases (SF)
1 100 130 150 160 170
2 120 110 170 150 145
3 200 260 300 320 340
64
30.2.2.3 Scaling
In the meteorite example, the concentration of Ni is of the order of 50000 ppm
and the Ga content of the order of 50 ppm. Small relative changes in the Ni content
then have, of course, a much higher effect on the Euclidean distance than equally
high relative changes of the Ga content. One might also consider two metals M and
N, one ranging in concentration from 900 to 1100 ppm, the other from 500 to 1500
ppm. Concentration changes from one end of the range to the other in N would then
be more important in the Euclidean distance than the same kind of change in M. It
is probable that the person carrying out the classification will not agree with these
numerical consequences and consider them as artefacts. Both problems can be
solved by scaling the variables. The most usual way of doing this is using the
z-transform, also called autoscaling (see also Chapter 3.3). One then determines
^^^ij_2^ (30.10)
where x^j is the value for object / of variabley, x^ is the mean for variable7, and Sj is
the standard deviation for variable j . One then uses z in eq. (30.4), which is
equivalent to applying the standardized Euclidean distance (eq. (30.6)) to the
jc-values.
Other possibilities are range scaling and logarithmic transformation. In range
scaling one does not divide by s^ as in eq. (30.6), but by the range r^ of variable/
x^^_-^ (30.11)
r.
where JC^^^J^ is the lowest value of jc^. The logarithmic transform, too, reduces vari-
ation between variables. Its effect is not to make absolute variation equal but to
make variation comparable in the following sense. Suppose that variable 1 has a
mean value of 100 and variable 2 a mean value of 10. Variable 1 varies between 50
and 150. If the variation is proportional to the mean values of the variables, then
one expects variable 2 to vary between more or less 5 and 15. In absolute values the
variation in variable 1 is therefore much larger. When one transforms variables 1
and 2 by taking their logarithms the variation in the two transformed variables
becomes comparable. Log-transformation to correct for heteroscedasticity in a
regression context is described in Section 8.2.3.1. It also has the advantage that the
scaling does not change when data are added. This is not so for eqs. (30.10) and
(30.11), since one must recompute Xj, r- or Sy
65
V/ = 0 if ^/y^^ry
The matching coefficient is the mean of the 5--values for all m attributes
1 m
(30.13)
m j=i
This means that one counts the number of attributes for which / and /' have the
same value and divides this by the number of attributes.
The Jaccard similarity coefficient is slightly more complex. It considers that the
simultaneous presence of an attribute in objects / and i indicates similarity, but that
the absence of the attribute has no meaning. Therefore:
^..,.= 1 if x^. = x^^=i
and
It can be shown [5] that the Hamming distance is a binary version of the city block
distance (Section 30.2.3.2).
Some authors use the Hamming distance as the equivalent of Euclidean distance
of binary data. In that case:
The literature also mentions a normalized Hamming distance, which is then equal
to either:
1 '^
"^.1=
or
1 '"
\mf:t
The first of these two is also called the Tanimoto coefficient by some authors. It
can be verified that, since distance = 1 - similarity, this is equal to the simple
matching coefficient. Clearly, confusion is possible and authors using a certain
distance or similarity measure should always define it unambiguously.
D,,=X4o (30-14)
67
X2 4
^1
Fig. 30.6. Dii' is the Euclidean distance between / and /', dm and dwj are the city-block distances be-
tween / and /' for variables xi and X2 respectively. The city-block distance is dm + dm.
Here, too, scaling can be required when the ranges of the variables are dissimilar.
In this case, one divides the distances by the range r^ for variable j
IJ
The Manhattan distance is obtained for r = 1 and the Euclidean distance for r = 2. In
this context the Euclidean distance is also referred to as the L2-norm.
for ordinal variables in Section 30.2.3.2, while binary variables are expressed
naturally on a 0-1 scale. The range scaled similarity for variables on an interval
scale is obtained as
with Zij and z- as defined in eq. (30.12). Then one can determine the similarities of
the objects / and /' by summing the range scaled similarities for all variables j . A
distance measure can be obtained by computing
where d^^^j is the range scaled similarity between objects / and i' for variable 7.
The similarities between all pairs of objects are measured using one of the
measures described earlier. This yields the similarity matrix or, if the distance is
used as measure of (dis)similarity, the distance matrix. It is a symmetrical nxn
matrix containing the similarities between each pair of objects. Let us suppose, for
example, that the meteorites A, B, C, D, and E in Table 30.3 have to be classified
and that the distance measure selected is Euclidean distance. Using eq. (30.4), one
obtains the similarity matrix in Table 30.4. Because the matrix is symmetrical,
only half of this matrix needs to be used.
TABLE 30.3
TABLE 30.4
Similarity matrix (based on Euclidean distance) for the objects from Table 30.3
A 0
B 40.0 0
C 38.7 17.3 0
D 110.4 70.7 78.1 0
E 111.4 72.1 80.6 14.1
70
A B C D E A B C D E A B C D E
OT
U M IUM U M
bO\
1001
Fig. 30.7. Dendrograms for the data of Tables 30.3-30.7: (a) average linkage; (b) single linkage; (c)
complete linkage.
TABLE 30.5
Successive reduced matrices for the data of Table 4 obtained by average linkage
(a)
B D*
A 0
B 40.0 0
C 38.7 17.3 0
D* 110.9 71.4 79.3
D* is the object resulting from the combination of D and E.
(b)
A B* D*
A 0
B* 39.3 0
D* 110.9 75.3 0
B* is the object resulting from the combination of B and C.
(c)
A* D*
A* 0
D* 93.1 0
A* is the object resulting from the combination of A and B*.
(d)
The last step consists in the junction of A* and D*.
The resulting dendrogram is given in Fig. 30.7(a).
71
TABLE 30.6
Successive reduced matrices for the data of Table 30.4 obtained by single linkage
(a)
A B C D*
A 0
B 40.0 0
C 38.7 17.3 0
D* 110.4 70.7 78.1
(b)
A B* D*
A 0
B* 38.7 0
D* 110.4 70.7 0
(c)
A* D*
A* 0
D* 70.7 0
(d)
The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(b).
TABLE 30.7
Successive reduced matrices for the data of Table 30.4 obtained by complete linkage
(a)
A B C D*
A 0
B 40.0 0
C 38.7 17.3 0
D* 111.4 72.1 80.6
(b)
A B* D*
A 0
B* 40.0 0
D* 111.4 80.6 0
(c)
A* D*
A* 0
D* 111.4 0
(d)
The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(c).
X,*
Fig. 30.8. Dissimilar objects A and X are chained together in a cluster obtained by single linkage.
says that the single linkage methods is space contracting. The complete linkage
method leads to small, tight clusters and is space dilating. Average linkage and
Ward's method are space conserving and seem, in general, to give the better results.
73
TABLE 30.8
Values characterizing the objects of Fig. 30.9
X2
A 45 24
B 24 43
C 14 23
D 64 52
E 36 121
F 56 140
G 20 148
TABLE 30.9
Euclidean distance between points in Fig. 30.9 (from Ref [18])
B C D
A 0
B 28 0
C 32 23 0
D 35 40 60 0
E 100 80 103 76 0
F 119 104 128 90 29 0
G 127 105 126 105 30 35
Fig. 30.9. Examples of trees in a graph; (a) is the minimal spanning tree [18].
minimal spanning tree is also the optimal solution for the highway problem. The
terminology used in this chapter comes from graph theory. Graph theory is
described in Chapter 42.
Several algorithms can be used to find the minimal spanning tree. One of these
is KruskaVs algorithm [19] which can be stated as follows: add to the tree the edge
with the smallest value which does not form a cycle with the edges already part of
the tree. According to this algorithm, one selects first the smallest value in Table
30.9 (link BC, value 23). The next smallest value is 28 (link AB). The next smallest
values are 29 and 30 (links EF and EG). The next smallest value in the table is 32
(link AC). This would, however, close the cycle ABC and is therefore eliminated.
Instead, the next link that satisfies the conditions of Kruskal's algorithm is AD and
the last one is DE. The minimal spanning tree obtained in this way is that given in
Fig. 30.9(a).
After careful inspection of this figure, one notes that two clusters can be
obtained in a formal way by breaking the longest edge (DE). When a more detailed
classification is needed, one breaks the second longest edge, and so on until the
desired number of classes is obtained (see Section 30.3.4.2). In the same way,
clusters were obtained from Fig. 30.7 by breaking first the lowest link, i.e. the one
with highest distance.
An example of an application is shown in Fig. 30.10. This concerns the
classification of 42 solvents based on three solvatochromic parameters (para-
meters that describe the interaction of the solvents with solutes) [13]. Different
methods were applied, among which was the average linkage method, the result of
which is shown in the figure. According to the method applied, several clusterings
can be found. For instance, the first cluster to split off from the majority of solvents
consists of solvents 36, 37, 38, 39, 40, 41, 42 (r-butanol, isopropanol, n-butanol,
75
0.371 -r
0.332 I
0.293 t
0.254
0.215 +
0.176
0.138 -f
0.099 I
0.060 +
0.021
Fig. 30.10. Hierarchical agglomerative classification of solvents according to solvent-solute and sol-
vent-solvent interactions [13].
ethanol, methanol, ethyleneglycol and water). This solvent class consists of am-
phiprotic solvents (alcohols and water). This is then split further into the mono-
alcohols on the one hand and ethyleneglycol and water, which have higher
association ability. In this way, one can develop a detailed classification of the
solvents. Another use of such a classification is to select different, representative
objects. Snyder [20] used this to select a few solvents, that would be different and
representative of certain types of solvent-solute interactions. These solvents were
then used in a successful strategy for the optimization of mobile phases for liquid
chromatographic separation.
The hierarchical methods so far discussed are called agglomerative. Good
results can also be obtained with hierarchical divisive methods, i.e., methods that
first divide the set of all objects in two so that two clusters result. Then each cluster
is again divided in two, etc., until all objects are separated. These methods also lead
to a hierarchy. They present certain computational advantages [21,22].
Hierarchical methods are preferred when a visual representation of the cluster-
ing is wanted. When the number of objects is not too large, one may even compute
a clustering by hand using the minimum spanning tree.
One of the problems in the approaches described above and, in fact, also in those
described in the next sections, is that when objects are added to the data set, the
76
05%
LI 03[l [\
Va-1
Fig. 30.11. The three-distance clustering method [23]. The new object A has to be classified. In node
Ka it must be decided whether it fits better in the group of nodes represented by Vi, the group of nodes
represented by V2, or does not fit in any of the nodes already represented by V^.
whole clustering must be carried out again. A hierarchical procedure which avoids
this problem has been proposed by Zupan [23,24] and is called the three-distance
clustering method (3-DM). Let us suppose that a hierarchical clustering has
already been obtained and define V^ as a node in the dendrogram, representing all
the objects above it. Thus
1
Z^ ^ii^Xi ^i2^"'^2j
i=\ i=\ i=\
where v,^ is the mean vector of the n^ vectors, representing the % objects / = 1,..., n^
above it and m is the number of variables. For instance, in Fig. 30.11, Vj is the mean
of three objects Oj, O2, O3.
Suppose now also that in an earlier stage one has decided that the new object A
belongs rather to the group of objects above V^ than to the group represented by V^.
We will now take one of three possible decisions:
(a) A belongs to Vj
(b) A belongs to V2
(c) A does not belong to V^ or V2.
This depends on the similarity or distance between A, V^ and Vo- O^ie determines
the similarities 5^^ , 5^^^, and 5^ ^ . If 5^^ is highest, then A belongs to V^ and
the same process as for V^ is repeated for Vj. If 5^^^ is highest, then A belongs to
the group represented by V2 and if 5^, y^ is highest, then A belongs to V^^ but not to
V, or Vo. A new branch is started for A between K, and V,a-\'
Let us now cluster the objects of Table 30.8 with a non-hierarchical algorithm.
Instead of clustering by joining objects successively, one wants to determine
77
1I
oD
oF
»t1
^ E ^
X2
Fig. 30.12. Forgy's non-hierarchical classification method. A,..., G are objects to be classified; * 1,
* 4 are successive centroids of clusters.
^2*
Fig. 30.13. Agglomerative methods will first link A and B, so that meaningless clusters may result.
The non-hierarchical K=2 clustering will yield clusters I and II.
where T is the total sums of squares and products matrix, related to the variance-
covariance matrix, B is the same matrix for the centroids and W is obtained by
pooling the sums of squares and product matrices for the clusters.
One can also write
tr(T) = tr(B) + tr(W)
Since T and therefore also tr(T) is constant, minimizing tr(W) is equivalent to
maximizing tr(B). It can be shown that tr(B) is the sum of squared Euclidean
distances between the group centroids.
An advantage of non-hierarchical methods compared to hierarchical methods is
that one is not bound by earlier decisions. A simple example of how disastrous this
can be is given in Fig. 30.13 where an agglomerative hierarchical method would
start by linking A and B. On the other hand, the agglomerative methods allow
better visualization, although some visualization methods (e.g. Ref. [28]) have
been proposed for non-hierarchical methods.
A group of methods quite often used is based on the idea of describing high local
densities of points. This can be done in different ways. One such way, mode
analysis [29], is described by Bratchell [6]. A graph theoretical method (see also
Chapter 42) by Jardine and Sibson [30] starts by considering each object as a node
and by linking those objects which are more similar than a certain threshold. If a
Euclidean distance is used, this means that only those nodes are linked for which
the distance is smaller thap a chosen threshold distance. When this is done, one
determines the so-called maximal complete subgraphs. A complete subgraph is a
80
Fig. 30.14. Step 1 of the Jardine and Sibson method [30]. Objects less distant than Dj are linked.
set of nodes for which all the nodes are connected to each other. Maximal complete
subgraphs are then the largest (i.e. those containing most objects) of these comp-
lete subgraphs. In Fig. 30.14, they are (B, G, F, D, E, C), (A, B, C, F, G), (H, I, J,
K), etc. Each of these is considered as the kernel of a cluster. One now joins those
kernels that overlap to a large degree with, as criterion, the fact that they should
have at least a prespecified number of nodes in common (for instance, 3). Since
only the first two kernels satisfy this requirement, one considers A, B, C, D, E, F, G
as one cluster and H, I, J, K as another.
Another technique, originally derived from the potential methods described for
supervised pattern recognition (Chapter 33), was described by Coomans and
Massart [31]. A kernel or potential density function is constructed around each
object. In Fig. 30.15 A, B, C and D are objects around which a triangular potential
field is constructed (solid lines). The potentials in each point are summed (broken
line). One selects the point with highest summed potential (B) as cluster center and
measures the summed potential in the closest point. All points, such as A and C,
that can be reached from B along a potential path, which decreases continuously,
belong to the same summed potential hill and such objects are considered to be part
of the cluster. When the potential is higher again, in a certain object, or there is a
point mid-way between two points which has lower potential, then this means that
one has started to climb a new hill and therefore the object is part of another cluster.
The method has the advantage that the form of the cluster is not important, while
most other methods select spherical or ellipsoid clusters. The disadvantage is that
the width of the potential field must be optimized.
All these methods and the methods of the preceding section have one charact-
eristic in common: an object may be part of only one cluster. Fuzzy clustering
applies other principles. It permits objects to be part of more than one cluster. This
leads to results such as those illustrated by Fig. 30.16. Each object / is given a value
81
potential
B'
\
\
\
\
\
AgC D distance
Xjt
II /B
^1
Fig. 30.16. Fuzzy clustering. Two fuzzy clusters (I and II) are obtained. For example/M = 1 ,/AII = 0,/BI
= 0 , / B I I = 1 , / C I = 0.47,/CII = 0.53.
f-i^ for a membership function (see Chapter 19) in cluster k. The following relation-
ships are defined for all / and k:
and
K
82
where 5 is an empirical constant and d^^ a distance. The criterion is minimized and
therefore requires membership values of / and i to be similar when their distance is
small. Fuzzy clustering has been applied only to a very limited extent in chemo-
metrics. A good example concerning the classification of seeds from images is
found in Ref. [33].
As described in the Introduction to this volume (Chapter 28), neural networks
can be used to carry out certain tasks of supervised or unsupervised learning. In
particular, Kohonen mapping is related to clustering. It will be explained in more
detail in Chapter 44.
Fig. 30.17. Square symbols are the actual objects and circled squares are the marked objects. Open cir-
cles are artificial points (adapted from Ref. [34]).
'24
Fig. 30.18. Nine objects, A-I, are clustered. K=2 and A^ = 3 clusterings are obtained. The 2-clustering
is correct but not relevant. The 3-clustering is meaningful and robust.
as the similarity value at which links are cut or the optimization criterion of Section
30.3.3 in function of K. Large steps in that function occur when a significant
A'-clustering is encountered. An information theoretical procedure is described in
[10]. Another approach to select clusters is to validate them using a supervised
pattern recognition technique (see Chapter 33), such as linear discriminant anal-
ysis (LDA) [36]. Cross-validation can be applied to decide whether the LDA does
separate the clusters, in which case they are considered to be significant.
30.3.5 Conclusion
References
1. P.H. A. Sneath and R.R. Sokal, Numerical Taxonomy. The Principles and Practice of Numeri-
cal Classification. Freeman, San Francisco, CA, 1973.
2. G.C. Mead, A.P. Norris and N. Bratchell, Differentiation of Staphylococcus aureus from
freshly slaughtered poultry and strains 'endemic' to processing plants by biochemical and
physiological tests. J. Appl. Bacteriol., 66 (1989) 153-159.
3. B.S. Everitt, Cluster Analysis. Heineman, London, 1981.
4. L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis.
Wiley, New York, 1990.
5. D.L. Massart and L. Kaufman, Interpretation of Analytical Data by the Use of Cluster Analy-
sis. Wiley, New York, 1983.
6. N. Bratchell, Cluster analysis. Chemom. Intell. Lab. Syst., 6 (1989) 105-125.
7. D.L. Massart, L. Kaufman and K.H. Esbensen, Hierarchical non-hierarchical clustering strat-
egy and application to classification of iron-meteorites according to their trace element pat-
terns. Anal. Chem., 54 (1982) 911-917.
8. R.G. Lawson and P.C. Jurs, Cluster analysis of acrylates to guide sampling for toxicity testing.
J. Chem. Inf. Comput. Sci., 30 (1990) 137-144.
9. R.D. Brown and Y.C. Martin, Use of structure-activity data to compare structure-based clus-
tering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci., 36
(1996) 572-584.
10. I. Bondarenko, H. Van Malderen, B. Treiger, P. Van Espen and R. Van Grieken, Hierarchical
cluster analysis with stopping rules built on Akaike's information criterion for aerosol particle
classification based on electron probe X-ray microanalysis. Chemom. Intell. Lab. Syst., 22
(1994) 87-95.
11. L.X. Sun, F. Xu, Y.Z. Liang, Y.L. Xie and R.Q. Yu, Cluster analysis by the K-means algorithm
and simulated annealing. Chemom. Intell. Lab. Syst., 25 (1994) 51-60.
12. E. Marengo, C. Baiocchi, M.C. Gennaro, P.L. Bertolo, S. Lanteri and W. Garrone, Classifica-
tion of essential mint oils of different geographic origin by applying pattern recognition meth-
ods to gas chromatographic data. Chemom. Intell. Lab. Syst., 11 (1991) 75-88.
13. A. de Juan, G. Fonrodona and E. Casassas, Solvent classification based on solvatochromic pa-
rameters: a comparison with the Snyder approach. Trends Anal. Chem., 16(1) (1997) 52-62.
14. T. Naes, The design of calibration in near infra-red reflectance analysis by clustering. J.
Chemom., 1 (1987) 129-134.
15. P.C. Mahalanobis, On the Generalized Distance in Statistics, Proc. Nat. Inst. Sci. (India) 12
(1936)49-55.
16. P. Dagnelie and A. Merckx, Using generalized distances in classification of groups. Biom. J.,
33(1991)683-695.
17. J.H. Ward Jr., Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc, 58
(1963) 236-244.
18. D.L. Massart and L. Kaufman, Operations research in analytical chemistry. Anal. Chem., 47
(1975) 1244A-1253A.
19. J.B. Kruskal Jr., On the shortest spanning subtree of a graph and the traveling salesman prob-
lem. Proc. Am. Math. Soc, 7 (1956) 48-50.
20. L.R. Snyder, Classification of the solvent properties of common liquids. J. Chrom. Sci., 16
(1978)223-234.
21. P. MacNaughton-Smith, W.T. Williams, M.B. Dale and H.T. Clifford. Computer J., 14 (1972)
162.
86
22. A. Thielemans, M.P. Derde and D.L. Massart, CLUE. Elsevier Scientific Software, Elsevier,
Amsterdam, 1985.
23. J. Zupan, A new approach to binary tree-based heuristics. Anal. Chim. Acta, 122 (1980)
337-346.
24. J. Zupan and D.L. Massart, Application of the three-distance clustering method in analytical
chemistry. Anal. Chem., 61 (1989) 2098-2102.
25. E.W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifi-
cations. Biometrics, 21 (1965) 768.
26. J. MacQueen, Some methods for classification and analysis of multivariate observations. In: L.
Le Cam and J. Neyman (eds.). Proceedings 5th Berkeley Symposium on Mathematical Statis-
tics and Probability, University of California Press, Berkeley, CA, 1967, pp. 281-297.
27. D.L. Massart, F. Plastria and L. Kaufman, Non-hierarchical clustering with MASLOC. Pattern
Recogn., 16(1983)507-516.
28. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster anal-
ysis. J. Comput. Applied. Math., 20 (1987) 53-65.
29. D. Wishart, Mode analysis. In: A.J. Cole (ed.). Numerical Taxonomy. Academic Press, Lon-
don, 1969, pp. 282-308.
30. N. Jardine and R. Sibson, The construction of hierarchic and non-hierarchic classifications.
Computer J., 11 (1968) 177-184.
31. D. Coomans and D.L. Massart, Potential methods in pattern recognition. Part 2. CLUPOT: an
unsupervised pattern recognition technique. Anal. Chim. Acta., 133 (1981) 225-239.
32. E. Ruspini, Numerical methods for fuzzy clustering, Inf Sci., 2 (1970) 319-350.
33. Y. Chtioui, D. Bertrand, D. Barba and Y. Dattee, Application of fuzzy C-means clustering for
seed discrimination by artificial vision. Chemom. Intell. Lab. Syst., 38 (1997) 75-87.
34. B. Hopkins, A new method for determining the type of distribution of plant individuals. Ann.
Bot., 18(1954)213-227.
35. J. Jurs and R.G. Lawson, New index for clustering tendency and its application to chemical
problems. Chemom. Intell. Lab. Syst., 10 (1991) 81-83.
36. E. Marengo and R. Todeschini, Linear discriminant hierarchical clustering: a modeling and
cross-validable divisive clustering method. Chemom. Intell. Lab. Syst., 19 (1993) 43-51.
Additional reading
G.N. Lance and W.T. Williams, A general theory of classificatory sorting strategies. 1. Hierarchical
systems, Comput. J., 9 (1967) 373-380.
P. Willett and V. Winterman, A comparison of some measures for the determination of inter-molecu-
lar structural similarity. Measures of inter-molecular structural similarity. Quant. Struct.- Act.
Relat.,5(1986) 18-25.
87
Chapter 31
Introduction
Measurement tables are the raw data that result from measurements on a set of
objects. For the sake of simplicity we restrict our arguments to measurements
obtained by means of instruments on inert objects, although they equally apply to
sensory observations and to living subjects. By convention, a measurement table is
organized such that its rows correspond to objects (e.g. chemical substances) and
that its columns refer to measurements (e.g. physicochemical parameters). Here
we adopt the point of view that objects are described in the table by means of the
measurements performed upon them. Objects and measurements will also be
referred to in a more general sense as row-variables and column-variables.
A heterogeneous table contains measurements that are defined with different
units; for example, chromatographic retention time in seconds, molecular volume
in nm^, biological activity in mg substance per kg body weight, partition coefficient
(dimensionless), etc. In a homogeneous table, all measurements are expressed in
the same unit, such as retention times obtained with various chromatographic
methods in seconds, or biological activities from a battery of tests in mg substance
per kg body weight, etc.
A special type of homogeneous measurements is found in a compositional table
which describes chemical samples by means of the relative concentrations of their
components. By definition, relative concentrations in each row of a compositional
table add up to unity or to 100%. Such a table is said to be closed with respect to the
rows. In general, closure of a table results when their rows or columns add up to a
constant value. This operation is only applicable to homogeneous tables. Yet
another type of homogeneous table arises when the rows or columns can be
ordered according to a physical parameter, such as in a table of spectroscopic
absorptions by chemical samples obtained at different wavelengths.
Multivariate analysis of these different types of measurements (heterogeneous,
homogeneous, compositional, ordered) may require special approaches for each of
them. For example, compositional tables that are closed with respect to the rows,
require a different type of analysis than heterogeneous tables where the columns
are defined with different units. The basic approach of principal components
analysis, however, can be very helpful in understanding the overall structure of the
data, i.e. its dimensionality, correlations between measurements, clustering of
objects, specificities of objects for certain measurements, patterns of objects
correlating with external parameters, and much more. The objective of this chapter
is to discuss principal components analysis in relation to measurement tables. An
introduction to principal components analysis has already been presented in
Chapter 17, the notation of which is followed here as closely as possible, with two
exceptions which will be indicated where they first occur.
A measurement table is different from a contingency table. The latter results
from counting the number of objects that belong simultaneously to various cat-
egories of two measurements (e.g. molar refractivity and partition coefficient of
chemical compounds). It is also called a two-way table or cross-tabulation, as the
total number of objects is split up in two ways according to the two measurements
that are crossed with one another. The analysis of contingency tables is dealt with
specifically in Chapter 32.
Most of the algebra of vectors and matrices that is used in this chapter has been
explained in Chapters 9 and 29. Small discrepancies between the tabulated values
in the examples and their exact values may arise from rounding of intermediate
results.
Let us suppose that we have a measurement table X with n rows and/7 columns.
Each element x^y of the table then represents the value of they th measurement on the
/th object, where / ranges from 1 to AI and where y ranges from 1 to p. In contrast
with the notation in Chapter 17, we use the symbol p instead of m to denote the
number of columns in a date table in order to avoid a conflict with the symbol m
which we reserve for denoting means, later on in this chapter.
An important theorem of matrix algebra, called singular value decomposition
(SVD), states that any nxp table X can be written as the matrix product of three
terms U, A and V:
TABLE 31.1
Concentrations of trace elements in atmospheric samples at prevailing wind directions
Na CI Si
0 0.212 0.399 0.190
90 0.072 0.133 0.155
180 0.036 0.063 0.213
270 0.078 0.141 0.273
with the orthonormality condition for the singular vectors of the wind directions:
r0753 0.618"
[0753 0.343 0.302 0.473 f 0.343 -0.127 '1 O'
U^U =1.
i 0.618 -0.127 -0.567 -0.529 0.302 -0.567 0 1
0.473 -0.529
and with the orthonormality condition for the singular vectors of the trace elements:
[0.371 0.280"
0.371 0.690 0.622 1 0
VTv = 0.690 0.556 I,
0.280 0.556 -0783 0 1
0.622 -0.7831
Note that, although X has dimensions 4x3, only 2 nonzero eigenvalues can be
extracted. This is the result of a linear dependency among the columns of X which
takes the form:
x^^i =0.5245x^.^2 +0.01437x^.^3
91
The number of singular vectors r is at most equal to the smallest of the number
of rows n or the number of columns p of the data table X. For the sake of simplicity
we will assume here that p is smaller than n, which is most often the case with
measurement tables. Hence, we can state here that r is at most equal to p or
equivalently that r<p<n. The value of r is called the rank of X. It represents the
number of independent measurements in X. Independent measurements are those
that cannot be expressed as a linear combination or weighted sum of the other
variables.
Using eqs. (31.1) and (31.2) we can readily prove that:
C, = XX^ = UAV^VAU^ = UA^U^
C^ = X^X = VAU^UAV^ = VA^V^ (31.3)
where C^ is thepxp square symmetric matrix of cross-products of the columns of
X, and where C^ is tht nxn square symmetric matrix of cross-products of the rows
ofX.
92
0.371 0.280
0.392 0 0.371 0.690 0.622
VA^yT = 0.690 0.556
0 0.046 0.280 0.556 -0.783
0.622 -0.783
u T C „ u , = v T c ^ v , =^^ (315^)
where Uj and Vj are such that X] is maximal [6]. By virtue of the orthonormality of
eigenvectors we can rewrite the above expressions in the forms:
or (31.5b)
The same applies to the other eigenvectors U2 and V2, etc., with additional
constraints of orthonormality of Uj, U2, etc. and of Vj, V2, etc. By analogy with eq.
(31.5b) it follows that the r eigenvalues in A^ must satisfy the system of linear
homogeneous equations:
(C^-X\IJu,=0, with/:=l,...,r (31.6a)
or
where !„ represents the nxn identity matrix and where Ip is thepxp identity matrix
[7].
For the sake of completeness we mention here an alternative definition of
eigenvalue decomposition in terms of a constrained maximization problem which
can be solved by the method of Lagrange multipliers:
max [u;^ C^ Ui - X\ (uj u, - 1)] (31.6b)
94
or
m^x[yj Cpy,-X]{yJ v^ -1)]
where X] represents the Lagrange multiplier associated with the first eigenvectors
u, and Vj. The above Lagrange expressions define that the sum of squares of the
projections of X^ and X upon the normalized vectors u^ and v, must be maximal.
Differentiation of these expressions with respect to the unknown Uj and Vj, and
equalization of the results to zero, leads to the expressions in eq. (31.5b). Similar
Lagrange expressions as in eq. (31.6b) can be derived for the other eigenvectors U2
and V2, etc., with additional constraints of orthonormality of Uj, U2, etc. and of Vj,
V2, etc. Differentiation with respect to the unknown eigenvectors then leads to the
system of linear homogeneous equations of eq. (31.6a).
A necessary condition for obtaining non-trivial solutions for u^ and v^ in eq.
(31.6a) is that the determinants of the coefficient matrices must be zero, which
results into the so-called characteristic equation:
or
ic,-x^,i,i=o
The determinants can be developed into a polynomial equation of degree r of
which the r positive roots are the eigenvalues X\, where r<p, and provided that
/7 < n as we have assumed above. (The derivation of these properties is given in
Section 29.6.) More practical methods for computing the eigenvalues will be
discussed in Section 31.4 on algorithms.
A property of EVD is that the traces of C„, C^ and A^ are the same and that they
are equal to the global sum of squares c of the elements of X:
r fi P
where trace (tr) means sum of the elements on the main diagonal of a square
matrix. Note that the global sum of squares c is denoted by the symbol SS^ in other
chapters, where it is referred to as the total sum of squares.
From the numerical results in the previous example we find that the global sum
of squares of the concentrations of all trace elements in all wind directions amounts
to:
tr(A2) = 0.392 + 0.046 = 0.438
95
In order to avoid confusion we will in this chapter consistently use the term
latent vector or latent variable as being synonymous with singular vector and with
eigenvector. Similarly, we will use the term latent value when referring to singular
value or to the square root of eigenvalue. In the literature, latent vectors are also
indicated by means of the term factor or component. The process of extracting
latent vectors from a matrix of cross-products according to eqs. (31.3) or (31.4) is
cdiWtd factorization. In a broad sense, factor analysis designates a class of methods
of multivariate data analysis which involve factorization, although, historically,
the name factor analysis has been first proposed for the analysis of tables of
correlation coefficients [8]. This subject is treated extensively in Chapter 34.
In the same way, we define the coordinates for the p columns of X in factor-
space S^\
L = VAP (31.11)
where the factor scaling coefficient p is usually assigned the value 0,0.5 or 1. The
matrix L is a pxr orthogonal matrix, called loading matrix. Orthogonality of L can
be derived from the orthonormality of V, using eq. (31.2):
L ' ^ L = AP V ^ V A P = A 2 P (31.12)
In the special case where a = 1 we can express the score matrix S in the form:
S = UA = XV (31.13)
as follows fromeqs. (31.1) and (31.2).
Each element of S can be regarded as a scalar product:
p
where the vector x^ represents the rth row of X and where the vector v^ represents
the kth column of V.
Each column of S represents a row-principal component of X and can be
interpreted as a linear combination of the columns of X using the elements of V as
weighting coefficients:
In the case of a = 1 we compute the scores S for the wind directions in the
current example of concentrations of trace elements in atmospheric samples as
follows:
97
0.753 0.618
0.343 -0.127 0.626 0
S = UA =
0.302 -0.567 0 0.214
0.473 -0.529
In Fig. 31.1a these scores are used as the coordinates of the four wind directions in
2-dimensional factor-space. From this so-called score plot one observes a large
degree of association between the wind directions of 90, 180 and 270 degrees,
while the one at 0 degrees stands out from the others.
When a = 0.5 we obtain somewhat different values for the scores in S:
0753 0.618 0.596 0.286
0.343 -0.127 0.791 0 0.271 -0.059
S = UAi'2^
0.302 -0.567 0 0.462 0.239 -0.262
0.473 -0.529 0.374 -0.244
where the vector x^ represents the yth column of X and where the vector u^ rep-
resents the k\h column of U.
Each column of L represents also a column-principal component of X and can
be regarded as a linear combination of the rows of X using the elements of U as
weighting coefficients:
© a=l
S'' \= .046
^ 1 A,.= .392
-.5
I 1 ^ , = .392
-.5
-.5 -^
Fig. 31.1. (a) Score plot in which the distances between representations of rows (wind directions) are
reproduced. The factor scaling coefficient a equals 1. Data are listed in Table 31.1. (b) Loading plot in
which the distances between representations of columns (trace elements) are preserved. The factor
scaling coefficient p equals 1. Data are defined in Table 31.1.
99
0.371 0.280
0.626 0
L = V A = 0.690 0.556
0 0.214
0.622 -0.783
0.753 0.618
0.212 0.072 0.036 0.078
0.343 -0.127
XTU = 0.399 0.133 0.063 0.141
0.302 -0.567
0.190 0.155 0.213 0.273
0.473 -0.529
0.232 0.060
0.432 0.119
0.390 -0.168
These loadings have been used for the construction of the so-called loading plot in
Fig. 31.1b which shows the positions of the three trace elements in 2-dimensional
factor-space. The elements Na and CI are clearly related, while Si takes a position
of its own in this plot.
When p = 0.5 we obtain:
The relationships between scores, loadings and latent vectors can be written in a
compact way by means of the so-called transition formulae:
s = xv
U = S A-^ (from eq. (31.13)) (31.17)
L = X^U
In this way, given U one can compute V, and vice versa. These transition
formulae are the basis for the calculation of U and V by the so-called NIPALS
algorithm which will be explained in Section 31.4.
31.1.7 Reconstructions
0.472 0.132]
0.215 -0.027 ro.472 0.215 0.189 0.296
SS'T =
0.189 -0.122 [o.l32 -0.027 -0.122 -0.113
0.296 -0.113
L^ L = A V^ V A = A^ (31.20)
(31.21)
using eqs. (31.2), (31.3) and (31.15). Note that the diagonal matrix A is invariant
under transposition.
In the current example using p = 1, we reconstruct the sums of cross-products C^
between the trace elements from their loadings L:
0.232 0.060
T _
0.232 0.432 0.390
LL 0.432 0.119
0.060 0.119 -0.168
0.390 -0.168
A very special case arises when a + P equals 1. If we form the product of the
score matrix S with the transpose of the loading matrix L, then we obtain the
original measurement table X:
SL'T =UA« AP yi" =UAV'^ = X (31.22)
by making use of eqs. (31.1), (31.9) and (31.11) where a + P = 1.
102
This implies that the scalar product of a score vector s^ with a loading vector ly
allows us to reconstruct the value x^j in the table X. Written in full, this is equivalent
to:
r
^yi=l^ikhi:=^u (31-23)
k
where the vector sj is the iih row of S and where the vector IJ is theyth row of L.
Summarizing, we find that, depending on the choice of a and (5, we are able to
reconstruct different features of the data in factor-space by means of the latent
vectors. On the one hand, if a = 1 then we can reproduce the cross-products C„
between the rows of the table. On the other hand, if p equals 1 then we are able to
reproduce the cross-products C^ between columns of the table. Clearly, we can
have both a = 1 and p = 1 and reproduce cross-products between rows as well as
between columns. In the following section we will explain that cross-products can
be related to distances between the geometrical representations of the corres-
ponding rows or columns.
If we want at the same time to reconstruct the data in the table X we must have
that a + p = 1. A frequently used compromise is a = 1 and p = 0, where we sacrifice
distances between columns for distances between rows, while still satisfying the
constraint a + p = 1 for reconstruction of the data. This is an asymmetrical choice
for a and p. Another compromise is to have a = 0.5 and P = 0.5 which allows for an
approximate reconstruction of distances together with the reconstruction of the
data. This is a symmetrical choice for a and p.
An important aspect of latent vectors analysis is the number of latent vectors
that are retained. So far, we have assumed that all latent vectors are involved in the
reconstruction of the data table (eq. (31.1)) and the matrices of cross-products (eq.
(31.3)). In practical situations, however, we only retain the most significant latent
vectors, i.e. those that contribute a significant part to the global sum of squares c
(eq.(31.8)).
If we only include the first r* latent variables we have to redefine our relation-
ships between data, latent vectors and latent values:
X = U* A* V*^ 4- E = X* + E (31.24)
where U*, A* and V* represent the matrices formed from the first r* columns of U
and V and from the first r* rows and columns of A, where E is the residual data
matrix, and where X* refers to the partially reconstructed data matrix. The matrix
X* can be regarded as a least squares approximation of the original matrix X by
means of a reduced number of latent variables r* out of a total of r [4].
If we retain only the first factor in the current example of concentrations of trace
elements in atmospheric samples, we obtain:
103
0.753
0.343
X* =U* A* V * j (0.626) (0.371 0.690 0.622) =
0.302
0.473
= X-E
The residual matrix E contains the contributions of the remaining second factor to
the concentrations of the trace elements at the various wind directions.
If we disregard E, we have to rewrite the reconstruction formulae (eqs. (31.19),
(31.21) and (31.22)) accordingly:
T *T
X* = S** L' with a + P = 1
IK 1^
(31.26)
I^
In our example, we find that y = 0.895 for r* = 1. We will discuss various
methods which can guide in the choice of the number of relevant latent vectors r* in
Section 31.5.
104
Fig. 31.2. Geometrical example of the duality of data space and the concept of a common factor space,
(a) Representation of n rows (circles) of a data table X in a space S^ spanned hyp columns. The pattern
P" is shown in the form of an equiprobability ellipse. The latent vectors V define the orientations of the
principal axes of inertia of the row-pattern, (b) Representation of/? columns (squares) of a data table X
in a space S" spanned by n rows. The pattern P^ is shown in the form of an equiprobability ellipse. The
latent vectors U define the orientations of the principal axes of inertia of the column-pattern, (c) Result
of rotation of the original column-space S^ toward the factor-space 5'' spanned by r latent vectors. The
original data table X is transformed into the score matrix S and the geometric representation is called a
score plot, (d) Result of rotation of the original row-space S" toward the factor-space 5'^ spanned by r
latent vectors. The original data table X^ is transformed into the loading table L and the geometric rep-
resentation is referred to as a loading plot, (e) Superposition of the score and loading plot into a biplot.
106
axes of symmetry of the ellipse. In the general case, Vj and V2 are called axes of
inertia, because they account for a maximal part of the global sum of squares of the
data which we have denoted by the symbol c in eq. (31.8). Note that inertia of a
pattern of points can be related to the sum of squares of their coordinates, if we
assume that the masses of the points are all equal. If we project a particular row x^
upon VJ, then from eq. (31.13) we deduce that the projection is at a distance s^^ from
the origin:
x^ y,=s,,
or more generally for all the rows:
Xv, =Si (31.27)
Since this latent vector is defined as the vector for which the sum of squares of
the projections is maximum (eq. (31.5)), we can interpret Vj as an axis of maximal
inertia:
s^ s, =v;^ X^ Xvi ^v?' C^ Vi =\\ (31.28)
Hence ^^ is the fraction of the total sum of squares (or inertia) c of the data X that is
accounted for by Vj. The sum of squares (or inertia) of the projections upon a
certain axis is also proportional to the variance of these projections, when the mean
value (or sum) of these projections is zero. In data analysis we can assign different
masses (or weights) to individual points. This is the case in correspondence factor
analysis which is explained in Chapter 32, but for the moment we assume that all
masses are identical and equal to one.
We now consider a subspace of SP which is orthogonal to Vj and we repeat the
argument. This leads to V2, and in the multidimensional case to all r columns in V.
By the geometrical construction, all r latent vectors are mutually orthogonal, and r
is equal to the number of dimensions of the pattern of points represented by X. This
number r is the rank of X and cannot exceed the number of columns /? in X and, in
our case, is smaller than the number of rows in X (because we assume that n is
larger than/?).
One of the earliest interpretations of latent vectors is that of lines of closest fit
[9]. Indeed, if the inertia along Vj is maximal, then the inertia from all other
directions perpendicular to Vj must be minimal. This is similar to the regression
criterion in orthogonal least squares regression which minimizes the sum of
squared deviations which are perpendicular to the regression line (Section 8.2.11).
In ordinary least squares regression one minimizes the sum of squared deviations
from the regression line in the direction of the dependent measurement, which
assumes that the independent measurement is without error. Similarly, the plane
formed by Vj and V2 is a plane of closest fit, in the sense that the sum of squared
deviations perpendicularly to the plane is minimal. Since latent vectors v^ contribute
107
in decreasing order to the global sum of squares (as measured by their contri-
butions X\), we obtain increasingly better approximations to the data structure X
by means of the (hyper)planes of closest fit which are defined by V. Usually we
will not include all r latent vectors in such a regression model. If a large fraction of
the total inertia is accounted for by the r* first latent variables, where r is much
smaller than p, then we have achieved a substantial dimension reduction. The
question of how many latent variables should be included is discussed in Section
31.5.
The same geometrical considerations can be applied to the dual representation
of the column-pattern P^ in row-space S"- (Fig. 31.2b). Here Uj is the major axis of
symmetry of the equiprobability envelope. The projection of theyth column Xy of X
upon Uj is at a distance from the origin given by:
xj n,=lj,
or more generally from eq. (31.15) for all the columns:
X^ Ui = l i (31.29)
Since the latent vector maximizes the sum of squares of the projections (eq.
(31.5)) we can also interpret u^ as an axis of inertia:
Note that the inertia X] loaded by Uj in S"^ is the same as that loaded by Vj in S^.
For this reason, we must consider Uj and Vj as two different expressions of one and
the same latent vector. The former is developed in S^ while the latter is constructed
in 5^.
It is important to note that the score vector Sj results from the projection of X
upon Vj. While Vj is a latent vector in S^ (because it has p elements), Sj is a vector in
5" (because it possesses n elements). In Chapter 29 we have shown that the pattern
P" in 5^, when projected upon the axis represented by Vj, produces an image
thereof in the dual space S"^ at the point represented by Sj. If we normalize Sj (by
division with the corresponding latent value X^) we obtain the latent vector Uj.
Similarly, the loading vector Ij results from the projection of the pattern P^ in S^
upon Uj. While the latent vector Uj is in 5" (because of its n elements), Ij is in 5^
(because it hasp elements). By the same argument as employed above, we find that
the pattern P^ in 5"^, when projected upon the axis represented by Uj, produces an
image thereof in the dual space 5^ at the point represented by Ij. Normalization of Ij
(by division with X^) yields the latent vector Vj. This is the geometrical basis for the
transition formulae which have been discussed before (eq. (31.17)) and for the
NIPALS algorithm which is used for the calculation of singular vectors and which
is explained in Section 31.4.
108
Once we have obtained the projections S and L of X upon the latent vectors V
and U, we can do away with the original data spaces S^ and S"^. Since V and U are
orthonormal vectors that span the space of latent vectors S^, each row / and each
columny of X is now represented as a point in S\ as shown in Figs. 31.2c and d. The
coordinate axes of this space S^ are oriented along the axes of inertia of the
equiprobability envelopes. In the case of multinormally distributed data the semi-
axes of the (hyper)ellipsoids have lengths equal to the corresponding latent values
in A. The lengths of these semi-axes can also be interpreted as the standard
deviations of the corresponding latent vectors. Since the scores and the loadings
are associated with the same latent values, we obtain that the semi-axes of the
ellipses in Figs. 31.2a and b are identical. The geometrical interpretation of PCA,
as expressed by eqs. (31.13) and (31.15) can thus be seen as an orthogonal rotation
of the original data spaces 5^ and S"" into 5^ The rotation is called orthogonal
because the rotation matrices V and U are orthonormal (eq. (31.2)). Orthogonal
rotations do not change Euclidean distances between the points in the patterns. In
particular, the total inertia of the pattern is invariant under orthogonal rotation, a
property which has been defined already in eq. (31.8). Fig. 31.2c is traditionally
called a score plot, while Fig. 31.2d is called a loading plot.
Since U and V express one and the same set of latent vectors, one can
superimpose the score plot and the loading plot into a single display as shown in
Fig. 31.2e. Such a display was called a biplot (Section 17.4), as it represents two
entities (rows and columns of X) into a single plot [10]. The biplot plays an
important role in the graphic display of the results of PCA. A fundamental property
of PCA is that it obviates the need for two dual data spaces and that instead of these
it produces a single space of latent variables.
31.2.2 Distances
It has been shown in the previous section that there is an infinity of decomp-
ositions of a data table X into scores S and loadings L depending upon the choice of
the factor scaling coefficients a and p (eqs. (31.9) and (31.11)). The most current
settings of a and (3, however, are 0,0.5 and 1. It has been noted that when a equals
1, distances between the representations of the rows are preserved in the score plot.
In the case when p equals 1, distances between the representations of the columns
are preserved in the loading plot. In the score plot of Fig. 31.3a we represent the
Euclidean distance between rows / and i' by (i,,-, while their angular distance, as
seen from the origin, is denoted by i^^-. Euclidean distances from the origin of the
rows / and /' are indicated as d^ and J,, respectively. As shown in Section 9.2.3, this
means in algebraic terms that:
109
where the vectors sj and sj^ are the / and i'-th row of the score matrix S, where the
vectors x^ and xj^ are the / and /'-th row of the matrix X and where C^ is the matrix
of cross-products between rows of X (eq. (31.3)). The above relations can also be
derived from eq. (31.13) and by taking account of the orthonormality of singular
vectors.
In particular, c^^ and C,Y represent the sums of squared elements of the row-
vectors Xj and Xj/. Consequently, d^ and d^^ are the norms (or lengths) of the ith and
i'-th rows, respectively. The relation that links the three expressions of eq. (31.31)
together is called the triangle equation or cosine rule:
dl = df + df^ - Id^d^. cos ^ii. (31.32a)
and from eq. (31.31) we can also derive that:
dl.=c,+c,,-2c,, (31.32b)
where 0^^ is the sum of cross-products x] x,. between the / and i-\h rows of X. The
term cos i^- is a measure for the association (or similarity) between the rows / and
I. From the score plot in Fig. 31.3a we learn that it represents the angular distance
between the points / and i as seen from the origin of space. This measure of
association is often used to define similarities between objects in cluster analysis
(Chapter 30).
In the current example we can calculate the distances of the wind directions
from the origin either from the data X, from the scores S or from the sums of
cross-products C^. In the case of wind direction 0 we obtain:
t/f^i =s;?Li s^.^i =(0.472^ +0.132^) = 0.240
= x^^i x.^i =(0.212^ +0.399^ +0.190^) = 0.240
^ c , . i , =0.240
or
J,^i =0.490
The distances between wind directions can be calculated from the sums of cross-
products in C„. In the case of wind directions 0 and 90° we derive:
dl'=n =(^ii=u +^/r=22 -2c,.,.,=i2 =0.240 +0.047 - 2(0.098) = 0.092
or
d..^^^2 =0.303
110
@ a=l ® P=i
S^
>v
-> —7
0
© a + p=l @ a+p=l
/N
s^ S"^
s ./
y\
s / U
•P 1 O ,/ L
/ 0
/
/
/
• ^ • >
© a+p=l ® a+p=l
/N A^
s-^ S^
s, X
-> ^
Fig. 31.3. (a,b) Reproduction of distances D and angular distances 0 in a score plot (a = 1) or loading
plot (p = 1) in the common factor-space 5^ (c,d) Unipolar axis through the representation of a row or
column and through the origin 0 of space. Reproduction of the data X is obtained by perpendicular
projection of the column- or row-pattern upon the unipolar axis (a + P = 1). (e,0 Bipolar axis through
the representation of two rows or two columns. Reproduction of differences (contrasts) in the data X is
obtained by perpendicular projection of the column- or row-pattern upon the bipolar axis (a + p = 1).
Ill
4,=(l,-I,,)^(l,-V) = (x,-x,0^(x,-v)
d^-^h=^^j='jj (31-33)
where the vectors 1J and 1 J. are they and/-th rows of the loading matrix L, where
the vectors Xj and Xy/ are the j and /-th columns of the matrix X and where C^
represents the matrix of cross-products between the columns of X (eq. (31.3)). In
particular, Cjj and Cjy represent the sums of squared elements of the jth and y'-th
columns of X respectively. The above relations can also be derived from eq.
(31.15) and by taking account of the orthonormality of singular vectors.
Here dj and di^ are called the norms (or lengths) of they andy'-th columns of X
respectively. The triangle equation or cosine rule links the above expressions in eq.
(31.33) together:
d].. = dj + d} - Idjdy COS ^^y (31.34a)
and from eq. (31.33) we can also derive that:
where Cjj^ is the sum of cross-products xJ Xy between the y and y'-th column-
variables of X. The term cos i^^^/ is a measure of association (or similarity)
between these column-variables. From Fig. 31.3b we can see that it can be ob-
tained from the angular distance between the pointsy andy" as seen from the origin
of space. This measure of association is closely related to the coefficient of correl-
ation between measurements (Section 8.3).
From the current example we can calculate the distances from the origin of the
trace elements, either from the data X, from the loadings L (with p = 1) or from the
sums of cross-products C^. In the case of Na:
d%^ ^ I j l i I,.^i =(0.232^ +0.060^) = 0.057
= x]^, x.^i =(0.212^ +0.072^ +0.036^ +0.0782) = 0.058
= c^..ii =0.058
or
rf,.^i =0.240
The distances between trace elements are calculated from the sums of cross-
products in Cp. In the case of Na and CI:
112
Figure 31.4 shows the biplot of the trace elements and wind directions for the
case when a = p = 0.5. Since here we have that a -H (3 equals 1, we can reconstruct
the values in the columns of the data table X by means of perpendicular projections
upon unipolar axes. In Fig. 31.4a we have drawn a unipolar axis through CI.
Perpendicular projection of the four wind directions upon this axis reconstructs the
order of the concentrations of CI at the four wind directions as listed in Table 31.1.
Now we have established a way which leads back from the graphic display to the
tabulated data. This interpretation of the biplot emphasizes the one-to-one relation-
ship between the data and the plot. Such a relationship is also inherent in the
ordinary bivariate (or Cartesian) diagram.
The distance from the origin d^ or dj is a measure of the information contained in
the corresponding row or column. If the distance is relatively large, in comparison
to others, then the corresponding row or column can be seen to contribute more
information (inertia, variance) to the result of the analysis. In the case when the
distance is zero, then the row or column carries no information (inertia, variance)
since all its elements are zero.
In the case when not all r latent variables have been retained, the reproduction
will be approximate and we have to rewrite eq. (31.35) in the form:
sY \]=\y s* =jc*. (31.36)
where the asterisk denotes that only the first r* latent variables have been included
in the reconstruction of the data X, and where r* < r.
A bipolar axis is defined by two rows s, and s^/ in Fig. 31.3e. If we project a
column Ij upon this bipolar axis, we reconstruct the difference between two
elements in X. This follows readily from eq. (31.22):
(s,.-s,)^l;=sn,-sn,.=^,-^,,, (31.37)
This bipolar axis defines a contrast between the row-variables / and /'. Note that a
contrast involves the difference of two quantities.
By virtue of the symmetry between scores and loadings, we can also construct
bipolar axes through two columns ly and 1^^ such as is shown in Fig. 31.3f. When we
project a row s^ upon this bipolar axis we construct a difference between two
elements in X. The proof follows readily from eq. (31.22):
sjilj-l,) = sjlj-sllj,=x,-x,, (31.38)
This bipolar axis defines a contrast between the column-variables7 and/.
In Fig. 31.4b we have constructed a bipolar axis through CI and Si. Perpend-
icular projections of the wind directions upon this axis are in the same order as the
114
© a = p = 0.5
6 O
180 270
a = P = 0.5
® .5n
/
i
Na
/
.5
-.5 90^
?
270.'
180
/
.5 -•
Fig. 31.4. (a) Biplot in which the concentrations of an atmospheric trace element (CI) are recon-
structed by perpendicular projection upon a unipolar axis, (b) Biplot in which the differences (con-
trasts) between two atmospheric trace elements (CI, Si) are reproduced by perpendicular projection
upon a bipolar axis.
115
31.3 Preprocessing
1
^ i
In our notation m^ represents an element of the vector m^ and rrij is an element of the
vector m^. The same convention applies to the elements d-, dj and the
corresponding vectors d„, d^. The global mean m and the global norm d are defined
as follows:
1 n p
^P i J
d'=-tt4 (31.40)
^Pi J
Note that a norm is defined here as a root mean square rather than as a sum of
squares, because of the division by n and/7. This greatly simplifies our notation.
The vector of column-means m^ defines the coordinates of the centroid (or
center of mass) of the row-pattern P^ that represents the rows in column-space S^.
Similarly, the vector of row-means m„ defines the coordinates of the center of mass
of the column-pattern P^ that represents the columns in row-space 5". If the
column-means are zero, then the centroid will coincide with the origin of 5^ and the
data are said to be column-centered. If both row- and column-means are zero then
the centroids are coincident with the origin of both 5'^ and 5^. In this case, the data
are double-centered (i.e. centered with respect to both rows and columns). In this
chapter we assume that all points possess unit mass (or weight), although one can
extend the definitions to variable masses as is explained in Chapter 32.
The vector of column-norms d^ of the table defines the weighted distances of the
points in P^ from the origin of space in S^, Similarly, the vector of row-norms d^
represents the weighted distances of the points in P"^ from the origin of space in S^.
If all column-norms are equal then the pattern PP will appear on a (hyper)sphere
about the origin of 5". In this case the data are said to be column-normalized. If, in
addition, the center of mass of P^ is at the origin of S^, then we deal with
column-standardized data (i.e. column-centered and column-normalized). Note
that the norm is also proportional to the length of the vectors, i.e. the distance of its
endpoint from the origin of space. These concepts are explained in Chapter 29.
By way of graphical example of the various algebraic and geometrical concepts
that are introduced in this chapter, we will make use of a measurement table
adapted from Walczak etal. [II]. Table 31.2 describes 23 substituted chalcones in
terms of eight chromatographic retention times. Chalcone molecules are consti-
tuted of two phenyl rings joined by a chain of three-carbon atoms which carries a
double bond and a ketone function. Substitutions have been made on each of the
phenyl rings at the para-positions with respect to the chain. The substituents are
CF3, F, H, methyl, ethyl, /-propyl, r-butyl, methoxy, dimethylamine, phenyl and
NO2. Not all combinations two-by-two of these substituents are represented in the
117
TABLE 31.2
Chromatographic retention times of 23 doubly substituted chalcones, as determined by 8 HPLC chromato-
graphic methods using heptane as the mobile phase. The methods differ by addition of 0.5 percent of a
chemical modifier.
X-Y CH2CI2 THE Dioxan Ethanol Propano][ Octanol DMSO DMF Mean
H-CH3 2.02 1.78 1.72 0.62 0.57 0.78 2.98 2.85 1.66
H-tBu 2.99 2.24 2.30 0.64 0.59 0.92 2.20 1.53 1.68
H-iPr 3.00 2.24 2.29 0.66 0.62 0.96 2.26 1.62 1.71
H-H 2.98 2.39 2.31 0.78 0.76 1.17 3.00 2.80 2.02
F-H 3.56 2.94 2.81 0.98 0.92 1.40 4.66 4.15 2.68
H-F 2.82 2.38 2.30 0.83 0.76 1.13 3.74 3.40 2.17
H-Et 3.12 2.34 2.43 0.70 0.68 1.08 2.38 1.90 1.83
H-Me 3.28 2.52 2.57 0.81 0.76 1.22 2.67 2.25 2.01
F-Me 3.87 3.00 3.06 0.91 0.87 1.38 4.33 3.22 2.58
F-F 3.47 2.85 2.92 1.05 0.96 1.39 6.24 4.98 2.98
Me-Phe 5.86 4.59 4.43 0.96 0.93 1.68 4.84 3.82 3.39
MeO-Me 9.00 6.62 6.68 1.78 1.58 2.92 6.81 5.62 5.13
Me-MeO 9.12 6.76 6.74 1.75 1.63 3.01 6.59 5.76 5.17
F-MeO 10.34 8.22 7.59 2.03 2.02 3.48 11.90 10.50 7.01
H-NO2 6.76 5.60 5.92 1.80 1.66 2.68 16.85 12.82 6.76
MeO-Phe 14.58 10.72 10.68 2.19 2.08 4.08 14.39 11.41 8.77
F-NO2 10.15 8.11 7.10 2.68 2.37 3.68 27.62 20.24 10.24
N02-Me 12.15 9.82 9.42 2.50 2.30 3.95 18.04 15.28 9.18
NO2-H 11.93 9.81 9.40 2.68 2.59 4.27 24.38 18.94 10.50
MeO-MeO 22.19 16.08 15.79 3.44 3.48 6.85 21.27 17.12 13.28
MeO-NOj 15.94 12.67 12.74 3.36 3.08 5.48 41.17 26.35 15.10
NO2-F 14.32 12.43 12.01 3.52 3.20 5.04 39.34 21.80 13.96
NMe2-N02 26.07 19.28 18.93 4.24 3.92 8.78 42.38 26.48 18.76
Mean 8.67 6.76 6.61 1.78 1.67 2.93 13.48 9.78 6.46
first two latent variables as defined by eqs. (31.9) and (31.11), using the compromise
a = P = 0.5. This particular choice of a and (3 ensures that projections on unipolar
and bipolar axes can be carried out, while compromising on the representation of
distances between rows and between columns.
The relative inertias contributed by the two latent vectors are indicated along-
side the horizontal and vertical axes. In some cases we have reflected one or two of
these axes such as to make the individual biplots comparable with each other.
Rows of Table 31.2 refer to the 23 substituted chalcones which are represented on
the biplots by means of graded circles. Areas of the circles are proportional to the
mean retention times of the corresponding compounds as obtained with the eight
chromatographic methods. Here large areas of circles correspond with compounds
whose average retention times are large. Columns of the table refer to the chroma-
tographic methods (heptane plus 0.5% of one of the different modifiers). They are
represented by means of graded squares on the biplot. Areas of these squares are
proportional to the mean retention times produced with the corresponding methods
by the 23 substituted chalcones. Large areas of squares correspond with methods
that on the average tend to produce large retention times in the 23 compounds. The
areas of circles and squares provide an indication of the average size or importance
of the objects and measurements described by the table, i.e. average retention time
of compounds and average retention time with chromatographic methods.
31.3.1 No transformation
The case in which the original data are left unchanged can be simply defined by
means of:
Zij-x^j with / = 1, ..., n andy = 1, ...,/7 (31.41)
Generally when no preprocessing is applied, we will find that all means and
norms are distinct as in Table 31.3.
TABLE 31.3
Atmospheric data from Table 31.1, after no transformation. The means m and norms d have been computed
column-wise, row-wise and globally over all elements of the table
Na CI Si m„ d„
0 0.2120 0.3990 0.1900 0.2670 0.2830
90 0.0720 0.1330 0.1550 0.1200 0.1250
180 0.0360 0.0630 0.2130 0.1040 0.1299
270 0.0780 0.1410 0.2730 0.1640 0.1830
—
)HmO-H02
9r'N02 ^Q
# a-N02
DMFfH H02-H
PROPAJfOL
B-CF3\ »™^^^
B-lPr^-B^ 2
^kiatm2-lK)2
.H.-PH. ^,_^^
DIOXAH^TBT
«-°"^tM.-M*0
^MmO-MmO
97%
Fig. 31.5. Biplot of 23 substituted chalcones (circles) and 8 chromatographic methods (squares) as de-
scribed by their retention times in Table 31.2, after no transformation of the data. Areas of circles and
squares are related to the mean retention times of the corresponding compounds and methods, such as
they appear in the margins of the table.
Figure 31.5 shows the biplot of the 23 substituted chalcones and eight chrom-
atographic methods as described by the retention times in Table 31.2. The first
latent variable accounts for 97% of the inertia, while the second one carries about
3%. A small fraction of the inertia (less than 0.3%) is distributed over the other
latent variables. The origin of the space is represented by a small cross and appears
at the far left of the biplot. The two centroids (of compounds and of methods) are
manifestly at a large distance from the origin. Note that the circles and squares are
ordered by increasing areas along the horizontal axis. We conclude from this, that
the first latent variable represents a strong size component. It accounts for the size of
compounds and measurements as appears from the marginal means of Table 31.2,
31.3.2 Column-centering
TABLE 31.4
Na CI Si ^n dn
0 0.1125 0.2150 -0.0178 0.1032 0.1405
90 -0.0275 -0.0510 -0.0528 -0.0437 0.0452
180 -0.0635 -0.1210 0.0053 -0.0598 0.0790
270 -0.0215 -0.0430 0.0653 0.0003 0.0468
m^ 0 0 0 0
^ 0.0669 0.1278 0.0430 — 0.0869
After this transformation we find that the column-means m^ are zero as shown in
Table 31.4.
In this case the column-norms d^ are called column-standard deviations. The
square of these numbers are the column-variances, whose sum represents the
global variance in the data. Note that the column-variances are heterogeneous
which means that they are very different from each other.
The column-centered biplot of the chromatographic data is shown in Fig. 31.6.
The first and second latent variables contribute 94% and 5% to the total inertia of
the data (or total variance in this case). About 0.5% is distributed over the residual
latent variables. Here the centroid of compounds coincides with the origin of space
(indicated by the small cross near the middle of the plot). The centroid of the
methods is not at the origin, since all methods are projected at the right side of the
origin on Fig. 31.6. This is indicative of a strong size component, which is also
evidenced by the ordering of the compounds according to increasing average
retention time. This observation can also be made by looking at the areas of the
circles which increase from left to right.
Inspection of the positions of the chromatographic methods on the biplot of Fig.
31.6 shows that the three alcohols (ethanol, propanol and octanol) contribute
relatively little to the global inertia (variance) of the data, since their distances from
the origin are small in comparison to those of DMSO, DMF, THF, dioxane and
methylenedichloride (CH2CI2). Judging from the angular distances, as seen from
the origin, we find rather strong correlations between DMSO and DMF, on the one
121
P* \ 00 Hm0-K02 f ^^j^
X -W02
\. 2
9'- N02-F^ W mC''^<'^^^
4 # fl-N02 «^
Nv V02-H^
B-cr3* B-r •'-' ^''[UDMT /^20 1
fl-fl,* mr-H
X ^ ..^^-^^^^^^^ ^^
/N
0M0O-Ph« \^^^/
5
CH2Cl2j3^ 20
22
/^^
N ^ 24
/ WoO-Ate<J^
X 34 »
Fig. 31.6. Biplot of chromatographic retention times in Table 31.2, after column-centering of the data.
Two unipolar axes and one bipolar axis have been drawn through the representations of the methods
DMSO and methylenedichloride (CH2CI2). The projections of three selected compounds are indicated
by dashed lines. The values read off from the unipolar axes reproduce the retention times in the corre-
sponding columns. The values on the bipolar axis reproduce the differences between retention times.
hand, and between dioxane, THF and methylenedichloride, on the other hand. The
correlation between the three alcohols cannot be established accurately on this plot
because of their proximity to the origin.
There are two outstanding poles on this biplot. DMSO and dimethylchloride are
at a large distance from the origin and from one another. These poles are the most
likely candidates for the construction of unipolar axes. As has been explained in
the previous section, perpendicular projections of points (representing comp-
ounds) upon a unipolar axis (representing a method) leads to a reproduction of the
data in Table 31.3. In this case we have to substitute the untransformed value x^^ in
eq. (31.35) by z^. of eq. (31.42):
s7i 7 ~'^ij -^ij ^j (31.43)
Since m^ is a constant for the given unipolar axis through the yth method, we
obtain that the projections on this axis are equal to x^j minus a constant.
The unipolar axis through the origin and DMSO reproduces rather well the data
in the corresponding column of Table 31.2. By perpendicular projection of the
122
center of the circles we can read off the approximate retention times on this
unipolar axis. For example, the compound with substituents F-NO2 projects at
about a value of 28 which compares well with the tabulated value of 27.62. The
DMSO axis separates compounds with NO2 and methoxy substituents from the
others. The unipolar axis through methylenedichloride also reproduces the data in
the corresponding column of Table 31.2. This additive seems to selectively delay
the elution of compounds with dimethylamine and some methoxy substituents.
Finally, we have constructed a bipolar axis through DMSO and methylene-
dichloride. By perpendicular projection of the centers of the circles upon this bipolar
axis we obtain the differences in retention times obtained respectively with DMSO
and with methylenedichloride. Using a similar reasoning as developed above for
unipolar axes we can perform a substitution of eq. (31.42) in eq. (31.38) which leads
to:
sj{\j -lj.) = Zij -Zij'=Xij -x.y-{mj -my) (31.44)
This shows that the bipolar axis reproduces the difference x^j - x- between
method7 a n d / of the table minus a constant (nij - m^). This bipolar axis defines a
clear contrast between the N02-substituted compounds and the others.
For example, projection of the compound with substituents dimethylamine-N02
defines a DMSO-methylenedichloride contrast of about 16 (Fig. 31.6), where the
real difference is 42.38 - 26.07 or 16.31 (Table 31.2). Graphical estimation of the
same contrast for the MeO-MeO compound yields a value o f - 1 , which compares
well with the difference between tabulated data of 21.27 - 22.19 or -0.92.
31.3.3 Column-standardization
with
d]=^l(^,-mjr
TABLE 31.5
Na CI Si in„ d.
0 1.681 1.683 -0.413 0.984 1.394
90 -0.411 -0.399 -1.228 -0.679 0.782
180 -0.949 -0.947 0.122 -0.591 0.777
270 -0.321 -0.337 1.519 0.287 0.917
m^ 0 0 0 0 —
d. 1 1 1 — 1
4%
# a-N02
B-Hm
• a-LPr B-Et
Fig. 31.7. Biplot of chromatographic retention times in Table 31.2, after column-standardization of
the data. Unipolar semi-axes have been drawn through all points representing methods. The particular
arrangement of the methods is indicative for the presence of a strong size component in the data.
In this case it is required that the original data in X are strictly positive. The effect
of the transformation appears from Table 31.6. Column-means are zero, while
column-standard deviations tend to be more homogeneous than in the case of
simple column-centering in Table 31.4 as can be seen by inspecting the corresp-
onding values for Na and CI.
In the log column-centered biplot of Fig. 31.8 one observes that the centroid of
the compounds coincides with the origin. Also, the chromatographic methods are
at a more uniform distance from the origin than it is the case with simple
column-centering in Fig. 31.6. The effect of log column-centering is to reduce the
heterogeneity of the variances. The result is close to that of column-standard-
ization in Fig. 31.7. While the logarithmic function reduces the effect of large
values it enhances that of the smaller ones. In the chromatographic application this
is an advantage, as appears from the widening of the cluster of compounds with the
substituents CF3, F, H, methyl, ethyl, /-propyl, r-butyl on the left side of Fig. 31.7.
With log column-centering we obtain unipolar axes by substituting eq. (31.46)
in eq. (31.35):
TABLE 31.6
Na CI Si m^ d.
0 0.4183 0.4326 -0.0297 0.2738 0.3479
90 -0.0507 -0.0445 -0.1181 -0.0711 0.0785
180 -0.3517 -0.3690 0.0200 -0.2336 0.2945
270 -0.0159 -0.0191 0.1278 0.0309 0.0751
m^ 0 0 0 0
^ 0.2746 0.2853 0.0888 — 0.2343
where rrij is the logarithm of the geometric mean of theyth column in Table 31.2,
which is a constant for this unipolar axis. Hence, the unipolar axis through DMSO
reproduces the logarithms of the data in the corresponding column of the table.
Note that the unipolar axis in Fig. 31.8 has been constructed with logarithmical
subdivisions. A similar construction applies to the unipolar axis through methylene-
dichloride.
The bipolar axis through DMSO and methylenedichloride defines logarithms of
the ratio of the data in the corresponding columns. By substitution of eq. (31.46) in
eq. (31.38) and after rearrangement we obtain that:
Consequently, the bipolar axis represents the (log) ratio of columns j and / in
the original data table up to the term nij - rrif which is a constant for the given
bipolar axis. Note that the axis which represents the ratio of retention times with
DMSO/methylenedichloride is also divided according to a logarithmic scale.
In Fig. 31.8 we determine the DMSO/methylenedichloride ratio of the di-
methylamine-N02 substituted compound approximately at 1.6, where the com-
puted ratio from Table 31.2 is 42.38/26.07 or 1.63.
\3%
r 4 ^
\ ^ 2 r 2.s^^
DMSO ^y^
• H-W02 F-JI02
DMFW^LS
B-CF3 ^^v^ 9r-F
j ^^MAO-K02
^ N02
• H-H ^ ^ NO2-Mm
• f-ite ETBANOL^
i%
8 PROPAITOL"
l^-^-"^^
\^^ OCTAWOIQ f NMtt^-KO^'^
R-tBu ^ ^ 4 •^'•-PhB CHF
#toO-Ph«^v^^3| DJOXAN
CB2C12*f^
M«0-Ha
^^,y^ 2 %H»-MQO
97%*^
Fig. 31.8. Biplot of chromatographic retention times in Table 31.2, after log column-centering of the
data. The values on the bipolar axis reproduce the (log) ratios between retention times in the two corre-
sponding columns.
with
1 n p
and m=— ^ ^ y..
It is assumed that the original data in X are strictly positive. As is evident from
Table 31.7 both the row-means m„ and the column-means m^ of the transformed
table Z are equal to zero.
The biplot of Fig. 31.9 shows that both the centroids of the compounds and of
the methods coincide with the origin (the small cross in the middle of the plot). The
first two latent variables account for 83 and 14% of the inertia, respectively. Three
percent of the inertia is carried by higher order latent variables. In this biplot we
can only make interpretations of the bipolar axes directly in terms of the original
data in X. Three prominent poles appear on this biplot: DMSO, methylene-
dichloride and ethylalcohol. They are called poles because they are at a large
distance from the origin and from one another. They are also representative for the
three clusters that have been identified already on the column-standardized biplot
in Fig. 31.7.
127
TABLE 31.7
Na CI Si m„ dn
0 0.1446 0.1589 -0.3034 0 0.2146
90 0.0204 0.0266 -0.0470 0 0.0333
180 -0.1181 -0.1354 0.2536 0 0.1794
270 -0.0468 -0.0500 0.0969 0 0.0685
m^ 0 0 0 0
d. 0.0968 0.1082 0.2049 — 0.1450
114%
# MmO-Phm . — • 194m2-N02
€.5 —
Mm-Phmm^
CH2Cl2i 3 tf
DIOXAJ p i ^ j ^ 0M.O-WO2 -^
\ «02-B% \ ^ ^ 11
• a-tBu \ aoczAMOL 4. ^y^^^io
• E-iPr
• B-Xt
\ ^^^^^ 1 %F-N02
• B-H»
\ '-B^^
E-B% \ > ^
\ ^^"^ ^
PRQPANOLB \ x ^ • E-Cr3
Fig. 31.9. Biplot of chromatographic retention times in Table 31.2, after log double-centering. Two bi-
polar axes have been drawn through the representation of the methods DMSO, methylenedichloride
(CH2CI2) and ethanol.
14%
83%
Fig. 31.10. Same biplot of chromatographic retention times as in Fig. 31.9. The line segments connect
compounds that share a common substituent. The horizontal contrast reflects the presence or absence
of a NO2 substituent. The vertical contrast expresses the electronegativity of the substituents.
One can also state that the log double-centered biplot shows interactions
between the rows and columns of the table. In the context of analysis of variance
(ANOVA), interaction is the variance that remains in the data after removal of the
main effects produced by the rows and columns of the table [12]. This is precisely
the effect of double-centering (eq. (31.49)).
Sometimes it is claimed that the double-centered biplot of latent variables 1 and
2 is identical to the column-centered biplot of latent variables 2 and 3. This is only
the case when the first latent variable coincides with the main diagonal of the data
space (i.e. the line that makes equal angles with all coordinate axes). In the present
application of chromatographic data this is certainly not the case and the results are
different. Note that projection of the compounds upon the main diagonal produces
the size component.
The transformation by log double-centering has received various names among
which spectral mapping [13], logarithmic analysis [14], saturated RC association
model [15], log-bilinear model [16] and spectral map analysis or SMA for short
[17].
130
31.3.6 Double-closure
A matrix is said to be closed when the sums of the elements of each row or
column are equal to a constant, for example, unity. This is the case with a table of
compositional data where each row or column contains the relative concentrations
of various components of a sample. Such compositional data are closed to 100%.
Centering is a special case where the rows or columns of a table are closed with
respect to zero. Here we only consider closure with respect to unity. A data table
becomes column-closed by dividing each element with the corresponding column-
sum. A table is row-closed when each element has been divided by its corresp-
onding row-sum. By analogy with double-centering, double-closure involves the
division of each element of a table by its corresponding row- and column-sum and
multiplication by the global sum:
with
p n n p
J i i J
where x„^, x+^ and x^^ represent the vector of row-sums, the vector of column-sums
and the global sum of the elements in the table X. Note that eq. (31.50) can also be
written as:
Zii=-^^— (31.51)
m, nij
where m^, m^, m represent the row-means, column-means and global mean of the
elements of X.
Double-closure is only applicable to data that are homogeneous, i.e. when
measurements are expressed in the same unit. The data must also be non-negative.
If these requirements are satisfied, then the row- and column-sums of the table can
be thought to express the size or importance of the items that are represented by the
corresponding rows and columns. Double-closure is the basic transformation of
correspondence factor analysis (CFA), which is applicable to contingency tables
and, by extension, to homogeneous data tables [18]. Data in contingency tables
represent counts, while data in homogeneous tables need only be defined in the
same unit. Although CFA will be discussed in greater detail in Chapter 32 on the
multivariate analysis of contingency tables, it is presented here for comparison
with the other methods of PC A which have been discussed above.
131
where E(x^j) is called the expected value of x^y under the condition of fixed marginal
totals. (See Chapter 26 on 2 x 2 contingency tables). For this reason the operation
of double-closure can also be considered as a division by expected values. It
follows that the transformed data Z in eqs. (31.50) or (31.51) can be regarded as
ratios of X to their expected values E(X).
In CFA the means m^, m^, m and norms d„, d^, d are computed by weighted
sums and weighted sums of squares:
p
m^ = 2^Wi Zij = 1 with / = 1, ..., n
J
n
n p
' J
dj^^w.zfj
n p
i J
where the vectors of weight coefficients w„ and w^ are defined from the marginal
sums:
w,=^^ (31.54)
132
TABLE 31.8
Atmospheric data from Table 31.1, after double-closure. The weights w are proportional to the row- and
column-sums of the original data table. They are normalized to unit sum.
Na CI Si w„ m. d.
0 0.3067 0.3299 -0.4391 0.4075 0 0.3759
90 -0.0126 -0.0136 0.0181 0.1832 0 0.0155
180 -0.4303 -0.4609 0.6143 0.1587 0 0.5259
270 -0.2173 -0.2349 0.3121 0.2503 0 0.2672
When all weight coefficients w„ and w^ are constant then we have the special case:
6%
• B-tBu P JDIOKAN^
PROPANOLa a ^ - ^
ETHAIfOl
90%
Fig. 31.11. Biplot of chromatographic retention times in Table 31.2, resulting from correspondence
factor analysis, i.e. after double-closure of the data. The line segments have been added to emphasize
contrasts in the same way as in Fig. 31.10.
where m^, m^ and m now represent the means of the logarithmically transformed
data Y.
It has been proved that, in the limiting case when the contrasts in the data are
small, the two expressions produce equivalent results [14]. The main difference
between the methods resides in the quantity that is analyzed. In CFA one analyses
the global distance ofchi-square of the data [18] as explained in Section 32.5. In
log double-centered PCA or SMA one analyses the global interaction in the data
[17]. Formally, distances of chi-square and interactions are defined in the same
way as weighted sums of squares of the transformed data. Only the type of
transformation of the data differs from one method to another. (See Goodman [15]
for a thorough review of the two approaches.) Contrasts in CFA are expressed in
terms of distances of chi-square, while in SIMA they can be interpreted as (log)
ratios. While the former is only applicable to homogeneous data, the latter lends
itself also to heterogeneous data, since differences between units are cancelled out
by the combined operations of logarithms and centering.
134
31.4 Algorithms
The 'non-linear iterative partial least square' or NIPALS algorithm has been
designed by H. Wold [19] for the solution of a broad class of problems involving
relationships between several data tables. For a discussion of partial least squares
(PLS) the reader is referred to Chapter 35. In the particular case where NIPALS is
applied to a single rectangular data table one obtains the row- and column-singular
vectors one after the other, in decreasing order of their corresponding singular
values. The original concept of the NIPALS approach is attributed to Fisher [30].
In NIPALS one starts with an initial vector t with n arbitrarily chosen values
(Fig. 31.12). In a first step, the matrix product of the transpose of the nxp table X
with the AX-vector t is formed, producing the p elements of vector w. Note that in the
traditional NIPALS notation, w has a different meaning than that of a weighting
vector which has been used in Section 31.3.6. In a second step, the elements of the
p-vector w are normalized to unit sum of squares This prevents values from
becoming too small or too large for the purpose of numerical computation. The
135
NIPALS
X t
y^
/
/ .'1
//
/
/ i
/3
0 7
/
/~7
A
w -,Jr^
Fig. 31.12. Dance-step diagram, illustrating a cycle of the iterative NIPALS algorithm. Step 1 multi-
plies the score vector t with the data table X, which produces the weight vector w. Step 2 normalizes w
to unit sum of squares. In step 3, X is multiplied by w, yielding an updated t.
third step involves the multiplication of X with w, which produces updated values
for t. The complete cycle of the NIPALS algorithm can be represented by means of
the operations, given an initial vector t:
w = X^t
w -^ W (31.57)
(w^ w) 1/2
t = Xw
which is equivalent to the transition formulae of eq. (31.17). The three steps of the
cycle are represented in Fig. 31.12 which shows the transitions in the form of a
dance step diagram [31]. The cycle is repeated until convergence of w or t, when
changes between current and previous values are within a predefined tolerance
(e.g. 10"^). At this stage one can derive the first normalized row-latent vector Uj
and column-latent vector Vj from the final w and t:
(31.58)
( t ^ t ) 1/2
W
X = X^it("it v J ) = U A V ^ (31.61)
k
-> • 0 • -^ • \ 0 •
0^ 0 \
(b) Jacobi
c A2
-^ • 1 • •
\ 0
0 \
•
+
•
M •
+
• m
(c) Householder - QR
c A2 VT
•
\ 0
-^ -^
0 \
Householder step QR step
Fig. 31.13. Schematic example of three common algorithms for singular value and eigenvalue decom-
position.
138
v = w/(wTw)^/2 (31.62b)
The normalization step prevents the elements in v from becoming either too large
or too small during the numerical computation. The two operations above define
the cycle of the powering algorithm, which can be iterated until convergence of the
elements in the vector v within a predefined tolerance. It can be easily shown that
after n iterations the resulting vector w can be expressed as:
wocC^ V (31.63)
where the symbol C^ represents the nth power of the matrix C^, i.e. the matrix
product of n terms equal to C^. The name of the method is derived from this
property. The normalized vector Vj which results from the iterative procedure is
the first dominant eigenvector of C^. Its associated eigenvalue X\ follows from eq.
(31.5a):
^^^v^^C^v, (31.64)
C , - M ( v , v?-)^Cp (31.65)
matrix are zero (or near zero within a specified tolerance), then we have exhausted
the cross-product matrix C^. Otherwise, the algorithm (eq. (31.62)) is repeated,
yielding the second eigenvector V2 and its associated eigenvalue \\. Because of
the definition of the residual matrix (eq. (31.65)), we obtain that y^ is orthogonal to
Vj. A new residual matrix is computed and the process is repeated until complete
exhaustion of C^. At the end we have decomposed the cross-product matrix C^ into
r orthogonal components (where r is at most equal to p)\
C , = i M ( v , v J ) = VA2V^ (31.66)
k
31.5 Validation
A question that often arises in multivariate data analysis is how many meaning-
ful eigenvectors should be retained, especially when the objective is to reduce the
dimensionality of the data. It is assumed that, initially, eigenvectors contribute
only structural information, which is also referred to as systematic information.
141
100
Time
s
Power
Jacobi
10 -\
Householder-QR
Fig. 31.14. Performance of three computer algorithms for eigenvalue decomposition as a function of
the dimension of the input matrix. The horizontal and vertical scales are scaled logarithmically. Exe-
cution time is proportional to a power 2.6 of the dimension.
'•' d
with
. p . n ^ n p
p j n i -^ ^P i j
1 n p
^p i j
31.5.1 Scree-plot
This empirical test is based on the so-called Scree-plot which represents the
residual variance as a function of the number of eigenvectors that have been
extracted [42]. The residual variance V of the r*-th eigenvector is defined by:
r
V(r*) A
Fig. 31.15. Scree-plot, representing the residual variance V* as a function of the number of factors r*
that has been extracted. The diagram is based on a factor analysis of Table 31.2 after log dou-
ble-centering. A break point occurs after the second factor, which suggests the presence of only two
structural factors, the residual factors being attributed to noise and artefacts in the data.
TABLE 31.9
Validation results obtained from factor analysis of Table 31.2, containing the retention times of 23
chalcones in 8 chromatographic methods, after log double-centering and global normalization. The results
are used in the Malinowski's F-test and in cross-validation by PRESS.
r* r-r*
^i" V(r*) F(r*) PRESS(r*) W(r*)
vector. This structural variance should be larger than the average V of the vari-
ances contributed by the r-r* error eigenvectors that follow:
- ] ' 1
V(r*) = y X\=--^V(r*) withl<r*<r (31.70)
Hence, the number of structural eigenvectors is the largest r* for which Malinowski 's
F-ratio is still significant at a predefined level of probability a (say 0.05):
31.5.3 Cross-validation
Wfl ff
Fig. 31.16. Two of the four masks that are used in the calculation of a PRESS value in cross-validation.
The remaining two masks are obtained by a parallel shift of the diagonal lines that represent data
points to be deleted.
number of analyses to four. In this scheme, elements to be left out are located on
equally spaced diagonals (pseudodiagonals) such as illustrated by the masks of
Fig. 31.16. Actually, there are four such masks, but only two are shown here. This
scheme ensures that not more and not less than about one fourth of each row and
column are left out at the same time. The elements on the diagonals are predicted
from the remaining ones. Then the diagonals are shifted one element to the right
and the procedure is repeated. After four such analyses we have predicted all
elements of the table. The predicted residual error sum of squares or PRESS for r*
eigenvectors is given by:
where z* is the value predicted for Zij using r* eigenvectors [44]. For a general
discussion of PRESS the reader is referred to Section 10.3.4 and to Chapter 36.
As more stmctural eigenvectors are included we expect PRESS to decrease up to
a point when the structural information is exhausted. From this point on we expect
PRESS to increase again as increasingly more error eigenvectors are included. In
order to determine the transition point r* one can compare PRESS(r*+l) with the
previously obtained PRESS(r*). The number of structural eigenvectors r* is reached
when the ratio:
PRESS(r* +1)
W(r*): with 1 < r* < r (31.73)
PRESS(r*)
becomes larger than 1 [45]. Note that the symbols Vand Win this section represent
the traditional notation which differs from our standardized nomenclature in this
chapter.
146
where J. J is the reconstructed value of z,y using r* eigenvectors and without leaving
any element out. The results of cross-validation of the transformed retention times
of Table 31.2 are compiled in the last two columns of Table 31.9. They point
toward the presence of two structural factors.
where the vectors xj and x^. represent the iih and i'-th rows of X.
It can be regarded as a special case of the squared weighted Euclidean distance
(Section 30.2.2.1). A property of (weighted) Euclidean distance functions is that
the distances between row-items D are invariant under column-centering of the
table X:
J n p
In the case of r = 2 we obtain the ordinary Euclidean distance of eq. (31.75), which
is also called the L2-norm. In the case of r = 1 we derive the city-block distance
(also called Hamming-, taxi- or Manhattan-distance), which is also referred to as
the Lj-norm:
^/'.+ y
=1^ (31.80)
where x-^, x^-, x^^ represent the /th row-sum, theyth column-sum and the global sum
over all rows and columns, respectively.
The Chi-square distance can be seen as a weighted Euclidean distance on the
transformed data:
Zij=Xij/E{Xij) (31.81)
where the expected value Eix^^ has been defined in Section 16.2.3 as:
•^i+ ^+j
E{x,) = (31.82)
148
In this respect, the weight coefficients are proportional to the column-sums. Dist-
ances of Chi-square form the basis of correspondence factor analysis (CFA) which
is discussed in Chapter 32.
In Section 31.2.2 we have derived the Euclidean distances D between two items
/ and /' from their corresponding variances and covariances C:
The inverse relationship has been established by Young and Householder [48] and
is also referred to as Gower's formula because of the extensive use he has made of
it [49]:
Cn'-^-\idl-dl-d\-^d^) (31.84)
where d], d\ and d^ refer to the mean of row /, the mean of column i and the
global mean over all rows and columns, respectively, of the squared distance
matrix D^. The variance-covariance matrix C equals the double-centered squared
distance matrix D (multiplied by the constant -0.5).
Once the nXAz variance-covariance matrix C has been derived one can apply
eigenvalue decomposition (EVD) as explained in Section 31.4.2. In this case we
obtain:
C = U A^ U'^ with U'^ U = I
where U represents the Azxr matrix of orthonormal factor scores, where A^ is the
ry.r diagonal matrix of eigenvalues and where r is the rank of C.
149
Factor scores for each of the n objects with respect to the r computed factors are
then defined in the usual way by means of the wxr orthogonal factor score matrix S:
S = UA
These factor scores can be used for the construction of a score plot in which each
object is represented as a point in the plane of the first two dominant factors.
A close analogy exists between PCoA and PCA, the difference lying in the
source of the data. In the former they appear as a square distance table, while in the
latter they are defined as a rectangular measurement table. The result of PCoA also
serves as a starting point for multidimensional scaling (MDS) which attempts to
reproduce distances as closely as possible in a low-dimensional space. In this
context PCoA is also referred to as classical metric scaling. In MDS, one mini-
mizes the stress between observed and reconstructed distances, while in PCA one
maximizes the variance reproduced by successive factors.
Non-linear PCA can be obtained in many different ways. Some methods make
use of higher order terms of the data (e.g. squares, cross-products), non-linear
transformations (e.g. logarithms), metrics that differ from the usual Euclidean one
(e.g. city-block distance) or specialized applications of neural networks [50]. The
objective of these methods is to increase the amount of variance in the data that is
explained by the first two or three components of the analysis. We only provide a
brief outline of the various approaches, with the exception of neural networks for
which the reader is referred to Chapter 44.
done, for example, in the HOMALS program (Homogeneity and Alternating Least
Squares) [51] where non-linear transformations are applied in order to maximize
the homogeneity of the data. In this context, homogeneity is inversely related to the
within-variance of the rows of the data table. For example, a table in which the
values in each individual row are constant (but different from zero) possesses
maximum homogeneity [52]. The HOMALS approach is most suitable in the case
of categorical data, i.e. when the column-items represent categories of one or more
variables such as age, gender, etc. Correspondence factor analysis (CFA) which is
discussed in Chapter 32 may be regarded as a special case of HOMALS.
The logairithmic transformation prior to column- or double-centered PCA (Sect-
ion 31.3) can be considered as a special case of non-linear PCA. The procedure
tends to make the row- and column-variances more homogeneous, and allows us to
interpret the resulting biplots in terms of log ratios.
The non-linear PCA biplot approach makes use of distance metrics that differ
from the ordinary and weighted Euclidean metrics. We have already discussed the
linear biplot resulting from PCA of a column-centered measurement table (Section
31.3.3). In this case, row-items are represented as a pattern of points and column-
items are depicted by means of unipolar axes, i.e. straight lines through the origin
of the plot (which is at the same time the centroid of the pattern). The original data
in a column of the table can also be read off by means of perpendicular projection
of the row-points upon the unipolar axes and by interpolating between equidistant
markers (tick marks). In the non-linear case, the axes are curved and the markers
on them are no longer equidistant (Fig. 31.17).
The theory of the non-linear PCA biplot has been developed by Gower [49] and
can be described as follows. We first assume that a column-centered measurement
table X is decomposed by means of classical (or linear) PCA into a matrix of factor
scores S and a matrix of factor loadings L:
X = U A V^ = S L'^
with
S =UA and L =V
This corresponds with a choice of factor scaling coefficients a = 1 and (3 = 0, as
defined in Section 31.1.4. Note that classical PCA implicitly assumes a Euclidean
metric as defined above. Let us consider theyth coordinate axis of column-space,
which is defined by a ;?-vector of unit length e^ of the form:
e. = (0 0 ... 1 ... 0 0)
151
Fig. 31.17. (a) In a classical PCA biplot, data values xij can be estimated by means of perpendicular
projection of the iih row-point upon a unipolar axis which represents the jth column-item of the data
table X. In this case the axis is a straight line through the origin (represented by a small cross), (b) In a
non-linear PCA biplot, theyth column-item traces out a curvilinear trajectory. The data value JC/, is now
estimated by defining the shortest distance between the iih row point and theyth trajectory.
which has zeroes in all p positions except at theyth one. The projection of e^ onto
factor-space yields the coordinates of the jth column-item of X, which are con-
tained in the 7th row of the factor loading matrix L. Instead of e^ we may also
project the vector kj Cy, where kj is a variable coefficient ranging from -00 to +00. In
that case, we obtain a series of points the locus of which is the unipolar axis
through the origin of space and through the representation of the 7th column-item
of X. In classical PCA, the origin of space is also the centroid of the row-items of
the column-centered table X. This projection method can also be used to define
markers along the 7th unipolar axis, by choosing appropriate values of the co-
152
efficient/:y (e.g. ...,-10,-5,0,5,10, ...).If we set/:y equal to the successive values x^^
in theyth column of X we obtain the exact projections of the row-items (/ = 1,..., n)
along theyth axis. These projections usually differ from the orthogonal projections
in the plane of the biplot when the contribution of the first two factors is less than
100% of the variance in X.
The same idea can be developed in the case of a non-Euclidean metric such as
the city-block metric or Lj-norm (Section 31.6.1). Here we find that the traject-
ories, traced out by the variable coefficient kj are curvilinear, rather than linear.
Markers between equidistant values on the original scales of the columns of X are
usually not equidistant on the corresponding curvilinear trajectories of the non-
linear biplot (Fig. 31.17b). Although the curvilinear trajectories intersect at the
origin of space, the latter does not necessarily coincide with the centroid of the
row-points of X. We briefly describe here the basic steps of the algorithm and we
refer to the original work of Gower [53,54] for a formal proof.
Given the original nxp measurement table X, one derives the table of nxn
distances D between the n row-items, using a particular distance function cp:
p
f^.,=-Udfikj)-dl+^dA (31.86)
2 V 2
where the terms of the right-hand member have been defined above.
The result of the linear regression can be expressed in the form:
Note that df and d^ represent the row- and global means of the squared distance
matrix D, respectively. The latter needs only be computed once and for all. The n
distances of the variable point d{kj) to the n row-points, however, are to be re-
evaluated for every column-item of X and for every marker which is to appear in
the corresponding trajectory in the biplot. Usually, the range of k^ is limited
between the minimum and maximum value in the 7th column of X or somewhat
beyond (say 10% of the range on either side).
31.8.1 Unfolding
pl I I I
I }
/ . I ^ I
n
nXq
PI
nXp
\
nL _ _Ji _
Fig. 31.18. Three ways of unfolding a three-way table into a rectangular matrix.
(31.88)
f S h
155
where I^, I^ and I^ represent the unit matrices of dimensions r, s and t, respectively.
The number of factors r, s and t, assigned to each mode, is generally different.
They are chosen such as to be less than the dimensions of the original three-way
table in order to achieve a considerable amount of data reduction. The elements of
Z represent the magnitude of the factors and the extent of their interaction [57].
Computationally, the core matrix Z and the loadings matrices A, B and C are
derived such as to minimize the sum of squared residuals.
By analogy with singular vector decomposition (SVD) and ordinary PCA, one
can also define the Tucker3 model in an extended matrix notation:
(31.89)
X=A Z B+ E
The extended matrix notation is represented here by the three dots that surround
the core matrix. A graphical example of the Tucker3 model is rendered in Fig.
31.19.
>
I r'
Fig. 31.19. Tucker3 or core matrix decomposition of an nx/7X^ three-way table X. The matrix Z repre-
sents the rx5xr core matrix. A, B and C are thenxr,/?X5 and ^xdoading matrices of the row-, column-
and layer-items of X, respectively.
156
where e^j^ represents a residual error term. This decomposition is also referred to as
the trilinear model
In this case, the nxpxq three-way table X is decomposed into the nxr, pxr, qxr
loading matrices A, B, C for the row-, column- and layer-items of X.
In the PARAFAC model, the three loading matrices A, B, C are not necessarily
orthogonal [56]. The solution of the PARAFAC model, however, is unique and
does not suffer from the indeterminacy that arises in principal components and
factor analysis.
In contrast to the Tucker3 model, described above, the number of factors in each
mode is identical. It is chosen to be much smaller than the original dimensions of
the data table in order to achieve a considerable reduction of the data. The elements
of the loading matrices A, B, C are computed such as to minimize the sum of
squared residuals.
The PARAFAC model can also be defined by means of an extended matrix
notation:
C
X=A i B (31.91)
where I represents an rxrxr identity array (with ones on the superdiagonal and
zeroes in the off-diagonal positions). The decomposition is represented schematic-
ally in Fig. 31.20. It can be seen from (eqs. (31.89) and (31.91)) that the PARAFAC
model is a special case of the Tucker3 model where Z = I and r = 5 = r.
A
c
/ r_
I y---
Fig. 31.20. PARAFAC decomposition of an nxpxq three-way table X. The core matrix I is an rxrxr
three-way identity matrix. A, B and C represent the nxr, pxr and qxr loading matrices of the row-,
column- and layer-items of X, respectively.
validation (Section 31.5). If only two or three structural factors are present in the
data, it can be helpful to provide the clustering tree in addition to the plot of factor
scores in order to better comprehend the latent structure. It is also possible to
represent the cluster structure graphically on the score plot, for example, by
grouping those points that cluster together at a certain level in the clustering tree
[59].
Although cluster analysis is based on the distances or similarities between a set
of objects, its graphic representation in the form of a clustering tree has no metric
property. The separation between any two leaves (i.e. terminal objects) in the tree
usually bears no relation to their actual distance as defined by the data table or on
the score plot with a reduced number of factors. In fact, given n objects one can
easily show that there are 2^"^ equivalent hierarchical clustering trees, which can be
obtained by reversals of the branches of the tree. In some special cases it is
possible, however, to derive a unique or canonical representation of the objects in a
clustering tree, by relating the order of appearance in the tree to their correspond-
ing positions on the score plot. This is relevant, for example, in so-called phylo-
genetic applications, where the distance or similarity between any two objects
reflects the time that elapsed after their evolutionary divergence. Such applications
arise in biological taxonomy, in the study of proteins, in the classification of DNA
sequences, etc. In these cases, one often finds that the objects (species, proteins,
DNA strings) in a score plot are arranged approximately on a circle around the
origin, at least when a two-factor solution is capable of reproducing the hereditary
158
or ancestral structure adequately. If the center of the score plot represents the
primeval ancestor, then one expects all living present-day species, proteins or
DNA strings to be more or less at an equal distance in evolutionary space. There
may be discrepancies, however, because biological clocks may tick at different
rates in different branches of the evolutionary tree. The positions of the objects are
defined unambiguously on the score plot, apart from reflections about the factor
axes. Hence, given a particular orientation of the axes, one may rearrange the
clustering tree such that the order of its leaves corresponds optimally with the order
of the objects in the score plot [60]. This then leads to a canonical representation of
the clustering tree which is based on a preliminary PCA of the data.
References
17. P.J. Lewi, Spectral map analysis. Analysis of contrasts, especially from log-ratios. Chemom.
Intell. Lab. Syst., 5 (1989) 105-116.
18. J.P. Benzecri, L'analyse des Donnees. Vol II. L'analyse des Correspondances. Dunod, Paris,
1973.
19. H. Wold, Soft modelling by latent variables: the non-linear iterative partial least squares
(NIPALS) algorithm. In: Perspectives in Probability and Statistics, J. Gani (Ed.). Academic
Press, London, 1975, pp. 117-142.
20. G.H. Golub and C. Reinsch, Singular value decomposition and least squares solutions. Numer.
Math., 14(1970)403-420.
21. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ.
Psychol., 24 (1933) 417^41.
22. C.G.J. Jacobi, Ueber ein leichtes Verfahren die in der Theorie der Saekularstoerungen vor-
kommender Gleichungen numerisch aufzuloesen. J. Reine Angew. Math., 30 (1846) 51-95.
23. B.N. Parlett, The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs, NJ, 1980.
24. K.E. Iverson, Notation as a tool of thought (1979 Turing award lecture). CACM, 23 (1980)
444^65.
25. MATLAB, Engineering Computation Software, The Math Works Inc., Natick, 1992.
26. SAS/IML (Interactive Matrix Language) Software: User and Reference Manual. Version 6,1 st
Ed., SAS Institute Inc., Gary, NC, 1990.
27. J.J. Dongarra, C.B. Moler, J.R. Bunch and G.W. Stewart, LINPACK. SIAM, Philadelphia, 1979.
28. B.T. Smith, Matrix Eigensystem Routines — EISPACK Guide. 2nd edn.. Lecture Notes in
Computer Science, Vol. 6. Springer, New York, 1976.
29. W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes. The Art
of Scientific Computing. Cambridge Univ. Press, Cambridge, UK, 1986, pp. 335-363.
30. R. Fisher and W. A. MacKenzie, Studies in crop variation II. The manurial response of different
potato varieties, J. Agricult. Sci., 13 (1923) 311-320.
31. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of
real-life performance from laboratory results. In: Scientific Computing and Automation (Eu-
rope) 1990. E.J. Karjalainen (Ed.). Elsevier, Amsterdam, 1990, pp. 199-210.
32. A.S. Householder, The Theory of Matrices in Numerical Analysis. Blaisdell, New York, 1964.
33. J.G.F. Francis, The QR transformation. Parts I and II. Computer J., 6 (1958) 378-392.
34. A. Ralston and H.S. Wilf, Mathematical Methods for Digital Computers. Wiley, New York,
1960.
35. H. Rutishauser, Solution of eigenvalue problems with the LR-transformation. Appl. Math. Ser.
Nat. Bur. Stand., 49 (1958) 47-81.
36. J.H. Wilkinson, Global convergence of tridiagonal QR algorithm with origin shift. Algorithms
Applic, 1 (1968) 409-420.
37. W.W. Cooley and P.P. Lohnes, Multivariate Procedures for the Behavioral Sciences. Wiley,
New York, 1962.
38. A. Ley sen. Systematic study of algorithms for eigenvector extraction on PC. Thesis, Kath. Ind.
Hogeschool der Kempen, Geel, Belgium, 1989.
39. S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many
variables and fewer objects. Part I: theory and algorithm. J. Chemom., 8 (1994) 11-125.
40. J.M. Deane, Data reduction using principal components analysis. In: Multivariate Pattern Rec-
ognition in Chemometrics, R. Brereton (Ed.). Chapter 5, Elsevier, Amsterdam, 1992, pp.
125-165.
41. R.G. Brereton, Chemometrics, Applications of Mathematics and Statistics to Laboratory Sys-
tems. Ellis Horwood, New York, 1990, p. 221.
160
42. R.B. Cattell, The Scree test for the number of factors. Multivariate Behavioral Res., 1 (1966)
245-276.
43. E.R. Malinowski, Statistical F-tests for abstract factor analysis and target testing. J. Chemom.,
3(1988)49-60.
44. S. Wold, Cross-validatory estimation of the number of components in factor and principal
components models. Technometrics, 20 (1978) 397^05.
45. M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986, p. 255.
46. J.C. Gower, A general coefficient of similarity and some of its properties. Biometrics, 27
(1971)857-874.
47. I.E. Frank and R. Todeschini, The Data Analysis Handbook. Elsevier, Amsterdam, 1994.
48. G. Young and A.S. Householder, Discussion of a set of points in terms of their mutual dis-
tances. Psychometrika, 3 (1938) 19-22.
49. J.C. Gower, Some distance properties of latent root and vector methods used in multivariate
analysis. Biometrika, 53 (1966) 325-328.
50. G.E. Hinton, How neural networks learn from experience. Sci. Am., Sept. (1992) 145-151.
51. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990.
52. J.P. van de Geer, Multivariate Analysis of Categorical Data: Applications. Sage Publications,
Newbury Park, CA, 1993.
53. J.C. Gower and S.A. Harding, Non-linear biplots. Biometrika, 75 (1988) 445-455.
54. J.C. Gower, Adding a point to vector diagrams in multivariate analysis. Biometrika, 55 (1968)
582-585.
55. T. Kourti and J.F. MacGregor, Process analysis, monitoring and diagnosis, using multivariate
projection methods. Chemom. Intell. Lab. Syst., 28 (1995) 3-21.
56. P. Geladi, Analysis of multi-way (multi-mode) data. Chemom. Intell. Lab. Syst., 7 (1989)
11-30.
57. A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15
(1992) 143-157.
58. R. Henrion, N-way principal component analysis. Theory, algorithms and applications.
Chemom. Intell. Lab. Syst., 25 (1994) 1-23.
59. P.N. Nyambi, J. Nkengasong, P. Lewi, K. Andries, W. Janssens, K. Fransen, L. Heyndrickx, P.
Piot and G. van der Groen, Multivariate analysis of human immunodeficiency virus type 1 neu-
tralization data. J. Virol., 70 (1996) 6235-6243.
60. P.J. Lewi, H. Moereels and D. Adriaensen, Combination of dendrograms with plots of latent
variables, Application of G-protein coupled receptor sequences. Chemom. Intell. Lab. Syst., 16
(1992) 145-154.
161
Chapter 32
Table 32.1 describes 30 persons who have been observed to use one of four
available therapeutic compounds for the treatment of one of three possible disorders.
The four compounds in this measurement table are the benzodiazepine tranquil-
lizers Clonazepam (C), Diazepam (D), Lorazepam (L) and Triazolam (T). The
three disorders are anxiety (A), epilepsy (E) and sleep disturbance (S). In this
example, both measurements (compounds and disorders) are defined on nominal
scales. Measurements can also be defined on ordinal scales, or on interval and ratio
scales in which case they need to be subdivided in discrete and non-overlapping
categories.
Each column of a measurement table can be expanded into an indicator table.
The rows of an indicator table refer to the same objects and in the same order as in
the measurement table. The columns of the indicator table represent non-over-
lapping categories of the selected measurement. Table 32.1 has been expanded
into the indicator Table 32.2 for compounds and into the indicator Table 32.3 for
disorders. In the indicator table for compounds, a value of one in a particular row is
recorded if a person has used the corresponding compound. In the indicator table
for disorders, one in a particular row indicates that the person has been treated for
the corresponding disorder. All other elements of the row are set to zero. Note that
the order of the columns in the indicator tables is not relevant.
A contingency table X can be constructed by means of the matrix product of two
indicator tables:
X = FJ (32.1)
where I is the indicator table for the first measurement and J is the indicator matrix
for the second measurement. The rows of I and J must refer to the same objects and
in the same order. The dimensions of a contingency table are defined by the
number of categories of the two measurements.
Each element x^^ in an nxp contingency table X represents the number of objects
that can be associated simultaneously with category / of the first measurement and
with category j of the second measurement. The element x^y can be interpreted as
162
TABLE 32.1
Measurement table describing the use of four compounds (C, D, L, T) for the treatment of three disorders
(A, E, S) by 30 persons. For explanation of symbols, see text.
# Compound Disorder
1 L A
2 D E
3 C S
4 T S
5 L A
6 D A
7 C E
8 C S
9 D A
10 C E
11 D S
12 L A
13 D S
14 C E
15 C S
16 D A
17 L S
18 D E
19 L S
20 L A
21 C E
22 T S
23 D A
24 T S
25 D E
26 T A
27 T S
28 C E
29 D E
30 D A
163
TABLE 32.2
Indicator table for compounds, from Table 32.1
Compound
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 0 0 1
5 0 0 1 0
6 0 1 0 0
7 1 0 0 0
8 1 0 0 0
9 0 1 0 0
10 1 0 0 0
11 0 1 0 0
12 0 0 1 0
13 0 1 0 0
14 1 0 0 0
15 1 0 0 0
16 0 1 0 0
17 0 0 1 0
18 0 1 0 0
19 0 0 1 0
20 0 0 1 0
21 1 0 0 0
22 0 0 0 1
23 0 1 0 0
24 0 0 0 1
25 0 1 0 0
26 0 0 0 1
27 0 0 0 1
28 1 0 0 0
29 0 1 0 0
30 0 1 0 0
All 8 11 6 5
164
TABLE 32.3
Indicator table for disorders, from Table 32.1
# Disorder
A E S
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 1
5 1 0 0
6 1 0 0
7 0 1 0
8 0 0 1
9 1 0 0
10 0 1 0
11 0 0 1
12 1 0 0
13 0 0 1
14 0 1 0
15 0 0 1
16 1 0 • 0
17 0 0 1
18 0 1 0
19 0 0 1
20 1 0 0
21 0 1 0
22 0 0 1
23 1 0 0
24 0 0 1
25 0 1 0
26 1 0 0
27 0 0 1
28 0 1 0
29 0 1 0
30 1 0 0
All 10 9 11
165
TABLE 32.4
Contingency table X derived from the indicator Tables 32.2 and 32.3
Sum 10 9 11 30
p
x^^ = ^x^j with / = 1,..., n (32.2)
./
n p
=11'.
I J
166
Table 32.5 presents the expected values of the elements in the contingency
Table 32.4. Note that the marginal sums in the two tables are the same. There are,
however, large discrepancies between the observed and the expected values. Small
discrepancies between the tabulated values of our illustrations and their exact
values may arise due to rounding of intermediate results.
For example, the observed value for Clonazepam in anxiety has been recorded
as 0 in Table 32.4. The corresponding expected value in Table 32.5 has been
computed from eq. (32.3) as follows:
x^+ x+1 _ 8x 10
£:(x,j): = 2.67
jc_ ~ 30
where jCj j is the element in the first row and first column of the table, where Xj^, x^^
are the corresponding marginal sums, and where x^^ is the global sum.
The generalization of Pearson's chi-square statistic x^ for 2x2 contingency
tables, which has been discussed in Section 16.2.3, can be written as:
( ^1+ -"'•+./ V
n P (Y _ f(y \\2 n p
^-h+ J
(32.4)
Eix,) •^1+ -^+7
Sum 10 9 11 30
167
The chi-square statistic is a measure for the global deviation of the observed
values in a contingency table from their expected values. The number of degrees of
freedom of the chi-square statistic defines the number of elements of the table that
can be varied independently of each other. Since n row-sums and/? column-sums
are held fixed and since both add up to the same global sum, the number of degrees
of freedom of an nxp contingency table is np minus n -\- p - I or (n - I) (p - I). The
chi-square statistic can be tested for significance using tabulated critical values of
the statistic for a fixed level of significance (e.g. 0.05, 0.01 or 0.001) and for the
specified degrees of freedom.
In the case of the 4x3 contingency Table 32.4 we obtain a chi-square value of
15.3 with 6 degrees of freedom, which is significant at the 0.05 level of probability
as it exceeds the critical value of 12.6. The probability corresponding with a
chi-square value exceeding 15.3 equals 0.02. From this result we may be led to
conclude that there are significant differences between the prescription patterns of
the four compounds, as well as between the treatment patterns of the three
disorders. One should be cautious, however, in drawing such a conclusion,
because of three considerations. First, the statistical evidence of this positive
outcome is not overwhelming, as it leaves a two percent chance of being false.
Secondly, the chi-square test is not quite appropriate in this case, since all but two
cell frequencies are smaller than five and three of them are even zero. Finally, a
statistically significant result does not necessarily mean that the differences between
compounds and disorders are of practical relevance. These considerations should
not preclude, however, an exploratory analysis of the data in order to search for
meaningful correlations and trends that might be the object of more detailed
studies in the future. In the case of a negative outcome of the chi-square test, one
might have assumed that there are no differences between the drugs and between
the disorders. This conclusion also might be erroneous in view of the lack of power
that is inherent to a small sample (30 persons in this example). In this case, an
exploratory analysis may still yield relevant associations and tendencies which can
guide the design of future studies.
32.3 Closure
Clonazepam
Diazepam
Lorazepam y/^y^yyM
Triazolam V^/M^^A
Anxiety Epilepsy Sleep
Fig. 32.1. Row-profiles representing the data in the rows of Table 32.4.
32.3.1 Row-closure
fx^A +7
(32.5)
^/+
32.3.2 Column-closure
Anxiety
Epilepsy
Sleep
Fig. 32.2. Column-profiles representing the data in the columns of Table 32.4.
^
^^:/=- (32.6)
"+./ v^+^ J "+./
32.3.3 Double-closure
TABLE 32.6
Table of double-closed data Z, with weighted means, weighted sums of squares and weight coefficients
added in the margins, from Table 32.4
The effect of these operations is shown in Table 32.6 using the data of our
example in Table 32.4. For example, in the case of Clonazepam in anxiety, the
observed value x^^ has been shown to be equal to 0 in Table 32.4, and the
corresponding expected value has been derived to be 2.67 in Table 32.5. The
corresponding deviation of the double-closed value from its expected value is then
computed as:
^^ £(jcii) 2.67
It is important to realize that closure may reduce the rank of the data matrix by
one. This is the case with row-closure when n>p, and with column-closure when
n<pAi is always the case with double-closure. This reduction of the rank by one is
the result of a linear dependence between the rows or columns of the table that
results from closure of the data matrix.
where x and y are vectors with the same dimension n and where W is a diagonal
matrix of dimension n:
W = diag(w)
171
in which the elements of the weight vector w are positive constants which add to
unity. In what follows we will omit the predicate 'Euclidean' which is intended
implicitly throughout this chapter.
A weighted metric is uniquely defined by the metric matrix W. The usual metric
is but a special case of the weighted metric, which is obtained when W equals the
identity matrix I (divided by the dimension n). In the usual metric we assign equal
weights to all the axes of coordinate space. This is the metric that is used in daily
life for measuring distances, lengths, widths, heights, etc. Weighted metrics play
an important role in multivariate data analysis when different weights are assigned
to objects and measurements according to their size or importance. This is the case
in correspondence factor analysis which will be discussed in Section 32.6.
It has been shown in Chapter 29 that the set of vectors of the same dimension
defines a multidimensional space S in which the vectors can be represented as
points (or as directed line segments). If this space is equipped with a weighted
metric defined by W, it will be denoted by the symbol S^. The squared weighted
distance between two points representing the vectors x and y in S^ is defined by the
weighted scalar product:
n
where llx - yll^ refers to the weighted norm or length of the vector x - y in the
weighted metric defined by W.
Distances in S^ are different from those in the usual space S. A weighted space
S^ can be represented graphically by means of stretched coordinate axes [2]. The
latter result when the basis vectors of the space are scaled by means of the
corresponding quantities in vw, where the vector w contains the main diagonal
elements of W. Figure 32.3 shows that a circle is deformed into an ellipse if one
passes from usual coordinate axes in the usual metric I to stretched coordinate axes
in the weighted metric W. In this example, the horizontal axis in5^ is stretched by
a factor ^1.6 = 1.26 and the vertical axis is shrunk by a factor V0.4 = 0.63.
A similar effect in S^ is observed on angular distances (or angles), which are
also different from those in the unweighted space 5:
cosd^ = ^— (32.10)
11x11 J l y l l ,
with
where d^ is the angular distance and where 11x11^ and llyll^are the weighted norms or
172
1.
(Ya4)
1 (Yr6)
1 0 1.6 0
W: w=
2 2
0 1 0 0.4
Fig. 32.3. Effect of a weighted metric on distances, (a) representation of a circle in the space S defined
by the usual Euclidean metric, (b) representation of the same circle in the space 5^ defined by a
weighted Euclidean metric. The metric is defined by the metric matrix W.
lengths of the vectors x and y as measured from the origin of 5^. Two vectors x and
y are said to be orthogonal in the weighted metric defined by W if and only if:
x'^Wy = 0 (32.12)
Generally, two vectors that are orthogonal in S will be oblique in 5^, unless the
vectors are parallel to the coordinate axes. This is illustrated in Fig. 32.4. Further-
more, if X and y are orthogonal vectors in S, then the vectors W~^^^x and W"^^^y are
orthogonal in 5^. This follows from the definition of orthogonality in the metric W
(eq. 32.12):
xTW-l/2;Y^-l/2y = xV = 0
In Section 29.3 it has been shown that a matrix generates two dual spaces: a
row-space 5^^ in which the p columns of the matrix are represented as a pattern P^,
and a column-space 5^ in which the n rows are represented as a pattern P^. Separate
weighted metrics for row-space and column-space can be defined by the corre-
sponding metric matrices W^ and W^. This results into the complementary
weighted spaces 5^ and 5^, each of which can be represented by stretched
coordinate axes using the stretching factors in ^w „ and J w ^ , where the vectors
w^ and w^ contain the main diagonal elements of W„ and W^.
173
1
(Yoi)
1 (VT.6)
Fig. 32.4. Effect of a weighted metric on angular distances, (a) representation of two line segments that
are perpendicular in the space S defined by the usual Euclidean metric, (b) representation of the same
two line segments in the space 5H, defined by a weighted Euclidean metric. The metric matrix W is the
same as in Fig. 32.3b.
m^ = x ^ w = xTWl = ^ W. X: (32.13)
1 1
w, = • and w^=- with / = l,...,n and y = !,...,/?
174
In the case of an nxp matrix X we can define two sets of weighted measures of
location and spread, one for each of the two dual spaces S^ and 5 ^ , in terms of
weighted means and weighted variances:
m:=XW^l^=XWp or <=X>^v^,y
J
(32.15)
n
m;=X^W„l„=X'rw„ or mJ=J^w,x,j
c: = ( X - i n - l p 2 W^ 1^ or c- =j^Wjix,-mry
(32.16)
where 1„ and 1^ represent sum vectors with n and p elements equal to one, resp-
ectively. In this notation, X^ results in squaring each element of X. The expression
XW^l^ produces a weighted summation over the columns of X, and X^W„1„
performs a weighted summation over the rows of X.
One may assign greater weight (or mass) to objects and measurements that
should have a greater influence in the analysis than the others. In regression
problems one may define weights to be inversely proportional to the variance of
the objects, such as to lend more influence to those that have been measured with
greater precision. In correspondence factor analysis (CFA) the weights are defined
from the marginal sums of the table, as will be shown in Section 32.6. The result of
an exploratory analysis depends to some extent on the choice of the weight
coefficients in w„ and w^ . Sometimes the effect is only slight, for example, when
the observed data in a contingency table are close to their expected values. In other
cases, when the discrepancies between observed and expected values are large, the
effect of variable weighting can be prominent. As a general rule, one will select the
weights according to the nature of the data and the objective of the analysis.
175
.2 ^ P X
52=^ =XS^ -^^^ (32.17)
•^++ / ./• -^ i+ •^+j
32.5.1 Row-closure
with
W: = and w: =
' X .
176
^ ij "^ +j
fij = with / = 1,..., nandy = 1, ...,p
In 5^ the row-profiles are centred about the origin, as can be shown by working
out the expression for the weighted column-means m^:
•^i+ -^+7
m =0 with;= 1, ...,/7
y-^++ "++ y
v^±:__^jiiZ__s2
^r=l^jf.;=l =6 with /= 1, ..., n (32.19)
"+j
The distance 5, is a measure of the difference of the ith row-closed profile with
respect to the average or expected row-profile.
The distance 6,,^ between two row-profiles / and /' can also be expressed
formally as a distance of chi-square:
.2
U I J
32.5.2 Column-closure
with
177
"+.1
w, = • and Wj=-
X;^
where G stands for the matrix of deviations of column-closed profiles, cf. eq.
(32.6):
In this case we find that the pattern of column-profiles in S^ is centred about the
origin, as can be seen by computing the weighted row-means m„:
•^i+ •^+J
<=l^j8, =Z =0 with / = 1,..., n
\^-^++ ^++ J
The weighted sums of squares c^ of the points around the origin of 5^ can be
expressed as distances of chi-square:
\2
(
V^+7'
-.r=I-<^^=i: = 8y withy = !,...,/? (32.22)
The distance 5^ is a measure of the difference of the yth column-profile from the
average or expected column-profile.
The distance 6^y.between two column-closed profilesy and/ can also be defined
as a distance of chi-square:
(X X .^
ij iJ
V^+./ 1LZ_-K2
^^i(8il -Sij'f =S 6t with7,/= l,...,p (32.23)
X^
32.5.3 Double-closure
l^l-^4=l-il-.4 (32.24)
++ J •^++
178
with
''+./
w- =- and w. =-
and where Z is the matrix of deviations of the double-closed data, cf. eq. (32.7):
The pattern of points produced by Z is centred in both dual spaces S^ and 5^,
since the weighted row- and column-means m„ and m^ are zero:
'^ f
^ij
m :0 withy = 1,..., P
V^+7
P f X
m r=I-.^. =S X.
ij ^+j
^++y
=0 with / = 1,..., n
m Z^/Z^7^.y =0
Weighted sums of squares c]^ and c^ around the origins of the two spaces can
be interpreted as distances of chi-square:
2
v-X ^8j withy = 1, ...,/?
(32.25)
where 5, and 5^ have been defined above in terms of the elements and marginal
sums of X. These distances are measures for the differences of the row- or column-
closed profiles from their average or expected profiles. The symbols cf' and c'J
denote the weighted sums of squares of row / and column j in the transformed
matrix Z.
The values of Z in Table 32.6 already reveal some peculiar aspects of the data.
One should compare these values with zero, i.e. the value that would be obtained if
the observed values were equal to their expected ones. The larger the deviation
from zero, the stronger is the interaction between the corresponding row and
179
or
5,..4 =0.931
Similarly, we compute the distance of chi-square from the origin for epilepsy,
using the same deviations Z and the row-weights w^:
4
+ 0.167(-1.000)2 = 0.696
or
6 ,.^2 = 0-834
Distances between points in 5^ and S^ can also be defined formally as distances
of chi-square:
p
where 5 - and 5^y. have been defined before in terms of the elements and marginal
sums of X. We use the symbols cJJ. and cjj^ to denote the weighted sums of
cross-products between rows / and i' and between columnsy and/ in the transformed
matrix Z.
180
or
8,r=34 = i n 7
In a similar way we derive the distance of chi-square between anxiety (/ = 1) and
epilepsy (/ = 2) using the same deviations Z and the row-weights w„:
4
^r=\2 =Ya^Mi\ -^ii)^ =0.267(-1.000-1.083)2+ 0.367(0.364-0.212)2
One finds an illustration of the measures of location and spread in the margins of
Table 32.6. The geometric properties of row-closed and column-closed profiles are
summarized in Fig. 32.5.
In the following section on the analysis of contingency tables we will relate the
distances of chi-square in terms of contrasts. In the present context we use the word
contrast in the sense of difference (see also Section 31.2.4). For example, we will
show that the distance of chi-square from the origin 5, can be related to the amount
of contrast contained in row / of the data tables, with respect to what can be
expected. Similarly, the distance 5^ can be associated to the amount of contrast in
column y, relative to what can be expected. In a geometrical sense, one will find
rows and columns with large contrasts at a relatively large distance from the origin
of 5^ and 5 ^ , respectively. The distance of chi-square 6- then represents the
amount of contrast between rows / and i with respect to the difference between
their expected values. Similarly, the distance of chi-square 6^^ indicates the amount
of contrast between columns^ a n d / in comparison to the difference between their
181
Fig. 32.5. (a) Geometric representation of row-closed profiles in the weighted column-space S^. The
row-profiles are shown as an elliptic pattern P". The figure shows the distance of a row-profile / from
the origin and the distance between two row-profiles i and i\ (b) Geometric representation of col-
umn-closed profiles in the weighted row-space S^. The column-profiles are shown as an elHptic pat-
tern P\ The figure shows the distance of a column-profile; from the origin and the distance between
two column-profiles7 and/.
expected values. In geometrical terms, one will find in 5^ rows with large contrast
between them at a relatively large distance from one-another. Likewise, columns
with large contrast between them will also be found in S^ at a relatively large
distance from one another.
Note that double-closure yields the same results as those produced by row-
closure in 5^ and by column-closure in S^. From an algorithmic point of view,
double-closure is the more attractive transformation, although row- and column-
closure possess a strong didactic appeal.
182
The origin of the analysis of contingency tables can be traced back to the
definition by Pearson in 1900 of the chi-square statistic as a measure for goodness
of fit [3]. The partition of the chi-square into main effects and interaction has been
described around 1950 by Lancaster [4]. One of the first practical applications of
multivariate analysis to contingency tables is due to Hill [5] who developed the
method of reciprocal averaging. This method is used for the ordination of plant
species from tabulated surveys carried out in various plots of land yielding a
species x plots matrix. The result allows one to order species according to a latent
vector which can be ascribed to environmental characteristics, such as density of
the soil, abundance of sunlight, thickness of cover, etc.
Reciprocal averaging is considered as the forerunner of correspondence factor
analysis (CFA) which has been developed by Benzecri [6,7] and a group of French
statisticians [8]. It is often also referred to as correspondence analysis (CA) for
short. Originally, the term correspondence means association between rows and
columns of a table, i.e. the number of joint occurrences of the categories represented
by the rows and columns of a contingency table. The French term 'analyse des
correspondances' points to the analysis of the interaction between rows and
columns. The French school also used a geometrical approach for the inter-
pretation of its results, especially by means of the biplot which has been designed
by Gabriel [9]. In the discussion of measurement tables in Chapter 31, we have
defined biplots as the joint representation of rows and columns of a table in a
low-dimensional space spanned by latent vectors. The concept of latent vectors
will be generalized to the case of weighted metrics in the following section.
Initially, CFA has been applied in linguistic studies. Later on, applications have
been described in many diverse fields of the empirical sciences. The method of
CFA can be extended to the analysis of tables with non-negative elements, not
necessarily counts, provided that these have been recorded with a common unit of
measurement. Such a table is called homogeneous, as opposed to a heterogeneous
table in which the rows or columns are defined with different units. Heterogeneous
tables can be analyzed by means of CFA, after subdivision of the measurements
into discrete categories (e.g., in the case of lipophilicity: strongly lipophilic,
weakly lipophilic or hydrophilic, strongly hydrophilic). CFA has played a decisive
role in the development of multivariate data analysis, especially by means of its
graphic aspect. It also has stimulated interest in related areas such as principal
components analysis and cluster analysis.
The position of CFA, as well as that of its related methods, is that of a
preliminary step to statistical inference. In this respect, it has been regarded as an
183
inductive approach [2]. This means that, first, hypotheses have to be formulated
from an analysis of empirical evidence. After this exploratory phase, the newly
formulated hypotheses have to be tested more rigorously using appropriate
statistical methods of inference.
Correspondence factor analysis can be described in three steps. First, one
applies a transformation to the data which involves one of the three types of closure
that have been described in the previous section. This step also defines two vectors
of weight coefficients, one for each of the two dual spaces. The second step
comprises a generalization of the usual singular value decomposition (SVD) or
eigenvalue decomposition (EVD) to the case of weighted metrics. In the third and
last step, one constructs a biplot for the geometrical representation of the rows and
columns in a low-dimensional space of latent vectors.
TABLE 32.7
1/2 1/2
Usual singular vectors extracted from the transformed Z in Table 32.6, i.e. W^ ZW^
Compound
Clonazepam u,0.750 Uj0.113 Disorder
Anxiety Vj
-0.644 ^-0.502
Diazepam -0.025 -0.540 Epilepsy 0.762 -0.346
Lorazepam -0.619 -0.148 Sleep -0.075 0.792
Triazolam -0.232 0.820
In Table 32.7 we observe a contrast (in the sense of difference) along the first
row-singular vector Uj between Clonazepam (0.750) and Lorazepam (-0.619).
Similarly we observe a contrast along the first column-singular vector Vi between
epilepsy (0.762) and anxiety (-0.644). If we combine these two observations then
we find that the first singular vector (expressed by both Uj and Vj) is dominated by
the positive correspondence between Clonazepam and epilepsy and between
Lorazepam and anxiety. Equivalently, the observations lead to a negative corres-
pondence between Clonazepam and anxiety, and between Lorazepam and epilepsy.
In a similar way we can interpret the second singular vector (expressed by both U2
and V2) in terms of positive correspondences between Triazolam and sleep and
between Diazepam and anxiety.
From eq. (32.30) we can derive the solution for the generalized SVD problem of
Z:
B = W-'^^\ (32.31)
|Ll = A
TABLE 32.8
Generalized singular vectors extracted from Z in Table 32.6
Compound Disorder
Clonazepam 1.452 0.220 Anxiety -1.115 -0.870
Diazepam -0.042 -0.892 Epilepsy 1.390 -0.632
Lorazepam -1.385 -0.331 Sleep -0.124 1.308
Triazolam -0.569 2.010
A = ZW^BA^
(32.32)
B = Z^W A A
From the solution of the usual EVD problem (Section 31.4.2) follows that:
which yields:
which yields:
B = W;^/2V and ^ =A
Note that the eigenvalues in A^ are the same in both the row- and column-problems.
The rank r of Z is also the rank of the weighted cross-products matrices Z^W^Z
and TN^^, Singular vectors and eigenvectors are the same, and it is appropriate to
refer to them by the more general terms of latent vectors [10]. Singular values are the
square roots of the corresponding eigenvalues. The latter express the contribution of
the associated latent vectors to the global interaction between rows and columns.
The trace of A^ is equal to the traces of the weighted cross-product matrices,
which in turn are equal to the global weighted sum of squares c"^ or global
interaction 6^ (eq. (32.27)):
tr(A2) = tr(Wy2Z'^W„ZW^/2)^
32.6.3 Biplots
In CFA we can derive biplots for each of the three types of transformed
contingency tables which we have discussed in Section 32.3 (i.e., by means of
row-, column- and double-closure). These three transformations produce, respect-
ively, the deviations (from expected values) of the row-closed profiles F, of the
column-closed profiles G and of the double-closed data Z. It should be reminded
that each of these transformations is associated with a different metric as defined
by W^ and W^. Because of this, the generalized singular vectors A and B will be
different also. The usual latent vectors U, V and the matrix of singular values A,
however, are identical in all three cases, as will be shown below. Note that the
usual singular vectors U and V are extracted from the matrix W^^^ZW^^^.
In what follows we restrict the discussion of biplots to the case of double-closed
data Z as defined by the elements:
All other cases can be readily derived by analogy. In the case of row-closed
profiles we have to perform the following substitutions:
_ •^++
w,. = - ^ and Wf
J
•^-\-+ ^+7
^ = Sij = (32.40)
_ •^++
w,. and W . := - ^
J
•*^/+ •^++
It can be proved that the matrix W^^^ ZW^^^ is the same in all cases. Hence, the
decomposition UAV^ is also the same, and since the decomposition is unique we
obtain that U, V and A are the same for all three cases. The matrices A and B,
however, depend on the type of closure that has been applied to the data.
From the latent vectors and singular values one can compute the nxr general-
ized score matrix S and the pxr generalized loading matrix L. These matrices
contain the coordinates of the rows and columns in the space spanned by the latent
vectors:
S = AA« (32.41a)
and
L = BAP (32.41b)
where the factor scaling coefficients a and P may take the values 0, 0.5 and 1. In
Chapter 31 we have amply discussed the implications of the various choices of a
and p for the interpretation of the biplots. The main considerations are reviewed
below in the context of CFA:
a=l
for reproduction of distances from the origin 5, and distances 6,,/ between the
row-profiles / and i\
p=i
for reproduction of distances from the origin 5y and distances 5^^. between the
column-profilesy and/,
a + p=l
for reproduction of the values Zij by means of perpendicular projections upon
unipolar axes; also for reproduction of the differences (or contrasts) z^ - Zi^j and
Zij - Zif by means of perpendicular projections upon bipolar axes.
Unipolar and bipolar axes have been discussed in Section 31.2. Briefly, a
unipolar axis is defined by the origin and the representation of a row or column. A
bipolar axis is drawn through the representations of two rows or through the
representations of two columns. Projections upon unipolar axes reproduce the
values in the transformed data table. Projections upon bipolar axes reproduce the
contrasts (i.e. differences) between values in the data table.
In the case when a = 1 we obtain:
SS^ ^AAAA'r = A A 2 A ' ^
(32.42)
189
which shows that the distances of the rows of the biplot in S'^ are equal to the
distances of chi-square in 5^ which we have derived in Section 32.5.
In the case of (3 = 1 we can show that:
from which we conclude that distances of the columns on the biplot in S"^ are equal
to the distances of chi-square in 5^ which we have already derived in Section 32.5.
In the special case of a + (3 = 1 we can state:
SL^ ^AA^APB'T = A A B T
(32.46)
= W;i/2 UAV^ W;i/2 = Z
which shows that the projection in 5"^ of point s, upon a unipolar axis 1^ reproduces
the value z„:
r
(32.48)
which defines the projection in S"^ of point s^ upon the bipolar axis 1^ - \y resulting in
the contrast Zij - Zif.
190
TABLE 32.9
Compound Disorder
Clonazepam 0.823 0.095 Anxiety -0.632 -0.378
Diazepam -0.024 -0.388 Epilepsy 0.788 -0.275
Lorazepam -0.785 -0.144 Sleep -0.070 0.568
Triazolam -0.322 0.873
which defines the projection in y of point \j upon the bipolar axis s, - s,^.
The generalized scores and loadings from the 4x3 contingency table are given in
Table 32.9 for the particular choice of a = (3 = 1. A plot of these generalized scores
and loadings is shown in Fig. 32.6 using the coordinates of the points which are
listed in Table 32.9. Distances from the origin and distances between points have
been computed according to the expressions given in Section 32.5. In this example
we find two distinct eigenvectors with eigenvalues of 0.322 and 0.188, respect-
ively. This is compatible with the maximal rank of a 4x3 data matrix after
double-closure. It should be reminded that double-closure always reduces the rank
of the data matrix by one. The sum of the eigenvalues (0.510) is equal to the global
sum of squares (interaction) which is shown in Table 32.6.
The two plots can be superimposed into a biplot as shown in Fig. 32.7. Such a
biplot reveals the correspondences between the rows and columns of the cont-
ingency table. The compound Triazolam is specific for the treatment of sleep
disturbances. Anxiety is treated preferentially by both Lorazepam and Diazepam.
The latter is also used for treating epilepsy. Clonazepam is specifically used with
epilepsy. Note that distances between compounds and disorders are not to be
considered. This would be a serious error of interpretation. A positive correspondence
between a compound and a disorder is evidenced by relatively large distances from
the origin and a common orientation (e.g. sleep disturbance and Triazolam). A
negative correspondence is manifest in the case of relatively large distances from
the origin and opposite orientations (e.g. sleep disturbance and Diazepam).
The geometrical reconstruction of the values in Z by perpendicular projection of
points upon axes can be justified algebraically by the matrix product of the scores S
with the transpose of the loadings L according to eqs. (32.41) and (32.46):
191
Triazolam 1 -i
\= 0.188
?^,= 0.322 \
Lorazepam
O
Diazepam
\= 0.188
Sleep
®
0.834'**-'^>.A= 0.322 j
1.423 Epilepsy
Fig. 32.6. (a) Generalized score plot derived by correspondence factor analysis (CFA) from Table
32.4. The figure shows the distance of Triazolam from the origin, and the distance between Triazolam
and Lorazepam. (b) Generalized loading plot derived by CFA from Table 32.4. The figure shows the
distance of epilepsy from the origin, and the distance between epilepsy and anxiety.
SL^ = A A « A P B T = A A B T =Z (32.50)
provided that a + |3 = 1.
The most common choices for the exponents in CFA are a = (3 = 1 and a = (3 = 0.5.
The choices a = 1, p = 0 and a = 0, p = 1 will produce biplots that are
192
H
X.2= 0.188
Triazolam
Sleep
Clonazepam
O
Lorazepam
-1 O \= 0.322 1
D
Epilepsy
Anx?etv DiazepaS
Fig. 32.7. CFA biplot resulting from the superposition of the score and loading plots of Figs. 32.6a and
b. The coordinates of the products and the disorders are contained in Table 32.9.
y= lK'lK (32.52)
32.6A Application
CFA is applied to the contingency Table 32.10 which has been adapted from a
retrospective study of US doctorates in chemistry and in other fields awarded to
men and women [11]. The rows of this table refer to consecutive time intervals
from 1920 up to 1989. The columns indicate non-overlapping categories of
doctorates. The marginal sums provide the totals by time interval and by category
of doctorate. These will be used for the assignment of weights to the rows and
columns of the table. Note that the data from 1920 up to 1959 have been grouped
into intervals of 10 or 5 years in order to level out statistical fluctuations due to
small counts, especially in the category of women doctorates in chemistry. This
pooling of rows does not affect the subsequent analysis by CFA. Indeed, rows (or
columns) that are similar can be grouped together, provided that the corresponding
weights are also added together. This is called the principle of distributional
equivalence of CFA [2].
Casual inspection of the table reveals that the total number of doctorates has
reached a peak value of 33727 in 1973 which has only been surpassed in the last
year of the time series. Peak values are also observed in the 1973 category of men
in other fields and in the 1970 category of men in chemistry. In both categories
related to women, no such peak values are observed as the annual counts seem to
increase more or less steadily over the whole period.
A more comprehensive analysis is obtained by CFA, the results of which are
summarized in Tables 32.11 and 32.12. The first column of these tables contains
the diagonal elements of the metric matrices W^ and W^. These normalized weight
coefficients are proportional to the marginal sums of the original contingency
Table 32.10 and sum to unity. The following three columns form the matrix of
generalized scores S and the matrix of generalized loadings L. These matrices
satisfy the relations (eq. (39.41)):
S = AA and L = BA
which follows from the generalized SVD (eq. (32.50)):
Z = AART with A^W^A = B^W^B = I,
where:
194
TABLE 32.10
Year W o m e n in M e n in W o m e n in M e n in Total
chemistry chemistry other fields other fields
TABLE32.il
Weights, scores, distances, contributions and precisions from CFA applied to Table 32.10
/ w. Sl S2 S3 K In T^n
TABLE 32.12
Weights, loadings, distances, contributions and precisions from CFA applied to Table 32.10
1, 1. 1. ^P
1 ( X,i X^
with / = 1, ..., n andy = 1, ...,p (32.54)
^ 5
\^i-\- ^+j
and where r is the rank of Z. The rank r is at most equal to the smaller ofn-l and
p- 1. Hence, in the present application to a 35x4 contingency table we can at most
obtain three distinct latent vectors. For practical reasons we have divided the
transformed data (eq. (32.7)) by the global distance of chi-square 6 (eq. (32.24)).
This type of normalization ensures that the global interaction in the data equals
unity:
The three latent vectors account for respectively 86, 13 and 1% of the interaction.
The next two columns in Tables 32.11 and 32.12 show the distances 5„ and 6^ of rows
and columns from the origin of space and their contributions y^ and y^ to the
interaction:
(32.55)
The final column in Tables 32.11 and 32.12 lists the precisions n^ and n^ with
which the rows and columns are represented in the plane spanned by the first two
latent vectors:
197
M&n Chemistry
m ®^40-49
j \.® 20-29
1 ® 30-39
||IlVoi77e;3 Chemistry
^50-54
^^^^^^m^^^^^^i"®^^^^
Fig. 32.8. CFA biplot computed from the data in Table 32.10. Circles represent years and squares
identify the four educational categories. The centre of the plot is represented by a small cross. The co-
ordinates of the years and the categories are contained in Tables 32.11 and 32.12. Factor scaling coef-
ficients were defined as a = P = 1.
with / = 1, ..., n
k
(32.56)
with7= 1, ...,/7
Figure 32.8 shows the biplot constructed from the first two columns of the
scores matrix S and from the loadings matrix L (Table 32.11). This biplot
corresponds with the exponents a = 1 and p = 1 in the definition of scores and
loadings (eq. (39.41)). It is meant to reconstruct distances between rows and
between columns. The rows and columns are represented by circles and squares
respectively. Circles are connected in the order of the consecutive time intervals.
The horizontal and vertical axes of this biplot are in the direction of the first and
second latent vectors which account respectively for 86 and 13% of the interaction
between rows and columns. Only 1% of the interaction is in the direction perpen-
dicular to the plane of the plot. The origin of the frame of coordinates is indicated
198
by a small cross (+) near the centre of the biplot. As can be seen from the precisions
K^ in Table 32.11, all years are well-represented in the plane of the biplot. The
precisions n^ in Table 32.12, however, show that the category of women in
chemistry is not so well-represented in the biplot as its precision amounts only to
0.770.
It is readily evident that the most dominant (horizontal) latent variable reflects a
contrast between women and men. From 1966 onwards there appears a sustained
increase of the proportion of women doctorates. The increase is most prominent
during the seventies as evidenced by the relatively large distances between adjacent
time intervals from 1971 to 1980. The second (vertical) latent variable is defined
by a contrast between chemistry and the other fields. Contrast is to be understood
here as a difference between profiles. Initially there is a progressive decrease in the
share of chemistry doctorates when compared to those in other fields. Around
1973, however, the trend reverses and the proportion of chemical degrees rises
slowly, but steadily.
The correspondences between rows and columns are evidenced by the positions
of the circles and squares with respect to the origin of the plot. Those points that are
closest to the origin are most similar to the average profile. Those that are further
away show specific differences with respect to the average profile. Circles and
squares that moved toward the border of the biplot, and in the same direction,
possess a positive correspondence. They seem to attract each other. Those that
moved in the opposite direction demonstrate a negative correspondence. They
seem to repel each other. For example, the category of chemistry doctorates
awarded to men exhibits a positive correspondence with the early years, and at the
same time a negative correspondence with the more recent years. A mechanical
analogy of forces of attraction and repulsion between a circle and a square is
appropriate here. One should refrain, however, from judging distances between
circles and squares. As stated already before, their closeness is not a measure of
their correspondence, only the distances from the origin and their angular distance
matter.
In the case when one of the two measurements of the contingency table is
divided in ordered categories, one can construct a so-called thermometer plot. On
this plot we represent the ordered measurement along the horizontal axis and the
scores of the dominant latent vectors along the vertical axis. The solid line in Fig.
32.9 displays the prominent features of the first latent vector which, in the context
of our illustration, is called the women/men factor. It clearly indicates a sustained
progress of the share of women doctorates from 1966 onwards. The dashed line
corresponds with the second latent vector which can be labelled as the chemistry/
other fields factor. This line shows initially a decline of the share of chemistry and
a slow but steady recovery from 1973 onwards. The successive decline and rise are
responsible for the horseshoe-like appearance of the pattern of points representing
199
1.5 n
Score
-0.5
1973
Women / Men
1966
-1.5
20 30 40 50 60 70 80 90
Years
Fig. 32.9. Thermometer plot representing the scores of the first and second component of a CFA ap-
plied to Table 32.10. The solid line denotes the first component which accounts for the women/men
contrast in the data. The broken line corresponds with the second component which reveals a contrast
between chemistry and other fields.
the rows in the biplot of Fig. 32.8. This phenomenon is also called the Guttman
effect [2]. The effect is due to the fact that a linearly varying series of scores or
loadings on factor 1 (e.g. 1, 2, 3) can be uncorrelated or orthogonal to a parabolic
series on factor 2 (e.g. -30, 0, 10).
For the purpose of comparison, we also discuss briefly the biplot constructed
from the CFA using the exponents a = 0.5 and (3 = 0.5 (Fig. 32.10). Such a display
is meant to reconstruct the values in the transformed contingency table Z by
projections of points representing rows upon axes representing columns (or vice
versa):
where:
This type of biplot does not reproduce distances or angles accurately, especially
when A,j and A-2 are very distinct. It can be shown that the distortion of distances that
occurs in the vertical direction is proportional to the square root of \IX2. The
advantage of this type of biplot is that it allows us to construct bipolar axes for any
200
Fig. 32.10. CFA biplot computed from the data in Table 32.10. Factor scaling coefficients were de-
fined as a = p = 0.5. This definition allows us to draw bipolar axes through the four educational catego-
ries, showing the contrast between women and men (horizontally) and between chemistry and other
fields (vertically).
pair of categories. For example, the more horizontal axes represent the contrasts
between women and men. The more vertical axes display contrasts between chem-
istry and the other fields. By perpendicular projection of the circles upon any
bipolar axis one obtains a relative ordering of the time intervals according to a
particular contrast. In the case of the women/men contrast in other fields (the
slightly positively sloping axis) we first note a somewhat stable situation until
1966, when the contrast evolves rapidly in favour of women doctorates. Similarly,
the chemistry/other fields contrast in women (the more vertical axis at the right)
shows a strong regression of chemistry until about 1975. In later years the situation
seems to have stabilized.
Both types of symmetric displays exhibited in Figs. 32.9 and 32.10 have their
merits. They are called symmetric because they produce equal variances in the
scores and in the loadings. In the case when a = p = 1, we obtain that the variances
along the horizontal and vertical axes are equal to the eigenvalues'h?associated to
the dominant latent vectors. In the other case when a = (3 = 0.5, the variances are
found to be equal to the singular values X.
201
32.7.2 Algorithm
n p
= 1^ W YW 1 =w^ Yw or y.. yij
where the sum vectors 1„ and 1^ have been defined before in eqs. (32.15) and
(32.16).
From this point on, the analysis is identical to that of CFA. Briefly, this involves
a generalized singular vector decomposition (SVD) of Z using the metrics W„ and
W^, such that:
Z = AAB'^ cf. eq. (32.50)
where A is the rxr diagonal matrix of singular values, A is the nxr matrix of
generalized latent vectors for rows, B is the pxr matrix of generalized latent
vectors for columns, and where r is the rank of Z. The generalized latent vectors A
and B are orthogonal in the weighted metrics W^ and W^, which implies that:
A^W^A = B^W^B = I,
For the same reason as for double-closure, double-centring always reduces the
rank of the data matrix by one, as a result of the introduction of a linear dependence
among the rows and columns of the data table.
Biplots can be constructed in the same way as in the case of CFA, by defining
scores S and loadings L, cf. eq. (32.41):
S = AA«
L = BAP
The choice of a = (3 = 1 will reproduce distances between rows and between
columns of Z. The choice a = p = 0.5 allows one to reconstruct the values and
contrasts of Z by perpendicular projections of points, representing rows, upon axes
representing columns (or vice versa). With these two choices for a and (3, the
analysis is symmetrical with respect to rows and columns.
203
The conventions of the LLM biplot are similar to those defined for the CFA
biplot. Circles represent the row-categories and squares denote the column-
categories of the contingency table. When the areas of circles and squares are
made proportional to the marginal sums of the table, then the areas confer a
notion of size or importance of the row- and column-categories that are represented.
The positions of the circles and squares from the origin (which appears as a small
cross near the centre) express the degree of difference or contrast of the corres-
ponding rows and columns with respect to their average profiles. Rows or columns
that are represented close to the origin show little contrast with their average
profiles. Those that are further away will have greater contrasts in their profiles.
The direction in which circles and squares move from the origin, in the same or in
opposite directions, is an expression of their positive or negative correspondence.
The way interaction is expressed in the LLM biplot is identical to that of the CFA
biplot. We can also define bipolar axes through two selected row-categories or
through two selected column-categories which form the poles of bipolar axes.
Additionally, these axes can be calibrated according to the ratios of the two
categories that define the axes. Perpendicular projection of the centres of the
circles upon any bipolar axis produces an (approximate) reading of the corres-
ponding ratios, as has been explained in Chapter 31. The approximation depends
on the precision with which the points are represented in the plane of the plot and is
limited by the visual interpolation on logarithmic scales.
LLM can also be defined as an expansion of a contingency table X using the
generalized latent vectors in A and B and the singular values in A by analogy with
eq. (32.53):
^/. ^.j
x^j = - e x p ^^k^ikt^. with/= 1, ...,/2 and 7 = 1,...,/? (32.58)
J
where x^, x j and x are the geometrical row-, column- and global means of X. If the
contrasts in the data are small, then the exponents in the above expression will be
small also. In that case we can further expand the exponential function, which
produces:
and which is similar to the expansion in eq. (32.53) obtained with CFA.
204
32.7.3 Application
The LLM approach described above has been applied to the 35x4 contingency
Table 32.10 after some modification. In this case, we have replaced the pooled data
in the first five rows by the corresponding average annual values. Further, columns
1 and 2 have been combined with columns 3 and 4 in order to produce the
categories of women in all fields and of men in all fields. The reason for the change
is our objective to produce an analysis of the ratios between chemistry and all
fields rather than between chemistry and the other fields.
In this analysis, weight coefficients for rows and for columns have been defined
as constants. They could have been made proportional to the marginal sums of
Table 32.10, but this would weight down the influence of the earlier years, which
we wished to avoid in this application. As with CFA, this analysis yields three
latent vectors which contribute respectively 89,10 and 1% to the interaction in the
data. The numerical results of this analysis are very similar to those in Table 32.11
and, therefore, are not reproduced here. The only notable discrepancies are in the
precision of the representation of the early vears up to 1972, which is less than in
the previous application, and in the precision of the representation of the category
of women chemists which is better than in the previous analysis by CFA (0.960 vs
0.770).
The overall interpretation of the LLM biplot of Fig. 32.11 is the same as
obtained with the CFA biplot of Fig. 32.10. The first (horizontal) latent variable
seems to be associated primarily with the women/men contrast, while the second
(vertical) latent variable is mostly associated with the chemistry/all fields contrast.
Thermometer plots, which represent the scores of the various time intervals as a
function of time, are similar to those in Fig. 32.9. They are not reproduced here as
they point to the same remarkable events, i.e. the sustained rise of the proportion of
women since 1966 and the recovery of the share of chemistry in 1973.
In Fig. 32.11 the women/men ratio in 1980 is estimated visually to be 0.190 in
chemistry and 0.430 in all fields. The exact ratios as computed from Table 32.10
are 0.198 and 0.435 respectively. The chemistry/all fields ratio in 1970 is visually
estimated at 0.090 for men and at 0.041 for women. The exact values from Table
32.10 are 0.080 and 0.046 respectively.
The question may be asked why there should be different methods when they
provide more or less the same kind of information. One answer is that not all
contingency tables will yield results that are agree as well as in the case we have
studied here. When there are very large contrasts, the results of CFA and LLM tend
to disagree. This occurs when there are several zeroes in the table, which in LLM
have to be replaced by small positive values. In that case it may pay off to analyse
the same contingency table with different methods, each of which may reveal
different aspects of the multivariate patterns [18]. Furthermore, LLM can be
205
Fig. 32.11. Log-linear model (LLM) biplot computed from the data in Table 32.10. Conventions are
the same as in Fig. 32.10. The areas of circles (representing years) and of squares (representing catego-
ries) are made proportional to the row- and column-totals in Table 32.10.
applied to tables of heterogeneous data, i.e. data that have been recorded with
different units, while CFA can only be applied to tables of homogeneous data,
unless the heterogeneous scales of measurement are subdivided into discrete
categories.
References
1. B.S. Everitt, The Analysis of Contingency Tables. Chapman and Hall, London, 1977.
2. M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, Lon-
don, 1984.
3. A. Agresti, Categorical Data Analysis. Wiley, New York, 1990.
4. M.G. Kendall and A. Stuart, The Advanced Theory of Statistics. Vol. II. Ch. Griffin, London,
1960.
5. M.O. Hill, Reciprocal averaging: an eigenvector method of ordination. J. EcoL, 61 (1973)
237-224.
6. J.-P. Benzecri, L'analyse des Donnees. Vol. II, L'analyse des Correspondances. Dunod, Paris,
1973.
7. J.-P. Benzecri, Histoire et Prehistoire de 1'Analyse des Donnees. Dunod, Paris, 1982.
206
8. M.O. Hill, Correspondence analysis: a neglected multivariate method. Appl. Statist., 23 (1974)
340-355.
9. K.R. Gabriel, The biplot graphic display of matrices with application to principal components
analysis. Biometrika, 58 (1971) 453-446.
10. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and
use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom.
Intell. Lab. Syst., 4 (1988) 11-25.
11. K.G. Everett and W.S. Deloach, Chemistry doctorates awarded to women in the United States,
A historical perspective. J. Chem. Educ, 68 (1991) 545-547.
12. E.B. Andersen, Discussion of paper by L.A. Goodman. Int. Stat. Review, 54 (1986) 271-272.
13. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical
compounds. Arzneim. Forsch. (Drug Res.), 26 (1976) 1295-1300.
14. P.J. Lewi, Spectral Map Analysis, Analysis of contrasts especially from log-ratios. Chemom.
Intell. Lab. Syst., 5 (1989) 105-116.
15. J.B. Kasmierczak, Analyse logarithmique, Deux exemples d'applicarion. Rev. Statist.
Appliquee, 33 (1985) 13-24.
16. L.A. Goodman, Some useful extensions of the usual correspondence analysis approach and the
usual log-linear models approach in the analysis of contingency table. Int. Stat. Review, 54
(1986)243-309.
17. Y. Escoufier and S. Junca, Least squares approximation of frequencies or their logarithms. Int.
Stat. Rev., 54 (1986) 279-283.
18. A. Thielemans, P.J. Lewi and D.L. Massart, Similarities and differences among multivariate
display techniques by Belgian Cancer Mortality Distribution data. Chemom. Intell. Lab. Syst.,
3(1988)277-300.
207
Chapter 33
Many books have been published about pattern recognition; one of these is
directed towards chemometrics [1].
There are many types of pattern recognition which essentially differ in the way
they define classification rules. In this section, we will describe some of the
approaches, which we will then develop further in the following sections. We will
not try to develop a classification of pattern recognition methods but merely
indicate some characteristics of the methods, that are found most often in the
chemometric literature and some differences between those methods.
A first distinction which is often made is that between methods focusing on
discrimination and those that are directed towards modelling classes. Most methods
explicitly or implicitly try to find a boundary between classes. Some methods such
as linear discriminant analysis (LDA, Sections 33.2.2 and 33.2.3) are designed to
find explicit boundaries between classes while the k-nearest neighbours (A;-NN,
Section 33.2.4) method does this implicitly. Methods such as SIMCA (Section
33.2.7) put the emphasis more on similarity within a class than on discrimination
between classes. Such methods are sometimes called disjoint class modelling
methods. While the discrimination oriented methods build models based on all the
classes concerned in the discrimination, the disjoint class modelling methods
model each class separately.
LDA and several other supervised methods focus on finding optimal boundaries
between classes: their first goal is to discriminate. In Section 17.9 we explained
also with an example from clinical chemistry that canonical variates can be
determined and plotted to discriminate between classes and that this is a way of
performing LDA. The example concerned the thyroid gland. People whose thyroid
gland functions normally are called euthyroid (EU) and patients whose thyroid
gland is too active or not active enough are called, respectively, hyperthyroid
(HYPER) or hypothyroid (KYPO). Clinicians want to make a distinction between
the three classes. This can be done using five chemical determinations among
which e.g. serum thyroxine or thyroid-stimulating hormone. This is clearly a
multivariate (five-dimensional) situation and LDA is applied. Allocation of patients
to one of the classes can be achieved in a more formal way by determining the
centroid of each group and drawing a boundary half-way between the two centroids
[2]. This is shown in Fig. 33.1.
Figure 33.1 is typical of many situations in clinical chemistry: it shows a tight
normal group (the EU group) and spreading out from it much more disperse
209
a
.2
a
Discriminant function 1
Fig. 33.1. Canonical variate plot for three classes with different thyroid status. The boundaries are ob-
tained by linear discriminant analysis [2].
abnormal groups (the HYPO and HYPER groups). In fact, this kind of picture is
also found in many non-clinical situations too (see, for instance, the air pollution
situation of Fig. 17.10). If one now determines boundaries as described in the
preceding paragraph, namely by situating linear boundaries half-way between the
centroids of adjacent classes, some patients of the more disperse abnormal classes
are classified as members of the more condensed class.
This classification problem can then be solved better by developing more
suitable boundaries. For instance, using so-called quadratic discriminant analysis
(QDA) (Section 33.2.3) or density methods (Section 33.2.5) leads to the boundaries
of Fig. 33.2 and Fig. 33.3, respectively [3,4]. Other procedures that develop
irregular boundaries are the nearest neighbour methods (Section 33.2.4) and neural
nets (Section 33.2.9).
One of the problems with discrimination-oriented methods is that we need to
classify each object in one of the given classes. It is, however, quite possible that an
210
o
o
•4—•
c
ac
a
c
>
3 . - - i .
Q .^r' 1 1
t' I 1 1 1
3 ,'3111 I1 1 1 I '*
, 3 \ 31 1 1 I I I I I I n 1 \
^ :^^ nil 11 11111 i
^'^^ 1 111 111 1 1 1 1 I 1 1 V' 1 2
3 3 ^ s l l l l l 2 x ' 2 2 2 2
1 "->>^.J-'2 2 2 2
3 2
•I 2
2 2 2
2 2 2 2
3 \ 2
\ 2
3 \ ' 2 2
3 3
Discriminant function 1
Fig. 33.2. As Fig. 33.1 but with boundaries obtained by quadratic discriminant analysis [3].
object should not be classified in any of these classes. Returning to the wine
example, where a classification between wines from three given origins was
wanted, we must realize that the sample submitted for classification may, in
reality, belong to none of the given origins, but to a fourth one. However, using the
discrimination-oriented methods, we will classify it necessarily in one of the three
given regions. A different approach to supervised pattern recognition can then be
useful. This consists of making a separate model of each class. Objects which fit
the model for a class are considered part of it and objects which do not fit are
classified as non-members. In discrimination terms, we could say that the class
model discriminates between membership and non-membership of a certain class.
In statistical terms, we can state that these methods perform outlier tests.
The conceptually simplest model, which for reasons explained later is called
UNEQ, is based on the multivariate normal distribution. Suppose we have carried
211
c
3 X 1
\
3|
/ 1 1 I
1/ 1 11 1
;3111 II /2
\ 31 M il
11111 iiim '
3 3
^ 1111
" 1' 111 111112'
3N ••••• ^/
V 11 1 1 1 1 1 1 11 1 1 1 1 1 / 1 2
3
^\ 1 11 1 1 2 ^ ^ 2 2 2 2
1 \ 1 , ''2 2 2
3
\ 1 /
2 2
\>
2 2 2
2 2
Discriminant function 1
Fig. 33.3. As Fig. 33.1 but with boundaries obtained by a density method [4].
out two tests, jCj and X2, with which we want to describe the health of a patient. Only
healthy patients are investigated. In Fig. 33.4, we could say that the ellipse
describing the 95% confidence limit for a bivariate normal distribution can be
considered as a model of the class of healthy patients. Those within its limits are
considered healthy and those outside would be considered non-members of the
healthy class. The bivariate normal distribution is therefore a model of the healthy
class.
In three dimensions (Fig. 33.5), the model takes the shape of an ellipsoid and in
m dimensions, we must imagine an m-dimensional hyperellipsoid. In the figure,
two classes are considered and we now observe that four situations can be
encountered when classifying an object, namely:
(a) the object is part of class K,
(b) the object is part of class L,
(c) the object is not a member of class K or L: it is an outlier, and
212
X
1 f
Fig. 33.4. Nintety-five percent confidence limit for a bivariate distribution as class envelope.
Fig. 33.5. Class envelopes in three dimensions as derived from the three-variate normal distribution.
(d) if K and L overlap the object can be a member of both classes K and L: it is
situated in a region of doubt.
UNEQ is applied only when the number of variables is relatively low. For more
variables, one does not work with the original variables, but rather with latent
variables. A latent variable model is built for each class separately. The best known
such method is SIMCA.
We also make a distinction between parametric and non-parametric techniques.
In the parametric techniques such as linear discriminant analysis, UNEQ and
SIMCA, statistical parameters of the distribution of the objects are used in the
derivation of the decision function (almost always a multivariate normal distribution
213
LDA is the best studied method of pattern recognition. It was originally proposed
by Fisher [2] and is applied very often in chemometrics. Applications can be found
for instance in the classification of Eucalyptus oils based on gas-chromatographic
data [6], the automatic recognition of substance classes from GC/MS [7], the
recognition of tablets and capsules with different dosages with the use of NIR
spectra [8] and in the already cited clinical chemical example (see Section 33.1). It
appears that there are several ways of deriving essentially the same methodology.
This may be confusing and, following a short article by Fearn [9], we will try to
explain the different approaches. A detailed overview is found in the book by
McLachlan [10].
Let us first consider two classes K and L in a bivariate space (xj, Xr^. Figure
33.6a shows the objects in this space. In Fig. 33.6b bivariate probability ellipses
are drawn representing the normal (bivariate) probability distributions to which
the objects belong. Since there are two classes, there are two such ellipses.
Basically, an object will be classified in the class for which it has the highest
probability. In Fig. 33.6b, object A is classified in class ^ because it has a (much)
higher probability in K than in L. In Fig. 33.6c, an additional ellipse is drawn for
each class. These ellipses both represent the same probability level in their
respective classes; they touch in point O half-way between the two class centres.
Line a is the tangent to the two ellipses in point O. Any point to the left of it, has a
higher probability to belong to K and to the right it is more probable that it belongs
to L. Line a can be used as a boundary, separating K from L.
In practice, we would prefer an algebraic way to define the boundary. For this
purpose, we define line d, perpendicular to a. One can project any object or point
on that line. In Fig. 33.6c this is done for point A. The location of A on d is given by
its score on d. This score is given by:
Z) = WQ + WjX] + ^2^2 (33.1)
a)
X
X X X
V X X
X X „ X
• • •
• X• • •
• • • •
. 'K
b)
X 0
-2
-6 -4 0
X1
Fig. 33.6. (a) Two classes A'and L to be discriminated, (b) confidence limits around the centroids oiK
and L, (c) the iso-probability confidence limits touch in O; a is a line tangential to both ellipses; d is the
optimal discriminating direction; A is an object.
It is observed that D as defined by eq. (33.1) is a latent variable, in the same way
as a principal component. We can consider LDA, as was the case for principal
components analysis, as a feature reduction method. Let us therefore again consider
the two-dimensional space of Fig. 33.6. For feature reduction, we need to determine
a one-dimensional space (a line) on which the points will be projected from higher,
here two-dimensional space. However, while principal components analysis selects
a direction which retains maximal structure in a lower dimension among the data,
LDA selects a direction which achieves maximum separation among the given
classes. The latent variable obtained in this way is a linear combination of the
original variables. This function is called the canonical variate. When there are k
classes, one can determine k- \ canonical variates. In Fig. 33.1, two canonical
variates are plotted against one another for three overlapping classes. A new
sample can be allocated by determining its location in the figure. The second way
of introducing LDA, discovered by Fisher, is therefore to rotate through O a line
216
• •
• • • • • • • • • X X X X X X X
• • • • • • • • • • • X X X X X X ^^
until the optimal discriminating direction is found (d in Fig. 33.6c). This rotation is
determined by the values of Wj and W2 in eq. (33.1).
These weights depend on several characteristics of the data. To understand
which ones, let us first consider the univariate case (Fig. 33.7). Two classes, K and
L, have to be distinguished using a single variable, jCj. It is clear that the discrim-
ination will be better when the distance between Xj^^ and x^j (i.e. the mean values,
or centroids, of jc, for classes K and L) is large and the width of the distributions is
small or, in other words, when the ratio of the squared difference between means to
the variance of the distributions is large. Analytical chemists would be tempted to
say that the resolution should be as large as possible.
When we consider the multivariate situation, it is again evident that the discrim-
inating power of the combined variables will be good when the centroids of the two
sets of objects are sufficiently distant from each other and when the clusters are
tight or dense. In mathematical terms this means that the between-class variance is
large compared with the within-class variances.
In the method of linear discriminant analysis, one therefore seeks a linear
function of the variables, D, which maximizes the ratio between both variances.
Geometrically, this means that we look for a line through the cloud of points, such
that the projections of the points of the two groups are separated as much as
possible. The approach is comparable to principal components, where one seeks a
line that explains best the variation in the data (see Chapter 17). The principal
component line and the discriminant function often more or less coincide (as is the
case in Fig. 33.8a) but this is not necessarily so, as shown in Fig. 33.8b.
Generalizing eq. (33.1) to m variables, we can write:
D = w'^x + Wo (33.2)
where it can be shown [10] that the weights w are determined for a two-class
discrimination as
w^ =(x, - X 2 ) ' ' S-' (33.3)
and
217
Xi4
o o
a)
^ PC.DF
Xi4
b)
Fig. 33.8. Situation where principal component (PC) and linear discriminant function (DF) are essen-
tially the same (a) and very different (b).
Wo = - - ( ^ 1 - ^ 2 ) ' ^ S U x i + X 2 ) (33.4)
In eq. (33.3) and (33.4) Xj and X2 ^re the sample mean vectors, that describe the
location of the centroids in m-dimensional space and S is the pooled sample
variance-covariance matrix of the training sets of the two classes.
The use of a pooled variance-covariance matrix implies that the variance-
covariance matrices for both populations are assumed to be the same. The conse-
quences of this are discussed in Section 33.2.3.
Example:
A simple two-dimensional example concerns the data from Table 33.1 and Fig.
33.9. The pooled variance-covariance matrix is obtained as [K^K + L^L]/(ni + ^2
- 2), i.e. by first computing for each class the centred sum of squares (for the
diagonal elements) and the cross-products between variables (for the other
218
lO 1 — 1 1 •y—I 1
\
o
14 - -
n
12 o ^ ^
o o ^
10 ^ ^
o
^ 8 o o )K -
^
6 o )K ~
o o )K X
4 - )i( -
2 -
{ 1 1 1
10 12 14
x1
Fig. 33.9. LDA applied to the data of Table 33.1; n is a new object to be classified.
elements), then summing the two matrices and dividing each element hyn^+n2-k
(here 10 + 10 - 2 = 18). As an example we compute the cross-term s^2 (which is
equal to 521). This calculation is performed in Table 33.2. In the same way we can
compute the diagonal elements, yielding
2.78 3.78 ^
3.78 10.56
and
0.70 -0.25
s-' = -0.25 0.18
Since
(6-11) -5
(x, - X 2 ) :
( 9 - 7) 2
219
TABLE 33.1
Example data set for linear discriminant analysis
Class 1 Class 2
Object ^1 X2 Object ^1 X2
1 8 15 11 11 11
2 7 12 12 9 5
3 8 11 13 11 8
4 5 11 14 12 6
5 7 9 15 13 10
6 4 8 16 14 12
7 6 8 17 10 7
8 4 5 18 12 4
9 5 6 19 10 5
10 6 5 20 8 2
mean 6 9 mean 11 7
TABLE 33.2
Computation of the cross-product term in the pooled variance-covariance matrix for the data of Table 33.1
Class 1 Class 2
S class 1 = 30 I class 2 = 38
Degrees of freedom: 1 0 + 1 0 - 2 = 1 8 .
Cross-product term: (30 + 38)/18 = 3.78
220
Wi=-4.00
W2 = 1.62
Wo= 21.08
and
D = 21.08-4.00 jci +1.62 JC2
We can now classify a new object n. Consider an object with x^ = 9 and JC2 = 13. For
this object
For two classes, Fisher arrived at similar results for the equations given above
by considering LDA as a regression problem. A response factory, indicating class
membership, is introduced: >^ = -1 for all objects belonging to class K and y = -\-\
for all objects belonging to class L. We then obtain the regression equation for y =
y(xj,X2). It is shown that, when there is an equal number of objects in K and L, the
same w values are obtained. If the number is not the same, then w^ and W2 are still
the same but WQ changes.
So far, we have described only situations with two classes. The method can also
be applied to K classes. It is then sometimes called descriptive linear discriminant
analysis. In this case the weight vectors can be shown to be the eigenvectors of the
matrix:
A = W-^ B (33.5)
where W is the within-group sum of squares and cross-products matrix and B is the
between-groups sum of squares and cross-products matrix. As described in [11]
this leads to non-symmetrical eigenvalue problems.
where ^ij^ and F^ are the population mean vector and variance-covariance matrix
of K respectively. They can be estimated by the sample parameters x^ and C^.
From these equations, one can derive the following classification rule: classify
object u of unknown class in the class K for which Dj^j^^^ is minimal, given
DIK.U=(^U-^KVC-,\X^-X,) (33.8)
where % is the prior probability that an object belongs to class K, rij^ is the number
of objects in the training set for class K and N is the total number of objects in the
training set. One computes OMA:,M ^^
X^t
Q)
^2
X, •
b)
X2
Fig. 33.10. Situations with unequal variance-covariance: (a) unequal variance, (b) unequal co-
variance.
over all classes. It follows that both QDA and LDA have advantages: QDA is less
subject to constraints in the distribution of objects in space and LDA requires less
objects than QDA. Friedman [12] has also shown that regularised discriminant
analysis (RDA), a form of discriminant analysis intermediate between QDA and
LDA, has advantages compared to both: it is less subject to constraints without
requiring more objects. The method has been used in chemometrics, e.g. for the
classification of seagrass [13] or pharmaceutical preparations [14].
X,
• • U
•• • J • ••
• • • • • • ^
• • • • o
• • • -
Fig. 33.13. A situation which necessitates classification of the unknown u by alternative A:-NN criteria.
The nearest neighbour method is often applied, with, in view of its simplicity,
surprisingly good results. An example where k-NN performs well in a comparison
with neural networks and SIMCA (see further) can be found in [16].
In density or kernel methods one imagines a potential field around the objects of
the learning set. For this reason these methods have also been called potential
methods. A variant for clustering was described in Section 30.3.3. One starts with
the selection of a potential function. Many functions can be used for this purpose,
but for practical reasons it is recommended that a simple one such as a triangular or
a Gaussian function is selected. The function is characterized by its width. This is
important for its smoothing behaviour (see below). Figure 33.14 shows a Gaussian
function for a class ^ in a one-dimensional space. The cumulative potential
function is determined by adding the heights of the individual potential functions
in each position along the x axis. The figure shows that the cumulative function
constitutes a continuous line which is never zero within a class. This is done
separately for each class.
226
Fig. 33.14. Density estimate for a test set using normal potential functions (univariate case).
Fig. 33.15. Classification of an unknown object u. f(K) and f(L) indicate the potential functions for
classes K and L.
Fig. 33.16. Influence of the smoothing parameters on the potential surfaces of classes which are (a) too
small and (b) too large.
learning procedure is then to select the most suitable value of the smoothing
parameter.
Advantages of these methods are that no a priori assumptions about distrib-
utions are necessary and that probabilistic decisions can be taken more easily than
with k-NN. In chemometrics, the method was introduced under the name ALLOC
[17, 18]. The methodology was described in detail in a book by Coomans and
Broeckaert [19]. The method was developed further by Forina and coworkers
[20,21].
In Section 18.4, we explained that inductive expert systems can be applied for
classification purposes and we refer to that section for further information and
example references. It should be pointed out that the method is essentially uni-
variate. Indeed, one selects a splitting point on one of the variables, such that it
achieves the "best" discrimination, the "best" being determined by, e.g., an
entropy function. Several references are given in Chapter 18. A comparison with
other methods can be found, for instance, in an article by Mulholland et al. [22].
Additionally, Breiman et al. [23] developed a methodology known as classifi-
cation and regression trees (CART), in which the data set is split repeatedly and a
binary tree is grown. The way the tree is built, leads to the selection of boundaries
parallel to certain variable axes. With highly correlated data, this is not necessarily
the best solution and non-Hnear methods or methods based on latent variables have
been proposed to perform the splitting. A combination between PLS (as a feature
reduction method — see Sections 33.2.8 and 33.3) and CART was described by
228
Yeh and Spiegelman [24]. Very good results were also obtained by using simple
neural networks of the type described in Section 33.2.9 to derive a decision rule at
each branching of the tree [25]. Classification trees have been used relatively
rarely in chemometrics, but it seems that in general [26] their performance is
comparable to that of the best pattern recognition methods.
As explained in Section 33.2.1, one can prefer to consider each class separately
and to perform outlier tests to decide whether a new object belongs to a certain
class or not. The earliest approaches, introduced in chemometrics, were called
SIMCA (soft independent modelling of class analogy) [27] and UNEQ [28].
UNEQ can be applied when only a few variables must be considered. It is based
on the Mahalanobis distance from the centroid of the class. When this distance
exceeds a critical distance, the object is an outlier and therefore not part of the
class. Since for each class one uses its own covariance matrix, it is somewhat
related to QDA (Section 33.2.3). The situation described here is very similar to that
discussed for multivariate quality control in Chapter 20. In eq. (20.10) the original
variables are used. This equation can therefore also be used for UNEQ. For
convenience it is repeated here.
Dl={x,-x^y C-\x,-x^) (33.11)
where Dj^ is the squared Mahalanobis distance between object / and the centroid
x^ of the class A'and C is the variance-covariance matrix of the n training objects
defining class K (see also eq. (33.7)). When D^ becomes too large for a certain
object, this means that it is no longer considered to be part of the class. The
Mahalanobis distance follows the Hotelling 7^-distribution. The critical value t^ is
defined as:
2 _ m{n - \){n •\-\)
^crit ~ Z ^^ Ma,m,n-m) (33.12)
n{n-m)
UNEQ requires a multivariate normal distribution and can be applied only when
the ratio objects/variables is sufficiently high (e.g. 3). When the ratio is less, as also
explained in Chapter 20, one can measure distances in the principal component
(PC)-space instead of the original space and in the pattern recognition context this
is usually either necessary or preferable. In SIMCA one applies latent variables
instead of the original variables. It, too, can be viewed as a variant of the quadratic
discriminant rule. SIMCA, the original version of which [27] was published in
1976, starts by determining the number of principal components or eigenvectors
needed to describe the structure of the training class (Section 31.5). Usually
229
PCI PCI
*-x, • X ,
a) b)
PCI PCI
c) d)
Fig. 33.17. SIMCA: (a) step 1 in a 1 PC model; (b) step 1 in a 2 PC model; (c) step 2 in a 1 PC model; (d)
step 2 in a 2 PC model.
^ = JSE4/[(''-''*K'^-^*-l)J (33.13)
The residuals from the model can be computed from the scores on the non-
retained eigenvectors, i.e. the scores t^j on the eigenvectors r* + 1 to r (r = min {{n -
1), m)). Then
^ = J S I,tfj/[(r-r^)(n^r^-l)] (33.14)
If care is not taken about the way s is obtained, SIMCA has a tendency to exclude
more objects from the training class than necessary. The 5-value should be deter-
mined by cross-vahdation. Each object in the training set is then predicted, using
the r*-dimensionaI PCA model obtained, for the other (n-l) training set objects.
The (residual) scores obtained in this way for each object are used in eq. (33.14)
[30].
A confidence limit is obtained by defining a critical value of the (Euclidean)
distance towards the model. This is given by
^crit=V^^ (33.15)
F^rit is the tabulated one-sided value for (r - r*) and (r - r*) (n - r* - 1) degrees
of freedom. The s^^-^^ is used to determine the boundary (the cylinder) around the
PCI line in Fig. 33.17c and the planes around the PCI, PC2 plane in Fig. 33.17d.
Objects with s < 5^^^ belong to class K, otherwise they do not. To predict whether a
new object x^^^ belongs to class K one verifies whether it falls within the cylinder
(for a one-dimensional model), between the limiting planes (for a two-dimensional
model, etc.). Suppose the following r* dimensional PC model was obtained
X^=T^L1,4-E^ (33.16)
with X^ the centred X-matrix for class K, T^^ the (un-normed) score matrix (nxr*),
(T^ = U^ A;^, where U^ is the normed score matrix and A^^ is the singular value
matrix).
L^ = the loading matrix (mxr'^)
Ef. = the matrix of residuals (nxm)
For a new object x^^^ one first determines the scores using eq. (33.17)
tLw=(x„ew-x^)'^L^ (33.17)
231
The Euclidean distance from the model is then obtained, similarly to eq. (33.14)
as:
(33.18)
V./=r*+l
If ^ng^ < s^^-^^, then the new object belongs to class K, otherwise it does not. A
discussion concerning the number of degrees of freedom can be found in [31 ]. This
article also compares SIMCA with several other methods.
A useful tool in the interpretation of SIMCA is the so-called Coomans plot [32].
It is applied to the discrimination of two classes (Fig. 33.18). The distance from the
model for class 1 is plotted against that from model 2. On both axes, one indicates
the critical distances. In this way, one defines four zones: class 1, class 2, overlap
of class 1 and 2 and neither class 1 nor class 2. By plotting objects in this plot, their
classification is immediately clear. It is also easy to visualize how certain a
classification is. In Fig. 33.18, object a is very clearly within class 1, object b is on
the border of that class but is not close to class 2 and object c clearly belongs to
neither class.
The first versions of SIMCA stop here: it is considered that a new object belongs
to the class, if it fits the r*-dimensional PC model. However, one can also consider
overlap 1 • a
1
zone 1
class 1
1
1
Fig. 33.18. The Coomans plot. Object a belongs to class 1, object b is a borderline class 1 object, object
c is an outlier towards the two classes.
232
that objects that fit the PC-model but, in that model, are far from the members of
the training class, are also outliers. Therefore, a second step was added to the
original version of SIMCA by closing the tube or the box (Fig. 33.17). This was
originally done by treating each PC in an univariate way. The limits were situated
on each of the r* PCs:
reduction (see Section 33.3) and the other is to apply methods such as partial least
squares (PLS - see Chapter 35) [45]. When there are two classes to be discrim-
inated PLSl is applied, which means that there is one independent variable y,
which for each object has a value 0 or 1. When there are more classes, PLS2 is
applied. The independent variable then becomes a vector of class variables, one for
each class, with a value of 1 for the class to which it belongs and zeros for all other
variables. Suppose that there are 4 classes and that a certain object belongs to class
2, then for that object y^ = [0 1 0 0]. One might be tempted to use PLSl, with an
independent variable that can take the values 1, 2, 3 and 4. However, this would
imply an ordered relationship between the four classes, such that the distance
between class 3 and 1 is twice that between class 2 and 1.
OUT
IN
Fig. 33.19. An artificial neuron. The inputs are weighted and summed according to Z = wixi + W2X2, X
is transformed by comparison with T and leads to a 0/1 value for y.
234
In the second stage, Z is transformed with the aid of a transfer function. For
instance, it can be compared to a threshold value. If Z > 7, then y=l and if I. <T,y
= 0. This is the simplest possible type of neuron, used here for didactic purposes
and not because it is the configuration to be recommended. Let us suppose that for
this isolated neuron Wj = 1, W2 = 2 and T=l. The line in Fig. 33.20 then gives the
values of X, and X2 for which Z = 7. All combinations of jCj and X2 on and above the
line will yield Z > 7and therefore lead to an output j j = 1 (i.e. the object is class K),
all combinations below it to 3^1 = 0. The procedure described here is equivalent to a
method called the linear learning machine, which was one of the first supervised
pattern recognition methods to be applied in chemometrics. It is further explained,
including the training phase, in Chapter 44.
Neurons are not used alone, but in networks in which they constitute layers. In
Fig. 33.21 a two-layer network is shown. In the first layer two neurons are linked
each to two inputs, jCj and ^2. The upper one is the one we already described, the
lower one has w, = 2, W2 = 1 and also T = 1. It is easy to understand that for this
neuron, the output y2 is 1 on and above line b in Fig. 33.22a and 0 below it. The
outputs of the neurons now serve as inputs to a third neuron, constituting a second
layer. Both have weight 0.5 and 7 for this neuron is 0.75. The output y^-^^^i of this
neuron is 1 if Z = 0.5 y^ + 0.5 ^2 > 0.75 and 0 otherwise. Since ^i and y2 have as
possible values 0 and 1, the condition for r > 0.75 is fulfilled only when both are
equal to 1, i.e. in the dashed area of Fig. 33.22b. The boundary obtained is now no
longer straight, but consists of two pieces. This network is only a simple demon-
stration network. Real networks have many more nodes and transfer functions are
usually non-linear and it will be intuitively clear that boundaries of a very complex
nature can be developed. How to do this, and applications of supervised pattern
recognition are described in detail in Chapter 44 but it should be stated here that
excellent results can be obtained.
235
Layer 1 Layer 2
a) b)
Fig. 33.22. (a) Intermediate (yi mdy2) outputs of the neural network of Fig. 33.21; (b) final output of
the neural network.
The similarity in approach to LDA (Section 33.2.2) and PLS (Section 33.2.8)
should be pointed out. Neural classification networks are related to neural regression
networks in the same way that PLS can be applied both for regression and
classification and that LDA can be described as a regression application. This can
be generalized: all regression methods can be applied in pattern recognition. One
must expect, for instance, that methods such as ACE and MARS (see Chapter 11)
will be used for this purpose in chemometrics.
236
One can — and sometimes must — reduce the number of features. One way is to
combine the original variables in a smaller number of latent variables such as
principal components or PLS functions. This is calltd feature reduction.
The combination of PCA and LDA is often applied, in particular for ill-posed
data (data where the number of variables exceeds the number of objects), e.g. Ref.
[46]. One first extracts a certain number of principal components, deleting the
higher-order ones and thereby reducing to some degree the noise and then carries
out the LDA. One should however be careful not to eliminate too many PCs, since
in this way information important for the discrimination might be lost. A method in
which both are merged in one step and which sometimes yields better results than
the two-step procedure is reflected discriminant analysis. The Fourier transform is
also sometimes used [14], and this is also the case for the wavelet transform (see
Chapter 40) [13,16]. In that case, the information is included in the first few
Fourier coefficients or in a restricted number of wavelet coefficients.
In feature selection one selects from the m variables a subset of variables that
seem to be the most discriminating. Feature selection therefore constitutes a means
of choosing sets of optimally discriminating variables and, if these variables are
the results of analytical tests, this consists, in fact, of the selection of an optimal
combination of analytical tests or procedures.
One way of selecting discriminating features is to compare the means and the
variances of the different variables. Variables with widely different means for the
classes and small intraclass variance should be of value and, for a binary discrim-
ination, one therefore selects those variables for which the expression
is maximal. It should be noted that, in this way, we select the individually best
variables. As the correlation between variables is not taken into account, this
means one not necessarily selects the best combination of variables.
Most of the supervised pattern recognition procedures permit the carrying out of
stepwise selection, i.e. the selection first of the most important feature, then, of the
second most important, etc. One way to do this is by prediction using e.g.
cross-validation (see next section), i.e. we first select the variable that best
classifies objects of known classification but that are not part of the training set,
then the variable that most improves the classification already obtained with the
first selected variable, etc. The results for the linear discriminant analysis of the
EU/HYPER classification of Section 33.2.1 is that with all 5 or 4 variables a
selectivity of 91.4% is obtained and for 3 or 2 variables 88.6% [2] as a measure of
classification success. Selectivity is used here. It is applied in the sense of Chapter
237
In the training or learning step, one develops a decision model (a rule) which
allows classification of the unknown samples to be carried out. The decision model
of Fig. 33.6 consists of line a and the classification rule is that objects to the right of
it are assigned to class L and objects to the left to class K. Once a decision rule has
been obtained, it is still necessary to demonstrate that it is a good one. This can be
done by observing how successful it is at classifying known samples (test set). One
distinguishes between recognition and prediction ability. The recognition (or
classification) ability is characterized by the percentage of the members of the
training set that are correctly classified. The prediction ability is determined by the
percentage of the members of the test set correctly classified by using the decision
functions or classification rules developed during the training step. When one only
determines the recognition ability, there is a risk that one will be deceived into
taking an overoptimistic view of the classification result. It is therefore also
necessary to verify the prediction ability. Both recognition and prediction ability
are usually expressed as (correct) classification rate, although other possibilities
exist (see Section 33.3).
The situation is very similar to that in regression (see Section 10.3.4), where we
validated the regression model by looking how well it modelled the objects that
were included in the calibration set (goodness- or lack-of-fit) and new samples
(prediction performance). This analogy should not surprise since regression and
classification are both modelling methods. Validation in pattern recognition is
therefore similar to validation in multivariate calibration and the reader should
refer to Chapter 36 for more details.
The ideal situation is when there are enough samples available to create separate
(independent) training and test sets. When this is not possible, an artifice is
necessary. The prediction ability is determined by developing the decision model
on a part of the training set only and using the other part as a mock test set. Often
this is repeated a few times until all training samples have been used as test
samples. If several objects at a time are considered as test samples, this is called a
resampling or {internal) cross-validation method, k-fold cross-validation ov jack-
knife method; when only one sample at a time is removed from the training set, it is
called a leave-one-out procedure. If the training set consists of 20 objects, a
jackknife method could be carried out as follows. We first delete objects 1-6 from
the training set and develop the classification rules with the remaining objects
7-20. Then, we consider 1-6 as the test set, classify them with the rules obtained
on objects 7-20 and note how many objects were classified correctly. The whole
procedure is then repeated after replacing first objects 1-6 in the training set but
deleting 7-12. The latter objects are classified using the classification rule developed
on a training set consisting of objects 1-6 and 13-20. Finally, a training set
239
consisting of 1-12 is used and objects 13-20 serve as test set and are classified.
The percentage of successes on the three runs together is then called the prediction
ability.
In general, it is found that prediction ability is somewhat less good than
recognition ability. If the prediction and the recognition ability are substantially
different, this means that the decision rules depend too much on the actual objects
in the training set: the solution obtained is not stable and should therefore not be
trusted.
Many other subjects are important to achieve successful pattern recognition. To
name only two, it should be investigated to what extent outliers are present,
because these can have a profound influence on the quality of a model and to what
extent clusters occur in a class (e.g. using the index of clustering tendency of
Section 30.4.1). When clusters occur, we must wonder whether we should not
consider two (or more) classes instead of a single class. These problems also affect
multivariate calibration (Chapter 36) and we have discussed them to a somewhat
greater extent in that chapter.
References
13. Y. Mallet, D. Coomans and O. de Vel, Recent developments in discriminant analysis on high
dimensional spectral data. Chemom. Intell. Lab. Systems, 35 (1996) 157-173.
14. W. Wu, Y. Mallet, B. Walczak, W. Penninckx, D.L. Massart, S. Heuerding and F. Erni, Com-
parison of regularized discriminant analysis, linear discriminant analysis and quadratic
discriminant analysis, applied to NIR data. Anal. Chim. Acta, 329 (1996) 257-265.
15. D. Coomans and D.L. Massart, Alternative K-nearest neighbour rules in supervised pattern
recognition. Part 2. Probabilistic classification on the basis of the kNN method modified for di-
rect density estimation. Anal. Chim. Acta, 138 (1982) 153-165.
16. E.R. Collantes, R. Duta, W.J. Welsh, W.L. Zielinski and J. Brower, Reprocessing of HPLC
trace impurity patterns by wavelet packets for pharmaceutical finger printing using artificial
neural networks. Anal. Chem. 69 (1997) 1392-1397.
17. J.D.F. Habbema, Some useful extensions of the standard model for probabilistic supervised
pattern recognition. Anal. Chim. Acta, 150 (1983) 1-10.
18. D. Coomans, M.P. Derde, I. Broeckaert and D.L. Massart, Potential methods in pattern recog-
nition. Anal. Chim. Acta, 133 (1981) 241-250.
19. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Deci-
sion Making. Wiley, Chichester, 1986.
20. M. Forina, C. Armanino, R. Leardi and G. Drava, A class-modelling technique based on poten-
tial functions. J. Chemom. 5 (1991) 435^53.
21. M. Forina, S. Lanteri and C. Armanino, Chemometrics in Food Chemistry. Topics Curr.
Chem., 141 (1987)93-143.
22. M. Mulholland, D.B. Hibbert, P.R. Haddad and P. Parslov, A comparison of classification in
artificial intelligence, induction versus a self-organising neural networks. Chemom. Intell.
Lab. Systems, 30 (1995) 117-128.
23. L.J. Breiman, R. Freidman, R. Olsen and C. Stone, Classification and Regression Trees.
Wadsworth, Pacific Grove, CA, 1984.
24. C.H. Yeh and C.H. Spiegelman, Partial least squares and classification and regression trees.
Chemom. and Intell. Lab. Systems, 22 (1994) 17-23.
25. A. Sankar and R. Mammone, A fast learning algorithm for tree neural networks. In: Proc. 1990
Conf on Information Sciences and Systems, Princeton, NJ, 1990, pp. 638-642.
26. D.H. Coomans and O. Y. de Vel, Pattern analysis and classification, in J. Einax (ed). The Hand-
book of Environmental Chemistry, Vol. 2, Part G. Springer, 1995, pp. 279-324.
27. S. Wold, Pattern recognition by means of disjoint principal components models. Pattern
Recogn.,8(1976) 127-139.
28. M.P. Derde and D.L. Massart, UNEQ: a disjoint modelling technique for pattern recognition
based on normal distribution. Anal. Chim. Acta, 184 (1986) 33-51.
29. I. Frank and J.M. Friedman, Classification: oldtimers and newcomers. J. Chemom. 3 (1989)
463-475.
30. R. De Maesschalck, A. Candolfi, D.L. Massart and S. Heuerding, Decision criteria for SIMCA
applied to Near Infrared data. Chemom. Intell. Lab. Syst., in prep.
31. H. Van der Voet and P.M. Coenegracht, The evaluation of probabilistic classification methods,
Part 2. Comparison of SIMCA, ALLOC, CLASSY and LDA. Anal. Chim. Acta, 209 (1988)
1-27.
32. D. Coomans, I. Broeckaert, M.P. Derde, A. Tassin, D.L. Massart and S. Wold, Use of a micro-
computer for the definition of multivariate confidence regions in medical diagnosis based on
clinical laboratory profiles. Comp. Biomed. Res., 17 (1984) 1-14.
33. I. Frank, DASCO: a new classification method. Chemom. Intell. Lab. Syst., 4 (1988) 215-222.
241
34. H. Van Der Voet, P.M.J. Coenegracht and J.B. Hemel, New probabilistic versions of the Simca
and Classy classification methods. Part 1. Theoretical description. Anal. Chim. Acta, 192
(1987) 63-75.
35. H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using ker-
nel density estimation. Part 1. Anal. Chim. Acta, 161 (1984), 115-123; Part 2. Anal. Chim.
Acta, 161 (1984) 125-134.
36. M. Forina, G. Drava and G. Contarini, Feature selection and validation of SIMCA models: a
case study with a typical Italian cheese. Analusis 21 (1993) 133-147.
37. A. Candolfi, R. De Maesschalk, D.L. Massart, P.A. Hailey and A.C.E. Harrington, Identifica-
tion of pharmaceutical excipients using NIR spectroscopy and SIMCA. J. Pharm. Biomed.
Anal., in prep.
38. M.A. Dempster, B.F. MacDonald, P.J. GemperHne and N.R. Boyer, A near-infrared reflect-
ance analysis method for the non invasive identification of film-coated and non-film-coated,
blister-packed tablets. Anal. Chim. Acta, 310 (1995) 43-51.
39. D. Scott, W. Dunn and S. Emery, Pattern recognition classification and identification of trace
organic pollutants in ambient air from mass spectra. J. Res. Natl. Bur. Stand., 93 (1988)
281-283.
40. E. Saaksjarvi, M. Khaligi and P. Minkkinen, Waste water pollution modeling in the southern
area of Lake Saimaa, Finland, by the simca pattern recognition method. Chemom. Intell. Lab.
Systems, 7 (1989) 171-180.
41. O.M. Kvalheim, K. 0ygard and O. Grahl-Nielsen, SIMCA multivariate data analysis of blue
mussel components in environmental pollution studies. Anal. Chim. Acta, 150 (1983) 145-152.
42. B.G.J. Massart, O.M. Kvalheim, F.O. Libnau, K.I. Ugland, K. Tjessem and K. Bryne, Projec-
tive ordination by SIMCA: a dynamic strategy for cost-efficient environmental monitoring
around offshore installations. Aquatic Sci., 58 (1996) 121-138.
43. B.K. Lavine, H. Mayfield, P.R. Kromann and A. Faruque, Source identification of under-
ground fuel spills by pattern recognition analysis of high-speed gas chromatograms. Anal.
Chem., 67 (1995) 3846-3852.
44. B. Mertens, M. Thompson and T. Fearn, Principal component outlier detection and SIMCA: a
synthesis. Analyst 119 (1994) 2777-2784.
45. L Stable and S. Wold, Partial least square analysis with cross-vahdation for the two-class prob-
lem: a Monte Carlo study. J. Chemometrics, 1 (1987) 185-196.
46. A.F.M. Nierop, A.C. Tas, J. Van der Greef, Reflected discriminant analysis. Chemom. Intell.
Lab. Syst., 25 (1994) 249-263.
243
Chapter 34
In Chapter 31 we stated that any data matrix can be decomposed into a product
of two other matrices, the score and loading matrix. In some instances another
decomposition is possible, e.g. into a product of a concentration matrix and a
spectrum matrix. These two matrices have a physical meaning. In this chapter we
explain how a loading or a score matrix can be transformed into matrices to which
a physical meaning can be attributed. We introduce the subject with an example
from environmental chemistry and one from liquid chromatography.
Let us suppose that dust particles have been collected in the air above a city and
that the amounts of/? constituents, e.g. Si, Al, Ca,..., Pb have been determined in
these samples.The elemental compositions obtained for n (e.g. 100) samples, taken
over a grid of sampling points, can be arranged in a data matrix X (Fig. 34.1). Each
row of the table represents the elemental composition of one of the samples. A
column represents the amount of one of the elements found in the sample set. Let
us further suppose that there are two main sources of dust in the neighbourhood of
the sampled area, and that the particles originating from each source have a
specific concentration pattern for the elements Si to Pb. These concentration
patterns are described by the vectors Sj and S2. For instance the dust in the air may
originate from a power station and from an incinerator, having each a specific
concentration pattern, sj = [Si^, Al^, Ca^,... PbJ with A: = 1,2.
Obviously, each sample in the sampled area contains particles from each source,
but in a varying proportion. Some of the samples mainly contain particles from the
power station and less from the incinerator. Other samples may contain an equal
amount of particles of each source. In general, one can say that the composition x^
of any sample / of dust is a linear combination of the two source patterns s^ and S2
given by: x^ = c^^ s^ + c^2 S2. In this expression c^ gives the contribution of the first
source and c^2 ^he contribution of the second dust source in sample /. For all n
samples these contributions can be arranged in a nx2 matrix C giving X = CS^
where S is thepx2 matrix of the source patterns. If the concentration patterns of the
244
variables
Si Al Ca .... Pb 12 Si Al Ca .... Pb
1
2
factors
i
FA
measurements concentrations
Si Al Ca .... Pb
V
PC loadings
PCA
PC-scores
elements (Si to Pb) of the two dust sources are known, the relative contribution C
of each source in the samples is estimated by solving the equation X = CS^ by
multiple linear regression (see Chapter 10) provided that the concentration patterns
are sufficiently different and that the number of sources nc is less than or equal to
the number of measured elements p: C = XS(S^S)"^
Matters become more complex when the concentration patterns of the sources
are not known. It becomes even more complicated when the number and the origin
of the potential sources of the dust are unknown. In this case the number of sources
and the concentration patterns of each source have to be estimated from the
measured data table X. This operation is called factor analysis. In the terminology
of factor analysis, the two sources of dust in our example are called factors, and the
concentration patterns of the compounds in each source are calltdfactor loadings.
In Chapters 17 and 31 we explained that a matrix X can be decomposed by SVD
in a product of three matrices: the two matrices of singular vectors U and V and a
diagonal matrix of singular values A such that:
X - U A V^ (34.1)
nxp nxp pxp pxp
245
X = T V^ (34.2)
nxp nxp pxp
where T (= UA) is the matrix of scores proportional to their 'size' and V is the
loading matrix. The columns of V are colloquially denoted as the principal compo-
nents of X. Each row of the original data matrix X can be represented as a linear
combination of all PCs in the loading matrix. The multiplicative factors (or
regressors) of that linear combination for a particular row, x^ of X are given by the
corresponding row t^ in the score matrix: x, = t^ V^.
X can be reconstructed within the measurement error E by taking the first nc
PCs.
X - T * V*T+ E
nxp nxnc ncxp nxp
The symbol * means that the first nc significant columns of V and T are retained.
The number of significant principal components, nc, which is the pseudo-rank of
X, is usually unknown. Methods for estimating the number of components are
discussed in Section 31.5.
In the case of city air pollution caused by two sources of dust, one expects to
find two significant PCs: PCj and PC2, which are the two first rows of V^. One
could conclude that this decomposition yielded the requested information. Each
row of X is represented by a linear combination of two source profiles PCj and
PC2. The question arises then whether these profiles represent the true
concentration profiles of the sources. Unfortunately, the answer is no! In Chapter
29, we explained that PCj and PC2 are obtained under some specific constraints,
i.e., PCi is calculated under the constraint that it describes the maximum variance
of X. The second loading vector PC2 also describes the maximum variance in X but
now in the orthogonal direction to PCj. These constraints are not necessarily valid
for the true factors. It is very improbable that the true source profiles are ortho-
gonal. Therefore the PCs are called abstract factors. The aim of factor analysis is to
transform these abstract factors into real factors. PC A is a purely mathematical
operation, using no other information than that the rows in X can be described by a
linear combination of a number of linearly independent vectors. However factor
analysis usually requires the formulation of additional constraints to find a solution.
These constraints are defined by the characteristics of the system being investi-
gated. In this example, a constraint could be that all concentrations should be
non-negative.
Before going into more detail, a second example is given. Suppose that during
the elution of two compounds from an HPLC, one measures n (=15) UV-visible
246
0.25
(T3
0.15
0.05
220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)
Fig. 34.2. UV-visible spectra of mixtures offluorantheneand chrysene (see Fig. 34.3 for the pure
spectra).
X = C S^ (34.3)
nx/? ny^nc ncxp
0)
o
c
CD
220 240 260 280 300 320 340 360 380 400 420
wavelength
(nm)
Fig. 34.3. UV-visible spectra of two polyaromatic hydrocarbons (PAHs), fluoranthene and chrysene.
By decomposing the HPLC data matrix of spectra shown in Fig. 34.2 according
to eq. (34.4), a matrix V* is obtained containing the two significant columns of V.
Evidently the loading plots shown in Fig. 34.4 do not represent the two pure
spectra, though each mixture spectrum can be represented as a linear combination
of these two PCs. Therefore, these two PCs are called abstract spectra. Equations
(34.3) and (34.4) show a decomposition of the data matrix X in two ways: the first
is a decomposition in real factors, a product of a matrix S^ of the spectra with a
matrix C of concentration profiles, and the second is a decomposition in abstract
factors T* and V*^. By factor analysis one transforms V*^ in eq. (34.4) into S^ in eq.
(34.3). The score matrix T* gives the location of the spectra in the space defined by
the two principal components. Figure 34.5 shows a scores plot thus obtained with a
clear structure (curve). The cause of this structure is explained in Section 34.2.1.
248
• — • — • — • — f
20 240 260 280 300 320 340 360 380 400 420
wavelength (nm)
Fig. 34.4. The two first principal components of the data matrix of the spectra given in Fig. 34.2.
scores on PC2
Fig. 34.5. Score plot (PCi score vs PC2 score) of the mixture spectra given in Fig. 34.2.
249
The decomposition into elution profiles and spectra may also be represented as:
X^ = S C^
pxn px2 2xn
XT ^ p* Q*T + E
pxn pxnc ncxn pxn
where the rows in Q*^ now represent abstract elution profiles. It should be noted
that the score matrix P* has another meaning than T* in eq. (34.4). It represents
here the location of the chromatograms in the factor space Q*. It should also be
noted that Q is equivalent to U in eq. (34.1) and that P is equivalent to AV^ in eq.
(34.1). Here, too, one wants to transform abstract elution profiles Q*^ into real
elution profiles C^ by factor analysis. The result of a PC A carried out in this
retention time space is given in Fig. 34.6. The two first PCs clearly have the
appearance of elution profiles, but are not the true elution profiles. Because elution
profiles have a much smoother appearance than spectra, which may have a very
irregular form, abstract elution profiles are sometimes easier to interpret than
abstract spectra. For instance, one can easily derive the positions of the peak
maxima and also distinguish significant PCs from those which represent noise.
The scores plot (Fig. 34.7) in which the chromatograms are plotted in the space
defined by the wavelengths is less easily interpreted than the corresponding scores
plot of the spectra shown in Fig. 34.5. The plot in Fig. 34.7 does not reveal any
structure because the consecutive chromatograms in X^ follow the irregular pattern
of the absorptivity coefficients as a function of the wavelength. Therefore, if the
aim of factor analysis is to transform PCs into real factors (by one of the methods
explained in this chapter) one prefers the retention time space, because this yields
loadings which are the easiest to interpret. On the other hand, if the aim of the
analysis is to detect structure in the scores plot, the wavelength space is preferred.
In this space regions of pure spectra (selective parts of the chromatograms), or
regions where only binary mixtures are present, are more easily detected. The
above considerations form the basis of the HELP procedure explained in Section
34.3.3.
In some cases a principal components analysis of a spectroscopic- chromato-
graphic data-set detects only one significant PC. This indicates that only one
chemical species is present and that the chromatographic peak is pure. However,
by the presence of noise and artifacts, such as a drifting baseline or a nonlinear
response, conclusions on peak purity may be wrong. Because the peak purity
assessment is the first step in the detection and identification of an impurity by
factor analysis, we give some attention to this subject in this chapter.
250
13 17 21 25 29
retention time
Fig. 34.6. The two first principal components of a LC-DAD data set in the retention time space.
scores on PC2
• 0.1
•
0.05 - •
. •• • • •
0 t
*• • •
•
-0.05 • •
•
•
-0.1
•
1 1 . _- 1
-0.15
0.2 0.4 0.6 0.8
scores on PC1
Pi
Fig. 34.7. Score plot (PCi score vs PC2 score) of chromatograms in the retention time space.
251
Basically, we make a distinction between methods which are carried out in the
space defined by the original variables (Section 34.4) or in the space defined by the
principal components. A second distinction we can make is between full-rank
methods (Section 34.2), which consider the whole matrix X, and evolutionary
methods (Section 34.3) which analyse successive sub-matrices of X, taking into
account the fact that the rows of X follow a certain order. A third distinction we
make is between general methods of factor analysis which are applicable to any
data matrix X, and specific methods which make use of specific properties of the
pure factors.
Before going into detail about various specific methods to estimate pure factors,
we qualitatively describe how pure factors can be derived from the principal
components. This is illustrated with a data matrix of the two-component HPLC
example discussed in the previous section. When discussing the scores plot (Fig.
34.5) we mentioned that the scores showed some structure. From the origin, two
straight lines depart which are connected with a curved line. In Chapter 17 we
explained that these straight lines coincide with the pure spectra present in the pure
elution time zones. The distance from the origin is a measure for the 'size' of the
spectrum. The curved part represents the zone where two compounds co-elute.
Going through the curve starting from the origin we find pure spectra of compound
1 in increasing concentrations, then mixtures of compounds 1 and 2, followed by
the spectra of compound 2 in decreasing concentration, back to the origin. The
angle between the two lines is a measure for the correlation between the two pure
spectra. If the spectra are uncorrelated (i.e. very dissimilar) the two lines are
orthogonal. At high correlations the angle becomes very small. The two lines
define the directions of the pure factors in the PC1-PC2 space. In this simplified
situation no factor analysis is needed to find the pure factors. From the scores plot
we also observe that the pure factors have an angle aj and a2 respectively with PCj
and PC2. Conversely, one could also find pure factors by rotating PCj over an angle
a^ and PC2 over an angle a2. Finding these angles is the purpose of factor analysis.
When pure spectra are present in the data set, finding these angles is quite
straightforward. Therefore, several factor analysis methods aim at finding the
purest rows (spectra) or purest columns (wavelengths) in the data set, which is
discussed in Section 34.4. When no pure row or column is available for one of the
factors, we cannot directly derive the rotation angles from the scores plot, because
the straight line segments are missing. In this case we need to make some
252
assumptions about the pure factors in order to estimate the rotation angles.
Obvious assumptions in chromatography are the non-negativity of the absorbances
and the concentration profiles. If no constraints can be formulated to estimate the
rotation angles, one must rely on abstract rotation procedures. An example of this
type of rotation is the Varimax method of Kaiser [1] which is explained in Section
34.2.3.
A row of a data matrix can be interpreted as a point in the space defined by its
column variables.
yi
n yn
For instance, the first row of the matrix X defines a point with the coordinates (JCJ,
3^i) in the space defined by the two orthogonal axes x^ = (1 0) and y^ = (0 1). Factor
rotation means that one rotates the original axes x^ = (1 0) and y^ = (0 1) over a
certain angle 9. With orthogonal rotation in two-dimensional space both axes are
rotated over the same angle. The distance between the points remains unchanged.
Fig. 34.8. Orthogonal rotation of the (jcj) axes into (/:,/) axes.
253
Rotated axes are characterized by their position in the original space, given by the
vectors k^ = [cos9 -sinG] and F = [sinG cosG] (see Fig. 34.8). In PCA or FA, these
axes fulfil specific constraints (see Chapter 17). For instance, in PCA the direction
of k is the direction of the maximum variance of all points projected on this axis. A
possible constraint in FA is maximum simplicity of k, which is explained in
Section 34.2.3. The new axes (k,I) define another basis of the same space. The
position of the vector [x- y-] is now [k^ /•] relative to these axes.
Factor rotation involves the calculation of [k^ /J from [x^ y^], given a rotation
angle with respect to the original axes. Suppose that after rotation the matrix X is
transformed into a matrix F:
'k, h'
2 h
zn
.K K.
Then the following relationship exists between [k^ /J and [x^ y^
k^ = x^ cosG - y^ sinG
li = x^ sinG + jjCosG
or in matrix notation (for all vectors /)
'k, h' Xi y
k2 k ^2 yi cosG sinG
=
-sinG cosG
.n K _ _^n yn\
which gives:
F = XR (34.4)
witr
cosG sinG
R—
iv = - s i r iG cosG
Columns and rows of R are orthogonal with a norm equal to one. Therefore, R
defines a rotation, for which RR^ = R^ R = I.
254
The aim of factor analysis is to calculate a rotation matrix R which rotates the
abstract factors (V) (principal components) into interpretable factors. The various
algorithms for factor analysis differ in the criterion to calculate the rotation matrix
R. Two classes of rotation methods can be distinguished: (i) rotation procedures
based on general criteria which are not specific for the domain of the data and (ii)
rotation procedures which use specific properties of the factors (e.g. non-negativity).
In the previous section we have seen that axes defined by the column variables
can be rotated. It is also possible to rotate the principal components. Instead of
rotating the axes which define the column space of X, we rotate here the significant
PCs in the sub-space defined by V*^:
F =V*'r R (34.5)
ncxp ncxp pxp
The columns of V* are the abstract factors of X which should be rotated into real
factors. The matrix V*^ is rotated by means of an orthogonal rotation matrix R, so
that the resulting matrix F = V*^ R fulfils a given criterion. The criterion in
Varimax rotation is that the rows of F obtain maximal simplicity, which is usually
denoted as the requirement that F has a maximum row simplicity. The idea behind
this criterion is that 'real' factors should be easily interpretable which is the case
when the loadings of the factor are grouped over only a few variables. For instance
the vector f^ = [ 0 0 0 0.5 0.8 0.33] may be easier to interpret than the vector f J =
[0.1 0.3 0.1 0.4 0.4 0.75]. It is more likely that the simple vector is a pure factor
than the less simple one. Returning to the air pollution example, the simple vector
f J may represent the concentration profile of one of the pollution sources which
mainly contains the three last constituents.
A measure for the simplicity of a vector f, is the variance of the squares of its p
elements, which should be maximized [2]:
1 ^ —
Simp = var(f,2 ) = _ V (f 2 ^ f 2 )2
P i=i
To illustrate
3 vectors the concept
f, and fj: of simplicity of a vector we calculate the simplicity of
f f f.^ f2
*2
0 0.1 0 0.01
0 0.3 0 0.09
0 0.1 0 0.01
0.5 0.4 0.25 0.16
255
The simplicity of a matrix is the sum of the simplicities of its rows.Thus the
simplicity of a matrix F with the rows fj^ and fj as given before is 0.088. fj and £2
are called varivectors.
By a varimax rotation the matrix V*^ is rotated by means of an orthogonal
rotation matrix R, so that the simplicity of the resulting matrix F = V*^ R is
maximal. Several algorithms have been proposed [2] for the calculation of R. Let
us suppose that a PC A of X with four variables yields the following two significant
principal components: PCi! [0.1023 -0.767 0.153 0.614] and PC2: [0.438 0.501
0.626 0.407]. The simplicity of the matrix with the rows PC^ and PC2 is equal to
0.0676. A varimax rotation of this matrix yields the varivectors f^ = [-0.167
-0.916 -0.233 0.269] and f2^ =[0.417 -0.030 0.601 0.685] at an angle of 35
degrees with the PCs, and a corresponding simplicity equal to 0.1486. One can see
that varivector f^ is mainly directed along variable 2, which is probably a pure
factor. The other variables (1,3 and 4) load high on varivector ^2- Therefore, they
belong to the second pure factor. Anyway, because the varivectors are simpler than
the original PCs one can safely conclude that they resemble better the pure factors.
Several applications of varimax rotation in analytical chemistry have been
reported. As an example the varimax rotation is applied on the HPLC data table of
1 3 5 7 9 11 13 15
Retention time
PAHs introduced in Section 34.1. A PCA applied on the transpose of this data
matrix yields abstract 'chromatograms' which are not the pure elution profiles.
These PCs are not simple as they show several minima and/or maxima coinciding
with the positions of the pure elution profiles (see Fig. 34.6). By a varimax rotation
it is possible to transform these PCs into vectors with a larger simplicity (grouped
variables and other variables near to zero). When the chromatographic resolution
is fairly good, these simple vectors coincide with the pure factors, here the elution
profiles of the species in the mixture (see Fig. 34.9). Several variants of the
varimax rotation, which differ in the way the rotated vectors are normalized, have
been reviewed by Forina et al. [2].
In the previous section, factors were searched by rotating the abstract factors
into pure factors, obeying a number of constraints. In some cases, however, one
may have a collection of candidate pure factors, e.g., a set of UV-visible or Mass
spectra of chemical compounds. Having measured a data matrix of mixture spectra
one could investigate whether compounds present in the mixture match with
compounds available in the data base of pure spectra. In that situation one could
first estimate the pure spectra from the mixture spectra and thereafter compare the
obtained spectra with the spectra in the data base. Alternatively, one could identify
the pure spectra by solving equation X = C S^ for C where S is the set of candidate
spectra to be tested and X contains the mixture spectra. All non-zero rows of C
indicate the presence of the spectrum in the corresponding row in S. This method
fails when S does not contain all pure spectra present in the mixtures. Moreover,
this procedure becomes unpractical when the number of candidate spectra is very
large and the whole data base has to be checked. Furthermore, when the spectra to
be checked are quite similar, the calculation of (S^S)~^ becomes unstable, leading
to large errors in C and the indication of wrong spectra [3].
By target transformation factor analysis (TTFA), each candidate spectrum,
called target is tested individually on its presence in the mixtures [4,5]. Here,
targets are tested in the space defined by the significant principal components of
the data matrix. Therefore, TTFA begins with a PCA of the data matrix X of the
measured spectra. The principle of TTFA can be explained in an algebraic as well
as geometrical way. We start with the algebraic approach. Because according to
eq. (34.4) any row of X can be written as
spectrum taken from the library can be tested on this property. If the test passes, the
spectrum or target may be one of the pure factors. How this is done is explained
below.
The first step is to calculate the scores t i„ of the target spectrum, in, to be tested
by solving equation:
givmg
t;„ =inV*(V*T V*)-^ in V
These scores give the linear combination of the PCs that provides the best
estimation (in a least squares sense) of the target spectrum. How good that
estimation is can be evaluated by calculating the sum of squares of the residuals
between the re-estimated target or output target (from its scores) and the input
target. The output target out is equal to t •„ V*^. The overall expression for TTFA
therefore becomes:
out = inV*V*T (34.6)
If the difference between out and in (||out - in||) can be explained by the variance
of the noise, the test passes and the target is possibly one of the pure factors.
9 11 13 15 17 19 21 23 25 27 29
Retention time
Fig. 34.10. Simulated LC-DAD data-set of the separation of three PAHs (spectra 4, 5 and 6 in Fig.
34.11) (individual profiles and the sum).
258
C ^'•
^ 0>
<D ON
1^
§ CO
h ! ON
^§
^^
- ^ ON
^^
cxo
*r: ^
W OS
s^
3°
"^ IT)
c a-
o :3
&^
a ^
OS CU
<u c
C/3
Xi
o C/5
00
cd O
^-1
O (^
(1)
o
<j J3
259
Otherwise, the test fails. In principle there are no limitations on the number of
compounds in the mixture.
Let us illustrate the method with a LC-UV-visible data set of three overlapping
peaks (Fig. 34.10). The TFA results obtained with six candidate spectra, including
the pure spectra are given in Fig. 34.11. The assignment of the right spectra by an
evaluation of the correlation coefficients between input and output target is
obvious.
Strasters et al. [6] applied this method to data sets obtained by HPLC-DAD and
concluded that the pure factors can be very well recovered even at very low
chromatographic resolutions (R^ < 0.02). A critical step in PCA-based methods is
the determination of the number of significant PCs, which indicates the number of
pure factors which have to be searched.
A geometrical representation of target FA is as follows. Suppose a data set with
spectra of a two-component system, measured at p wavelengths. All spectra are
located on a plane defined by the two PCs, PCj and PC2 (see Fig. 34.12 where/? = 3).
An input target in is a vector in the originalp-dimensional space. A target (e.g. inj)
which lies in the plane PC1-PC2 is a linear combination of the two eigenvectors.
Other targets (e.g. in2) are not. Whether a target lies in the plane defined by the PCs
is tested by projecting the target on the plane (or space) defined by the eigen-
vectors. For a target which lies in the plane defined by the eigenvectors the input
Fig. 34.12. Geometrical representation of TTFA: target ini is accepted; target in2 is rejected.
260
and output targets are the same. For any other target (e.g. in2), the projected target
out2 differs from the original one. If that difference is significantly greater than the
noise, then this target is not one of the factors. Figure 34.12 also shows that the
projected vector out2 is obtained by rotating one of the PCs. It also illustrates that
another way to express the difference between the input target and output target is
the angle between these two vectors. This angle is related to the correlation
coefficient between the two vectors (Chapter 9).
scores on PC2
0.2
0.1
-0.1
-0.2
-0.3
Fig. 34.13. Score plot (ti vs 12) of the spectra given in Fig. 34.2 after normalisation. The points A and B
are the purest spectra in the data set. The points A' and B' are the spectra at the boundaries of the
non-negativity constraint.
zones). The curved line segment corresponds with the mixture spectra. When no
pure spectra are present, no straight lines are observed making it very difficult to
extract the pure spectra from this scores plot. Such plots are the basis of the
HELP-method to assess peak purity in liquid chromatography, which is discussed
in Section 34.3.3. When the spectra are normalized to a sum equal to one, the
scores plot given in Fig. 34.5 becomes much simpler. All spectra are situated on a
straight line (see Fig. 34.13). This is caused by the fact that by normalization of the
rows the dimensionality of the data is reduced by one (because the spectra are not
mean centred, we still find two significant PCs). The spectra on this line are
arranged in a certain order. At both ends are the two spectra with the smallest
contamination from the other compound. These are the purest spectrum (A) and
the purest spectrum (B). In fact, these two spectra are the two purest rows in X
which are the best available estimators of the two pure spectra. These estimates can
be improved. First we know that the normalized pure spectra are located on the line
passing through the line segment (A-B) of the mixture spectra. Thus the pure
spectra should be found somewhere on the extrapolated ends of the line. The
question arises how far one should extrapolate. At this point one needs constraints
to bound the solution. The first constraint is that the absorbances are non-negative.
At a given critical point (say AO located on the line, this constraint may be
violated. This point has to be found. Let us explain how this is done with an
example.
262
Suppose that 15 spectra have been measured at 20 wavelengths and that after
normalization on a sum equal to one they are compiled in a data matrix. Suppose
also that by PCA the following score and loading matrices are obtained:
.372 .160
.609 .391
.148 -.208
.340 .073
.218 -.450
.339 .071
.314 -.583
.339 .068
.402 -.070
.338 .064
.088 -.230
.336 .058
.101 -.307
.333 .049
.058 -.063
.330 .035
.101 -J005
7= .324 .027 v= (34.7)
.100 J023
.317 -.008
.206 .144
.309 -.038
.139 .110
.300 -.072
.239 .212
.289 -.107
.016 -.007
.279 -.140
.010 -.0108
.271 -.169
.0035 .0006
.265 -.191
.002 -.0009
.0004 -.001
0 0
The scores and the loadings are plotted in Figs. 34.13 and 34.4. The fact that all
absorbances of the pure spectra Sj and S2 should be non-negative means that the
linear combination of the two PCs, s = x^y^ + T2V2 should be non-negative for each
element 5^ of s, given by:
Because the elements of v, all have equal sign, contrary to the elements of V2
which have mixed signs, x^ /Xj values satisfying eq. (34.8) are found on the interval
[m,n], where m is equal to
263
and n is equal to
minP^ (V2^<0)
with Tj > 0.
The boundaries, m and n, define two lines in the (Vj, V2) plane: the line Xj = mi2
and the line T^ = nT2. The intersection of these two lines with the line defined by the
mixture spectra gives two points A' and B' which are the estimates of the pure
spectra. The intervals (A-AO and (B-BO define the solution bands between which
the pure spectra are situated.
For the example discussed here, the points A' and B' are found as follows. First,
we calculate m and n by dividing all elements of Vj by the corresponding elements
of ¥2, which gives:
S2 = 0.3620 Vi+0.1187 V2
which are plotted in Fig. 34.14.
It should be borne in mind that the two spectra given in Fig. 34.14 are estimates
of the pure spectra, which exactly fulfil the constraints. Two other estimates of the
pure spectra are the purest mixture spectra A and B. When plotted in the same
figure (see Figs. 34.15 and 34.16) a good impression is obtained of the remaining
264
-0.05
220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)
Fig. 34.14. Estimated pure spectra at the points A' and B', with respectively the scores ^A'I , /A'2 and t^'i,
^B'2-
220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)
uncertainty (or solution bands) in the spectra. The widths of the solution bands
depend on the similarity of the spectra, and on the concentration ratios spanned by
the mixtures. In HPLC with DAD these concentration ratios depend on the
265
220 240 260 280 300 320 340 360 380 400 420
• wavelength (nm)
(^A'l»^A'2) ^^d (^B'l»^B'2) ^^^ the scores of the two spectra meeting the non-negativ-
ity criterion. Equally, each mixture spectrum is a linear combination of the two
first principal components:
^i ~ h\ ^1 "•" hi ^2
Therefore, for all spectra / = 1,...,« the scores t^^ and t^2 are given by:
U\ ~^i\ ^A'l "^ ^i2 ^B'l ^^^Ul —^i\ ^A'2 "•" ^/2 ^B'2
By solving these two equations for c^ and c^2 ^^^ by applying the condition that
c^j > 0 and c,2 > 0 we find [7] that the scores of the first pure spectrum (^^^ and ^^^^2)
should fulfil the condition:
266
scores on PC2
0.1
n-line
0.05
-0.05
-0.1
-0.15
m-line
Fig. 34.17. Non-negativity constraints (m- and w-line) applied on the scores of the normalized
chromatograms.
'A'2
> max hi. m
^'1 til
and that the scores of the second pure spectrum (rg/i and ^3^2) should fulfil the
condition:
^B'2 'i2
< min =n
^B'l
If these conditions are not fulfilled by the scores (r^/j, ^^^2) ^^^ (^B'I' ^B'2) ^he pure
spectra should be adapted by taking the intersections of the line defined by the
mixture spectra and the lines T2 = m'T^ and T2 = nx^. In our example, the above
conditions are met and therefore the solution bands are not affected.
We can also estimate the pure column factors (here elution profiles). Therefore,
we take the transpose of X (rows are chromatograms), normalize all rows to a sum
equal to one, and calculate the PCs. These PCs represent abstract chromatograms
as is shown in Fig. 34.6. The scores of the chromatograms are again situated on a
straight line (Fig. 34.17). The purest elution profiles are at the two ends of the line
segment. By applying eq. (34.9) one obtains the m- and n-lines defined by the
constraint that the elution profiles should be non-negative (see Fig. 34.17):
267
0.15
0.05
29 33 37
retention time
T2 =-0.5103 Xi(m-line)
T2 = 0.1909 Ti(n-line)
The intersection of these two lines with the line defined by the mixture chromato-
grams gives the scores of the 'pure' elution profiles (0.360, 0.070) and (0.325,
-0.166) resulting in the chromatograms given in Fig. 34.18. Knowing the pure
elution profiles, the spectra are easily calculated by solving eq. (34.3).
Having derived a solution for two-component systems, we could try and extend
this solution to three-component systems. A PC A of a data set of spectra of
three-component mixtures yields three significant eigenvectors and a score matrix
with three scores for each spectrum. Therefore, the spectra are located in a
three-dimensional space defined by the eigenvectors. For the same reason, exp-
lained for the two-component system, by normalization, the ternary spectra are
found on a surface with one dimension less than the number of compounds, in this
case, a plane.
If the ternary mixtures span the whole concentration range, i.e. from 0 to 100%
for each component, some of the mixtures contains only one or two compounds.
As explained before, the binary mixtures should be found on straight lines. In this
268
PureB it
B+C
A+B • A+B+C
Puree
# • • • • • • •
A+C
Pure A
Fig. 34.19. Three-component spectra (normalized) projected in the space defined by the two first PCs.
case, three straight lines should be found representing the binary mixtures of
compounds (A,B), (A,C) and (B,C), respectively. These three lines define a
triangle. Ternary mixtures are located inside the triangle. The comers of the
triangle represent the pure spectra, and on each side are the binary mixtures (see
Fig. 34.19). The problem of finding the pure spectra (or pure factors) in a
three-component (or three-factor) system is to first estimate the position of the
lines on which the binary spectra are situated, followed by an estimate of the
intersections (comer points). If no spectra of binary mixtures are in the data set,
these lines are not directly observed and should be estimated. In a similar way as
for the two-component case, these lines and comer points have to be found by
applying constraints on the spectra (non-negativity of the absorbances) and on the
resulting concentrations. Details of this method are found in Ref. [9]. Clearly, the
factor analysis of three-component systems by curve resolution is much more
complex than it is for the two-component case. In fact the generalisation of the
method to systems with more than three components appears to be impractical.
Fig. 34.20. ITTFA: projection of the input vector ini in the PC-space gives outi. A new input target in2
is obtained by adapting outi to specific constraints. in2 projected in the PC-space gives out2.
that, in the absence of good candidate targets to be tested by TTFA, one defines an
initial target, which is gradually improved until the TTFA test passes.
The first step of ITTFA is to project any input target on the eigenvector space
(Fig. 34.20) defined by the data matrix of the mixtures, for example inj into outj.
Next, the projected vector is inspected on its likelihood to be a pure factor. By
examining outj, we find that some loadings do not fulfil specific requirements, e.g.
if the factor should represent an elution profile, then all values should be non-
negative and no shoulders or secondary maxima should be present. Instead of
rotating outj in the PC1-PC2 plane, we adapt the loadings of this vector in such a
way that it obeys all imposed constraints; i.e. all negative values are replaced by
zeros and, in the case of chromatography for instance, all secondary peaks are cut
away. This step is schematically shown in Fig. 34.21. The consequence of this
adaptation is that the vector outj is lifted from the plane and rotated over a small
angle, giving in2. By this procedure, we hope to obtain a result that is closer to one
of the true factors, than inj was.
In the following step the vector in2 is considered as a new target, which in its
turn is submitted to a TTFA. As a result, a second projected output target out2 is
obtained. The overall result is a rotation of outj into out2 as is schematically shown
in Fig. 34.20. The procedure of inspection and adaptation of the loadings is then
repeated, until the difference between output and input targets converges to zero,
indicating that a stable solution is obtained. This solution is one of the pure factors.
To find a second factor, the procedure is repeated with a new initial target. The
main condition for success is that good constraints can be formulated for adapting
the targets and that good initial targets can be found.
270
retention time
retention time
of the elution profiles. The other one also has a smooth shape, but with negative
values because it is orthogonal to the first one.
(2) Several procedures have been proposed to select a target. The first one is the
so-called needle search [13,14,16]. This procedure starts the ITTFA with as many
needle targets (i.e., all elements of the target are zero except one which has a value
equal to one) as there are variables. Each needle is projected into the space defined
by the eigenvectors once, and the correlation between input and output target is
evaluated. The nc (number of significant eigenvectors) best targets are retained to
start the iterations. A second procedure is to apply first a VARIMAX rotation on
the eigenvectors (see Section 34.2.3). Because elution profiles are present in
specific windows of the data matrix and because they are almost orthogonal, a
VARIMAX yields good initial profiles for each factor, indicating at least the
position of the peaks.
(3) Targets are adapted by replacing all negative values by zero. Furthermore,
all secondary peaks or shoulders are removed. Some typical examples are given in
Fig. 34.21.
Because all elution profiles are normalized to a sum equal to one, the resulting
factors are also normalized. The overall result of ITTFA is therefore a set of peaks
with equal peak area or height. Depending on the kind of information wanted,
either the spectra or elution profiles, ITTFA is concluded by a terminating step.
The spectra are obtained by solving eq. (34.3), where X is the original and
unnormalized data set and C contains the normalized elution profiles. After the
spectra are obtained, we can denormalize the elution profiles by dividing them by
the norm of the spectra. This is easily seen from the fact that X = C K S, where C
and S are respectively the elution profiles and spectra, normalized to a sum equal to
one, and K is a diagonal matrix with on its diagonal the norms of the spectra. From
C and X we obtain KS. The lengths of the spectra KS are equal to the diagonal
elements of K. Multiplication of C by K in its turn gives CK, the denormalized
elution profiles.
An example of the results obtained by ITTFA is given in Figs. 34.22 and 34.23.
The solid lines in the two figures are the estimated spectra and elution profiles. The
dotted lines are the true spectra and profiles The ability to estimate spectra and
elution profiles by ITTFA depends on the similarity of the spectra, the overlap of
the elution profiles and the relative intensity of the signal. By simulation, Strasters
et al. [6] evaluated the quality of the spectra as a function of the peak overlap and
spectrum similarity. The quality of the spectra was expressed as the correlation
coefficient between the true and estimated spectrum. As can be seen from Fig.
34.24, spectra of good quality are obtained for a chromatographic resolution
/?, > 0.45. The quality of the elution profiles, expressed as the accuracy of the peak
areas, has been evaluated by Vandeginste et al. [11].
272
a
O4
o
o
:3
o
CO
a.
- CO
(D
^
o 0
•A. (D
-
• * — '
(• CO T3
Vm O cd
\ 'w- Q^
M CO "^ f0
••
O 0
CN
CO
<
O
O
CO ^
O
00
CM 13
CD
O 03
CD
CM
C/3
O (U
cd
-^
CM V-i
^*1l« O
CM
CO
CO m CM in O
CN
o o o (N
d d
CO
PLH
273
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
retention time
Fig. 34.23. Individual elution profiles and the total profile estimated by ITTFA compared to the true
profiles.
spectral
similarity
0.3 0.4
resolution
Fig. 34.24. Quality (correlation coefficient between true and estimated spectra) of the spectra, esti-
mated by ITTFA depending on similarity of the spectra and the chromatographic resolution.
274
One of the early applications of ITTFA was developed by Hopke et al. in the
area of geochemistry [12]. By measuring the elemental composition in the crossing
section of two lava beds, Hopke was able to derive the elemental composition of
the two lava bed sources. This application is very similar to the air dust case given
in the introduction of this chapter (see Section 34.1).
Methods for evolutionary rank analysis are explained and discussed in this
section. The different approaches of evolutionary rank analysis have in common
that the two-way data structure is analyzed piece-wise to locally reveal the
presence of the analytes. A reference to the review of evolutionary methods by
Toft et al. is included in the additional recommended reading list at the end of this
chapter.
tl 12_ t4 15_ t6
2 3 4 4 4
4 4 4 3 2
J-L
Fig. 34.25. Time windows in which compounds are present in a composite 4-component peak, with
the ranks of the data matrices formed by adding rows to a top matrix To (top down) or to a bottom ma-
trix B() (bottom up).
20
r e t e n t i o n time
to
i->
c
=}
/Axx
Q
u
5
•••1 1
a •• •• h .. Is s
3
'['••• •••1
«0 "••••• 1 ••• 1
> • • • . '
• .
cr f-. I'-.
a»
O)
••••.. 1 ••• 1 ••.
1
a; •-, 1 •••. 1
1 •• 1
o •• 1
. i . 1 '
20 40
r e t e n t i o n time
<v
Z3
to
c>
0)
en
"•*
<u
^1 ^2 s s H S k
1
1 - - - ;
1
' \ " 1 1 1
1 I 1 , 1 _ ,..,J
20 40
retention time
Fig. 34.28. Reconstructed concentration profiles from the combination of Figs. 34.26 and 34.27.
II
T R ^1 1
EZ]
1
I ^1
10
Ci
-pO
T i r;
Fig. 34.29. Construction of a sub-matrix containing only the zero rows of c,.
profiles in equilibria studies. One could consider these bell-shaped profiles of the
eigenvalues as being a first rough estimate of the concentration profiles C in eq.
(34.3). It is now possible to calculate the pure spectra S in an iterative way by
278
solving the eq. (34.3) first for S (by a least-squares method). Because C is a first
approximation, the spectra have negative values, v^hich can be set to zero. In a next
step the corrected spectra are used to solve eq. (34.3) for C. These steps are
repeated until S and C are stable. This is the principle of alternating regression, also
called alternating least squares, introduced by Karjalainen [21]. This iterative
resolution method has been successfully applied to many different analytical data,
including chromatographic examples (see recommended additional reading).
An alternative and faster method estimates the pure spectra in a single step. The
compound windows derived from an EPCA, are used to calculate a rotation matrix
R by which the PCs are transformed into the pure spectra: X = C S^ = T* R R"* V*^.
Consequently, C = T* R, where T* is the score matrix of X. Focusing on a
particular component, c^, (ith column of the matrix C) one can write c^ = T* r^,
where r, is the /th column of R. Because compound / is not present in the shaded
areas of c, (see Fig. 34.29), the values of c^ in these areas are zero. This allows us to
calculate the rotation vector r^. Therefore, all zero rows of c^ are combined into a
new column vector, cf, and the corresponding rows of T* are combined into a new
matrix T*^. As a result one obtains:
When the rows of a data matrix follow a certain pattern, e.g. the appearance and
disappearance of compounds as a function of time, a. fixed-size window EFA is
applicable. This is the case, for instance, for data sets generated by hyphenated
measurement techniques such as HPLC with DAD. Fixed-size window EFA [22]
can be applied for detecting the presence of minor compounds (< 1 %) and for the
resolution of a data set into its components (pure spectra and elution profiles).
279
0) a>
a>
E
E E
'43
3
2
1
0
1
logEY 2
<J
-4
-5
-6
7
10 20 30 40 50 60
Fig. 34.31. Eigenvalue plot obtained by fixed-size moving window FA for a pure peak.
Contrary to EFA which calculates a PC A of a sub data matrix to which rows are
added, in fixed-size window EFA a small window of rows is selected which is
moved over the data set (see Fig. 34.30). Typically, a window of seven consecutive
spectra is used. At each new position of the window a PCA is calculated and the
eigenvalues associated with each PC are recalculated and are plotted as a function
of the position of the window. This yields a number of eigenvalue-lines. Figure
34.31 shows the eigenvalue-lines obtained for a simulated pure LC-DAD peak. In
the baseline zones (null spectra) all eigenvalue-lines are noisy horizontal lines. In
the selective retention time regions (one component present) the eigenvalue-line
associated with the first PC follows the appearance and disappearance of the
280
%10
3
Q.
E detectable
0.5
0.3 not detectable
0.2
0.1
0.2 0.4 0.6 0.8 1
— • resolution
Fig. 34.32. Detectability of a minor compound by fixed-size window evolving factor analysis.
compound. The other eigenvalue-lines remain noisy horizontal lines. From the
variations observed for eigenvalue-lines representing noise one can derive a
critical value to decide on the significance of an eigenvalue. Figure 34.32 demon-
strates the capabilities of this method for the detection of a minor compound which
is partly separated from the main constituent. Impurities as low as 1% are detected
when chromatographic resolution > 0.3.
Eigenstmcture tracking (ETA), developed by Toft and Kvalheim [23] is very
similar to FSWEFA. Here, one starts initially with a window of size 2 and the size
is increased by one until a singular value representing noise is obtained, i.e. the
window size is one higher than the number of co-eluting compounds.
Heuristic evolving latent projection [24], which was introduced in Chapter 17,
concentrates on the evaluation of the results of a local principal components
analysis (LPCA), performed on a part of the data matrix. Both, the scores and
loadings, obtained by a local PCA are inspected for the presence of structures
which indicate selective retention time regions (pure spectra) and selective wave-
lengths. Because the data are not pre-treated (normalization or any other type of
pretreatment) the score plots obtained are very similar to the scores plots previously
discussed in Section 34.2.1. Straight lines departing from the origin in the scores
plot indicate pure spectra. HELP can be conducted in the wavelength space and in
the retention time space. Purity checks and the search for selective retention time
281
1.277
U 0.000
0 O
-1.277 J
1 J
Fig. 34.33. Simulated three-component system: (a) spectra (b) elution profiles and (c) scores plot ob-
tained by a global PCA with HELP.
regions are usually conducted in the wavelength space. Selective wavelengths are
found in the retention time space, and if present yield directly the pure elution
profiles.
The method also provides what is called a data-scope, which zooms in on a
particular part of the data set. The functioning of the data-scope is illustrated with a
simulated three-component system given in Figs. 34.33a and b. The scores plot
(Fig. 34.33c) obtained by a global PCA in wavelength space shows the usual line
structures. In this case the data-scope technique is applied to evaluate the purity of
the up-slope and down-slope elution zones of the peak. Therefore, data-scope
performs a local PCA on the up-slope and down-slope regions of the data.
In Fig. 34.34 the first three local PCs with the associated eigenvalues are given
for four retention time regions: (a) a zero component region (Fig. 34.34a), (b)
selective chromatographic regions at the up-slope side (Fig. 34.34b), (c) and at the
282
1.277
LM^W^t/^
imnnniiiimiimn»imimiinimiiiBtniiuiimwiimr
\A/W\Mf\NM
nfmiiiini|uiuiuniinmniiJtmtiiwinHHiBimuBiil
k Y ^ WH^^V^Ayj
WAVELENGTH
(a)
Fig. 34.34. The three first principal components obtained by a local PC A: (a) zero component region,
(b) up-slope selective region, (c) down-slope selective region (d) three-component region. The spectra
included in the local PCA are indicated in the score plot and in the chromatogram.
down-slope side (Fig. 34.34c), and a three-component region in the centre of the
peak profile (Fig. 34.34d). The selected regions of the data set are indicated in the
scores plot (wavelength space) and in the chromatogram. The loading patterns of
the three PCs obtained for the suspected selective chromatographic regions,
clearly show structure in the first principal component (Fig. 34.34 b,c). The
significance of the second principal component should be confirmed by separating
the variation originating from this component from the variation due to
experimental or instrumental errors. This is done by comparing the second eigen-
value in the suspected selective region and the first singular value in the zero-
component region (see Table 34.1). In the first instance, the rank of the
investigated local region was established by means of an F-test. For the up-slope
283
1.277
0.000
1 0 o
-1.277
-3.438
i
0.000
J
3.438
RETENTION TIME
PC1
nimHi«uui|«mminmnmninmnniininBmiiiT
\r
Bmn|iBwmmnwii|Bnnii|iiuii.|Bmiiuniinnnuii
mmi)mminiiiHmiiiiiiiniinu|iiiimnnuimmul
WAVELENGTH (b)
TABLE 34.1
Local rank analysis of suspected selective regions in a LC-DAD data set (see Fig. 34.34)
Eigenvalues
1.277
O 0.000
Q.
0 O
RETENTION TIME
-1.277
inniB>HHiim{iiim.nii.i.»:|mmiii|BUiimHiininLiiiii(
B»pg>iiit|nmn:»|niiiiiii{iimBnnnuanniiniHBmir
K
innrwnmnui;BunM{wiinriminBiHBmBinMiiuiniinii1
WAVELENGTH (C)
1.277
O 0.000
Q.
0.0
RETENTION TIME
-1.277
iiiiiiHi|iniiiiii|niiiHii|Hiiiiiii|NinHM|niiinii|i^innii|Hiiiiii
WAVELENGTH (d)
Pure variables are fiilly selective for one of the factors. This means that only
one pure factor contributes to the values of that variable. When the pure variables
or selective wavelengths for each factor are known then the pure spectra can be
calculated in a straightforward manner from the mixture spectra by solving:
X = PS'^ (34.11)
where X is the matrix of the mixture spectra, S is the matrix of the unknown pure
spectra and P is a sub-matrix of X which contains the nc pure columns (one
selective column per factor). The pure spectra are then estimated by a least squares
procedure:
sT = (pTp)-^p'rx
The pure variable technique can be applied in the column space (wavelength) as
well as in the row space (time). When applied in the column space, a pure column
is one of the column factors. In LC-DAD this is the elution profile of the compound
which contains that selective wavelength in its spectrum. When applied in the row
space, a pure row is a pure spectrum measured in a zone where only one compound
elutes.
Fig. 34.35. Principle of the VARDIA technique demonstrated on a two-component system, f is rotated
in the V1-V2 plane. The variables Xi, X2 and X3 are projected on f during the rotation. In this figure f is
oriented in the direction of Xi and X2.
that the angle of this variable axis with Vj is small. Conversely a small loading for
the variable with Vj means that the angle with this variable is large. Instead of
projecting the variables on the principal components, one can also project them on
any other vector f located in the space defined by the PCs and which makes an
angle T with Vj. Vector f is defined as:
f = Vi cosF + V2 sinP
The loading of variabley on this vector is:
The larger the loading^ of variable) on f, the smaller the angle is between f and this
variable. This means that f lies in the direction of that variable, f may lie in the
direction of several correlated variables (jCj and X2 in Fig. 34.35), or a cluster of
variables. Because correlated variables are assumed to belong to the same pure
factor, f could be one of the pure factors. Thus by rotating fin the space defined by
the principal components and observing the loadings during that rotation, one may
find directions where variables cluster in the multivariate space. Such a cluster
belongs to a pure factor. This is illustrated with the LC-DAD data discussed in
Section 34.2.5.1. The two principal components found for this data matrix are
given in eq. (34.7). By illustration, the loadings of the variables 2 and 15 on fare
288
TABLE 34.2
The loadings of variables 2 and 15 during the rotation of PC^ (see eq. (34.7) for the values of PC^ and PCj)
10 0.668 0.014
20 0.706 0.013
30 0.723 0.010
40 0.718 0.008
50 0.691 0.005
60 0.643 0.002
70 0.576 -0.001
80 0.491 -0.004
90 0.391 -0.007
100 0.279 -0.010
110 0.159 -0.012
120 0.034 -0.014
130 -0.092 -0.016
140 -0.215 -0.017
150 -0.332 -0.017
160 -0.438 -0.017
170 -0.532 -0.017
180 -0.609 -0.016
190 -0.668 -0.014
200 -0.706 -0.013
210 -0.723 -0.010
220 -0.718 -0.008
230 -0.691 -0.005
240 -0.643 -0.002
250 -0.576 0.001
260 -0.491 0.004
270 -0.391 0.007
280 -0.279 0.010
290 -0.159 0.012
300 -0.034 0.014
310 0.092 0.016
320 0.215 0.017
330 0.332 0.017
340 0.438 0.017
350 0.532 0.017
360 0.609 0.016
289
If
fj> pi + vi)cosm)
one concludes that the variable j is pure. The cosine term introduces a small
projection window from -p/2 to +p/2. Windig [27] recommends a value of 10
degrees for p.
In order to decide on the purity of the factor f as a whole, the purity of the
loadings of all variables should be checked. At 30 degrees, where f = 0.866vi +
0.5V2, the variables 2 and 12 (see Table 34.3) fulfil the purity criterion. The sum of
squares of all variable loadings fulfilling the purity criterion are plotted in a
so-called variance diagram. At 30 degrees, this is the value (/2^3o + /i2,3o)- ^^^^ ^^
the principle of the VARDIA technique in which var(/) is plotted as a function of F,
with
for allfj for whichy^ > J{v^j + v|y )cos(P/2), with^ = v^cosF + V2ySinF (F is the
rotation angle).
The variance diagram obtained for the example discussed before is quite simple.
Clusters of pure variables are found at 30 degrees (var = 0.5853) and at 300 degrees
(var = 0.4868) (see Fig. 34.36). The distance from the centre of the diagram to each
point is proportional to the variance value. Neighbouring points are connected by
solid lines. All values were scaled in such a way that the highest variance is full
scale. As can be seen from Fig. 34.36, two clusters of pure variables are found. The
290
TABLE 34.3
VARDIA calculation at 30 degrees
Vl f2 *
V2 *3() ^(v?+V2)cos(57i/180) *• 3 0
Spectra at 30 degrees and 300 degrees (see Figs. 34.37 and 34.38) are in good
agreement with the pure spectra given in Fig. 34.3. This demonstrates that this
procedure yields quite a good estimate of the pure spectra. The rotation of a factor
in the space defined by two eigenvectors is straightforward and, therefore, the
method is well applicable to solve two-component systems. Also three-component
systems can be solved after all rows (spectra) have been normalized.
Windig [27,28] also proposed procedures to handle more complex systems with
more than two components. Therefore, several variance diagrams have to be com-
bined. Instead of observing the variance of the loadings of a factor being rotated in
the space defined by the eigenvectors, one can also plot the loadings, and guide
iteratively the rotation to a factor that obeys the constraints imposed by the system
291
180
210
Fig. 34.36. Variance diagram obtained for the data in Fig. 34.2.
wavelength (nm)
Fig. 34.37. Spectrum (f) at an angle of 30 degrees with Vi (in the V1-V2 plane).
292
0)
o
c:
CO
to
220 240 260 280 300 320 340 360 380 400
wavelength (nm)
Fig. 34.38. Spectrum (f) at an angle of 300 degrees with Vi (in the V1-V2 plane).
34.4.2 Simplisma
Windig et al. [29] proposed an elegant method, called SIMPLISMA to find the
pure variables, which does not require a principal components analysis. It is based
on the evaluation of the relative standard deviation (Sj/Xj) of the columnsy of X.
This yields a so-called standard deviation spectrum. A large relative standard
deviation indicates a high purity of that column. For instance, variable 1 in Table
34.4 has the same value (0.2) for the two pure factors and is therefore very impure.
By mixing the two pure factors no variation is observed in the value of that
variable. On the other hand, variable 2 is pure for factor 1 and varies between 0.1
and 0.4 in the mixtures. In order to avoid that wavelengths with a low mean
intensity obtain a high purity value, the relative standard deviation is truncated by
introducing a small off-set value 6, leading to the following expression for the
purity pj of a variable (column 7):
Pj =
Xj + 6
The purity plotted as a function of the wavelength; gives a purity spectrum. The
calculation of a purity spectrum is illustrated by the following example. Suppose
293
TABLE 34.4
Determination of pure variables from mixture spectra
1 2 3 4 5 6
Estimated
Pure 1 0.4 1.0 1.4 0.8 0.8 0
Pure 2 0.25 0 0.25 0.375 0.625 1.0
that spectra have been obtained of a number of mixtures (see Table 34.4). A plot of
the relative standard deviation spectrum together with the pure spectra (see Fig.
34.39) shows that a high relative standard deviation coincides with a pure variable
and that a small relative standard deviation coincides with an impure variable
(impure in the sense of non selective). Pure variables either belong to the same
spectrum, e.g. because it contains several selective wavelengths, or belong to
spectra from different compounds. This has to be sorted out by considering the
correlation between the pure variable columns of the data matrix. If the pure
variables are not positively correlated (here variable 2 and 6) one can also conclude
that these pure variables belong to different compounds. In this example the
correlation between columns 2 and 6 is -1.0 which indicates that the pure variables
belong to different compounds. In practice, however, one has to follow a more
complex procedure to find the successive pure variables belonging to different
species and to determine the number of compounds [29].
294
(a)
8
C
to
JO
o
CO
5 6
wavelength
0) 0.4
5 6
— • wavelength
Fig. 34.39. Mixture spectra and the corresponding relative standard deviation spectrum.
Once the pure variables have been identified, the data set can be resolved into
the pure spectra by solving eq. (34.11). For the mixture spectra in Table 34.4, this
gives:
estimate of the pure column factors. These are the elution profiles which are the
less contaminated by the other compounds.
An alternative procedure is to evaluate the purity of the rows / instead of the
purity of the columns;:
X- + O
The rows with the highest purities are estimates of the row factors, i.e. the purest
spectra from the data set, which are refined afterward by alternating regression.
1 1^ , , , , , , , , ,
1
0.9
0.8
\
MM
\ \ \
|o.6
0.7
/ 1/ 1 I
CO
\ \
§0.5
c
o
f "\
0.4
\ \
\ \ 1
0.3
/ ! \ 1
0.2
0.1 j
/ A \
1 >• .mrntf ,. ..1 . . . ^ ^ 1 i^'*'*»>..., 1 i,?TTrrrtii 1 1
10 20 30 40 50 60 70 80 90 100
Time
Fig. 34.40. Normalized concentration profiles of a minor and main compound for a system with 0.2%
of prednisone; chromatographic resolution is 0.8.
In summary, the selection procedure consists of three steps: (1) compare each
spectrum in X with all spectra already selected by applying eq. (34.14). Initially,
when no spectrum has been selected, the spectra are compared with the average
spectrum of matrix X; (2) plot of the dissimilarity values as a function of the
retention time (dissimilarity plot) and (3) select the spectrum with the highest
dissimilarity value by including it as a reference in matrix Y^. The selection of the
spectra is finished when the dissimilarity plot shows a random pattern. It is
considered that there are as many compounds as there are spectra. Once the purest
spectra are available, the data matrix X can be resolved into its spectra and elution
profiles by Alternating Regression explained in Section 34.3.1.
By way of illustration, let us consider the separation of 0.2% prednisone in
etrocortysone eluting with a chromatographic resolution equal to 0.8 [30] (Fig.
34.40). The dissimilarity of each spectrum with respect to the mean spectrum is
plotted in Fig. 34.41a. Two clearly differentiated peaks with maxima around times
46 and 63 indicate the presence of at least two compounds. In this case, the
297
a) xlO"
Fig. 34.41. Dissimilarity of each spectrum with respect to (a) the mean spectrum, (b) spectrum at time
46 and (c) spectra at times 46 and 63, for the system of Fig. 34.40.
dissimilarity of the spectrum at time 46 is slightly higher than the one of the
spectrum at time 63 and it is the first spectrum selected. Each spectrum is then
compared with the spectrum at time 46 and the dissimilarity is plotted versus time
(Fig. 34.41b). The spectrum at time 63 has the highest dissimilarity value and,
therefore, it is the second spectrum selected. The procedure is continued by
calculating the dissimilarity of all spectra left with respect to the two already
selected spectra, which is plotted in Fig. 34.41c. As one can see the dissimilarity
values are about 1000 times smaller than the smallest than obtained so far.
Moreover, no peak is observed in the plot. This leads to the conclusion that no third
component is present in the data. A comparison of the performance of FWSEFA,
SIMPLISMA and OPA on a real data set of LC-FTIR spectra containing three
complex clusters of co-eluting compounds is given in Ref. [31]. An alternative
method, key-set factor analysis, which looks for a set of purest rows, called key-set,
has been developed by Malinowski [32].
298
Fig. 34.42. RAFA plot of the least significant eigenvalue as a function of ^ (see text for an explanation
ofk).
In order to apply residual bilinearization [35] at least two data sets are needed:
Xy which is the data set measured for the unknown sample and X^ which is the data
matrix of a calibration standard, containing the analyte of interest. In the absence
of interferences these two data matrices are related to each other as follows:
X^ = bX^-^R (34.15)
R = T*V*'r + E (34.16)
By combining eqs. (34.15) and (34.16) we obtain:
X„ =fcX,+ r V*^ + E (34.17)
By RBL the regression coefficient b is calculated by minimizing the sum of
squares of the elements in E. Because the rank of R in eq. (34.16) is unknown, the
estimation of b from eq. (34.17) should be repeated for an increasing number of
principal components included in V*^.
Schematically the procedure proceeds as follows:
(1) Start with an initial estimate b^ofb',
(2) Calculate R = X, - fcoX,;
(3) Determine the rank of R and decompose R into T*V*^ + E;
(4) Obtain a new estimate b^ofbby solving X^ = bX^ + f *Y*T f^j. ^^-^^which
T*V*^ is the result ofstep 3. This yields: Z7i = (X J X,)-'Xj ( X , - T V ) ;
(5) Repeat steps (2) to (4) after substituting b^ with by
The iteration process is stopped after b converged to a constant value. If na
analytes are quantified simultaneously, data matrices of standard samples are
measured for each analyte separately. These matrices X^j, X^2' •••' ^s na ^^^
collected in a three-way data matrix Xg of the size nxpxna, where n is the number
of spectra in X^j,..., X^ „3, p is the number of wavelengths and na is the number of
analytes. The basic equation for this multicomponent system is given by:
301
X, = X,b + R (34.18)
where X,. is the three-way matrix of calibration data and b is a vector of regression
coefficients related to the unknown concentrations by c^ = c^ b^. How to perform
matrix operations on a three-way table is discussed in Section 31.17. The procedure
is then continued in a similar way as for the one-component case. Eq. (34.18) is
solved for b iteratively by substituting R = T*V*^ + E as explained before. Because
the concentrations c^ are known, the three-way data matrix X^ measured for the
standard samples can be directly resolved in its elution profiles and spectra by
Parafac [36] explained in Section 31.8.3. References to other methods for the
decomposition of three-way multicomponent profiles are included in the list of
additional recommended reading.
34.5.3 Discussion
usually moved over the cluster. All global methods, except PCA, usually apply a
stepwise approach, e.g. SIMPLISMA, OPA and HELP. HELP is a very versatile
tool for a visual inspection and exploration of the data.
Several complications can be present, such as heteroscedastic noise, sloping
baseline, large scan time and non-linear absorbance [40]. This may lead to the
overestimation of the number of existing compounds. The presence of hetero-
scedastic noise and non-linearities have an important effect on all PCA based
methods, such as EFA and FSWEFA. Non-zero and sloping baselines have a
critical effect in SIMPLISMA, HELP and FSWEFA. In any case it is better to
correct for the baseline prior to the application of any multivariate technique.
Baseline correction can be done by subtracting a linear interpolation of the noise
spectra before and after the peak, or by row-centring the data [40]. Most analytical
instruments have a restricted linear range and outside that range Beer's law no
longer holds. Non-linear absorbance indicates the presence of more compounds in
all the approaches discussed in this chapter. In some cases it is possible to detect a
characteristic profile indicating the presence of non-linearities. In any case the best
remedy is to keep the signal within the linear range. A non-linearity may also be
introduced because the DAD needs about 10-50 ms to measure a whole spectrum.
During that time the concentration of the eluting compound(s) may change signific-
antly. The most sensitive methods for the detection of small amounts of impurities
eluting at low chromatographic resolutions, OPA and HELP, are also the ones
most affected by these non-linearities. If the scan time is known, a partial correction
is possible. EFA, FSWEFA and ETA, which belong to the family of evolutionary
methods are somewhat less performing for purity checking. They may also flag
impurities due to the heteroscedasticity of the noise and non-linearity of the signal.
For a more detailed discussion we refer to Ref. [40].
The first step in analysing a data table is to determine how many pure factors
have to be estimated. Basically, there are two approaches which we recommend.
One starts with a PCA or else either with OPA or SIMPLISMA. PCA yields the
number of factors 2ind the significant principal components, which are abstract
factors. OPA yields the number of factors and the purest rows (or columns)
(factors) in the data table. If we suspect a certain order in the spectra, we
preferentially apply evolutionary techniques such as FSWEFA or HELP to detect
pure zones, or zones with two or more components.
Depending on the way the analysis was started, either the abstract factors found
by a PCA or the purest rows found by OPA, should be transformed into pure
factors. If no constraints can be formulated on the pure factors, the purest rows
303
(spectra) found by OPA cannot be improved. On the contrary, a PCA can either be
followed by a Varimax rotation or by constructing a variance diagram which yields
factors with the greatest simplicity. If constraints can be formulated on the pure
factors, a PCA can be followed by a curve resolution under the condition that only
two compounds are present. OPA (or SIMPLISMA or FSWEFA) can be followed
by alternating regression to iteratively estimate the pure row-factors (spectra) and
pure column-factors (elution profiles). In a similar way, the varimax and vardia
factors can be improved by alternating regression. The success of ITTFA in
finding pure factors depends on its convergence to a pure factor by a stepwise
application of constraints on the solution, which has been demonstrated on elution
profiles. However, it then requires a PCA in the retention time space.
Although the decomposition of a data table yields the elution profiles of the
individual compounds, a calibration step is still required to transform peak areas
into concentrations. Essentially we can follow two approaches. The first one is to
start with a decomposition of the peak cluster by one of the techniques described
before, followed by the integration of the peak of the analyte. By comparing the
peak area with those obtained for a number of standards we obtain the amount. One
should realize that the decomposition step is necessary because the interfering
compound is unknown. The second approach is to directly calibrate the method by
RAFA, RBL or GRAFA or to decompose the three-way table by Parafac. A serious
problem with these methods is that the data sets measured for the sample and for
the standard solution should be perfectly synchronized.
References
1. H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23
(1958) 187-200.
2. M. Forina, C. Armanino, S. Lanteri and R. Leardi, Methods of Varimax rotation in factor anal-
ysis with applications in clinical and food chemistry. J. Chemom., 3 (1988) 115-125.
3. J.K. Strasters, H. A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Evaluation of
peak-recognition techniques in liquid chromatography with photodiode array detection. J.
Chromatog., 385 (1987) 181-200.
4. E.R. Malinowski and D. Howery, Factor Analysis in Chemistry. Wiley, New York, 1980.
5. P.K. Hopke, Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989) 7-19.
6. J.K. Strasters, H.A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Reliability of
iterative target transformation factor analysis when using multiwavelength detection for peak
tracking in liquid-chromatographic separation. Anal. Chem., 60 (1988) 2745-2751.
7. W.H. Lawton and E A. Sylvestre, Self modeling curve resolution. Technometrics, 13 (1971)
617-633.
8. B.G.M. Vandeginste, R. Essers, T. Bosman, J. Reijnen and G. Kateman, Three-component
curve resolution in HPLC with multiwavelength diode array detection. Anal. Chem., 57 (1985)
971-985.
9. O.S. Borgen and B.R. Kowalski, An extension of the multivariate component-resolution
method to three components. Anal. Chim. Acta, 174 (1985) 1-26.
304
10. A. Meister, Estimation of component spectra by the principal components method. Anal.
Chim. Acta, 161 (1984) 149-161.
11. B.G.M. Vandeginste, F. Leyten, M. Gerritsen, J.W. Noor, G. Kateman and J. Frank, Evaluation
of curve resolution and iterative target transformation factor analysis in quantitative analysis by
liquid chromatography. J. Chemom., 1 (1987) 57-71.
12. P.K. Hopke, D.J. Alpert and B.A. Roscoe, FANTASIA — A program for target transformation
factor analysis to apportion sources in environmental samples. Comput. Chem., 7 (1983)
149-155.
13. P.J. Gemperline, A priori estimates of the elution profiles of the pure components in over-
lapped liquid chromatography peaks using target factor analysis. J. Chem. Inf. Comput. Sci., 24
(1984)206-212.
14. P.J. Gemperline, Target transformation factor analysis with linear inequality constraints ap-
plied to spectroscopic-chromatographic data. Anal. Chem., 58 (1986) 2656-2663.
15. B.G.M. Vandeginste, W.Derks and G. Kateman, Multicomponent self modelling curve resolu-
tion in high performance liquid chromatography by iterative target transformation analysis.
Anal. Chim. Acta, 173 (1985) 253-264.
16. A. de Juan, B. van den Bogaert, F. Cuesta Sanchez and D.L. Massart, Application of the needle
algorithm for exploratory analysis and resolution of HPLC-DAD data. Chemom. Intell. Lab.
Syst.,33(1996) 133-145.
17. M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks.
Anal. Chem., 59 (1987) 527-530.
18. H. Gampp, M. Maeder, C.J. Meyer and A.D. Zuberbuhler, Calculation of equilibrium con-
stants from multiwavelength spectroscopic data. Ill Model-free analysis of spectrophotometric
and ESR titrations. Talanta, 32 (1985) 1133-1139.
19. M. Maeder and A.D. Zuberbuhler, The resolution of overlapping chromatographic peaks by
evolving factor analysis. Anal. Chim. Acta, 181 (1986) 287-291.
20. R.Tauler and E. Casassas, Application of principal component analysis to the study of multiple
equilibria systems — Study of Copper(II) salicylate monoethanolamine, diethanolamine and
triethanolamine systems. Anal. Chim. Acta, 223 (1989) 257-268.
21. E.J. Karjalainen, Spectrum reconstruction in GC/MS. The robustness of the solution found
with alternating regression, in: E.J. Karjalainen (Ed.), Scientific Computing and Automation
(Europe). Elsevier, Amsterdam, 1990, pp. 477-488.
22. H.R. Keller and D.L. Massart, Peak purity control in liquid-chromatography with photodiode
array detection by fixed size moving window evolving factor analysis. Anal. Chim. Acta, 246
(1991)379-390.
23. J. Toft and O.M. Kvalheim, Eigenstructure tracking analysis for revealing noise patterns and
local rank in instrumental profiles: application to transmittance and absorbance IR spectros-
copy. Chemom. Intell. Lab. Syst., 19 (1993) 65-73.
24. O.M. Kvalheim and Y.-Z. Liang, Heuristic evolving latent projections — resolving 2-way
multicomponent data. 1. Selectivity, latent projective graph, datascope, local rank and unique
resolution. Anal. Chem., 64 (1992) 936-946.
25. M.J.P. Gerritsen, H. Tanis, B.G.M. Vandeginste and G. Kateman, Generalized rank annihila-
tion factor analysis, iterative target transformation factor analysis and residual bilinearization
for the quantitative analysis of data from liquid-chromatography with photodiode array detec-
tion. Anal. Chem., 64 (1992) 2042-2056.
26. H.R. Keller and D.L. Massart, Artifacts in evolving factor analysis-based methods for peak pu-
rity control in liquid-chromatography with diode array detection. Anal. Chim. Acta, 263 (1992)
21-28.
305
27. W. Windig and H.L.C. Meuzelaar, Nonsupervised numerical component extraction from py-
rolysis mass spectra of complex mixtures. Anal. Chem., 56 (1984) 2297-2303.
28. W. Windig and J. Guilement, Interactive self-modeling mixture analysis. Anal. Chem., 63
(1991) 1425-1432.
29. W. Windig, C.E. Heckler, FA. Agblevor and R.J. Evans, Self-modeling mixture analysis of
categorized pyrolysis mass-spectral data with the Simplisma approach. Chemom. Intell. Lab.
Syst., 14 (1992) 195-207.
30. F.C. Sanchez, J. Toft, B. van den Bogaert and D.L. Massart, Orthogonal projection approach
applied to peak purity assessment. Anal. Chem., 68 (1996) 79-85.
31. F. C. Sanchez, T. Hancewicz, B.G.M. Vandeginste and D.L. Massart, Resolution of complex
liquid chromatography Fourier transform infrared spectroscopy data. Anal. Chem., 69 (1997)
1477-1484.
32. E.R. Malinowski, Obtaining the key set of typical vectors by factor analysis and subsequent
isolation of component spectra. Anal. Chim. Acta, 134 (1982) 129-137.
33. C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to
quantitative analysis of multicomponent fluorescence data from the video fluorometer. Anal.
Chem., 52 (1980) 1108-1113.
34. E. Sanchez and B.R. Kowalski, Generalized rank annihilation factor analysis. Anal. Chem., 58
(1986) 496-499.
35. J. Ohman, P. Geladi and S. Wold, Residual bilinearization. Part I Theory and algorithms. J.
Chemom., 4 (1990) 79-90.
36 A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15
(1992) 143-157.
37. M.J.P. Gerritsen, N.M. Faber, M. van Rijn, B.G.M. Vandeginste and G. Kateman, Realistic
simulations of high-performance liquid-chromatographic ultraviolet data for the evaluation of
multivariate techniques. Chemom. Intell. Lab. Syst., 2 (1992) 257-268.
38. E. Sanchez, L.S. Ramos and B.R. Kowalski, Generalized rank annihilation method. I. Applica-
tion to liquid chromatography-diode array ultraviolet detection data. J. Chromatog., 385
(1987) 151-164.
39. L.S. Ramos, E. Sanchez and B.R. Kowalski, Generalized rank annihilation method. II Analysis
of bimodal chromatographic data. J. Chromatog., 385 (1987) 165-180.
40. F. C. Sanchez, B. van den Bogaert, S.C. Rutan and D.L. Massart, Multivariate peak purity ap-
proaches. Chemom. Intell. Lab. Syst., 34 (1996) 139-1171.
Books
E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, 2nd Edn. Wiley, New York, 1992.
R. Coppi and S. Bolasco (Eds.), Multiway Data Analysis. North-Holland, Amsterdam, 1989.
Articles
Target transformation factor analysis:
P.K. Hopke, Tutorial: Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989)
7-19.
306
Chapter 35
35.1 Introduction
Studying the relationship between two or more sets of variables is one of the
main activities in data analysis. This chapter mainly deals with modelling the
linear relationship between two sets of multivariate data. One set holds the
dependent variables (or responses), the other set holds the independent variables
(or predictors). However, we will also consider cases where such a distinction
cannot be made and the two data sets have the same status. Each set is in the usual
objects X measurements format. There is a choice of techniques for estimating the
model, all closely related to multiple linear regression (see Chapter 10). Roughly,
the model found can be used in two ways. One usage is for a better understanding
of the system under investigation by an interpretation of the model results. The
other usage is for the future prediction of the dependent variable from new
measurements on the predictor variables. Examples of problems amenable to such
multivariate modelling are legion, e.g. relating chemical composition to
spectroscopic or chromatographic measurements in analytical chemistry, studying
the effect of structural properties of chemical compounds, e.g. drug molecules, on
functional behaviour in pharmacology or molecular biology, linking flavour
composition and sensory properties in food research or modelling the relation
between process conditions and product properties in manufacturing.
The present chapter provides an overview of the wide range of techniques that
are available to tackle the problem of relating two sets of multivariate data.
Different techniques meet specific objectives: simply identifying strong
correlations, matching two multi-dimensional point configurations, analyzing the
effects of experimental factors on a set of responses, multivariate calibration,
predictive modelling, etc. It is important to distinguish the properties of these
techniques in order to make a balanced choice.
As an example consider the data presented in Tables 35.1-35.4. These tables are
extracted from a much larger data base obtained in an international cooperative
study on the sensory aspects of olive oils [1]. Table 35.1 gives the mean scores for
16 samples of olive oil with respect to six appearance attributes given by a Dutch
sensory panel. Table 35.2 gives similar scores for the same samples as judged by a
308
TABLE 35.1
Olive oils: mean scores for appearance attributes from Dutch sensory panel
British panel. Note that the sensory attributes are to some extent different. Table
35.3 gives some information on the country of origin and the state of ripeness of
the olives. Finally, Table 35.4 gives some physico-chemical data on the same
samples that are related to the quality indices of olive oils: acid and peroxide level,
UV absorbance at 232 nm and 270 nm, and the difference in absorbance at
wavelength 270 nm and the average absorbance at 266 nm and 274 nm.
Given these tables of multivariate data one might be interested in various
relationships. For example, do the two panels have a similar perception of the
different olive oils (Tables 35.1 and 35.2)? Are the oils more or less similarly
scattered in the two multidimensional spaces formed by the Dutch and by the
British attributes? How are the two sets of sensory attributes related? Does the
309
TABLE 35.2
Olive oils: mean scores of appearance attributes from British sensory panel
country of origin or the state of ripeness affect the sensory characteristics (Tables
35.1 and 35.3)? Can we possibly predict the sensory properties from the physico-
chemical measurements (Tables 35.1 and 35.4)?
An important aspect of all methods to be discussed concerns the choice of the
model complexity, i.e., choosing the right number of factors. This is especially
relevant if the relations are developed for predictive purposes. Building validated
predictive models for quantitative relations based on multiple predictors is known
as multivariate calibration. The latter subject is of such importance in chemo-
metrics that it will be treated separately in the next chapter (Chapter 36). The
techniques considered in this chapter comprise Procrustes analysis (Section 35.2),
canonical correlation analysis (Section 35.3), multivariate linear regression
310
TABLE 35.3
Country of origin and state of ripeness of 16 olive oils. The last 4 columns contain the same information in
the form of a coded design matrix
35.2.1 Introduction
TABLE 35.4
Physico-chemical quality parameters of the 16 olive oils
little bear
Great Bear
^ ^ ^ \
data set. The next step is a reflection (Fig. 35.1c). Again this is a transformation
which leaves distances between objects unaltered. The following step is a rotation,
which changes the orientation, but not the internal structure of the configuration
(Fig. 35.Id). When this best match is found one is at liberty to rotate all config-
urations equally. This will not affect the match but it may yield an overall
orientation that is more appealing (Fig. 35.le). Finally, by taking the mean position
for each star, one obtains an average configuration, often called the consensus, that
is representative for the two separate configurations (Fig. 35.If).
The major problem is to find the rotation/reflection which gives the best match
between the two centered configurations. Mathematically, rotations and reflec-
tions are both described by orthogonal transformations (see Section 29.8). These
are linear transformations with an orthonormal matrix (see Section 29.4), i.e. a
square matrix R satisfying R^R = RR^ = I, or R^ = R"^ When its determinant is
positive R represents a pure rotation, when the determinant is negative R also
involves a reflection.
The best match is defined as the one which minimizes the sum of squared
distances between the transformed Z-objects and the corresponding objects in the
target configuration given by Y. The Procrustes problem then is equivalent to
minimizing the sum of squares of the deviations matrix E = Y - XR, assuming both
X and Y have been column-mean centered. This looks like a straightforward
least-squares regression problem, Y = XR + E, but it is not since R is restricted to
be an orthogonal rotation/reflection matrix.
Using a shorthand notation for a matrix sum of squares, ||E|p = YJ^'^J = tr(E^E),
we may state the Procrustes optimization problem as:
Using elementary properties of the trace of a matrix (viz. tr(A + B) = tr(A) + tr(B)
and tr(AB) = tr(BA), see Section 29.4) we may write:
II Y - XR IP = tr((Y - XR)'r(Y - XR)) = tr(YTY - R^X'^Y - Y^XR + R'^X^XR) =
= tr(YTY) - 2tr(Y^XR) + tr(RTX'rXR) (35.2)
The first term on the right-hand side represents the total sum of squares of Y, that
obviously does not depend on R. Likewise, the last term represents the total sum of
squares of the transformed X-configuration, viz. XR. Since the rotation/reflection
given by R does not affect the distance of an object from the origin, the total sum of
squares is invariant under the orthogonal transformation R. (This also follows
from tr(R'^X^XR) = tr(X^XRRT) = tT(X^XI) = tT{X^X).) The only term then in eq.
(35.2) that depends on R is tr(Y'^XR), which we must seek to maximize.
314
Let the SVD of V^X be given by V^X = Q D W ^ , with D being the diagonal
matrix of singular values. The properties of singular vector decomposition (SVD,
Section 29.6) tell us that, among all possible orthonormal matrices, Q and W are
the ones that maximize tr(Q'^Y'^XW). Since tr(Q'^Y'^XW) = trCY^XWQ^), it
follows that R = WQ^ is the rotation/reflection which maximizes tr(Y^XR), and
hence it minimizes the squared distances, ||Y - XR |p (eq. 35.2) between X and Y.
Given this optimal Procrustes rotation applied to X, one may compute an average
configuration Z as (Y + XR)/2. Usually, this is followed by a principal component
analysis (Section 31.1) of the average Z. The rotation matrix V, obtained as the
matrix of eigenvectors of Z^Z, is then applied each to Y, XR and Z.
It must be emphasized that Procrustes analysis is not a regression technique. It
only involves the allowed operations of translation, rotation and reflection which
preserve distances between objects. Regression allows any linear transformation;
there is no normality or orthogonality restriction to the columns of the matrix B
transforming X. Because such restrictions are released in a regression setting Y =
XB will fit Y more closely than the Procrustes match Y = XR (see Section 35.3).
35.2.2 Algorithm
35.2.3 Discussion
In the field of structural chemistry Procrustes analysis has been used to compare
the geometries of similar structures from different studies: e.g. the conformations
of a molecule crystallized in two crystalline forms, or the 3D-configuration of the
main chain of two proteins, or the geometries of differently substituted com-
pounds. Figure 35.2a shows the 3D-configuration of two comparable amino-acid
sequences taken from related protein fragments obtained from different structural
studies. After Procrustes matching the two structures are overlaid (Fig. 35.2b)
revealing both the close overall similarity and the local deviations.
315
AA,
/-O- --A
Fig. 35.2. Comparable fragments of different protein fragments, before (a) and after (b) matching.
Let us now compare the sensory data of Table 35.1 (Dutch panel, X) and Table
35.2 (British panel, Y) using Procrustes analysis. Since the British panel has one
variable less than the Dutch panel we add a column of zeros to ensure that the two
data sets have the same size (16x6). This so-called 'zero padding' is allowed since
it does not affect the distances between the objects. Another form of data pre-
processing that we apply is the overall scaling of the British scores which have a
larger total variance. After shrinking these data by a factor 0.73, both data sets have
the same total variance (= 1038). Figure 35.3 shows the object scores of the
combined data, obtained after matching, projected on the principal axes of the
316
20
10 C
to
b D d
(0
" %
a a ^
•
rflt
I
o
c
-10^ M
-20\
1 1 — • — I — ' — \ — ' -1 p- —^ r — ^ I I r -
principal axis 1
Fig. 35.3. Scatter plot of 16 olive oils scored by two sensory panels (Dutch panel: lower case; British
panel: upper case). The combined data are shown after Procrustes matching and projection onto the
principal plane of the average configuration.
consensus configuration. This scatter plot clearly shows that the relative positions
of the 16 olive oil samples by the two panels compare very well. It is also clear that
samples from different countries, especially Greece and Spain, are well separated.
As to the interpretation of the various attributes one may plot the correlations of the
individual attributes with the two principal dimensions (Fig. 35.4). This plot
shows, for example, that indeed the attributes 'yellow' and 'green' (and to a lesser
extent 'brown') are similarly perceived by the two panels. It also shows that the
second dimension is better represented by the attributes from the Dutch panel.
Practically all attributes lie close to the circle of radius 1. This means that a large
fraction of their variance is explained by dimensions 1 and 2, indicating that there
is little correlation with the higher dimensions. Hence, the common two-
dimensional representation provides an adequate summary of the data of both
panels.
Procrustes analysis has been generalized in two ways. One extension is that
more than two data sets may be considered. In that case the algorithm is iterative.
One then must rotate, in turn, each data set to the average of the other data sets. The
cycle must be repeated until the fit no longer improves. Procrustes analysis of
many data sets has been applied mostly in the field of sensory data analysis [4].
Another extension is the application of individual scaling to the various data sets in
order to improve the match. Mathematically, it amounts to multiplying all entries
in a data set by the same scalar. Geometrically, it amounts to an expansion (or
317
1.0
BRIC
YELLJOW
Fig. 35.4. Correlations of sensory attributes with principal axes of average configuration after Pro-
crustes matching (Dutch panel: lower case; British panel: upper case).
shrinkage) that is isotropic, i.e. equal in all directions. While it does affect the
absolute distances between objects, it does not affect their relative distances nor
the angles involving three objects. In Chapter 38 we give an example of such a
Generalized Procrustes Analysis (GPA) where it is used to find a 'best average'
configuration of different food products coming from individual assessments by
different panellists. Procrustes analysis has been little exploited in typical chemo-
metrical areas, since it is less common that one only wants to investigate the
similarity of two comparable data tables. One is more often interested in a
predictive regression-type linear relation between two data tables. The data tables
involved then may be of quite different origin, e.g. sensory and chemical.
35.3.1 Introduction
TABLE 35.5
Two small bivariate data sets.
A. Raw data
X Y
•^1 ^2 ^'i yi
8 11 10 8
8 8 9 7
9 12 6 11
10 10 1 14
7 9 13 3
0 2 10 7
9 11 5 12
5 2 11 2
B. Correlation matrix
Xi X2 yi yi
For example, let us take a look at the data of Table 35.5a. This table shows two
very simple data sets, X and Y, each containing only two variables. Is there a
relationship between the two data sets? Looking at the matrix of correlation
coefficients (Table 35.5b) we find that the so-called intra-set (or within-set)
correlations are strong:
r(jCi,jC2) = +0.86 and r(y^,y2) = -0.93.
The inter-set (or between-set) correlations, however, are relatively low:
r{x,,y,) = -0.55, rix.^y^) = +0.53, rix^^y,) = -0.46, r(x2,y2) = +0.62.
The squared inter-set correlation coefficients vary from 0.21 to 0.38. Thus, only
some 20% to 40% of the variance of the individual variables can be explained by
one of the variables from the other data set. At a first glance these low inter-set
correlations do not indicate a strong relation between the two data tables. In
319
35.3.2 Algorithm
correlations p^. Instead of a single SVD of R one may apply a spectral de-
composition (Section 29.6) of RR^ giving eigenvectors W* and a spectral de-
composition of R^R giving eigenvectors Q*, the eigenvalues corresponding to the
squared canonical correlations. The canonical variables are now obtained as in
eqs. (35.7a,b). The factor {n-\f^ is included to ensure that the canonical variables
have unit variance. Back-transformation to the centred X- and F-variables yields
the sets of canonical weights collected in matrices W and Q, respectively (eqs.
35.8a,b). Applying these weights to the original variables again yields the
canonical variables (eqs. 35.9a,b). Regressing the X-variables and F-variables on
their corresponding canonical variables gives the loading matrices P and C (eqs.
35.10a,b)) which appear in the canonical decomposition: X = TP'^ = T(T'rT)-iT^X
and Y = UC^ = U(U^U)"^U^Y. The loadings, defining the original mean-centred
variables in terms of the orthogonal canonical variables, are better suited for
interpretation than the weights. Each row of P (or C) corresponds to a variable and
tells how much each canonical variable contributes to (or "loads" on) this variable.
In case the X-variables and the F-variables are also scaled to unit variance, P and C
contain the intra-set correlations between the original variables and the canonical
variables (so-called structure correlations', see Table 35.6).
It should be appreciated that canonical correlation analysis, as the name implies,
is about correlation not about variance. The first step in the algorithm is to move
from the original data matrices X and Y, to their singular vectors, Ux and Uy,
respectively. The singular values, or the variances of the PCs of X and Y, play no
role.
353.3 Discussion
Let us take a closer look at the analysis of the data of Table 35.5. In Table 35.6
we summarize the correlations of the canonical variates and also their correlations
with the original variables. The high value of the first canonical correlation
(pi = 0.95) suggests a strong relationship between the two data sets. However, the
canonical variables tj and Uj are only strongly related to each other, not with the
original variables (Table 35.6). On the other hand, the second set of canonical
variates t2 and U2 are strongly related to their original variables, but not to each
other (p2 = 0.55). Thus, the analysis yields a pair of strongly linked, but un-
interesting factors and a pair of more interesting factors, which are weakly related,
however.
A major limitation to the value of CCA thus already has become apparent in the
example shown. There is no guarantee that the most important canonical variable
t, (or Ui) is highly correlated to any of the individual variables of X (or Y). It is
possible then for the first canonical variable tj of X to be strongly correlated with
Uj, yet to have very little predictive value for Y. In terms of principal components
322
TABLE 35.6
Canonical structure: correlations between the original variables (jc, y) and their canonical variates (/, u).
X Y
^1 h "i "2
(Chapter 17): only the minor principal components of X and Y happen to be highly
correlated. It is questionable whether such a high correlation is then of much
interest. This dilemma of choosing between high correlation and large variance
presents a major problem when analyzing the relation between measurements
tables. The regression techniques treated further on in this chapter address this
dilemma in different ways. A second limitation of CCA is that it cannot deal in a
meaningful way with data tables in 'landscape mode', i.e. wide data tables having
more variables than objects. This severely limits the importance of CCA as a
general tool for multivariate data analysis in chemometrics, e.g. when X represents
spectral data.
As the name implies CCA analyses correlations. It is therefore insensitive to any
rescaling of the original variables. This advantage is not shared with most other
techniques discussed in this chapter. As in Procrustes analysis X and Y play
entirely equivalent roles in canonical correlation analysis: there is no distinction in
terms of dependent variables (or responses) versus independent variables (or
predictors, regressors, etc.). This situation is fairly uncommon. Usually, the X and
Y data are of a different nature and one is interested in understanding one set of
data, say Y, in the light of the information contained in the other data set, X. Rather
than exploring correlations in a symmetric X<->Y relation, one is searching for an
asymmetric regression relation X->Y explaining the dependent F-variables from
the predictor X-variables. Thus, the symmetrical nature of CCA limits its practical
importance. In the following sections we will discuss various asymmetric re-
gression methods where the goal is to fit the matrix of dependent variables Y by
linear combination(s) of the predictor variables X.
323
35.4.1 Introduction
In this section we will distinguish multivariate regression from multiple
regression. The former deals with a multivariate response (Y), the latter with the
use of multiple predictors (X). When studying the relation between two multi-
variate data sets via regression analysis we are therefore dealing with multivariate
multiple regression. Perhaps the simplest approach to studying the relation
between two multivariate data sets X and Y is to perform for each individual
univariate variable y^ {k = 1,..., m) a separate multiple (i.e. two or more predictor
variables) regression on the Z-variables. The obvious advantage is that the whole
analysis can be done with standard multiple regression programs. A drawback of
this approach of many isolated regressions is that it does not exploit the multi-
variate nature of Y, viz. the interdependence of the y-variables. Genuine multi-
variate analysis of a data table Y in relation to a data table X should be more than
just a collection of univariate analyses of the individual columns of Y!
One might suspect that fitting all 7-variables simultaneously, i.e. in one overall
multivariate regression, might make a difference for the regression model. This is
not the case, however. To see this, let us state the multivariate (i.e. two or more
dependent variables) regression model as:
[yi' y2' — Ym] = X [bj, b2,..., b j + [ei, t^^..., e j (35.11)
which shows explicitly the various responses, y^ (/ = 1, 2, ..., m), as well as the
vector of regression coefficients b^, and the residual vector, e^, corresponding to
each response. This model may be written more compactly as:
Y = XB + E (35.12)
where Y is the nxm data set of responses, X the nxp data set of regressors, B the
pxm matrix of regression coefficients and E the nxm error matrix. Each column of
Y, B and E corresponds to one of the m responses, each column of X and each row
of B to one of the p predictor variables and each row of Y, X, and E to one of the n
observations.
The total residual sum of squares, taken over all elements of E, achieves its
minimum when each column e^ separately has minimum sum of squares. The latter
occurs if each (univariate) column of Y is fitted by X in the least-squares way.
Consequently, the least-squares minimization of E is obtained if each separate
dependent variable is fitted by multiple regression on X. In other words: the
multivariate regression analysis is essentially identical to a set of univariate
regressions. Thus, from a methodological point of view nothing new is added and
we may refer to Chapter 10 for a more thorough discussion of theory and
application of multiple regression.
324
35.4.2 Algorithm
35.4.3 Discussion
35.5.7 Introduction
Reduced rank regression (RRR), also known as redundancy analysis (or PCA
on Instrumental Variables), is the combination of multivariate least squares
regression and dimension reduction [7]. The idea is that more often than not the
dependent F-variables will be correlated. A principal component analysis of Y
might indicate that A {A « m) PCs may explain Y adequately. Thus, a full set of m
325
35.5.2 Algorithm
Equation (35.15) represents the projection of each 7-variable onto the space
spanned by the X-variables, i.e. each F-variable is replaced by its fit from multiple
regression on X.
Step 2. Next one appUes an SVD (or PCA) to centered Y, denoted as Y*(= Y - I m ^):
Y* = U S V ^ (35.16)
Step 3. Dimension (rank) reduction by only retaining A major components to
approximate Y*. This gives the RRR fit:
Y*,A] = ^A]StAiVtIj (35.17)
Step 4. The RRR model coefficients are then found by a multivariate linear
regression of the RRR fit, Yj^j = (Y*j^j + In"** Y ) ^^ original X, which should have a
column of ones:
BRRR = ( X ^ ) - ' X\Y\^^+ l„m^) (35.18)
35.5.3 Discussion
35.5.4 Example
Let us try to relate the (standardized) sensory data in Table 35.1 to the explan-
atory variables in Table 35.3. Essentially, this is an analysis-of-variance problem.
We try to explain the effects of two qualitative factors, viz. Country and Ripeness,
on the sensory responses. Each factor has three levels: Country = {Greece, Italy,
327
Spain} and Ripeness = {Unripe, Normal, Overripe}. Since not all combinations
out of the complete 3x3 block design are duplicated, there is some unbalance
making the design only nearly orthogonal. We treat this multivariate ANOVA
problem as a regression problem, coding the regressors as indicated in Table 35.3
and omitting the Italy column and Normal column to avoid ill-conditioning of X.
Some of the results are collected in Table 35.7. Table 35.7a shows that some
sensory attributes can be fitted rather well by the RRR model, especially 'yellow'
and 'green' (/?^« 0.75), whereas for instance 'brown' and 'syrup' do much worse
(/?2 ^ 0.40). These fits are based on the first two PCs of the least-squares fit (Y). The
PCA on the OLS predictions showed the 2-dimensional approximation to be very
good, accounting for 99.2% of the total variation of Y. The table shows the PC
weights of the (fitted) sensory variables. Particularly the attributes 'brown', and to
a lesser extent 'syrup', stand out as being different and being the main contributors
to the second dimension.
TABLE 35.7
(a) Basic results of the reduced rank regression analysis. The columns PCI and PC2 give the weights of the
PCA model of Y (OLS fitted Y). The columns /?2 (in %) show how well Y and Y are fitted by the first two
principal components of Y.
(b) The columns PCI and PC2 give the X-weights of the PCA model of Y (OLS fitted Y). The columns /?2
(in %) show how well the X variables are fitted by the first two principal components of Y.
The two principal axes can also be defined as linear combinations of the
explanatory variables. This is given in Table 35.7b. The larger coefficients for the
Country variables when regressing the PCs on the four predictor variables show
that the country of origin is strongly related to the most predictive principal
dimensions and that the state of ripeness is not. This also appears from the fact that
the Country variables can be fitted very well (high R^) by the first two PCs, in
contrast to the low R^ values for the ripeness variables. In other words, the country
of origin is the dominant factor affecting the appearance of olive oils whereas the
state of ripeness has little effect. The first predictive dimension mainly represents a
contrast between the olive oils of Spanish origin versus non-Spanish origin and to a
much lesser extent a contrast between unripe versus the (over)ripe olives. The
second predictive factor, which is mostly used to fit the 'brown' sensory attribute,
represents a contrast between Italy and Greece, with Spain in the middle. Fig. 35.5
summarizes the relationships between samples (objects), predictor variables and
dependent variables. The objects are plotted as standardized scores (first two
columns of (n-iy^^V), the variables as loading vectors, taken from X'^U and Y^U,
respectively, scaled to fit on the graph. For a thorough treatment of biplotting the
results of rank reduced multivariate regression models, see Ref [8].
By combining the coefficients in the two parts of the table one can express each
sensory attribute in terms of the explanatory factors. Note that the above regression
Fig. 35.5. Biplot of reduced rank regression model showing objects, predictors and responses.
329
35.6.1 Introduction
35.6.3 Algorithm
The equation represents the projection of each K-variable onto the space spanned
by the first A PCs of X.
Step 4. The PCR model coefficient matrix, pxm Bp^R, can be obtained in a variety
of equivalent ways:
35.6.3 Discussion
The PCR approach has many attractive features. First of all there is the aspect of
a prior dimension reduction of the data set (measurement table) X. Using PCA this
is done in such a way as to maintain the maximum amount of information. The
neglected minor components are supposed to contain noise that is in no way
relevant for the relation with Y. Another advantage is that the principal compo-
nents are orthogonal (uncorrelated). This greatly simplifies the multiple regression
of the y-variables, allowing the effect of the individual principal components to be
assessed independently. The chief advantage is that the major principal compo-
nents have, by definition, large variance. This leads to a stable regression as the
variance of an estimated regression coefficient is inversely proportional to the
variance of the regressor (si = s^ /Z(A:? - Jc"); see Section 8.2.4.1)
The orthogonality of the principal components has the advantage that the effect
of the various PCs are estimated independently: multiple regression becomes
equivalent with a sequence of separate regressions of the response(s) on the
individual PCs. The fact that the X-components are chosen on the basis of
representing X rather than Y does not only have advantages. It also gives rise to a
major concern. What if a minor component happens to be important for the
regression? And what is the use of a major principal component if it is not related to
Y? The answer to the latter question is simple: it is of little use but it does not harm
either. The problem of discarding minor X-components that possibly are highly
correlated to Y is more severe. One way to address this problem is to include the
minor components in the regression if they are really needed. That is, one should
go on adding principal components in the regression model until Y is fitted well,
provided such a model also passes the (cross-)validation procedure (Section
36.10). Another strategy that is gaining popularity is to enter the principal compo-
nents in a different order than the standard order of descending variance (PCI,
PC2,...). Rather than this top-down procedure, one may apply variable selection:
one starts with the principal component that is most highly correlated with the
331
responses, then move to the PC with the second highest correlation, etc. The only
thing that is needed is to compute the correlation coefficients of the PCs with each
response. For a univariate y the PCs may then be ranked according to their
descending (squared) correlation coefficient. By applying this forward selection
procedure one ensures that highly correlating PCs are not overlooked. For multi-
variate response data Y one should compute for each PC an average index of its
importance for all 7-variables together, e.g. the average squared correlation or the
total variance explained (I I Yl P = 11 u J Yl P) by that PC. Comparative studies have
shown that the latter method of PCR frequently performs better, i.e. it gives good
predictive models with fewer components [9].
Since principal components regression starts with a PCA of X, its solution
depends on the particular scaling chosen for the X-variables. It does not depend on
the scaling of the 7-variables. When the maximum number of factors are used the
regression model becomes equivalent with multivariate regression. There is no
special multivariate version of principal component regression: each K-variable is
separately regressed on the set of X-components. One might also consider
regressing the major PCs of Y or of Y (Eq. 35.14) on the PCs of X.
The purpose of Partial Least Squares (PLS) regression is to find a small number
A of relevant factors that (/) are predictive for Y and (//) utilize X efficiently. The
method effectively achieves a canonical decomposition of X in a set of orthogonal
factors which are used for fitting Y. In this respect PLS is comparable with CCA,
RRR and PCR, the difference being that the factors are chosen according to yet
another criterion.
We have seen that PCR and RRR form two extremes, with CCA somewhere in
between. RRR emphasizes the fit of Y (criterion ii). Thus, in RRR the X-
components t- preferably should correlate highly with the original 7-variables.
Whether X itself can be reconstructed ('back-fitted') from such components t^ is of
no concern in RRR. With standard PCR, i.e. top-down PCR, the emphasis is
initially more on the X-side (criterion /) than on the F-side. CCA emphasizes the
importance of correlation; whether the canonical variates t and u account for much
variance in each respective data set is immaterial. Ideally, of course, one would
like to have the best of all three worlds, i.e. when the major principal components
of X (as in PCR) and the major principal components of Y (as in RRR) happen to
be very similar to the major canonical variables (as in CCA). Is there a way to
combine these three desiderata — summary of X, summary of Y and a strong link
between the two — into a single criterion and to use this as a basis for a
compromise method? The PLS method attempts to do just that.
332
PLS has been introduced in the chemometrics literature as an algorithm with the
claim that it finds simultaneously important and related components of X and of Y.
Hence the alternative explanation of the acronym PLS: Projection to Latent
Structure. The PLS factors can loosely be seen as modified principal components.
The deviation from the PCA factors is needed to improve the correlation at the cost
of some decrease in the variance of the factors. The PLS algorithm effectively
mixes two PCA computations, one for X and one for Y, using the NIPALS
algorithm. It is assumed that X and Y have been column-centred as usual. The
basic NIPALS algorithm can best be demonstrated as an easy way to calculate the
singular vectors of a matrix, viz. via the simple iterative sequence (see Section
31.4.1):
t = Xw (35.19)
wocX^t (35.20)
for X, and
u = Yq (35.21)
q oc Y^u (35.22)
for Y. The «:-symbol is used here to imply that the resultant vector has to be
normalized, i.e. w^w = q^q = 1. In eq. (35.19) t represents the regression coeffi-
cients of the rows of X regressed on w. Likewise, w in eq. (35.20) is proportional to
the vector of regression coefficients obtained by regressing each column (vari-
ables) of X on the score vector t. This iterative process of criss-cross regressions is
graphically illustrated in Fig. 35.6.
Iterating eq. (35.19) and eq. (35.20) leads w to converge to the first eigenvector
of X^X. One may easily verify this by substituting eq. (35.19) into eq. (35.20),
which yields w oc X^t«: X^Xw, the defining relation for an eigenvector. Similarly,
t is proportional to an eigenvector of XX^. It can be shown that the eigenvectors w
and t are the dominant eigenvectors, i.e. the ones corresponding to the largest
eigenvalue. Thus, w and (normalized) t form the first pair of singular vectors of X.
Likewise, q and (normalized) u are the dominant eigenvectors of Y^Y and YY^,
u
T
Fig. 35.7. Principle of PLS/NIPALS algorithm.
respectively, or the first pair of singular vectors of Y. Once this first pair of singular
vectors is determined one extracts this dimension by fitting t to X (or u to Y) and
proceeding with the matrix Ej (or Fj) of residuals. Using the residual matrix Ej (or
Fj) and the basic NIPALS algorithm one may find the pair of dominant singular
vectors which in fact is the second pair of singular vectors of the starting matrix X
(or Y). The process is repeated until the starting matrix is fully depleted.
Instead of separately calculating the principal components for each data set, the
two iterative sequences are interspersed in the PLS-NIPALS algorithm (see Fig.
35.7):
Wcx: X^U (35.23)
t = Xw (35.19)
qocY^t (35.24)
u = Yq (35.21)
One starts the iterative process by picking some column of Y for u and then
repeating the above steps cyclically until consistency. Upon convergence we have
w oc X^u oc X^Yq oc X^YY^t oc X^YY^Xw. Thus, w is an eigenvector of X'^YY'^X
and, similarly, q is an eigenvector of Y^XX^Y [10]. These matrices are the two
symmetric matrix products, viz. (X'^Y)(X'^Y)'^ and (X'^Y)'^(X'^Y), based on the
same cross-product matrix (X^Y). Apart from a factor (n - 1), the latter matrix is
equal to the matrix of inter-set covariances. Another interpretation of the weight
vectors w and q in PLS is therefore as the first pair of singular vectors of the
CO variance matrix X^Y. As we found in Chapter 29 this first pair of singular
vectors forms the unique pair of normalized weight vectors that maximizes the
expression w^(X^Y)q = (Xw)'^(Yq) = t^u. Up to a factor (n - 1), the latter inner
product equals the covariance of the two score vectors t = Xw and u = Yq. This
then leads to the following important interpretation of the PLS factors: t = Xw and
u = Yq are chosen to maximize their covariance [10,11].
334
Let us take a closer look at this covariance criterion. A covariance involves three
terms (see Section 8.3):
Thus, the PLS covariance criterion capitalizes on precisely the three links that
connect two sets of data via their latent factors:
(i) the X-factor t should have appreciable variance, var(t);
(ii) similarly, the K-factor u should have a large variance, var(u), and
(iii) the two factors t and u should be strongly related also (high r^).
Of the three aspects inherent to the covariance criterion (35.26), CCA just con-
siders the so-called inner relation between t and u as expressed by r^, RRR
entirely neglects the var(t) aspect, whereas PCR emphasizes this var(t) compo-
nent. One might maintain that PLS forms a well-balanced compromise between
the methods treated thus far. PLS neither emphasizes one aspect of the X<->Y
relation unduly, nor does it completely neglect any.
The covariance criterion as such suggests a symmetrical situation, X and Y
playing equivalent roles. In fact, up to here, there is little difference with Procrustes
analysis which also utilizes the singular vectors of the covariance matrix (Section
35.2). The difference is that in PLS the first X-factor, say tj, is now used as a
regressor to fit both the X-block and the K-block:
X(=Eo) = t,pT + E , (35.27a)
Y(=:Fo) = t , c ^ + F i (35.27b)
Here, the loading vector pj contains the coefficients of the separate univariate
regressions of the individual X-variables on tj. The/^ element of Pj, py, represents
the regression coefficient of X: regressed on ti'-Pij = ^J^j f tJ ty The full vector of
loadings becomes p, = E^tj / t ^ t j . Similarly, Cj contains the regression
coefficients relating tj to the K-variables: Cj =FQ t^/tjt^.
The residuals of these regressions are collected in residual matrices Ej and Fj:
E,=E,-i,pJ (35.28a)
F, =Fo-t,c;^ (35.28b)
A second PLS factor t2 is extracted in a similar way maximizing the covariance of
linear combinations of the residual matrices Ej and Fj. Subsequently, Ej and Fj are
regressed on t2, yielding new residual matrices Ej and F2 from which a third PLS
335
factor t3 is computed, and so on. If one does not limit the number of factors, the
process automatically stops when the Z-matrix has been fully depleted. This
occurs when the number of factors A equals the rank of X, i.e. A = mm(n - 1, /?). As
for PCR, such a full-rank PLS model is entirely equivalent with multivariate
regression on the original X-variables.
The PLS algorithm is relatively fast because it only involves simple matrix
multiplications. Eigenvalue/eigenvector analysis or matrix inversions are not
needed. The determination of how many factors to take is a major decision. Just as
for the other methods the 'right' number of components can be determined by
assessing the predictive ability of models of increasing dimensionality. This is
more fully discussed in Section 36.5 on validation.
Let us now consider a new set of values measured for the various X-variables,
collected in a supplementary row vector x*. From this we want to derive a row
vector y* of expected F-values using the predictive PLS model. To do this, the
same sequence of operations is followed transforming x* into a set of factor scores
{^1*, ^2, ..., ^^ } pertaining to this new observation. From these r*-scores y * can be
estimated using the loadings C.
Prediction starts by equating yo to the mean (my) for the training data and
removing the mean m^ from x* giving CQ :
y 0 = my
6*0 = X* - n i x
Then we compute the score of the new observation x* on the first PLS dimension
and from that we calculate an updated prediction (y\) and we remove the first
dimension from CQ giving e\:
r, =eo' w,
^1 =^o-h Pi
This sequence is repeated for dimension 2:
62 —Cj — ^2 P 2
and so on.
336
vector w by some scalar constant. Then, the length of the score vector t increases
by the same factor. In contrast, the lengths of the loading vectors p and c decrease
by that factor. Since the loadings appear in combination with scores (e.g. tp^, tc^,
(t^t)(p^p), and (tTt)(c^c)) or with weights (W(P'rW)-i C^), the chosen multipli-
cation factor has no effect.
A slight alternative to the above NIPALS algorithm is replacing the iteration
loop (line 8-14) by:
This formulation is attractive since it gives the weight vector q accurately and
(usually) very fast. The latter holds true since more often than not the number of
response variables (m) is not that large, hence the matrix F^EE^F (mxm) is usually
quite small. This is especially true when Y is univariate. One may verify that in this
situation Y and F become vectors,say y and f, hence the loading 'vector'
c = F'^t/(t'^t) becomes a scalar, c = f^t/(t'^t), and so do the 'matrix' F'^EE'^F
(=f^EE^f) and its eigenvector: q=^=l. The NIPALS procedure also becomes very
efficient when there is a single response y as it requires only one passage through
the iteration steps (lines 9-14). The distinction between multivariate and
univariate Y is an important one and the two situations are distinguished by the
notation PLS2 and PLSl, respectively. In this chapter, which is about relating two
multivariate data tables, the emphasis is on PLS2, and the case of univariate y is not
considered. PLSl regression is considered further in Chapter 36.
35.7.3 Discussion
One may wonder why it is necessary to take the variance of predictors t^ into
account, when the goal, after all, is not to fit X but to fit and predict Y. There is a
good reason, however, to prefer a fit based on a large-variance component t above
an equally good fit based on a component of small variance. This ties in with the
intuitive appeal that a regression relation should preferably be based on a regressor
varying over as wide a range as possible. Or, more precisely, the variance of an
estimated regression parameter is inversely proportional to the variance of the
regressor (Section 8.4 and 10.4).
As an example we try to model the relation between the sensory data of Table
35.1 and the instrumental measurements of Table 35.4. The PLS analysis results
are shown in Table 35.8. The first PLS dimension loads about equally high on
338
S52
^?eaB2
S61
133 S53
U 132 G12
1 322
G13
G11
-2 132 012
142
131
T ' 1 '
-4 - 2 0 2 4
t1
peroxide, K232 and K270 and much less on DK and Acidity. As to the K-variables
all variables are about equally well approximated from tj. Thus the first PLS factor
(tj) tries to model some average overall appearance feature (Ui) of the F-block.
There is a good linear relation between the first X-factor and the first F-factor (Fig.
35.8). One notices that this dimension separates the Spanish olive oils from the
other nationalities. The second dimension captures a different feature. It chiefly
loads on acidity and a little bit on peroxide and DK. This second PLS factor t2 loads
mainly on the *brown' variable and slightly on 'green'. A scatter plot of U2 versus t2
again reveals a fairly good correlation (0.69) and, this time, a contrast between the
Greek and the non-Greek oils (Fig. 35.9). While a plot of u-scores versus t-scores
shows the strength of the linear relation, a scatter plot of the t-scores for the first
few dimensions gives a view of the arrangements of the objects in multivariate
space. From Fig. 35.10 one concludes, for example, that there is a weak clustering
of objects according to the country of origin. Figure 35.11 shows the pattern of
loadings of the predictors and the dependent variables. One observes that the first
PLS factor is associated with a contrast between glossy, transparency and yellow-
ness on the one hand and green, syrupy and brown on the other. One also sees that
the UV measurements K232 and K270 are most relevant for this dimension, or that
K232 is more associated with 'syrupy' and 'brown', whereas K270 and DK are
more associated with the attribute 'green'. The various projections obtained via
PLS regression such as the ones shown in Figs. 35.8-11 are a powerful tool in
interpreting the structure and the relationship of the two datasets.
339
G11
G12 G13
U
1^ 52
022
1
ssg
l3iS61^
142
-l ,—1—r- 1 1 — , — . — 1 — , —
-3 -2 0 1 2 3
t2
G11
G13
k322
133^ ^ ^ ^ ^
131
132 132
G12 U,
-2 142
-3 -2 -1 0
tl
u . a.
ACIDITY
0.6
0.3- DK
d green
i K270
glossy
0.0
transpar
m
^gsyrupy yellow
2 -0.3
PI :ROXIDE
brown
-0.6 — , — , — , — . — . — 1 — . — , — f J
1 — ' — • — 1 — ' — ' —
dim 1
Fig. 35.11. Loadings of the two dominant PLS factors on the X-variables (lower case) and 7-variables
(upper case).
TABLE 35.8
Summary of PLS results
X Y
PLS dimension PLS dimension
OL (2) (3) (4) (5) W_ (2) (3) (4) (5)
Weights
Acidity -21 77 11 53 -24 Yellow 39 -29 -66 5 66
Peroxide -53 -44 -20 61 31 Green -36 40 55 -16 -40
K232 -56 -22 2 -25 -75 Brown -39 -79 47 25 16
K270 -51 16 59 -36 47 Glossy 44 11 13 -12 22
DK -30 35 -76 -38 21 Transpar 41 -2 6 -22 11
Syrupy -42 -32 -1 91 -55
Loadings
Acidity -246 814 107 497 -241 Yellow 375 -199 -337 43 541
Peroxide -507 -368 -174 659 316 Green -343 266 280 -136 -328
K232 -545 -271 -30 -361 -756 Brown -377 -526 243 209 139
K270 ^88 128 625 -292 471 Glossy 423 76 70 -102 188
DK -394 349 -755 -349 218 Transpar 395 -14 31 -181 90
Syrupy -405 -218 -9 734 -456
Scores
Gil -1969 2504 -556 -356 -332 -1595 1662 524 253 -701
G12 724 -708 620 -54 -105 -2075 782 1726 1103 -2969
G12 1205 -221 1106 -55 -385 319 1204 1482 -722 -1085
G13 -1533 1725 1708 282 323 -797 1138 389 -359 837
G22 333 1483 26 130 64 -675 594 -620 643 -204
131 -2617 -878 -499 -254 176 -3220 -957 -491 464 -280
132 -843 -539 -760 -206 -36 283 449 -729 -1422 701
132 -1797 -554 -965 260 350 -2018 606 764 -138 -1191
133 -615 -106 -1024 390 -275 867 -162 -1498 -258 1040
142 2676 -1905 1171 62 -189 -2643 -2423 138 338 1041
S51 1411 -459 647 -387 296 2328 -278 -311 -725 997
S52 1770 144 -218 -985 115 3172 235 -453 -1179 987
S53 1029 -73 -548 267 -5 694 -495 -750 285 283
S61 1542 -513 -281 -129 -134 1158 -869 -112 701 -269
S62 2215 62 -505 391 218 2110 -721 40 522 164
S63 1822 39 77 645 -81 2093 -766 -99 493 646
% VAF
Acidity 17.6 95 95.8 99.7 100 Yellow 40.9 45.6 53.3 53.3 54.9
Peroxide 74.8 90.6 92.7 99.5 100 Green 34.3 42.6 47.9 48.2 48.8
K232 86.2 94.8 94.8 96.9 100 Brown 41.3 73.7 77.7 78.4 78.5
K270 69.1 71 97.5 98.8 100 Glossy 51.9 52.6 52.9 53.1 53.2
Dk 45 59.3 97.8 99.7 100 Transpar 45.4 45.4 45.5 46 46.1
Syrupy 47.6 53.2 53.2 61.6 62.7
Average 58.5 82.1 95.7 98.9 100 Average 43.6 52.2 55.1 56.8 57.4
342
constituting Z = [(1 -a)YlaX], one may, to some extent, control the outcome of the
analysis. If the importance of X is downgraded (a -^ 0), Y dominates and the
analysis becomes entirely equivalent to RRR. In the opposite case (a -^ 1), X will
dominate and the analysis becomes equivalent to PCR. When Y and X are scaled to
equal "size", i.e. sum of squares, they are weighted equally and the results
resemble those of PLS.
The main advantage of PCovR lies in its flexibility of being able to move
between the two extremes. It releases the data analyst from making a choice
between the various techniques, letting the data determine the best predictive
model. One might also consider it an advantage of PCovR compared to PLS that it
employs the familiar least-squares criterion. PCovR also allows for a nice graph-
ical presentation of the dependent and independent variables in conjunction
through a biplot of the augmented Z data table. Figures 35.12-14 show such
biplots for the sensory and physico-chemical data. Figure 35.12 shows the RRR
extreme, optimally displaying the fitted sensory variables (total Y variance
accounted/or by two dimensions, VAF = 54%, i.e. 95% of the maximum attain-
able percentage 57%) and the explanatory variables (VAF = 77%) projected on
this PCs-of-Y plane. Figure 35.13 shows the PCR extreme, optimally displaying
the explanatory variables (VAF = 82%) and the sensory variables (VAF = 51%, or
89% of maximum) projected on this PCs-of-X plane. Figure 35.14 forms the
PCovR compromise, accounting for 53% of the total Y-variance (94% of max-
imum) and 81% of the total X-variance. These results and the accompanying
graphs show once again that the difference between the various methods is
15%
] ^^m
G13C
GREEN \
GS22C
S52C
Q12P GLOSSY
80%
Y^mjp::^afi^^^^^*
»2iErSfflS62C
J .^^^^ ^^ I32P
VEILOIV
\BR6WN
1 1 , 1 1 fJ
-3 0
PCI
Fig. 35.12. Scores, X-loadings and F-loadings for the two dominant RRR dimensions.
344
23%
P
C
2
59%
I YELLOW^
-1 PEROX?DE /
BROWN
142
-2 -1
PCI
Fig. 35.13. Scores, X-loadings and K-loadings for the two dominant PCR dimensions.
20 %
G22
d
i
m S52
IMQSSY
G12 ^ YELLOW
IPEROXYD^
BROWN
1142
—r-
-1
dim 1 (66%)
Fig. 35.14. Scores, X-loadings and K-loadings for the two dominant PCovR dimensions.
345
marginal in this case. The reason is that the principal components of X and those of
Y are already highly correlated, i.e. both sets of data can be explained by the same
latent variables.
Finally, another alternative to continuum regression has been put forward by
Wise and de Jong [18]. Their continuum power-PLS (CP-PLS) method modifies
the matrix X = USV^ into X^^^ = US^^^V^, i.e. the singular values are raised to a
certain power y = oc/(l - a) and a modified ('powered') predictor matrix is
constructed. Then one applies PLS regression and the results are back-transformed
to the original X matrix. Again by changing a from 0 to 1 (or y from 0 to «>) the
model changes from RRR via PLS to PCR.
Different techniques have been presented that can be used to relate two
measurements tables. Are we now in a position to answer the obvious question:
Which one is to be preferred? As expected there is no clear cut answer to this
question. It all depends on the objective of the comparison and the nature of the
data. Some general conclusions can be drawn, however. Procrustes analysis stands
apart from the other techniques in that it is not a regression technique. Its main use
lies in the comparison of two or more data sets which are supposed to be essentially
similar apart from basic transformations such as translation, rotation and
reflection. Procrustes analysis finds those transformations needed to match the sets
of data. Multivariate regression is only useful when both the number of dependent
variables and the number of regressors are not too large. It does not provide an
interpretation in terms of latent variables. All other regression techniques dis-
cussed apply some form of dimension reduction and become essentially equivalent
to multivariate regression when they are carried through to the maximum number
of factors. Canonical correlation analysis is not to be recommended when the
interest lies in understanding or predicting Y from X. However, when the aim is to
discover dimensions in Y and X that are very 'similar', and this in itself is of
substantial interest, then CCA can be a very useful tool. Reduced rank regression is
especially useful when the Y data are highly collinear and the X data are not.
Hence it is especially useful when X is designed and Y is multivariate. It is of little
use when the Y data are nearly orthogonal. Principal component regression is to be
preferred when the X data are highly collinear. It provides little advantage when
the predictor variables are nearly orthogonal. Partial least squares should do well
under most circumstances. When the X data are orthogonal, e.g. when coming
from a designed experiment, PLS becomes equivalent to reduced rank regression.
Principal covariates regression is in many ways quite comparable to PLS. It adds
some additional flexibility, covering RRR and PCR as extreme cases, as does
346
continuum regression. When we are in the lucky circumstance that the major
dimensions of X and Y are strongly related, then it will be hard to miss this by any
technique: all will reveal this strong relation. On the other extreme side, when there
is no relation whatsoever between the two data tables, all techniques will arrive to
the same negative conclusion.
The similarities and differences between the regression methods have been
investigated in various instances (see e.g. [19]). The relation between CCA, PLS,
SIMPLS and RRR has been discussed by Bumham, Viveros and MacGregor [20].
When X is an orthogonal design matrix PLS2 becomes equivalent with RRR [21].
Breiman and Friedman [22] have devised a method, called Curds-and-Whey, that
uses the canonical variables and combines ideas from canonical correlation anal-
ysis and ridge regression (Section 10.6). A thorough discussion of various multi-
variate regression techniques and their application to QSAR problems is given by
Jonathan and Stone [23]. A good collection of papers on the applicability of PLS
regression to QSAR problems can be found in Ref [24]. Total Least Squares (TLS,
[25]) is a technique popular with numerical analysts and engineers. It is a multi-
variate generalization of errors-in-variables regression or orthogonal distance
regression (cf Section 8.2.11), where measurement errors in both X and Y are
explicitly taken into account.
References
1. C. Peri, A. Garrido and A. Lopez, eds., The sensory and nutritional quality of virgin olive oil.
Grasas y Aceitas, 45, (Symposium Issue) (1994) 1-81.
2. J.C. Gower, Generalized Procrustes analysis. Psychometrika, 40 (1975) 33-51.
3. J.M.F. ten Berge, Orthogonal Procrustes rotation for two or more matrices. Psychometrika, 42
(1977)267-276
4. G.B. Dijksterhuis, Procrustes analysis in sensory research, Ch. 7 in: Multivariate analysis of
data in sensory science (T. Naes and E. Risvik, eds), Elsevier, Amsterdam (1996).
5. H. Hotelling, Relationships between two sets of variates. Biometrika, 28 (1936) 321-327.
6. T.W. Anderson, Introduction to Multivariate Statistical Analysis, 2nd ed., Wiley, New York
(1984).
7. P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression, Appl. Stat., 31 (1982)
244^255.
8. C.J.F. ter Braak, Interpreting canonical correlation analysis through biplots of structure corre-
lations and weights. Psychometrika, 55 (1990) 519-531.
9. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a
global prediction model by principal component regression. Anal. Chim. Acta (1997) 348,
19-27.
10. A. Hoskuldsson, PLS regression methods. J. Chemom., 2 (1988) 211-228.
11. I.E. Frank, Intermediate least-squares regression method. Chemom. Intell. Lab. Syst., 1 (1987)
233-242.
12. H. Martens and T. Naes. Multivariate Calibration. Wiley, Chichester, 1989.
347
13. S. de Jong, Comparison of PLS algorithms. Chemom. Intell. Lab. Syst. (1998)
14. S. de Jong, SIMPLS: an alternative approach to partial least squares regression. Chemom.
Intell. Lab. Syst., 18 (1993) 251-263.
15. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed
prediction embracing ordinary least sqaures, partial least squares, and principal component re-
gression. J. Roy. Stat. Soc. B52 (1990) 237-269.
16. R.J. Brooks and M. Stone, Joint Continuum Regression for multiple predictands. J. Am. Stat.
Assoc, 89 (1994) 1374-1377.
17. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14
(1992) 155-164.
18. S. de Jong and B.M. Wise, J. Chemom., 12, (1998).
19. I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools.
Technometrics, 35 (1993) 109-135.
20. A. Bumham, R. Viveros, J.F. MacGregor, Frameworks for latent variable multivariate regres-
sion. J. Chemom., 10 (1996) 3 1 ^ 6 .
21. 0. Langsrud and T. Nses, On the structure of PLS in orthogonal designs. J. Chemom., 9 (1995)
483-487.
22. L. Breiman and J.H. Friedman, Predicting multivariate responses in multiple linear regression.
J. Roy. Stat. Soc. B59 (1997) 3-37.
23. M. Stone and P. Jonathan, Statistical thinking and technique for QSAR and related studies. J.
Chemom. 7 (1993) 455-475; 8 (1994) 1-20, 303.
24. S. Wold, in: QSAR: Chemometric Methods in Molecular Design, ed. H. van de Waterbeemd,
VCH, Weinheim, 1995 (especially Ch 3.2, Ch 4.4, Ch 5.1, Ch 5.2).
25. S. Van Huff el and J. Vandewalle, The Total Least Squares Problem: Computational Aspects
and Analysis. SIAM, Philadelphia, PA, 1991.
Chapter 36
Multivariate calibration
36.1 Introduction
i.e. the analysis of chemical systems or processes using multiple sensors, depends
heavily on the applicability of multivariate calibration models for the quantitative
monitoring of the systems or processes of interest. Particularly, the application of
near-infrared spectroscopy to analyze samples needing little or no pretreatment has
found wide-spread use in the chemical and food industries [1]. The technique is
also used to characterize product properties related to composition, e.g. octane
number of gasolines, iodine value of fats and oils or crystallinity of polymers.
Applications outside the field of analytical chemistry are, for example, the pre-
diction of biochemical or pharmacological properties from structural parameters
(QSAR) [2], the understanding of sensory profiles from physico-chemical data in
food research [3], and the modelling of environmental data [4].
The ultimate goal of multivariate calibration is the indirect determination of a
property of interest (y) by measuring predictor variables (X) only. Therefore, an
adequate description of the calibration data is not sufficient: the model should be
generalizable to future observations. The optimum extent to which this is possible
has to be assessed carefully: when the calibration model chosen is too simple
(underfitting) systematic errors are introduced, when it is too complex (overfitting)
large random errors may result {cf. Section 10.3.4).
In many applications the goal of predictive modelling is not a detailed under-
standing of the relation between dependent and independent variables. Ability to
interpret the model, therefore, is not a requirement per 5^. This should not preclude
the exploitation of available background knowledge on the problem at hand during
calibration modelling. A model that can be sensibly interpreted certainly adds
value and confidence to the calibration result.
Section 36.2 deals with the types of regression models that are commonly used
in multivariate calibration. The treatment of the subject will mostly focus on
multi-component analysis, i.e. predicting the analyte concentrations in a mixture
from some multi-channel responses. We will gradually add complexity to the
analytical chemical problem (unknown pure spectra, interferents, less samples
than channels, etc) which automatically calls for different calibration methods.
These topics are better addressed after modelling aspects have been discussed.
Section 36.3 deals with the important issue of validation and assessing the predic-
tive performance. Section 36.4 treats procedures preceding the actual modelling
stage, viz. design and data pretreatment and outlier detection. We conclude in
Section 36.5 with briefly mentioning some recent developments, such as wave-
length selection and calibration standardization. We will also discuss non-linear
modelling as an extension to the mainstream linear calibration setting.
351
Let us assume that we have collected a set of calibration data (X, Y), where the
matrix X (nxp) contains thep > 1 predictor variables (columns) measured for each
of n samples (rows). The data matrix Y (nxq) contains the q variables which
depend on the X-data. The general model in calibration reads
Y = / ( X ; 0 ) + EY (36.1)
where / is some multivariate function, 0 is a vector of parameters and Ey is the
error. We will mainly discuss linear calibration functions, i.e./is a linear function
of the parameters. The case of non-linear regression modelling is treated for the
univariate case in Chapter 11. Non-linear multivariate modelling is briefly dealt
with in the concluding Section 36.5 on recent developments.
There are several good reasons to focus on linear models. Theory may indicate
that a linear relation is to be expected, e.g. Lambert-Beer's law of the linear
relationship between concentration and absorbance. Even when a linear relation
does not hold strictly it can be a sufficiently good local approximation. Finally, one
may try and find a transformation of the individual variables (e.g. a logarithmic
transformation), in order to obtain an acceptable linear model for the transformed
variables. Thus, we simplify eq. (36.1) to
Y = X B +EY (36.2)
nxq nxp pxq nxq
used to deal with such conditions. The main feature of these methods is that the
spectral data are projected onto a low-dimensional subspace of the ^wavelengths
(predictors) retaining nearly all relevant information. These methods are discussed
in Sections 36.2.4 and 36.2.5. Finally, in Section 36.2.6, we briefly discuss some
other approaches to the multivariate linear calibration problem.
From now on, we adopt a notation that reflects the chemical nature of the data,
rather than the statistical nature. Let us assume one attempts to analyze a solution
containing p components using UV-VIS transmission spectroscopy. There are n
calibration samples ('standards'), hence n spectra. The spectra are recorded at q
wavelengths ('sensors'), digitized and collected in an nxq matrix S. The infor-
mation on the known concentrations of the chemical constituents in the calibration
set is stored in an nxp matrix C. Each column of C contains the concentrations of
one of the p analytes, each row the concentrations of the analytes for a particular
calibration standard.
The calibration standards must have been well-chosen. In the simplest case it
suffices to measure the spectra of the respective single-component solutions. This
presumes that pure samples can be made, that the whole range of concentrations is
relevant and that one is convinced that the relation is linear over the entire
concentration range. Normally, not all of these conditions will be fulfilled, al-
though it may apply, for example, to the case of re-calibrating an instrument for a
system that has already been amply studied. Usually, we would have to set up a
design for measuring a number of mixtures spanning the range of interest.
Assuming Lambert-Beer's law to hold, we may write
S = C K + Es (36.4)
nxq nxp pxq nxq
TABLE 36.1
Concentrations and digitized spectra (xlOO) of four calibration samples (compare Fig. 36.1a)
Sample 1 2 3 5 15 25 35 45 55 65 75 85 95
Note that eq. (36.5) is a collection of many univariate multiple regression models:
for each wavelength y the multiple regression of the corresponding spectral 'chan-
nel', i.e. Sp on the concentration matrix C yields a vector of regression coefficients,
k^ (theyth column of K). For K to be estimable C^C must be invertible, i.e. the
number of calibration standards should at least be as large as the number of
analytes. It is clearly not possible to obtain, directly or indirectly, say 3 pure
spectra from recording the spectra of just 1 or 2 standards of known composition.
In practice, the condition n>p, or more precisely rank(C) =p, is hardly a restriction.
As an example, we give in Table 36.1 data on calibration spectra at 10
wavelengths of 4 calibration standards for 3 analytes. The corresponding spectra
are shown in Fig. 36.1a at a 10-fold higher resolution. The pure spectra calculated
according to eq. (36.5), using all 100 wavelengths, are displayed in Fig. 36.1b.
In many cases, one may measure spectra of solutions of the pure components
directly, and the above estimation procedure is not needed. For the further dev-
elopment of the theory of multicomponent analysis we will therefore abandon the
hat-notation in K. Given the pure spectra, i.e. given K (pxq), one may try and
estimate the vector of concentrations c^^^ (pxl) of a new sample from its measured
spectrum Sne^(^xl):
sLw-cLwK (36.6)
calibration spectra
0.2 [- ,."i
1 I '. •'! /•
15 0.1 L
c
So 0
•vvvl
-0.1
1* . /',
-0.2
0 50 100
channel
Fig. 36.1. (a) Spectra of four different calibration standards with three analytes of known composition;
(b) Extracted pure spectra for the three analytes; (c) Regression weights for estimating concentrations
of the three analytes; (d) New spectrum resolved into three spectral contributions from pure analytes.
or
^T _ T »
(36.9)
Here, BQ^ = K^(KK^)"^ is the final qxp matrix of regression coefficients for
converting a spectral measurement into concentration estimates. For KK^ (pxp) to
be invertible K should be of full rank. A first requirement for this is that p<q, i.e.
the number of chemical components should not exceed the number of wave-
lengths. Furthermore, the set of pure spectra in K should be independent, i.e. no
pure spectrum may be an exact linear combination of the other pure spectra.
Elaborating on the example we show in Fig. 36.1c the three regression models
(regression coefficients in columns of B^LS) ^^^ the three analytes. Each of these
356
models has the property that it gives an estimate close to zero when applied to the
pure spectra of the other two analytes. Figure 36. Id gives the (simulated) spectrum
of a supposedly unknown mixture and its reconstruction in terms of concentra-
tion-weighted pure spectra. The estimated composition (34.7, 16.0, 64.1) reflects
the true composition (35.0,15.0,65.0) well. The greatest error occurs for analyte 2,
which is clearly overestimated at the cost of analytes 1 and 3. Such a result is in line
with the spectral characteristics of the analytes, viz. the fact that the spectrum of
analyte 2 strongly overlaps with those of 1 and 3 (see Fig. 36.1b).
The attraction of the CLS approach is that the whole spectral domain is used for
estimating each constituent. Using redundant information has an effect equivalent
to replicated measurement and signal averaging, hence it improves the precision of
the concentration estimate. This is true if the entire spectral region contains
information with respect to the constituents, which need not be the case. A further
improvement in precision is possible by taking the error structure of S into account
using Generalized Least Squares (GLS) regression [5,6]. The error of the spectral
signal is not necessarily constant over the range of wavelengths, e.g. it may
increase with the signal itself. In this situation of heteroscedasticity a more
efficient estimator than the one given by eq. (36.9) for c^g^ is
c„,, = (KV-^KVKV-is„,^ (36.10)
where V (qxq) is the error covariance. It may be estimated from the matrix of
IS][dual sEs:
C = S P + Ec (36.13)
where P (qxp) is a matrix of regression coefficients. Each column of P contains a
set of q weights for a particular analyte. Multiplying the q weights of the jth
column of P with the corresponding spectral intensities and summing the results
yields a concentration estimate for theyth analyte.
The P-matrix is chosen to fit best, in a least-squares sense, the concentrations in
the calibration data. This is called inverse regression, since usually we fit a random
variable prone to error (>;) by something we know and control exactly (x). The
least-squares estimate P is given by
F = (S^Sr^S^C. (36.14)
It is only possible to invert S^S (pxp) if the matrix of calibration spectra S is of full
rank/?. As a consequence, the number of calibration samples should at least be as
large as the number of wavelengths, n > p. This precludes the use of the full
spectrum and poses the problem of selecting the 'best' wavelengths. The variable
selection methods discussed in Chapter 10 (forward selection, stepwise regression,
all possible subsets regression; cf. Section 10.3.3) or genetic algorithms (Chapter
27) can be used to determine a good set of wavelengths. A major concern when
making this selection is the correlation between the absorbances at the various
wavelengths for the calibration set. Highly dependent predictors inflate the vari-
ance of the regression coefficients P, leading to unreliable predictions (cf. Section
10.5), particularly if the prediction samples move out of the region covered by the
calibration samples. The PCR and PLS methods provide a way out for the problem
of wavelength selection and the associated collinearity problem.
The advantage of the inverse calibration approach is that we do not have to
know all the information on possible constituents, analytes of interest and inter-
ferents alike. Nor do we need pure spectra, or enough calibration standards to
determine those. The columns of C (and P) only refer to the analytes of interest.
Thus, the method can work in principle when unknown chemical interferents are
present. It is of utmost importance then that such interferents are present in the
calibration samples. A good prediction model can only be derived from calibration
data that are representative for the samples to be measured in the future.
358
With this proviso a new unknown sample with the spectrum s^^^^ can be directly
translated into a concentration estimate:
cL=SnewB,LS (36.15)
where the matrix of prediction vectors BJLS (qxp) has already been established in
the calibration step:
B,Ls=P = (S'^Sr^S'^C. (36.16)
It should be noted that Bj^s is just a collection of separate multiple regression
models. Each column h^^ of BILS is associated with a particular analyte and
follows from a multiple regression of the corresponding column c of C on the
predictor matrix of spectra S.
T = SoW (36.18)
In PCA, the weights W and the normalized loadings P are identical {cf. Chapters
31 and 35). For any other data decomposition, e.g PLS, weights and loadings
differ.
The main idea of PCA is to approximate the original data, spectra in our case, by
a small number {A < r) of factors, discarding the minor factors:
S-l„s'^ + rP*T (36.19)
The suffix * in T* {nxA) and P* {qxA) indicates that only the first A columns of T
and P are used, A being much smaller than n and q. In principal component
regression we use the PC scores as regressors for the concentrations. Thus, we
apply inverse calibration of the property of interest on the selected set of factor
scores:
C = l„c'^ +T*Q* + Ec (36.20)
The regression coefficients Q* follow immediately by least-squares estimation
Q* = {T^Ty'T^C^ (36.21)
where CQ = C - 1 ^ c^ is the matrix of concentrations, expressed as deviations from
the mean composition c. Substituting eq. (36.18) into eq. (36.20) gives
C = l „ c ^ +SoW*Q* (36.22)
low X-content
velcngth
samples
high X-content
1100
Fig. 36.2. Raw data: near-infrared spectra of a food product with different X-content.
predictors than samples. There are no problems of collinearity since the regressors
in T* are uncorrelated. There is no problem of having too many variables {q > n),
since we have made the step from the q original variables, possibly hundreds of
wavelengths, to just a few PCs (dimension reduction). There is no problem of
selecting predictor variables: all are represented in the PCs via the weights in W*
that define T*. The only problem is choosing A, the number of PCs to be used for
the calibration model. This important decision is given separate attention in
Section 36.3.
As an example we give in Fig. 36.2 a set of near-infrared calibration spectra of a
food product along with the varying content of a major component that we must
call X for proprietary reasons. The food product is measured as such in reflection
mode. We see that there is little variation in the spectra. Figure 36.3 gives the
mean spectrum (A) and the 'spectrum' of standard deviations (B). Comparing A
and B, we observe bands at roughly the same wavelengths, e.g. at wavelengths
1450 and 1880, indicating that the variation in intensity is largest at the peaks.
Figure 36.4 shows the loadings for the major 4 PCs (cf. Section 17.6.2). Notice
that a loadings vector has an entry for each wavelength, so it can be drawn as a
spectrum-like graph. One observes that the loadings plot of PCI more or less
reflects the standard deviation spectrum (Fig. 36.3b), and that practically all
individual loadings are positive. The implication is that PCI gives positive weight
361
250
0.04
Fig. 36.3. Summary of spectral information: (a) average spectrum, (b) standard deviation spectrum.
to each wavelength and that the resulting PCI scores more or less reflects the total
intensity of the spectra (c/. Section 31.3 where we also found that the first PC
represents overall size when the measurements are positively correlated). The
loadings of PC2, PC3, etc. show more complex patterns. Negative parts enter since
the different loadings spectra are orthogonal, i.e. the inner products of the loading
vectors are zero. One may also interpret the loadings spectra of the higher PCs in
terms of contrasts between different spectral regions. Plots such as Fig. 36.4 are
helpful in interpreting the nature of the PCs from a spectroscopic point of view.
They also can aid in detecting artefacts in the data. For example, a spike in one of
the spectra will show up as a spike in one or more of the loadings plots at the
'faulty' wavelength.
Figure 36.5 shows the scores plots for PC2 v. PCI (A) and PC4 v. PC3 (B). Such
plots are useful in indicating a possible clustering of samples in subsets or the
presence of influential observations. Again, a spectrum with a spike may show up
as an outlier for that sample in one of the scores plots. If outliers are indicated, one
should try and identify the cause of the outlying behaviour. Only when a satis-
factory explanation is found can the outlier be safely omitted. In practice, one will
362
PCI PC 2
-0.1
PC 3 PC 4
0.2
' '
D
GO
0.1 s
c
n 1 V^J_ ^ 1
Fig. 36.4. Spectral loadings for the first four principal components.
also remove the outlier solely on the basis of its extreme deviation. The most
difficult cases arise when suspect outliers are only mildly deviating and there is no
good explanation for the outlying behaviour. In our example, there are no such
remarkable outlying features in the score plots. Robust methods for PCA and PCR
have been developed which are less vulnerable to deviating observations {cf.
Section 36.4.3).
Finally, one may plot the X-content against any of the PC scores (Fig. 36.6). In
this case we observe a relationship of the amount of X with PC2 and with PC3. The
amount of spectral variance explained by the PCs is shown in Fig. 36.7. It would
appear that the first four PCs account for practically all variation (99.2 %). Thus, a
model with A=4 PCs will capture most of the spectral variation and, hopefully,
most of the correlation with the X-content. The estimated model {cf. eq. 36.20) is
0.6 — " T
A 0 0
0.4 \- « o
o
..
O O
0 0
0.2 L 0
0
0 ^" 0 o o
0
o o Oo
<h
U 0
o 0 0
0 0 ^
o 0 O 0 0 0 0 O
o
-0.2 - G 0 8 0
0°
|0 0 1
-0.4 1 1 1 1 1
B : " 0 0 O
0.1
0 0 „ 00 °
o cP Op G O
Oo 0
0 o " "o"
5 -0.1 ^' 0 Oo G
o "
-0.2 h - • • • • • o
o o
-0.3
-0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
PC3 scores
Fig. 36.5. Score plot of the samples in PC space: (a) scatterplot of PC2-scores vs. PCl-scores, (b)
scatterplot of PC4-scores vs. PC3-scores.
where the constant term represents the average X-content in the calibration set,
^mois =50.3, and the regression coefficients correspond to the slopes in Fig. 36.7. If
we multiply the weights (= loadings) given in Fig. 36.4 with these corresponding
slopes and add up the result we arrive at the regression model shown in Fig. 36.8.
This is the graphical version of eq. (36.24). It gives the weights bp^R in the closed
model (Cn = c„ ,1^ + SJ bpcR) which describes how to convert spectral
measurements into best fitting estimates of the X-content, based on 4 PCs as an
intermediary. Figure 36.9 gives the fitted versus the observed X-content.
We chose the number of PCs in the PCR calibration model rather casually. It is,
however, one of the most consequential decisions to be made during modelling.
One should take great care not to overfit, i.e. using too many PCs. When all PCs
are used one can fit exactly all measured X-contents in the calibration set. Perfect
as it may look, it is disastrous for future prediction. All random errors in the
calibration set and all interfering phenomena have been described exactly for the
calibration set and have become part of the 'predictive' model. However, all one
needs is a description of the systematic variation in the calibration data, not the
364
50 PCI scores 50
PC2 scores
A B
45 ~ ~ 45 o o o
u ft) tOo o 0 ()
0 0
^^?^->aO ^ O 0
40 - • • - •
o 0 %'^^^^«i:^, o o - 40 0 o
0 '^"°^>;-^ «
X <>„o 0 o°o'' 0^ oo&b
0 35 0 0
35 o o
in 1 1 I
30
-0.5 0.5 -0.5 0.5
50 50
D
45 h 45 - -
o
40 o
o
o o
35 o
30
0.4 -0.2 0 0.2 0.4
PC3 scores PC4 scores
Fig. 36.6. Scatter plot of dependent variable (X-content) versus the first four principal components.
random variation. As a model becomes more complex, e.g. using more PCs in PCR
calibration, the noise term starts to dominate the added systematic contribution.
Procedures to establish the optimum complexity, i.e. choosing the best-predictive
calibration model from a range of increasingly complex models, are given in
Section 36.3.
Thus far w^e have discussed the traditional method of applying PCR, where the
PCs are chosen in order of decreasing variance (top-down approach). Another
approach is to calculate the correlation of each PC with y and to enter the PCs in
order of their decreasing squared correlation. Since the PC scores are uncorrelated,
this is equivalent to applying multiple regression with variable selection on the set
of PCs using forward selection (cf. Section 10.3.3). Several studies, e.g. Refs.
[7,8], have shown that, with respect to prediction accuracy, this correlation-PCR
approach performs no worse and often better than the traditional top-down app-
roach. Correlation-PCR often gives simpler models with a smaller number of PCs
than top-down PCR. The added advantage of such parsimonious models is that
they are easier to interpret.
365
100
>
Fig. 36.7. Percentage variance of X-content explained by the principal components from spectral data.
Individual percentages (bars) are shown as well as cumulative percentages (circles).
I • T — T
2r
45 0 °
O ,
1 CP
T3
o
U
o[ 40
-1
35 0
-21-
100 200
35 40 45
wavelength
X-content (measure
Fig. 36.8. Regression coefficients obtained Fig. 36.9. Final calibration showing fitted versus
from PCR model. measured values.
366
The procedure for PLS calibration is very akin to PCR calibration. It also
proceeds via a small set of orthogonal factors constructed from the predictor
variables. The main difference with PCR lies in the way the factors are determined.
This has been discussed in Chapter 35. In PLS regression the factor is not solely
determined by the spread (variance) of the predictor data; the correlation of the
PLS factor with the variable to be predicted also plays a role. A high-variance
factor that is not at all correlated to the dependent property is given less weight by
PLS from the outset. In PCR, such a factor at first seems important on account of its
high variance. However, it is downweighted in the final model through the small
regression coefficient q^ for that factor (see, for example, the low coefficient for
PCI in the PCR regression model of eq. (36.25)). Since in PLS regression such
factors are avoided, PLS models often have less factors than PCR models built on
the same data. This aspect of parsimony is sometimes seen as an advantage of PLS
over traditional PCR. Correlation-PCR behaves like PLS in this respect. Another
difference between PLS and PCR is that the loadings P, even when normalized to
unit length, differ from the weights W. Qualitatively, however, e.g. with regard to
the sign pattern, the difference often is not large. Usually one plots the loadings
since they have a simpler interpretation. In Chapter 35 we showed that the loadings
P are the regression coefficients obtained from regressing the predictors (spectral
signals at each wavelength) on the PLS factor scores. One may consider the
loadings (columns of P) as abstract spectra from which the measured spectra can
be reconstructed. A high loading p^j^ implies that the spectral region around thejth
wavelength has a strong contribution from the kih factor.
PLS calibration of a multicomponent system can be performed in two different
ways. One may do a separate regression for each analyte. Such univariate (in >^)
regressions are called PLSl regressions. One may also model the various analytes
collectively in one and the same multivariate PLS2 regression model. Which of
these two approaches should be chosen? The use of PLS2 regression has a few
advantages. Firstly, there is one common set of PLS factors T for all analytes. This
simplifies interpretation and enables a simultaneous graphical inspection. Secondly,
when the analyte concentrations are strongly correlated one may expect on theo-
retical grounds that the PLS2 model is more robust than separate PLSl models.
This is especially true when the Y matrix is closed, as for compositional data.
Finally, when the number of analytes is large the development of a single PLS2
model is done much quicker than the development of many separate PLS 1 models.
Practical experience, however, indicates that PLSl calibration usually performs
equally well or better in terms of predictive accuracy. Thus, when the ultimate
requirement of the calibration study is to enable the best possible predictions, a
separate PLSl regression for each analyte is advised.
367
There are many more methods that are well suited for calibration of collinear
data. Latent root regression [9,10], principal covariates regression [11], total least
squares [12], and reduced rank regression [13] are related to PCR and PLS in that a
calibration model is based on orthogonal factors derived from the predictors
(spectra). However, a comparative evaluation of these techniques has not yet been
published. Continuum regression [14,15] provides a continuum of models which
can be viewed as interpolations between three 'anchor' models, viz. PCR, PLS
(more specifically SIMPLS, Section 35.7.4), and MLR (or reduced rank re-
gression). It still has to be established whether there is a need for a continuum of
models filling the gaps between three anchors. Ridge regression is a technique that
has been specially devised for dealing with multi-coUinear data (cf. Section 10.6).
It replaces the OLS regression estimator (X'^X)-^X'^y by (X^X+kiy^X^y. The addi-
tion of the constant k, the ridge parameter, to the diagonal of X^X has the effect of
artificially decreasing the dependence among the X-predictors. As a result it leads
to regression coefficients that are slightly biased towards the low side (shrinkage
property). This small systematic error is, however, more than compensated by the
beneficial effect of the greatly reduced variance of the estimates. The value of k
can be identified from the graph of the regression parameters as a function of /:,
namely as the value where the regression parameters start to stabilize. In a
comparative simulation study it has been shown that ridge regression performs as
well as PCR or PLS, all of them outperforming MLR with forward variable
selection [16]. Other comparative studies also have shown that often the results
obtained with various prediction methods are very similar. When each method is
applied carefully it turns out that there is no overall superior technique. Thus, it is
better to get well acquainted with one or two methods and to apply that method in a
professional way, rather than applying different techniques which are known only
superficially. Even multiple linear regression (MLR, Chapter 10) has regained
some of its lost respectability in the field of multivariate calibration. In combina-
tion with modem wavelength selection procedures (Section 36.5.1) and a safe-
guard for overfitting, well-performing models can be derived that are based on a
small number of wavelengths.
is assumed that the responses are a linear function of the analyte concentrations.
The calibration model reads
R = ( V J + AC) K + E (36.26)
where R (nxq) is the set of responses measured on the calibration samples accord-
ing to the 'design' matrix AC (nxp), containing the added concentrations. The
unknown concentration is given by the vector CQ 07X1) and K (pxq) is the unknown
matrix of sensitivities of the q responses with respect to the p different analytes.
One may eliminate the unknown concentration vector CQ by subtracting the res-
ponse vector TQ corresponding to the original sample, without standard additions,
from each row of the response matrix R, giving AR, and at the same time sub-
tracting the unknown CQ from the matrix (l^c J + AC). The sensitivities K can be
estimated from this reduced system of equations using multiple linear regression
of the corrected responses (AR) on the multiple standard additions (AC). Given the
estimate
K = ((AC)'^(AC))-^ (AC)'^ AR
one may estimate the unknown concentration as:
Co=K^(KKT)-^ro (36.27)
More efficient estimation methods exist than the simple method described here
[17]. The generalized standard addition method (GS AM) shares the strong points
(e.g correction for interferences) and weak points (e.g. error amplification because
of the extrapolation involved) of the simple standard addition method [18].
36.3 Validation
PRESS = I ( q - c , ) 2 (36.28)
5
# factors
Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained
from leave-one-out cross-validation using PCR (o) and PLS (*) regression.
all. Apparently, the first PC-factor captures an important source of spectral vari-
ation that has no predictive value.
Leaving out one object at a time represents only a small perturbation to the data
when the number (n) of observations is not too low. The popular LOO procedure
has a tendency to lead to overfitting, giving models that have too many factors and
a RMSPE that is optimistically biased. Another approach is k-fold cross-validation
where one applies k calibration steps (5 < k < 15), each time setting a different
subset of (approximately) n/k samples aside. For example, with a total of 58
samples one may form 8 subsets (2 subsets of 8 samples and 6 of 7), each subset
tested with a model derived from the remaining 49 or 50 samples. In principle, one
may repeat this /:-fold cross-validation a number of times using a different splitting
[20].
Van der Voet [21] advocates the use of a randomization test (cf. Section 12.3) to
choose among different models. Under the hypothesis of equivalent prediction
performance of two models, A and B, the errors obtained with these two models
come from one and the same distribution. It is then allowed to exchange the
observed errors, e^^ and e^^, for the /th sample that are associated with the two
models. In the randomization test this is actually done in half of the cases. For each
object / the two residuals are swapped or not, each with a probability 0.5. Thus, for
all objects in the calibration set about half will retain the original residuals, for the
other half they are exchanged. One now computes the error sum of squares for each
of the two sets of residuals, and from that the ratio F = SSEJSSE^. Repeating the
process some 100-200 times yields a distribution of such F-ratios, which serves as
a reference distribution for the actually observed F-ratio. When for instance the
observed ratio lies in the extreme higher tail of the simulated distribution one may
371
less evenly over the experimental region. One way to generate such a space filling
design is by applying the Kennard-Stone algorithm {cf. Section 24.4.2). Generally,
this will put less weight on the center of the experimental region and more on the
extremes which should give more precise estimates. Feam [24] gives an interesting
discussion on the two choices, the main message being that when the number of
calibration objects is limited the latter choice may be preferable. One is not always
in a position to choose the samples. In that case it is never wise to discard samples
from a given calibration set for the sole sake of making the distribution of
calibration objects in the region of interest more uniform.
When a linear model does not hold over the full range of calibration samples
there are two options. One may apply a nonlinear regression method to the
complete set of data {cf. Section 36.5.3) or one may split the experimental region
into smaller subregions and estimate a separate linear model for each. Naes and
Isaksson [25] use fuzzy clustering for splitting the training sets into smaller subsets
with improved linearity in each group.
spectral data to get rid of uncontrolled random noise. Various options are available
to implement such smoothing {cf. Chapter 40): moving window (box car) aver-
aging, Fourier filtering, Savitzky-Golay smoothing. A side effect of smoothing is
that the spectral resolution may be lost.
A very common pre-treatment of spectral data is to convert spectra to first- (or
second)- derivative form [27]. This has the effect of removing any offset and
constant slope (or curvature). Applying second derivatives has the advantage of
sharpening peaks and resolving overlapping bands to some extent, although it also
introduces spurious satellite peaks. Derivatization in general amplifies the noise in
the data (Section 40.5.5). As a remedy to the latter drawback one may carry out the
derivatization in combination with some degree of smoothing, for example, em-
ploying Savitzky-Golay filtering (Section 40.5.2.3).
An effective preprocessing method is the use of standard normal variates
(SNV). This type of standardization boils down to considering each spectrum x^ as
a set of q observations and calculating their z-scores:
z. = ( x . -x)l s (36.30)
It has the effect of removing an overall offset by subtracting the mean spectral
reading x and it corrects for differences affecting the overall variation. In various
settings it has been found to be an effective preprocessing method.
Another popular form of data pre-processing with near-infrared data is the
application of the Multiplicative Scatter Correction (MSC, [28]). It is well known
that particle size distribution of non-homogeneous powders has an overall effect
on the spectrum, raising all intensities as the average particle size increases.
Individual spectra x^ are approximated by a general offset plus a multiple of a
reference spectrum, z.
x. = a^-\-b^z + e,. (36.31)
The offset a^ and the multiplication constant b^ are estimated by simple linear
regression of the /th individual spectrum on the reference spectrum z. For the latter
one may take the average of all spectra. The deviation e, from this fit carries the
unique information. This deviation, after division by the multiplication constant, is
used in the subsequent multivariate calibration. For the above correction it is not
mandatory to use the entire spectral region. In fact, it is better to compute the offset
and the slope from those parts of the wavelength range that contain no relevant
chemical information. However, this requires spectroscopic knowledge that is not
always available.
A special type of data pre-treatment is the transformation of data into a smaller
number of new variables. Principal components analysis is a natural example and
we have treated it in Section 36.2.3 as PCR. Another way to summarize a spectrum
in a few terms is through Fourier analysis. McClure [29] has shown how a NIR
374
36.4.3 Outliers
Brown et al. [36] published an interesting, simple method for selecting wave-
lengths in NIR calibration. The method boils down to ranking the wavelengths in
order of diminishing squared correlation (R^) with the analyte concentration. The
model is then built using the first m wavelengths, j = 1 ...m, from this ordered list as
a weighted average of the m simple regression models corresponding to these
wavelengths. The idea is to minimize the confidence interval for future pre-
dictions. As m increases, so does the total signal-to-noise ratio (sum of F-ratios or
^-values for the m separate, simple regressions), but also does the model complex-
ity. The optimum number of wavelengths m is established by minimizing the ratio
X^(m;a)/(S^J), where X^(m;a) is the critical value of a chi-square distribution with
m degrees of freedom at the chosen level of confidence. Having found the selection
of wavelengths one may proceed using any of the aforementioned regression
methods, e.g. ILS or GLS.
Many other new developments are under way in this area of variable selection.
Wavelength selection can also be done using forward selection using covariance
rather than correlation as a criterion (Intermediate Least Squares [37,38]). One
may see this as the PLS analogon to forward selection in MLR. Recently, also
genetic algorithms {cf. Chapter 27) have been applied to the problem of finding
small sets of predictive wavelengths among the legion of candidate wavelengths
[39]. The challenge with the application of such methods is not to fall in the trap of
overfitting. Another major problem is that of chance correlations: with the large set
376
All other approaches try and relate the child spectra to the parent spectra. In the
patented method of Shenk and Westerhaus [41: Sh], in its simplest form, one first
applies a wavelength correction and then a correction for the absorbance. Each
wavelength channel / of the parent instrument is linked to a nearby wavelength
channel j(i) in the child instrument, namely the one to which it is maximally
correlated. Then, for each pair of wavelengths, / for the parent Mid j(i) for the child,
a simple linear regression is carried out, linking the pair of measured absorbances
In this way the child spectrum is transformed into a spectrum as if measured on the
parent instrument. In a more refined implementation one establishes the highest
correlating wavelength channel through quadratic interpolation and, subsequently,
the corresponding intensity at this non-observed channel through linear inter-
polation. In this way a complete spectrum measured on the child instrument can be
transformed into an estimate of the spectrum as if it were measured on the parent
instrument. The calibration model developed for the parent instrument may be
applied without further ado to this spectrum. The drawback of this approach is that
it is essentially univariate. It cannot deal with complex differences between dis-
similar instruments.
In the direct standardization introduced by Wang et al. [42] one finds the
transformation needed to transfer spectra from the child instrument to the parent
instrument using a multivariate calibration model for the transformation matrix:
Z^ = ZgF. The transformation matrix F (qxq) translates spectra Zg that are
actually measured on the child instrument B into spectra Z^ that appear as if they
were measured on instrument A. Predictions are then obtained by applying the old
calibration model b^ to these simulated spectra Z^:
y^=Z^h^=Z^Fh^, (36.34)
giving
be = Fb^ (36.35)
In recent years there has been much activity to devise methods for multivariate
calibration that take non-linearities into account. Artificial neural networks (Chap-
ter 44) are well suited for modelling non-linear behaviour and they have been
applied with success in the field of multivariate calibration [47,48]. A drawback of
neural net models is that interpretation and visualization of the model is difficult.
Several non-linear variants of PCR and PLS regression have been proposed.
Conceptually, the simplest approach towards introducing non-linearity in the
regression model is to augment the set of predictor variables (jCj, ^2,...) with their
respective squared terms (jc,^, jc|,...) and, optionally, their possible cross-product
terms (JCJJC2, ...). Since the number of predictors grows appreciably, PCR or PLS
regression is called for. A non-linear variant of PLS employing splines for the
inner relation between y and the r-scores has been proposed that has some analogy
with neural nets. However, in multivariate calibration this splines-PLS approach
[49] has not yet met with success. In fact, using a quadratic regression model using
factor scores from PC A can be just as effective [50]. One may employ linear PLS
regression as a first step and then proceed with these PLS scores in a quadratically
extended regression model (LQ-PLS, [51]). Locally weighted regression (LWR],
an approach combining elements of PCA or PLS, weighted regression and local
modelling, has been more successful [52]. In this approach one starts with a
transformation of the spectra into a few PC scores. The spectrum of any new
sample is transformed to the same PC space and a small set of similar spectra from
the calibration set is determined using the Mahalanobis distance as a criterion for
379
similarity. Multiple linear regression is then used to relate the response y to the PC
scores for this small local set and this is used as an interpolating model to estimate
the unknown response for the new sample. The number of PC dimensions, the
number of neighbours are determined through cross-validation. More elaborate
extensions of this approach not only take spectral similarity but also the estimated
chemical similarity into account [53]. An interesting study comparing a variety of
modem non-linear methods is given in Ref. [54].
References
1. K.I. Hildrum, T. Isaksson, T. Nses and A. Tandberg, Near Infra-red Spectroscopy Bridging the
Gap between Data Analysis and NIR Applications. Ellis Horwood, New York, 1992.
2. H. van de Waterbeemd, ed., QSAR: Chemometric Methods in Molecular Design. VCH,
Weinheim, 1995.
3. T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science, Data Han-
dling in Science and Technology Series. Elsevier, Amsterdam, 1996
4. P.K. Hopke and X.-H. Song, The chemical mass balance as a multivariate calibration problem.
Chemom. Intell. Lab. Assist., 37 (1997) 5-14.
5. P.J. Brown, Measurement, Calibration and Regression. Clarendon Press, Oxford, 1993.
6. T. Nses, Progress in multivariate calibration, pp. 52-60, in Ref. [1].
7. J.M. Sutter, J.H. Kalivas and P.M. Lang, Which principal components to utilize for principal
component regression. J. Chemometr., 6 (1992) 217-225.
8. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a
global prediction model by principal component regression. Anal. Chim. Acta, 348 (1997)
19-27.
9. R.F. Gunst and R.L. Mason, Regression Analysis and its Application: A Data-Oriented Ap-
proach. Marcel Dekker, New York, 1980.
10. E. Vigneau, D. Bertrand and E.M. Qannari, Application of latent root regression for calibration
in near-infrared spectroscopy. Comparison with principal component regression and partial
least squares. Chemometr. Intell. Lab. Syst., 35 (1996) 231-238.
11. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14
(1992) 155-164.
12. S. Van Huffel and J. Vandewalle, The Total Least Squares Problem: Computational Aspects
and Analysis. SIAM, Philadelphia, PA, 1991.
13. P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression. Appl. Stat., 31 (1982)
244-255.
14. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed
prediction ernbracing ordinary least sqaures, partial least squares, and principal component re-
gression. J. Roy. Stat. Soc. B52 (1990) 237-269.
15. R.J. Brooks and M. Stone, Joint continuum regression for multiple predictands. J. Am. Stat.
Assoc, 89 (1994) 1374-1377.
16. I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools.
Technometrics, 35 (1993) 109-135.
17. R. Sundberg, Interplay between chemistry and statistics, with special reference to calibration
and the generalized standard addition method. Chemom. Intell. Lab. Syst., 4 (1988) 299-305.
380
18. M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986.
19. D.M. Allen, The relationship between variable selection and data augmentation and a method
for prediction. Technometrics, 16 (1974) 125-127.
20 M. Forina, G. Drava, R. Boggia, S. Lanteri and P. Conti, Validation procedures in near-infrared
spectrometry. Anal. Chim. Acta, 295 (1994) 109-118.
21. H. van der Voet, Comparing the predictive accuracy of models using a simple randomization
test. Chemom. Intell. Lab. Syst., 25 (1994) 313-323.
22. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Wiley, New York, 1993.
23. R. Wehrens and W. van der Linden, Bootstrapping principal regression models. J. Chemom.,
11 (1997)157-172.
24. T. Fearn, Flat or natural? A note on the choice of calibration samples, pp. 61-66 in Ref. [1].
25. T. Naes and T. Isaksson, SpHtting of calibration data by cluster-analysis. J. Chemometr., 5
(1991)49-65.
26. O.E. de Noord, The influence of data preprocessing on the robustness nd parsimony of
multivariate calibration models. Chemom. Intell. Lab. Systems, 23 (1994) 65-70.
27. W.R. Hruschka, Data analysis: wavelength selection methods, pp. 35-55 in: P.C. Williams and
K. Norris, eds. Near-infrared Reflectance Spectroscopy. Am. Cereal Assoc, St. Paul MI, 1987.
28. P. Geladi, D. McDougall and H. Martens, Linearization and scatter-correction for near-
infrared reflectance spectra of meat. Appl. Spectrosc, 39 (1985) 491-500.
29. F. McClure, Analysis using Fourier transforms, in: Handbook of Near-Infrared Analysis, D. A.
Bums and E.W. Ciurczak, eds. Dekker, New York, pp. 181-224 (1992).
30. A.C. Atkinson, Masking unmasked. Biometrika, 73 (1986) 533-541.
32. I.N. Wakeling and H.J.H. MacFie, A robust PLS procedure. J. Chemom., 6 (1992) 189-198.
31. B. Walczak and D.L. Massart, Robust PCR as outliers detection tool. Chemom. Intell. Lab.
Syst., 27 (1995) 41-54.
33. B. Walczak, Outlier detection in bilinear calibration. Chemom. Intell. Lab. Syst., 29 (1995)
63-73.
34. A. Singh, Outliers and robust procedures in some chemometric applications. Chemom. Intell.
Lab. Syst., 33 (1996) 75-100.
35. A.S. Hadi, A modification of a method for the detection of outliers in multivariate samples. J.
Roy. Stat. Soc. B56 (1994) 393-396.
36. P.J. Brown, C.H. Spiegelman and M.C. Denham, Chemometrics and spectral frequency selec-
tion. Phil. Trans. R. Soc. Ser. A, 337 (1991) 311-322.
37. I.E. Frank, Intermediate least squares regression method. Chemom. Intell. Lab. Syst., 1 (1987)
232-242.
38. A. Hoskuldsson, The H-principle in modelling with applications to chemometrics. Chemom.
Intell. Lab. Syst., 14 (1992) 139-153.
39. D. Jouan-Rimbaud, D.L. Massart, R. Leardi, et al.. Genetic algorithms as a tool for wavelength
selection in multivariate calibration. Anal. Chem., 67 (1995) 4295-4301.
40. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.G.M. Vandeginste, and C. Sterna,
Elimination of uninformative variables for multivariate calibration. Anal. Chem., 68 (1996)
3851-3858.
41. J.S. Shenk and M.O. Westerhaus, US Patent No. 4866644, Sept. 12, 1989.
42. Y.D. Wang, D.J. Veltkamp and B.R. Kowalski, Multivariate instrument standardization. Anal.
Chem., 63 (1991) 2750-2756.
43. Z.Y. Wang, T. Dean and B.R. Kowalski, Additive background correction in multivariate in-
strument standardization. Anal. Chem., 67 (1995) 2379-2385.
381
44. M. Forina, G. Drava, C. Armanino, et al., Transfer of calibration function in near-infrared spec-
troscopy. Chemom. Intell. Lab. Syst., 27 (1995) 189-203.
45. O.E. de Noord, Multivariate calibration standardization. Chemom. Intell. Lab. Systems, 25
(1995) 85-97.
46. E. Bouveresse and D.L. Massart, Standardisation of near-infrared spectrometric instruments.
Vib. Spectrosc, 11 (1996) 3-15.
47. B.J. Wythoff, Backpropagation neural networks — a tutorial. Chemom. Intell. Lab. Syst., 18
(1993)115-155.
48. C. Borggaard and H.H. Thodberg, Optimal minimal neural interpretation of spectra. Anal.
Chem., 64 (1992) 545-551.
49. S. Wold, Non-linear partial least squares modelling. II. Spline inner relation. Chemom. Intell.
Lab. Syst., 14(1992)71-84.
50. S.D. Oman, T. Naes and A. Zube, Detecting and adjusting for nonlinearities in calibration of
near-infrared data using principal components. J. Chemom., 7 (1993) 195-212.
51. S. Wold, N. Kettaneh-Wold and B.Skagerberg, Nonlinear PLS modeling. Chemom. Intell.
Lab. Syst., 7 (1989) 53-65.
52. T. Naes, T. Isaksson and B. Kowalski, Locally weighted regression in NIR analysis. Anal.
Chem., 2 (1990) 664-673.
53 . Z.Y. Wang, T. Isaksson and B.R. Kowalski, New approach for distance measurement in lo-
cally weighted regression. Anal. Chem., 66 (1994) 249-260.
54. S. Sekulics, B.R. Kowalski, Z.Y. Wang, et al.. Nonlinear multivariate calibration methods in
analytical chemistry. Anal. Chem., 65 (1993) A835-A845.
Books
H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester, 1989.
J.H. Kalivas and P.M. Lang, Mathematical Analysis of Spectral Orthogonality. Dekker, New York,
1994.
Articles
K.R. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal.
Chem., 59 (1987) 1007A-1017A.
B.R. Kowalski and M.B. Seasholtz, Recent developments in multivariate calibration. J. Chemo-
metrics, 5 (1990) 129-145.
H. Martens and T. Naes, Multivariate Calibration by Data Compression, Chapter 4 in Ref. [1].
P.J. Brown, Multivariate Calibration. J. Roy. Stat. Soc, B44 (1982) 287-321.
K. Faber and B.R. Kowalski, Propagation of measurement errors for the validation of predictions ob-
tained by principal component regession and partial least squares. J. Chemom., 11 (1997)
181-238.
D.M. Haaland, Multivariate Calibration Methods Applied to the Quantitative Analysis of Infrared
Spectra, Chapter 1 in "Computer-Enhanced Analytical Spectroscopy, Volume 3", edited by
P.C. Jurs. Plenum Press, New York, 1992.
383
Chapter 37
of the opioid receptor and which is used as an antidote for poisoning by morphine
and related narcotics.
The stereospecificity of the drug-receptor interaction was discovered in 1894
by Fischer in his study of the cleavage of glycosides by yeast [5]. He devised the
still famous 'lock and key' paradigm. A drug acts as a key which can turn the
receptor on or off, and which must satisfy narrow constraints (although the key
may be flexible and the lock may appear to be somewhat plastic). The actual term
receptor was coined in 1913 by Ehrlich, the discoverer of salvarsan (arsphen-
amine) which was regarded as the 'magic bullet' for the treatment of syphilis [6].
The part of the drug molecule that interacts with the receptor is called the
pharmacophore. Compounds that share the same pharmacophore are considered to
be similar with respect to the biological activity that is triggered by their receptor.
In practice, however, the therapeutic benefit may vary considerably within a series
of similar compounds due to differences in absorption, secretion, metabolization,
uptake by fatty tissues, transport through membranes, toxicity, etc. The design of
drugs is both an art and a science, but rational approaches, such as QSAR, play an
ever increasing role in the generation of new lead compounds and the optimization
of existing leads.
The foundation of QSAR as a practical tool of drug design took place around
1964 with the introduction of two so-called extrathermodynamic methods. One of
these methods is based on so-called linear free energy relationships (LFER)
between biological activities of structurally related (congeneric) drugs and the
physicochemical properties of the chemical substituents on a common parent
molecule. For example, benzoic acid may be regarded as a parent compound which
can be substituted at the ortho, meta and para positions by chlorine, bromine,
amine, hydroxyl, acetyl, etc. The latter are referred to as the substituent groups.
The ortho, meta and para locations with respect to the carboxyl group in the parent
benzoic acid molecule are called the substituent positions. This approach is also
referred to as Hansch analysis. Another extrathermodynamic method is based on
the additivity of the contributions to the biological activity by various substituent
groups at multiple substituent positions, and is called Free-Wilson analysis. Both
the Hansch and Free-Wilson methods make use of multiple linear regression.
These approaches allow us to determine the combination of substituents that
provide maximal activity in a series of structurally related molecules. These
developments have been extensively reviewed by Martin [7] and Kubinyi [8].
Multivariate chemometric techniques have subsequently broadened the arsenal
of tools that can be applied in QSAR. These include, among others, Multivariate
ANOVA [9], Simplex optimization (Section 26.2.2), cluster analysis (Chapter 30)
and various factor analytic methods such as principal components analysis (Chapter
31), discriminant analysis (Section 33.2.2) and canonical correlation analysis
(Section 35.3). An advantage of multivariate methods is that they can be applied in
385
supervised and unsupervised modes. They can handle series of structurally unrelated
compounds and some of them can be used in the case when multiple biological
activities are obtained. As we will see later on, a drug may act on a wide variety of
receptors and hence may provide a typical spectrum of biological activities.
Spectral map analysis is a factor analytic method which makes use of this property
(Section 31.3.5).
Partial Least Squares (PLS) regression (Section 35.7) is one of the more recent
advances in QS AR which has led to the now widely accepted method of Compara-
tive Molecular Field Analysis (CoMFA). This method makes use of local physico-
chemical properties such as charge, potential and steric fields that can be determined
on a three-dimensional grid that is laid over the chemical structures. The determina-
tion of steric conformation, by means of X-ray crystallography or NMR spectro-
scopy, and the quantum mechanical calculation of charge and potential fields are
now performed routinely on medium-sized molecules [10]. Modem optimization
and prediction techniques such as neural networks (Chapter 44) also have found
their way into QSAR.
Much attention is devoted today to automatic searching of large libraries of
chemical compounds in order to find interesting chemical structures that possess
some similarity to known lead compounds. Current emphasis is on searching of
three-dimensional structures within libraries of hundreds of thousands of com-
pounds [11]. Finally, our knowledge of proteins, specifically of enzymes and
receptors, has greatly increased by the use of techniques from biotechnology. It is
now possible in many cases to isolate and clone these macromolecules in pure
form and to determine their primary amino acid sequence, as well as their exact
three-dimensional structure. This has led to so-called rational drug design by
which drugs are designed by means of computer modelling rather than by the
empirical (serendipitous) method of trial-and-error.
In the light of these considerations one may regard QSAR as a multidisciplinary
field of chemometrics. Statistics, optimization, pattem recognition and information
technology converge in QSAR with chemistry, physicochemistry, biology, bio-
chemistry and biotechnology. The start of QSAR coincides with the development
of linear free energy relationships (LFER) between biological activities and the
physicochemical properties of molecules. This development is generally referred
to as the extrathermodynamic approach in QSAR for reasons that will become
apparent later on in this section. A vast amount of literature has appeared on the
subject since 1964. The content of this section has been largely inspired by a recent
review of the subject by Kubinyi [8].
As we have remarked above, biological activity is the result of the binding of a
drug D onto a receptor protein (or enzyme) P which results in the drug-receptor
complex DP. The strength of the binding and, hence, the magnitude of the effect
can be expressed by means of the change in Gibbs' free energy AG between the
386
free D and P on the one hand and the bound DP on the other hand. According to
classical thermodynamics, the free energy release AG is made up from an enthalpic
and an entropic term [12]:
AG = AH-T AS (37.1)
where R represents the gas constant. The relation shows that if the reaction between
drug and receptor proceeds in the direction of the formation of the drug-receptor
complex (K> I), then a decrease of free energy will occur (AG < 0).
Interactions of a drug with a receptor (or enzyme) are of a reversible non-
covalent nature. The free energy released by drug-receptor interactions is smaller
by one to two orders of magnitude than that of a covalent bond. The interactions
take place mainly by means of electrostatic, hydrophobic and dispersion forces.
Electrostatic interactions between charged groups are enthalpic. Hydrogen bonding
is a weak electrostatic interaction which takes place between electron donating and
electron accepting groups, such as amine, ketone, hydroxyl groups, etc. Hydro-
phobic interactions are mostly entropic. They can displace water molecules that
are bound to the drug molecule or to the receptor surface. This diminishes the
ordering of the water molecules and, hence, increases entropy, which in turn
lowers the free energy of the drug-receptor complex. Dispersion forces are
attractive when a drug molecule closely approaches the receptor surface. They are
repulsive when the van der Waals radii of drug and receptor molecules tend to
overlap. The nature of dispersion forces is mainly enthalpic.
The question that now arises is how to define biological (or therapeutic) activity
of a drug. There are good reasons for defining biological activity as the logarithm
of the reciprocal of the effective dose (or concentration) that is needed in order to
produce a well-defined biological effect. This is in accordance with Weber-
Fechner's psychophysical law which states that a biological response is an approx-
imately linear function of the logarithm of the physical stimulus [13]. Furthermore,
experimental observations also show that the doses or concentrations that are
tolerated by living organisms follow a log-normal distribution. The reciprocal of
the effective dose is required, since the more active or potent drugs require a lower
387
dose in order to produce the desired effect. Usually, an effective dose is defined as
the dose which produces the required effect in half of the samples or subjects that
has been tested. In classical pharmacology on living subjects this produces the
so-called median effective dose (or ED5Q). In biochemical pharmacology on isolated
receptors results are often reported in the form of median inhibitory concentrations
(or IC5Q). If we combine the previous observations with eq. (37.2), we obtain an
expression which relates the biological activity (log 1/C) of a drug to the equi-
librium constant (K) of the drug-receptor interaction:
where b^, b^, Z?2 are coefficients that can be determined by means of multiple linear
regression (see Chapter 10).
The Hansch model can be expressed in a general form:
y = Xb + e (37.8)
where X represents the matrix of independent parameters (extended by a unit
column-vector in order to account for the constant term), y is the vector of observed
biological activities, e contains the residuals between observed and computed
activities, and b is the vector of regression coefficients.
The above Hansch equations are also generally referred to as linear free energy
relationships (LFER) as they are derived from the free energy concept of the
drug-receptor complex. They also assume that biological activity is linearly
related to the electronic and lipophilic contributions of the various substituents on
the parent molecule.
A typical Hansch analysis has been applied to the 50% inhibitory concentrations
(ICgo) of oxidative phosphorylation of 11 doubly substituted salicylanilides (Table
37.1) as reported by Williamson and Metcalf [17]. Multiple linear regression leads
to the following model:
log I/IC50 = -2.190 + 0.708 a + 0.348 log P
389
TABLE 37.1
Lipophilicity (log P), Hammett electronic parameter (a) and inhibitory concentration (1050) for oxidative
phosphorylation of 11 doubly substituted salicylanilides [17]. The two substitution positions are labeled A
andB.
OH
,0^~- ™^^
# A B logF a IC50
The residual standard deviation of regression (s^) equals 0.346, the coefficient
of determination (R^) is 0.548 and the F-statistic amounts to 4.840 with 2 and 8
degrees of freedom (df) which is significant at the 0.05 level of significance (p).
All coefficients of the regression are significant at the 0.05 level of probability,
which lends evidence to the importance of both electrostatic (a) and hydrophobic
interactions (log P). The coefficient of determination (R^) is small and the error
term is relatively large. This suggests that predicted IC50 values are expected to
deviate by more than a factor two from the observed values. Inspection of the
residuals indicates that there may be two outliers (compounds 8 and 11) in this data
set that do not fit well to the proposed Hansch model.
Frequently, the relationship between biological activity and log P is curved and
shows a maximum [18]. In that case, quadratic and non-linear Hansch models have
been proposed [19]. The parabolic model is defined as:
log lie = bQ^b^ logP + b^ (logPf (37.9)
which reaches a maximum at (log P)^ = -b^l{2b'2).
390
TABLE 37.2
Lipophilicity (log P) and bactericidal concentrations (O of 17 doubly substituted phenols [20]
1 H CI 2.39 0.81
2 Methyl CI 2.89 1.34
3 Ethyl CI 3.39 1.73
4 Propyl CI 3.89 2.26
5 Butyl CI 4.39 2.52
6 Amyl CI 4.89 2.63
7 sec-Amyl CI 4.69 2.23
8 Cyclohexyl CI 4.90 2.25
9 Heptyl CI 5.89 2.51
10 Octyl CI 6.39 1.83
11 CI H 2.15 0.50
12 CI Methyl 2.65 0.91
13 CI Ethyl 3.15 1.35
14 CI Propyl 3.65 1.86
15 CI Butyl 4.15 2.20
16 CI Amyl 4.65 2.23
17 CI tert-Amyl 4.33 2.00
:).U -
m
• •
2.5 -
T-H 2.0 -
•
o
1.5 -
J •
1.0 - /•
0.5 -
0.0 -
1 1 I 1
LogP
Fig. 37.1. Quadratic Hansch model fitted to the bactericidal activities (log 1/C) of 10 doubly substi-
tuted phenols in Table 37.2 as a function of lipophilicity (log F) [20].
that the field and resonance parameters are less correlated than Hammett's
electronic parameters. Steric parameters account for the geometric properties of
the compounds. Among these we only cite the steric hulk factor E^ of Taft [24], the
molecular weight MH^and the five STERIMOL variables which were proposed by
Verloop [25] to describe the size and shape of the molecules. Molar refractivity
MR appears to be a parameter which correlates with lipophilicity and steric
parameters. Connectivity indices have been introduced by Kier and Hall [26] for
the description of the topological (graph) structure of the molecules. A connec-
tivity index is a single number which expresses how atoms are arranged in a
molecule. There are several different indices, each of which expresses a particular
feature of the molecule, such as its degree of branchedness, the presence of double
and triple bonds, of non-carbon atoms (apart from hydrogen), of aromatic rings,
etc. Finally indicator variables can be included into the Hansch equation in order to
account for the presence or absence of specific chemical functions. The use of
indicator variables will be discussed more amply in the following section which
deals with Free-Wilson analysis.
Extensive data bases are now available which list lipophilicity, molar refrac-
tivity, electronic and steric values for a wide collection of substituents [27,28].
Starting from a particular parent compound one computes the value of a physico-
chemical parameter (e.g. lipophilicity) for a given drug by adding the contributions
393
of all the substituents that have been introduced on the parent compound. Hence,
for a particular analog derived from an unsubstituted parent compound we obtain:
logP = X O o g ^ ) / (37.11)
CJ = X<^.- (37.12)
where the summation index / extends over all substituents on the parent molecule.
Computed values of log P are often represented by the symbol 7i, in order to
distinguish them from the empirical values which have been derived from partition
coefficients or from chromatographic retention times.
A difficulty with Hansch analysis is to decide which parameters and functions
of parameters to include in the regression equation. This problem of selection of
predictor variables has been discussed in Section 10.3.3. Another problem is due to
the high correlations between groups of physicochemical parameters. This is the
multicoUinearity problem which leads to large variances in the coefficients of the
regression equations and, hence, to unreliable predictions (see Section 10.5). It can
be remedied by means of multivariate techniques such as principal components
regression and partial least squares regression, appUcations of which are discussed
below.
The fundamental assumption of Hansch analysis is that substituent values are
additive. This implies that substituents are mutually independent, i.e. that the
effect of a substitution group at one position in the parent molecule is independent
of substitution groups at other positions. The assumption of additivity is violated,
for example, when hydrogen bonding occurs between two adjacent substituents.
where b^ represents the biological activity of the reference compound. The indexy
ranges between 1 and the number of substitution positions /?, while the index /
varies between 1 and the number of possible substituents rip at the position y,
minus 1. The coefficient b^j in eq. (37.13) denotes the contribution to the activity by
the /th substituent group at theyth substitution position on the reference compound,
and x^j is the corresponding indicator variable. The latter equals one if the /th
substitution group is present at theyth position, and zero otherwise. Note that there
is a distinction between the parent compound in Hansch analysis and the reference
compound in Free-Wilson analysis. The parent compound is by definition the
unsubstituted compound (with hydrogen atoms at each substitution site), while the
reference compound can be chosen arbitrarily from the set of compounds. The
substituent groups that appear in the reference compound do not appear in the
Free-Wilson model that has been defined above.
The number n of independent variables x^ is defined by:
n = f,(nj-l) (37.14)
./•
TABLE 37.3
Complete indicator table and bacteriostatic activities of 10 triply substituted tetracyclines. The three substi-
tution positions are labeled U, V and W [31].
CO NHo
# U V w 1/Z
1 1 0 1 0 0 1 0 0 0.60
2 1 0 0 1 0 1 0 0 0.21
3 1 0 0 0 1 1 0 0 0.15
4 1 0 0 1 0 0 1 0 5.25
5 1 0 0 0 1 0 1 0 3.20
6 1 0 1 0 0 0 1 0 2.75
7 0 1 1 0 0 0 1 0 1.60
8 0 1 1 0 0 0 0 1 0.15
9 0 1 0 0 1 0 1 0 1.40
10 0 1 0 0 1 0 0 1 0.75
which shows that all compounds with the NH2 substituent at W possess high
activity, while the remaining compounds reached only low activities. It is not
surprising, therefore, that the Free-Wilson equation indicates that the W site in the
molecule is the most critical one for obtaining bacteriostatic activity. A Free-Wilson
analysis can also be performed by means of analysis of variance (ANOVA) instead
of multiple regression (Chapter 6). The interpretation of the results obtained by the
two methods is essentially the same.
In a broad sense, one may include the Free-Wilson equation within the class of
linear free energy relationships (LFER). It is also subjected to the assumption of
additivity of the contributions to the biological activity by substituent groups at
different substitution sites. The assumption requires, for example, that there is no
hydrogen bonding interaction between the various substitution groups.
The interpretation of the result of a Free-Wilson analysis is somewhat different
from that of a Hansch analysis. The coefficients in the Hansch model represent
absolute contributions to the biological activity of a compound from the various
396
TABLE 37.4
Reduced indicator table and bacteriostatic activities 1/Zof 10 triply substituted tetracyclines, as derived
from Table 37.3. The compound with substitution groups H, NO2 and NO2 at the positions U, V and W is
taken as the reference compound.
# U V w 1/Z
1 0 0 0 0 0 0.60
2 0 1 0 0 0 0.21
3 0 0 1 0 0 0.15
4 0 1 0 1 0 5.25
5 0 0 1 1 0 3.20
6 0 0 0 1 0 2.75
7 1 0 0 1 0 1.60
8 1 0 0 0 1 0.15
9 1 0 1 1 0 1.40
10 1 0 1 0 1 0.75
and indicator variables that are to be included in the Hansch model. The latter
approach can be considered as quantitative, as it makes use of experimentally or
computationally derived parameters, such as 7C, o, E^, etc. The goodness of fit of
the Free-Wilson model to the observed biological activities also indicates whether
the additivity assumption is supported by the data. A comparison between the
Hansch and Free-Wilson models has been made by Craig [32].
Attention has already been drawn to correlations that occur between various
physicochemical and conformational descriptors of QSAR. Since Hansch analysis
involves multiple linear regression, these correlations may lead to unreliable
predictions of biological activity. In the design of chemical synthesis one also
wishes to sample the space of the variables in a uniform way such as to produce the
greatest possible variety of compounds. For these reasons, use has been made of
so-called Craig plots [33] in which compounds are plotted according to two
selected variables (e.g. a and log P). Such plots have been useful for the optimi-
zation of biological activity by means of sequential Simplex optimization (Section
26.2) [34] or response surface methodology (Section 24.5).
Due to the rapid proliferation of physicochemical, quantum mechanical and
conformational variables, the need was soon felt to apply multivariate methods and
pattern recognition techniques in QSAR. Cluster analysis was introduced by
Kowalski [35] in QSAR as a tool for studying the correlation structure of the
variables and for establishing similarities between the compounds. Factor analysis
has been used at an early stage in QSAR by Franke [36] for analyzing the
correlation structure of the variables (see also Chapter 34). This was followed by
applications of principal components analysis and biplot representations of
rectangular (compounds x properties) tables (see Chapters 17 and 31). Spectral
map analysis (SMA) is one such technique which has been introduced by Lewi
[37] for the study of drug-test specificities in pharmacological screening (Section
31.3.5). Principal components analysis, cluster analysis, multivariate regression
involving multiple dependent variables and other multivariate techniques have
been integrated by IMager [9] into structure-activity relationships in combination
with multivariate bioassay. This approach, which is called rational empiricism,
holds the middle between purely rational and purely empirical drug design.
In the following sections we propose typical methods of unsupervised learning
and pattern recognition, the aim of which is to detect patterns in chemical,
physicochemical and biological data, rather than to make predictions of biological
activity. These inductive methods are useful in generating hypotheses and models
which are to be verified (or falsified) by statistical inference. Cluster analysis has
398
initially been used in QSAR for the detection of patterns in physicochemical and
biological data [38]. These clusters can often be displayed visually in a two- or
three-dimensional plot of principal components.
TABLE 37.5
Correlations between 7 physicochemical substituent parameters obtained from 90 substituents groups [39]
logP ^m ^. F R MR MW
PC2
PCI
Fig. 37.2. Principal components loading plot of 7 physicochemical substituent parameters, as ob-
tained from the correlations in Table 37.5 [39,40]. The horizontal and vertical axes account for 46 and
31%, respectively, of the correlations. IVIost of the residual correlation is along the perpendicular to the
plane of the diagram. The line segments define clusters of parameters that have been computed by
means of cluster analysis.
and six thiazepines. These compounds are classified as neuroleptics (major tranquil-
lizers) which are designed for the treatment of psychosis. The common chemical
structure is shown in the insert of Table 37.6, where the dummy element X
represents either O or S. The oxazepines and thiazepines are also substituted at the
position R in one of the two phenyl rings by means of 6 different substituent
400
TABLE 37.6
o I
ax^^
# X R <^n, logP (log P)2 Apo-morphine Catalepsy
groups. The substituent parameters o^, log P and the square of log P are shown in
Table 37.6 together with the biological activities (log I/ED50) in pharmacological
tests which measure the ability to inhibit the stereotyped effects produced by
apomorphine and the ability to produce catalepsy (abnormal waxy postures) in
rats. These two tests are considered to be good animal models for testing neuro-
leptic effects of compounds.
The biplot in Fig. 37.3 has been constructed from the factor scores of the 12
compounds and the factor loadings of the five physicochemical and biological
variables [42,43]. (The biplot graphic technique is explained in Section 31.2.) It is
401
PC2 "-. ;
V
^".
^^
" Q logP
y , o,CN y \ ' \ y
^^-' ^\^.^
-^
" " / ^©^'"''^
y^ 1 S, qtl APOMORPHXNE - -^ '
"" ^ y O,CLO -""
' / -TT ,
y 0^ K x< .C'''
y S,ClQf
0,CH^j ^
V, ,' , * ^ ' '
QO, SCH3 ^ -^'
1 <«' / / ^ \
1 '^ / ^
^"^ / // / / \
^s,scn3 ^^
/ -^ "^
/
^sfCH3
^ x
^
• (loaP)2
P 5 , S02CH3
O O,S02CH3
PCI
Fig. 37.3. Principal components biplot showing the positions of 6 substituted oxazepine (O) and 6 sub-
stituted thiazepine (S) neuroleptics with respect to three physicochemical parameters and two biologi-
cal activities [41,43]. The data are shown in Table 37.6. The thiazepine analogs are represented by
means of filled symbols. The horizontal and vertical components represent 50 and 39%, respectively,
of the variance in the data.
For a detailed description of spectral map analysis (SMA), the reader is referred
to Section 31.3.5. The method has been designed specifically for the study of
drug-receptor interactions [37,44]. The interpretation of the resulting spectral map
is different from that of the usual principal components biplot. The former is
symmetric with respect to rows and columns, while the latter is not. In particular,
the spectral map displays interactions between compounds and receptors. It shows
which compounds are most specific for which receptors (or tests) and vice versa.
This property will be illustrated by means of an analysis of data reporting on the
binding affinities of various opioid analgesics to various opioid receptors [45,46].
In contrast with the previous approach, this application is not based on extra-
thermodynamic properties, but is derived entirely from biological activity spectra.
Binding studies are performed on isolated receptors that are labeled by means of
a specific radioactive marker. Compounds that are tested may compete with the
specific marker at the binding site of the receptor. The fraction of the marker that is
displaced by the test compound is a measure of the affinity of the latter for that
receptor. Usually, an active compound will show a spectrum of affinities for
several related receptors. The aim of the multivariate analysis is to map the
spectrum of affinities, hence the name spectral mapping. At the time when the
report was published, pharmacologists had identified two opioid receptors which
were labeled |i and 5. The former binds specifically to opioid analgesics such as
morphine and its synthetic derivatives, the latter binds to derivatives of the
so-called endogenous opioid peptides, in particular to enkephalins and endorphins.
Strong evidence for another opioid receptor, named K, was available and this had
been already identified in brain tissues. The data in Table 37.7 report on the results
of 26 narcotic agonists and antagonists in four binding assays. The four radioactive
markers comprised dihydromorphine (DHM, a |i-agonist), an enkephalin analog
403
TABLE 37.7
Inhibition constants (Kf) of 26 narcotic analgesic agonists and antagonists as obtained in 4 receptor binding
tests (EKC, DHM, DADLE and NAL) [45].
PCI *EKC
BUPRENORPHINZ ,'\
KETAZOCINE* ^ / ^
ETKYLKETOCYCLAZOCINEi \
LOFKNTANI^O » O NALOXONE
' \ O LEVALLORPHAN
BpTORPHANOZ^
___i,,_-__-_ I .,^NALBXTPEINE
PENTAZOCINE > _
^PHENAZOCINE
MET-ZNKEPHALXN'
LEU-ENKEPHALIN* ^/' Q /fj? Q*^^™^^^
0-AIJl,U-I.£I7-KNKSPHAI,XW» ,' UK 20Jtf / y ; NALORPHINE
, + ' ETORPHINE
)CINE
'O LEVORPHANOL
NAL
DADLE' i—--_ _? n
i
/ LBU-EiiKEPHALIN,ARa-PHE
I DIHYDROMORPHINE^
OLY-127623
T FK-33824Q
BETA-ENDORPHIN
PC2
Fig. 37.4. Spectral map of the 26 opioid agonists and antagonists in 4 receptor binding tests, as de-
scribed by Table 37.7 [45, 46]. Circles refer to the compounds. Squares represent the binding tests.
Areas of circles and squares are proportional to the marginal mean affinities in the table. The lines that
join the three poles (DHM, DADLE and EKC) of the map represent axes of contrast between the p.-, 5-
and K-opioid receptors. The horizontal and vertical components represent 18 and 79%, respectively,
of the interaction in the data.
specificity for the 6-receptor. The larger the distance between two poles, the
greater is the contrast between the corresponding receptors. The areas of the circles
are proportional to the average affinity of the corresponding compounds for the
four labeled receptors.
It appears from the spectral map that the K-receptor is a highly specific receptor
which produces strong contrasts in binding affinities of opioid analgesics. The
contrast is most evident in ketazocine, ethylketazocine and buprenorphine which
possess much more affinity for the K-receptor than for the two others. The contrast
is also strong with dihydromorphine, beta-endorphin, an enkephalin analog and
two experimental compounds (LY and FK) which have little or no affinity for the
K-receptor.
TABLE 37.8
Data obtained from a screening test on rats, differentiating between morphinomimetic (opioid analgesic)
and neuroleptic compounds [48]. The six observations are scored on a 6-point scale ranging from absent (0)
to highly pronounced (6). Compounds 1 to 13 are morphinomimetics; compounds 14 to 26 are neuroleptics.
1 Propoxyphene 0 3 1 3 0 0 7
2 Morphine 0 0 4 2 0 2 8
3 Methadone 0 0 6 6 0 3 15
4 Pethidine 0 1 4 6 0 2 13
5 Codeine 0 4 0 3 0 1 8
6 Dextromoramide 0 0 6 4 0 3 13
7 Anileridine 0 2 3 4 0 2 11
8 Phenoperidine 0 2 5 2 0 3 12
9 Etonitazine 0 0 6 6 0 0 12
10 Phenazocine 0 2 4 4 0 2 12
11 Piritramide 0 2 4 4 0 3 13
12 Fentanyl 0 1 6 6 0 3 16
13 Bezitramide 0 0 2 2 0 1 5
14 Chlorpromazine 6 0 6 5 0 6 23
15 Perphenazine 6 0 6 0 0 6 18
16 Haloperidol 0 0 6 0 3 6 15
17 Trifluoperazine 6 0 3 0 0 6 15
18 Chlorprothixene 5 0 6 6 0 6 23
19 Spirilene 0 0 3 0 0 5 8
20 Pimozide 0 0 5 0 1 5 11
21 Pipamperone 2 0 6 0 1 6 15
22 Sulpiride 0 0 0 0 1 2 3
23 Benperidol 4 0 6 6 0 6 22
24 Spiperone 0 3 0 3 0 5 11
25 Oxypertine 6 0 6 0 1 6 19
26 Thioridazine 1 0 6 6 0 6 19
PROPOXYPHENEO
9HALOPERIDOL
@ SPIPERONE
PIRITRAMIDE
\^-G ANILERIDINE
:DE PHENOPERIDII:ufi O^-^HENAZOCINE
FSNTANYL^ Q PETHIDINE
DBXTROMORAISINE ^ ^ n PUPIL DIAM. INCREASE
MORPHINE^
SPIRILENE0 ^OO 0 OETONITAZINE
•f^ \ METHADONE
PIPAMPERONS^
^^
P
PALPEBRAL DECREASE
I
~n ^
9
n ^ ^gy
\
»-«.^m».»rTn-.
BEZITRAMIDE
THIORIDAZINE
TEMPERAT. DECREASE
^ BENPERIDOL
P f CHLORPROTHIXENE
OXYPERTINE^ \CHLORPROMAZINE
^ PERPHENAZINE
@TRIFLUOPERAZINE
n PROSTRATION
PCI
Fig. 37.5. Biplot obtained from correspondence factor analysis of the data in Table 37.8 [43]. Circles
refer to compounds. Squares relate to observations. Areas of circles and squares are proportional to the
marginal sums of the rows and columns in the table. The horizontal and vertical components represent
40 and 31%, respectively, of the interaction in the data.
Linear regression, such as used in the Hansch and Free-Wilson methods, seeks
to predict the actual value of biological activities from chemical, physicochemical,
quantum mechanical and biological parameters of a set of compounds. In linear
discriminant analysis (LDA) one only attempts to predict the class membership of
the compounds from the above mentioned parameters (Section 17.9). The discrim-
inant equation is obtained from a learning or training set of compounds with
known class membership and applied to newly synthesized compounds for the
prediction of their classification. In the case of two classes (e.g. morphinomimetics
and neuroleptics) the discriminant equation can be obtained directly by means of
linear regression (Chapter 10) or principal components regression (Section 35.6).
In the general case of LDA with more than two classes one makes use of canonical
variate analysis (Section 33.2.2). The result can be displayed graphically in the
form of a biplot which allows to determine visually which parameters discriminate
between which classes of compounds.
Linear discriminant analyses have been carried out using the extrathermo-
dynamic parameters that normally appear in Hansch equations (Log P and its
squared value, a^, E^, etc.). The method can also be applied to biological spectra in
order to discriminate between pharmacological classes of compounds that are
submitted to screening tests. One of the assumptions of LDA is that the classes of
compounds are normally distributed in the multivariate predictor space and that
their multivariate equidensity (hyper)ellipsoids are similarly oriented and shaped.
This is a stringent assumption which, in the strictest form, is seldom satisfied in
practice. Another problem of LDA occurs when a highly discriminating parameter
contributes little to the total variance in the data. These drawbacks can be allevi-
ated by applying partial least squares (PLS) regression which is the subject of
Section 35.7.
409
Partial least squares analysis (PLS) has rapidly found applications in QSAR
since its introduction by Wold et al. [52]. The reader is referred to Sections 35.7
and 36.2.4 for a thorough discussion of PLS.
TABLE 37.9
Pharmacological results (ED^^y in mg/kg) obtained in rats from 17 reference neuroleptic compounds in the
combined ATN test [48].
of the ED5Q and K^ values, and the resulting activities were double-centered. The
biplots of Figs. 37.6 and 37.7 are constructed from the first two PLS components,
and display the pharmacological and biochemical spectra, respectively. The
reading rules of these biplots are the same as those of the spectral map which have
been explained above (Section 37.2.2). Briefly, circles represent neuroleptic
compounds and squares refer to the tests. Areas of circles are proportional to the
potencies of the compounds, while areas of squares are proportional to the
sensitivities of the tests. Potencies and sensitivities have been defined from the
marginal totals of the data tables. Compounds and tests that possess mutual
specificity are found at a distance and in the same direction with respect to the
center of the plot.
The biplots show which binding tests are predictive for which pharmacological
tests. Binding to the haloperidol and apomorphine labeled receptors corresponds
with inhibition of agitation and stereotypy in rats, binding to the spiperone receptor
414
TABLE 37.10
Biochemical results (A', in nM) obtained in binding assays of radioactively labeled brain receptors [57]
from the same 17 neuroleptics shown in Table 37.9.
corresponds with inhibition of seizures and tremors, and, finally, binding to the
alpha-adrenergic receptor (labeled by WB4101) matches with prevention of mor-
tality. These triangular patterns indicate a high specificity of the in-vivo tests for
three receptors (dopamine, serotonin and norepinephrine) that are thought to be
involved in psychotic manifestations such as hallucinations, delusions, mania,
extreme agitation, anxiety, etc. These three receptors form distinct poles on the
pharmacological and biochemical biplots of Figs. 37.6 and 37.7. There also is
agreement between the positions of the compounds on the two biplots. Haloperidol
and pimozide appear as specific inhibitors of the dopamine receptor, promazine
and thioridazine are shown to be specific blockers of the norepinephrine receptor,
and, finally, pipamperone is the only compound that exhibits high specificity for
the serotonin receptor. A comparison of the two biplots (Figs. 37.6 and 37.7)
reveals which binding assays are predictive for which pharmacological effects. It
also shows that activity spectra in animals can be reliably predicted from binding
profiles. The biplots can also be analyzed from a chemical point of view by
415
Serotonin
I
I
•VTHEWDRS
"S
mseiZURES BENPERJDOL Q TRIFLUOPERAZINE^
•PIPAMPEMNE
i '>'SULPIRIDE [PERPHENAZINE ^
CLOZAPINE' CLOTHIAPINE O
CHLOHPROMAZINE*
1UPHEM4ZIMEC :0 V 0«)HALOPERiooL D o p a m i n G
I PENFLURIDOL
CHLOPPROTHIXENE%
I
I - - 0 "^XD] " 1
DROPERIDOL
AGITATION ^ N
OPIHOZIDE
STEREOTYPy
*PROHAZIHE
I
I
•THIORIDAZIHE • 7H10TIXENE
I
jttNORTALITY
Norepinephrine
Fig. 37.6. PLS biplot obtained from the pharmacological data in Table 37.9, after log double-centering
and analysis by two-block PLS [56]. Circles represent 17 reference neuroleptic compounds, squares
denote tests. Areas of circles and squares are proportional to the potencies of the compounds and the
sensitivities of the tests, respectively. Reproduced with permission of E.J. Karjalainen.
Serotonin I
I
PIPANPERONE» J
aSPIPERONE
; \
•CLOTHIAPINE
•PERPHENAZINE ^
TfilOTIXENEO ^PENFLURIDOL
•CLOZAPINE FLUPHENAZINE
piMOiZDE Dopamine
CHLORPROTHIXENEO -h ^ T^I^\^in.UOPERAZINE
'DROPERIDOLO hi-/p--B£NPffljoa.
CHLORPROMAZINEO
• APOMORPHINE HALOPERIDOL
THIORIDAZINE ^ v
I ^ ^ OHALOPERIOOL ^^
OPRONAZINE
I
Norepinepifirine
Fig. 37.7. PLS biplot derived from the biochemical binding data in Table 37.10, after log double-
centering and analysis by two-block PLS [56]. The reading rules are the same as those indicated with
Fig. 37.6. Reproduced with permission of E.J. Karjalainen.
416
correlating structural properties with positions on the maps. It appears that com-
pounds attracted by the dopamine pole are chiefly of the butyrophenone and
diphenylbutyl type, while those attracted by the adrenergic pole belong to the
phenothiazine class. In a similar way as with PLS regression, one may determine
the optimal number of components that must be included for obtaining the most
reliable predictions, using cross-validation and minimization of PRESS (Section
36.3).
Two-block PLS can be extended to multi-block PLS for the prediction of drug
activities from several independent predictor blocks [58]. For example, one may
attempt to predict clinical effects (the ultimate goal of drug design) from all
available information (pharmacological, biochemical, pharmacokinetic, toxicol-
ogical, etc.). Multi-block PLS has been used to predict clinical scores for various
antischizophrenic effects [59] from in-vivo and in-vitro data [56]. The analysis
showed that only a single component of clinical effects could be reliably predicted
from animal and receptor studies, while all higher order clinical and biological
components showed only weak correlations.
In this chapter we have only addressed a selected number of topics and for lack
of space we have left out many others. Cluster analysis has played a larger role in
QSAR than appears from our overview. This technique is an established QSAR
tool in recognition or classification of known patterns [38,60] as well as for
cognition or detection of novel patterns [61].
Neural networks have been introduced in QSAR for non-linear Hansch analyses.
The Perceptron, which is generally considered as a forerunner of neural networks
has been developed by the Russian school of Rastrigin and coworkers [62] within
the context of QSAR. The learning machine is another prototype of neural network
which has been introduced in QSAR by Jurs et al. [63] for the discrimination
between different types of compounds on the basis of their properties.
Decision trees for optimal drug design strategies have been proposed by Topliss
[64] and by Purcell et al [65].
The determination of the three-dimensional conformation of molecules is an
important aspect of QSAR, which can be obtained from x-ray crystallography [66],
NMR spectroscopy or, in the case of small molecular fragments by quantum-
mechanical calculations [67,68].
A topic of actuality is the study of receptor proteins and enzymes for which data
bases with crystallographic information are now made available. Computer model-
ling of the active sites of receptors and enzymes are important tools in rational drug
design. Principal components and cluster analysis can be applied to the primary
417
References
14. L.P. Hammett, Physical Organic Chemistry. McGraw-Hill, New York, 1940.
15. C. Hansch and T. Fujita, pen Analysis. A method for the correlation of biological activity and
chemical structure. J. Am. Chem. Soc, 86 (1964) 1616-1626.
16. C. Hansch and W.J. Dunn III, Linear relationships between lipophilic character and biological
activity of drugs. J. Pharmaceut. Sci., 61 (1972) 1-19.
17. R.L. Williamson and R.L. Metcalf, Salicylanilides: a new group of active uncouplers of oxida-
tive phosphorylation. Science, 158 (1967) 1694-1695.
18. C. Hansch and J.M. Clayton, Lipophilic character and biological activity of drugs II: The para-
bolic case. J. Pharmaceut. Sci., 62 (1973) 1-21.
19. J.W. McFarland, On the parabolic relationship between drug potency and hydrophobicity. J.
Med. Chem., 13 (1970) 1192-1196.
20. E. Klarmann, V.A. Shtemov and L.W. Gates, The alkyl derivatives of halogen phenols and
their bactericidal action. I. Chlorophenols. J. Am. Chem. Soc, 55 (1933) 2576-2589.
21. L. Buydens, D.L. Massart and P. Geerlings, Gas chromatographic behaviour and pharmacol-
ogical activity of neuroleptica. Anal. Chim. Acta, 174 (1985) 237-244.
22. R.F. Rekker and R. Mannhold, Calculation of Lipophilicity. The Hydrophobic Fragmental
Constant Approach. VCH, Weinheim, Germany, 1992.
23. C.G. Swain and E.C. Lupton, Field and resonance components of substituent effects. J. Am.
Chem. Soc, 90 (1968) 4328-4337.
24. R.W. Taft, Separation of polar, steric and resonance effects in reactivity. In: Steric Effects in
Organic Chemistry (M.S. Newman, Ed.). Wiley, New York, 1956, pp. 556-575.
25. A. Verloop, The STERIMOL Approach to Drug Design. Marcel Dekker, New York, 1987.
26. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New
York, 1986.
27. C. Hansch and A. Leo, Substituent Constants for Correlation Analysis in Chemistry and Biol-
ogy. Wiley, New York, 1979.
28. G. Grassy, R. Lahana and Oxford Molecular Ltd., TSAR, Tools for Structure-Activity Rela-
tionships, User's Guide, Issue 3. Proprietary Software developed by Oxford Molecular Ltd.,
Oxford, UK, 1993. _ .^
29. S.M. Free and J.XVL-Wilson, A mathematical contribution to structure-activity relationships. J.
Med. Ciiem.,"7 (1964) 395-399.
30. T. Fujita and T. Ban, Structure-activity study of phenethylamines as substrates of biosynthetic
enzymes of sympathetic transmitters. J. Med. Chem., 14 (1971) 148-152.
31. J.L. Spencer, J.J. Hlavka, J. Petisi, H.M. Krazinski and J.H. Boothe, 6-Deoxytetracyclines, V.
7,9-Disubstituted Products. J. Med. Chem., 6 (1963) 405-407.
32. P.N. Craig, Comparison of the Hansch and Free-Wilson approaches to structure-activity cor-
relation. In: Biological Correlations — The Hansch Approach (R.F. Gould, Ed.). Advances in
Chemistry Series, No. 114. American Chemical Society, Washington DC, 1972, pp. 115-129.
33. P.N. Craig, Interdependence between physical parameters and selection of substituent groups
for correlation studies. J. Med. Chem., 14 (1971) 680-684.
34. F. Darvas, Application of the sequential simplex method in designing drug analogs. J. Med.
Chem., 17(1974)99-804.
35. B.R. Kowalski and C.F. Bender, The application of pattern recognition to screening prospec-
tive anticancer drugs. J. Am. Chem. Soc, 96 (1974) 916-918.
36. R. Franke, S. Dove and R. Kuehne, Hydrophobicity and hydrophobic interactions, 1. On the
physical nature of aromatic hydrophobic substituent constants. Eur. J. Med. Chem., 14 (1979)
363-374.
419
37. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical
compounds. Arzneim.-Forsch. (Drug Res.), 26 (1976) 1205-1300.
38. B.R. Kowalski and C.F. Bender, Pattern recognition. A powerful approach to interpreting
chemical data. J. Am. Chem. Soc, 94 (1972) 5632-5639.
39. C. Hansch, Quantitative approaches to pharmacological structure-activity relationships. In:
Structure-Activity Relationships (C.J. Cavallito, Ed.), Vol. 1. Pergamon, Oxford, 1973, pp.
75-165.
40. P.J. Lewi, Multivariate data analysis in structure-activity relationships, In: Drug Design (E.J.
Ariens, Ed.), Vol. X. Academic Press, New York, 1980.
41. J. Schmutz, Neuroleptic piperazinyl-dibenzo-azepines. Arzneim.-Forsch. (Drug Res.), 25
(1975)712-719.
42. K.R. Gabriel, The biplot graphical display of matrices with appHcation to principal component
analysis. Biometrika, 58 (1971) 453-467.
43. P.J. Lewi, Multidimensional data representation in medicinal chemistry. In: Chemometrics.
Mathematics and Statistics in Chemistry (B.R. Kowalski, Ed.), Reidel, Dordrecht, 1984, pp.
351-376.
44. P.J. Lewi, Spectral mapping of drug-test specificities. In: Advanced Computer-Assisted Tech-
niques in Drug-Discovery (H. van de Waterbeemd, Ed.). VCH, Weinheim, Germany, 1994, pp.
219-253.
45. P.L. Wood, S.E. Charleson, D. Lane and R.L. Hudgin, Multiple opiate receptors: differential
binding of mu, kappa and delta agonists. Neuropharmacology, 20 (1981) 1215-1220.
46. G. Calomme and P.J. Lewi, Multivariate analysis of structure-activity data. Spectral map of
opioid narcotics in receptor binding. Actual. Chim. Therap., S.ll (1984) 121-126.
47. J.-P. Benzecri, Analyse des Donnees. Analyse des Correspondences, Dunod, Paris, 1973.
48. C.J.E. Niemegeers and P.A.J. Janssen, A systematic study of the pharmacology of DA-
antagonists. Life Sci., 24 (1979) 2201-2216.
49. M.J. Greenacre, Correspondence Analysis in Practice. Academic Press, London, 1993.
50. R. Franke, Optimierungs-Methoden in der Wirkstoff-Forschung. Quantitative Struktur-
Wirkungs-Analyse. Akademie-Verlag, Berlin, 1980.
51. T.W. Anderson, Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984.
52. S. Wold, C. Albano, W.J. Dunn III, K. Esbensen, S. Hellberg, E. Johansson and M. Sjostrom,
Pattern recognition: finding and using patterns in multivariate data. In: Food Research and Data
Analysis (H. Martens and H. Russwurm Jr., Eds.). Applied Science, London, 1983, p. 147.
53. S. de Jong, PLS fits closer than PCR. J. Chemom., 7 (1993) 551-557.
54. R.D. Cramer III, D.E. Patterson and J.D. Bunce, Comparative Molecular Field Analysis
(CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc, 110
(1988)5959-5967.
55. L.M. Kauvar, D.L. Higgins, H.O. Villar, J.R. Sportman, A. Engqvist-Goldstein, R. Bukar, K.E.
Bauer, H. Dilley and D.M. Rocke, Predicting ligand binding to proteins by affinity finger print-
ing. Chem. Biol., 2 (1995) 107-118.
56. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of
real-life performance from laboratory results. In: Scientific Computing and Automation (Eu-
rope) (E.J. Karjalainen, Ed.). Elsevier, Amsterdam, 1990, pp. 199-209.
57. J.E. Leysen, Review of neuroleptic receptors: specificity and multiplicity of in-vitro binding
related to pharmacological activity. In: Clinical pharmacology in psychiatry (E. Usdin, S. Dahl,
L.F. Gram and O. Lingjaerde, Eds.). MacMillan, London, 1982, pp. 35-62.
58. L.E. Wangen and B.R. Kowalski, A multiblock partial least squares algorithm for investigating
complex chemical systems, J. Chemom., 3 (1988) 3-10.
420
59. J. Bobon, D.P. Bobon, A. Pinchard, J. Collard, T.A. Ban, R. De Buck, H. Hippius, P.A. Lam-
bert and O.A. Vinar, A new comparative physiognomy of neuroleptics: a collaborative clinical
report. Acta Psych. Belg., 72 (1972) 542-554.
60. C. Hansch, S.H. Unger and A.B. Forsythe, Strategy in drug design. Cluster analysis as an aid in
the selection of substituents. J. Med. Chem., 16 (1973) 1212-1222.
61. D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of
Cluster Analysis. Wiley, New York, 1983.
62. S.A. Hiller, V.E. Golender, A.B. Rosenblit, L.A. Rastrigin and A.B. Glaz, Cybernetic methods
of drug design. 1. Statement of the problem — The Perceptron approach. Comp. Biomed. Res.,
6(1972)411-421.
63. P.C. Jurs, B.R. Kowalski and T.L. Isenhour, Computerized learning machines applied to chem-
ical problems. Molecular formula determination from low resolution mass spectrometry. Anal.
Chem., 41 (1969)21-27.
64. J.G. Topliss, AppHcation of operational schemes for analog synthesis in drug design. J. Med.
Chem., 15(1972) 1001-1011.
65. W.P. Purcell, G.E. Bass and J.M. Clayton, Strategy of drug design: a guide to biological activ-
ity. Wiley, New York, 1973.
66. J.P. Tollenaere, H. Moereels and L.A. Raymaekers, Atlas of Three-Dimensional Structure of
Drugs. Elsevier, Amsterdam, 1979.
67. L.B. Kier, Molecular Orbital Theory in Drug Research. Academic Press, New York, 1971.
68. P.S. Portoghese, In: Molecular and Quantum Pharmacology (E.D. Bergman and B. Pullman,
Eds.). Reidel, Dordrecht, 1974, pp. 352-353.
69. P.J. Lewi and H. Moereels, Receptor mapping and phylogenetic clustering. In: Advanced
Computer-Assisted techniques in Drug Discovery (H. van de Waterbeemd, Ed.). VCH,
Weinheim, Germany, 1994, pp. 131-162.
70. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New
York, 1986.
71. S.C. Basak, V.R. Magnuson, G.J. Niems and R.R. Regal, Determining structural similarity of
chemicals using graph-theoretic indices. Discr. Appl. Math., 19 (1988) 17-44.
421
Chapter 38
38.1 Introduction
© © ®
Fig. 38.1. Triangle test: two similar products and one different product are presented; the assessor has
to indicate the product that is different.
recognize the dissimilar sample one knows that a sensory difference exists. Table
38.1 gives the critical values for different panel sizes and different significance
levels. For example, with 27 panellists (n = 27) one concludes that there is a
difference (at the a = 5% significance level) when at least 14 correct responses are
obtained.
One should be aware that the conclusion of 'no significant difference detected'
may not be interpreted as 'no difference present'. If the panel is small the triangle
test has low power, i.e. there is a high probability that real differences may go
unnoticed (see Sections 4.7 and 5.3). Therefore, a sizeable panel of at least 25
assessors is recommended. When the assessors have been selected and trained for
their task the panel size may be somewhat smaller (at least 20). In general,
economizing on the number of assessors for sensory tests may be a false economy:
choosing a small inexpensive panel may result in great losses due to missing real
opportunities for introducing improved products with increased market share.
In the duo-trio test one presents to each panellist (or 'subject') an identified
reference sample, followed by two coded samples, one of which matches the
reference sample (Fig. 38.2). The subjects are asked to indicate which of the two
coded samples matches the reference. If enough correct replies are obtained the
two coded samples are perceived as different. Table 38.2 gives the critical values
Which is equal to X: A or B?
® ®
®
Fig. 38.2. Duo-trio test: two different products are presented; the assessor has to indicate which of
these two is similar to a third product.
423
TABLE 38.1
Table of critical values for the triangle test for differences
n
« 10 5 1 0.1 n 10 5 1 0.1 n 10 5 1 0.1 « 10 5 1 0.1
1 26 13 14 15 17 51 22 24 26 29 76 32 33 36 39
2 27 13 14 16 18 52 23 23 26 29 77 32 34 36 40
3 3 3 28 14 15 16 18 53 53 24 27 30 78 32 34 37 40
4 4 4 29 14 15 17 19 54 23 25 27 30 79 33 34 37 41
5 4 4 5 30 14 15 17 19 55 24 25 28 30 80 33 35 38 41
6 5 5 6 31 15 16 18 20 56 24 26 28 31 81 33 35 38 41
7 5 5 6 7 32 15 16 18 20 57 25 26 28 31 82 34 35 38 42
8 5 6 7 8 33 15 17 18 21 58 25 26 29 32 83 34 36 39 42
9 6 6 7 8 34 16 17 19 21 59 25 27 29 32 84 35 36 39 43
10 6 7 8 9 35 16 17 19 22 60 26 27 30 33 85 35 37 40 43
11 7 7 8 10 36 17 18 20 22 61 26 27 30 33 86 35 37 40 44
12 7 8 9 10 37 17 18 20 22 62 26 28 30 33 87 36 37 40 44
13 8 8 9 11 38 17 19 21 23 63 27 28 31 34 88 36 38 41 44
14 8 9 10 11 39 18 19 21 23 64 27 29 31 34 89 36 38 41 45
15 8 9 10 12 40 18 19 21 24 65 28 29 32 35 90 37 38 42 45
16 9 9 11 12 41 19 20 22 24 66 28 29 32 35 91 37 39 42 46
17 9 10 11 13 42 19 20 22 25 67 28 30 33 36 92 37 39 42 46
18 10 10 12 13 43 19 20 23 25 68 29 30 33 36 93 38 40 43 46
19 10 11 12 14 44 20 21 23 26 69 29 31 33 36 94 38 40 43 47
20 10 11 13 14 45 20 21 24 26 70 29 31 34 37 95 39 40 44 47
21 11 12 13 15 46 20 22 24 27 71 30 31 34 37 96 39 41 44 48
22 11 12 14 15 47 21 22 24 27 72 30 32 34 38 97 39 41 44 48
23 12 12 14 16 48 21 22 25 27 73 31 32 35 38 98 40 41 45 48
24 12 13 15 16 49 22 23 25 28 74 31 32 35 39 99 40 42 45 49
25 12 13 15 17 50 22 23 26 28 75 31 33 36 39 100 40 42 46 49
424
TABLE 38.2
Table of critical values for the duo-trio test
n 10 5 1 0.1 n 10 5 1 0.1
1 26 17 18 20 22 51 31 32 35 37 76 45 46 49 52
2 27 18 19 20 22 52 32 33 35 38 77 45 47 50 53
3 28 18 19 21 23 53 32 33 36 39 78 46 47 50 54
4 4 29 19 20 22 24 54 33 34 36 39 79 46 48 51 54
5 5 5 30 20 20 22 24 55 33 35 37 40 80 47 48 51 55
6 6 6 31 20 21 23 25 56 34 35 38 40 81 47 49 52 55
7 6 7 7 32 21 22 24 26 57 34 36 38 41 82 48 49 52 56
8 7 7 8 33 21 22 24 26 58 35 36 39 42 83 48 50 53 56
9 7 8 9 34 22 23 25 27 59 35 37 39 42 84 49 51 54 57
10 8 9 10 10 35 22 23 25 27 60 36 37 40 43 85 49 51 54 58
11 9 9 10 11 36 23 24 26 28 61 37 38 41 43 86 50 52 55 58
12 9 10 11 12 37 23 24 27 29 62 37 38 41 44 87 50 52 55 59
13 10 10 12 13 38 24 25 27 29 63 38 39 42 45 88 51 53 56 59
14 10 11 12 13 39 24 26 28 30 64 38 40 42 45 89 52 53 56 60
15 11 12 13 14 40 25 26 28 31 65 39 40 43 46 90 52 54 57 61
16 12 12 14 15 41 26 27 29 31 66 39 41 43 46 91 53 54 58 61
17 12 13 14 16 42 26 27 29 32 67 40 41 44 47 92 53 55 58 62
18 13 13 15 16 43 27 28 30 32 68 40 42 45 48 93 54 55 59 62
19 13 14 15 17 44 27 28 31 33 69 41 42 45 48 94 54 56 59 63
20 14 15 16 18 45 28 29 31 34 70 41 43 46 49 95 55 57 60 63
21 14 15 17 18 46 28 30 32 34 71 42 43 46 49 96 55 57 60 64
22 15 16 17 19 47 29 30 32 35 72 42 44 47 50 97 56 58 61 65
23 16 16 18 20 48 29 31 33 36 73 43 45 47 51 98 56 58 61 65
24 16 17 19 20 49 30 31 34 36 74 44 45 48 51 99 57 59 62 66
25 17 18 19 21 50 31 32 34 37 75 44 46 49 52 100 57 59 63 66
425
for various panel sizes and significance levels. For example, with a panel of n = 30
panellists one would need at least 20 correct answers to conclude that there is a
significant (a = 5%) difference.
In paired comparison tests two different samples are presented and one asks
which of the two samples has 'most' of the sensory property of interest, e.g. which
of two products has the sweetest taste (Fig. 38.3). The pairs are presented in
random order to each assessor and preferably tested twice, reversing the pre-
sentation order on the second tasting session. Fairly large numbers (>30) of test
subjects are required. If there are more than two samples to be tested, one may
compare all possible pairs ('round robin'). Since the number of possible pairs
grows rapidly with the number of different products this is only practical for sets of
three to six products. By combining the information of all paired comparisons for
all panellists one may determine a rank order of the products and determine
significant differences. For example, in a paired comparison one compares three
food products: (A) the usual freeze-dried form, (B) a new freeze-dried product, (C)
the new product, not freeze-dried. Each of the three pairs are tested twice by 13
panellists in two different presentation orders, A-B, B-A, A-C, C-A, B-C, C-B.
The results are given in Table 38.3.
The results of such multiple paired comparison tests are usually analyzed with
Friedman's rank sum test [4] or with more sophisticated methods, e.g. the one
using the Bradley-Terry model [5]. A good introduction to the theory and appli-
cations of paired comparison tests is David [6]. Since Friedman's rank sum test is
based on less restrictive, ordering assumptions it is a robust alternative to two-way
analysis of variance which rests upon the normality assumption. For each panellist
(and presentation) the three products are scored, i.e. a product gets a score 1,2 or 3,
when it is preferred twice, once or not at all, respectively. The rank scores are
summed for each product /. One then tests the hypothesis that this result could be
obtained under the null hypothesis that there is no difference between the three
products and that the ranks were assigned randomly. Friedman's test statistic for
this reads
Which of the two samples
is the sweetest ? A or B?
© CD
Fig. 38.3. Paired comparison test: two different products are presented and the assessor has to indi-
cated the one that has most of a specified attribute.
426
TABLE 38.3
A B C
T=l2[C^Rf)/{bt(t+l)}]-3b(t-^l) (38.1)
along a single scale. One way out of this situation is to reconsider the testing
design, improve the instructions to the assessors, rephrase the question asked in a
way less prone to misinterpretation, etc. Another way out is to adapt the analysis of
the data allowing for additional dimensions. In the next section we show how the
paired comparison of a set of products (objects, items) can give the basic input for
such a multidimensional scaling.
TABLE 38.4
Table of approximate flying times (hours) between nine major European cities
From/To: At Lo Ma Mu Pa Ro St VI Wa
Here, d^j is the dissimilarity of object / and object y, d^. is the mean of row /, d^ the
mean of column;, and d the overall mean of the matrix of dissimilarities. When
the solution is found one hopes that a sufficiently good approximation can be
obtained in a low-dimensional space, preferably a two-dimensional space for easy
graphical display. Figure 38.4 shows the map obtained by displaying the scores for
each city along the principal axes 1 and 2. One should be aware that flying times
are only an approximate indication of the true distance, since the average flying
speed may differ between different connections. Yet, one notices that the config-
uration of the cities is quite close to reality. One may well imagine that a similar
analysis for cities from different continents would give a poor representation in
two dimensions. Here, a three-dimensional representation would be much more
adequate. With sensory data the best dimensionality of the solution cannot be
guessed beforehand and is itself an analysis result of major interest. In practice one
must carry out the MDS analysis for different choices of the dimensionality. The
final choice is guided by goodness-of-fit criteria, interpretability of the solution
and a desire to keep the results as simple as possible.
4
Stockholm
3
c London Warzaw
o 1
•(/) Vlaardingen
c
(D
E 0
-1
• Rome
- 2 Madrid
Athens
-3 - 2 - 1 0 1
Dimension 2
Fig. 38.4. Two-dimensional map of European cities derived from the flight time data of Table 38.4 us-
ing classical (metric) MDS.
429
Stockholm
2
•
1- Warzaw
London •
• en
o Vlaarding
C/5
C
CD
0 •
Paris
Munich
-1 • Rome
Madrid
• Athens
- 2 - 1 0 1 2
Dimension 2
Fig. 38.5. Two-dimensional map of European cities obtained from non-metric MDS (using ranked
distances) applied to the flight time data of Table 38.4,
Even when the input data are much less accurate a reasonable configuration can
be recovered by MDS. For example, replacing the distances by their rank numbers
and then applying non-metric MDS gives the configuration displayed in Fig. 38.5,
which still represents the actual distance configuration quite well.
In non-metric MDS the analysis takes into account the measurement level of the
raw data (nominal, ordinal, interval or ratio scale; see Section 2.1.2). This is most
relevant for sensory testing where often the scale of scores is not well-defined and
the differences derived may not represent Euclidean distances. For this reason one
may rank-order the distances and analyze the rank numbers with, for example, the
popular method and algorithm for non-metric MDS that is due to Kruskal [7]. Here
one defines a non-linear loss function, called STRESS, which is to be minimized:
ordering of the dissimilarities, which may still be in accordance with the measure-
ment level (ordinal). There are many extensions and adaptations to multidi-
mensional scaling methods [8], which are beyond the scope of this book. In the
computer science and pattern recognition literature one finds closely related
procedures under the name of non-linear mapping [9].
Example
Twenty-four different types of bread were assessed visually by 12 assessors.
The samples differed in colour, size, shape, aeration and filling. The assessors
were asked to indicate on a line scale the difference in appearance between
photographs of two samples. A tick mark on the left side of the line scale indicates
no difference, a tick mark on the right end of the line scale a large difference. The
distances of the tick marks from the left end were measured for each of the pairs
and for each panellist. The distances were averaged over all panellists and the
average distances obtained were ranked and analyzed using Kruskal's non-metric
multidimensional scaling technique. A 4-dimensional configuration satisfactorily
approximated the original rank-distances. Figure 38.6 shows a 3D-graph of the
MDS analysis. The graph reveals a clear clustering into six groups. These could be
labelled as white bread, wheat bread, currant bread and pastry bread, and two
intermediate forms. The largest perceived differences are associated with a dif-
ference in colour (-dim 1) and filling (-dim 2) and aeration (-dim 3). The other
dimensions, related to size and shape differences, played a minor role in discrimi-
nating the various products.
€URRANr_
BftEAD
-0.22
0.32
dim 2 -0-69 -1.69 -1.25
Attribute
Appearance Pale fl • Yellow
Smooth o' * * Coarse
Spreading slippery jy Sticky
Smooth "'"C^ Brittle
Fine ^B > • Coarse
Mouthfeel Quick • \ * Slow
Thin P \ * Thick
Fine O ^* it Coarse
Flavour None Very Intense
None Very Salty
T"\
None 6 ^ ^ * Off-Flavour
0 5 10 15 20 25 30 35
Score
Fig. 38.7. Example of average sensory profile for two food products as obtained in a QDA panel. For
five attributes (*) the difference is statistically significant (95% confidence level).
attributes arranged along one axis and the other axis providing the scale for the
scores. Attribute scores belonging to the same product are joined, thus giving a
multivariate profile of the sensory characteristics of the product.
The analysis of QDA panel data is usually done by means of analysis of
variance (see Chapter 6) on the individual attributes. With the individual data it is
arguable whether analysis of variance is a proper method with its assumption of
normality. In a two-way ANOVA lay-out with "Products" and "Panellist" as main
factors one can test the main effects against the Product*Panellist interaction (see
Section 6.8.2). The latter interaction can be tested against the pure error when the
testing has been repeated in different sessions. Ideally, one would like this inter-
action to be non-significant. Absence of a significant Product*Panellist interaction
means that the panellists have similar views on the relative ordering of the
products. A practical consequence is that it is less of a problem when not all
panellists are present at each tasting session. With panel mean scores the assump-
tion of normality is much more easily satisfied. For such panel average scores one
may for example test for a difference between products when the samples have
been tasted at two or more sessions. In Fig. 38.7 the attributes for which there is a
significant difference between the two products have been indicated.
When there are many samples and many attributes the comparison of profiles
becomes cumbersome, whether graphically or by means of 2inalysis of variance on
all the attributes. In that case, PC A in combination with a biplot (see Sections 17.4
and 31.2) can be a most effective tool for the exploration of the data. However, it
does not allow for hypothesis testing. Figure 38.8 shows a biplot of the panel-
average QDA results of 16 olive oils and 7 appearance attributes. The biplot of the
433
134%
21
131 ^ G11
Fig. 38.8. Biplot of panel average QDA profiles (attributes) for 16 olive oil products.
column-centered data shows the approximate position of the products and the
attributes in sensory space. The 2D-approximation of Fig. 38.8 is a fairly good one,
since it accounts for 83% of the variance. Noteworthy in the biplot is the presence
of one distinct sample, G13, characterized by a particular low transparency,
perhaps due to the presence of "particles". The plot, which is based on a factoring
using a = 0 and (3 = 1 (see Section 31.2), also gives a geometrical view on the
correlation between the various attributes. For example, the attributes "brown" and
"green" are highly correlated and negatively correlated with "yellow". The
attributes "transparency" and "particles" are also strongly negatively correlated.
Instead of using the principal component axes one might define two factors (see
Section 34.2), obtained by a slight rotation over 20 degrees, which are associated
with the particles/transparency contrast and the green/yellow contrast, respec-
tively. The attributes "syrup" and "glossy" are not well represented in this
2-dimensional projection.
When many data sets for the same set of products (objects) are available it is of
interest to look for the common information and to analyze the individual
deviations. When the panellists in a sensory panel test a set of food products one
might be interested in the answer to many questions. How are the products
positioned, on the average, in sensory space? Are there regions which are not well
434
covered, which may provide opportunities for new products? How are the
attributes related? Are some panellists deviating significantly from the majority
signalling a need for retraining? When comparing different panels, is there a
consensus among the panels? Can one compare the results of different panels
across (cultural, culinary) borders. After all, descriptive panels should be more or
less objective! Can one compare the results of panels (panellists) which have tested
the same samples, but have used different attributes? Can this then be used to
interrelate the different sets of attributes?
A powerful technique which allows to answer such questions is Generalized
Procrustes Analysis (GPA). This is a generalization of the Procrustes rotation
method to the case of more than two data sets. As explained in Chapter 36
Procrustes analysis applies three basic operations to each data set with the
objective to optimize their similarity, i.e. to reduce their distance. Each data set can
be seen as defining a configuration of its rows (objects, food samples, products) in
a space defined by the columns (sensory attributes) of that data set. In geometrical
terms the (squared) distance between two data sets equals the sum over the squared
distances between the two positions (one for data set X^ and one for Xg) for each
object.
The first operation in Procrustes analysis is to shift the center of gravity of each
configuration of data points to the origin. This is a geometric way of saying that
each attribute is mean-centered. In the next step the data configurations are rotated
and possibly reflected to further reduce the distances between corresponding
objects. When more than one data set is involved each data set is rotated in turn to
the mean configuration of the other data sets. In the third step the data sets are
shrunk or stretched by a scaling factor in order to increase their match. For each
data set its scaling factor applies equally to all its attributes, thus the data
configuration of objects shrinks or expands by an equal amount in any direction.
The process of alternating rotations and scaling is repeated until convergence to
stable and optimally close configurations.
We now give a qualitative discussion of the usage and interpretation of the GPA
method to sensory data. The most common application in sensory analysis of GPA
is the comparison of the test results from different panellists. The interest may
simply be to find the 'best' average configuration of the samples. For example, Fig.
38.9 shows the results of a Procrustes analysis on 7 cheese products measured in
triplicate by a QDA panel of 8 panellists using 18 attributes describing odour,
flavour, appearance and texture. The graph shows the average position of the 7
cheeses after optimal Procrustes transformation of the 8x3 = 24 different data sets.
This final average GPA configuration is often referred to as the consensus
configuration. The term 'consensus' is somewhat misleading as the final GPA
configuration is merely the result of an averaging process, it is not the result of
some group discussion between the panellists! In principle, the 7 products span a
435
/>.
A /D '
CM
W
X
Q.
O
^1—1^
Q. G^ "E
^c
principal axis 1
Fig. 38.9. Consensus plot showing the relative position of 7 cheese products (A-G) as assessed by a
panel in 3 sessions. Triangles for each product indicate the three sessions. Differences between the
three sessions are much smaller than differences between products.
6-dimensional space. In this case a two-dimensional projection onto the first two
principal components of the GPA average configuration is a good enough
approximation, accounting for 83% of the total variation. The triangles in the plot
around each product indicate the average positions obtained at the three sessions.
Clearly, the difference among the products exceeds the within-product between-
session variability. Therefore, an interpretation of the results with regard to the
products can be meaningful. Conclusions which can be drawn from the graph are,
for example: A takes a somewhat isolated position, E and F are close, so are C and
G, B is an 'average' product, the lower right area is empty, and so on. For the
product developer who has a background knowledge about the various products
such a graphical summary of the sensory properties can be a useful aid in his work.
For an interpretation of the principal axes one may draw a correlation plot. This
is a plot of the loadings (correlations) of the individual attributes with the principal
axis scores. Figure 38.10 shows an example of an international collaborative study
involving panels from five different institutes [11]. The aim was to assess the
degress of cross-cultural differences in the sensory perception of coffee. The
panels characterized eight brands of coffee, each with an independently developed
list of attributes. The correlation plot reveals that attributes with the same or a
similar name are in general positioned close together. This lends credit to the
'objectivity' of the QDA technique. Attributes which are close to the circle of
radius 1 are well represented by the 2-dimensional space of the first two principal
axes. Thus, the correlation plot and the reference to the common configuration is
helpful in judging the relations between the various attributes within and between
different panels. One may also try to label each principal axis with a name that is
436
i/BURNT
CVi BITTER
BURNT
I WOODY
BITTER
I
umt
bitter
bitter
^ bitter
-0.5
DENMARK
Poland
GERMANY
-1.0 France
-0.5 0.0 0.5 GB/ICO
principal axis 1
Fig. 38.11. Deviations of the 8 panellists from the consensus, based on a PC A of the residuals x panel-
list table, where the residuals comprise all products, sessions and attributes.
Example
Beilken et al. [12] have applied a number of instrumental measuring methods to
assess the mechanical strength of 12 different meat patties. In all, 20 different
physical/chemical properties were measured. The products were tasted twice by 12
panellists divided over 4 sessions in which 6 products were evaluated for 9 textural
attributes (rubberiness, chewiness, juiciness, etc.). Beilken et al. [12] subjected the
two sets of data, viz. the instrumental data and the sensory data, to separate
principal component analyses. The relation between the two data sets, mechanical
measurements versus sensory attributes, was studied by their intercorrelations.
Although useful information can be derived from such bivariate indicators, a truly
multivariate regression analysis may give a simpler overall picture of the relation.
In recent years the application of techniques such as PLS regression to link the
block of sensory variables to the block of predictor variables has become popular.
PLS regression is well suited to data sets with relatively few objects and many
highly correlated variables. It provides an analysis in terms of a few latent
variables that often allows a meaningful interpretation and an effective graphical
summary. When we analyze the data of Beilken et al. with PLS2 regression (see
Section 35.7) a two-dimensional model is found to account for 65-90% of the
variance of the sensory attributes, with the exception of the attributes juicy and
greasy which cannot be modelled well with this set of explanatory variables.
439
2
1 eg
a
0 f h
-2]
-4
i
-61.ll , , ^ 1 1 1 1 ' r*
-6 -4 -2 0
t1
Fig. 38.12. Scores of products (meat patties) on the first two PLS dimensions.
Figure 38.12 shows the position of the twelve meat patties in the space of the first
two PLS dimensions. Such plots reveal the similarity of certain products (e.g. C
and D, or E and G) or the extreme position of some products (e.g. A or I or L).
Figure 38.13 shows the loadings of the instrumental variables on these PLS factors
and Fig. 38.14 the loadings of the sensory attributes. The plot of the products in the
0.5
COOKLOSS
WB SLOPE I COHESIV
CM
0.0
w e PEAK
CPJYD
CPl jyp WB SHEAR
PH
TENSILE
-0.5
-0.5 0.0 0.5
p1
Fig. 38.13. Loadings of predictor (instrumental) variables on the first two PLS dimensions.
440
0.61
COARSE
CHEWY
0.4i
TEXaiRE
0.2
CM
O
GREAS'^dUICY
CRUMBLY
0.0
T_^RUBBER
0.2
ADHESION
0.4-L , 1 , 1 n 1 1 . r'
Fig. 38.14. Loadings of dependent (sensory) variables on the first two PLS dimensions.
space of the PLS dimensions has a fair resemblance to the PC A scores plot only for
the sensory variables. As a consequence, the PLS loading plot of the sensory
variables (Fig. 38.14) gives a similar picture as a PCA loading plot would give.
The associations between the variables within each set are immediately apparent
from Fig. 38.13 or Fig. 38.14. For example, "hardness", "hrdxchv", "R-punch"
and "CP-peak" are all highly correlated and indicate the firmness of the product.
As another example, "tensile" strength is positively correlated with the amount of
protein and negatively correlated with "moisture" and "fat". It would require an
exhaustive inspection of the 20x9 correlation table to obtain similar conclusions
about the relationship between variables of the two sets as the conclusions derived
from simply comparing or overlaying Figs. 38.13 and 38.14.
In the foregoing we loosely talked about the intensity of a sensory attribute for a
given sample, as if the assessors perceive a single (scalar) response. In reality,
perception is a dynamic process, and a very complex one. For example, when a
food product is taken in the mouth, the product disintegrates, emulsions are
broken, flavours are released and transported from the mouth to the olfactory
(smell) receptors in the nose. The measurement of these processes, analyzing and
interpreting the results and, eventually, their control is of importance to the food
441
CO
CD
time I s
manufacturer. There are many ways to study the temporal aspects of sensory
perception. Experimentally many methods have been developed to measure so-
called time-intensity curves or Tl-curves. Currently popular methodology is to use
a slide-wire potentiometer or a computer mouse and to feed the data directly into
the computer. The sort of curves that is obtained is shown in Fig. 38.15.
Typically, one may characterize such a curve by a number of parameters, such
as time-to-maximum-intensity, maximum intensity, time of decay, total area. The
way to average such curves over panellists in order to derive a panel-average
Tl-curve is not trivial. Geometrical averaging in both the intensity and time
direction may help to best preserve shape. Separate analysis of variance of the
characteristic parameters of the average curves can be used to assess the
differences between the products. One may also try to fit a parametric function to
each individual curve, for example, a combination of two exponential functions
(see Chapters 11 and 39). The curves are then characterized by their best-fitting
parameters, and these are compared in a subsequent analysis. Another method is to
leave the curves as they are and to analyze the whole set of curves by PCA. It
would seem natural to consider each curve as a multivariate observation and the
intensities at equidistant points in time as the variables. Since there is a natural zero
point (^ = 0, / = 0) in these measurements it makes sense in this case not to center
each curve around its mean intensity, but to analyze the raw intensity data. Also,
since the different maximum intensities may be related to concentration of the
bitter component it would be imprudent to scale the curves to a common standard
(e.g. maximum, mean intensity, area). However, one might consider a log trans-
formation allowing for the nature (ratio scale) of the intensity scale.
Figures 38.16 to 38.18 show the result of such an uncentered PCA applied to a
set of Tl-curves obtained by 9 panellists [13]. The perceived intensity of bitterness
442
25Ch
Caffeine 1
Fig. 38.16. Loading curves (PCI, non-centered PCA) for 4 bitter solutions.
of four solutions, caffeine and tetrahop at two concentration levels, was recorded
as a function of time. The analysis is applied separately to each of four bitter
solutions. The loading plots for PCI and PC2 are shown in Figs. 38.16 and 38.17.
Notice that PCI (Fig. 38.16) has little structure: it represents an equal weighting
over most of the time axis. In fact the PCI loading plot very much resembles the
average curve for each product. This is a common outcome with noncentered PCA.
The loading plot for PC2 (Fig. 38.17) has a more distinct structure. Since it has a
negative part it does not represent a particular type of intensity curve. PC2 affects
the shape of the curve. One notices in Fig. 38.17 a distinct shape between the two
tetrahop solutions and the two caffeine solutions. This interpretation of PCI as a
size component and PC2 as a contrast component is a familiar phenomenon in
principal component analysis of data (see Chapter 31). A different interpretation is
obtained if one considers a rotation of the PCs. Rotation of PCI and PC2 does give
more interpretable curves: PCI + PC2 gives a curve that rises steeply and decays
deeply, representing a fast perception, whereas PCI - PC2 gives a curve that starts
to rise much more slowly, reaching its maximum much later with a longer lasting
perception. The score plot of the panellists in the space of PCI and PC2 is shown in
Fig. 38.18, for one of the four product. A high score along PCI, e.g. panellist 6,
implies that the panellist gave overall high intensity scores. A high score on PC2
443
_ 60-1
<
O - Tetra 2
Q.
c
Q>
O
c 2(H
o
c
0)
I <H
O
CO
>
"55 -2(H
c
C
C -4(H
o
o
0) -SO n— "T r- "1 "I
10 20 30 40 50 60 70 80 90
Time
Fig. 38.17. Loading curves (PC2, non-centered PCA) for 4 bitter solutions.
Caffeine 1
4.6
4
73
5 8
-1
-4-
dimension 1
4-6
c
o
Fig. 38.18. Score plot (PC2 v. PCI) based on non-centered PCA of Tl-curves from 9 panellists for a
bitter (caffeine) solution.
444
(e.g. panellists 9), implies a Tl-curve with a relatively fast rise and early peak, in
contrast to a low (negative) score on PC2 (e.g. panellist 1), implying a TI curve
with a relatively slow rise and late peak.
There have also been attempts to describe the temporal aspects of perception
from first principles, the model including the effects of adaptation and integration
of perceived stimuli. The parameters in the specific analytical model derived were
estimated using non-linear regression [14]. Another recent development is to
describe each individual TI-curve,y;(r), / = 1, 2,..., n, as derived from a prototype
curve, S(t). Each individual Tl-curve can be obtained from the prototype curve by
shrinking or stretching the (horizontal) time axis and the (vertical) intensity axis,
i.e. f.(t) = a^ S(bi i). The least squares fit is found in an iterative procedure,
alternately adapting the parameter sets {^,, Z?.} for / = 1,2,..., n and the shape of the
prototype curve [15].
Example
Figure 38.19 shows the contour plots of the foaming behaviour, uniformity of
air cells and the sweetness of a whipped topping based on peanut milk with varying
com syrup and fat concentrations [16]. Clearly, fat is the most important variable
determining foam (Fig. 38.19A), whereas com symp concentration determines
sweetness (Fig. 38.19C). It is rather the mle than the exception that more than one
sensory attribute are needed to describe the sensory characteristics of a product. An
effective way to make a final choice is to overlay the contour plots associated with
the response surfaces for the various plots. If one indicates in each contour plot
which regions are preferred, then in the overlay a window region of products with
acceptable properties is left (see Fig. 38.19D and Sections 24.5 and 26.4). In the
20 / 100
110'
1 y B
,^ -"'—--.
W""
1 9D_ — " ^ • - , ^
L 8Q "^^^^
[ 70 ~"~~-- ^ ^-^
10
15 20 15 20 25
% Corn Syrup % Corn Syrup
^u 20 ^ ^ ^ C ^ _ _ _ ^ _ J : , , - ^ - — ^
\
m-^-^:
\ C
CO CO
15- 15
15 20 25 15 20 25
% Corn Syrup % Corn Syrup
Fig. 38.19. Contour plots of foam (A), uniformity of air cells (B) and sweetness (C) as a (fully-
quadratic) function of the levels of fat and corn syrup. An overlay plot (D) shows the region of overall
acceptability.
446
case of this example products with >135 g fat and <175 g com syrup per kg would
give acceptable foaming behaviour, uniformity of air cells and sweetness. One
might also construct a composite desirability function and optimize this composite
response (see Section 26.4 on multi-criteria decision making). Sometimes, the best
settings for the various responses are conflicting and there is no region where all
properties are acceptable. In that case one has either to relax the acceptability limits
for the various responses or to change the product concept.
References
1. M. Meilgaard, G. V. Civille and B.T. Carr, Sensory Evaluation Techniques, 2nd Edition. CRC
Press, Boca Raton, PL, 1987.
2. H.J.H. MacFie, Data Analysis in Flavour Research: Achievements, Needs and Perspectives,
in: Flavour Science and Technology, M. Martens, G.A. Dalen and H. Russwurm Jr (Editors).
Wiley, 1987.
3. H. Stone and J.L. Sidel, Sensory Evaluation Practices, 2nd ed. Academic Press, San Diego,
1993.
4. S. Siegel. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, Tokyo, 1956.
5. R.A. Bradley, Science, statistics, and paired comparisons. Biometrics, 32 (1976) 213-232.
6. H. A. David, The Method of Paired Comparisons. Charles Griffin, London, 1963.
7. J.B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothe-
sis. Psychometrika, 29 (1964) 1-27, 115-129.
8. S.S. Schiffman, M.L. Reynolds and F.W. Young, Introduction to Multi-dimensional Scaling:
Theory, Methods and Applications. Academic Press, New York, 1981.
9. J.W. Sammon, A nonlinear mapping for data structure analysis. IEEE Trans. Comput., C-18
(1969)401-409.
10. ABC Executive Flight Planner, June 1994. Reed Telepublishing, Dunstable, 1994.
11. S. de Jong, J. Heidema and H.C.M. van der Knaap, Generalized Procrustes analysis of coffee
brands tested by five European sensory panels. Food Qual. Pref, 9 (1998) 111-114.
12. S. Beilken, L.M. Eadie, I. Griffiths, P.N. Jones and P.V. Harris, Assessment of the textural
quality of meat patties. J. Food Sci., 56 (1991) 1465-1470.
13. G. Dijksterhuis, Principal component analysis of Tl-curves: three methods compared. Food
Qual. Pref., 5 (1994) 121-127.
14. P. Overbosch and S. de Jong, A theoretical model for perceived intensity in human taste and
smell. Physiol. Behav., 45 (1989) 607-613.
15 G. Dijksterhuis and P. Filers, Modelling time-intensity curves using prototype curves. Food
Qual. Pref., 8 (1997) 131-140.
16. A. Abdullah, A.V.A Resurreccion and L.R. Beuchat, Formulation and evaluation of a peanut
milk based whipped topping using response surface methodology. Lebensm. Wiss. Technol.,
26(1993)162-166.
G.E. Arteaga, E. Li-Chan, M.C. Vazquez-Aretaga and S. Nakai, Systematic experimental designs for
product formula optimization. Trends in Food Science Technol., 5 (1994) 243-253.
J.B. Kruskal and M. Wish, Multidimensional Scaling. SAGE, Newberry Park, CA, 1991.
447
P. Lea, T. Naes and M. R0dbotton, Analysis of Variance for Sensory Data. Wiley, London, 1997
D. H. Lyon, M.A. Francombe, T.A. Hasdell and K. Lawson, Guidelines for Sensory Analysis in Prod-
uct Development and Quality Control. Chapman and Hall, London, 1990.
H. Martens and H. Russwurm, eds., Food Research and Data Analysis. Applied Science, London,
1983.
M. O'Mahony, Sensory Evaluation of Foods. Statistical Methods and Procedures. Marcel Dekker,
New York, 1986
T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science. Data Handling in
Science and Technology Series, Elsevier, Amsterdam, 1996
J.R. Piggott (Editor), Sensory Analysis of Foods. Elsevier, London, 1984.
J.R. Piggott, Statistical Procedures in Food Research. Elsevier, London, 1987.
449
Chapter 39
Pharmacokinetic Models
Introduction
Rescigno and Segre [1], in one of the first significant text books on the subject,
defined pharmacokinetics as "the study of drug concentrations in different tissues
and the construction of models suited to interpret such phenomena". We can
distinguish four aspects of pharmacokinetics, namely absorption, distribution,
biotransformation and excretion of a drug. The term drug includes here any
naturally occurring or synthetic chemical compound.
Although it is highly desirable to deliver a drug to the site where it must exert its
effect, this is not generally possible except in those cases where the site of
treatment is readily accessible (such as the skin). When topical (local) treatment is
precluded, the drug must be carried to the target site via the circulatory or
lymphatic system. To this effect it must first be absorbed by the blood of the central
circulatory system. Absorption is extremely rapid and complete when a drug is
delivered directly to the circulation by means of an intravenous injection. Slower
and occasionally less efficient ways of absorption are via the mucous membranes,
by oral intake, by injection into muscle or fatty tissues, or by means of patches
which are applied to the skin. Pharmaceutical technology continuously tries to
explore novel ways of administration in order to deliver a drug under the most
optimal conditions. Important considerations in this respect are the solubility of the
compound, the speed of release, and the susceptibility to biological degradation.
Once the substance has been absorbed, it is distributed by the blood towards the
various tissues of the body. If the drug is highly lipophilic, it may be rapidly
deposited into fatty tissues and slowly released thereafter. Some drugs are readily
adsorbed to membranes of cells, especially those of the red blood cells, others may
adsorb onto circulating proteins. The plasma of blood, however, must carry the
drug to the target tissue. If the treatment of a disease calls for a rapid and burstlike
delivery to the target, then it is undesirable to have a large fraction of the drug
stored away in various buffers. On the other hand, when a slow and progressive
release is required, temporal storage into fatty tissues may be an advantage. Some
tissues possess membranes which are highly selective for the passage of specific
chemical substances. The blood-brain barrier and the placenta are among these.
450
Excretion is the process by which a substance leaves the body. The most
common ways are via the kidneys and via the gut. Renal excretion is favored by
water-soluble compounds that can be filtered (passively by the glomeruli) or
secreted (actively by the tubuli) and that are collected into urine. Fecal excretion is
followed by more lipid substances that are excreted from the liver into the bile,
which is collected in the gut and passed out by the feces. Other routes of excretion
are available through the skin and the lungs.
As long as a compound circulates in the body it is at risk of being degraded by
enzymes. This process is called biotransformation or metabolism. Many of the
biotransformation processes occur in the liver which contains a wide variety of
enzymes of the cytochrome P-450 family, all geared towards the breakdown of
compounds that are foreign to the body (so-called xenobiotics). Metabolic break-
down may also occur in plasma by circulating enzymes or even in the vicinity of
the target site. The pathway of biodegradation can be very complex and often
results in several metabolites, each of which is distributed throughout the body and
excreted in its own particular way. Some metabolites are neutral, others may be
harmful, still others may possess a therapeutic effect. In some cases, a prodrug is
delivered into the system which is inert by itself but which is metabolized near the
target site into the active compound. Such a strategy is highly effective and least
harmful to the system. In all these instances where biotransformation plays a
relevant role, it will be necessary to study the kinetics of the major and most
reactive metabolites in addition to those of the original compound.
Pharmacokinetics is closely related to pharmacodynamics, which is a recent
development of great importance to the design of medicines. The former attempts
to model and predict the amount of substance that can be expected at the target site
at a certain time after administration. The latter studies the relationship between
the amount delivered and the observable effect that follows. In some cases the
observable effect can be related directly to the amount of drug delivered at the
target site [2]. In many cases, however, this relationship is highly complex and
requires extensive modeling and calculation. In this text we will mainly focus on
the subject of pharmacokinetics which can be approached from two sides. The first
approach is the classical one and is based on so-called compartmental models. It
requires certain assumptions which will be explained later on. The second one is
non-compartmental and avoids the assumptions of compartmental analysis.
Although the methods that are discussed in this chapter deal explicitly with the
disposition of drugs in animals and humans, their scope is much wider. In general,
these methods can be applied to study the transport of substances within parts of a
system provided that these transports can be described by zero or first order
kinetics. This applies, for example, when the rate of change of the amount in one
part of the system depends linearly on the amounts present in all the various parts
of the system. Applications are found commonly in first order chemical reactions,
451
Uptake and washout of tracers, radioactive and fluorescent decay, etc. In a broader
context, kinetic models find application in population dynamics, ecology, meteor-
ology, epidemiology, etc. Non-linear models are of special importance in these
fields. Although their formulation is often only slightly more involved than their
linear counterparts, the complexity of their solutions has only recently been
explored. They lie at the origin of fractal geometry and deterministic chaos. Fractal
geometry arises when a system does not evolve along smooth paths, such as the
stream lines in laminar flow, but follows highly complex patterns, such as the
eddies in turbulent flow. The latter system appears to behave chaotically, yet obeys
constraints which force it to remain under the influence of so-called attractors. The
dimensionality of an attractor in a seemingly chaotic system is characterized by a
fractional number rather than by an integer one, hence the name fractal [3].
dX "" ( "" ^
X, (39.1)
where X^ is the amount of drug within compartment / at time t, k^j is the transfer
constant for transport from compartment / to compartment; (in reciprocal time
units) and where n represents the number of compartments. An illustration of
transfer constants in the case of a two-compartment system is given in Fig. 39.1.
The equation above states that the net content in any compartment / equals the
sum of all inflows from the other compartments minus the sum of all outflows
towards the other compartments. Such a relation is also referred to as a mass
balance differential equation, which is well-known in chemical engineering. In the
452
Skin
2
A
kj2 ^21
itravenous T
injection Blood Plasma
^ ' W^ Elimination
^ ^
1
Fig. 39.1. Two-compartment open model composed of a central (plasma) compartment and a target
(skin) compartment. This model assumes that a dnig is delivered rapidly into plasma from which it is
either exchanging with the target organ or eliminated by excretion or metabolism.
IntraveilOUS
injecti on Skin
1 t
w
'^-
Liver ^
c Blood Plasma Kidney
^ ^
>
>i >r T
Vletabolism Adipose Excretion
Tissues
Fig. 39.2. Multicompartment model which, in addition to Fig. 39.1, takes into account that the drug is
buffered in adipose (fatty) tissue, excreted by the kidneys and metabolized in the liver.
circulation to the liver. In this case, a large fraction of the drug may be metabolized
during its first pass through the liver. Some of the drug (and its metabolites) is
returned to the gut lumen via the excretion in the bile. The substances are excreted
by the gut into the feces or can be re-entered into the system through the gut tissue.
The remaining part is exchanged between the liver and the plasma fluid. From the
plasma, some part of the drug can be excreted through the kidneys, it can be stored
in adipose tissues or it can be adsorbed on the membranes of red cells. Eventually,
the drug penetrates the skin w^here it must produce its therapeutic benefits. This is
an example of a mixed mammillary-catenary model [5].
An additional problem arises when the exchange processes are rate-limited.
This may be caused by enzymes that become saturated when all their active sites
are occupied by the drug, or it may be due to adsorbing proteins that have a limited
binding capacity. In such cases, one obtains a type of Michaelis-Menten kinetics of
the form:
^max^
(39.2)
dt K +x
where V is the rate of the process, V^^^ represents the maximal rate when the
process is completely saturated, and K^ denotes the Michaelis constant. A rate-
limited process introduces a non-linearity which is incompatible with the usual
454
Oral
administration
Gut Fecal
Lumen excretion
Bile
Gut Skin
Tissue
Fig. 39.3. Multicompartment model which, in addition to Fig. 39.2, models absorption of a drug via
the gut, excretion via the bile and adsorption to the membranes of red cells.
© ®
DoseD
Cp<0)
x„.Cp
Plasma
Compartment
1^
Xe
>r D-
Elimination
Pool
Fig. 39.4. (a) One-compartment open model with single-dose intravenous injection of a dose D. The
transfer constant of elimination (excretion and metabolism) is kp^. (b) Time course of the plasma con-
centration Cp and of the contents in the elimination pool Xe.
456
- ^ = -^pe^P (39.4)
and also
X,(r) = D ( l - e " ^ ' )
Note that the solution is in terms of amounts X^ rather than observed plasma
concentrations Cp. The conversion from amounts to concentrations is defined by
means of:
where V^ represents the volume of distribution of the drug in plasma. The volume
of distribution should not be confounded with the physical plasma volume (which
is about four liters in adult humans). The former may exceed the latter by several
orders of magnitude. This occurs when there is a large binding affinity of the drug
molecules with membranes of red blood cells or with circulating proteins. In that
case, only a minute fraction of the drug is recovered in the plasma, the larger part
being buffered reversibly on membranes and proteins. We can derive the volume
of distribution Vp from the dose D and from the observed plasma concentration at
time zero Cp(0):
V = - ^ (39.7)
P Cp(0)
The half-life time ty2 of a drug is an important pharmacokinetic parameter. In this
simple model it can be obtained immediately from the solution in eq. (39.6):
- ^ — = Cp(0)e"'-^- (39.8)
,,.-f^ (39.9)
457
Cp(0) D
AUC = - ^ = (39.12)
^pe *^^p ^pe
The AUC is a measure of bioavailability, i.e. the amount of substance in the central
compartment that is available to the organism. It takes a maximal value under
intravenous administration, and is usually less after oral administration or par-
enteral injection (such as under the skin or in muscle). In the latter cases, losses
occur in the gut and at the injection sites. The definition also shows that for a
constant dose Z), the area under the curve varies inversely with the rate of elimina-
tion /Cpg and with the volume of distribution Vp. Figure 39.6 illustrates schemat-
ically the different cases that can be obtained by varying the volume of distribution
Vp and the rate of elimination k^^ both on linear and semilogarithmic diagrams.
These diagrams show that the slope (time course) of the curves are governed by the
rate of elimination and that elevation (amplitude) of the curve is determined by the
volume of distribution.
458
Cp(0) —50-1
40 -^ \
30 H \
Cp(0) \ Cp(0)
2 1
20 H
10 -\
0 -1 '^__^ , ,— 1 ' 1 1 n* 1
0 100 200 300 400 500 600 700 800
t (min)
Arctan SQ
—I— —1— —T
200 300 500 600 700 800
t (min)
Fig. 39.5. (a) Plot of plasma concentration Cp (ng T^) versus time t. Cp(0) is the extrapolated initial
concentration at time zero. At the half-life time tm the plasma concentration is half that of Cp(0). (b)
Semilogarithmic plot of plasma concentration Cp (|ig 1"^) versus time t. The intercept B of the fitted
line is the plasma concentration Cp at time 0. The slope s^ is proportional to the transfer constant of
elimination kr^e.
459
® ®
Fig. 39.6. (a) Time courses of plasma concentration Cp in the one-compartment open model for intra-
venous injection, with different contingencies for the transfer constant of elimination kp^ and the vol-
ume of distribution Vp. (b) Time courses of plasma concentration Cp as in panel (a) on semilogarithmic
plots.
From the AUC and knowing the dose, one can immediately derive from eq.
(39.12) an important pharmacokinetic parameter which is the clearance Cl^ of the
drug from the plasma:
- ^ p ^pe (39.13)
P AUC
Example
We consider the following synthetic plasma concentrations which are sup-
posedly obtained after intravenous injection of 10 mg of a drug:
The data are also represented in Fig. 39.5a and have been replotted semi-
logarithmically in Fig. 39.5b. Least squares linear regression of log C^ with respect
to time t has been performed on the first nine data points. The last three points have
been discarded as the corresponding concentration values are assumed to be close
to the quantitation limit of the detection system and, hence, are endowed with a
large relative error. We obtained the values of 1.701 and 0.005117 for the intercept
log B and slope s^, respectively. From these we derive the following pharmaco-
kinetic quantities:
The synthetic data have been derived from a theoretical one-compartment model
with the following settings of the parameters:
Dose: /)=10mg
Plasma volume of distribution: V^ = 200 1
Initial plasma concentration: C (0) = D/V = 50 L
| Lg l'
Half-life time: t^,^ = 60 min
Rate constant of elimination: k^ = 0.693/ty^ = 0.01155 min'
Area under the curve: AUC = 4.329 mg min 1"'
Plasma clearance: CI =2.31 1 min"'
p
The synthetic data have been obtained by adding random noise with standard
deviation of about 0.4 \ig 1"^ to the theoretical plasma concentrations. As can be
seen, the agreement between the estimated and the computed values is fair.
Estimates tend to deteriorate rapidly, however, with increasing experimental error.
This phenomenon is intrinsic to compartmental models, the solution of which
always involves exponential functions.
461
® ©
DoseD
Extravascular
Compartment
"ap
or"""
Plasma
Compartment
V
t
Elimination
Pool
Fig. 39.7. (a) Two-compartment catenary model for extravascular (oral or parenteral) administration
of a single dose D which is completely absorbed. The transfer constant of absorption isfcap-(b) Time
courses of the amount in the extravascular compartment Xa, the concentration in the plasma compart-
ment Cp and the content in the elimination pool X^.
462
"dT"" "p^"
X, = D - ( X , +Xp)
with the initial conditions:
X,(0) = D and Xp(0) == 0
where X^, X^ and Xj, represent the quantities in the absorption and central plasma
compartments, and in the elimination pool, respectively.
The analytic solution of the system of differential equations in eq. (39.14) can
be written as follows:
k (39.15)
Xp(0 = g ;" (e-V-e-V)
The content X^ in the absorption compartment varies with time according to the
solution of the one-compartment model which has been described in the previous
section. This time course is represented by a single negative exponential function.
Initially, the content is equal to the dose D and at a sufficiently long time after
administration, the content becomes zero. The concentration C^ in the central
compartment is initially zero, reaches a maximum and eventually becomes zero
again. The content X^ of the elimination pool starts from 0 and asymptotically
approaches the dose D at prolonged times (Fig. 39.7b). The solution for the central
plasma compartment gives rise to three distinct cases depending on the difference
between the transfer constants for absorption and elimination.
Case 1
When /:^p > k^^, the time course of the plasma concentration C^ is dominated by
the rate of elimination which is the slower of the two. This is desirable when a drug
must be delivered as rapidly as possible to the plasma compartment, for example in
the relief of acute pain. At sufficiently large times following administration, the
463
transient effect of absorption will have decayed such that one obtains approx-
imately from eq. (39.16):
Case 2
If A:^p < /^pe, the time course of C^ is dominated by the slower rate of absorption.
This is desirable when a drug is to be delivered over a prolonged period of time, for
example in the relief of chronic pain. When a sufficiently large period of time has
elapsed, the transient effect of elimination has decayed and the solution for the
plasma compartment in eq. (39.16) becomes approximately:
Case 3
In the theoretical situation when k^^ = k^^ the solution for C^ degenerates since
both the numerator and denominator vanish and the expression becomes indeter-
minate. The limiting solution must then be derived from a rearrangement of the
expression of C^ in eq. (39.16):
(a)
log Cp(t)= 1.745-0.0051661
Arctansp/ '\^
—T ^ — ^
200 I 400 500 600 700
t(min)
(b)
log C_(t)= 1.745-0.0621 t
Arctan s„ /
I :••;;....
— I —
20
t(min)
Fig. 39.8. (a) Semilogarithmic plot of plasma concentration Cp (|ig T^) versus time t. The straight line
is fitted to the later part of the curve (slow p-phase), with the exception of points that fall below the
quantitation limit. The intercept B of the fitted line is the extrapolated plasma concentration C^ that
would have been obtained at time 0 with an intravenous injection. The slope 5p is proportional to the
transfer constant of elimination kp^. (b) Semilogarithmic plot of the residual plasma concentration C "
(|ig r') versus time r, on an expanded time scale t. The straight line is fitted to the first part of the resid-
ual curve (fast a-phase), with the exception of points whose residuals fall below the quantitation limit.
The intercept B of the fitted line, is the same as that in panel a. The slope 5a is proportional to the trans-
fer constant of absorption from the extravascular compartment.
465
D k
logCP(0-log— '' -OA343k^J = logB-s^t (39.21)
with intercept log B and slope s^ = 0.4343 k^^. From the slope one obtains the
transfer constant of elimination k , The next step is to compute the residuals
between the observed plasma concentrations and the computed p-phase values,
which are again replotted on a logarithmic concentration scale (Fig. 39.8b). The
resulting line represents the absorption phase or a-phase of the plasma
concentration:
= logB-s^t
with the same intercept log B as derived above and slope s^^ = 0.4343 k^^, which
yields the absorption transfer constant k^^. The procedure by which a sum of
exponential functions is resolved into its component terms is called curve peeling.
It allows to include empirical knowledge in the analysis about the number of
exponential functions and possible time lag. A comparison of curve peeling with
non-linear regression is presented in Section 39.1.6.
Knowing the transfer constants k^^, k^^, the dose D and the intercept 5, we can
compute the volume of distribution Vp from the expression (eq. (39.18):
D ^an
— ^^— = B (39.23)
V k -k
*^p '^ap '^pe
The various effects produced by varying k^^, k^^ and Vp are illustrated in Fig. 39.9.
In practice, the half-life time of the drug in the plasma compartment is derived
from the p-phase, and is therefore denoted as t\i2:
.r„=5^ (39.24)
^pe
Likewise the half-life time of the drug in the extravascular compartment r "2 '^
given by:
^«2 z . 2 ^ (39.25)
^ap
The area under the curve AUC is obtained by integrating the plasma con-
centration function between times 0 and infinity. This integral can be obtained
analytically from eq. (39.16):
466
k-n small
ap
Fig. 39.9. Time courses of plasma concentration Cp in a one-compartment model for extravascular ad-
ministration, with different contingencies of (a) the transfer constant of absorption k^p, (b) the transfer
constant of elimination kpc and (c) the volume of distribution Vp.
D
AUC= j C^(t)dt = (39.26)
^P^e
and has the same form as in the case of the intravenous one-compartment model
developed in the previous section. This is not surprising, however, since we know
that the AUC is a measure of the bio-availability of the substance to the plasma
compartment. In this model, all of the initially administered dose D must enter the
467
C l = ^ - = V^k^, (39.27)
^ AUC P P
An important pharmacokinetic parameter is the time of appearance of the
maximum t^ of the plasma concentration. This can be derived by setting the first
derivative of the plasma concentration function in eq. (39.16) equal to zero and
solving for t, which yields:
'- - k -k " k -k ^ ^ ^
where In means natural logarithm (base e). From this expression, one can see that
the time t^^ only depends on the transfer constants of absorption k^^ and elimination
k^^. Substitution of this expression in the plasma concentration function yields the
value of the peak concentration C^it^J. Knowledge of the peak concentration is
important since a drug usually has a minimum threshold for therapeutic effect and
a maximum level above which toxic effects become manifest.
Differentiation of the plasma concentration function in eq. (39.16) at time zero
also yields the initial rate of change of C^
fdC^ D
C'iO)
p
= y K, (39.29)
dt
This shows that the initial steepness of the plasma curve depends, for a given dose
£), on the volume of distribution Vp and on the transfer constant of absorption k^^.
The various characteristics of the plasma concentration curve are also schem-
atically displayed in Fig. 39.9 for different contingencies of/c^p, k^^ and V^,
Example
The following synthetic plasma concentrations are representative for the time
course of the plasma concentration C^ after a single oral administration of a dose of
10 mg:
468
The semilogarithmic plot of the data is given in Fig. 39.8a. On this plot we can
readily identify the linear p-phase of the plasma concentrations between 30 and
240 minutes. The last three points (at 360, 480 and 720 minutes) have been
discarded because the corresponding plasma values are supposed to be close to the
quantitation limit of the detection system.
A least squares linear regression has been applied to the data pertaining to the
p-phase, yielding the values of 1.745 and 0.005166 for the intercept log B and
slope 5p, respectively. Using these results, we can compute the extrapolated plasma
concentrations C^ between 0 and 20 minutes. From the latter, we subtract the
observed concentrations C^ which yields the concentrations of the a-phase C " :
r(min): 0 2 5 10 20
C^Cngl-^): 55.5 54.2 52.3 49.3 43.8
Cp(|ngH): 0 12.3 23.6 35.5 40.7
Cp«(|jgH): 55.5 41.9 28.7 13.8 3.1
The semilogarithmic plot of the a-phase is shown in Fig. 39.8b. Through these
data we force a least squares linear regression through the vertical intercept, i.e. the
point with coordinates (0, 55.5). The latter is achieved by assigning a very large
weight to this single point, for example 100 times the weight of each of the other
points. This will guarantee that the regression line will pass through the vertical
intercept of the semilogarithmic plot, as required from the theoretical consid-
erations above. The resulting intercept log B and slope s^ are 1.745 and 0.0621,
respectively.
The pharmacokinetic parameters of the model are then readily derived from the
defining equations and the results of the regression:
Initial plasma cone, p-phase: C^ (0) = B = 55.5 \lg 1"'
Transfer constant of elimination: k = 5„/0.4343 = 0.0119 min"'
pe p
Dose: D = 10 mg
Plasma volume of distribution: n= :200l
.a = 5 min
Half-life time of absorption: '-111
Transfer constant of absorption: K-~: 0.693/r «2 = 0.1386
Half-life time of elimination: '1/2 = 60 min
Transfer constant of elimination: =0.01155
K- = 0.693/rf/2
Area under the curve: AUC = 4.329 mg min \~'
Plasma clearance: CK-= 2.311min"^
The agreement with the theoretical and the 'experimental' results is still fair.
This is due, however, to the rather moderate noise which has been superimposed
on the theoretical plasma concentration values. The standard deviation of the
random noise amounted to about 0.4 |ig 1~^ The success of a curve peeling
operation also depends on the ratio of the transfer constants k^^ and k^^, which
should differ at least by a magnitude of about 10. Errors tend to accumulate in the
successive steps of a curve peeling process. As a consequence, we obtain that the
transfer constant of the faster a-phase has been determined less accurately than the
one of the slower p-phase. In general, the fitting of sums of exponential functions
is an ill-conditioned problem, the solution of which rapidly deteriorates with
increasing noise and with decreasing distinction between the transfer constants.
F.'J^ (39.30)
= -k^X
dt P^ P (39.32)
X, = k^x-X^
with the conditions that the same result Xp(x) is obtained from both parts of the
equation.
The solution of the first part of the model is obtained by straightforward
integration:
X ro = — d - e - * - ' ) (39.33)
or equivalently:
471
® ®
Reservoir 1
;T t
>
Cp(t)
Plasma
Compartment
^ ?T t
f Xe
1 ^^^1
Elimination
Pool
Fig. 39.10. (a) One-compartment open model for continuous intravenous infusion of a dose D from a
reservoir. The rate of infusion is k^p. (b) Time courses of the content in the reservoir Xr, of the plasma
concentration Cp and of the content of the elimination pool X^.
CM) = "rp
(1- -V ) (39.34)
^P^pe
C, =limC„(T) = (39.35)
^ ^ p e
CL
Note that the steady-state plasma concentration Q^ varies proportionally with the
rate of infusion k^ and inversely with the plasma clearance C/p, the definition of
472
which has been given by eq. (39.13). This result allows us to rewrite the solution of
the model in eq. (39.34) in the form:
Cp(0 = C , 3 ( l - e - V ) (39.36)
or equivalently:
log 1 - = -0A343kt
pe
(39.37)
C.
Hence, the slope of the semilogarithmic plot of 1 - C^(t)/C^^ versus time t yields the
transfer constant of elimination k^. From the known rate constant of infusion ^^p*
the transfer constant of elimination k^ and a graphical estimate of C^^, one can then
derive the plasma volume of distribution V^, using the steady-state condition which
has been derived above.
The transfer constant of elimination k^^ has already been shown to be related to
the half-life time of the drug in the plasma compartment (eq. (39.9)):
ty2 = (39.38)
After substitution of this expression in the solution of the model in eq. (39.36) we
obtain:
Cp(r) = C,,(l-e-^-^^^'^''-) (39.39)
which shows that the steady-state plasma concentration Q^ is only obtained when
the time of infusion T is several times the half-life time ty2-
When infusion stops at time T, the plasma concentration Cp(T) decays exp-
onentially according to the solution of the one-compartment open model in eq.
(39.6):
Cp(0 = Cp(T)e-^^'-^^ (39.40)
From a semilogarithmic plot of C^(t) versus r - T, one can again estimate the
transfer constant of elimination k^. From the known values of Cp(T), ^^p and ^p^ one
also obtains a new estimate of V^.
In practice, one will seek to obtain an estimate of the elimination constant k^^
and the plasma volume of distribution V^ by means of a single intravenous
injection. These pharmacokinetic parameters are then used in the determination of
the required dose D in the reservoir and the input rate constant k^ (i.e. the drip rate
or the pump flow) in order to obtain an optimal steady state plasma concentration
473
Example
We make use of the previously defined parameters of the one-compartment
open system (eq. (39.6)).
From the solution of the continuous administration model we can now derive
the plasma concentration at the end of infusion T:
1 ftf\ 1
C(x) = • (1 - exp(-0.1155 X 60)) = 72.16 x (1 - 0.5) = 36.1 |ig l'^
P 200x0.01155 ^
This should be compared with the initial plasma concentration of 50 |Lig 1"^
which is obtained when the same dose D is injected as a single bolus. The final
concentration C^(i) is also still far from the steady-state concentration C^^ (72.2 |ig
r^), since the duration of infusion T is only equal to once the half-life time of the
drug ty2 (60 min).
® ©
Y
Y
Plasma
Compartment
Elimination
0 e 29 39
Pool
Fig. 39.11. (a) One compartment open model for repeated intravenous injection of the same dose D at
constant intervals 9. (b) Time course of the plasma concentration Cp with peaks C ^ ^ and troughs
C ™" at multiples of the time interval 9. The peaks and troughs tend asymptotically toward the
steady-state values C^^
which follows from a consideration of Fig. 39.11 and from the general solution of
the one-compartment open model in eq. (39.6).
It is readily observed that the expression within brackets for the peak
concentration C ^^"^ forms a geometric progression in which the first term equals
one and with a common ratio of e~^®. (The common ratio is the factor which
results from dividing a term of the progression by its preceding one.) For the sum
of the finite progression in eq. (39.41) we obtain after n cycles:
In the case of a sufficiently large number of cycles n we thus find that the peak
and trough values approach their respective steady-state values C^^"^ and C^'"":
^ max ^ ^ max / ^ x ^
^ (39.43)
/ ^ min _ /^ max
^ SS ~ ^ SS XT
p
Usually, one has obtained an estimate for the elimination constant k^^ and the
distribution volume Vp from a single intravenous injection. These pharmacokinetic
parameters, together with the interval between administrations 9 and the
single-dose Z), then allow us to compute the steady-state peak and trough values.
The criterion for an optimal dose regimen depends on the minimum therapeutic
concentration (which must be exceeded by C^^) and on the maximum safe
concentration (which may not be exceeded by C^^^). It is readily seen that the
magnitudes of the peak and trough values of C^ are inverse functions of the
elimination constant k^^, the interval between successive dosings 9 and the
distribution volume Vp. They are also directly proportional to the single dose D,
In the limit, when the interval 9 between administrations becomes extremely
small in comparison with the elimination half-life 0.693/^pg, the steady-state
solutions are reduced to those already derived for a continuous intravenous
infusion (eq. (39.35)):
C j r ^ C ™ - — —^ = — ^ (39.44)
V k Q V k
since the rate of input k^ can be written in the form:
^rp-f (39.45)
Example
We consider again the pharmacokinetic parameters of the one-compartment
model for a single intravenous injection (eq. (39.6)).
Transfer constant of elimination: k^ = 0.01155 min"^
pe
Note that the half-hfe time of the drug in the plasma compartment is equal to
0.693//:pe which amounts, in this case, to exactly 60 min.
From the previously derived formula we compute the steady-state peak and
trough concentrations in plasma:
dr (39.46)
dX b
•^pb^p ^bp^b
dr
with
where Xp, X^ and X^ are the amounts of drug in the plasma and buffer compartments
and in the elimination pool, respectively, and where D represents the single intra-
venous dose.
These equations form a set of first order linear differential equations with
constant coefficients and with initial conditions:
477
® ®
Plasma
Compartment
^bp
"^pe y
Buffer
Elimination
Pool
Fig. 39.12. (a) Two-compartment mammillary model for single intravenous injection of a dose D. The
buffer compartment exchanges with plasma with transfer constants /:pb and /:bp- (b) Time courses of the
plasma concentration Cp, the content in the buffer compartment X^ and in the elimination pool X^..
While the simple linear models in the previous sections have been solved by
straightforward integration, the present model (and more complicated ones) are
more conveniently solved by means of the Laplace transform.
In this section, we will outline only those properties of the Laplace transform that
are directly relevant to the solution of systems of linear differential equations with
constant coefficients. A more extensive coverage can be found, for example, in the
text book by Franklin [6].
The Laplace transform of a time-dependent variable X{i) is denoted by Lap X{t)
or x{s) and is defined by means of the definite integral over the positive time
domain:
478
Since the integral is over time t, the resulting transform no longer depends on t, but
instead is a function of the variable s which is introduced in the operand. Hence, the
Laplace transform maps the function X(t) from the time domain into the 5-domain.
For this reason we will use the symbol x{s) when referring to Lap X(t). To some
extent, the variable s can be compared with the one which appears in the Fourier
transform of periodic functions of time t (Section 40.3). While the Fourier domain
can be associated with frequency, there is no obvious physical analogy for the
Laplace domain. The Laplace transform plays an important role in the study of
linear systems that often arise in mechanical, electrical and chemical kinetic
systems. In particular, their interest lies in the transformation of linear differential
equations with respect to time t into equations that only involve simple functions of
s, such as polynomials, rational functions, etc. The latter are solved easily and the
results can be transformed back to the original time domain.
The Laplace transform of a first-order derivative is defined consistently with eq.
(39.48) by means of the integral:
Lap^^=fe-^d/ (39.49)
dt i0 dt
which can be integrated by parts:
dXit)
L a p ^ ^ = e-^'X(0
e "^yi) I -(-s)
-^-s) { e-"X(t)dt = -X{0)+sLapXit) =s x(s)-X(0)
dt ^0 i
0
(39.50)
provided that X is a bounded function of t.
The inverse Laplace transform is denoted by Lap"^:
Lap-^jc(5) = X(0 (39.51)
Inverse Laplace transforms have been tabulated for most analytical functions,
including power, exponential, trigonometric, hyperbolic and other functions. In
this context we require only the inverse Laplace transform which yields a simple
exponential:
L a p - ^ — ^ = e-^^ (39.52)
479
-^pb-^p+('^ + ^bp)-^b=0
The solution for x^ in eq. (39.54) can be written explicitly in the form:
x(s) = D— ^^ =D ^ (39.55)
P 5 2 + 5 ( a + p) + ap ( s + a ) ( s + P)
where a and P are called the hybrid transport constants. The latter are related to the
physical transport constants by means of their sum and product:
pb bp pe (39.56)
«P=^pe^bp
If the transfer constants k^^, k^^ and k^^ are known, then the hybrid transfer
constants a and P are the roots of the quadratic equation:
c(s) = ^ ^ =— ^^ = — ^ + —^ (39.58)
P Vp Vp (jy + a)(jy + P) s + a s + ^
where the coefficients A^ and B^ are obtained by working out the right-hand part of
the expression in eq. (39.58) and by equating the corresponding terms of the
numerators in its right- and left-hand members. This yields the solution for Ap and
fip as functions of a and p:
480
\ =^ -^r and
^"d B„^ P= = T r - ^ ^ (39.59)
Kp a-p " V a-p
The plasma concentration function C^ in the time domain is obtained by applying
the inverse Laplace transform to the two rational functions in the expression for c^
in eq. (39.58):
Cp(0 = Lap-^ Cp(5) = Ap e-«^ + B^ e'P^ (39.60)
where the coefficients A^, B^, a, P and V^ have to be estimated from the observed
concentration-time data.
From the initial conditions we must have that:
Cp(0) = A p + 5 p = - ^ (39.61)
P
the three transfer constants k^y^, k^^, k^^ and their derived quantities, such as the
plasma clearance Cl^ and the bio-availability AUC. By convention, we assume that
p is the smaller of the two hybrid transfer constants ((3 < a). If the difference is
marked (e.g. by one decade) then we can clearly distinguish between the two
exponentially decaying phases of the plasma concentration curve. The a-phase is
the fastest and disappears rapidly after administration and the (i-phase is slower
and dominates at longer times after administration. Under this circumstance it is
possible to apply curve peeling on a semilogarithmic diagram, in the same way as
has been done in the case of the two-compartment model with oral or parenteral
administration defined by eq. (39.16).
First, we determine the p-phase function on a semilog plot by means of linear
regression on the later part of the concentration curve for which the time course is
most linear (Fig. 39.13a). This yields:
This yields numerical values for the intercept log B^ and slope P, which have been
defined analytically before.
The next step requires the determination of the residual concentrations C p by
subtracting C^ from the observed C^. The resulting a-phase function is again
obtained by means of linear regression on the earlier part of the time course of the
logarithmic plasma concentration (Fig. 39.13b):
This way, we obtain numerical values for the intercept log A^ and the slope a,
which have been defined formally before.
Since we have now determined the hybrid constants a and p, we can derive V^
and k^^ from the coefficients A^ and B^ which resulted from our graphical analysis.
The ratio of Ap to B^ produces (according to eq. (39.59)):
^ ^ a ^ - ^ (39.66)
^p ^bp-P
from which the transfer constant ^^p from the buffer to the plasma compartments
can be solved:
, -MlM ,39.67)
482
Using this result we now obtain the plasma distribution volume V^ from either A
or fip(eq. (39.59)):
(39.68)
" Ap a - p "fip a-p
Finally, we derive the remaining transfer constants k^y, and k^^ from the
previously defined sum and product of the hybrid rate constants a and P (eq.
(39.56)):
(a)
800
Fig. 39.13. (a) Semilogarithmic plot of the plasma concentration Cp (jiig 1"^) versus time t. The straight
line is fitted to the later part of the curve (slow P-phase) with the exception of points that fall below the
quantitation limit. The intercept B^ of the extrapolated plasma concentration C^ appears as a coeffi-
cient in the solution of the model. The slope is proportional to the hybrid transfer constant p, which is
itself a function of the transfer constants of/:pe, itpb and k^p of the model, (b) Semilogarithmic plot of the
residual plasma concentration C " (\xg 1"*) on an extended time scale t. The straight line is fitted on the
first part of the residual curve (fast a-phase), with the exception of points whose residuals fall below
the quantitation level. The intercept Ap appears as a coefficient in the solution of the model. The slope
a is proportional to the hybrid transfer constant a, which is itself a function of the transfer constants of
the pharmacokinetic model, (c) Semilogarithmic plot of the content X^ (|ig) in the buffer compartment.
The rate constants of the exponential functions are the hybrid transfer constants a and p. The coeffi-
cients have been computed by means of the y-method (see text), using the results of curve peeling rep-
resented in panels a and b.
483
100
(b)
log Cp(t)= 1.542-0.024081
0.5 i
(c)
700 800
t (min)
k - ^
^bp (39.69)
^bp=a + P-^bp-^pe
The area under the plasma concentration curve UAC results from integration of
the sum of exponentials in eq. (39.60) between zero and infinity:
which is, once again, the maximal bio-availability of the drug. This expression has
also been obtained in the case of the one-compartment models for intravenous and
for complete extravascular absorption (Sections 39.1.1 and 39.1.2).
The plasma clearance Cl^ follows in the usual way from the dose D and the
bio-availability AUC (eq. (39.27)):
P AUC P P'
Example
We assume that the following synthetic plasma concentrations C^ have been
obtained after a single intravenous administration at different times t\
a-phase p-phase
These data are also shown in the semilogarithmic plot in Fig. 39.13a, which
clearly shows two distinct phases. A straight line has been fitted by least-squares
regression through the data starting from the observation at 90 minutes down to the
last one. This yields the values of 1.086 and -0.001380 for the intercept log B^ and
slope 5p, respectively. From these results we have computed the extrapolated
p-phase values C^ between 2 and 60 minutes. These have been subtracted from the
experimental C^ values in order to yield the a-phase concentrations C " :
r(min) 2 5 10 20 30 60
Cp(Mgl-') 45.7 38.4 31.5 22.3 15.7 11.5
CP(Mgl-') 12.2 12.0 11.8 11.5 11.1 10.1
33.5 26.4 19.7 10.8 4.6 1.4
485
The residual a-phase concentrations C " are shown in the semilogarithmic plot
of Fig. 39.13b. Least-squares linear regression of log C " upon time produced
1.524 and -0.02408 for the intercept log A^ and the slope ^^c, respectively.
From the curve peeling operation we thus obtained the following intercepts,
hybrid transfer constants and half-life times of the a-and P-phases:
^p = 33.4 |ig r^ a = ^^ /0.4343 = 0.0555 min'^ t^2 = 0.693/a = 12.5 min
5p=12.2|Ligr^ p = 5p 70.4343 = 0.00318 rnin"^ /f/2 = 0.693/(3 = 218 min
The ratio of the intercepts A^ to B^ indicates that the amplitude or peak value of
the a-phase is 2.7 times larger than the one of the p-phase, while the ratio of a to p
(or rf/2 to ty2) shows that the a-phase decays at a rate which is 17 times larger than
that of the p-phase.
Using the formulas derived above we can now compute the pharmacokinetic
properties of the model:
Transfer constant buffer to plasma: ^^^p = (App + i5pa)/(Ap + B^) = 0.0172 min '
Transfer constant of elimination: k^ = a^/k^^ = 0.0103 min"^
Transfer constant plasma to buffer: ^pb = ^ + P ~ ^bp ~ ^pe = 0.0312 min~^
Plasma volume of distribution: Vp = D/(A^ + 5p) = 219 1
Area under the curve: AUC = (AJa) + (Bp/p) = 4.442 mg min 1"'
Plasma clearance: Cl^ = D/AUC = 2.25 1 min"'
The synthetic data have been derived from a theoretical model with the
following parameters:
Dose: D = 10 mg
Plasma volume of distribution: n=
= 2001
Transfer constant of elimination: K-= 0.01155 min"'
Transfer constant buffer to plasma: K = 0.040 min"'
Transfer constant plasma to buffer: pb
= 0.020 min"'
Area under the curve: AUC = 4.329 mg min 1
Plasma clearance: ch = 2.311 min"'
Fast phase hybrid transfer constant: a == 0.06816
Slow phase hybrid transfer constant: p= : 0.003390
A random noise with standard deviation of 0.4 |ig 1"^ has been added to the
theoretical values in order to produce a realistic example. The specifications of the
model are in part the same as those used for the one-compartment models which
have been discussed above. The major distinction between this model and the
486
former ones is the addition of the buffer compartment which exchanges with the
plasma compartment. A comparison between the various results (Figs. 39.5b,
39.8a and 39.13a) clearly shows that the buffer compartment produces a signif-
icant delay on the time course of the plasma concentration.
At the beginning of this section we have assumed that hybrid rate constants
should be larger than p by at least about one order of magnitude. In the case when a
is close to p, residual plasma concentrations, after peeling off of the P-phase, may
become very small and may even take negative values. Hence, the relative error on
the semilogarithmic plot, which leads to the determination of a, will become very
prominent. As a consequence, p will be estimated more accurately than a, A^ and
^p. Unfortunately, there is no real good alternative. One may perform a non-linear
regression (see Chapter 11) on the plasma concentrations C^ using the graphic
determinations of a, p, A^ and B^ as initial estimates for the parameters of the sum
of exponential functions of time t. This will tend to spread out the estimation error
over the four parameters, which are computed concurrently by the non-linear
regression algorithm. In practical applications, however, it may be preferable to
obtain the best possible estimate of p, even at the expense of the accuracy of the
other parameters. Indeed, the rate constant of the P-phase determines the overall
half-life of the substance in the system that is being modelled. In repetitive drug
administration, for example, the half-life of the slowest phase (p in this case)
determines the schedule of administration (once daily, twice daily, etc.) The
determination of residues in tissues after discontinuation of medication is also
governed by the slowest decaying phase.
To conclude this section on two-compartment models we note that the hybrid
constants a and p in the exponential function are eigenvalues of the matrix of
coefficients of the system of linear differential equations:
^ pb "^ ^ pe ^ bp
K= (39.71)
~^pb ^bp
If a and p are referred to by the general symbol y we must have that the following
determinant A must be zero:
A = |K-Yl| = 0 (39.72)
where I is the 2 x 2 identity matrix. Working out the determinant leads to the
characteristic equation of the system (Section 29.6):
macokinetic system [7]. The complete solution of this model in terms of the
eigenvalues of the matrix of transfer constants is discussed below in Section
39.1.7.2.
x,(s) 1 Xa(s)
® ^ •
>'':' s >-•
^ •
©
x/s)
•
•'ap
(s+k^p) (s+kp^)
^\f
^ • s
Xe(s)
Fig. 39.14. (a) Catenary compartmental model representing a reservoir (r), absorption (a) and plasma
(p) compartments and the elimination (e) pool. The contents Xr, Xa, Xp and X^ are functions of time t.
(b) The same catenary model is represented in the form of a flow diagram using the Laplace transforms
Xr, Xa and Xp in the ^-domain. The nodes of the flow diagram represent the compartments, the boxes
contain the transfer functions between compartments [ 1 ]. (c) Flow diagram of the lumped system con-
sisting of the reservoir (r), and the absorption (a) and plasma (p) compartments. The lumped transfer
function is the product of all the transfer functions in the individual links.
488
transfer function from the reservoir to the absorption compartment is l/(s + k^^),
from the adsorption to the plasma compartment is k^^(s + ^p^), and from the plasma
to the elimination pool is k^Js. Note that the transfer constant from an emitting
input node and to a receiving output node appears systematically in the numerator
and denominator of the connecting transfer function, respectively. This makes it
rather easy to model linear systems, of the catenary, mammillary and mixed types.
Parts of the model in the Laplace domain can be lumped by multiplying the transfer
functions that appear between the input and output nodes. In Fig. 39.14c we have
lumped the absorption and plasma compartments. The resulting Laplace transform
of the output jCp(5) is then related to the input jc/5) by means of the transfer function
g(s):
x^(s) = g(s)x,is) (39.74)
where
This model can now be solved for various inputs to the absorption compartment.
In the case of rapid administration of a dose D to the absorption compartment
(such as the gut, skin, muscle, etc.), the Laplace transform of the reservoir function
is given by:
x^s) = D (39.76)
In this case we obtain the simplest possible expression for the plasma function in
the 5-domain:
The inverse transform X^(t) in the time domain can be obtained by means of the
method of indeterminate coefficients, which was presented above in Section
39.1.6. In this case the solution is the same as the one which was derived by
conventional methods in Section 39.1.2 (eq. (39.16)). The solution of the two-
compartment model in the Laplace domain (eq. (39.77)) can now be used in the
analysis of more complex systems, as will be shown below.
When the administration is continuous, for example by oral infusion at a
constant rate offc^p,the input function is given by:
k
x^(s) = -^ (39.78)
s
489
The inverse Laplace transform can be obtained again by means of the method of
indeterminate coefficients. In this case the coefficients A, B and C must be solved
by equating the corresponding terms in the numerators of the left- and right-hand
parts of the expression:
"^^ ^ ^ + _ A _ + _ ^ (39.80)
After substitution of the values of A, B and C we finally obtain the plasma concen-
tration function X^{t) for the two-compartment open system with continuous oral
administration:
k. l-e"'-^ 1 e -k.j A
^,it) = k^ (39.82)
^ap ^pe ^pe ^ap
From the above solution we can now easily determine the steady-state plasma
content X^^ after a sufficiently long time t:
X^^=-^ (39.83)
^pe
If X^(t) and X^(t) are the input and output functions in the time domain (for
example, the contents in the reservoir and in the plasma compartment), then X^{t) is
the convolution of X-{t) with G(t), the inverse Laplace transform of the transfer
function between input and output:
X^(r) = JG(x)X,(t-x)d%=\G{t-x)X;{x)dx
0 0
= G(trX,it)
where the symbol * means convolution, and where T denotes the integration
variable.
490
^(-iy^u^ (39.89)
G,j=xm A' Jy=yj
where Aj, is the minor of A, which is obtained by crossing out row 1 and column /,
and where A' denotes the derivative (dA/dy) of A with respect to y.
This general approach for solving linear pharmacokinetic problems is referred
to as the y-method. It is a generalization of the approach by means of the Laplace
transform, which has been applied in the previous Section 39.1.6 to the case of a
two-compartment model.
The theoretical solution from the y-method allows us to study the behaviour of
the model, provided that the transfer constants in K are known. In the reverse
problem, one must estimate the transfer constants in K from an observed plasma
concentration curve C^(k). In this case, we may determine the hybrid transfer
constants jj and the associated intercepts of the plasma concentration curve Gj • by
means of graphical curve peeling or by non-linear regression techniques. Using
these experimental results we must then seek to compute the transfer constants
from systems of equations, which relate the hybrid rate constants y^ and the
associated intercepts Gy to the transfer constants in K. We may also insert the
general solution into the differential equation, their derivatives (at time 0) and their
integrals (between 0 and infinity) in order to obtain useful relationships between
the hybrid transfer constants y^, the intercepts Gy and the model parameters in K.
The calculations can be done by computer programs such as PROC MODEL in
SAS [8]. Not all linear pharmacokinetic models are computable, however, and
criteria for computability have been described [9]. Some models may be in-
determinate, yielding an infinity of solutions, while others may have no solution.
By way of illustration, we apply the y-method to the two-compartment
mammillary model for intravenous administration which we have already seen in
Section 39.1.6. The matrix K of transfer constants for this case is defined by means
of:
^ pb "^ ^ pe ^ bp
K= (39.90)
~^pb ^bp
492
p pe + "^ pb " y ~ ^ bp
A=IK-Yll = =0 (39.91)
-pb ^bp-Y
The eigenvalues y of the model are the roots a and P of the quadratic equation:
Y,Y2 = a p =^bp^pe
The general solution for the plasma compartment is now expressed as follows:
Cp(f) = G„e-T''' +G,2e-^=' = Ape-™ +B^e-^ (39.94)
with
f
D (-l)A,
'"-''-t^ A' ^Y=Yi=a
(39.95)
and
D f (-l)A,
<^12 = ^ p =
V„p V A'
— ^T=72=P
and
The time course of X,^ in the buffer compartment is represented in Fig. 39.13c
using the theoretical values for D, k^^, a and p of the two-compartment model of
Section 39.1.6. At the time t^^ of peak concentration at 46.3 min, about half of the
initial dose D, i.e. 5.02 mg, is stored in the buffer.
Fig. 39.15. Area under a plasma concentration curve AUC as the sum of a truncated and an extrapo-
lated part. The former is obtained by numerical integration (e.g. trapezium rule) between times 0 and
r, the latter is computed from the parameters of a least squaresfitto the exponentially decaying part of
the curve (p-phase).
This parameter can be obtained by numerical integration, for example using the
trapezium rule, between time 0 and the time T when the last plasma sample has
been taken. The remaining tail of the curve (between T and infinity) must be
estimated from an exponential model of the slowest descending part of the
observed plasma curve (P-phase) as shown in Fig. 39.15. The area under the curve
AUC can thus be decomposed into a truncated and extrapolated part:
with
n-\
0 ^ '=1
and
B (39.101)
T P
The mean residence time MRT of the drug in plasma can be expressed in the
form:
495
MRT = ^ ^AUMC
~ AUC
J C^(t)dt
0
where AUMC is called the area under the moment curve and is defined as the
integral of t C^(t) between time t ranging from 0 to infinity. The expression for
MRT in eq. (39.102) defines the mean of the continuous variable t, using the
corresponding plasma concentration C^(t) as a weighting function. From this point
of view, the denominator can be regarded as a normalizing constant which ensures
that the integral of the weighting function over the whole time domain equals
unity. This is consistent with the definition of the mean of a discrete variable in
terms of the first moment of its probability density function, as was explained in
Section 3.2. For this reason, the graph of tC^it) is called the (first) moment curve.
The quantity AUMC can be decomposed into an observed and an extrapolated
part:
T
jtC^{t)dt DJ tQ-''^' dt
MRT. = ^ = -^ =— (39.105)
IV ex, OO J^
Since
f r e ^^ dr = —- and f e ^^ dt = —
0 '^pe 0 pe
The mean residence time MRT can thus be defined as the time it takes for a
single intravenous dose to be reduced to 36.8% in a one-compartment open model.
This follows from the property derived above in eq. (39.105) and from eq. (39.5):
DQ-^^^^^^^ =DQ-^ =0.368 D (39.106)
since
/:p,MRTi,= l
This relationship shows the analogy with the previously derived half-life time ty2
(in Section 39.1.1), which is the time required for a single intravenous dose to be
halved in a one-compartment open system. Since we already derived in eq. (39.9)
that:
0.693
^1/2 ~ •
V.=^5^!5Ii. (39.109)
AUC
497
In the special case of a one-compartment open model, it can easily be shown that
the steady-state volume V^^ is identical to the plasma volume V^ which has been
defined before in Section 39.1.1 (eq. (39.12)):
Ks= = ^0 (39.110)
" AUC/:p, P
since in this case MRT is the reciprocal of the transfer constant of elimination ^p^.
In the general case, however, V^^ is independent of the elimination of the drug and
only depends on the non-compartmental parameters MRT and AUC, for a given
dose D. In this respect, one can state that the plasma clearance Cl^ is also a
non-compartmental parameter which, for a given dose D, depends only on the area
under the curve AUC (cf. eq. (39.13)):
P AUC
j(r-MRT)2 Cp(Odf
VRT = ^ ^AUSC (39.111)
~ AUC
j Cp(Od/
VRT and the second moment function {t - MRT)^ C^{t) are related in the same way
as MRT and the (first) moment function t C^{t) in eq. (37.102).
The quantity AUSC can be decomposed into an observed and an extrapolated
part:
AUSC= AUSCo,7^ + AUSCy. ,^
T oo (39.112)
= j ( r - M R T ) 2 Cp(Odr + ^{t-MKY^ BQ'^' dt
498
where the mean residence time MRT and the area under the curve AUC have
already been defined. An analytical expression can be derived for the extrapolated
partof AUSC:
Example
We reconsider the data used previously in Section 39.1.2 in the discussion of the
two-compartment system for extravascular administration (e.g. oral, subcuta-
neous, intravascular). The data are truncated at 120 minutes in order to obtain a
realistic case. It is recalled that these data have been synthesized from a theoretical
model and that random noise with a standard deviation of about 0.4 |jg 1"^ has been
superimposed.
r(min) 0 2 5 10 20 30 60 90 120
Cp(Mgl-') 0 12.3 23.6 35.5 40.7 37. 1 26.9 19.8 14.0
These data can be integrated numerically between times 0 and 120 minutes. The
remaining part between 120 minutes and infinity must be extrapolated from the
downslope of the curve (P-phase) which can be modelled by means of the
exponential function:
CP(0 = 5e-P^=55.5e^-^i^^^
499
the coefficients of which have been determined graphically in Section 39.1.2 from
the experimental data. We now provide the results that can be derived numerically
from the data by means of the statistical moments model. The exact values
provided by the theoretical model in Section 39.1.2 are added within parentheses.
Dose: Z) = 10 mg
Intercept of p-phase: B = 55.5 jug V
Hybrid transfer constant of p-phase: p = 0.0119 min'^
Last sample time: T=120 min
Compartmental analysis is the most widely used method of analysis for systems
that can be modeled by means of linear differential equations with constant
coefficients. The assumption of linearity can be tested in pharmacokinetic studies,
for example by comparing the plasma concentration curves obtained at different
dose levels. If the curves are found to be reasonably parallel, then the assumption
of linearity holds over the dose range that has been studied. The advantage of linear
501
Fig. 39.16. Paradigm for the fitting of sums of exponentials from a compartmental model (c) to ob-
served concentration data (o) as contrasted by the results of statistical moment analysis (s). (After
Thorn [13].)
© ®
1 _,
V ,*
/\^^^ope = ^
j^'^'^ 1 max
_\
max
0
0 1
Fig. 39.17. Schematic illustration of Michaelis-Menten kinetics in the absence of an inhibitor (solid
line) and in the presence of a competitive inhibitor (dashed line), (a) Plot of initial rate (or velocity) V
against amount (or concentration) of substrate X. Note that the two curves tend to the same horizontal
asymptote for large values of X. (b) Lineweaver-Burk linearized plot of 1/Vagainst \/X. Note that the
two lines intersect at a common intercept on the vertical axis.
dX (39.114)
V =
" dt
where V represents the rate (or velocity) of the process and X represents the amount
(or concentration) of the substrate.
The parameters of this model are the maximal rate of the process V^^^x ^^d the
saturation constant K^. The latter can also be defined as the amount of substrate
which produces a rate V which is half the maximal rate V^^, as can be verified by
substituting K^ by X in the above relation.
Usually, one plots the initial rate V against the initial amount X, which produces
a hyperbolic curve, such as shown in Fig. 39.17a. The rate and amount at time 0 are
larger than those at any later time. Hence, the effect of experimental error and of
possible deviation from the proposed model are minimal when the initial values
are used. The Michaelis-Menten equation can be linearized by taking reciprocals
on both sides of eq. (39.114) (Section 8.2.13), which leads to the so-called
Lineweaver-Burk form:
503
- =^ ^ - +^ - (39.115)
V V X V
^ ^ max ^*^ ^ max
slope = KJV^^^
intercept = l/V^^
The parameters V^^^^ and K^^ can be obtained from the intercept and slope of the
linear relationship between 1/Vand l/X as shown in Fig. 39.17b.
Least-squares linear regression (Chapter 8) of 1/Vupon l/X, although a simple
solution to the problem, suffers from a lack of robustness which is induced by the
reciprocal transformation of V and X. Serious heteroscedasticity (or lack of
homogeneity of variance) occurs in the differences between observed and
computed values of l/V. Indeed, the variances of these residuals tend to increase
with large values of l/X. (It should be noted that the variance of l/V is approx-
imately equal to the variance of V divided by the fourth power of V, where V
becomes vanishingly small when X tends to zero.) This could be remedied by
means of weighted linear regression, in which each point is assigned a weight
which is inversely proportional to the corresponding variance. (Heteroscedasticity
and weighted regression have been explained in Section 8.2.3.) Unfortunately, the
distribution of l/V at the various levels of l/X is usually unknown. It has been
recommended, therefore, to use distribution-free methods of linear regression
[14]. A robust procedure is to compute the intercept and slopes of the linear
relationship by means of the single median regression method (Section 12.1.5.1).
Another linearized form of the Michaelis-Menten equation is defined by:
- = —i-X + ^ ^ (39.116)
V V V
max ' max
slope = l/V^^^
intercept = i^JV^,,
This variant can be derived from the Lineweaver-Burk form in eq. (39.115) by
multiplying both sides by X. From a statistical point of view, it does not seem to
have an advantage over the Lineweaver-Burk form [14]. The latter variant, how-
ever, can be more easily extended to more complex systems of substrate-enzyme
reactions, as will be shown below.
In the case of competitive inhibition, the substrate is displaced by a substance
which has greater affinity for the enzyme (or receptor protein) than its natural
substrate. For example, a competitive inhibitor (or antagonist) will try to occupy
the binding sites such that the enzyme is prevented from exerting its normal
activity on the substrate. It is assumed here that the binding between inhibitor and
504
enzyme is reversible. The inhibiting activity depends on the amount (or con-
centration) Y of the competing substance and on the inhibition constant K^ of the
inhibitor-enzyme complex. The potency of a competitive inhibitor is inversely
proportional to its K^ value. The relationship between the rate of the enzymatic
reaction V and the amount of substrate X can be expressed in a linearized form:
^ K 1
"^ -f-^- (39.117)
slope = (I ^Y/K.)KJV^^
intercept = l/V^^
Note the close analogy with the Lineweaver-Burk form of the simple Michaelis-
Menten equation. In a diagram representing l/V against l/X one obtains a line
which has the same intercept as in the simple case. The slope, however, is larger by
a factor (1 + Y/K^ as shown in Fig. 39.17b. Usually, one first determines V^^^ and
Kj^ in the absence of a competitive inhibitor (Y = 0), as described above. Sub-
sequently, one obtains K^ from a new set of experiments in which the initial rate V
is determined for various levels of X in the presence of a fixed amount of inhibitor
Y. The slope of the new line can be obtained by means of robust regression.
The cases of non-competitive inhibition and even more complex non-linear
reaction kinetics will not be discussed further here.
Example
We consider the initial velocities V, observed with different substrate
concentrations X in a rate-limited enzymatic reaction [15]:
Ordinary least squares regression of l/V upon l/X produces a slope of 9.32 and
an intercept of 2.36. From these we derive the parameters of the simple Michaelis-
Menten reaction (eq. (39.116)):
V,,, = 0.424 mg/min
/^^ = 3.95mM
We now consider the case of a competitive inhibitor which has been added to
the above reaction at the fixed concentration of 40 mM [15]. The following initial
velocities of the competitively inhibited Michaelis-Menten process are observed
at the same substrate concentrations as above:
505
Ordinary least squares regression of l/V upon l/X yields a slope of 17.7 and an
intercept of 2.46. Using the previously derived values for V^^^ ^"^ K^, and setting
Y equal to 40, we can derive the inhibition constant of the competitively inhibited
Michaelis-Menten reaction (eq. (39.117))
K. = 44 A mM.
References
1. A. Rescigno and G. Segre, Drug and Tracer Kinetics. Blaisdell, Waltham, MA, 1966.
2. P.J. Lewi, J.J.P. Heykants, F.T.N. Allewijn, J.G.H. Dony and P.A.J. Janssen, On the distribu-
tion and metabolism of neuroleptic drugs. Part I: Pharmacokinetics of haloperidol. Drug. Res.,
20 (1970)943-948.
3. J. Gleick, Chaos: Making a New Science. Viking, New York, 1987.
4. T. Teorell, Kinetics of distribution of substances administered to the body. I. The extravascular
modes of administration. Arch. Int. Pharmacodyn., 57 (1937) 205-225. II. The intravascular
modes of administration. Arch. Int. Pharmacodyn., 57 (1937) 226-240.
5. F.F. Farris, R.L. Dedrick, P.V. Allen and J.C. Smith, Physiological model for the pharmaco-
kinetics of methyl mercury in the growing rat. Toxicol. Appl. Pharmacol., 119 (1993) 74-90.
6. P. Franklin, An Introduction to Fourier Methods and the Laplace Transformation. Dover, New
York, 1958.
7. P.J. Lewi, Pharmacokinetics and eigenvector decomposition. Arch. Int. Pharmacodyn., 215
(1975)283-300.
8. G.S. Klonicki, Using SAS/ETS Software for Analysis of Pharmacokinetic Data. SAS Institute
Inc., Gary, NC, 1993.
9. P.L. Williams, Structural identifiability of pharmacokinetic models — compartments and ex-
perimental design. J. Vet. Pharmacol. Therap., 13 (1990) 121-131.
10. P.R. Mayer and R.K. Brazell, Application of statistical moment theory to pharmacokinetics. J.
Clin. Pharmacology, 28 (1988) 481-483.
11. J. Powers, Statistical analysis of pharmacokinetic data. J. Vet. Pharmacol. Therap., 13 (1990)
113-120.
12. K. Zierler, A critique of compartmental analysis. Ann. Rev. Biophys. Bioeng., 10 (1981)
531-562.
13. R. Thom, Structural Stability and Morphogenesis. Benjamin-Cummings, Reading, MA, 1975.
506
14. I. A. Nimmo and G.L Atkins., The statistical analysis on non-normal (real?) data. TiBS (Trends
in Biological Science), 4 (1979) 236-239.
15. R.J. Tallarida and R.B. Murray, Manual of Pharmacological Calculations. Springer, New
York, 1981, pp. 35-37.
Chapter 40
Signal Processing
When measuring a signal, one records the magnitude of the output or the
response of a measurement device as a function of an independent variable. For
instance, in chromatography the signal of a Flame Ionization Detector (FID) is
measured as a function of time. In spectrometry the signal of a photomultiplier or
diode array is measured as a function of the wavelength. In a potentiometric
titration the current of an electrode is measured as a function of the added volume
of titrant.
Time, wavelength and added volume in the above-mentioned examples are the
domains of the measurement. A chromatogram is measured in the time domain,
whereas a spectrum is measured in the wavelength domain. Usually, signals in
these domains are directly translated into chemical information. In spectrometry
for example peak positions are calculated in the wavelength domain and in
chromatography they are calculated in the time domain. Signals in these domains
are directly interpretable in terms of the identity or amount of chemical substances
in the sample.
In some cases one may consider a measurement domain which is complementary
to one of the domains mentioned before. By 'complementary' we mean that each
value in the complementary domain contains information on all variables in the
other domain. In this terminology, the wavelength and frequency of the radiation
emitted by a light source in spectrometry are not complementary domains. One
frequency corresponds to a single wavelength (by the relation v = \/X) and not a
range of wavelengths. On the contrary, each individual point of an interferogram
measured with a Fourier Transform Infra Red (FTIR) spectrometer at a certain
displacement 5 of one of the mirrors in the Michelson interferometer contains
information on all wavelengths in the IR spectral domain, i.e. on the whole IR
spectrum. One can switch between two complementary domains by a math-
ematical operation. The Fourier transform (FT), which is the main subject of this
chapter, is such a mathematical operation. We illustrate this with the way an FTIR
spectrophotometer works [1]. In FTIR spectroscopy, the radiation from the light
source is split into two beams by a beam-splitter. Behind the splitter, the two beams
508
fixed mirror
t . movable mirror
source
direction of
motion
beam
splitter
detector
are recombined by a mirror before reaching the detector (Fig. 40.1). If the path
lengths of both beams are equal, the intensity of the light reaching the detector is
equal to the intensity of the light source before splitting. However, when the path
lengths of both beams are unequal the measured intensity is lower, because some
frequencies of the light are out-of-phase. This is shown in Fig. 40.2 for light
consisting of a single frequency. For a path length difference 6 equal to zero, the
intensity is maximal, for b = )J2 the intensity is zero and for 6 = A,/5 the intensity is
in between. By slowly moving a mirror, the path length difference between the two
beams is changed as a function of the position of the mirror. As a result, the light
i (a+d)
5=X/5 ' , / \id) / \
i t
8=0 ^ '' \ / ^ . /
(a+b)
'(a)'
t
Fig. 40.2. Interference of two beams with the same frequency (wavelength). The path-length differ-
ence between beams (a) and (b) is zero, between (a) and (c) X/2, between (a) and (d) X/5; (a+b), (a+c)
and (a+d) are the amplitudes after recombination of the beams.
509
intensity reaching the detector varies according to a pattern which depends on the
path length difference and absorbance of a sample placed in one of the beams. At
each path length difference or displacement 6, frequencies coinciding with 5 = {2k
4- \ykll are out-of-phase, others for which 5 =fc?iare in phase and others for which
kX<h< {2k + l)?l/2 are partly out-of-phase. Frequencies of the light source which
are out-of-phase interfere destructively and frequencies which are in-phase interfere
constructively. At a given displacement specific frequencies are constructively
combined and others are destructively combined. The intensity measured at a
given displacement of the mirror is the sum of all destructively and constructively
combined frequencies. Thus the signal measured at each position of the mirror,
contains information on all wavelengths (or wavenumbers) emitted by the light
source. This is fundamentally different from ordinary IR spectrometry where each
measured signal (transmittance) corresponds to a single wavelength. When the
light intensity is measured as a function of the mirror displacement an interfero-
gram is obtained. Each individual point of the interferogram contains information
about the entire spectrum (thus all wavelengths or wavenumbers emitted by the
light source). It can be shown [1] that the interferogram, which is the amplitude
/(8) as a function of the displacement 5 is the Fourier transform of the spectrum in
the wavelength domain. It represents the spectrum in iht frequency domain. In
order to obtain an interpretable spectrum (the transmittance as a function of
wavenumber or wavelength) the interferogram needs to be back transformed by a
Fourier transform of 7(6).
It may thus be necessary to calculate the Fourier transform of the measured
signal to return to the domain of interpretation, here wavelength or wavenumber.
In FTIR the signal is measured in the displacement domain 6 and transformed to
the wavelength or wavenumber domain by a Fourier transform. Because the
wavelength domain is the Fourier transform of the displacement domain, and vice
versa, we say that the spectrum is measured in the Fourier domain.
Instead of considering a spectrum as a signal which is measured as a function of
wavelength, in this chapter we consider it measured as a function of time. The
frequency scale in the Fourier domain is given in cycles (v) per second (Hz) or
radians (co) per second (s"^). These units are related by 1 Hz = 2n s~^
Before discussing the Fourier transform, we will first look in some more detail
at the time and frequency domain. As we will see later on, a FT consists of the
decomposition of a signal in a series of sines and cosines. We consider first a signal
which varies with time according to a sum of two sine functions (Fig. 40.3). Each
sine function is characterized by its amplitude A and its period T, which corresponds
to the time required to run through one cycle {In radials) of the sine function. In
this example the frequencies are 1 and 3 Hz. The frequency of a sine function can
be expressed in two ways: the radial frequency oo (radians per second), which is
511
05
Fig. 40.3. A composite sine function (solid line) which is the sum of two sine functions of 1 Hz (dotted
line) and 3 Hz (dashed line).
equal to 2%IT, and the number of cycles per second, v, which is equal to 1/rHz. For
all t which are a multiple of period T, the amplitude of the sine is zero. The function
describing a sum of two sines is given by:
0) I
It
Q.
E
03
^ ^ ^ ^~^ V (CDS)
1 2 3 ^^^
—^ frequency
2n An 67c CO
Fig. 40.4. The signal of Fig. 40.3 represented in the frequency domain.
Fig. 40.5. (a) Sine function with a radial frequency +0) and -co. (b) Cosine function with a radial fre-
quency +0) and -CO. (c) Representation of (a) in the frequency domain, (d) Representation of (b) in the
frequency domain.
Let us consider the antisymmetric block shaped signal/(O shown in Fig. 40.6a,
with the values -1 and +1 measured during a finite time, T^. In many textbooks the
measuring time is defined as 2T^ for practical reasons. In this chapter we also
adopt this convention. Usually the origin of the time axis is shifted to the centre of
the measurement time, which is then defined from -T^ to +T^. As a first approx-
imation this block signal f{t) can be represented by a sine function with a period
equal to 2T^, or a frequency v equal to 1/(27^), i.e. fit) ^ 1 sm(2nt/2T^). As can be
seen from Fig. 40.6b, this approximation is imperfect and is improved by adding a
second sine function. Because the period T of this sine function should exactly fit a
number of times in the measurement time 27^, only the frequencies l / ( 2 r j ,
2 / ( 2 r j , 3/(2rj,..., n/(2TJ are considered. In this particular case the fit cannot be
improved by adding a sine with a period equal to 2/(27^). However, including a
sine with a period equal to 3/(27^) and an amplitude 1/3 yields the next best
approximation, which is:
The approximation is still imperfect and can be improved by adding a third sine
function, now with a period 5/(2T^) and an amplitude 1/5, i.e.
The process of adding sine functions can be continued giving the following
expression:
(a) f(t)|
+1
-T T
m
(b) f(t)
(C)
frequency (Hz)
Fig. 40.6. (a) Block signal measured from -T„t to +r;„. (b) Signal (a) approximated by a sum of one,
two and three sine functions, (c) Signal (a) represented in the frequency domain.
515
(nlTti) „ . (nlnt
/(0=I A„ cos +B„sin
n=0 m J
(rilKt\
= x A„ cos + B„ sin
V^^m )
or
A =
2 7m
j m cos .Unt^
T .
dt (n =-oo,..., oo)
-T
(40.2)
fi = J /(r)sm nnt
— dt (n = -oo,..., oo)
27m -T m y
m-\ S 2 2j'
Becausey^ = -l,A^ = A_^ and B^ = -B_^ (in analogy to A^^ = A.^^ and B^^ = -fi_J
jnnt
1
m = - ZiA^-jBJe-'-' (40.3)
The amplitudes (A^) of the cosines, therefore, represent the real part of the FT.
They are the real Fourier coefficients. The amplitudes (5„) of the sines are the
imaginary part of the FT, which are the imaginary Fourier coefficients. Recalling
the fact that the antisymmetric block signal, given in Fig. 40.6a, could be described
with a sum of sine functions, means that its FT only consists of imaginary Fourier
coefficients. A symmetric signal only contains real Fourier coefficients. As a
consequence, setting the imaginary coefficients equal to zero symmetrizes the
signal obtained after inverse transform.
Usually, the Fourier coefficients A^ and B^ in eq. (40.2) are combined in a single
complex number [A„ -jBJ:
517
+ 7-
rnit ^rnit^
K-JB„=--27, J/(0 cos 7 Sin At
m -J \^m J
(40.4)
-jnnt
1
j / ( O e ^- df
27,m -7
Equations (40.3) and (40.4) are called the Fourier transform pair. Equation
(40.3) represents the transform from the frequency domain back to the time
domain, and eq. (40.4) is the forward transform from the time domain to the
frequency domain. A closer look at eqs. (40.3) and (40.4) reveals that the forward
and backward Fourier transforms are equivalent, except for the sign in the exponent.
The backward transform is a summation because the frequency domain is discrete
for finite measurement times. However, for infinite measurement times this sum-
mation becomes an integral.
Summarizing, two complementary representations of a signal have been derived:
f(0 in the time domain and [A^ - jBJ in the frequency domain. The imaginary
Fourier coefficients, 5^, represent the frequencies of the sine functions and the real
Fourier coefficients, A^, represent the frequencies of the cosine functions. In the
remainder of this chapter we use the shorthand notation F(n) for [A„ -jB^].
Many signals, such as a Gaussian peak, have a symmetric clock-shaped form
about their maximum. If the origin of the time axis is chosen in the centre of
symmetry the FT contains only real Fourier coefficients. If, however, the origin of
the measurement axis is put at the start of the peak, the function looses its
symmetry about the origin resulting in a FT with real and imaginary coefficients.
This indicates that the location of the origin of the data influences the Fourier
transform.
The Fourier coefficients can be combined in a so called power spectrum which
is defined as P^ = ^J[(A'^ + B^)]. One should realize that because the contributions
of A„ and B^ are implicitly included in P„, one cannot back-transform the power
spectrum to the time domain. However, the power spectrum gives the total
amplitude of the frequencies present in the signal.
The Fourier transform may be considered as a special case of the general data
transform
Fiy)= \ f(t)-K(vj)dt
K(v,t) is called the transform kernel. For the Fourier transform the kernel is e"^^^
Other transforms (see Section 40.8) are the Hadamard, wavelet and the Laplace
transforms [4].
518
In this section we derive the Fourier transform of a sine function f(0 = sin(coO by
solving eq. (40.4). Because cos(jc) = (e'^ + e"^^)/2 and sin(jc) =(d^ - e"^^)/(2/), we can
express sin(coO in the complex number notation: sin(coO = (e®^^ - t~^^y{2j)
The Fourier transform of sin(cor) is:
1 r -jnn —
( A „ - ; f i „ ) = — J sin(0)0e ^ df
-T (40.5)
t
-inn -
IT _J, 2j
1
y n J n/ 2T J 2j 2T JJ 02j1
OT
AITI
d/ (0
_ j sin(cor -n%)
2 (oT -nn
In the same way we find that the second term 5„ is equal to
This gives the following expression for the Fourier transform of a sine function:
519
All sin(m - n)n are zero for all integer (m - n) except forn = m and for n = -m for
which respectively the first and the second term of the above equation are equal to
1. As a result the FT is:
A„-jB, = -j/2(l+0) = (0-j/2)
and
A_„-;B_„ = - ; / 2 ( 0 - l ) = (0+;/2)
The frequency corresponding to n = m is co^ = InmllT = co and to n = -m is co_^ =
-InmllT = -CO. The FT is thus imaginary, which is in agreement with the fact that a
sine is an antisymmetric function. The factor 1/2 is introduced because positive
and negative frequencies are considered. Namely, f(0 = sin(coO = 1/2 sin(coO
-l/2sin(-co0.
V^
^ I I I I
0 1 2 2 0 2 ^
Fig. 40.7. (a) A continuous signalJ{t) measured from to =0 to ^o +2Tn, = 2 seconds, (b) Same signal after
digitization,/(/:A0 as a function of/:.
The expressions for the forward and backward Fourier transforms of a data
array of2N-\- 1 data points with the origin in the centre point are [3]:
Forward:
k=+N -jnlnk
1
F(n) = X / ( M r ) e 2^-1 (40.7)
2A^ + 1 k=-N
for n = -^max ^^ '^max' ^max corrcspouds to the maximum frequency which is in the
ta. The value
data. vah of n^^ is derived in the next section.
Backward:
. ..... ^ jnlnk
f(kAt) = - X ^(^)e 2^^^ (40.8)
for/: = -M...,+A^
The frequency associated with F(n) is v^. This frequency should be equal to n
times the basis frequency, which is equal to 1/(27^) (this is the period of a sine or
cosine which exactly fits in the measurement time). Thus v„ = n/(2T^) = n/(2NAt).
It should be noted that in literature one may find other conventions for the
normalization factor used in front of the integral and summation signs.
(D
Q.
E
(0
Fig. 40.8. (a) Sine function sampled at the Nyquist frequency (2 points per period), (b) An un-
der-sampled sine function.
522
1 1
0 At 32
16 0 16
-N N
^0
A(n)
• n
N real coefficients 1
1
\ ii
(+ value at n=0) v = 1/(2T„) v = 1/(2 At)
Av=1/(2Tm)
Fig. 40.9. Relationship between measurement time {2T„,), digitization interval and the maximum and
minimal observable frequencies in the Fourier domain.
TABLE 40.1
Signal measured at five time points with Ar = 0.5 s and 2T^ = 2 s
l . n = 0:Vo = OHz
-^ k=-2 -^ k=-2
= 1/5(2 + 3 + 4 + 4 +2) = 3
ni)=l/5lf()tA0e--'^''*'5 =
= l/5(2e'""'5 +3e'^"'5 + 4e° + AQ-J^"^^ + 2e-''*"'^)
= l/5(2cos47i/5 + 2/sin47i/5 + 3cos27t/5 + 3;sin27i/5 + 4 + 4cos27i/5
- 4/sin27i/5 + 2cos47i/5 - 2jsm4n/5)
= 1/5 (-1.618 +1.1756/- + 0.9271 + 2.8532/- + 4 + 1.2361 - 3.8042; - 1.618
-1.1756/) =
f(kAt)
Fig. 40.10. Fourier transform of the data points listed in Table 40.1.
40.3.6 Sampling
In the previous section we have seen that the highest observable frequency
present in a discrete signal depends on the sampling interval (At) and is equal to
l/2Ar, the Nyquist frequency. This has two important consequences. The minim-
ally required digitizing rate (sampling points per second) in order to retain all
information in a continuous signal is defined by its maximum frequency. Secondly,
the Fourier transform is disturbed if the continuous signal contains frequencies
higher than the Nyquist frequency. This disturbance is called aliasing or folding.
The principle of folding is illustrated in Fig. 40.11. In this figure two sine functions
A and B are shown which have been sampled at the same sampling rate of 16 Hz.
Signal A (8 Hz) is sampled with a rate which exactly fits the Nyquist frequency
(v^J, namely 2 data points per period. Signal B (11 Hz) is under-sampled as it
requires a sampling rate of minimally 22 Hz. The frequency of signal B is 3 Hz (=
5v) higher than the maximally observable frequency. This does not mean that for
signal B no frequencies are observed in the frequency domain. Indeed, from Fig.
40.11 one can see that the data points of signal B can also be fitted with a 5 Hz sine,
which is 3 Hz (= 6v) lower than v^^^ (= 8 Hz). As a consequence, if a signal
contains a frequency which is 5v higher than the Nyquist frequency, false frequen-
525
t (sec)
Fig. 40.11. Aliasing or folding, (a) Sine of 8 Hz sampled at 16 Hz (Nyquist frequency), (b) Sine of 11
Hz sampled at 16 Hz (under-sampled), (c) A sine of 5 Hz fitted through the data points of signal (b).
cies are observed at a frequency which is 5v lower than the Nyquist frequency. One
should always be aware of the possible presence of 'false' frequencies by aliasing.
This can be easily checked by changing the sampling rate. True frequencies remain
unaffected whereas 'aliased' frequencies shift to other values. The Nyquist fre-
quency defines the minimally required sampling rate of analytical signals. As an
example we take a chromatographic peak with a Gaussian shape given by y(x) =
atxp(-x^/2s^) (Gaussian function with a standard deviation equal to s located in the
origin, x = 0). The FT of this function is [5]:
From this equation it follows that the amplitudes of the frequencies present in a
Gaussian peak are normally distributed about a frequency equal to zero with a
standard deviation inversely proportional to the standard deviation s of the
Gaussian peak. As a consequence, the maximal frequency present in a Gaussian
peak is approximately equal to (0 + 3{l/s)). In order to be able to observe that
frequency, a sampling rate of at least 6/s is required or 6 data points per standard
deviation of the original signal. As the width of the base of a Gaussian peak is
approximately 6 times the standard deviation, the sampling rate should be at least
36 points over the whole peak. In practice higher sampling rates are applied in
order to avoid aliasing of high noise frequencies and to allow signal processing
526
(b)
(a) 0.1
0.1
-0.1
0 05 -0.2
-0.3 «
64 127
64 127
0.2
0.1
-0.1
-0.2 I
127
Fig. 40.12. (a) FT (real coefficients) of a Gaussian peak located in the origin of the measurements (256
data points). Solid line: wVi = 20; dashed line: wVi = 5 and corresponding maximal frequencies, (b) FT
(real and imaginary coefficients) of the same peak shifted by 50 data points.
(see Section 40.5). Figure 40.12a gives the real Fourier coefficients of two
Gaussian peaks centered in the origin of the data and half-height widths
respectively equal to 20 and 5 data points. As one can read from the figure, the
maximum frequency in the narrow peak (dashed line in Fig. 40.12) is about 4 times
higher than the maximum frequency in the wider peak (solid line in Fig. 40.12).
In Section 40.3.4 we have shown that the FT of a discrete signal consisting of 2A^
+ 1 data points, comprises A^ real, A^ imaginary Fourier coefficients (positive
frequencies) and the average value (zero frequency). We also indicated that A^ real
and N imaginary Fourier coefficients can be defined in the negative frequency
domain. In Section 40.3.1 we explained that the FT of signals, which are sym-
metrical about the r = 0 in the time domain contain only real Fourier coefficients.
528
6.4 s
1 I 10 Hz 10 Hz
Fig. 40.14. Effect of zero filling on the back transform of the pulse NMR signal given in Fig. 40.12. (a)
before zero filling, (b) after zero filling.
whereas signals which are antisymmetric about the r = 0 point in the time domain
contain only imaginary Fourier coefficients (sines). This property of symmetry
may be applied to obtain a transform which contains only real Fourier coefficients.
For example, when a spectrum encoded in 2A^ + 1 = 512 + 1 data points is
artificially mirrored into the negative wavelength domain, the spectrum is symm-
etrical about the origin. The FT of this spectrum consists of 512 real and 512
imaginary Fourier coefficients. However, because the signal is symmetric all
imaginary coefficients are zero, reducing the FT to 512 real coefficients (plus one
for the mean) of which 256 coefficients at negative frequencies and 256 coefficients
at positive frequencies.
modulation depends on ^Q, the magnitude of the shift. The back transform has the
same property. F{n - n^ is back-transformed to f(t)txp(j2nnQt/2N). A shift by A^
data points, therefore, results in f{kAt)Qxp(j2nNk/2N) = f{kAt)cxp(jnk) = f{kAt)
(-1)^. This property is often applied to shift the origin of the Fourier domain to the
centre of its 2A^ + 1 data array when the software by default places its origin at the
first data point.
In Section 40.3.6 we mentioned that the Fourier transform of a Gaussian peak
positioned in the origin is also a Gaussian function. In practice, however, peaks are
located at some distance from the origin. The real output, therefore, is a damped
cosine wave and the imaginary output is a damped sine wave (see Fig. 40.12b).
The frequency of the damping is proportional to the distance of the peak from the
origin. The functional form of the damped wave is defined by the peak shape and is
a Gaussian for a Gaussian peak. Inversely, the frequency of the oscillations of the
Fourier coefficients contains information on the peak position.
The phase spectrum 0{n) is defined as 0(n) = arctan(A(n)/S(n)). One can prove
that for a symmetrical peak the ratio of the real and imaginary coefficients is
constant, which means that all cosine and sine functions are in phase. It is
important to note that the Fourier coefficients A(n) and B(n) can be regenerated
from the power spectrum P(n) using the phase information. Phase information can
be applied to distinguish frequencies corresponding to the signal and noise,
because the phases of the noise frequencies randomly oscillate.
The Fourier transform is distributive over summation, which means that the
Fourier transform of the two individual signals is equal to the sum of the Fourier
transforms of the two individual signals F\f^{t) +/2(0] = F\f\if)\ + ^l/iCOl-
The enhancement of the signal-to-noise ratio (or filtering) in the Fourier domain
is based on that property. If one assumes that the noise n{t) is additive to the signal
s{t), the measured signal m{i) is equal to s{t) + n{t). Therefore, F[m{t)\ = F[s{t)\ +
F[n{i)], or
M(v) = 5(v)+7V(v)
Assuming that the Fourier transformed spectra 5(v) and A^(v) contribute at specific
frequencies, the true signal, s(t), can be recovered from M(v) after elimination of
A^(v). This is csllcd filtering (see further Section 40.5.3)
The Fourier transform is not distributive over multiplication:
Fm)f2(f))^F{f,{twm))'
It is also easy to show that for the scalar a F{a(j{t))) = aF(t).
530
As explained before, the FT can be calculated by fitting the signal with all
allowed sine and cosine functions. This is a laborious operation as this requires the
calculation of two parameters (the amplitude of the sine and cosine function) for
each considered frequency. For a discrete signal of 1024 data points, this requires
the calculation of 1024 parameters by linear regression and the calculation of the
inverse of a 1024 by 1024 matrix.
The FT could also be calculated by directly solving eq. (40.7). For each
frequency (2N + 1 values) we have to add 2N + 1 (the number of data points) values
which are each the result of a multiplication of a complex number with a real value.
The number of complex multiplications and additions is, therefore, proportional to
{2N + 1)^. Even for fast computers this is a considerable task. Therefore, so-called
fast Fourier transform (FFT) algorithms have been developed, originally by
Cooley and Tukey [6], which are available in many software packages. The
number of operations in FFT is proportional to (2N + l)log2(2A^ +1) permitting
considerable savings of calculation time. The calculation of a signal digitized over
1024 points now requires 10"^ operations instead of 10^, which is about 100 times
faster. A condition for applying the FFT algorithm is that the number of points is a
power of 2. The principle of the FFT algorithm can be found in many textbooks
(see additional recommended reading).
Because the FFT algorithm requires the number of data points to be a power of
2, it follows that the signal in the time domain has to be extrapolated (e.g. by zero
filling) or cut off to meet that requirement. This has consequences for the resolution
in the frequency domain as this virtually expands or shortens the measurement
time.
40.4 Convolution
Fig. 40.15. (a) Slit function (point-spread function) h{X) for a spectrometer tuned at 304 nm. (b) f(X) is
the true absorbance spectrum of the sample.
peak shape is disturbed. The mechanism behind this disturbance is called con-
volution. Convolution also occurs when filtering a signal with an electronic filter
or with a digital filter, as explained in Section 40.5.2.
By way of illustration the spectrometry example is worked out. Two functions
are involved in the process, the signal/(^) and the convolution function h{X). Both
functions should be measured in the same domain and should be digitized with the
same interval and at the same ^values (in spectrometry: X-values). Let us further-
more assume that the spectrum/(^) and convolution function h{X) have a simple
triangular shape but with a different half-height width.
Let us suppose that the true spectrum y(^) (absorbance xlOO) is known (Fig.
40.15b):
?ii = 301nm=>f(301) = 0
?i2 = 302nm=»y(302) = 0
?i3 = 303nm=>y(303) = 0
?i4 = 304nm=>/(304) = 5
^5 = 305nm=>y(305)=10
^6 = 306 nm =^J{306) = 5
;i7 = 307nm=^y(307) = 0
gix)=^fiy)h(y-x) (40.9)
Fig. 40.16. Measured absorbance spectrum for the system shown in Fig. 40.15.
(a)
FT
FT (b)
(d) (C)
RFT
13 17 :^1 25
Fig. 40.17. Convolution in the time domain ofJ{t) with h(t) carried out as a multiplication in the Fou-
rier domain, (a) A triangular signal (wi/j = 3 data points) and its FT. (b) A triangular slit function h{t)
(w./, = 5 data points) and its FT. (c) Multiplication of the FT of (a) with that of (b). (d) The inverse FT of
(c).
As said before, there are two main applications of Fourier transforms: the
enhancement of signals and the restoration of the deterministic part of a signal.
Signal enhancement is an operation for the reduction of the noise leading to an
improved signal-to-noise ratio. By signal restoration deformations of the signal
introduced by imperfections in the measurement device are corrected. These two
operations can be executed in both domains, the time and frequency domain.
Ideally, any procedure for signal enhancement should be preceded by a charac-
terization of the noise and the deterministic part of the signal. Spectrum (a) in Fig.
40.18 is the power spectrum of "white noise" which contains all frequencies with
approximately the same power. Examples of white noise are shot noise in photo-
multiplier tubes and thermal noise occurring in resistors. In spectrum (b), the
power (and thus the magnitude of the Fourier coefficients) is inversely prop-
ortional to the frequency (amplitude -- 1/v). This type of noise is often called 1//
IF(/'
L^'VAMMM/
iFd'ii (b)
IFI/11
Fig. 40.18. Noise characterisation in the frequency domain. The power spectrum IF(v)l of three types
of noise, (a) White noise, (b) Flicker or 1//noise, (c) Interference noise.
536
o
u
noise (/is the frequency) and is caused by slow fluctuations of ambient conditions
(temperature and humidity), power supply, vibrations, quality of chemicals etc.
This type of noise is very common in analytical equipment. An example is the
power spectrum (Fig. 40.20) of the noise of a UV-Vis detector in HPLC (Fig.
40.19). In some cases the power spectrum may have peaks at some specific
frequencies (see Fig. 40.18c). A very common source of this type of noise is the 50
Hz periodic interference of the power line. In reality most noise is a combination of
the noise types described. Spectral analysis is a useful tool in assessing the
frequency characteristics of these types of signal.
We first discuss signal enhancement in the time domain, which does not require
a transform to the frequency domain. It is noted that all discrete signals should be
sampled at uniform intervals.
In many instances the quality of the signal has to be improved before the
chemical information can be derived from it. One of the possible improvements is
the reduction of the noise. In principle there are two options, the enhancement of
the analog signal by electronic devices (hardware), e.g. an electronic filter, and the
537
(b)
Fig. 40.20. The power spectrum of the baseline noise given in Fig. 40.19.
5 6 6 6
time (s) time (s)
The simplest form of smoothing consists of using a moving average, which has
been introduced in Chapter 7 for quahty control. A window with an odd number of
data points is defined. The values in that window are averaged, and the central
point is replaced by that value. Thereafter, the window is shifted one data point by
dropping the last one in the window and including the next one (un-smoothed
measurement). The averaging process is repeated until all data points have been
averaged. For the smoothing of a given data point, data points are used which were
measured after this data point, which is not possible for analog filters. The
expression for the moving average of point / is:
j=m
j=-m
where g^ is the smoothed data point i^f^^j is the original data point /+/, (2m + 1) is
the size of the smoothing window.
The process of moving averaging — and we will see later, the process of
smoothing in general — can be represented as a convolution. Consider, therefore,
the data points/(I),/(2), .... , fin). The moving average of point/(5) using a
smoothing window of 5 data points is calculated as follows:
If we consider the zeros and ones to form a function h(t) and if we position the
origin of that function /i(0) in the centre of the defined window, then the above
process becomes:
0 0 1 1 1 1 1 0...
m= -4 -3 -2 -1 0 1 2 3...
givmg
^(5) = 1/5 JJ{m)h{m - 5)
or in general
g{t) = Yj'{m)h{m - 0/NORM (40.11)
for all m for which f(m) and h(m-t) are defined. NORM is a factor to keep the
integral of the signal constant.
540
0 6
Q.5
0.4
(window)/\A^^2
1.2
(0) (5) (9) (17) (25) h/hp
(b)
1 1 f • •-
0.8
0.6
0.4
0,2
Fig. 40.22. Distortion (h/ho) of a Gaussian peak for various window sizes (indicated within parenthe-
ses), (a) Moving average, (b) Polynomial smoothing.
541
windows. As one can see, the peak becomes lower and broader, but remains
symmetrical. The peak remains symmetric because of the symmetry of the filter
about its central point. Analog filters are by definition asymmetric because they
can only include data points at the left side of the data point being smoothed (see
also Section 40.5.2.4 on exponential smoothing). The applicability of moving
averaging and smoothing in general depends on the degree of deformation assoc-
iated with a certain improvement of the signal-to-noise ratio. Figure 40.22a shows
the effect of a moving averaging applied on a Gaussian peak to which white
random noise is added, as a function of the window size. Clearly, for window sizes
larger than half the half-height peak width, the peak is broadened and the intensity
drops. As a result adjacent peaks may be less resolved. Peak areas, however,
remain unaffected. From Fig. 40.22a one can derive the maximally applicable
window size (number of data points) to avoid peak deformation, given the scan
speed and the digitisation rate. Suppose, for instance, that the half-height width of
the narrowest peak in an IR spectrum is about 10 cm~\ the digitization rate is 10 Hz
and the scan rate is 2 cm"^ per sec. Hence, the half-height width is digitized in 50
data points. The largest smoothing window, therefore, introducing minor disturb-
ances is 25 data points. If the noise is white the signal-to-noise ratio is improved by
a factor 5.
Another unwanted effect of smoothing is the alteration of the frequency charac-
teristics of the noise. This calls for caution. Because low frequencies present in the
noise are not removed, the improvement of the signal-to-noise ratio may be
limited. This is illustrated in Fig. 40.23 where one can see that after smoothing
with a 25 point window low-frequency noise is left. In Section 40.5.3 filtering
v ^
Fig. 40.23. Polynomial smoothing (noise = N(0,3%)): 5-point; 17-point; 25-point smoothing window
and the noise left after smoothing.
542
methods in the frequency domain are discussed, which are capable of removing
specific frequencies from the noise.
h{-2) = - 3 ; h{-\) = 12; /i(0) = 17; /z(+l) = 12; /i(+2) = - 3 ; else h{t) = 0
In order to keep the average signal amplitude unaffected a scaling factor NORM
is introduced, which is the sum off all convolutes, here 35. The smoothing
procedure is now
g(t) = If(m)h(m-t)/lh{m)
TABLE 40.2
Convolutes for quadratic and cubic smoothing (adapted from Refs. [8,10])
Points 25 23 21 19 17 15 13 11 9 7 5
-12 -253
-11 -138 -42
-10 -33 -21 -171
-09 62 -2 -76 -136
-08 147 15 9 -51 -21
-07 222 30 84 24 -6 -78
-06 287 43 149 89 7 -13 -11
-05 343 54 204 144 18 42 0 -36
-04 387 63 249 189 27 87 9 9 -21
-03 422 70 284 224 34 122 16 44 14 -2
-02 447 75 309 249 39 147 21 69 39 3 -3
-01 462 78 324 264 42 162 24 84 54 6 12
00 467 79 329 269 43 167 25 89 59 7 17
01 462 78 324 244 42 162 24 84 54 6 12
02 447 75 309 249 39 147 21 69 39 3 -3
03 422 70 284 224 34 122 16 44 14 -2
04 387 63 249 189 27 87 9 9 -21
05 343 54 204 144 18 42 0 -36
06 287 43 149 89 7 -13 -11
07 222 30 84 24 -6 -78
08 147 15 9 -51 -21
09 62 -2 -76 -136
10 -33 -21 -171
11 -138 -42
12 -253
NORM 5175 805 3059 2261 323 1105 143 429 231 21 35
well, but to a less extent than for moving averaging. For a Gaussian peak the
window size of a quadratic polynomial smoothing should now be less than 1.5
times the half-height width compared to the value of 0.5 found for moving
averaging. The effect on the noise reduction and frequencies left in the noise is
comparable to the moving average filter. We refer to the work of Enke [9] for a
detailed discussion of peak deformation versus signal-to-noise improvement under
544
- 3 - 2 - 1 0 1 2 3
Fig. 40.24. Polynomial smoothing: window of 7 data points fitted with polynomials of degrees 0,1,2,
3 and 4.
which states that a data point / is smoothed by taking a weighted average of the
measurement at time / and the smoothed previous data point at time / - 1. By tuning
the weight X, the smoothing can be varied from no smoothing at all (for A. = 0) to
obtaining a constant value jCj (for A = 1), which remains unchanged regardless of
the value of new incoming measurements. The effect of the exponential smoothing
process is thus defined by a single parameter. Moreover, the smoothed data points
can easily be calculated by hand. This simplicity and versatility are the reasons for
the popularity of the filter. Exponential averaging is mainly applied to smooth time
series or stochastic processes in which one wants to detect a drift or a deviation
from the stationary state. The filter is less appropriate to smooth analytical signals
545
TABLE 40.3
Example of exponential smoothing according to Jcj = XJCJ_I + (1 - X)xi (k = 0.2)
(1 - X)xi + Xxi.
TABLE 40.4
The exponential shape of the moving average filter (xi = (1 - 'k)xi + Xxj_i)
1 xi ^1 = (l->-)xi
2 X2 X2 = (l-X)X2+'kxi = (1-X)x2 + X(l-X)xi
3 X3 ^3 = (1-A-)X3+XJC2 = (1-A-)X3 + M1->.)X2 + X\l-X)Xy
4 X4 ^4 =(1->.)X4+AJC3 = (l->-)X4 + >-(l-A,)X3 + A-^(1->.)X2 + X\l-k)Xi
5 X5 X5 = (1-X)X5+X^4 = (1-X)X5 + X,(1->.)X4 + A,2(1-X)X3 + P(1->.)X2 + X\l-X)X^
546
x{t)
10
J L_
5 6 10
t
Fig. 40.25. Effect of exponential smoothing on the data points listed in Table 40.3 (solid line: original
data; dotted line: smoothed data).
Signal
1
0.8
0.6
0.4
0.2
0 5 10 15 20 26 30 35 40 45 50
Time
Fig. 40.26. Effect of exponential smoothing {X = 0.6) on a Gaussian peak (wy^ = 6 data points) (solid
line: original data; bold line: smoothed data).
Table 40.3. As one can see, the filter introduces a slower response to stepwise
changes of the signal, as if it were measured with an instrument with a large
response time. Because fluctuations are smoothed, the standard deviation of the
signal is decreased, in this example from 2.58 to 1.95. A Gaussian peak is broadened
and becomes asymmetric by exponential smoothing (Fig. 40.26).
547
Instead of smoothing the data directly in the domain in which they were
acquired, the signal-to-noise ratio can also be improved by transforming the signal
to the frequency domain and eliminating noise frequencies present in the measure-
ments, after which one returns to the original domain. For instance, the power
spectrum of the noise of a flame ionization detector (Fig. 40.27) reveals the
presence of two dominant frequencies, namely at 2 and at 10 Hz. By substituting
all Fourier coefficients at frequencies higher than 5 Hz by zero, all high frequency
noise is eliminated after back transforming to the time domain. This operation is
called filtering, and because in this particular case low frequencies are retained,
this filter is called a low-pass filter. Equally, one could define a high-pass filter hy
setting all low frequency values equal to zero.
Mathematically this operation can be described by the same equation (eq.
(40.13)) as derived for polynomial smoothing, namely:
G(v) = F{v)H(v)
where H(v) is the filter function, which is now defined in the frequency domain.
Often used filter functions are:
Low-pass filter: H(v) = 1 for all v < VQ else H(v) = 0
High-pass filter: H(v) = 1 for all v > VQ else H(v) = 0
o
X
0.64
a
u
0.3 2
16 32 48
F r e q . in CPS
VQ is called the cut-off frequency. H(v) is referred in this context as di filter transfer
function.
Many other filter functions can be designed, e.g. an exponential or a trapezoidal
function, or a band pass filter. As a rule exponential and trapezoidal filters perform
better than cut-off filters, because an abrupt truncation of the Fourier coefficients
may introduce artifacts, such as the annoying appearance of periodicities on the
signal. The problem of choosing filter shapes is discussed in more detail by Lam
and Isenhour [11] with references to a more thorough mathematical treatment of
the subject. The expression for a band-pass filter is: H(v) = 1 for v^i„ < v < v^^^^ else
//(v) = 0. This filter is particularly useful for removing periodic disturbances of the
signal.
The effect of a low-pass filter applied on a Gaussian peak is shown in Fig. 40.28
for two cut-off frequencies. The lower the cut-off frequency of the filter, the more
noise is removed. However, this increasingly effects the high frequencies present
in the signal itself, causing a deformation. On the other hand, the higher the cut-off
frequency the more high-frequency noise is left in the signal. Thus the choice of
the cut-off frequency is often a compromise between the noise one wants to
eliminate and the deformation of the signal one can accept. The more the fre-
quencies of noise and signal are similar, the more difficult it becomes to improve
Fig. 40.28. Effect of alow-pass filter, (a) original Gaussian signal, (b) FT of (a), (c) Signal (a) filtered
with V() = 10. (d) Signal (a) filtered with Vo = 20.
549
the signal-to-noise ratio. For this reason it is very difficult to eliminate 1//noise,
because its power increases with lower frequencies.
Filtering and smoothing are related and are in fact complementary. Filtering is
more complicated because it involves a forward and a backward Fourier transform.
However, in the frequency domain the noise and signal frequencies are distin-
guished, allowing the design of a filter that is tailor-made for these frequency
characteristics.
Polynomial smoothing is more or less a trial and error operation. It gives an
improvement of the signal-to-noise ratio but the best smoothing function has to be
empirically found and there are no hard rules to do so. However, because of its
computational simplicity, polynomial smoothing is the preferred method of many
instrument manufacturers. By calculating the Fourier transform of the smoothing
convolutes derived by Savitzky and Golay one can see that polynomial smoothing
is equivalent to low-pass filtering. Figure 40.29 shows the Fourier transforms of
the 5-point, 9-point, 17-point and 25-point second-order convolutes given in Table
40.2 (the frequency scale is arbitrarily based on a 1024 data points sampled with
Fig. 40.29. Fourier spectrum of second-order Savitzky-Golay convolutes. (a) 5-point. (b) 9-point. (c)
17-point. (d) 25-point (arrows indicate cut-off frequencies).
550
Signals are differentiated for several purposes. Many software packages for
chromatography and spectrometry offer routines for determining the peak position
and for finding the up-slope and down-slope integration limits of a peak. These
algorithms are based on the calculation of the first- or second-derivative. In NIRA
small differences between spectra are magnified by taking the first or second
derivative of the spectra. Baseline drifts are eliminated as well. The simplest
procedure to calculate a derivative is by taking the difference between two
successive data points. However, by this procedure the noise is magnified by
several orders of magnitude leading to unacceptable results. Therefore, the
calculation of a derivative is usually linked to a smoothing procedure. In principle
one could smooth the data first. This requires a double sweep through the data, the
first one to smooth the data and the second one to calculate the derivative.
However, the smoothing and differentiation can be combined into a single step. To
explain this, we recall the way Savitzky and Golay derived the smoothing
convolutes by moving a window over the data and fitting a polynomial through the
data in the window. The central point in the window is replaced by the value of the
polynomial. Instead, one may replace it by the value of the first or second
derivative of that polynomial in that point. Savitzky and Golay [8] published
convolutes (corrected later on by Steinier [10]) for that operation (see Table 40.5).
This procedure is the recommended method for the calculation of derivatives. Fig.
40.30 gives the second-derivative of two noisy overlapped Gaussian peaks,
obtained with a quadratic 7-points smoothed derivative. The two negative regions
(shaded areas in the figure) reveal the presence of two peaks.
Sets of spectroscopic data (IR, MS, NMR, UV-Vis) or other data are often
subjected to one of the multivariate methods discussed in this book. One of the
issues in this type of calculations is the reduction of the number variables by
selecting a set of variables to be included in the data analysis. The opinion is
gaining support that a selection of variables prior to the data analysis improves the
results. For instance, variables which are little or not correlated to the property to
be modeled are disregarded. Another approach is to compress all variables in a few
features, e.g. by a principal components analysis (see Section 31.1). This is called
551
TABLE 40.5
Convolutes for the calculation of the smoothed second derivative (adapted from Ref. [8])
Points 25 23 21 19 17 15 13 11 9 7 5
-12 92
-11 69 77
-10 48 56 190
-09 29 37 133 51
-08 12 20 82 34 40
-07 -3 5 37 19 25 91
-06 -16 -8 -2 6 12 52 22
-05 -27 -19 -35 -5 1 19 11 15
-04 -36 -28 -62 -14 -8 -8 2 6 28
-03 -43 -35 -83 -21 -15 -29 -5 -1 7 5
-02 -48 -40 -98 -26 -20 -48 -10 -6 -8 0 2
-01 -51 -43 -107 -29 -23 -53 -13 -9 -17 -3 -1
00 -52 -44 -110 -30 -24 -56 -14 -10 -20 -4 -2
01 -51 -43 -107 -29 -23 -53 -13 -9 -17 -3 -1
02 -48 -40 -98 -26 -20 ^8 -10 -6 8 0 2
03 -43 -35 -83 -21 -15 -29 -5 -1 7 5
04 -36 -28 -62 -14 -8 -8 2 6 28
05 -27 -19 -35 -5 1 19 11 15
06 -16 -8 -2 6 12 52 22
07 -3 5 37 19 25 91
08 12 20 82 34 40
09 29 37 133 51
10 48 56 190
11 69 77
12 92
NORM 26910 17710 33649 6783 3876 6188 1001 429 462 42
5 40 45 50
0.05 h
•0.D5
-0.1
-0.15
of real coefficients which facilitates the calculations (see Section 40.3.8). Figure
40.31 shows a spectrum of 512 data points, which is reconstructed from resp-
ectively the first 2, 4, 8...,256 Fourier coefficients. The effect is more or less
comparable to wavelength selection by deleting data points at regular intervals. As
a consequence one looses high frequency information. When the rows of a data
table are replaced by the first n relevant Fourier coefficients, the properties of a
data table are retained. For instance, the Fourier coefficients of the rows of a
two-way data table of mixture spectra remain additive (distributivity property).
Similarities and dissimilarities between rows (the objects) are retained as well,
allowing the application of pattern recognition [12] and other multivariate
operations [13,14].
553
FFT
(b)
16
32
64
128
256
Fig. 40.31. Data compression by a Fourier transform, (a) A spectrum measured at 512 wavelengths;
(b) spectrum after reconstruction with 2, 4,..., 256 Fourier coefficients.
domain and a careful design of the inverse filter. The basic deconvolution algorithm
follows directly from eq. (40.13)
G(v)
F(v) =
H(v)
where G(v) is the Fourier transform of the damaged signal, F(v) is the FT of the
recovered signal and H(v) is the FT of the point-spread function. The back transform
of F(v) gives/(A:). Thus a deconvolution requires the following three steps:
(1) Calculate the FT of the measured signal and of the point-spread function to
obtain respectively G(v) and //(v).
(2) Divide G(v) by ^(v) at corresponding frequency values (according to the
rules for the division of two complex numbers), which gives F(v)
(3) Back-transform F(v) by which the undamaged signaly(jc) is estimated.
The effect of deconvolution applied on a noise-free Gaussian peak is shown in
Fig. 40.32a. Unfortunately as can be seen in Fig. 40.32c a deconvolution carried
out in the presence of noise (s=l%of the signal maximum) leads to no results at
all. This is caused by the fact that two different kinds of damage are present
sjgrlal 1.2
1
(a)
1
0.8 *
tl 0.8
0.6 ^
11 0.6
0.4
1 1 0.4
0.2
u
0 16 32
/l
48 64 80 96 112 128
0.2
time
Fig. 40.32. Deconvolution (result in solid line) of a Gaussian peak (dashed line) for peak broadening
((>v./Jpsf/(H'i/jG = 1). (a) Without noise, (b) With coloured noise (A^(0,1 %), 7JC = 1.5): inverse filter in
combination with a low-pass filter, (c) With coloured noise (M0,1 %), Tx=\.5): inverse filter without
low-pass filter.
555
simultaneously, namely signal broadening and noise. The model for the damaged
signal, therefore, needs to be expanded to the following expression:
g(x)=f(x)^h{x) + n{x)
H(v) H(v)
The unacceptable results of Fig. 40.32c are caused by the term N(v)/H{v) which is
large for high v values (= high frequency). Indeed H(v) approaches zero for high
frequencies whereas the value of A^(v) does not. The influence of the noise can be
limited by combining the inverse filter with a low-pass noise filter, which removes
all frequencies larger than a threshold value VQ. In this way one can avoid that the
term N(v)/H{v) inflates to large values (Fig. 40.32b). We observe that the overall
procedure consists of two contradictory operations: one which sharpens the signal
by removing the broadening effect of the measuring device, and one which increases
broadening because noise has to be removed. Consequently, the broadening effect
of the measuring device can only be partially removed. In Section 40.7 we discuss
other approaches such as the Maximum Entropy method and Maximum Likeli-
hood method, which are less sensitive to noise.
An essential condition for performing signal restoration by deconvolution is the
knowledge of the point-spread function (psf) h(x). In some instances h{x) can be
postulated or be determined experimentally by measuring a narrow signal having a
bandwidth which is at least 10 times narrower than the width of the point-spread
function. The effect of deconvolution is very well demonstrated by the recovery of
two overlapping peaks from a composite profile (see Fig. 40.33). The half-height
width of the psf was 1.25 times the peak width for the peak systems (a) and (b). For
the peak system (c) the half-height width of the psf and signal were equal.
It is still possible to enhance the resolution also when the point-spread function
is unknown. For instance, the resolution is improved by subtracting the second-
derivative g'\x) from the measured signal g{x). Thus the signal is restored by ag(x)
- (1 - a)g'Xx) with 0 < a < 1. This algorithm is called pseudo-deconvolution.
Because the second-derivative of any bell-shaped peak is negative between the
two inflection points (second-derivative is zero) and positive elsewhere, the
subtraction makes the top higher and narrows the wings, which results in a better
resolution (see Fig. 40.30). Pseudo-deconvolution methods can correct for sym-
556
1 0
U 9
O 6
0 7
O 6
0 !>
0 4
O J
0 2
O 1
0 0
O 20 40 6 0 60 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200
11 (c)
1 U
0 9
o tt "V -r-v 1
o / // \ / \1
V M
c o
i
O t
J 4
tl
G J ij M
0 J 1 I
O J
ij L V
00 _-.yi/. v:-.^
20 40 60 8 0 100 120 140 160 1 8 0 2 0 0
Fig. 40.33. Restoration of two overlapping peaks by deconvolution. Dashed line: measured data. Solid
line: after restoration. Dotted line: difference between true and restored signals.
that the calculation of fix) from the measured spectrum g{x) by solving g(x) =
y(x)*/z(x) by deconvolution is a compromise between restoration and filtering.
Because of a lack of hard rules the selection of the noise filter introduces some
arbitrariness in the procedure.
Another class of methods such as Maximum Entropy, Maximum Likelihood
and Least Squares Estimation, do not attempt to undo damage which is already in
the data. The data themselves remain untouched. Instead, information in the data is
reconstructed by repeatedly taking revised trial data f(x) (e.g. a spectrum or
chromatogram), which are damaged as they would have been measured by the
original instrument. This requires that the damaging process which causes the
broadening of the measured peaks is known. Thus an estimate g(x) is calculated
from a trial spectrum J{x) which is convoluted with a supposedly known point-
spread function h{x). The residuals e{x) = g{x) - g(x) are inspected and compared
with the noise n{x). Criteria to evaluate these residuals are Maximum Entropy (see
Section 40.7.2) and Maximum Likelihood (Section 40.7.1).
P{gi\f)^—^ exp
2s}
Under the assumption that the noise in point / is uncorrelated with the noise in
pointy, the likelihood that (/*/i). for all measurements, /, represents the measured
set gj, g2, ..., g^ is the product of all probabilities:
(g,-(/*/^),)'
(40.15)
1=1 i=i 5,V27i 2s}
This likelihood function has to be maximized for the parameters in f. The maxim-
ization is to be done under a set of constraints. An important constraint is the
knowledge of the peak-shapes. We assume that f is composed of many individual
558
peaks of known shape. However, we make no assumption about the number and
position of the peaks. Because f is non-linear and contains many parameters to be
estimated, the solution of eq. (40.15) is not straightforward and should be calcul-
ated in an iterative way by a sequential optimisation strategy. Figure 40.34 shows
the kind of resolution improvement one obtains. Under the normality assumption
the Maximum Likelihood and the least squares criteria are equivalent. Thus one
can also minimize Zef by a sequential optimization strategy [15].
Before going into detail in the meaning of entropy and maximum entropy, the
effect of applying this principle for signal enhancement is shown in Fig. 40.35. As
can be seen, the effect is a drastic improvement of the signal-to-noise ratio and an
enhancement of the resolution. This effect is thus comparable to what is achieved
by the Maximum Likelihood procedure and by inverse filtering. However, the
Maximum Entropy technique apparently does improve the resolution of the signal
without increasing noise.
In physical chemistry, entropy has been introduced as a measure of disorder or
lack of structure. For instance the entropy of a solid is lower than for a fluid,
because the molecules are more ordered in a solid than in a fluid. In terms of
probability it means also that in solids the probability distribution of finding a
molecule at a given position is narrower than for fluids. This illustrates that
entropy has to do with probability distributions and thus with uncertainty. One of
the earliest definitions of entropy is the Shannon entropy which is equivalent to the
definition of Shannon's uncertainty (see Chapter 18). By way of illustration we
559
(a)
Ctfcefnlwl Shift/ppm
(b)
t« 140
ChgrnTcalSKVl/ppm
fi=-^Pi^0S2 Pi
i=\
TABLE 40.6
Shannon's uncertainty for two probability distributions (a broad and narrow distribution)
Distribution 1 2 3 5 5 3 2
Probability 0.10 0.15 0.25 0.25 0.15 0.10
-logaP/ 3.32 2.737 2.00 2.00 2.737 3.32
-Pi^og^Pi 0.332 0.411 0.50 0.50 0.411 0.332 H = 2.486
Distribution 2 0 2 8 8 2 0
Probability 0 0.10 0.40 0.40 0.10 0
-•og2P, 3.323 1.322 1.322 3.323
-Pi^og2Pi 0 0.332 0.529 0.529 0.332 0 H= 1.722
TABLE 40.7
Entropy of noise, noise plus a spike at / = 5, a spike at / = 5 and two spikes at / = 5 and 6
The maximum entropy method thus consists of maximizing the entropy under
the y^ constraint. An algorithm to maximize entropy is the so-called Cambridge
algorithm [16].
When the maximum entropy approach is used for signal restoration a step has to
be included between steps (1) and (2) in which the trial spectrum is first convoluted
(see Section 40.4) with the point-spread function before calculating and testing the
differences with the measured spectrum. The entropy of the trial spectrum before
convolution is evaluated as usual.
1 2
B,
B„
Nl
6 H
~i I ~T 1 1 r — I —
12 16 20 24
FHT
J L
16
32
"^~>^V
64
128
i\
256
Fig. 40.38. Spectrum given in Fig. 40.31 reconstructed with 2, 3,..., 256 Hadamard coefficients.
564
Hadamard transform [17]. For example the IR spectrum (512 data points) shown
in Fig. 40.31a is reconstructed by the first 2, 4, 8, ... 256 Hadamard coefficients
(Fig. 40.38). In analogy to spectrometers which directly measure in the Fourier
domain, there are also spectrometers which directly measure in the Hadamard
domain. Fourier and Hadamard spectrometers are called non-dispersive. The
advantage of these spectrometers is that all radiation reaches the detector whereas
in dispersive instruments (using a monochromator) radiation of a certain wave-
length (and thus with a lower intensity) sequentially reaches the detector.
A common feature of the Fourier and Hadamard transform is that they describe
an overall property of the signal in the measurement range, ^ = 0 to ^ = 7^.
However, one may be interested in local features of the signal. For instance, it may
well be that at the beginning of the signal the frequencies are much higher than at
the end of the signal as shows Fig. 40.39. This is certainly true when the signal
contains noise and peaks with different peak widths. In Fig. 40.39 there are regions
with a high, low and intermediate frequency. One way to detect these local features
is by calculating the FT in a moving window of size T^, and to observe the
i+1,w
i+1,w
^1,0 • • • ^ l , n ^ l , 0 "•^\,n
^ 2 , 0 '"^2,n^2,0 "^2,n
The columns of this matrix contain the time information (amplitudes at a specific
frequency as a function of time) and the rows the frequency information.
566
^Ja K a
A series of Morlet wavelets for various dilation values is shown in Fig. 40.42. In
a similar way as for the FT, where only frequencies are considered which fit an
exact number of times in the measurement time, here, only dilation values are
considered which are stretched by a factor of two. A wavelet transform consists of
fitting the measurements with a basis of wavelets as shown in Fig. 40.42, which are
generated by stretching and shifting a mother wavelet h(t). The narrowest wavelet
(level a^) is shifted in small steps, whereas a broad wavelet is shifted in bigger
steps. The shift b is usually a multiple of the dilation value. The fitting of these
basis of wavelets on the data yields the wavelet transform coefficients. Coefficients
associated with narrow wavelets describe the local features in a signal, whereas the
broad wavelets describe the smooth features in the signal.
a)
c)
A
Fig. 40.41. The Haar wavelet (a) and three Daubechies wavelets (b-d).
567
C) C2 0 0 0 0 0 0
0 0 Cl C2 0 0 0 0
0 0 0 0 Cl C2 0 0
0 0 0 0 0 0 c, C-,
Multiplication of this 4x8 transformation matrix with the 8x1 column vector of
the signal results in 4 wavelet transform coefficients or N/2 coefficients for a data
vector of length A^. For Cj = C2 = C3 = C4 = 1, these wavelet transform coefficients are
equivalent to the moving average of the signal over 4 data points. Consequently,
568
these wavelet filter coefficients define a low-pass filter (see Section 40.5.3), and
the resulting wavelet transform coefficients contain the 'smooth' information of
the signal. For this reason this set of wavelet filter coefficients is called the
approximation coefficients and the resulting transform coefficients are the a-compo-
nents. The transform matrix containing the approximation coefficients is denoted
as the G-matrix. In the above example with 8 data points, the highest possible
transform level is level 3 (8 non-zero coefficients). The result of this transform is
the average of the signal. The level zero (1 non-zero coefficient) returns the signal
itself
Besides this first set of coefficients, a second set of filter coefficients is defined
which is the equivalent of a high-pass filter (see Section 40.5.3) and describes the
detail in the signal. The high-pass filter uses the same set of wavelet coefficients,
but with alternating signs and in reversed order. These coefficients are arranged in
the H-matrix. The H-matrix for the level two transform of a signal with length 8 is:
Cl -C\ 0 0 0 0 0 0
0 0 C2 -C\ 0 0 0 0
0 0 0 0 Cl -C| 0 0
0 0 0 0 0 0 C-, -c
The coefficients in the H-matrix are the detail coefficients. The output of the
H-matrix are the ^-components. With Nil detail components and Nil approx-
imation components, we are able to reconstruct a signal of length N.
The discrete wavelet transform can be represented in a vector-matrix notation
as:
a = W'^f (40.16)
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 .
Rows 1-8 are the approximation filter coefficients and rows 9-16 represent the
detail filter coefficients. At each next row the two coefficients are moved two
positions (shift b equal to 2). This procedure is schematically shown in Fig. 40.43
for a signal consisting of 8 data points. Once W has been defined, the a^ wavelet
transform coefficients are found by solving eq. (40.16), which gives:
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ro.oooo" " 0.1470"
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0.2079 0.7032
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0.4067 1.1378
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0.5878 1.3757
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0.7431 1.3757
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0.8660 1.1378
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0.9511 0.7032
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0.9945 0.1470
/1/2
1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9945 -0.1470
0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0.9511 -0.1281
0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0.8660 -0.0869
0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0.7431 -0.0307
0 0 0 0 0 0 0 0 1-10 0 0 0 0 0 0.5878 0.0307 '
0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0.4067 0.0869
0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0.2079 0.1281
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 [o.oooo_ . 0.1470 J
The factor •\ll/ 2 is introduced to keep the intensity of the signal unchanged. The 8
first wavelet transform coefficients are the a or smooth components. The last eight
coefficients are the d or detail components. In the next step, the level 2 components
are calculated by applying the transformation matrix, corresponding to the a^ level
on the original signal. This a^ transformation matrix contains 4 wavelet filter
570
original data
approximation detail
rn m
1
Ti m
, [, \, \
J 1
iTTTi rrm 1 iTi m
IJ 11FT]
• iJ
Fig. 40.43. Waveforms for the discrete wavelet transform using the Haar wavelet for an 8-points long
signal with the scheme of Mallat' s pyramid algorithm for calculating the wavelet transform coefficients.
1 1 0 0 ] [0.6012" 1.6819"
0 0 1 1 1.7773 1.6818
^n/2
1-10 0 1.7773 -0.8316
0 0 1 -ij [0.6012 0.8316J
and finally the level-4 coefficients (a^) are calculated according to:
1 1 1.6819 2.3785
^fU2
1 - 1 1.6818 0.0000
Having a closer look at the pyramid algorithm in Fig. 40.43, we observe that it
sequentially analyses the approximation coefficients. When we do analyze the
detail coefficients in the same way as the approximations, a second branch of
decompositions is opened. This generalization of the discrete wavelet transform is
called the wavelet packet transform (WPT). Further explanation of the wavelet
packet transform and its comparison with the DWT can be found in [19] and [21].
The final results of the DWT applied on the 16 data points are presented in Fig.
40.44. The difference with the FT is very well demonstrated in Fig. 40.45 where
we see that wavelet a^ describes the locally fast fluctuations in the signal and
wavelet a^ the slow fluctuations. An obvious application of WT is to denoise
spectra. By replacing specific WT coefficients by zero, we can selectively remove
572
approximations details
J
^ ^
2^^.
Fig. 40.44. Wavelet decomposition of a 16-point signal (see text for the explanation).
QI—•^'^—^^A^-*^
0.2
Hri
V -0.2
123 128
noise from distinct areas in the signal without disturbing other areas [22]. Mitter-
mayr et al. [23] compared the wavelet filters to Fourier filters and to polynomial
smoothers such as the Savitzky-Golay filters.
Wavelets have been applied to analyze signals arising from several areas, as
acoustics [24], image processing [22], seismics [25] and analytical signals [23,26,
27]. Another obvious application is to use wavelets to detect peaks in a noisy
signal. Each sudden change of the signal by the appearance of a peak results in a
573
wavelet coefficient at that position [27]. Recently, it has been shown that signals
can be compressed to a fairly small number of coefficients without much loss of
information. Bos et al. [26] applied this property to compress IR spectra by a factor
of 20 prior to a classification by a neural net. Feature reduction by wavelet
transform for multivariate calibration has been studied by Jouan-Rimbaud et al.
[28].
References
1. F.C. Strong III, How the Fourier transform infrared spectrophotometer works. J. Chem. Educ,
56(1979)681-684.
2. R. Brereton, Tutorial: Fourier transforms. Use, theory and applications to spectroscopic and re-
lated data. Chem. Intell. Lab. Syst., 1 (1986) 17-31.
3. R.N. Bracewell, The Fourier Transform and its Applications. 2nd rev. ed., McGraw-Hill, New
York, 1986.
4. G. Doetsch, Anleitung zum praktischen gebrauch der Laplace-transformation und der Z-
transformation. R. Oldenbourg, Munchen 1989.
5. K. Schmidt-Rohr and H.W. Spiess, Multidimensional Solid-state NMR and Polymers. Aca-
demic Press, London 1994, pp. 141.
6. J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier se-
ries. Math. Comput., 19 (1965) 297-301.
7. E.G. Brigham, The Fast Fourier Transform. Prentice-Hall, Englewood Cliffs NJ, 1974.
8. A. Savitzky and M.J.E. Golay, Smoothing and differentiating of data by simplified least-
squares procedures. Anal. Chem., 36 (1964) 1627-1639.
9. C.G. Enke and T.A. Nieman, Signal-to-noise ratio enhancement by least-squares polynomial
smoothing. Anal. Chem., 48 (1976) 705A-712A.
10. J. Steinier, Y. Termonia and J. Deltour, Comments on smoothing and differentiation of data by
simplified least squares procedure. Anal. Chem., 44 (1972) 1906-1909.
11. B. Lam and T.L. Isenhour, Equivalent width criterion for determining frequency domain cut-
offs in Fourier transform smoothing. Anal. Chem., 53 (1981) 1179-1182.
12. W. Wu, B. Walczak, W. Pennincks and D.L, Massart, Feature reduction by Fourier transform
in pattern recognition of NIR data. Anal. Chim. Acta, 331 (1996) 75-83.
13. L. Pasti, D. Jouan-Rimbaud, D.L. Massart and O.E. de Noord, Application of Fourier trans-
form to multivariate caUbration of Near Infrared Data. Anal. Chim. Acta, 364 (1998) 253-263.
14. W.F. McClure, A. Hamid, E.G. Giesbrecht and W.W. Weeks, Fourier analysis enhances NIR
diffuse reflectance spectroscopy. Appl. Spectrosc, 38 (1988) 322-329.
15. L.K. DeNoyer and J.G. Dodd, Maximum Likelihood deconvolution for spectroscopy and chro-
matography. Am. Lab., 23 (1991) D24-H24.
16. E.D. Laue, M.R. Mayger, J. Skilling and J. Staunton, Reconstruction of phase-sensitive two-
dimensional NMR-spectra by maximum entropy. J. Magn. Reson., 68 (1986) 14-29.
17. S.A. Dyer, Tutorial: Hadamard transform spectrometry. Chemom. Intell. Lab. Syst., 12 (1991)
101-115.
18. I. Daubechies, Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math.,
41(1988) 909-996.
19. B. Walczak and D.L.Massart, Tutorial: Noise suppression and signal compression using the
wavelet packet transform. Chemom. Intell. Lab. Syst., 36 (1997) 81-94.
574
20. I. Daubechies, S. Mallat and A.S. Willsky, Special issue on wavelet transforms and multireso-
lution signal analysis. IEEE Trans. Info Theory, 38 (1992) 529-531.
21. B. Walczak, B. van den Bogaert and D.L. Massart, Application of wavelet packet transform in
pattern recognition of near-IR data. Anal. Chem., 68 (1996) 1742-1747.
22. S.G. Nikolov, H. Hutter and M. Grasserbauer, De-noising of SIMS images via wavelet shrink-
age. Chemom. Intell. Lab. Syst., 34 (1996) 263-273.
23. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet denoising of Gaus-
sian peaks: a comparative study. Chemom. Intell. Lab. Syst., 34 (1996) 187-202.
24. R. Kronland-Martinet, J. Morlet and A. Grossmann, Analysis of sound patterns through
wavelet transforms. Int. J. Pattern Recogn. Artif. Intell., 1 (1987) 273-302.
25. P. Goupillaud, A. Grossmann and J. Morlet, Cycle-octave and related transforms in seismic
signal analysis. Geoexploration, 23 (1984) 85-102.
26. M. Bos and J. A.M. Vrielink, The wavelet transform for pre-processing IR spectra in the identi-
fication of mono- and di-substituted benzenes. Chemom. Intell. Lab. Syst., 23 (1994) 115-122.
27. M. Bos and E. Hoogendam, Wavelet transform for the evaluation of peak intensities in flow-
injection analysis. Anal. Chim. Acta, 267 (1992) 73-80.
28. D. Juan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Noord and D.L. Massart, Application of
wavelet transform to extract the relevant component from spectral data for multivariate calibra-
tion. Anal. Chem., 69 (1997) 4317-4323.
Chapter 41
Kalman Filtering
41.1 Introduction
compounds as a function of time during the data acquisition, two models are
needed, the Lambert-Beer model and the kinetics model. The Lambert-Beer
model relates the measurements (absorbances) to the concentrations. This model is
called the measurement model {A =J{a,c)). The kinetics model describes the way
the concentrations vary as a function of time and is called the system model (c =
J{k,t)). In this particular instance, the system model is an exponential function in
which the reaction rate k is the regression parameter. The terms * systems' and
'states' are associated with the Kalman fdter. The system here is the chemical
reaction and its state at a given time is the set of concentrations of the compounds at
that time. The output of the system model is the state of the system. The measure-
ment and system models fully describe the behaviour of the system and are
referred to as the state-space model. Thus the system and measurement models are
connected. The parameters of the measurement model (in our example, the conc-
entrations of the reactants and the reaction product) are the dependent variables of
the system model, in which the reaction rate is the regression parameter and time is
the independent variable. Later on we explain that system models may also contain
a stochastic part. It should be noted that this dual definition implies that two sets of
parameters are estimated simultaneously, the parameters of the measurement
model and the parameters of the systems model.
In this chapter we introduce the Kalman filter, with which it is possible to
estimate the actual values of the parameters of a state-space model, e.g., the rate of
a chemical reaction from the evolution of the concentrations of a reaction product
and the reactants as a function of time which in turn are estimated from the
measured absorbances. Let us consider an experiment where a flow-through cell is
connected to a reaction vessel, in which the reaction takes place. During the
reaction, which may be fast with respect to the scan speed of the spectrometer, a
spectrum is measured. At r = 0 the measurement of the spectrum is started, e.g., at
320 nm in steps of 2 nm per second. Every second (2 nm) estimates of the
concentrations of the compounds are updated by the Kalman filter. When arrived
at the end of the spectral range (e.g. 600 nm) we may continue the measurement
process as long as the reaction takes place, by reversing the scan of the spectrometer.
As mentioned before, the confidence in the estimates of the model parameters
improves during the measurement-estimation process. Therefore, we want to
prevent a new but deviating measurement to influence the estimated parameters
too much. In the Kalman filter this is implemented by attributing a lower weight to
new measurements.
An important property of a Kalman filter is that during the measurement and
estimation process, regions of the measurement range can be identified where the
model is invalid. This allows us to take steps to avoid these measurements
affecting the accuracy of the estimated parameters. Such a filter is called the
adaptive Kalman fdter. An increasing number of applications of the Kalman filter
577
has been published, taking advantage of the formulation of a systems model which
describes the dynamic behaviour of the parameters in the measurement model and
exploiting the adaptive properties of the filter.
In this chapter we discuss the principles of the Kalman filter with reference to a
few examples from analytical chemistry. The discussion is divided into three parts.
First, recursive regression is applied to estimate the parameters of a measurement
equation without considering a systems equation. In the second part a systems
equation is introduced making it necessary to extend the recursive regression to a
Kalman filter, and finally the adaptive Kalman filter is discussed. In the concluding
section, the features of the Kalman filter are demonstrated on a few applications.
yi = ^0 + ^1 ^i + ^i
bg and b^ are the estimates of the true regression parameters % and Pj, calculated
by linear regression and e^ is the contribution of measurement noise to the response.
In matrix notation the model becomes:
y. = xj h + e^
h
A.
A recursive algorithm which estimates PQ and pj has the following general form:
New estimate = Previous estimate + Correction
578
After each new observation, the estimates of the model parameters are updated
(= new estimate of the parameters). In all equations below we treat the general case
of a measurement model with p parameters. For the straight line model /? = 2. An
estimate of the parameters b based ony - 1 measurements is indicated by b(/ - 1).
Let us assume that the parameters are recursively estimated and that an estimate b(/
- 1) of the model parameters is available fromj - 1 measurements. The next
measurement y(j) is then performed at x(/), followed by the updating of the model
parameters to b(/).
The first step of the algorithm is to calculate the innovation (I), which is the
difference between measured y(j) and predicted response y(j) at x(/). Therefore, the
last estimate b(/ - 1) of the parameters is substituted in the measurement model in
order to forecast the response y(j), which is measured at x(j):
y(j) = bo(j-l) + b,(j-l)x(j)or
>;(/•) = xT(/*)b(/'-l) (41.1)
The innovation /(/) (not to be confused with the identity matrix I) is the difference
between measured and predicted response at x(j). Thus /(/) = y(j) - yij)-
The value of the innovation is used to update the estimates of the parameters of
the model as follows:
b(7) = b ( 7 - l ) + k ( ; ) / ( j ) (41.2)
px\ px\ pxl 1x1
Equation (41.2) is the first equation of the recursive algorithm. k(/) is a/7xl vector,
called the gain vector. Looking in more detail at this gain vector, we see that it
weights the innovation. For k equal to the null vector, b(/) = b(/ - 1), leaving the
model parameters unadapted. The larger the k, the more weight is attributed to the
innovation and as a consequence to the last observation. Therefore, one may
intuitively understand that the gain vector depends on the confidence we have in
the estimated parameters. As usual this confidence is expressed in the variance-
covariance matrix of the parameters. Because this confidence varies during the
estimation process, it is indicated by P(/) which is given by:
nj)--
One expects that during the measurement-prediction cycle the confidence in the
parameters improves. Thus, the variance-covariance matrix needs also to be
updated in each measurement-prediction cycle. This is done as follows [1]:
P(7)=P(;-l)-k(;)xT(7)P(7-l) (41.3)
pxp pxp pxl Ixp pxp
579
Equation (41.3) is the second equation of the recursive filter, which expresses
the fact that the propagation of the measurement error depends on the design (X) of
the observations. Once the fitting process is complete the square root of the
diagonal elements of P give the standard deviations of the parameter estimates.
Key factor in the recursive algorithm is the gain vector k, which controls the
updating of the parameters as well as the updating of the variance-covariance
matrix P. The gain vector k is the most affected by the variance of the experimental
error r(j) of the new observation y(j) and the uncertainty PO - 1) in the parameters
b(/ - 1). When the elements of PO* - 1) are much larger than r(/), the gain factor is
large, otherwise it is small. After each new measurement j the gain vector is
updated according to eq. (41.4)
k(7) = P ( y - l ) x ( y ) ( x T ( y ) P ( y - l ) x ( ; ) + r ( ; ) ) - i (41.4)
pxl pxp pxl \xp pxp px\ 1x1
The expression x^(/)P(/ - l)x(/) in eq. (41.4) represents the variance of the
predictions, y(j), at the value x(j) of the independent variable, given the uncertainty
in the regression parameters P(/). This expression is equivalent to eq. (10.9) for
ordinary least squares regression. The term r(j) is the variance of the experimental
error in the response y(j). How to select the value of r(j) and its influence on the
final result are discussed later. The expression between parentheses is a scalar.
Therefore, the recursive least squares method does not require the inversion of a
matrix. When inspecting eqs. (41.3) and (41.4), we can see that the variance-
covariance matrix only depends on the design of the experiments given by x and on
the variance of the experimental error given by r, which is in accordance with the
ordinary least-squares procedure.
Typically, a recursive algorithm needs initial estimates for b(0) and P(0) to start
the iteration process and an estimate of r(j) for all j during the measurement-
estimation process. When no prior information on the regression parameters is
available b(0) is usually set equal to zero. In many textbooks [1], it is recom-
mended to choose for P(0) a diagonal matrix with very large diagonal elements,
expressing the large uncertainty in the chosen starting values b(0). As explained
before, r(j) represents the variance of the experimental error in observation y(j).
Although the choice of P(0) and r(j) introduces some arbitrariness in the calcu-
lation, we show in an example later on that the gain vector (k) which fully controls
the updating of the estimates of the parameters is fairly insensitive for the chosen
values r(/) and P(0). However, the obtained final value of P depends on a correct
estimation of r(j). Only when the value of r(j) is equal to the variance of the
experimental error, does P converge to the variance-covariance matrix of the
estimated parameters. Otherwise, no meaning can be attributed to P.
In summary, after each new measurement a cycle of the algorithm starts with the
calculation of the new gain vector (eq. (41.4)). With this gain vector the variance-
580
Iteration step
Fig. 41.1. Evolution of the innovation during the recursive estimation process (iteration steps) (see
Table 41.1).
covariance matrix (eq. (41.3)) is updated and the estimates of the parameters are
updated (eq. (41.2)). By monitoring the innovation sequence, the stability of the
iteration process can be followed. Initially, the innovation shows large fluctu-
ations, which gradually fade out to the level expected from the measurement noise
(Fig. 41.1). An innovation which fails to converge to the level of the experimental
error indicates that the estimation process is not completed and more observations
are required. However, P(0) also influences the convergence rate of the estimation
process as we show with the calibration example discussed below.
By way of illustration, the regression parameters of a straight line with slope = 1
and intercept = 0 are recursively estimated. The results are presented in Table 41.1.
For each step of the estimation cycle, we included the values of the innovation,
variance-covariance matrix, gain vector and estimated parameters. The variance
of the experimental error of all observations y is 25 10"^ absorbance units, which
corresponds to r = 25 10"^ au for a l l / The recursive estimation is started with a
high value (10^) on the diagonal elements of P and a low value (1) on its
off-diagonal elements.
The sequence of the innovation, gain vector, variance-covariance matrix and
estimated parameters of the calibration lines is shown in Figs. 41.1^1.4. We can
clearly see that after four measurements the innovation is stabilized at the
measurement error, which is 0.005 absorbance units. The gain vector decreases
monotonously and the estimates of the two parameters stabilize after four measure-
ments. It should be remarked that the design of the measurements fully defines the
variance-covariance matrix and the gain vector in eqs. (41.3) and (41.4), as is the
case in ordinary regression. Thus, once the design of the experiments is chosen
581
TABLE 41.1
Recursive estimation of the parameters BQ and Z?^ of a straight line (see text for symbols; y are the measure-
ments). InitiaUsation: bT(0) = [0 0]; r = 25 10-^; P^^(0) = P2,2(^)= 1000; Fi 2CO) = PiA^) = 1
(x(j),j = 1, ..., n), one can predict how the variance-covariance matrix behaves
during the iterative process and decide on the number of experiments or decide on
the design itself. This is further explained in Section 41.3. The relative insensitivity
of the estimation process for the initial value of P is illustrated by repeating the
calculation with the diagonal elements P(l,l) and P(2,2) of P set equal to 100
instead of equal to 1000 (see Table 41.2). As can be seen from Table 41.2 the gain
vector, the variance-covariance matrix and the estimated regression parameters
rapidly converge to the same values. Also using unrealistically high values for the
experimental error (e.g. r(j) = 1) does not affect the convergence of the gain factors
too much as long as the diagonal elements of P remain high. However, we also
582
Iteration number
Fig. 41.2. Gain factor during the recursive estimation process (see Table 41.1).
lt+4
1E+2
b
1E+0
\ \ P(2,2)
1E-2
\pair'"---~.^___^
1E-4
8 9 10
Iteration number
Fig. 41.3. Evolution of the diagonal elements of the variance-covariance matrix (P) during the estima-
tion process (see Table 41.1).
Q.
i 1
<D i
^—" •
o 1 1 slope
Q) C 0.8
"(0 o 0.6 : /
0) (A
0.4 - /
0.2
intercept
^ / ^
u
,,. 1 1 ,1 1 , 1 , 1. , 1 1 1
10
Iteration number
Fig. 41.4. Evolution of the model parameters during the iteration process (see Table 41.1).
This straight line example exemplifies the fact that by exploiting the recursive
approach alternative calibration practices are possible. The standard practice in
calibration is to measure calibration standards prior to the analysis of the unknown
samples and to construct a calibration line. Thereafter one starts the analysis of a
sample batch. To check whether the analysis system performs well, each batch of
samples is preceded by a check sample with a known concentration. The results of
the check sample are plotted in a control chart (see Chapter 7). When the value
exceeds the warning or action limits, one may decide to re-establish the calibration
line by remeasuring a new set of calibration standards. This procedure disregards
the knowledge on the calibration factors already available from the previous
calibration run. By applying the recursive algorithm, it is in principle sufficient to
remeasure a limited set of calibration standards and to update the estimates of the
calibration factors already available from the previous calibration step. However,
as can be seen from the example given in Table 41.1, this would mean that the
weight attributed to the new measurements steadily decreases (small gain vector).
This leads to the undesirable situation where the calibration factors are hardly
updated. However, it should be bome in mind that the reason to decide to
recalibrate was that during the measurement of the unknown samples, calibration
factors may be changed. As a result, the uncertainty in the calibration factors were
judged to be unacceptably large, requiring recalibration. Expressed in terms of a
system, the system state (here slope and intercept) has not been observed for some
time. Because this system state varies in time (in an unknown deterministic or
stochastic way), the uncertainty about the system state increases and the state has
to be observed again. This uncertainty is taken into account in the Kalman filter by
the system equation, which is discussed in the next section. One of the methods to
584
TABLE 41.2
Influence of initial values of the diagonal values of the variance-covariance matrix P(0) and the variance of
the experimental error on the gain vector and the Innovation sequence (see Table 41.1 for the experimental
values, y)
kl k2 I kl k2 I kl k2 I kl k2 I
1 0.99 0.10 0.101 0.98 0.10 0.101 0.99 0.11 0.101 0.98 0.11 0.101
2 -1.0 10.0 0.094 -0.75 8.31 0.094 -1.0 9.99 0.094 0.00 3.33 0.095
3 -0.73 5.20 0.011 -0.62 4.76 0.035 -0.66 4.99 0.011 -0.33 3.31 0.105
4 -0.53 3.08 0.002 -0.48 2.94 0.011 -0.50 2.99 0.002 -0.37 2.48 0.069
b -0.41 2.02 -0.005 -0.39 1.98 0.0005 -0.40 1.99 -0.045 -0.34 1.80 0.034
6 -0.33 1.43 0.007 -0.33 1.42 -0.010 -0.33 1.42 0.007 -0.30 1.34 0.034
7 -0.28 1.07 -0.003 -0.28 1.07 -0.001 -0.28 1.07 -0.003 -0.27 1.03 0.016
8 -0.24 0.83 -0.005 -0.25 0.83 -0.003 -0.25 0.83 -0.004 -0.24 0.81 0.009
9 -0.22 0.66 -0.007 -0.22 0.66 -0.006 -0.22 0.66 -0.007 -0.21 0.65 0.003
10 -0.19 0.54 0.008 -0.20 0.54 0.009 -0.20 0.54 0.008 -0.19 0.54 0.016
where v(j) includes the random measurement error (not to be confounded with r(j)
which is the variance of the random measurement error) and the systematic error
caused by the bias in model parameters, x(/ - 1). The equations for the variance-
covariance update and Kalman gain update are adapted accordingly. The overall
sequence of the equations is given in Table 41.3. It should be noted that here the
state vector x is time-invariant.
TABLE 41.3
Kalman filter algorithm equations for time-invariant system states
Variance-covariance update
PO; = PO"-l)-kWh'^0-)PO'-l) (41.7)
pxp pxp pxl Ixp pxp
Measurement zij)
State update
x(j)=xU-l)-^k(j)(z{j)-h^ (j)xU-l)) (41.8)
pxl pxl pxl 1x1 pxl 1x1
586
TABLE 41.4
Absorptivities of CI2 and Br2 in chloroform [2]
za;=h^(/>o'-i)
where z(j) is the predicted absorbance at the 7th measurement, using the latest
estimated concentrations x(j - 1) obtained after the (/* - l)th measurement. h^O)
contains the absorptivities of CI2 and Br2 at the wavelength chosen for the jth
measurement.
Step 1. Initialisation (/ = 0)
"1000 1
P(0) = x(0) =
1 1000
Step 2. Update of the Kalman gain vector k(l) and variance-covariance matrix
P(l)
k(l) = P(0)h(l)(hT(l)P(0)h(l) + i r '
P(l) = P(0)-k(l)h'^(l)P(0)
This gives for design A:
[0.1596"
' L 5-744 _
TABLE 41.5
Calculated gain vector k and variance covariance matrix P\ f 11(0) = P2.2(^) = 1000; P^j^O) = ^2.1(0) = 1 > ''= 1
TABLE 41.6
Estimated concentrations (see Table 41.5 for the starting conditions)
Step Wavelength x2
CL Br.
1 22 0.0054 0.196
2 24 0.0072 0.2004
3 26 0.023 0.2022
4 28 0.0802 0.1993
5 30 0.0947 0.1982
6 32 0.0959 0.198
TABLE 41.7
Calculated gain vector, variance covariance matrix and estimated concentrations (see Table 41.5 for the
starting conditions)
TABLE 41.8
Concentration estimation with the optimal set of wavelengths (see Table 41.5 for the starting conditions)
(2) From the evolution of P in design B (Table 41.7), one can conclude that the
measurement at wavenumber 24 10^ cm~^ does not really improve the
estimates already available after the sequence 22, 32. Equally the measure-
ment at 26 10^ cm~^ does not improve the estimate already available after
the sequence 22,32,24 and 30 10^ cm~^. This means that these wavelengths
do not contain new information. Therefore, a possibly optimal set of wave-
numbers is 22,32 and 30. Inclusion of a fourth wavelength namely at 28 10^
cm~^ probably does not improve the estimates of the concentrations already
available, since the value of P converged to a stable value. To confirm this
conclusion, the recursive regression was repeated for the set of wave-
lengths 22, 32 and 30 10^ cm"^ (see Table 41.8). Thyssen et al. [3] devel-
oped an algorithm for the selection of the best set of m wavelengths out of
n. Instead of having to calculate 10^^ determinants to find the best set of six
wavelengths out of 300, the recursive approach only needs to evaluate a
rather straightforward equation 420 times. The influence of the measure-
ment sequence on the speed of convergence is well demonstrated for the
four-component analysis (Fig. 41.5) of a mixture of aniline, azobenzene,
nitrobenzene and azoxybenzene [3]. In the forward scan mode a quick
convergence is attained, whereas in the backward scan mode, convergence is
slower. Using an optimized sequence, convergence is complete after less
than seven measurements (Fig. 41.6). Other methods for wavelength
selection are discussed in Chapters 10 and 27.
" K
M •
1 .6 " K
X
K
M
K M
X
1.2 «"" K
•". X * X X
X Z
0.8 r> *-
rzi
AA _
xa^ie""
-4
1 1 1 1 1 1 1 1 1- > <
14 28 42 56 70 —> k
0.8 +
0.4 \
78 —> k
Fig. 41.5. Multicomponent analysis (aniline (jci), azobenzene fe), nitrobenzene fe) and azoxy-
benzene (JC4)) by recursive estimation (a) forward run of the monochromator (b) backward run (k indi-
cates the sequence number of the estimates; solid lines are the concentration estimates; dotted lines are
the measurements z).
591
l.B f
1.2
XltUB
M 28 42 56 70 —> k
Fig. 41.6. Multicomponent analysis (see Fig. 41.5) with an optimized wavelength sequence.
the system equation. As explained before, the system equation describes how the
system changes in time. In the kinetics example the system equation describes the
change of the concentrations as a function of time. This is a deterministic system
equation. The random fluctuation of the slope and intercept of a straight line can be
described by a stochastic model, e.g., an autoregressive model (see Chapter 20).
Any unmodelled system fluctuations left are included in the system noise, w(j).
The system equation is usually expressed in the following way:
where F(/j - 1) is the system transition matrix, which describes how the system
state changes from time ti_^ to time /^. The vector w(/) consists of the noise
contributions to each of the system states. These are system fluctuations which are
not modelled by the system transition matrix. The parameters of the measurement
equation, the h-vector and system transition matrix for the kinetics and calibration
model are defined in Table 41.9. In the next two sections we derive the system
equations for a kinetics and a calibration experiment. System state equations are
not easy to derive. Their form depends on the particular system under
consideration and no general guidance can be given for their derivation.
592
TABLE 41.9
Definition of the state parameter (x), h-vector and transition matrix (F) for two systems
1. Calibration Slope and intercept of [ 1 c] where c is the concent- time constant of the
the calibration line at ration of the calibration the variations of slope
time t standard measured at time t and intercept
Let us assume that a kinetics experiment is carried out and we want to follow the
concentrations of component B which is formed from A by the reaction: A -> B.
For a first-order reaction, the concentrations of A (= jCj) and B (= x^^) as a function
of time are described by two differential equations:
dxj Idt = -k^x^
djC2/dr = fcjjCj
1 1 •^i(y) 0
+ (41.12)
.a(7 + l) 0 1 a(7) WaCi+l)
A similar equation can be derived for the slope, where P is the drift parameter:
X2(;+l)
1 ip2(;) 0
(41.13)
.P0"+1). .0 iJIpCy). H ' p ( ; + l)
Equations (41.12) and (41.13) can now be combined in a single system model
which describes the drift in both parameters:
10 10" Xi
0 10 1
F= andx =
0 0 10 a
0 0 0 1 .P.
F describes how the system state changes from time tj to tj^^.
594
For the time invariant calibration model discussed in Section 41.2, eq. (41.14)
reduces to:
"1 OTA:,
•^iO')1
0 lJ[jC2lU)}
41.5.1 Theory
TABLE41.10
Kalman filter algorithm equations
Initialisation: x(0), P(0)
State extrapolation: system equation
x(/!/- 1) = F(/)xO- 11/'- 1) + wO) (41.15)
Covariance extrapolation
P(/V- 1) = F(/-)P(/- II7- 1)F''(/) + Q 0 - 1) (41.16)
Covariance update
P(/'iy) = P(/"iy - 1) - mh'ijWiiy -1) (41.17)
State update
x(J\J) = ^(J\J-^) + mKz(j)-h'(j)x(j\j- D) (41.19)
595
update the state x(j) (eq. (41.19) in Table 41.10). In order to make a distinction
between state extrapolations by the system equation and state updates when
making a new observation, a double index (j\j - 1) is used. The index (j\j - 1)
indicates the best estimates at time tj, based on measurements obtained up to and
including those obtained at point tj^^.
Equations (41.15) and (41.19) for the extrapolation and update of system states
form the so-called state-space model. The solution of the state-space model has
been derived by Kalman and is known as the Kalman filter. Assumptions are that
the measurement noise v(j) and the system noise w(;) are random and independent,
normally distributed, white and uncorrected. This leads to the general formulation
of a Kalman filter given in Table 41.10. Equations (41.15) and (41.19) account for
the time dependence of the system. Eq. (41.15) is the system equation which tells
us how the system behaves in time (here inj units). Equation (41.16) expresses
how the uncertainty in the system state grows as a function of time (here inj units)
if no observations would be made. Q(/ - 1) is the variance-covariance matrix of
the system noise which contains the variance of w.
The algorithm is initialized in the same way as for a time-invariant system. The
sequence of the estimations is as follows:
Cycle 1
1. Obtain initial values for x(OIO), k(0) and P(OIO)
2. Extrapolate the system state (eq. (41.15)) to x(llO)
3. Calculate the associated uncertainty (eq. (41.16)) P( 110)
4. Perform measurement no. 1
5. Update the gain vector k(l) (eq. (41.18)) using P(IIO)
6. Update the estimate x(llO) to x(lll) (eq. (41.19)) and the associated
uncertainty P(lll) (eq. (41.17))
Cycle 2
1. Extrapolate the estimate of the system state to x(2ll) and the associated
uncertainty P(2I1)
2. Peform measurement no. 2
3. Update the gain vector k(2) using P(2I1)
4. Update (filter) the estimate of the system state to x(2l2) and the associated
uncertainty P(2I2)
and so on.
In the next section, this cycle is demonstrated on the kinetics example intro-
duced in Sections 41.1 and 41.4.
Time-invariant systems can also be solved by the equations given in Table
41.10. In that case, F in eq. (41.15) is substituted by the identity matrix. The system
state, x(/), of time-invariant systems converges to a constant value after a few
cycles of the filter, as was observed in the calibration example. The system state,
596
"c ik 1?
(0 I i
•G
(Q
g> 1
O
C
o 08
(Q
C
8 0.6
oC
o
0.4
0.2
o o o o
0 _l I I l_ _J I \ L_
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
• time
Fig. 41.7. Concentrations of the reactant A (reaction A ^ B) as a function of time (dotted line) (CA = 1,
CB = 0); • state updates (after a new measurement), O state extrapolations to the next measurement
(see Table 41.11 for Kalman filter settings).
597
TABLE41.il
The prediction of concentrations and reaction rate constant by a Kalman filter; (JCJCO) =l,k^=0.l min-^
')State extrapolations in italic are obtained by substituting the last estimate of k^ equal to -\n{x(j\j))/t in the
system equation (41.15).
dashed line in Fig. 41.7 shows the true evolution of the concentration of the
reactant A. For a matter of simplicity we have not included the covariance
extrapolation (eq. (41.16)) in the calculations. Although more cycles are needed
for a convincing demonstration that a Kalman filter can follow time varying states,
the example clearly shows its principle.
598
Before we can apply an adaptive filter, we should define a criterion to judge the
validity of the model to describe the measurements. Such a criterion can be based
on the innovation defined in Section 41.2. The concept of innovation, /, has been
introduced as a measure of how well the filter predicts new observations:
As explained before, the second term in the above equation is the variance of the
response, £(/), predicted by the model at the value h(/) of the independent variable,
given the uncertainty in the regression parameters P(/ - 1) obtained so far. This
equation reflects the fact that the fluctuations of the innovation are larger at the
beginning of the filtering procedure, when the states are not well known, and
converge to standard deviation of the experimental error when the confidence in
the estimated parameters becomes high (small P). Therefore, it is more difficult to
detect model errors at the beginning of the estimation sequence than later on.
Rejection and acceptance of a new measurement is then based on the following
criterion:
if |/0)|>3a,(/): reject
otherwise: accept
Adaptation of the Kalman filter may then simply consist of ignoring the rejected
observations without affecting the parameter estimates and covariances. When
using eq. (41.21) a complication arises at the beginning of the recursive estimation
because the value of P depends on the initially chosen values P(0) and thus is not a
good measure for the uncertainty in the parameters. When large values are chosen
for the diagonal terms of P in order to speed up the convergence (high P means
large gain factor k and thus large influence of the last observation), eq. (41.21)
overestimates the variance of the innovation, until P becomes independent of P(0).
For short runs one can evaluate the sequence of the innovation and look for regions
with significantly larger values, or compare the innovation with r(j). By way of
illustration we apply the latter procedure to solve the multicomponent system
discussed in Section 41.3 after adding an interfering compound which augments
the absorbance at 26 10^ cm"^ with 0.01 au and at 28 10^ cm"^ with 0.015 au. First
we apply the non-adaptive Kalman filter to all measurements. The estimation then
proceeds as shown in Table 41.12.
The above example illustrates the self adaptive capacity of the Kalman filter.
The large interferences introduced at the wavelengths 26 and 28 10^ cm~^ have not
really influenced the end result. At wavelengths 26 and 28 10^ cm"^ the innovation
is large due to the interferent. At 30 10^ cm"^ the innovation is high because the
concentration estimates obtained in the foregoing step are poor. However, the
observation at 30 10^ cm~^ is unaffected by which the concentration estimates are
restored within the true value. In contrast, the OLS estimates obtained for the
above example are inaccurate (jCj = 0.148 and X2 = 0.217) demonstrating the
sensitivity of OLS for model errors.
601
TABLE 41.12
Non-adaptive Kalman filter with interference at 26 and 28 10-3 cm-^ (see Table 41.5 for the starting condi-
tions)
0 0 0
1 22 0.054 0.196 0.0341 0 0.0341
2 24 0.072 0.2004 0.0429 0.0414 0.0015
3 26 0.169 0.211 0.0435 0.0331 0.0104
4 28 0.313 0.204 0.0267 0.0158 0.0110
5 30 0.099 0.215 0.0110 0.0409 -0.030
6 32 0.098 0.215 0.0080 0.0082 -0.0002
The estimation sequence when the Kalman filter is adapted is given in Table
41.13. This illustrates well the adaptive procedure, which has been followed. At 26
10-^ cm~^ the new measurement is 0.0104 absorbance units higher than expected
from the last available concentration estimates (x^ = 0.072 and X2 = 0.2004). This
deviation is clearly larger than the value 0.005 expected from the measurement
noise. Therefore, the observation is disregarded and the next measurement at 28
10^ cm~^ is predicted with the last accepted concentration estimates. Again, the
difference between predicted and measured absorbance (= innovation) cannot be
explained from the noise and the observation is disregarded as well. At 30 10^ cm"^
the predicted absorbance using the concentration estimates from the second step is
back within the expectations, P and k can be updated leading to new concentration
estimates x^ = 0.098 and X2 = 0.1995. Thereafter, the estimation process is
continued in the normal way. The effect of this whole procedure is that the two
measurements corrupted by the presence of an interferent have been eliminated
after which the measurement-filtering process is continued.
41.7 Applications
One of the earliest applications of the Kalman filter in analytical chemistry was
multicomponent analysis by UV-Vis spectrometry of time and wavelength inde-
pendent concentrations, which was discussed by several authors [7-10]. Initially,
the spectral range was scanned in the upward and downward mode, but later on
602
TABLE 41.13
Adaptive Kalman filter with interference at 26 and 28 cm-' (see Table 41.5 for the starting conditions)
0 0 0
1 22 0.054 0.196 0.0341 0 0.0341
2 24 0.072 0.2004 0.0429 0.0414 0.0015
3 26 - - 0.0435 0.0331 0.0104
4 28 - - 0.0267 0.0100 0.0166
5 30 0.0980 0.1995 0.0110 0.00814 -0.002
6 32 0.0979 0.1995 0.0080 0.00801 -0.00001
optimal sequences were derived for faster convergence to the result [3]. The
measurement model can be adapted to include contributions from linear drift or
background [6,11]. This requires an accurate model for the background or drift. If
the background response is not precisely described by the model the Kalman filter
fails to estimate accurate concentrations. Rutan [12] applied an adaptive Kalman
filter in these instances with success. In HPLC with diode array detection, three-
dimensional data are available. The processing of such data by multivariate
statistics has been the subject of many chemometric studies, which are discussed in
Chapter 34. Under the restriction that the spectra of the analytes should be
available, accurate concentrations can be obtained by Kalman filtering in the
presence of unknown interferences [13].
One of the earliest reports of a Kalman filter which includes a system equation is
due to Seelig and Blaunt [14] in the area of anodic stripping voltametry. Five
system states — potential, concentration, potential sweep rate and the slope and
intercept of the non-Faradaic current — were predicted from a measurement
model based on the measurement of potential current. Later on the same approach
was applied in polarography. Similar to spectroscopic and chromatographic appli-
cations, overlapping voltamograms can be resolved by a Kalman filter [15]. A vast
amount of applications of Kalman filters in kinetic analysis has been reported
[16,17] and the performance has been compared with conventional non-linear
regression. In most cases the accuracy and precision of the results obtained from
the two methods were comparable. The Kalman filter is specifically superior for
detecting and correcting model errors.
603
References
12. S.C. Rutan, E. Bouveresse, K.N. Andrew, P.J. Worsfold and D.L. Massart, Correction for drift
in multivariate systems using the Kalman filter. Chemom. Intell. Lab. Syst., 35 (1996)
199-211.
13. J. Chen and S.C. Rutan, Identification and quantification of overlapped peaks in liquid chroma-
tography with UV diode array detection using an adaptive Kalman filter. Anal. Chim. Acta,
335(1996) 1-10.
14. P. Seelig and H. Blount, Kalman Filter applied to Anodic Stripping Voltametry: Theory. Anal.
Chem., 48 (1976) 252-258.
15. C.A. Scolari and S.D. Brown, Multicomponent determination in flow-injection systems with
square-wave voltammetric detection using the Kalman filter. Anal. Chim. Acta, 178 (1985)
239-246.
16. B.M. Quencer, Multicomponent kinetic determinations with the extended Kalman filter. Diss.
Abstr. Int. B 54 (1994) 5121-5122.
17. M. Gui and S.C. Rutan, Determination of initial concentration of analyte by kinetic detection of
the intermediate product in consecutive first-order reactions using an extended Kalman filter.
Anal. Chim. Acta, 66 (1994) 1513-1519.
18. P.C. Thijssen, L.T.M. Prop, G. Kateman and H.C. Smit, A Kalman filter for caHbration, evalu-
ation of unknown samples and quality control in drifting systems. Part 4. Flow Injection Analy-
sis. Anal. Chim. Acta, 174 (1985) 27-40.
19. P.C. Thijssen, N.J.M.L. Janssen, G. Kateman and H.C. Smit, Kalman filter applied to setpoint
control in continuous titrations. Anal. Chim. Acta, 177 (1985) 57-69.
Chapter 42
42.1 An overview
Ackoff and Sasieni [1] defined operations research (OR) as "the application of
scientific method by interdisciplinary teams to problems involving the control of
organized (man-machine) systems so as to provide solutions which best serve the
purposes of the organization as a whole".
Operations research consists of a collection of mathematical techniques. Some
of these are linear programming, integer programming, queuing theory, dynamic
programming, graph theory, game theory, multicriteria decision making, and
simulation. They often are optimization techniques and are characterized by their
combinatorial character: their aim is to find an optimal combination. Typical
problems that can be solved are:
(1) allocation,
(2) inventory,
(3) replacement,
(4) queuing,
(5) sequencing and combination,
(6) routing,
(7) competition, and
(8) search.
Several, but not all, of these mathematical methods (e.g. multicriteria decision
making. Chapter 26) or problems (the non-hierarchical clustering methods of
Chapter 30, which can be treated as allocation models) have been treated earlier. In
this chapter, we will briefly discuss the methods that are relevant to chemo-
metricians and have not been treated in earlier chapters yet.
Suppose that a manufacturer prepares a food product by adding two oils (A and
B) of different sources to other ingredients. His purpose is to optimize the quality
606
of the product and at the same time minimize cost. The quality parameters are the
amount of vitamin A (y^) and the amount of polyunsaturated fatty acids (y2). The
cost of a unit amount of oil A is 40, that of B is 25. In this introductory example, we
will suppose that the cost of the other ingredients is negligible and does not have to
be taken into account. Moreover, we suppose that the volume remains constant by
adaptation of the other ingredients. The cost or objective function to be minimized
is therefore
Z = 4 0 J C , + 25JC2 (42.1)
where jc j is the amount in grams of oil A per litre of product and X2 the amount of oil
B. An optimal product contains at least 65 vitamin A units and 40 polyunsaturated
units per litre (all numbers in this section have been chosen for mathematical
convenience and are not related to real values). More vitamin A or polyunsaturated
fatty acids are not considered to have added benefit. Suppose now that oil A
contains 30 units of vitamin A per litre and 10 units of polyunsaturated fatty acids
and oil B contains 15 units of vitamin A and 25 units of polyunsaturated fatty acid.
One can then write the following set of constraints.
y^ = 30x^ + 15A:2>65
The line y^=65 = SOx, + 15JC2 is shown in Fig. 42.1. All points on that line or
above satisfy the constraint 30JCJ + 15JC2 ^ 65. Similarly, all points lying above line
J2 = IOJC, + 25JC2 = 40 satisfy the second constraint of eq. (42.2), while the last
constraints limit the acceptable solutions to positive or zero values for x^ and X2.
The acceptable region is the shaded area of Fig. 42.1.
We can now determine which pairs of (JC,, X2) values yield a particular z. In Fig.
42.1 line z, shows all values for which z = 50. These (JC,, X2) do not belong to the
acceptable area. However, we can now draw parallel lines until we meet the
acceptable area. This happens in point B with line 12- The coordinates of this point
are obtained by solving the set of simultaneous equations
30JC, + 15JC2 = 6 5 l
IOJCJ + 2 5 J C 2 =40J
Let us look at another example (Fig. 42.2) [2]. A laboratory must carry out
routine determinations of a certain substance and uses two methods, A and B, to do
this. With method A, one technician can carry out 10 determinations per day, with
method B 20 determinations per day. There are only 3 instruments available for
method B and there are 5 technicians in the laboratory. The first method needs no
sophisticated instruments and is cheaper. It costs 100 units per determination while
method B costs 300 units per determination. The available daily budget is 14000
units. How should the technicians be divided over the two available methods, so
that as many determinations as possible are carried out? Let the number of
technicians working with method A be x^ and with method B ^2, and the total
number of determinations z; then, the objective function to be maximized is given
by:
^2 = X, + ^2 < 5
solution is not feasible and one should then apply a related method, called integer
programming in which only integer values are allowed for the variable x^. When
only binary values are allowed, this is called binary programming.
Problems resembling the first example, but much more complex, are often
studied in industry. For instance in the agro-food industry linear programming is a
current tool to optimize the blending of raw materials (e.g. oils) in order to obtain
the wanted composition (amount of saturated, monounsaturated and polyunsaturated
fatty acids) or property of the final product at the best possible price. Here linear
programming is repeatedly applied each time when the price of raw materials is
adapted by changing markets.
Integer programming has been applied by De Vries [3] (a short English-
language description can be found in [2]) for the determination of the optimal
configuration of equipment in a clinical laboratory and by De Clercq et al. [4] for
the selection of optimal probes for GLC. From a data set with retention indices for
68 substances on 25 columns, sets of p probes (substances) (/? = 1, 2,..., 20) were
selected, such that the probes allow to obtain the best characterization of the
columns. This type of application would nowadays probably be carried out with
genetic algorithms (see Chapter 27).
The fact that only linear objective functions are possible, limits the applicability
of the methodology in chemometrics. Quadratic or non-linear programming are
possible however. The former has been applied in the agro-food industry for the
determination of the composition of an unknown fat blend from its fatty acid
profile and the fatty acid profiles of all possible pure oils [5]. The solution of this
problem is searched under constraints of the number of oils allowed in the solution,
a minimal or maximal content, or a content range. This problem can be solved by
quadratic programming. The objective function is to minimize the squared differ-
ences between the calculated and actual fatty acid composition of the oil blend. An
attractive feature of the programming approach to solve this type of problems is
that it provides several solutions in a decreasing order of value of the objective
function. All these methods together are also known as mathematical programming.
In several chapters we discussed how the quality of the analytical result defines
the amount of information which is obtained on a sampled system. Obvious quality
criteria are accuracy and precision. An equally important criterion is the analysis
time. This is particularly true when dynamic systems are analyzed. For instance a
relationship exists between the measurability and the sampling rate, analysis time
and precision (see Chapter 20). The monitoring of environmental and chemical
processes are typical examples where the management of the analysis time is
610
important. In this chapter we will focus on the analysis time. The time between the
arrival of the sample and the reporting of the analytical result is usually sub-
stantially longer than the net analysis time. Delays may be caused by a congestion
of the laboratory or analysis station or by managerial policies, e.g. priorities
between samples and waiting until a batch of a certain size is available for analysis.
A branch of Operations Research is the study of queues and the influence of
scheduling policies on the formation of queues. Queues in waiting rooms, for
instance, or the occupation of beds in hospitals, telephone and computer networks
have been extensively studied by queuing theory. On the contrary only a few
studies have been conducted on queues and delays in analytical laboratories
[6-10]. Despite the fact that several operational parameters can be registered with
modem Laboratory Information Management Systems (LIMS), laboratory activ-
ities are apparently too complex to be described by models from queuing theory.
For the same reason the alternative approach by simulation (see Section 42.4) is
not a real management tool for decision support in analytical laboratory manage-
ment. On the other hand simulation techniques proved to be a useful tool for the
scheduling of robots.
In this section waiting and queues are discussed in order to provide some basic
understanding of general queuing behaviour, in particular in analytical lab-
oratories. This should allow a qualitative forecast of the effect of managerial
decisions.
No queues would be formed if no new samples are submitted during the time
that the analyst is busy with the analysis of the previous sample. If all analysis
times were equally long and if each new sample arrived exactly after the previous
analysis is finished, the analytical facility could be utilized up to 100%. On the
other hand, if samples always arrive before the analysis of the previous sample is
completed, more samples arrive than can be analyzed, causing the queue to grow
indefinitely long. Mathematically this means that:
with AZq the number of samples waiting in queue, w the waiting time in queue, X the
number of samples submitted per unit of time (e.g. a day), and \i the number of
samples which can be analyzed per unit of time.
611
Because
X = l/lAT, wherelAT is the mean interarrival time, and
L
| L = 1 / AT, where AT is the mean analysis time
:^/|i = Af/iAf = p
p is called the utilization factor of the facility or service station.
In reality, the queue size {n^ and waiting time (w) do not behave as a zero-
infinity step function at p = 1. Also at lower utilization factors (p < 1) queues are
formed. This queuing is caused by the fact that when analysis times and arrival
times are distributed around a mean value, incidently a new sample may arrive
before the previous analysis is finished. Moreover, the queue length behaves as a
time series which fluctuates about a mean value with a certain standard deviation.
For instance, the average lengths of the queues formed in a particular laboratory for
spectroscopic analysis by IR, ^H NMR, MS and ^^C NMR are respectively 12, 39,
14 and 17 samples and the sample queues are Gaussian distributed (see Fig. 42.3).
This is caused by the fluctuations in both the arrivals of the samples and the
analysis times.
According to the queuing theory the average waiting time (w) exponentially
grows with increasing utilization factor (p) and asymptotically approaches infinity
when p goes to 100% (see Fig. 42.4). Figure 42.4 shows the waiting time for the
simplest queuing system, consisting of one server, independent analysis and
arrival processes, Poisson distributed (see Section 15.3) arrivals (number of
samples per day) and exponentially distributed analysis times. In the jargon of
queuing theory such a system is denoted by M/M/1 where the two Ms indicate the
arrival and analysis processes respectively (M = Markov process) and ' 1' is the
number of servers.
The number of arrivals follows a Poisson distribution when samples are sub-
mitted independently from each other, which is generally valid when the samples
are submitted by several customers. The probability of n arrivals in a time interval t
is given by:
P M ^ ' - ^ ^ (42.5)
where Xt is the average number of arrivals during the interval t. For instance the
number of samples (samples/day) submitted to the spectroscopic department men-
tioned earher [9] can be modelled by Poisson distributions with the means 2.8,7.7,
2.06 and 2.5 samples per day (Fig. 42.5).
612
20 60
Fig. 42.3. Time series of the observed queue lengths (n) in a department for structural analysis, with
their corresponding histograms fitted with a Gaussian distribution.
613
Fig. 42.4. The ratio between the average waiting time (w) and the average analysis time (AT) as a func-
tion of the utilization factor (p) for a system with exponentially distributed interarrival times and anal-
ysis times (M/M/1 system).
1
IR HNMR
t 20
c
^^ 15
a.
10
0 1 2 3 4
n^ 5 7
»• n
8
lii
0 2 4 6
I I I
10 12 14 16 18
13
MS CO 25 r CNMR
m
> t> 20 V
03
CO
c:
- • . - ^
15 [
tl.
0 1 2 3
L
4 5 6 7 8 0 1 2 3 4 5
L^ 6 7 8
Fig. 42.5. The distribution of the probability that n samples arrive per day, observed in a department
for structural analysis. (I) observed (•) Poisson distribution with mean [i.
614
- the average waiting time (vv) in queue (excluding the analysis time of the sam-
ple):
w = ATp/(l-p):
^(1-p)
Queuing models also describe the distribution of the waiting times though only
for relatively simple queuing systems. Waiting times in an M/M/1 system are
exponentially distributed. The probability of a waiting time shorter than a given
Wj^gx is given by (see Fig. 42.6):
,/AT
P ( w < w ^ ^ ) = l - p e -^iVVm = l - p e -P^mj^
It means that a large part of the samples (65% for p = 0.7) waits for less than the
mean waiting time (w^^ = vv). On the other hand, there is a significant probability
(35% for p = 0.7) that a sample has to wait longer than this average. It also means
that when the laboratory management wants to guarantee a certain maximal turn
around time (e.g. 95% of the samples within w^^J, the mean waiting time should
be 27% of w^3^ (p = 0.7). Figure 42.7 shows the waiting time distributions of the
samples in the IR, *H NMR, ^^C NMR and MS departments mentioned before.
The probability of finding k samples in a queue is:
t/AT
Fig. 42.6. Probability that the waiting time is smaller than t (t given in units relative to the average anal-
ysis time).
615
cumulative % IP cumulativ
%
100 100
1 • <»• y/^
80
15h| /
A 60
i
lOhl 1
40
20
0 2 4 6 8 10 12 14 16 18 20 0 2
lllllir
4 6 8 10 12 14 16 18 20
Waiting time (days) Waiting time (days)
13
% C NMR cumulative % MS cumulative %
20 ( ^ _ _ ^ - • 1100 100
15
10
r*La
0 2 4 6 8 10 12 14 16 18 20
Waiting time (days)
i
0 2 4
1
lllh
bl
llllllfrrr^f*..
6 8 10 12 14 16 18 20
Waiting time (days)
Fig. 42.7. Histograms and cumulative distributions of the delays (waiting time + analysis time) in a de-
partment for structural analysis. (I) Observed values. (•) Cumulative distribution. (•) Fit witli a theo-
retical model (not discussed in this chapter).
r(T) 1 1 . ..
A 08
\ HNMR r(T) 1 IR
A 08
T 06
\ ' 06
04
0.2
0
^ 0 ^---^^^ ^ ^
-0 2
-0.2
-0.4
0 2 4 6 8 10 12 14 16 18 20
-0.4
0 2 4 6 8 10 12 14 16 18 20
— ^ ^ T (days)
— ^ - T(days)
\ /
-0.2
-0.4
0 2 4 6
\
8
v.^,
10 12
^-'-^
14
y
16 18 20
0
-0.2
-0 4 «
0 2 4 6 8 10 12 14 16 18 20
— ^ - T (days) ^^ t (days)
Fig. 42.8. Autocorrelograms of the queue lengths observed in a department for structural analysis.
TABLE 42.1
Classification of queuing systems
1. Single arrivals [11] 1. Analysis time (single) [11] 1. Queue discipline [11]
a. uniform a. uniform a. first-in-first-out (FIFO)
b. Poisson b. exponential b. last-in-first-out (LIFO)
c. other distribution c. other distribution c. priority to easy samples [10]
2. Batch arrivals [12] d. interrupted (illness, failure) [9] (expected analysis time)
a. uniform e. depends of batchsize d. random
b. Poisson 2. Batch size [12] 2. Priority rules [9]
c. other a. fixed a. absolute classes
3. Batch size [12] b. distributed b. depends on waiting
a. fixed c. minimal size 3. Overhead scheduling [9]
b. distribution a. random interruption
4. No arrival if n > n^Ht b. scheduled interruption
c. interruption ifn< w^rit
fact that queues are never empty does not necessarily indicate an oversaturated
system.
The overview given in Table 42.1 demonstrates that queues and waiting can
only be studied by queuing theory in a limited number of cases. Specifically the
queuing systems that are of interest to the analytical chemist are too complex.
However the behaviour of simple queuing systems provides a good qualitative
insight in the queuing processes occurring in the laboratory. An alternative approach
which has been applied extensively in other fields, is to simulate queuing systems
and to support the decision making by the simulation of the effect of the decision.
This is the subject of Section 42.4. Before this, we will summarize a number of
rules of thumb relevant to the laboratory manager, who may control the queuing
process by controlling the input, the analytical process and the resources.
(1) Input control: Maximum delays may be controlled by monitoring the work
load (Wq AT). When n^ AT > w^^^, or when n^ exceeds a critical value (n^rit)'
customers are requested to refrain from submitting samples. Too high a frequency
of such warnings indicates insufficient resources to achieve the desired Wj^^x-
618
In Section 42.2 we have discussed that queuing theory may provide a good
qualitative picture of the behaviour of queues in an analytical laboratory. However
the analytical process is too complex to obtain good quantitative predictions. As
this was also true for queuing problems in other fields, another branch of Operations
Research, called Discrete Event Simulation emerged. The basic principle of
discrete event simulation is to generate sample arrivals. Each sample is chara-
cterized by a number of descriptors, e.g. one of those descriptors is the analysis
time. In the jargon of simulation software, a sample is an object, with a number of
attributes (e.g. analysis time) and associated values (e.g. 30 min). Other objects are
e.g. instruments and analysts. A possible attribute is a list of the analytical
619
procedures which can be carried out by the analyst or instrument. The more one
wants to describe the reaUty in detail, the more attribute-value pairs are added to
the objects. For example, if one wants to include down times of the instrument or
absence due to illness, such attributes have to be added to the object.
An event takes place when the state of the laboratory changes. Examples of
events are:
- a sample arrival: this introduces a state change because the sample joins the
queue;
- an analysis is ready: this introduces a state change because the instrument and
analyst become idle.
With each event a number of actions is associated. For example when a sample
arrives, the following actions are taken:
- if instrument and analyst are idle and if all other conditions are met (batch
size), start the analysis. This implies that the next event 'analysis is ready' is
generated, the status of the instrument and analyst is switched to 'busy'. Gen-
erate the time of next arrival.
- otherwise: register the arrival time in the queue; augment the queue size by
one. Generate the time of the next arrival.
As events generate other events, the simulation keeps going from event to event
until some terminating conditions are met (e.g. the end of the simulation time, or
the maximum number of samples has been generated). As one can see a specific
programming environment, called object oriented programming [17], is required
to develop a simulation model, consisting of object-attribute-value (O-A-V) triplets
and rules (see also Section 43.4.2). The little research that has been conducted on
the simulation of laboratory systems [9,10,13-15] was primarily focused on the
demonstration that it is possible to develop a validated simulation model that
exhibits the same behaviour in terms of queues, delays as in reality. Next, such a
validated model is interrogated with the question "What if?". For instance, what if:
- priorities are changed?
- resources are modified?
- minimal batch size is increased?
In Fig. 42.9 we show the simulation results obtained by Janse [8] for a municipal
laboratory for the quality assurance of drinking water. Simulated delays are in
good agreement with the real delays in the laboratory. Unfortunately, the develop-
ment of this simulation model took several man years which is prohibitive for a
widespread application. Therefore one needs a simulator (or empty shell) with
predefined objects and rules by which a laboratory manager would be capable to
develop a specific model of his laboratory. Ideally such a simulator should be
linked to or be integrated with the laboratory information management system in
order to extract directly the attribute values.
620
i*^ ^
•^ "w"
o
c3
c\ ^-^
;: E o | B
O
<D
to| ^
rr^ -o
c
« 8 :^ 2
CO
o_c5
^
CD
o
x^
OS
^?
T- n u^
Si
CO a bO
C
;g
:i ^
(N •c
73
tiJO
CM
x:
J
T3
CM S,
CD
CO
-Q
O
JO)
O
621
Step 1
There are two possibilities, namely:
(a) one can elute one element and retain the other two on the column. This leads
to nodes AB//C, AC//B, BC //A; or
(b) one can elute two elements and retain the other. This leads to nodes A//BC,
B//ACandC//AB.
Step 2
(a) Following step la, one of the two remaining ions is eluted. For example, if in
the first step A was eluted, one now elutes B or C. In step 2a one can reach the
situations A//B/C, B//A/C or C//A/B.
(b) Following step lb, two ions are eluted together and are therefore not
separated; they have to be adsorbed first on another column. In the mean time, the
single ion, left on the first column, can be eluted. Two ions are then adsorbed on a
623
M//BC A//B/C
a8^
//A/B/C
a 11
column and one is eluted. The different possibilities are AB//C, AC//B, BC//A, i.e.
the situations reached also after step la. From there one proceeds to step 2a.
Step 3
This follows step 2a. Only one ion remains on the column. It is now eluted, so
that the terminal node is reached.
These different possibilities and their relationships can be depicted as a directed
graph (Fig. 42.10). Directed graphs are graphs in which each edge has a specific
direction. To find the shortest path one has to give values to the edges of the graph.
As the problem is to find the procedure that permits to carry out the separation in
the shortest time possible, these values should be the times necessary to carry out
the steps symbolized by the links in the graph or a value proportional to the time.
We shall not detail the manner in which these times were derived. Essentially,
there are three possibilities:
(a) If the separation depicted by a particular link is possible, the time is
considered to be equal to the distribution coefficient of the ion which is the slowest
to be eluted in this step (the distribution coefficient as defined in ion exchange
chromatography is proportional to the elution time).
(b) If the separation depicted by a particular link is not possible, a very high
value (900) is given to that link.
(c) If the link contains a transfer from one column to another, a value corres-
ponding to the estimated time needed for this transfer is given.
624
One may question whether the application of graph theory is really necessary as
no doubt a separation such as that described above can be investigated easily
without it. However, when more ions are to be separated, the number of nodes
grows very rapidly.
The calculation of the number of nodes is rather complicated. It is simpler in the
special case where transfers from one column to another are not allowed (all ions
eluted from the column must be completely separated from all the other ions). In
this instance, and for three elements, the following nodes should be considered:
ABC//, A//B/C, AB//C, AC//B, BC//A, B//A/C, C//A/B, and //A/B/C. If one
considers only the stationary phase (to the left of //), one notes that all the
combinations of zero, one, two, and three ions out of three are present.
Calling n the total number of ions and p (0 < p < n) the number of ions in a
particular combination taken from these n, the total number of combinations is
equal to
p=0
where C^ is the symbol used for the number of combinations ofp elements out of a
set of A2. This can be shown to be equal to 2^. In this particular instance, a separation
scheme for eight elements would contain 256 nodes. For the general case (transfers
allowed), no less than 17008 nodes would have to be considered! Even for 4
elements, 38 nodes are obtained and it becomes difficult to consider all of the
possibilities without using graph theory. The shortest (cheapest) path in a graph
can be found, for example, with a very simple algorithm. Let us suppose that one
has to construct a highway from town a^ to a town a^ (Fig. 42.10). There are
several possible layouts, which are determined by the towns through which one
must pass, and these must be selected from ^2 to a^Q. The values of the links in the
resulting graph are given by the estimated costs. The value 900 is a way to indicate
that this link is impossible or undesirable. The problem is, of course, to find the
cheapest route. A value A | = 0 is assigned to town (node) a, and the value of all the
nodes a^ directly linked to a j is computed by using the equation A^ = Aj + /(a,, a J
where /(a,, a^) is the length of edge (^j, aj. In this way, we assign the values 3, 5,
900, 0, 900 and 900 to the nodes <22, a^, a^, a^, a^ and a-^, respectively. This
procedure is repeated for the nodes a^ linked directly to one of the nodes a^ by
using the equation A^ = A^ + /(a„, a^. We continue to do this until a value has been
assigned to each of the nodes in the graph. We would, for example, assign the
values 15, 1800, 900 and 902 to ag, a^, a^^ and a^, respectively.
In the first stage, we assigned a possible value to all of the nodes, but not
necessarily the lowest possible value. For example, the value A-j of node a^ is now
900. The value 900 is artificially high, indicating that for practical reasons it is
625
References
1. R.I. Ackoff and M.W. Sasieni, Fundamentals of Operations Research. Wiley, New York, 1968.
2. D.L, Massart, A. Dijkstra and L. Kaufman, Evaluation and Optimization of Laboratory
Methods and Analytical Procedures. Elsevier, Amsterdam, 1978.
3. T. De Vries, Het klinisch-chemisch laboratorium in economisch perspectief. Stenfert-Kroese,
Leiden, 1974.
4. H. De Clercq, M. Despontin, L. Kaufman and D.L. Massart, The selection of representative
substances by operations research techniques. J. Chromatogr., 122 (1976) 535-551.
5. S. de Jong and Th.J. R. de Jonge, Computer assisted fat blend recognition using regression
analysis and mathematical programming. Fat Sci. Technol., 93 (1991) 532-536.
6. I.K. Vaananen, S. Kivirikko, J. Koskennieni, J. Koskimies, A. Relander, Laboratory Simula-
tor: a study of laboratory activities by a simulation method. Meth. Inform. Med., 13 (1974)
158-168.
7. B. Schmidt, Rechnergesteuerte Strategien zur Probenverteilung in einem klinisch-chemischen
Labo. Z. Anal. Chem., 287 (1977) 157-160.
8. T. A.H.M. Janse, G. Kateman, Enhancement of the performance of analytical laboratories by a
digital simulation approach. Anal. Chim. Acta, 159 (1984), 181-189.
9. B.G.M. Vandeginste, Strategies in molecular spectroscopic analysis with application of
queueing theory and digital simulation. Anal. Chim. Acta, 112 (1979) 253-275.
626
10. B.G.M. Vandeginste, Digital simulation of the effect of dispatching rules on the performance
of a routine laboratory for structural analysis. Anal. Chim. Acta, 122 (1980) 435-454.
11. L. Kleinrock, Queueing Systems, Vol. I, Theory. Wiley, New York, 1976.
12. J.G. Vollenbroek and B.G.M. Vandeginste, Some considerations on batch arrival and batch
analysis in analytical laboratories. Anal. Chim. Acta, 133 (1981) 85-97.
13. B. Van de Wijdeven, J. Lakeman, J. Klaessens, B. Vandeginste and G. Kateman, Digital simu-
lation as an aid to sample scheduling in a routine laboratory for liquid chromatography. Anal.
Chim. Acta, 184 (1986) 151-164.
14. J. Klaessens, T. Saris, B. Vandeginste and G. Kateman, Expert system for knowledge-based
modelling of analytical laboratories as a tool for laboratory management. J. Chemometrics, 2
(1988)49-65.
15. J. Klaessens, J. Sanders, B. Vandeginste and G. Kateman, LABGEN, expert system for knowl-
edge based modelling of analytical laboratories. Part 2, Application to a laboratory for quality
control. Anal. Chim. Acta, 222 (1989) 19-34.
16. J. Klaessens, B. Vandeginste and G. Kateman, Towards Computer-supported management of
analytical laboratories. Anal. Chim. Acta, 223 (1989) 205-221.
17. J. Klaessens, L. Van Beysterveldt, T. Saris, B. Vandeginste and G. Kateman, Labgen, expert
system for knowledge-based modelling of analytical laboratories. Part 1. Laboratory organisa-
tion. Anal. Chim. Acta, 222 (1989) 1-17.
18. G.M. Birtwisle, O.J. Dahl, B. Myhrhaug and K. Nygard, SIMULA Begin, Petrocelli/Charter,
New York, 1973.
19. C.A. Copenhaver, Knowledge Engineering Environment (KEE) from Intellicorp, Packag.
Software Rep., 15 (1989) 5-20.
20. M.J. Randic, On the recognition of identical graphs representing molecular topology. J. Chem.
Physics, 60 (1974) 3920-3928.
21. A.T. Balaban, ed.. Chemical Applications of Graph Theory, Academic Press, London, 1976.
22. V. Kvasnicka and J. Pospichal, An improved version of the constructive enumeration of mo-
lecular graphs with described sequence of valence states. Chemom. Intell. Lab. Systems, 18
(1993) 171-181.
23. L.B. Kiers and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Research
Studies Press, Letchwork, U.K., 1986.
24. D.L. Massart, C. Janssens, L. Kaufman and R. Smits, Application of the theory of graphs to the
optimalisation of chromatographic separation schemes for multicomponent samples. Anal.
Chem., 44 (1972) 2390-2399.
L. Kleinrock, Queueing Systems, Vol. II: Computer Applications. Wiley, New York, 1976.
H.M. Wagner, Principles of Operations Research, With Applications to Managerial Decisions.
Prentice/Hall, Englewood Cliffs, NJ, 1975.
S.P. Sanoff, D. Poilevey, Integrated information processing for production scheduling and control.
Comput. Integr. Manuf., 4 (1991) 164-175.
627
Chapter 43
Expert systems are computer programs that are intended to support tasks at an
expert level. What makes them different from other programs that compute
complicated formulas by means of ingenious algorithms? Certainly, these
programs are necessary and valuable tools for the expert; they make it possible to
carry out computations instantaneously and reliably. When making decisions in a
specialized area there is, however, another important aspect, namely, experience.
During the decision-making process, the expert — often without realizing it
explicitly — makes choices or excludes possibilities to reach conclusions
efficiently. His or her major guides are qualitative rules of thumb, optimized
during his/her own learning period and experience with his/her job. A simple
example will best illustrate the concept of heuristics.
Classification of French wines according to their region of origin is a problem
that can be tackled in different ways. Many studies are published based on different
pattern recognition techniques on chemical and sensory characteristics of the
wines. All these techniques perform well to very well, but they all require many
measurements. However, this strategy is totally inefficient and unnecessary when
some typical prior knowledge is available. When the sample is offered in its
original bottle, the shape and colour of the bottle contain at least part of the
required information. To reason with this kind of prior knowledge is difficult in
classical pattern recognition techniques. The inclusion of qualitative rules, also
called heuristic rules, would certainly enhance the performance of the classical
techniques.
Sometimes these heuristics are an extension of the theory and sometimes they
are experience-based rules, with no apparent theoretical justification, but they
simply work most of the time. One way of explaining human reasoning that has
influenced the expert system research is that these heuristic rules of thumb are
629
implicit in the expert's mind as a kind of decision tree that he or she is able to
consult and evaluate instantaneously. This is what defines an expert. An expert
system is then the implementation of a part of this decision tree in a computer
program. By chaining the heuristic rules, the expert system can emulate a part of
the decision capacity of the expert. Expert systems are not simulations of human
intelligence in general. They are simply practical programs that use heuristics to
solve specific problems. Such programs can obviously be useful as decision
support tools. Since they are developed to give advice at expert level, ideally they
should:
- be able to explain their own reasoning process, answer queries such as 'why'
or 'how' certain conclusions have been reached or 'why' certain conclusions
have 'not' been drawn;
- be easily modifiable; since by definition expert knowledge is dynamic, ex-
pert systems must allow easy implementation of changes;
- allow a certain degree of inexactness; since they are based on heuristics
which work most of the time, but not in 100% of the cases, a strategy to cope
with uncertainty should be available.
The last feature brings us to what is considered to be one major disadvantage of
expert systems. Since they are acting on an expert level, with a certain level of
uncertainty, evaluating them in an exact way is impossible, as is possible with
algorithmic programs. These programs should work correctly and reliably in 100%
of all cases. Test methods are available to evaluate them. Expert systems, however,
are based on heuristics, which fail occasionally. The problem is then how to
evaluate expert systems. What is acceptable? The only reasonable way is to
compare their performance with that of human experts. Even when they are
performing well, eliminating the occurrence of a failure is not possible. This is
especially a drawback when expert systems are to be integrated in an automatic
environment. One way to alleviate this problem is to include a strong
meta-knowledge component in the expert system, e.g., knowledge about its
boundaries and limitations in order to recognize cases where failures are likely to
occur.
The typical structure of an expert system is shown in Fig. 43.1. Three basic
components are present in all expert systems: the knowledge base, the inference
engine and the interaction module (user interface). The knowledge base is the heart
of the expert system. It contains the necessary expert knowledge and experience to
act as a decision support. Only if this is correct and complete enough the expert
system can produce meaningful and useful conclusions and advice. The inference
630
KNOWLEDGE
BASE
INFERENCE
ENGINE
INTERACTION
MODULE
engine contains the strategy to use this knowledge to reach a conclusion. This
separation between the knowledge in the knowledge base and the inference engine
is an important characteristic of expert systems. An advantage is that changes,
made in the knowledge base will not corrupt the active component, the inference
engine. Another spin-off of this separated architecture is the possibility of multiple
uses of the inference engine for different knowledge bases. Expert-system shells
are commercially available software programs that contain all elements of Fig.
43.1, except that the knowledge base is empty and must be filled to create specific
applications. The interaction module takes care of communication between the
user and the expert system. Explanation facilities are an important part of the
interface.
systems will serve as human decision support tools they must incorporate all the
knowledge of the domain in a natural way. The focus is changed from computer
efficiency aspects towards the programmer's or user's comfort. It is thanks to
research in this field that the focus of modem computer tools and languages is
shifted more to the user than to the machine.
In general, a representation method for expert systems should fulfil as well as
possible the following criteria:
- incorporate qualitative knowledge in a natural way,
- allow an efficient use of this knowledge by the inference engine,
- allow structuring of the knowledge,
- allow representation of meta-knowledge: knowledge about the system itself,
so that it is able to recognize its own boundaries.
The two most important representation schemes are the rule-based scheme and
the frame-based or object-oriented scheme.
In what follows we will use the semi-formal way of describing rules, since there
are multiple equivalent ways to implement them formally. The rules may consist of
one or more conditions in the IF part and one or more conclusions in the THEN
part. The collection of all the rules in the knowledge base constitutes the expertise
present in the system. Typically a medium-sized expert system contains a few
hundreds of rules. Rules allow a flexible and modular representation of this
expertise. It is important that each rule represents a single step in the reasoning
process. In this case rules can be seen as independent chunks of knowledge that can
be changed or modified easily without affecting the other rules. When the know-
ledge domain becomes complex, this condition is almost impossible to fulfil.
Larger applications require the possibility of structuring the rules in small rule sets
that are easier to manipulate.
Rules seemingly have the same format as "IF.. THEN.." statements in any other
conventional computer language. The major difference is that the latter statements
are constructed to be executed sequentially and always in the same order, whereas
expert system rules are meant as little independent pieces of knowledge. It is the
task of the inference engine to recognize the applicable rules. This may be different
in different situations. There is no preset order in which the rules must be executed.
Clarity of the rule base is an essential characteristic because it must be possible to
control and follow the system on reasoning errors. The structuring of rules into rule
sets favours comprehensibility and allows a more efficient consultation of the
system. Because of the natural resemblance to real expertise, rule-based expert
systems are the most popular. Many of the earlier developed systems are pure
rule-based systems.
Frame-like structures can be used to represent the facts, objects and concepts. In
this context frames must be interpreted as a software structure (a frame) in which
all characteristics of an object or a concept is described. The simplest example of
frame-like structures are the so-called ''object-attribute-value'' triplets. Examples
of such triplets to describe a chromatographic column are:
Column 1 - Tradename - Lichrosorb
Column 1 - Functionality - C8
Frames can be seen as structures where all relevant information about an object
or a concept is collected. As an example the relevant information about a column in
a chromatographic method can be represented by a dedicated general frame,
COLUMN. Separate columns can be represented by so-called instantiations of this
frame. Instantiations are copies of the general frame that contain the characteristics
of a specific object, in this example the separate columns.
COLUMN
Tradename: [Spherisorb, Hypersil, Lichrosorb,...]
Functionality: [C8,C18,...]
Particle size: [Real number] (|im)
Column length: [Integer number] (cm)
Batch number: [Integer number]
Internal diameter: [Real number] (mm)
COLUMN 1 COLUMN 2
Tradename: Lichrosorb Tradename: |Li-Bondapalc
Functionality: C8 Functionality: C18
Column length: 30 cm Column length: 30 cm
Particle size: 4.0 |Lim Particle size: 5.0 |Lim
Batch number: 2347645 Batch number: 3459863
Internal diameter: 4.6 mm Internal diameter: 4.0 mm
In these frames all specific columns that are relevant for the reasoning process of
the expert system can be described in a structured and comprehensive way. The
frame-based and rule-based Icnowledge representation are both required to rep-
resent expertise in a natural way. Therefore, in most expert systems a combination
of rule-based and frame-based knowledge representation is used. The rule base
together with the factual and descriptive knowledge by means, of e.g., frames
constitute the knowledge base of the expert system.
The inference engine is the active component of the expert system. It contains a
strategy to use the knowledge, present in the knowledge base, to draw conclusions.
In a rule-based expert system its major task is to recognize the applicable rules and
how they must be combined in order to derive new knowledge that eventually
leads to the conclusion. Suppose the following two rules are present in a
knowledge base:
634
This is obviously not a complete rule base and, moreover, we have over-
simplified the chemical knowledge, but for clarity we will restrict the example to
these six (over-simplified) rules. The knowledge base is summarized in Fig. 43.2.
We will simulate a forward and a backward inference engine to find the solvent
for acetylsalicylic acid. The goal-driven or backward strategy starts with a possible
solution. In our case the two possible solvents are methanol and an alkaline
solution. The inference engine searches the rule base for all rules that decide on the
solvent. Rules 1, 2 and 3 are selected on this basis. At this point the inference
engine must decide which one to investigate first. The simplest strategy is to
investigate the rules in order of appearance in the knowledge base. In our case this
means that the inference engine selects Rule 1. In more sophisticated expert
systems priority rules can be defined that decide on the order of the rules or rule
sets that must be investigated. In our example we will first investigate Rule 1. The
THEN part of this rule is valid if the compound is not polar. The IF part of this rule
becomes now a new subgoal (SGI: is the polarity of the compound = not polar?)
that is treated in a similar way. The inference engine now tries to prove this subgoal
by searching other rules that conclude in the THEN part on the polarity of a
compound. This is the case for Rules 4 and 6. Rule 4 is first selected. The IF part of
Solvent =
Solvent = Methanol
alkaline solution
Rule Rules
Stability in
Polarity alkaline solution
Fig. 43.2. Representation of a small example knowledge base to select a solvent (see text for further
explanation).
636
Rule 4 is now the new subgoal (SG2: are there functional groups present?). No
rules are found in this rule base that conclude on the presence of functional groups.
At this point the inference engine checks whether this information is available in
the factual database (e.g. the frames). In our example we assume that there is no
such factual database and the working memory is still empty. Only in this case
does the inference engine ask the user (through the interaction module) for
information about this point. When the user confirms the presence of functional
groups in acetylsalicylic acid the inference engine realizes that Rule 4 is not valid.
Rule 6 is now investigated as the second and last possibility to conclude on the
polarity of the compound. The compound is polar when a carboxylic functional
group is present. The presence of a carboxylic function becomes subgoal 3. Since
no rules or databases are available to decide on this, the user is again asked for
information on this topic. Because acetylsalicylic acid contains a carboxylic
function, the inference engine can establish the polarity of the compound: The
polarity of acetylsalicylic acid = polar. Based on this conclusion SGI must be
rejected. The inference engine is now back to Rule 1 and has to conclude that the
path followed was unsuccessful. The process just described is called backtracking.
Rule 2 can now be investigated in the same way. SG4 is now the instability of the
compound in an alkaline solution. Rule 5 is selected because it concludes in the
THEN part on the stability of compounds in an alkaline solution. The presence of
an ester function becomes SG5. Since no rules are available that conclude on this
issue, the user is again asked for information. Through the interaction module the
user now confirms the presence of an ester function in acetylsalicylic acid. This
proves SG5 which in turn establishes SG4. We are now back to Rule 2, which can
be executed. The second path turned out to be successful. The expert system has
come to a conclusion and advises methanol as a suitable solvent. In the same way
Rule 3 is investigated. The reader can verify that the inference engine can reject
this third path without further information of the user.
It is already clear from this small example that aspects such as conflict resol-
ution can have an important influence on the performance of the search strategy.
A forward chaining inference engine first searches for rules that make use of
characteristics of the molecule. This is the case for Rules 4, 5 and 6. The inference
engine checks whether the conditions of the IF parts of these rules can be proven
from the knowledge already present in the descriptive part of the knowledge base,
e.g. in the frames. In our example there is no such information. In this case the
expert system queries the user for information on functional groups, ester
functions, and carboxylic acid functions. Rule 4 is selected first (order of appear-
ance). Because there are indeed functional groups present in acetylsalicylic acid,
this rule cannot be executed and the inference engine continues with Rule 5. This
rule can be executed because the user has confirmed the presence of an ester-
function in acetylsalicylic acid. The stability in an alkaline solution has now been
637
established as: 'not stable'. The inference engine now looks for rules that use in the
IF part the stability in an alkaline solution of the compound = 'not stable'. This is
the case for Rule 2 and this rule can be executed, leading to the conclusion, namely
to advise methanol as solvent for acetylsalicylic acid. The inference engine
continues with Rule 6. Based on the information of the user that a carboxylic group
is present, the polarity of the compound can be established as 'polar'. The
inference engine looks for rules that use the fact that the compound is polar in the
IF part. This is the case for Rule 3. This rule can, however, not be executed since
the second IF statement of the rule is not true.
Both strategies show advantages and drawbacks. When most information of the
bottom line rules (here Rules 4,5 and 6) is present in the knowledge base or can be
obtained automatically, e.g. by a database search, then forward chaining is the
natural way to proceed. When, however, much information must be queried from
the user in this way, the use of the expert system becomes irritating and the system
gives the impression to ask for many facts, that are seemingly not used. Backward
chaining avoids this drawback. Information is asked only where necessary for the
reasoning process. The consultation of such a system creates a closer relationship
with the user since following the reasoning path of the system is easier, and
questions come in a logical sequence.
43.5.2.1 Inheritance
A typical feature of expert systems that support frames is inheritance. Frames
can be organized in a hierarchical structure. They can inherit properties (attributes)
from frames that are higher in the hierarchy. The latter are therefore called parent
frame and the former child frame. There are many varieties of the inheritance
principle. Frames can have only one parent frame (simple inheritance) or may have
multiple parent frames (multiple inheritance). All attributes can be inherited (full
inheritance) or only a few, selected by the knowledge engineer, may be inherited
(partial inheritance) by the child frames. An example of a simple inheritance
organization of frames is shown in Table 43.1. The frame 'Organic Compound' is
the parent frame. The frames 'Ester' and 'Acids' are child frames of 'Organic
Compound'. A typical example of inheritance is instantiation. The frame Acetic
acid is a child of 'Acids' and, since no extra attributes are added, it is also an
instantiation.
The hierarchical structure of the frames defines the relation between them. This
hierarchical structure is static, however, and cannot be easily modified. Inheritance
is therefore an elegant way to represent taxonomical relations or well-established
relations between objects. It is less suited for variables or ill-defined relations
between objects.
638
TABLE 43.1
An example of inheritance
Organic Compound
Name
Molecular formula
Molecular weight
Ester Acids
(child of 'Organic Compound') (child of 'Organic Compound')
Name Name
Molecular formula Molecular formula
Molecular weight Molecular weight
StabiHty in alkaline solution pKa value
Acetic Acid
(child of 'Acids')
Name: acetic acid
Molecular formula: C2H4O2
Molecular weight: 60
pKa value: 4.7
TABLE 43.2
An example of object-oriented programming
Procedure 'start'
Send message to SAMPLE: activate procedure 'find polarity'
Send message to SAMPLE: activate procedure 'find alkaline solution stability'
if valuel = 'yes'
and value2 = 'yes'
SOLVENT/'Alkaline solution' = 'yes'
elsif SOLVENT/'Methanol' = 'yes'
endif
Because the rules or procedures in expert systems are heuristic they are often
not well-defined in a logical sense. Nevertheless, they are used to draw conc-
lusions. A conclusion can be uncertain because the truth of the rules deriving it
cannot be established with 100% certainty or because the facts or evidence on
which the rule is based are uncertain. Some measure of reliability of the obtained
conclusions is therefore useful. There are different approaches used in expert
systems to model uncertainty. They can be divided into methods that are based on
640
Bayesian probability theory and methods that are based on fuzzy-set theory. The
principles of both theories are explained in Chapter 16 and Chapter 19, respect-
ively. Both approaches have advantages and disadvantages for the use in expert
systems and it must be emphasized that none of the methods, developed up to now
are satisfactory [7,11].
The application of Bayesian theory requires the knowledge of probabilities of
all occurring situations. This is in practice impossible. In one of the earliest
successful expert systems, MYCIN, a medical diagnosis system [12], this problem
has been circumvented by assigning two numbers to each rule, not to be interpreted
as probabilities but as likelihoods. The measure of belief (MB, reflecting the
strength of the rule) and a measure of disbelief (MD). These two measures are
combined in a certainty factor (CF). There are different ways, loosely based on
probability theory, to combine these measures when rules must be chained to
arrive at the conclusion. Instead of assigning a certainty factor to a rule or
statement, a fuzzy measure can be assigned that is based on a function, sometimes
called a credibility function in expert system terminology.
All these methods pretend to represent the intuitive way an expert deals with
uncertainty. Whether this is true remains an open question. No method has yet
been evaluated thoroughly. Modelling uncertainty to obtain a reasonable relia-
bility measure for the conclusions remains one of the major unsolved issues in
expert system technology. Therefore, it is important that in the expert system a
mechanism is provided to define its boundaries, within which it is reasonably safe
to accept the conclusions of the expert system.
The interaction module takes care of the communication with the user. It has
essentially the same function as the user interface of other software programs. All
queries to and from the expert system are passed through this module. Although it
has nothing to do with the quality of the advice, it often determines the degree of
acceptance of the system. Most expert systems provide also an interface for the
programmer and for the user. The programmer's interface can be a knowledge base
editor that allows the introduction of the rules in the correct format in an efficient
way. Sometimes an automatic debugger for syntax errors is also provided. The
possibility to retrieve or forward information from or to the user is always
provided. Expert systems can use question/answer dialogues in a natural-like
language, or by menu-driven interfaces or by graphical style interactions. The
preferred interface is mainly application-dependent.
Another essential component, different from classical software, of the inter-
action module is the explanation facility subsystem. The simplest provision is the
641
43.7 Tools
explicit. This process often leads to the conclusion that for some situations no
heuristic rules or experience are available. The awareness of these knowledge gaps
can prevent unforeseen difficulties. There may be of course also other reasons that
justify an expert system project. Obtaining experience in this kind of projects often
is a reason or it may be useful for demonstration purposes.
The next few steps are very similar to those required in any software project.
One of the first stages is the clear definition of the knowledge domain. It must be
clear which problems the expert system must solve. It is at this stage not the
intention to define how this can be done. Clarity and specificity must be the major
guides here. Fuzziness at this stage will, more than in classical software projects,
have to be paid for later when different interpretations cause misunderstandings.
Equally important is the clear definition of the end user(s). An expert system set up
as decision support tool for professionals is totally different from an expert system
that can be used as a training support for less professional people.
The next step is to make sure that the whole knowledge area is covered by
appropriate expertise. In most cases the knowledge source is a person with
extensive experience in the problem area. The expert will also be responsible for
the contents of the knowledge base and the quality of advice. Usually, literature
references will serve as a secondary source to find basic or background knowledge.
Using literature puts fewer demands on the experts. There are situations where
literature can be the main source of knowledge. The knowledge domain must then
be stable for a longer period. Still there has to be an expert who is responsible for
extracting the relevant information from the literature.
After the knowledge domain, the end users and the experts have been defined, a
choice must be made for the software environment for the development of the
expert system. In addition, the hardware for the development process as well as for
the delivery system must be identified. Real application expert systems always
require at least a mid-sized tool (see Section 43.7).
If possible, the best procedure for selecting a tool is to implement a smaller test
knowledge base similar to the final knowledge base in a few different tools. This
test knowledge base should be small enough to be implemented quickly in
different tools. On the other hand, it must be specific enough to highlight the
differences between the tools.
The knowledge base of an expert system contains the relevant knowledge about
the specific area of the expert system. For this all necessary heuristics must be
644
made explicit. Two persons play a pivotal role in this step: the expert and the
knowledge engineer. The expert is responsible for the contents of the expert
system while the knowledge engineer is responsible for the implementation of the
knowledge in the selected tools. Knowledge acquisition is the bottleneck of the
development stage. Here expert and knowledge engineer together try to unravel
the decision path of the problem solving. All the factual knowledge, the decision
rules and rules of thumb must be explicitly defined. This process is quite tedious
because experts usually have much trouble in defining why and how certain
conclusions are reached. The expert often uses many assumptions that are obvious
to him or her. Yet these have to be stated explicitly since nothing is obvious for the
computer. It is the task of the knowledge engineer to assist the expert in making all
necessary knowledge explicit. Up to now numerous knowledge acquisition
techniques have been applied, ranging from interviews of the expert by the
knowledge engineer to closely observing the expert in his daily practice. Research
is still going on to support this step and to automate it as far as possible.
The final aim is to construct a formalized representation of the decision process.
Decision trees and structured system analysis are possibilities. Some types of
expert systems can derive their own rules from examples. These are described in
Chapters 18 and 33.
43.8.4 Implementation
After knowledge acquisition has been completed, the next step is implement-
ation. Since in practice explicit knowledge emerges piecewise, it is useful to start
implementation as soon as some structured knowledge is available. In doing so, the
knowledge engineer has the opportunity to become acquainted with the tool. The
expert on the other hand uses the prototype to check whether no misunderstandings
exist. This tends to greatly enhance the confidence in the project and to reduce
misunderstandings between the knowledge engineer and the expert. An important
point that should always be borne in mind during implementation is the future
maintainability of the system.
answer, different from the expert's, may still be acceptable. Because of these facts
it is not to be expected that general test procedures will become available. The
practical question that must be answered is whether the expert system is useful in
the working environment. One procedure to test an expert system is a validation
followed by an evaluation of the expert system. Validation involves verifying
whether the system responds in the way the expert has in mind. Evaluation
involves verifying whether the expert system meets the requirements of the
intended end users. Validation can be performed with a number of carefully
selected test cases. Evaluation must be performed in real situations. The users must
apply the system in their daily work and check the practical usefulness.
43,8,6 Maintenance
43.9 Conclusion
Expert systems were investigated most intensively during the eighties, and it
appeared that the problems for their practical usability were greater than expected.
In particular, the time-consuming acquisition phase, the absence of a general test
procedure and the lack of a satisfactory method to reason with uncertain know-
ledge prohibited their real breakthrough. Another typical feature of expert systems
that limits their distribution compared with classical software is the fact that they
contain very subjective and site-specific knowledge. The utility of this kind of
knowledge on a larger scale is, of course, limited. All these problems caused a
decline in the development of expert system applications. The phrase 'expert
system' became almost taboo. Nevertheless, some expert systems have been
developed that are useful on a large scale, particularly in the medical domain. In
the chemical domain there are also a few famous expert systems such as
DENDRAL [15], while other expert systems in chemistry concern organic syn-
thesis, LHASA [1]. LHASA uses a retrosynthetic approach; starting from the
molecule to be synthesized, it gives different possibilities for how that molecule
can be synthesized in one step from smaller molecules. The user then selects one of
the possibilities and subsequently LHASA continues with the molecules involved
646
in the selected reaction. The process continues until basic starting molecules such
as, e.g., ethanol are reached. LHASA is still undergoing development to make it
more efficient to use [16]. Other applications can be found in spectroscopy,
chromatography and AAS; for an overview, see Refs [2-4,17-22].
Two trends can be observed in the last few years. First, small-scale expert
systems are being incorporated in chemical instrumentation under the name
decision-support systems. Secondly, in many companies site-specific knowledge
is being incorporated in computer programs. For example, the strategy for
validation of the analytical methods is a crucial issue, especially for accreditation
and GLP reasons. The development of a satisfactory strategy for a specific
laboratory requires much expertise. In many companies programs are being
developed that contain this strategy. While these are often programmed in
conventional programming languages, the knowledge acquisition phase is the
same as for expert systems [23].
References
1. E.J. Corey, A.K. Long and S.D. Rubenstein, Computer-assisted analysis in organic synthesis.
Science, 228 (1985) 408-418.
2. L.M.C. Buydens and P. Schoenmakers (eds.). Intelligent Software for Chemical Analysis.
Elsevier, Amsterdam, 1993.
3. M. Peris, An overview of recent expert system applications in analytical chemistry. Crit. Rev.
Anal. Chem., 26 (4) (1996) 219-237.
4. C.H. Bryant, A. Adam, D.R. Taylor and R.C. Rowe, A review of expert systems for chroma-
tography. Anal. Chim. Acta, 297 (1994) 317-347.
5. M. Cadisch and E. Pretsch, SpecTool: a knowledge-based hypermedia system for interpreting
molecular spectra. Fres. J. Anal. Chem., 344 (1992) 173-177.
6. W. Pennincks, P. Vankeerberghen, D.L. Massart and J. Smeyers-Verbeke, A knowledge-based
computer system for the detection of matrix interferences. Atomic Absorption Spectrometric
Methods, J. Anal. Atom. Spectrom. inch Atomic Absorption Spectrom. Updates, 10 (3) (1995)
207-214.
7. F. Hayes-Roth, D.A. Waterman and D.B. Lenat (eds.). Building Expert Systems. Adison-
Wesley, London, 1983.
8. R. Forsyth and R. Rada, Machine Learning: Applications in Expert Systems and Information
Retrieval. Ellis Horwood Series in Artificial Intelligence. Ellis Horwood, Wiley, Chichester,
1986.
9. P.G. Raeth, Expert systems; a software methodology for modern applications. IEEE Computer
Society Press Reprint Collection, IEEE Computer Society Press, Los Alamitos, CA, USA,
1990.
10. A. Hart, Knowledge Acquisition for Expert Systems. Kogan Page, London, 1986.
11. P. Walley, Measures of uncertainty in expert systems. Artif. Intell., 83 (1) (1996) 1-58.
12. E.H. Shortlife, Computer-based Medical Consultation: MYCIN. Elsevier, New York, 1976.
13. H.Y. Chung, I.S. Park, S.K. Hur, S.W. Cheon, H.G. Kim, S.H. Chang, Diagnostic strategies for
primary-side systems in nuclear power plants. Elect. Power Energy Syst., 14 (4) (1994)
284-297.
647
14. K.P. Adlassnig, H. Leitich and G. Kolarz, On the applicability of dignostic criteria for the diag-
nosis of rheumatoid arthritis in an expert system. Expert Syst. Applic, 13 (1) (1997) 73-79.
15. B.G. Buchanan and E.A. Feigenbaum, Dendral and Meta-Dendral; their application dimen-
sion. Artif. Intel!., 11 (1978) 5-24.
16. M.A. Ott and J.H. Noordik, Long-range strategies in the LHASA program: the Quinone
Diels-Alder transform. J. Chem. Inf. Comput. Sci., 37 (1997) 98-108.
17. P. Van Keerbergen, J. Smeyers-Verbeke and D.L. Massart, Decision support system for run
suitability checking and explorative method validation in electrothermal atomic absorption
spectrometry. J. Anal. Atomic Spectrom. incl. Atomic Spectrom. Updates, 11 (2) (1996)
149-158.
18. J.E. Matek and G.F. Luger, An expert system controller for gas chromatography automation.
Instrum. Sci. Technol., 25 (2) (1997) 107-120.
19. M.E. Elyashberg, Y. Karasev, E.R. Martirosian, H. Thiele and H. Somberg, Expert systems as
a tool for the molecular structure elucidation by spectral methods, strategies of solution to the
problem. Anal. Chim. Acta, 348 (1997) 443-464.
20. M. De Smet, G. Musch, A. Peeters, L. Buydens and D.L. Massart, Expert systems for the selec-
tion of HPLC methods for the analysis of drugs. J. Chromatogr., 485 (1989) 237-253.
21. P.J. Schoenmakers, A. Peeters and R.J. Lynch, Optimization of chromatographic methods by a
combination of optimization software and expert systems. J. Chromatogr., 506 (1990)
169-184.
22. M. Mulholland, N. Walker, J.A. van Leeuwen, L. Buydens, H. Hindriks and P.J.
Schoenmakers, Expert systems for method development and validation in HPLC. Microchim.
Acta, 2 (1991) 493-503.
23. J. Sandberg, B. Wielinga and G. Schreiber, Methods and techniques for knowledge manage-
ment: what has knowledge engineering to offer? Expert Syst. Applic, 13(1) (1997) 73-79.
Additional Reading
H.J. Luinge, A knowledge-based system for structure analysis from infrared and mass spectral data.
Trends Anal. Chem., 9 (1990) 66-69.
M. Otto, Fuzzy expert systems. Trends Anal. Chem., 9 (1990) 69-72.
J. Klaessens, L. Van Beysterveldt, T. Saris, B. Vandeginste and G. Kateman, LABGEN, expert sys-
tem for knowledge-based modelling of analytical laboratories. Part I, Laboratory organization.
Anal. Chim. Acta, 222 (1989) 1-17.
K. Janssens and P. van Espen, Implementation of an expert system for the qualitative interpretation of
X-ray fluorescence spectra. Anal. Chim. Acta, 184 (1991) 117-132.
B.A. Hohne and T.H. Pierce, (eds.). Expert-system Applications in Chemistry. ACS Symp. Ser., Vol.
306, ACS, Washington, DC, 1989.
649
Chapter 44
44.1 Introduction
The beginning of the research into artificial neural networks is often considered
to be 1943 when McCulloch and Pitts published their paper on the functioning of
the nervous system [1]. They tried to explain it by small units that are based on
mathematical logic and that are interconnected. These units are abstractions of the
biological neurons, and their connections. In 1949 Hebb [2] tried to explain some
psychological results by means of a learning law for biological synapses. With the
introduction of computers it was possible to develop and test artificial neural
networks. The first artificial neural networks on computer were developed by
Rosenblatt (the perceptron) [3] and by Widrow [4] (the AD ALINE: ADAptive
LINear Element). The main difference between them is the learning rule that they
use. These simple networks were able to learn and perform some simple tasks. This
success stimulated research in the field; the book by Nilsson on linear learning
machines [5] summarizes most of the work of this early period. The possibilities of
artificial neural networks were believed to be tremendous, and gave rise to
unrealistically high expectations by the public at large. At that time (1969) Papert
and Minsky showed [6] that many of these expectations could not be fulfilled by
the perceptron. Their book had a very negative impact on the research in the field:
funding became difficult and publication of papers declined. Apart from some
enthusiastic researchers who continued their efforts in the field, research stopped
for many years. Some of the investigations continued under the heading adaptive
signal processing or pattern recognition. In chemistry the best known example is
the linear learning machine which was a popular pattern recognition method (see
also Chapter 33).
In 1986 the second breakthrough was caused by the publication of a book by
Rumelhart in which a learning strategy, the back-propagation, developed earlier by
Werbos was proposed [7,8]. This new learning rule allowed the construction of
networks which were able to overcome the problems of the perceptron-based
networks. They were now able to solve more complicated (non-linear) problems.
This breakthrough caused a new interest and up-to-date research is still increasing
with encouraging results.
Output
Fig. 44.2. An artificial neuron; xi...Xp are the incoming signals; wi... Wp are the corresponding weight
factors and F is the transfer function.
function), into the outgoing signal (the output of the neuron). The propagation of
the signal is determined by the connections between the neurons and by their
associated weights. These weights represent the synaptic strengths in the biol-
ogical neuron.
Different types of networks have been developed. The connection pattern is an
important differentiating factor, since this determines the information flow
through the network.
The weights associated with each connection are also essential since, together
with the connection pattern, they determine the signal propagation through the
network. They are real numbers that determine the strength of the connectivity
between the two connected neurons. Each signal that is transported through the
connection is multiplied by the associated weight of the connection. Weights can
be positive or negative. The appropriate setting of the weights is essential for the
proper functioning of the network. They determine the relationship between the
input and the eventual output of the network. Therefore, they are considered as the
distributed knowledge content of the network. Finding the proper weight setting is
achieved in a training phase in which the weights are adapted according to a
so-called learning rule. The training of the network is a major phase of the
developing process of artificial neural networks. Different networks require
different training procedures, the two major variants being supervised and
unsupervised training (see also Chapters 30 and 33).
653
44.4.1 Principle
The perceptron-like linear networks were the first networks that were developed
[3,4]. They are described in an intuitive way in Chapter 33. In this section we
explain their working principle as an introduction to that of the more advanced
MLF networks. We explain the principle of these early networks by means of the
Linear Learning Machine (LLM) since it is the best known example in chemistry.
The LLM is a classical supervised pattern recognizer. It tries to find boundaries
between classes. In Fig. 44.3 an example of two classes A and B is shown in two
dimensions {x^ and X2). Each possible object is thus defined by its values on x^ and
JC2. The boundary between the two classes A and B is defined by the line, L,
described by eq. (44.1):
WjX2+W2X2 - 6 = 0
Fig. 44.3. Two classes in two dimensions and a possible boundary line.
654
W, X| + W2 JC2 = 6
or (44.1)
^2 - e =o
where w, and W2 ^^^ called the weights, 0 is the offset. In neural networks term-
inology it is called the bias. The offset or bias is added for generality. If no offset
(or bias) is provided the line is forced to go through the origin.
The line L divides the two-dimensional space in two regions (corresponding to
the two classes). All points or objects (combinations of Xj and X2) below the line
yield a value for the function, which we will call NET(xi,X2) (eq. (44.2)), larger
(smaller) than 0; the points above the line yield a value of NET(xi,X2) smaller
(larger) than 0; the points on the line satisfy eq. (44.1) and thus yield a value, 0.
where
xj =[x,,X2i] wT = [w,W2] (44.2)
Object / is classified as
Class A, if: NET(jCi,, JC2,) > 0 or NET(jCi„ JC2,) - 0 > 0
Class B, if: NET(jCi„ ^2,) < 0 or NET(jCi., X2) - 0 < 0
Another approach to eq. (44.2) is to add an extra dimension to the object vector
x^, on which all objects have the same value. Usually 1 is taken for this extra term.
The 0 term can then be included in the weight vector, w^(wi,W2,-0). This is the
same procedure as for MLR where an extra column of ones is added to the
X-matrix to accommodate for the intercept (Chapter 10). The objects are then
characterized by the vector xJ (JCI-,JC2„1). Equation 44.2 can then be written:
NET(x.) = xj yv
Object / is classified as
Class A if: NET(x,.) > 0
Class B if: NET(x,.) < 0
The classification of objects is based on the threshold value, 0, of NET(x), also
called the bias. The procedure can be described by means of a transfer function, F
(Fig. 44.4a). The weighted sum of the input values of x is transmitted through a
655
+1 (CLASS A)
F(vj[5j+\^3^ - 9) I ^-
-1 (CLASS B)
+1 +1
0
K^ + ^^1 [Wj^+\^Xj- 6]
-1 J_
Fig. 44.4. (a) Schematic representation of the LLM. F represents the threshold transfer function: F =
sign(Net(x)). (b) On the left: the threshold function for NET(x) = wiJCi + W2X2; on the right for NET(x)
= wiXi + W2X2 - 6; see text for explanation.
transfer function, a threshold step function, also called a hard delimiter function.
Figure 44.4b shows the threshold function. The classification is based on the
output value of the threshold function (0 or +1 for class A; - 1 for class B). When
the input to the threshold function (i.e. NET(Xj)) exceeds the threshold, then the
output value, y, of the function is 1, otherwise it is - 1 .
Instead of such a hard delimiter function it is possible to use other transfer
functions such as the threshold logic, also called the semi-linear function (Fig.
44.5a). In this function there is a region where the output value of the transfer
function is linearly related to the input value with a slope, a. The first point, A, of
this region is: x^w^ + X2W2 = NET(JCI,JC2) = 6. The endpoint, B, of the linear region is
reached when: a{x^w^ + X2W2 - 9) = 1 or x^Wi + JC2W2 = NET(xi,X2) = 6 + lla. The
width of the interval between A and B is thus lla. It is, moreover, possible to use
non-linear functions. The sigmoidal transfer function (Fig. 44.5b) is the most
widely used transfer function in the more advanced MLF networks. It will be
discussed more in detail in Section 44.5.
656
a
A
+1
e + l/a NET
F
+1
NET
Fig. 44.5. (a) The semi-linear function, F = max(0,min(l,a(NET(x) -9))) with NET(x) = Wixi + W2X2
and (b) the sigmoid function: F = ft.Mox. ^ a^^
The weights, as described in the previous section, determine the position of the
boundary that the LLM draws between the classes. The strategy to find these
weights is at the heart of the LLM. This procedure is called the learning rule [7,8].
It is a supervised strategy and is based on the learning rule that Hebb suggested for
biological neurons [2]. Initially the weights are set randomly. A training set with a
number of objects with known classification is presented to the classifier. At each
presentation the weights are updated with an amount Aw, determined by the
learning rule. The most important variant, used in networks, is the delta rule
657
developed by Widrow and Hoff (eq.(44.4)) [4]. In this learning strategy the actual
output of the neuron is adapted with a term, based on 6, the error, i.e. the difference
between the actual output and the desired or target output of the neuron for a
specific object (eq. (44.4)).
Aw. ~ X. 6,
5-4-a, (44.4)
The trivial solution (w„g^ = -w^y) is not interesting since it defines the same
boundary line. A non-trivial solution is found by the following procedure:
Aw ~ 6, X.
or
xJ Aw
5,=T1-V— (0<r|<l)
x ; x^.
Aw = w„,^-Woid
/ I - T
x ; X,
TABLE 44.1
Cycle 1
1 (1,2,1) A -1 Yes -
2 (3,2,1) A +1 No wj =(0.57,-1.28,-0.14)
3 (-1,0,1) B -0.71 No W2 =(-0.14,-1.26,0.57)
4 (-1,2,1) B -1.86 No W3 =(-0.76,-0.05,1.19)
Cycle 2
1 (1,2,1) A +0.33 No w j =(-0.87,-0.27,1.08)
2 (3,2,1) A -2.07 Yes -
3 (-1,0,1) B + 1.95 Yes -
4 (-1,2,1) B + 1.41 Yes -
Cycle 3
I (1,2,1) A -0.33 Yes -
2 (3,2,1) A -2.07 Yes -
3 (-1,0,1) B + 1.95 Yes -
4 (-1,2,1) B + 1.41 Yes -
All objects are correctly classified after three cycles; final w'^ = (-0.87,-0.27,1.08)
xj X,.
W„ew=W,,-'^"J^°'^X, (44.6)
xj X.
When Ti is set to 1 the new weights yield the opposite NET-value for the object,
/. This mirroring effect can sometimes yield undesirably large weight changes.
Therefore r| is usually set to a value < 1. In Table 44.1 a calculation on a two-class
example is given. In Fig. 44.6 the boundaries found by LLM are shown. As an
659
1 1 1 1 1
1 ' '
l \NO^
-1
-2 w4 \ \ w3
1 1 1 1 1 1 _ 1 J
-0.5 0.5 1 1.5 2.5
x1
Fig. 44.6. The successive boundaries (wO to w4) as found by the LLM, Airing the draining procedure
described in Table 44.1. The crosses indicate the positions of the objects (1 to 4); their class member-
ship (A or B) is given between parentheses.
example the weight adaptation in the first cycle, caused by the wrong classification
of the second object, can be calculated using eq. (44.6):
ll
2[3 2 1] - 1
oj [3"
2
" 0.57'
— -1.28
[3 2 1]
[l -0.14
44.4,3 Limitations
A X2
4-A +B
xl
\«- -+Ar
Fig. 44.7. The XOR classification problem (see also Table 44.2).
TABLE 44.2
B 0 0 -1 -1 -1
A 0 1 +1 -1 +1
A 1 0 +1 -1 +1
B 1 1 +1 +1 -1
ification in 2 classes A and B for all objects. The fact that perceptron-like networks
were only able to solve simple linearly separable problems was criticized first by
Minsky and Papert [6] and marked the end of the first period of interest in neural
network research. To solve the XOR problem it must be possible to find two
boundary lines. For this an additional layer of neurons is necessary (see Fig.
44.8a). The two additional neurons in the hidden layer of neurons each determine
one boundary line. By combining these lines it is possible to obtain the correct
classification. In Fig. 44.8a an example of a combination of weights and thresholds
is given that yields the correct classification. The first hidden unit determines
'line 1' and the other one determines 'line T in Fig. 44.8b. In Table 44.2 the result
of applying the different hard delimiter transfer functions, of the two hidden units
and the output unit, is summarized. The output values of the transfer functions of
the two hidden units, hu\ and hul are given in the third and fourth column. The two
objects of class A yield the same results for hul and hul. The objects of class B still
yield different results. In Fig. 44.8c the output values of the hidden units for the
661
Output unit(line3)
Line 2 xl+x2-1.5=0
Linel xl+x2-0.5=0
hu2
B / Line 3 hul-hu2-0.5=0
hul
Fig. 44.8. (a) The structure of the neural network, for solving the XOR classification problem of Fig.
44.7. (b) The two boundary lines as defined by the hidden units in the input space (xl,x2). (c) Repre-
sentation of the objects in the space defined by the output values of the two hidden units {hul,hu2) and
the boundary line defined in this space by the output unit. The two objects of class A are at the same
location.
662
different objects is plotted. It is the output unit that determines a third line, 'line 3'
in Fig. 44.8c, that yields the correct classification (see Table 44.2).
44.5.1 Introduction
44.5.2 Structure
The general structure is shown in Fig. 44.9. The units are ordered in layers.
There are three types of layers: the input layer, the output layer and the hidden
layer(s). All units from one layer are connected to all units of the following layer.
The network receives the input signals through the input layer. Information is then
passed to the hidden layer(s) and finally to the output layer that produces the
response of the network. There may be zero, one or more hidden layers. Networks
with one hidden layer make up the vast majority of the networks. The number of
units in the input layer is determined by /?, the number of variables in the {nxp)
matrix X. The number of units in the output layer is determined by q, the number of
variables in the {nxq) matrix Y, the solution pattern.
663
input layer
hidden layer
output layer
Fig. 44.9. General structure of an MLF Network, (a) without bias and (b) with a bias neuron.
xj =[x,,,„x,pl]
wj =[Wj,...Wjp-Q]
Wj is the weight vector associated with the hidden unity, including 0, the bias term.
X. is the input vector, including an extra 'one' to accommodate for the bias, that
characterizes object /. Since the input units are pass-through units, x. is also the
input vector for each of the hidden units.
2. The net input, NETy, is then passed to the transfer function that transforms it
into the output signal of the unit. Different transfer functions may be used, the most
common non-linear one being the sigmoidal function (Fig. 44.5b).
wJ =[wj,.,.Wjf^-Q]
oJ is the input vector to the output units. It contains the output signals of the h
hidden units plus an additional 'one' to take the bias term for the output units into
account. The whole signal propagation is shown in Fig. 44.10. As can be seen the
signals are only forwarded from one layer to the next layer. There is no signal
propagation within a layer or to a previous layer. This explains the term feed-
forward networks.
665
0.583 Output
Fig. 44.10. An example of signal propagation in an MLF network with transfer function:
1
-(NET(x))-
1+e
NET
2 3 4
Fig. 44.11. Different response regions in a sigmoid function. For explanation, see text.
667
« = -2.5
(a)
6=2.5
Fig. 44.12. (a) and (b): The output value (y) of the MLF network as a function of the input value (x)
with two different weight settings, (c) and (d): The contour plots of the output value of an MLF net-
work in the specified input space with two different weight settings.
8 10
e = o^
(b)
e=-o.5
seem surprising at first sight that such a simple non-linear unit suffices to model
successfully complex non-linear relationships. One must not forget however that it is
the proper combination (weight setting) of several of those sigmoid functions that
assures the overall modelling power. In Fig. 44.12a an example of a combination of
two sigmoidal functions of two hidden units is given so that a Gaussian-like
relationship between input and output is modelled. Changing the weights yields Fig.
44.12b. It is easy to imagine that in such a way complex relationships can be
modelled with more sigmoids. In this view an MLF network can be seen as a kind of
automatic spline fitter (Chapter 11). It is automatic because the combination of basic
functions and knot setting is done automatically during the weight and bias setting
in the training phase by means of the learning rule (Section 44.5.5).
669
-4 -3 -2
9=0
xl
x2
e=o
When the MLF is used for classification its non-linear properties are also
important. In Fig. 44.12c the contour map of the output of a neural network with
two hidden units is shown. It shows clearly that non-linear boundaries are
obtained. Totally different boundaries are obtained by varying the weights, as
shown in Fig. 44.12d. For modelling as well as for classification tasks, the
appropriate number of transfer functions (i.e. the number of hidden units) thus
depends essentially on the complexity of the relationship to be modelled and must
be determined empirically for each problem. Other functions, such as the tangens
hyperbolicus function (Fig. 44.13a) are also sometimes used. In Ref. [19] the
authors came to the conclusion that in most cases a sigmoidal function describes
non-linearities sufficiently well. Only in the presence of periodicities in the data
670
1 1 1 1 1
' 1 ' '/ ' '
1
4
1 ^^ ^
cvj 0
-1
-2
-3
-4
1 1 1 / 1 1 1 1 1 ... 1 1.. 1
-4 -3 -2 -1 0
x1
9=0
xl
x2
9=0
this type of function is not adequate. In that case a sinusoidal transfer function (Fig.
44.13b) is necessary. In a special kind of MLF networks the so-called radial basis
function is used as transfer function. We describe them in Section 44.6.
-1
b A
Ej=^(dj-Ojf (44.9)
Wij
Fig. 44.14. An example of an error surface of an MLF as function of one weight value. The gradient de-
termines the change of the weight in the next iteration.
673
4. Calculate the adaptation of the weights to the hidden layer. The desired output
and thus the error is not known directly. In the back-propagation strategy it is
calculated from the accumulated errors of the output neurons. It can be shown (e.g.
Refs. [14] and [15]) that this yields eq. (44.12). In this equation, k represents the
units in the output layer. All weight adaptations can be calculated using eq. (44.10)
and eq. (44.12). The error is said to be back-propagated through the network.
k ^jk
•'• diNETj) h
(44.12)
5. Repeat this process for all input patterns. One iteration or epoch is defined as
one weight correction for all examples of the training set.
The learning rate, r|, in eq. (44.10) is important for the performance of the
training procedure. If it is small, the convergence of the weights to an optimum
may be slow and there is a danger of getting stuck at a local optimum. If the
learning rate is high, the system may oscillate. An appropriate value of the learning
rate depends on the transfer function of the output units. For a sigmoidal transfer
function, r| values between 0.5 and 1 are used. For a linear transfer function much
smaller values (0.001 < r| < 0.1) are applied. To limit the danger of oscillation, eq.
(44.10) can be adapted as follows:
In eq. (44.13) the term ^Wy^fn) is the current weight change, l^w^fn - 1) is the
weight change from the previous learning cycle, a is called the momentum. It
represents the proportion of the weight change of the previous learning cycle that is
taken into account to determine the step size of the weight in the present learning
cycle. Its value is usually set between 0.6 and 0.8. The influence of the momentum
term is less important than that of the learning rate. The optimal values of a and y\
depend on the problem under study and can be found by means of an optimization
procedure, although this is usually not done and only some values of r| and a are
tried.
674
During a supervised training session the weights of the network are optimized,
using a training set. Since the weight adaptations are done by means of the training
set, it should contain a sufficient number of relevant examples for the relationship
to be modelled. The more hidden units, the more examples are necessary to train
the network. The number of weights to be optimized increases indeed drastically
with the number of units in the network:
number of weights = ph + qh = h(p + q)
where p is the number of input units, h the number of hidden units and q the number
of output units.
As described in Section 44.5.5, the weights are adapted along the gradient that
minimizes the error in the training set, using the back-propagation strategy. One
iteration is not sufficient to reach the minimum in the error surface. Care must be
taken that the sequence of input patterns is randomized at each iteration, otherwise
bias can be introduced. Several (50 to 5000) iterations are typically required to
reach the minimum.
q is the number of output units, n is the number of examples in the training set, d is
the desired output value and a is the actual output value. The NSE is calculated
after each iteration of the training session. The NSE is not always the best
performance criterion. One badly predicted pattern (e.g. an outlier) may influence
strongly this performance criterion. It also gives no indication on which or how
many patterns are well or badly predicted. Small errors, close to 0 for almost all
training examples and a few badly predicted examples will yield the same NSE as
for average errors for all examples. For classification purposes the NSE is not the
ideal criterion since it is more important that the output values are above or below a
certain threshold (e.g. 0.9 or 0.1) than that they are exactly 0 or 1.
The NSE does give, however, a good general idea of the performance of the
network and is therefore commonly used during the training session. In Fig. 44.15a
a performance curve is shown. The NSE decreases monotonically with the number
675
a Error A
ITERATIONS
Error
Fig. 44.15. (a) Performance curve of a MLF network for a training set. (b) Performance curve of a
MLF network with too high a learning rate.
of iterations. Training with too high a learning rate, Ti, yields the performance
behaviour of Fig 44.15b.
When the network must be used to predict future unknown examples it is not
good practice however to judge the performance of the network based on the
training set only. Together with the NSE of the training set the NSE of an
independent set, the monitoring set, must always be calculated. The NSE value for
the monitoring set is usually somewhat larger than the NSE for the training set. In
Fig. 44.16a an ideal performance behaviour of a network is shown.
A typical performance behaviour is shown in Fig. 44.16b. The increase of the
NSE for the monitoring set is a phenomenon that is called overtraining. This
phenomenon can be compared to fitting a curve with a polynomial of a too high
order or with a PCR or PLS model with too many latent variables. It is caused by
the fact that after a certain number of iterations, the noise present in the training set
is modelled by the network. The network acts then as a memory, able to recall
676
Error
Error
ITERATIONS ITERATIONS
Fig. 44.16. (a) Ideal performance curve of a MLF network. The solid line represents the training set;
the dashed line represents the monitoring set. (b) overtraining of the network begins at point A. (c) Par-
alysed network, (d) Performance curve of an MLF network for which the training and control samples
represent different models.
exactly the training examples but loosing its ability to predict. This is harmful for
the predictive performance of the network. When the NSE of the monitoring set is
acceptable it suffices to stop the training at point A. Overtraining occurs sooner
when the number of hidden units is too large for the problem complexity or
because the number of training examples is too small to train adequately all the
weights. The remedy then is to use a network with a smaller number of units.
In Fig. 44.16c the performance behaviour of SL paralysed network is shown. The
normal decrease of the NSE stops too soon and the NSE remains too high to be
acceptable. The paralysation of the network occurs when the weighted input value,
677
NETy, is situated in the response regions 1 or 5 of the transfer function for too many
units (Fig. 44.11). As long as Net^ is smaller than A or larger than D, the response
does not change when changing Net^ (i.e changing the weights). A remedy is to
increase the learning rate, r|, or to explicitly decrease the slope, (3, of the transfer
function, so that the input values fall again inside the dynamic region. To retrain
the network with a different weight initialization is another option. This perf-
ormance behaviour is also observed when too few hidden units are used. The
network is then not able to properly model the relationship. Increasing the number
of hidden units is the remedy in this case.
The performance behaviour shown in Fig. 44.16d is caused by the fact that the
monitoring set and the training set represent different relationships or when some
outliers are present in the monitoring set that are not present in the training set.
When not enough examples are available to make an independent monitoring
set, the cross-validation procedure can be applied (see Chapter 10). The data set is
split into C different parts and each part is used once as monitoring set. The
network is trained and tested C times. The results of the C test sessions give an
indication on the performance of the network. It is strongly advised to validate the
network that has been trained by the above procedure with a second independent
test set (see Section 44.5.10).
From the previous sections it is clear that it is important to use a network with a
suitable number of hidden units. When too few hidden units are used, the
relationship cannot be modelled properly and the network shows poor
performance. Too large a number of hidden units causes severe overtraining. The
suitable number of hidden units depends on the problem complexity and on the
number of training examples that are available. It must be determined empirically.
There are basically three approaches for this:
678
Error
10 12 14 16 18
Hidden units
Fig. 44.17. Determining the number of hidden units for an MLF network. The whiskers represent the
range of the error with different random weight initializations or by cross-validation.
Train and test a network with a certain number of hidden units, based on an
educated guess. When the NSE of the network is acceptable and no severe
overtraining occurs, the network is suitable.
A second approach is to train different networks with a different number of
hidden units. Preferably, each network is trained several times with a differ-
ent weight initialization. From a plot as in Fig. 44.17 it is then straightforward
to select a suitable number of hidden units. This approach is certainly the best
but it involves the training and testing of many networks and is thus a time
consuming procedure.
Another approach that has been used is the so-called pruning procedure
[26,27]. One starts with a network with a large number of hidden units. Dur-
ing the training phase the weight changes of all units are monitored. Those
units or connections whose weights remain low are removed and the training
is continued. This procedure has not been very successful for MLF networks
and is not often applied.
679
44.5.9.1 Scaling
So far, we have not considered the nature of the data in the X or Y matrix.
However, as with all other data handling techniques, preprocessing of the data may
be desirable in some cases. The reasons for scaling are essentially the same as
described in earlier chapters. Methods that are often used for continuous values are
autoscaling and range scaling. The data can be scaled by columns or by rows.
Scaling is often applied if the variables have different ranges of values. Variables
with large values can dominate the model. Another reason for scaling in neural
networks is the fact that large input values for a certain unit, especially together
with large weights, may cause the NET input of that unit to fall in the tail of the
sigmoid transfer function. In this region the derivative is small and, as explained in
Section 44.5.5, weight corrections according to the back-propagation rule are
proportional to the derivative of the transfer function (see eqs. (44.10 and 44.11)).
This may then lead to paralysation of the network. Scaling of the input brings the
net input within the appropriate range of the sigmoid transfer function.
network that yields the lowest NSE for the test set. As mentioned before, however,
the NSE does not cover all performance aspects. It is good practice to consider also
other performance criteria at this stage of the development. The distribution of the
errors is certainly important. Another important performance criterion is the
robustness to noise on the input values. Derks et al. [29] proposed an empirical
procedure to test this robustness. It is based on the gradually increasing addition of
noise to the input data. Networks whose performance is less sensitive to this noise
injection are to be preferred over sensitive networks. Gemperline studied the
robustness of MLF networks in comparison with PLS [30]. The more performance
criteria used to validate the model the higher the probability that a good impression
is obtained of the network's predictive value for samples that lie within the range
of the training set. Since no theoretical knowledge is available about the model, it
is dangerous to extrapolate with MLF networks. Before using the network for
prediction it should be checked whether the input falls within the training set range.
MLF networks are very powerful but should be applied with caution. They are
powerful since practical experience has shown that they are able to model in an
easy way difficult non-linear relationships, much better and faster than alternative
techniques.
There are however many pitfalls in the use of MLF networks.
- Since they are used for complex relationships, about which only little prior
knowledge is available, it is very hard to be sure that the training, monitoring
and test examples are representative for the data.
- Usually many local minima are present in the error surface. It is impossible to
guarantee that the overall optimum is reached.
- The validation of the obtained result is another problem. It is yet not possible
to obtain a confidence interval around the obtained output. Research on this
topic is going on [21,28].
- It is not possible to extract in a direct way theoretical information on the
model. It is to be considered as an empirical model builder such as a spline or
polynomial fitting procedure.
Especially the last few years, the number of applications of neural networks has
grown exponentially. One reason for this is undoubtedly the fact that neural
networks outperform in many applications the traditional (linear) techniques. The
large number of samples that are needed to train neural networks remains certainly
a serious bottleneck. The validation of the results is a further issue for concern.
681
TABLE 44.3
Examples of chemical applications
Even more than in traditional techniques, care must be taken not to extrapolate.
When the training data are not evenly distributed, even intrapolation can cause
problems [20].
The applications concern classification as well as quantification problems. In
Table 44.3 some examples are given in both problem domains.
44.6.1 Structure
Radial basis function networks (RBF) are a variant of three-layer feed forward
networks (see Fig 44.18). They contain a pass-through input layer, a hidden layer
and an output layer. A different approach for modelling the data is used. The
transfer function in the hidden layer of RBF networks is called the kernel or basis
function. For a detailed description the reader is referred to references [62,63].
Each node in the hidden unit contains thus such a kernel function. The main
difference between the transfer function in MLF and the kernel function in RBF is
that the latter (usually a Gaussian function) defines an ellipsoid in the input space.
Whereas basically the MLF network divides the input space into regions via
hyperplanes (see e.g. Figs. 44.12c and d), RBF networks divide the input space
into hyperspheres by means of the kernel function with specified widths and
centres. This can be compared with the density or potential methods in pattern
recognition (see Section 33.2.5).
The output of a hidden unit, in case of a Gaussian kernel function is defined as:
output^. = Ofx) = e'^"^-'^'^^^' (44.13)
Ix - c-l is the euclidean distance (other distance measures are also possible) between
the input vector, x, and c^, the centroid of the Gaussian kemel function. The
682
yi
y2
parameter b represents the width of the Gaussian function. The centroids, Cy and the
widths, bj of all the hidden units together define the so-called activation space,
where the Gaussian function has a value larger than a given threshold value. Nodes
containing a kernel with a centroid close (in comparison with the width, b) to the
input pattern will be predominantly activated. This is in contrast to the distant
objects which cause low output values of the hidden units. This activation space
can thus be regularized by the width factor and the centroid position, providing a
means to adapt the local behaviour of the network.
The output of these hidden nodes, o^, is then forwarded to all output nodes
through weighted connections. The output yj of these nodes consists of a linear
combination of the kernel functions:
yj = I wji Oi(x)
where Wj^ represents the weights of the connections between the hidden layer / and
output layer, j .
44.6.2 Training
The centroid positions, c^, of the kernel functions and their widths, bj, determine
the output of the hidden units. The centroid positions are fixed at the initialization
of the network. The widths and the weights are obtained by supervised training.
Initializing and training these positions and widths is a critical issue in constructing
RBF networks. The supervised network training is generally performed by error
back-propagation by means of the generalized delta learning rule (Section 44.5.5).
The Gaussian kernels are continuously differentiable functions and are therefore
suited for this algorithm.
683
Approaches for the initialization and training of the position of the centres, the
widths and weights are summarized below:
1. For the distribution of the position of the centres the following methods or
combinations of methods are used [29]:
- random distribution within the range of interest;
- random selection of input patterns;
- maximal coverage of the range of interest, e.g. with the Kennard and Stone
algorithm (see Chapter 24);
- selection of representative input patterns, defined by means of the output of a
Kohonen network, applied to the input pattems (see Section 44.7);
~ selection of representative input pattems, defined by means of clustering
methods (e.g. ^-means [65] (see Section 30.3.2), genetic algorithms or simu-
lated annealing [64] (see Chapter 27).
2. The widths, b, of the kernel functions are obtained by:
- the supervised training procedure (error back-propagation);
- fixing them at the initialization phase based on expertise and prior knowledge
on the data structure.
3. The weights, w^p are trained with the usual error back-propagation strategy.
It is generally advised to use an abundant number of kernels on random positions
which are then successively pruned during training.
44,6.3 An example
.•.
0.,
I >i - 1 1 -1 . 1 •, 1 , 1
•• 4 # .• %. • # i
(-i,.i)(i-i,i)Ci4i)(i.i) (-i;-i)(i-i,i).(i.^i)<W)
i "* i •
-1 -I
X I -1
1 1
AND OR
Fig. 44.19. The logical operators AND and OR and their RBF implementation.
44.6.4 Applications
In chemical practice, problems are far more complex than the example
problems described in the previous section. To be able to apply RBF networks
properly, it is useful to have a considerable amount of prior knowledge about the
685
Fig. 44.20. (a) The trained logical OR. The contour lines in the xl ,x2-plane around the centres denote
the 20, 40, 60 and 80 percent confidence limits of the Gaussian kernels, (b) The trained logical OR
with increased width factors.
686
Fig. 44.21. Grid (40x40) prediction of the logical AND for (a) the MLF model and (b) for the RBF
model.
687
data especially for the distribution of the centroid positions and widths [29].
Although this might be considered as a disadvantage, at least the weights have
physical or chemical meaning, allowing a better understanding of the model. They
have been applied in process control [66,67]. A combination of RBF-PLS as a
flexible non-linear regression technique has also been proposed [68].
44.7.1 Structure
a o—o—o—o—o—o c o—o—o—o
/ \ / \ / \ / \
o—o—o—o—o
/ \ / \ / \ / \ / \
b o—o—o—o—o—o o—o—o—o—o—o
i4444_|, / \ / \ / \ / \ / \ / \
I I I I I V O O O O O O O
o—o—o—o—o—o
o—o—o—o—o—o
o—o—o—o—o—o \ / \ / \ / \ / \ /
o—o—o—o—o—o o—o—o—o—o
I I I I I I \ / \ / \ / \.
o—o—o—o—o—o o—o—o—o
Fig. 44.22. Three commonly used Kohonen network structures, (a) One-dimensional array; (b)
two-dimensional rectangular network (each unit, apart from the borderline units has 8 neighbours)
and (c) two-dimensional hexagonal network (each unit, apart from the borderline units, has 6 neigh-
bours). (Reprinted with permission from Ref. [70]).
688
44J.2 Training
n(r)
a
Fig. 44.23. Some common neighbourhood functions, used in Kohonen networks, (a) a block function,
(b) a triangular function, (c) a Gaussian-bell function and (d) a Mexican-hat shaped function. In each
of the diagrams is the winning unit situated at the centre of the abscissa. The horizontal axis represents
the distance, r, to the winning unit. The vertical axis represents the value of the neighbourhood func-
tion. (Reprinted with permission from [70]).
matching
input vector
Fig. 44.24. Example of gradually narrowing of the neighbourhood function during the training pro-
cess. Light grey: definition of neighbourhood at stage 1; middle grey: at stage 2 and dark grey: at stage
3. O is a unit and • is the winning unit. (Adapted from Ref. [70]).
690
X^ r\(X-W(t))
Fig. 44.25. Example of the weight update of the winning unit in a Kohonen network. (Reprinted with
permission from Ref. [70]).
After the training procedure, the weight vectors of the units are fixed and the
map is ready to be interpreted. There are different possibilities to interpret the
weight combination, depending on the purpose of the network use. In this section
some possibilities are described.
The output-activity map. A trained Kohonen network yields for a given input
object, Xj, one winning unit, whose weight vector, w^, is closest (as defined by the
criterion used in the learning procedure) to x-. However, x^ may be close to the
weight vectors, w^, of other units as well. The output )^ of the units of the map can
also be defined as:
yj = D(Wj, X.)
D is the similarity measure as used in the training procedure. This results in a map
as in Fig.44.26a. This map allows the inspection of regions (neighbouring neurons)
that have a similar weight vector as a given input x^. Note that each input x^ yields a
different output activity map.
The counting map (Fig. 44.26b). This map is obtained by counting for each unit
of the Kohonen network, the number of training objects for which the unit is the
winning one. This map provides insight in the number of clusters that are present in
the dataset. When e.g. all input patterns are assigned to e.g two distinct regions
(sets of neighbouring neurons) in the map, it can be concluded that there are two
clusters in the dataset.
691
+ 4- 4- + 4- - X X A A A A A A
+ + 4- 4- 4- 4- X X A A A A
4- + 4- X - X - A A A B B
4- 4- 4 - - - - B B B
+ 4- - - - -1 C X B B X B B
++4 « - - C C BX X C B B
4- 4- 4- 4- - - C C B X C C B B
4- 4- - - - C C C C C B B B
k
Fig. 44.26. Some possibilities for analysing a two-dimensional Kohonen map (Reprinted with per-
mission from Ref. [70]). (a) Grey-encoded output activity map for a given training example. Dark ar-
eas in the map indicate a high similarity between the weight vector of the unit and the input object, (b)
A counting map: dark areas indicate a large number of training examples that map on the unit. Units on
which no training examples map are indicated white, (c) Feature map indicating units on which train-
ing examples map with (+) or without (-) a certain feature. (X) indicates that different labels are as-
signed to the unit, (d) Feature map with class identification A and B as labels. (X) indicates multiple
class labels.
The feature map. This map can be obtained when labels can be assigned to the
training objects. Labels may consist of a known classification, or the presence
versus absence of certain features. For each training object, x-, the winning unit is
determined and to this unit the label of x^ is assigned. When this is completed for all
training objects, each unit in the map is labelled in the map with zero, one or more
labels (see Figs. 44.26c and 44.26d). When one unit is labelled with more labels it
means that an overlap is present. In this way it can be compared with fuzzy
clustering techniques (see Chapter 30). A detailed description can be found in
references [69,70].
44.7.4 Applications
Due to the Kohonen learning algorithm, the individual weight vectors in the
Kohonen map are arranged and oriented in such a way that the structure of the
input space, i.e. the topology is preserved as well as possible in the resulting
692
44.8.1 Introduction
The Adaptive Resonance Theory (ART) networks are also designed for un-
supervised pattern recognition. There exists a variant (ARTMAP) that is meant for
supervised pattern recognition, but we will consider here the more common
unsupervised variant. For a detailed description see Ref. [14]. Grossberg designed
the adaptive resonance theory to overcome a general drawback of classifiers (e.g.
MLF networks), i.e. the fact that once the training procedure (supervised or
unsupervised) is completed the weights are fixed. The network or classifier is
designed for a certain well-defined classification task. The network is trained by
means of a set of representative training examples. This is an acceptable procedure
only when the classification problem is well defined and only as long as it remains
stable. Unfortunately this is not often the case in real situations. It may happen that
a new class of objects appears that was not represented in the training set. The only
remedy in that case is to retrain the network including training objects of the new
class. It is also possible that existing classes tend to change in time, e.g. drifting
away. Here too, the proper action is to retrain the network with new representative
training objects. This is what Grossberg called the stability-plasticity dilemma:
how can a system be adaptive to relevant new input and at the same time be stable
to irrelevant input changes? In response to this, he and others developed the
adaptive resonance theory. The essence of ART is pattern matching. When a
familiar pattern is presented to the network (i.e. satisfies the limiting conditions of
the previous examples) the network recognizes the input and it also incorporates
693
the new information of the new input by adapting the weights. When, on the other
hand, a novel input is presented, i.e. not satisfying the limiting conditions of the
previous examples, the structure is adapted and the novel input is identified as the
first representative of a new class. In this way ART is stable enough to preserve
past learning but remains adaptable to incorporate new information when it
appears.
44.8.2 Structure
ART networks consist of units that contain a weight vector of the same
dimension as the input patterns. Each unit is meant to represent one class or cluster
in the input patterns. The structure of the ART network is such that the number of
units is larger than the expected number of classes. The units in excess are dummy
units that can be taken into use when a new input pattern shows up that does not
belong to any of the already learned classes.
There exist many different types of ART. The variant ARTl is the original
Grossberg algorithm. It allows only binary input vectors. ART2 allows also
continuous input. It is the basic variant of this type that we will describe.
The structure of ART networks is hard to visualize. It is in fact a theory that can
better be explained by means of a sequence of steps that follow the strategy.
44.8.3 Training
- The similarity between x, and the winning unit is compared with a threshold
value, p*, in the range from zero to one. When p^^ < p* the input pattern, x^, is
not considered to fall into the existing class. It is decided that a so-called nov-
elty is detected and the input vector is copied into one of the unused dummy
units. Otherwise the input pattern, x^, is considered to fall into the existing
class (to resonate with it). A large p* will result in many novelties, thus many
small clusters. A small p* results in few novelties and thus in a few large clus-
ters.
- When the resonance step succeeds the weight vector of the winning unit is
changed. It adapts itself a little towards the new input pattern x^, belonging to
the same class, according to:
w(r+l) = r| x^ + (1 - r|) w(0
w(r+l) = w(r+l)/lw(r+l)l
Usually the learning rate, rj, is chosen between 0 and 1. In this step the network
incorporates the new information present in the input object by moving the
centroid of the class a little towards the new input pattern x^. This step is intended to
make the network flexible enough when clusters are changing in time.
44.8.4 Application
From the previous discussion, it is clear that ART networks are more applicable
for pattern recognition than for quantitative applications. In Fig. 44.27 it is shown
how ART networks functions with different values for p*. It has been applied to
Fig. 44.27. The influence of different values of p* on the classification performance of the ART net-
work, (a) large p* value; (b) small p* values.
695
classify process control data [77], for classification of UV/VIS/NIR spectra and of
airborne particles [78]. It has also been applied for QSAR [79]. There is not,
however, much expertise available yet on the performance of these networks in
real situations. The general experience is that the networks are very sensitive to
noise. It is difficult to select a proper threshold value for the similarity measure.
Moreover, the experience is that within one network in fact different threshold
values are necessary for different classes. Strategies that use also an adaptive
threshold (e.g. based on the different class properties) may be more successful.
References
1. W.S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity.
Bull. Math. Biophy., 5 (1943) 115-133.
2. D.O. Hebb, The Organization of Behavior. Wiley, New York, 1949.
3. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization
in the brain. Psycholog. Rev., 65 (1958) 386-408.
4. B. Widrow and M.E. Hoff, Adaptive switching circuits. In: IRE WESCON Convention Re-
cord, New York, 1960, p. 96-104.
5. N. Nilsson, Learning Machines, McGraw-Hill, New York, 1965.
6. M.L. Minsky and S.A. Papert, Perceptrons. MIT Press, Cambridge, MA, 1969.
7. D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning internal representations by error
propagation. In: Parallel Distributing Processing, Explorations in the Microstructure of Cogni-
tion, Vol. 1: Foundations, D.E. Rumelhart and J.L. McClelland (eds.), MIT Press, Cambridge,
MA, 1986, pp. 318-362.
8. P. Werbos, Beyond regression: new tools for prediction and analysis the behavioral sciences.
PhD Thesis, Harvard, Cambridge, MA, 1974.
9. B. Wythoff, Backpropagation neural networks. Chemom. Intell. Lab. Syst., 18 (1993)
115-155.
10. J. Zupan and J. Gasteiger, Neural Networks for Chemists. An Introduction. VCH
Verlaggesellschaft, Weinheim, 1993.
11. J. Devillers (ed.), Neural Networks in QSAR and Drug Design. Academic Press, London,
1996.
12. R. Beale and T. Jackson, Neural Computing: An Introduction. Institute of Physics Publishing,
Bristol, 1992.
13. J.L. McClelland and D.E. Rumelhart, Parallel Distributed Processing, Vol. 1. MIT Press, Lon-
don, 1988.
14. J.A. Freeman and D.M. Skapura, Neural Networks, Algorithms, Applications and Pro-
gramming Techniques. Addison-Wesley, Reading, MA, 1991.
15. R. Hecht-Nielsen, Neurocomputing. Addison-Wesley, Reading, MA, 1991.
16. P.D. Wasserman, Neural Computing: Theory and Practice. Van Nostrand Reinhold, New
York, 1989.
17. J. Smits, W.J. Meissen, L.M.C. Buydens and G. Kateman, Using artificial neural networks for
solving chemical problems. Chemom. Intell. Lab. Syst., 22 (1994) 165-189.
18. D. Svozil, Introduction to multi-layer feed-forward neural networks. Chemom. Intell. Lab.
Syst., 39 (1997) 43-62.
696
19. W.J. Meissen and L.M.C. Buydens, Aspects of multi-layer feed-forward neural networks in-
fluencing the quality of the fit of univariate non-linear relationships. Anal. Proc, 32 (1995)
53-56.
20. E.P.P. A. Derks, Aspects of Artificial networks and experimental noise. PhD thesis, University
of Nijmegen, the Netherlands, Chapter 2, 1997.
21. R. Tibshirani, A comparison of some error estimates for neural network models. Neural Com-
putation, 8 (1995) 152-163.
22. B. Walczak, Neural networks with robust backpropagation learning algorithm. Anal. Chim.
Acta, 322 (1996) 21-30.
23. P. Deveka and L. Achenie, On the use of quasi-Newton based training of a feedforward neural
network for time series forecasting. J. Intell. Fuzzy Syst., 3 (1995) 287-294.
24. M. Norgaard, Neural network based system identification toolbox. Technical report. Institute
of Automation, Technical University, Denmark, 1995.
25. E.P.P. A. Derks, Aspects of artificial networks and experimental noise. PhD thesis, Nijmegen,
the Netherlands, Chapter 3, 1997.
26. G. Castellano, A.M. Fanelli and M. Pelillo, An iterative pruning algorithm for feedforward
neural networks. IEEE Trans. Neural Networks, 8 (1997) 519-531.
27. J. Zhang, J.-H. Jiang, P. Liu, Y.-Z. Liang and R.-Q. Yu, Multivariate nonlinear modelling of
fluorescence data by neural network with hidden node pruning algorithm. Anal. Chim. Acta,
344(1997)29-40.
28. E.P.P.A. Derks, M.L.M. Beckers, W.J. Meissen and L.M.C. Buydens, A parallel
cross-validation procedure for artificial neural networks. Computers Chem., 20 (1995)
439-448.
29. E.P.P.A. Derks, M.S. Sanchez Pastor and L.M.C. Buydens, A robustness analysis for MLF and
RBF neural networks models. Chemom. Intell. Lab. Syst., 34 (1996) 299-301.
30. P.J. Gemperline, Rugged Spectroscopic calibration. Chemom. Intell. Lab. Syst., 39 (1997)
29-42.
31. G.J. Salter, M. Lazzari, L. Giansante, R. Goodacre, A. Jones, G. Surrichio, D.B. Kell and B.
Bianchi, Determination of the geographical origin of Italian extra-virgin olive oil using pyroly-
sis-mass spectrometry and neural networks. J. Anal. Appl. Pyrol., 40 (1997) 159-170.
32. J.R.M. Smits, L.W. Breedveld, M.W.J. Derksen, G. Kateman, H.W. Balfoort, J. Snoek and
J.W. Hofstraat, Pattern classification with artificial neural networks: classification of Algae,
based upon flow cytometry data. Anal. Chim. Acta, 258 (1992) 11-25.
33. H. Chan, A. Butler, D.M. Falk and M.S. Freund, Artificial neural network processing of strip-
ping analysis responses for identifying and quantifying heavy metals in the presence of
intermetallic compound formation. Anal. Chem., 69 (1997) 2373-2378.
34. C.W. McCarrick, D.T. Ohmer, L.A. Gilliland, P.A. Edwards and H.T. Mayfield, Fuel identifi-
cation by neural network analysis of the responses of vapour-sensitive sensor arrays. Anal.
Chem., 68 (1996) 4264^269.
35. M. Click and G.M. Hieftje, Classification of alloys with an artificial neural network and
multivariate calibration of Glow-Discharge emission spectra. Appl. Spectrosc, 45 (1991)
1706-1716.
36. R. Goodacre, D.B. Kell and G. Bianchi, Rapid identification of species using pyrolysis mass
spectrometry and artificial neural networks of propionibacterium acnes isolated from dogs. J.
Appl. Bacteriol., 76 (1994) 124-134.
37. W. Werther, H. Lohninger, F. Stand and K. Vermuza, Classification of mass spectra. A com-
parison of yes/no classification methods for the recognition of simple structural properdes.
Chemom. Intell. Lab. Syst., 22 (1994) 63-67.
697
38. J.M. Andrews and S.H. Lieberman, Neural network approach to qualitative identifications of
fuels from laser induced fluorescence spectra. Anal. Chim. Acta, 285 (1994) 237-246.
39. U. Hare, J. Brian and J.H. Prestegard, Application of neural networks to automated assignment
of NMR spectra of proteins. J. Biomolec. NMR, 4 (1994) 35-46.
40. T. Visser, H.J. Luinge and J.H. van der Maas, Recognition of visual characteristics of infrared
spectra by artificial neural networks and partial least squares regression. Anal. Chim. Acta, 296
(1994).
41. H.J. Luinge, E.D. Leussink, and T. Visser, Trace-level identity confirmation from infrared
spectra by library searching and artificial neural networks. Anal. Chim. Acta, 345 (1997)
173-184.
42. C. Affolter and J.T. Clerc, Prediction of infrared spectra from chemical structures of organic
compounds using neural networks. Chemom. Intell. Lab. Syst., 21 (1993) 151-157.
43. J.R.M. Smits, P. Schoenmakers, A. Stehmann, F. Sijstermans and G. Kateman, Interpretation
of Infrared spectra with modular neural-network systems. Chemom. Intell. Lab. Syst., 18
(1993) 27-39.
44. M.E. Munk, M.S. Madison and E.W. Robb, Neural network models for infrared spectrum in-
terpretation. Microchim. Acta, 2 (1991) 505-524.
45. W. Wu and D.L. Massart, Artificial neural networks in classification of Nir spectral data: selec-
tion of the input. Chemom. Intell. Lab. Syst., 35 (1996) 127-135.
46. C. Hoskins and D.M. Himmelblau, Process Control via artificial Neural networks and rein-
forced learning. Computers Chem. Eng., 16 (1992) 241-251.
47. C. Hoskins and D.M. Himmelblau, Fault diagnosis in complex chemical plants using artificial
neural networks. AIChE J, 37 (1991) 137-141.
48. M. Bhat and T.J. McAvory, Use of neural nets for dynamic modelling and control of chemical
process systems. Computers Chem. Eng., 15 (1990) 573-578.
49. C. Puebla, Industrial process control of chemical reactions using spectroscopic data and neural
networks. Chemom. Intell. Lab. Syst., 26 (1994) 27-35.
50. N. Qian and T.J. Sejnowski, Predicting the secondary structure of globular proteins using neu-
ral network models. J. Molec. Biol., 202 (1988) 568-584.
51. M. Vieth, A. Kolinsky, J. Skolnicek, and A. Sikorski, Prediction of protein secondary structure
by neural networks, encoding short and long range patterns of amino acid packing. Acta
Biochim. Pol., 39 (1992) 369-392.
52. R. Wehrens and W.E. van der Linden, Cahbration of an array of voltammetric microelectrodes.
Anal. Chim. Acta, 334 (1996) 93-100.
53. J.R.M. Smits, W.J. Meissen, G.J. Daalmans and G. Kateman, Using molecular representations
in combination with neural networks. A case study: prediction of the HPLC retention index.
Computers Chem., 18 (1994) 157-172.
54. A. Bos, M. Bos and W.E. Van der Linden, Artificial neural networks as a multivariate calibra-
tion tool: modeUng the ion-chromium nickel system in x-ray fluorescence spectra. Anal. Chim.
Acta, 277 (1993) 289-295.
55. M.N. Tib and R. Narayanaswamy, Multichannel calibration technique for optical-fibre chemi-
cal sensor using artificial neural network. Sensors Actuators, B39 (1997) 365-370.
56. H.M. Wei, L.S. Wang, B.G. Zhang, C.J. Liu and J.X. Feng, An application of artificial neural
networks. Simultaneous determination of the concentration of sulfur dioxide and relative hu-
midity with a single coated piezoelectric crystal. Anal. Chem., 69 (1997) 699-702.
57. H.J. Miao, M. H. Yu and S.X. Hu, Artificial neural networks aided deconvolving overlapped
peaks in chromatograms. J. Chromatogr., A, 749 (1996) 5-11.
698
58. R.M. Lopes Marques, P.J. Schoenmakers, C.B. Lucasius and L.M.C. Buydens, Modelling
chromatographic behavior as a function of pH and solvent composition in RPLC.
Chromatographia, 36 (1993) 83-95.
59. A.P. de Weyer, L.M.C. Buydens, G. Kateman and H.M. Heuvel, Neural networks used as a soft
modelling technique for quantitative description of the inner relation between physical proper-
ties and mechanical properties of poly ethylene terephthalate yams. Chemom. Intell. Lab.
Syst., 16(1992)77-82.
60. E.P.P.A. Derks, B.A. Pauly, J. Jonkers, E.A.H. Timmermans and L.M.C. Buydens, Adaptive
noise cancellation on inductively coupled plasma spectroscopy. Chemom. Intell. Lab. Syst.,
(1998) in press.
61. A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing.
Wiley, New York, 1993.
62. J. Park and I.W. Sandberg, Universal approximation using radial basis function networks. Neu-
ral Computation, 3 (1991) 246-257.
63. J. Moody and C.J. Darken, Fast learning in networks of locally-tuned processing units. Neural
Computation, 1 (1989) 281-294.
64. B. Carse and T.C. Fogarty, Fast evolutionary learning of minimal radial basis function neural
networks using a genetic algorithm. Lecture Notes in Computer Science, 1143, (1996) 1-22.
65. L. Kieman, J.D. Mason and K. Warwick, Robust initialisation of Gaussian radial basis function
networks using partitioned k-means clustering. Electron. Lett., 32 (1996) 671-672.
66. W. Luo, M.N. Karim, A.J. Morris and E.B. Martin, Control relevant identification of a pH
waste water neutralisation process using adaptive radial basis function networks. Computers
Chem.Eng.,20(1996)S1017
67. B. Walczack and D.L. Massart, Application of radial basis functions-partial least squares to
non-linear pattern recognition problems: diagnosis of process faults. Anal. Chim. Acta, 331
(1996) 187-193.
68. B. Walczack and D.L. Massart, The radial basis functions — partial least squares approach as a
flexible non-linear regression technique. Anal. Chim. Acta, 331 (1996) 177-185.
69. T. Kohonen, Self Organization and Associated Memory. Springer-Verlag, Heidelberg, 1989.
70. W.J. Meissen, J.R.M. Smits, L.M.C. Buydens and G. Kateman, Using artificial neural net-
works for solving chemical problems. II. Kohonen self-organizing feature maps and Hopfield
networks. Chemom. Intell. Lab. Syst., 23 (1994) 267-291.
71. W.J. Meissen, J.R.M. Smits, G.H. Rolf and G. Kateman,Two-dimensional mapping of IR
spectra using a parallel implemented self-organising feature map. Chemom. Intell. Lab. Syst.,
18(1993)195-204.
72. S. Anzali, G. Bamickel, M. Krug, J. Sadowski, M. Wagener, J. Gasteiger and J. Polanski, The
comparison of geometric and electronic properties of molecular surfaces by neural networks:
Application to the analysis of corticosteroid-binding globulin activity of steroids. J. Com-
puter-Aided Molec. Design, 10 (1996) 521-534.
73. X.H. Song and P.K. Hopke, Kohonen neural network as a pattern-recognition method, based
on weight interpretation. Anal. Chim. Acta, 334 (1996) 57-66.
74. M. Walkenstein, H. Hutter, C. Mittermayr, W. Schiesser and M. Grasserbauer, Classification
of SIMS images using a Kohonen network. Anal. Chem., 69 (1997) 777-782.
75. R. Goodacre, J. Pygall and D.B. Kell, Plant seed classification using pyrolysis mass spectrome-
try with unsupervised learning: the application of auto-associative and Kohonen artificial neu-
ral networks. Chemom. Intell. Lab. Syst., 33 (1996) 69-83.
76. J. Zupan, M. Novic and I. Ruisanchez, Kohonen and counterpropagation networks in analytical
chemistry. Chemom. Intell. Lab. Syst., 38 (1997) 1-23.
699
77. D. Wienke and L.M.C. Buydens, Adaptive resonance theory neural networks — the ART of
real-time pattern recognition in chemical process monitoring? Trends Anal. Chem., 99 (1995)
1-8.
78. Y. Xie, P.K. Hopke and D. Wienke, Airborne particle classification with a combination of
chemical composition and shape index utilizing an adaptive resonance artificial neural net-
work. Environ. Sci. Technol., 28 (1994) 1399-1407.
79. D. Domine, D. Wienke, J. Devillers and L.M.C. Buydens, A new nonlinear neural mapping
technique for visual exploration of QSAR data. In: Neural Networks in QSAR and Drug De-
sign, J. Devillers (ed.). Academic Press, London, 1996, p. 223-253.
80. S.H. Barondes, Geestesziekten en Moleculen. De Wetenschappelijke Bibliotheek van Natuur
& Techniek, 1993.
701
Subject Index
disjoint class modelling, 208 eigenvalue decomposition, 33, 92, 148, 183
dispersion matrix, 31 eigenvector, 33, 228, 230
dissimilarity plot, 295 eigenvector projection, 55
dissociation constant, 387 electron acceptor, 127
distance, 43, 45, 47, 60, 108, 176-178 electron donor, 127
— between two points, 11 electrostatic interaction, 386
— of chi-square, 133, 175 elimination pool, 455
— city block, 66, 149 ellipsoid, 39
— Cook, 374 enthalpy, 386
— Euclidean,60, 62,63, 67, 108, 146,230, entropy, 386
231 environmental applications, 232
— from the origin, 11 enzyme, 383, 411
— function, 152 epoch, 673
— generalized, 61 equilibrium constant, 386
— Hamming, 66, 147 equiprobability envelope, 104, 107
— Mahalanobis, 61, 62, 65, 147, 220, 221, error eigenvector, 143
228, 274 Euclidean distance, 60, 62, 63, 67, 108, 146,
— Manhattan, 67, 147 230,231
— Minkowski, 67 euroleptic effect, 400
— matrix, 68 evolution program, 375
— metrics, 150 evolutionary method, 251
— Pythagorean, 146 evolutionary rank analysis, 274
— standardized Euclidean, 61, 62, 64 evolving factor analysis, 274
— taxi, 147 evolving principal components analysis, 274
distribution, 449 exclusive or (XOR) problem, 659
— volume of, 456 excretion, 449, 452
distributional equivalence, 193 exhaustion, 136
D-optimality, 40 expected value, 147
double-closure, 130, 169 experimental design, 40
drift, 593, 598 expert system, 4, 627
drug-receptor complex, 385 — if...then... rules, 631
drug-receptor interaction, 402 — inductive, 227
drug-test specificity, 397 — inference engine, 629, 633
dual representation, 28 — inheritance, 637
dual space, 174 — shell, 630, 641
duality, 16 explanation facility, 640
duo-trio test, 422 exploratory analysis, 46, 167
dynamic programming, 605 exponential function, 203, 462
exponential smoothing, 544
earning machine, 416 extended matrix, 155
edges, 621 extrathermodynamic method, 384
eigenfunction, 486 extravascular administration, 461,469
eigenstructure tracking, 280
eigenvalue, 31, 186, 486, 490, 492 factor, 244, 319
705
maximum likelihood method, 555, 557 multivariate cahbration, 60, 239, 349
mean absorption time, 496 multivariate normal distribution, 40, 212, 221,
measure of belief, 640 228
measure of disbelief, 640 multivariate quality control, 232
measure of (dis)similarity, 60, 65 multivariate regression, 323
measurement equation, 577 multivariate statistics, 4
measurement model, 576 multi-way contingency table, 165
measurement table, 7, 87 multi-way data table, 2
mechanical analogy, 128 MYCIN, 640
membrane, 456
metabolism, 452 natural calibration, 352
metabolite, 450 near infrared spectra, 232
meta-knowledge, 629, 631 nearest neighbour method, 209, 223
meteorites, 57 nearest neighbours, 213
metric matrix, 171 needle search, 271
metric MDS, 428 network, 621
Michaelis constant, 453 neural network, 82, 149, 209, 213, 225, 233,
Michaelis-Menten equation, 502 378,385,416,649
Michaelis-Menten kinetics, 453 — application of, 680
minimal path, 622 — ART, 649
minimal spanning tree, 73, 74 — back-propagation learning rule, 662, 671
Minkowski distance, 67 — backtracking, 636
mint species, 60 — backward chaining, 634
mixed variables, 67 — hidden layer, 660, 662
mixture design, 444 — hidden unit, 660, 677
mode analysis, 79 — input layer, 662
modelling power, 237 — multi-layer-feed-forward network, 649
molar refractivity, 392 — output layer, 662
molecular field, 410 — performance behaviour of, 674, 677
momentum, 673 — performance criterion, 680
monitoring set, 675 — radial basis function, 649, 670, 681
Moore-Penrose inverse, 38 — validation of, 679
Morlet wavelet, 566 neuron, 650
moving average, 538 NIPALS, 134, 139,332
multicompartment model, 487 NIPALS algorithm, 40, 107
multicomponent analysis, 575 NIR spectra, 213
multicriteria decision making, 426, 605 node, 621
multidimensional scaling, 149, 427 noise, 143, 535
multidimensional space, 16 nominal scale, 161
multinormal, 104 non-compartmental analysis, 493, 500
multiple linear regression, 53 non-hierarchical classification, 58
multiple inheritance, 637 non-hierarchical clustering, 59, 83
multiplicative scatter correction, 373 non-hierarchical methods, 76
multivariate bioassay, 397 non-linear Hansch model, 389
709